en English is Íslenska

Thesis University of Iceland > Verkfræði- og náttúruvísindasvið > Meistaraprófsritgerðir - Verkfræði- og náttúruvísindasvið >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1946/2071

Title: 
  • Adaptive Revisiting with Heritrix
Submitted: 
  • May 2005
Abstract: 
  • The World Wide Web contains an increasingly significant amount of the
    world’s knowledge and heritage. Since the Web is also in a constant state
    of change significant efforts are now underway to capture and preserve its
    contents. These efforts extend the traditional legal deposit laws that have
    been aimed at preserving printed material over the last centuries.
    The first three chapters outline the fundamental challenges for collecting
    the Web and present the software, Heritrix, which has been designed to
    perform this task. The first chapter focuses on the reasons and history
    behind this endeavour, with chapters two and three focusing on more
    technical aspects.
    The goal of this project was to develop a new way of collecting parts of
    the Web that are believed to change very rapidly and are considered of
    significant interest. The later chapters focus on defining such an
    incremental strategy, which we call an ‘adaptive revisting strategy’ and
    how it was implemented as a part of Heritrix. A part of this discussion is
    how to detect change in documents.
    Finally we discuss initial impressions of the new software and highlight
    areas that require further work or attention. As the goal of the project was
    primarily to establish the foundation for such incremental crawling and
    provide a simple and sturdy implementation, this section contains many
    thoughts on issues that could be improved on in the future.

Accepted: 
  • Mar 16, 2009
URI: 
  • http://hdl.handle.net/1946/2071


Files in This Item:
Filename Size VisibilityDescriptionFormat 
Adaptive Revisiting with Heritrix - Thesis.pdf1.12 MBOpenThesisPDFView/Open