Please use this identifier to cite or link to this item: http://hdl.handle.net/1946/2071
The World Wide Web contains an increasingly significant amount of the
world’s knowledge and heritage. Since the Web is also in a constant state
of change significant efforts are now underway to capture and preserve its
contents. These efforts extend the traditional legal deposit laws that have
been aimed at preserving printed material over the last centuries.
The first three chapters outline the fundamental challenges for collecting
the Web and present the software, Heritrix, which has been designed to
perform this task. The first chapter focuses on the reasons and history
behind this endeavour, with chapters two and three focusing on more
The goal of this project was to develop a new way of collecting parts of
the Web that are believed to change very rapidly and are considered of
significant interest. The later chapters focus on defining such an
incremental strategy, which we call an ‘adaptive revisting strategy’ and
how it was implemented as a part of Heritrix. A part of this discussion is
how to detect change in documents.
Finally we discuss initial impressions of the new software and highlight
areas that require further work or attention. As the goal of the project was
primarily to establish the foundation for such incremental crawling and
provide a simple and sturdy implementation, this section contains many
thoughts on issues that could be improved on in the future.
|Adaptive Revisiting with Heritrix - Thesis.pdf||1.12 MB||Open||Thesis||View/Open|