EnglishisÍslenska

Member institutions

Search in


ArticleNational and University Library of Iceland >Rit starfsmanna Lbs-Hbs>

Please use this identifier to cite or link to this item: http://hdl.handle.net/1946/6074

Title

Managing duplicates across sequential crawls

Published
September 2006
Abstract

Dealing with documents that remain unchanged between harvesting rounds is an important topic for many organizations archiving the World Wide Web. This paper discusses some of the key problems in dealing with this and then outlines a simple, yet effective way of managing at least a part of it. This is done in form of an add-on module for the popular web crawler Heritrix. The paper contains the results of crawls using this new software. Finally, there is adiscussion on the limitations and some of the future work needed to improve duplicate handling.

Issued Date
27/08/2010


Artifacts
Name[Sortable]Size[Sortable]Visibility[Sortable]Description[Sortable]Format
kristinn-sigurdsso... .pdf245KBOpen Complete Text PDF View/Open