is Íslenska en English

Grein

Landsbókasafn Íslands - Háskólabókasafn > Rit starfsmanna Lbs-Hbs >

Vinsamlegast notið þetta auðkenni þegar þið vitnið til verksins eða tengið í það: https://hdl.handle.net/1946/6074

Titill: 
  • Titill er á ensku Managing duplicates across sequential crawls
Útgáfa: 
  • September 2006
Útdráttur: 
  • Útdráttur er á ensku

    Dealing with documents that remain unchanged between harvesting rounds is an important topic for many organizations archiving the World Wide Web. This paper discusses some of the key problems in dealing with this and then outlines a simple, yet effective way of managing at least a part of it. This is done in form of an add-on module for the popular web crawler Heritrix. The paper contains the results of crawls using this new software. Finally, there is adiscussion on the limitations and some of the future work needed to improve duplicate handling.

Tengd vefslóð: 
Samþykkt: 
  • 27.8.2010
URI: 
  • http://hdl.handle.net/1946/6074


Skrár
Skráarnafn Stærð AðgangurLýsingSkráartegund 
kristinn-sigurdsson-iwaw06.pdf239,4 kBOpinnHeildartextiPDFSkoða/Opna