loading...
Where and How Duplicates Occur in the Web
Cholula, Mexico October 25-October 27
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/LA-WEB.2006.39Fourth Latin American Web Congress (L ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Alvaro Pereira Jr, Federal Univ. of Minas Gerais, Brazil
Ricardo Baeza-Yates, Yahoo! Research, Spain & Chile
Nivio Ziviani, Federal Univ. of Minas Gerais, Brazil
In this paper we study duplicates on the Web, using collections containing documents of all sites under the .cl domain that represent accurate and representative subsets of the Web. We identify duplicate and near-duplicate documents in our collections, studying the distribution of documents in clusters of duplicates. We also study the occurrence of duplicates in both parts of our Web graphs -- connected and disconnected component -- aiming to identify where duplicates occur more frequently. We originally show that the number of duplicates in the Web is expressively greater than the number of duplicates in the connected component of the Web graph. Works that previously estimated the number of duplicates in the Web used collections of connected components of the Web. In those cases the sample of the Web was biased.
Citation:
Alvaro Pereira Jr, Ricardo Baeza-Yates, Nivio Ziviani, "Where and How Duplicates Occur in the Web," la-web, pp.127-134, Fourth Latin American Web Congress (LA-WEB'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.


Suggestions