loading...
Syntactic Similarity of Web Documents
Santiago, Chile November 10-November 12
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/LAWEB.2003.1250297First Latin American Web Congress (LA ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Álvaro R. Pereira Jr, Federal University of Minas Gerais
Nivio Ziviani, Federal University of Minas Gerais
This paper presents and compares two methods for evaluating the syntactic similarity between documents. The first method uses the Patricia tree, constructed from the original document, and the similarity is computed searching the text of each candidate document in the tree. The second method uses shingles concept to obtain the similarity measure for every document pairs, and each shingle from the original document is inserted in a hash table, where shingles of each candidate document are searched. Given an original document and some candidates, two methods find documents that have some similarity relationship with the original document. Experimental results were obtained by using a plagiarized documents generator system, from 900 documents collected from the Web. Considering the arithmetic average of the absolute differences between the expected and obtained similarity, the algorithm that uses shingles obtained a performance of 4.13% and the algorithm that uses Patricia tree a performance of 7.50%.
Citation:
Álvaro R. Pereira Jr, Nivio Ziviani, "Syntactic Similarity of Web Documents," la-web, pp.194, First Latin American Web Congress (LA-WEB'03), 2003
Usage of this product signifies your acceptance of the Terms of Use.