loading...
Robust Identification of Fuzzy Duplicates
Tokyo, Japan April 05-April 08
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDE.2005.12521st International Conference on Data ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Surajit Chaudhuri, Microsoft Research
Venkatesh Ganti, Microsoft Research
Rajeev Motwani, Stanford University
Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm.
Citation:
Surajit Chaudhuri, Venkatesh Ganti, Rajeev Motwani, "Robust Identification of Fuzzy Duplicates," icde, pp.865-876, 21st International Conference on Data Engineering (ICDE'05), 2005
Usage of this product signifies your acceptance of the Terms of Use.