loading...
Rapid Identification of Column Heterogeneity
Hong Kong December 18-December 22
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDM.2006.132Sixth IEEE International Conference o ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Bing Tian Dai, National Univ. of Singapore, Singapore
Nick Koudas, University of Toronto
Beng Chin Ooi, National Univ. of Singapore, Singapore
Divesh Srivastava, AT&T Labs-Research
Suresh Venkatasubramanian, AT&T Labs--Research
Data quality is a serious concern in every data management application, and a variety of quality measures have been proposed, e.g., accuracy, freshness and completeness, to capture common sources of data quality degradation. We identify and focus attention on a novel measure, column heterogeneity, that seeks to quantify the data quality problems that can arise when merging data from different sources. We identify desiderata that a column heterogeneity measure should intuitively satisfy, and describe our technique to quantify database column heterogeneity based on using a novel combination of cluster entropy and soft clustering. Finally, we present detailed experimental results, using diverse data sets of different types, to demonstrate that our approach provides a robust mechanism for identifying and quantifying database column heterogeneity.
Citation:
Bing Tian Dai, Nick Koudas, Beng Chin Ooi, Divesh Srivastava, Suresh Venkatasubramanian, "Rapid Identification of Column Heterogeneity," icdm, pp.159-170, Sixth IEEE International Conference on Data Mining (ICDM'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.


Suggestions