loading...
An Extensible Framework for Data Cleaning
San Diego, California February 28-March 03
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDE.2000.83942916th International Conference on Data ...
 This Article 
 
PURCHASE ARTICLE: $0
HTML
IEEE Xplore Subscribers
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Eric Simon, INRIA
Dennis Shasha, New York University
Data quality concerns arise when one wants to correct anomalies in a single data source (e.g., duplicate elimination in a file), or when one wants to integrate data coming from multiple sources into a single new data source (e.g., data warehouse construction). Three data quality problems are typically encountered: (1) the absence of universal keys across different databases that is known as the object identity problem, (2) the existence of keyboard errors in the data, and (3) the presence of inconsistencies in data coming from multiple sources. Dealing with these problems is globally called the data cleaning process.We propose a framework that models a data cleaning application as a directed graph of data transformations. Transformations are divided into four distinct classes: mapping, matching, clustering and merging; and each of them is implemented by a macro-operator. Moreover, we propose an SQL extension for specifying each of the macro-operators. One important feature of the framework is the ability to include human interaction explicitly in the process. Finally, we study performance optimizations which are tailored for data cleaning applications: mixed evaluation, neighborhood hash join, decision push-down and short-circuited computation.
Index Terms:
data quality, data cleaning, query language, query optimization, data transformation, duplicate elimination, approximate join, object matching
Citation:
Helena Galhardas, Daniela Florescuand, Eric Simon, Dennis Shasha, "An Extensible Framework for Data Cleaning," icde, pp.312, 16th International Conference on Data Engineering (ICDE'00), 2000
Usage of this product signifies your acceptance of the Terms of Use.