loading...
Scalable Model-based Clustering by Working on Data Summaries
Melbourne, Florida November 19-November 22
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDM.2003.1250907Third IEEE International Conference o ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Huidong Jin, Lingnan University, Hong Kong
Man-Leung Wong, Lingnan University, Hong Kong
Kwong-Sak Leung, The Chinese University of Hong Kong
The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources. In this paper, we present a two-phase scalable model-based clustering framework: First, a large data set is summed up into sub-clusters; Then, clusters are directly generated from the summary statistics of sub-clusters by a specifically designed Expectation-Maximization (EM) algorithm. Taking example for Gaussian mixture models, we establish a provably convergent EM algorithm, EMADS, which embodies cardinality, mean, and covariance information of each sub-cluster explicitly. Combining with different data summarization procedures, EMADS is used to construct two clustering systems: gEMADS and bEMADS. The experimental results demonstrate that they run several orders of magnitude faster than the classic EM algorithm with little loss of accuracy. They generate significantly better results than other model-based clustering systems using similar computational resources.
Citation:
Huidong Jin, Man-Leung Wong, Kwong-Sak Leung, "Scalable Model-based Clustering by Working on Data Summaries," icdm, pp.91, Third IEEE International Conference on Data Mining (ICDM'03), 2003
Usage of this product signifies your acceptance of the Terms of Use.