loading...
Categorization and Keyword Identification of Unlabeled Documents
Houston, Texas November 27-November 30
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDM.2005.39Fifth IEEE International Conference o ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Ning Kang, George Mason University
Carlotta Domeniconi, George Mason University
Daniel Barbará, George Mason University
In this paper we first propose a global unsupervised feature selection approach for text, based on frequent itemset mining. As a result, each document is represented as a set of words that co-occur frequently in the given corpus of documents. We then introduce a locally adaptive clustering algorithm, designed to estimate (local) word relevance and, simultaneously, to group the documents. We present experimental results to demonstrate the feasibility of our approach. Furthermore, the analysis of the weights credited to terms provides evidence that the identified keywords can guide the process of label assignment to clusters. We take into consideration both spam email filtering and general classification datasets. Our analysis of the distribution of weights in the two cases provides insights on how the spam problem distinguishes from the general classification case.
Citation:
Ning Kang, Carlotta Domeniconi, Daniel Barbará, "Categorization and Keyword Identification of Unlabeled Documents," icdm, pp.677-680, Fifth IEEE International Conference on Data Mining (ICDM'05), 2005
Usage of this product signifies your acceptance of the Terms of Use.