loading...
Pruning The Vocabulary For Better Context Recognition
Cambridge UK August 23-August 26
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICPR.2004.133427017th International Conference on Patt ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Rasmus Elsborg Madsen, Technical University of Denmark
Sigurdur Sigurdsson, Technical University of Denmark
Lars Kai Hansen, Technical University of Denmark
Jan Larsen, Technical University of Denmark
Language independent 'bag-of-words' representations are surprisingly effective for text classification. The representation is high dimensional though, containing many non-consistent words for text categorization. These non-consistent words result in reduced generalization performance of sub-sequent classifiers, e.g., from ill-posed principal component transformations. In this communication our aim is to study the effect of reducing the least relevant words from the bag-of-words representation. We consider a new approach, using neural network based sensitivity maps and information gain for determination of term relevancy, when pruning the vocabularies. With reduced vocabularies documents are classified using a latent semantic indexing representation and a probabilistic neural network classifier. Reducing the bag-of-words vocabularies with 90%-98%, we find consistent classification improvement using two mid size data-sets. We also study the applicability of information gain and sensitivity maps for automated keyword generation.
Citation:
Rasmus Elsborg Madsen, Sigurdur Sigurdsson, Lars Kai Hansen, Jan Larsen, "Pruning The Vocabulary For Better Context Recognition," icpr, vol. 2, pp.483-488, 17th International Conference on Pattern Recognition (ICPR'04) - Volume 2, 2004
Usage of this product signifies your acceptance of the Terms of Use.