loading...
Text Classification Improved through Automatically Extracted Sequences
Atlanta, Georgia April 03-April 07
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDE.2006.15822nd International Conference on Data ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Dou Shen, Hong Kong University of Science and Technology
Jian-Tao Sun, Microsoft Research Asia
Qiang Yang, Hong Kong University of Science and Technology
Hui Zhao, Hong Kong University of Science and Technology
Zheng Chen, Microsoft Research Asia
We propose to use the n-multigram model to help the automatic text classification task. This model could automatically discover the latent semantic sequences contained in the document set of each category. Based on the n-multigram model and the n-gram language model, we put forward two text classification algorithms. The experiments on RCV1 show that our proposed algorithm based on n-multigram model can achieve the similar classification performance compared with the one based on n-gram model. However, the model size of our algorithm is only 4.21% of the latter one. Another proposed algorithm based on the combination of nmultigram model and n-gram model improves the micro- F1 and macro-F1 values by 3.5% and 4.5% respectively which support the validity of our approach.
Citation:
Dou Shen, Jian-Tao Sun, Qiang Yang, Hui Zhao, Zheng Chen, "Text Classification Improved through Automatically Extracted Sequences," icde, pp.121, 22nd International Conference on Data Engineering (ICDE'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.