Dou Shen, Hong Kong University of Science and Technology
Qiang Yang, Hong Kong University of Science and Technology
Hui Zhao, Hong Kong University of Science and Technology
We propose to use the n-multigram model to help the automatic text classification task. This model could automatically discover the latent semantic sequences contained in the document set of each category. Based on the n-multigram model and the n-gram language model, we put forward two text classification algorithms. The experiments on RCV1 show that our proposed algorithm based on n-multigram model can achieve the similar classification performance compared with the one based on n-gram model. However, the model size of our algorithm is only 4.21% of the latter one. Another proposed algorithm based on the combination of nmultigram model and n-gram model improves the micro- F1 and macro-F1 values by 3.5% and 4.5% respectively which support the validity of our approach.
Citation:
Dou Shen, Jian-Tao Sun, Qiang Yang, Hui Zhao, Zheng Chen, "Text Classification Improved through Automatically Extracted Sequences," icde, pp.121, 22nd International Conference on Data Engineering (ICDE'06), 2006