This paper reports on experiments in multi-class document categorization with supervised machine learning techniques. The document collection consists of of a set of personal e-mail messages. Two distinct document representation formalisms are employed to characterize these messages, namely a standard word-based approach and a character n-gram document representation. Based on these document representations, the categorization performance of five machine learning approaches is assessed and a comparison is given. In principle, both document representation yielded comparable results with the various classifiers. However, the results for the n-gram-based document representation were definitely better in case of an aggressive feature selection strategy.
Citation:
Helmut Berger, Michael Dittenbach, Dieter Merkl, "Analyzing the Effect of Document Representation on Machine Learning Approaches in Multi-Class e-Mail Filtering," wi, pp.297-300, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI'06), 2006