Automatic text classification for Web collection is a non-trivial task. Since Thai academic Web pages usually present technical articles. They may have many technical terms both in Thai and English. This paper presents two approaches towards the problem of a large number of unique terms in a Web page: 1) term weighting schemes and 2) schemes using Web link information. We propose an approach using inverse class frequency instead of inverse document frequency in centroid-based text categorization. Web link information provides information for users to follow to another part or page. It adds useful unique terms for classification. The experimental results show that inverse class frequency is useful on a set of Thai academic Web documents, which is categorized by sources (sites) of information. It should be applied on both prototype and query vectors. Moreover, Web link information expresses its usefulness when inverse class frequency is also applied.
Index Terms:
Text Classification, Text Categorization, Inverse Class Frequency, Web Link Information
Citation:
Verayuth Lertnattee, Thanaruk Theeramunkong, "Improving Thai Academic Web Page Classification Using Inverse Class Frequency and Web Link Information," ainaw, pp.1144-1149, 22nd International Conference on Advanced Information Networking and Applications - Workshops (aina workshops 2008), 2008