An algorithm to identify and remove term redundancy is proposed for text classifiers using ranking-based feature selection. The proposed method employs a normalized mu- tual information, which is called inclusion measure, to es- timate asymmetric dependency between two terms. Based on pair-wise dependency measures, a dependency matrix is constructed. In this paper, an algorithm is proposed to learn term dependency links from term dependency matrix, and visualize the dependency between term in a graph called term dependency tree. All nodes of the tree are categorized into two groups: hubs and links. Any node whose outde- gree is less than two will join the Links group. We show that all link nodes are most likely redundant. We also in- troduce a criterion, which is called substitution cost, to de- cide whether to remove or retain a candidate, redundant term. The proposed approach is applied to four well-known benchmark data sets with a SVM and Rocchio classifier us- ing a set of highly aggressive feature selection schemes. The results show the effectiveness of the proposed method espe- cially when applied to weak classifiers.
Citation:
Masoud Makrehchi, Mohamed S. Kamel, "Learning Term Dependency Links Using Information Theoretic Inclusion Measure," icdmw, pp.423-428, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), 2007