Title :
Learning Term Dependency Links Using Information Theoretic Inclusion Measure
Author :
Makrehchi, Masoud ; Kamel, Mohamed S.
Abstract :
An algorithm to identify and remove term redundancy is proposed for text classifiers using ranking-based feature selection. The proposed method employs a normalized mu- tual information, which is called inclusion measure, to es- timate asymmetric dependency between two terms. Based on pair-wise dependency measures, a dependency matrix is constructed. In this paper, an algorithm is proposed to learn term dependency links from term dependency matrix, and visualize the dependency between term in a graph called term dependency tree. All nodes of the tree are categorized into two groups: hubs and links. Any node whose outde- gree is less than two will join the Links group. We show that all link nodes are most likely redundant. We also in- troduce a criterion, which is called substitution cost, to de- cide whether to remove or retain a candidate, redundant term. The proposed approach is applied to four well-known benchmark data sets with a SVM and Rocchio classifier us- ing a set of highly aggressive feature selection schemes. The results show the effectiveness of the proposed method espe- cially when applied to weak classifiers.
Keywords :
Conferences; Data mining; Electric variables measurement; Gain measurement; Machine learning; Pattern analysis; Support vector machine classification; Support vector machines; Text categorization; Tree graphs;
Conference_Titel :
Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on
Conference_Location :
Omaha, NE
Print_ISBN :
978-0-7695-3019-2
Electronic_ISBN :
978-0-7695-3033-8
DOI :
10.1109/ICDMW.2007.21