Title :
Feature selection based on word-sentence relation
Author :
Schönhofen, Péter ; Benczur, András A.
Author_Institution :
Informatics Lab. Comput. & Autom. Res. Inst., Hungarian Acad. of Sci., Budapest, Hungary
Abstract :
Feature selection proved to improve both the speed and the quality of classification. Methods such as mutual information, information gain or chi-square are all based on the joint distribution of classes and words; there exist only a few methods which exploit contextual information for feature selection. We introduce an algorithm based on word and word pair frequencies that reduces both vocabulary and total word size prior to classification. We measure the effectiveness of our algorithm by clustering Ken Lang´s 20 newsgroups corpus and obtain significantly better size reduction than the state-of-the-art methods. We perform keyword selection by identifying correlated word pairs within sentences; measuring how strongly a word in a given document takes part in such pairs; finally selecting those keywords that take part in several such pairs in several documents.
Keywords :
feature extraction; pattern classification; pattern clustering; feature selection; newsgroups corpus clustering; total word size; vocabulary; word pair frequencies; word-sentence relation; Automation; Clustering algorithms; Frequency measurement; Informatics; Laboratories; Mutual information; Performance evaluation; Size measurement; Testing; Vocabulary;
Conference_Titel :
Machine Learning and Applications, 2005. Proceedings. Fourth International Conference on
Print_ISBN :
0-7695-2495-8
DOI :
10.1109/ICMLA.2005.32