Title :
Boosting the Feature Space: Text Classification for Unstructured Data on the Web
Author :
Song, Yang ; Zhou, Ding ; Huang, Jian ; Councill, Isaac G. ; Zha, Hongyuan ; Giles, C. Lee
Author_Institution :
Dept. of Comput. Sci. & Eng., Pennsylvania State Univ., University Park, PA
Abstract :
The issue of seeking efficient and effective methods for classifying unstructured text in large document corpora has received much attention in recent years. Traditional document representation like bag-of-words encodes documents as feature vectors, which usually leads to sparse feature spaces with large dimensionality, thus making it hard to achieve high classification accuracies. This paper addresses the problem of classifying unstructured documents on the Web. A classification approach is proposed that utilizes traditional feature reduction techniques along with a collaborative filtering method for augmenting document feature spaces. The method produces feature spaces with an order of magnitude less features compared with a baseline bag-of-words feature selection method. Experiments on both real-world data and benchmark corpus indicate that our approach improves classification accuracy over the traditional methods for both support vector machines and AdaBoost classifiers.
Keywords :
Internet; classification; feature extraction; information filtering; text analysis; AdaBoost classifier; Internet; Web; bag-of-words feature selection method; collaborative filtering method; document corpora; document feature space augmentation; feature reduction technique; support vector machine; unstructured text classification; Bismuth; Boosting; Collaboration; Data mining; Filtering; Neural networks; Space technology; Support vector machine classification; Support vector machines; Text categorization;
Conference_Titel :
Data Mining, 2006. ICDM '06. Sixth International Conference on
Conference_Location :
Hong Kong
Print_ISBN :
0-7695-2701-7
DOI :
10.1109/ICDM.2006.31