Title :
Evaluating the utility of statistical phrases and latent semantic indexing for text classification
Author :
Wu, Huiwen ; Gunopulos, Dimitrios
Author_Institution :
Comput. Sci. & Eng. Dept., California Univ., Riverside, CA, USA
Abstract :
The term-based vector space model is a prominent technique for retrieving textual information. In this paper we examine the usefulness of phrases as terms in vector-based document classification. We focus on statistical techniques to extract both adjacent and window phrases from documents. We discover that the positive effect of adding phrase terms is very limited, if we have already achieved good performance using single-word terms, even when SVD/LSI is used as the dimensionality reduction method.
Keywords :
classification; indexing; information retrieval; statistical analysis; text analysis; adjacent phrase extraction; dimensionality reduction method; latent semantic indexing; single-word terms; statistical phrases; term-based vector space model; text classification; textual information retrieval; vector-based document classification; window phrase extraction; Computer science; Data mining; Dictionaries; Frequency; Indexing; Information retrieval; Large scale integration; Support vector machine classification; Support vector machines; Text categorization;
Conference_Titel :
Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on
Print_ISBN :
0-7695-1754-4
DOI :
10.1109/ICDM.2002.1184036