Title :
A Study of chi^2-test for Text Categorization
Author :
Chen, Yao-Tsung ; Chen, Meng Chang
Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Nat. Penghu Univ.
Abstract :
In this paper, we propose the chi2-classifier employing the chi2-test to test the homogeneity of two random samples of term vectors for text categorization decision. First, the properties of chi2-test for text categorization are studied. One of the advantages of chi2-test is that its significance level a is the same as the miss rate that provides a foundation for theoretical performance guarantee. The chi2-classifier also considers term aggregation and selection methods to improve the categorization performance. Generally cosine similarity with TF*IDF weighting function performs reasonably well in text categorization. However, the performance of cosine similarity depends on the given threshold value, and its categorization performance may fluctuate even near the optimal threshold value. To alleviate the problems, the chi2-classifier proposes a combination of chi2-test and cosine similarity. Extensive experiment results have verified the properties of chi2-test and performance of the combined classifier
Keywords :
classification; data mining; learning (artificial intelligence); sampling methods; statistical testing; text analysis; chi2-classifier; chi2-test; cosine similarity; random sampling; term aggregation method; term selection method; text categorization; Computer science; Frequency; Information science; Machine learning; Statistical analysis; Statistical distributions; Statistics; Testing; Text categorization; Text mining; G.3g Nonparametric statistics; H.2.8.I Text mining; I.2.6.g Machine Learning;
Conference_Titel :
Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on
Conference_Location :
Hong Kong
Print_ISBN :
0-7695-2747-7