DocumentCode
2897726
Title
Analysis of inverse class frequency in centroid-based text classification
Author
Lertnattee, Verayuth ; Theeramunkong, Thanaruk
Author_Institution
Inf. Technol. Program, Sirindhorn Int. Inst. of Technol., Maung, Thailand
Volume
2
fYear
2004
fDate
26-29 Oct. 2004
Firstpage
1171
Abstract
Most previous works on text categorization applied term occurrence frequency and inverse document frequency for representing importance of terms. This work presents an analysis of inverse class frequency in centroid-based text categorization. There are two aims of this paper. The first one is to find appropriate functions of inverse class frequency. The other is to find the key factors for using inverse class frequency. The experimental results show that the key factors, which improve classification accuracy, are the numbers of few-class terms and most-class terms. When large numbers of few-class terms and most-class terms are obtained, the logarithmic function of inverse class frequency is the most effective when it is combined with term frequency. The square root of inverse class frequency incorporated into TFIDF, works well in the case when data sets include a small number of few-class terms and most-class terms. To increase the numbers of these effective terms, some methods are involved i.e. using higher gram models, small number of classes and large number of training sets.
Keywords
pattern classification; statistical analysis; text analysis; TFIDF; centroid-based text classification; classification accuracy; few-class terms; inverse class frequency; most-class terms; term frequency; text categorization; Bayesian methods; Electronic mail; Frequency; Information resources; Information technology; Neural networks; Prototypes; Support vector machine classification; Support vector machines; Text categorization;
fLanguage
English
Publisher
ieee
Conference_Titel
Communications and Information Technology, 2004. ISCIT 2004. IEEE International Symposium on
Print_ISBN
0-7803-8593-4
Type
conf
DOI
10.1109/ISCIT.2004.1413903
Filename
1413903
Link To Document