Title :
On a new model for automatic text categorization based on Vector Space Model
Author :
Suzuki, Makoto ; Yamagishi, Naohide ; Ishida, Takashi ; Goto, Masayuki ; Hirasawa, Shigeichi
Author_Institution :
Fac. of Inf. Sci., Shonan Inst. of Technol., Fujisawa, Japan
Abstract :
In our previous paper, we proposed a new classification technique called the Frequency Ratio Accumulation Method (FRAM). This is a simple technique that adds up the ratios of term frequencies among categories, and it is able to use index terms without limit. Then, we adopted the Character N-gram to form index terms, thereby improving FRAM. However, FRAM did not have a satisfactory mathematical basis. Therefore, we present here a new mathematical model based on a “Vector Space Model” and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, a Japanese CD-Mainichi 2002 data set using the proposed method. The Reuters-21578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 92.2% for English. The proposed method can perform classification utilizing a single program and it is language-independent.
Keywords :
document handling; natural language processing; pattern classification; text analysis; vectors; vocabulary; English reuter; Japanese CD-Mainichi; automatic text categorization; character N-gram; frequency ratio accumulation method; index term; newspaper article; vector space model; Benchmark testing; Lead; N-gram; classification; newspaper; text mining;
Conference_Titel :
Systems Man and Cybernetics (SMC), 2010 IEEE International Conference on
Conference_Location :
Istanbul
Print_ISBN :
978-1-4244-6586-6
DOI :
10.1109/ICSMC.2010.5642259