DocumentCode :
2550537
Title :
Text categorization based on the ratio of word frequency in each categories
Author :
Suzuki, Makoto ; Hirasawa, Shigeichi
Author_Institution :
Shonan Inst. of Technol., Shonan
fYear :
2007
fDate :
7-10 Oct. 2007
Firstpage :
3535
Lastpage :
3540
Abstract :
In the present paper, we consider the automatic text categorization as a series of information processing and propose a new classification technique called the Frequency Ratio Accumulation Method (FRAM). This is a simple technique that calculates the sum of ratios of word frequency in each category. However, in FRAM, feature terms can be used without limit. Therefore, we propose the use of the character N-gram and the word N-gram as feature terms using the above-described property of FRAM. Next, we evaluate the proposed technique through a number of experiments. In these experiments, we classify newspaper articles from Japanese CD-Mainichi 2002 and English Reuters-21578 using the Naive Bayes method (baseline method) and the proposed method. As a result, we show that the classification accuracy of the proposed method is far better than that of the baseline method. Specifically, the classification accuracy of the proposed method is 87.3% for Japanese CD-Mainichi 2002 and 86.1% for English Reuters-21578. Thus, the proposed method has very high performance. Although the proposed method is a simple technique, it provides a new perspective and has a high potential and is language-independent. Thus, the proposed method can be expected to be developed further in the future.
Keywords :
Bayes methods; classification; text analysis; English Reuters; Japanese CD-Mainichi; Naive Bayes method; automatic text categorization; frequency ratio accumulation method; information classification; information processing; newspaper articles; word frequency ratio; Data mining; Feature extraction; Ferroelectric films; Frequency; Information processing; Machine learning algorithms; Natural languages; Nonvolatile memory; Random access memory; Text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on
Conference_Location :
Montreal, Que.
Print_ISBN :
978-1-4244-0990-7
Electronic_ISBN :
978-1-4244-0991-4
Type :
conf
DOI :
10.1109/ICSMC.2007.4414216
Filename :
4414216
Link To Document :
بازگشت