Title :
Multilingual text categorization using Character N-gram
Author :
Suzuki, Makoto ; Yamagishi, Naohide ; Tsai, Yi-Ching ; Hirasawa, Shigeichi
Author_Institution :
Shonan Inst. of Technol., Fujisawa
Abstract :
In our previous paper, we proposed a new classification technique called the Frequency Ratio Accumulation Method (FRAM). This is a simple technique that adds up the ratios of term frequency among categories. However, in FRAM, the use of feature terms is unlimited. In the present paper, we adopt character N-gram as feature terms improving the above-described particularity of FRAM. That is to say, the proposed method is language-independent because it does not depend on the low of grammar by using character N-gram. Therefore, we can classify multi-language into some categories using only one program. Next, the proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from English Reuters-21578, Japanese CD-Mainichi 2002 and Chinese China Times 2005 using FRAM. As a result, we show that it has the good classification accuracy. Specifically, the recall of the proposed method is 87.8% for English, 86.0% for Japanese and 72.8% for Chinese. Although it turned out that Chinese classification accuracy was extremely low in the present experiments compared with English and Japanese, the proposed method is language-independent and provides a new perspective and has excellent potential.
Keywords :
classification; computational linguistics; grammars; text analysis; character N-gram; frequency ratio accumulation method; grammar; language-independent method; multilanguage classification; multilingual text categorization; Data mining; Feature extraction; Ferroelectric films; Frequency; Natural languages; Nonvolatile memory; Performance evaluation; Random access memory; Testing; Text categorization; N-gram; classification; newspaper; text mining;
Conference_Titel :
Soft Computing in Industrial Applications, 2008. SMCia '08. IEEE Conference on
Conference_Location :
Muroran
Print_ISBN :
978-1-4244-3782-5
Electronic_ISBN :
978-4-9904-2590-6
DOI :
10.1109/SMCIA.2008.5045934