Title : 
English and Taiwanese text categorization using N-gram based on Vector Space Model
         
        
            Author : 
Suzuki, Makoto ; Yamagishi, Naohide ; Tsai, Yi-Ching ; Ishida, Takashi ; Goto, Masayuki
         
        
            Author_Institution : 
Fac. of Inf. Sci., Shonan Inst. of Technol., Fujisawa, Japan
         
        
        
        
        
        
            Abstract : 
In this paper, we present a new mathematical model based on a “Vector Space Model” and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, and Taiwanese China Times 2005 data set using the proposed method. The Reuters-21578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 94.5% for English. However, that is 78.0% for Taiwanese. Though the proposed method is language-independent and provides a new perspective, our future work is to improve classification accuracy for Taiwanese.
         
        
            Keywords : 
natural language processing; pattern classification; text analysis; English Reuters-21578 data set; English text categorization; N-gram; Taiwanese China Times 2005 data set; Taiwanese classification accuracy; Taiwanese text categorization; automatic text categorization; language-independent; mathematical model; microaveraged F-measure; newspaper articles; vector space model; Accuracy; Computers; Feature extraction; Mathematical model; Nonvolatile memory; Text categorization; Training; N-gram; classification; newspaper; text mining;
         
        
        
        
            Conference_Titel : 
Information Theory and its Applications (ISITA), 2010 International Symposium on
         
        
            Conference_Location : 
Taichung
         
        
            Print_ISBN : 
978-1-4244-6016-8
         
        
            Electronic_ISBN : 
978-1-4244-6017-5
         
        
        
            DOI : 
10.1109/ISITA.2010.5649453