• DocumentCode
    2260278
  • Title

    Automatic Evaluation of Document Classification Using N-Gram Statistics

  • Author

    Choi, Dongjin ; Ko, Byeongkyu ; Lee, Eunji ; Hwang, Myunggwon ; Kim, Pankoo

  • Author_Institution
    Dept. of Comput. Eng., Chosun Univ., Gwangju, South Korea
  • fYear
    2012
  • fDate
    26-28 Sept. 2012
  • Firstpage
    739
  • Lastpage
    742
  • Abstract
    Due to the development of World Wide Web technologies, people are living in the place flooding trillions of web pages in every moment. The amount of web size has been increasing dramatically. For this reason, it is getting more difficult to find relevant web documents corresponding to what users want to read. Classifying documents into predefined categories is one of the most important tasks in Natural Language Processing field. Over the years, many statistical and linguistical approaches have been applied to overcome traditional classification machine. However, it still remains in unsolved problem. There is a no perfect solution to machine understand human language yet. We have to consider every possibility for making machine think like human does. In this paper, we propose a method for classifying textural document using n-gram co-occurrence statistics which have a great possibility to find similarities between given documents. We also compare our proposed method with traditional method suggested by Keselj. This paper only covers simple approaches and still needs more sophisticated experiments. However, the performance using this method is better than the Keselj approach.
  • Keywords
    Web sites; computational linguistics; natural language processing; pattern classification; statistical analysis; text analysis; Keselj approach; Web documents; Web pages; World Wide Web technologies; classification machine; document classification automatic evaluation; linguistical approach; n-gram co-occurrence statistics; natural language processing field; statistical approach; textural document classification; Bioinformatics; Computer vision; Computers; Data mining; Humans; Semantics; Training; N-gram; Natural Language Processing; document classification; formatting;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Network-Based Information Systems (NBiS), 2012 15th International Conference on
  • Conference_Location
    Melbourne, VIC
  • Print_ISBN
    978-1-4673-2331-4
  • Type

    conf

  • DOI
    10.1109/NBiS.2012.96
  • Filename
    6354916