• DocumentCode
    2692009
  • Title

    The Research on Automatic Construction Techniques of Large-Scale Corpus for Chinese Text Categorization

  • Author

    Hu, Yan ; Wu, Wei ; Miao, Miao

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Wuhan Univ. of Technol., Wuhan, China
  • fYear
    2009
  • fDate
    16-17 May 2009
  • Firstpage
    640
  • Lastpage
    645
  • Abstract
    Large-scale corpus contains abundant language phenomenon. It can reflect the universal law of language use and has drawn the interest of many countries in the field of information technology and linguistics circle. It has become a hot topic in the field of natural language processing. But based on the current scarcity of Chinese corpus, especially in the field of text categorization, the Chinese categorization corpus is especially rare. Today the text categorization has become the core and foundation of large-scale data processing applications. The lagging of corpus research has become the obstruction of information technology development. Therefore, by analyzing the characteristics of Chinese categorization corpus, combining with Internet which is the largest knowledge base at present and depending on the search capability of search engines, this paper proposes and realizes a kind of algorithm on large-scale corpus for Chinese text categorization. Experiments show that the corpus constructed by this algorithm performance well in various classifiers. It has a certain practical value.
  • Keywords
    Internet; linguistics; natural language processing; search engines; text analysis; Chinese text categorization; Internet; automatic construction technique; classifier; information technology; large-scale corpus; large-scale data processing; linguistics; natural language processing; search engine; universal language law; Application software; Computer science; Electronic commerce; Information analysis; Information processing; Information technology; Large-scale systems; Materials science and technology; Natural languages; Text categorization; Automatic Construction; Chinese Text Categorization; Large-scale Corpus;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Engineering and Electronic Commerce, 2009. IEEC '09. International Symposium on
  • Conference_Location
    Ternopil
  • Print_ISBN
    978-0-7695-3686-6
  • Type

    conf

  • DOI
    10.1109/IEEC.2009.141
  • Filename
    5175198