• DocumentCode
    2183547
  • Title

    Automatic training corpora acquisition through Web mining

  • Author

    Huang, Chien-Chung ; Lin, Kuan-Ming ; Chien, Lee-Feng

  • Author_Institution
    Dartmouth Coll., Hanover, NH, USA
  • fYear
    2005
  • fDate
    19-22 Sept. 2005
  • Firstpage
    193
  • Lastpage
    199
  • Abstract
    Text classification is a task having been extensively studied for decades. However, most previous work pre-assumes the existence of explicitly labeled corpora. In this study, we focus on the issue of automatic corpora acquisition. We propose a Web-based mining approach to collect necessary corpora, which can be greatly useful to both common users and system designers. Moreover, the proposed technique can also be incorporated with existing classification techniques to further boost classifier performance. It has been shown that the concept of the class can be captured by the class name and its associated terms (Huang et al., 2004). In this work, we aim at analyzing Web-retrieved documents to discover the associated terms, which are further utilized to collect more training corpora. Working iteratively, the proposed approach can acquire training corpora of high quality. We give empirical evidence that the classifiers thus created have promising accuracy. In sum, the convenience and efficiency of the proposed approach, along with the new perspective on the issue of corpora acquisition, are the primary contributions of this work.
  • Keywords
    Internet; classification; data mining; document handling; information retrieval; Web mining; Web-retrieved document; automatic training corpora acquisition; classification technique; Classification algorithms; Educational institutions; History; Humans; Labeling; Prototypes; Support vector machine classification; Support vector machines; Text categorization; Web mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACM International Conference on
  • Print_ISBN
    0-7695-2415-X
  • Type

    conf

  • DOI
    10.1109/WI.2005.39
  • Filename
    1517842