• DocumentCode
    3105481
  • Title

    Topic-Based Vietnamese News Document Filtering in the BioCaster Project

  • Author

    Hoang, Vu ; Nguyen, Nguyen ; Dinh, Dien ; Collier, Nigel

  • fYear
    2007
  • fDate
    22-24 Aug. 2007
  • Firstpage
    224
  • Lastpage
    229
  • Abstract
    In this paper, we describe a topic-based Vietnamese news document filtering (VTDF) system in the BioCaster Project which automatically classifies news documents from a wide variety of sources into relevant topics suitable for disease outbreak detection. Given the very large numbers of news reports that have to be analyzed each day, VTDF is a crucial preprocessing step in reducing the burden of semantic annotation. Here we present two different approaches for the Vietnamese document classification problem which will be used in the VTDF system. By using the Bag Of Words – BOW and Statistical N-Gram Language Modeling – N-Gram approaches we were able to evaluate these two widely used classification approaches for our task and showed that N-Gram could achieve an average of 95% accuracy with an average 79 minutes filtering time for about 14,000 documents (3 docs/sec).
  • Keywords
    Diseases; Drugs; Humans; Information filtering; Information filters; Marketing and sales; Natural languages; Ontologies; Surveillance; Text mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advanced Language Processing and Web Information Technology, 2007. ALPIT 2007. Sixth International Conference on
  • Conference_Location
    Luoyang, Henan, China
  • Print_ISBN
    978-0-7695-2930-1
  • Type

    conf

  • DOI
    10.1109/ALPIT.2007.56
  • Filename
    4460644