• DocumentCode
    492502
  • Title

    A Novel POS-Based Approach to Chinese News Topic Extraction from Internet

  • Author

    Zhao, Xujian ; Jin, Peiquan ; Yue, Lihua

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Univ. of Sci. & Technol. of China
  • Volume
    2
  • fYear
    2008
  • fDate
    13-15 Dec. 2008
  • Firstpage
    39
  • Lastpage
    42
  • Abstract
    News topic extraction is very important for news search engine. The traditional methods are based on pattern matching and linguistic analysis, which mainly depend on the measurement of feature similarity. But due to two reasons, those methods are basically inefficient to process Chinese news topic extraction from Internet. The first reason is the difficulty of Natural Language Processing (NLP) for Chinese, and the other is the diversity and fast update speed of Internet news. At the present, some works utilizing news special structure (e.g. title) for Chinese news topic are presented. However, two problems still remain unsolved so far, which are (1) missing of some news topic and (2) irregular topic words produced. Aiming to solve these two problems, we propose a POS-based approach to news topic extraction. We first segment words and tag POS for news title, and then eliminate segmentation errors according to POS information and position relation. After that, topic words are associated and combined into bigger ones, and different topic weights are assigned to those bigger words. We conduct an experiment on 600 Chinese news Web pages to demonstrate our new approach. The experimental results show that our approach has a higher recall and precision rate of news topic extraction and reduces irregular topic words obviously.
  • Keywords
    Internet; information resources; information retrieval; natural language processing; search engines; word processing; Chinese news topic extraction; Internet; natural language processing; news search engine; novel POS-approach; topic word segmentation; Computer science; Conferences; Data mining; IP networks; Internet; Pattern analysis; Pattern matching; Search engines; Thesauri; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Future Generation Communication and Networking Symposia, 2008. FGCNS '08. Second International Conference on
  • Conference_Location
    Sanya
  • Print_ISBN
    978-1-4244-3430-5
  • Electronic_ISBN
    978-0-7695-3546-3
  • Type

    conf

  • DOI
    10.1109/FGCNS.2008.71
  • Filename
    4813517