• DocumentCode
    1649086
  • Title

    Automatic Elements Extraction of Chinese Web News Using Prior Information of Content and Structure

  • Author

    Chengru Song ; Shifeng Weng ; Changshui Zhang

  • Author_Institution
    Dept. of Autom., Tsinghua Univ., Beijing, China
  • fYear
    2013
  • Firstpage
    340
  • Lastpage
    345
  • Abstract
    We propose a set of efficient processes for extracting all four elements of Chinese news web pages, namely news title, release date, news source and the main text. Our approach is based on a deep analysis of content and structure features of current Chinese news. We take content indicators as the key to recover tree structure of the main text. Additionally, we come up with the concept of Length-Distance Ratio to help improve performance. Our method rarely depends on selection of samples and has strong generalization ability regardless of training process, distinguishing itself from most existing methods. We have tested our approach on 1721 labeled Chinese news pages from 429 web sites. Results show that an 87% accuracy was achieved for news source extraction, and over 95% accuracy for other three elements.
  • Keywords
    Web sites; data mining; information retrieval; natural language processing; text analysis; tree data structures; Chinese Web news; Chinese news Web pages; Web sites; automatic elements extraction; content indicators; deep analysis; generalization ability; length-distance ratio; news source extraction; news title; release date; training process; tree structure; Accuracy; Data mining; Educational institutions; Feature extraction; HTML; Media; Vectors; LDR; TF-IDF; news extraction; term vector model;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition (ACPR), 2013 2nd IAPR Asian Conference on
  • Conference_Location
    Naha
  • Type

    conf

  • DOI
    10.1109/ACPR.2013.52
  • Filename
    6778337