• DocumentCode
    584447
  • Title

    Content Extraction from Chinese Web Pages Based on Punctuations Distribution

  • Author

    Peng, Qian ; Wang, Qinglin ; Li, Yuan ; Zhang, Jixian ; Hao, Yuexing

  • Author_Institution
    Sch. of Autom. Beijing, Inst. of Technol., Beijing, China
  • fYear
    2012
  • fDate
    11-13 Aug. 2012
  • Firstpage
    1351
  • Lastpage
    1355
  • Abstract
    Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.
  • Keywords
    Internet; hypermedia markup languages; information resources; information retrieval; natural language processing; Chinese punctuation distribution; Chinese web pages; HTML page; Internet; content extraction; information resources; left boundary computation; right boundary computation; Accuracy; Data mining; Feature extraction; HTML; Kernel; Navigation; Web pages; content extraction; kernel punctuation; punctuation distruction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Science & Service System (CSSS), 2012 International Conference on
  • Conference_Location
    Nanjing
  • Print_ISBN
    978-1-4673-0721-5
  • Type

    conf

  • DOI
    10.1109/CSSS.2012.341
  • Filename
    6394579