• DocumentCode
    2392481
  • Title

    Content extraction from web pages based on Gaussian Smoothing

  • Author

    Liao, Baohua ; Cheng, Bo ; Liu, Chuanchang ; Cheng, Junliang ; Tan, Gang

  • Author_Institution
    State Key Lab. of Networking & Switching Technol., Beijing Univ. of Posts & Telecommun., Beijing, China
  • fYear
    2010
  • fDate
    26-28 Oct. 2010
  • Firstpage
    42
  • Lastpage
    47
  • Abstract
    Web pages have been the potential source of information retrieval and data mining technology, but most HTML documents on Internet are cluttered with large amount of less informative and typically unrelated materials. Content extraction is defined as the process of identifying the main content region and removing other materials. According to the different properties between Tag and Text nodes, we propose a general, accurate and efficient content extraction framework named Gaussian Smoothing Content Extractor (GSCE) to solve this problem. In addition, based on the identifying of main content, we also describe the extraction of Title and Published Date. According to the evaluation result using large data set, GSCE achieve a high precision and recall for most Web pages.
  • Keywords
    Gaussian processes; Internet; content-based retrieval; hypermedia markup languages; smoothing methods; Gaussian smoothing content extractor; HTML documents; Internet; Web pages; content extraction; data mining; information retrieval; published date extraction; title extraction; HTML; Head; Tutorials; DOM; Gaussian Smoothing; content extraction; information retrieval;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Broadband Network and Multimedia Technology (IC-BNMT), 2010 3rd IEEE International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-6769-3
  • Type

    conf

  • DOI
    10.1109/ICBNMT.2010.5704866
  • Filename
    5704866