• DocumentCode
    2324778
  • Title

    A Novel Approach To Automatically Extracting Main Content of Web News

  • Author

    Wang, Xuan ; Wang, WeiPing ; Liu, Bowen ; Wang, Zhen ; Wang, Xicai

  • Author_Institution
    Bus. Intell. Lab., Univ. of Sci. & Technol. of China, Hefei
  • fYear
    2009
  • fDate
    23-24 May 2009
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    Recently, the Web has been the data repository. In order to obtain the relevant information from the repository, many research have been made. The typical function of Web news extraction is to locate the useful content text and filter the noises , both main issues result in Web news extraction that is an open research problem. In this paper , we describe an approach that can cluster the pages which share common extracting path and automatically extract location of main text passages. Our approach can apply to structural Web pages . Moreover, we developed an extracting system by using our algorithm. Experiments are done over several important on-line news sites and experimental results on our extracting system show that the approach can achieve higher extraction accuracy than RTDM algorithm.
  • Keywords
    Web sites; content management; information filtering; pattern clustering; text analysis; Web site; automatic Web new content extraction; content text; data repository; noises filtering; structural Web page clustering; Clustering algorithms; Computer vision; Costs; Data mining; Information filtering; Information filters; Navigation; Tree data structures; Vegetation mapping; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    E-Business and Information System Security, 2009. EBISS '09. International Conference on
  • Conference_Location
    Wuhan
  • Print_ISBN
    978-1-4244-2909-7
  • Electronic_ISBN
    978-1-4244-2910-3
  • Type

    conf

  • DOI
    10.1109/EBISS.2009.5137884
  • Filename
    5137884