• DocumentCode
    3499593
  • Title

    Automatic Web News Content Extraction Based on Similar Pages

  • Author

    Zhang, Chunyuan ; Lin, Zhiyang

  • Author_Institution
    Dept. of Comput. Sci., Hainan Univ., Haikou, China
  • Volume
    1
  • fYear
    2010
  • fDate
    23-24 Oct. 2010
  • Firstpage
    232
  • Lastpage
    236
  • Abstract
    Today most news pages are generated from some underlying structured source, so we think that template-dependent wrappers should be more suitable for them than template-independent wrappers. In this paper, we propose a novel automatic template-dependent Web news content extraction approach based on similar pages. Firstly, We choose two similar pages as training samples and represent them as two HTML DOM trees. Secondly, we create the maximum matching tree between the DOM trees using our simple tree matching and backtracking algorithm. Then, by analyzing the characteristics of nodes in the maximum matching tree, we eliminate the noise nodes to generate an extraction template. Finally, we build a template-dependent wrapper for target news pages whose structures are similar to the samples. Experimental results indicate that our approach is effective and efficient for Web news content extraction, and the average harmonic mean of precision and recall reaches 98.3%.
  • Keywords
    information resources; tree searching; HTML DOM trees; automatic template-dependent Web news content extraction; backtracking algorithm; extraction template; maximum matching tree; template-dependent wrapper; template-independent wrapper; tree matching; Web news content extraction; similar pages; simple tree matching and backtracking algorithm; template-dependent wrapper;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Information Systems and Mining (WISM), 2010 International Conference on
  • Conference_Location
    Sanya
  • Print_ISBN
    978-1-4244-8438-6
  • Type

    conf

  • DOI
    10.1109/WISM.2010.154
  • Filename
    5662317