• DocumentCode
    2893884
  • Title

    A Comprehensive Survey on Web Content Extraction Algorithms and Techniques

  • Author

    Al-Ghuribi, Sumaia Mohammed ; Alshomrani, Saleh

  • Author_Institution
    Fac. of Comput. & Inf. Technol., King Abdulaziz Univ., Jeddah, Saudi Arabia
  • fYear
    2013
  • fDate
    24-26 June 2013
  • Firstpage
    1
  • Lastpage
    5
  • Abstract
    Web Content Extraction is an important problem that has been studied through different approaches and algorithms. It is interested in extracting meaningful and useful data from the Webpage which is surrounded with many noisy data such as advertisements and navigation links. Many applications get benefits from the extracted content such as crawlers, indexers, document classification, and Information retrieval. This survey aims at providing a comprehensive overview of many approaches that constructed for extracting Webpage content. In this survey, Web Content Extraction approaches are classified into categories and for each category, some approaches are given in details with their weakness. Based on analyzing the given approaches deeply, we can draw the fundamentals factors for constructing the optimal Web content extractor.
  • Keywords
    Web sites; content management; data mining; pattern classification; Web crawlers; Webpage content extraction algorithm; Webpage content extraction technique; Webpage data extraction; advertisement links; document classification; indexers; information retrieval; navigation links; noisy data; optimal Web content extractor; Algorithm design and analysis; Classification algorithms; Data mining; Feature extraction; HTML; Visualization; Web sites;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Science and Applications (ICISA), 2013 International Conference on
  • Conference_Location
    Suwon
  • Print_ISBN
    978-1-4799-0602-4
  • Type

    conf

  • DOI
    10.1109/ICISA.2013.6579445
  • Filename
    6579445