• DocumentCode
    22643
  • Title

    A Survey on Region Extractors from Web Documents

  • Author

    Sleiman, Hassan A. ; Corchuelo, Rafael

  • Author_Institution
    ETSI Inf., Univ. of Sevilla, Sevilla, Spain
  • Volume
    25
  • Issue
    9
  • fYear
    2013
  • fDate
    Sept. 2013
  • Firstpage
    1960
  • Lastpage
    1981
  • Abstract
    Extracting information from web documents has become a research area in which new proposals sprout out year after year. This has motivated several researchers to work on surveys that attempt to provide an overall picture of the many existing proposals. Unfortunately, none of these surveys provide a complete picture, because they do not take region extractors into account. These tools are kind of preprocessors, because they help information extractors focus on the regions of a web document that contain relevant information. With the increasing complexity of web documents, region extractors are becoming a must to extract information from many websites. Beyond information extraction, region extractors have also found their way into information retrieval, focused web crawling, topic distillation, adaptive content delivery, mashups, and metasearch engines. In this paper, we survey the existing proposals regarding region extractors and compare them side by side.
  • Keywords
    Web sites; document handling; relevance feedback; search engines; Web crawling; Web document complexity; Web sites; adaptive content delivery; information extraction; information retrieval; mashups; metasearch engines; region extractors; topic distillation; Data mining; Engines; Feature extraction; HTML; Metasearch; Proposals; Search engines; Information extractors; enterprise information integration; region extractors; web documents; wrappers;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2012.135
  • Filename
    6231632