• DocumentCode
    2457347
  • Title

    Automatic Extraction of Structured Web Data with Domain Knowledge

  • Author

    Derouiche, Nora ; Cautis, Bogdan ; Abdessalem, Talel

  • Author_Institution
    LTCI, Telecom ParisTech., Paris, France
  • fYear
    2012
  • fDate
    1-5 April 2012
  • Firstpage
    726
  • Lastpage
    737
  • Abstract
    We present in this paper a novel approach for extracting structured data from the Web, whose goal is to harvest real-world items from template-based HTML pages (the structured Web). It illustrates a two-phase querying of the Web, in which an intentional description of the data that is targeted is first provided, in a flexible and widely applicable manner. The extraction process leverages then both the input description and the source structure. Our approach is domain-independent, in the sense that it applies to any relation, either flat or nested, describing real-world items. Extensive experiments on five different domains and comparison with the main state of the art extraction systems from literature illustrate its flexibility and precision. We advocate via our technique that automatic extraction and integration of complex structured data can be done fast and effectively, when the redundancy of the Web meets knowledge over the to-be-extracted data.
  • Keywords
    Internet; query processing; automatic extraction; domain knowledge; input description; intentional description; source structure; structured Web data; template-based HTML pages; two-phase querying; Data mining; Feature extraction; HTML; Semantics; Silicon; Web pages; Wrapping;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2012 IEEE 28th International Conference on
  • Conference_Location
    Washington, DC
  • ISSN
    1063-6382
  • Print_ISBN
    978-1-4673-0042-1
  • Type

    conf

  • DOI
    10.1109/ICDE.2012.90
  • Filename
    6228128