• DocumentCode
    1802693
  • Title

    Semi-Automated Extraction of Targeted Data fromWeb Pages

  • Author

    Estiévenart, Fabrice ; Meurisse, Jean-Roch ; Hainaut, Jean-Luc ; Thiran, Philippe

  • Author_Institution
    CETIC, Belgium
  • fYear
    2006
  • fDate
    2006
  • Firstpage
    48
  • Lastpage
    48
  • Abstract
    TheWorldWideWeb can be considered an infinite source of information for both individuals and organizations. Yet, if the main standard of publication on the Web (HTML) is quite suited to human reading, its poor semantics makes it difficult for computers to process and use embedded data in a smart and automated way. In this paper, we propose to build a bridge between HTML documents and external applications by means of socalled mapping rules. Such rules mainly record a semantic interpretation of recurring types of information in a cluster of similar Web documents and their location in those documents. Relying on these rules, HTML-embedded data can be extracted towards a more computable format. The definition of mapping rules is based on direct user input mainly for the interpretation part, and on automatic computing for the location of data in HTML tree structures. This approach is supported by a user-friendly tool called Retrozilla.
  • Keywords
    Bridges; Computer science; Data mining; HTML; Humans; Information management; Information resources; Software agents; Web sites; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering Workshops, 2006. Proceedings. 22nd International Conference on
  • Conference_Location
    Atlanta, GA, USA
  • Print_ISBN
    0-7695-2571-7
  • Type

    conf

  • DOI
    10.1109/ICDEW.2006.135
  • Filename
    1623843