• DocumentCode
    2533878
  • Title

    Automatically maintaining wrappers for Web sources

  • Author

    Raposo, Juan ; Pan, Alberto ; Álvarez, Manuel ; Hidalgo, Justo

  • Author_Institution
    A Coruna Univ., Spain
  • fYear
    2005
  • fDate
    25-27 July 2005
  • Firstpage
    105
  • Lastpage
    114
  • Abstract
    A substantial subset of the Web data follows some kind of underlying structure. Nevertheless, HTML does not contain any schema or semantic information about the data it represents. A program able to provide software applications with a structured view of those semi-structured Web sources is usually called a wrapper. Wrappers are able to accept a query against the source and return a set of structured results, thus enabling applications to access Web data in a similar manner to that of information from databases. A significant problem in this approach arises because Web sources may experiment changes that invalidate the current wrappers. In this paper, we present novel heuristics and algorithms to address this problem. Our approach is based on collecting some query results during wrapper operation. Then, when the source changes, they are used to generate a set of labeled examples that are then provided as input to a wrapper induction algorithm able to regenerate the wrapper. We have tested our methods in several real-world Web data extraction domains, obtaining high accuracy in all the steps of the process.
  • Keywords
    Internet; information retrieval; software maintenance; HTML; Web data extraction domains; automatic wrapper maintenance; semi-structured Web sources; wrapper induction algorithm; Application software; Data engineering; Data mining; Databases; HTML; Heuristic algorithms; Induction generators; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database Engineering and Application Symposium, 2005. IDEAS 2005. 9th International
  • ISSN
    1098-8068
  • Print_ISBN
    0-7695-2404-4
  • Type

    conf

  • DOI
    10.1109/IDEAS.2005.13
  • Filename
    1540901