• DocumentCode
    419374
  • Title

    Data extraction from Web data sources

  • Author

    Robinson, Jerome

  • Author_Institution
    Dept. of Comput. Sci., Essex Univ., Colchester, UK
  • fYear
    2004
  • fDate
    30 Aug.-3 Sept. 2004
  • Firstpage
    282
  • Lastpage
    288
  • Abstract
    An explanation is given of the basic data structures used in a new page analysis technique to create wrappers (data extractors) for the result pages produced by Web sites in response to user qeries via Web page forms. The key structure called a tpGrid is a representation of the web page, which is easier to analyse than the raw HTML code. The analysis looks for repetition patterns of sets of tagSets, which are defined in the paper.
  • Keywords
    Web sites; data structures; grid computing; hypermedia markup languages; information retrieval; HTML code; Web data source; Web page analysis; Web sites; data extraction; data extractor; data structure; repetition patterns; tagSets; tpGrid; wrappers; Computer science; Data mining; Data structures; Databases; HTML; Pattern analysis; Production; Springs; Web pages; Web server;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database and Expert Systems Applications, 2004. Proceedings. 15th International Workshop on
  • ISSN
    1529-4188
  • Print_ISBN
    0-7695-2195-9
  • Type

    conf

  • DOI
    10.1109/DEXA.2004.1333487
  • Filename
    1333487