• DocumentCode
    3264141
  • Title

    A gateway from HTML to XML

  • Author

    Fu, Tao ; Liu, Mengchi

  • Author_Institution
    Sch. of Comput. Sci., Carleton Univ., Ottawa, Ont., Canada
  • fYear
    2004
  • fDate
    7-9 July 2004
  • Firstpage
    205
  • Lastpage
    214
  • Abstract
    XML is gaining popularity as an industrial standard for presenting and exchanging structured information on the Web. Meanwhile, the majority of documents on-line are still marked up with HTML, which are designed specifically for display purposes rather than for applications to automatically access. In order to make Web information accessible to applications so as to afford automation, inter-operation and intelligent services, some information extraction programs, called "wrappers", have been developed to extract the structured data from HTML pages. In this paper, we present a layout-based approach to separate the data layer from its aspect of presentation in HTML and extract the pure data as well as its hierarchical structure into XML. This approach aims to offer a general purpose methodology that can automatically convert HTML to XML without any tuning for a particular domain.
  • Keywords
    Internet; XML; information retrieval; HTML; Web information accessibility; World Wide Web; XML; information extraction; online documents; structured data extraction; Automation; Classification tree analysis; Computer industry; Computer science; Data mining; Displays; Drives; HTML; Intelligent structures; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database Engineering and Applications Symposium, 2004. IDEAS '04. Proceedings. International
  • ISSN
    1098-8068
  • Print_ISBN
    0-7695-2168-1
  • Type

    conf

  • DOI
    10.1109/IDEAS.2004.1319793
  • Filename
    1319793