Title :
A gateway from HTML to XML
Author :
Fu, Tao ; Liu, Mengchi
Author_Institution :
Sch. of Comput. Sci., Carleton Univ., Ottawa, Ont., Canada
Abstract :
XML is gaining popularity as an industrial standard for presenting and exchanging structured information on the Web. Meanwhile, the majority of documents on-line are still marked up with HTML, which are designed specifically for display purposes rather than for applications to automatically access. In order to make Web information accessible to applications so as to afford automation, inter-operation and intelligent services, some information extraction programs, called "wrappers", have been developed to extract the structured data from HTML pages. In this paper, we present a layout-based approach to separate the data layer from its aspect of presentation in HTML and extract the pure data as well as its hierarchical structure into XML. This approach aims to offer a general purpose methodology that can automatically convert HTML to XML without any tuning for a particular domain.
Keywords :
Internet; XML; information retrieval; HTML; Web information accessibility; World Wide Web; XML; information extraction; online documents; structured data extraction; Automation; Classification tree analysis; Computer industry; Computer science; Data mining; Displays; Drives; HTML; Intelligent structures; XML;
Conference_Titel :
Database Engineering and Applications Symposium, 2004. IDEAS '04. Proceedings. International
Print_ISBN :
0-7695-2168-1
DOI :
10.1109/IDEAS.2004.1319793