DocumentCode
3264141
Title
A gateway from HTML to XML
Author
Fu, Tao ; Liu, Mengchi
Author_Institution
Sch. of Comput. Sci., Carleton Univ., Ottawa, Ont., Canada
fYear
2004
fDate
7-9 July 2004
Firstpage
205
Lastpage
214
Abstract
XML is gaining popularity as an industrial standard for presenting and exchanging structured information on the Web. Meanwhile, the majority of documents on-line are still marked up with HTML, which are designed specifically for display purposes rather than for applications to automatically access. In order to make Web information accessible to applications so as to afford automation, inter-operation and intelligent services, some information extraction programs, called "wrappers", have been developed to extract the structured data from HTML pages. In this paper, we present a layout-based approach to separate the data layer from its aspect of presentation in HTML and extract the pure data as well as its hierarchical structure into XML. This approach aims to offer a general purpose methodology that can automatically convert HTML to XML without any tuning for a particular domain.
Keywords
Internet; XML; information retrieval; HTML; Web information accessibility; World Wide Web; XML; information extraction; online documents; structured data extraction; Automation; Classification tree analysis; Computer industry; Computer science; Data mining; Displays; Drives; HTML; Intelligent structures; XML;
fLanguage
English
Publisher
ieee
Conference_Titel
Database Engineering and Applications Symposium, 2004. IDEAS '04. Proceedings. International
ISSN
1098-8068
Print_ISBN
0-7695-2168-1
Type
conf
DOI
10.1109/IDEAS.2004.1319793
Filename
1319793
Link To Document