Title :
Generating Structured Documents from HTML Tables
Author :
Kim, Yeon-Seok ; Lee, Kyong-Ho
Author_Institution :
Yonsei University
Abstract :
A table is a facility for presenting relational information structurally and concisely. As a prerequisite for extracting information from the Web, This paper presents an efficient method for extracting logical structures from HTML tables and transforming them into XML documents. The proposed method consists of two phases: area segmentation and structure analysis. The area segmentation step cleans up the table and segments the normalized table into attribute and value areas by checking visual and semantic coherency. Particularly, heuristic rules are also proposed to handle complex tables. In the structure analysis phase, the hierarchical structure between attribute and value areas is analyzed and transformed into an XML representation using the proposed table model. Experimental results with a large number of HTML tables show that the proposed method performs better than the conventional method.
Keywords :
Computer science; Data mining; HTML; Information analysis; Markup languages; Ontologies; Performance analysis; Process design; Text analysis; XML;
Conference_Titel :
Hybrid Information Technology, 2006. ICHIT '06. International Conference on
Conference_Location :
Cheju Island
Print_ISBN :
0-7695-2674-8
DOI :
10.1109/ICHIT.2006.253669