DocumentCode :
2791651
Title :
Generating Structured Documents from HTML Tables
Author :
Kim, Yeon-Seok ; Lee, Kyong-Ho
Author_Institution :
Yonsei University
Volume :
2
fYear :
2006
fDate :
9-11 Nov. 2006
Firstpage :
605
Lastpage :
610
Abstract :
A table is a facility for presenting relational information structurally and concisely. As a prerequisite for extracting information from the Web, This paper presents an efficient method for extracting logical structures from HTML tables and transforming them into XML documents. The proposed method consists of two phases: area segmentation and structure analysis. The area segmentation step cleans up the table and segments the normalized table into attribute and value areas by checking visual and semantic coherency. Particularly, heuristic rules are also proposed to handle complex tables. In the structure analysis phase, the hierarchical structure between attribute and value areas is analyzed and transformed into an XML representation using the proposed table model. Experimental results with a large number of HTML tables show that the proposed method performs better than the conventional method.
Keywords :
Computer science; Data mining; HTML; Information analysis; Markup languages; Ontologies; Performance analysis; Process design; Text analysis; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Hybrid Information Technology, 2006. ICHIT '06. International Conference on
Conference_Location :
Cheju Island
Print_ISBN :
0-7695-2674-8
Type :
conf
DOI :
10.1109/ICHIT.2006.253669
Filename :
4021274
Link To Document :
بازگشت