Title :
Extraction and integration information in HTML tables
Author :
Li, Shijun ; Peng, Zhiyong ; Liu, Mengchi
Author_Institution :
Sch. of Comput., Wuhan Univ., China
Abstract :
A large amount of information available on the Web is formatted in HTML tables, which are mainly presentation-oriented and are not suited for database applications. As a result, how to capture information in HTML tables semantically and integrate relevant information is a challenge. In this paper, we present a new approach that automatically captures the semantic hierarchies of HTML tables, and semi-automatically integrates HTML tables. It first automatically captures the attribute-value pairs in HTML tables by normalization, and introduces the notion of eigenvalue in formatting information to recognize the headings of HTML tables. After generating the global concepts and global schema manually by defining what data to be integrated, it then learns the lexical semantic set for each global concept, the contexts via labelling the attributes of example HTML tables to their corresponding global concept. Finally, it integrates the data of each source HTML table using the lexical semantic sets and the contexts to eliminate the conflicts and solve the nondeterministic problems in mapping each source schema to the global schema.
Keywords :
data structures; eigenvalues and eigenfunctions; hypermedia markup languages; knowledge acquisition; merging; HTML tables; attribute-value pairs; data integration; information extraction; information formatting; information integration; lexical semantic set; nondeterministic problems; semantic hierarchy; semiautomatic integration; Application software; Computer science; Data mining; Databases; Drives; Eigenvalues and eigenfunctions; HTML; Information retrieval; Labeling; Software engineering;
Conference_Titel :
Computer and Information Technology, 2004. CIT '04. The Fourth International Conference on
Print_ISBN :
0-7695-2216-5
DOI :
10.1109/CIT.2004.1357214