Title :
Top-down extraction of semi-structured data
Author :
Ribeiro-Neto, Berthier ; Laender, Alberto H F ; Da Silva, Altigran S.
Author_Institution :
Dept. of Comput. Sci., Univ. Fed. de Minas Gerais, Belo Horizonte, Brazil
Abstract :
We propose an innovative approach to extracting semi-structured data from Web sources. The idea is to collect a couple of example objects from the user and to use this information to extract new objects from new pages or texts. We propose a top-down strategy that extracts complex objects, decomposing them in objects less complex, until atomic objects have been extracted. Through experimentation, we demonstrate that with a small number of given examples, our strategy is able to extract most of the objects present in a Web source given as input
Keywords :
data handling; information resources; information retrieval; Web sources; atomic objects; complex objects; example objects; object decomposition; semi-structured data; top-down extraction; top-down strategy; Computer science; Data mining; Databases; Electrical capacitance tomography; Electronic switching systems; Explosives; Natural language processing; Ontologies; Read only memory; Web pages;
Conference_Titel :
String Processing and Information Retrieval Symposium, 1999 and International Workshop on Groupware
Conference_Location :
Cancun
Print_ISBN :
0-7695-0268-7
DOI :
10.1109/SPIRE.1999.796593