DocumentCode :
2931316
Title :
Learning Rules to Pre-process Web Data for Automatic Integration
Author :
Simon, Kai ; Hornung, Thomas ; Lausen, Georg
Author_Institution :
Inst. fur Informatik, Univ. Freiburg
fYear :
2006
fDate :
Nov. 2006
Firstpage :
107
Lastpage :
116
Abstract :
Web pages such as product catalogues and Web sites resulting from querying a search engine often follow a global layout template which facilitates the retrieval of information for a user. In this paper we present a technique which makes such content machine-processable by extracting and transforming it into tabular form. We achieve this goal via ViPER, our fully automatic wrapper system, while localizing and extracting structured data records from suchlike Web pages following a sophisticated strategy based on the visual perception of a Web page. The first contribution of this paper is to give deep insight into the post-processing heuristics of ViPER, which become materialized by a set of rules. Once these rules are defined, the regular content of a Web page can be abstracted into a relational view. Second, we show that new, unseen contents rendered with the same layout, only have to be extracted by ViPER, whereas the remaining transformation can be performed by applying the learned rules accordingly
Keywords :
Internet; information retrieval; learning (artificial intelligence); ViPER; Web data preprocessing; Web pages; Web sites; automatic integration; fully automatic wrapper system; information retrieval; learning rules; product catalogues; search engine; structured data record extraction; structured data record localization; Data mining; Databases; HTML; Humans; Information resources; Information retrieval; Internet; Search engines; Visual perception; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Rules and Rule Markup Languages for the Semantic Web, Second International Conference on
Conference_Location :
Athens, GA
Print_ISBN :
0-7695-2652-7
Type :
conf
DOI :
10.1109/RULEML.2006.16
Filename :
4032397
Link To Document :
بازگشت