DocumentCode :
2850332
Title :
A probabilistic approach for adapting information extraction wrappers and discovering new attributes
Author :
Wong, Tak-Lam ; Lam, Wai
Author_Institution :
Dept. of Syst. Eng. & Eng. Manage., Hong Kong Chinese Univ., China
fYear :
2004
fDate :
1-4 Nov. 2004
Firstpage :
257
Lastpage :
264
Abstract :
We develop a probabilistic framework for adapting information extraction wrappers with new attribute discovery. Wrapper adaptation aims at automatically adapting a previously learned wrapper from the source Web site to a new unseen site for information extraction. One unique characteristic of our framework is that it can discover new or previously unseen attributes as well as headers from the new site. It is based on a generative model for the generation of text fragments related to attribute items and formatting data in a Web page. To solve the wrapper adaptation problem, we consider two kinds of information from the source Web site. The first kind of information is the extraction knowledge contained in the previously learned wrapper from the source Web site. The second kind of information is the previously extracted or collected items. We employ a Bayesian learning approach to automatically select a set of training examples for adapting a wrapper for the new unseen site. To solve the new attribute discovery problem, we develop a model which analyzes the surrounding text fragments of the attributes in the new unseen site. A Bayesian learning method is developed to discover the new attributes and their headers. EM technique is employed in both Bayesian learning models. We conducted extensive experiments from a number of real-world Web sites to demonstrate the effectiveness of our framework.
Keywords :
Bayes methods; Web sites; data mining; learning (artificial intelligence); Bayesian learning; EM technique; Web page; Web site; attribute discovery; information extraction wrappers; knowledge extraction; learned wrapper; probability; text fragments; wrapper adaptation; Bayesian methods; Books; Data mining; Humans; Learning systems; Research and development management; Systems engineering and theory; Web pages; Web search; Web sites;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2004. ICDM '04. Fourth IEEE International Conference on
Print_ISBN :
0-7695-2142-8
Type :
conf
DOI :
10.1109/ICDM.2004.10111
Filename :
1410292
Link To Document :
بازگشت