DocumentCode
2850332
Title
A probabilistic approach for adapting information extraction wrappers and discovering new attributes
Author
Wong, Tak-Lam ; Lam, Wai
Author_Institution
Dept. of Syst. Eng. & Eng. Manage., Hong Kong Chinese Univ., China
fYear
2004
fDate
1-4 Nov. 2004
Firstpage
257
Lastpage
264
Abstract
We develop a probabilistic framework for adapting information extraction wrappers with new attribute discovery. Wrapper adaptation aims at automatically adapting a previously learned wrapper from the source Web site to a new unseen site for information extraction. One unique characteristic of our framework is that it can discover new or previously unseen attributes as well as headers from the new site. It is based on a generative model for the generation of text fragments related to attribute items and formatting data in a Web page. To solve the wrapper adaptation problem, we consider two kinds of information from the source Web site. The first kind of information is the extraction knowledge contained in the previously learned wrapper from the source Web site. The second kind of information is the previously extracted or collected items. We employ a Bayesian learning approach to automatically select a set of training examples for adapting a wrapper for the new unseen site. To solve the new attribute discovery problem, we develop a model which analyzes the surrounding text fragments of the attributes in the new unseen site. A Bayesian learning method is developed to discover the new attributes and their headers. EM technique is employed in both Bayesian learning models. We conducted extensive experiments from a number of real-world Web sites to demonstrate the effectiveness of our framework.
Keywords
Bayes methods; Web sites; data mining; learning (artificial intelligence); Bayesian learning; EM technique; Web page; Web site; attribute discovery; information extraction wrappers; knowledge extraction; learned wrapper; probability; text fragments; wrapper adaptation; Bayesian methods; Books; Data mining; Humans; Learning systems; Research and development management; Systems engineering and theory; Web pages; Web search; Web sites;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining, 2004. ICDM '04. Fourth IEEE International Conference on
Print_ISBN
0-7695-2142-8
Type
conf
DOI
10.1109/ICDM.2004.10111
Filename
1410292
Link To Document