• DocumentCode
    2850332
  • Title

    A probabilistic approach for adapting information extraction wrappers and discovering new attributes

  • Author

    Wong, Tak-Lam ; Lam, Wai

  • Author_Institution
    Dept. of Syst. Eng. & Eng. Manage., Hong Kong Chinese Univ., China
  • fYear
    2004
  • fDate
    1-4 Nov. 2004
  • Firstpage
    257
  • Lastpage
    264
  • Abstract
    We develop a probabilistic framework for adapting information extraction wrappers with new attribute discovery. Wrapper adaptation aims at automatically adapting a previously learned wrapper from the source Web site to a new unseen site for information extraction. One unique characteristic of our framework is that it can discover new or previously unseen attributes as well as headers from the new site. It is based on a generative model for the generation of text fragments related to attribute items and formatting data in a Web page. To solve the wrapper adaptation problem, we consider two kinds of information from the source Web site. The first kind of information is the extraction knowledge contained in the previously learned wrapper from the source Web site. The second kind of information is the previously extracted or collected items. We employ a Bayesian learning approach to automatically select a set of training examples for adapting a wrapper for the new unseen site. To solve the new attribute discovery problem, we develop a model which analyzes the surrounding text fragments of the attributes in the new unseen site. A Bayesian learning method is developed to discover the new attributes and their headers. EM technique is employed in both Bayesian learning models. We conducted extensive experiments from a number of real-world Web sites to demonstrate the effectiveness of our framework.
  • Keywords
    Bayes methods; Web sites; data mining; learning (artificial intelligence); Bayesian learning; EM technique; Web page; Web site; attribute discovery; information extraction wrappers; knowledge extraction; learned wrapper; probability; text fragments; wrapper adaptation; Bayesian methods; Books; Data mining; Humans; Learning systems; Research and development management; Systems engineering and theory; Web pages; Web search; Web sites;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2004. ICDM '04. Fourth IEEE International Conference on
  • Print_ISBN
    0-7695-2142-8
  • Type

    conf

  • DOI
    10.1109/ICDM.2004.10111
  • Filename
    1410292