• DocumentCode
    557151
  • Title

    An integrated approach for information extraction

  • Author

    Xia, YingJu ; Yang, YuHang ; Ge, Fujiang ; Zhang, Shu ; Yu, Hao

  • Author_Institution
    Fujitsu R&D Center Co., Ltd, Beijing, China
  • Volume
    1
  • fYear
    2011
  • fDate
    24-26 Oct. 2011
  • Firstpage
    122
  • Lastpage
    127
  • Abstract
    This paper proposes an integrated approach to automatic information extraction for Forums, Blogs and News web sites using wrapper. This paper presents a tree alignment and transfer learning method to generate the wrapper. The tree alignment algorithm is adopted to find the best matching structure of the input web pages. A kind of linear regression method is employed to get the weight of different tag-matching. For wrapper maintenance, this paper presents a method using a log likelihood ratio test for detecting the change points on the similarity series which gotten from the wrapper and input web pages. Experimental results show that the method achieves high accuracy and has steady performance.
  • Keywords
    Web sites; information retrieval; learning (artificial intelligence); regression analysis; trees (mathematics); Web pages; automatic information extraction; blogs; forums; integrated approach; linear regression method; log likelihood ratio test; news Web sites; tag matching; transfer learning method; tree alignment method; wrapper maintenance; Accuracy; Blogs; Data mining; Estimation; Linear regression; Maintenance engineering; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Science and Service Science (NISS), 2011 5th International Conference on New Trends in
  • Conference_Location
    Macao
  • Print_ISBN
    978-1-4577-0665-3
  • Type

    conf

  • Filename
    6093405