DocumentCode :
557151
Title :
An integrated approach for information extraction
Author :
Xia, YingJu ; Yang, YuHang ; Ge, Fujiang ; Zhang, Shu ; Yu, Hao
Author_Institution :
Fujitsu R&D Center Co., Ltd, Beijing, China
Volume :
1
fYear :
2011
fDate :
24-26 Oct. 2011
Firstpage :
122
Lastpage :
127
Abstract :
This paper proposes an integrated approach to automatic information extraction for Forums, Blogs and News web sites using wrapper. This paper presents a tree alignment and transfer learning method to generate the wrapper. The tree alignment algorithm is adopted to find the best matching structure of the input web pages. A kind of linear regression method is employed to get the weight of different tag-matching. For wrapper maintenance, this paper presents a method using a log likelihood ratio test for detecting the change points on the similarity series which gotten from the wrapper and input web pages. Experimental results show that the method achieves high accuracy and has steady performance.
Keywords :
Web sites; information retrieval; learning (artificial intelligence); regression analysis; trees (mathematics); Web pages; automatic information extraction; blogs; forums; integrated approach; linear regression method; log likelihood ratio test; news Web sites; tag matching; transfer learning method; tree alignment method; wrapper maintenance; Accuracy; Blogs; Data mining; Estimation; Linear regression; Maintenance engineering; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Science and Service Science (NISS), 2011 5th International Conference on New Trends in
Conference_Location :
Macao
Print_ISBN :
978-1-4577-0665-3
Type :
conf
Filename :
6093405
Link To Document :
بازگشت