DocumentCode :
258616
Title :
The research and implementation of web information extraction technology based on multi-level pages
Author :
Hengyu Lai ; Yifei Wei ; Yali Wang ; Mei Song ; Xiaojun Wang
Author_Institution :
Sch. of Electron. Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear :
2013
fDate :
26-27 June 2013
Firstpage :
292
Lastpage :
297
Abstract :
With the development of Internet, online information becomes more and more rich and complex, how to extract target information on multi-level webs and re-construct a form of structured data is worth investigating. This paper puts forward two methods of web information extraction. The first method is width priority analysis method based on regular expressions, which is more flexible and applicable to all regular data. The second method is depth priority analysis method based on DOM tree, which is easier to implement and applicable to HTML structured data. The proposed methods are implemented and the performance is tested through the extraction of TV program information on yahoo website.
Keywords :
Internet; Web sites; hypermedia markup languages; information retrieval; DOM tree; HTML structured data; Internet; TV program information extraction; Web information extraction technology; Yahoo Website; depth priority analysis method; multilevel pages; online information; regular expressions; width priority analysis method; DOM tree; Semi-structured information; regular expressions; web information extraction;
fLanguage :
English
Publisher :
iet
Conference_Titel :
Irish Signals & Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014). 25th IET
Conference_Location :
Limerick
Type :
conf
DOI :
10.1049/cp.2014.0701
Filename :
6912772
Link To Document :
بازگشت