Title :
The research and implementation of web information extraction technology based on multi-level pages
Author :
Hengyu Lai ; Yifei Wei ; Yali Wang ; Mei Song ; Xiaojun Wang
Author_Institution :
Sch. of Electron. Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
Abstract :
With the development of Internet, online information becomes more and more rich and complex, how to extract target information on multi-level webs and re-construct a form of structured data is worth investigating. This paper puts forward two methods of web information extraction. The first method is width priority analysis method based on regular expressions, which is more flexible and applicable to all regular data. The second method is depth priority analysis method based on DOM tree, which is easier to implement and applicable to HTML structured data. The proposed methods are implemented and the performance is tested through the extraction of TV program information on yahoo website.
Keywords :
Internet; Web sites; hypermedia markup languages; information retrieval; DOM tree; HTML structured data; Internet; TV program information extraction; Web information extraction technology; Yahoo Website; depth priority analysis method; multilevel pages; online information; regular expressions; width priority analysis method; DOM tree; Semi-structured information; regular expressions; web information extraction;
Conference_Titel :
Irish Signals & Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014). 25th IET
Conference_Location :
Limerick
DOI :
10.1049/cp.2014.0701