مرکز منطقه ای اطلاع رساني علوم و فناوري - The research and implementation of web information extraction technology based on multi-level pages

DocumentCode :

258616

Title :

The research and implementation of web information extraction technology based on multi-level pages

Author :

Hengyu Lai ; Yifei Wei ; Yali Wang ; Mei Song ; Xiaojun Wang

Author_Institution :

Sch. of Electron. Eng., Beijing Univ. of Posts & Telecommun., Beijing, China

fYear :

2013

fDate :

26-27 June 2013

Firstpage :

292

Lastpage :

297

Abstract :

With the development of Internet, online information becomes more and more rich and complex, how to extract target information on multi-level webs and re-construct a form of structured data is worth investigating. This paper puts forward two methods of web information extraction. The first method is width priority analysis method based on regular expressions, which is more flexible and applicable to all regular data. The second method is depth priority analysis method based on DOM tree, which is easier to implement and applicable to HTML structured data. The proposed methods are implemented and the performance is tested through the extraction of TV program information on yahoo website.

Keywords :

Internet; Web sites; hypermedia markup languages; information retrieval; DOM tree; HTML structured data; Internet; TV program information extraction; Web information extraction technology; Yahoo Website; depth priority analysis method; multilevel pages; online information; regular expressions; width priority analysis method; DOM tree; Semi-structured information; regular expressions; web information extraction;

fLanguage :

English

Publisher :

iet

Conference_Titel :

Irish Signals & Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014). 25th IET

Conference_Location :

Limerick

Type :

conf

DOI :

10.1049/cp.2014.0701

Filename :

6912772

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=258616