مرکز منطقه ای اطلاع رساني علوم و فناوري - DESP: An Automatic Data Extractor on Deep Web Pages

DocumentCode :

2261062

Title :

DESP: An Automatic Data Extractor on Deep Web Pages

Author :

Ma, Ji ; Shen, Derong ; Nie, Tiezheng

Author_Institution :

Dept. of Comput. Sci. & Eng., Northeastern Univ., Shenyang, China

fYear :

2010

fDate :

20-22 Aug. 2010

Firstpage :

132

Lastpage :

136

Abstract :

We present DESP, an automatic data extractor on Deep Web pages for book domain, which can extract data items and label attributes at the same time. The case of DESP is to extract books´ information such as title, author, price and publisher from result pages returned from bookstore web sites. Although DESP is for a specific domain, the method used by DESP is highly adaptive and can suit other domains. The system consists of two parts, one is Data Record Locater, the Modified Data Locating algorithm used by it overcomes the shortcoming of the MDR algorithm, the other is Attribute Labeler, and the Detect Combine algorithm makes the data item have a more explicit meaning.

Keywords :

Internet; Web sites; information retrieval; DESP; attribute labeler; automatic data extractor; book domain; books information extracttion; bookstore Web sites; data record locater; deep Web pages; detect combine algorithm; modified data locating algorithm; Data mining; HTML; Hidden Markov models; Labeling; USA Councils; Web pages; Web; edit distance; string similarity algorithm;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Web Information Systems and Applications Conference (WISA), 2010 7th

Conference_Location :

Hohhot

Print_ISBN :

978-1-4244-8440-9

Type :

conf

DOI :

10.1109/WISA.2010.15

Filename :

5581384

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2261062