DocumentCode :
177172
Title :
A novel method for extracting entity data from Deep Web precisely
Author :
Hai-tao Yu ; Jian-Yi Guo ; Zheng-Tao Yu ; Yan-Tuan Xian ; Xin Yan
Author_Institution :
Sch. of Inf. Eng. & Autom., Kunming Univ. of Sci. & Technol., Kunming, China
fYear :
2014
fDate :
May 31 2014-June 2 2014
Firstpage :
5049
Lastpage :
5053
Abstract :
In order to make better use of the hidden information value in the Deep Web, get fast and accurate access to the embedded entity data, this paper presented a method for extracting entity data from Deep Web precisely, designed a entity extraction system, which will extract data from Deep Web automatically. Firstly, designed a web crawler based on the characteristics of Deep Web, take advantage of the web crawler to get resources from Internet; Secondly, the pretreatment of web resources, normalize the pages which are non-standard; Finally, locate and extract the entity data from Deep Web accurately, in this paper, based on the hierarchy and layout features in DOM tree, combined XPath with RegExp to locate entity data, then stored the extracted entity attributes and attribute values. Experiments show that, using this method can locate and extract the entity data from Deep Web quickly and efficiently, and achieved a higher accuracy.
Keywords :
Internet; document handling; information retrieval; DOM tree; Internet; RegExp; Web crawler; Web resources; XPath; attribute values; deep Web; document object model; embedded entity data access; entity attributes; entity data extraction; hidden information value; Crawlers; Data mining; Feature extraction; HTML; Standards; DOM; Deep Web; Entity Extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Control and Decision Conference (2014 CCDC), The 26th Chinese
Conference_Location :
Changsha
Print_ISBN :
978-1-4799-3707-3
Type :
conf
DOI :
10.1109/CCDC.2014.6853078
Filename :
6853078
Link To Document :
بازگشت