DocumentCode :
2533254
Title :
A Bottom-up Approach of Web Data Extraction based on Entity Recognition and Integration
Author :
Liu, Tong ; Shen, Derong ; Shan, Jing ; Nie, Tiezheng ; Kou, Yue
Author_Institution :
Coll. of Inf. Sci. & Eng., Northeastern Univ., Shenyang, China
fYear :
2011
fDate :
21-23 Oct. 2011
Firstpage :
150
Lastpage :
155
Abstract :
Nowadays, most popular methods for web data extraction (WDE) are top-down ones depending on structure. However, these techniques are not scalable enough when coming to complex pages. Consequently, we put forward a bottom-up approach for WDE based on entity recognition and integration to avoid over dependency to structure of web pages. The approach proposed focuses on primary text sequences labeling first and also gives consideration to repetitive patterns of them as well. We propose a Two-Level extraction model for entity recognition and repetitive pattern extraction algorithm for entity integration. Our approach can effectively reduce the attribute labeling mistakes. Also, we demonstrate our approach by scientifically experimental results. The conclusion is that our approach perform better than the traditional extraction techniques, especially on complex Web pages.
Keywords :
Internet; information retrieval; Web data extraction; Web pages; bottom-up approach; entity integration; entity recognition; text sequences; Arrays; Context; Data mining; HTML; Labeling; Redundancy; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Information Systems and Applications Conference (WISA), 2011 Eighth
Conference_Location :
Chongqing
Print_ISBN :
978-1-4577-1812-0
Type :
conf
DOI :
10.1109/WISA.2011.37
Filename :
6093582
Link To Document :
بازگشت