Title :
Web Data Extraction Based on Visual Information and Partial Tree Alignment
Author :
Siwu Fan ; Xinjun Wang ; Yongquan Dong
Author_Institution :
Sch. of Comput. Sci. & Technol., Shandong Univ., Jinan, China
Abstract :
Web databases contain a huge amount of structured data which are easily obtained via their query interfaces only. The query results are presented in dynamically generated web pages, usually in the form of data records, for human use. The automatical web data extraction is critical in web integration. A number of approaches have been proposed. The early work are most based on the source code or the tag tree of the page. Recent approaches use the visual feature to extract data information, which are better than the previous work. However, these approaches still have inherent limitation. In this paper, we propose a novel approach that make use of visual features to extract data information from web page, including the data records and the data items. The results of this experiment tests on a large set of query result pages in different domain show that the proposed approach is highly effective.
Keywords :
Internet; feature extraction; information retrieval systems; information services; query processing; Web databases; Web integration; Web pages; automatical Web data extraction; data information extraction; data items; data records; partial tree alignment; query interfaces; structured data; visual feature extraction; visual information; Data mining; Databases; Educational institutions; Feature extraction; Noise; Visualization; Web pages; Web data extraction; Web mining; Wrapper generation;
Conference_Titel :
Web Information System and Application Conference (WISA), 2014 11th
Print_ISBN :
978-1-4799-5726-2
DOI :
10.1109/WISA.2014.12