Title :
Parallel Approach and Platform for Large-Scale WEB Data Extraction
Author :
Yi Shen ; Shengsheng Shi ; Haitao Wang ; Wu Wei ; Chunfeng Yuan ; Yihua Huang
Author_Institution :
Dept. of Comput. Sci. & Technol., Nanjing Univ., Nanjing, China
Abstract :
As the most popular information publishing platform, the Web contains a lot of valued information of interests to users or applications. Although a lot of data extraction techniques have been studied in the last decade, it is still far away from meeting the need of real data extraction. On the one hand, most of them cannot support the whole web information extraction process involving three stages: web page navigation, data extraction and data integration, On the other hand, they cannot support parallel data extraction process for large-scale web pages. In this paper, we propose a parallel approach and platform based on the Hadoop MapReduce for large-scale web data extraction. Our approach can perform the whole three-stage web data extraction process in parallel. Experimental results show that our approach is efficient and can achieve linear speedup.
Keywords :
Internet; Web sites; information retrieval; parallel processing; Hadoop MapReduce; Web information extraction process; Web page navigation; data integration; information publishing platform; large-scale Web data extraction; parallel approach; parallel data extraction process; Data integration; Data mining; Data models; Knowledge engineering; Navigation; Web pages; Large-scale web data extraction; Parallel platform; Parallel web data extraction; Web data integration; Web page navigation;
Conference_Titel :
Advanced Cloud and Big Data (CBD), 2013 International Conference on
Conference_Location :
Nanjing
Print_ISBN :
978-1-4799-3260-3
DOI :
10.1109/CBD.2013.24