Title :
A peer-to-peer based passive web crawling system
Author :
Chen, Qing-cai ; Yang, Xiao-hong ; Wang, Xiao-long
Author_Institution :
Dept. of Comput. Sci. & Technol., Harbin Inst. of Technol., Shenzhen, China
Abstract :
Though the commercial success of search engines and large scale web page crawlers, the problems of page refresh, new URL discovering, large file downloading, distributed multimedia content feature extracting and indexing etc. are still open. The independent working behavior of each crawler makes it very hard to seek solutions for all these problems under the classical web crawler architecture. To address these problems, this paper proposes an innovative client/server based web crawling system. This system consists of a crawler server and a crawler client which work in the search engine and website end respectively. The crawler server registers itself to the client and joins into a temporary peer-to-peer network to cooperate and share downloaded data with other crawler servers. Different from the classical crawlers, the data downloading procedure is initialized by a client. So for the crawler server, this is a “passive” web crawling system. The main benefits of this system include the capability of timely management web changes for a crawler, the saving of website bandwidth resources, the capability of downloading large files or multimedia content features, and the capability of protection intellectual properties while indexing and searching the content. Our experiments taken on a simulation system show its efficiency and practicability for the real Internet environments.
Keywords :
feature extraction; multimedia computing; peer-to-peer computing; search engines; Internet environments; URL discovery; distributed multimedia content feature extraction; distributed multimedia content feature indexing; file downloading; multimedia content; peer-to-peer based passive web crawling system; search engines; website bandwidth resources; Bandwidth; Crawlers; Feature extraction; History; Protocols; Servers; Web pages; Passive web crawler; peer-to-peer network; search engine;
Conference_Titel :
Machine Learning and Cybernetics (ICMLC), 2011 International Conference on
Conference_Location :
Guilin
Print_ISBN :
978-1-4577-0305-8
DOI :
10.1109/ICMLC.2011.6016959