Title :
Optimal Crawling Strategies for Multimedia Search Engines
Author :
Hama, Hiromitsu ; Zin, Thi Thi ; Tin, Pyke
Author_Institution :
Grad. Sch. of Eng., Osaka City Univ., Osaka, Japan
Abstract :
In this paper we propose a novel optimal crawling strategy for next-generation multimedia search engines. We consider here a Web crawl as a two-dimensional (2D) random walker on a graph whose vertices are the Web pages and whose edges are the hyperlinks. The proposed crawler is a two-part scheme optimizing the crawling process in such a way that the average level of staleness over all pages is minimized and the quality of search engine from user´s perspective is maximized. In doing so, we employ techniques from probability theory and the theory of functional equations which are highly computationally efficient-crucial for practicality because the size of the problem in the Web environment is immense. We show that a combination of breadth-depth crawling including the largest sites is a practical and optimal strategy. In particular, several probabilistic models for user browsing in infinite Web are proposed and studied to estimate how deep and breadth a crawler must go to download a significant portion of the Web site that is actually visited. Experimental and simulation results show that a crawler needs to download just a few levels in depth and breadth to reach the maximum number of pages that users actually visit. It also suggests that the largest sites should be included in the crawling process.
Keywords :
Web sites; functional equations; multimedia systems; online front-ends; probability; search engines; Web browsers; Web crawling; Web pages; Web site; breadth- depth crawling; functional equations; graph; hyperlinks; multimedia search engines; optimal crawling strategies; probability theory; two-dimensional random walker; Algorithm design and analysis; Context modeling; Crawlers; Equations; Focusing; Search engines; Shape; Signal processing; Tin; Web pages; breadth-depth; crawling; functional equations; multimedia search engine; random walk;
Conference_Titel :
Intelligent Information Hiding and Multimedia Signal Processing, 2009. IIH-MSP '09. Fifth International Conference on
Conference_Location :
Kyoto
Print_ISBN :
978-1-4244-4717-6
Electronic_ISBN :
978-0-7695-3762-7
DOI :
10.1109/IIH-MSP.2009.225