Title :
A multi-layer bloom filter for duplicated URL detection
Author :
Zhiwang, Cen ; Jungang, Xu ; Jian, Sun
Author_Institution :
Sch. of Inf. Sci. & Eng., Grad. Univ. of Chinese Acad. of Sci., Beijing, China
Abstract :
It is of great significance to improve the speed of data collecting and updating in a web crawler because there are a large number of web pages in Internet. A duplicated URL detection approach based on multi-layer bloom filter algorithm is proposed in this paper, which divides an entire URL into some layers and stores them in multi-layer bloom filter. The experimental result shows that the false positive of multi-layer bloom filter algorithm is significantly lower than that of classical bloom filter algorithm, while the efficiency of the former is almost the same as the later.
Keywords :
Internet; data handling; filtering theory; Internet; Web crawler; data collection; data updating; duplicated URL detection; multilayer Bloom filter algorithm; Internet; bloom filter; duplicated URL detection; false positive; web crawler;
Conference_Titel :
Advanced Computer Theory and Engineering (ICACTE), 2010 3rd International Conference on
Conference_Location :
Chengdu
Print_ISBN :
978-1-4244-6539-2
DOI :
10.1109/ICACTE.2010.5578947