DocumentCode :
659549
Title :
Efficient near-duplicate document detection using FPGAs
Author :
Xi Luo ; Najjar, Walid ; Hristidis, Vagelis
Author_Institution :
Comput. Sci. & Eng., UC Riverside, Riverside, CA, USA
fYear :
2013
fDate :
6-9 Oct. 2013
Firstpage :
54
Lastpage :
61
Abstract :
Detecting duplicate and near-duplicate documents is critical in applications like Web crawling since it helps save document processing resources. Simhash is a state-of-art method to assign a bit-string fingerprint to a document, such that similar documents have similar fingerprints. Finding the near-duplicates in a large collection of documents consists of two stages: (a) compute the simhash fingerprint of each document, and (b) find pairs of similar fingerprints by computing their Hamming distance. Previous work has focused on optimizing the second stage, i.e., avoiding the quadratic number of comparisons to compute the all to all Hamming distance. However, our experiments show that the total time is dominated by the first stage (the fingerprints computation), which is the focus of this paper. We propose an implementation of simhash on Field Programmable Gate Arrays (FPGAs), by implementing a customized fingerprint computing engine in hardware that exploits parallelization and pipelining opportunities. We present a comprehensive experimental evaluation on large diverse real document datasets. Our experiments show a speedup of 362× in the simhash computation, and savings of up to 98% in overall near-duplicate detection execution time compared to using multi-core CPUs.
Keywords :
Internet; document handling; field programmable gate arrays; multiprocessing systems; search engines; FPGA; Hamming distance; Web crawling; bit string fingerprint; document processing resources; field programmable gate arrays; fingerprint computing engine; multicore CPU; near duplicate document detection; quadratic number; simhash computation; Encyclopedias; Engines; Field programmable gate arrays; Fingerprint recognition; Hardware; Logic gates; Software; FPGA; document similarity; duplicate detection; hardware; hashing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data, 2013 IEEE International Conference on
Conference_Location :
Silicon Valley, CA
Type :
conf
DOI :
10.1109/BigData.2013.6691698
Filename :
6691698
Link To Document :
بازگشت