DocumentCode
3562880
Title
Near-duplicate detection using GPU-based simhash scheme
Author
Xiaowen Feng ; Hai Jin ; Ran Zheng ; Lei Zhu
Author_Institution
Services Comput. Technol. & Syst. Lab., Huazhong Univ. of Sci. & Technol., Wuhan, China
fYear
2014
Firstpage
223
Lastpage
228
Abstract
With the rapid growth of data, near-duplicate documents bearing high similarity are abundant. Elimination of near-duplicates can reduce storage cost and improve the quality of search indexes in data mining. A challenging problem is to find near-duplicate records in large-scale collections efficiently. There have already been several efforts on implementing near-duplicate detection on different architectures. In this paper, a new implementation, using a special hash function namely simhash, is proposed to identify near-duplicate documents on CUDA enabled devices. Two mechanisms are designed to achieve higher performance, including swapping and dynamic allocating. Experimental results show that our parallel implementation outperforms the serial CPU version, achieving up to 18 times.
Keywords
cryptography; data mining; document handling; graphics processing units; parallel architectures; CUDA enabled devices; GPU-based simhash scheme; data mining; dynamic allocation; graphics processing units; hash function; large-scale collections; near-duplicate detection; near-duplicate document identification; near-duplicate records; search index quality improvement; storage cost reduction; swapping; Computer architecture; Dynamic scheduling; Fingerprint recognition; Graphics processing units; Hamming distance; Instruction sets; Kernel; Graphics Processing Units; Hamming distance; Near-duplicate detection; Simhash; Similarity;
fLanguage
English
Publisher
ieee
Conference_Titel
Smart Computing (SMARTCOMP), 2014 International Conference on
Print_ISBN
978-1-4799-5710-1
Type
conf
DOI
10.1109/SMARTCOMP.2014.7043862
Filename
7043862
Link To Document