DocumentCode :
3562880
Title :
Near-duplicate detection using GPU-based simhash scheme
Author :
Xiaowen Feng ; Hai Jin ; Ran Zheng ; Lei Zhu
Author_Institution :
Services Comput. Technol. & Syst. Lab., Huazhong Univ. of Sci. & Technol., Wuhan, China
fYear :
2014
Firstpage :
223
Lastpage :
228
Abstract :
With the rapid growth of data, near-duplicate documents bearing high similarity are abundant. Elimination of near-duplicates can reduce storage cost and improve the quality of search indexes in data mining. A challenging problem is to find near-duplicate records in large-scale collections efficiently. There have already been several efforts on implementing near-duplicate detection on different architectures. In this paper, a new implementation, using a special hash function namely simhash, is proposed to identify near-duplicate documents on CUDA enabled devices. Two mechanisms are designed to achieve higher performance, including swapping and dynamic allocating. Experimental results show that our parallel implementation outperforms the serial CPU version, achieving up to 18 times.
Keywords :
cryptography; data mining; document handling; graphics processing units; parallel architectures; CUDA enabled devices; GPU-based simhash scheme; data mining; dynamic allocation; graphics processing units; hash function; large-scale collections; near-duplicate detection; near-duplicate document identification; near-duplicate records; search index quality improvement; storage cost reduction; swapping; Computer architecture; Dynamic scheduling; Fingerprint recognition; Graphics processing units; Hamming distance; Instruction sets; Kernel; Graphics Processing Units; Hamming distance; Near-duplicate detection; Simhash; Similarity;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Smart Computing (SMARTCOMP), 2014 International Conference on
Print_ISBN :
978-1-4799-5710-1
Type :
conf
DOI :
10.1109/SMARTCOMP.2014.7043862
Filename :
7043862
Link To Document :
بازگشت