• DocumentCode
    3562880
  • Title

    Near-duplicate detection using GPU-based simhash scheme

  • Author

    Xiaowen Feng ; Hai Jin ; Ran Zheng ; Lei Zhu

  • Author_Institution
    Services Comput. Technol. & Syst. Lab., Huazhong Univ. of Sci. & Technol., Wuhan, China
  • fYear
    2014
  • Firstpage
    223
  • Lastpage
    228
  • Abstract
    With the rapid growth of data, near-duplicate documents bearing high similarity are abundant. Elimination of near-duplicates can reduce storage cost and improve the quality of search indexes in data mining. A challenging problem is to find near-duplicate records in large-scale collections efficiently. There have already been several efforts on implementing near-duplicate detection on different architectures. In this paper, a new implementation, using a special hash function namely simhash, is proposed to identify near-duplicate documents on CUDA enabled devices. Two mechanisms are designed to achieve higher performance, including swapping and dynamic allocating. Experimental results show that our parallel implementation outperforms the serial CPU version, achieving up to 18 times.
  • Keywords
    cryptography; data mining; document handling; graphics processing units; parallel architectures; CUDA enabled devices; GPU-based simhash scheme; data mining; dynamic allocation; graphics processing units; hash function; large-scale collections; near-duplicate detection; near-duplicate document identification; near-duplicate records; search index quality improvement; storage cost reduction; swapping; Computer architecture; Dynamic scheduling; Fingerprint recognition; Graphics processing units; Hamming distance; Instruction sets; Kernel; Graphics Processing Units; Hamming distance; Near-duplicate detection; Simhash; Similarity;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Smart Computing (SMARTCOMP), 2014 International Conference on
  • Print_ISBN
    978-1-4799-5710-1
  • Type

    conf

  • DOI
    10.1109/SMARTCOMP.2014.7043862
  • Filename
    7043862