• DocumentCode
    659461
  • Title

    Massively scalable near duplicate detection in streams of documents using MDSH

  • Author

    Logasa Bogen, Paul ; Symons, Christopher T. ; McKenzie, Amber ; Patton, Robert M. ; Gillen, Robert E.

  • Author_Institution
    Comput. Data Analytics Group, Oak Ridge Nat. Lab., Oak Ridge, TN, USA
  • fYear
    2013
  • fDate
    6-9 Oct. 2013
  • Firstpage
    480
  • Lastpage
    486
  • Abstract
    In a world where large-scale text collections are not only becoming ubiquitous but also are growing at increasing rates, near duplicate documents are becoming a growing concern that has the potential to hinder many different information filtering tasks. While others have tried to address this problem, prior techniques have only been used on limited collection sizes and static cases. We will briefly describe the problem in the context of Open Source analysis along with our additional constraints for performance. In this work we propose two variations on Multi-dimensional Spectral Hash (MDSH) tailored for working on extremely large, growing sets of text documents. We analyze the memory and runtime characteristics of our techniques and provide an informal analysis of the quality of the near-duplicate clusters produced by our techniques.
  • Keywords
    file organisation; information filtering; public domain software; text analysis; MDSH; document stream; information filtering task; large-scale text collections; memory characteristics; multidimensional spectral hash; near duplicate detection; near duplicate documents; near-duplicate clusters; open source analysis; quality informal analysis; runtime characteristics; text documents; Electronic publishing; Encyclopedias; Internet; Memory management; Random access memory; Runtime; Big Data; MDSH; Near Duplicate Detection; Open Source Intelligence; Streaming Text;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data, 2013 IEEE International Conference on
  • Conference_Location
    Silicon Valley, CA
  • Type

    conf

  • DOI
    10.1109/BigData.2013.6691610
  • Filename
    6691610