• DocumentCode
    2207764
  • Title

    Improved Consistent Sampling, Weighted Minhash and L1 Sketching

  • Author

    Ioffe, Sergey

  • Author_Institution
    Google Inc., Mountain View, CA, USA
  • fYear
    2010
  • fDate
    13-17 Dec. 2010
  • Firstpage
    246
  • Lastpage
    255
  • Abstract
    We propose a new Consistent Weighted Sampling method, where the probability of drawing identical samples for a pair of inputs is equal to their Jaccard similarity. Our method takes deterministic constant time per non-zero weight, improving on the best previous approach which takes expected constant time. The samples can be used as Weighted Minhash for efficient retrieval and compression (sketching) under Jaccard or L1 metric. A method is presented for using simple data statistics to reduce the running time of hash computation by two orders of magnitude. We compare our method with the random projection method and show that it has better characteristics for retrieval under L1. We present a novel method of mapping hashes to short bit-strings, apply it to Weighted Minhash, and achieve more accurate distance estimates from sketches than existing methods, as long as the inputs are sufficiently distinct. We show how to choose the optimal number of bits per hash for sketching, and demonstrate experimental results which agree with the theoretical analysis.
  • Keywords
    cryptography; data compression; file organisation; information retrieval; pattern matching; sampling methods; Jaccard similarity; L1 sketching; bit string; consistent weighted sampling method; data statistics; deterministic constant time; hash computation; random projection method; running time reduction; sample compression; sample retrieval; weighted Minhash; Compression; Hashing; Minhash; Retrieval; Sampling; Sketching;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining (ICDM), 2010 IEEE 10th International Conference on
  • Conference_Location
    Sydney, NSW
  • ISSN
    1550-4786
  • Print_ISBN
    978-1-4244-9131-5
  • Electronic_ISBN
    1550-4786
  • Type

    conf

  • DOI
    10.1109/ICDM.2010.80
  • Filename
    5693978