• DocumentCode
    3710111
  • Title

    Hashing for Statistics over K-Partitions

  • Author

    Søren ;Mathias Bæk Tejs ;Eva Rotenberg;Mikkel Thorup

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Copenhagen, Copenhagen, UK
  • fYear
    2015
  • Firstpage
    1292
  • Lastpage
    1310
  • Abstract
    In this paper we analyze a hash function for k-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin [FOCS´83] in order to save a factor Ω(k) of time per element over k independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used HyperLogLog algorithm of Flajolet et al. [AOFA´97] and in large-scale machine learning by Li et al. [NIPS´12] for minwise estimation of set similarity. The main issue of k-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of k-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. a simple and efficient construction for invertible bloom filters and uniform hashing on a given set.
  • Keywords
    "Yttrium","Polynomials","Correlation","Frequency estimation","Estimation","Computer science","Machine learning algorithms"
  • Publisher
    ieee
  • Conference_Titel
    Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on
  • ISSN
    0272-5428
  • Type

    conf

  • DOI
    10.1109/FOCS.2015.83
  • Filename
    7354457