DocumentCode :
1941048
Title :
Optimal sampling algorithms for frequency estimation in distributed data
Author :
Zengfeng Huang ; Ke Yi ; Yunhao Liu ; Guihai Chen
Author_Institution :
Hong Kong Univ. of Sci. & Technol., Hong Kong, China
fYear :
2011
fDate :
10-15 April 2011
Firstpage :
1997
Lastpage :
2005
Abstract :
Consider a distributed system with n nodes where each node holds a multiset of items. In this paper, we design sampling algorithms that allow us to estimate the global frequency of any item with a standard deviation of εN, where N denotes the total cardinality of all these multisets. Our algorithms have a communication cost of O(n +√n/ε), which is never worse than the O(n + 1/ε2) cost of uniform sampling, and could be much better when n ≪ 1/ε2. In addition, we prove that one version of our algorithm is instance-optimal in a fairly general sampling framework. We also design algorithms that achieve optimality on the bit level, by combining Bloom filters of various granularities. Finally, we present some simulation results comparing our algorithms with previous techniques. Other than the performance improvement, our algorithms are also much simpler and easily implementable in a large-scale distributed system.
Keywords :
computational complexity; distributed processing; frequency estimation; sampling methods; Bloom filter; distributed system; global frequency estimation; optimal sampling algorithm; Algorithm design and analysis; Approximation algorithms; Distributed databases; Encoding; Frequency estimation; Peer to peer computing; Probabilistic logic;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
INFOCOM, 2011 Proceedings IEEE
Conference_Location :
Shanghai
ISSN :
0743-166X
Print_ISBN :
978-1-4244-9919-9
Type :
conf
DOI :
10.1109/INFCOM.2011.5935005
Filename :
5935005
Link To Document :
بازگشت