DocumentCode :
3121960
Title :
Sketching Sampled Data Streams
Author :
Rusu, Florin ; Dobra, Alin
Author_Institution :
CISE Dept., Univ. of Florida, Gainesville, FL
fYear :
2009
fDate :
March 29 2009-April 2 2009
Firstpage :
381
Lastpage :
392
Abstract :
Sampling is used as a universal method to reduce the running time of computations - the computation is performed on a much smaller sample and then the result is scaled to compensate for the difference in size. Sketches are a popular approximation method for data streams and they proved to be useful for estimating frequency moments and aggregates over joins. A possibility to further improve the time performance of sketches is to compute the sketch over a sample of the stream rather than the entire data stream. In this paper we analyze the behavior of the sketch estimator when computed over a sample of the stream, not the entire data stream, for the size of join and the self-join size problems. Our analysis is developed for a generic sampling process. We instantiate the results of the analysis for all three major types of sampling - Bernoulli sampling which is used for load shedding, sampling with replacement which is used to generate i.i.d. samples from a distribution, and sampling without replacement which is used by online aggregation engines - and compare these particular results with the results of the basic sketch estimator. Our experimental results show that the accuracy of the sketch computed over a small sample of the data is, in general, close to the accuracy of the sketch estimator computed over the entire data even when the sample size is only 10% or less of the dataset size. This is equivalent to a speed-up factor of at least 10 when updating the sketch.
Keywords :
approximation theory; data analysis; data mining; Bernoulli sampling; approximation method; generic sampling process; load shedding; sampled data streams; Aggregates; Approximation methods; Engines; Frequency estimation; Sampling methods; Sketches; moment generating functions; sampling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering, 2009. ICDE '09. IEEE 25th International Conference on
Conference_Location :
Shanghai
ISSN :
1084-4627
Print_ISBN :
978-1-4244-3422-0
Electronic_ISBN :
1084-4627
Type :
conf
DOI :
10.1109/ICDE.2009.31
Filename :
4812419
Link To Document :
بازگشت