DocumentCode :
2404964
Title :
Approximating a data stream for querying and estimation: algorithms and performance evaluation
Author :
Guha, Sudipto ; Koudas, Nick
Author_Institution :
Pennsylvania Univ., Philadelphia, PA, USA
fYear :
2002
fDate :
2002
Firstpage :
567
Lastpage :
576
Abstract :
Obtaining fast and good-quality approximations to data distributions is a problem of central interest to database management. A variety of popular database applications, including approximate querying, similarity searching and data mining in most application domains, rely on such good-quality approximations. Histogram-based approximation is a very popular method in database theory and practice to succinctly represent a data distribution in a space-efficient manner. In this paper, we place the problem of histogram construction into perspective and we generalize it by raising the requirement of a finite data set and/or known data set size. We consider the case of an infinite data set in which data arrive continuously, forming an infinite data stream. In this context, we present single-pass algorithms that are capable of constructing histograms of provable good quality. We present algorithms for the fixed-window variant of the basic histogram construction problem, supporting incremental maintenance of the histograms. The proposed algorithms trade accuracy for speed and allow for a graceful tradeoff between the two, based on application requirements. In the case of approximate queries on infinite data streams, we present a detailed experimental evaluation comparing our algorithms with other applicable techniques using real data sets, demonstrating the superiority of our proposal
Keywords :
distributed databases; query processing; software performance evaluation; statistics; accuracy; application requirements; approximate queries; approximate querying; continuously arriving data; data distributions; data mining; data set size; data stream approximation; database management; estimation algorithms; finite data set; fixed-window histogram construction problem; histogram construction; histogram quality; histogram-based approximation; incremental histogram maintenance; infinite data set; infinite data stream; performance evaluation; similarity searching; single-pass algorithms; space-efficient data distribution representation; speed; Data mining; Data warehouses; Databases; Financial management; Histograms; Monitoring; Quality management; Query processing; Telecommunication traffic; Web server;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering, 2002. Proceedings. 18th International Conference on
Conference_Location :
San Jose, CA
ISSN :
1063-6382
Print_ISBN :
0-7695-1531-2
Type :
conf
DOI :
10.1109/ICDE.2002.994775
Filename :
994775
Link To Document :
بازگشت