• DocumentCode
    2404964
  • Title

    Approximating a data stream for querying and estimation: algorithms and performance evaluation

  • Author

    Guha, Sudipto ; Koudas, Nick

  • Author_Institution
    Pennsylvania Univ., Philadelphia, PA, USA
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    567
  • Lastpage
    576
  • Abstract
    Obtaining fast and good-quality approximations to data distributions is a problem of central interest to database management. A variety of popular database applications, including approximate querying, similarity searching and data mining in most application domains, rely on such good-quality approximations. Histogram-based approximation is a very popular method in database theory and practice to succinctly represent a data distribution in a space-efficient manner. In this paper, we place the problem of histogram construction into perspective and we generalize it by raising the requirement of a finite data set and/or known data set size. We consider the case of an infinite data set in which data arrive continuously, forming an infinite data stream. In this context, we present single-pass algorithms that are capable of constructing histograms of provable good quality. We present algorithms for the fixed-window variant of the basic histogram construction problem, supporting incremental maintenance of the histograms. The proposed algorithms trade accuracy for speed and allow for a graceful tradeoff between the two, based on application requirements. In the case of approximate queries on infinite data streams, we present a detailed experimental evaluation comparing our algorithms with other applicable techniques using real data sets, demonstrating the superiority of our proposal
  • Keywords
    distributed databases; query processing; software performance evaluation; statistics; accuracy; application requirements; approximate queries; approximate querying; continuously arriving data; data distributions; data mining; data set size; data stream approximation; database management; estimation algorithms; finite data set; fixed-window histogram construction problem; histogram construction; histogram quality; histogram-based approximation; incremental histogram maintenance; infinite data set; infinite data stream; performance evaluation; similarity searching; single-pass algorithms; space-efficient data distribution representation; speed; Data mining; Data warehouses; Databases; Financial management; Histograms; Monitoring; Quality management; Query processing; Telecommunication traffic; Web server;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2002. Proceedings. 18th International Conference on
  • Conference_Location
    San Jose, CA
  • ISSN
    1063-6382
  • Print_ISBN
    0-7695-1531-2
  • Type

    conf

  • DOI
    10.1109/ICDE.2002.994775
  • Filename
    994775