• DocumentCode
    3121497
  • Title

    A Framework for Clustering Massive-Domain Data Streams

  • Author

    Aggarwal, Charu C.

  • Author_Institution
    IBM T. J. Watson Res. Center, Hawthorne, NY
  • fYear
    2009
  • fDate
    March 29 2009-April 2 2009
  • Firstpage
    102
  • Lastpage
    113
  • Abstract
    In this paper, we will examine the problem of clustering massive domain data streams. Massive-domain data streams are those in which the number of possible domain values for each attribute are very large and cannot be easily tracked for clustering purposes. Some examples of such streams include IP-address streams, credit-card transaction streams, or streams of sales data over large numbers of items. In such cases, it is well known that even simple stream operations such as counting can be extremely difficult because of the difficulty in maintaining summary information over the different discrete values. The task of clustering is significantly more challenging in such cases, since the intermediate statistics for the different clusters cannot be maintained efficiently. In this paper, we propose a method for clustering massive-domain data streams with the use of sketches. We prove probabilistic results which show that a sketch-based clustering method can provide similar results to an infinite-space clustering algorithm with high probability. We present experimental results which validate these theoretical results, and show that it is possible to approximate the behavior of an infinite-space algorithm accurately.
  • Keywords
    pattern clustering; probability; IP-address streams; clustering massive-domain data streams; credit-card transaction streams; infinite-space clustering algorithm; intermediate statistics; sketch-based clustering method; Clustering algorithms; Clustering methods; Computational efficiency; Data engineering; Data structures; Design methodology; Marketing and sales; Probability; Statistics; USA Councils; clustering; data streams; massive-domain;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2009. ICDE '09. IEEE 25th International Conference on
  • Conference_Location
    Shanghai
  • ISSN
    1084-4627
  • Print_ISBN
    978-1-4244-3422-0
  • Electronic_ISBN
    1084-4627
  • Type

    conf

  • DOI
    10.1109/ICDE.2009.13
  • Filename
    4812395