• DocumentCode
    3133974
  • Title

    A fast algorithm for subspace clustering by pattern similarity

  • Author

    Wang, Haixun ; Chu, Fang ; Fan, Wei ; Yu, Philip S. ; Pei, Jian

  • Author_Institution
    T. J. Watson Res. Center, IBM, USA
  • fYear
    2004
  • fDate
    21-23 June 2004
  • Firstpage
    51
  • Lastpage
    60
  • Abstract
    Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends the concept of traditional clustering and benefits a wide range of applications, including large scale scientific data analysis, target marketing, Web usage analysis, etc. However, state-of-the-art pattern-based clustering methods (e.g., the pCluster algorithm) can only handle data sets of thousands of records, which makes them inappropriate for many real-life applications. Furthermore, besides the huge data volume, many data sets are also characterized by their sequentiality, for instance, customer purchase records and network event logs are usually modeled as data sequences. Hence, it becomes important to enable pattern-based clustering methods i) to handle large datasets, and ii) to discover pattern similarity embedded in data sequences. In this paper, we present a novel algorithm that offers this capability. Experimental results from both real life and synthetic datasets prove its effectiveness and efficiency.
  • Keywords
    biology computing; computational complexity; data mining; pattern clustering; scientific information systems; very large databases; Web usage analysis; customer purchase records; data sequences; data volume; dataset sequentiality; network event logs; pCluster algorithm; pattern similarity clustering; scientific data analysis; subspace pattern clustering; target marketing; Clustering algorithms; Clustering methods; Computer science; DNA; Data analysis; Data mining; Gene expression; Genomics; Large-scale systems; Pattern analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on
  • ISSN
    1099-3371
  • Print_ISBN
    0-7695-2146-0
  • Type

    conf

  • DOI
    10.1109/SSDM.2004.1311193
  • Filename
    1311193