• DocumentCode
    57286
  • Title

    From K-Means to Higher-Way Co-Clustering: Multilinear Decomposition With Sparse Latent Factors

  • Author

    Papalexakis, Evangelos E. ; Sidiropoulos, Nicholas ; Bro, Rasmus

  • Author_Institution
    Electr. & Comput. Eng. Dept., Tech. Univ. of Crete, Chania, Greece
  • Volume
    61
  • Issue
    2
  • fYear
    2013
  • fDate
    Jan.15, 2013
  • Firstpage
    493
  • Lastpage
    506
  • Abstract
    Co-clustering is a generalization of unsupervised clustering that has recently drawn renewed attention, driven by emerging data mining applications in diverse areas. Whereas clustering groups entire columns of a data matrix, co-clustering groups columns over select rows only, i.e., it simultaneously groups rows and columns. The concept generalizes to data “boxes” and higher-way tensors, for simultaneous grouping along multiple modes. Various co-clustering formulations have been proposed, but no workhorse analogous to K-means has emerged. This paper starts from K-means and shows how co-clustering can be formulated as a constrained multilinear decomposition with sparse latent factors. For three- and higher-way data, uniqueness of the multilinear decomposition implies that, unlike matrix co-clustering, it is possible to unravel a large number of possibly overlapping co-clusters. A basic multi-way co-clustering algorithm is proposed that exploits multilinearity using Lasso-type coordinate updates. Various line search schemes are then introduced to speed up convergence, and suitable modifications are proposed to deal with missing values. The imposition of latent sparsity pays a collateral dividend: it turns out that sequentially extracting one co-cluster at a time is almost optimal, hence the approach scales well for large datasets. The resulting algorithms are benchmarked against the state-of-art in pertinent simulations, and applied to measured data, including the ENRON e-mail corpus.
  • Keywords
    convergence; data mining; electronic mail; pattern clustering; sparse matrices; tensors; unsupervised learning; ENRON e-mail corpus; K-means clustering; Lasso-type coordinate updates; collateral dividend; constrained multilinear decomposition; data boxes; data matrix; data mining applications; groups columns coclustering; higher-way coclustering; higher-way data; higher-way tensors; latent sparsity; matrix coclustering; multilinear decomposition; multiple modes; sparse latent factors; three-way data; unsupervised clustering; various line search schemes; Arrays; Convergence; Educational institutions; Matrix decomposition; Signal processing algorithms; Sparse matrices; Vectors; Co-clustering; compressed sensing; factor analysis; k-means; multi-way analysis; sparsity; tensor decomposition; triclustering; unsupervised clustering;
  • fLanguage
    English
  • Journal_Title
    Signal Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1053-587X
  • Type

    jour

  • DOI
    10.1109/TSP.2012.2225052
  • Filename
    6331561