• DocumentCode
    34702
  • Title

    Sketch and Validate for Big Data Clustering

  • Author

    Traganitis, Panagiotis A. ; Slavakis, Konstantinos ; Giannakis, Georgios B.

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Univ. of Minnesota, Minneapolis, MN, USA
  • Volume
    9
  • Issue
    4
  • fYear
    2015
  • fDate
    Jun-15
  • Firstpage
    678
  • Lastpage
    690
  • Abstract
    In response to the need for learning tools tuned to big data analytics, the present paper introduces a framework for efficient clustering of huge sets of (possibly high-dimensional) data. Building on random sampling and consensus (RANSAC) ideas pursued earlier in a different (computer vision) context for robust regression, a suite of novel dimensionality- and set-reduction algorithms is developed. The advocated sketch-and-validate (SkeVa) family includes two algorithms that rely on K-means clustering per iteration on reduced number of dimensions and/or feature vectors: The first operates in a batch fashion, while the second sequential one offers computational efficiency and suitability with streaming modes of operation. For clustering even nonlinearly separable vectors, the SkeVa family offers also a member based on user-selected kernel functions. Further trading off performance for reduced complexity, a fourth member of the SkeVa family is based on a divergence criterion for selecting proper minimal subsets of feature variables and vectors, thus bypassing the need for K-means clustering per iteration. Extensive numerical tests on synthetic and real data sets highlight the potential of the proposed algorithms, and demonstrate their competitive performance relative to state-of-the-art random projection alternatives.
  • Keywords
    Big Data; data analysis; pattern clustering; K-means clustering; RANSAC; SkeVa family; big data analytics; big data clustering; dimensionality-reduction algorithms; divergence criterion; random sampling; reduced complexity; robust regression; set-reduction algorithms; sketch-and-validate family; user-selected kernel functions; Big data; Clustering algorithms; Complexity theory; Kernel; Signal processing algorithms; Special issues and sections; Vectors; $K$-means; Clustering; feature vector selection; high-dimensional data; sketching; validation; variable selection;
  • fLanguage
    English
  • Journal_Title
    Selected Topics in Signal Processing, IEEE Journal of
  • Publisher
    ieee
  • ISSN
    1932-4553
  • Type

    jour

  • DOI
    10.1109/JSTSP.2015.2396477
  • Filename
    7018966