Title :
Distributed Column Subset Selection on MapReduce
Author :
Farahat, Ahmed K. ; Elgohary, Ahmed ; Ghodsi, Ali ; Kamel, Mohamed S.
Author_Institution :
Univ. of Waterloo, Waterloo, ON, Canada
Abstract :
Given a very large data set distributed over a cluster of several nodes, this paper addresses the problem of selecting a few data instances that best represent the entire data set. The solution to this problem is of a crucial importance in the big data era as it enables data analysts to understand the insights of the data and explore its hidden structure. The selected instances can also be used for data preprocessing tasks such as learning a low-dimensional embedding of the data points or computing a low-rank approximation of the corresponding matrix. The paper first formulates the problem as the selection of a few representative columns from a matrix whose columns are massively distributed, and it then proposes a MapReduce algorithm for selecting those representatives. The algorithm first learns a concise representation of all columns using random projection, and it then solves a generalized column subset selection problem at each machine in which a subset of columns are selected from the sub-matrix on that machine such that the reconstruction error of the concise representation is minimized. The paper then demonstrates the effectiveness and efficiency of the proposed algorithm through an empirical evaluation on benchmark data sets.
Keywords :
data analysis; distributed algorithms; greedy algorithms; MapReduce algorithm; corresponding matrix; data analysts; distributed column subset selection; generalized column subset selection problem; low-rank approximation; random projection; Approximation algorithms; Approximation methods; Cascading style sheets; Data handling; Data storage systems; Distributed databases; Information management; Big Data; Column Subset Selection; Distributed Computing; Greedy Algorithms; MapReduce;
Conference_Titel :
Data Mining (ICDM), 2013 IEEE 13th International Conference on
Conference_Location :
Dallas, TX
DOI :
10.1109/ICDM.2013.155