Title :
A scalable solution for group feature selection
Author :
Priya Govindan;Ruobing Chen;Katya Scheinberg;Soundararajan Srinivasan
Author_Institution :
Rutgers University
Abstract :
In many applications, we may want to build a classifier with high confidence, while reducing the number of features. We consider the case where features are assigned to predefined groups and cannot be removed individually. An additional and important constraint is that the datasets may be very large and may not fit in memory. We use logistic regression with group penalty, which results in sparse solutions at the group level. In our implementation, we apply L-BFGS to approximate the quadratic loss function of logistic regression and use Block Co-ordinate Descent to solve for each group. Our contributions can be summarized as follows: (1) we discuss different scalable approaches, depending on characteristics of the dataset, such as, large number of data points or large number of features or large number of groups; (2) for datasets with large number of data points and few groups of features, we identify the bottlenecks for scalability; (3) we present Spark solutions in Python and discuss the advantages of our solution over alternate solutions; (4) we present the experiments and results on synthetic data and real data from manufacturing applications.
Keywords :
"Sparks","Logistics","Runtime","Sparse matrices","Approximation methods","Machine learning algorithms","Big data"
Conference_Titel :
Big Data (Big Data), 2015 IEEE International Conference on
DOI :
10.1109/BigData.2015.7364098