Title :
Data intensive parallel feature selection method study
Author :
Zhanquan Sun ; Zhao Li
Author_Institution :
Shandong Provincial Key Lab. of Comput. Network, Shandong Comput. Sci. Center, Jinan, China
Abstract :
Feature selection is an important research topic in machine learning and pattern recognition. It is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. With the development of computer science, data deluge occurs in many application fields. Classical feature selection method is out of work in processing large-scale dataset because of expensive computational cost. This paper mainly concentrates on the study of data intensive parallel feature selection method. The parallel feature selection method is based on MapReduce program model. In each map node, a novel method is used to calculate the mutual information and combinatory contribution degree is used to determine the number of selected features. In each epoch, selected features of all map nodes are collected to a reduce node and from which a feature is selected through synthesization. The parallel feature selection method is scalable. The efficiency of the method is illustrated through an example analysis.
Keywords :
feature selection; parallel programming; MapReduce program model; combinatory contribution degree; computational cost; data deluge; data intensive parallel feature selection method; dimensionality reduction; epoch; irrelevant data removal; large-scale dataset processing; learning accuracy improvement; map node collection; mutual information; node reduction; result comprehensibility improvement; synthesiation; Computational modeling; Entropy; Joints; Mutual information; Support vector machines; Training; Vectors; Feature selection; MapReduce; contribution degree; mutual information;
Conference_Titel :
Neural Networks (IJCNN), 2014 International Joint Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4799-6627-1
DOI :
10.1109/IJCNN.2014.6889409