DocumentCode :
1796739
Title :
Scaling a neyman-pearson subset selection approach via heuristics for mining massive data
Author :
Ditzler, Gregory ; Austen, Matthew ; Rosen, Gail ; Polikar, Robi
Author_Institution :
Drexel Univ., Philadelphia, PA, USA
fYear :
2014
fDate :
9-12 Dec. 2014
Firstpage :
439
Lastpage :
445
Abstract :
Feature subset selection is an important step towards producing a classifier that relies only on relevant features, while keeping the computational complexity of the classifier low. Feature selection is also used in making inferences on the importance of attributes, even when classification is not the ultimate goal. For example, in bioinformatics and genomics feature subset selection is used to make inferences between the variables that best describe multiple populations. Unfortunately, many feature selection algorithms require the subset size to be specified a priori, but knowing how many variables to select is typically a nontrivial task. Other approaches rely on a specific variable subset selection framework to be used. In this work, we examine an approach to feature subset selection works with a generic variable selection algorithm, and our approach provides statistical inference on the number of features that are relevant, which may be unknown to the generic variable selection algorithm. This work extends our previous implementation of a Neyman-Pearson feature selection (NPFS) hypothesis test, which acts as a meta-subset selection algorithm. Specifically, we examine the conservativeness of the NPFS approach by biasing the hypothesis test, and examine other heuristics for NPFS. We include results from carefully designed synthetic datasets. Furthermore, we demonstrate the NPFS´s ability to perform on data of a massive scale.
Keywords :
computational complexity; data mining; feature selection; NPFS hypothesis test; Neyman-Pearson feature selection hypothesis test; Neyman-Pearson subset selection approach; bioinformatics; computational complexity; generic variable selection algorithm; genomics feature subset selection; heuristics; massive data mining; meta-subset selection algorithm; statistical inference; Neyman-Pearson; feature subset selection;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Intelligence and Data Mining (CIDM), 2014 IEEE Symposium on
Conference_Location :
Orlando, FL
Type :
conf
DOI :
10.1109/CIDM.2014.7008701
Filename :
7008701
Link To Document :
بازگشت