مرکز منطقه ای اطلاع رساني علوم و فناوري - Scaling a neyman-pearson subset selection approach via heuristics for mining massive data

DocumentCode :

1796739

Title :

Scaling a neyman-pearson subset selection approach via heuristics for mining massive data

Author :

Ditzler, Gregory ; Austen, Matthew ; Rosen, Gail ; Polikar, Robi

Author_Institution :

Drexel Univ., Philadelphia, PA, USA

fYear :

2014

fDate :

9-12 Dec. 2014

Firstpage :

439

Lastpage :

445

Abstract :

Feature subset selection is an important step towards producing a classifier that relies only on relevant features, while keeping the computational complexity of the classifier low. Feature selection is also used in making inferences on the importance of attributes, even when classification is not the ultimate goal. For example, in bioinformatics and genomics feature subset selection is used to make inferences between the variables that best describe multiple populations. Unfortunately, many feature selection algorithms require the subset size to be specified a priori, but knowing how many variables to select is typically a nontrivial task. Other approaches rely on a specific variable subset selection framework to be used. In this work, we examine an approach to feature subset selection works with a generic variable selection algorithm, and our approach provides statistical inference on the number of features that are relevant, which may be unknown to the generic variable selection algorithm. This work extends our previous implementation of a Neyman-Pearson feature selection (NPFS) hypothesis test, which acts as a meta-subset selection algorithm. Specifically, we examine the conservativeness of the NPFS approach by biasing the hypothesis test, and examine other heuristics for NPFS. We include results from carefully designed synthetic datasets. Furthermore, we demonstrate the NPFS´s ability to perform on data of a massive scale.

Keywords :

computational complexity; data mining; feature selection; NPFS hypothesis test; Neyman-Pearson feature selection hypothesis test; Neyman-Pearson subset selection approach; bioinformatics; computational complexity; generic variable selection algorithm; genomics feature subset selection; heuristics; massive data mining; meta-subset selection algorithm; statistical inference; Neyman-Pearson; feature subset selection;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computational Intelligence and Data Mining (CIDM), 2014 IEEE Symposium on

Conference_Location :

Orlando, FL

Type :

conf

DOI :

10.1109/CIDM.2014.7008701

Filename :

7008701

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1796739