Title :
Comparison of Stability for Different Families of Filter-Based and Wrapper-Based Feature Selection
Author :
Wald, Randall ; Khoshgoftaar, Taghi ; Napolitano, Antonio
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
Abstract :
Due to the prevalence of high dimensionality(having a large number of independent attributes), feature selection techniques (which reduce the feature subset to amore manageable size) have become quite popular. These reduced feature subsets can help improve the performance of classification models and can also inform researchers about which features are most relevant for the problem at hand. For this latter problem, it is often most important that the features chosen are consistent even in the face of changes(perturbations) to the dataset. While previous studies have considered the problem of finding so-called "stable" feature selection techniques, none has examined stability across all three major categories of feature selection technique: filter-based feature rankers (which use statistical measures to assign scores to each feature), filter-based subset evaluators (which also employ statistical approaches, but consider whole feature subsets at a time), and wrapper-based subset evaluation (which also considers whole subsets, but which builds classification models to evaluate these subsets). In the present study, we use two datasets from the domain of Twitter profile mining to compare the stability of five filter-based rankers, two filter-based subset evaluators, and five wrapper-based subset evaluators. We find that the rankers are most stable, followed by the filter-based subset evaluators, with the wrappers being the least stable. We also show that the relative performance among the techniques within each group is consistent across dataset and perturbation level. However, the relative stability of the two datasets does vary between the groups, showing that the effects are more complex than simply "one group is always more stable than another group".
Keywords :
data mining; feature selection; information filters; pattern classification; social networking (online); statistical analysis; Twitter profile mining; classification model performance improvement; feature subset reduction; filter-based feature rankers; filter-based feature selection; filter-based subset evaluators; high-dimensional data; perturbation level; relative stability analysis; score assignment; statistical measures; wrapper-based feature selection; wrapper-based subset evaluation; Buildings; Feature extraction; Indexes; Measurement; Stability criteria; Twitter; Stability; filter-based feature selection; wrapper-based feature selection;
Conference_Titel :
Machine Learning and Applications (ICMLA), 2013 12th International Conference on
Conference_Location :
Miami, FL
DOI :
10.1109/ICMLA.2013.162