DocumentCode :
1830974
Title :
The importance of performance metrics within wrapper feature selection
Author :
Wald, Randall ; Khoshgoftaar, Taghi ; Napolitano, Antonio
fYear :
2013
fDate :
14-16 Aug. 2013
Firstpage :
105
Lastpage :
111
Abstract :
Many important datasets are affected by the problem of high dimensionality (having a large number of attributes or features), which can result in complex and time-consuming classification models. Feature selection techniques try to identify an optimal subset of features which may show improved classification performance as well as identify important features for the application at hand. Wrapper feature selection in particular uses a classifier to discover which feature subsets are most useful. However, feature selection can be affected by another dataset problem: imbalanced data. When one class outnumbers the other class(es), the chosen features may not reflect those most important to all classes - especially when wrapper feature selection uses a performance metric which does not consider class imbalance. No previous work has examined how the choice of performance metric within wrapper-based feature selection will affect classification performance. To study this effect, in this paper we consider two high-dimensional datasets drawn from the field of Twitter profile mining, both of which exhibit class imbalance. Using the Logistic Regression learner, we perform wrapper feature selection followed by classification, using five different performance metrics both (Area Under the Receiver Operating Characteristic Curve, Area Under the Precision Recall Curve, Best Arithmetic Mean of TPR and TNR, Best Geometric Mean of TPR and TNR, and Overall Accuracy) for the wrapper and for evaluating the classification model. We find that performance metrics which take class imbalance into account, especially the Area Under the Precision-Recall Curve, are far more effective than Overall Accuracy when used within the wrapper, producing much better performance as evaluated by the metrics which consider imbalance. In fact, even when Overall Accuracy is the classification metric, it is not the best metric to use within the wrapper. In addition, we find that there is no direct connection bet- een the metric inside the wrapper and used for classification evaluation: the metrics show similar patterns across all four balance-aware metrics (e.g., all but Overall Accuracy).
Keywords :
data mining; pattern classification; regression analysis; Twitter profile mining; area under the precision-recall curve; area under the receiver operating characteristic curve; balance-aware metrics; best arithmetic mean; best geometric mean; classification performance; feature selection techniques; logistic regression learner; performance metrics; wrapper feature selection; Accuracy; Buildings; Feature extraction; Logistics; Measurement; Pragmatics; Twitter; Twitter; Wrapper feature selection; performance metrics; social bots;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Reuse and Integration (IRI), 2013 IEEE 14th International Conference on
Conference_Location :
San Francisco, CA
Type :
conf
DOI :
10.1109/IRI.2013.6642460
Filename :
6642460
Link To Document :
بازگشت