مرکز منطقه ای اطلاع رساني علوم و فناوري - Feature Selection Based on Class-Dependent Densities for High-Dimensional Binary Data

DocumentCode :

1415369

Title :

Feature Selection Based on Class-Dependent Densities for High-Dimensional Binary Data

Author :

Javed, Kashif ; Babri, Haroon A. ; Saeed, Mehreen

Author_Institution :

Dept. of Electr. Eng., Univ. of Eng. & Technol., Lahore, Pakistan

Volume :

Issue :

fYear :

2012

fDate :

3/1/2012 12:00:00 AM

Firstpage :

465

Lastpage :

477

Abstract :

Data and knowledge management systems employ feature selection algorithms for removing irrelevant, redundant, and noisy information from the data. There are two well-known approaches to feature selection, feature ranking (FR) and feature subset selection (FSS). In this paper, we propose a new FR algorithm, termed as class-dependent density-based feature elimination (CDFE), for binary data sets. Our theoretical analysis shows that CDFE computes the weights, used for feature ranking, more efficiently as compared to the mutual information measure. Effectively, rankings obtained from both the two criteria approximate each other. CDFE uses a filtrapper approach to select a final subset. For data sets having hundreds of thousands of features, feature selection with FR algorithms is simple and computationally efficient but redundant information may not be removed. On the other hand, FSS algorithms analyze the data for redundancies but may become computationally impractical on high-dimensional data sets. We address these problems by combining FR and FSS methods in the form of a two-stage feature selection algorithm. When introduced as a preprocessing step to the FSS algorithms, CDFE not only presents them with a feature subset, good in terms of classification, but also relieves them from heavy computations. Two FSS algorithms are employed in the second stage to test the two-stage feature selection idea. We carry out experiments with two different classifiers (naive Bayes´ and kernel ridge regression) on three different real-life data sets (NOVA, HIVA, and GINA) of the”Agnostic Learning versus Prior Knowledge” challenge. As a stand-alone method, CDFE shows up to about 92 percent reduction in the feature set size. When combined with the FSS algorithms in two-stages, CDFE significantly improves their classification accuracy and exhibits up to 97 percent reduction in the feature set size. We also compared CDFE against the winning entries of the challenge and f- und that it outperforms the best results on NOVA and HIVA while obtaining a third position in case of GINA.

Keywords :

pattern classification; GINA data set; HIVA data set; NOVA data set; class-dependent density; class-dependent density-based feature elimination; data management system; feature classification; feature ranking approach; feature selection algorithm; feature subset selection approach; filtrapper approach; high-dimensional binary data; kernel ridge regression classifier; knowledge management system; naive Bayes classifier; Accuracy; Algorithm design and analysis; Approximation algorithms; Frequency selective surfaces; Markov processes; Mutual information; Redundancy; Feature ranking; binary data; classification.; feature subset selection; two-stage feature selection;

fLanguage :

English

Journal_Title :

Knowledge and Data Engineering, IEEE Transactions on

Publisher :

ieee

ISSN :

1041-4347

Type :

jour

DOI :

10.1109/TKDE.2010.263

Filename :

5677524

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1415369