Abstract :
Traditional feature evaluation methods, such as information gain, entropy and mutual information, generally evaluate the discriminating power of individual features independently based upon a vast varieties of metrics, referred as the TopK approach. Though a few feature evaluation methods, such as wrappers and criterion function, evaluate the discriminating power of a subset of features instead, they are usually either based upon a heuristic scheme or suffer a burden of high computational cost. As a result, when applied for multi-class classification on large data sets, existing feature evaluation methods either suffer the “siren pitfall” of a surplus of discriminating features for some classes while lack of discriminating features for the remaining classes, or become inapplicable due to the problems of repeatability and computational cost. Specifically, when applied for multiclass classification, the TopK approach overweighs individual discriminating features while lack the concern of their collective discrimination, and the optimal feature subsets discovered by wrapper´s method are influenced by the corresponding classifier, lack of repeatability, let alone a rather high computation cost. In this paper, we propose an effective feature evaluation method for mixed-valued data sets via set cover criteria. Our set cover feature evaluation method gains several advantages in addressing the “siren pitfall” problem: its feature selection scheme is more robust and relies on little prior knowledge, its feature evaluation process is repeatable and the computational cost is rather low. In addition to that, the set cover method is applicable on mixed-valued data sets and able to weigh the discriminating power of features quantificationally. Experimental results indicate the effectiveness of our set cover method1.
Keywords :
pattern classification; set theory; TopK approach; criterion function; entropy; feature evaluation process; feature selection scheme; information gain; mixed-valued data sets; mixed-valued feature evaluation method; multiclass classification; mutual information; repeatability; set cover criteria; siren pitfall problem; wrappers; Computational efficiency; Entropy; Feature extraction; Measurement; Mutual information; Robustness; Training; feature evaluation; feature selection; feature subset selection; set cover;