Abstract :
In machine learning there are numerous supervised techniques that extend naturally to analogous unsupervised methods, such as clustering. In this paper, we consider so-called rare-weak models, in which the number of important features is small (or rare) and the signal strength of each important feature is minimal (or weak). When classical clustering is applied crudely in "big data" scenarios, significant problems can arise, including long computational run times and significant clustering errors. One solution is to use feature selection (FS) to reduce dataset dimensionality before clustering. We introduce two novel unsupervised feature selection methods, one parametric and one nonparametric, based on what we call bimodal feature selection. These methods produce ranked lists of features based on their univariate multi-modality. Unlike previously developed univariate FS methods, which have typically been restricted to 2-cluster scenarios, our method has been adapted and tested to discriminate binary and higher level clusterings. The method is particularly advantageous in rare-weak settings, since reducing data dimensionality allows classical clustering methods to be applied computationally faster and with greater accuracy.
Keywords :
"Clustering methods","Kernel","Clustering algorithms","Estimation","Standards","Electronic mail"