مرکز منطقه ای اطلاع رساني علوم و فناوري - On Feature Selection for Genomic Signal Processing and Data Mining

Abstract :

An effective data mining system lies in the representation of pattern vectors. The most vital information to be represented is the characteristics embedded in the raw data most essential for the intended applications. In order to extract a useful high-level representation, it is desirable that a representation can provide concise, invariant, and/or intelligible information on input patterns. The curse of dimensionality has traditionally been a serious concern in many genomic applications. For example, the feature dimension of gene expression data is often in the order of thousands. This motivates exploration into feature selection and representation, both aiming at reducing the feature dimensionality to facilitate the training and prediction of genomic data. The challenge lies in how to reduce feature dimension while conceding minimum sacrifice on accuracy. For feature selection, both individual and group information are important, and each has its own pros and cons in measuring the truly relevant information. The individual quantification is simple as each of the M features can be represented by one single value. However, it cannot deal with the inter-feature redundancy, abounding specially in genomic data. In contrast, the group information can fully address the mutual redundancy, but it is often too complicated to process. (Note that there are 2^M possible groups.) Between the two extremes, fortunately, there is a convenient compromise: the pairwise kernel - which has a low complexity (M² pairs) and yet reveals the critical information regarding the m inter-feature redundancy. Indeed, it has been already found very useful for many genomic applications. Especially, we shall describe how pairwise-based feature selection may be successful applied to genomic subcellular localization. A special method (VIA-SVM) designed exclusively for pairwise scoring kernels is introduced. This is the first method that fully utilizes the reflexive property of th- e so-called self-supervised training data, arising uniquely available in multiple sequence alignment. Based on several subcellular localization experiments, the VIA-SVM when combined with some filter-type metrics appears to deliver a substantial dimension reduction (one-order of magnitude) with only little degradation on accuracy.