DocumentCode :
2551595
Title :
On Feature Selection for Genomic Signal Processing and Data Mining
Author :
Kung, S.Y.
Author_Institution :
Princeton Univ., Princeton
fYear :
2007
fDate :
27-29 Aug. 2007
Firstpage :
1
Lastpage :
20
Abstract :
An effective data mining system lies in the representation of pattern vectors. The most vital information to be represented is the characteristics embedded in the raw data most essential for the intended applications. In order to extract a useful high-level representation, it is desirable that a representation can provide concise, invariant, and/or intelligible information on input patterns. The curse of dimensionality has traditionally been a serious concern in many genomic applications. For example, the feature dimension of gene expression data is often in the order of thousands. This motivates exploration into feature selection and representation, both aiming at reducing the feature dimensionality to facilitate the training and prediction of genomic data. The challenge lies in how to reduce feature dimension while conceding minimum sacrifice on accuracy. For feature selection, both individual and group information are important, and each has its own pros and cons in measuring the truly relevant information. The individual quantification is simple as each of the M features can be represented by one single value. However, it cannot deal with the inter-feature redundancy, abounding specially in genomic data. In contrast, the group information can fully address the mutual redundancy, but it is often too complicated to process. (Note that there are 2M possible groups.) Between the two extremes, fortunately, there is a convenient compromise: the pairwise kernel - which has a low complexity (M2 pairs) and yet reveals the critical information regarding the m inter-feature redundancy. Indeed, it has been already found very useful for many genomic applications. Especially, we shall describe how pairwise-based feature selection may be successful applied to genomic subcellular localization. A special method (VIA-SVM) designed exclusively for pairwise scoring kernels is introduced. This is the first method that fully utilizes the reflexive property of th- e so-called self-supervised training data, arising uniquely available in multiple sequence alignment. Based on several subcellular localization experiments, the VIA-SVM when combined with some filter-type metrics appears to deliver a substantial dimension reduction (one-order of magnitude) with only little degradation on accuracy.
Keywords :
biology computing; data analysis; data mining; data reduction; feature extraction; learning (artificial intelligence); signal processing; support vector machines; SVM; data mining; dimensionality reduction; feature selection; gene expression data; genomic signal processing; self-supervised training data; Bioinformatics; Data mining; Design methodology; Gene expression; Genomics; Kernel; Machine learning; Proteins; Signal processing; Training data;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning for Signal Processing, 2007 IEEE Workshop on
Conference_Location :
Thessaloniki
ISSN :
1551-2541
Print_ISBN :
978-1-4244-1566-3
Electronic_ISBN :
1551-2541
Type :
conf
DOI :
10.1109/MLSP.2007.4414275
Filename :
4414275
Link To Document :
بازگشت