DocumentCode :
2319454
Title :
Robust integrated framework for effective feature selection and sample classification and its application to gene expression data analysis
Author :
Gao, Shang ; Addam, Omar ; Qabaja, Ala ; ElSheikh, Abdallah ; Zarour, Omar ; Nagi, Xohamad ; Triant, Flouris ; Almansoori, W. ; Ozyer, O.S.T. ; Zeng, Jia ; Rokne, Jon ; Alhajj, Reda
Author_Institution :
Dept. of Comput. Sci., Univ. of Calgary, Calgary, AB, Canada
fYear :
2012
fDate :
9-12 May 2012
Firstpage :
112
Lastpage :
119
Abstract :
Genes are encoding regions that form essential building block within the cell and lead to proteins which are achieving various functions. However, some genes may be mutated due to internal or external factors and this is a main cause for various diseases. The latter case could be discovered by closely examining samples taken from patients to identify faulty genes. In other words, it is important to identify mutated genes as disease biomarkers. Then consider certain normal and infected samples to build a classifier model capable of successfully classifying new samples as infected or normal. The work described in this paper addresses this problem by introducing a comprehensive framework that incorporates the two stages of the process, namely feature selection and sample classification. In fact, high dimensionality in terms of the number of genes and small number of samples distinguishes gene expression data as an ideal application for the proposed framework. Reducing the dimensionality is essential to efficiently analysis the samples for effective knowledge discovery. Actually, there is a tradeoff between feature selection and maintaining acceptable accuracy. The target is to find the reduction level or compact set of features which once used for knowledge discovery will lead to improved performance and acceptable accuracy. For the first stage, we concentrate on four feature selection techniques, namely chi-square from statistics, frequent pattern mining and clustering from data mining, and community detection from network analysis. The effectiveness of the feature reduction techniques is demonstrated in the second stage by coupling them with classification techniques, namely associative classification, support vector machine and naive Bayesian classifier. Majority voting is applied for both stages. The results reported for four cancer datasets demonstrate the applicability and effectiveness of the proposed framework.
Keywords :
Bayes methods; bioinformatics; cancer; data mining; feature extraction; genetics; molecular biophysics; network analysis; pattern classification; support vector machines; associative classification; cancer; chi-square method; classifier model; community detection; data clustering; data mining; dimensionality reduction; disease biomarker; feature reduction technique; feature selection; frequent pattern mining; gene expression data analysis; knowledge discovery; mutation; naive Bayesian classifier; network analysis; proteins; statistics; support vector machine; Bayesian methods; Biomarkers; Communities; Entropy; Gene expression; Itemsets; Support vector machines; Feature selection; SVM; associative classifier; chisquare; classification; clustering; frequent pattern mining; gene expression data; naive Bayesian classifier; network analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2012 IEEE Symposium on
Conference_Location :
San Diego, CA
Print_ISBN :
978-1-4673-1190-8
Type :
conf
DOI :
10.1109/CIBCB.2012.6217219
Filename :
6217219
Link To Document :
بازگشت