DocumentCode :
1925029
Title :
Pattern Recognition in Mining High-Throughput Genomics/Proteomics Data: The New Challenges in Old Questions
Author :
Zhang, Xuegong
Author_Institution :
Dept. of Autom., Tsinghua Univ., Beijing
fYear :
2007
fDate :
5-7 March 2007
Firstpage :
242
Lastpage :
244
Abstract :
Summary form only given. The current molecular biology and systems biology is featured by the rapid accumulation of high-throughput genomics and proteomics data like microarray and mass spectrometry (MS) data. Through our study on microarray and MS data, we have observed that the cancer classification and gene/biomarker selection task has many unique characteristics that distinguish itself from other standard pattern recognition tasks. Due to the extremely small sample size, the reliable assessment of the classification accuracy becomes a major question. For gene/biomarker selection, a key question is the significance of the selected genes/marker. We studied these questions with both simulated and real microarray and MS data. We developed a perturbation-based method for estimating the distribution of error rates of a support vector machine classifier. For evaluating the statistical significance of gene lists selected by sophisticated machine learning methods, we defined the problem of rank significance of genes and developed a heuristic strategy for estimating this significance. These questions highlight two important aspects of the pattern recognition problems in high-throughput computational molecular biology. The awareness of such questions is a key for properly applying computational methods to practical data and for developing new methods that really target the scientific questions
Keywords :
biology computing; cancer; data mining; genetics; learning (artificial intelligence); molecular biophysics; pattern classification; proteins; support vector machines; cancer classification; computational molecular biology; data mining; genomics data; machine learning; mass spectrometry data; microarray data; pattern recognition; perturbation-based method; proteomics data; statistical significance; support vector machine; Bioinformatics; Biology computing; Biomarkers; Cancer; Error analysis; Genomics; Mass spectroscopy; Pattern recognition; Proteomics; Systems biology;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computing: Theory and Applications, 2007. ICCTA '07. International Conference on
Conference_Location :
Kolkata
Print_ISBN :
0-7695-2770-1
Type :
conf
DOI :
10.1109/ICCTA.2007.103
Filename :
4127374
Link To Document :
بازگشت