Title :
Pattern Recognition in Mining High-Throughput Genomics/Proteomics Data: The New Challenges in Old Questions
Author_Institution :
Dept. of Autom., Tsinghua Univ., Beijing
Abstract :
Summary form only given. The current molecular biology and systems biology is featured by the rapid accumulation of high-throughput genomics and proteomics data like microarray and mass spectrometry (MS) data. Through our study on microarray and MS data, we have observed that the cancer classification and gene/biomarker selection task has many unique characteristics that distinguish itself from other standard pattern recognition tasks. Due to the extremely small sample size, the reliable assessment of the classification accuracy becomes a major question. For gene/biomarker selection, a key question is the significance of the selected genes/marker. We studied these questions with both simulated and real microarray and MS data. We developed a perturbation-based method for estimating the distribution of error rates of a support vector machine classifier. For evaluating the statistical significance of gene lists selected by sophisticated machine learning methods, we defined the problem of rank significance of genes and developed a heuristic strategy for estimating this significance. These questions highlight two important aspects of the pattern recognition problems in high-throughput computational molecular biology. The awareness of such questions is a key for properly applying computational methods to practical data and for developing new methods that really target the scientific questions
Keywords :
biology computing; cancer; data mining; genetics; learning (artificial intelligence); molecular biophysics; pattern classification; proteins; support vector machines; cancer classification; computational molecular biology; data mining; genomics data; machine learning; mass spectrometry data; microarray data; pattern recognition; perturbation-based method; proteomics data; statistical significance; support vector machine; Bioinformatics; Biology computing; Biomarkers; Cancer; Error analysis; Genomics; Mass spectroscopy; Pattern recognition; Proteomics; Systems biology;
Conference_Titel :
Computing: Theory and Applications, 2007. ICCTA '07. International Conference on
Conference_Location :
Kolkata
Print_ISBN :
0-7695-2770-1
DOI :
10.1109/ICCTA.2007.103