Title :
Robust and accurate cancer classification with gene expression profiling
Author :
Li, Haifeng ; Zhang, Keshu ; Jiang, Tao
Author_Institution :
Dept. of Comput. Sci., California Univ., Riverside, CA, USA
Abstract :
Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sample size problem. In this paper, we propose a novel method to solve these two problems. Our method is able to map gene expression data into a very low dimensional space and thus meets the recommended samples to features per class ratio. As a result, it can be used to classify new samples robustly with low and trustable (estimated) error rates. The method is based on linear discriminant analysis (LDA). However, the conventional LDA requires that the within-class scatter matrix Sw be nonsingular. Unfortunately, Sw is always singular in the case of cancer classification due to the small sample size problem. To overcome this problem, we develop a generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize Fisher´s criterion. GLDA is mathematically well-founded and coincides with the conventional LDA when Sw is nonsingular. Different from the conventional LDA, GLDA does not assume the nonsingularity of Sw, and thus naturally solves the small sample size problem. To accommodate the high dimensionality of scatter matrices, a fast algorithm of GLDA is also developed. Our extensive experiments on seven public cancer datasets show that the method performs well. Especially on some difficult instances that have very small samples to genes per class ratios, our method achieves much higher accuracies than widely used classification methods such as support vector machines, random forests, etc.
Keywords :
cancer; genetics; medical computing; support vector machines; tumours; GLDA; accurate cancer classification; cancer treatment; diagnose tumors; fast algorithm; gene expression profile; generalized linear discriminant analysis; map gene expression data; optimize Fisher criterion; random forests; scatter matrix; support vector machine; trustable error rates; Cancer; Computer science; Gene expression; Humans; Linear discriminant analysis; Neoplasms; Robustness; Scattering; Support vector machine classification; Support vector machines;
Conference_Titel :
Computational Systems Bioinformatics Conference, 2005. Proceedings. 2005 IEEE
Print_ISBN :
0-7695-2344-7
DOI :
10.1109/CSB.2005.49