Title :
Predicting "Essential" Genes across Microbial Genomes: A Machine Learning Approach
Author :
Palaniappan, Krishnaveni ; Mukherjee, Sumitra
Author_Institution :
Biol. Data Manage. & Technol. Center, Lawrence Berkeley Nat. Lab., Berkeley, CA, USA
Abstract :
Essential genes constitute the minimal set of genes an organism needs for its survival. Identification of essential genes is of theoretical interest to genome biologist and has practical applications in medicine and biotechnology. This paper presents and evaluates machine learning approaches to the problem of predicting essential genes in microbial genomes using solely sequence derived input features. We investigate three different supervised classification methods - Support Vector Machine (SVM), Artificial Neural Network (ANN), and Decision Tree (DT) - for this binary classification task. The classifiers are trained and evaluated using 37830 examples obtained from 14 experimentally validated, taxonomically diverse microbial genomes whose essential genes are known. A set of 52 relevant genomic sequence derived features is used as input for the classifiers. The models were evaluated using novel blind testing schemes Leave-One-Genome-Out (LOGO) and Leave-One-Taxon-group-Out (LOTO) and 10-fold stratified cross validation (10-f-cv) strategy on both the full multi-genome datasets and its class imbalance reduced variants. Experimental results (10 X 10-f-cv) indicate SVM and ANN perform better than DT with Area under the Receiver Operating Characteristics (AU-ROC) scores of 0.80, 0.79 and 0.68 respectively. This study demonstrates that supervised machine learning methods can be used to predict essential genes in microbial genomes by using only gene sequence and features derived from it. LOGO and LOTO Blind test results suggest that the trained classifiers generalize across genomes and taxonomic boundaries.
Keywords :
bioinformatics; decision trees; genetics; genomics; learning (artificial intelligence); neural nets; pattern classification; support vector machines; 10-fold stratified cross validation strategy; ANN; AU-ROC scores; LOGO scheme; LOTO scheme; SVM; area under the receiver operating characteristics; artificial neural network; binary classification task; biotechnology; blind testing schemes; decision tree; essential gene identification; essential gene prediction; genome biologist; genomic sequence; leave-one-genome-out scheme; leave-one-taxon-group-out scheme; medicine; microbial genomes; multigenome datasets; supervised classification methods; supervised machine learning method; support vector machine; trained classifiers; Artificial neural networks; Bioinformatics; Genomics; Organisms; Support vector machines; Testing; Training; bioinformatics; essential genes; microbial genomes; supervised learning;
Conference_Titel :
Machine Learning and Applications and Workshops (ICMLA), 2011 10th International Conference on
Conference_Location :
Honolulu, HI
Print_ISBN :
978-1-4577-2134-2
DOI :
10.1109/ICMLA.2011.114