Author :
Ko, Kyung Dae ; Hong, Yoo Jin ; Van Rossum, Damian B. ; Patterson, Randen L.
Author_Institution :
Dept. of Biol., Penn State Univ., University park, PA, USA
Abstract :
In principle, the amino acid sequence of a protein contains structural, functional, and evolutionary characteristics. Investigation of these characteristics using computational methods provides a powerful resource. However, these methods have limitations in their ability to annotate the characteristics of proteins accurately. In an attempt to overcome this drawback, we have developed a unified computational pipeline, called the Gestalt domain detection algorithm basic local alignment tool (GDDA-BLAST), for measuring the structural, functional and evolutionary characteristics of a protein. The performance of GDDA-BLAST is better than those of other method such as SAM and psi-BLAST in homology detection. Using GDDA-BLAST, we implemented a classification library to find quantitative thresholds capable of inferring protein function. Using this library, we first identified RNA-binding proteins (RBPs) containing structural unique motifs by 2695 expanded position specific scoring metric (PSSM) profiles in a testing dataset with 37 positive and 118 negative sequences. We achieved 100% specificity, 96.8% accuracy, and 86.5% sensitivity. For the specific nucleotide binding folds (dsRNA vs. dsDNA, dsRNA vs. dsDNA, and ssRNA vs. ssDNA), our results exceeded those of obtained using support vector machine (SVM) learning algorithms. Using this method, we also identified 29 and 168 novel RBPs in yeast and human proteomes. We extend our experiment to additional protein functions, such as Ankyrin-repeat (ANK), integral lipid-binding(ILB), and calmodulin(CaM)-binding. For ANK, 449 ANK PSSMs are used to measure 126 negative and 32 postive sequences. And, for ILB and CaM-binding, we had used 24,378 PSSMs to measure 24 negatives and 32 positives, and 820 PSSMs used to measure 17 negatives and 65 positives, respectively. By ROC curve analysis,calmodulin we achieved ~100%, ~93%, ~72% sensitivity at false positive rate ~10%, for ANK, ILB, and CaM-binding classification. The result again con- firmed that we can classify the proteins using function-specific PSSM sets. We believe that the performance can be improved with more carefully curated PSSM sets. All of these results suggest that this method can be used to create PSSM databases for the quantitative measurement and classification of any protein function.
Keywords :
DNA; bioinformatics; macromolecules; molecular biophysics; pattern classification; proteins; proteomics; Ankyrin-repeat protein function; CaM-binding classification; Gestalt domain detection algorithm basic local alignment tool; PSSM databases; RNA-binding proteins; ROC curve analysis; amino acid sequence; bioinformatics; calmodulin-binding protein function; classification library; dsDNA; dsRNA; functional PSSM; human proteomes; integral lipid-binding protein function; nucleotide binding folds; position specific scoring metric profiles; primary protein sequence; protein classification; structural-specific PSSM; unified computational pipeline; yeast; Amino acids; Detection algorithms; Fungi; Libraries; Machine learning; Pipelines; Protein engineering; Sensitivity; Support vector machines; Testing;