DocumentCode :
2369649
Title :
Development of dbOGAP: A bioinformatics resource of O-GlcNAcylated proteins and site prediction
Author :
Hu, Zhang-Zhi ; Torii, Manabu ; Wang, Jinlian ; Liu, Hongfang ; Hart, Gerald W.
Author_Institution :
Dept. of Oncology, Georgetown Univ. Med. Center, Washington, DC, USA
fYear :
2009
fDate :
1-4 Nov. 2009
Firstpage :
346
Lastpage :
346
Abstract :
Protein glycosylation is one of the most common posttranslational modifications (PTMs) with several types. O-GlcNAcylation is an O-linked glycosylation with attachment of beta-N-acetylglucosamine (GlcNAc) to Ser/Thr residues catalyzed by O-GlcNAc transferase (OGA), whose removal is catalyzed by O-GlcNAcase. Unlike mucin-type O-glycosylation, O-GlcNAcylation occurs primarily in nucleocytoplasmic proteins and the monosaccharide is not further extended. Moreover, O-GlcNAcylation is dynamic and often reciprocal to phosphorylation at the same or adjacent Ser/Thr residues. Growing evidences suggest that O-GlcNAcylation is very common and has broad roles in physiology as well as in diseases especially through its interplay with phosphorylation, e.g., regulation of insulin signaling, transcription, and roles in diabetes and neurodegenerative diseases. In contrast to the enormous body of research on cellular roles of phosphorylation, the amount of research on O-GlcNAcylation has been disproportionally small and annotation of O-GlcNAcylated sites in protein databases is currently scarce. An O-GlcNAcylation site prediction program was developed in 2002, but it was based on a small data set with 40 O-GlcNAcylation sites known at that time. Here we seek to develop a database of O-GlcNAcylated proteins and sites, named dbOGAP, and also an O-GlcNAcylated site prediction system based on known sites data in dbOGAP to facilitate annotation and proteomic identification of the O-GlcNAcylation sites. We developed dbOGAP based primarily on O-GlcNAcylated proteins and sites published in peer-reviewed articles dated back to 1984 since it was first described. Most of these proteins were mapped to the UniProtKB protein IDs, except for some that could not be unambiguously mapped. The database currently contains 540 protein entries with experimental O-GlcNAcylation information, and 338 O-GlcNAc sites for 164 proteins. About 59% of these proteins are of humans, and other organisms include rat, - mouse, fly and African frog. Among 164 proteins with known O-GlcNAcylation sites, 122 also have both phosphorylation (total 1634) and O-GlcNAc sites (total 263). The gene ontology (GO) profiling showed that the known O-GlcNAcylated proteins have a broad range of functions including developmental process, transcriptional regulation, cell signaling, metabolic regulation, and cellular transport and trafficking. The GO profile also showed that O-GlcNAcylated proteins are primarily nuclear and cytoplasmic, including membrane-associated intracellular proteins. The database is also populated with additional orthologous protein sequences to known O-GlcNAcylated proteins. Additional functional data, including other PTM features, biological pathways and disease information have been integrated to the database. We developed an O-GlcNAcylation site prediction program using support vector machine (SVM). As positive instances, sequence fragments surrounding 322 O-GlcNAcylated Ser/Thr sites were extracted from 157 proteins in dbOGAP, and over 28 thousand sequence fragments surrounding the rest of the Ser/Thr sites in those proteins were assumed as negative instances. Two thirds of this data set was randomly selected as development data and was used for tuning parameters in SVM classifiers, while the rest of the data was set apart as a held-out test data set. To reduce the impact of imbalanced data on the performance of trained classifiers, we explored different ratio of positive to negative instances in a training data set, which was controlled by under-sampling negative instances in a training data set. The optimal parameters of the prediction system were sought in five-fold cross-validation tests conducted on the development data set, and the final classifier trained on the entire development data set was evaluated on the held-out test data set. We used four encoding methods for feature vector extraction, including binary encoding, composition of k-spaced amino acid pairs (CKS
Keywords :
Internet; bioinformatics; proteins; support vector machines; O-GlcNAc transferase; O-GlcNAcase; O-GlcNAcylated proteins; O-GlcNAcylation; O-linked glycosylation; SVM classifiers; Ser/Thr residues; UniProtKB protein IDs; Web resource; beta-N-acetylglucosamine; binary encoding; bioinformatics resource; bioinformatics tool; cell signaling; cellular trafficking; cellular transport; dbOGAP; developmental process; feature vector extraction; gene ontology; k-spaced amino acid pairs; membrane-associated intracellular proteins; metabolic regulation; monomer spectrum; neurodegenerative diseases; nucleocytoplasmic proteins; orthologous protein sequences; phosphorylation cellular roles; posttranslational modifications; protein databases; protein glycosylation; site prediction; support vector machine; transcriptional regulation; Bioinformatics; Data mining; Diseases; Encoding; Nuclear facility regulation; Protein engineering; Spatial databases; Support vector machines; System testing; Training data; O-GlcNAcylation; database; protein glycosylation; site prediction; support vector machine;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Bioinformatics and Biomedicine Workshop, 2009. BIBMW 2009. IEEE International Conference on
Conference_Location :
Washington, DC
Print_ISBN :
978-1-4244-5121-0
Type :
conf
DOI :
10.1109/BIBMW.2009.5332094
Filename :
5332094
Link To Document :
بازگشت