Author/Authors :
Mauri, Theo Univ. Lille, CNRS; UMR8576 - UGSF - Unité de Glycobiologie Structurale et Fonctionnelle, Lille, F-59000, France , Bouaouiche, Laurence Menu Normandy University - UNIROUEN, Laboratoire Glyco-MEV EA4358, Rouen, France , Bardor, Muriel Normandy University - UNIROUEN, Laboratoire Glyco-MEV EA4358, Rouen, France , Lefebvre, Tony Univ. Lille, CNRS; UMR8576 - UGSF - Unité de Glycobiologie Structurale et Fonctionnelle, Lille, F-59000, France , Lensink, Marc F Univ. Lille, CNRS; UMR8576 - UGSF - Unité de Glycobiologie Structurale et Fonctionnelle, Lille, F-59000, France , Brysbaert, Guillaume Univ. Lille, CNRS; UMR8576 - UGSF - Unité de Glycobiologie Structurale et Fonctionnelle, Lille, F-59000, France
Abstract :
Background: O-GlcNAcylation is an essential post-translational modification (PTM) in
mammalian cells. It consists in the addition of a N-acetylglucosamine (GlcNAc) residue
onto serines or threonines by an O-GlcNAc transferase (OGT). Inhibition of OGT is lethal,
and misregulation of this PTM can lead to diverse pathologies including diabetes,
Alzheimer’s disease and cancers. Knowing the location of O-GlcNAcylation sites and the
ability to accurately predict them is therefore of prime importance to a better understanding
of this process and its related pathologies.
Purpose: Here, we present an evaluation of the current predictors of O-GlcNAcylation sites
based on a newly built dataset and an investigation to improve predictions.
Methods: Several datasets of experimentally proven O-GlcNAcylated sites were combined,
and the resulting meta-dataset was used to evaluate three prediction tools. We further defined
a set of new features following the analysis of the primary to tertiary structures of experimentally
proven O-GlcNAcylated sites in order to improve predictions by the use of different types of
machine learning techniques.
Results: Our results show the failure of currently available algorithms to predict
O-GlcNAcylated sites with a precision exceeding 9%. Our efforts to improve the precision
with new features using machine learning techniques do succeed for equal proportions of
O-GlcNAcylated and non-O-GlcNAcylated sites but fail like the other tools for real-life
proportions where ~1.4% of S/T are O-GlcNAcylated.
Conclusion: Present-day algorithms for O-GlcNAcylation prediction narrowly outperform
random prediction. The inclusion of additional features, in combination with machine
learning algorithms, does not enhance these predictions, emphasizing a pressing need for
further development. We hypothesize that the improvement of prediction algorithms requires
characterization of OGT’s partners.
Keywords :
machine learning , glycosylation , O-GlcNAc , post-translational modification , dataset , OGT