Title of article :
Iterative Variable Gene Discovery from Whole Genome Sequencing with a Bootstrapped Multiresolution Algorithm
Author/Authors :
Olivieri, David N Department of Computer Science - University of Vigo - Ourense, Spain , Gambon-Deza, Francisco Department of Immunology - Hospital of Meixoeiro - Vigo, Spain
Abstract :
In jawed vertebrates, variable (V) genes code for antigen-binding regions of B and T lymphocyte receptors, which generate a
specific response to foreign pathogens. Obtaining the detailed repertoire of these genes across the jawed vertebrate kingdom would
help to understand their evolution and function. However, annotations of V-genes are known for only a few model species since
their extraction is not amenable to standard gene finding algorithms. Also, the more distant evolution of a taxon is from such
model species, and there is less homology between their V-gene sequences. Here, we present an iterative supervised machine
learning algorithm that begins by training a small set of known and verified V-gene sequences. The algorithm successively
discovers homologous unaligned V-exons from a larger set of whole genome shotgun (WGS) datasets from many taxa. Upon each
iteration, newly uncovered V-genes are added to the training set for the next predictions. This iterative learning/discovery process
terminates when the number of new sequences discovered is negligible. This process is akin to “online” or reinforcement learning
and is proven to be useful for discovering homologous V-genes from successively more distant taxa from the original set. Results
are demonstrated for 14 primate WGS datasets and validated against Ensembl annotations. 0is algorithm is implemented in the
Python programming language and is freely available at http://vgenerepertoire.org.
Keywords :
Gene , Bootstrapped , Multiresolution , WGS
Journal title :
Computational and Mathematical Methods in Medicine