Author :
Kuo-ching Liang ; Xiaodong Wang ; Anastassiou, D.
Author_Institution :
Dept. of Electr. Eng., Columbia Univ., New York, NY, USA
Abstract :
It has been shown that electropherograms of DNA sequences can be modeled with hidden Markov models. Basecalling, the procedure that determines the sequence of bases from the given electropherogram can then be performed using the Viterbi algorithm. A training step is required prior to basecalling in order to estimate the HMM parameters. In this paper, we propose a Bayesian approach which employs the Markov chain Monte Carlo (MCMC) method to perform basecalling. Such an approach not only allows one to naturally encode the prior biological knowledge into the basecalling algorithm, it also exploits both the training data and the basecalling data in estimating the HMM parameters, leading to more accurate estimates. Using the recently sequenced genome of the organism Legionella pneumophila, we show that the MCMC basecaller outperforms the state-of-the-art basecalling algorithm in terms of total errors while requiring much less training than other proposed statistical basecallers.
Keywords :
Bayes methods; DNA; Monte Carlo methods; Viterbi decoding; biological techniques; biology computing; fluorescence; genetics; hidden Markov models; microorganisms; molecular biophysics; Bayesian approach; Bayesian basecalling algorithm; DNA sequence analysis; Legionella pneumophila; Markov chain Monte Carlo method; Viterbi algorithm; electropherograms; hidden Markov models; prior biological knowledge encoding; sequenced genome; Artificial Intelligence; Base Pairing; Base Sequence; Bayes Theorem; DNA; Electrophoresis; Markov Chains; Molecular Sequence Data; Pattern Recognition, Automated; Sequence Analysis, DNA;
Journal_Title :
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
DOI :
10.1109/TCBB.2007.1027