DocumentCode :
2468238
Title :
The 4M (Mixed Memory Markov Model) Algorithm for Finding Genes from Prokaryote Genomes
Author :
Vidyasagar, M.
Author_Institution :
Software Units Layout, Tata Consultancy Services, Hyderabad
fYear :
2006
fDate :
13-15 Dec. 2006
Firstpage :
1
Lastpage :
6
Abstract :
In this paper, we present a new algorithm called 4M (Mixed Memory Markov Model) for finding genes from the genome of a prokaryote genome. Strictly speaking, the algorithm can be used in any problem of classification of strings over a finite alphabet into one of two distinct families. However, the specific application here is to gene finding, and the finite alphabet is the four symbol nucleotide alphabet {A,C,G,T}. The algorithm is based on modelling the known coding regions of a genome as a set of sample paths of a stochastic process, and the known non-coding regions of the same genome as a set of sample paths of another stochastic process. Then an ORF (open reading frame) is classified as being either a coding region or a non-coding region based on likelihood estimation. Initially, each stochastic process is modeled as a fifth-order Markov process. Then a further reduction in the size of the state space is realized by observing that different strings can have `memories´ of different lengths (which is the rationale for the name). The 4M algorithm is applied to 70 or so genomes from both bacterial genomes and archaea, and its performance is compared to that of Glimmer-2, one of the most widely used algorithms. The 4M algorithm consistently matches or exceeds the performance of Glimmer-2 in the test cases. The size of the state space used by the 4M algorithm is a few hundred states, compared with 16,384 for Glimmer-2. Moreover, since the 4M algorithm is based on standard methods in statistical analysis, the significance of the various tests performed can be estimated precisely
Keywords :
Markov processes; genetics; maximum likelihood estimation; pattern classification; Glimmer-2; archaea; bacterial genomes; fifth-order Markov process; gene finding; likelihood estimation; mixed memory Markov model algorithm; nucleotide alphabet; open reading frame; prokaryote genomes; state space; statistical analysis; stochastic process; Archaea; Bioinformatics; Genomics; Markov processes; Microorganisms; Performance evaluation; State-space methods; Statistical analysis; Stochastic processes; Testing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Decision and Control, 2006 45th IEEE Conference on
Conference_Location :
San Diego, CA
Print_ISBN :
1-4244-0171-2
Type :
conf
DOI :
10.1109/CDC.2006.377780
Filename :
4177249
Link To Document :
بازگشت