Title :
DNA sequence compression using the normalized maximum likelihood model for discrete regression
Author :
Tabus, Ioan ; Korodi, Gergely ; Rissanen, Jorma
Author_Institution :
Inst. of Signal Process., Tampere Univ. of Technol., Finland
Abstract :
The use of normalized maximum likelihood (NML) model for encoding sequences known to have regularities in the form of approximate repetitions was discussed. A particular version of the NML model was presented for discrete regression, which was shown to provide a very powerful yet simple model for encoding the approximate repeats in DNA sequences. Combining the model of repeats with a simple first order Markov model, a fast lossless compression method was obtained that compares favorably with the existing DNA compression programs. It is remarkable that a simple model, which recursively updates a small number of parameters, is able to reach the state of the art compression ratio for DNA sequences with much more complex models. Being a minimum description length (MDL) model, the NML model may later prove to be useful in studying global and local features of DNA or possibly of other biological sequences.
Keywords :
DNA; Markov processes; biology computing; data compression; encoding; maximum likelihood estimation; sequences; DNA compression programs; DNA sequence compression; MDL model; Markov model; NML model; approximate repetitions; biological sequences; compression ratio; deoxyribonucleic acids; discrete regression; fast lossless compression method; global DNA feature; local DNA features; minimum description length; normalized maximum likelihood; parameter updating; Biological information theory; Biological system modeling; Biomedical signal processing; DNA; Data compression; Dictionaries; Encoding; Entropy; History; Sequences;
Conference_Titel :
Data Compression Conference, 2003. Proceedings. DCC 2003
Print_ISBN :
0-7695-1896-6
DOI :
10.1109/DCC.2003.1194016