Exploring deep Markov models in genomic data compression using sequence pre-analysis

Author

Pratas, Diogo ; Pinho, Armando J.

Author_Institution

DETI/IEETA, Signal Process. Lab., Univ. of Aveiro, Aveiro, Portugal

fYear

2014

fDate

1-5 Sept. 2014

Firstpage

2395

Lastpage

2399

Abstract

The pressure to find efficient genomic compression algorithms is being felt worldwide, as proved by several prizes and competitions. In this paper, we propose a compression algorithm that relies on a pre-analysis of the data before compression, with the aim of identifying regions of low complexity. This strategy enables us to use deeper context models, supported by hash-tables, without requiring huge amounts of memory. As an example, context depths as large as 32 are attainable for alphabets of four symbols, as is the case of genomic sequences. These deeper context models show very high compression capabilities in very repetitive genomic sequences, yielding improvements over previous algorithms. Furthermore, this method is universal, in the sense that it can be used in any type of textual data (such as quality-scores).

Keywords

Markov processes; biology computing; data analysis; data compression; genomics; data sequence pre-analysis; deep Markov models; genomic data compression algorithm; hash-tables; low complexity regions; repetitive genomic sequences; textual data; Bioinformatics; Context; Context modeling; DNA; Data compression; Data models; Genomics; Genomic data compression; finite-context models; hash-tables;

fLanguage

English

Publisher

ieee

Conference_Titel

Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European

Conference_Location

Lisbon

Type

conf

Filename

6952879