Title :
Exploring deep Markov models in genomic data compression using sequence pre-analysis
Author :
Pratas, Diogo ; Pinho, Armando J.
Author_Institution :
DETI/IEETA, Signal Process. Lab., Univ. of Aveiro, Aveiro, Portugal
Abstract :
The pressure to find efficient genomic compression algorithms is being felt worldwide, as proved by several prizes and competitions. In this paper, we propose a compression algorithm that relies on a pre-analysis of the data before compression, with the aim of identifying regions of low complexity. This strategy enables us to use deeper context models, supported by hash-tables, without requiring huge amounts of memory. As an example, context depths as large as 32 are attainable for alphabets of four symbols, as is the case of genomic sequences. These deeper context models show very high compression capabilities in very repetitive genomic sequences, yielding improvements over previous algorithms. Furthermore, this method is universal, in the sense that it can be used in any type of textual data (such as quality-scores).
Keywords :
Markov processes; biology computing; data analysis; data compression; genomics; data sequence pre-analysis; deep Markov models; genomic data compression algorithm; hash-tables; low complexity regions; repetitive genomic sequences; textual data; Bioinformatics; Context; Context modeling; DNA; Data compression; Data models; Genomics; Genomic data compression; finite-context models; hash-tables;
Conference_Titel :
Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European
Conference_Location :
Lisbon