Title :
Compression-based normal similarity measures for DNA sequences
Author :
Ferreira, P.J.S.G. ; Pinho, Armando J.
Author_Institution :
Dept. Electron., IEETA Univ. de Aveiro, Aveiro, Portugal
Abstract :
Similarity measures based on compression assess the distance between two objects based on the number of bits needed to describe one, given a description of the other. Theoretically, compression-based similarity depends on the concept of Kol-mogorov complexity, which is non-computable. The implementations require compression algorithms that are approximately normal. The approach has important advantages (no signal features to identify and extract, for example) but the compression method must be normal. This paper proposes normal algorithms based on mixtures of finite context models. Normality is attained by combining two new ideas: the use of least-recently-used caching in the context models, to allow deeper contexts, and data interleaving, to better explore that cache. Examples for DNA sequences are given (at the human genome scale).
Keywords :
DNA; biology computing; cache storage; data compression; sequences; DNA sequences; compression-based normal similarity measures; data interleaving; finite context models; least-recently-used caching; normal algorithms; Bioinformatics; Complexity theory; Context; Context modeling; DNA; Genomics; Image coding; DNA sequences; LRU cache; Normalized compression distance; finite context models; interleaving;
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on
Conference_Location :
Florence
DOI :
10.1109/ICASSP.2014.6853630