Title :
An Adaptive Difference Distribution-Based Coding with Hierarchical Tree Structure for DNA Sequence Compression
Author :
Wenrui Dai ; Hongkai Xiong ; Xiaoqian Jiang ; Ohno-Machado, L.
Author_Institution :
Dept. of Electron. Eng., Shanghai Jiaotong Univ., Shanghai, China
Abstract :
Previous reference-based compression on DNA sequences do not fully exploit the intrinsic statistics by merely concerning the approximate matches. In this paper, an adaptive difference distribution-based coding framework is proposed by the fragments of nucleotides with a hierarchical tree structure. To keep the distribution of difference sequence from the reference and target sequences concentrated, the sub-fragment size and matching offset for predicting are flexible to the stepped size structure. The matching with approximate repeats in reference will be imposed with the Hamming-like weighted distance measure function in a local region closed to the current fragment, such that the accuracy of matching and the overhead of describing matching offset can be balanced. A well-designed coding scheme will make compact both the difference sequence and the additional parameters, e.g. sub-fragment size and matching offset. Experimental results show that the proposed scheme achieves 150% compression improvement in comparison with the best reference-based compressor GReEn.
Keywords :
DNA; Hamming codes; adaptive codes; biology computing; data compression; molecular biophysics; pattern matching; tree data structures; DNA sequence compression; Hamming-like weighted distance measure function; adaptive difference distribution-based coding; approximate matching offset; hierarchical tree structure; nucleotides; stepped size structure; sub-fragment size offset; Bioinformatics; Biological cells; DNA; Distance measurement; Encoding; Genomics; Sequential analysis;
Conference_Titel :
Data Compression Conference (DCC), 2013
Conference_Location :
Snowbird, UT
Print_ISBN :
978-1-4673-6037-1
DOI :
10.1109/DCC.2013.45