DocumentCode :
3602154
Title :
CoGI: Towards Compressing Genomes as an Image
Author :
Xiaojing Xie ; Shuigeng Zhou ; Jihong Guan
Author_Institution :
Shanghai Key Lab. of Intell. Inf. Process., Fudan Univ., Shanghai, China
Volume :
12
Issue :
6
fYear :
2015
Firstpage :
1275
Lastpage :
1285
Abstract :
Genomic science is now facing an explosive increase of data thanks to the fast development of sequencing technology. This situation poses serious challenges to genomic data storage and transferring. It is desirable to compress data to reduce storage and transferring cost, and thus to boost data distribution and utilization efficiency. Up to now, a number of algorithms / tools have been developed for compressing genomic sequences. Unlike the existing algorithms, most of which treat genomes as one-dimensional text strings and compress them based on dictionaries or probability models, this paper proposes a novel approach called CoGI (the abbreviation of Compressing Genomes as an Image) for genome compression, which transforms the genomic sequences to a two-dimensional binary image (or bitmap), then applies a rectangular partition coding algorithm to compress the binary image. CoGI can be used as either a reference-based compressor or a reference-free compressor. For the former, we develop two entropy-based algorithms to select a proper reference genome. Performance evaluation is conducted on various genomes. Experimental results show that the reference-based CoGI significantly outperforms two state-of-the-art reference-based genome compressors GReEn and RLZ-opt in both compression ratio and compression efficiency. It also achieves comparable compression ratio but two orders of magnitude higher compression efficiency in comparison with XM-one state-of-the-art reference-free genome compressor. Furthermore, our approach performs much better than Gzip-a general-purpose and widely-used compressor, in both compression speed and compression ratio. So, CoGI can serve as an effective and practical genome compressor. The source code and other related documents of CoGI are available at: http://admis.fudan.edu.cn/projects/cogi.htm.
Keywords :
DNA; biological techniques; genomics; molecular biophysics; probability; reviews; CoGI; RLZ-opt; compressing genomic sequences; data distribution; entropy-based algorithms; genomic data storage; genomic data transferring; high compression efficiency; one-dimensional text strings; performance evaluation; probability models; rectangular partition coding algorithm; sequencing technology; state-of-the-art reference-based genome compressors GReEn; state-of-the-art reference-free genome compressor; two-dimensional binary imaging; Bioinformatics; Computational biology; Encoding; Entropy; Genomics; Image coding; Partitioning algorithms; Genomics, genomes compression, reference-based compression, sequence matrixization, rectangular partition coding, entropy coding;
fLanguage :
English
Journal_Title :
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
Publisher :
ieee
ISSN :
1545-5963
Type :
jour
DOI :
10.1109/TCBB.2015.2430331
Filename :
7102721
Link To Document :
بازگشت