• DocumentCode
    3602154
  • Title

    CoGI: Towards Compressing Genomes as an Image

  • Author

    Xiaojing Xie ; Shuigeng Zhou ; Jihong Guan

  • Author_Institution
    Shanghai Key Lab. of Intell. Inf. Process., Fudan Univ., Shanghai, China
  • Volume
    12
  • Issue
    6
  • fYear
    2015
  • Firstpage
    1275
  • Lastpage
    1285
  • Abstract
    Genomic science is now facing an explosive increase of data thanks to the fast development of sequencing technology. This situation poses serious challenges to genomic data storage and transferring. It is desirable to compress data to reduce storage and transferring cost, and thus to boost data distribution and utilization efficiency. Up to now, a number of algorithms / tools have been developed for compressing genomic sequences. Unlike the existing algorithms, most of which treat genomes as one-dimensional text strings and compress them based on dictionaries or probability models, this paper proposes a novel approach called CoGI (the abbreviation of Compressing Genomes as an Image) for genome compression, which transforms the genomic sequences to a two-dimensional binary image (or bitmap), then applies a rectangular partition coding algorithm to compress the binary image. CoGI can be used as either a reference-based compressor or a reference-free compressor. For the former, we develop two entropy-based algorithms to select a proper reference genome. Performance evaluation is conducted on various genomes. Experimental results show that the reference-based CoGI significantly outperforms two state-of-the-art reference-based genome compressors GReEn and RLZ-opt in both compression ratio and compression efficiency. It also achieves comparable compression ratio but two orders of magnitude higher compression efficiency in comparison with XM-one state-of-the-art reference-free genome compressor. Furthermore, our approach performs much better than Gzip-a general-purpose and widely-used compressor, in both compression speed and compression ratio. So, CoGI can serve as an effective and practical genome compressor. The source code and other related documents of CoGI are available at: http://admis.fudan.edu.cn/projects/cogi.htm.
  • Keywords
    DNA; biological techniques; genomics; molecular biophysics; probability; reviews; CoGI; RLZ-opt; compressing genomic sequences; data distribution; entropy-based algorithms; genomic data storage; genomic data transferring; high compression efficiency; one-dimensional text strings; performance evaluation; probability models; rectangular partition coding algorithm; sequencing technology; state-of-the-art reference-based genome compressors GReEn; state-of-the-art reference-free genome compressor; two-dimensional binary imaging; Bioinformatics; Computational biology; Encoding; Entropy; Genomics; Image coding; Partitioning algorithms; Genomics, genomes compression, reference-based compression, sequence matrixization, rectangular partition coding, entropy coding;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2015.2430331
  • Filename
    7102721