DocumentCode :
2890020
Title :
No-Reference Compression of Genomic Data Stored in FASTQ Format
Author :
Bhola, Vishal ; Bopardikar, Ajit S. ; Narayanan, Rangavittal ; Lee, Kyusang ; Ahn, TaeJin
Author_Institution :
Samsung India Software Oper., SAIT-India, Bangalore, India
fYear :
2011
fDate :
12-15 Nov. 2011
Firstpage :
147
Lastpage :
150
Abstract :
In this paper, we propose a system to compress Next Generation Sequencing (NGS) information stored in a FASTQ file. A FASTQ file contains text, DNA read and quality information for millions or billions of reads. The proposed system first parses the FASTQ file into its component fields. In a partial first pass it gathers statistics which are then used to choose a representation for each field that can give the best compression. Text data is further parsed into repeating and variable components and entropy coding is used to compress the latter. Similarly, Markov encoding and repeat finding based methods are used for DNA read compression. Finally, we propose several run length based methods to encode quality data choosing the method that gives the best performance for a given set of quality values. The compression system provides features for lossless and nearly lossless compression as well as compressing only read and read + quality data. We compare its performance to bzip2 text compression utility and an existing benchmark algorithm. We observe that the performance of the proposed system is superior to that of both the systems.
Keywords :
DNA; Markov processes; bioinformatics; data compression; entropy codes; genomics; runlength codes; statistical analysis; text analysis; DNA read compression; FASTQ format; Markov encoding; entropy coding; field representation; genomic data stored; nearly lossless compression; next generation sequencing information compression; no-reference compression; quality data encoding; repeat finding based method; run length based method; statistics; text data parsing; Bioinformatics; DNA; Dictionaries; Encoding; Genomics; Markov processes; Next generation networking; FASTQ; Genomic Data Compression; Next generation sequencing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Bioinformatics and Biomedicine (BIBM), 2011 IEEE International Conference on
Conference_Location :
Atlanta, GA
Print_ISBN :
978-1-4577-1799-4
Type :
conf
DOI :
10.1109/BIBM.2011.110
Filename :
6120426
Link To Document :
بازگشت