Title :
Sorting next generation sequencing data improves compression effectiveness
Author :
Wan, Raymond ; Asai, Kiyoshi
Author_Institution :
Dept. of Comput. Biol., Univ. of Tokyo, Kashiwa, Japan
Abstract :
With the increase usage of next generation sequencing, the problem of effectively storing and transmitting such massive amounts of data will need to be addressed. Current repositories such as the Sequence Read Archive (SRA) currently use the FASTQ format and a general-purpose compression systems (GZIP) for data archiving. In this work, we investigate how GZIP (and BZIP2) can be made more effective for read archiving by pre-sorting the reads. The improvement in compression effectiveness of just the sequences is a reduction of at most 12% and of up to 6% when the original FASTQ data is considered.
Keywords :
bioinformatics; data compression; sequences; sorting; BZIP2; FASTQ format; GZIP; Sequence Read Archive; compression effectiveness; data archiving; general-purpose compression systems; next generation sequencing data sorting; data archiving; data pre-processing; next generation sequencing; sorting;
Conference_Titel :
Bioinformatics and Biomedicine Workshops (BIBMW), 2010 IEEE International Conference on
Conference_Location :
Hong, Kong
Print_ISBN :
978-1-4244-8303-7
Electronic_ISBN :
978-1-4244-8304-4
DOI :
10.1109/BIBMW.2010.5703863