DocumentCode :
2379423
Title :
Sorting next generation sequencing data improves compression effectiveness
Author :
Wan, Raymond ; Asai, Kiyoshi
Author_Institution :
Dept. of Comput. Biol., Univ. of Tokyo, Kashiwa, Japan
fYear :
2010
fDate :
18-18 Dec. 2010
Firstpage :
567
Lastpage :
572
Abstract :
With the increase usage of next generation sequencing, the problem of effectively storing and transmitting such massive amounts of data will need to be addressed. Current repositories such as the Sequence Read Archive (SRA) currently use the FASTQ format and a general-purpose compression systems (GZIP) for data archiving. In this work, we investigate how GZIP (and BZIP2) can be made more effective for read archiving by pre-sorting the reads. The improvement in compression effectiveness of just the sequences is a reduction of at most 12% and of up to 6% when the original FASTQ data is considered.
Keywords :
bioinformatics; data compression; sequences; sorting; BZIP2; FASTQ format; GZIP; Sequence Read Archive; compression effectiveness; data archiving; general-purpose compression systems; next generation sequencing data sorting; data archiving; data pre-processing; next generation sequencing; sorting;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Bioinformatics and Biomedicine Workshops (BIBMW), 2010 IEEE International Conference on
Conference_Location :
Hong, Kong
Print_ISBN :
978-1-4244-8303-7
Electronic_ISBN :
978-1-4244-8304-4
Type :
conf
DOI :
10.1109/BIBMW.2010.5703863
Filename :
5703863
Link To Document :
بازگشت