DocumentCode :
167338
Title :
Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data
Author :
Yi Wang ; Agrawal, Gagan ; Ozer, Gulcin ; Kun Huang
Author_Institution :
Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
fYear :
2014
fDate :
19-23 May 2014
Firstpage :
508
Lastpage :
517
Abstract :
Throughput from sequencing instruments has been increasing in an unprecedented speed, leading to an explosion of the next-generation sequencing (NGS) data, and challenges in storing, managing, and analyzing these datasets. Parallelism is the key in handling large-scale data, and some progress has been made in parallelizing important steps, like sequence alignment. However, other major steps continue to be sequential, limiting the ability to handle massive datasets. In this paper, we focus on parallelizing algorithms from two areas. The first is efficient data format conversion among a wide variety of sequence data formats, which is important for cross-utilization of different analysis modules. The second is statistical analysis. Our parallelization sequence data format converter allows sequence datasets in BAM/SAM format to be converted into multiple formats, including SAM/BAM, BED, FASTA, FASTQ, BEDGRAPH, JSON, and YAML, using both shared memory and distributed memory parallelism. The converter currently comprises three instances: SAM format converter, BAM format converter and preprocessing-optimized SAM format converter. Additionally, our converter can also support partial format conversion, to perform format conversion only on a specified chromosome region. The statistical analysis module includes parallelized non-local means (NL-means) algorithm and false discovery rate (FDR) computation. Through extensive evaluation, we demonstrate high scalability of our framework.
Keywords :
data analysis; distributed memory systems; electronic data interchange; shared memory systems; statistical analysis; BAM format converter; BED format; BEDGRAPH format; FASTA format; FASTQ format; FDR computation; JSON format; NGS data analysis; SAM format converter; YAML format; data format conversion; distributed memory parallelism; false discovery rate computation; large-scale data handling; next-generation sequencing data analysis; parallelization sequence data format converter; parallelized NL-means algorithm; parallelized nonlocal means algorithm; parallelizing algorithms; partial format conversion; preprocessing-optimized SAM format converter; sequence alignment; shared memory parallelism; statistical analysis; Algorithm design and analysis; Bioinformatics; Genomics; Histograms; Program processors; Sequential analysis; Statistical analysis; Data Format Conversion; Next-Generation Sequencing; Parallelization; Statistical Analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International
Conference_Location :
Phoenix, AZ
Print_ISBN :
978-1-4799-4117-9
Type :
conf
DOI :
10.1109/IPDPSW.2014.64
Filename :
6969430
Link To Document :
بازگشت