DocumentCode
168752
Title
Accelerating Comparative Genomics Workflows in a Distributed Environment with Optimized Data Partitioning
Author
Choudhury, Olivia ; Hazekamp, Nicholas L. ; Thain, D. ; Emrich, S.
Author_Institution
Dept. of Comput. Sci. & Eng., Univ. of Notre Dame, Notre Dame, IN, USA
fYear
2014
fDate
26-29 May 2014
Firstpage
711
Lastpage
719
Abstract
The advent of new sequencing technology has generated massive amounts of biological data at unprecedented rates. High-throughput bioinformatics tools are required to keep pace with this. Here, we implement a workflow-based model for parallelizing the data intensive task of genome alignment and variant calling with BWA and GATK´s Haplotype Caller. We explore different approaches of partitioning data and how each affect the run time. We observe granularity-based partitioning for BWA and alignment-based partitioning for Halo type Caller to be the optimal choices for the pipeline. We identify the various challenges encountered while developing such an application and provide an insight into addressing them. We report significant performance improvements, from 12 days to 4 hours, while running the BWA-GATK pipeline using 100 nodes for analyzing high-coverage oak tree data.
Keywords
biology computing; distributed processing; genomics; tree data structures; workflow management software; BWA-GATK pipeline; GATK HaplotypeCaller; alignment-based partitioning; biological data; comparative genomics workflow; data intensive task; distributed environment; genome alignment; granularity-based partitioning; high-coverage oak tree data; high-throughput bioinformatics tools; optimized data partitioning; partitioning data; sequencing technology; workflow-based model; Bioinformatics; Genomics; III-V semiconductor materials; Pipelines; Runtime; Sequential analysis; BWA; Bioinformatics; Comparative Genomics; Data Partitioning; Distributed Computing; GATK; Genome Alignment; Makeflow; Variant Calling; Work Queue;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
Conference_Location
Chicago, IL
Type
conf
DOI
10.1109/CCGrid.2014.79
Filename
6846523
Link To Document