• DocumentCode
    168752
  • Title

    Accelerating Comparative Genomics Workflows in a Distributed Environment with Optimized Data Partitioning

  • Author

    Choudhury, Olivia ; Hazekamp, Nicholas L. ; Thain, D. ; Emrich, S.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Univ. of Notre Dame, Notre Dame, IN, USA
  • fYear
    2014
  • fDate
    26-29 May 2014
  • Firstpage
    711
  • Lastpage
    719
  • Abstract
    The advent of new sequencing technology has generated massive amounts of biological data at unprecedented rates. High-throughput bioinformatics tools are required to keep pace with this. Here, we implement a workflow-based model for parallelizing the data intensive task of genome alignment and variant calling with BWA and GATK´s Haplotype Caller. We explore different approaches of partitioning data and how each affect the run time. We observe granularity-based partitioning for BWA and alignment-based partitioning for Halo type Caller to be the optimal choices for the pipeline. We identify the various challenges encountered while developing such an application and provide an insight into addressing them. We report significant performance improvements, from 12 days to 4 hours, while running the BWA-GATK pipeline using 100 nodes for analyzing high-coverage oak tree data.
  • Keywords
    biology computing; distributed processing; genomics; tree data structures; workflow management software; BWA-GATK pipeline; GATK HaplotypeCaller; alignment-based partitioning; biological data; comparative genomics workflow; data intensive task; distributed environment; genome alignment; granularity-based partitioning; high-coverage oak tree data; high-throughput bioinformatics tools; optimized data partitioning; partitioning data; sequencing technology; workflow-based model; Bioinformatics; Genomics; III-V semiconductor materials; Pipelines; Runtime; Sequential analysis; BWA; Bioinformatics; Comparative Genomics; Data Partitioning; Distributed Computing; GATK; Genome Alignment; Makeflow; Variant Calling; Work Queue;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
  • Conference_Location
    Chicago, IL
  • Type

    conf

  • DOI
    10.1109/CCGrid.2014.79
  • Filename
    6846523