• DocumentCode
    228698
  • Title

    Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly

  • Author

    Georganas, Evangelos ; Buluc, Aydin ; Chapman, Jarrod ; Oliker, Leonid ; Rokhsar, Daniel ; Yelick, Katherine

  • Author_Institution
    Comput. Res. Div., Lawrence Berkeley Nat. Lab., Berkeley, CA, USA
  • fYear
    2014
  • fDate
    16-21 Nov. 2014
  • Firstpage
    437
  • Lastpage
    448
  • Abstract
    De novo whole genome assembly reconstructs genomic sequence from short, overlapping, and potentially erroneous fragments called reads. We study optimized parallelization of the most time-consuming phases of Meraculous, a state of-the-art production assembler. First, we present a new parallel algorithm for k-mer analysis, characterized by intensive communication and I/O requirements, and reduce the memory requirements by 6.93×. Second, we efficiently parallelize de Bruijn graph construction and traversal, which necessitates a distributed hash table and is a key component of most de novo assemblers. We provide a novel algorithm that leverages one-sided communication capabilities of the Unified Parallel C (UPC) to facilitate the requisite fine-grained parallelism and avoidance of data hazards, while analytically proving its scalability properties. Overall results show unprecedented performance and efficient scaling on up to 15,360 cores of a Cray XC30, on human genome as well as the challenging wheat genome, with performance improvement from days to seconds.
  • Keywords
    Cray computers; biology computing; genomics; graph theory; input-output programs; parallel algorithms; program assemblers; Cray XC30; I/O requirement; Meraculous; UPC; data hazard avoidance; de novo assembler; de novo genome assembly; distributed hash table; fine-grained parallelism; genomic sequence; human genome; k-mer analysis; memory requirements; one-sided communication capability; optimized parallelization; parallel algorithm; parallel de Bruijn graph construction and traversal; production assembler; scalability property; unified parallel C; wheat genome; Algorithm design and analysis; Assembly; Bioinformatics; Genomics; Memory management; Program processors; Sequential analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for
  • Conference_Location
    New Orleans, LA
  • Print_ISBN
    978-1-4799-5499-5
  • Type

    conf

  • DOI
    10.1109/SC.2014.41
  • Filename
    7013023