DocumentCode
228698
Title
Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly
Author
Georganas, Evangelos ; Buluc, Aydin ; Chapman, Jarrod ; Oliker, Leonid ; Rokhsar, Daniel ; Yelick, Katherine
Author_Institution
Comput. Res. Div., Lawrence Berkeley Nat. Lab., Berkeley, CA, USA
fYear
2014
fDate
16-21 Nov. 2014
Firstpage
437
Lastpage
448
Abstract
De novo whole genome assembly reconstructs genomic sequence from short, overlapping, and potentially erroneous fragments called reads. We study optimized parallelization of the most time-consuming phases of Meraculous, a state of-the-art production assembler. First, we present a new parallel algorithm for k-mer analysis, characterized by intensive communication and I/O requirements, and reduce the memory requirements by 6.93×. Second, we efficiently parallelize de Bruijn graph construction and traversal, which necessitates a distributed hash table and is a key component of most de novo assemblers. We provide a novel algorithm that leverages one-sided communication capabilities of the Unified Parallel C (UPC) to facilitate the requisite fine-grained parallelism and avoidance of data hazards, while analytically proving its scalability properties. Overall results show unprecedented performance and efficient scaling on up to 15,360 cores of a Cray XC30, on human genome as well as the challenging wheat genome, with performance improvement from days to seconds.
Keywords
Cray computers; biology computing; genomics; graph theory; input-output programs; parallel algorithms; program assemblers; Cray XC30; I/O requirement; Meraculous; UPC; data hazard avoidance; de novo assembler; de novo genome assembly; distributed hash table; fine-grained parallelism; genomic sequence; human genome; k-mer analysis; memory requirements; one-sided communication capability; optimized parallelization; parallel algorithm; parallel de Bruijn graph construction and traversal; production assembler; scalability property; unified parallel C; wheat genome; Algorithm design and analysis; Assembly; Bioinformatics; Genomics; Memory management; Program processors; Sequential analysis;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for
Conference_Location
New Orleans, LA
Print_ISBN
978-1-4799-5499-5
Type
conf
DOI
10.1109/SC.2014.41
Filename
7013023
Link To Document