DocumentCode
1995563
Title
GPU-Accelerated Protein Family Identification for Metagenomics
Author
Changjun Wu ; Kalyanaraman, Ananth
Author_Institution
Xerox Innovation Group, Xerox Res. Center, Webster, NY, USA
fYear
2013
fDate
20-24 May 2013
Firstpage
559
Lastpage
568
Abstract
The clustering of putative protein/Open Reading Frame (ORF) sequences available from large-scale metagenomics survey projects is a core analytical function that has led to the identification and characterization of novel protein families of environmental microbial communities. The implementation of this function, however, is currently challenged not only by data size but also by data complexity. In this paper, we present a CPU-GPU implementation of a randomized graph clustering heuristic called Shingling, which was originally developed by Gibson et al. Our implementation uses the CPU and GPU for different stages of computation, using GPUs for the most time-consuming steps. Experimental results of a 2M ocean metagenomics data set obtained from the Sorcerer II Global Ocean Sampling project show that our new implementation is able to achieve a ~7X speedup over our serial implementation without using asynchronous CPU-GPU communication, with the GPU part alone contributing to over ~374X speedup in the accelerated part. Qualitative evaluation of the 2M data set shows that our method is able to improve sensitivity of clustering over existing methods, and is more successful in recruiting more sequences into the clustering without impacting the overall specificity. As a demonstration of a large scale run, we were able to cluster a real world homology graph, containing 11M vertices and 640M edges, and constructed from sequences of an ongoing Pacific Ocean metagenomics survey project, in about 94 minutes.
Keywords
biocomputing; biology computing; graphics processing units; multiprocessing systems; proteins; CPU-GPU implementation; GPU accelerated protein family identification; ORF; Shingling; Sorcerer II global ocean sampling project; environmental microbial communities; graph clustering heuristic; metagenomics; metagenomics data; open reading frame; Algorithm design and analysis; Clustering algorithms; Graphics processing units; Instruction sets; Oceans; Proteins; Random access memory; Dense subgraph detection; GPGPU application; Parallel graph clustering algorithm; Protein family identification;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International
Conference_Location
Cambridge, MA
Print_ISBN
978-0-7695-4979-8
Type
conf
DOI
10.1109/IPDPSW.2013.185
Filename
6650931
Link To Document