DocumentCode :
1995563
Title :
GPU-Accelerated Protein Family Identification for Metagenomics
Author :
Changjun Wu ; Kalyanaraman, Ananth
Author_Institution :
Xerox Innovation Group, Xerox Res. Center, Webster, NY, USA
fYear :
2013
fDate :
20-24 May 2013
Firstpage :
559
Lastpage :
568
Abstract :
The clustering of putative protein/Open Reading Frame (ORF) sequences available from large-scale metagenomics survey projects is a core analytical function that has led to the identification and characterization of novel protein families of environmental microbial communities. The implementation of this function, however, is currently challenged not only by data size but also by data complexity. In this paper, we present a CPU-GPU implementation of a randomized graph clustering heuristic called Shingling, which was originally developed by Gibson et al. Our implementation uses the CPU and GPU for different stages of computation, using GPUs for the most time-consuming steps. Experimental results of a 2M ocean metagenomics data set obtained from the Sorcerer II Global Ocean Sampling project show that our new implementation is able to achieve a ~7X speedup over our serial implementation without using asynchronous CPU-GPU communication, with the GPU part alone contributing to over ~374X speedup in the accelerated part. Qualitative evaluation of the 2M data set shows that our method is able to improve sensitivity of clustering over existing methods, and is more successful in recruiting more sequences into the clustering without impacting the overall specificity. As a demonstration of a large scale run, we were able to cluster a real world homology graph, containing 11M vertices and 640M edges, and constructed from sequences of an ongoing Pacific Ocean metagenomics survey project, in about 94 minutes.
Keywords :
biocomputing; biology computing; graphics processing units; multiprocessing systems; proteins; CPU-GPU implementation; GPU accelerated protein family identification; ORF; Shingling; Sorcerer II global ocean sampling project; environmental microbial communities; graph clustering heuristic; metagenomics; metagenomics data; open reading frame; Algorithm design and analysis; Clustering algorithms; Graphics processing units; Instruction sets; Oceans; Proteins; Random access memory; Dense subgraph detection; GPGPU application; Parallel graph clustering algorithm; Protein family identification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International
Conference_Location :
Cambridge, MA
Print_ISBN :
978-0-7695-4979-8
Type :
conf
DOI :
10.1109/IPDPSW.2013.185
Filename :
6650931
Link To Document :
بازگشت