• DocumentCode
    1995563
  • Title

    GPU-Accelerated Protein Family Identification for Metagenomics

  • Author

    Changjun Wu ; Kalyanaraman, Ananth

  • Author_Institution
    Xerox Innovation Group, Xerox Res. Center, Webster, NY, USA
  • fYear
    2013
  • fDate
    20-24 May 2013
  • Firstpage
    559
  • Lastpage
    568
  • Abstract
    The clustering of putative protein/Open Reading Frame (ORF) sequences available from large-scale metagenomics survey projects is a core analytical function that has led to the identification and characterization of novel protein families of environmental microbial communities. The implementation of this function, however, is currently challenged not only by data size but also by data complexity. In this paper, we present a CPU-GPU implementation of a randomized graph clustering heuristic called Shingling, which was originally developed by Gibson et al. Our implementation uses the CPU and GPU for different stages of computation, using GPUs for the most time-consuming steps. Experimental results of a 2M ocean metagenomics data set obtained from the Sorcerer II Global Ocean Sampling project show that our new implementation is able to achieve a ~7X speedup over our serial implementation without using asynchronous CPU-GPU communication, with the GPU part alone contributing to over ~374X speedup in the accelerated part. Qualitative evaluation of the 2M data set shows that our method is able to improve sensitivity of clustering over existing methods, and is more successful in recruiting more sequences into the clustering without impacting the overall specificity. As a demonstration of a large scale run, we were able to cluster a real world homology graph, containing 11M vertices and 640M edges, and constructed from sequences of an ongoing Pacific Ocean metagenomics survey project, in about 94 minutes.
  • Keywords
    biocomputing; biology computing; graphics processing units; multiprocessing systems; proteins; CPU-GPU implementation; GPU accelerated protein family identification; ORF; Shingling; Sorcerer II global ocean sampling project; environmental microbial communities; graph clustering heuristic; metagenomics; metagenomics data; open reading frame; Algorithm design and analysis; Clustering algorithms; Graphics processing units; Instruction sets; Oceans; Proteins; Random access memory; Dense subgraph detection; GPGPU application; Parallel graph clustering algorithm; Protein family identification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International
  • Conference_Location
    Cambridge, MA
  • Print_ISBN
    978-0-7695-4979-8
  • Type

    conf

  • DOI
    10.1109/IPDPSW.2013.185
  • Filename
    6650931