Title :
A Map-Reduce Framework for Clustering Metagenomes
Author :
Rasheed, Z. ; Rangwala, Huzefa
Author_Institution :
Dept. of Comput. Sci., George Mason Univ., Fairfax, VA, USA
Abstract :
The past few years has seen an explosion in the use of sequence technologies for met genomics i.e., determination of the collective genome of microorganisms co-existing within several environments. In parallel, there has been rapid development of computational tools for the quantification of abundance, diversity and functionality of different species within these communities. Several clustering algorithms (also called binning algorithms) have been developed to categorize similar met genome sequence reads for efficient post-processing and analysis. In this paper we present a distributed algorithm for clustering met genome sequence reads. The algorithm is implemented within the Map-Reduce based Hadoop platform, and approximates the computation of pair wise sequence similarity with a minwise hashing approach. The algorithm is capable of performing agglomerative hierarchical clustering or a greedy clustering approach and is referred to as MrMC-MinH. The key advantage of MrMC-MinH is it´s ability to handle large volumes of sequence reads obtained from targeted 16S metagenomic or whole metagenomic data. We evaluate the performance of our algorithm on several real and simulated metagenome benchmarks and demonstrate that our approach is computationally efficient, and produces accurate clustering results when evaluated using external ground truth. The source code for MrMC-MinH will be made available at the supplementary website.
Keywords :
biology computing; genomics; greedy algorithms; microorganisms; parallel algorithms; pattern clustering; MapReduce based Hadoop platform; MrMC-MinH; agglomerative hierarchical clustering; collective genome determination; distributed algorithm; external ground truth; greedy clustering approach; met genome sequence read clustering; metagenome clustering algorithm; metagenomic data; microorganisms; minwise hashing approach; pairwise sequence similarity; Algorithm design and analysis; Approximation algorithms; Bioinformatics; Clustering algorithms; Communities; Genomics; Sequential analysis; map-reduce; metagenome clustering; minwise hashing;
Conference_Titel :
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International
Conference_Location :
Cambridge, MA
Print_ISBN :
978-0-7695-4979-8
DOI :
10.1109/IPDPSW.2013.100