DocumentCode :
3205497
Title :
Parallel Metagenomic Sequence Clustering Via Sketching and Maximal Quasi-clique Enumeration on Map-Reduce Clouds
Author :
Yang, Xiao ; Zola, Jaroslaw ; Aluru, Srinivas
Author_Institution :
Dept. of Electr. & Comput. Eng., Iowa State Univ., Ames, IA, USA
fYear :
2011
fDate :
16-20 May 2011
Firstpage :
1223
Lastpage :
1233
Abstract :
Taxonomic clustering of species is an important and frequently arising problem in metagenomics. High-throughput next generation sequencing is facilitating the creation of large metagenomic samples, while at the same time making the clustering problem harder due to the short sequence length supported and unknown species sampled. In this paper, we present a parallel algorithm for hierarchical taxonomic clustering of large metagenomic samples with support for overlapping clusters. We adapt the sketching techniques originally developed for web document clustering to deduce significant similarities between pairs of sequences without resorting to expensive all vs. all alignments. We formulate the metagenomics classification problem as that of maximal quasi-clique enumeration in the resulting similarity graph, at multiple levels of the hierarchy as prescribed by different similarity thresholds. We cast execution of the underlying algorithmic steps as applications of the map-reduce framework to achieve a cloud based implementation. Apart from solving an important problem in metagenomics, this work demonstrates the applicability of map-reduce framework in relatively complicated algorithmic settings.
Keywords :
biology computing; cloud computing; genomics; graph theory; parallel algorithms; pattern clustering; hierarchical taxonomic clustering; map-reduce clouds; map-reduce framework; maximal quasi-clique enumeration; metagenomics classification problem; parallel algorithm; parallel metagenomic sequence clustering; similarity graph; similarity threshold; sketching technique; web document clustering; Clustering algorithms; Couplings; DNA; Organisms; Silicon; Strontium;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International
Conference_Location :
Anchorage, AK
ISSN :
1530-2075
Print_ISBN :
978-1-61284-372-8
Electronic_ISBN :
1530-2075
Type :
conf
DOI :
10.1109/IPDPS.2011.116
Filename :
6012859
Link To Document :
بازگشت