DocumentCode :
1826774
Title :
Reengineering High-throughput Molecular Datasets for Scalable Clustering Using MapReduce
Author :
Estrada, Trilce ; Boyu Zhang ; Taufer, Michela ; Cicotti, Pietro ; Armen, R.
Author_Institution :
Dept. of Comput. & Inf. Sci., Univ. of Delaware, Newark, DE, USA
fYear :
2012
fDate :
25-27 June 2012
Firstpage :
351
Lastpage :
359
Abstract :
We propose a linear clustering approach for large datasets of molecular geometries produced by high-throughput molecular dynamics simulations (e.g., protein folding and protein-ligand docking simulations). To this scope, we transform each three-dimensional (3D) molecular conformation into a single point in the 3D space reducing the space complexity while still encoding the molecular similarities and geometries. We assign an identifier to each single 3D point mapping a docked ligand, generate a tree from the whole space, and apply a tree-based clustering on the reduced conformation space that identifies most dense hyperspaces. We adapt our method for MapReduce and implement it in Hadoop. The load-balancing, fault-tolerance, and scalability in MapReduce allows screening of very large conformation datasets not approachable with traditional clustering methods. We analyze results for datasets with different concentrations of optimal solutions, and draw conclusions about the limitations and usability of our method. The advantages of this approach make it attractive for complex applications in real-world high-throughput molecular simulations.
Keywords :
computational complexity; computational geometry; distributed programming; medical computing; molecular dynamics method; pattern clustering; proteins; public domain software; resource allocation; software fault tolerance; systems re-engineering; tree data structures; 3D point mapping; 3D space reduction; Hadoop; MapReduce; conformation space reduction; docked ligand; fault-tolerance; high-throughput molecular dataset re-engineering; high-throughput molecular dynamics simulations; hyperspaces; large-conformation datasets; linear clustering approach; load-balancing; molecular geometries; molecular similarities; optimal solutions; scalable clustering; space complexity; three-dimensional molecular conformation; tree-based clustering; Clustering algorithms; Computational modeling; Geometry; Peptides; Proteins; Scalability; Solid modeling; Linear clustering; MapReduce; Molecular docking; Octree;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on
Conference_Location :
Liverpool
Print_ISBN :
978-1-4673-2164-8
Type :
conf
DOI :
10.1109/HPCC.2012.54
Filename :
6332193
Link To Document :
بازگشت