DocumentCode
1826774
Title
Reengineering High-throughput Molecular Datasets for Scalable Clustering Using MapReduce
Author
Estrada, Trilce ; Boyu Zhang ; Taufer, Michela ; Cicotti, Pietro ; Armen, R.
Author_Institution
Dept. of Comput. & Inf. Sci., Univ. of Delaware, Newark, DE, USA
fYear
2012
fDate
25-27 June 2012
Firstpage
351
Lastpage
359
Abstract
We propose a linear clustering approach for large datasets of molecular geometries produced by high-throughput molecular dynamics simulations (e.g., protein folding and protein-ligand docking simulations). To this scope, we transform each three-dimensional (3D) molecular conformation into a single point in the 3D space reducing the space complexity while still encoding the molecular similarities and geometries. We assign an identifier to each single 3D point mapping a docked ligand, generate a tree from the whole space, and apply a tree-based clustering on the reduced conformation space that identifies most dense hyperspaces. We adapt our method for MapReduce and implement it in Hadoop. The load-balancing, fault-tolerance, and scalability in MapReduce allows screening of very large conformation datasets not approachable with traditional clustering methods. We analyze results for datasets with different concentrations of optimal solutions, and draw conclusions about the limitations and usability of our method. The advantages of this approach make it attractive for complex applications in real-world high-throughput molecular simulations.
Keywords
computational complexity; computational geometry; distributed programming; medical computing; molecular dynamics method; pattern clustering; proteins; public domain software; resource allocation; software fault tolerance; systems re-engineering; tree data structures; 3D point mapping; 3D space reduction; Hadoop; MapReduce; conformation space reduction; docked ligand; fault-tolerance; high-throughput molecular dataset re-engineering; high-throughput molecular dynamics simulations; hyperspaces; large-conformation datasets; linear clustering approach; load-balancing; molecular geometries; molecular similarities; optimal solutions; scalable clustering; space complexity; three-dimensional molecular conformation; tree-based clustering; Clustering algorithms; Computational modeling; Geometry; Peptides; Proteins; Scalability; Solid modeling; Linear clustering; MapReduce; Molecular docking; Octree;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on
Conference_Location
Liverpool
Print_ISBN
978-1-4673-2164-8
Type
conf
DOI
10.1109/HPCC.2012.54
Filename
6332193
Link To Document