• DocumentCode
    1826774
  • Title

    Reengineering High-throughput Molecular Datasets for Scalable Clustering Using MapReduce

  • Author

    Estrada, Trilce ; Boyu Zhang ; Taufer, Michela ; Cicotti, Pietro ; Armen, R.

  • Author_Institution
    Dept. of Comput. & Inf. Sci., Univ. of Delaware, Newark, DE, USA
  • fYear
    2012
  • fDate
    25-27 June 2012
  • Firstpage
    351
  • Lastpage
    359
  • Abstract
    We propose a linear clustering approach for large datasets of molecular geometries produced by high-throughput molecular dynamics simulations (e.g., protein folding and protein-ligand docking simulations). To this scope, we transform each three-dimensional (3D) molecular conformation into a single point in the 3D space reducing the space complexity while still encoding the molecular similarities and geometries. We assign an identifier to each single 3D point mapping a docked ligand, generate a tree from the whole space, and apply a tree-based clustering on the reduced conformation space that identifies most dense hyperspaces. We adapt our method for MapReduce and implement it in Hadoop. The load-balancing, fault-tolerance, and scalability in MapReduce allows screening of very large conformation datasets not approachable with traditional clustering methods. We analyze results for datasets with different concentrations of optimal solutions, and draw conclusions about the limitations and usability of our method. The advantages of this approach make it attractive for complex applications in real-world high-throughput molecular simulations.
  • Keywords
    computational complexity; computational geometry; distributed programming; medical computing; molecular dynamics method; pattern clustering; proteins; public domain software; resource allocation; software fault tolerance; systems re-engineering; tree data structures; 3D point mapping; 3D space reduction; Hadoop; MapReduce; conformation space reduction; docked ligand; fault-tolerance; high-throughput molecular dataset re-engineering; high-throughput molecular dynamics simulations; hyperspaces; large-conformation datasets; linear clustering approach; load-balancing; molecular geometries; molecular similarities; optimal solutions; scalable clustering; space complexity; three-dimensional molecular conformation; tree-based clustering; Clustering algorithms; Computational modeling; Geometry; Peptides; Proteins; Scalability; Solid modeling; Linear clustering; MapReduce; Molecular docking; Octree;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on
  • Conference_Location
    Liverpool
  • Print_ISBN
    978-1-4673-2164-8
  • Type

    conf

  • DOI
    10.1109/HPCC.2012.54
  • Filename
    6332193