• DocumentCode
    1913942
  • Title

    A Hadoop-based Massive Molecular Data Storage Solution for Virtual Screening

  • Author

    Zhang, Yan ; Zhang, Ruisheng ; Chen, Qiuqiang ; Gao, Xiaopan ; Hu, Rongjing ; Zhang, Ying ; Liu, Guangcai

  • Author_Institution
    Eng. Res. Center of Open Source Software & Real-time Syst., Lanzhou Univ., Lanzhou, China
  • fYear
    2012
  • fDate
    20-23 Sept. 2012
  • Firstpage
    142
  • Lastpage
    147
  • Abstract
    Virtual Screening involves massive computing tasks with millions of molecules docking on the targeted protein. Such data-intensive science always faces the challenge of managing tens of TB datasets, which gives rise to the requirement of large-scale storage. Furthermore, the efficient query and transmission of the large-scale datasets are the other key requirements during the virtual screening progress. Therefore, in this data-intensive application, a massive data storage solution is expected to improve the efficiency of storage and access of large-scale molecules and their docking results, as well as facilitating the data preparing and analysis phases of virtual screening. In order to address the key requirements mentioned above, we proposed a novel storage solution based on Hadoop for virtual screening. HBase was implemented as a distributed database to persist the properties of massive molecules and docking results. HDFS was utilized as a molecule source files storage system. The comparison of the system performance was also presented. Finally, we concluded that the storage solution we proposed could be considered as an alternative attempt to enable the efficient storage and access of large-scale molecules and docking results in virtual screening research.
  • Keywords
    biology computing; data analysis; distributed databases; drugs; molecular biophysics; proteins; query processing; storage management; HBase; Hadoop-based massive molecular data storage solution; TB datasets; data analysis phase; data preparing phase; data-intensive application; distributed database; large-scale dataset query; large-scale dataset transmission; large-scale molecules; molecule source files storage system; protein; virtual screening; Chemicals; Distributed databases; Fault tolerance; Fault tolerant systems; Indexes; Memory; Cloud Computing; HBase; HDFS; Hadoop; Massive Data Storage; Virtual Screening;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    ChinaGrid Annual Conference (ChinaGrid), 2012 Seventh
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4673-2623-0
  • Electronic_ISBN
    978-0-7695-4816-6
  • Type

    conf

  • DOI
    10.1109/ChinaGrid.2012.26
  • Filename
    6337289