A Hadoop-based Massive Molecular Data Storage Solution for Virtual Screening

Author

Zhang, Yan ; Zhang, Ruisheng ; Chen, Qiuqiang ; Gao, Xiaopan ; Hu, Rongjing ; Zhang, Ying ; Liu, Guangcai

Author_Institution

Eng. Res. Center of Open Source Software & Real-time Syst., Lanzhou Univ., Lanzhou, China

fYear

2012

fDate

20-23 Sept. 2012

Firstpage

142

Lastpage

147

Abstract

Virtual Screening involves massive computing tasks with millions of molecules docking on the targeted protein. Such data-intensive science always faces the challenge of managing tens of TB datasets, which gives rise to the requirement of large-scale storage. Furthermore, the efficient query and transmission of the large-scale datasets are the other key requirements during the virtual screening progress. Therefore, in this data-intensive application, a massive data storage solution is expected to improve the efficiency of storage and access of large-scale molecules and their docking results, as well as facilitating the data preparing and analysis phases of virtual screening. In order to address the key requirements mentioned above, we proposed a novel storage solution based on Hadoop for virtual screening. HBase was implemented as a distributed database to persist the properties of massive molecules and docking results. HDFS was utilized as a molecule source files storage system. The comparison of the system performance was also presented. Finally, we concluded that the storage solution we proposed could be considered as an alternative attempt to enable the efficient storage and access of large-scale molecules and docking results in virtual screening research.

Keywords

biology computing; data analysis; distributed databases; drugs; molecular biophysics; proteins; query processing; storage management; HBase; Hadoop-based massive molecular data storage solution; TB datasets; data analysis phase; data preparing phase; data-intensive application; distributed database; large-scale dataset query; large-scale dataset transmission; large-scale molecules; molecule source files storage system; protein; virtual screening; Chemicals; Distributed databases; Fault tolerance; Fault tolerant systems; Indexes; Memory; Cloud Computing; HBase; HDFS; Hadoop; Massive Data Storage; Virtual Screening;

fLanguage

English

Publisher

ieee

Conference_Titel

ChinaGrid Annual Conference (ChinaGrid), 2012 Seventh

Conference_Location

Beijing

Print_ISBN

978-1-4673-2623-0

Electronic_ISBN

978-0-7695-4816-6

Type

conf

DOI

10.1109/ChinaGrid.2012.26

Filename

6337289