Title :
Scalability of DNA sequence database on low-end cluster using Hadoop
Author :
Jamal, Ade ; Pradani, Winangsari ; Hasanati, Nida´ul ; Supriyanto, Arief ; Pujianto, Rahman
Author_Institution :
Dept. of Inf. Eng., Univ. Al-Azhar Indonesia, Jakarta, Indonesia
Abstract :
Publicly available DNA sequence database such as GenBank managed by National Center for Biotechnology Information (NCBI) is very large and still grows exponentially. The sequence data are stored in flat file format grouped in various division based on the source taxonomy. Bacterial division alone consists of more than 100 files has size about 6 Gigabytes in total. Searching in 100 files using single server took time about 1500 seconds in a Quad Cores machine. An effort to speed up this process has been worked out by uploading the bacterial sequence data on Hadoop Distributed File System on low-end cluster. MapReduce computation model is invoked for searching algorithm in conjunction with Hadoop Distributed File System as both technologies are main component of Hadoop framework. Scalability evaluation has been performed to investigate whether increasing number of node in the cluster will be fruitful.
Keywords :
DNA; bioinformatics; data handling; microorganisms; parallel processing; DNA sequence database; GenBank; Hadoop distributed file system; MapReduce computation model; NCBI; National Center for Biotechnology Information; bacterial division; bacterial sequence data; flat file format; low-end cluster; quad cores machine; scalability evaluation; searching algorithm; source taxonomy; Computational modeling; DNA; Databases; File systems; Microorganisms; Scalability; Servers; DNA; Distributed File System; Hadoop; MapReduce;
Conference_Titel :
Information Technology Systems and Innovation (ICITSI), 2014 International Conference on
DOI :
10.1109/ICITSI.2014.7048237