DocumentCode :
3077036
Title :
Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture
Author :
Islam, Nusrat Sharmin ; Xiaoyi Lu ; Wasi-ur-Rahman, M. ; Shankar, Dipti ; Panda, Dhabaleswar K.
fYear :
2015
fDate :
4-7 May 2015
Firstpage :
101
Lastpage :
110
Abstract :
HDFS (Hadoop Distributed File System) is the primary storage of Hadoop. Even though data locality offered by HDFS is important for Big Data applications, HDFS suffers from huge I/O bottlenecks due to the tri-replicated data blocks and cannot efficiently utilize the available storage devices in an HPC (High Performance Computing) cluster. Moreover, due to the limitation of local storage space, it is challenging to deploy HDFS in HPC environments. In this paper, we present a hybrid design (Triple-H) that can minimize the I/O bottlenecks in HDFS and ensure efficient utilization of the heterogeneous storage devices (e.g. RAM, SSD, and HDD) available on HPC clusters. We also propose effective data placement policies to speed up Triple-H. Our design integrated with parallel file system (e.g. Lustre) can lead to significant storage space savings and guarantee fault-tolerance. Performance evaluations show that Triple-H can improve the write and read throughputs of HDFS by up to 7x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 3x. Our design also improves the execution time of the Sort benchmark by up to 40% over default HDFS and 54% over Lustre. The alignment phase of the Cloudburst application is accelerated by 19%. Triple-H also benefits the performance of SequenceCount and Grep in PUMA [15] over both default HDFS and Lustre.
Keywords :
Big Data; distributed databases; fault tolerant computing; parallel processing; CloudBurst application; Grep; HDFS; HPC clusters; Hadoop distributed file system; IO bottlenecks; Lustre; PUMA; SequenceCount; Triple-H; big data applications; data locality; data placement policies; fault-tolerance; heterogeneous storage architecture; high performance computing; parallel file system; sort benchmark; trireplicated data blocks; Engines; Fault tolerance; Fault tolerant systems; File systems; Performance evaluation; Random access memory; Servers; Big Data; HDFS; HPC; Heterogeneous Storage;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
Conference_Location :
Shenzhen
Type :
conf
DOI :
10.1109/CCGrid.2015.161
Filename :
7152476
Link To Document :
بازگشت