DocumentCode :
3471587
Title :
Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?
Author :
Islam, Nusrat Sharmin ; Xiaoyi Lu ; Wasi-ur-Rahman, Md ; Panda, Dhabaleswar K.
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
fYear :
2013
fDate :
21-23 Aug. 2013
Firstpage :
75
Lastpage :
78
Abstract :
The Hadoop Distributed File System (HDFS) is a popular choice for Big Data applications due to its reliability and fault-tolerance. HDFS provides fault-tolerance and availability guarantee by replicating each data block to multiple DataN-odes. The current implementation of HDFS in Apache Hadoop performs replication in a pipelined fashion resulting in higher replication times. Such large replication times adversely impact the performance of real-time, latency-sensitive applications. In this paper, we propose an alternative parallel replication scheme applicable to both the socket-based design of HDFS and the RDMA-based design of HDFS over InfiniBand. We analyze the challenges and issues in parallel replication and compare its performance with the existing pipelined replication scheme in HDFS over 1 GigE, IPoIB (IP over InfiniBand), 10 GigE and RDMA (Remote Direct Memory Access) over InfiniBand. Experiments performed over high performance networks (IPoIB, 10 GigE, and IB) show that the proposed parallel replication scheme is able to outperform the default pipelined design for a variety of benchmarks. We observe up to a 16% reduction in the execution time of the TeraGen benchmark. We are also able to increase the throughput reported by the TestDFSIO benchmark by up to 12%. The proposed parallel replication is also able to enhance the HBase Put operation performance by 17%. However, for lower performance networks like 1GigE and smaller data sizes, parallel replication does not benefit the performance.
Keywords :
IP networks; computer network performance evaluation; fault tolerant computing; network operating systems; peripheral interfaces; pipeline processing; replicated databases; 10 GigE; Apache Hadoop Distributed File System; HBase Put operation performance enhancement; HDFS fault-tolerance; HDFS reliability; IP over InfiniBand; IPoIB; RDMA-based HDFS design; TeraGen benchmark; TestDFSIO benchmark; availability guarantee; big-data applications; data block replication; data nodes; execution time reduction; high-performance interconnects; high-performance networks; parallel replication time; pipelined replication scheme; real-time latency-sensitive applications; remote direct memory access; socket-based HDFS design; throughput; Benchmark testing; Data handling; Data storage systems; File systems; Information management; Protocols; Throughput; Big Data; HDFS; High Performance Interconnects; Replication;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High-Performance Interconnects (HOTI), 2013 IEEE 21st Annual Symposium on
Conference_Location :
San Jose, CA
Type :
conf
DOI :
10.1109/HOTI.2013.24
Filename :
6627739
Link To Document :
بازگشت