Title :
On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems
Author :
Yunfeng Zhu ; Lee, Patrick P. C. ; Yinlong Xu ; Yuchong Hu ; Liping Xiang
Author_Institution :
Univ. of Sci. & Technol. of China, Hefei, China
Abstract :
Modern storage systems stripe redundant data across multiple nodes to provide availability guarantees against node failures. One form of data redundancy is based on XOR-based erasure codes, which use only XOR operations for encoding and decoding. In addition to tolerating failures, a storage system must also provide fast failure recovery to reduce the window of vulnerability. This work addresses the problem of speeding up the recovery of a single-node failure for general XOR-based erasure codes. We propose a replace recovery algorithm, which uses a hill-climbing technique to search for a fast recovery solution, such that the solution search can be completed within a short time period. We further extend the algorithm to adapt to the scenario where nodes have heterogeneous capabilities (e.g., processing power and transmission bandwidth). We implement our replace recovery algorithm atop a parallelized architecture to demonstrate its feasibility. We conduct experiments on a networked storage system testbed, and show that our replace recovery algorithm uses less recovery time than the conventional recovery approach.
Keywords :
fault tolerant computing; storage management; XOR operations; XOR-based erasure codes; availability guarantees; data redundancy; fast recovery solution; hill-climbing technique; large-scale erasure-coded storage systems; networked storage system testbed; node failures; parallelized architecture; replace recovery algorithm; single-node failure recovery; vulnerability window; Algorithm design and analysis; Distributed databases; Encoding; Equations; Generators; Mathematical model; Strips; XOR-coded storage system; recovery algorithm; single-node failure;
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
DOI :
10.1109/TPDS.2013.244