Title :
Fused Data Structures for Handling Multiple Faults in Distributed Systems
Author :
Balasubramanian, Bharath ; Garg, Vijay K.
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of Texas at Austin, Austin, TX, USA
Abstract :
The paper describes a technique to correct crash faults in large data structures hosted on distributed servers, based on the concept of fused backups. The prevalent solution to this problem is replication. To correct f crash faults among n distinct data structures, replication requires nf additional replicas. If each of the primaries contains O(m) nodes of O(s) size each, this translates to O(nmsf) total backup space. Our technique uses a combination of erasure correcting codes and selective replication to correct f crash faults using just f additional backups consuming O(msf) total backup space, while incurring minimal overhead during normal operation. Since the data is maintained in the coded form, recovery is costly as compared to replication. However, in a system with infrequent faults, the savings in space outweighs the cost of recovery. We explore the theory and algorithms for these fused backups and provide a library of such backups for all the data structures in the Java 6 Collection framework. Our experimental evaluation confirms that fused backups are space-efficient as compared to replication (almost n times), while they cause very little overhead for updates. Many real world distributed systems such as Amazon´s Dynamo data store use replication to achieve reliability. An alternate, fusion-based design can result in significant savings in space as well as other resources such as power.
Keywords :
computational complexity; data handling; data structures; distributed processing; sensor fusion; system recovery; Amazon dynamo data store; crash fault; distributed server; distributed system; erasure correcting code; fused data structure; fusion-based design; multiple fault handling; real world distributed system; selective replication; space efficient fused backup library; space outweighs saving; Arrays; Computer crashes; Fault tolerance; Fault tolerant systems; Indexes; Servers; Data Structures; Distributed Systems; Fault Tolerance;
Conference_Titel :
Distributed Computing Systems (ICDCS), 2011 31st International Conference on
Conference_Location :
Minneapolis, MN
Print_ISBN :
978-1-61284-384-1
Electronic_ISBN :
1063-6927
DOI :
10.1109/ICDCS.2011.80