Title :
Protecting against rare event failures in archival systems
Author :
Wildani, Avani ; Schwarz, Thomas J E ; Miller, Ethan L. ; Long, Darrell D E
Author_Institution :
Storage Syst. Res. Center, Univ. of California, Santa Cruz, CA, USA
Abstract :
Digital archives are growing rapidly, necessitating stronger reliability measures than RAID to avoid data loss from device failure. Mirroring, a popular solution, is too expensive over time. We present a compromise solution that uses multi-level redundancy coding to reduce the probability of data loss from multiple simultaneous device failures. This approach handles small-scale failures of one or two devices efficiently while still allowing the system to survive rare-event, larger-scale failures of four or more devices. In our approach, each disk is split into a set of fixed size disklets which are used to construct reliability stripes. To protect against rare event failures, reliability stripes are grouped into larger super-groups, each of which has a corresponding super-parity; super-parity is only used to recover data when disk failures overwhelm the redundancy in a single reliability stripe. Super-parity can be stored on a variety of devices such as NV-RAM and always-on disks to offset write bottlenecks while still keeping the number of active devices low. Our calculations of failure probabilities show that adding super-parity allows our system to absorb many more disk failures without data loss. Through discrete event simulation, we found that adding super-groups has a significant impact on mean time to data loss and that rebuilds are slow but not unmanageable. Finally, we showed that robustness against rare events can be achieved for a fraction of total system cost.
Keywords :
discrete event simulation; information retrieval systems; records management; NV-RAM; archival systems; digital archives; discrete event simulation; rare event failure protection; rare event failures; super-parity; Cooling; Costs; Data engineering; Insurance; Power system protection; Power system reliability; Redundancy; Reliability engineering; Robustness; Surge protection;
Conference_Titel :
Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS '09. IEEE International Symposium on
Conference_Location :
London
Print_ISBN :
978-1-4244-4927-9
Electronic_ISBN :
1526-7539
DOI :
10.1109/MASCOT.2009.5366825