Title :
Reliability mechanisms for very large storage systems
Author :
Xin, Qin ; Miller, Ethan L. ; Schwarz, Thomas ; Long, Darrell D E ; Brandt, Scott A. ; Litwin, Witold
Author_Institution :
California Univ., Santa Cruz, CA, USA
Abstract :
Reliability and availability are increasingly important in large-scale storage systems built from thousands of individual storage devices. Large systems must survive the failure of individual components; in systems with thousands of disks, even infrequent failures are likely in some device. We focus on two types of errors: nonrecoverable read errors and drive failures. We discuss mechanisms for detecting and recovering from such errors, introducing improved techniques for detecting errors in disk reads and fast recovery from disk failure. We show that simple RAID cannot guarantee sufficient reliability; our analysis examines the tradeoffs among other schemes between system availability and storage efficiency. Based on our data, we believe that two-way mirroring should be sufficient for most large storage systems. For those that need very high reliability, we recommend either three-way mirroring or mirroring combined with RAID.
Keywords :
RAID; computer network reliability; fault tolerant computing; file servers; hard discs; redundancy; RAID; drive failures; large-scale storage systems; nonrecoverable read errors; storage efficiency; system availability; three-way mirroring; two-way mirroring; Availability; Bandwidth; Costs; Disk drives; Error analysis; Fault tolerant systems; Frequency; High performance computing; Large-scale systems; Redundancy;
Conference_Titel :
Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings. 20th IEEE/11th NASA Goddard Conference on
Print_ISBN :
0-7695-1914-8
DOI :
10.1109/MASS.2003.1194851