DocumentCode :
2846566
Title :
On-the-Fly Recovery of Job Input Data in Supercomputers
Author :
Wang, Chao ; Zhang, Zhe ; Vazhkudai, Sudharshan S. ; Ma, Xiaosong ; Mueller, Frank
Author_Institution :
Dept. of Comput. Sci., North Carolina State Univ., Raleigh, NC
fYear :
2008
fDate :
9-12 Sept. 2008
Firstpage :
620
Lastpage :
627
Abstract :
Storage system failure is a serious concern as we approach Petascale computing. Even at today´s sub-Petascale levels, I/O failure is the leading cause of downtimes and job failures. We contribute a novel, on-the-fly recovery framework for job input data into supercomputer parallel file systems. The framework exploits key traits of the HPC I/O workload to reconstruct lost input data during job execution from remote, immutable copies. Each reconstructed data stripe is made immediately accessible in the client request order due to the delayed metadata update and fine-granular locking while unrelated access to the same file remains unaffected. We have implemented the recovery component within the Lustre parallel file system, thus building a novel application-transparent online recovery solution. Our solution is integrated into Lustre´s two-level locking scheme using a two-phase blocking protocol. Combining parametric and simulation studies, our experiments demonstrate a significant improvement in HPC center service ability and user job turnaround time.
Keywords :
meta data; parallel machines; parallel memories; storage management; HPC center service; I-O failure; Lustre parallel file system; Petascale computing; application-transparent online recovery solution; fine-granular locking; job input data recovery; metadata update; on-the-fly recovery framework; storage system failure; supercomputer parallel file systems; two-phase blocking protocol; Access protocols; Chaos; Computer science; Delay; File systems; Laboratories; Mathematics; Parallel processing; Petascale computing; Supercomputers;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Processing, 2008. ICPP '08. 37th International Conference on
Conference_Location :
Portland, OR
ISSN :
0190-3918
Print_ISBN :
978-0-7695-3374-2
Electronic_ISBN :
0190-3918
Type :
conf
DOI :
10.1109/ICPP.2008.28
Filename :
4625901
Link To Document :
بازگشت