DocumentCode :
168591
Title :
A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers
Author :
Sato, Kiminori ; Mohror, Kathryn ; Moody, Adam ; Gamblin, Todd ; de Supinski, Bronis R. ; Maruyama, Naoya ; Matsuoka, Shingo
Author_Institution :
Dept. of Math. & Comput. Sci., Tokyo Inst. of Technol., Tokyo, Japan
fYear :
2014
fDate :
26-29 May 2014
Firstpage :
21
Lastpage :
30
Abstract :
Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-performance computing applications that run continuously for hours or days at a time. However, even with state-of-the-art checkpoint/restart techniques, high failure rates at large scale will limit application efficiency. To alleviate the problem, we consider using burst buffers. Burst buffers are dedicated storage resources positioned between the compute nodes and the parallel file system, and this new tier within the storage hierarchy fills the performance gap between node-local storage and parallel file systems. With burst buffers, an application can quickly store checkpoints with increased reliability. In this work, we explore how burst buffers can improve efficiency compared to using only node-local storage. To fully exploit the bandwidth of burst buffers, we develop a user-level Infini Band-based file system (IBIO). We also develop performance models for coordinated and uncoordinated checkpoint/restart strategies, and we apply those models to investigate the best checkpoint strategy using burst buffers on future large-scale systems.
Keywords :
checkpointing; fault tolerant computing; parallel processing; storage management; IBIO; burst buffers; checkpoint strategy; checkpoint-restart techniques; dedicated storage resources; fault tolerance technique; high-performance computing applications; large-scale systems; node-local storage; parallel file system; user-level InfiniBand-based file system; Bandwidth; Buffer storage; Checkpointing; Computational modeling; Instruction sets; Reliability; Servers; burst buffer; checkpoint/restart; fault tolerance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
Conference_Location :
Chicago, IL
Type :
conf
DOI :
10.1109/CCGrid.2014.24
Filename :
6846437
Link To Document :
بازگشت