DocumentCode
168591
Title
A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers
Author
Sato, Kiminori ; Mohror, Kathryn ; Moody, Adam ; Gamblin, Todd ; de Supinski, Bronis R. ; Maruyama, Naoya ; Matsuoka, Shingo
Author_Institution
Dept. of Math. & Comput. Sci., Tokyo Inst. of Technol., Tokyo, Japan
fYear
2014
fDate
26-29 May 2014
Firstpage
21
Lastpage
30
Abstract
Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-performance computing applications that run continuously for hours or days at a time. However, even with state-of-the-art checkpoint/restart techniques, high failure rates at large scale will limit application efficiency. To alleviate the problem, we consider using burst buffers. Burst buffers are dedicated storage resources positioned between the compute nodes and the parallel file system, and this new tier within the storage hierarchy fills the performance gap between node-local storage and parallel file systems. With burst buffers, an application can quickly store checkpoints with increased reliability. In this work, we explore how burst buffers can improve efficiency compared to using only node-local storage. To fully exploit the bandwidth of burst buffers, we develop a user-level Infini Band-based file system (IBIO). We also develop performance models for coordinated and uncoordinated checkpoint/restart strategies, and we apply those models to investigate the best checkpoint strategy using burst buffers on future large-scale systems.
Keywords
checkpointing; fault tolerant computing; parallel processing; storage management; IBIO; burst buffers; checkpoint strategy; checkpoint-restart techniques; dedicated storage resources; fault tolerance technique; high-performance computing applications; large-scale systems; node-local storage; parallel file system; user-level InfiniBand-based file system; Bandwidth; Buffer storage; Checkpointing; Computational modeling; Instruction sets; Reliability; Servers; burst buffer; checkpoint/restart; fault tolerance;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
Conference_Location
Chicago, IL
Type
conf
DOI
10.1109/CCGrid.2014.24
Filename
6846437
Link To Document