• DocumentCode
    168591
  • Title

    A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers

  • Author

    Sato, Kiminori ; Mohror, Kathryn ; Moody, Adam ; Gamblin, Todd ; de Supinski, Bronis R. ; Maruyama, Naoya ; Matsuoka, Shingo

  • Author_Institution
    Dept. of Math. & Comput. Sci., Tokyo Inst. of Technol., Tokyo, Japan
  • fYear
    2014
  • fDate
    26-29 May 2014
  • Firstpage
    21
  • Lastpage
    30
  • Abstract
    Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-performance computing applications that run continuously for hours or days at a time. However, even with state-of-the-art checkpoint/restart techniques, high failure rates at large scale will limit application efficiency. To alleviate the problem, we consider using burst buffers. Burst buffers are dedicated storage resources positioned between the compute nodes and the parallel file system, and this new tier within the storage hierarchy fills the performance gap between node-local storage and parallel file systems. With burst buffers, an application can quickly store checkpoints with increased reliability. In this work, we explore how burst buffers can improve efficiency compared to using only node-local storage. To fully exploit the bandwidth of burst buffers, we develop a user-level Infini Band-based file system (IBIO). We also develop performance models for coordinated and uncoordinated checkpoint/restart strategies, and we apply those models to investigate the best checkpoint strategy using burst buffers on future large-scale systems.
  • Keywords
    checkpointing; fault tolerant computing; parallel processing; storage management; IBIO; burst buffers; checkpoint strategy; checkpoint-restart techniques; dedicated storage resources; fault tolerance technique; high-performance computing applications; large-scale systems; node-local storage; parallel file system; user-level InfiniBand-based file system; Bandwidth; Buffer storage; Checkpointing; Computational modeling; Instruction sets; Reliability; Servers; burst buffer; checkpoint/restart; fault tolerance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
  • Conference_Location
    Chicago, IL
  • Type

    conf

  • DOI
    10.1109/CCGrid.2014.24
  • Filename
    6846437