• DocumentCode
    2441955
  • Title

    A high-performance fault-tolerant software framework for memory on commodity GPUs

  • Author

    Maruyama, Naoya ; Nukada, Akira ; Matsuoka, Satoshi

  • Author_Institution
    GSIC, Tokyo Inst. of Technol., Tokyo, Japan
  • fYear
    2010
  • fDate
    19-23 April 2010
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    As GPUs are increasingly used to accelerate HPC applications by allowing more flexibility and programmability, their fault tolerance is becoming much more important than before when they were used only for graphics. The current generation of GPUs, however, does not have standard error detection and correction capabilities, such as SEC-DED ECC for DRAM, which is almost always exercised in HPC servers. We present a high-performance software framework to enhance commodity off-the-shelf GPUs with DRAM fault tolerance. It combines data coding for detecting bit-flip errors and checkpointing for recovering computations when such errors are detected. We analyze performance of data coding in GPUs and present optimizations geared toward memory-intensive GPU applications. We present performance studies of the prototype implementation of the framework and show that the proposed framework can be realized with negligible overheads in compute intensive applications such as N-body problem and matrix multiplication, and as low as 35% in a highly-efficient memory intensive 3-D FFT kernel.
  • Keywords
    DRAM chips; checkpointing; coprocessors; error correction; error detection; fast Fourier transforms; matrix multiplication; operating system kernels; software fault tolerance; DRAM fault tolerance; HPC applications; HPC servers; N-body problem; SEC-DED ECC; bit-flip errors; checkpointing; commodity GPU; commodity off-the-shelf GPU; data coding; error correction capability; high-performance fault-tolerant software framework; high-performance software framework; matrix multiplication; memory intensive 3-D FFT kernel; memory-intensive GPU applications; standard error detection; Acceleration; Application software; Checkpointing; Data analysis; Error correction; Error correction codes; Fault tolerance; Graphics; Performance analysis; Random access memory; GPGPU; fault tolerance; memory soft errors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on
  • Conference_Location
    Atlanta, GA
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4244-6442-5
  • Type

    conf

  • DOI
    10.1109/IPDPS.2010.5470473
  • Filename
    5470473