• DocumentCode
    2547176
  • Title

    Assessing Fault Sensitivity in MPI Applications

  • Author

    Lu, Charng-Da ; Reed, Daniel A.

  • Author_Institution
    University of Illinois at Urbana-Champaign
  • fYear
    2004
  • fDate
    06-12 Nov. 2004
  • Firstpage
    37
  • Lastpage
    37
  • Abstract
    Today, clusters built from commodity PCs dominate high-performance computing, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to thousands and with proposed petaflop system likely to contain tens of thousands of nodes, the standard assumption that system hardware and software are fully reliable becomes much less credible. Concomitantly, understanding application sensitivity to system failures is critical to establishing confidence in the outputs of large-scale applications. Using software fault injection, we simulated single bit memory errors, register file upsets and MPI message payload corruption and measured the behavioral responses for a suite of MPI applications. These experiments showed that most applications are very sensitive to even single errors. Perhaps most worrisome, the errors were often undetected, yielding erroneous output with no user indicators. Encouragingly, even minimal internal application error checking and program assertions can detect some of the faults we injected.
  • Keywords
    Application software; Fault detection; Hardware; Large-scale systems; Payloads; Personal communication networks; Registers; Software measurement; Software standards; Software systems;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Supercomputing, 2004. Proceedings of the ACM/IEEE SC2004 Conference
  • Print_ISBN
    0-7695-2153-3
  • Type

    conf

  • DOI
    10.1109/SC.2004.12
  • Filename
    1392967