• DocumentCode
    3448335
  • Title

    1st workshop on fault-tolerance for HPC at extreme scale FTXS 2010

  • Author

    Daly, John ; DeBardeleben, Nathan

  • Author_Institution
    Center for Exceptional Computing / Department of Defense, USA
  • fYear
    2010
  • fDate
    June 28 2010-July 1 2010
  • Firstpage
    1
  • Lastpage
    1
  • Abstract
    With the emergence of many-core processors, accelerators, and alternative/heterogeneous architectures, the HPC community faces a new challenge: a scaling in number of processing elements that supersedes the historical trend of scaling in processor frequencies. The attendant increase in system complexity has first-order implications for fault tolerance. Mounting evidence invalidates traditional assumptions of HPC fault tolerance: faults are increasingly multiple-point instead of single-point and interdependent instead of independent; silent failures and silent data corruption are no longer rare enough to discount; stabilization time consumes a larger fraction of useful system lifetime, with failure rates projected to exceed one per hour on the largest systems; and application interrupt rates are apparently diverging from system failure rates.
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks Workshops (DSN-W), 2010 International Conference on
  • Conference_Location
    Chicago, IL
  • Print_ISBN
    978-1-4244-7729-6
  • Electronic_ISBN
    978-1-4244-7728-9
  • Type

    conf

  • DOI
    10.1109/DSNW.2010.5542628
  • Filename
    5542628