1^st workshop on fault-tolerance for HPC at extreme scale FTXS 2010

Author

Daly, John ; DeBardeleben, Nathan

Author_Institution

Center for Exceptional Computing / Department of Defense, USA

fYear

2010

fDate

June 28 2010-July 1 2010

Firstpage

615

Lastpage

615

Abstract

With the emergence of many-core processors, accelerators, and alternative/heterogeneous architectures, the HPC community faces a new challenge: a scaling in number of processing elements that supersedes the historical trend of scaling in processor frequencies. The attendant increase in system complexity has first-order implications for fault tolerance. Mounting evidence invalidates traditional assumptions of HPC fault tolerance: faults are increasingly multiple-point instead of single-point and interdependent instead of independent; silent failures and silent data corruption are no longer rare enough to discount; stabilization time consumes a larger fraction of useful system lifetime, with failure rates projected to exceed one per hour on the largest systems; and application interrupt rates are apparently diverging from system failure rates.

Keywords

Conferences; Error correction; Fault tolerance; Fault tolerant systems; Government; Hardware; Laboratories; Predictive models; Software performance; Space technology;

fLanguage

English

Publisher

ieee

Conference_Titel

Dependable Systems and Networks (DSN), 2010 IEEE/IFIP International Conference on

Conference_Location

Chicago, IL, USA

Print_ISBN

978-1-4244-7500-1

Electronic_ISBN

978-1-4244-7499-8

Type

conf

DOI

10.1109/DSN.2010.5544426

Filename

5544426

1st workshop on fault-tolerance for HPC at extreme scale FTXS 2010

Daly, John ; DeBardeleben, Nathan

conf

1^st workshop on fault-tolerance for HPC at extreme scale FTXS 2010