Title :
Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems
Author :
Brandt, Jim ; Debusschere, Bert ; Gentile, Ann ; Mayo, Jackson ; Pebay, P. ; Thompson, David ; Wong, Matthew
Author_Institution :
Sandia Nat. Labs., Livermore, CA
Abstract :
The current trend in high performance computing is to aggregate ever larger numbers of processing and interconnection elements in order to achieve desired levels of computational power, This, however, also comes with a decrease in the Mean Time To Interrupt because the elements comprising these systems are not becoming significantly more robust. There is substantial evidence that the Mean Time To Interrupt vs. number of processor elements involved is quite similar over a large number of platforms. In this paper we present a system that uses hardware level monitoring coupled with statistical analysis and modeling to select processing system elements based on where they lie in the statistical distribution of similar elements. These characterizations can be used by the scheduler/resource manager to deliver a close to optimal set of processing elements given the available pool and the reliability requirements of the application.
Keywords :
scheduling; statistical distributions; HPC systems; computational power; hardware level monitoring; high performance computing; runtime faults; statistical analysis; statistical distribution; Aggregates; Hardware; High performance computing; Monitoring; Power system interconnection; Resource management; Robustness; Runtime; Statistical analysis; Statistical distributions; RAS; abnormality detection; cluster monitoring; fault tolerance. probabilistic characterization; resilience; statistical analysis;
Conference_Titel :
Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on
Conference_Location :
Lyon
Print_ISBN :
978-0-7695-3156-4
Electronic_ISBN :
978-0-7695-3156-4
DOI :
10.1109/CCGRID.2008.124