DocumentCode :
2251022
Title :
Probabilistic failure diagnosis in large-scale HPC applications with automaDeD
Author :
Laguna, Ignacio
Author_Institution :
Sch. of Electr. & Comput. Eng., Purdue Univ., West Lafayette, IN, USA
fYear :
2012
fDate :
10-16 Nov. 2012
Firstpage :
1
Lastpage :
32
Abstract :
This article consists of a collection of slides from the author´s conference presentation. Some of the specific conclusions presented include: 1. The debugging approach, AutomaDeD, diagnoses failures in HPC applications at large scale; 2. Distributed debugging method is scalable (it takes fraction of a second with 32,000 MPI tasks); 3. Tests were performed with difficult-to-catch real-world bug and fault injections in HPC benchmarks. Possible future directions are noted.
Keywords :
failure analysis; parallel processing; program debugging; HPC application; automaDeD; distributed debugging method; fault injection; probabilistic failure diagnosis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:
Conference_Location :
Salt Lake City, UT
Print_ISBN :
978-1-4673-6218-4
Type :
conf
DOI :
10.1109/SCC.2012.6522601
Filename :
6522601
Link To Document :
بازگشت