Title :
Large scale debugging of parallel tasks with AutomaDeD
Author :
Laguna, Ignacio ; Gamblin, Todd ; De Supinski, Bronis R. ; Bagchi, Saurabh ; Bronevetsky, Greg ; Anh, Dong H. ; Schulz, Martin ; Rountree, Barry
Author_Institution :
Sch. of Electr. & Comput. Eng., Purdue Univ., West Lafayette, IN, USA
Abstract :
Developing correct HPC applications continues to be a challenge as the number of cores increases in today´s largest systems. Most existing debugging techniques perform poorly at large scales and do not automatically locate the parts of the parallel application in which the error occurs. The over head of collecting large amounts of runtime information and an absence of scalable error detection algorithms generally cause poor scalability. In this work, we present novel, highly efficient techniques that facilitate the process of debugging large scale parallel applications. Our approach extends our previous work, AutomaDeD, in three major areas to isolate anomalous tasks in a scalable manner: (i) we efficiently compare elements of graph models (used in AutomaDeD to model parallel tasks) using pre-computed lookup-tables and by pointer comparison; (ii) we compress per-task graph models before the error detection analysis so that comparison between models involves many fewer elements; (iii) we use scalable sampling-based clustering and nearest-neighbor techniques to isolate abnormal tasks when bugs and performance anomalies are manifested. Our evaluation with fault injections shows that AutomaDeD scales well to thousands of tasks and that it can find anomalous tasks in under 5 seconds in an online manner.
Keywords :
graph theory; parallel programming; pattern clustering; program debugging; software fault tolerance; table lookup; task analysis; AutomaDeD; HPC; debugging techniques; fault injections; lookup tables; nearest neighbor techniques; parallel tasks; per-task graph models; scalable error detection algorithms; scalable sampling based clustering; Clustering algorithms; Complexity theory; Debugging; Gaussian distribution; Image edge detection; Probability distribution; Runtime;
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for
Conference_Location :
Seatle, WA
Electronic_ISBN :
978-1-4503-0771-0