DocumentCode
2765092
Title
Selective Recovery from Failures in a Task Parallel Programming Model
Author
Dinan, James ; Singri, Arjun ; Sadayappan, P. ; Krishnamoorthy, Sriram
Author_Institution
Dept. Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
fYear
2010
fDate
17-20 May 2010
Firstpage
709
Lastpage
714
Abstract
We present a fault tolerant task pool execution environment that is capable of performing fine-grain selective restart using a lightweight, distributed task completion tracking mechanism. Compared with conventional checkpoint/restart techniques, this system offers a recovery penalty that is proportional to the degree of failure rather than the system size. We evaluate this system using the Self Consistent Field (SCF) kernel which forms an important component in ab initio methods for computational chemistry. Experimental results indicate that fault tolerant task pools are robust in the presence of an arbitrary number of failures and that they offer low overhead in the absence of faults.
Keywords
Chemistry; Clouds; Computer science; Electronics packaging; Fault tolerance; Grid computing; Hardware; Kernel; Parallel processing; Parallel programming; Global Arrays; PGAS; Parallel processing; fault tolerance; selective recovery; task parallelism;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on
Conference_Location
Melbourne, Australia
Print_ISBN
978-1-4244-6987-1
Type
conf
DOI
10.1109/CCGRID.2010.34
Filename
5493399
Link To Document