• DocumentCode
    2765092
  • Title

    Selective Recovery from Failures in a Task Parallel Programming Model

  • Author

    Dinan, James ; Singri, Arjun ; Sadayappan, P. ; Krishnamoorthy, Sriram

  • Author_Institution
    Dept. Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
  • fYear
    2010
  • fDate
    17-20 May 2010
  • Firstpage
    709
  • Lastpage
    714
  • Abstract
    We present a fault tolerant task pool execution environment that is capable of performing fine-grain selective restart using a lightweight, distributed task completion tracking mechanism. Compared with conventional checkpoint/restart techniques, this system offers a recovery penalty that is proportional to the degree of failure rather than the system size. We evaluate this system using the Self Consistent Field (SCF) kernel which forms an important component in ab initio methods for computational chemistry. Experimental results indicate that fault tolerant task pools are robust in the presence of an arbitrary number of failures and that they offer low overhead in the absence of faults.
  • Keywords
    Chemistry; Clouds; Computer science; Electronics packaging; Fault tolerance; Grid computing; Hardware; Kernel; Parallel processing; Parallel programming; Global Arrays; PGAS; Parallel processing; fault tolerance; selective recovery; task parallelism;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on
  • Conference_Location
    Melbourne, Australia
  • Print_ISBN
    978-1-4244-6987-1
  • Type

    conf

  • DOI
    10.1109/CCGRID.2010.34
  • Filename
    5493399