• DocumentCode
    3079289
  • Title

    A Multilevel Fault-Tolerance Technique for the DAG Data Driven Model

  • Author

    Hao Fu ; Ce Yu ; Jizhou Sun ; Jun Du ; Mengmeng Wang

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Tianjin Univ., Tianjin, China
  • fYear
    2015
  • fDate
    4-7 May 2015
  • Firstpage
    1127
  • Lastpage
    1130
  • Abstract
    Fault tolerance of hardware failure is a challenging work for parallel programming in massively parallel processing environment. However, traditional rollback-recovery techniques, which an be classified into checkpoint-based and log-based, would introduce extra overhead for recording an overall snapshot of an application. For a specialized programming model, a private recovery technique is valuable and can achieve a better performance.In this paper, a multilevel fault-tolerance technique designed for the DAG data driven model is proposed. It utilized the checkpoint-based fault tolerance technique for system recovery, and timeout to detect and revoery from performance faults. It consists of two kinds of checkpoints: the DAG pattern checkpoint and the intermediate result checkpoint. The DAG pattern checkpoint is designed for tracing the current processing progress of the DAG model, while the intermediate results checkpoint is used to record outputs of compute nodes. Moreover, we also implement this technique in the EasyHPS runtime system. Experimental results show that the check pointing overhead is as low as 2.6%.
  • Keywords
    checkpointing; directed graphs; parallel programming; software fault tolerance; DAG data driven model; EasyHPS runtime system; checkpoint-based fault tolerance technique; directed acyclic graph; hardware failure; multilevel fault-tolerance technique; parallel processing environment; parallel programming; private recovery technique; rollback-recovery techniques; Checkpointing; Computational modeling; Data models; Dynamic programming; Fault tolerance; Fault tolerant systems; Program processors; DAG data driven model; Fault tolerance; Multilevel Fault-tolerance Technique;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
  • Conference_Location
    Shenzhen
  • Type

    conf

  • DOI
    10.1109/CCGrid.2015.89
  • Filename
    7152603