• DocumentCode
    1998414
  • Title

    Sustained Resilience via Live Process Cloning

  • Author

    Rezaei, A. ; Mueller, Frank

  • Author_Institution
    Dept. of Comput. Sci., North Carolina State Univ., Raleigh, NC, USA
  • fYear
    2013
  • fDate
    20-24 May 2013
  • Firstpage
    1498
  • Lastpage
    1507
  • Abstract
    More flexible fault tolerance approaches with lower overhead are a must for the next generation of supercomputers that rely on massive numbers of computational elements. This work proposes a reactive method for fault resilience in high-performance computing (HPC) systems based on forward execution instead of rollback to checkpoints. We study the feasibility of combining redundancy with live process cloning to create highly reliable HPC systems. The main motivation is to avoid costly checkpoint restart approaches. We present live process cloning as a mechanism to create a copy of a running process on-the-fly. We show that the reliability of a dual redundant system with live process cloning is as good as a triple redundant system even for very large systems. We also investigate the effect of node failure and the changes in Mean time to Interrupt (MTTI) of the application. This provides a better understanding of the available time to recover from a failure by cloning a healthy replica.
  • Keywords
    checkpointing; fault tolerant computing; parallel processing; redundancy; reliability; HPC systems; MTTI; checkpoint restart approach; computational elements; dual redundant system; fault resilience; flexible fault tolerance approaches; high-performance computing systems; live process cloning; mean time to interrupt; next generation supercomputers; sustained resilience; triple redundant system; Checkpointing; Cloning; Computational modeling; Logic gates; Redundancy; Resilience; Fault Resilience; HPC; Process Cloning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International
  • Conference_Location
    Cambridge, MA
  • Print_ISBN
    978-0-7695-4979-8
  • Type

    conf

  • DOI
    10.1109/IPDPSW.2013.224
  • Filename
    6651044