• DocumentCode
    3664283
  • Title

    Communication Pattern-Based Distributed Snapshots in Large-Scale Systems

  • Author

    Salem Saker;Adnan Agbaria

  • Author_Institution
    Acad. Arab Coll. for Educ. in Israel, Univ. of Haifa, Haifa, Israel
  • fYear
    2015
  • fDate
    5/1/2015 12:00:00 AM
  • Firstpage
    1062
  • Lastpage
    1071
  • Abstract
    Large-Scale systems (LSSs) continue to attract more attention from the scientific community for addressing high-performance computing. Providing fault tolerance in distributed systems is a challenge. This challenge doubtlessly becomes more difficult in LSSs. Distributed snapshots are an important building block for distributed systems, and, among other applications, are useful for providing fault tolerance. This paper motivates the need for providing fault tolerance in LSSs and focuses on the limitations behind this provision. It then presents an innovative and scalable distributed snapshots approach for LSSs. In this approach, upon a new snapshot, a process coordinates only with the processes that it has communicated with since the last snapshot. Our protocol improves the Chandy and Lamport distributed snapshot protocol which was presented in 1985. This improvement may enable developers and planners of systems to consider this protocol. We compare the performance of our new approach to the performance of other existing well-known distributed snapshot approaches using stochastic models. The results show that our approach achieves lower overhead with significant improvement.
  • Keywords
    "Protocols","Fault tolerance","Fault tolerant systems","Checkpointing","Process control","Complexity theory","History"
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International
  • Type

    conf

  • DOI
    10.1109/IPDPSW.2015.117
  • Filename
    7284427