DocumentCode :
3664283
Title :
Communication Pattern-Based Distributed Snapshots in Large-Scale Systems
Author :
Salem Saker;Adnan Agbaria
Author_Institution :
Acad. Arab Coll. for Educ. in Israel, Univ. of Haifa, Haifa, Israel
fYear :
2015
fDate :
5/1/2015 12:00:00 AM
Firstpage :
1062
Lastpage :
1071
Abstract :
Large-Scale systems (LSSs) continue to attract more attention from the scientific community for addressing high-performance computing. Providing fault tolerance in distributed systems is a challenge. This challenge doubtlessly becomes more difficult in LSSs. Distributed snapshots are an important building block for distributed systems, and, among other applications, are useful for providing fault tolerance. This paper motivates the need for providing fault tolerance in LSSs and focuses on the limitations behind this provision. It then presents an innovative and scalable distributed snapshots approach for LSSs. In this approach, upon a new snapshot, a process coordinates only with the processes that it has communicated with since the last snapshot. Our protocol improves the Chandy and Lamport distributed snapshot protocol which was presented in 1985. This improvement may enable developers and planners of systems to consider this protocol. We compare the performance of our new approach to the performance of other existing well-known distributed snapshot approaches using stochastic models. The results show that our approach achieves lower overhead with significant improvement.
Keywords :
"Protocols","Fault tolerance","Fault tolerant systems","Checkpointing","Process control","Complexity theory","History"
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International
Type :
conf
DOI :
10.1109/IPDPSW.2015.117
Filename :
7284427
Link To Document :
بازگشت