DocumentCode :
2399726
Title :
Replaying distributed programs without message logging
Author :
Netzer, Robert H B ; Xu, Yikang
Author_Institution :
Dept. of Comput. Sci., Brown Univ., Providence, RI, USA
fYear :
1997
fDate :
5-8 Aug 1997
Firstpage :
137
Lastpage :
147
Abstract :
Debugging long program runs can be difficult because of the delays required to repeatedly re-run the execution. Even a moderately long run of five minutes can incur aggravating delays. To address this problem, techniques exist that allow re-executing a distributed program from intermediate points by using combinations of checkpointing and message logging. In this paper we explore another idea: how to support replay without logging the contents of any message. When no messages are logged, the set of global states from which replay is possible is constrained, and it has been unknown how to compute this set without exhaustively searching the space of all global states, whose size is exponential in the number of processes. We present a simple and efficient hybrid on-the-fly/post-mortem algorithm for detecting the necessary and sufficient conditions under which parts of the execution can be replayed without message logs. A small amount of trace (two vectors) is recorded at each checkpoint and a fast post-mortem algorithm computes global states from which replay can begin. This algorithm is independent of the checkpointing technique used
Keywords :
parallel programming; program debugging; checkpointing; debugging; distributed program; distributed programs; global states; message logging; post-mortem algorithm; replay; Checkpointing; Computer science; Costs; Debugging; Delay; Fault tolerance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Distributed Computing, 1997. Proceedings. The Sixth IEEE International Symposium on
Conference_Location :
Portland, OR
ISSN :
1082-8907
Print_ISBN :
0-8186-8117-9
Type :
conf
DOI :
10.1109/HPDC.1997.622370
Filename :
622370
Link To Document :
بازگشت