Title :
Fault detection using hints from the socket layer
Author :
Neves, Nuno ; Fuchs, W. Kent
Author_Institution :
Coordinated Sci. Lab., Illinois Univ., Urbana, IL, USA
Abstract :
Describes a fault detection mechanism that uses the error codes returned by stream sockets to locate process failures. Since these errors are generated automatically when there is communication with a failed process, the mechanism does not incur in any failure-free overheads. However, for some types of faults, detection can only be attained if the surviving processes use certain communication operations. To assess the coverage and latency of the proposed mechanism, faults were injected during the execution of parallel applications. Our results show that in most cases, faults could be found using only the errors from the socket layer. Depending on the type of fault that was injected, detection occurred in an interval ranging from a few milliseconds to less than nine minutes
Keywords :
error detection; fault diagnosis; parallel processing; system recovery; automatically generated errors; communication operations; coverage; error codes; fault detection mechanism; fault injection; latency; parallel applications execution; process failure location; socket layer; stream sockets; surviving processes; Availability; Computer crashes; Computer errors; Computer networks; Contracts; Delay; Fault detection; Protocols; Safety; Sockets;
Conference_Titel :
Reliable Distributed Systems, 1997. Proceedings., The Sixteenth Symposium on
Conference_Location :
Durham, NC
Print_ISBN :
0-8186-8177-2
DOI :
10.1109/RELDIS.1997.632799