DocumentCode
2632777
Title
On integrating error detection into a fault diagnosis algorithm for massively parallel computers
Author
Altmann, Jörn ; Bartha, Tamas ; Pataricza, András
Author_Institution
Dept. of Comput. Sci., Erlangen-Nurnberg Univ., Germany
fYear
1995
fDate
24-26 Apr 1995
Firstpage
154
Lastpage
164
Abstract
Scalable fault diagnosis is necessary for constructing fault tolerance mechanisms in large massively parallel multiprocessor systems. The diagnosis algorithm must operate efficiently even if the system consists of several thousand processors. We introduce an event-driven, distributed system-level diagnosis algorithm. It uses a small number of messages and is based on a general diagnosis model without the limitation of the number of simultaneously existing faults (an important requirement for massively parallel computers). The algorithm integrates both error detection techniques like ⟨I´m alive⟩ messages, and built in hardware mechanisms. The structure of the implemented algorithm is presented and the essential program modules are described. The paper also discusses the use of test results generated by error detection mechanisms for fault localization. Measurement results illustrate the effect of the diagnosis algorithm, in particular the error detection mechanism by ⟨I´m alive⟩, messages, on the application performance
Keywords
computer testing; distributed algorithms; error detection; fault location; multiprocessing systems; parallel machines; I´m alive messages; application performance; built in hardware mechanisms; error detection integration; event-driven distributed system-level diagnosis algorithm; fault diagnosis algorithm; fault localization; fault tolerance mechanisms; general diagnosis model; large massively parallel multiprocessor systems; massively parallel computers; messages; program modules; scalable fault diagnosis; simultaneously existing faults; test results; Application software; Clustering algorithms; Computer errors; Concurrent computing; Fault detection; Fault diagnosis; Fault tolerant systems; Hardware; Instruments; Scalability;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Performance and Dependability Symposium, 1995. Proceedings., International
Conference_Location
Erlangen
Print_ISBN
0-8186-7059-2
Type
conf
DOI
10.1109/IPDS.1995.395836
Filename
395836
Link To Document