DocumentCode :
2632662
Title :
Assessing the effects of communication faults on parallel applications
Author :
Carreira, João ; Madeira, Henrique ; Silva, Joao Gabriel
Author_Institution :
Dept. de Engenharia Inf., Univ. de Coimbra, Portugal
fYear :
1995
fDate :
24-26 Apr 1995
Firstpage :
214
Lastpage :
223
Abstract :
This paper addresses the problem of injection of faults in the communication system of disjoint memory parallel computers and presents fault injection results showing that 5% to 30% of the faults injected in the communication subsystem of a commercial parallel computer caused undetected errors that lead the application to generate erroneous results. All these cases correspond to situations in which it would be virtually impossible to detect that the benchmark output was erroneous, as the size of the results file was plausible and no system errors had been detected. This emphasizes the need for fault tolerant techniques in parallel systems in order to achieve confidence in the application results. This is especially true in massively parallel computers, as the probability of occurring faults increase with the number of processing nodes. Moreover, in disjoint memory computers, which is the most popular and scalable parallel architecture, the communication subsystem plays an important role, and is also very prone to errors. CSFI (Communication Software Fault Injector) is a versatile tool to inject communication faults in parallel computers. Faults injected with CSFI directly emulate communication faults and spurious messages generated by non fail-silent nodes by software, allowing the evaluation of the impact of faults in parallel systems and the assessment of fault tolerant techniques. The use of CSFI is nearly transparent to the target application as it only requires minor adaptations. Deterministic faults of different nature can be injected without user intervention and fault injection results are collected automatically by CSFI
Keywords :
fault tolerant computing; parallel architectures; performance evaluation; Communication Software Fault Injector; benchmark output; calable parallel architecture; communication faults; disjoint memory parallel computers; fault injection; fault tolerant techniques; massively parallel computers; parallel applications; parallel systems; Application software; Circuit faults; Computer applications; Computer errors; Concurrent computing; Fault tolerant systems; Integrated circuit interconnections; Message passing; Parallel architectures; Parallel processing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Performance and Dependability Symposium, 1995. Proceedings., International
Conference_Location :
Erlangen
Print_ISBN :
0-8186-7059-2
Type :
conf
DOI :
10.1109/IPDS.1995.395830
Filename :
395830
Link To Document :
بازگشت