• DocumentCode
    2632777
  • Title

    On integrating error detection into a fault diagnosis algorithm for massively parallel computers

  • Author

    Altmann, Jörn ; Bartha, Tamas ; Pataricza, András

  • Author_Institution
    Dept. of Comput. Sci., Erlangen-Nurnberg Univ., Germany
  • fYear
    1995
  • fDate
    24-26 Apr 1995
  • Firstpage
    154
  • Lastpage
    164
  • Abstract
    Scalable fault diagnosis is necessary for constructing fault tolerance mechanisms in large massively parallel multiprocessor systems. The diagnosis algorithm must operate efficiently even if the system consists of several thousand processors. We introduce an event-driven, distributed system-level diagnosis algorithm. It uses a small number of messages and is based on a general diagnosis model without the limitation of the number of simultaneously existing faults (an important requirement for massively parallel computers). The algorithm integrates both error detection techniques like ⟨I´m alive⟩ messages, and built in hardware mechanisms. The structure of the implemented algorithm is presented and the essential program modules are described. The paper also discusses the use of test results generated by error detection mechanisms for fault localization. Measurement results illustrate the effect of the diagnosis algorithm, in particular the error detection mechanism by ⟨I´m alive⟩, messages, on the application performance
  • Keywords
    computer testing; distributed algorithms; error detection; fault location; multiprocessing systems; parallel machines; I´m alive messages; application performance; built in hardware mechanisms; error detection integration; event-driven distributed system-level diagnosis algorithm; fault diagnosis algorithm; fault localization; fault tolerance mechanisms; general diagnosis model; large massively parallel multiprocessor systems; massively parallel computers; messages; program modules; scalable fault diagnosis; simultaneously existing faults; test results; Application software; Clustering algorithms; Computer errors; Concurrent computing; Fault detection; Fault diagnosis; Fault tolerant systems; Hardware; Instruments; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Performance and Dependability Symposium, 1995. Proceedings., International
  • Conference_Location
    Erlangen
  • Print_ISBN
    0-8186-7059-2
  • Type

    conf

  • DOI
    10.1109/IPDS.1995.395836
  • Filename
    395836