Title :
How fail-stop are faulty programs?
Author :
Chandra, S. ; Chen, P.M.
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., Michigan Univ., MI, USA
Abstract :
Most fault-tolerant systems are designed to stop faulty programs before they write permanent data or communicate with other processes. This property (halt-on-failure) forms the core of the fail-stop model. Unfortunately, little experimental data exists on whether or not program failures follow the fail-stop model. This paper describes a tool, based on the SimOS complete-machine simulator that can trace how faults propagate through memory, disk, and functions. Using this tool on the Postgres database system, we conduct a controlled experiment to measure how often faulty programs violate the fail-stop model. We find that a significant number of faults (7%) violate the fail-stop model by writing incorrect data to stable storage before halting. We then apply Postgres´ transaction mechanism to undo recent changes before a crash and find that transactions reduce fail-stop violations by a factor of 3.
Keywords :
relational databases; software fault tolerance; system recovery; transaction processing; virtual machines; Postgres database; SimOS; complete-machine simulator; experiment; fail-stop model; fault-tolerant systems; faulty programs; halt-on-failure; transaction processing; Application software; Computer bugs; Computer science; Condition monitoring; Fault detection; Kernel; Software systems; System software; Transaction databases; Workstations;
Conference_Titel :
Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on
Conference_Location :
Munich, Germany
Print_ISBN :
0-8186-8470-4
DOI :
10.1109/FTCS.1998.689475