Title :
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
Author :
Prvulovic, Milos ; Zhang, Zheng ; Torrellas, Josep
Author_Institution :
Illinois Univ., Urbana, IL, USA
fDate :
6/24/1905 12:00:00 AM
Abstract :
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all memory-based. It enables recovery from a wide class of errors, including the permanent loss of an entire node. To maintain high performance, ReVive includes specialized hardware that performs frequent operations in the background, such as log and parity updates. To keep the cost low, more complex checkpointing and recovery functions are performed in software, while the hardware modifications are limited to the directory controllers of the machine. Our simulation results on a 16-processor system indicate that the average error-free execution time overhead of using ReVive is only 6.3%, while the achieved availability is better than 99.999% even when the errors occur as often as once per day
Keywords :
error detection; performance evaluation; shared memory systems; system recovery; ReVive; architectural support; checkpointing; distributed parity protection; hardware cost; performance; rollback recovery; shared-memory multiprocessors; simulation results; Availability; Bit error rate; Checkpointing; Error correction; Hardware; Integrated circuit reliability; Performance loss; Power system reliability; Protection; Redundancy;
Conference_Titel :
Computer Architecture, 2002. Proceedings. 29th Annual International Symposium on
Conference_Location :
Anchorage, AK
Print_ISBN :
0-7695-1605-X
DOI :
10.1109/ISCA.2002.1003567