DocumentCode :
2588656
Title :
Increasing register file immunity to transient errors
Author :
Memik, Gokhan ; Kandemir, Mahmut T. ; Ozturk, Ozcan
Author_Institution :
Electr. & Comput. Eng. Dept, Northwestern Univ., Evanston, IL, USA
fYear :
2005
fDate :
7-11 March 2005
Firstpage :
586
Abstract :
Transient errors are a major reason for system downtime in many systems. In prior research, the register file has largely been neglected, but since it is accessed very frequently, the probability of transient errors is high. These errors can quickly spread to different parts of the system, and cause an application crash or silent data corruption. The paper addresses the reliability of register files in superscalar processors. We propose to duplicate actively used physical registers in unused physical registers. If the protection mechanism (parity or ECC) used for the primary copy indicates an error, the duplicate can provide the data, as long as it is not corrupted. We implement two strategies based on register duplication. In the "conservative strategy", we limit ourselves with the given register usage behavior, and duplicate register contents only on otherwise unused registers. Consequently, there is no impact on the original performance when there is no error, except for the protection mechanism used for the primary copy. Experiments with two different versions of this strategy show that, with the more powerful conservative scheme, 78% of the accesses are to the physical registers with duplicates. The "aggressive strategy" sacrifices some performance to increase the number of register accesses with duplicates. It does so by marking the registers not used for a long time as "dead" and using them for duplicating actively used registers. Experiments with this strategy indicate that it takes the fraction of reliable register accesses to 84%, and degrades the overall performance by only 0.21% on average.
Keywords :
error correction; error statistics; file organisation; microprocessor chips; reliability; ECC; aggressive strategy; application crash; conservative strategy; data corruption; error probability; parity; register content duplication; register file; reliability; superscalar processors; transient error immunity; Computer crashes; Computer errors; Computer science; Degradation; Error correction; Error correction codes; Error probability; Packaging; Protection; Registers;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Design, Automation and Test in Europe, 2005. Proceedings
ISSN :
1530-1591
Print_ISBN :
0-7695-2288-2
Type :
conf
DOI :
10.1109/DATE.2005.181
Filename :
1395632
Link To Document :
بازگشت