Title :
Timely Error Detection for Effective Recovery in Light-Lockstep Automotive Systems
Author :
Hernandez, Carles ; Abella, Jaume
Author_Institution :
Barcelona Supercomput. Center, Barcelona, Spain
Abstract :
Safety-relevant systems in the automotive domain often implement features such as lockstep execution for error detection, and reset and re-execution for error correction. Light-lockstep has already been adopted in some such systems due to its relatively low-implementation cost given that it does not require deep changes into nonlockstep hardware. Instead, as only off-core activities (i.e., data/addresses sent) need to be compared across different cores, light-lockstep designs are lowly intrusive. This approach has been proven sufficient to guarantee functional correctness of the system in the presence of errors in the cores, in particular in relation with certification against safety standards such as ISO26262 in the automotive domain. However, error detection in light-lockstep systems may occur long after the error actually occurs, thus jeopardizing timing guarantees, which are as critical as functional ones in hard real-time systems. In this paper, we analyze the timing behavior of errors due to transient and permanent faults in light-lockstep systems. Our results show that the time elapsed until an error is detected can be inordinately large, especially for permanent faults. Based on this observation and building upon the specific characteristics of light-lockstep systems, we propose lightly verbose (LiVe), a new mechanism to enforce the early detection of errors, due to both transient and permanent faults, thus enabling the computation of tight error detection timing bounds. We also analyze how existing mechanisms for error recovery in multicore systems increase their effectiveness when light-lockstep operates in LiVe mode in the context of mixed-criticality workloads.
Keywords :
automobiles; error correction; error detection; fault tolerant computing; multiprocessing systems; real-time systems; traffic engineering computing; ISO26262 standard; LiVe mode; error correction; error detection timing bounds; error recovery; functional correctness; hard real-time systems; light-lockstep automotive systems; light-lockstep designs; lightly verbose; mixed-criticality workloads; multicore systems; off-core activities; permanent faults; safety-relevant systems; timing behavior; transient faults; Automotive engineering; Circuit faults; Hardware; Registers; Safety; Timing; Transient analysis; Embedded systems; embedded systems; embedded test; fault diagnosis; testing; timing analysis;
Journal_Title :
Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on
DOI :
10.1109/TCAD.2015.2434958