Title :
A user-level library for fault tolerance on shared memory multicore systems
Author :
Mushtaq, Hamid ; Al-Ars, Zaid ; Bertels, Koen
Author_Institution :
Comput. Eng. Lab., Delft Univ. of Technol., Delft, Netherlands
Abstract :
The ever decreasing transistor size has made it possible to integrate multiple cores on a single die. On the downside, this has introduced reliability concerns as smaller transistors are more prone to both transient and permanent faults. However, the abundant extra processing resources of a multicore system can be exploited to provide fault tolerance by using redundant execution. We have designed a library for multicore processing, that can make a multithreaded user-level application fault tolerant by simple modifications to the code. It uses the abundant cores found in the system to perform redundant execution for error detection. Besides that, it also allows recovery through checkpoint/rollback. Our library is portable since it does not depend on any special hardware. Furthermore, the overhead (up to 46% for 4 threads), our library adds to the original application, is less than other existing approaches, such as Respec.
Keywords :
checkpointing; fault tolerant computing; libraries; multi-threading; redundancy; shared memory systems; checkpoint-rollback; error detection; multicore processing; multithreaded user-level application fault tolerance; redundant execution; reliability concerns; shared memory multicore systems; user-level library; Benchmark testing; Fault tolerance; Fault tolerant systems; Instruction sets; Libraries; Memory management; Multicore processing;
Conference_Titel :
Design and Diagnostics of Electronic Circuits & Systems (DDECS), 2012 IEEE 15th International Symposium on
Conference_Location :
Tallinn
Print_ISBN :
978-1-4673-1187-8
Electronic_ISBN :
978-1-4673-1186-1
DOI :
10.1109/DDECS.2012.6219071