Abstract :
Summary form only given. As a side effect of semiconductor technology scaling, chips are becoming ever less reliable. Prominent reasons for this phenomenon are the increasing sheer number of transistors on a given silicon area and their shrinking device features. As a consequence, fault tolerance has to be more often applied, e.g. provided through various redundancy schemes, but at the same time, the fault tolerance has more and more disproportionally increasing power and performance costs. To make the matters even worse, the chip power density is becoming a significant limiting factor for performance and SoC design in general, as it bars the maximum allowed number of online transistors per chip unit area. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce an excessive overhead in future SoCs. Attempting to design and manufacture a totally faultfree system, would heavily or even prohibitively impact the design, manufacturing, and testing costs, as well as, the performance and power consumption of a system. In this tutorial, we will address the above challenges and discuss new design techniques for more efficient, adaptive fault-tolerant SoCs. We will describe a SoC architecture which uses a small, guaranteed-by-manufacturing reliable fraction of the chip to manage the remaining unreliable SoC resources. We will further discuss how the flexibility of reconfigurable hardware can be used to tolerate (permanent) chip defects and ageing faults. We will also present novel approaches for dealing with transient and intermittent faults, as well as (software) runtime optimization mechanisms to adapt the system to various fault types and density on demand, in order to improve the system efficiency and facilitate graceful system degradation. Finally, as an interesting case study, we will discuss the particular safety and system requirements of two cutting-edge, medical devices and explain how the above de- ign techniques can be applied to such highly-demanding systems.
Keywords :
circuit optimisation; fault tolerant computing; integrated circuit design; integrated circuit manufacture; integrated circuit reliability; integrated circuit testing; performance evaluation; reconfigurable architectures; system-on-chip; SoC architecture; SoC resources; adaptive fault-tolerant SoCs; ageing faults; chip defects; chip power density; density on demand; fault tolerance; faultfree system; manufacturing costs; power consumption; reconfigurable hardware flexibility; redundancy schemes; reliable SoC design techniques; runtime optimization mechanisms; safety requirements; semiconductor technology scaling; system efficiency; system requirements; testing costs; transistor sheer number; Abstracts;