Title :
Achieving low-overhead fault tolerance for parallel accelerators with dynamic partial reconfiguration
Author :
Davis, Jeffery Jonathan ; Cheung, Peter Y. K.
Author_Institution :
Dept. of Electr. & Electron. Eng., Imperial Coll. London, London, UK
Abstract :
While allowing for the fabrication of increasingly complex and efficient circuitry, transistor shrinkage and count-per-device expansion have major downsides: chiefly increased variation, degradation and fault susceptibility. For this reason, design-time consideration of fault tolerance will have to be given to increasing numbers of electronic systems in the future to ensure yields, reliabilities and lifetimes remain acceptably high. Many commonly implemented operators are suited to modification resulting in datapath error detection capabilities with low area overheads. FPGAs are uniquely placed to allow further area savings to be made when incorporating fault avoidance mechanisms thanks to their dynamic reconfigurability. In this paper, we examine the practicalities and costs involved in implementing hardware-software fault tolerance on a test platform: a parallel matrix multiplication accelerator in hardware, with controller in software, running on a Xilinx Zynq system-on-chip. A combination of `bolt-on´ error detection logic and software-triggered routing reconfiguration serve to provide low-overhead datapath fault tolerance at runtime. Rapid yet accurate fault diagnoses along with low hardware (area), software (configuration storage) and performance penalties are achieved.
Keywords :
error detection; fault tolerance; field programmable gate arrays; integrated circuit reliability; matrix multiplication; system-on-chip; FPGA; Xilinx Zynq system-on-chip; bolt-on error detection logic reconfiguration; count-per-device expansion; datapath error detection capability; design-time consideration; dynamic partial reconfiguration; dynamic reconfigurability; electronic systems; fault avoidance mechanisms; fault susceptibility; hardware-software fault tolerance; low-overhead datapath fault tolerance; parallel matrix multiplication accelerator; software-triggered routing reconfiguration; test platform; transistor shrinkage; Digital signal processing; Random access memory; Redundancy; Tunneling magnetoresistance;
Conference_Titel :
Field Programmable Logic and Applications (FPL), 2014 24th International Conference on
Conference_Location :
Munich
DOI :
10.1109/FPL.2014.6927447