Title :
Detecting matrix multiplication faults in many-core systems
Author_Institution :
Fac. of Inf. Technol., UAE Univ., Al Ain, United Arab Emirates
Abstract :
Many-core systems are characterized by a large number of components based on ever-shrinking circuit geometries. System reliability becomes an issue because of the system complexity, the large number of components and nanoscale issues due to soft errors. While information redundancy techniques can be used for fault tolerance, they occupy too much memory space and increase the memory and network bandwidth. Moreover, in many-cores, resources are plentiful encouraging the design of simple cores without hardware fault tolerance. Thus in the absence of information redundancy, software fault detection techniques become necessary to detect errors. Herein, we present fault detection techniques for 2×2 matrix multiplication which we extend to nxn matrix multiplication. These tests can detect transient and some intermittent and permanent hardware faults. These tests are also suitable to computing grids and distributed heterogeneous systems where the result-forming node may run tests in software to validate the sub-results submitted by the grid nodes.
Keywords :
computational complexity; geometry; mathematics computing; matrix multiplication; multiprocessing systems; software fault tolerance; software reliability; circuit geometries; fault tolerance; information redundancy techniques; many core systems; matrix multiplication faults; software fault detection; system complexity; system reliability; Circuit faults; Fault tolerant systems; Hardware; Program processors; Redundancy; fault detection; many-core systems; parallel or distributed matrix multiplication;
Conference_Titel :
Innovations in Information Technology (IIT), 2011 International Conference on
Conference_Location :
Abu Dhabi
Print_ISBN :
978-1-4577-0311-9
DOI :
10.1109/INNOVATIONS.2011.5893843