Title :
Fault-tolerant task management and load re-distribution on massively parallel hypercube systems
Author :
Ahmad, Ishfaq ; Ghafoor, Arif
Author_Institution :
Sch. of Comput. & Inf. Sci., Syracuse Univ., NY, USA
Abstract :
The authors present a scheme for managing real-time task allocation and load redistribution with fault-tolerance for hypercube systems. A set of processors, called fault-control processors (FCPs), can be used for keeping the duplicate copies of tasks and real locating tasks if the original processors of those tasks fail. Two-level task redundancy is used by grouping the FCPs as primary and secondary for each processor. The proposed scheme provides a high degree of fault-tolerance since each FCP itself is monitored by other FCPs. Assuming a failure-repair system environment, the performance of the proposed strategy has been evaluated and compared with a fault-free environment for 256-node and 512-node hypercubes, through simulation experiments. The authors also introduce a measure of goodness, success probability, which represents the probability of reallocated tasks meeting their deadlines despite the failures of processors. It is shown that, using the proposed scheme, a large percentage of the rescheduled tasks can still meet their deadlines. The probability of a task being lost altogether, due to multiple failures, has been shown to be extremely low
Keywords :
fault tolerant computing; hypercube networks; parallel architectures; real-time systems; resource allocation; failure-repair system; fault-control processors; fault-tolerance; hypercube systems; load redistribution; real-time task allocation; success probability; task redundancy; Concurrent computing; Dynamic scheduling; Engineering management; Fault tolerance; Fault tolerant systems; Hypercubes; Large-scale systems; Load management; Real time systems; Timing;
Conference_Titel :
Supercomputing '92., Proceedings
Conference_Location :
Minneapolis, MN
Print_ISBN :
0-8186-2630-5
DOI :
10.1109/SUPERC.1992.236688