Title :
Checkpointing and Recovery Mechanism in Grid
Author :
Mehta, Janki ; Chaudhary, Sanjay
Author_Institution :
Dhirubhai Ambani Inst. of Technol. & Sci., Gandhinagar
Abstract :
Grid is a collection of distributed computing resources that performs tasks in co-ordination to achieve high-end computational capabilities by dividing a given task into sub-tasks. Each sub-task could be large and run for several hours or days on a number of grid nodes. If a sub-task fails to complete even on a single site, all the computations should be performed again. In scalable distributed systems, an individual component failure usually does not result in failure of the entire system. The probability of a single component failure rises rapidly with the increase in number of components in the system. As system grows in size, efficient recovery mechanism is most important for highly parallel mission critical and long running applications of grid environment. This paper addresses a recovery mechanism using checkpoints to recover from grid service failure resulting in task or transaction failure in computational or data grid which will prevent computations to be restarted from scratch. This work helps in preserving two main objectives of grid namely optimal resource utilization and speedy computations, which can be achieved by using resources in a better way for improving performance of the system rather than engaging them in tasks like rollbacks resulting from cascading aborts. This work aims to address checkpointing mechanism to recover from system failure leading to failure of running services and computational tasks or transactions being executed. The saved state using checkpoints can also be used for job migration using job schedulers of grid.
Keywords :
checkpointing; grid computing; operating systems (computers); scheduling; system recovery; checkpointing; computational grid; data grid; distributed computing resources; grid computing; grid service failure; job migration; job schedulers; optimal resource utilization; recovery mechanism; scalable distributed systems; speedy computations; Checkpointing; Computer architecture; Computer networks; Distributed computing; Grid computing; Hardware; High performance computing; Processor scheduling; Resource management; Web services; Checkpointing; Grid Computing;
Conference_Titel :
Advanced Computing and Communications, 2008. ADCOM 2008. 16th International Conference on
Conference_Location :
Chennai
Print_ISBN :
978-1-4244-2962-2
Electronic_ISBN :
978-1-4244-2963-9
DOI :
10.1109/ADCOM.2008.4760439