Title :
Automatic checkpointing based fault tolerance in computational grid
Author :
Babu, Ch Ratna ; Rao, C. D. V. Subba
Author_Institution :
Dept. of CSE, JNTU Kakinada, Kakinada, India
Abstract :
Although technology changes quick still more sophisticated computational techniques are needed to preserve them. The majority of the computational grids work-load-logs show that node or job failure is the major challenging task to deal with. Since very robust scheduling algorithms are used to handle varied resource allocation in computational grids. However there is a need in previous studies to remedy the failures and delay of executing jobs with respect to resource availability, which can handle both scheduling and efficient failure handling in any large scale high performance computational applications. Consequently the major issues concerned here is fault-tolerance to tolerate failures with regard to job scheduling and efficient failure handling mechanism. So synchronization is needed to embed both techniques. Recurrently using techniques for fault tolerance in the widely held computational applications are periodic job checkpointing and replication. Hence most of the job checkpointing techniques are not merely based on scheduling algorithm. This work presents an automated checkpointing strategy in computational grid based on different scheduling algorithms. Experimental results have shown that the proposed automated checkpointing of jobs based on fault tolerant scheduling strategy has got considerable improvement over conventional adaptive checkpointing algorithms.
Keywords :
checkpointing; grid computing; parallel processing; resource allocation; scheduling; software fault tolerance; synchronisation; adaptive checkpointing algorithm; automated checkpointing strategy; automatic checkpointing based fault tolerance; computational grid; computational techniques; failure handling mechanism; fault tolerant scheduling strategy; high performance computational application; job checkpointing techniques; job failure; job scheduling; periodic job checkpointing; resource allocation; resource availability; scheduling algorithm; synchronization; Checkpointing; Fault tolerance; Fault tolerant systems; Kernel; Scheduling algorithms; Torque; Automatic checkpointing; Computational Grid; job scheduling; node failure; replication;
Conference_Titel :
Computing, Management and Telecommunications (ComManTel), 2014 International Conference on
Conference_Location :
Da Nang
Print_ISBN :
978-1-4799-2904-7
DOI :
10.1109/ComManTel.2014.6825575