Automatic checkpointing based fault tolerance in computational grid

Author

Babu, Ch Ratna ; Rao, C. D. V. Subba

Author_Institution

Dept. of CSE, JNTU Kakinada, Kakinada, India

fYear

2014

fDate

27-29 April 2014

Firstpage

41

Lastpage

45

Abstract

Although technology changes quick still more sophisticated computational techniques are needed to preserve them. The majority of the computational grids work-load-logs show that node or job failure is the major challenging task to deal with. Since very robust scheduling algorithms are used to handle varied resource allocation in computational grids. However there is a need in previous studies to remedy the failures and delay of executing jobs with respect to resource availability, which can handle both scheduling and efficient failure handling in any large scale high performance computational applications. Consequently the major issues concerned here is fault-tolerance to tolerate failures with regard to job scheduling and efficient failure handling mechanism. So synchronization is needed to embed both techniques. Recurrently using techniques for fault tolerance in the widely held computational applications are periodic job checkpointing and replication. Hence most of the job checkpointing techniques are not merely based on scheduling algorithm. This work presents an automated checkpointing strategy in computational grid based on different scheduling algorithms. Experimental results have shown that the proposed automated checkpointing of jobs based on fault tolerant scheduling strategy has got considerable improvement over conventional adaptive checkpointing algorithms.

Keywords

checkpointing; grid computing; parallel processing; resource allocation; scheduling; software fault tolerance; synchronisation; adaptive checkpointing algorithm; automated checkpointing strategy; automatic checkpointing based fault tolerance; computational grid; computational techniques; failure handling mechanism; fault tolerant scheduling strategy; high performance computational application; job checkpointing techniques; job failure; job scheduling; periodic job checkpointing; resource allocation; resource availability; scheduling algorithm; synchronization; Checkpointing; Fault tolerance; Fault tolerant systems; Kernel; Scheduling algorithms; Torque; Automatic checkpointing; Computational Grid; job scheduling; node failure; replication;

fLanguage

English

Publisher

ieee

Conference_Titel

Computing, Management and Telecommunications (ComManTel), 2014 International Conference on

Conference_Location

Da Nang

Print_ISBN

978-1-4799-2904-7

Type

conf

DOI

10.1109/ComManTel.2014.6825575

Filename

6825575