شماره ركورد كنفرانس :
3537
عنوان مقاله :
A Genetic-based Optimal Checkpoint Placement Strategy for Multicore Processors
Author/Authors :
Atieh Lotfi School of Electrical and Computer Engineering College of Engineering - University of Tehran, Iran , Saeed Safari School of Electrical and Computer Engineering College of Engineering - University of Tehran, Iran
كليدواژه :
Genetic Algorithm , Multicore Architectures , Optimal Checkpoint Placement , Fault Tolerance
عنوان كنفرانس :
شانزدهمين همايش بين المللي معماري كامپيوتر و سيستم هاي ديجيتال
چكيده لاتين :
Nowadays multicore processors are increasingly being
deployed in high performance computing systems. As the
complexity of systems increases, the probability of failure
increases substantially. Therefore, the system requires techniques
for supporting fault tolerance. Checkpointing technique is widely
used to reduce the execution time of long-running programs in
the presence of failures and to enhance the reliability of such
systems. Optimizing the number of checkpoints in a parallel
application running on a multicore processor is a complicated
and challenging task. Infrequent checkpointing results in long
reprocessing time, while too short checkpointing intervals lead to
high checkpointing overhead. Since this is a multi-objective
optimization problem, trapping in local optimums is very
plausible. On the other hand, bio-inspired algorithms are
powerful function optimizers that are successfully used to solve
problems in many different areas. In this paper, by applying
genetic algorithm, which is a well-known bio-inspired computing
algorithm, finding optimal checkpoint placement in parallel
applications is exercised. Under certain fault conditions, this new
checkpoint placement strategy outperforms the existing ones with
a significant reduction in the total wasted times. Our
experimental results show that our method, which is
implementable on any message-passing multicore system, can
optimally find the suitable points in which checkpoints should be
taken.