Title :
Efficient Resubmission Strategies to Design Robust Grid Production Environments
Author :
Lingrand, Diane ; Montagnat, Johan
Author_Institution :
CNRS, Univ. of Nice - Sophia Antipolis, Sophia Antipolis, France
Abstract :
Production grids exhibit high failure rates hampering the development of many large scale scientific applications. End users require robust experiment production environments ensuring efficient resubmission of failed tasks. Proper parameterization of resubmission strategies is a complex problem that depends on the non-stationary workload conditions experienced by the infrastructure. In order to determine optimal resubmission parameters, probabilistic models of the overhead experienced by grid jobs are defined, taking into account the distribution of faults as measured on the infrastructure. Two strategies that can be implemented on the client side are proposed. Their models are evaluated under variable workload conditions to assess their validity along time. Their results are compared and a trade-off between usability and model accuracy is discussed.
Keywords :
fault tolerance; grid computing; production engineering computing; experiment production environment; large scale scientific application; nonstationary workload condition; optimal resubmission parameter; probabilistic model; production grid; resubmission strategy; robust grid production environment; Computational modeling; Delay; Equations; Mathematical model; Monitoring; Probabilistic logic; Production; Fault tolerance; Grid computing; Probabilistic modeling;
Conference_Titel :
e-Science (e-Science), 2010 IEEE Sixth International Conference on
Conference_Location :
Brisbane, QLD
Print_ISBN :
978-1-4244-8957-2
Electronic_ISBN :
978-0-7695-4290-4
DOI :
10.1109/eScience.2010.11