Title :
Handling Failures in Parallel Scientific Workflows Using Clouds
Author :
Costa, Francois ; de Oliveira, Daniel ; Ocala, Kary ; Ogasawara, Eduardo ; Dias, Joana ; Mattoso, Marta
Author_Institution :
COPPE, Fed. Univ. of Rio de Janeiro, Rio de Janeiro, Brazil
Abstract :
Failures are common in High Performance Computing (HPC) environments and can significantly impact the performance of scientific workflows executing on top of these large scale computing environments. Computing clouds are being used as promising HPC environments. Although clouds offer several advantages such as elasticity and availability, failures are very frequent in this type of environment, where virtualization, instabilities and providers´ actions directly impact on workflow execution. In this way, activity failures are almost inevitable in clouds where virtual machine failures are a reality rather than a possibility. In this paper we present a set of failure handling heuristics based on cloud characteristics, which are implemented within SciMultaneous, a service-oriented architecture that manages re-executions of failed scientific workflow activities using runtime provenance data. Experimental results on clouds showed that SciMultaneous and its heuristics considerably increase workflow completion and reduce the total execution time (TET) of the workflow (even considering executions or reexecutions) up to 45%, when compared to a posteriori reexecution approaches. We analyze SciMultaneous´ behavior under a series of activity failures types and concluded that even a single activity failure could have a large detrimental effect on scientific workflow TET.
Keywords :
cloud computing; operating systems (computers); parallel processing; service-oriented architecture; system recovery; virtual machines; HPC environments; SciMultaneous; TET; cloud characteristics; cloud computing; handling failures; high performance computing; large scale computing environments; parallel scientific workflows; service-oriented architecture; total execution time; virtual machine failures; workflow execution; Failure handling; Scientific Workflows;
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:
Conference_Location :
Salt Lake City, UT
Print_ISBN :
978-1-4673-6218-4
DOI :
10.1109/SC.Companion.2012.28