Title :
Kangaroo: Reliable Execution of Scientific Applications with DAG Programming Model
Author :
Zhang, Kai ; Chen, Kang ; Xue, Wei
Author_Institution :
Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
Abstract :
As high performance computing (HPC) systems increase in scale with higher potential level of component failure, the need rises for developing fault tolerant systems. However, current fault tolerance mechanisms, including Reply, Check pointing, and Redundant Execution, dose not scale well in large-scale scientific computing. Kangaroo is a reliable execution engine for scientific applications. Parallel programs are modeled as directed acyclic graph (DAG), and executed on clusters with graph theory based scheduling policy. Kangaroo provides effective execution of scalable parallel programs and transparently tolerates failures during runtime. In this paper, we describe the implementations of Kangaroo system, discuss designs of scheduling and fault tolerance, and evaluate the performance by a dense matrix inversion program. The results demonstrate that scheduling policies have a strong effect on program performance. They also demonstrate the feasibility and effectiveness of our approach to fault tolerance.
Keywords :
graph theory; matrix algebra; parallel processing; software fault tolerance; DAG; DAG programming model; HPC; Kangaroo; directed acyclic graph; fault tolerant systems; graph theory; high performance computing; matrix inversion program; parallel programs; reliable execution; scientific applications; Clustering algorithms; Fault tolerance; Fault tolerant systems; Hardware design languages; Heuristic algorithms; Programming; Parallel programming; directed acyclic graph; fault tolerance; scientific computing;
Conference_Titel :
Parallel Processing Workshops (ICPPW), 2011 40th International Conference on
Conference_Location :
Taipei City
Print_ISBN :
978-1-4577-1337-8
Electronic_ISBN :
1530-2016
DOI :
10.1109/ICPPW.2011.28