Title :
The design and implementation of a fault-tolerant RPC system: Ninf-C
Author :
Nakada, Hidemoto ; Tanaka, Yoshio ; Matsuoka, Satoshi ; Sekiguchi, Satoshi
Author_Institution :
Nat. Inst. of Adv. Ind. Sci. & Technol., Tsukuba, Japan
Abstract :
We describe the design and implementation of a fault tolerant GridRPC system, Ninf-C, designed for easy programming of large-scale master-worker programs that take from few days to few months for its execution in a grid environment. Ninf-C employs Condor, developed at University of Wisconsin, as the underlying middleware supporting remote file transmission and checkpointing for system-wide robustness for application users on the grid. Ninf-C layers all the GridRPC communication and task parallel programming features on top of Condor in a non-trivial fashion, assuming that the entire program is structured in a master-worker style-in fact, older Ninf master-worker programs can be run directly or trivially ported to Ninf-C. In contrast to the original Ninf, Ninf-C exploits and extends Condor features extensively for robustness and transparency, such as 1) checkpointing and stateful recovery of the master process, 2) the master and workers mutually communicating using (remote) files, not IP sockets, and 3) automated throttling of parallel GridRPC calls; and in contrast to using Condor directly, programmers can set up complex dynamic workflow as well as master-worker parallel structure with almost no learning curve involved. To prove the robustness of the system, we performed an experiment on a heterogeneous cluster that consists of x86 and SPARC CPUs, and ran a simple but long-running master-worker program with staged rebooting of multiple nodes to simulate some serious fault situations. The program execution finished normally avoiding all the fault scenarios, demonstrating the robustness of Ninf-C.
Keywords :
grid computing; middleware; parallel programming; remote procedure calls; software fault tolerance; task analysis; Condor; GridRPC system; Ninf-C; SPARC CPU; dynamic workflow; fault-tolerant RPC system; grid environment; heterogeneous cluster; large-scale master-worker programs; remote file transmission; remote procedure calls; task parallel programming; Checkpointing; Fault tolerant systems; Grid computing; High performance computing; Large-scale systems; Middleware; Parallel programming; Programming profession; Robustness; Sockets;
Conference_Titel :
High Performance Computing and Grid in Asia Pacific Region, 2004. Proceedings. Seventh International Conference on
Print_ISBN :
0-7695-2138-X
DOI :
10.1109/HPCASIA.2004.1324011