Title :
A checkpointing/recovery system for MPI applications on cluster of IA-64 computers
Author :
Zhang, Youhui ; Xue, Ruini ; Wong, Dongsheng ; Zheng, Weimin
Author_Institution :
Dept. of Comput. Sci., Tsinghua Univ., Beijing, China
Abstract :
As the clusters continue to grow in size and popularity, issues of fault tolerance and reliability turn into limiting factors on application scalability and system availability. To address these issues, we design and implement a high availability parallel run-time system - ChaRM64 for MPI, a checkpoint-based rollback recovery and migration system for MPI programs on a cluster of IA-64 computers. Our approach integrates MPICH with a user-level, single process checkpoint/recovery library for IA-64 Linux, and modifies P4 libraries to implement a coordinated checkpointing and rollback recovery (CRR) and migration mechanism for parallel applications. In addition, the CRR of file operations is supported. Testing shows negligible performance overhead introduced by the CRR mechanism in our implementation.
Keywords :
Linux; application program interfaces; checkpointing; fault tolerant computing; message passing; parallel programming; software libraries; workstation clusters; ChaRM64; IA-64 Linux; IA-64 computer cluster; MPI applications; MPI programs; P4 libraries; application scalability; checkpointing; fault tolerance; file operations; migration system; parallel applications; parallel run-time system; recovery system; reliability; rollback recovery; system availability; user-level library; Application software; Availability; Checkpointing; Computer architecture; Computer network reliability; Computer networks; Concurrent computing; Fault tolerant systems; Libraries; Linux;
Conference_Titel :
Parallel Processing, 2005. ICPP 2005 Workshops. International Conference Workshops on
Print_ISBN :
0-7695-2381-1
DOI :
10.1109/ICPPW.2005.5