DocumentCode :
1854074
Title :
Checkpointing Process Groups in a Grid Environment
Author :
Mehnert-Spahn, John ; Schottner, Michael ; Morin, Christine
Author_Institution :
Dept. of Comput. Sci., Heinrich-Heine Univ., Duesseldorf
fYear :
2008
fDate :
1-4 Dec. 2008
Firstpage :
243
Lastpage :
251
Abstract :
The EU-funded XtreemOS project implements a grid operating system transparently exploiting resources of virtual organizations through the standard POSIX interface. Grid checkpointing and restart requires to save and restore jobs executing in a distributed heterogeneous grid environment. The latter may spawn millions of grid nodes ( PCs, clusters, and mobile devices ) using different system-specific checkpointers saving and restoring application and kernel data structures for processes executing on a grid node. In this paper we shortly describe the XtreemOS grid checkpointing architecture and how we bridge the gap between the abstract grid and the system-specific checkpointers. Then we discuss how we keep track of processes and how different process grouping techniques are managed to ensure that all processes of a job and any further dependent ones can be checkpointed and restarted. Finally, we present how Linux control groups can be used to address resource isolation issues during the restart.
Keywords :
checkpointing; data structures; grid computing; software architecture; Linux control groups; POSIX interface; XtreemOS grid checkpointing architecture; checkpointing process; distributed heterogeneous grid environment; kernel data structures; resource isolation; virtual organizations; Checkpointing; Computer science; Kernel; Linux; Middleware; Operating systems; Personal communication networks; Power system management; Power system security; Resource management; fault tolerance; grid computing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Computing, Applications and Technologies, 2008. PDCAT 2008. Ninth International Conference on
Conference_Location :
Otago
Print_ISBN :
978-0-7695-3443-5
Type :
conf
DOI :
10.1109/PDCAT.2008.14
Filename :
4710987
Link To Document :
بازگشت