DocumentCode
244407
Title
Grid-Oriented Process Clustering System for Partial Message Logging
Author
Jitsumoto, Hideyuki ; Todoroki, Yuki ; Ishikawa, Yozo ; Sato, Mitsuhisa
Author_Institution
Inf. Technol. Center, Univ. of Tokyo, Tokyo, Japan
fYear
2014
fDate
23-26 June 2014
Firstpage
714
Lastpage
719
Abstract
In a computer cluster composed of many nodes, the mean time between failures becomes shorter as the number of nodes increases. This may mean that lengthy tasks cannot be performed, because they will be interrupted by failure. Therefore, fault tolerance has become an essential part of high-performance computing. Partial message logging forms clusters of processes, and coordinates a series of checkpoints to log messages between groups. Our study proposes a system of two features to improve the efficiency of partial message logging: 1) the communication log used in the clustering is recorded at runtime, and 2) a graph partitioning algorithm reduces the complexity of the system by geometrically partitioning a grid graph. The proposed system is evaluated by executing a scientific application. The results of process clustering are compared to existing methods in terms of the clustering performance and quality.
Keywords
checkpointing; computational complexity; fault tolerant computing; graph theory; grid computing; message passing; natural sciences computing; parallel processing; checkpoints; clustering performance; communication log; complexity reduction; computer cluster; fault tolerance; graph partitioning algorithm; grid graph geometric partitioning; grid-oriented process clustering system; high-performance computing; mean time between failure; partial message logging; scientific application; Computational complexity; Fault tolerance; Fault tolerant systems; Partitioning algorithms; Runtime; Three-dimensional displays; Topology; fault tolerance; graph partition; message logging;
fLanguage
English
Publisher
ieee
Conference_Titel
Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on
Conference_Location
Atlanta, GA
Type
conf
DOI
10.1109/DSN.2014.72
Filename
6903630
Link To Document