Title :
A scalable and efficient self-organizing failure detector for grid applications
Author :
Horita, Yuuki ; Taura, Kenjiro ; Chikayama, Takashi
Author_Institution :
Tokyo Univ., Japan
Abstract :
Failure detection and group membership management are basic building blocks for self-repairing systems in distributed environments, which need to be scalable, reliable, and efficient in practice. As available resources become larger in size and more widely distributed, it is more essential that they can be easily used with a small amount of manual configuration in grid environments, where connectivities between different networks may be limited by firewalls and NATs. In this paper, we present a scalable failure detection protocol that self-organizes in grid environments. Our failure detectors autonomously create dispersed monitoring relationships among participating processes with almost no manual configuration so that each process will be monitored by a small number of other processes, and quickly disseminate notifications along the monitoring relationships when failures are detected. With simulations and real experiments, we showed that our failure detector has a practical scalability, a high reliability, and a good efficiency. The overhead with 313 processes was at most 2-percent even when the heartbeat interval was set to 0.1 second, and accordingly smaller when it was longer.
Keywords :
computer networks; failure analysis; fault diagnosis; fault tolerant computing; grid computing; groupware; transport protocols; distributed environments; failure detection protocol; grid applications; grid environments; group membership management; notification dissemination; process relationship monitoring; self-organizing failure detector; self-repairing systems; Computer crashes; Condition monitoring; Detectors; Environmental management; Fault detection; Heart beat; Libraries; Network address translation; Protocols; Scalability;
Conference_Titel :
Grid Computing, 2005. The 6th IEEE/ACM International Workshop on
Print_ISBN :
0-7803-9492-5
DOI :
10.1109/GRID.2005.1542743