Abstract :
When developing networked or distributed systems, network monitoring is becoming an essential facility for controlling and managing their performance or quality of service. Especially as their network rapidly scales up, distributed monitoring schemes based on a hierarchy of monitoring managers has been presented and used. But, failures of some monitoring managers cause managed network elements not to be continuously and correctly polled until the managers are repaired. For this purpose, this paper proposes an efficient monitoring manager fault-tolerance scheme to enable the managers to effectively exploit their hierarchical structure. The scheme results in low failure detection overhead by each monitoring manager periodically sending a manager advertisement message only to its immediate super manager. Therefore, even if some managers crash concurrently, the scheme allows their immediate super managers to take over them. This behavior can achieve minimizing the number of live managers affected by the failures. Moreover, after failed managers have been recovered, it allows them to immediately play their pre-failure roles in order to improve entire monitoring system performance degraded by the failures.
Keywords :
computer network management; quality of service; system recovery; telecommunication network reliability; failure detection; hierarchical distributed monitoring; monitoring manager fault-tolerance; network monitoring; quality of service; recovery scheme; Computer crashes; Computer science; Computerized monitoring; Condition monitoring; Control systems; Fault tolerance; Grid computing; Information management; Peer to peer computing; Telecommunication traffic;