DocumentCode :
1802650
Title :
A Framework for Distributed Monitoring and Root Cause Analysis for Large IP Networks
Author :
Banerjee, Dipyaman ; Madduri, Venkateswara R. ; Srivatsa, Mudhakar
Author_Institution :
Res. Lab., IBM India, New Delhi, India
fYear :
2009
fDate :
27-30 Sept. 2009
Firstpage :
246
Lastpage :
255
Abstract :
As the size of a centrally managed IP network increases, the cost of monitoring network devices and the number of reported events increase super-linearly. This in turn degrades the performance of the event correlation engine that is responsible for suppressing dependent events and escalating root cause events to a network administrator. To solve this scalability problem, we propose a distributed framework that partitions the network into smaller management domains and enables concurrent monitoring and event correlation in those domains. The gain in performance, however, comes with the challenge of correlating cross-domain events which occurs when failure in one domain induces events in other domain(s). In this paper, we investigate such situations and show in the worst case it would be impossible to determine the root cause. We propose a two step approach to solve this problem. First, we define a property called route-closure, which if satisfied by every partition not only minimizes the number of cross-domain events but also eliminates cases wherein root cause analysis may be inconclusive. We also describe a technology-centric partitioning mechanism that constructs partitions satisfying the route-closure property. Next, we propose a distributed architecture to efficiently identify and correlate cross-domain events. We use a commercial network management system to implement our distributed framework and run experiments by injecting synthetic events on large, real network topologies. Our experimental results show that our approach can manage over 200,000 managed entities and handle event bursts of size 15,000 in under five minutes without compromising the efficacy of event correlation.
Keywords :
IP networks; routing protocols; cross-domain events; distributed monitoring; large IP networks; root cause analysis; route-closure; technology-centric partitioning mechanism; Condition monitoring; Costs; Databases; Degradation; Engines; IP networks; Network topology; Performance analysis; Probes; USA Councils; Event Correlation; Fault Diagnosis; Network Management;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Reliable Distributed Systems, 2009. SRDS '09. 28th IEEE International Symposium on
Conference_Location :
Niagara Falls, NY
ISSN :
1060-9857
Print_ISBN :
978-0-7695-3826-6
Type :
conf
DOI :
10.1109/SRDS.2009.22
Filename :
5283232
Link To Document :
بازگشت