DocumentCode
1802650
Title
A Framework for Distributed Monitoring and Root Cause Analysis for Large IP Networks
Author
Banerjee, Dipyaman ; Madduri, Venkateswara R. ; Srivatsa, Mudhakar
Author_Institution
Res. Lab., IBM India, New Delhi, India
fYear
2009
fDate
27-30 Sept. 2009
Firstpage
246
Lastpage
255
Abstract
As the size of a centrally managed IP network increases, the cost of monitoring network devices and the number of reported events increase super-linearly. This in turn degrades the performance of the event correlation engine that is responsible for suppressing dependent events and escalating root cause events to a network administrator. To solve this scalability problem, we propose a distributed framework that partitions the network into smaller management domains and enables concurrent monitoring and event correlation in those domains. The gain in performance, however, comes with the challenge of correlating cross-domain events which occurs when failure in one domain induces events in other domain(s). In this paper, we investigate such situations and show in the worst case it would be impossible to determine the root cause. We propose a two step approach to solve this problem. First, we define a property called route-closure, which if satisfied by every partition not only minimizes the number of cross-domain events but also eliminates cases wherein root cause analysis may be inconclusive. We also describe a technology-centric partitioning mechanism that constructs partitions satisfying the route-closure property. Next, we propose a distributed architecture to efficiently identify and correlate cross-domain events. We use a commercial network management system to implement our distributed framework and run experiments by injecting synthetic events on large, real network topologies. Our experimental results show that our approach can manage over 200,000 managed entities and handle event bursts of size 15,000 in under five minutes without compromising the efficacy of event correlation.
Keywords
IP networks; routing protocols; cross-domain events; distributed monitoring; large IP networks; root cause analysis; route-closure; technology-centric partitioning mechanism; Condition monitoring; Costs; Databases; Degradation; Engines; IP networks; Network topology; Performance analysis; Probes; USA Councils; Event Correlation; Fault Diagnosis; Network Management;
fLanguage
English
Publisher
ieee
Conference_Titel
Reliable Distributed Systems, 2009. SRDS '09. 28th IEEE International Symposium on
Conference_Location
Niagara Falls, NY
ISSN
1060-9857
Print_ISBN
978-0-7695-3826-6
Type
conf
DOI
10.1109/SRDS.2009.22
Filename
5283232
Link To Document