A Framework for Distributed Monitoring and Root Cause Analysis for Large IP Networks

Author

Banerjee, Dipyaman ; Madduri, Venkateswara R. ; Srivatsa, Mudhakar

Author_Institution

Res. Lab., IBM India, New Delhi, India

fYear

2009

fDate

27-30 Sept. 2009

Firstpage

246

Lastpage

255

Abstract

As the size of a centrally managed IP network increases, the cost of monitoring network devices and the number of reported events increase super-linearly. This in turn degrades the performance of the event correlation engine that is responsible for suppressing dependent events and escalating root cause events to a network administrator. To solve this scalability problem, we propose a distributed framework that partitions the network into smaller management domains and enables concurrent monitoring and event correlation in those domains. The gain in performance, however, comes with the challenge of correlating cross-domain events which occurs when failure in one domain induces events in other domain(s). In this paper, we investigate such situations and show in the worst case it would be impossible to determine the root cause. We propose a two step approach to solve this problem. First, we define a property called route-closure, which if satisfied by every partition not only minimizes the number of cross-domain events but also eliminates cases wherein root cause analysis may be inconclusive. We also describe a technology-centric partitioning mechanism that constructs partitions satisfying the route-closure property. Next, we propose a distributed architecture to efficiently identify and correlate cross-domain events. We use a commercial network management system to implement our distributed framework and run experiments by injecting synthetic events on large, real network topologies. Our experimental results show that our approach can manage over 200,000 managed entities and handle event bursts of size 15,000 in under five minutes without compromising the efficacy of event correlation.

Keywords

IP networks; routing protocols; cross-domain events; distributed monitoring; large IP networks; root cause analysis; route-closure; technology-centric partitioning mechanism; Condition monitoring; Costs; Databases; Degradation; Engines; IP networks; Network topology; Performance analysis; Probes; USA Councils; Event Correlation; Fault Diagnosis; Network Management;

fLanguage

English

Publisher

ieee

Conference_Titel

Reliable Distributed Systems, 2009. SRDS '09. 28th IEEE International Symposium on

Conference_Location

Niagara Falls, NY

ISSN

1060-9857

Print_ISBN

978-0-7695-3826-6

Type

conf

DOI

10.1109/SRDS.2009.22

Filename

5283232