DocumentCode :
2262648
Title :
Efficient verification of IT change operations or: How we could have prevented Amazon´s cloud outage
Author :
Hagen, Sebastian ; Seibold, Michael ; Kemper, Alfons
Author_Institution :
Dept. of Comput. Sci., Tech. Univ. Munchen, Garching, Germany
fYear :
2012
fDate :
16-20 April 2012
Firstpage :
368
Lastpage :
376
Abstract :
On April 21st, 2011, a major outage occurred in Amazon´s US east coast data center which led to significant disruptions on customer services. The root cause of the outage was an IT change to route traffic off from a router to a redundant router to conduct a network upgrade. The change was wrongly executed as a router was picked that could not handle the traffic due to capacity constraints. Consequently, network outages occurred, finally leading to unavailability, temporary, and even durable data loss of customers. We propose an object-oriented verification technique to detect conflicts among IT change operations and safety constraints, such as network capacity constraints, in the verification phase before the execution of IT changes. Based on Amazon´s incident report different scenarios in static and dynamic routing environments that cause a network overload are shown to be detectable by logical verification. The verification algorithm is proven to be sound and has linear runtime complexity for Amazon´s network overload scenarios. A performance analysis confirms the theoretical results and promises scalability to thousands of IT changes and safety constraints.
Keywords :
cloud computing; computational complexity; computer centres; computer network performance evaluation; customer services; object-oriented methods; security of data; telecommunication network routing; Amazon cloud outage prevention; Amazon´s network overload scenarios; IT Change Operations; US east coast data center; conflict detection; customer services; durable data loss; dynamic routing environments; linear runtime complexity; logical verification; network outages; network upgrade; object-oriented verification technique; performance analysis; redundant router; safety constraints; static routing environments; temporary data loss; traffic handling; Logic gates; Redundancy; Routing; Routing protocols; Safety; Servers;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Network Operations and Management Symposium (NOMS), 2012 IEEE
Conference_Location :
Maui, HI
ISSN :
1542-1201
Print_ISBN :
978-1-4673-0267-8
Electronic_ISBN :
1542-1201
Type :
conf
DOI :
10.1109/NOMS.2012.6211920
Filename :
6211920
Link To Document :
بازگشت