DocumentCode
3678435
Title
New Systems, New Behaviors, New Patterns: Monitoring Insights from System Standup
Author
Jim Brandt;Ann Gentile;Cindy Martin;Jason Repik;Narate Taerat
Author_Institution
Sandia Nat. Labs., Albuquerque, NM, USA
fYear
2015
Firstpage
658
Lastpage
665
Abstract
Disentangling significant and important log messages from those that are routine and unimportant can be a difficult task. Further, on a new system, understanding correlations between significant and possibly new types of messages and conditions that cause them can require significant effort and time. The initial standup of a machine can provide opportunities for investigating the parameter space of events and operations and thus for gaining insight into the events of interest. In particular, failure inducement and investigation of corner case conditions can provide knowledge of system behavior for significant issues that will enable easier diagnosis and mitigation of such issues for when they may actually occur during the platform lifetime. In this work, we describe the testing process and monitoring results from a testbed system in preparation for the ACES Trinity system. We describe how events in the initial standup including changes in configuration and software and corner case testing has provided insights that can inform future monitoring and operating conditions, both of our test systems and the eventual large-scale Trinity system.
Keywords
"Blades","Testing","Monitoring","Program processors","Cooling","Layout","Temperature"
Publisher
ieee
Conference_Titel
Cluster Computing (CLUSTER), 2015 IEEE International Conference on
Type
conf
DOI
10.1109/CLUSTER.2015.116
Filename
7307665
Link To Document