Title :
Ovis-2: A robust distributed architecture for scalable RAS
Author :
Brandt, J.M. ; Debusschere, B.J. ; Gentile, A.C. ; Mayo, J.R. ; Pébay, P.P. ; Thompson, D. ; Wong, M.H.
Author_Institution :
Sandia Nat. Labs., Livermore, CA
Abstract :
Resource utilization in High Performance Compute clusters can be improved by increased awareness of system state information. Sophisticated run-time characterization of system state in increasingly large clusters requires a scalable fault-tolerant RAS framework. In this paper we describe the architecture of OVIS-2 and how it meets these requirements. We describe some of the sophisticated statistical analysis, 3-D visualization, and use cases for these. Using this framework and associated tools allows the engineer to explore the behaviors and complex interactions of low level system elements while simultaneously giving the system administrator their desired level of detail with respect to ongoing system and component health.
Keywords :
resource allocation; system monitoring; workstation clusters; 3D visualization; Ovis-2; high performance compute clusters; resource utilization; robust distributed architecture; run-time characterization; scalable RAS; scalable fault-tolerant RAS framework; statistical analysis; system state information; Computer architecture; Displays; Failure analysis; Fault tolerance; Fault tolerant systems; Monitoring; Resource management; Robustness; Statistical analysis; US Department of Energy; RAS; cluster monitoring; distributed analysis; failure prediction; fault-tolerance; scalable analysis;
Conference_Titel :
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
Conference_Location :
Miami, FL
Print_ISBN :
978-1-4244-1693-6
Electronic_ISBN :
1530-2075
DOI :
10.1109/IPDPS.2008.4536549