Title :
Challenges of HPC monitoring
Author :
Allcock, W. ; Felix, E. ; Lowe, M. ; Rheinheimer, R. ; Fullop, J.
Author_Institution :
Argonne Nat. Lab., Argonne, IL, USA
Abstract :
At a recent meeting of monitoring experts from nine large supercomputing centers, there was a broad divergence of opinion on what monitoring in our environment actually is, what ought to be monitored, what technology should be used, etc. Broad consensus can be summarized in a couple of key points: Data management is increasingly a problem. As a result, historical information is rarely kept, or, if kept, rarely accessed. A proliferation of e-mails is ignored, and slow database interfaces are not used. At least some portion of the HPC Monitoring solution at each site can be summarized as "Scripts written by smart personnel over the years". An example is the "Is this node ready to run?" script developed essentially in isolation at each site. Given this environment of supercomputing centers trying to solve a seemingly simple, common problem with largely divergent technologies, philosophies, and problem definitions, we feel that a public conversation will be of value to the supercomputing community as a whole. This report outlines the general positions with regard to monitoring of five experienced supercomputing personnel, and is intended to be of benefit to the general community by revealing a variety of opinions on the following topics: What do you understand the monitoring of supercomputing to be? What are the most difficult problems in monitoring today, and which of the problems of five years ago have been put to rest? What areas of supercomputer monitoring are you most focused on at your site? Are there any particularly promising technologies you\´re using? If you could have the vendor community do one thing in this area, what would it be?
Keywords :
monitoring; parallel machines; HPC; data management; e-mails; supercomputer monitoring; Databases; File systems; Hardware; Humans; Monitoring; Servers; Supercomputers; Blue Gene; HPC; Nagios; Zenoss; monitoring;
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for
Conference_Location :
Seatle, WA
Electronic_ISBN :
978-1-4503-0771-0