• DocumentCode
    560217
  • Title

    Challenges of HPC monitoring

  • Author

    Allcock, W. ; Felix, E. ; Lowe, M. ; Rheinheimer, R. ; Fullop, J.

  • Author_Institution
    Argonne Nat. Lab., Argonne, IL, USA
  • fYear
    2011
  • fDate
    12-18 Nov. 2011
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    At a recent meeting of monitoring experts from nine large supercomputing centers, there was a broad divergence of opinion on what monitoring in our environment actually is, what ought to be monitored, what technology should be used, etc. Broad consensus can be summarized in a couple of key points: Data management is increasingly a problem. As a result, historical information is rarely kept, or, if kept, rarely accessed. A proliferation of e-mails is ignored, and slow database interfaces are not used. At least some portion of the HPC Monitoring solution at each site can be summarized as "Scripts written by smart personnel over the years". An example is the "Is this node ready to run?" script developed essentially in isolation at each site. Given this environment of supercomputing centers trying to solve a seemingly simple, common problem with largely divergent technologies, philosophies, and problem definitions, we feel that a public conversation will be of value to the supercomputing community as a whole. This report outlines the general positions with regard to monitoring of five experienced supercomputing personnel, and is intended to be of benefit to the general community by revealing a variety of opinions on the following topics: What do you understand the monitoring of supercomputing to be? What are the most difficult problems in monitoring today, and which of the problems of five years ago have been put to rest? What areas of supercomputer monitoring are you most focused on at your site? Are there any particularly promising technologies you\´re using? If you could have the vendor community do one thing in this area, what would it be?
  • Keywords
    monitoring; parallel machines; HPC; data management; e-mails; supercomputer monitoring; Databases; File systems; Hardware; Humans; Monitoring; Servers; Supercomputers; Blue Gene; HPC; Nagios; Zenoss; monitoring;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for
  • Conference_Location
    Seatle, WA
  • Electronic_ISBN
    978-1-4503-0771-0
  • Type

    conf

  • Filename
    6114488