• DocumentCode
    228652
  • Title

    The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications

  • Author

    Agelastos, Anthony ; Allan, Benjamin ; Brandt, Jim ; Cassella, Paul ; Enos, Jeremy ; Fullop, Joshi ; Gentile, Ann ; Monk, Steve ; Naksinehaboon, Nichamon ; Ogden, Jeff ; Rajan, Mahesh ; Showerman, Michael ; Stevenson, Joel ; Taerat, Narate ; Tucker, Tom

  • Author_Institution
    Sandia Nat. Labs. ABQ, Albuquerque, NM, USA
  • fYear
    2014
  • fDate
    16-21 Nov. 2014
  • Firstpage
    154
  • Lastpage
    165
  • Abstract
    Understanding how resources of High Performance Compute platforms are utilized by applications both individually and as a composite is key to application and platform performance. Typical system monitoring tools do not provide sufficient fidelity while application profiling tools do not capture the complex interplay between applications competing for shared resources. To gain new insights, monitoring tools must run continuously, system wide, at frequencies appropriate to the metrics of interest while having minimal impact on application performance. We introduce the Lightweight Distributed Metric Service for scalable, lightweight monitoring of large scale computing systems and applications. We describe issues and constraints guiding deployment in Sandia National Laboratories´ capacity computing environment and on the National Center for Supercomputing Applications´ Blue Waters platform including motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters. We address monitoring overhead and impact on application performance and provide illustrative profiling results.
  • Keywords
    parallel processing; resource allocation; software metrics; computing system monitoring; high performance computing platform; lightweight distributed metric service; resource utilization; Bandwidth; Instruction sets; Measurement; Memory management; Monitoring; Resource management; Sockets; resource management; resource monitoring;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for
  • Conference_Location
    New Orleans, LA
  • Print_ISBN
    978-1-4799-5499-5
  • Type

    conf

  • DOI
    10.1109/SC.2014.18
  • Filename
    7013000