• DocumentCode
    1825882
  • Title

    Availability modeling and analysis on high performance cluster computing systems

  • Author

    Song, Hertong ; Leangsuksun, Chokchai Box ; Nassar, Raja ; Gottumukkala, Narasimha Raju ; Scott, Stephen

  • Author_Institution
    Coll. of Eng. & Sci., Louisiana Tech. Univ., Ruston, LA, USA
  • fYear
    2006
  • fDate
    20-22 April 2006
  • Abstract
    Cluster computing has been attracting more and more attention from both the industry and the academia for its enormous computing power, cost effectiveness, and scalability. Availability is a key system attribute that needs to be considered both at system design stage and must reflect the actuality. System monitoring and logging enables identifying unplanned events to reflect the actual system´s availability. This paper proposes a single framework that coordinates event monitoring, filtering, data analysis and dynamic availability modeling. The availability model is abstracted and categorized based on functionality. We describe the proposed architecture, and a sample analysis of real time event logs from a 512 node cluster from Lawrence Livermore National Laboratory.
  • Keywords
    fault tolerant computing; system monitoring; workstation clusters; data analysis; dynamic availability modeling; event monitoring; high performance cluster computing systems; real time event logs; system design; system monitoring; Availability; Computer industry; Costs; Filtering; High performance computing; Monitoring; Performance analysis; Power system modeling; Scalability; System analysis and design;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Availability, Reliability and Security, 2006. ARES 2006. The First International Conference on
  • Print_ISBN
    0-7695-2567-9
  • Type

    conf

  • DOI
    10.1109/ARES.2006.37
  • Filename
    1625325