• DocumentCode
    3206104
  • Title

    Co-analysis of RAS Log and Job Log on Blue Gene/P

  • Author

    Ziming Zheng ; Li Yu ; Wei Tang ; Zhiling Lan ; Gupta, Rajesh ; Desai, Narayan ; Coghlan, Susan ; Buettner, David

  • Author_Institution
    Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
  • fYear
    2011
  • fDate
    16-20 May 2011
  • Firstpage
    840
  • Lastpage
    851
  • Abstract
    With the growth of system size and complexity, reliability has become of paramount importance for petascale systems. Reliability, Availability, and Serviceability (RAS) logs have been commonly used for failure analysis. However, analysis based on just the RAS logs has proved to be insufficient in understanding failures and system behaviors. To overcome the limitation of this existing methodologies, we analyze the Blue Gene/P RAS logs and the Blue Gene/P job logs in a cooperative manner. From our co-analysis effort, we have identified a dozen important observations about failure characteristics and job interruption characteristics on the Blue Gene/P systems. These observations can significantly facilitate the research in fault resilience of large-scale systems.
  • Keywords
    mainframes; system recovery; Blue Gene-P system; RAS log coanalysis; failure analysis; job log coanalysis; petascale system reliability; reliability-availability-serviceability log; Correlation; Hardware; Kernel; Laboratories; Large-scale systems; Reliability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International
  • Conference_Location
    Anchorage, AK
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-61284-372-8
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2011.83
  • Filename
    6012893