DocumentCode
3206104
Title
Co-analysis of RAS Log and Job Log on Blue Gene/P
Author
Ziming Zheng ; Li Yu ; Wei Tang ; Zhiling Lan ; Gupta, Rajesh ; Desai, Narayan ; Coghlan, Susan ; Buettner, David
Author_Institution
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
fYear
2011
fDate
16-20 May 2011
Firstpage
840
Lastpage
851
Abstract
With the growth of system size and complexity, reliability has become of paramount importance for petascale systems. Reliability, Availability, and Serviceability (RAS) logs have been commonly used for failure analysis. However, analysis based on just the RAS logs has proved to be insufficient in understanding failures and system behaviors. To overcome the limitation of this existing methodologies, we analyze the Blue Gene/P RAS logs and the Blue Gene/P job logs in a cooperative manner. From our co-analysis effort, we have identified a dozen important observations about failure characteristics and job interruption characteristics on the Blue Gene/P systems. These observations can significantly facilitate the research in fault resilience of large-scale systems.
Keywords
mainframes; system recovery; Blue Gene-P system; RAS log coanalysis; failure analysis; job log coanalysis; petascale system reliability; reliability-availability-serviceability log; Correlation; Hardware; Kernel; Laboratories; Large-scale systems; Reliability;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International
Conference_Location
Anchorage, AK
ISSN
1530-2075
Print_ISBN
978-1-61284-372-8
Electronic_ISBN
1530-2075
Type
conf
DOI
10.1109/IPDPS.2011.83
Filename
6012893
Link To Document