DocumentCode :
3206104
Title :
Co-analysis of RAS Log and Job Log on Blue Gene/P
Author :
Ziming Zheng ; Li Yu ; Wei Tang ; Zhiling Lan ; Gupta, Rajesh ; Desai, Narayan ; Coghlan, Susan ; Buettner, David
Author_Institution :
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
fYear :
2011
fDate :
16-20 May 2011
Firstpage :
840
Lastpage :
851
Abstract :
With the growth of system size and complexity, reliability has become of paramount importance for petascale systems. Reliability, Availability, and Serviceability (RAS) logs have been commonly used for failure analysis. However, analysis based on just the RAS logs has proved to be insufficient in understanding failures and system behaviors. To overcome the limitation of this existing methodologies, we analyze the Blue Gene/P RAS logs and the Blue Gene/P job logs in a cooperative manner. From our co-analysis effort, we have identified a dozen important observations about failure characteristics and job interruption characteristics on the Blue Gene/P systems. These observations can significantly facilitate the research in fault resilience of large-scale systems.
Keywords :
mainframes; system recovery; Blue Gene-P system; RAS log coanalysis; failure analysis; job log coanalysis; petascale system reliability; reliability-availability-serviceability log; Correlation; Hardware; Kernel; Laboratories; Large-scale systems; Reliability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International
Conference_Location :
Anchorage, AK
ISSN :
1530-2075
Print_ISBN :
978-1-61284-372-8
Electronic_ISBN :
1530-2075
Type :
conf
DOI :
10.1109/IPDPS.2011.83
Filename :
6012893
Link To Document :
بازگشت