Title :
Co-analysis of RAS Log and Job Log on Blue Gene/P
Author :
Ziming Zheng ; Li Yu ; Wei Tang ; Zhiling Lan ; Gupta, Rajesh ; Desai, Narayan ; Coghlan, Susan ; Buettner, David
Author_Institution :
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
Abstract :
With the growth of system size and complexity, reliability has become of paramount importance for petascale systems. Reliability, Availability, and Serviceability (RAS) logs have been commonly used for failure analysis. However, analysis based on just the RAS logs has proved to be insufficient in understanding failures and system behaviors. To overcome the limitation of this existing methodologies, we analyze the Blue Gene/P RAS logs and the Blue Gene/P job logs in a cooperative manner. From our co-analysis effort, we have identified a dozen important observations about failure characteristics and job interruption characteristics on the Blue Gene/P systems. These observations can significantly facilitate the research in fault resilience of large-scale systems.
Keywords :
mainframes; system recovery; Blue Gene-P system; RAS log coanalysis; failure analysis; job log coanalysis; petascale system reliability; reliability-availability-serviceability log; Correlation; Hardware; Kernel; Laboratories; Large-scale systems; Reliability;
Conference_Titel :
Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International
Conference_Location :
Anchorage, AK
Print_ISBN :
978-1-61284-372-8
Electronic_ISBN :
1530-2075
DOI :
10.1109/IPDPS.2011.83