Title :
Filtering failure logs for a BlueGene/L prototype
Author :
Liang, Yinglung ; Zhang, Yanyong ; Sivasubramaniam, Anand ; Sahoo, Ramendra K. ; Moreira, Jose ; Gupta, Manish
Author_Institution :
Dept. of Electr. & Comput. Eng., Rutgers Univ., Piscataway, NJ, USA
fDate :
28 June-1 July 2005
Abstract :
The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM´s BlueGene/L, which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.
Keywords :
IBM computers; failure analysis; parallel machines; IBM BlueGene/L prototype; error event log filtering; failure log filtering; parallel machines; temporal filtering; three-step filtering algorithm; Application software; Condition monitoring; Error analysis; Filtering; Hardware; Large-scale systems; Military computing; Parallel machines; Prototypes; System software;
Conference_Titel :
Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on
Print_ISBN :
0-7695-2282-3
DOI :
10.1109/DSN.2005.50