loading...
Filtering Failure Logs for a BlueGene/L Prototype
Yokohama, Japan June 28-July 01
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/DSN.2005.502005 International Conference on Depe ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Yinglung Liang, Rutgers University
Yanyong Zhang, Rutgers University
Anand Sivasubramaniam, Penn State University
Ramendra K. Sahoo, IBM T. J. Watson Research Center
Jose Moreira, IBM T. J. Watson Research Center
Manish Gupta, IBM T. J. Watson Research Center
The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM?s BlueGene/L which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.
Citation:
Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Ramendra K. Sahoo, Jose Moreira, Manish Gupta, "Filtering Failure Logs for a BlueGene/L Prototype," dsn, pp.476-485, 2005 International Conference on Dependable Systems and Networks (DSN'05), 2005
Usage of this product signifies your acceptance of the Terms of Use.


Suggestions