loading...
Modeling Coordinated Checkpointing for Large-Scale Supercomputers
Yokohama, Japan June 28-July 01
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/DSN.2005.672005 International Conference on Depe ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Long Wang, University of Illinois at Urbana-Champaign
Karthik Pattabiraman, University of Illinois at Urbana-Champaign
Zbigniew Kalbarczyk, University of Illinois at Urbana-Champaign
Ravishankar K. Iyer, University of Illinois at Urbana-Champaign
Lawrence Votta, Sun Microsystems
Christopher Vick, Sun Microsystems
Alan Wood, Sun Microsystems
Abstract. Current supercomputing systems consisting of thousands of nodes cannot meet the demands of emerging high-performance scientific applications. As a result, a new generation of supercomputing systems consisting of hundreds of thousands of nodes is being proposed. However, these systems are likely to experience far more frequent failures than today's systems, and such failures must be tackled effectively. Coordinated checkpointing is a common technique to deal with failures in supercomputers. This paper presents a model of a coordinated checkpointing protocol for large-scale supercomputers, and studies its scalability by considering both the coordination overhead and the effect of failures. Unlike most of the existing checkpointing models, the proposed model takes into account failures during checkpointing and recovery, as well as correlated failures. Stochastic Activity Networks (SANs) are used to model the system, and the model is simulated to study the scalability, reliability, and performance of the system.
Citation:
Long Wang, Karthik Pattabiraman, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Lawrence Votta, Christopher Vick, Alan Wood, "Modeling Coordinated Checkpointing for Large-Scale Supercomputers," dsn, pp.812-821, 2005 International Conference on Dependable Systems and Networks (DSN'05), 2005
Usage of this product signifies your acceptance of the Terms of Use.