loading...
Optimizing Checkpoint Sizes in the C3 System
Denver, Colorado April 04-April 08
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/IPDPS.2005.31619th IEEE International Parallel and ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Daniel Marques, Cornell University, Ithaca, NY
Greg Bronevetsky, Cornell University, Ithaca, NY
Rohit Fernandes, Cornell University, Ithaca, NY
Keshav Pingali, Cornell University, Ithaca, NY
Paul Stodghil, Cornell University, Ithaca, NY
The running times of many computational science applications are much longer than the mean-time-between-failures (MTBF) of current high-performance computing platforms. To run to completion, such applications must tolerate hardware failures.
Checkpoint-and-restart (CPR) is the most commonly used scheme for accomplishing this - the state of the computation is saved periodically on stable storage, and when a hardware failure is detected, the computation is restarted from the most recently saved state. Most automatic CPR schemes in the literature can be classified as system-level checkpointing schemes because they take core-dump style snapshots of the computational state when all the processes are blocked at global barriers in the program. Unfortunately, a system that implements this style of checkpointing is tied to a particular platform amd cannot optimize the checkpointing process using application-specific knowledge.
We are exploring an alternative called automatic applicationlevel checkpointing. In our approach, programs are transformed by a pre-processor so that they become self-checkpointing and self-restartable on any platform. In this paper, we evaluate a mechanism that utilizes application knowledge to minimize the amount of information saved in a checkpoint.
Citation:
Daniel Marques, Greg Bronevetsky, Rohit Fernandes, Keshav Pingali, Paul Stodghil, "Optimizing Checkpoint Sizes in the C3 System," ipdps, vol. 11, pp.226a, 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10, 2005
Usage of this product signifies your acceptance of the Terms of Use.