loading...
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
San Diego, CA, USA September 20-September 23
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/CLUSTR.2004.1392606Sixth IEEE International Conference o ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Gengbin Zheng, Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
Lixia Shi, Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
L.V. Kale, Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
As high performance clusters continue to grow in size, the mean time between failures shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challenging factors for application scalability. The traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically to reliable storage and restart from the recent checkpoint. The recovery of the application from faults involves (often manually) restarting applications on all processors and having it read the data from disks on all processors. The restart can therefore take minutes after it has been initiated. Such a strategy requires that the failed processor can be replaced so that the number of processors at checkpoint-time and recovery-time are the same. We present FTC-Charms ++, a fault-tolerant runtime based on a scheme for fast and scalable in-memory checkpoint and restart. At restart, when there is no extra processor, the program can continue to run on the remaining processors while minimizing the performance penalty due to losing processors. The method is useful for applications whose memory footprint is small at the checkpoint state, while a variation of this scheme - in-disk checkpoint/restart can be applied to applications with large memory footprint. The scheme does not require any individual component to be fault-free. We have implemented this scheme for Charms++ and AMPI (an adaptive version of MPl). This work describes the scheme and shows performance data on a cluster using 128 processors.
Citation:
Gengbin Zheng, Lixia Shi, L.V. Kale, "FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI," cluster, pp.93-103, Sixth IEEE International Conference on Cluster Computing (CLUSTER'04), 2004
Usage of this product signifies your acceptance of the Terms of Use.