loading...
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
Long Beach, CA, USA March 26-March 30
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/IPDPS.2007.3703072007 IEEE International Parallel and ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Chao Wang, Department of Computer Science, North Carolina State University Raleigh, NC
Frank Mueller, Department of Computer Science, North Carolina State University Raleigh, NC. mueller@cs.ncsu.edu, phone: +1.919.515.7889, fax: +1.919.515.7896
Christian Engelmann, Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN
Stephen L. Scott, Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN
Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale clusters due to a mean-time-to-failure (MTTF) in the order of hours. After a failure, C/R mechanisms generally require a complete restart of an MPI job from the last checkpoint. A complete restart, however, is unnecessary since all but one node are typically still alive. Furthermore, a restart may result in lengthy job requeuing even though the original job had not exceeded its time quantum. In this paper, we overcome these shortcomings. Instead of job restart, we have developed a transparent mechanism for job pause within LAM/MPI+BLCR. This mechanism allows live nodes to remain active and roll back to the last checkpoint while failed nodes are dynamically replaced by spares before resuming from the last checkpoint. Our methodology includes LAM/MPI enhancements in support of scalable group communication with fluctuating number of nodes, reuse of network connections, transparent coordinated checkpoint scheduling and a BLCR enhancement for job pause. Experiments in a cluster with the NAS Parallel Benchmark suite show that our overhead for job pause is comparable to that of a complete job restart. A minimal overhead of 5.6% is only incurred in case migration takes place while the regular checkpoint overhead remains unchanged. Yet, our approach alleviates the need to reboot the LAM run-time environment, which accounts for considerable overhead resulting in net savings of our scheme in the experiments. Our solution further provides full transparency and automation with the additional benefit of reusing existing resources.
Citation:
Chao Wang, Frank Mueller, Christian Engelmann, Stephen L. Scott, "A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance," ipdps, pp.117, 2007 IEEE International Parallel and Distributed Processing Symposium, 2007
Usage of this product signifies your acceptance of the Terms of Use.