loading...
Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand
Xi'an, China September 10-September 14
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICPP.2007.442007 International Conference on Para ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Qi Gao, The Ohio State University, USA
Wei Huang, The Ohio State University, USA
Matthew J. Koop, The Ohio State University, USA
Dhabaleswar K. Panda, The Ohio State University, USA
As more and more clusters with thousands of nodes are being deployed for high performance computing (HPC), fault tolerance in cluster environments has become a critical requirement. Checkpointing and rollback recovery is a common approach to achieve fault tolerance. Although widely adopted in practice, coordinated checkpointing has a known limitation on scalability. Severe contention for bandwidth to storage system can occur as a large number of processes take a checkpoint at the same time, resulting in an extremely long checkpointing delay for large parallel applications. In this paper, we propose a novel group-based checkpointing design to alleviate this scalability limitation. By carefully scheduling the MPI processes to take checkpoints in smaller groups, our design reduces the number of processes simultaneously taking checkpoints, while allowing those processes not taking checkpoints to proceed with computation. We implement our design and carry out a detailed evaluation with micro-benchmarks, HPL, and the parallel version of a data mining toolkit, MotifMiner. Experimental results show our group-based checkpointing design can reduce the effective delay for checkpointing significantly, up to 78% for HPL and up to 70% for MotifMiner.
Citation:
Qi Gao, Wei Huang, Matthew J. Koop, Dhabaleswar K. Panda, "Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand," icpp, pp.47, 2007 International Conference on Parallel Processing (ICPP 2007), 2007
Usage of this product signifies your acceptance of the Terms of Use.