loading...
Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation
Singapore May 16-May 19
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/CCGRID.2006.81Sixth IEEE International Symposium on ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Yuan Tang, University of Tennessee, USA
Graham E. Fagg, University of Tennessee, USA
Jack J. Dongarra, University of Tennessee, USA
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we extended the MPI specification on handling fault tolerance by specifying a systematic framework for the recovery methods, communicator, message modes etc. that define the behavior of MPI in case an error occurs. These extensions not only specify how the implementation of the MPI library and RTE (Run Time Environment) handle failures at the system level, but provide the normal HPC application developers with various recovery choices with varying performance and cost. In this paper, we continue the work on extending the MPI?s capability in this direction. Firstly, we are proposing an MPI operation level checkpoint/rollback library to recover the user?s data. More importantly, we argue that the future generation programming model of a fault tolerant MPI application should be recover-and-continue against the more traditional stop-and-restart model. Recover-and-continue means that in case an error occurs, we just re-spawn the failed processes. All the remaining living processes stay in their original processors mapping on memory. The main benefits of recover-and-continue are much less cost for system recovery and the opportunity of employing in-memory checkpoint/ rollback techniques. Compared with stable or local disk techniques, which are the only choices for stop-andrestart, doubtlessly, the in-memory approach significantly reduces the performance penalty in checkpoint/rollback. Additionally, it makes it possible to establish a concurrent multiple level checkpoint/ rollback framework. With the progress of our work, a picture of the hierarchy of future generation fault tolerant HPC system will be gradually unveiled.
Citation:
Yuan Tang, Graham E. Fagg, Jack J. Dongarra, "Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation," ccgrid, pp.27-34, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.