The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time between-failures (MTBF). Hardware failures must be tolerated by the parallel applications to ensure that no all computation done is lost on machine failures. Checkpointing and rollback recovery is a very useful technique to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhace with fault tolerant capability their applications. This work presents two different approaches to endow with fault tolerance the MPI version of an air quality simulation. A segment?-evel solution has been implemented by means of the extension of a checkpointing library for sequential codes. A variable level solution has been implemented manually in the code. The main differences between both approaches are portability, transparency-level and checkpointing overheads. Experimental results comparing both strategies on a cluster of PCs are shown in the paper.
Citation:
J.C. Mourino, M.J. Martin, P. Gonzalez, R. Doallo, "Fault-tolerant solutions for a MPI compute intensive application," pdp, pp.246-253, 15th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP'07), 2007