Many scientific applications exhibit high demands on memory storage and computing capability. Recent improvements in commodity processors and networks have provided an opportunity to support such scientific applications within an everyday computing infrastructure. Good applications need the ability to work in constantly changing environments. Adaptability and fault tolerance are essential. Based on simulation of relativistic particle transport, this paper proposes a data-level checkpointing scheme for common scientific applications. This scheme takes advantage of the regular program layout, dominant computing loops, and fine-grained iterations. Without handling stack and heap segments directly, only application data is saved and restored as the computation state. Checkpointing interval can be dynamically adjusted to satisfy sensitivity and efficiency requirements for feasible fault tolerance. With this periodic but fixed-location checkpointing scheme, the MPI-based simulation system can be reconfigured by being shut down first and then restarted on same or different computer clusters. Application data can be redistributed for the new configuration. Experimental results have demonstrated this scheme's efficiency and effectiveness.
Index Terms:
Simulation, Fault Tolerance, Reconfiguration, Checkpointing, Relativistic Particle Transport
Citation:
Ruipeng Li, Hai Jiang, Hung-Chi Su, Bin Zhang, Jeff Jenness, "Adaptive and Fault Tolerant Simulation of Relativistic Particle Transport with Data-Level Checkpointing," cse, pp.345-352, 2008 11th IEEE International Conference on Computational Science and Engineering, 2008