Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
|
Over the past two decades, rollback-recovery via checkpoint-restart has been used with reasonable success for long-running applications, such as scientific workloads that take from few hours to few months to complete. Currently, several commercial systems and publicly available libraries exist to support various flavors of checkpointing. Programmers typically use these systems if they are satisfactory or otherwise embed checkpointing support themselves within the application. In this paper, we project the performance and functionality of checkpointing algorithms and systems as we know them today into the future. We start by surveying the current technology roadmap and particularly how Peta-Flop capable systems may be plausibly constructed in the next few years. We consider how rollback-recovery as practiced today will fare when systems may have to be constructed out of thousands of nodes. Our projections predict that, unlike current practice, the effect of rollback-recovery may play a more prominent role in how systems may be configured to reach the desired performance level. System planners may have to devote additional resources to enable rollback-recovery and the current practice of using "cheap commodity” systems to form large-scale clusters may face serious obstacles. We suggest new avenues for research to react to these trends.
[1] 97 L. Alvisi, E.N. Elnozahy, S. Rao, S.A. Husain, and A. Del Mel, “An Analysis of Communication-Induced Checkpointing,” Proc. 29th Int'l Symp. Fault-Tolerant Computing, June 1999.
[2] O. Babaoglu and W. Joy, “Converting a Swap-Based System to Do Paging in an Architecture Lacking Page-Reference Bits,” Proc. Symp. Operating Systems Principles, pp. 78-86, 1981.
[3] A. Beguelin, E. Seligman, and P. Stephan, “Application Level Fault Tolerance in Heterogeneous Networks of Workstations,” J. Parallel and Distributed Computing, vol. 43, no. 2, pp. 147-155, June 1997.
[4] B. Bhargava and S.R. Lian, “Independent Checkpointing and Concurrent Rollback for Recovery— An Optimistic Approach,” Proc. Symp. Reliable Distributed Systems, pp. 3-12, 1988.
[5] D. Briatico, A. Ciuffoletti, and L. Simoncini, “A Distributed Domino-Effect Free Recovery Algorithm,” Proc. IEEE Int'l Symp. Reliability, Distributed Software, and Databases, pp. 207-215, Dec. 1984.
[6] M. Chandy and C.V. Ramamoorthy, “Rollback and Recovery Strategies for Computer Programs,” IEEE Trans. Computers, vol. 21, no. 6, pp. 546-556, June 1972.
[7] M. Chandy and L. Lamport, “Distributed Snapshots: Determining Global States of Distributed Systems,” ACM Trans. Computing Systems, vol. 3, no. 1, pp. 63-75, Aug. 1985.
[8] E.N. Elnozahy, L. Alvisi, Y.-M. Wang, and D.B. Johnson, “A Survey of Rollback-Recovery Protocols in Message Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, Sept. 2002.
[9] E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel, “The Performance of Consistent Checkpointing,” Proc. 11th Symp. Reliable Distributed Systems, pp. 39-47, Oct. 1992.
[10] A.H. Johnston, “Scaling and Technology Issues for Soft Error Rates,” Proc. Fourth Ann. Research Conf. Reliability, Oct. 2000.
[11] J.M. Hélary, A. Mostefaoui, R.H. Netzer, and M. Raynal, “Preventing Useless Checkpoints in Distributed Computations,” Proc. IEEE Symp. Reliable Distributed Systems, pp. 183-190, Oct. 1997.
[12] Y. Huang and Y.-M. Wang, “Why Optimistic Message Logging Has Not Been Used in Telecommunication Systems,” Proc. 25th Int'l Symp. Fault-Tolerant Computing (FTCS-25), pp. 459-463, June 1995.
[13] M. Kanellos, “Intel Hastily Redraws Road Maps,” CNET News, May 2004.
[14] R. Koo and S. Toueg, “Checkpointing and Rollback-Recovery for Distributed Systems,” IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[15] C.C. Li and W.K. Fuchs, “CATCH: Compiler-Assisted Techniques for Checkpointing,” Proc. 20th Int'l Symp. Fault-Tolerant Computing (FTCS-20), pp. 74-81, June 1990.
[16] G. Moore, “Cramming More Components unto Integrated Circuits,” Electronics, Apr. 1965.
[17] J.S. Plank, “Efficient Checkpointing on MIMD Architectures,” PhD thesis, Princeton Univ., 1993.
[18] J.S. Plank, M. Beck, G. Kingsley, and K. Li, “Lipckpt: Transparent Checkpointing under UNIX,” Proc. USENIX Winter 1995 Technical Conf., pp. 213-223, Jan. 1995.
[19] J.S. Plank and K. Li, “Faster Checkpointing with N + 1 Parity,” Proc. 24th Int'l Symp. Fault-Tolerant Computing (FTCS-24), pp. 288-297, June 1994.
[20] J.S. Plank, J. Xu, and R.B. Netzer, “Compressed Differences: An Algorithm for Fast Incremental Checkpointing,” Technical Report CS-95-302, Univ. of Tennessee at K noxville, Aug. 1995.
[21] B. Randell, “System Structure for Software Fault-Tolerance,” IEEE Trans. Software Eng., vol. 1, no. 2, pp. 220-232, June 1975.
[22] R.D. Schlichting and F.B. Schneider, “Fail-Stop Processors: An Approach to Designing Fault-Tolerant Computing Systems,” ACM Trans. Computer Systems, vol. 1, no. 3, pp. 222-238, Aug. 1983.
[23] A. Sistla and J. Welch, “Efficient Distributed Recovery Using Message Logging,” Proc. Eighth Ann. ACM Symp. Principles of Distributed Computing (PODC), pp. 223-238, Aug. 1989.
[24] R. Strom and S. Yemini, “Optimistic Recovery in Distributed Systems,” ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[25] N. Vaidya, “Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme,” IEEE Trans. Computers, vol. 46, no. 8, pp. 942-947, Aug. 1997.
[26] N. Vaidya, “On Staggered Checkpointing,” Proc. Eighth IEEE Symp. Parallel and Distributed Processing, Oct. 1996.
Index Terms:
Distributed systems, distributed applications, performance of systems, fault tolerance, modeling techniques, reliability, availability, serviceability, measurement, evaluation, modeling, simulation of multiple-processor systems.
Citation:
Elmootazbellah N. Elnozahy, James S. Plank, "Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery," IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 2, pp. 97-108, Apr.-June 2004, doi:10.1109/TDSC.2004.15