The Interplay of Power Management and Fault Recovery in Real-Time Systems
|
Abstract—This paper describes how to exploit the scheduling slack in a real-time system to reduce energy consumption and achieve fault tolerance at the same time. During failure-free operation, a task takes checkpoints to enable recovery from failure. Additionally, the system exploits the slack to conserve energy by reducing the processor speed. If a task fails, it will restart from a saved checkpoint and execute at maximum speed to guarantee that the deadlines are met. The paper shows that the number of checkpoints and their placements interact in subtle ways with the power management policy. We study two checkpoint placement policies for aperiodic tasks and analytically derive the optimal number of checkpoints to conserve energy under each. This optimal number allows the CPU speed to be slowed down to the level that yields minimum energy consumption, while still guaranteeing recoverability of tasks under each checkpointing policy. The results show that traditional periodic checkpointing is not the best policy for the combined purpose of conserving energy and guaranteeing recovery. Instead, better energy savings are possible through a nonuniform distribution of checkpoints that takes into account the energy consumption and reliability factors. Depending on the amount of slack and the checkpointing overhead, energy can be reduced by up to 68 percent under nonuniform checkpointing. We also demonstrate the applicability of these checkpoint placement policies to periodic tasks.
[1] 217 H. Aydin, R. Melhem, D. Mossé, and P. Alvarez, Dynamic and Aggressive Scheduling Techniques for Power-Aware Real-Time Systems Proc. IEEE Real-Time Systems Symp. (RTSS '99), Dec. 1999.
[2] A. Campbell, P. McDonald, and K. Ray, Single Event Upset Rates in Space IEEE Trans. Nuclear Science, vol. 39, no. 6, pp. 1828-1835, 1992.
[3] X. Castillo, S. McConnel, and D. Siewiorek, Derivation and Calibration of a Transient Error Reliability Model IEEE Trans. Computers, vol. 31, no. 7, pp. 658-671, July 1982.
[4] The Standard Performance Evaluation Corp.,http:/www. specbench.org, 2003.
[5] E. Elnozahy, L. Alvisi, Y.-M. Wang, and D. Johnson, A Survey of Rollback-Recovery Protocols in Message Passing Systems technical report, Carnegie Mellon Univ., 1999.
[6] Compaq et al., ACPI Specification, Version 2.0 2000.
[7] M. Fleischmann, Crusoe Power Management: Cutting x86 Operating Power through LongRun Embedded Processor Forum, June 2000.
[8] S. Ghosh, R. Melhem, D. Mossé, and J. Sarma, Fault-Tolerant Rate-Monotonic Scheduling J. Real-Time Systems, vol. 15, no. 2, Sept. 1998.
[9] S. Ghosh, D. Mossé, and R. Melhem, Implementation and Analysis of a Fault-Tolerant Scheduling Algorithm IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 3, Mar. 1997.
[10] R. Gonzalez and M. Horowitz, Energy Dissipation in General Purpose Microprocessors IEEE J. Solid-State Circuits, vol. 31, no. 9, Sept. 1996.
[11] F. Gruian, Hard Real-Time Scheduling for Low-Energy Using Stochastic Data and DVS Processors Proc. Int'l Symp. Low Power Electronics and Design, pp. 46-51, 2001.
[12] V. Gutnik and A. Chandrakasan, An Efficient Controller for Variable Supply Voltage Low Power Processing Proc. Symp. VLSI Circuits, pp. 158-159, 1996.
[13] I. Hong, M. Potkonjak, and M. Srivastava, On-Line Scheduling of Hard Real-Time Tasks on Variable Voltage Processor Proc. Computer-Aided Design (ICCAD '98), pp. 653-656, 1998.
[14] I. Hong, G. Qu, M. Potkonjak, and M. Srivastava, Synthesis Techniques for Low-Power Hard Real-Time Systems on Variable Voltage Processors Proc. 19th IEEE Real-Time Systems Symp. (RTSS '98), Dec. 1998.
[15] Intel Corp, SpeedStep http://developer.intel.com/mobilePentiumIII , 2003.
[16] R. Iyer, D. Rossetti, and M. Hsueh, Measurement and Modeling of Computer Reliability as Affected by System Activity ACM Trans. Computer Systems, vol. 4, no. 3, pp. 214-237, Aug. 1986.
[17] B. Johnson, Design and Analysis of Fault Tolerant Digital Systems. Addison Wesley, 1989.
[18] H. Kopetz, H. Kantz, G. Grunsteidl, P. Puschner, and J. Reisinger, Tolerating Transient Faults in MARS Digest of Papers, 20th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS-20), pp. 466-473, June 1990.
[19] C. Krishna and Y. Lee, Voltage Clock Scaling Adaptive Scheduling Techniques for Low Power in Hard Real-Time Systems Proc. Sixth IEEE Real-Time Technology and Applications Symp. (RTAS '00), May 2000.
[20] C. Krishna and K. Shin, On Scheduling Tasks with a Quick Recovery from Failure IEEE Trans. Computers, vol. 35, no. 5, pp. 448-455, May 1986.
[21] A. Liestman and R. Campbell, A Fault-Tolerant Scheduling Problem IEEE Trans. Software Eng., vol. 12, no. 11, pp. 1089-1095, Nov. 1986.
[22] C. Liu and J. Layland, Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment J. ACM, vol. 20, pp. 46-61, 1973.
[23] J. Lorch and A. Smith, Improving Dynamic Voltage Scaling Algorithms with PACE Proc. ACM SIGMETRICS 2001, June 2001.
[24] T. Ma and K. Shin, A User-Customizable Energy-Adaptive Combined Static/Dynamic Scheduler for Mobile Applications Proc. 21st IEEE Real-Time Systems Symp. (RTSS '00), pp. 227-236, 2000.
[25] A. Martin, A. Lines, R. Manohar, M. Nystrm, P. Penzes, R. Southworth, U. Cummings, and T. Lee, The Design of an Asynchronous MIPS R3000 Microprocessor Proc. 17th Conf. Advanced Research in VLSI, Sept. 1997.
[26] A. Mehra, J. Rexford, H. Ang, and F. Jahanian, Design and Evaluation of a Window-Consistent Replication Service Proc. Real-Time Technology and Applications Symp., 1995.
[27] H. Mehta, R. Owens, M. Irwin, R. Chen, and D. Ghosh, Techniques for Low Energy Software Proc. 1997 Int'l Symp. Low Power Electronics, 1997.
[28] R. Melhem, N. AbouGhazaleh, H. Aydin, and D. Mosse, Power Management Points in Power-Aware Real-Time Systems Power-Aware Computing, R. Graybill and R. Melhem, eds., Kluwer/Plenum Series in Computer Science, Jan. 2002.
[29] Microsoft Corp., PC99 System Design Guide. Microsoft Press, 1999.
[30] Y. Oh and S. Son, Enhancing Fault-Tolerance in Rate-Monotonic Scheduling J. Real-Time Systems, vol. 7, no. 3, pp. 315-329, Nov. 1994.
[31] Y. Oh and S. Son, Scheduling Hard Real-Time Tasks with Tolerance of Multiprocessor Failures Microprocessing and Microprogramming, pp. 193-206, 1994.
[32] M. Pedram, Power Minimization in IC Design: Principles and Applications ACM Trans. Design Automation of Electronics Systems, vol. 1, no. 1, pp. 3-56, Jan. 1996.
[33] S. Punnekkat, A. Burns, and R. Davis, Analysis of Checkpointing for Real-Time Systems Real-Time Systems J., vol. 20, no. 1, pp. 83-102, Jan. 2001.
[34] S. Ramos-Thuel and J. Strosnider, Scheduling Fault Recovery Operations for Time-Critical Applications Proc. Fourth IFIP Conf. Dependable Computing for Critical Applications, Jan. 1995.
[35] Y. Shin and K. Choi, Power Conscious Fixed Priority Scheduling for Hard Real-Time Systems Proc. 36th Design Automation Conf. (DAC '99), pp. 134-139, 1999.
[36] P. Yang, C. Wong, P. Marchal, F. Catthoor, D. Desmet, D. Verkest, and R. Lauwereins, Energy-Aware Runtime Scheduling for Embedded-Multiprocessor SOCs IEEE Design and Test of Computers, vol. 18, no. 5, pp. 46-58, Sept./Oct. 2001.
[37] F. Yao, A. Demers, and S. Shenker, A Scheduling Model for Reduced CPU Energy Proc. IEEE Ann. Foundations of Computer Science, pp. 374-382, 1995.
Index Terms:
Checkpointing, fault tolerance, frequency scaling, power management, real-time systems, reliability, voltage scaling.
Citation:
Rami Melhem, Daniel Moss?, Elmootazbellah (Mootaz) Elnozahy, "The Interplay of Power Management and Fault Recovery in Real-Time Systems," IEEE Transactions on Computers, vol. 53, no. 2, pp. 217-231, Feb. 2004, doi:10.1109/TC.2004.1261830