This paper presents an objective measure, called overhead ratio, for evaluating distributed checkpointing protocols. This measure extends previous evaluation schemes by incorporating several additional parameters that are inherent in distributed environments. In particular, we take into account the rollback propagation of the protocol, which impacts the length of the recovery process, and therefore the expected program run-time in executions that involve failures and recoveries. The paper also analyzes several known protocols and compares their overhead ratio.
Citation:
Adnan Agbaria, Ari Freund, Roy Friedman, "Evaluating Distributed Checkpointing Protocol," icdcs, pp.266, 23rd IEEE International Conference on Distributed Computing Systems (ICDCS'03), 2003