loading...
Characterization of Consistent Global Checkpoints in Large-Scale Distributed Systems
Chenju, Korea August 28-August 30
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/FTDCS.1995.5250005th IEEE Workshop on Future Trends of ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Backward error recovery is one of the most used schemes to ensure fault-tolerance in distributed systems. It consists, upon the occurrence of a failure, in restoring a distributed computation in an error-free global state from which it can be resumed to produce a correct behaviour. Checkpointing is one of the techniques to pursue the backward error recovery. As we consider large-scale distributed systems, on one side a coordinated approach to take checkpoints is not practicable, on the other side for an uncoordinated approach the probability to have a domino effect during a recovery could be no longer negligible. In this paper, we present a framework that allows first to define formally the domino effect and second to state and prove a theorem to determine if an arbitrary set of checkpoints is consistent. This theorem is very general as it considers a semantic including missing and orphan messages. This plays a key role in designing uncoordinated checkpointing algorithms that require to take as less additional checkpoints as possible in order to ensure domino-free recovery.
Citation:
R. Baldoni, J. Brzezinski, J.M. Helary, A. Mostefaoui, M. Raynal, "Characterization of Consistent Global Checkpoints in Large-Scale Distributed Systems," ftdcs, pp.0314, 5th IEEE Workshop on Future Trends of Distributed Computing Systems, 1995
Usage of this product signifies your acceptance of the Terms of Use.