In this paper, we describe new protocols augmenting traditional cache coherency mechanisms to implement fault-tolerance based on Recovery Blocks and checkpointing. Concurrent processes compound rollback recovery since the rollback can potentially lead to a "domino effect" whereby the process is rolled back to the beginning. Several approaches have been proposed to limit the domino effect. One set of such techniques requires communicating processes to periodically synchronize in order to checkpoint a globally consistent state. These schemes can be implemented more naturally on distributed shared memory systems using synchronization on shared memory. We have developed extensions to well known cache-coherency methods (e.g., directory-based) for the implementation of checkpointing consistent states.
Index Terms:
Checkpointing, Backward recovery, Cache-Coherency, Conversations, Recovery Blocks, Directory-Based Protocols, Distributed Shared Memory
Citation:
D.L. Hecht, K.M. Kavi, R.K. Gaede, C. Katsinis, "Fault-Tolerance Using Cache-Coherent Distributed Shared Memory Systems," ispan, pp.100, 1999 International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN '99), 1999