This paper presents a processor group membership protocol for fault-tolerant distributed real-time systems that utilize periodic, time-triggered scheduling for sending messages over the system?s communication network. The protocol allows fault-free nodes to reach agreement on the operational state of all nodes in the presence of fail-silent or failreporting node failures as well as network failures (lost or corrupted messages). The protocol is based on the principle that each message sent by a node in the membership is acknowledged by k other nodes in a system of n nodes, where k can be set to any number between 2 and n-1. Agreement on node failure (membership departure) and agreement on node recovery (membership reintegration) are handled by two different mechanisms. Agreement on departure is guaranteed if no more than f = k-1 failures occur in the same communication round, while at most one node can be reintegrated into the membership per communication round.
Citation:
Raul Barbosa, Johan Karlsson, "Flexible, Cost-EffectiveMembership Agreement in Synchronous Systems," prdc, pp.105-113, 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06), 2006