Fault detection is a fundamental issue for fault tolerance in distributed systems. This paper presents the DPCP (Discard Past Consider Present) approach, that discards the last elapsed times of fault detection messages and considers only the current one. By this way, DPCP allows to perform a fast, accurated and scalable adaptive fault monitoring for asynchronous distributed systems. The scalability comes from the parameter MinimumTimeUnit, that controls the minimum frequency of the fault monitoring messages. The fastness and accuracy of fault monitoring come from the changing of timeout and monitoring interval values as soon as the system workload and the MinimumTime Unit allow. Some DPCP experiments on ACE+TAO were made to observe DPCP behavior on changing network workloads.
Citation:
I. Sotoma, E. Madeira, "DPCP (Discard Past Consider Present) -- A Novel Approach to Adaptive Fault Detection in Distributed Systems," ftdcs, pp.0076, 8th IEEE Workshop on Future Trends of Distributed Computing Systems (FTDCS'01), 2001