Emerging networking technologies have complex network interfaces that have renewed concerns about network reliability. In this paper, we present an effective low-overhead fault tolerance technique to recover from network interface failures, more particularly network processor hangs. We demonstrate the technique in the context of Myrinet. Fault recovery is achieved by restoring the state of the network interface using a small backup copy containing just the right amount of information required for complete recovery. Our fault detection is based on a software watchdog that detects network processor hangs. Results on the Myrinet platform show that the complete fault recovery can be achieved in under 2sec while incurring a latency overhead of just 1.5?s during normal operation. The paper also shows how this fault recovery can be made completely transparent to the user.
Citation:
Vijay Lakamraju, Israel Koren, C.M. Krishna, "Low Overhead Fault Tolerant Networking in Myrinet," dsn, pp.193, 2003 International Conference on Dependable Systems and Networks (DSN'03), 2003