loading...
Low Overhead Fault Tolerant Networking in Myrinet
San Francisco, California June 22-June 25
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/DSN.2003.12099302003 International Conference on Depe ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Vijay Lakamraju, University of Massachusetts at Amherst
Israel Koren, University of Massachusetts at Amherst
C.M. Krishna, University of Massachusetts at Amherst
Emerging networking technologies have complex network interfaces that have renewed concerns about network reliability. In this paper, we present an effective low-overhead fault tolerance technique to recover from network interface failures, more particularly network processor hangs. We demonstrate the technique in the context of Myrinet. Fault recovery is achieved by restoring the state of the network interface using a small backup copy containing just the right amount of information required for complete recovery. Our fault detection is based on a software watchdog that detects network processor hangs. Results on the Myrinet platform show that the complete fault recovery can be achieved in under 2sec while incurring a latency overhead of just 1.5?s during normal operation. The paper also shows how this fault recovery can be made completely transparent to the user.
Citation:
Vijay Lakamraju, Israel Koren, C.M. Krishna, "Low Overhead Fault Tolerant Networking in Myrinet," dsn, pp.193, 2003 International Conference on Dependable Systems and Networks (DSN'03), 2003
Usage of this product signifies your acceptance of the Terms of Use.