loading...
Byzantine Anomaly Testing for Charm++: Providing Fault Tolerance and Survivability for Charm++ Empowered Clusters
Singapore May 16-May 19
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/CCGRID.2006.125Sixth IEEE International Symposium on ...
 This Article 
 
PDF
HTML
IEEE Xplore Subscribers
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Dmitry Mogilevsky, University of Illinois at Urbana-Champaign, USA
Gregory A. Koenig, University of Illinois at Urbana-Champaign, USA
William Yurcik, University of Illinois at Urbana-Champaign, USA
Recently shifts in high-performance computing have increased the use of clusters built around cheap commodity processors. A typical cluster consists of individual nodes, containing one or several processors, connected together with a highbandwidth, low-latency interconnect. There are many benefits to using clusters for computation, but also some drawbacks, including a tendency to exhibit low Mean Time To Failure (MTTF) due to the sheer number of components involved. Recently, a number of fault-tolerance techniques have been proposed and developed to mitigate the inherent unreliability of clusters. These techniques, however, fail to address the issue of detecting non-obvious faults, particularly Byzantine faults. At present, effectively detecting Byzantine faults is an open problem. We describe the operation of ByzwATCh, a module for run-time detecting Byzantine hardware errors as part of the Charm++ parallel programming framework.
Citation:
Dmitry Mogilevsky, Gregory A. Koenig, William Yurcik, "Byzantine Anomaly Testing for Charm++: Providing Fault Tolerance and Survivability for Charm++ Empowered Clusters," ccgrid, vol. 2, pp.30, Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.