loading...
Active/Active Replication for Highly Available HPC System Services
Vienna, Austria April 20-April 22
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ARES.2006.23First International Conference on Ava ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
C. Engelmann, University of Reading, Reading, RG6 6AH, UK
S. L. Scott, Oak Ridge National Laboratory, Oak Ridge, TN
C. Leangsuksun, Louisiana Tech University, Ruston, LA
X. He, Tennessee Technological University, Cookeville, TN
Today?s high performance computing systems have several reliability deficiencies resulting in availability and serviceability issues. Head and service nodes represent a single point of failure and control for an entire system as they render it inaccessible and unmanageable in case of a failure until repair, causing a significant downtime. This paper introduces two distinct replication methods (internal and external) for providing symmetric active/active high availability for multiple head and service nodes running in virtual synchrony. It presents a comparison of both methods in terms of expected correctness, ease-of-use and performance based on early results from ongoing work in providing symmetric active/active high availability for two HPC system services (TORQUE and PVFS metadata server). It continues with a short description of a distributed mutual exclusion algorithm and a brief statement regarding the handling of Byzantine failures. This paper concludes with an overview of past and ongoing work, and a short summary of the presented research.
Citation:
C. Engelmann, S. L. Scott, C. Leangsuksun, X. He, "Active/Active Replication for Highly Available HPC System Services," ares, pp.639-645, First International Conference on Availability, Reliability and Security (ARES'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.