loading...
Supporting fault-tolerance in heterogeneous distributed applications
Geneva, SWITZERLAND April 01-April 01
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/HCW.1997.5814216th Heterogeneous Computing Workshop ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
P. Maheshwari, Sch. of Comput. Sci. & Eng., New South Wales Univ., Sydney, NSW, Australia
J. Ouyang, Sch. of Comput. Sci. & Eng., New South Wales Univ., Sydney, NSW, Australia
Heterogeneous computing opens up new challenges and opportunities in fields such as parallel and distributed processing, design of algorithms for applications, scheduling of parallel tasks, interconnection network technology and support for reliable distributed heterogeneous computing. A trend of supporting fault-tolerance in distributed computing systems is to incorporate fault-tolerance into applications at low cost, in terms of both run time performance and programming effort required to construct reliable application software. We present an approach for developing efficient reliable distributed applications for heterogeneous computing systems. We propose a library prototype, called H-Libra, to support fault-tolerance in heterogeneous systems with low run-time cost. Fault-tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level network communication protocol. By employing novel mechanisms, minimum communication overhead is involved for taking a consistent distributed checkpoint and catching messages in transit during a checkpoint. By providing fault-tolerance transparency and a simple, easy to use high-level message-passing interface, H-Libra simplifies the development of reliable heterogeneous distributed applications.
Index Terms:
software fault tolerance; software fault-tolerance; heterogeneous distributed applications; parallel processing; algorithm design; parallel task scheduling; interconnection network; reliable distributed heterogeneous computing; low cost; run time performance; programming; library prototype; H-Libra; distributed consistent checkpointing; rollback-recovery; user-level network communication protocol; high-level message-passing interface
Citation:
P. Maheshwari, J. Ouyang, "Supporting fault-tolerance in heterogeneous distributed applications," hcw, pp.195, 6th Heterogeneous Computing Workshop (HCW '97), 1997
Usage of this product signifies your acceptance of the Terms of Use.