loading...
MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing
Brisbane, Australia May 15-May 18
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/CCGRID.2001.923171First IEEE International Symposium on ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Rajanikanth Batchu, MPI Software Technology, Inc.
Anthony Skjellum, MPI Software Technology, Inc.
Zhenqian Cui, MPI Software Technology, Inc.
Murali Beddhu, MPI Software Technology, Inc.
Jothi P. Neelamegam, Mississippi State University
Yoginder Dandass, Mississippi State University
Manoj Apte, Mississippi State University
MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault-tolerant MPI middleware. Environments include space-based, wide-area/web/meta computing, and scalable clusters. MPI/FT, the system described here, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirements and constraints. MPI codes are evolved to use MPI/FT features. Non-portable code for event handlers and recovery management is isolated.User-coordinated recovery, checkpointing, transparency and event handling, as well as evolvability of legacy MPI codes form key design criteria. Parallel self-checking threads address four levels of MPI implementation robustness, three of which are portable to any multi-threaded MPI. A taxonomy of application types provides six initial fault-relevant models; user-transparent parallel nMR computation is thereby considered. Key concepts from MPI/RT - real-time MPI - are also incorporated into MPI/FT, with further overt support for MPI/RT and MPI/FT in applications possible in future.
Citation:
Rajanikanth Batchu, Anthony Skjellum, Zhenqian Cui, Murali Beddhu, Jothi P. Neelamegam, Yoginder Dandass, Manoj Apte, "MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing," ccgrid, pp.26, First IEEE International Symposium on Cluster Computing and the Grid (CCGrid'01), 2001
Usage of this product signifies your acceptance of the Terms of Use.