loading...
Architecture of LA-MPI, A Network-Fault-Tolerant MPI
Santa Fe, New Mexico April 26-April 30
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/IPDPS.2004.130292018th International Parallel and Distr ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Rob T. Aulwes, Los Alamos National Laboratory
David J. Daniel, Los Alamos National Laboratory
Nehal N. Desai, Los Alamos National Laboratory
Richard L. Graham, Los Alamos National Laboratory
L. Dean Risinger, Los Alamos National Laboratory
Mark A. Taylor, Los Alamos National Laboratory
Timothy S. Woodall, Los Alamos National Laboratory
Mitchel W. Sukalski, Sandia National Laboratories
We discuss the unique architectural elements of the Los Alamos Message Passing Interface (LA-MPI), a high-performance, network-fault-tolerant, thread-safe MPI library. LA-MPI is designed for use on terascale clusters which are inherently unreliable due to their sheer number of system components and tradeoffs between cost and performance. We examine in detail the design concepts used to implement LA-MPI. These include reliability features, such as application-level checksumming, message retransmission, and automatic message re-routing. Other key performance enhancing features, such as concurrent message routing over multiple, diverse network adapters and protocols, and communication-specific optimizations (e.g., shared memory) are examined.
Citation:
Rob T. Aulwes, David J. Daniel, Nehal N. Desai, Richard L. Graham, L. Dean Risinger, Mark A. Taylor, Timothy S. Woodall, Mitchel W. Sukalski, "Architecture of LA-MPI, A Network-Fault-Tolerant MPI," ipdps, vol. 1, pp.15b, 18th International Parallel and Distributed Processing Symposium (IPDPS'04) - Papers, 2004
Usage of this product signifies your acceptance of the Terms of Use.