Achieving high performance parallel computing requires both a large scale and reliable system. We describe our design and implementation of the message passing interface, called MPICH-OPeN, for parallel computing over a peer-to-peer network to address this challenge. Our implementation uses the Condor standalone checkpoint library and the Chandy-Lamport algorithm, for reliability, with extensions to make it decentralized. We use the OPeN architecture with an adaptive peer-to-peer protocol that caches connections between peers according to communication requirements of the parallel processes. We used PlanetLab to compare the performance of our implementation to MPICH-P4 and to measure the impact of dynamic peers on parallel program execution
Index Terms:
parallel program execution, message passing interface, MPICH-OPeN architecture, adaptive peer-to-peer network, high performance parallel computing, large scale reliable system, Condor standalone checkpoint library, Chandy-Lamport algorithm, adaptive peer-to-peer protocol, PlanetLab, MPICH-P4
Citation:
L. Ni, A. Harwood, "An Implementation of the Message Passing Interface over an Adaptive Peer-to-Peer Network," hpdc, pp.371-372, 2006 15th IEEE International Conference on High Performance Distributed Computing, 2006