loading...
A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform
Seattle, Washington June 21-June 21
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/CLADE.2003.1209999International Workshop on Challenges ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Christian Engelmann, Oak Ridge National Laboratory
Al Geist, Oak Ridge National Laboratory
This paper discusses the issue of fault-tolerance in distributed computer systems with tens or hundreds of thousands of diskless processor units. Such systems, like the IBM BlueGene/L, are predicted to be deployed in the next five to ten years. Since a 100,000-processor system is going to be less reliable, scientific applications need to be able to recover from occurring failures more efficiently. In this paper, we adapt the present technique of diskless checkpointing to such huge distributed systems in order to equip existing scientific algorithms with super-scalable fault-tolerance. First, we discuss the method of diskless checkpointing, then we adapt this technique to super-scale architectures and finally we present results from an implementation of the Fast Fourier Transform that uses the adapted technique to achieve super-scale fault-tolerance.
Citation:
Christian Engelmann, Al Geist, "A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform," clade, pp.47, International Workshop on Challenges of Large Applications in Distributed Environments, 2003
Usage of this product signifies your acceptance of the Terms of Use.