loading...
The Fault Tolerant Parallel Algorithm: the Parallel Recomputing Based Failure Recovery
Brasov, Romania September 15-September 19
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/PACT.2007.7316th International Conference on Para ...
 This Article 
 
PDF
HTML
IEEE Xplore Subscribers
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Xuejun Yang, National University of Defense Technology, China
Yunfei Du, National University of Defense Technology, China
Panfeng Wang, National University of Defense Technology, China
Hongyi Fu, National University of Defense Technology, China
Jia Jia, National University of Defense Technology, China
Zhiyuan Wang, National University of Defense Technology, China
Guang Suo, National University of Defense Technology, China
This paper addresses the issue of fault tolerance in parallel computing, and proposes a new method named parallel recomputing. Such method achieves fault recovery automatically by using surviving processes to recompute the workload of failed processes in parallel. The paper firstly defines the fault tolerant parallel algorithm (FTPA) as the parallel algorithm which tolerates failures by parallel recomputing. Furthermore, the paper proposes the inter-process definition-use relationship analysis method based on the conventional definition-use analysis for revealing the relationship of variables in different processes. Under the guidance of this new method, principles of fault tolerant parallel algorithm design are given. At last, the authors present the design of FTPAs for matrix-matrix multiplication and NPB kernels, and evaluate them by experiments on a cluster system. The experimental results show that the overhead of FTPA is less than the overhead of checkpointing.
Citation:
Xuejun Yang, Yunfei Du, Panfeng Wang, Hongyi Fu, Jia Jia, Zhiyuan Wang, Guang Suo, "The Fault Tolerant Parallel Algorithm: the Parallel Recomputing Based Failure Recovery," pact, pp.199-212, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007), 2007
Usage of this product signifies your acceptance of the Terms of Use.