loading...
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing
Singapore May 16-May 19
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/CCGRID.2006.45Sixth IEEE International Symposium on ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Yawei Li, Illinois Institute of Technology, USA
Zhiling Lan, Illinois Institute of Technology, USA
As the scale of cluster computing grows, it is becoming hard for long-running applications to complete without facing failures on large-scale clusters. To address this issue, checkpointing/restart is widely used to provide the basic fault-tolerant functionality, yet it suffers from high overhead and its reactive characteriristic. In this work, we propose FT-Pro, an adaptive fault management mechanism that optimally chooses migration, checkpointing or no action to reduce the application execution time in the presence of failures based on the failure prediction. A cost-based evaluation model is presented for dynamic decision at run-time. Using the actual failure log from a production cluster at NCSA, we demonstrate that even with modest failure prediction accuracy, FT-Pro outperforms the traditional checkpointing/restart strategy by 13%-30% in terms of reducing the application execution time despite failures, which is a significant performance improvement for long-running applications.
Citation:
Yawei Li, Zhiling Lan, "Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing," ccgrid, pp.531-538, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.