This paper presents a fault tolerance framework for applications that process data using a distributed network of user-defined operations in a pipelined fashion. The framework saves intermediate results and messages exchanged among application components in a distributed data management system to facilitate quick recovery from failures. The experimental results show that the framework scales well and our approach introduces very little overhead to application execution.
Citation:
Tulio Tavares, George Teodoro, Tahsin Kurc, Renato Ferreira, Dorgival Guedes, Wagner Jr. Meira, Umit Catalyurek, Shannon Hastings, Scott Oster, Steve Langella, Joel Saltz, "An Efficient and Reliable Scientific Workflow System," ccgrid, pp.445-452, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07), 2007