A major challenge in a dynamic Grid with thousands of machines connected toeach other is fault tolerance. The more resources and components involved, themore complicated and error-prone becomes the system. Migol is an adaptive Grid middleware,which addresses the fault tolerance of Grid applications and services by providing the capability to recover applications from checkpoint files automatically. A critical aspect for an automatic recovery is the availability of checkpoint files: If a resource becomes unavailable, it is very likely that the associated storage is also unreachable, e. g. due to a network partition. A strategy to increase the availability of checkpoints isreplication.In this paper, we present the Checkpoint Replication Service. A key feature of this service is the ability to automatically replicate and monitor checkpoints in the Grid.
Index Terms:
Grid Computing, Checkpointing, Replication
Citation:
Andre Luckow, Bettina Schnor, "Adaptive Checkpoint Replication for Supporting the Fault Tolerance of Applications in the Grid," nca, pp.299-306, 2008 Seventh IEEE International Symposium on Network Computing and Applications, 2008