This paper explores the diagnosis of cluster-based parallel architectures. A hierarchical strategy which well suits to such architectures is proposed. This strategy avoids a costly full distributed diagnosis of the network by running an adaptive diagnosis algorithm into each cluster and collecting all the test results at the host level. Key results of the paper include realistic fault and architecture models, an adaptive cluster diagnosis algorithm and a global diagnosis strategy of cluster-based parallel machines.
Citation:
O. Benkahla, C. Aktouf, C. Robach, "System-Diagnosis of Cluster-Based Parallel Architectures," pdp, pp.0305, 4th Euromicro Workshop on Parallel and Distributed Processing (PDP '96), 1996