loading...
Architectural Support for System Software on Large-Scale Clusters
Montreal, Quebec, Canada August 15-August 18
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICPP.2004.13279622004 International Conference on Para ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Juan Fernández, Universidad de Murcia and Los Alamos National Laboratory
Eitan Frachtenberg, Los Alamos National Laboratory
Fabrizio Petrini, Los Alamos National Laboratory
Scalable management of distributed resources is one of the major challenges in deployment of large-scale clusters. Management includes transparent fault tolerance, efficient allocation of resources, and support for all the needs of parallel computing: parallel I/O, deterministic behavior, and responsiveness. Meeting these requirements with commodity hardware and operating systems is difficult because they were not designed to support global management of a largescale system. In this paper we propose a small set of hardware mechanisms in the cluster interconnect to facilitate the implementation of a simple yet powerful global operating system. This system, inspired by concepts from the BSP and SIMD computational models, allows commodity clusters to grow to thousands of nodes while still retaining the usability and responsiveness of the single-node workstation. Our results on a software prototype show that it is possible to implement efficient and scalable system software using the proposed set of mechanisms.
Index Terms:
Cluster computing, cluster operating system, network hardware, debuggability, resource management, fault tolerance.
Citation:
Juan Fernández, Eitan Frachtenberg, Fabrizio Petrini, "Architectural Support for System Software on Large-Scale Clusters," icpp, pp.519-528, 2004 International Conference on Parallel Processing (ICPP'04), 2004
Usage of this product signifies your acceptance of the Terms of Use.