• DocumentCode
    3351673
  • Title

    Architectural support for system software on large-scale clusters

  • Author

    Fernández, Juan ; Frachtenberg, Eitan ; Petrini, Fabrizio ; Davis, Kei ; Sancho, Jose C.

  • Author_Institution
    Departamento de Ingenieria y Tecnologia de Computadores, Murcia Univ., Spain
  • fYear
    2004
  • fDate
    15-18 Aug. 2004
  • Firstpage
    519
  • Abstract
    Scalable management of distributed resources is one of the major challenges in deployment of large-scale clusters. Management includes transparent fault tolerance, efficient allocation of resources, and support for all the needs of parallel computing: parallel I/O, deterministic behavior, and responsiveness. Meeting these requirements with commodity hardware and operating systems is difficult because they were not designed to support global management of a large-scale system. We propose a small set of hardware mechanisms in the cluster interconnect to facilitate the implementation of a simple yet powerful global operating system. This system, inspired by concepts from the BSP and SIMD computational models, allows commodity clusters to grow to thousands of nodes while still retaining the usability and responsiveness of the single-node workstation. Our results on a software prototype show that it is possible to implement efficient and scalable system software using the proposed set of mechanisms.
  • Keywords
    fault tolerant computing; message passing; network operating systems; parallel processing; resource allocation; workstation clusters; SIMD computational model; cluster computing; cluster operating system; distributed resources; large-scale cluster deployment; network hardware mechanism; parallel I/O; parallel computing; resource allocation; resource management; scalable system software; single-node workstation responsiveness; transparent fault tolerance; Delay; Fault tolerance; Hardware; Large-scale systems; Operating systems; Parallel processing; Power system interconnection; Power system management; Resource management; System software;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing, 2004. ICPP 2004. International Conference on
  • ISSN
    0190-3918
  • Print_ISBN
    0-7695-2197-5
  • Type

    conf

  • DOI
    10.1109/ICPP.2004.1327962
  • Filename
    1327962