• DocumentCode
    2011095
  • Title

    Communication across fault-containment firewalls on the SGI origin

  • Author

    Ghosh, Kaushik ; Christie, Allan J.

  • Author_Institution
    Silicon Graphics Comput. Syst., Mountain View, CA, USA
  • fYear
    1998
  • fDate
    1-4 Feb 1998
  • Firstpage
    277
  • Lastpage
    287
  • Abstract
    Scalability and reliability are inseparable in high-performance computing. Fault-isolation through hardware is a popular means of providing reliability. Unfortunately, such isolation also increases communication latencies: typically, one has to drop into and out of the kernel to communicate between failure domains. On the other hand, relaxing fault isolation domains allows efficient communication, but at the risk of failure propagation, and thus reduced reliability. We are concerned with finding a middle ground between these extremes. We first review a few salient aspects of the SGI Origin-2000 architecture, mentioning the hardware features germane to efficient communication, and building protection-firewalls. Then, we describe a mechanism for risk-free, point-to-point communication between processes on distinct failure domains. Quoting performance numbers, we show that the overheads of crossing domains render this mechanism unattractive for small messages. To address this issue, we describe a mechanism for controlled opening of the firewalls, thereby achieving explicit inter-partition shared-memory for communication. We describe the kernel software that addresses the resulting reliability issues, and discuss how familiar IPC mechanisms such as MPI and SysV shared-memory can use the explicit shared-memory to advantage. Finally, based on the lessons learnt, we discuss some future directions, and draw concluding remarks
  • Keywords
    fault tolerant computing; parallel architectures; reliability; shared memory systems; IPC mechanisms; SGI Origin-2000; fault-containment firewalls; high-performance computing; kernel software; performance numbers; reliability; shared-memory; Buildings; Delay; Graphics; Hardware; Kernel; Operating systems; Protection; Registers; Scalability; Silicon;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High-Performance Computer Architecture, 1998. Proceedings., 1998 Fourth International Symposium on
  • Conference_Location
    Las Vegas, NV
  • Print_ISBN
    0-8186-8323-6
  • Type

    conf

  • DOI
    10.1109/HPCA.1998.650567
  • Filename
    650567