• DocumentCode
    1468613
  • Title

    A gracefully degrading massively parallel system using the BSP model, and its evaluation

  • Author

    Savva, Andreas ; Nanya, Takashi

  • Author_Institution
    Fujitsu Labs. Ltd., Kawasaki, Japan
  • Volume
    48
  • Issue
    1
  • fYear
    1999
  • fDate
    1/1/1999 12:00:00 AM
  • Firstpage
    38
  • Lastpage
    52
  • Abstract
    The Bulk-Synchronous Parallel (BSP) Model was proposed as a unifying model for parallel computation. By using Randomized Shared Memory (RSM), the model offers an asymptotically optimal emulation of the Parallel Random Access Machine (PRAM). By using the BSP model with RSM, we construct a gracefully degrading massively parallel system using a fault tolerance (FT) scheme that relies on memory duplication to ensure global memory integrity and to speed up the reconfiguration. After a fault occurs, global reconfiguration restores the logical properties of the system. Work done during reconfiguration is shared equally among the live processors, with minimal coordination. We analyze, at the level of the BSP model, how the performance of a system may change as processors fail and the performance of the interconnection network degrades. We relate the change in overall system performance to the change in computation and communication load on the live processors. Further, we show how to estimate the overhead imposed by the FT scheme. We evaluate the reconfiguration time, the overhead, and graceful degradation of the system experimentally by an implementation on a Massively Parallel Processor (MPP). We show that the predictions about the degradation of the system and the overhead cost of the scheme are accurate
  • Keywords
    fault tolerant computing; parallel processing; performance evaluation; BSP model; asymptotically optimal emulation; bulk-synchronous parallel model; fault tolerance; global memory integrity; graceful degradation; gracefully degrading massively parallel system; logical properties; memory duplication; parallel random access machine; randomized shared memory; system performance; Computational modeling; Concurrent computing; Degradation; Emulation; Failure analysis; Fault tolerant systems; Multiprocessor interconnection networks; Performance analysis; Phase change random access memory; System performance;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/12.743410
  • Filename
    743410