DocumentCode
1468613
Title
A gracefully degrading massively parallel system using the BSP model, and its evaluation
Author
Savva, Andreas ; Nanya, Takashi
Author_Institution
Fujitsu Labs. Ltd., Kawasaki, Japan
Volume
48
Issue
1
fYear
1999
fDate
1/1/1999 12:00:00 AM
Firstpage
38
Lastpage
52
Abstract
The Bulk-Synchronous Parallel (BSP) Model was proposed as a unifying model for parallel computation. By using Randomized Shared Memory (RSM), the model offers an asymptotically optimal emulation of the Parallel Random Access Machine (PRAM). By using the BSP model with RSM, we construct a gracefully degrading massively parallel system using a fault tolerance (FT) scheme that relies on memory duplication to ensure global memory integrity and to speed up the reconfiguration. After a fault occurs, global reconfiguration restores the logical properties of the system. Work done during reconfiguration is shared equally among the live processors, with minimal coordination. We analyze, at the level of the BSP model, how the performance of a system may change as processors fail and the performance of the interconnection network degrades. We relate the change in overall system performance to the change in computation and communication load on the live processors. Further, we show how to estimate the overhead imposed by the FT scheme. We evaluate the reconfiguration time, the overhead, and graceful degradation of the system experimentally by an implementation on a Massively Parallel Processor (MPP). We show that the predictions about the degradation of the system and the overhead cost of the scheme are accurate
Keywords
fault tolerant computing; parallel processing; performance evaluation; BSP model; asymptotically optimal emulation; bulk-synchronous parallel model; fault tolerance; global memory integrity; graceful degradation; gracefully degrading massively parallel system; logical properties; memory duplication; parallel random access machine; randomized shared memory; system performance; Computational modeling; Concurrent computing; Degradation; Emulation; Failure analysis; Fault tolerant systems; Multiprocessor interconnection networks; Performance analysis; Phase change random access memory; System performance;
fLanguage
English
Journal_Title
Computers, IEEE Transactions on
Publisher
ieee
ISSN
0018-9340
Type
jour
DOI
10.1109/12.743410
Filename
743410
Link To Document