Title :
Rebound: Scalable checkpointing for coherent shared memory
Author :
Agarwal, Rishi ; Garg, Pranav ; Torrellas, Josep
Author_Institution :
Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
Abstract :
As we move to large manycores, the hardware-based global check-pointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multi-processors with directory-based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.
Keywords :
cache storage; checkpointing; microprocessor chips; parallel programming; shared memory systems; IO-intensive loads; Rebound; barrier synchronization; coherent shared memory; coordinated local checkpointing; directory-based cache coherence; hardware-based global checkpointing schemes; interthread dependences; manycores; multiple checkpoints; parallel programs; rollback operations; scalability barriers; scalable checkpointing; shared-memory machines; Checkpointing; Coherence; Hardware; Program processors; Protocols; Registers; Faults; Scalable Checkpointing; Shared-Memory Multiprocessors;
Conference_Titel :
Computer Architecture (ISCA), 2011 38th Annual International Symposium on
Conference_Location :
San Jose, CA
Print_ISBN :
978-1-4503-0472-6