Title :
Energy Consumption of Resilience Mechanisms in Large Scale Systems
Author :
Mills, B. ; Znati, Taieb ; Melhem, Rami ; Ferreira, Kurt B. ; Grant, Ryan E.
Author_Institution :
Dept. of Comput. Sci., Univ. of Pittsburgh, Pittsburgh, PA, USA
Abstract :
As HPC systems continue to grow to meet the requirements of tomorrow´s exascale-class systems, two of the biggest challenges are power consumption and system resilience. On current systems, the dominant resilience technique is checkpoint/restart. It is believed, however, that this technique alone will not scale to the level necessary to support future systems. Therefore, alternative methods have been suggested to augment checkpoint/restart -- for example process replication. In this paper we address both resilience and power together, this is in contrast to much of the competed work which does so independently. Using an analytical model that accounts for both power consumption and failures, we study the performance of checkpoint and replication-based techniques on current and future systems and use power measurements from current systems to validate our findings. Lastly, in an attempt to optimize power consumption for replication, we introduce a new protocol termed shadow replication which not only reduces energy consumption but also produces faster response times than checkpoint/restart and traditional replication when operating under system power constraints.
Keywords :
checkpointing; parallel processing; power consumption; HPC systems; analytical model; checkpoint-restart; dominant resilience technique; energy consumption; exascale-class systems; large scale systems; power consumption; power measurements; process replication; replication-based techniques; shadow replication; system power constraints; system resilience mechanisms; Checkpointing; Energy consumption; Fault tolerance; Fault tolerant systems; Resilience; Sockets; Time factors; energy-aware; fault tolerance; power-aware; resilience; scheduling; shadow computing;
Conference_Titel :
Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on
Conference_Location :
Torino
DOI :
10.1109/PDP.2014.111