• DocumentCode
    125641
  • Title

    Energy Consumption of Resilience Mechanisms in Large Scale Systems

  • Author

    Mills, B. ; Znati, Taieb ; Melhem, Rami ; Ferreira, Kurt B. ; Grant, Ryan E.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Pittsburgh, Pittsburgh, PA, USA
  • fYear
    2014
  • fDate
    12-14 Feb. 2014
  • Firstpage
    528
  • Lastpage
    535
  • Abstract
    As HPC systems continue to grow to meet the requirements of tomorrow´s exascale-class systems, two of the biggest challenges are power consumption and system resilience. On current systems, the dominant resilience technique is checkpoint/restart. It is believed, however, that this technique alone will not scale to the level necessary to support future systems. Therefore, alternative methods have been suggested to augment checkpoint/restart -- for example process replication. In this paper we address both resilience and power together, this is in contrast to much of the competed work which does so independently. Using an analytical model that accounts for both power consumption and failures, we study the performance of checkpoint and replication-based techniques on current and future systems and use power measurements from current systems to validate our findings. Lastly, in an attempt to optimize power consumption for replication, we introduce a new protocol termed shadow replication which not only reduces energy consumption but also produces faster response times than checkpoint/restart and traditional replication when operating under system power constraints.
  • Keywords
    checkpointing; parallel processing; power consumption; HPC systems; analytical model; checkpoint-restart; dominant resilience technique; energy consumption; exascale-class systems; large scale systems; power consumption; power measurements; process replication; replication-based techniques; shadow replication; system power constraints; system resilience mechanisms; Checkpointing; Energy consumption; Fault tolerance; Fault tolerant systems; Resilience; Sockets; Time factors; energy-aware; fault tolerance; power-aware; resilience; scheduling; shadow computing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on
  • Conference_Location
    Torino
  • ISSN
    1066-6192
  • Type

    conf

  • DOI
    10.1109/PDP.2014.111
  • Filename
    6787325