مرکز منطقه ای اطلاع رساني علوم و فناوري - Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget

DocumentCode :

228757

Title :

Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget

Author :

Sarood, Osman ; Langer, Akhil ; Gupta, Arpan ; Kale, Laxmikant

Author_Institution :

Dept. of Comput. Sci., Univ. of Illinois Urbana-Champaign, Urbana, IL, USA

fYear :

2014

fDate :

16-21 Nov. 2014

Firstpage :

807

Lastpage :

818

Abstract :

Building future generation supercomputers while constraining their power consumption is one of the biggest challenges faced by the HPC community. For example, US Department of Energy has set a goal of 20 MW for an exascale (1018 flops) supercomputer. To realize this goal, a lot of research is being done to revolutionize hardware design to build power efficient computers and network interconnects. In this work, we propose a software-based online resource management system that leverages hardware facilitated capability to constrain the power consumption of each node in order to optimally allocate power and nodes to a job. Our scheme uses this hardware capability in conjunction with an adaptive runtime system that can dynamically change the resource configuration of a running job allowing our resource manager to re-optimize allocation decisions to running jobs as new jobs arrive, or a running job terminates. We also propose a performance modeling scheme that estimates the essential power characteristics of a job at any scale. The proposed online resource manager uses these performance characteristics for making scheduling and resource allocation decisions that maximize the job throughput of the supercomputer under a given power budget. We demonstrate the benefits of our approach by using a mix of jobs with different power response characteristics. We show that with a power budget of 4:75 MW, we can obtain up to 5:2X improvement in job throughput when compared with the SLURM scheduling policy that is power-unaware. We corroborate our results with real experiments on a relatively small scale cluster, in which we obtain a 1:7X improvement.

Keywords :

computer centres; mainframes; parallel machines; power consumption; resource allocation; scheduling; SLURM scheduling policy; adaptive runtime system; hardware facilitated capability; network interconnects; node allocation; online resource manager; overprovisioned HPC data centers; performance modeling scheme; power 4.75 MW; power allocation; power consumption; power efficient computers; power response characteristics; resource allocation decisions; software-based online resource management system; strict power budget; supercomputers; throughput maximization; Linear programming; Mathematical model; Parallel processing; Power demand; Resource management; Throughput; Time-frequency analysis;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for

Conference_Location :

New Orleans, LA

Print_ISBN :

978-1-4799-5499-5

Type :

conf

DOI :

10.1109/SC.2014.71

Filename :

7013053

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=228757