ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers

Author

Kim, Yoongu ; Han, Dongsu ; Mutlu, Onur ; Harchol-Balter, Mor

Author_Institution

Carnegie Mellon Univ., Pittsburgh, PA, USA

fYear

2010

fDate

9-14 Jan. 2010

Firstpage

Lastpage

Abstract

Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algorithm also needs to be scalable - as the number of cores increases, the number of memory controllers shared by the cores should also increase to provide sufficient bandwidth to feed the cores. Unfortunately, previous memory scheduling algorithms are inefficient with respect to system throughput and/or are designed for a single memory controller and do not scale well to multiple memory controllers, requiring significant finegrained coordination among controllers. This paper proposes ATLAS (Adaptive per-Thread Least-Attained-Service memory scheduling), a fundamentally new memory scheduling technique that improves system throughput without requiring significant coordination among memory controllers. The key idea is to periodically order threads based on the service they have attained from the memory controllers so far, and prioritize those threads that have attained the least service over others in each period. The idea of favoring threads with least-attained-service is borrowed from the queueing theory literature, where, in the context of a single-server queue it is known that least-attained-service optimally schedules jobs, assuming a Pareto (or any decreasing hazard rate) workload distribution. After verifying that our workloads have this characteristic, we show that our implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput. Furthermore, since the periods over which we accumulate the attained service are long, the controllers coordinate very infrequently to form the ordering of threads, thereby making ATLAS scalable to many controllers. We evaluate ATLAS on a wide variety of mult- - iprogrammed SPEC 2006 workloads and systems with 4-32 cores and 1-16 memory controllers, and compare its performance to five previously proposed scheduling algorithms. Averaged over 32 workloads on a 24-core system with 4 controllers, ATLAS improves instruction throughput by 10.8%, and system throughput by 8.4%, compared to PAR-BS, the best previous CMP memory scheduling algorithm. ATLAS´s performance benefit increases as the number of cores increases.

Keywords

Pareto distribution; digital storage; microprocessor chips; multiprocessing systems; scheduling; ATLAS memory scheduling; Pareto workload distribution; adaptive per-thread least-attained-service; chip multiprocessor; multiple memory controllers; multiprogrammed SPEC 2006 workloads; scheduling algorithm; single-server queue; Adaptive control; Algorithm design and analysis; Bandwidth; Control systems; Feeds; Programmable control; Queueing analysis; Scheduling algorithm; Throughput; Yarn;

fLanguage

English

Publisher

ieee

Conference_Titel

High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on

Conference_Location

Bangalore

ISSN

1530-0897

Print_ISBN

978-1-4244-5658-1

Type

conf

DOI

10.1109/HPCA.2010.5416658

Filename

5416658

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=3522940