A low-cost memory interface for high-throughput accelerators

Author

Jing Huang ; Yuanjie Huang ; Temam, Olivier ; Ienne, Paolo ; Yunji Chen ; Chengyong Wu

Author_Institution

State Key Lab. of Comput. Archit., Inst. of Comput. Technol., Beijing, China

fYear

2014

fDate

12-17 Oct. 2014

Firstpage

1

Lastpage

10

Abstract

Heterogeneous multi-cores, a mix of cores and accelerators, are becoming prevalent. These accelerators are designed for both speed and energy improvements, and thus, they increasingly come with a large number of load/store ports for achieving a high degree of parallelism. However, beyond GPG-PUs, accelerators such as ASICs and CGRAs are increasingly capable of accelerating computations with irregular control flow and memory accesses; as a result, such accelerators need to be plugged to caches instead of scratchpads, and few studies focus on accelerator-to-cache interfaces. The main existing alternative are Load/Store Queues (LSQs) traditionally used to connect superscalar processors to caches and memory, but in the context of accelerators, they are overkill and could significantly reduce the area and power benefits of accelerators. Moreover, we show that they are just not fit for accelerators plugged to multi-banked caches. In this article, we propose a fast accelerator-to-cache interface with a moderate area and power footprint compared to LSQs, even for a large number of load/store ports. For that purpose, we introduce a set of low-overhead techniques for ensuring in-order delivery of requests to/from cache banks. We synthesize and layout at 65nm the design of both our interface and an LSQ specially adapted to accelerators for a fair comparison. We find that our interface achieves on average 78% of the performance of an LSQ using only 16% of the area and 24% of the power.

Keywords

application specific integrated circuits; cache storage; graphics processing units; integrated circuit design; multiprocessing systems; ASIC; CGRA; GPG-PU; LSQ; accelerator-to-cache interfaces; cache banks; control flow; heterogeneous multicores; high-throughput accelerators; load-store queues; memory accesses; memory interface; multibanked caches; superscalar processors; Acceleration; Application specific integrated circuits; Computer architecture; Layout; Out of order; Ports (Computers);

fLanguage

English

Publisher

ieee

Conference_Titel

Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2014 International Conference on

Conference_Location

Jaypee Greens

Type

conf

DOI

10.1145/2656106.2656109

Filename

6972464