Pinot: speculative multi-threading processor architecture exploiting parallelism over a wide range of granularities

Author

Ohsawa, Taku ; Takagi, Masamichi ; Kawahara, Shoji ; Matsushita, Satoshi

Author_Institution

NEC Corp., Japan

fYear

2005

fDate

12-16 Nov. 2005

Abstract

We propose a speculative multi-threading processor architecture called Pinot. Pinot exploits parallelism over a wide range of granularities without modifying program sources. Since exploitation of fine-grain parallelism suffers from limits of parallelism and overhead incurred by parallelization, it is better to extract coarse-grain parallelism. Coarse-grain parallelism is biased in some programs (mainly, numerical ones) and some program portions. Therefore, exploiting both coarse- and fine-grain parallelism is a key to the performance of speculative multithreading. The features of Pinot are as follows: (1) A parallelizing tool extracts parallelism at any level of granularity (e.g. even ten thousand instructions) from any program sub-structures (e.g. loops, calls, or basic blocks). The tool utilizes formulation in which the parallelization process is reduced to a combinatorial optimization problem. (2) A parallel execution model with extension of thread control instructions is designed in order to minimize the increase of the dynamic instruction count. The model employs implicit thread termination and cancellation, as well as register value transfer without synchronization. (3) A versioning cache called version resolution cache (VRC) accomplishes both coarse- and fine-grained speculative multithreading. VRC operates as a large buffer for coarse-grained multi-threading. In addition, it provides low latency inter-thread communication with an update-based protocol for fine-grained multi-threading. We performed cycle-accurate simulations with 38 programs from the SPEC and MiBench benchmarks. The speedup with 4-processor-element-Pinot is up to 3.7 times, and 2.2 times on geometric mean against a conventional processor. The speedup in a program (susan) drops from 3.7 to 1.6 when the speculative buffer size is limited to 256 bytes. It confirms that exploiting coarse-grain parallelism is essential to the improved performance. FPGA implementation shows 32% overhead of area and 12% increase of critical path delay compared to a conventional processor.

Keywords

multi-threading; FPGA implementation; MiBench benchmarks; Pinot; SPEC benchmarks; coarse-grain parallelism; coarse-speculative multithreading; combinatorial optimization problem; critical path delay; fine-grain parallelism; fine-grained speculative multithreading; implicit thread termination; low latency inter-thread communication; multithreading processor architecture; parallel execution; parallelizing tool; register value transfer; speculative buffer size; thread control instructions; update-based protocol; version resolution cache; Delay; Design optimization; Field programmable gate arrays; Frequency synchronization; Multithreading; National electric code; Protocols; Termination of employment; Throughput; Yarn;

fLanguage

English

Publisher

ieee

Conference_Titel

Microarchitecture, 2005. MICRO-38. Proceedings. 38th Annual IEEE/ACM International Symposium on

Print_ISBN

0-7695-2440-0

Type

conf

DOI

10.1109/MICRO.2005.26

Filename

1540950

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=2534883