• DocumentCode
    2534883
  • Title

    Pinot: speculative multi-threading processor architecture exploiting parallelism over a wide range of granularities

  • Author

    Ohsawa, Taku ; Takagi, Masamichi ; Kawahara, Shoji ; Matsushita, Satoshi

  • Author_Institution
    NEC Corp., Japan
  • fYear
    2005
  • fDate
    12-16 Nov. 2005
  • Abstract
    We propose a speculative multi-threading processor architecture called Pinot. Pinot exploits parallelism over a wide range of granularities without modifying program sources. Since exploitation of fine-grain parallelism suffers from limits of parallelism and overhead incurred by parallelization, it is better to extract coarse-grain parallelism. Coarse-grain parallelism is biased in some programs (mainly, numerical ones) and some program portions. Therefore, exploiting both coarse- and fine-grain parallelism is a key to the performance of speculative multithreading. The features of Pinot are as follows: (1) A parallelizing tool extracts parallelism at any level of granularity (e.g. even ten thousand instructions) from any program sub-structures (e.g. loops, calls, or basic blocks). The tool utilizes formulation in which the parallelization process is reduced to a combinatorial optimization problem. (2) A parallel execution model with extension of thread control instructions is designed in order to minimize the increase of the dynamic instruction count. The model employs implicit thread termination and cancellation, as well as register value transfer without synchronization. (3) A versioning cache called version resolution cache (VRC) accomplishes both coarse- and fine-grained speculative multithreading. VRC operates as a large buffer for coarse-grained multi-threading. In addition, it provides low latency inter-thread communication with an update-based protocol for fine-grained multi-threading. We performed cycle-accurate simulations with 38 programs from the SPEC and MiBench benchmarks. The speedup with 4-processor-element-Pinot is up to 3.7 times, and 2.2 times on geometric mean against a conventional processor. The speedup in a program (susan) drops from 3.7 to 1.6 when the speculative buffer size is limited to 256 bytes. It confirms that exploiting coarse-grain parallelism is essential to the improved performance. FPGA implementation shows 32% overhead of area and 12% increase of critical path delay compared to a conventional processor.
  • Keywords
    multi-threading; FPGA implementation; MiBench benchmarks; Pinot; SPEC benchmarks; coarse-grain parallelism; coarse-speculative multithreading; combinatorial optimization problem; critical path delay; fine-grain parallelism; fine-grained speculative multithreading; implicit thread termination; low latency inter-thread communication; multithreading processor architecture; parallel execution; parallelizing tool; register value transfer; speculative buffer size; thread control instructions; update-based protocol; version resolution cache; Delay; Design optimization; Field programmable gate arrays; Frequency synchronization; Multithreading; National electric code; Protocols; Termination of employment; Throughput; Yarn;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Microarchitecture, 2005. MICRO-38. Proceedings. 38th Annual IEEE/ACM International Symposium on
  • Print_ISBN
    0-7695-2440-0
  • Type

    conf

  • DOI
    10.1109/MICRO.2005.26
  • Filename
    1540950