• DocumentCode
    572388
  • Title

    OUTRIDER: Efficient memory latency tolerance with decoupled strands

  • Author

    Crago, Neal C. ; Patel, Sanjay J.

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
  • fYear
    2011
  • fDate
    4-8 June 2011
  • Firstpage
    117
  • Lastpage
    128
  • Abstract
    We present Outrider, an architecture for throughput-oriented processors that provides memory latency tolerance to improve performance on highly threaded workloads. Out-rider enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separate memory-accessing and memory-consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate memory latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Moreover, instead of adding more threads as is done in modern GPUs, Outrider can tolerate memory latency with fewer threads and reduced contention for resources shared amongst threads. We demonstrate that Outrider can outperform single threaded cores by 23-131% and a 4-way simultaneous multithreaded core by up to 87% on data parallel applications in a 1024-core system. Moreover, Outrider achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved area efficiency compared to a multithreaded core.
  • Keywords
    computational complexity; multi-threading; multiprocessing systems; parallel architectures; performance evaluation; pipeline processing; resource allocation; storage management; tolerance analysis; 1024-core system; 4-way simultaneous multithreaded core; OUTRIDER; data parallel applications; decoupled strands; highly threaded workloads; low-complexity in-order microarchitecture; memory latency tolerance; memory-accessing instruction; memory-consuming instruction; multiple decoupled instruction streams; performance improvement; processor pipeline; resource sharing; throughput-oriented processor architecture; Abstracts; Biological system modeling; Heating; Instruction sets;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Architecture (ISCA), 2011 38th Annual International Symposium on
  • Conference_Location
    San Jose, CA
  • ISSN
    1063-6897
  • Print_ISBN
    978-1-4503-0472-6
  • Type

    conf

  • Filename
    6307751