• DocumentCode
    625583
  • Title

    Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor

  • Author

    Heinecke, Alexander ; Vaidyanathan, Karthikeyan ; Smelyanskiy, Mikhail ; Kobotov, Alexander ; Dubtsov, Roman ; Henry, Greg ; Shet, Aniruddha G. ; Chrysos, Grigorios ; Dubey, Pradeep

  • Author_Institution
    Dept. of Inf., Tech. Univ. Munchen, Munich, Germany
  • fYear
    2013
  • fDate
    20-24 May 2013
  • Firstpage
    126
  • Lastpage
    137
  • Abstract
    Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intel´s recently released Intel® Xeon Phi™1 co-processor (code-named Knights Corner) in both native and hybrid configurations. Our native DGEMM implementation takes full advantage of Knights Corner´s salient architectural features and successfully utilizes close to 90% of its peak compute capability. Our native Linpack implementation running entirely on Knights Corner employs novel dynamic scheduling and achieves close to 80% efficiency - the highest published co-processor efficiency. Similarly to native, our single-node hybrid implementation of Linpack also achieves nearly 80% efficiency. Using dynamic scheduling and an enhanced look-ahead scheme, this implementation scales well to a 100-node cluster, on which it achieves over 76% efficiency while delivering the total performance of 107 TFLOPS.
  • Keywords
    coprocessors; linear algebra; multiprocessing systems; processor scheduling; 100-node cluster; DGEMM implementation; Intel Xeon Phi coprocessor; Linpack benchmark; dense linear algebra; dynamic scheduling; enhanced look-ahead scheme; hardware accelerators; knights corner; multicore processors; multinode systems; salient architectural features; single node systems; Bandwidth; Kernel; Matrix decomposition; Prefetching; Registers; Tiles; Vectors; HPL; LU factorization; SIMD; TLP; Xeon Phi; hybrid parallelization; panel factorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on
  • Conference_Location
    Boston, MA
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4673-6066-1
  • Type

    conf

  • DOI
    10.1109/IPDPS.2013.113
  • Filename
    6569806