مرکز منطقه ای اطلاع رساني علوم و فناوري - The Exploration of Pervasive and Fine-Grained Parallel Model Applied on Intel Xeon Phi Coprocessor

DocumentCode :

652520

Title :

The Exploration of Pervasive and Fine-Grained Parallel Model Applied on Intel Xeon Phi Coprocessor

Author :

Calvin, C. ; Fan Ye ; Petiton, S.

Author_Institution :

DM2S Commissariat a l´Energie Atomique, CEA, France

fYear :

2013

fDate :

28-30 Oct. 2013

Firstpage :

166

Lastpage :

173

Abstract :

In this paper we investigate the dissimilar multithreading programming paradigms on x86 CPU architectures, where the recently released Intel Xeon Phi Coprocessor and commonly used Intel Xeon processors were studied, as well as the NVIDIA K20 GPU, which represents the cutting-edge general purpose graphics processing unit. The relevant numerical algorithm selected to address the problem is power method, which is widely used to compute the dominant eigenvalue of a matrix. This work focuses on dense linear algebra. The frequently used multi-core or many-core processor parallelism techniques include OpenMP, Intel Cilk Plus, Intel Threading Building Blocks, i.e. TBB, along with the optimized computing libraries such as Intel Math Kernel Library(MKL) or the NVIDIA CUDA Basic Linear Algebra Subroutines(cuBLAS) library. Optimized implementations of these techniques were separately applied to the aforementioned architectures. For the reason that a unitary programming model may not satisfy the growing performance demand, we also explored some possible mix of these languages. The study shows that the hybrid pattern of multithreading and data parallelism via explicit vectorization maximizes the performance on x86 architectures, which allows us to obtain 80% of the sustainable peak performance in double precision on the Intel Many Integrated Core(MIC) Architecture. In the case of single precision, this number reaches even 96%. In addition, this approach enables a reasonable performance by requiring least developing time. The numbers of iterations till convergence are roughly the same in both architectures of CPU and GPU. The GPU performs better in small matrix sizes. However, the Intel Xeon Phi coprocessor excels for large sizes with a better scalability.

Keywords :

graphics processing units; mathematics computing; matrix algebra; multi-threading; multiprocessing systems; ubiquitous computing; Building Blocks; Intel Cilk Plus; Intel Math Kernel Library; Intel Threading; Intel Xeon Phi coprocessor; NVIDIA CUDA basic linear algebra subroutines library; NVIDIA K20 GPU; OpenMP; data parallelism; dense linear algebra; dissimilar multithreading programming paradigms; fine-grained parallel model; general purpose graphics processing unit; iteration method; many-core processor parallelism techniques; multicore processor parallelism techniques; multithreading; optimized computing libraries; pervasive model; Arrays; Computational modeling; Coprocessors; Graphics processing units; Kernel; Vectors;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2013 Eighth International Conference on

Conference_Location :

Compiegne

Type :

conf

DOI :

10.1109/3PGCIC.2013.31

Filename :

6681224

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=652520