مرکز منطقه ای اطلاع رساني علوم و فناوري - Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS

DocumentCode :

720570

Title :

Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS

Author :

Amer, Abdelhalim ; Huiwei Lu ; Balaji, Pavan ; Matsuoka, Satoshi

Author_Institution :

Tokyo Inst. of Technol., Tokyo, Japan

fYear :

2015

fDate :

4-7 May 2015

Firstpage :

1075

Lastpage :

1083

Abstract :

With the increasing prominence of many-core architectures and decreasing per-core resources on large supercomputers, a number of applications developers are investigating the use of hybrid MPI+threads programming to utilize computational units while sharing memory. An MPI-only model that uses one MPI process per system core is capable of effectively utilizing the processing units, but it fails to fully utilize the memory hierarchy and relies on fine-grained internodes communication. Hybrid MPI+threads models, on the other hand, can handle internodes parallelism more effectively and alleviate some of the overheads associated with internodes communication by allowing more coarse-grained data movement between address spaces. The hybrid model, however, can suffer from locking and memory consistency overheads associated with data sharing. In this paper, we use a distributed implementation of the breadth-first search algorithm in order to understand the performance characteristics of MPI-only and MPI+threads models at scale. We start with a baseline MPI-only implementation and propose MPI+threads extensions where threads independently communicate with remote processes while cooperating for local computation. We demonstrate how the coarse-grained communication of MPI+threads considerably reduces time and space overheads that grow with the number of processes. At large scale, however, these overheads constitute performance barriers for both models and require fixing the root causes, such as the excessive polling for communication progress and inefficient global synchronizations. To this end, we demonstrate various techniques to reduce such overheads and show performance improvements on up to 512K cores of a Blue Gene/Q system.

Keywords :

application program interfaces; message passing; tree searching; BFS; Blue Gene/Q system; MPI-only implementation; breadth-first search algorithm; hybrid MPI+threads applications; message passing interface; Computational modeling; Data models; Instruction sets; Memory management; Message systems; Programming; Synchronization; BFS; MPI; OpenMP; breadth-first search; hybrid model; multithreading; parallel programming models; runtime contention; threading models; threads;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on

Conference_Location :

Shenzhen

Type :

conf

DOI :

10.1109/CCGrid.2015.93

Filename :

7152594

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=720570