Title :
Tightly Coupled Accelerators Architecture for Minimizing Communication Latency among Accelerators
Author :
Hanawa, T. ; Kodama, Yuetsu ; Boku, Taisuke ; Sato, Mitsuhisa
Author_Institution :
Center for Comput. Sci., Univ. of Tsukuba, Tsukuba, Japan
Abstract :
In recent years, heterogeneous clusters using accelerators have been widely used in high performance computing systems. In such clusters, inter-node communication among accelerators requires several memory copies via CPU memory, and the communication latency causes severe performance degradation. In order to address this problem, we propose the Tightly Coupled Accelerators (TCA) architecture to reduce the communication latency between accelerators over different nodes. In addition, we promote the HA-PACS project at the Center for Computational Sciences, University of Tsukuba, in order to build up the HA-PACS base cluster system, as a commodity GPU cluster, and to develop an experimental system based on the TCA architecture as a proprietary interconnection network connecting accelerators beyond the nodes. In the present paper, we describe the TCA architecture and the design and implementation of PEACH2 for realizing the TCA architecture. We also evaluate the functionality and the basic performance of the PEACH2 chip, and the results demonstrate that the PEACH2 chip has sufficient maximum performance with 93% of the theoretical peak performance and a latency between adjacent nodes of approximately 0.8μsec.
Keywords :
graphics processing units; multiprocessor interconnection networks; parallel architectures; performance evaluation; peripheral interfaces; CPU memory; Center for Computational Sciences; HA-PACS base cluster system; HA-PACS project; PCI express adaptive communication hub version 2; PEACH2 chip performance; PEACH2 design; PEACH2 implementation; TCA architecture; University of Tsukuba; commodity GPU cluster; communication latency; communication latency minimization; communication latency reduction; functionality evaluation; heterogeneous clusters; high performance computing systems; interconnection network; internode communication; memory copies; performance degradation; tightly coupled accelerator architecture; Field programmable gate arrays; Graphics processing units; Memory management; Peer-to-peer computing; Ports (Computers); Routing; Accelerator Computing; CUDA; GPGPU; GPU Direct; Interconnection Network; PCI Express; Remote DMA;
Conference_Titel :
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International
Conference_Location :
Cambridge, MA
Print_ISBN :
978-0-7695-4979-8
DOI :
10.1109/IPDPSW.2013.226