Title :
Implementation of CG Method on GPU Cluster with Proprietary Interconnect TCA for GPU Direct Communication
Author :
Kazuya Matsumoto;Toshihiro Hanawa;Yuetsu Kodama;Hisafumi Fujii;Taisuke Boku
Author_Institution :
Center for Comput. Sci., Univ. of Tsukuba, Tsukuba, Japan
fDate :
5/1/2015 12:00:00 AM
Abstract :
We have been developing a proprietary interconnect technology called Tightly Coupled Accelerators (TCA) architecture to improve communication latency and bandwidth between compute nodes on a GPU cluster. This paper describes the Conjugate Gradient (CG) method implementation using TCA and results of performance evaluation on the HA-PACS/TCA system, which is a proof-of-concept GPU cluster based on the TCA concept. The implementation uses the TCA for all gather and all reduce collective communications. Comparison results between the implementation using TCA and an implementation using MPI show that the TCA contributes to reduce latency for relatively small data gathering on the all gather and demonstrate about twice faster speed on the all reduce. As a result, the CG method implementation using TCA outperforms the implementation using MPI for sparse matrices whose matrix size is thousands to tens of thousands.
Keywords :
"Graphics processing units","Sparse matrices","Bandwidth","Performance evaluation","Ports (Computers)","Peer-to-peer computing","Computer architecture"
Conference_Titel :
Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International
DOI :
10.1109/IPDPSW.2015.102