NUMA-optimized parallel breadth-first search on multicore single-node system

Author

Yasui, Yuichiro ; Fujisawa, Katsuki ; Goto, Keisuke

Author_Institution

Chuo Univ., Tokyo, Japan

fYear

2013

fDate

6-9 Oct. 2013

Firstpage

394

Lastpage

402

Abstract

The breadth-first search (BFS) is one of the most important kernels in graph theory. The Graph500 benchmark measures the performance of any supercomputer performing a BFS in terms of traversed edges per second (TEPS). Previous studies have proposed hybrid approaches that combine a well-known top-down algorithm and an efficient bottom-up algorithm for large frontiers. This reduces some unnecessary searching of outgoing edges in the BFS traversal of a small-world graph, such as a Kronecker graph. In this paper, we describe a highly efficient BFS using column-wise partitioning of the adjacency list while carefully considering the non-uniform memory access (NUMA) architecture. We explicitly manage the way in which each working thread accesses a partial adjacency list in local memory during BFS traversal. Our implementation has achieved a processing rate of 11.15 billion edges per second on a 4-way Intel Xeon E5-4640 system for a scale-26 problem of a Kronecker graph with 2²⁶ vertices and 2³⁰ edges. Not all of the speedup techniques in this paper are limited to the NUMA architecture system. With our winning Green Graph500 submission of June 2013, we achieved 64.12 GTEPS per kilowatt hour on an ASUS Pad TF700T with an NVIDIA Tegra 3 mobile processor.

Keywords

multiprocessing systems; parallel algorithms; storage management; tree searching; 4-way Intel Xeon E5-4640 system; ASUS Pad TF700T; BFS; Graph500 benchmark; Green Graph500; Kronecker graph; NUMA-optimized parallel breadth-first search; NVIDIA Tegra 3 mobile processor; adjacency list; bottom-up algorithm; column-wise partitioning; graph edges; graph theory; graph vertices; multicore single-node system; nonuniform memory access architecture; small-world graph; supercomputer; top-down algorithm; Approximation algorithms; Benchmark testing; Computer architecture; Instruction sets; Kernel; Partitioning algorithms; Random access memory; Breadth-first search; graph algorithms; multicore processing; parallel algorithms;

fLanguage

English

Publisher

ieee

Conference_Titel

Big Data, 2013 IEEE International Conference on

Conference_Location

Silicon Valley, CA

Type

conf

DOI

10.1109/BigData.2013.6691600

Filename

6691600