DocumentCode
739758
Title
Affine Parallelization Using Dependence and Cache Analysis in a Binary Rewriter
Author
Kotha, Aparna ; Anand, Kapil ; Creech, Timothy ; ElWazeer, Khaled ; Smithson, Matthew ; Yellareddy, Greeshma ; Barua, Rajeev
Author_Institution
Dept. of ECE, Univ. of Maryland, College Park, MD, USA
Volume
26
Issue
8
fYear
2015
Firstpage
2154
Lastpage
2163
Abstract
Today, nearly all general-purpose computers are parallel, but nearly all software running on them is serial. Bridging this disconnect by manually rewriting source code in parallel is prohibitively expensive. Hence, automatic parallelization is an attractive alternative. We present a method to perform automatic parallelization in a binary rewriter. The input to the binary rewriter is the serial binary executable program and the output is a parallel binary executable. The advantages of parallelization in a binary rewriter versus a compiler include: (i) applicability to legacy binaries whose source is not available; (ii) applicability to library code that is often distributed only as binary; (iii) usability by end-user of the software; and (iv) applicability to assembly-language programs. Adapting existing parallelizing compiler methods that work on source code to binaries is a significant challenge. This is primarily because symbolic and array index information used by existing parallelizing compilers to take parallelizing decisions is not available from a binary. We show how to adapt existing parallelization methods to binaries without using symbolic and array index information to achieve equivalent source parallelization from binaries. Results using our x86 binary rewriter called SecondWrite are presented in detail for the dense-matrix and regular Polybench benchmark suite. For these, the average speedup from our method for binaries is 27.95X versus 28.22X from source on 24 threads, compared to the input serial binary. Super-linear speedups are possible due to cache optimizations. In addition our results show that our method is robust and has correctly parallelized much larger binaries from the SPEC 2006 and OMP 2001 benchmark suites, totaling over 2 million source lines of code (SLOC). Good speedups result on the subset of those benchmarks that have affine parallelism in them; this subset exceeds 100,000 SLOC. This robustness is unusual even for the latest leading- source parallelizers, many of which are famously fragile.
Keywords
affine transforms; general purpose computers; rewriting systems; source code (software); affine parallelization; array index information; assembly-language programs; automatic parallelization; binary rewriter; cache analysis; dense-matrix; dependence analysis; general-purpose computers; library code; parallel binary executable; regular polybench benchmark suite; serial binary executable program; source code rewriting; source parallelization; super-linear speedups; symbolic information; Arrays; Binary codes; Educational institutions; Indexes; Optimization; Parallel processing; Vectors; Parallel programming; automatic programming; multithreading;
fLanguage
English
Journal_Title
Parallel and Distributed Systems, IEEE Transactions on
Publisher
ieee
ISSN
1045-9219
Type
jour
DOI
10.1109/TPDS.2014.2349501
Filename
6880338
Link To Document