DocumentCode :
2793071
Title :
Optimizing the Fast Fourier Transform on a Multi-core Architecture
Author :
Chen, Long ; Hu, Ziang ; Lin, Junmin ; Gao, Guang R.
Author_Institution :
Electr. & Comput. Eng. Dept., Delaware Univ., Newark, DE
fYear :
2007
fDate :
26-30 March 2007
Firstpage :
1
Lastpage :
8
Abstract :
The rapid revolution in microprocessor chip architecture due to multicore technology is presenting unprecedented challenges to the application developers as well as system software designers: how to best exploit the parallelism potential due to such multi-core architectures? In this paper, we report an in-depth study on such challenges based on our experience of optimizing the fast Fourier transform (FFT) on the IBM Cyclops-64 chip architecture - a large-scale multi-core chip architecture consisting 160 thread units, associated memory banks and an interconnection network that connect them together in a shared memory organization. We demonstrate how multi-core architectures like the C64 could be used to achieve a high performance implementation of FFT both in 1D and 2D cases. We analyze the optimization challenges and opportunities including problem decomposition, load balancing, work distribution, and data-reuse, together with the exploiting of the C64 architecture features such as the multi-level of memory hierarchy and large register files. Furthermore, the experience learned during the hand-tuned optimization process have provided valuable guidance in our compiler optimization design and implementation. The main contributions of this paper include: 1) our study demonstrates that successful optimization for C64-like large-scale multi-core architectures requires a careful analysis that can identify certain domain-specific features of a target application (e.g. FFT) and match them well with some key multi-core architecture features; 2) our optimization, assisted with hand-tuned process, provided quantitative evidence on the importance of each optimization identified in 1); 3) automatic optimization by our compiler, the design and implementation of which is guided by the feedbacks from 1) and 2), shows excellent results that are often comparable to the results derived from our time-consuming hand-tuned code.
Keywords :
computer architecture; fast Fourier transforms; mathematics computing; microprocessor chips; optimising compilers; shared memory systems; IBM Cyclops-64 multicore chip architecture; compiler automatic optimization; fast Fourier transform optimization; memory hierarchy; multicore microprocessor chip architecture; shared memory organization; Application software; Computer architecture; Design optimization; Fast Fourier transforms; Large-scale systems; Microprocessor chips; Multicore processing; Optimizing compilers; Software design; System software;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International
Conference_Location :
Long Beach, CA
Print_ISBN :
1-4244-0910-1
Electronic_ISBN :
1-4244-0910-1
Type :
conf
DOI :
10.1109/IPDPS.2007.370639
Filename :
4228367
Link To Document :
بازگشت