DocumentCode
3350986
Title
Improving load/store queues usage in scientific computing
Author
Lemuet, Christophe ; Jalby, William ; Touati, Sid-Ahmed-Ali
Author_Institution
PRiSM Lab., Univ. of Versailles, France
fYear
2004
fDate
15-18 Aug. 2004
Firstpage
38
Abstract
Memory disambiguation mechanisms, coupled with load/store queues in out-of-order processors, are crucial to increase instruction level parallelism (ILP), especially for memory-bound scientific codes. Designing ideal memory disambiguation mechanisms is too complex because it would require precise address bits comparators; thus, modern microprocessors implement simplified and imprecise ones that perform only partial address comparisons. In this paper, we study the impact of such simplifications on the sustained performance of some real processors such that Alpha 21264, Power 4 and Itanium 2. Despite all the advanced features of these processors, we demonstrate in this article that memory address disambiguation mechanisms can cause significant performance loss. We demonstrate that, even if data are located in low cache levels and enough ILP exist, the performance degradation can be up to 21 times slower if no care is taken on the order of accessing independent memory addresses. Instead of proposing a hardware solution to improve load/store queues, as done in [G. Chrysos et al., (1998), S. Sethumadhavan et al., (2003), I. Park et al., (2003), A. Yoaz et al., (1999), S. Onder (2002)], we show that a software (compilation) technique is possible. Such solution is based on the classical (and robust) Id/st vectorization. Our experiments highlight the effectiveness of such method on BLAS 1 codes that are representative of vector scientific loops.
Keywords
instruction sets; multiprocessing systems; storage management; vector processor systems; Alpha 21264 processor; BLAS 1 codes; Itanium 2 processor; Power 4 processor; independent memory address; instruction level parallelism; load queue; memory disambiguation mechanism; scientific computing; software compilation technique; store queues; vector scientific loop; Application software; Hardware; Libraries; Optimization methods; Parallel processing; Performance analysis; Performance loss; Pollution; Scientific computing; Software performance;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel Processing, 2004. ICPP 2004. International Conference on
ISSN
0190-3918
Print_ISBN
0-7695-2197-5
Type
conf
DOI
10.1109/ICPP.2004.1327902
Filename
1327902
Link To Document