Title :
High performance cache block replication using re-reference probability in CMPs
Author :
Wang, Jinglei ; Wang, Dongsheng ; Wang, Haixia ; Xue, Yibo
Author_Institution :
Tsinghua Nat. Lab. for Inf. Sci. & Technol., Tsinghua Univ., Beijing, China
Abstract :
In a Chip Multiprocessor(CMP) with shared caches, the last level cache (LLC) is distributed across all the cores. This increases the on-chip communication delay and thus influence the pr ocessor´s performance. The LLC is also quite inefficient due to plenty of dead blocks. Replication can be provided in shared caches by replicating cache blocks evicted from cores to the local LLC slices to minimize access latency through utilizing the cache space of dead blocks which will not be referenced again before they are evicted. However, naively allowing all evicted blocks to be replicated have limited performance benefit as such replicating does not take into account reuse probability of replicated blocks. This paper proposes Adaptive Probability Replication (APR), a mechanism that counts each block´s accesses in L2 cache slices, and monitors the number of evicted blocks with different number of accesses, to estimate the Re-Reference Probability of blocks in their lifetime at runtime. Using predicted re-reference probability, APR adopts probability replication policy and probability insertion policy to replicate blocks at corresponding probabilities, and insert them at appropriate position, according to their re-reference probability. We evaluate APR for a 16-core tiled CMP using splash-2 and parsec benchmarks. APR improves performance by 21% on average compared to conventional shared cache design, by 17% over Victim Replication (VR), by 10% over Adaptive Selective Replication (ASR), and by 15% over Reactive NUCA (R-NUCA). The additional hardware cost of APR is well under 1% of L2 cache slice.
Keywords :
cache storage; microprocessor chips; multiprocessing systems; probability; 16-core tiled CMP; L2 cache slices; access latency minimization; account reuse probability; adaptive probability replication; adaptive selective replication; cache space utilization; chip multiprocessor; dead blocks; high performance cache block replication; last level cache; local LLC slices; on-chip communication delay; probability insertion policy; probability replication policy; re-reference probability; reactive NUCA; shared caches; victim replication; Benchmark testing; Hardware; Monitoring; Radiation detectors; Runtime; System-on-a-chip; Tiles;
Conference_Titel :
High Performance Computing (HiPC), 2011 18th International Conference on
Conference_Location :
Bangalore
Print_ISBN :
978-1-4577-1951-6
Electronic_ISBN :
978-1-4577-1949-3
DOI :
10.1109/HiPC.2011.6152739