Abstract :
Future supercomputer systems will face serious reliability challenges. Among failure scenarios, latent errors are some of the most serious and concerning. Preserving multiple versions of critical data is a promising approach to deal with such errors. We are developing the Global View Resilience (GVR) library, with multi-version global arrays as one of the key features. This paper presents three array versioning architectures: flat array, flat array with change tracking, and log-structured array. We use a synthetic workload that mimics the memory access patterns of radix sort, N-body simulation, and matrix multiplication, comparing the three array architectures in terms of runtime performance, memory requirements, and version restoration costs. The experiments show that the flat array with change tracking is the best architecture in terms of runtime performance, for versioning frequencies of 10-5 ops-1 or higher matching the second best architecture or beating it by up to 23 times, whereas the log-structured array is preferable for low memory usage, since it saves up to 98% of memory compared with a flat array.
Keywords :
"Arrays","Libraries","Resilience","Kernel","Memory management","Runtime"