Author_Institution :
Univ. of Texas at Dallas, Richardson, TX, USA
Abstract :
With process scaling and the adoption of post-cmos technologies, reliability and power are becoming a significant concern for future computing systems, especially highly parallel systems. Previous approaches have investigated augmenting applications with additional logic to detect and correct errors efficiently. In this research, we investigate the impact of different algorithmic designs on error resilience and propose an approach for algorithm selection for a class of equations, i.e. partial differential equations (PDEs), that are at the core of many scientific computing applications, which drive HPC systems. Many different schemes have been devised for the approximation of PDE systems, each with different accuracy, stability, and performance properties. In this research, there are two primary questions that we address: (1) Does numerical stability translate to error resilience? and (2) How do we design schemes to improve error resilience? If an algorithm´s error resilience is correlated with its numerical stability properties, this may allow us to design more resilient applications by leveraging well established information on numerical stability. Even with a clear translation of numerical stability to error resilience properties, the question of designing these algorithms still remains however, due to the variety of implementations, schemes, and largely input specific nature of the design. In this research, we propose one approach for automated design using machine-learning. We observe that intelligent selection of the algorithm or a given problem, improves robustness by 20%-50%, on average, over the traditional selection of algorithms, without the addition of any other detection/correction logic.
Keywords :
fault tolerant computing; learning (artificial intelligence); natural sciences computing; numerical stability; parallel processing; partial differential equations; HPC systems; PDE systems; algorithm selection; error resilience properties; high performance computing; machine-learning; numerical stability; partial differential equations; scientific computing applications; Algorithm design and analysis; Circuit faults; Equations; Mathematical model; Numerical stability; Resilience; Stability analysis; ABFT; Error resilience; HPC; PDE solvers; algorithmic fault tolerance;