• DocumentCode
    2440580
  • Title

    Analyzing the soft error resilience of linear solvers on multicore multiprocessors

  • Author

    Malkowski, Konrad ; Raghavan, Padma ; Kandemir, Mahmut

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Pennsylvania State Univ., University Park, PA, USA
  • fYear
    2010
  • fDate
    19-23 April 2010
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    As chip transistor densities continue to increase, soft errors (bit flips) are becoming a significant concern in networked multiprocessors with multicore nodes. Large cache structures in multicore processors are especially susceptible to soft errors as they occupy a significant portion of the chip area. In this paper, we consider the impacts of soft errors in caches on the resilience and energy efficiency of sparse linear solvers. In particular, we focus on two widely used sparse iterative solvers, namely Conjugate Gradient (CG) and Generalized Minimum Residuals (GMRES). We propose two adaptive schemes, (i) a Write Eviction Hybrid ECC (WEH-ECC) scheme for the L1 cache and (ii) a Prefetcher Based Adaptive ECC (PBA-ECC) scheme for the L2 cache, and evaluate the energy and reliability trade-offs they bring in the context of GMRES and CG solvers. Our evaluations indicate that WEH-ECC reduces the CG and GMRES soft error vulnerability by a factor of 18 to 220 in L1 cache, relative to an unprotected L1 cache, and energy consumption by 16%, relative to a cache with strong protection. The PBA-ECC scheme reduces the CG and GMRES soft error vulnerability by a factor of 9 × 103 to 8.6 × 109, relative to an unprotected L2 cache, and reduces the energy consumption by 8.5%, relative to a cache with strong ECC protection. Our energy overheads over unprotected L1 and L2 caches are 5% and 14% respectively.
  • Keywords
    cache storage; conjugate gradient methods; multiprocessing programs; multiprocessing systems; software fault tolerance; chip transistor density; conjugate gradient; energy efficiency; error correction code; generalized minimum residual; multicore multiprocessors; prefetcher based adaptive ECC scheme; soft error resilience; sparse linear solvers; write eviction hybrid ECC scheme; Character generation; Computer errors; Data structures; Error analysis; Error correction codes; Hardware; Multicore processing; Prefetching; Protection; Resilience;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on
  • Conference_Location
    Atlanta, GA
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4244-6442-5
  • Type

    conf

  • DOI
    10.1109/IPDPS.2010.5470411
  • Filename
    5470411