• DocumentCode
    2932486
  • Title

    Survey of fault tolerance techniques for shared memory multicore/multiprocessor systems

  • Author

    Mushtaq, Hamid ; Al-Ars, Zaid ; Bertels, Koen

  • Author_Institution
    Comput. Eng. Lab., Delft Univ. of Technol., Delft, Netherlands
  • fYear
    2011
  • fDate
    11-14 Dec. 2011
  • Firstpage
    12
  • Lastpage
    17
  • Abstract
    With the advent of modern nano-scale technology, it has become possible to implement multiple processing cores on a single die. The shrinking transistor sizes however have made reliability a concern for such systems as smaller transistors are more prone to permanent as well as transient faults. To reduce the probability of failures of such systems, online fault tolerance techniques can be applied. These techniques need to be efficient as they execute concurrently with applications running on such systems. This paper discusses the challenges involved in online fault tolerance and existing work which tackles these challenges. We classify fault tolerance into four different steps which are proactive fault management, error detection, fault diagnosis and recovery and discuss related work for each step, with focus on techniques for shared memory multicore/multiprocessor systems. We also highlight the additional difficulties in tolerating faults for parallel execution on shared memory multicore/multiprocessor systems.
  • Keywords
    error detection; fault diagnosis; fault tolerance; parallel processing; shared memory systems; system recovery; error detection; failure probability; fault diagnosis; fault recovery; nanoscale technology; online fault tolerance techniques; parallel execution; proactive fault management; shared memory multicore system; shared memory multiprocessor system; transistor sizes; Checkpointing; Fault tolerance; Fault tolerant systems; Hardware; Multicore processing; Program processors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Design and Test Workshop (IDT), 2011 IEEE 6th International
  • Conference_Location
    Beirut
  • ISSN
    2162-0601
  • Print_ISBN
    978-1-4673-0468-9
  • Electronic_ISBN
    2162-0601
  • Type

    conf

  • DOI
    10.1109/IDT.2011.6123094
  • Filename
    6123094