• DocumentCode
    3080148
  • Title

    Architectures for online error detection and recovery in multicore processors

  • Author

    Gizopoulos, Dimitris ; Psarakis, Mihalis ; Adve, Sarita V. ; Ramachandran, Pradeep ; Hari, Siva Kumar Sastry ; Sorin, Daniel ; Meixner, Albert ; Biswas, Arijit ; Vera, Xavier

  • Author_Institution
    Dept. of Inf., Univ. of Piraeus, Piraeus, Greece
  • fYear
    2011
  • fDate
    14-18 March 2011
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    The huge investment in the design and production of multicore processors may be put at risk because the emerging highly miniaturized but unreliable fabrication technologies will impose significant barriers to the life-long reliable operation of future chips. Extremely complex, massively parallel, multi-core processor chips fabricated in these technologies will become more vulnerable to: (a) environmental disturbances that produce transient (or soft) errors, (b) latent manufacturing defects as well as aging/wearout phenomena that produce permanent (or hard) errors, and (c) verification inefficiencies that allow important design bugs to escape in the system. In an effort to cope with these reliability threats, several research teams have recently proposed multicore processor architectures that provide low-cost dependability guarantees against hardware errors and design bugs. This paper focuses on dependable multicore processor architectures that integrate solutions for online error detection, diagnosis, recovery, and repair during field operation. It discusses taxonomy of representative approaches and presents a qualitative comparison based on: hardware cost, performance overhead, types of faults detected, and detection latency. It also describes in more detail three recently proposed effective architectural approaches: a software-anomaly detection technique (SWAT), a dynamic verification technique (Argus), and a core salvaging methodology.
  • Keywords
    error detection; fault diagnosis; formal verification; multiprocessing systems; parallel architectures; Argus; aging phenomena; architectural approach; core salvaging methodology; design bugs; detection latency; dynamic verification technique; environmental disturbance; fault type detection; hardware cost; hardware errors; latent manufacturing defects; multicore processor architecture; multicore processor chip; multicore processor design; multicore processor production; online error detection; online error diagnosis; online error recovery; online error repair; parallel processor chip; performance overhead; software-anomaly detection technique; wearout phenomena; Built-in self-test; Hardware; Maintenance engineering; Multicore processing; Program processors; dependable architectures; multicore microprocessors; online error detection/recovery/repair;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2011
  • Conference_Location
    Grenoble
  • ISSN
    1530-1591
  • Print_ISBN
    978-1-61284-208-0
  • Type

    conf

  • DOI
    10.1109/DATE.2011.5763096
  • Filename
    5763096