• DocumentCode
    2496478
  • Title

    Localizing transient faults using dynamic bayesian networks

  • Author

    Jha, Susmit ; Li, Wenchao ; Seshia, Sanjit A.

  • Author_Institution
    Dept. of Electr. Eng. & Comput. Sci., UC Berkeley, Berkeley, CA, USA
  • fYear
    2009
  • fDate
    4-6 Nov. 2009
  • Firstpage
    82
  • Lastpage
    87
  • Abstract
    Transient faults are a major concern in today´s deep sub-micron semiconductor technology. These faults are rare but they have been known to cause catastrophic system-level failures. Transient errors often occur due to physical effects on deployed systems and hence, diagnosis of transient errors must be performed over manufactured chips or systems assembled from black-box components where arbitrary instrumentation of the system is not possible and hence, the system state is only partially observable. Further, these systems are often composed of components that are third party IP which further adds opaqueness to the system. In this paper, we propose a probabilistic approach to localize transient faults in space and time for such partially observable systems. From a set of correct traces and a failure trace, we seek to locate the faulty component and the cycle of operation at which the fault occurred. Our technique uses correct system traces over monitored components of the system to learn a dynamic Bayesian network (DBN) summarizing the temporal dependencies across the monitored components. This DBN is augmented with different error hypotheses allowed by the fault model. The most probable explanation (MPE) among these hypotheses corresponds to the most likely location of the error. We evaluated the effectiveness of our technique on a set of ISCAS89 benchmarks and a router design used in on-chip networks in a multi-core design.
  • Keywords
    Bayes methods; integrated circuit reliability; system-on-chip; transients; ISCAS89 benchmarks; black-box components; correct traces; deep submicron semiconductor technology; dynamic Bayesian networks; failure trace; multicore design; on-chip networks; router design; system-level failures; transient error debugging; transient faults; Assembly systems; Bayesian methods; Clocks; Computer bugs; Computer errors; Condition monitoring; Error correction; Fault diagnosis; Manufacturing; Semiconductor device manufacture;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Level Design Validation and Test Workshop, 2009. HLDVT 2009. IEEE International
  • Conference_Location
    San Francisco, CA
  • ISSN
    1552-6674
  • Print_ISBN
    978-1-4244-4823-4
  • Electronic_ISBN
    1552-6674
  • Type

    conf

  • DOI
    10.1109/HLDVT.2009.5340170
  • Filename
    5340170