DocumentCode
2496478
Title
Localizing transient faults using dynamic bayesian networks
Author
Jha, Susmit ; Li, Wenchao ; Seshia, Sanjit A.
Author_Institution
Dept. of Electr. Eng. & Comput. Sci., UC Berkeley, Berkeley, CA, USA
fYear
2009
fDate
4-6 Nov. 2009
Firstpage
82
Lastpage
87
Abstract
Transient faults are a major concern in today´s deep sub-micron semiconductor technology. These faults are rare but they have been known to cause catastrophic system-level failures. Transient errors often occur due to physical effects on deployed systems and hence, diagnosis of transient errors must be performed over manufactured chips or systems assembled from black-box components where arbitrary instrumentation of the system is not possible and hence, the system state is only partially observable. Further, these systems are often composed of components that are third party IP which further adds opaqueness to the system. In this paper, we propose a probabilistic approach to localize transient faults in space and time for such partially observable systems. From a set of correct traces and a failure trace, we seek to locate the faulty component and the cycle of operation at which the fault occurred. Our technique uses correct system traces over monitored components of the system to learn a dynamic Bayesian network (DBN) summarizing the temporal dependencies across the monitored components. This DBN is augmented with different error hypotheses allowed by the fault model. The most probable explanation (MPE) among these hypotheses corresponds to the most likely location of the error. We evaluated the effectiveness of our technique on a set of ISCAS89 benchmarks and a router design used in on-chip networks in a multi-core design.
Keywords
Bayes methods; integrated circuit reliability; system-on-chip; transients; ISCAS89 benchmarks; black-box components; correct traces; deep submicron semiconductor technology; dynamic Bayesian networks; failure trace; multicore design; on-chip networks; router design; system-level failures; transient error debugging; transient faults; Assembly systems; Bayesian methods; Clocks; Computer bugs; Computer errors; Condition monitoring; Error correction; Fault diagnosis; Manufacturing; Semiconductor device manufacture;
fLanguage
English
Publisher
ieee
Conference_Titel
High Level Design Validation and Test Workshop, 2009. HLDVT 2009. IEEE International
Conference_Location
San Francisco, CA
ISSN
1552-6674
Print_ISBN
978-1-4244-4823-4
Electronic_ISBN
1552-6674
Type
conf
DOI
10.1109/HLDVT.2009.5340170
Filename
5340170
Link To Document