DocumentCode
1971609
Title
Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults
Author
Tang, Dong ; Carruthers, Peter ; Totari, Zuheir ; Shapiro, Michael W.
Author_Institution
Sun MicroSysterms Inc., Mountain View, CA
fYear
2006
fDate
25-28 June 2006
Firstpage
365
Lastpage
370
Abstract
The Solaris 10 operating system includes a number of new features for predictive self-healing. One such feature is the ability of the fault management software to diagnose memory errors and drive automatic memory page retirement (MPR), intended to reduce the negative impact of permanent memory faults that generate either correctable or uncorrectable errors on system reliability, availability, and serviceability (RAS). The MPR technique allows memory pages suffering from correctable errors and relocatable clean pages suffering from uncorrectable errors to be removed from use in the virtual memory system without interrupting user applications. It also allows relocatable dirty pages associated with uncorrectable errors to be isolated with limited impact on affected user processes, avoiding an outage for the entire system. This study applies analytical models, with parameters calibrated by field experience, to quantify the reduction that can be made by this operating system self-healing technique on the system interruptions, yearly downtime, and number of services introduced by hardware permanent faults, for typical low-end and mid-range server systems. The results show that significant improvements can be made on these three system RAS metrics by deploying the MPR capability
Keywords
fault tolerant computing; operating systems (computers); paged storage; program diagnostics; Solaris 10 operating system; automatic memory page retirement; fault management software; hardware fault; memory error diagnosis; permanent memory fault; predictive self-healing; system availability; system interruption; system reliability; system serviceability; virtual memory system; Analytical models; Availability; Error correction; Error correction codes; Hardware; Memory management; Operating systems; Reliability; Retirement; Sun;
fLanguage
English
Publisher
ieee
Conference_Titel
Dependable Systems and Networks, 2006. DSN 2006. International Conference on
Conference_Location
Philadelphia, PA
Print_ISBN
0-7695-2607-1
Type
conf
DOI
10.1109/DSN.2006.13
Filename
1633525
Link To Document