Title :
Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults
Author :
Tang, Dong ; Carruthers, Peter ; Totari, Zuheir ; Shapiro, Michael W.
Author_Institution :
Sun MicroSysterms Inc., Mountain View, CA
Abstract :
The Solaris 10 operating system includes a number of new features for predictive self-healing. One such feature is the ability of the fault management software to diagnose memory errors and drive automatic memory page retirement (MPR), intended to reduce the negative impact of permanent memory faults that generate either correctable or uncorrectable errors on system reliability, availability, and serviceability (RAS). The MPR technique allows memory pages suffering from correctable errors and relocatable clean pages suffering from uncorrectable errors to be removed from use in the virtual memory system without interrupting user applications. It also allows relocatable dirty pages associated with uncorrectable errors to be isolated with limited impact on affected user processes, avoiding an outage for the entire system. This study applies analytical models, with parameters calibrated by field experience, to quantify the reduction that can be made by this operating system self-healing technique on the system interruptions, yearly downtime, and number of services introduced by hardware permanent faults, for typical low-end and mid-range server systems. The results show that significant improvements can be made on these three system RAS metrics by deploying the MPR capability
Keywords :
fault tolerant computing; operating systems (computers); paged storage; program diagnostics; Solaris 10 operating system; automatic memory page retirement; fault management software; hardware fault; memory error diagnosis; permanent memory fault; predictive self-healing; system availability; system interruption; system reliability; system serviceability; virtual memory system; Analytical models; Availability; Error correction; Error correction codes; Hardware; Memory management; Operating systems; Reliability; Retirement; Sun;
Conference_Titel :
Dependable Systems and Networks, 2006. DSN 2006. International Conference on
Conference_Location :
Philadelphia, PA
Print_ISBN :
0-7695-2607-1
DOI :
10.1109/DSN.2006.13