DocumentCode :
3056486
Title :
A Reinforcement Learning Approach to Automatic Error Recovery
Author :
Zhu, Qijun ; Yuan, Chun
Author_Institution :
Tianjin Univ., Tianjin
fYear :
2007
fDate :
25-28 June 2007
Firstpage :
729
Lastpage :
738
Abstract :
The increasing complexity of modern computer systems makes fault detection and localization prohibitively expensive, and therefore fast recovery from failures is becoming more and more important. A significant fraction of failures can be cured by executing specific repair actions, e.g. rebooting, even when the exact root causes are unknown. However, designing reasonable recovery policies to effectively schedule potential repair actions could be difficult and error prone. In this paper, we present a novel approach to automate recovery policy generation with reinforcement learning techniques. Based on the recovery history of the original user-defined policy, our method can learn a new, locally optimal policy that outperforms the original one. In our experimental work on data from a real cluster environment, we found that the automatically generated policy can save 10% of machine downtime.
Keywords :
learning (artificial intelligence); system recovery; automatic error recovery; real cluster environment; recovery policy generation; reinforcement learning approach; Artificial intelligence; Availability; Computer bugs; Computer crashes; Computer errors; Costs; Hardware; Learning; Redundancy; Software systems;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Systems and Networks, 2007. DSN '07. 37th Annual IEEE/IFIP International Conference on
Conference_Location :
Edinburgh
Print_ISBN :
0-7695-2855-4
Type :
conf
DOI :
10.1109/DSN.2007.11
Filename :
4273024
Link To Document :
بازگشت