مرکز منطقه ای اطلاع رساني علوم و فناوري - A Reinforcement Learning Approach to Automatic Error Recovery

DocumentCode :

3056486

Title :

A Reinforcement Learning Approach to Automatic Error Recovery

Author :

Zhu, Qijun ; Yuan, Chun

Author_Institution :

Tianjin Univ., Tianjin

fYear :

2007

fDate :

25-28 June 2007

Firstpage :

729

Lastpage :

738

Abstract :

The increasing complexity of modern computer systems makes fault detection and localization prohibitively expensive, and therefore fast recovery from failures is becoming more and more important. A significant fraction of failures can be cured by executing specific repair actions, e.g. rebooting, even when the exact root causes are unknown. However, designing reasonable recovery policies to effectively schedule potential repair actions could be difficult and error prone. In this paper, we present a novel approach to automate recovery policy generation with reinforcement learning techniques. Based on the recovery history of the original user-defined policy, our method can learn a new, locally optimal policy that outperforms the original one. In our experimental work on data from a real cluster environment, we found that the automatically generated policy can save 10% of machine downtime.

Keywords :

learning (artificial intelligence); system recovery; automatic error recovery; real cluster environment; recovery policy generation; reinforcement learning approach; Artificial intelligence; Availability; Computer bugs; Computer crashes; Computer errors; Costs; Hardware; Learning; Redundancy; Software systems;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Dependable Systems and Networks, 2007. DSN '07. 37th Annual IEEE/IFIP International Conference on

Conference_Location :

Edinburgh

Print_ISBN :

0-7695-2855-4

Type :

conf

DOI :

10.1109/DSN.2007.11

Filename :

4273024

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3056486