Title :
Dyna-like reinforcement learning based on accumulative and average rewards
Author :
Hwang, Kao-Shing ; Lo, Chia-Yue
Author_Institution :
Dept. of Electr. Eng., Nat. Chung Cheng Univ., Chiayi, Taiwan
Abstract :
An approach to accelerating the learning process of the actor-critic learning algorithm for reinforcement learning is presented. The algorithm was derived from principles based on the prediction of average rewards and temporal difference (TD) learning with averaged and discounted rewards. The derived algorithm was applied to neural networks, demonstrating their effective operation in nonlinear control problems. The motivation of the proposed algorithm was to elaborate how a learning scheme, implemented by artificial neural networks (ANNs), can speed up learning processes based on an arrangement akin to the Dyna-Q learning, where a simulative model of the controlled plant is established for virtual learning between two control cycles. Instead of modeling the complicated plant, the approach just introduced a simple predictor of rewards for virtual learning in simulation mode. Two TD learning methods based discounted and averaged rewards respectively, are used alternatively in the control and simulation mode to facilitate the derived algorithm. The proposed Alternative Learning Critic (ALC) algorithm consists of two sub-systems: one is Evaluation Predictor (EP), which performs an approximation of a long-term evaluation function, and the other is an immediate action selector, which is composed of two ANNs: Action Controller (AC) and Reinforcement Predictor (RP). The proposed learning scheme is then applied to control a pendulum system for tracking a desired trajectory to demonstrate its applausive performance and robustness. Through reinforcement signals from the environment, the system takes an appropriate action to a plant with unknown dynamics so the actual output of the plant can track the desired one concisely within a short learning cycles. Further, ALC is used as the compensator of a PI controller, which is actually only working well on a linear system, to control that pendulum system. The results show the affined system, trained ALC and the PI controller can man- - ipulate together on a nonlinear system with unknown dynamics.
Keywords :
PI control; learning (artificial intelligence); neural nets; nonlinear control systems; PI controller; accumulative rewards; action controller; actor-critic learning; affined system; alternative learning critic; artificial neural networks; average rewards; discounted rewards; dyna-Q learning; dyna-like reinforcement learning; evaluation predictor; nonlinear control problems; nonlinear system; pendulum system; reinforcement predictor; temporal difference learning; unknown dynamics; virtual learning; Artificial neural networks; Potentiometers; Robustness; Training; intelligent control; linearization; neural networks; reinforcement learning;
Conference_Titel :
Systems Man and Cybernetics (SMC), 2010 IEEE International Conference on
Print_ISBN :
978-1-4244-6586-6
DOI :
10.1109/ICSMC.2010.5642415