پديد آورندگان :
اصلاني، محمد نويسنده , , مسگري، محمد سعدي نويسنده دانشگاه صنعتي خواجه نصيرالدين طوسي M. S. Mesgari, , مطيعيان، حميد نويسنده دانشگاه صنعتي خواجه نصيرالدين طوسي H. Motieyan,
كليدواژه :
Actor-Critic , Adaptive Traffic Signal Control , reinforcement learning , سيستم هاي چند عامله , معماري عملگر - نقاد و كنترل ترافيك , يادگيري تقويتي , Multi-agent systems
چكيده فارسي :
در نیمه دوم قرن گذشته اغلب جوامع شاهد شروع پدیده ای بنام ترافیك شهری در خود بوده اند كه علت رخداد چنین پدیده ای عبور تعداد زیادی خودرو در زمان یكسان از یك زیر ساخت حمل و نقلی یكسان می باشد. پدیده ترافیك شهری دارای پیامدهای اقتصادی و محیط زیستی كاملاً شناخته شده ای از جمله آلودگی هوا، كاهش در سرعت، افزایش زمان سفر، افزایش مصرف سوخت و حتی افزایش تصادفات می باشد. یكی از راه های اقتصادی برای مدیریت كردن افزایش تقاضای سفر و جلوگیری از ترافیك شهری، افزایش كارایی زیر ساخت های موجود از طریق سیستم های هوشمند كنترل ترافیك می باشد.
از سوی دیگر كنترل ترافیك به دلیل طبیعت توزیع یافته و خودمختار آن توسط سیستم های چند عامله به خوبی قابل مدلسازی می باشد. رانندگان و چراغ های راهنمایی را می توان به عنوان عامل هایی كه رفتارهای هوشمندانه ای از خود نشان می دهند در نظر گرفت. برای ایجاد چنین رفتارهایی نیاز است كه دانش لازمه از محیط اطراف در ذهن عامل قرار داده شود اما به دلیل پیچیدگی های بالای موجود در الگوهای ترافیك شهری و ناایستا بودن اغلب محیط های ترافیكی قرار دادن یك دانش اولیه از محیط در ذهن عامل ها بسیار دشوار و غیر عملی می باشد. بنابراین نیاز به یك روشی كه عامل در طول تعامل با محیط بتواند دانش لازمه را بدست آورد كاملاً ضروری است كه در این تحقیق برای حل این چالش از یادگیری تقویتی استفاده شد. هدف مقاله حاضر بهبود استراتژی های كنترل ترافیك و به طور خاص كنترل هوشمند چراغ های راهنمایی از طریق توسعه تكنیك های یادگیری تقویتی در سیستم های چند عامله است. معماری عملگر – نقاد به عنوان یك معماری رایج در یادگیری تقویتی كه دارای ساختار حافظه جداگانه ای هم برای سیاست و هم برای تابع ارزش است مورد استفاده قرار گرفت. نتایج این تحقیق نشان دادند كه كنترل هوشمند چراغ های راهنمایی منجر به كاهش 23% طول صف و 16% زمان سفر نسبت به كنترل غیر هوشمند چراغ های راهنمایی برای یك تقاطع منفرد می شود.
چكيده لاتين :
Nowadays, most urban societies have experienced a new phenomenon so-called urban traffic congestion, which is caused by crossing too many vehicles from the same transportation infrastructure at the same time. Traffic congestion has different consequences such as air pollution, decrease in speed, increase in travel time, fuel consumption and even incidents. One of the feasible solutions for bringing off the increase in transportation demand is to improve the existing infrastructure by means of intelligent traffic control systems. From a traffic engineering point of view, a traffic control system consists of physical network, control devices (traffic signals, variable message signs, so forth), the model of transportation demand and control strategy. The focus of this paper is on the latter especially traffic signal control.
Traffic signal control can be modeled by multi-agent systems perfectly because of its distributed and autonomous nature. In this context, drivers and traffic signals are considered distributed, autonomous and intelligent agents. Besides, due to high complexity arising in urban traffic patterns and nonstationarity of traffic environment, developing an optimized multi-agent system by preprogrammed agent’s behavior is most impractical. Therefore, the agents must, instead, discover their knowledge through a learning mechanism by interacting with the environment.
Reinforcement Learning (RL) is a promising approach for training the agent in which optimizes its behavior by interacting with the environment. Each time the agent receives information on the current state of the environment, performs an action in its environment, which may changes the state of the environment, and receives a scalar reward that reflects how appropriate the agent’s behavior has been in the past. The function that indicates the action to take in a certain state is called the policy. The goal of RL is to find a policy that maximizes the long-term reward. Several types of RL algorithms have been introduced and they can be divided into three groups: Actor-Only, Critic-Only and Actor-Critic methods.
Actor-Only methods typically work with a parameterized family of policies over which optimization procedures can be used directly. Often the gradient of the value of a policy with respect to the policy parameters is estimated and then used to improve the policy. The drawback of Actor-Only methods is that the increase of performance is harder to estimate when no value function is learned. Critic-Only methods are based on the idea to first find the optimal value function and then to derive an optimal policy from this value function. This approach undermines the ability of using continuous actions and thus of finding the true optimum. In this research, Actor-Critic reinforcement learning is applied as a learning method for true adaptive traffic signal control. Actor-Critic method is a temporal difference method that has a separate memory structure to explicitly represent the policy independent of the value function. The policy structure is known as the actor, because it is used to select actions and the critic is a state-value function.
In this paper, AIMSUN, which is a microscopic traffic simulator, is used to model traffic environment. AIMSUN models stochastic vehicle flow by employing car-following, Lane Changing and gap acceptance. AIMSUN API was used to construct the state, execute the action, and calculate the signal reward in each traffic light. The state of the each agent is represented by a vector of 1 + P components, where the first component is the phase number and P is the number of entrance streets which goes to intersection. Also, the action of the agent is the duration of the current phase. The immediate reward is defined as the reduction in the total number of cars waiting in all entrance streets. In fact, difference between the total numbers of cars in two successive decision points is used as a signal reward. The reinforcement learning controller is benchmarked against optimized pretimed control. The results indicate that the Actor-Critic controller decreases Queue length, travel time, fuel consumption and air pollution when compared to optimized pretimed controller.