6
Design and realization of a new architecture based on multi-agent systems and reinforcement learning for traffic signal control AbstractIncreasing the number of cars in cities creates traffic congestion. This is due to static management of traffic lights. Reinforcement Learning RL algorithm is an artificial intelligence approach that enables adaptive real-time control at intersections. In this research paper, we purpose a new architecture based on multi-agent systems and RL algorithm in order to make the signal control system more autonomous, able to learn from its environment and make decisions to optimize road traffic. KeywordsReinforcement Learning, Multi-Agent System, AUML, Q-learning I. INTRODUCTION Nowadays, cars number is increasing significantly worldwide. This increase causes traffic congestion. Road congestion is a global problem that challenges both the scientific community and governments [1]. To remedy this, the most relevant solution is to propose an approach that aims to predict traffic congestion levels in real time, in order to prevent high traffic congestion. To achieve this goal we must first analyze the problem’s source. Therefore, there is a static traffic lights management, and instead of considering them as a source of security and efficiency, they are found as a cause of delay under high demand. The common goal of the research works is to improve the use of existing roads by proposing a new control at intersections to allow vehicles to cross faster and minimize waiting times in roads. In this context, many researchers have studied the traffic lights control, a detailed description can be found in [2]. Indeed, recent designs programs for the traffic control signal are based on reinforcement learning RL. RL is one of the most recently used control optimization techniques used to solve traffic light control [3][6], and most appropriate when the environment is stochastic. In addition, the traffic management problem is considered as a Markov decision process MDP [7]; an observer who counts vehicles number stopped at a stop line of an intersection at a given moment; the data collected by the observer over a period represents a stochastic process. The random variable is the vehicle's number. The observed vehicle's number can take different values at different times. In addition, a stochastic process may represent the variation in the vehicle's number waiting in a queue over time. This value changes when a new vehicle joins the queue or when a vehicle leaves the queue. At any time, the vehicle's number waiting in a queue describes the system’s state. The vehicle's number prediction in a queue at the next moment will depend only on the current state (i.e. the vehicle's number in the queue). The MDP provides a mathematical framework for modeling decision-making in the Markov process. An MDP is defined by: A set of states s; A set of actions A (s) when I am in state s; A transiti on model T (s', (s, a)) where a Є A (s); A reward function R (s) that informs about the utility of being in state s; The main goal of the MDP is to find an optimal policy to adopt. It is in this context that we find dynamic programming DP and reinforcement learning RL. The value iteration algorithm is one of the dynamic programming techniques that can be used to find the optimal policy. It is an iterative method that acts on the optimal policy by finding the first optimal value function, and subsequently the corresponding optimal policy.[8]. One of the disadvantages of this algorithm is that it presupposes the existence of a perfect model of the environment represented by the transition probability matrices and rewards, which is not the case for traffic networks [9]. Moreover, this algorithm may not be practical for large-scale problems, since it involves calculations on all MDP states; it requires a lot of memory to store large matrices with each iteration (probabilities transition and rewards). However, the RL does not choose from the beginning an optimal action but a random action, and after tests and experiments can reach an optimal action or close to the optimality f (s, a)’. The agent memorizes the action taken Maha Rezzai * Systems Architecture Team Laboratory of Research in Engineering. University Hassan II Casablanca ENSEM Casablanca 20200, Morocco [email protected] Wafaa Dachry Systems Architecture Team Laboratory of Research in Engineering. University Hassan II Casablanca ENSEM Casablanca 20200, Morocco [email protected] Fouad Moutaouakkil Systems Architecture Team Laboratory Research in Engineering. University Hassan II Casablanca ENSEM Casablanca 20200, Morocco fmoutaouakkil@hotmail. Hicham Medromi Systems Architecture Team Laboratory Research in Engineering. University Hassan II Casablanca ENSEM Casablanca 20200, Morocco [email protected] 978-1-5386-6220-5/18/$31.00 ©2018 IEEE

Design and realization of a new architecture based on

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Design and realization of a new architecture based on

Design and realization of a new architecture based on multi-agent systems and reinforcement learning

for traffic signal control

Abstract— Increasing the number of cars in cities creates traffic congestion. This is due to static management of traffic lights. Reinforcement Learning RL algorithm is an artificial intelligence approach that enables adaptive real-time control at intersections. In this research paper, we purpose a new architecture based on multi-agent systems and RL algorithm in order to make the signal control system more autonomous, able to learn from its environment and make decisions to optimize road traffic.

Keywords—Reinforcement Learning, Multi-Agent System, AUML, Q-learning

I. INTRODUCTION

Nowadays, cars number is increasing significantly worldwide. This increase causes traffic congestion. Road congestion is a global problem that challenges both the scientific community and governments [1]. To remedy this, the most relevant solution is to propose an approach that aims to predict traffic congestion levels in real time, in order to prevent high traffic congestion. To achieve this goal we must first analyze the problem’s source. Therefore, there is a static traffic lights management, and instead of considering them as a source of security and efficiency, they are found as a cause of delay under high demand. The common goal of the research works is to improve the use of existing roads by proposing a new control at intersections to allow vehicles to cross faster and minimize waiting times in roads. In this context, many researchers have studied the traffic lights control, a detailed description can be found in [2]. Indeed, recent designs programs for the traffic control signal are based on reinforcement learning RL. RL is one of the most recently used control optimization techniques used to solve traffic light control [3]–[6], and most appropriate when the environment is stochastic. In addition, the traffic management problem is considered as a Markov decision process MDP [7]; an observer who counts vehicles number stopped at a stop line of an intersection at a given moment; the data collected by the observer over a period represents a stochastic process. The random variable is the vehicle's

number. The observed vehicle's number can take different values at different times. In addition, a stochastic process may represent the variation in the vehicle's number waiting in a queue over time. This value changes when a new vehicle joins the queue or when a vehicle leaves the queue. At any time, the vehicle's number waiting in a queue describes the system’s state. The vehicle's number prediction in a queue at the next moment will depend only on the current state (i.e. the vehicle's number in the queue). The MDP provides a mathematical framework for modeling decision-making in the Markov process. An MDP is defined by:

A set of states s;

A set of actions A (s) when I am in state s;

A transition model T (s', (s, a)) where a Є A (s);

A reward function R (s) that informs about the utility of being in state s;

The main goal of the MDP is to find an optimal policy to adopt. It is in this context that we find dynamic programming DP and reinforcement learning RL.

The value iteration algorithm is one of the dynamic programming techniques that can be used to find the optimal policy. It is an iterative method that acts on the optimal policy by finding the first optimal value function, and subsequently the corresponding optimal policy.[8]. One of the disadvantages of this algorithm is that it presupposes the existence of a perfect model of the environment represented by the transition probability matrices and rewards, which is not the case for traffic networks [9]. Moreover, this algorithm may not be practical for large-scale problems, since it involves calculations on all MDP states; it requires a lot of memory to store large matrices with each iteration (probabilities transition and rewards).

However, the RL does not choose from the beginning an optimal action but a random action, and after tests and experiments can reach an optimal action or close to the optimality ‘f (s, a)’. The agent memorizes the action taken

Maha Rezzai*

Systems Architecture Team Laboratory of Research in

Engineering. University Hassan II

Casablanca ENSEM Casablanca 20200, Morocco

[email protected]

Wafaa Dachry Systems Architecture Team

Laboratory of Research in Engineering.

University Hassan II Casablanca ENSEM Casablanca 20200,

Morocco [email protected]

Fouad Moutaouakkil Systems Architecture Team Laboratory Research in

Engineering. University Hassan II

Casablanca ENSEM Casablanca 20200, Morocco

fmoutaouakkil@hotmail.

Hicham Medromi Systems Architecture

Team Laboratory of Computer System and Renewable Energy. University Hassan II Casablanca ENSEM Casablanca 20200, Morocco

[email protected]

Hicham Medromi Systems Architecture

Team Laboratory of Computer System and Renewable Energy. University Hassan II Casablanca ENSEM Casablanca 20200, Morocco

[email protected]

Hicham Medromi Systems Architecture Team Laboratory Research in

Engineering. University Hassan II

Casablanca ENSEM Casablanca 20200, Morocco

[email protected]

978-1-5386-6220-5/18/$31.00 ©2018 IEEE

Page 2: Design and realization of a new architecture based on

and then improves it. In another way, he learns from experiences. As a result, we will have the following equation:

Experience (t) = Experience (t-1) + 1/c * Learning (1)

Where c is the number of visits of the pair (s, a).

Over time the variable c increases, so the term 1 / c * Learning tends to zero. This means that the agent no longer learns, but he exploits his experiences and knowledge. Where the reinforcement learning is inspired.

The figure below illustrates the evolution of the agent by applying RL. It shows the importance of learning to accumulate experience and reflects the agent's intelligence over time. In addition, the key to success of this approach lies in the way of learning, which depends on the action choices in each step.

Fig. 1. Agent's evolution over time

This current paper intends to propose a distributed architecture for the signal traffic control in intersections. This platform is based on multi-agent systems (MAS) and reinforcement learning (RL). The MAS assigns to our system the characteristics of autonomy and intelligence to manage better high traffic situations. Indeed, they provide a real-time response to dynamic traffic flow [10]. The different agents of our architecture work together to observe analyze and decide the action to be taken in an intelligent way. While the advantage of RL allows the agent to learn from his environment and his experiences in order to react in an effective way.

The paper is organized as follows. In Section 2, we present the different traffic control systems based on reinforcement learning. In Section 3, we detail the basic principles of reinforcement learning. The multi-agent approach is interpreted in the following section. Then section 5, we shed light on the concepts of the fundamental variables of the traffic. Section 6, we propose the architecture of the traffic control system. Finally, Section 7 contains the conclusion of this study and the scope for future research.

II. STATE OF ART

A. Related work

Many researchers have studied the RL’s implementation in traffic control system. Their common goal is to reduce the congestion level at intersections.

Tahifa [11] proposed a multi-agent model and collaborative learning for smart mobility in an urban environment. Moghaddam et al. [12] proposed a real-time traffic control system. This system contains two phases to adjust the duration of the lights according to the traffic flows. Junchen [13] proposed a learning-based control system in the context of group-based phasing techniques and multi-agent system. In terms of traffic mobility, the simulation results showed that RL-based approaches consistently outperform fixed time group-based signal control with a wide margin regardless of the demand levels. Samah et al. [9] proposed a system called MARLIN-ATSC, where agents manage to coordinate and cooperate with each other. Abdul Aziz et al. [5] proposed the RMART technique that controls the signal lights using the Markov decision process in a multi-agent framework. Arel et al [14] introduced a multi-agent reinforcement learning system to obtain an effective traffic control policy, which aims to minimize the average delay, congestion and the probability of blocking at an intersection. De Oliveira et al. [15] proposed a RL method called Reinforcement Learning with Context Detection (RL-CD) to control traffic lights at isolated junctions as an appealing approach for dealing with non-stationarity and handling stochastic traffic patterns. Wiering [16] utilized model-based RL (with state transition models and state transition probabilities) to control traffic-light agents to minimize the waiting time of vehicles in a small grid network. Abdulhai et al. [4] applied the Q-learning algorithm to optimize the control signal of a single intersection. The objective involves the optimal control of traffic severely disrupted. Thorpe [17] applied a RL algorithm named SARSA for control of traffic signals. SARSA minimizes total travel time and vehicle waiting times.

Table 1 compares and contrasts these studies. The lines represent the method or approach used to solve the problem, while the columns represent the comparison criteria. These criteria include the TD learning method, the definition of RL elements (state, action (phasing sequence), reward), exploration method, coordination level (in other words, the existence of an information exchange between adjacent intersections), and finally the mechanism (centralized or decentralized).

III. REINFORCEMENT LEARNING RL

The reinforcement learning algorithm RL [18] as its name suggests, is to learn from experiences what to do in different situations (called policy) based on the success or failure observed (rewards or penalty) to optimize quantitative rewards over time. In RL [19], the learning agent interacts firstly with an unknown environment and modifies its action policies to maximize its cumulative gains.

Page 3: Design and realization of a new architecture based on

TABLE I. RL’S IMPLEMENTATION IN TRAFFIC CONTROL SYSTEM

The figure below illustrates the different interactions that exist between the agent and the environment.

Fig. 2. Interactions between agent and environment

A. RL’s Algorithms

Numerous RL algorithms have been studied in the literature [20]–[22]; the most relevant algorithms are Temporal Difference TD algorithms. They allow you to learn directly from experience without the need for a dynamic model of the environment. Two types of TD algorithm are cited in the literature [20], namely TD (0) and eligibility traces TD (λ). The TD (λ) is a learning algorithm invented by Richard S. Sutton. The agent looks forward to the end of the learning horizon and updates the value functions based on all future rewards. We will be interested in TD (0) for the rest of the article.

In TD (0), we find:

SARSA Algorithm

SARSA means State-Action-Reward-State-Action. It is an on-policy algorithm because it starts with a simple policy that improves after determining the sample of the state space. In the SARSA algorithm, the estimates of the values of the pair (state-action) are carried out starting the experiment. The state-action value functions are updated after the execution of the actions, which are selected by the current policy [9]. The disadvantage of this algorithm is that the update formula follows the choice of the action (which is not always optimal) [23].

Q-learning Algorithm

It is an '' off-policy '' algorithm since it collects the information in a random way, then evaluates the states by reducing slowly the randomness. Q-learning is one of the most widely used reinforcement learning methods because of its simplicity. The advantage of Q-Learning is that it takes into account that the policy changes regularly, which makes it more effective. Q-learning is more efficient in all levels of congestion. In addition, it works better than signal plans in all levels of congestion [5].

The difference between SARSA and Q-learning is: in SARSA, the update formula exactly follows the choice of action (which is not always optimal). In the case of Q-learning, the update formula uses the optimal value of the possible actions after the next state.

Note that the necessary items of RL i.e. states, actions and rewards are defined in the article [7].

Page 4: Design and realization of a new architecture based on

IV. MULTI-AGENT APPROACH A multi-agent system [24] is defined by a set of agents

that have several criteria. There is a certain degree of autonomy, a certain degree of artificial intelligence, a representation and interactions with their environment. Indeed, agents are able to take the initiative to communicate and they can adapt to different situations [25].

A. Agent paradigm

An agent is defined as an intelligent entity independently, able to communicate with other agents, to perceive and to represent its environment. Each agent performs specific actions based on their perception of its environment [2].

Fig. 3. Representation of a multi-agent system according to Ferber

Therefore, we can define a multi-agent system (MAS) as a system composed of a set of agents, located in a certain environment, and interacting with each other to achieve a common goal.

There are several types of agents [26]:

• The reactive Agent is often described as not being "smart" by itself. A very simple component who perceives the environment and can act on it. They just simply acquire perceptions and to respond to them using some predefined rules.

• The cognitive agent is a more or less intelligent agent, essentially characterized by a symbolic representation of knowledge and mental concepts. It has a partial representation of the environment, explicit objectives; it is able to plan its behavior, to remember past actions, to communicate by sending messages, to negotiate, etc.

• The intentional agent is an intelligent agent who applies the model of human intelligence and the human perspective to the world using mental concepts such as knowledge, beliefs, intentions, desires, choices, commitments. Its behavior can be provided by the assignment of beliefs, desires, and intentions.

• The rational agent is an agent that acts in a way that allows it to achieve the most success in carrying out the tasks assigned to it. For this, one must have a measure of performance, if possible, associated with a particular task that the agent must perform.

• Adaptive Agent is an agent that adapts to any changes the environment may have. It is very intelligent because it is able to change its goals and knowledge base as they change.

• The communicative agent is an agent who serves to communicate information to those around it. This

information can be made of its own perceptions as other agents can pass it on.

MAS’s application is increasingly present in research related to traffic regulation [9], [11], [27], etc. MAS is applied to the problem of urban traffic as they add some autonomy at intersections. Specific agents provide computing capabilities at intersections, while others monitor the global environment and coordinate actions between intersections [27]. The traffic signal control system based on multi-agent technology outperforms the traditional method because it provides a real-time and proactive response to dynamic traffic flows [10]. This allows an improvement of the traffic signal control system adopted in urban areas.

V. THE TRAFFIC FUNDAMENTAL VARIABLES

In our traffic management system, we will need traffic variables namely: the flow, speed, the density, the occupancy rate, the time interval between vehicles, and the status of the queue. For this, it is necessary to shed light on the following concepts:

• Phase A signal phase can be defined as a single set of traffic signal movements, where a number of traffic signal lights that change color at a time controls a movement.

• Cycle of lights The sequence of the different phases of the lights.

• Cycle Time The time required to complete all phases at the intersection. The cycle time includes green time, amber time and red time of each phase in use at an intersection.

• Green Time The period during which vehicles in a lane are permitted to cross an intersection.

•Flow Q The number of vehicles passing through a point during a given period expressed in veh / h.

• Density or concentration K Distribution of vehicles in space. In other words, the number of vehicles on a section expressed in veh / km.

• Occupancy Rate The proportion of time during which a roadway point is occupied by vehicles expressed in %.

= ∑ti/T (2) Where ti is the time of presence of vehicles i at a point on the road, and T is the observation period.

• Time interval between vehicles (GAP) The time spent between two consecutive vehicles passing through a given section of road, measured in s / veh.

The relationship between flow and GAP is as follows:

GAP= 1/Q (3) • Queue state

It is calculated from the following equation

= / K * 1 / (4)

: State of the remaining queue for road i in step t.

: Queue length of the road i in step t.

: Traffic’s density of the traffic.

: Length of road i.

Page 5: Design and realization of a new architecture based on

VI. ARCHITECTURE BASED ON MAS

We propose in this section the proposed architecture of the traffic control system. Note that for this first version we proceed to the control of a single intersection.

Main expected properties of our global architecture

Intelligence: characterized by a set of faculties to understand the facts and adapt easily to new situations. Intelligence consists in our system of analyzing and adapting signals to make appropriate decisions in order to achieve the desired objective. Learning: The culmination of this knowledge is based on the perception of the environment, the interactions of its components and the relationships between its actions. The system typically uses a memory, known as the knowledge base, which represents a set of behaviors, conditions, and constraints encountered during execution. Adaptation: The system is able to modify its organization in the face of a change in the environment. We can distinguish, in our system, several levels of adaptation; a low level for a calm city without congestion, another higher level can be a learning ability. Flexibility: Flexibility is the ability to adapt to the environment. Our system is facing several changes (ie, emergencies, accidents etc.).

A. System architecture

Fig. 4. Control architecture proposed for traffic control

The following figure shows the different agents of the architecture proposed for the traffic regulation. Its agents collaborate with each other in a continuous way in order to observe, analyze, and decide the action to be taken.

At the highest level, there is the environment. It contains all the elements necessary for the collection of data, in other words, the inputs of our system. We find the sensors installed in each road, the traffic lights, and the traffic regulator. The perception layer is content purely to act according to the perceived stimuli, without the need to understand their world or their goals. In other words, it allows the connection between the hardware and software layers of the architecture. Then we find the treatment layer. This layer is responsible for the pretreatment of the data

from the previous layer and then proceeds to learn to estimate the state-action value function Q (s, a). The role of the decision layer is to decide what action the regulator should take based on the information from the previous layer. Finally, we find the action layer. it is able to translate the decision into a set of actions that will be executed by the traffic regulator.

In what follows, we detail the different agents of the architecture.

B. Agents Specification

Sensor Agents: This agent provides the necessary elements for collecting data according to the types of sensors that are installed in the roads. For example, sensors that give the number of cars entering and leaving the area where they are located. Others that provide the occupancy rate of lane i. Otherwise devices capable of calculating the distance between two consecutive vehicles GAP. This agent sends these elements to the Pretreatment agent.

Information (controller) Agent: This agent confirms to the Manager agent that actions taken by the agent decision are actually executed (hardware level).

Pretreatment Agent: This agent intend to determine the state of the intersection, and subsequently sends to the Manager agent. It is responsible for calculating the traffic fundamental variables for each road i namely: The queue length, the flow, the occupancy rate, the GAP. Note that each intersection has its own characteristics (number of roads, number of phases, number of directions, tramway etc.).

Manager Agent: This agent contains all intersection data (states, actions, rewards, Q-values received from the Learning agent). It has a global view of the intersection.

Learning Agent: This agent is able to update the status-action value function through the Q-learning algorithm. It receives all necessary data from the Manager agent namely the state at time i-1 and i, the action and the reward value R (si-1, si / ai).

Decision Agent: This agent is responsible for balancing between exploitation and exploration in strategies for selecting an optimal action through one of the action selection algorithms (Є-Greedy, Softmax or Є- Softmax [7]) to determine the policy to be taken in the current situation.

Transfer Agent: This agent translates the decision received from the Decision Agent into a set of actions.

Action Regulator Agent: This agent gathers the acquired response from the Transfer Agent to execute the resulting action. Then it sends the set of behaviors required to the traffic controllers.

Communication Agent: This agent provides communication between the various agents of the architecture. The proposed architecture’s design is described in the following section.

C. Design Model In this part, we propose a design based on the AUML

language. AUML is a UML extension to reflect the concepts of agent. AUML inherits representations proposed by UML [26]. It contains two types of diagrams: Behavioral

Page 6: Design and realization of a new architecture based on

diagrams or dynamic diagrams and Structural diagrams or static diagrams.

The design of the proposed architecture is described through the agent class and agent sequence diagrams to illustrate the static and dynamic platform respectively.

D. Static Aspect We present in our article the conceptual level and the

implementation level. The conceptual level is high enough for the MAS

eliminates all surface information to understand the system structure. The agent class diagram in the figure below shows the conceptual level of the platform.

Fig. 5. Agent class diagram (conceptual level).

E. Dynamic Aspect

Agent sequence diagrams represent message exchanges between agents. The sequence diagram presented in figure 6 shows the interaction between the different agents over time in case everything goes well.

Fig. 6. Agent Sequence Diagram

VII. CONCLUSION

In this article, we reviewed the existing studies in the literature regarding the signal control systems based on RL, by presenting a table comparing and contrasting these studies. We also presented the architecture of the traffic

control system of an isolated intersection based on multi-agent systems and the learning algorithm Q-learning. The main goal of our work is the development of a new approach that ensures efficiency and reduces slowdowns, even at the most critical times. The proposed approach can be realized on the ground. Therefore, the next step will be dedicated to performing an intersection simulation using the open source simulator SUMO.

REFERENCES [1] A. Adnane, M. S. Lmimouni, M. Rezzai, and H. Medromi, ‘Traffic

Congestion Manager, a Cost-Effective Approach’, presented at the UNET, Casablanca, 2015.

[2] M. Rezzai, W. Dachry, F. Mouataouakkil, and hicham medromi, ‘Designing an Intelligent System for Traffic Management’, Journal of Communication and Computer, 2015.

[3] B. Abdulhai and L. Kattan, ‘Reinforcement learning: Introduction to theory and potential for transport applications’, Can. J. Civ. Eng., vol. 30, no. 6, pp. 981–991, Dec. 2003.

[4] B. Abdulhai, R. Pringle, and G. J. Karakoulas, ‘Reinforcement learning for true adaptive traffic signal control’, J. Transp. Eng., vol. 129, no. 3, pp. 278–285, 2003.

[5] H. M. Abdul Aziz, F. Zhu, and S. V. Ukkusuri, ‘Reinforcement learning based signal control using R -Markov Average Reward Technique (RMART) accounting for neighborhood congestion info rmation sharing’, presented at the Transportation Research Board Conference and publication in Transportation Research Record, 2012.

[6] J. C. Medina and R. F. Benekohal, ‘Reinforcement Learning Agents for Traffic Signal Control in Oversaturated Networks’, 2011, pp. 132–141.

[7] M. Rezzai, W. Dachry, F. Mouataouakkil, and H. Medromi, ‘Reinforcement learning for traffic control system: Study of Exploration methods using Q-learning’, International Research Journal of Engineering and Technology, india, pp. 1838–1848, 2017.

[8] Onivola Henintsoa Minoarivelo, ‘Application of Markov Decision Processes to the Control of a Traffic Intersection’, University of Barcelona, Spain, 2009.

[9] S. El-Tantawy and B. Abdulhai, ‘Multiagent Reinforcement Learning for Integrated Network of Adaptive Traffic Signal Controllers (MARLIN-ATSC): Methodology and Large-Scale Application on Downtown Toronto’, 2013.

[10] S.-Y. Lin, D.-K. Chen, and R.-S. Chen, ‘Apply adaptive and cooperative MAS to urban traffic signal control’, in Proc. 11th Mediterranean Conference on Control and Automation (MED’03), IEEE Control System Society Press, Rhodes, Greece, 2003.

[11] M. Tahifa, ‘Modéles multi-agents et apprentissage collaboratif pour une mobilité intelligente dans un environnement urbain’, Université Sidi Mohammed Ben Abdellah, Fes, Maroc, 2016.

[12] M. J. Moghaddam, M. Hosseini, and R. Safabakhsh, ‘Traffic light control based on fuzzy Q-leaming’, 2015, pp. 124–128.

[13] J. Jin and X. Ma, ‘Adaptive Group-based Signal Control by Reinforcement Learning’, Transp. Res. Procedia, vol. 10, pp. 207–216, 2015.

[14] I. Arel, C. Liu, T. Urbanik, and A. G. Kohls, ‘Reinforcement learning-based multi-agent system for network traffic signal control’, IET Intell. Transp. Syst., vol. 4, no. 2, p. 128, 2010.

[15] D. De Oliveira, A. L. Bazzan, and V. Lesser, ‘Using cooperative mediation to coordinate traffic lights: a case study’, in Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems, 2005, pp. 463–470.

[16] M. Wiering, J. Van Veenen, J. Vreeken, and A. Koopman, ‘Intelligent traffic light control’, Inst. Inf. Comput. Sci. Utrecht Univ., 2004.

[17] T. L. Thorpe and C. W. Anderson, ‘Traffic light control using sarsa with three state representations’, Citeseer, 1996.

[18] T. M. Mitchell, ‘Does machine learning really work?’, AI Mag., vol. 18, no. 3, p. 11, 1997.

[19] X. Xu, L. Zuo, and Z. Huang, ‘Reinforcement learning algorithms with function approximation: Recent advances and applications’, Inf. Sci., vol. 261, pp. 1–31, Mar. 2014.

[20] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning. London, England: Cambridge, Massachusetts, 1998.

[21] A. Gosavi, Simulation-Based Optimization, vol. 55. Boston, MA: Springer US, 2015.

[22] C. J. C. H. Watkins and P. Dayan, ‘Q-learning’, Mach. Learn., vol. 8, no. 3–4, pp. 279–292, May 1992.

[23] B. Bouzy, ‘Apprentissage par renforcement (3)’, Cours D’apprentissage Autom., 2005.

[24] W. Dachry, B. Aghezzaf, and B. Bensassi, ‘Collaborative Information System for Managing the Supply Chain: Architecture Based on a Multi Agent System’, International Journal of Research in Industrial Engineering 2, 1–16., 2013.

[25] N. Daily, Multi-agent systems in the EIAH. 2003. [26] S. Maalal and M. Addou, ‘A new approach of designing Multi-Agent

Systems’, ArXiv Prepr. ArXiv12041581, 2012..