Deep Deterministic Policy Gradient for Urban Trafﬁc Light

1

Deep Deterministic Policy Gradientfor Urban Traffic Light Control

Noe Casas

Abstract—Traffic light timing optimization is still an activeline of research despite the wealth of scientific literature onthe topic, and the problem remains unsolved for any non-toyscenario. One of the key issues with traffic light optimization isthe large scale of the input information that is available for thecontrolling agent, namely all the traffic data that is continuallysampled by the traffic detectors that cover the urban network.This issue has in the past forced researchers to focus on agentsthat work on localized parts of the traffic network, typically onindividual intersections, and to coordinate every individual agentin a multi-agent setup. In order to overcome the large scale ofthe available state information, we propose to rely on the abilityof deep Learning approaches to handle large input spaces, in theform of Deep Deterministic Policy Gradient (DDPG) algorithm.We performed several experiments with a range of models, fromthe very simple one (one intersection) to the more complex one(a big city section).

Index Terms—deep learning, reinforcement learning, trafficlight control

I. INTRODUCTION

C ITIES are characterized by the evolution of their transitdynamics. Originally meant solely for pedestrians,

urban streets soon shared usage with carriages and then withcars. Traffic organization became soon an issue that led tothe introduction of signaling, traffic lights and transit planning.

Nowadays, traffic lights either have fixed programs orare actuated. Fixed programs (also referred to as pretimedcontrol) are those where the timings of the traffic lightsare fixed, that is, the sequences of red, yellow and greenphases have fixed duration. Actuated traffic lights changetheir phase to green or red depending on traffic detectors thatare located near the intersection; this way, actuated trafficlight are dynamic and adapt to the traffic conditions to somedegree; however, they only take into account the conditionslocal to the intersection. This also leads to dis-coordinationwith the traffic light cycles of other nearby intersections andhence are not used in dense urban areas. Neither pretimedor actuated traffic lights take into account the current trafficflow conditions at the city level. Nevertheless, cities havelarge vehicle detector infrastructures that feed traffic volumeforecasting tools used to predict congestion situations. Suchinformation is normally only used to apply classic trafficmanagement actions like sending police officers to divert partof the traffic.

This way, traffic light timings could be improved by meansof machine learning algorithms that take advantage of theknowledge about traffic conditions by optimizing the flow ofvehicles. This has been the subject of several lines of research

in the past. For instance, Wiering proposed different variantsof reinforcement learning to be applied to traffic light control[1], and created the Green Light District (GLD) simulatorto demonstrate them, which was further used in other workslike [2]. Several authors explored the feasibility of applyingfuzzy logic, like [3] and [4]. Multi-agent systems where alsoapplied to this problem, like [5] and [6].

Most of the aforementioned approaches simplify thescenario to a single intersection or a reduced group ofthem. Other authors propose multi-agent systems where eachagent controls a single intersection and where agents maycommunicate with each other to share information to improvecoordination (e.g. in a connected vehicle setup [7]) or mayreceive a piece of shared information to be aware of thecrossed effects on other agents’ performance ([8]). However,none of the aforementioned approaches fully profited fromthe availability of all the vehicle flow information, that is,the decisions taken by those agents were in all cases partiallyinformed. The main justification for the lack of holistictraffic light control algorithms is the poor scalability of mostalgorithms. In a big city there can be thousands of vehicledetectors and tenths of hundreds of traffic lights. Thosenumbers amount for huge space and action spaces, which aredifficult to handle by classical approaches.

This way, the problem addressed in this works is the devisalof an agent that receives traffic data and, based on these,controls the traffic lights in order to improve the flow of traffic,doing it at a large scale.

II. TRAFFIC SIMULATION

In order to evaluate the performance of our work, we makeuse of a traffic simulation. The base of a traffic simulationis the network, that is, the representation of roads andintersections where the vehicles are to move. Connected tosome roads, there are centroids, that act as sources/sinksof vehicles. The amount of vehicles generated/absorbedby centroids is expressed in a traffic demand matrix, ororigin-destination (OD) matrix, which contains one cellper each pair of origin and destination centroids. During asimulation, different OD matrices can be applied to differentperiods of time in order to mimic the dynamics of the realtraffic through time. In the roads of the network, there canbe traffic detectors, that mimic induction loops beneaththe ground that are able to measure traffic data as vehiclesgo pass through them. Typical measurements that can betaken with traffic detectors include vehicle counts, average

arX

iv:1

703.

0903

5v1

[cs

.NE

] 2

7 M

ar 2

017

2

speed and percentage of occupancy. There can also betraffic lights. In many cases they are used to regulate thetraffic at intersections. In those cases, all the traffic lightsin an intersection are coordinated so that when one is red,another one is green, and vice versa (this way, the use ofthe intersection is regulated so that vehicles don’t block theintersection due to their intention to reach an exit of theintersection that is currently in use) . All the traffic lightsin the intersection change their state at the same time. Thisintersection-level configuration of the traffic lights is calleda phase, and it is completely defined by the states of eachtraffic light in the intersection plus its duration. The differentphases in an intersection form its control plan. The phasesin the control plan are applied cyclically, so the phases arerepeated after the cycle duration elapses. Normally, controlplans of adjacent intersections are synchronized to maximizethe flow of traffic avoiding unnecessary stops.

Urban traffic simulation software can keep models at dif-ferent levels of abstraction. Microscopic simulators simulatevehicles individually computing their positions at every fewmilliseconds and the the dynamics of the vehicles are governedby a simplified model that drives the behaviour of the driverunder different conditions, while macroscopic simulatorswork in an aggregated way, managing traffic like in a flow net-work in fluid dynamics. There are different variations betweenmicroscopic and macroscopic models, broadly referred to asmesoscopic simulators. To our interests, the proper simulationlevel would be microscopic, because we need information ofindividual vehicles and their responses to changes in the trafficlights, mimicking closely real workd dynamics in terms ofcongestion. As third party simulator we chose Aimsun [9],[10], a commercial microscopic, mesoscopic and macroscopicsimulator widely used, both in the private consulting sectorand in traffic organization institutions.

III. PRELIMINARY ANALYSIS

The main factor that has prevented further advance in thetraffic light timing control problem is the large scale of anyrealistic experiment. On the other hand, there is a familyof machine learning algorithms whose very strenght is theirability of handle large input spaces, namely deep learning.Recently, deep learning has been successfully applied toreinforcement learning, gaining much attention due to theeffectiveness of Deep Q-Networks (DQN) at playing Atarigames using as input the raw pixels of the game [11], [12].Subsequent successes of a similar approach called DeepDeterministic Policy Gradient (DDPG) were achieved in [13],which will be used in our work as reference articles, giventhe similarity of the nature of the problems addressed there,namely large continuous state and action spaces. This way, thetheme of this work is the application of Deep ReinforcementLearning to the traffic light optimization problem with anholistic approach, by leveraging deep learning to cope withthe large state and action spaces. Specifically, the hypothesisthat drives this work is that Deep reinforcement learning canbe successfully applied to urban traffic light control, having

similar or better performance than other approaches.

This is hence the main contribution of the present work,along with the different techniques applied to make thisapplication possible and effective. Taking into account thenature of the problem and the abundant literature on thesubject, we know that some of the challenges of devising atraffic light timing control algorithm that acts at a large scaleare:• Define a sensible state space. This includes finding a

suitable representation of the traffic information. Deeplearning is normally used with input signals over whichconvolution is easily computable, like images (i.e. pixelmatrices) or sounds (i.e. 1-D signals). Traffic informationmay not be easily represented as a matrix, but as alabelled graph. This is addressed in section VI-D.

• Define a proper action space that our agent is ableto perform. The naive approach would be to let thecontroller simply control the traffic light timing directly(i.e. setting the color of each traffic light individuallyat each simulation step). This, however, may lead tobreaking the normal routing rules, as the traffic lightsin an intersection have to be synchronized so that thedifferent intersection exit routes do not interfere with eachother. Therefore a careful definition of the agent’s actionsis needed. This is addressed in section VI-E.

• Study and ensure the convergence of the approach:despite the successes of Deep Q-Networks and DDPG,granted by their numerous contributions to the stabilityof reinforcement learning with value function approxima-tion, convergence of such approaches is not guaranteed.Stability of the training is studied and measures forpalliating divergence are put in place. This is addressedin section VI-I.

• Create a sensible test bed: a proper test bed shouldsimulate relatively realistically the traffic of a big city,including a realistic design of the city itself. This isaddressed in section VII.

IV. RELATED WORK

In this section we identify and explore research lines thatare especially close to ours, either regarding the techniquesused, or regarding the problem they address.

A. Classic Reinforcement Learning

Reinforcement Learning has been applied in the past tourban traffic light control. Most of the instances from theliterature consist of a classical algorithm like Q-Learning,SARSA or TD(λ) to control the timing of a single intersection.Rewards are typically based on the reduction of the traveltime of the vehicles or the queue lengths at the traffic lights.[14] offers a thorough review of the different approachesfollowed by a subset of articles from the literature that applyreinforcement learning to traffic light timing control. Asshown there, many studies use as state space informationsuch as the length of the queues and the travel time delay;such type of measures are rarely available in a real-world

3

setup and can therefore only be obtained in a simulatedenvironment. Most of the approaches use discrete actions(or alternatively, discretize the continuous actions by meansof tile coding, and use either ε-greedy selection (choosethe action with highest Q value with 1 − ε probability,or random action otherwise) or softmax selection (turn Qvalues into probabilities by means of the softmax functionand then choose stochastically accordingly). In most of theapplications of reinforcement learning to traffic control, thevalidation scenario consists of a single intersection, like in[15]. This is due to the scalability problems of classical RLtabular approaches: as the number of controlled intersectionsincreases, so grows the state space, making the learningunfeasible due to the impossibility for the agent to applyevery action under every possible state. This led someresearchers to study multi-agent approaches, with varyingdegrees of complexity: some approaches like that from[16] train each agent separately, without notion that moreagents even exist, despite the coordination problems that thisapproach poses. Others like [17] train each agent separately,but only the intersection with maximum reward executes theaction. More elaborated approaches, like in [18], train severalagents together modeling their interaction as a competitivestochastic game. Alternatively, some lines of research like[19] and [20] study cooperative interaction of agents bymeans of coordination mechanisms, like coordination graphs([21]).

As described throughout this section, there are severalexamples in the literature of the application of classicalreinforcement learning to traffic light control. Many of themfocus on a single intersection. Others apply multi-agent rein-forcement learning techniques to address the problems derivedfrom the high dimensionality of state and action spaces. Twocharacteristics of most of the explored approaches are thatthe information used to elaborate the state space is hardlyavailable in a real-world environment and that there are norealistic testing environments used.

B. Deep Reinforcement Learning

There are some recent works that, like ours, study theapplicability of deep reinforcement learning to traffic lightcontrol:

Li et al. studied in [22] the application of deep learning totraffic light timing in a single intersection. Their testing setupconsists of a single cross-shape intersection with two lanesper direction, where no turns are allowed at all (i.e. all trafficeither flows North-South (and South-North) or East-West (andWest-East), hence the traffic light set only has two phases.This scenario is therefore simpler than our simple networkA presented in VII-B. For the traffic simulation, they usethe proprietary software PARAllel MICroscopic Simulation(PARAMICS) [23], which implements the Cell TransmissionModel. This model does not take into account individualvehicles. Instead, it divides the road into cells and computesthe flow of vehicles moving inside each cell. Therefore, it

implements a very simplified version of the traffic dynamics.Their approach consists of a Deep Q-Network ([11], [12])comprised of a heap of stacked auto-encoders [24], [25],with sigmoid activation functions where the input is the stateof the network and the output is the Q function value foreach action. The inputs to the deep Q network are the queuelengths of each lane at time t (measured in meters), totalling 8inputs. The actions generated by the network are 2: remain inthe current phase or switch to the other one. The reward is theabsolute value of the difference between the maximum North-Source flow and the maximum East-West flow. The stackedautoencoders are pre-trained (i.e. trained using the state of thetraffic as both input and output) layer-wise so that an internalrepresentation of the traffic state is learned, which shouldimprove the stability of the learning in further fine tuning toobtain the Q function as output ([26]). The authors use anexperience-replay memory to improve learning convergence.In order to balance exploration and exploitation, the authorsuse an ε-greedy policy, choosing a random action with a smallprobability p. For evaluating the performance of the algorithm,the authors compare it with normal Q-learning ([27]). Foreach algorithm, they show the queue lengths over time andperform a linear regression plot on the queue lengths for eachdirection (in order to check the balance of their queue length).

Van der Pol explores in [28] the application of deeplearning to traffic light coordination, both in a singleintersection and in a more complex configuration. Theirtesting setup consists of a single cross-shaped intersectionwith one lane per direction, where no turns are allowed. Forthe simulation software, the author uses SUMO (Simulationof Urban MObility), a popular open-source microscopictraffic simulator. Given that SUMO teleports vehicles thathave been stuck for a long time 1, the author needs to takethis into account in the reward function, in order to penalizetraffic light configurations that favour vehicle teleportation.Their approach consists on a Deep Q-Network. The authorexperiments with two two alternative architectures, takenverbatim respectively from [11] and [12]. Those convolutionalnetworks were meant to play Atari games and receive asinput the pixel matrix with bare preprocessing (downscalingand graying). In order to enable those architectures to be fedwith the traffic data as input, an image is created by plottinga point on the location of each vehicle. The action space iscomprised of the different legal traffic light configurations(i.e. those that do not lead to flow conflicts), among which thenetwork chooses which to apply. The reward is a weightedsum of several factors: vehicle delay (defined as the roadmaximum speed minus the vehicle speed, divided by theroad maximum speed), vehicle waiting time, the numberof times the vehicle stops, the number of times the trafficlight switches and the number of teleportations. In orderto improve convergence of the algorithm, the authors applydeep reinforcement learning techniques such as prioritizedexperience replay and keeping a shadow target network,but also experimented with double Q learning [29], [30].

1See http://sumo.dlr.de/wiki/Simulation/Why Vehicles are teleporting

http://sumo.dlr.de/wiki/Simulation/Why_Vehicles_are_teleporting

4

They as well tested different optimization algorithms apartfrom the normal stochastic gradient optimization, such asthe ADAM optimizer [31], Adagrad [32] or RMSProp [33].The performance of the algorithm is evaluated visually bymeans of plots of the reward and average travel time duringthe training phase. The author also explores the behaviour oftheir algorithm in a scenario with multiple intersections (upto four) by means of a multi-agent approach. This is achievedby training two neighbouring intersections on their mutualinfluence and then the learned joint Q function is transferredfor higher number of intersections.

Genders et al. explore in [34] the the application of deepconvolutional learning to traffic light timing. Their test setupconsists of a single cross-shaped intersection with four lanes ineach direction, where the inner lane is meant only for turningleft and the outer lane is meant only for turning right. Assimulation software, the authors use SUMO, like the work byVan der Pol [28] (see previous bullet). However, Genders etal do not address the teleportation problem and do not takeinto account its effect on the results. Their approach consistsof a Deep Convolutional Q-Network. Like in [28], Genderset al. transform the vehicle positions into a matrix so that itbecomes a suitable input for the convolutional network. They,however, scale the value of the pixels with the local densityof vehicles. The authors refer to this representation as discretetraffic state encoding (DTSE). The actions generated by theQ-Network are the different phase configurations of the trafficlight set in the intersection. The reward defined as the variationin cumulative vehicle delay since the last action was applied.The network is fed using experience replay.

V. THEORETICAL BACKGROUND

Reinforcement Learning (RL) aims at training an agent sothat it applies actions optimally to an environment based on itsstate, with the downside that it is not known which actions aregood or bad, but it is possible to evaluate the goodness of theireffects after they are applied. Using RL terminology, the goalof the algorithm is to learn an optimal policy for the agent,based on the observable state of the environment and on areinforcement signal that represents the reward (either positiveor negative) obtained when an action has been applied. Theunderlying problem that reinforcement learning tries to solve isthat of the credit assignment. For this, the algorithm normallytries to estimate the expected cumulative future reward to beobtained when applying certain action when in certain state ofthe environment. RL algorithms act at discrete points in time.At each time step t, the agent tries to maximize the expectedtotal return RT , that is, the accumulated rewards obtainedafter each performed action: Rt = rt+1 + rt+2 + · · · + rT ,where T is the number of time steps ahead until the problemfinishes. However, as normally T is dynamic or even infinite(i.e. the problem has no end), instead of the summation of therewards,the discounted return is used:

Rt = rt+1 + γrt+2 + γ2rt+3 + · · · =∞∑k=0

γkrt+k+1 (1)

The state of the environment is observable, either totally orpartially. The definition of the state is specific to each problem.One example of state of the environment is the position x ofa vehicle that moves in one dimension. Note that the statecan certainly contain information that condenses pasts statesof the environment. For instance, apart from the positionx from the previous example, we could also include thespeed x and acceleration x in the state vector. ReinforcementLearning problems that depend only on the current state of theenvironment are said to comply with the Markov property andare referred to as Markov Decision Processes. Their dynamicsare therefore defined by the probability of reaching from astate s to a state s′ by means of action a:

p(s′|s, a) = P (St+1 = s′|St = s,At = a) (2)

This way, we can define the reward obtained when transition-ing from state s to s′ by means of action a:

r(s, a, s′) = E [Rt+1|St = s,At = a, St+1 = s′] (3)

Deep Reinforcement Learning refers to reinforcement learn-ing algorithms that use a deep neural network as value functionapproximator. The first success of reinforcement learningwith neural networks as function approximation was TD-Gammon [35]. Despite the initial enthusiasm in the scientificcommunity, the approach did not succeed when applied toother problems, which led to its abandonment ([36]). The mainreason for its failure was lack of stability derived from:• The neural network was trained with the values that

were generated on the go, therefore such values weresequential in nature and thus were auto-correlated (i.e.not independently and identically distributed).

• Oscillation of the policy with small changes to Q-valuesthat change the data distribution.

• Too large optimization steps upon large rewards.Their recent rise in popularity is due to the success of DeepQ-Networks (DQN) at playing Atari games using as input theraw pixels of the game [11], [12].

L(θ) = E[(y −Q(s, a; θ))

2]

(4)

In DQNs, there is a neural network that receives the environ-ment state as input and generates as output the Q-values foreach of the possible actions, using the loss function (4), whichimplies following the direction of the gradient (5):

∇θL(θ) = E[(r + γmax

a′Q(s′, a′; θ)

−Q(s, a; θ))∇θQ(s, a; θ)

] (5)

In order to mitigate the stability problems inherent to rein-forcement learning with value function approximation, in [11],[12], the authors applied the following measures:• Experience replay: keep a memory of past action-

rewards and train the neural network with random sam-ples from it instead of using the real time data, thereforeeliminating the temporal autocorrelation problem.

• Reward clipping: scale and clip the values of the rewardsto the range [−1,+1] so that the weights do not boostwhen backpropagating.

5

• Target network: keep a separate DQN so that one isused to compute the target values and the other oneaccumulates the weight updates, which are periodicallyloaded onto the first one. This avoid oscillations in thepolicy upon small changes to Q-values.

However, DQNs are meant for problems with a few possibleactions, and are therefore not appropriate for continuous spaceactions, like in our case. Nevertheless, a recently proposedDeep RL algorithm referred to as Deep Deterministic PolicyGradient or DDPG ([13]) naturally accommodates this kindof problems. It combines the actor-critic classical RL approach[27] with Deterministic Policy Gradient [37]. The originalformulation of the policy gradient algorithm was proposed in[38], which proved the policy gradient theorem for a stochasticpolicy π(s, a; θ):

Theorem 1. (Policy Gradient theorem from [38]) For anyMDP, if the parameters θ of the policy are updated propor-tionally to the gradient of its performance ρ then θ can beassured to converge to a locally optimal policy in ρ, being thegradient computed as

∆θ ≈ α∂ρ∂θ

= α∑s

dπ(s)∑a

∂π(s, a)

∂θQπ(s, a)

with α being a positive step size and where dπ is definedas the discounted weighting of states encountered starting ats0 and then following π: dπ(s) =

∑∞t=0 γ

tP (st = s|s0, π)

This theorem was further extended in the same article forthe case where an approximation function f is used in placeof the policy π. In this conditions the theorem holds validas long as the weights of the approximation tend to zeroupon convergence. In our reference articles [37] and [13], theauthors propose to use a deterministic policy (as opposed tostochastic) approximated by a neural network actor π(s; θπ)that depends on the state of the environment s and has weightsθπ , and another separate network Q(s, a; θQ) implementingthe critic, which is updated by means of the Bellman equationlike DQN (5):

Q(st, at) = Ert,st+1 [r(st, at) + γQ(st+1, π(st+1))] (6)

And the actor is updated by applying the chain rule to theloss function (4) and updating the weights θπ by followingthe gradient of the loss with respect to them:

∇θπL ≈ Es[∇θπQ(s, π(s|θπ)|θQ)

]= Es

[∇aQ(s, a|θQ)|a=π(s|θπ)∇θππ(s|θπ)

] (7)

In order to introduce exploration behaviour, thanks to theDDPG algorithm being off-policy, we can add random noiseN to the policy. This enables the algorithm to try unexploredareas from the action space to discover improvementopportunities, much like the role of ε in ε-greedy policies inQ-learning.

In order to improve stability, DDPG also can be appliedthe same measures as DQNs, namely reward clipping, ex-perience replay (by means of a replay buffer referred to asR in algorithm 1) and separate target network. In order to

implement this last measure for DDPG, two extra target actorand critic networks (referred to as π′ and Q′ in algorithm 1) tocompute the target Q values, separated from the normal actorand critic (referred to as π and Q in algorithm 1) that areupdated at every step and which weights are used to computesmall updates to the target networks. The complete DDPG, asproposed in [13], is summarized in algorithm 1.

Algorithm 1 Deep Deterministic Policy Gradient algorithmRandomly initialize weights of Q(s, a|θQ) and π(s|θπ).Initialize target net weights θQ

′ ← θQ, θπ′ ← θπ .

Initialize replay buffer R.for each episode do

Initialize random process N for action exploration.Receive initial observation state s1.for each step t of episode do

Select action at = π(st|θπ) + Nt.Execute at and observe reward rt and state st+1.Store transition (st, at, rt, st+1) in R.Sample from R a minibatch of N transitions.Set yi = ri + γQ′(si+1, π

′(si + 1|θπ′)|θQ′

).Update critic by minimizing the loss:

L =1

N

∑i(yi −Q(si, ai|θQ))2.

Update the actor using the sampled policy gradient:

∇θπL ≈1

N

∑i∇aQ(s, a|θQ)∇θππ(s|θπ)|si.

Update the target networks:θQ

′ ← τθQ + (1− τ)θQ′

θπ′ ← τθπ + (1− τ)θπ

′.

end forend for

VI. PROPOSED APPROACH

In this section we explain the approach we are proposingto address the control of urban traffic lights, along with therationale that led to it. We begin with section VI-A by definingwhich information shall be used as input to our algorithmamong all the data that is available from our simulation en-vironment. We proceed by choosing a problem representationfor such information to be fed into our algorithm in sectionVI-D for the traffic state and section VI-F for the rewards.

A. Input Information

The fact that we are using a simulator to evaluate theperformance of our proposed application of deep learning totraffic control, makes the traffic state fully observable to us.However, in order for our system to be applied to the realworld, it must be possible for our input information to bederived from data that is available in a typical urban trafficsetup. The most remarkable examples of readily available dataare the ones sourced by traffic detectors. They are sensorslocated throughout the traffic network that provide measure-ments about the traffic passing through them. Although thereare different types of traffic detectors, the most usual onesare induction loops placed under the pavement that send realtime information about the vehicles going over them. The

6

information that can normally be taken from such type ofdetectors comprise vehicle count (number of vehicles thatwent over the detector during the sampling period), vehicleaverage speed during the sampling period and occupancy (thepercentage of time in which there was a vehicle located overthe detector). This way, we decide to constrain the informationreceived about the state of the network to vehicle counts,average speed and occupancy of every detector in our trafficnetworks, along with the description of the network itself,comprising the location of all roads, their connections, etc.

B. Congestion Measurement

Following the self-imposed constraint to use only data thatis actually available in a real scenario, we shall elaborate asummary of the state of the traffic based on vehicle counts,average speeds and occupancy. This way, we defined a mea-sured called speed score, that is defined for detector i as:

speed scorei = min

(avg speed imax speed i

, 1.0

)(8)

where avg speed i refers to the average of the speeds measuredby traffic detector i and max speed i refers to the maximumspeed in the road where detector i is located. Note thatthe speed score hence ranges in [0, 1]. This measure willbe the base to elaborate the representation of both the stateof the environment (section VI-D) and the rewards for ourreinforcement learning algorithm (section VI-F).

C. Data Aggregation Period

The microscopic traffic simulator used for our experimentsdivides the simulation into steps. At each step, a small fixedamount of time is simulated and the state of the vehicles(e.g. position, speed, acceleration) is updated according to thedynamics of the system. This amount of time is configured tobe 0.75 seconds by default, and we have kept this parameter.However, such an amount of time is too short to imply achange in the vehicle counts of the detectors. Therefore, itis needed to have a larger period over which the data isaggregated; we refer to this period as episode step, or simply”step” when there is no risk of confusion. This way, the datais collected at each simulation step and then it is aggregatedevery episode step for the DDPG algorithm to receive itas input. In order to properly combine the speed scores ofseveral simulation steps, we take their weighted average, usingthe proportion of vehicle counts. In an analogous way, thetraffic light timings generated by the DDPG algorithm areused during the following episode step. The duration of theepisode step was chosen by means of grid search, determiningan optimum value of 120 seconds.

D. State Space

In order to keep a state vector of the environment, wemake direct use of the speed score described in section VI-B,as it not only summarizes properly the congestion of thenetwork, but also incorporates the notion of maximum speed

of each road. This way, the state vector has one componentper detector, each one defined as shown in (9).

statei = speed scorei (9)

The rationale for choosing the speed score is that, the higherthe speed score, the higher the speed of the vehicles relativeto the maximum speed of the road, and hence the higher thetraffic flow.

E. Action Space

In the real world there are several instruments to dy-namically regulate traffic: traffic lights, police agents, trafficinformation displays, temporal traffic signs (e.g. to block aroad where there is an accident), etc. Although it is possible toapply many of these alternatives in traffic simulation software,we opted to keep the problem at a manageable level andconstrain the actions to be applied only to traffic lights. Thenaive approach would be to let our agent simply controlthe traffic lights directly by setting the color of each trafficlight individually at every simulation step, that is, the actionsgenerated by our agent would be a list with the color (red,green or yellow) for each traffic light. However, traffic lights inan intersection are synchronized: when one of the traffic lightsof the intersection is green, the traffic in the perpendiculardirection is forbidden by setting the traffic lights of such adirection to red. This allows to multiplex the usage of theintersection. Therefore, letting our agent freely control thecolors of the traffic lights would probably lead to chaoticsituations. In order to avoid that, we should keep the phasesof the traffic lights in each intersection. With that premise,we shall only control the phase duration, hence the dynamicsare kept the same, only being accelerated or decelerated. Thisway, if the network has N different phases, the action vectorhas N components, each of them being a real number thathas a scaling effect on the duration of the phase. However, foreach intersection, the total duration of the cycle (i.e. the sumof all phases in the intersection) should be kept unchanged.This is important because in most cases, the cycles of nearbyintersections are synchronized so that vehicles travelling fromone intersection to the other can catch the proper phase,thus improving the traffic flow. In order to ensure that theintersection cycle is kept, the scaling factor of the phases fromthe same intersection are passed through a softmax function(also known as normalized exponential function). The result isthe ratio of the phase duration over the total cycle duration. Inorder to ensure a minimum phase duration, the scaling factoris only applied to 80% of the duration.

F. Rewards

The role of the rewards is to provide feedback to thereinforcement learning algorithm about the performance of theactions taken previously. As commented in previous section,it would be possible for us to define a reward scheme thatmakes use of information about the travel times of the vehicles.However, as we are self-constraining to the information thatis available in real world scenarios, we can not rely on othermeasures apart from detector data, e.g. vehicle counts, speeds.

7

This way, we shall use the speed score described in sectionVI-B. But the speed score alone does not tell whether theactions taken by our agent actually improve the situation ormake it worse. Therefore, in order to capture such information,we shall introduce the concept of baseline, defined as thespeed score for a detector during a hypothetical simulationthat is exactly like the one under evaluation but with nointervention by the agent, recorded at the same time step. Thisway, our reward is the difference between the speed score andthe baseline, scaled by the vehicle counts passing through eachdetector (in order to give more weight to scores where thenumber of vehicles is higher), and further scaled by a factorα to keep the reward in a narrow range, as shown in (10).

reward i = α · count i · (speed scorei − baselinei) (10)

Note that we may want to normalize the weights by dividingby the total vehicles traversing all the detectors. This wouldrestrain the rewards in the range [−1,+1]. This, however,would make the rewards obtained in different simulation stepsnot comparable (i.e. a lower total number of vehicles in thesimulation at instant t would lead to higher rewards). Thefactor α was chosen to be 1/50 empirically, by observingthe unscaled values of different networks and choosing avalue in an order of magnitude that leaves the scaled valuearound 1.0. This is important in order to control the scaleof the resulting gradients. Another alternative used in [11],[12] with this very purpose is reward clipping; this, however,implies losing information about the scale of the rewards.Therefore, we chose to apply a proper scaling instead. Thereis a reward computed for each detector at each simulationtime step. Such rewards are not combined in any way, but areall used for the DDPG optimization, as described in sectionVI-H. Given the stochastic nature of the microsimulator used,the results obtained depend on the random seed set for thesimulation. This way, when computing the reward, the baselineis taken from a simulation with the same seed as the one underevaluation.

G. Deep Network Architecture

Our neural architecture consists in a Deep DeterministicActor-Critic Policy Gradient approach. It is comprised of twonetworks: the actor network π and the critic network Q. Theactor network receives the current state of the simulation(as described in section VI-D) and outputs the actions, asdescribed in VI-E. As shown in figure 1, the network iscomprised of several layers. It starts with several fully con-nected layers (also known as dense layers) with Leaky ReLUactivations [39], where the number of units is indicated inbrackets, with nd being the number of detectors in the trafficnetwork and np is the number of phases of the traffic network.Across those many layers, the width of the network increasesand then decreases, up to having as many units as actions, thatis, the last mentioned dense layer has as many units as trafficlight phases in the network. At that point, we introduce a batchnormalization layer and another fully connected layer withReLU activation. The output of the last mentioned layer arereal numbers in the range [0,+∞], so we should apply some

kind of transformation that allows us to use them as scalingfactors for the phase durations (e.g. clipping to the range[0.2, 3.0]). However, as mentioned in section VI-E, we wantto keep the traffic light cycles constant. Therefore, we shallapply an element-wise scaling computed on the summation ofthe actions of the phases in the same traffic light cycle, that is,for each scaling factory we divide by the sum of all the factorsfor phases belonging to the same group (hence obtaining thenew ratios of each phase over the cycle duration) and thenmultiply by the original duration of the cycle. In order tokeep a minimum duration for each phase, such computationis only applied to the 80% of the duration of the cycle. Sucha computation can be pre-calculated into a matrix, which wecall the phase adjustment matrix, which is applied in the layerlabeled as ”Phase adjustment” in figure 1, and which finallygives the scaling factors to be applied to phase durations.This careful scaling meant to keep the total cycle durationcan be ruined by the exploration component of the algorithm,as described in 1, which consists of adding noise to the actions(and therefore likely breaking the total cycle duration), Thisway, we implement the injection of noise as another layerprior to the phase adjustment. The critic network receives thecurrent state of the simulation plus the action generated by theactor, and outputs the Q-values associated to them. Like theactor, it is comprised of several fully connected layers withleaky ReLU activations, plus a final dense layer with linearactivation.

Actions

ReLU

Dense [np]

Phase Adjustment

Gaussian Noise

Leaky ReLU

BatchNorm

Dense [nd + np]

Leaky ReLU

Dense [nd + np]

Leaky ReLU

Dense [nd]

State

ActorQ-Values

Linear

Dense [nd]

Leaky ReLU

Dense [nd + np]

Leaky ReLU

Dense [nd + np]

Leaky ReLU

Dense [nd + np]

Critic

Figure 1. Critic and Actor networks

H. Disaggregated Rewards

In our reference article [13], as well as all landmark oneslike [11] and [12], the reward is a single scalar value. However,in our case we build a reward value for each detector in thenetwork. One option to use such a vector of rewards could beto scalarize them into a single value. This, however, wouldimply losing valuable information regarding the location ofthe effects of the actions taken by the actor. Instead, we willkeep then disaggregated, leveraging the structure of the DDPGalgorithm, which climbs in the direction of the gradient of

8

the critic. This is partially analogous to a regression problemon the Q-value and hence does not impose constraints onthe dimensionality of the rewards. This way, we will havea N -dimensional reward vector, where N is the number ofdetectors in the network. This extends the policy gradienttheorem from [37] so that the reward function is no longerdefined as r : S × A → R but as r : S × A → RN .This is analogous to having N agents sharing the same actorand critic networks (i.e. sharing weights θπ and θQ) andbeing trained simultaneously over N different unidimensionalreward functions. This, effectively, implements multiobjectivereinforcement learning. To the best of our knowledge, the useof disaggregated rewards has not been used before in thereinforcement learning literature. Despite having proved usefulin our experiments, further study is needed in order to fullycharacterize the effect of disaggregated rewards on benchmarkproblems. This is one of the future lines of research that canbe spawned from this work. Such an approach could be furtherrefined by weighting rewards according to traffic control expertknowledge, which will then be incorporated in the computationof the policy gradients.

I. Convergence

There are different aspects that needed to be properly tunedin order for the learning to achieve convergence:• Weight Initialization has been a key issue in the results

cast by deep learning algorithms. The early architecturescould only achieve acceptable results if they were pre-trained by means of unsupervised learning so that theycould have learned the input data structure [26]. Theuse of sigmoid or hyperbolic tangent activations makesit difficult to optimize neural networks due to the nu-merous local minima in the function loss defined overthe parameter space. With pre-training, the explorationof the parameter space does not begin in a random point,but in a point that hopefully is not too far from a goodlocal minimum. Pretraining became no longer necessaryto achieve convergence thanks to the use of rectifiedlinear activation units (ReLUs) [40] and sensible weightinitialization strategies. In our case, different randomweight initializations (i.e. Glorot’s [41] and He’s [42])gave the best results, finally selecting He’s approach.

• Updates to the Critic: after our first experiments it becameevident the divergence of the learning of the network.Careful inspection of the algorithm byproducts revealedthat the cause of the divergence was that the criticnetwork Q′ predicted higher outcomes at every iteration,as trained according to equation (11) extracted fromalgorithm 1.

yi = ri + γQ′(si+1, π′(si + 1|θπ

′)|θQ

′) (11)

As DDPG learning -like any other reinforcement learningwith value function approximation approach- is a closedloop system in which the target value at step t+1 is biasedby the training at steps t, drifts can be amplified, thusruining the learning, as the distance between the desiredvalue for Q and the obtained one differ more and more.

In order to mitigate this divergence problem, our proposalconsists in reducing the coupling by means of the appli-cation of a schedule on the value of the discount factor γfrom Bellman’s equation, which is shown in figure 2. The

0 200 400 600 800 1000

step

0.0

0.2

0.4

0.6

0.8

γ

Figure 2. Schedule for the discount factor γ.

schedule of γ is applied at the level of the experiment, notwithin the episode. The oscillation in γ shown in figure2 is meant to enable the critic network not to enter inthe regime where the feedback leads to divergence. Tothe best of our knowledge, discount factor schedulinghas not been used before in the literature to improvethe convergence of reinforcement learning with valuefunction approximation. Despite having proved useful inour experiments, further study is needed in order to fullycharacterize the effect of discount factor schedules onbenchmark problems. This is one of the future lines ofresearch that can be spawned from this work.

• Gradient evolution: the convergence of the algorithm canbe evaluated thanks to the norm of the gradient used toupdate the actor network π. If such a norm decreasesover time and stagnates around a low value, it is a signthat the algorithm has reached a stable point and thatthe results might not further improve. This way, in theexperiments described in subsequent sections, monitoringof the gradient norm is used to track progress. Thegradient norm can also be controlled in order to avoidtoo large updates that make the algorithm diverge, e.g.[11]. This mechanism is called gradient norm clippingand consists of scaling the gradient so that its norm isnot over a certain value. Such a value was empiricallyestablished as 0.5 in our case.

J. Summary

Our proposal is to apply Deep Deterministic Policy Gradi-ent, as formulated in [13], to the traffic optimization problemby controlling the traffic lights timing. We make use ofa multilayer perceptron type of architecture, both for theactor and the critic networks. The actor is designed so thatthe modifications to the traffic light timings keep the cycleduration. In order to optimize the networks we make use ofstochastic gradient descent. In order to improve convergence,we make use of a replay memory, gradient norm clipping and aschedule for the discount rate γ. The input state used to feedthe network consists of traffic detector information, namelyvehicle counts and average speeds, which are combined in asingle speed score. The rewards used as reinforcement signalare the improvements over the measurements without anycontrol action being performed (i.e. baseline). Such rewardsare not aggregated but fed directly as expected values of thecritic network.

9

VII. EXPERIMENTS

In this section we describe the experiments conducted inorder to evaluate the performance of the proposed approach.In section VII-A we show the different traffic scenarios usedwhile in section VII-E we describe the results obtained in eachone, along with lessons learned from the problems found, plushints for future research.

A. Design of the Experiments

In order to evaluate our deep RL algorithm, we devisedincreasing complexity traffic networks. For each one, weapplied our DDPG algorithm to control the traffic light timing,but also applied multi-agent Q-Learning and random timingin order to have a reference to properly assess the performanceof our approach. At each experiment, the DDPG algorithmreceives as input the information of all detectors in thenetwork, and generates the timings of all traffic light phases.In the multi-agent Q-learning implementation, there is oneagent managing each intersection phase. It receives theinformation from the closest few detectors and generatesthe timings for the aforementioned phase. Given the tabularnature of Q-learning, both the state space and the action spaceneed to be categorical. For this, tile coding is used. Regardingthe state space, the tiles are defined based on the samestate space values as DDPG (see section VI-D), clusteredin one the following 4 ranges [−1.0,−0.2], [−0.2,−0.001],[−0.001, 0.02], [0.02, 1.0], which were chosen empirically.As one Q-learning agent controls the Ni phases of thetraffic lights of an intersection i, the number of states foran agent is 4Ni . The action space is analogous, being thegenerated timings one of the values 0.2, 0.5, 1.0, 2.0 or3.5. The selected ratio (i.e. ratio over the original phaseduration) is applied to the duration of the phase controlledby the Q-learning agent. As there is one agent per phase,this is a multi-agent reinforcement learning setup, whereagents do not communicate with each other. They do haveoverlapping inputs, though, as the data from a detector canbe fed to the agents of several phases. In order to keep thecycle times constant, we apply the same phase adjustmentused for the DDPG agent, described in section VI-E. Therandom agent generates random timings in the range[0, 1], and then the previously mentioned phase adjustment isapplied to keep the cycle durations constant (see section VI-E).

Given the stochastic nature of the microscopic traffic simu-lator used, the results obtained at the experiments depend onthe random seed set for the simulation. In order to address theimplications of this, we do as follows:

• In order for the algorithms not to overfit to the dynamicsof a single simulation, we randomize the seed of eachsimulation. We take into account this also for the com-putation of the baseline, as described in section VI-F.

• We repeat the experiments several times, and present theresults over all of them (showing the average, maximumor minimum data depending on the case).

B. Network AThis network, shown in figure 3 consists only of an inter-

section of two 2-lane roads. At the intersection vehicles caneither go straight or turn to their right. It is forbidden to turnleft, therefore simplifying the traffic dynamics and traffic lightphases. There are 8 detectors (in each road there is one detectorbefore the intersection and another one after it). There are twophases in the traffic light group: phase 1 allows horizontaltraffic while phase 2 allows vertical circulation. Phase 1 lasts15 seconds and phase 2 lasts 70 seconds, with a 5-secondsinter-phase. Phases 1 and 2 have unbalanced duration onpurpose, to have the horizontal road accumulate vehicles forlong time. This gives our algorithm room to easily improvethe traffic flow with phase duration changes. The simulationcomprises 1 hour and the vehicle demand is constant: for eachpair of centroids, there are 150 vehicles.

Figure 3. Urban network A

The traffic demand is defined by hand, with the proper orderof magnitude to ensure congestion. The definition and durationof the phases were computed by means of the classical cyclelength determination and green time allocation formulas from[43].

C. Network BThis network, shown in figure 4 consists of a grid layout

of 3 vertical roads and 2 horizontal ones, crossing in 6 inter-sections that all have traffic lights. Traffic in an intersectioncan either go straight, left or right, that is, all turns areallowed, complicating the traffic light phases, which have beengenerated algorithmically by the software with the optimaltiming, totalling 30 phases. There are detectors before and aftereach intersection, totalling 17 detectors. The traffic demand isdefined by hand, ensuring congestion. The trafic light phaseswere defined, like network A, with the classical approachfrom [43]. Four out of six junctions have 5 phases, while theremaining two junctions have 4 and 6 phases each. The trafficdemand has been created in a random manner, but ensuringenough vehicles are present and trying to collapse some of thesections of the network.

D. Network CThis network, shown in figure 5 is a replica of the Sants

area in the city of Barcelona (Spain). There are 43 junctions,

10

Figure 4. Urban network B

totalling 102 traffic light phases, and 29 traffic detectors. Thelocations of the detectors matches the real world. The trafficdemand matches that of the peak hour in Barcelona, and itpresents high degree of congestion. The number of controlled

Figure 5. Urban network C

phases per junction 2 ranges from 1 to 6, having most of themonly two phases.

E. Results

In order to evaluate the performance of our DDPG approachcompared to both normal Q-learning and random timings oneach of our test networks, our main reference measure shall bethe episode average reward (note that, as described in sectionVI-F there is actually a vector of rewards, with one elementper detector in the network, that is why we compute the

2Note that phases from the network that have a very small duration (i.e. 2seconds or less) are excluded from the control of the agent

average reward) of the best experiment trial, understanding”best” experiment as the one where the maximum episodeaverage reward was obtained. In figure 6 we can find the

0 100 200 300 400 500

episode

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

rew

ard

DDPGQ-Learningrandom

Figure 6. Algorithm performance comparison on network A

performance comparison for network A. Both the DDPGapproach and the classical Q-learning reach the same levelsof reward. On the other hand, it is noticeable the differencesin the convergence of both approaches: while Q-learning isunstable, DDPG remains remarkably stable once it reached itspeak performance. In figure 7 we can find the performance

0 200 400 600 800 1000

episode

0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

rew

ard


Figure 7. Algorithm performance comparison on network B

comparison for network B. While Q-learning maintains thesame band of variations along the simulations, DDPG starts toconverge. Given the great computational costs of running thefull set of simulations for one network, it was not affordableto let it run indefinitely, despite the promising trend. Figure8 shows the performance comparison for network C, fromwhich we can appreciate that both DDPG and Q-learningperforms at the same level, and that such a level is beneathzero, from which we know that they are actually worse thandoing nothing. This way, the performance of DDPG is clearlysuperior to Q-learning for the simplest scenario (network A),slightly better for scenarios with a few intersections (networkB) and at the same level for real world networks. From the

11

0 200 400 600 800 1000

episode

10

8

6

4

2

0

rew

ard


Figure 8. Algorithm performance comparison on network C

evolution of the gradient for medium and large networks, weobserved that convergence was not achieved, as it remainsalways at the maximum value induced by the gradient normclipping. This suggests that the algorithm needs more trainingtime to converge (probable for network B) or that it diverges(probable for network C). In any case, further study would beneeded in order to assess the needed training times and theneeded convergence improvement techniques.

VIII. CONCLUSIONS

We studied the application of Deep Deterministic PolicyGradient (DDPG) to increasingly complex scenarios. Weobtained good results in network A, which is analogous tomost of the scenarios used to test reinforcement learningapplied to traffic light control (see section IV for detailson this); nevertheless, for such a small network, vanillaQ-learning performs on par, but with less stability, though.However, when the complexity of the network increases,Q-learning can no longer scale, while DDPG still can improveconsistently the obtained rewards. With a real world scenario,our DDPG approach is not able to properly control the trafficbetter than doing nothing. The good trend for network Bshown in figure 7, suggests that longer training time may leadto better results. This might be also true for network C, butthe extremely high computational costs could not be handledwithout large scale hardware infrastructure. Our resultsshow that DDPG is able to better scale to larger networksthan classical tabular approaches like Q-learning. Therefore,DDPG is able to address the curse of dimensionality [44]regarding the traffic light control domain, at least partially.However, it is not clear that the chosen reward scheme(described in section VI-F) is appropriate. One of its manyweaknesses is its fairness for judging the performance ofthe algorithm based on the individual detector information.In real life traffic optimization it is common to favoursome areas so that traffic flow in arterials or large roads isimproved, at the cost of worsening side small roads. Thesame principle could be applied to engineer a more realisticreward function from the point of view of traffic control theory.

In order to properly asses the applicability of the proposedapproach to real world setups, it would also be needed toprovide a wide degree of variations in the conditions of thesimulation, from changes in the traffic demand to havingroad incidents in the simulation. Another aspect that needsfurther study is the effect of the amount and location oftraffic detectors on the performance of the algorithm. In ournetworks A and B, there were detectors at every section of thenetwork, while in network C their placement was scattered,which is the norm in real world scenarios. We appreciate aloose relation between the degree of observability of the stateof the network and the performance of our proposed trafficlight timing control algorithm. Further assessment about theinfluence of observability of the state of the network wouldhelp characterize the performance of the DDPG algorithmand even turn it into a means for choosing potential locationsfor new detector in the real world. An issue regarding theperformance of our approach is the sudden drops in therewards obtained through the training process. This suggeststhat the landscape of the reward function with respect tothe actor and critic network parameters is very irregular,which leads the optimization to fall into bad regions whenclimbing in the direction of the gradient. A possible futureline of research that addressed this problem could be applyingTrusted Region Policy Optimization [45], that is, leveragingthe simulated nature of our setup to explore more efficientlythe solution space. This would allow it to be more dataefficient, achieving comparable results with less training.

We have introduced two concepts that, to the bestof our knowledge, have not been used before in thedeep reinforcement learning literature, namely the use ofdisaggregated rewards (described in section VI-H) and thescheduling of the discount factor γ (described in sectionVI-I). These techniques need to be studied in isolation fromother factors on benchmark problems in order to properlyassess their effect and contribution to the performance of thealgorithms. This is another possible line of research to bespawned from this work.

On the other hand, we have failed to profit from thegeometric information about the traffic network. This isclearly a possible future line of research, that can leveragerecent advances in the application of convolutional networksto arbitrary graphs, similar to [46].

Finally, we have verified the applicability of simple deeplearning architectures to the problem of traffic flow optimiza-tion by traffic light timing control on small and medium-sizedtraffic networks. However, for larger-sized networks furtherstudy is needed, probably in the lines of exploring the resultswith significantly larger training times, using the geometricinformation of the network and devising data efficiency im-provements.

12

REFERENCES

[1] M. Wiering, J. Vreeken, J. Van Veenen, and A. Koopman, “Simulationand optimization of traffic in a city,” in Intelligent Vehicles Symposium,2004 IEEE. IEEE, 2004, pp. 453–458.

[2] L. Prashanth and S. Bhatnagar, “Reinforcement learning with functionapproximation for traffic signal control,” Intelligent Transportation Sys-tems, IEEE Transactions on, vol. 12, no. 2, pp. 412–421, 2011.

[3] J. Favilla, A. Machion, and F. Gomide, “Fuzzy traffic control: adaptivestrategies,” in Fuzzy Systems, 1993., Second IEEE International Confer-ence on. IEEE, 1993, pp. 506–511.

[4] S. Chiu and S. Chand, “Adaptive traffic signal control using fuzzy logic,”in Fuzzy Systems, 1993., Second IEEE International Conference on.IEEE, 1993, pp. 1371–1376.

[5] C.-Q. Cai and Z.-S. Yang, “Study on urban traffic management basedon multi-agent system,” in 2007 International Conference on MachineLearning and Cybernetics, vol. 1. IEEE, 2007, pp. 25–29.

[6] Z. Shen, K. Wang, and F. Zhu, “Agent-based traffic simulation and trafficsignal timing optimization with gpu,” in 2011 14th International IEEEConference on Intelligent Transportation Systems (ITSC). IEEE, 2011,pp. 145–150.

[7] Y. Feng, K. L. Head, S. Khoshmagham, and M. Zamanipour, “Areal-time adaptive signal control in a connected vehicle environment,”Transportation Research Part C: Emerging Technologies, vol. 55, pp.460–473, 2015.

[8] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent rein-forcement learning for integrated network of adaptive traffic signalcontrollers (marlin-atsc): methodology and large-scale application ondowntown toronto,” IEEE Transactions on Intelligent TransportationSystems, vol. 14, no. 3, pp. 1140–1150, 2013.

[9] J. Casas, J. L. Ferrer, D. Garcia, J. Perarnau, and A. Torday, “Traffic sim-ulation with aimsun,” in Fundamentals of traffic simulation. Springer,2010, pp. 173–232.

[10] T. Aimsun, “Dynamic simulators users manual,” Transport SimulationSystems, vol. 20, 2012.

[11] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-ing,” arXiv preprint arXiv:1312.5602, 2013.

[12] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovskiet al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[13] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforcementlearning,” arXiv preprint arXiv:1509.02971, 2015.

[14] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Design of reinforce-ment learning parameters for seamless application of adaptive trafficsignal control,” Journal of Intelligent Transportation Systems, vol. 18,no. 3, pp. 227–245, 2014.

[15] T. L. Thorpe, “Vehicle traffic light control using sarsa,” in Online].Available: citeseer. ist. psu. edu/thorpe97vehicle. html. Citeseer, 1997.

[16] I. Arel, C. Liu, T. Urbanik, and A. Kohls, “Reinforcement learning-basedmulti-agent system for network traffic signal control,” IET IntelligentTransport Systems, vol. 4, no. 2, pp. 128–135, 2010.

[17] M. Wiering et al., “Multi-agent reinforcement learning for traffic lightcontrol,” in ICML, 2000, pp. 1151–1158.

[18] E. Camponogara and W. Kraus Jr, “Distributed learning agents in urbantraffic control,” in Portuguese Conference on Artificial Intelligence.Springer, 2003, pp. 324–335.

[19] L. Kuyer, S. Whiteson, B. Bakker, and N. Vlassis, “Multiagent rein-forcement learning for urban traffic control using coordination graphs,”in Joint European Conference on Machine Learning and KnowledgeDiscovery in Databases. Springer, 2008, pp. 656–671.

[20] B. Bakker, S. Whiteson, L. Kester, and F. C. Groen, “Traffic lightcontrol by multiagent reinforcement learning systems,” in InteractiveCollaborative Information Systems. Springer, 2010, pp. 475–510.

[21] C. Guestrin, M. Lagoudakis, and R. Parr, “Coordinated reinforcementlearning,” in ICML, vol. 2, 2002, pp. 227–234.

[22] L. Li, Y. Lv, and F. Y. Wang, “Traffic signal timing via deep reinforce-ment learning,” IEEE/CAA Journal of Automatica Sinica, vol. 3, no. 3,pp. 247–254, July 2016.

[23] G. D. Cameron and G. I. Duncan, “Paramics—parallel microscopicsimulation of road traffic,” The Journal of Supercomputing, vol. 10,no. 1, pp. 25–53, 1996.

[24] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle et al., “Greedylayer-wise training of deep networks,” Advances in neural informationprocessing systems, vol. 19, p. 153, 2007.

[25] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,“Stacked denoising autoencoders: Learning useful representations in adeep network with a local denoising criterion,” Journal of MachineLearning Research, vol. 11, no. Dec, pp. 3371–3408, 2010.

[26] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, andS. Bengio, “Why does unsupervised pre-training help deep learning?”Journal of Machine Learning Research, vol. 11, no. Feb, pp. 625–660,2010.

[27] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.MIT press Cambridge, 1998, vol. 1, no. 1.

[28] E. van der Pol, “Deep reinforcement learning for coordination in trafficlight control,” 2016.

[29] H. V. Hasselt, “Double q-learning,” in Advances in Neural InformationProcessing Systems, 2010, pp. 2613–2621.

[30] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learningwith double q-learning,” CoRR, abs/1509.06461, 2015.

[31] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.

[32] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methodsfor online learning and stochastic optimization,” Journal of MachineLearning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.

[33] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradientby a running average of its recent magnitude,” COURSERA: NeuralNetworks for Machine Learning, vol. 4, no. 2, 2012.

[34] W. Genders and S. Razavi, “Using a deep reinforcement learning agentfor traffic signal control,” arXiv preprint arXiv:1611.01142, 2016.

[35] G. Tesauro, “Temporal difference learning and td-gammon,” Communi-cations of the ACM, vol. 38, no. 3, pp. 58–68, 1995.

[36] J. B. Pollack and A. D. Blair, “Why did td-gammon work?” Advancesin Neural Information Processing Systems, pp. 10–16, 1997.

[37] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,“Deterministic policy gradient algorithms,” in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14). JMLRWorkshop and Conference Proceedings, 2014, pp. 387–395. [Online].Available: http://jmlr.org/proceedings/papers/v32/silver14.pdf

[38] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour et al., “Policygradient methods for reinforcement learning with function approxima-tion.” in NIPS, vol. 99, 1999, pp. 1057–1063.

[39] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearitiesimprove neural network acoustic models,” in Proc. ICML, vol. 30, no. 1,2013.

[40] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz-mann machines,” in Proceedings of the 27th International Conferenceon Machine Learning (ICML-10), 2010, pp. 807–814.

[41] X. Glorot and Y. Bengio, “Understanding the difficulty of training deepfeedforward neural networks.” in Aistats, vol. 9, 2010, pp. 249–256.

[42] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,” inProceedings of the IEEE International Conference on Computer Vision,2015, pp. 1026–1034.

[43] F. Webster, “Traffic signal settings, road research technical paper no.39,” Road Research Laboratory, 1958.

[44] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,2016, http://www.deeplearningbook.org.

[45] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz, “Trustregion policy optimization.” in ICML, 2015, pp. 1889–1897.

[46] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neuralnetworks on graphs with fast localized spectral filtering,” in Advancesin Neural Information Processing Systems, 2016, pp. 3837–3845.

http://jmlr.org/proceedings/papers/v32/silver14.pdf

http://www.deeplearningbook.org

Documents

Deep Deterministic Policy Gradient for Urban Trafﬁc Light