||Hardware Acceleration for Data Processing – Fall 2017
Hardware Acceleration for Data Processing
Fall 2017
14.11.2017 1
Asynchronous Methods for
Deep Reinforcement Learning
Lavrenti Frobeen
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 2
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 3
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 4
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 5
||Hardware Acceleration for Data Processing – Fall 2017
▪ Set of States 𝑆
▪ Set of Actions 𝐴
Agent observes 𝑠𝑡 ∈ 𝑆 and…
…takes action 𝑎𝑡 ∈ 𝐴 according to its policy…
…receiving a new 𝑠𝑡+1 ∈ 𝑆 and a reward 𝑟𝑡+1 ∈ ℝ.
▪ Total Reward:
𝑅𝑡 =
𝑘=0
∞
𝛾𝑘 𝑟𝑡+𝑘 𝛾 ∈ (0, 1)
14.11.2017Lavrenti Frobeen 6
Ingredients of Reinforcement Learning
||Hardware Acceleration for Data Processing – Fall 2017
▪ Delayed rewards
▪ Vast state spaces
▪ Dynamic environment /
Actions influence training data
14.11.2017Lavrenti Frobeen 7
What makes RL difficult?
||Hardware Acceleration for Data Processing – Fall 2017
Idea: Tabulate “quality” of state-action-pairs
Qπ s, a = 𝔼 Rt|st = s, a, 𝜋
Choose action that maximizes Q (𝜀-greedily)
For an optimal policy it holds that:
Q∗ s, a = ቊ𝑟
𝑟 + 𝛾maxa′
Q∗ 𝑠′, 𝑎′
14.11.2017Lavrenti Frobeen 8
Solving RL: Q-Learning (1)
Action 1 … Action m
State 1 1 … 0.5
State 2 0 … 0.6
… ... … …
State n 0.7 … 2
if 𝑠′ terminal
else
||Hardware Acceleration for Data Processing – Fall 2017
Iterate Two Steps:
1. Choose action that maximizes Q (𝜀-greedily)
2. Update table according to
Q∗ s, a = ቊ𝑟
𝑟 + 𝛾maxa′
Q∗ 𝑠′, 𝑎′
3. Repeat until convergence
14.11.2017Lavrenti Frobeen 9
Solving RL: Q-Learning (2)[Watkins 89]
Action 1 … Action m
State 1 1 … 0.5
State 2 0 … 0.6
… ... … …
State n 0.7 … 2if 𝑠′ terminal
else
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 10
Problem: Q-Tables Do Not Scale
Action 1 … Action 361
State 1 1 … 0.5
State 2 0 … 0.6
… ... … …
State 10170 0.7 … 2
Input State Q-values
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 11
Reinforcement Learning with Neural Networks
Input State
RepresentationQ-values
𝑄(𝑠, 𝑎, 𝜃) ≈ 𝑄∗ s, a
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 12
Reinforcement Learning with Deep Neural Networks[Mnih et al. 2015]
Input State
RepresentationQ-values
Taken from “Human-level control through deep reinforcement learning” (Mnih et al. 2015)
||Hardware Acceleration for Data Processing – Fall 2017
Network is parameterized by 𝜃
𝜃 can be trained by minimizing the loss
𝐿 𝜃 = 𝔼 𝑡𝑎𝑟𝑔𝑒𝑡𝑄 − 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑𝑄 2
= 𝔼 𝑟 + 𝛾max𝑎′
Q 𝑠′, 𝑎′, 𝜃 − 𝑄 𝑠, 𝑎, 𝜃
2
14.11.2017Lavrenti Frobeen 13
Deep Q-Learning
Q-values for 𝑠
State 𝑠
Action 𝑎
Q-values for 𝑠′
max𝑎′
𝑄 𝑠′, 𝑎′, 𝜃 and 𝑄 𝑠, 𝑎, 𝜃𝜕𝐿
𝜕𝜃
Reward 𝑟 for action 𝑎
State 𝑠′
(after 𝑎)
||Hardware Acceleration for Data Processing – Fall 2017
▪ Delayed rewards
▪ Vast state spaces
▪ Dynamic environment /
Actions influence training data
▪ Feedback loops
▪ Updates affect Q-function globally
14.11.2017Lavrenti Frobeen 14
Issues with Deep Q-Learning
Q-values for 𝑠
State 𝑠
Action 𝑎
Q-values for 𝑠′
𝜕𝐿
𝜕𝜃State 𝑠′
(after 𝑎)
max𝑎′
𝑄 𝑠′, 𝑎′, 𝜃 and 𝑄 𝑠, 𝑎, 𝜃
Reward 𝑟 for action 𝑎
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 15
Breaking Feedback Loops
Q-values for 𝑠State 𝑠
Action 𝑎
max𝑎′
𝑄 𝑠′, 𝑎′, 𝜃−𝜕𝐿
𝜕𝜃
Reward for 𝑎
State 𝑠′
(after 𝑎)
𝑄 𝑠, 𝑎, 𝜃
Target Network (𝜃−)
Trained Network (𝜃)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 16
Breaking Feedback Loops
Q-values for 𝑠State 𝑠
Action 𝑎
State 𝑠′
(after 𝑎)
𝜃Target Network (𝜃−)
||Hardware Acceleration for Data Processing – Fall 2017
▪ Delayed rewards
▪ Vast state spaces
▪ Dynamic environment /
Actions influence training data
▪ Feedback loops
▪ Updates affect Q-function globally
14.11.2017Lavrenti Frobeen 17
Issues with Deep Q-Learning
Q-values for 𝑠
State 𝑠
Action 𝑎
Q-values for 𝑠′
𝜕𝐿
𝜕𝜃State 𝑠′
(after 𝑎)
max𝑎′
𝑄 𝑠′, 𝑎′, 𝜃 and 𝑄 𝑠, 𝑎, 𝜃
Reward 𝑟 for action 𝑎
||Hardware Acceleration for Data Processing – Fall 2017
Problem
Learning from new experiences might override
previous progress!
Idea
Store state transitions during exploration
Sample from past experience during learning to
maintain diversity
14.11.2017Lavrenti Frobeen 18
Experience Replay
𝑎 = 𝑓𝑟 = 0
𝑎 = 𝑓𝑟 = 1
𝑎 = 𝑏𝑟 = 0
𝜕𝐿
𝜕𝜃
||Hardware Acceleration for Data Processing – Fall 2017
▪ Delayed rewards
▪ Vast state spaces
▪ Dynamic environment /
Actions influence training data
▪ Feedback loops
▪ Updates affect Q-function globally
14.11.2017Lavrenti Frobeen 19
Issues with Deep Q-Learning
||Hardware Acceleration for Data Processing – Fall 2017
Problem
Rewards propagate through the DQN slowly!
Idea
Take multiple actions in sequence before
updating the network
14.11.2017Lavrenti Frobeen 20
N-Step Methods
||Hardware Acceleration for Data Processing – Fall 2017
Agent N
14.11.2017Lavrenti Frobeen 21
Parallel Deep Q-Learning
Parameter Server
𝜃 𝜃𝜃
Agent 1 Agent …
||Hardware Acceleration for Data Processing – Fall 2017
▪ Distributed version of DQN
▪ (Then) State-of-the-art results on the Atari
▪ Training takes days on 100+ machines
14.11.2017Lavrenti Frobeen 22
Massively Parallel Methods of Deep RL [Nair et al. 2015]
Taken from “Massively Parallel Methods of Deep RL” (Nair et al. 2015)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 23
Going HOGWILD!
Agent / Thread N
RAM
𝜃 𝜃𝜃
Agent / Thread 1 Agent / Thread …
Single Machine
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 24
Asynchronous one-step Q-learning[Mnih et al. 2016]
Agent / Thread 1
Shared 𝜃, 𝜃−
Agent / Thread N
every 5 frames
every 40000 frames
||Hardware Acceleration for Data Processing – Fall 2017
𝑎 = 𝑓𝑟 = 0
𝑎 = 𝑓𝑟 = 1
Experience Replay was scrapped!
Intuition
Parallel learners explore different paths anyway!
Parallelism similar effect to experience replay.
Diversity can be enforced by different
policies
14.11.2017Lavrenti Frobeen 25
What about Experience Replay?
𝜕𝐿
𝜕𝜃
Agent / Thread 1 Agent / Thread N
𝜕𝐿
𝜕𝜃
Experience Replay:
Actor Parallelism:
||Hardware Acceleration for Data Processing – Fall 2017
Policy is parametrized directly with
𝜋(𝑎𝑡|𝑠𝑡; 𝜃)
and estimated using gradient ascent for
𝛻𝜃 log 𝜋(𝑎𝑡|𝑠𝑡; 𝜃) 𝑅𝑡
which is an unbiased estimate of
𝛻𝜃𝔼[𝑅𝑡]
14.11.2017Lavrenti Frobeen 26
Policy Gradient Methods
𝜋(𝑎𝑡|𝑠𝑡; 𝜃)
State
𝑎1 𝑎2 𝑎3 ... 𝑎𝑛
||Hardware Acceleration for Data Processing – Fall 2017
Advantage Actor-Critic
𝛻𝜃 log 𝜋 𝑎𝑡 𝑠𝑡; 𝜃 𝑅𝑡 − 𝑉(𝑠𝑡 , 𝜃𝑣)
Note:
𝑅𝑡 depends on action taken 𝑎𝑡𝑉(𝑠𝑡 , 𝜃𝑣) depends the state only
𝑅𝑡 − 𝑉(𝑠𝑡 , 𝜃𝑣) is called the advantage of 𝑎𝑡
14.11.2017Lavrenti Frobeen 27
Asynchronous Advantage Actor-Critic (A3C)
Agent / Thread N
Agent / Thread 1
Shared 𝜃, 𝜃𝑣
𝜋 𝑎𝑡 𝑠𝑡; 𝜃
𝑉(𝑠𝑡 , 𝜃𝑣)
||Hardware Acceleration for Data Processing – Fall 2017
Agent / Thread N
14.11.2017Lavrenti Frobeen 28
Asynchronous Advantage Actor-Critic (A3C) [Mnih et al. 2016]
Agent / Thread 1
Shared 𝜃, 𝜃𝑣
𝜋 𝑎𝑡 𝑠𝑡; 𝜃
𝑉(𝑠𝑡 , 𝜃𝑣)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 29
Evaluation on the Atari Domain
Normalized scores on 57 Atari Games [Mnih et. al 2016]
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 30
EvaluationAsynchronous Methods vs. DQN
Averaged scores of different algorithms for 5 games over time [Mnih et. al 2016]
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 31
(Super-) Linear Speedup
Average speedup of different algorithms for
different thread counts [Mnih et. al 2016]
Average speedup of A3C for different thread counts
[Mnih et. al 2016]
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 32
Evaluation with TORCS
Averaged scores of different algorithms for TORCS
racing simulator [Mnih et. al 2016]
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 33
Stability & Robustness
Scores of A3C with different learning rates for 5 games [Mnih et. al 2016]
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 34
Stability & Robustness
Scores of A3C with different learning rates for 5 games [Mnih et. al 2016]
Raw scores of different algorithms for a selection of Atari Games [Mnih et. al 2016]
||Hardware Acceleration for Data Processing – Fall 2017
▪ Deep Reinforcement Learning is hard
▪ Requires techniques like experience replay
▪ Deep RL is easily parallelizable
▪ Parallelism can replace experience replay
▪ Dropping experience replay allows on-policy
methods like actor-critic
▪ A3C surpasses state-of-the-art performance
14.11.2017Lavrenti Frobeen 35
Takeaway Message
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 36
QUESTIONS?
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 37
One-step Q-learning vs. N-step Q-learning
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 38
Q-Learning: A Toy Example
Backwards Forwards
Start 0 0
Street 0 0
Goal – –
Current Q-Table
Hyperparameters: 𝜀 = 0.999 𝛾 = 0.5
Next Action: Go Forward
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 39
Q-Learning: A Toy Example
Backwards Forwards
Start 0 0
Street 0 0
Goal – –
Current Q-Table
Previous Action: Go Forwards
No Reward
Next Action: Go Forwards
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 40
Q-Learning: A Toy Example
Backwards Forwards
Start 0 0
Street 0 1
Goal – –
Current Q-Table
Previous Action: Go Forwards
Reward: +1
Update Table According to Q∗ s, a = ቊ𝑟
𝑟 + 𝛾maxa′
Q∗ 𝑠′, 𝑎′if 𝑠′ terminal
else
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 41
Q-Learning: A Toy Example
Backwards Forwards
Start 0 0
Street 0 1
Goal – –
Current Q-Table
Game has restarted!
Next Action: Go Forwards
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 42
Q-Learning: A Toy Example
Backwards Forwards
Start 0 0.5
Street 0 1
Goal – –
Current Q-Table
Previous Action: Go Forwards,
No reward, but rewarding follow-up action
Update Table According to
Next Action: Go Backwards
Q∗ s, a = ቊ𝑟
𝑟 + 𝛾maxa′
Q∗ 𝑠′, 𝑎′if 𝑠′ terminal
else
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 43
Q-Learning: A Toy Example
Backwards Forwards
Start 0 0.5
Street 0.25 1
Goal – –
Current Q-Table
Previous Action: Go Backwards
No reward, but rewarding follow-up action
Update Table According to
Next Action: Go Backwards
Q∗ s, a = ቊ𝑟
𝑟 + 𝛾maxa′
Q∗ 𝑠′, 𝑎′if 𝑠′ terminal
else
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 44
Q-Learning: A Toy Example
Backwards Forwards
Start 0.25 0.5
Street 0.25 1
Goal – –
Current Q-Table
Done