A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

A Brief Introduction to Reinforcement Learning

Jingwei Zhang [email protected]

1

Outline• Characteristics of Reinforcement Learning (RL)

• Components of RL (MDP, value, policy, Bellman)

• Planning (policy iteration, value iteration)

• Model-free Prediction (MC, TD)

• Model-free Control (Q-Learning)

• Deep Reinforcement Learning (DQN)

2


• The RL Problem (MDP, value, policy, Bellman)





4

Characteristics of RL SL VS RL

• Supervised Learning

• i.i.d data

• direct and strong supervision (label: what is the right thing to do)

• instantaneous feedback

• Reinforcement Learning

• sequential data, non-i.i.d

• no supervisor, only a reward signal (rule: what you did is good or bad)

• delayed feedback

5



• i.i.d data







6



• i.i.d data







7



• i.i.d data







8


• Components of RL (MDP, value, policy, Bellman)





10

Components of RL MDP

• A general framework for sequential decision making

• A MDP is a tuple:

• Markov property:

“The future is independent of the past given the present”

11

hS,A,P,R, �i

S :states

A :actions

P :transition probability,Pass0 = P [St+1 = s0|St = s,At = a]

R :reward function,Ras = E [Rt+1|St = s,At = a]

� :discount factor, � 2 [0, 1]






12

hS,A,P,R, �i

S :states

A :actions









13

hS,A,P,R, �i

S :states

A :actions




• Policy:

• Return:

• State-value function:

• Action-value function:

Components of RL Policy & Return & Value

14

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

⇡(a|s) = P [At = a|St = s]

v⇡(s) = E⇡ [Gt|St = s]

q⇡(s,a) = E⇡ [Gt|St = s,At = a]

• Policy:

• Return:




15

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

⇡(a|s) = P [At = a|St = s]

v⇡(s) = E⇡ [Gt|St = s]

q⇡(s,a) = E⇡ [Gt|St = s,At = a]

• Policy:

• Return:




16

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

⇡(a|s) = P [At = a|St = s]

v⇡(s) = E⇡ [Gt|St = s]

q⇡(s,a) = E⇡ [Gt|St = s,At = a]

• Policy:

• Return:




17

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

⇡(a|s) = P [At = a|St = s]

v⇡(s) = E⇡ [Gt|St = s]

q⇡(s,a) = E⇡ [Gt|St = s,At = a]

Components of RL Bellman Equations

• Bellman Expectation Equation

18

v⇡(s) = E⇡ [Rt+1 + �v⇡(St+1)|St = s]

v⇡(s) =X

a2A⇡(a|s)

Ra

s + �X

s02SPass0v⇡(s

0)

!

• Bellman Optimality Equation

Components of RL Bellman Equations

19

v⇤(s) = maxa

Ra

s + �X

s02SPass0v⇤(s

0)

!

Components of RL Prediction VS Control

• Prediction

given a policy, evaluate how much reward you can get by following that policy

• Control

find an optimal policy that maximizes the cumulative future reward

20

Components of RL Prediction VS Control

• Prediction

given a policy, evaluate how much reward you can get by following that policy

• Control

find an optimal policy that maximizes the cumulative future reward

21

Components of RL Planning VS Learning

• Planning

• the underlying MDP is known

• agent only needs to perform computations on the given model

• dynamic programming (policy iteration, value iteration)

• Learning

• the underlying MDP is initially unknown

• agent needs to interact with the environment

• model-free (learn value / policy) / model-based (learn model, plan on it)

22


• Planning




• Learning




23


• Planning




• Learning




24


• Planning




• Learning




25







27

Planning Dynamic Programming

• Applied when optimal solutions can be decomposed into subproblems

• For prediction:

• Input:

• Output:

• For control:

• Input:

• Output:

28

< S,A,R,S, � >,⇡

< S,A,R,S, � >

v⇡

v⇤,⇡⇤



• For prediction: (iterative policy evaluation)

• Input:

• Output:

• For control: (policy iteration, value iteration)

• Input:

• Output:

29

< S,A,R,S, � >,⇡

< S,A,R,S, � >

v⇡

v⇤,⇡⇤



• For prediction: (iterative policy evaluation)

• Input:

• Output:

• For control: (policy iteration, value iteration)

• Input:

• Output:

30

< S,A,R,S, � >,⇡

< S,A,R,S, � >

v⇡

v⇤,⇡⇤

• Iterative application of Bellman Expectation backup

Planning Iterative Policy Evaluation

31

vk+1(s) =X

a2A⇡(a|s)

Ra

s + �X

s02SPass0vk(s

0)

!

v1 ! v2 ! . . . ! v⇡

• Evaluate the given policy and get:

• Get an improved policy by acting greedily:

Planning Policy Iteration

32

⇡0 = greedy(v⇡)

v⇡

• Iterative application of Bellman Optimality backupv1 ! v2 ! . . . ! v⇤

Planning Value Iteration

33

vk+1(s) = maxa2A

Ra

s + �X

s02SPass0vk(s

0)

!

Planning Synchronous DP Algorithms

34







36

• Planning




• Learning




Recap: Components of RL Planning VS Learning

37

• Monte Carlo Learning

• learns from complete trajectories, no bootstrapping

• estimates values by looking at sample returns, empirical mean return

• Temporal Difference Learning

• learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate

• updates a guess towards a guess

Model-free Prediction MC VS TD

38








39








40

• Goal:

• Recall: Return is the total discounted reward:

• Recall: Value function is the expected return:

• MC policy evaluation (every visit MC):

uses empirical mean return instead of expected return

Model-free Prediction MC

41

learn v⇡ from episodes of experience under policy ⇡

v⇡(s) = E⇡ [Gt|St = s]

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

• Goal:






42


v⇡(s) = E⇡ [Gt|St = s]

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

• Goal:






43


v⇡(s) = E⇡ [Gt|St = s]

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

• Goal:






44


v⇡(s) = E⇡ [Gt|St = s]

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

• Goal:

• MC:

• TD:

Model-free Prediction MC -> TD

45


V (St) V (St) + ↵(Gt � V (St))

V (St) V (St) + ↵(Rt+1 + �V (St+1)� V (St))

updates V (St) towards actual return: Gt

updates V (St) towards estimated return: Rt+1 + �V (St+1)

Model-free Prediction MC VS TD: Driving Home

46

Model-free Prediction MC Backup

47

Model-free Prediction TD Backup

48

Model-free Prediction DP Backup

49

Model-free Prediction Unified View

50







52

• MDP is unknown:

but experience can be sampled

• MDP is known:

but too big to use except to sample from it

Model-free Control Why model-free?

53



Recap: Planning Policy Iteration

54

⇡0 = greedy(v⇡)

v⇡



Model-free Control Generalized Policy Iteration

55

⇡0 = greedy(v⇡)

v⇡

• Greedy policy improvement over V(s) requires model of MDP

• Greedy policy improvement over Q(s,a) is model-free

Model-free Control V -> Q

56

⇡0(s) = argmaxa2A

(Ras + Pa

ss0V (s0))

⇡0(s) = argmaxa2A

Q(s,a)

Model-free Control SARSA

57

Q(S,A) Q(S,A) + ↵ (R+ �Q(S0,A0)�Q(S,A))

Model-free Control Q-Learning

58

Q(S,A) Q(S,A) + ↵⇣R+ �max

a0Q(S0,a0)�Q(S,A)

⌘

Model-free Control SARSA VS Q-Learning

59

Model-free Control DP VS TD

60

Model-free Control DP VS TD

61







63

Deep Reinforcement Learning Why?

• So far we represented value function by a lookup table

• every state s has an entry V(s)

• every state-action pair (s, a) has an entry Q(s, a)

• Problem w/ large MDPs

• too many states and/or actions to store in memory

• too slow to learn the value of each state individually

64

Deep Reinforcement Learning Why?

• So far we represented value function by a lookup table

• every state s has an entry V(s)

• every state-action pair (s, a) has an entry Q(s, a)

• Problem w/ large MDPs

• too many states and/or actions to store in memory

• too slow to learn the value of each state individually

65

Deep Reinforcement Learning How to?

• Use deep networks to represent:

• value function (value-based methods)

• policy (policy-based methods)

• model (model-based methods)

• Optimize value function / policy / model end-to-end

66

Deep Reinforcement Learning How to?

• Use deep networks to represent:

• value function (value-based methods)

• policy (policy-based methods)

• model (model-based methods)

• Optimize value function / policy / model end-to-end

67

Deep Reinforcement Learning Q-learning -> DQN

68

Deep Reinforcement Learning Q-learning -> DQN

69

Deep Reinforcement Learning AI = RL + DL

• Reinforcement Learning (RL)

• a general purpose framework for decision making

• learn policies to maximize future reward

• Deep Learning (DL)

• a general purpose framework for representation learning

• given an objective, learn representation that is required to achieve objective

• DRL: a single agent which can solve any human-level task

• RL defines the objective

• DL gives the mechanism

• RL + DL = general intelligence

70












71












72

Some Recommendations• Reinforcement Learning from David Silver on YouTube

• Reinforcement Learning, An Introduction, Richard Sutton, 2nd Edition

• DQN Nature Paper: Human-level Control Through Deep Reinforcement Learning

• Flappy Bird:

• Tabular RL: https://github.com/SarvagyaVaish/FlappyBirdRL

• Deep RL: https://github.com/songrotek/DRL-FlappyBird

• Many many 3rd party implementations, just search for “deep reinforcement learning”, “dqn”, “a3c” on github

• My implementations in pytorch: https://github.com/jingweiz/pytorch-rl

73

https://github.com/songrotek/DRL-FlappyBird

https://github.com/jingweiz/pytorch-rl

Documents

A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem