74
A Brief Introduction to Reinforcement Learning Jingwei Zhang [email protected] 1

A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

A Brief Introduction to Reinforcement Learning

Jingwei Zhang [email protected]

1

Page 2: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Outline• Characteristics of Reinforcement Learning (RL)

• Components of RL (MDP, value, policy, Bellman)

• Planning (policy iteration, value iteration)

• Model-free Prediction (MC, TD)

• Model-free Control (Q-Learning)

• Deep Reinforcement Learning (DQN)

2

Page 3: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem
Page 4: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Outline• Characteristics of Reinforcement Learning (RL)

• The RL Problem (MDP, value, policy, Bellman)

• Planning (policy iteration, value iteration)

• Model-free Prediction (MC, TD)

• Model-free Control (Q-Learning)

• Deep Reinforcement Learning (DQN)

4

Page 5: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Characteristics of RL SL VS RL

• Supervised Learning

• i.i.d data

• direct and strong supervision (label: what is the right thing to do)

• instantaneous feedback

• Reinforcement Learning

• sequential data, non-i.i.d

• no supervisor, only a reward signal (rule: what you did is good or bad)

• delayed feedback

5

Page 6: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Characteristics of RL SL VS RL

• Supervised Learning

• i.i.d data

• direct and strong supervision (label: what is the right thing to do)

• instantaneous feedback

• Reinforcement Learning

• sequential data, non-i.i.d

• no supervisor, only a reward signal (rule: what you did is good or bad)

• delayed feedback

6

Page 7: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Characteristics of RL SL VS RL

• Supervised Learning

• i.i.d data

• direct and strong supervision (label: what is the right thing to do)

• instantaneous feedback

• Reinforcement Learning

• sequential data, non-i.i.d

• no supervisor, only a reward signal (rule: what you did is good or bad)

• delayed feedback

7

Page 8: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Characteristics of RL SL VS RL

• Supervised Learning

• i.i.d data

• direct and strong supervision (label: what is the right thing to do)

• instantaneous feedback

• Reinforcement Learning

• sequential data, non-i.i.d

• no supervisor, only a reward signal (rule: what you did is good or bad)

• delayed feedback

8

Page 9: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem
Page 10: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Outline• Characteristics of Reinforcement Learning (RL)

• Components of RL (MDP, value, policy, Bellman)

• Planning (policy iteration, value iteration)

• Model-free Prediction (MC, TD)

• Model-free Control (Q-Learning)

• Deep Reinforcement Learning (DQN)

10

Page 11: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Components of RL MDP

• A general framework for sequential decision making

• A MDP is a tuple:

• Markov property:

“The future is independent of the past given the present”

11

hS,A,P,R, �i

S :states

A :actions

P :transition probability,Pass0 = P [St+1 = s0|St = s,At = a]

R :reward function,Ras = E [Rt+1|St = s,At = a]

� :discount factor, � 2 [0, 1]

Page 12: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Components of RL MDP

• A general framework for sequential decision making

• A MDP is a tuple:

• Markov property:

“The future is independent of the past given the present”

12

hS,A,P,R, �i

S :states

A :actions

P :transition probability,Pass0 = P [St+1 = s0|St = s,At = a]

R :reward function,Ras = E [Rt+1|St = s,At = a]

� :discount factor, � 2 [0, 1]

Page 13: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Components of RL MDP

• A general framework for sequential decision making

• A MDP is a tuple:

• Markov property:

“The future is independent of the past given the present”

13

hS,A,P,R, �i

S :states

A :actions

P :transition probability,Pass0 = P [St+1 = s0|St = s,At = a]

R :reward function,Ras = E [Rt+1|St = s,At = a]

� :discount factor, � 2 [0, 1]

Page 14: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Policy:

• Return:

• State-value function:

• Action-value function:

Components of RL Policy & Return & Value

14

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

⇡(a|s) = P [At = a|St = s]

v⇡(s) = E⇡ [Gt|St = s]

q⇡(s,a) = E⇡ [Gt|St = s,At = a]

Page 15: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Policy:

• Return:

• State-value function:

• Action-value function:

Components of RL Policy & Return & Value

15

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

⇡(a|s) = P [At = a|St = s]

v⇡(s) = E⇡ [Gt|St = s]

q⇡(s,a) = E⇡ [Gt|St = s,At = a]

Page 16: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Policy:

• Return:

• State-value function:

• Action-value function:

Components of RL Policy & Return & Value

16

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

⇡(a|s) = P [At = a|St = s]

v⇡(s) = E⇡ [Gt|St = s]

q⇡(s,a) = E⇡ [Gt|St = s,At = a]

Page 17: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Policy:

• Return:

• State-value function:

• Action-value function:

Components of RL Policy & Return & Value

17

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

⇡(a|s) = P [At = a|St = s]

v⇡(s) = E⇡ [Gt|St = s]

q⇡(s,a) = E⇡ [Gt|St = s,At = a]

Page 18: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Components of RL Bellman Equations

• Bellman Expectation Equation

18

v⇡(s) = E⇡ [Rt+1 + �v⇡(St+1)|St = s]

v⇡(s) =X

a2A⇡(a|s)

Ra

s + �X

s02SPass0v⇡(s

0)

!

Page 19: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Bellman Optimality Equation

Components of RL Bellman Equations

19

v⇤(s) = maxa

Ra

s + �X

s02SPass0v⇤(s

0)

!

Page 20: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Components of RL Prediction VS Control

• Prediction

given a policy, evaluate how much reward you can get by following that policy

• Control

find an optimal policy that maximizes the cumulative future reward

20

Page 21: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Components of RL Prediction VS Control

• Prediction

given a policy, evaluate how much reward you can get by following that policy

• Control

find an optimal policy that maximizes the cumulative future reward

21

Page 22: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Components of RL Planning VS Learning

• Planning

• the underlying MDP is known

• agent only needs to perform computations on the given model

• dynamic programming (policy iteration, value iteration)

• Learning

• the underlying MDP is initially unknown

• agent needs to interact with the environment

• model-free (learn value / policy) / model-based (learn model, plan on it)

22

Page 23: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Components of RL Planning VS Learning

• Planning

• the underlying MDP is known

• agent only needs to perform computations on the given model

• dynamic programming (policy iteration, value iteration)

• Learning

• the underlying MDP is initially unknown

• agent needs to interact with the environment

• model-free (learn value / policy) / model-based (learn model, plan on it)

23

Page 24: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Components of RL Planning VS Learning

• Planning

• the underlying MDP is known

• agent only needs to perform computations on the given model

• dynamic programming (policy iteration, value iteration)

• Learning

• the underlying MDP is initially unknown

• agent needs to interact with the environment

• model-free (learn value / policy) / model-based (learn model, plan on it)

24

Page 25: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Components of RL Planning VS Learning

• Planning

• the underlying MDP is known

• agent only needs to perform computations on the given model

• dynamic programming (policy iteration, value iteration)

• Learning

• the underlying MDP is initially unknown

• agent needs to interact with the environment

• model-free (learn value / policy) / model-based (learn model, plan on it)

25

Page 26: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem
Page 27: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Outline• Characteristics of Reinforcement Learning (RL)

• The RL Problem (MDP, value, policy, Bellman)

• Planning (policy iteration, value iteration)

• Model-free Prediction (MC, TD)

• Model-free Control (Q-Learning)

• Deep Reinforcement Learning (DQN)

27

Page 28: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Planning Dynamic Programming

• Applied when optimal solutions can be decomposed into subproblems

• For prediction:

• Input:

• Output:

• For control:

• Input:

• Output:

28

< S,A,R,S, � >,⇡

< S,A,R,S, � >

v⇡

v⇤,⇡⇤

Page 29: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Planning Dynamic Programming

• Applied when optimal solutions can be decomposed into subproblems

• For prediction: (iterative policy evaluation)

• Input:

• Output:

• For control: (policy iteration, value iteration)

• Input:

• Output:

29

< S,A,R,S, � >,⇡

< S,A,R,S, � >

v⇡

v⇤,⇡⇤

Page 30: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Planning Dynamic Programming

• Applied when optimal solutions can be decomposed into subproblems

• For prediction: (iterative policy evaluation)

• Input:

• Output:

• For control: (policy iteration, value iteration)

• Input:

• Output:

30

< S,A,R,S, � >,⇡

< S,A,R,S, � >

v⇡

v⇤,⇡⇤

Page 31: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Iterative application of Bellman Expectation backup

Planning Iterative Policy Evaluation

31

vk+1(s) =X

a2A⇡(a|s)

Ra

s + �X

s02SPass0vk(s

0)

!

v1 ! v2 ! . . . ! v⇡

Page 32: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Evaluate the given policy and get:

• Get an improved policy by acting greedily:

Planning Policy Iteration

32

⇡0 = greedy(v⇡)

v⇡

Page 33: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Iterative application of Bellman Optimality backupv1 ! v2 ! . . . ! v⇤

Planning Value Iteration

33

vk+1(s) = maxa2A

Ra

s + �X

s02SPass0vk(s

0)

!

Page 34: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Planning Synchronous DP Algorithms

34

Page 35: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem
Page 36: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Outline• Characteristics of Reinforcement Learning (RL)

• The RL Problem (MDP, value, policy, Bellman)

• Planning (policy iteration, value iteration)

• Model-free Prediction (MC, TD)

• Model-free Control (Q-Learning)

• Deep Reinforcement Learning (DQN)

36

Page 37: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Planning

• the underlying MDP is known

• agent only needs to perform computations on the given model

• dynamic programming (policy iteration, value iteration)

• Learning

• the underlying MDP is initially unknown

• agent needs to interact with the environment

• model-free (learn value / policy) / model-based (learn model, plan on it)

Recap: Components of RL Planning VS Learning

37

Page 38: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Monte Carlo Learning

• learns from complete trajectories, no bootstrapping

• estimates values by looking at sample returns, empirical mean return

• Temporal Difference Learning

• learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate

• updates a guess towards a guess

Model-free Prediction MC VS TD

38

Page 39: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Monte Carlo Learning

• learns from complete trajectories, no bootstrapping

• estimates values by looking at sample returns, empirical mean return

• Temporal Difference Learning

• learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate

• updates a guess towards a guess

Model-free Prediction MC VS TD

39

Page 40: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Monte Carlo Learning

• learns from complete trajectories, no bootstrapping

• estimates values by looking at sample returns, empirical mean return

• Temporal Difference Learning

• learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate

• updates a guess towards a guess

Model-free Prediction MC VS TD

40

Page 41: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Goal:

• Recall: Return is the total discounted reward:

• Recall: Value function is the expected return:

• MC policy evaluation (every visit MC):

uses empirical mean return instead of expected return

Model-free Prediction MC

41

learn v⇡ from episodes of experience under policy ⇡

v⇡(s) = E⇡ [Gt|St = s]

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

Page 42: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Goal:

• Recall: Return is the total discounted reward:

• Recall: Value function is the expected return:

• MC policy evaluation (every visit MC):

uses empirical mean return instead of expected return

Model-free Prediction MC

42

learn v⇡ from episodes of experience under policy ⇡

v⇡(s) = E⇡ [Gt|St = s]

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

Page 43: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Goal:

• Recall: Return is the total discounted reward:

• Recall: Value function is the expected return:

• MC policy evaluation (every visit MC):

uses empirical mean return instead of expected return

Model-free Prediction MC

43

learn v⇡ from episodes of experience under policy ⇡

v⇡(s) = E⇡ [Gt|St = s]

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

Page 44: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Goal:

• Recall: Return is the total discounted reward:

• Recall: Value function is the expected return:

• MC policy evaluation (every visit MC):

uses empirical mean return instead of expected return

Model-free Prediction MC

44

learn v⇡ from episodes of experience under policy ⇡

v⇡(s) = E⇡ [Gt|St = s]

Gt = Rt+1 + �Rt+2 + . . . =1X

k=0

�kRt+k+1

Page 45: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Goal:

• MC:

• TD:

Model-free Prediction MC -> TD

45

learn v⇡ from episodes of experience under policy ⇡

V (St) V (St) + ↵(Gt � V (St))

V (St) V (St) + ↵(Rt+1 + �V (St+1)� V (St))

updates V (St) towards actual return: Gt

updates V (St) towards estimated return: Rt+1 + �V (St+1)

Page 46: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Model-free Prediction MC VS TD: Driving Home

46

Page 47: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Model-free Prediction MC Backup

47

Page 48: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Model-free Prediction TD Backup

48

Page 49: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Model-free Prediction DP Backup

49

Page 50: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Model-free Prediction Unified View

50

Page 51: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem
Page 52: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Outline• Characteristics of Reinforcement Learning (RL)

• The RL Problem (MDP, value, policy, Bellman)

• Planning (policy iteration, value iteration)

• Model-free Prediction (MC, TD)

• Model-free Control (Q-Learning)

• Deep Reinforcement Learning (DQN)

52

Page 53: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• MDP is unknown:

but experience can be sampled

• MDP is known:

but too big to use except to sample from it

Model-free Control Why model-free?

53

Page 54: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Evaluate the given policy and get:

• Get an improved policy by acting greedily:

Recap: Planning Policy Iteration

54

⇡0 = greedy(v⇡)

v⇡

Page 55: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Evaluate the given policy and get:

• Get an improved policy by acting greedily:

Model-free Control Generalized Policy Iteration

55

⇡0 = greedy(v⇡)

v⇡

Page 56: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

• Greedy policy improvement over V(s) requires model of MDP

• Greedy policy improvement over Q(s,a) is model-free

Model-free Control V -> Q

56

⇡0(s) = argmaxa2A

(Ras + Pa

ss0V (s0))

⇡0(s) = argmaxa2A

Q(s,a)

Page 57: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Model-free Control SARSA

57

Q(S,A) Q(S,A) + ↵ (R+ �Q(S0,A0)�Q(S,A))

Page 58: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Model-free Control Q-Learning

58

Q(S,A) Q(S,A) + ↵⇣R+ �max

a0Q(S0,a0)�Q(S,A)

Page 59: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Model-free Control SARSA VS Q-Learning

59

Page 60: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Model-free Control DP VS TD

60

Page 61: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Model-free Control DP VS TD

61

Page 62: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem
Page 63: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Outline• Characteristics of Reinforcement Learning (RL)

• The RL Problem (MDP, value, policy, Bellman)

• Planning (policy iteration, value iteration)

• Model-free Prediction (MC, TD)

• Model-free Control (Q-Learning)

• Deep Reinforcement Learning (DQN)

63

Page 64: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Deep Reinforcement Learning Why?

• So far we represented value function by a lookup table

• every state s has an entry V(s)

• every state-action pair (s, a) has an entry Q(s, a)

• Problem w/ large MDPs

• too many states and/or actions to store in memory

• too slow to learn the value of each state individually

64

Page 65: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Deep Reinforcement Learning Why?

• So far we represented value function by a lookup table

• every state s has an entry V(s)

• every state-action pair (s, a) has an entry Q(s, a)

• Problem w/ large MDPs

• too many states and/or actions to store in memory

• too slow to learn the value of each state individually

65

Page 66: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Deep Reinforcement Learning How to?

• Use deep networks to represent:

• value function (value-based methods)

• policy (policy-based methods)

• model (model-based methods)

• Optimize value function / policy / model end-to-end

66

Page 67: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Deep Reinforcement Learning How to?

• Use deep networks to represent:

• value function (value-based methods)

• policy (policy-based methods)

• model (model-based methods)

• Optimize value function / policy / model end-to-end

67

Page 68: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Deep Reinforcement Learning Q-learning -> DQN

68

Page 69: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Deep Reinforcement Learning Q-learning -> DQN

69

Page 70: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Deep Reinforcement Learning AI = RL + DL

• Reinforcement Learning (RL)

• a general purpose framework for decision making

• learn policies to maximize future reward

• Deep Learning (DL)

• a general purpose framework for representation learning

• given an objective, learn representation that is required to achieve objective

• DRL: a single agent which can solve any human-level task

• RL defines the objective

• DL gives the mechanism

• RL + DL = general intelligence

70

Page 71: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Deep Reinforcement Learning AI = RL + DL

• Reinforcement Learning (RL)

• a general purpose framework for decision making

• learn policies to maximize future reward

• Deep Learning (DL)

• a general purpose framework for representation learning

• given an objective, learn representation that is required to achieve objective

• DRL: a single agent which can solve any human-level task

• RL defines the objective

• DL gives the mechanism

• RL + DL = general intelligence

71

Page 72: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Deep Reinforcement Learning AI = RL + DL

• Reinforcement Learning (RL)

• a general purpose framework for decision making

• learn policies to maximize future reward

• Deep Learning (DL)

• a general purpose framework for representation learning

• given an objective, learn representation that is required to achieve objective

• DRL: a single agent which can solve any human-level task

• RL defines the objective

• DL gives the mechanism

• RL + DL = general intelligence

72

Page 73: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem

Some Recommendations• Reinforcement Learning from David Silver on YouTube

• Reinforcement Learning, An Introduction, Richard Sutton, 2nd Edition

• DQN Nature Paper: Human-level Control Through Deep Reinforcement Learning

• Flappy Bird:

• Tabular RL: https://github.com/SarvagyaVaish/FlappyBirdRL

• Deep RL: https://github.com/songrotek/DRL-FlappyBird

• Many many 3rd party implementations, just search for “deep reinforcement learning”, “dqn”, “a3c” on github

• My implementations in pytorch: https://github.com/jingweiz/pytorch-rl

73

Page 74: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem