27
Machine Learning Reinforcement Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Nat¨ urlichsprachliche Systeme Institut f¨ ur Informatik Technische Fakult¨ at Albert-Ludwigs-Universit¨ at Freiburg [email protected] Martin Riedmiller, Albert-Ludwigs-Universit¨ at Freiburg, [email protected] Machine Learning 1 Motivation Can a software agent learn to play Backgammon by itself? Martin Riedmiller, Albert-Ludwigs-Universit¨ at Freiburg, [email protected] Machine Learning 2 Motivation Can a software agent learn to play Backgammon by itself? Learning from success or failure Neuro-Backgammon: playing at worldchampion level (Tesauro, 1992) Martin Riedmiller, Albert-Ludwigs-Universit¨ at Freiburg, [email protected] Machine Learning 2 Can a software agent learn to balance a pole by itself? Martin Riedmiller, Albert-Ludwigs-Universit¨ at Freiburg, [email protected] Machine Learning 3

Motivation - ml.informatik.uni-freiburg.de

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Motivation - ml.informatik.uni-freiburg.de

Machine Learning

Reinforcement Learning

Prof. Dr. Martin RiedmillerAG Maschinelles Lernen und Naturlichsprachliche Systeme

Institut fur InformatikTechnische Fakultat

Albert-Ludwigs-Universitat Freiburg

[email protected]

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 1

Motivation

Can a software agent learn to play Backgammon by itself?

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 2

Motivation

Can a software agent learn to play Backgammon by itself?

Learning from success or failure

Neuro-Backgammon:playing at worldchampion level (Tesauro,

1992)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 2

Can a software agent learn to balance a pole by itself?

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 3

Page 2: Motivation - ml.informatik.uni-freiburg.de

Can a software agent learn to balance a pole by itself?

Learning from success or failure

Neural RL controllers:noisy, unknown, nonlinear (Riedmiller et.al.

)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 3

Can a software agent learn to cooperate with others by itself?

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 4

Can a software agent learn to cooperate with others by itself?

Learning from success or failure

Cooperative RL agents:complex, multi-agent, cooperative(Riedmiller et.al. )

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 4

Reinforcement Learning

has biological roots: reward and punishment

no teacher, but:

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 5

Page 3: Motivation - ml.informatik.uni-freiburg.de

Reinforcement Learning

has biological roots: reward and punishment

no teacher, but:

actions

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 5

Reinforcement Learning

has biological roots: reward and punishment

no teacher, but:

actions + goal

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 5

Reinforcement Learning

has biological roots: reward and punishment

no teacher, but:

actions + goallearn→ algorithm/ policy

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 5

Reinforcement Learning

has biological roots: reward and punishment

no teacher, but:

actions + goallearn→ algorithm/ policy

Goal

Action 1Action 2 ...

.

Agent .

’Happy Programming’

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 5

Page 4: Motivation - ml.informatik.uni-freiburg.de

Actor-Critic Scheme (Barto, Sutton, 1983)

Critic

World

?

Internal

Actor

Actor Critic External reward

s

Actor-Critic Scheme:

• Critic maps external, delayed reward in internal training signal

• Actor represents policy

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 6

Overview

I Reinforcement Learning - Basics

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 7

A First Example

Goal

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 8

A First Example

Goal

Repeat

Choose: Action a ∈{→,←, ↑}Until Goal is reached

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 8

Page 5: Motivation - ml.informatik.uni-freiburg.de

The ’Temporal Credit Assignment’ Problem

Goal

Which action(s) in the sequence has to be changed?

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 9

The ’Temporal Credit Assignment’ Problem

Goal

Which action(s) in the sequence has to be changed?

⇒ Temporal Credit Assignment Problem

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 9

Sequential Decision Making

Goal

Examples:Chess, Checkers (Samuel, 1959), Backgammon (Tesauro, 92)

Cart-Pole-Balancing (AHC/ ACE (Barto, Sutton, Anderson, 1983)), Robotics andcontrol, . . .

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 10

Three Steps

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 11

Page 6: Motivation - ml.informatik.uni-freiburg.de

Three Steps

⇒ Describe environment as a Markov Decision Process (MDP)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 11

Three Steps

⇒ Describe environment as a Markov Decision Process (MDP)

⇒ Formulate learning task as a dynamic optimization problem

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 11

Three Steps

⇒ Describe environment as a Markov Decision Process (MDP)

⇒ Formulate learning task as a dynamic optimization problem

⇒ Solve dynamic optimization problem by dynamic programming methods

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 11

1. Description of the environment

Goal

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 12

Page 7: Motivation - ml.informatik.uni-freiburg.de

1. Description of the environment

Goal

S: (finite) set of states

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 12

1. Description of the environment

Goal

S: (finite) set of statesA: (finite) set of actions

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 12

1. Description of the environment

Goal

S: (finite) set of statesA: (finite) set of actions

Behaviour of the environment ’model’p : S × S ×A→ [0, 1]p(s′, s, a) Probability distribution of transition

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 12

1. Description of the environment

Goal

S: (finite) set of statesA: (finite) set of actions

Behaviour of the environment ’model’p : S × S ×A→ [0, 1]p(s′, s, a) Probability distribution of transition

For simplicity, we will first assume a deterministic environment. There, themodel can be described by a transition function

f : S ×A→ S, s′ = f(s, a)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 12

Page 8: Motivation - ml.informatik.uni-freiburg.de

1. Description of the environment

Goal

S: (finite) set of statesA: (finite) set of actions

Behaviour of the environment ’model’p : S × S ×A→ [0, 1]p(s′, s, a) Probability distribution of transition

For simplicity, we will first assume a deterministic environment. There, themodel can be described by a transition function

f : S ×A→ S, s′ = f(s, a)

’Markov’ property: Transition only depends on current state and action

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 12

1. Description of the environment

Goal

S: (finite) set of statesA: (finite) set of actions

Behaviour of the environment ’model’p : S × S ×A→ [0, 1]p(s′, s, a) Probability distribution of transition

For simplicity, we will first assume a deterministic environment. There, themodel can be described by a transition function

f : S ×A→ S, s′ = f(s, a)

’Markov’ property: Transition only depends on current state and action

Pr(st+1|st, at) = Pr(st+1|st, at, st−1, at−1, st−2, at−2, . . .)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 12

2. Formulation of the learning task

every transition emits transition costs,’immediate costs’, c : S ×A→ ℜ(sometimes also called ’immediate reward’, r)

1

2

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 13

2. Formulation of the learning task

every transition emits transition costs,’immediate costs’, c : S ×A→ ℜ(sometimes also called ’immediate reward’, r)

1

2

Now, an agent policy π : S → A can beevaluated (and judged):Consider pathcosts:Jπ(s) =

t c(st, π(st)), s0 = s 2

1

1 12

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 13

Page 9: Motivation - ml.informatik.uni-freiburg.de

2. Formulation of the learning task

every transition emits transition costs,’immediate costs’, c : S ×A→ ℜ(sometimes also called ’immediate reward’, r)

1

2

Now, an agent policy π : S → A can beevaluated (and judged):Consider pathcosts:Jπ(s) =

t c(st, π(st)), s0 = s

Wanted: optimal policy π∗ : S → Awhere Jπ∗

(s) = minπ{∑

t c(st, π(st))|s0 = s}

2

1

1 12

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 13

2. Formulation of the learning task

every transition emits transition costs,’immediate costs’, c : S ×A→ ℜ(sometimes also called ’immediate reward’, r)

1

2

Now, an agent policy π : S → A can beevaluated (and judged):Consider pathcosts:Jπ(s) =

t c(st, π(st)), s0 = s

Wanted: optimal policy π∗ : S → Awhere Jπ∗

(s) = minπ{∑

t c(st, π(st))|s0 = s}

2

1

1 12

⇒ Additive (path-)costs allow to consider all events

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 13

2. Formulation of the learning task

every transition emits transition costs,’immediate costs’, c : S ×A→ ℜ(sometimes also called ’immediate reward’, r)

1

2

Now, an agent policy π : S → A can beevaluated (and judged):Consider pathcosts:Jπ(s) =

t c(st, π(st)), s0 = s

Wanted: optimal policy π∗ : S → Awhere Jπ∗

(s) = minπ{∑

t c(st, π(st))|s0 = s}

2

1

1 12

⇒ Additive (path-)costs allow to consider all events

⇒ Does this solve the temporal credit assignment problem? YES!

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 13

Choice of immediate cost function c(·) specifies policy to be learned

Example:

c(s) =

0 , if s success (s ∈ Goal)1000 , if s failure (s ∈ Failure)

1 , else

Goal

Jπ(sstart) = 12

Jπ(sstart) = 1004

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 14

Page 10: Motivation - ml.informatik.uni-freiburg.de

Choice of immediate cost function c(·) specifies policy to be learned

Example:

c(s) =

0 , if s success (s ∈ Goal)1000 , if s failure (s ∈ Failure)

1 , else

Goal

Jπ(sstart) = 12

Jπ(sstart) = 1004

⇒ specification of requested policy by c(·) is simple!

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 14

3. Solving the optimization problem

For the optimal path costs it is known that

J∗(s) = mina{c(s, a) + J∗(f(s, a))}

(Principle of Optimality (Bellman, 1959))

⇒Can we compute J∗ (we will see why, soon)?

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 15

Computing J∗: the value iteration (VI) algorithm

Start with arbitrary J0(s)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 16

Computing J∗: the value iteration (VI) algorithm

Start with arbitrary J0(s)

for all states s :

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 16

Page 11: Motivation - ml.informatik.uni-freiburg.de

Computing J∗: the value iteration (VI) algorithm

Start with arbitrary J0(s)

for all states s : Jk+1(s) :=

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 16

Computing J∗: the value iteration (VI) algorithm

Start with arbitrary J0(s)

for all states s : Jk+1(s) := mina∈A

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 16

Computing J∗: the value iteration (VI) algorithm

Start with arbitrary J0(s)

for all states s : Jk+1(s) := mina∈A{c(s, a)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 16

Computing J∗: the value iteration (VI) algorithm

Start with arbitrary J0(s)

for all states s : Jk+1(s) := mina∈A{c(s, a) + Jk()}

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 16

Page 12: Motivation - ml.informatik.uni-freiburg.de

Computing J∗: the value iteration (VI) algorithm

Start with arbitrary J0(s)

for all states s : Jk+1(s) := mina∈A{c(s, a) + Jk(f(s, a))}

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 16

Computing J∗: the value iteration (VI) algorithm

Start with arbitrary J0(s)

for all states s : Jk+1(s) := mina∈A{c(s, a) + Jk(f(s, a))}

5

7

2

?

1

1

1⇒

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 16

Computing J∗: the value iteration (VI) algorithm

Start with arbitrary J0(s)

for all states s : Jk+1(s) := mina∈A{c(s, a) + Jk(f(s, a))}

5

7

2

?

1

1

1⇒

5

7

2

3

1

1

1

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 16

Convergence of value iteration

Value iteration converges under certain assumptions, i.e. we have

limk→∞Jk = J∗

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 17

Page 13: Motivation - ml.informatik.uni-freiburg.de

Convergence of value iteration

Value iteration converges under certain assumptions, i.e. we have

limk→∞Jk = J∗

⇒ Discounted problems: Jπ∗(s) = minπ{

t γtc(st, π(st))|s0 = s}where 0 ≤ γ < 1 (contraction mapping)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 17

Convergence of value iteration

Value iteration converges under certain assumptions, i.e. we have

limk→∞Jk = J∗

⇒ Discounted problems: Jπ∗(s) = minπ{

t γtc(st, π(st))|s0 = s}where 0 ≤ γ < 1 (contraction mapping)

⇒ Stochastic shortest path problems:

• there exists an absorbing terminal state with zero costs

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 17

Convergence of value iteration

Value iteration converges under certain assumptions, i.e. we have

limk→∞Jk = J∗

⇒ Discounted problems: Jπ∗(s) = minπ{

t γtc(st, π(st))|s0 = s}where 0 ≤ γ < 1 (contraction mapping)

⇒ Stochastic shortest path problems:

• there exists an absorbing terminal state with zero costs

• there exists a ’proper’ policy (a policy that has a non-zero chance to finallyreach the terminal state)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 17

Convergence of value iteration

Value iteration converges under certain assumptions, i.e. we have

limk→∞Jk = J∗

⇒ Discounted problems: Jπ∗(s) = minπ{

t γtc(st, π(st))|s0 = s}where 0 ≤ γ < 1 (contraction mapping)

⇒ Stochastic shortest path problems:

• there exists an absorbing terminal state with zero costs

• there exists a ’proper’ policy (a policy that has a non-zero chance to finallyreach the terminal state)

• every non-proper policy has infinite path costs for at least one state

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 17

Page 14: Motivation - ml.informatik.uni-freiburg.de

Ok, now we have J∗

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 18

Ok, now we have J∗

⇒ when J∗ is known, then we also know an optimal policy:

π∗(s) ∈ arg mina∈A{c(s, a) + J∗(f(s, a))}

5

7

21

1

1

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 18

Back to our maze

GoalStart

1

2345

67

9

346789

10

5

11

10

11 9

1011

1001

1001 1000

1000

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 19

Overview of the approach so far

Ziel

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 20

Page 15: Motivation - ml.informatik.uni-freiburg.de

Overview of the approach so far

Ziel

• Description of the learning task as an MDPS, A, T, f, c

c specifies requested behaviour/ policy

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 20

Overview of the approach so far

Ziel

• Description of the learning task as an MDPS, A, T, f, c

c specifies requested behaviour/ policy

• iterative computation of optimal pathcosts J∗:∀s ∈ S : Jk+1(s) = mina∈A{c(s, a) + Jk(f(s, a))}

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 20

Overview of the approach so far

Ziel

• Description of the learning task as an MDPS, A, T, f, c

c specifies requested behaviour/ policy

• iterative computation of optimal pathcosts J∗:∀s ∈ S : Jk+1(s) = mina∈A{c(s, a) + Jk(f(s, a))}

• Computation of an optimal policy from J∗

π∗(s) ∈ arg mina∈A{c(s, a) + J∗(f(s, a))}

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 20

Overview of the approach so far

Ziel

• Description of the learning task as an MDPS, A, T, f, c

c specifies requested behaviour/ policy

• iterative computation of optimal pathcosts J∗:∀s ∈ S : Jk+1(s) = mina∈A{c(s, a) + Jk(f(s, a))}

• Computation of an optimal policy from J∗

π∗(s) ∈ arg mina∈A{c(s, a) + J∗(f(s, a))}

• value function (’costs-to-go’) can be stored in a table

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 20

Page 16: Motivation - ml.informatik.uni-freiburg.de

Overview of the approach: Stochastic Domains

Ziel

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 21

Overview of the approach: Stochastic Domains

Ziel

• value iteration in stochastic environments:∀s ∈ S : Jk+1(s) = mina∈A{ (c(s, a) + Jk(s

′))}

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 21

Overview of the approach: Stochastic Domains

Ziel

• value iteration in stochastic environments:∀s ∈ S : Jk+1(s) = mina∈A{

s′∈S p(s, s′, a) (c(s, a) + Jk(s′))}

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 21

Overview of the approach: Stochastic Domains

Ziel

• value iteration in stochastic environments:∀s ∈ S : Jk+1(s) = mina∈A{

s′∈S p(s, s′, a) (c(s, a) + Jk(s′))}

• Computation of an optimal policy from J∗

π∗(s) ∈ arg mina∈A{∑

s′∈S p(s, s′, a) (c(s, a) + Jk(s′))}

• value function J (’costs-to-go’) can be stored in a table

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 21

Page 17: Motivation - ml.informatik.uni-freiburg.de

Reinforcement Learning

Problems of Value Iteration:

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 22

Reinforcement Learning

Problems of Value Iteration:

for all s ∈ S :

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 22

Reinforcement Learning

Problems of Value Iteration:

for all s ∈ S :Jk+1(s) = mina∈A{c(s, a) + Jk()}

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 22

Reinforcement Learning

Problems of Value Iteration:

for all s ∈ S :Jk+1(s) = mina∈A{c(s, a) + Jk(f(s, a))}

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 22

Page 18: Motivation - ml.informatik.uni-freiburg.de

Reinforcement Learning

Problems of Value Iteration:

for all s ∈ S :Jk+1(s) = mina∈A{c(s, a) + Jk(f(s, a))}

problems:

• Size of S (Chess, robotics, . . . ) ⇒ learning time, storage?

• ’model’ (transition behaviour) f(s, a) or p(s′, s, a) must be known!

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 22

Reinforcement Learning

Problems of Value Iteration:

for all s ∈ S :Jk+1(s) = mina∈A{c(s, a) + Jk(f(s, a))}

problems:

• Size of S (Chess, robotics, . . . ) ⇒ learning time, storage?

• ’model’ (transition behaviour) f(s, a) or p(s′, s, a) must be known!

Reinforcement Learning is dynamic programming for very large state spacesand/ or model-free tasks

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 22

Important contributions - Overview

• Real Time Dynamic Programming(Barto, Sutton, Watkins, 1989)

• Model-free learning (Q-Learning,(Watkins, 1989))

• neural representation of value function (or alternative functionapproximators)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 23

Real Time Dynamic Programming (Barto, Sutton, Watkins, 1989)

Idea:

instead For all s ∈ S now For some s ∈ S . . .

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 24

Page 19: Motivation - ml.informatik.uni-freiburg.de

Real Time Dynamic Programming (Barto, Sutton, Watkins, 1989)

Idea:

instead For all s ∈ S now For some s ∈ S . . .⇒ learning based on trajectories (experiences)

Goal

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 24

Q-Learning

Idea (Watkins, Diss, 1989):

In every state store for every action the expected costs-to-go. Qπ(s, a) denotes

the expected future pathcosts for applying action a in state s (and continuingaccording to policy π):

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 25

Q-Learning

Idea (Watkins, Diss, 1989):

In every state store for every action the expected costs-to-go. Qπ(s, a) denotes

the expected future pathcosts for applying action a in state s (and continuingaccording to policy π):

Qπ(s, a) := (c(s, a) + Jπ(s′))

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 25

Q-Learning

Idea (Watkins, Diss, 1989):

In every state store for every action the expected costs-to-go. Qπ(s, a) denotes

the expected future pathcosts for applying action a in state s (and continuingaccording to policy π):

Qπ(s, a) :=∑

s′∈S

p(s′, s, a)(c(s, a) + Jπ(s′))

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 25

Page 20: Motivation - ml.informatik.uni-freiburg.de

Q-Learning

Idea (Watkins, Diss, 1989):

In every state store for every action the expected costs-to-go. Qπ(s, a) denotes

the expected future pathcosts for applying action a in state s (and continuingaccording to policy π):

Qπ(s, a) :=∑

s′∈S

p(s′, s, a)(c(s, a) + Jπ(s′))

where Jπ(s′) expected pathcosts when starting from s′ and acting according toπ

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 25

Q-learning: Action selection

is now possible without a model:

Original VI: state evaluationAction selection:

π∗(s) ∈ arg min{c(s, a) + J

∗(f(s, a))}

5

7

21

1

1

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 26

Q-learning: Action selection

is now possible without a model:

Original VI: state evaluationAction selection:

π∗(s) ∈ arg min{c(s, a) + J

∗(f(s, a))}

5

7

21

1

1

Q: state-action evaluationAction selection:

π∗(s) = arg min Q

∗(s, a)

3

6

8

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 26

Learning an optimal Q-Function

To find Q∗, a value iteration algorithm can be applied

Qk+1(s, u) := (c(s, a) + Jk(s′))

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 27

Page 21: Motivation - ml.informatik.uni-freiburg.de

Learning an optimal Q-Function

To find Q∗, a value iteration algorithm can be applied

Qk+1(s, u) :=∑

s′∈S

p(s′, s, a)(c(s, a) + Jk(s′))

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 27

Learning an optimal Q-Function

To find Q∗, a value iteration algorithm can be applied

Qk+1(s, u) :=∑

s′∈S

p(s′, s, a)(c(s, a) + Jk(s′))

where Jk(s) = mina′∈A(s) Qk(s, a′)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 27

Learning an optimal Q-Function

To find Q∗, a value iteration algorithm can be applied

Qk+1(s, u) :=∑

s′∈S

p(s′, s, a)(c(s, a) + Jk(s′))

where Jk(s) = mina′∈A(s) Qk(s, a′)

⋄ Furthermore, learning a Q-function without a model, by experience oftransition tuples (s, a)→ s′ only is possible:

Q-learning (Q-Value Iteration + Robbins-Monro stochastic approximation)

Qk+1(s, a) := (1− α) Qk(s, a) + α (c(s, a) + mina′∈A(s′)

Qk(s′, a′))

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 27

Summary Q-learning

Q-learning is a variant of value iteration when no model is available

it is based on two major ingredigents:

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 28

Page 22: Motivation - ml.informatik.uni-freiburg.de

Summary Q-learning

Q-learning is a variant of value iteration when no model is available

it is based on two major ingredigents:

• uses a representation of costs-to-go for state/ action-pairs Q(s, a)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 28

Summary Q-learning

Q-learning is a variant of value iteration when no model is available

it is based on two major ingredigents:

• uses a representation of costs-to-go for state/ action-pairs Q(s, a)

• uses a stochastic approximation scheme to incrementally computeexpectation values on the basis of observed transititions (s, a)→ s′

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 28

Summary Q-learning

Q-learning is a variant of value iteration when no model is available

it is based on two major ingredigents:

• uses a representation of costs-to-go for state/ action-pairs Q(s, a)

• uses a stochastic approximation scheme to incrementally computeexpectation values on the basis of observed transititions (s, a)→ s′

⋄ converges under the same assumption as value iteration + ’every state/action pair has to be visited infinitely often’ + conditions for stochasticapproximation

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 28

Q-Learning algorithm

Repeat

Repeat

Until

Until

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 29

Page 23: Motivation - ml.informatik.uni-freiburg.de

Q-Learning algorithm

Repeat

start in arbitrary initial state s0; t = 0Repeat

Until

Until

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 29

Q-Learning algorithm

Repeat

start in arbitrary initial state s0; t = 0Repeat

choose action greedily ut := arg mina∈AQk(st, a)

Until

Until

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 29

Q-Learning algorithm

Repeat

start in arbitrary initial state s0; t = 0Repeat

choose action greedily ut := arg mina∈AQk(st, a)or ut according to an exploration scheme

Until

Until

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 29

Q-Learning algorithm

Repeat

start in arbitrary initial state s0; t = 0Repeat

choose action greedily ut := arg mina∈AQk(st, a)or ut according to an exploration schemeapply ut in the environment: st+1 = f(st, ut, wt)

Until

Until

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 29

Page 24: Motivation - ml.informatik.uni-freiburg.de

Q-Learning algorithm

Repeat

start in arbitrary initial state s0; t = 0Repeat

choose action greedily ut := arg mina∈AQk(st, a)or ut according to an exploration schemeapply ut in the environment: st+1 = f(st, ut, wt)learn Q-value:

Until

Until

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 29

Q-Learning algorithm

Repeat

start in arbitrary initial state s0; t = 0Repeat

choose action greedily ut := arg mina∈AQk(st, a)or ut according to an exploration schemeapply ut in the environment: st+1 = f(st, ut, wt)learn Q-value:Qk+1(st, ut) := (1− α)Qk(st, ut) + α(c(st, ut) + Jk(st+1))where Jk(st+1) := mina∈A Qk(st+1, a)

Until

Until

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 29

Q-Learning algorithm

Repeat

start in arbitrary initial state s0; t = 0Repeat

choose action greedily ut := arg mina∈AQk(st, a)or ut according to an exploration schemeapply ut in the environment: st+1 = f(st, ut, wt)learn Q-value:Qk+1(st, ut) := (1− α)Qk(st, ut) + α(c(st, ut) + Jk(st+1))where Jk(st+1) := mina∈A Qk(st+1, a)

Until Terminal state reachedUntil

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 29

Q-Learning algorithm

Repeat

start in arbitrary initial state s0; t = 0Repeat

choose action greedily ut := arg mina∈AQk(st, a)or ut according to an exploration schemeapply ut in the environment: st+1 = f(st, ut, wt)learn Q-value:Qk+1(st, ut) := (1− α)Qk(st, ut) + α(c(st, ut) + Jk(st+1))where Jk(st+1) := mina∈A Qk(st+1, a)

Until Terminal state reachedUntil policy is optimal (’enough’)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 29

Page 25: Motivation - ml.informatik.uni-freiburg.de

Representation of the path-costs in a function approximator

Idea: neural representation of value function (or alternative functionapproximators) (Neuro Dynamic Programming (Bertsekas, 1987))

...

...

x J(x)

wij

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 30

Representation of the path-costs in a function approximator

Idea: neural representation of value function (or alternative functionapproximators) (Neuro Dynamic Programming (Bertsekas, 1987))

...

...

x J(x)

wij

⇒ few parameters (here: weights) specify value function for a large statespace

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 30

Representation of the path-costs in a function approximator

Idea: neural representation of value function (or alternative functionapproximators) (Neuro Dynamic Programming (Bertsekas, 1987))

...

...

x J(x)

wij

⇒ few parameters (here: weights) specify value function for a large statespace

⇒ learning by gradient descent: ∂E∂wij

= ∂(J(s′)−c(s,a)−J(s))2

∂wij

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 30

Example: learning to intercept in robotic soccer

• as fast as possible (anticipation of interceptposition)

• random noise in ball and player movement→need for corrections

• sequence of turn(θ)and dash(v)-commandsrequired

.

.

.

⇒handcoding a routine is a lot of work, many parameters to tune!

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 31

Page 26: Motivation - ml.informatik.uni-freiburg.de

Reinforcement learning of intercept

Goal: Ball is in kickrange of player

• state space: Swork = positions on pitch

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 32

Reinforcement learning of intercept

Goal: Ball is in kickrange of player

• state space: Swork = positions on pitch

• S+: Ball in kickrange

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 32

Reinforcement learning of intercept

Goal: Ball is in kickrange of player

• state space: Swork = positions on pitch

• S+: Ball in kickrange

• S−: e.g. collision with opponent

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 32

Reinforcement learning of intercept

Goal: Ball is in kickrange of player

• state space: Swork = positions on pitch

• S+: Ball in kickrange

• S−: e.g. collision with opponent

• c(s) =

0 , s ∈ S+

1 , s ∈ S−

0.01 , else

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 32

Page 27: Motivation - ml.informatik.uni-freiburg.de

Reinforcement learning of intercept

Goal: Ball is in kickrange of player

• state space: Swork = positions on pitch

• S+: Ball in kickrange

• S−: e.g. collision with opponent

• c(s) =

0 , s ∈ S+

1 , s ∈ S−

0.01 , else

• Actions: turn(10o), turn(20o), . . . turn(360o), . . . dash(10),dash(20), . . .

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 32

Reinforcement learning of intercept

Goal: Ball is in kickrange of player

• state space: Swork = positions on pitch

• S+: Ball in kickrange

• S−: e.g. collision with opponent

• c(s) =

0 , s ∈ S+

1 , s ∈ S−

0.01 , else

• Actions: turn(10o), turn(20o), . . . turn(360o), . . . dash(10),dash(20), . . .

• neural value function (6-20-1-architecture)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 32

Learning curves

0

10

20

30

40

50

60

70

80

90

100

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

"kick.stat" u 3:7

Percentage of successes

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

"kick.stat" u 3:11

Costs (time to intercept)

Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, [email protected] Machine Learning 33