55
Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer

Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Reinforcement Learning

Tamara Berg

CS 590-133 Artificial Intelligence

Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer

Page 2: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Announcements

• HW1 is graded– Grades are in your

submission folders in a file called grade.txt (mean = 14.15, median = 18)

– Thursday we’ll have demos of extra-credits

• Office hours canceled today

Page 3: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Reminder

• Mid-term exam next Tuesday, Feb 18– Held during regular class time in SN014 and

SN011– Closed book– Short answer written questions– Shubham will hold a mid-term review + Q&A

session, Feb 14 at 5pm in SN 014.

Page 4: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Exam topics1) Intro to AI, agents and environments Turing test

Rationality

Expected utility maximizationPEASEnvironment characteristics: fully vs. partially observable, deterministic vs. stochastic, episodic vs. sequential, static vs. dynamic, discrete vs. continuous, single-agent vs. multi-agent, known vs. unknown

2) SearchSearch problem formulation: initial state, actions, transition model, goal state, path cost

State space

Search tree

Frontier

Evaluation of search strategies: completeness, optimality, time complexity, space complexity

Uninformed search strategies: breadth-first search, uniform cost search, depth-first search, iterative deepening search

Informed search strategies: greedy best-first, A*, weighted A*

Heuristics: admissibility, dominance

Page 5: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Exam topics

3) Constraint satisfaction problems Backtracking search

Heuristics: most constrained/most constraining variable, least constraining value Forward checking, constraint propagation, arc consistencyTree-structured CSPsLocal search

4) GamesZero-sum games

Game treeMinimax/Expectimax/Expectiminimax searchAlpha-beta pruningEvaluation functionQuiescence searchHorizon effectStochastic elements in games

Page 6: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Exam topics

5) Markov decision processesMarkov assumption, transition model, policy

Bellman equationValue iterationPolicy iteration

6) Reinforcement learningModel-based vs. model-free approaches

Passive vs Active

Exploration vs. exploitation

Direct EstimationTD Learning

TD Q-learning

Page 7: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Reminder from last class

Page 8: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Stochastic, sequential environments

Image credit: P. Abbeel and D. Klein

Markov Decision Processes

Page 9: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Markov Decision Processes• Components:

– States s, beginning with initial state s0

– Actions a• Each state s has actions A(s) available from it

– Transition model P(s’ | s, a)• Markov assumption: the probability of going to s’ from

s depends only on s and a and not on any other past actions or states

– Reward function R(s)• Policy (s): the action that an agent takes in any given state

– The “solution” to an MDP

Page 10: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Overview

• First, we will look at how to “solve” MDPs, or find the optimal policy when the transition model and the reward function are known

• Second, we will consider reinforcement learning, where we don’t know the rules of the environment or the consequences of our actions

Page 11: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Grid world

R(s) = -0.04 for every non-terminal state

Transition model:

0.8 0.10.1

Source: P. Abbeel and D. Klein

Page 12: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Goal: Policy

Source: P. Abbeel and D. Klein

Page 13: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Grid world

R(s) = -0.04 for every non-terminal state

Transition model:

Page 14: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Grid world

Optimal policy when R(s) = -0.04 for every non-terminal state

Page 15: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Grid world• Optimal policies for other values of R(s):

Page 16: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Solving MDPs• MDP components:

– States s– Actions a– Transition model P(s’ | s, a)– Reward function R(s)

• The solution:– Policy (s): mapping from states to actions– How to find the optimal policy?

Page 17: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Finding the utilities of states

'

)'(),|'(s

sUassP

U(s’)

Max node

Chance node

')(

* )'(),|'(maxarg)(ssAa

sUassPs

P(s’ | s, a)

• What is the expected utility of taking action a in state s?

• How do we choose the optimal action?

• What is the recursive expression for U(s) in terms of the utilities of its successor states?

'

)'(),|'(max)()(s

a sUassPsRsU

Page 18: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

The Bellman equation• Recursive relationship between the utilities of

successive states:

End up here with P(s’ | s, a)Get utility U(s’)

(discounted by )

Receive reward R(s)

Choose optimal action a

'

)()'(),|'(max)()(

ssAa

sUassPsRsU

Page 19: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

The Bellman equation• Recursive relationship between the utilities of

successive states:

• For N states, we get N equations in N unknowns– Solving them solves the MDP– We can solve them algebraically– Two methods: value iteration and policy iteration

'

)()'(),|'(max)()(

ssAa

sUassPsRsU

Page 20: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Method 1: Value iteration• Start out with every U(s) = 0• Iterate until convergence

– During the ith iteration, update the utility of each state according to this rule:

• In the limit of infinitely many iterations, guaranteed to find the correct utility values– In practice, don’t need an infinite number of iterations…

')(

1 )'(),|'(max)()(s

isAa

i sUassPsRsU

Page 21: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Value iteration

• What effect does the update have?

')(

1 )'(),|'(max)()(s

isAa

i sUassPsRsU

Value iteration demo

Page 22: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Values vs Policy

• Basic idea: approximations get refined towards optimal values

• Policy may converge long before values do

Page 23: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Method 2: Policy iteration

• Start with some initial policy 0 and alternate between the following steps:– Policy evaluation: calculate Ui(s) for every

state s– Policy improvement: calculate a new policy

i+1 based on the updated utilities

')(

1 )'(),|'(maxarg)(ssAa

i sUassPs i

Page 24: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Policy evaluation• Given a fixed policy , calculate U(s) for every

state s • The Bellman equation for the optimal policy:

– How does it need to change if our policy is fixed?

– Can solve a linear system to get all the utilities!– Alternatively, can apply the following update:

'

1 )'())(,|'()()(s

iii sUsssPsRsU

'

)()'(),|'(max)()(

ssAa

sUassPsRsU

'

)'())(,|'()()(s

sUsssPsRsU

Page 25: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart
Page 26: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart
Page 27: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Reinforcement learning (Chapter 21)

Page 28: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Short intro to learning …much more to come later

Page 29: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

What is machine learning?

• Computer programs that can learn from data

• Two key components– Representation: how should we represent the

data?– Generalization: the system should generalize

from its past experience (observed data items) to perform well on unseen data items.

Page 30: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Types of ML algorithms

• Unsupervised– Algorithms operate on unlabeled examples

• Supervised– Algorithms operate on labeled examples

• Semi/Partially-supervised– Algorithms combine both labeled and unlabeled

examples

Page 31: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart
Page 32: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Types of ML algorithms

• Unsupervised– Algorithms operate on unlabeled examples

• Supervised– Algorithms operate on labeled examples

• Semi/Partially-supervised– Algorithms combine both labeled and unlabeled

examples

Page 33: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Slide from Dan Klein

Slide 33 of 113

Page 34: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Example: Image classification

apple

pear

tomato

cow

dog

horse

input desired output

Slide credit: Svetlana LazebnikSlide 34 of 113

Page 35: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Slide from Dan Kleinhttp://yann.lecun.com/exdb/mnist/index.html Slide 35 of 113

Page 36: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Reinforcement learning for flight

• Stanford autonomous helicopter

Page 37: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Types of ML algorithms

• Unsupervised– Algorithms operate on unlabeled examples

• Supervised– Algorithms operate on labeled examples

• Semi/Partially-supervised– Algorithms combine both labeled and unlabeled

examples

Page 39: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

However, for many problems, labeled data can be rare or expensive.

Unlabeled data is much cheaper.Need to pay someone to do it, requires special testing,…

Slide Credit: Avrim Blum

Page 40: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

However, for many problems, labeled data can be rare or expensive.

Unlabeled data is much cheaper.

Speech

Images

Medical outcomes

Customer modeling

Protein sequences

Web pages

Need to pay someone to do it, requires special testing,…

Slide Credit: Avrim Blum

Page 41: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

However, for many problems, labeled data can be rare or expensive.

Unlabeled data is much cheaper.

[From Jerry Zhu]

Need to pay someone to do it, requires special testing,…

Slide Credit: Avrim Blum

Page 42: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Need to pay someone to do it, requires special testing,…

However, for many problems, labeled data can be rare or expensive.

Unlabeled data is much cheaper.

Can we make use of cheap unlabeled data?

Slide Credit: Avrim Blum

Page 43: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Semi-Supervised LearningCan we use unlabeled data to augment a

small labeled sample to improve learning?

But unlabeled data is missing the most important info!!

But maybe still has useful regularities that

we can use.

But…But…But…Slide Credit: Avrim Blum

Page 44: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Reinforcement Learning

Page 45: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Reinforcement Learning• Components (same as MDP):

– States s, beginning with initial state s0

– Actions a• Each state s has actions A(s) available from it

– Transition model P(s’ | s, a)– Reward function R(s)

• Policy (s): the action that an agent takes in any given state– The “solution”

• New twist: don’t know Transition model or Reward function ahead of time!– Have to actually try actions and states out to learn

Page 46: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart
Page 47: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart
Page 48: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Reinforcement learning: Basic scheme

• In each time step:– Take some action– Observe the outcome of the action: successor state

and reward– Update some internal representation of the

environment and policy– If you reach a terminal state, just start over (each

pass through the environment is called a trial)

• Why is this called reinforcement learning?

Page 49: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart
Page 50: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Passive Reinforcement learning strategies

• Model-based– Learn the model of the MDP (transition probabilities

and rewards) and evaluate the state utilities under the given policy

• Model-free– Learn state utilities without explicitly modeling the

transition probabilities P(s’ | s, a)– TD-learning: use the observed transitions and

rewards to adjust the utilities of states so that they agree with the Bellman eqns

Page 51: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Model-based reinforcement learning• Basic idea: try to learn the model of the MDP (transition

probabilities and rewards) and evaluate the given policy– Keep track of how many times state s’ follows state s

when you take action a and update the transition probability P(s’ | s, a) according to the relative frequencies

– Keep track of the rewards R(s)

Page 52: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart
Page 53: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Model-free reinforcement learning• Idea: Learn state utilities without explicitly

modeling the transition probabilities P(s’ | s, a)• Direct Utility Estimation

– Utility of a state is expected total reward from that state onward.

– Each trial provides a sample of this quantity for each state visited

– Just keep a running average for each state in a table

Page 54: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart
Page 55: Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Model-free reinforcement learning• Idea: Learn state utilities without explicitly

modeling the transition probabilities P(s’ | s, a)• Temporal Difference learning idea: Update

U(s) each time we experience (s,a,s’,r)– Policy still fixed!– Use observed transitions to adjust the utilities of

states so that they agree with the Bellman eqns– Likely s’ will contribute updates more often

• When a transition occurs from s to s’ apply update:

Learning rateShould start at 1 and decay as O(1/t)