38
Reinforcement Learning: How far can it Go? Rich Sutton University of Massachusetts ATT Research With thanks to Doina Precup, Satinder Singh, Amy McGovern, B. Ravindran, Ron Parr

Reinforcement Learning: How far can it Go?

  • Upload
    dolf

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Reinforcement Learning: How far can it Go?. Rich Sutton University of Massachusetts ATT Research With thanks to Doina Precup, Satinder Singh, Amy McGovern, B. Ravindran, Ron Parr. Reinforcement Learning. An active, popular, successful approach to AI 15 – 50 years old - PowerPoint PPT Presentation

Citation preview

Page 1: Reinforcement Learning: How far can it Go?

Reinforcement Learning:How far can it Go?

Rich SuttonUniversity of Massachusetts

ATT Research

With thanks to Doina Precup, Satinder Singh, Amy McGovern,

B. Ravindran, Ron Parr

Page 2: Reinforcement Learning: How far can it Go?

Reinforcement LearningAn active, popular, successful approach to AI

15 – 50 years old emphasizes learning from interaction Does not assume complete knowledge of world

World-class applicationsStrong theoretical foundationsParallels in other fields: operations research, control

theory, psychology, neuroscienceSeeks simple general principles

How Far Can It Go ?

Page 3: Reinforcement Learning: How far can it Go?

World-Class Applications of RL

TD-Gammon and Jellyfish Tesauro, Dahl World's best backgammon player

Elevator Control Crites & Barto (Probably) world's best down-peak elevator controller

Job-Shop Scheduling Zhang & Dietterich World’s best scheduler of space-shuttle payload processing

Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin World's best assigner of radio channels to mobile telephone calls

Page 4: Reinforcement Learning: How far can it Go?

Outline

RL Past Trial and Error Learning

RL Present Learning and Planning Values

RL Future Constructivism

1985

2000

1950

Page 5: Reinforcement Learning: How far can it Go?

RL began with dissatisfactionwith previous learning problems

Such as Learning from examples Unsupervised learning Function optimization

None seemed to be purposiveful Where is the learning to how to get something? Where is the learning by trial and error?

Need rewards and penalties,interaction with the world!

Page 6: Reinforcement Learning: How far can it Go?

Rooms Example

Early learning methods could not learn how to get reward

Page 7: Reinforcement Learning: How far can it Go?

The Reward Hypothesis

Is this reasonable?Is it demeaning?Is there no other choice?

It seems to be adequate

That purposes can be adequately represented as

maximization of the cumulative sum of a scalar reward signal received from the environment

Page 8: Reinforcement Learning: How far can it Go?

RL Past – Trial and Error LearningLearned only a policy (a mapping from states to actions)Maximized only

Short-term reward (e.g., learning automata) Or delayed reward via simple action traces

Assumed good/bad rewards immed. distinguishable E.g., positive is good, negative is bad An implicitly known reinforcement baseline

Next steps were to learn baselines and internal rewards

Taking these next steps quickly led to modernvalue functions and temporal-difference learning

Page 9: Reinforcement Learning: How far can it Go?

Movement is in the wrong direction 1/3 of the time

A Policy

Page 10: Reinforcement Learning: How far can it Go?

Problems with Value-less RL Methods

Page 11: Reinforcement Learning: How far can it Go?

Outline

RL Past Trial and Error Learning

RL Present Learning and Planning Values

RL Future Constructivism

1985

2000

1950

Page 12: Reinforcement Learning: How far can it Go?

The Value-Function Hypothesis

That the dominant purpose of intelligence is to approximate these value functions

Value functions = Measures of expected rewardfollowing states:

V: States Expected future reward

or following state-action pairs: Q: States x Actions Expected future reward

All efficient methods for optimal sequential decision making estimate value functions

The hypothesis:

Page 13: Reinforcement Learning: How far can it Go?

State-Value Function

Page 14: Reinforcement Learning: How far can it Go?

RL PresentAccepts reward and value hypothesesMany real-world applications, some impressiveTheory strong and active,

yet still with more questions than answersStrong links to Operations ResearchA part of modern AI’s interest in uncertainty:

MDPs, POMDPs, Bayes nets, connectionismIncludes deliberative planning

Learning and Planning Values

Page 15: Reinforcement Learning: How far can it Go?

New Applications of RLCMUnited Robocup Soccer Team Stone & Veloso

World’s best player of Robocup simulated soccer, 1998

KnightCap and TDleaf Baxter, Tridgell & Weaver Improved chess play from intermediate to master in 300 games

Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis 10-15% improvement over industry standard methods

Walking Robot Benbrahim & Franklin Learned critical parameters for bipedal walking

Real-world applications using on-line learning Back-

prop

QuickTime™ and aCinepak decompressorare needed to see this picture.

Page 16: Reinforcement Learning: How far can it Go?

RL Present, Part II:The Space of Methods

Dynamic programming

Temporal-differencelearning

Monte Carlo

Exhaustivesearch

bootstrapping,

fullbackups

samplebackups

shallowbackups

deepbackups

Also: Function Approx. Explore/Exploit Planning/Learning Action/state values Actor-Critic . . .

Page 17: Reinforcement Learning: How far can it Go?

The TD Hypothesis

Even “Monte Carlo” methods can benefit TD methods enable them to be done incrementally

Even planning can benefit Trajectory following improves function approximation

and state sampling Sample backups reduce effect of branching factor

Psychological support TD models of reinforcement, classical conditioning

Physiological support Reward neurons show TD behavior (Schultz et al.)

That all value learning is driven by TD errors

Page 18: Reinforcement Learning: How far can it Go?

PlanningModern RL includes planning

As in planning for MDPs A form of state-space planning Still controversial for some

Planning and learning are near identical in RL The same algorithms on real or imagined experience Same value functions, backups, function approximation

value/policy

model experience

acting

modellearning

planning directRL

Interactionwith world

Imaginedinteraction

RL Alg. Value/Policy

Page 19: Reinforcement Learning: How far can it Go?

Planning with Imagined Experience

Realexperience

Imaginedexperience

Page 20: Reinforcement Learning: How far can it Go?

Outline

RL Past Trial and Error Learning

RL Present Learning and Planning Values

RL Future Constructivism

1985

2000

1950

Page 21: Reinforcement Learning: How far can it Go?

Constructivism

The active construction of representationsand models of the world to facilitate the learning and planning of values

Policy

Value functions

Representationsand Models

Great flexibility

here

PiagetDrescher

Page 22: Reinforcement Learning: How far can it Go?

Whereas RL present is about solving an MDP,RL future will be about representing the

States Actions Transitions Rewards Features

to construct an MDP.Constructing the world to be the way we want it:

Markov Linear Small Reliable Independent Shallow Deterministic Additive Low branching

Constructivist Prophecy

The RL agent as active

world modeler

Page 23: Reinforcement Learning: How far can it Go?

Linear-in-the-features methods are state of the art Memory-based methods

Two-stage architecture: Compute feature values

• Nonlinear, expansive, fixed or slowly changing mapping Map the feature values linearly to the result

• Linear, convergent, fast changing mapping

Works great if features are appropriate Fast, reliable, local learning; good generalization

Feature construction best done by hand ...or by methods yet to be found

Representing State, Part I: Features and Function Approximation

State Features Values

ConstructiveInduction

Page 24: Reinforcement Learning: How far can it Go?

Good Features Bad Features

Features correspond to regions of similar value

Features unrelated to values

Page 25: Reinforcement Learning: How far can it Go?

Representing State, Part II:Partial Observability

Not as big a deal as widely thought A greater problem for theory than for practice Need not use POMDP ideas

Can treat as function approximation issue Making do with imperfect observations/features Finding the right memories to add as new features

The key is to construct state representations that make the world more Markov – McCallum’s thesis

When immediate observations do not uniquelyidentify the current state; non-Markov problems

Page 26: Reinforcement Learning: How far can it Go?

Representations of ActionNominally, actions in RL are low-level

The lowest level at which behavior can varyBut people work mostly with courses of action

We decide among these We make predictions at this level We plan at this level

Remarkably, all this can be incorporated in RL Course of action = policy + termination condition Almost all RL ideas, algorithms and theory extend Wherever actions are used, courses of action can be

substituted

Parr, Bradtke & Duff, Precup, Singh, Dietterich, Kaelbling, Huber &Grupen, Szepesvari, Dayan, Ryan & Pendrith, Hauskrecht, Lin...

Page 27: Reinforcement Learning: How far can it Go?

Room-to-Room Courses of Action

A course of actionfor each hallwayfrom each room(2 of 8 shown)

Page 28: Reinforcement Learning: How far can it Go?

Representing TransitionsModels can also be learned for courses of action

What state will we be in at termination?

How much reward will we receive along the way?

Mathematical form of models follows from the theory of semi-Markov decision processes

Permits planning at a higher level

Page 29: Reinforcement Learning: How far can it Go?

Planning (Value Iteration)with Courses of Action

Iteration #0 Iteration #1 Iteration #2

with cell-to-cellprimitive actions

Iteration #0 Iteration #1 Iteration #2

with room-to-roomcourses of action

V(goa l)=1

V (goa l)=1

Page 30: Reinforcement Learning: How far can it Go?

Reconnaissance ExampleMission: Fly over (observe) most

valuable sites and return to baseStochastic weather affects

observability (cloudy or clear) of sitesLimited fuel Intractable with classical optimal

control methodsActions:

Primitives: which direction to fly Courses: which site to head for

Courses compress space and time Reduce steps from ~600 to ~6 Reduce states from ~1011 to ~106

Enable finding of best solutions

10

50

50

50

100

25

15 (reward)

5

25

8

Base100 decision steps

(mean time between weather changes)

?

B. Ravindran, UMass

Page 31: Reinforcement Learning: How far can it Go?

Courses of actionpermit enormous flexibility

Page 32: Reinforcement Learning: How far can it Go?

SubgoalsCourses of action are often goal-oriented

E.g., drive-to-work, open-the-doorA course can be learned to achieve its goalMany can be learned at once, independently

Solves classic problem of subgoal credit assignment Solves psychological puzzle of goal-oriented action

Goal-oriented courses of action create better MDP Fewer states, smaller branching factor Compartmentalizes dependencies

Their models are also goal-oriented recognizers...

Page 33: Reinforcement Learning: How far can it Go?

PerceptionReal perception, like real action,

is temporally extendedFeatures are ability oriented

rather than sensor oriented What is a chair? Something that can be sat upon

Consider a goal-oriented course of action, like dock-with-charger Its model gives the probability of successfully docking

as a function of state I.e., a feature (detector) for states that afford docking

Such features can be learned without supervision

chargerDockableregion

Page 34: Reinforcement Learning: How far can it Go?

This is RL with a totally different feelStill one primary policy and set of valuesBut many other policies, values, and models are

learned not directly in service of rewardThe dominant purpose is discovery, not reward

What possibilities does this world afford? How can I control and predict it in a variety of ways?

In other words, constructing representations to make the world: Markov Linear Small Reliable Independent Shallow Deterministic Additive Low branching

Page 35: Reinforcement Learning: How far can it Go?

ImagineAn agent driven primarily by biased curiosityTo discover how it can predict and control

its interaction with the world What courses of action have predictable effects? What salient observables can be controlled? What models are most useful in planning?

A human coach presenting a series of Problems/Tasks Courses of action Highlighting key states, providing subpolicies,

termination conditions…

Page 36: Reinforcement Learning: How far can it Go?

What is New?Constructivism itself is not new.

But actually doing it would be!

Does RL really change it, make it easier?That is, do values and policies help?

Yes! Because so much constructed knowledge is well represented as values and policies in service of approximating values and policies

RL’s goal-orientation is also critical to modeling goal-oriented action and perception

Page 37: Reinforcement Learning: How far can it Go?

Take Home Messages

RL Past Let’s revisit, but not repeat past work

RL Present Do you accept that value functions are critical? And that TD methods are the way to find them?

RL Future It’s time to address representation construction Explore/understand the world rather than control it RL/values provide new structure for this May explain goal-oriented action and perception

Page 38: Reinforcement Learning: How far can it Go?

How far can RL go?

A simple and general formulation of AI

Yet there is enough structure to make progress

While this is true, we should complicate no further, but seek general principles of AI

They may take us all the way to human-level intelligence