A Survey of Reinforcement Learningpiotr/cs594/PrashantRL.pdf · A Survey of Reinforcement Learning Œ p.24/35. Learning Policies: Model-based Methods Performance Comparison Problem

A Survey of Reinforcement Learning

LiteratureKaelbling, Littman, and Moore

Sutton and BartoRussell and Norvig

PresenterPrashant J. Doshi

CS594: Optimal Decision Making

A Survey of Reinforcement Learning – p.1/35

OverviewPart I

Reinforcement Learning model

Exploitation vs Exploration

Learning Optimal Policies using Model-based Methods

Learning Optimal Policies using Model-free Methods

Computing Optimal Policies by Learning Models

Part IIGeneralizations

Partially Observable Environments

Reinforcement Learning Applications


Part I: RoadmapReinforcement Learning(RL) Model

DefinitionKey ConceptsModels of Optimal BehaviourLearning Performance Metrics

Exploitation vs ExplorationTradeoff involvedFormally Justified Exploration TechniquesAd-hoc Exploration Techniques

Delayed RewardLearning Policies: Given a ModelLearning Policies: Model-free methodsLearning Policies: Model-based methods


RL Model

Problem faced by an agent that must learn behaviourthrough trial -and-error interactions with a dynamicenvironment.

Class of problems opposed to a set of techniques

Formal Definition of a RL modelDiscrete set of environmentstates, S;

Discrete set of environmentactions, A; and

Set of scalar reinforcementsignals; {0,1}


Key Concepts

Policy: Mapping from perceived states to actions to betaken when in those states - stimulus-response rules

Reward: Short term intrinsic desirability of the state. RLagent’s sole objective is to maximize the total reward itreceives in the long run

Value: Total amount of reward an agent can expect toaccumulate over the future starting from that state.Long term desirability of the state

Backup: Updating the value of a state using values offuture states

Sweep: A sweep consists of applying a backupoperation to each state


Other AI Methods vs RL

Supervised Learning Reinforcement Learning

Presentation of input/output pairs Agent is told immediate reward and the next state

agent is not told which action is best in the long term

Learning occurs offline Online performance is important

system is evaluated while the agent is learning

No exploration of the environment Explicit exploration of the environment is required

Search Reinforcement Learning

Entire state space need not be enumerated Requires entire state space to

be enumerated and stored in memory


Models of Optimal Behaviour

Finite Horizon Model

� � ��

Non-stationary policy � : h step optimal action, (h-1) step optimalaction . . . 1 step optimal action

Infinite Horizon Discounted Model

� � ��

�� where

� � � � � ��

Stationary policy

Average Reward Model

�� ! � � � " � ��

Gain optimal policy

Bias optimal model


Measuring Learning Performance

Eventual Convergence to optimalProvable guarantee of asymptotic convergence to optimalbehavior

eg. Value functions in an MDP

Speed of convergence to optimalitySpeed of convergence to near-optimality

Level of performance after a given time

RegretDifference between the expected total reward gained by followinga learning algorithm and expected total reward one could gain byplaying for the maximum expected reward from the start.


Exploitation vs Exploration

A Simple RL problem:

#

-armed bandit problemAgent is permitted

$

pulls % $

step finite horizon

Immediate boolean payoff 0

&

1 with a probability ')(

Exploit or exploreTypically premature sub-optimal decisions may affect optimalstrategy

SolutionsFormally Justified Techniques

Ad-Hoc Techniques


Formally Justified Techniques

Dynamic Programming Approach*,+ -. + /01 21 + : { 3 � , 4 � , 365 , 465 ,. . . , 37 , 47 }

Each '( has an independent prior uniform distribution(Beta)8 9

( 3 � , 4 � , 365 , 465 ,. . . , 37 , 47 ): expected future payoff when we actoptimally

If ( 3( =

$

, then

8 9

( 3 � , 4 � , 3:5 , 4:5 ,. . . , 37 , 47 )=0 . . . (Basis)8 9

( 3 � , 4 � , 3;5 , 4;5 ,. . . , 37 , 47 ) = max( E( Future payoff of performingaction i then acting optimally for the remaining pulls)= max( E( <( 8 9

( 3 � , 4 � ,. . . , 3( +1, 4( +1,. . . , 3 7 , 47 )+

�= > <( ) 8 9

( 3 � , 4 � ,. . . , 3( +1, 4( ,. . . , 37 , 47 ))

<( is the posterior payoff probability of action

.

paying off given 3 ( ,4( and our prior probability distribution % Bayesian updating


Formally Justified Techniques (Contd)

Gittins Allocation IndicesUtilizes the Discounted Expected Reward Model

Gives a table of allocation index values for different discountfactors: I( 3( , 4( )Index value: Expected payoff of action

.+ value of information in

selecting

.

At each step choosing action with the largest index guaranteesoptimal balance between exploration and exploitation

Indexes are computed through an iterated dynamic programmingapproach

Simple table lookup makes it computationally efficient


Ad-Hoc Techniques

Greedy StrategySelect the action with the highest estimated payoff

true optimal action may get starved

Optimism in the face of uncertainty

Assume optimistic prior beliefs; Actions are not easily eliminatedfrom consideration

Randomized Strategy

' : Random action; 1- ': Greedy action' controls the amount of exploration

Boltzmann exploration:

? � 2 � � @ AB CED FGHID JLK M @ AB CED FGH

T controls the amount of exploration


Ad-Hoc Techniques (Contd)

Interval-based EstimationTabulate for each 2 ( : 3( , 4(

Compute an upper bound with

=N N O �= > P �Q confidence intervalon '(

Select an action with the largest upper bound

Small P encourage greater exploration


Delayed Reward

Agent may take a long sequence of actions receivinginsignificant rewards and then finally arrive at a statewith a high reward

Markov Decision Processes(MDP)A set of states

0

A set of actions

R

,

A reward function

SUT 0 V R % W

A state transition functionX T 0 V R V 0 % Y � 0 �

Model is markov if the state transitions are independent of anyprevious states or actions


Learning Policies: Given a Model

ResultsFor the infinite horizon discounted model, there exists an optimaldeterministic stationary policy

Optimal value of state:

8 9 � Z 2[]\ � � ^��

Any policy that is greedy w.r.t. to

8 9is an Optimal policy

Methods to learn optimal policyValue Iteration

Policy Iteration


Learning Policies: Model-free Methods

RL is primarily concerned with the task of learningpolicies when the model is not known in advance.(

_

and

`

are not known)

Agent interacts with the environment directly to obtaininformation

Temporal credit assignment problem: When we get alarge reward after a series of actions, how do we figureout which action had the most impact?

Temporal difference methods:NewEstimate = OldEstimate + LearningRate[Target -OldEstimate]


Learning Policies: Model-free MethodsMonte-Carlo Method

Accumulate experience over an entire episode

Occurrence of state a in an episode is called a b. a .1 to a

Std Deviation of the error is

�cd

Algorithm:

e policy to be evaluated

8 e an arbitrary state-value function

Returns(s) e an empty list, for all a f0Repeat Forever:

Generate an episode using For each state a appearing in the episodeS e return following the first occurrence of a

Append

Sto Returns(s)8 � a � e average(Returns(s))



Adaptive Heuristic Critic and TD(0)Adaptive version of the policy iterationmethod

AHC: Using� and it computes theexpected discounted value function foreach state

RL: Computes g by maximizing over bValue Function

Experience tuple: h aji 2i � i a glk

Sutton’s

X m �N �algorithm:

8 � a � T � 8 � a �!n P �� n � 8 � a g � > 8 � a � � Sample backup

X m �o �A Survey of Reinforcement Learning – p.18/35


Q-Learning (Off-Policy TD(0))p 9 � a i 2 � T � S � aji 2 �!n � q Jsr t X � aji 2i a g � Z 2[vu J p 9 � a g i 2 g �

Q-Learning Rule (1-step)Experience tuple: h aji 2i � i a gLk

p � a i 2 � T � p � aji 2 �!n P �� n � Z 2[wu J p � a g i 2 g � > p � a i 2 � �

Q values are guaranteed to converge provided each action-valuepair is tried out an infinite number of times

Exploration insensitive

Exploration strategy will not affect the guarantee but may affectthe speed of convergence of the Q values



Q-Learning AlgorithmInitialize

x �zy|{ } � arbitrarilyRepeat(for each episode):

Initialize y

Repeat(for each step of episode):Choose } from y using policy derived from

x

(eg. greedy)Take action }, observe �{ y ~

x �zy|{ } �� x �y|{ } � �� }� ~� x �zy ~{ } ~ � � x �zy|{ } � �

y � y ~

Until y is terminal


Backup Comparisons


Learning Policies: Model-based Methods

Disadvantage of Model-free methodsRequire a great deal of experience; they make inefficient use ofthe gathered data

Use the experience to learn the models?

Certainty Equivalent MethodsAlgorithm:

Use the experience to statistically learn

X

and

S

Use Value Iteration or Policy Iteration to learn the policy

Limitations:

Arbitrary division between the learning phase and acting phase

Non-stationary environments



Sutton’s DynaInterleaves model learning with acting

Computationally efficient than the certainty equivalence method

Algorithm:Update the model

� X

and

� SUpdate the action-value function at a using

� X

and

� S

p � a i 2 � T � � S � aji 2 �n � q Jr t � X � aji 2i a g � Z 2 [u J p � a g i 2 g �

Perform

�

additional updates at randomp � a 7 i 27 � T � � S � a 7 i 27 �!n � q Jr t � X � a 7 i 2 7 i a g � Z 2[vu J p � a g i 2 g �

Choose an action 2 g to perform in state a g that may be greedy onpA Survey of Reinforcement Learning – p.23/35


Dyna is relatively undirected

Prioritized Sweeping/Queue-DynaUpdates are prioritised rather than random; we update

8

(not

p

)

Algorithm:

Select a high priority state from a queue

Remember the current value of the state

8�� 8 � a �

Update

8 � a � T � Z 2[ u � � S � aji 2 �n � q Jr t � X � aji 2i a g � 8 � a g � �

Reset the state’s priority back to 0

Compute

� � & 8� �� > 8 � a � &Set the priorities of the '� + �+� + a a� � a of a to �U� � X � aji 2i a g �

If

�k 1 $� + a $� - �, then insert the '� + �+� + a a� � a into a queue

Priority of backing up and updating depends on size of changeand the current transition probabilities



Performance Comparison

Problem domain: A 3277 state grid world formulated as ashortest path learning problem, which yields the sameresult as if a reward of 1 is given at the goal, and a rewardof 0 elsewhere and a discount factor is used .

Algorithm Steps BackupsQ-Learning 531,000 531,000

Dyna 62,000 3,055,000Prioritized sweeping 28,000 1,010,000


End of Part I: RecapRL provides us with an intuitive mechanism for learningpolicies

3 models of optimal behaviour and some measures oflearning performance

#

-armed bandit problem:Formally justified techniques

Ad-hoc techniques

Learning Policies for Delayed RewardGiven a model: Value and Policy Iteration

Model-free: TD(0) and Q-Learning

Model-based: Dyna and Prioritized Sweeping

Q-Learning requires most steps before convergence; prioritizedlearning requires least steps and backups before convergence


Part II: Roadmap

GeneralizationGeneralization over input

Generalization over actions

Hierarchical Methods

Partially Observable Environments

Reinforcement Learning ApplicationsGame Playing

Discussion


GeneralizationTrivial problems

Enumeration of state and action spaces

Store state values as tables

Experience is handled inefficiently

Non-trivial problemsLarge state and action spaces(possibly continuous)

Possible to aggregate over state and action spaces

Generalization:How can experience with a limited subset of the state space beusefully generalized to produce a good approximation over amuch larger subset?

Generalization technique: Function approximationGradient Descent methods such as Backpropagation NN


Generalization Over Input

Approximate

� �

: Value prediction with functionapproximation

Method of Approximation8�� 8 - parameterized functional form with parameter vector

� �� = � i �� = � i O O O i �� 3 � � �

eg.

8 � a � � � � � q � d( � � �� . � � q � . �(Linear)

where

� � � � q � = � i � q �= � i O O O i � q � 3 � � � is the column of featuresthat characterize a state aTask: Learn

� using supervised learning on a training set ofa > % b where b �� n � 8 � a g � in case of TD(0)

Supervised learning methods seek to minimize mean squarederror over some distribution of the states�0 � � �� qr t ? � a � � 8\ � a � > 8 � a � � 5


Generalization Over Input

Gradient Descent: Adjust the parameter after eachexample by a small amount in the direction that wouldreduce the error on that example

Method�� > �5 P �s�¡ � 8\ � a � > 8 � a � � 5

�� n P � 8\ � a � > 8 � a � � �!�¢ 8 � a �where �s�¡ 8 � a � � � q (linear model)

But we dont know

8\ � a �; however from the t th training example,a % b�� n P � bs > 8 � a � � �!�¢ 8 � a �

Sample Function ApproximationsFunction computed by NN utilizing the backpropagation algorithmwith

� as the connection weights


Generalization Over Actions

Approximate

x �

: Action-value prediction using functionapproximation

Method of Approximationp � p\

- parameterized functional form with parameter vector

�

Training example are of the form a i 2 % b

where b� �� n � p � a g i 2 g �Gradient Descent:�£� � � �£ n P � bE > p � a i 2 � � ��¡ p � a i 2 �

We can then combine action-value prediction withtechniques for policy improvement and action selection


Hierarchical MethodsLarge state spaces ¤ Hierarchy of learning problems

Hierarchical learners as Gated behaviorsCollection of

¥+ $ 2 b. � � a and a ¦ 21 . 3 ¦ function that selects aparticular behavior based on the state of the environment

Feudal Q-LearnerHigh-level

� 2 a1 + � and low-level

0 - 2 b+ modules

Master:Receives reinforcement from the environment and sendscommands to the slaveLearns a mapping from states to commands

Slave:Receives reinforcement from the master for taking actions thatsatisfy the commandsLearns a mapping from state,commands to actions


Partially Observable EnvironmentsNaive Approach

Ignore partial observability and treatobservations as if they were states of theenvironment

TD(0) and Q-Learning can be applied butthey lead to suboptimal behaviour

POMDP ApproachState Estimator: Computes a newexpected belief state

§Policy ¨: Compute from a piece-wise lin-ear and convex function over the beliefspace


Reinforcement Learning Applications

Game Playing: Tesauro’s TD-Gammon 0.0Combination of TD(

o

) and a non-linearfunction(

8 ) approximation(

=N 5 �

states).

8 % Estimate of the probability ofvictory for the current playermodeled as a multi-layer neural net-work trained using backpropagation algo-rithm(uses gradient-descent method)

Training examples: Obtained through constant self-play and a greedyexploration strategy

Achieved tournament-level performance

TD-Gammon 1.0 utilized inductive biases


Discussion

For complex problems, tabulation of values may not beenough. Inductive biases will give leverage to thelearning process

Shaping

Imitation

Problem decomposition

Reflexes

Future work: Methods for approximating, decomposingand incorporating bias into problems


Documents

A Survey of Reinforcement Learningpiotr/cs594/PrashantRL.pdf · A Survey of Reinforcement Learning Œ p.24/35. Learning Policies: Model-based Methods Performance Comparison Problem