A Survey of Reinforcement Learning
LiteratureKaelbling, Littman, and Moore
Sutton and BartoRussell and Norvig
PresenterPrashant J. Doshi
CS594: Optimal Decision Making
A Survey of Reinforcement Learning – p.1/35
OverviewPart I
Reinforcement Learning model
Exploitation vs Exploration
Learning Optimal Policies using Model-based Methods
Learning Optimal Policies using Model-free Methods
Computing Optimal Policies by Learning Models
Part IIGeneralizations
Partially Observable Environments
Reinforcement Learning Applications
A Survey of Reinforcement Learning – p.2/35
Part I: RoadmapReinforcement Learning(RL) Model
DefinitionKey ConceptsModels of Optimal BehaviourLearning Performance Metrics
Exploitation vs ExplorationTradeoff involvedFormally Justified Exploration TechniquesAd-hoc Exploration Techniques
Delayed RewardLearning Policies: Given a ModelLearning Policies: Model-free methodsLearning Policies: Model-based methods
A Survey of Reinforcement Learning – p.3/35
RL Model
Problem faced by an agent that must learn behaviourthrough trial -and-error interactions with a dynamicenvironment.
Class of problems opposed to a set of techniques
Formal Definition of a RL modelDiscrete set of environmentstates, S;
Discrete set of environmentactions, A; and
Set of scalar reinforcementsignals; {0,1}
A Survey of Reinforcement Learning – p.4/35
Key Concepts
Policy: Mapping from perceived states to actions to betaken when in those states - stimulus-response rules
Reward: Short term intrinsic desirability of the state. RLagent’s sole objective is to maximize the total reward itreceives in the long run
Value: Total amount of reward an agent can expect toaccumulate over the future starting from that state.Long term desirability of the state
Backup: Updating the value of a state using values offuture states
Sweep: A sweep consists of applying a backupoperation to each state
A Survey of Reinforcement Learning – p.5/35
Other AI Methods vs RL
Supervised Learning Reinforcement Learning
Presentation of input/output pairs Agent is told immediate reward and the next state
agent is not told which action is best in the long term
Learning occurs offline Online performance is important
system is evaluated while the agent is learning
No exploration of the environment Explicit exploration of the environment is required
Search Reinforcement Learning
Entire state space need not be enumerated Requires entire state space to
be enumerated and stored in memory
A Survey of Reinforcement Learning – p.6/35
Models of Optimal Behaviour
Finite Horizon Model
� � ���� � � � �
Non-stationary policy � : h step optimal action, (h-1) step optimalaction . . . 1 step optimal action
Infinite Horizon Discounted Model
� � ��� � � � � �
��� � � ��� � � where
� � � � � ��� � � �� �
Stationary policy
Average Reward Model
�� � �! � � � " � ��� � � � �
Gain optimal policy
Bias optimal model
A Survey of Reinforcement Learning – p.7/35
Measuring Learning Performance
Eventual Convergence to optimalProvable guarantee of asymptotic convergence to optimalbehavior
eg. Value functions in an MDP
Speed of convergence to optimalitySpeed of convergence to near-optimality
Level of performance after a given time
RegretDifference between the expected total reward gained by followinga learning algorithm and expected total reward one could gain byplaying for the maximum expected reward from the start.
A Survey of Reinforcement Learning – p.8/35
Exploitation vs Exploration
A Simple RL problem:
#
-armed bandit problemAgent is permitted
$
pulls % $
step finite horizon
Immediate boolean payoff 0
&
1 with a probability ')(
Exploit or exploreTypically premature sub-optimal decisions may affect optimalstrategy
SolutionsFormally Justified Techniques
Ad-Hoc Techniques
A Survey of Reinforcement Learning – p.9/35
Formally Justified Techniques
Dynamic Programming Approach*,+ -. + /01 21 + : { 3 � , 4 � , 365 , 465 ,. . . , 37 , 47 }
Each '( has an independent prior uniform distribution(Beta)8 9
( 3 � , 4 � , 365 , 465 ,. . . , 37 , 47 ): expected future payoff when we actoptimally
If ( 3( =
$
, then
8 9
( 3 � , 4 � , 3:5 , 4:5 ,. . . , 37 , 47 )=0 . . . (Basis)8 9
( 3 � , 4 � , 3;5 , 4;5 ,. . . , 37 , 47 ) = max( E( Future payoff of performingaction i then acting optimally for the remaining pulls)= max( E( <( 8 9
( 3 � , 4 � ,. . . , 3( +1, 4( +1,. . . , 3 7 , 47 )+
�= > <( ) 8 9
( 3 � , 4 � ,. . . , 3( +1, 4( ,. . . , 37 , 47 ))
<( is the posterior payoff probability of action
.
paying off given 3 ( ,4( and our prior probability distribution % Bayesian updating
A Survey of Reinforcement Learning – p.10/35
Formally Justified Techniques (Contd)
Gittins Allocation IndicesUtilizes the Discounted Expected Reward Model
Gives a table of allocation index values for different discountfactors: I( 3( , 4( )Index value: Expected payoff of action
.+ value of information in
selecting
.
At each step choosing action with the largest index guaranteesoptimal balance between exploration and exploitation
Indexes are computed through an iterated dynamic programmingapproach
Simple table lookup makes it computationally efficient
A Survey of Reinforcement Learning – p.11/35
Ad-Hoc Techniques
Greedy StrategySelect the action with the highest estimated payoff
true optimal action may get starved
Optimism in the face of uncertainty
Assume optimistic prior beliefs; Actions are not easily eliminatedfrom consideration
Randomized Strategy
' : Random action; 1- ': Greedy action' controls the amount of exploration
Boltzmann exploration:
? � 2 � � @ AB CED FGHID JLK M @ AB CED FGH
T controls the amount of exploration
A Survey of Reinforcement Learning – p.12/35
Ad-Hoc Techniques (Contd)
Interval-based EstimationTabulate for each 2 ( : 3( , 4(
Compute an upper bound with
=N N O �= > P �Q confidence intervalon '(
Select an action with the largest upper bound
Small P encourage greater exploration
A Survey of Reinforcement Learning – p.13/35
Delayed Reward
Agent may take a long sequence of actions receivinginsignificant rewards and then finally arrive at a statewith a high reward
Markov Decision Processes(MDP)A set of states
0
A set of actions
R
,
A reward function
SUT 0 V R % W
A state transition functionX T 0 V R V 0 % Y � 0 �
Model is markov if the state transitions are independent of anyprevious states or actions
A Survey of Reinforcement Learning – p.14/35
Learning Policies: Given a Model
ResultsFor the infinite horizon discounted model, there exists an optimaldeterministic stationary policy
Optimal value of state:
8 9 � Z 2[]\ � � ^�� � � � �
Any policy that is greedy w.r.t. to
8 9is an Optimal policy
Methods to learn optimal policyValue Iteration
Policy Iteration
A Survey of Reinforcement Learning – p.15/35
Learning Policies: Model-free Methods
RL is primarily concerned with the task of learningpolicies when the model is not known in advance.(
_
and
`
are not known)
Agent interacts with the environment directly to obtaininformation
Temporal credit assignment problem: When we get alarge reward after a series of actions, how do we figureout which action had the most impact?
Temporal difference methods:NewEstimate = OldEstimate + LearningRate[Target -OldEstimate]
A Survey of Reinforcement Learning – p.16/35
Learning Policies: Model-free MethodsMonte-Carlo Method
Accumulate experience over an entire episode
Occurrence of state a in an episode is called a b. a .1 to a
Std Deviation of the error is
�cd
Algorithm:
e policy to be evaluated
8 e an arbitrary state-value function
Returns(s) e an empty list, for all a f0Repeat Forever:
Generate an episode using For each state a appearing in the episodeS e return following the first occurrence of a
Append
Sto Returns(s)8 � a � e average(Returns(s))
A Survey of Reinforcement Learning – p.17/35
Learning Policies: Model-free Methods
Adaptive Heuristic Critic and TD(0)Adaptive version of the policy iterationmethod
AHC: Using� and it computes theexpected discounted value function foreach state
RL: Computes g by maximizing over bValue Function
Experience tuple: h aji 2i � i a glk
Sutton’s
X m �N �algorithm:
8 � a � T � 8 � a �!n P �� n � 8 � a g � > 8 � a � � Sample backup
X m �o �A Survey of Reinforcement Learning – p.18/35
Learning Policies: Model-free Methods
Q-Learning (Off-Policy TD(0))p 9 � a i 2 � T � S � aji 2 �!n � q Jsr t X � aji 2i a g � Z 2[vu J p 9 � a g i 2 g �
Q-Learning Rule (1-step)Experience tuple: h aji 2i � i a gLk
p � a i 2 � T � p � aji 2 �!n P �� n � Z 2[wu J p � a g i 2 g � > p � a i 2 � �
Q values are guaranteed to converge provided each action-valuepair is tried out an infinite number of times
Exploration insensitive
Exploration strategy will not affect the guarantee but may affectthe speed of convergence of the Q values
A Survey of Reinforcement Learning – p.19/35
Learning Policies: Model-free Methods
Q-Learning AlgorithmInitialize
x �zy|{ } � arbitrarilyRepeat(for each episode):
Initialize y
Repeat(for each step of episode):Choose } from y using policy derived from
x
(eg. greedy)Take action }, observe �{ y ~
x �zy|{ } ��� � x �y|{ } � ��� � � � � }� ~� x �zy ~{ } ~ � � x �zy|{ } � �
y � y ~
Until y is terminal
A Survey of Reinforcement Learning – p.20/35
Backup Comparisons
A Survey of Reinforcement Learning – p.21/35
Learning Policies: Model-based Methods
Disadvantage of Model-free methodsRequire a great deal of experience; they make inefficient use ofthe gathered data
Use the experience to learn the models?
Certainty Equivalent MethodsAlgorithm:
Use the experience to statistically learn
X
and
S
Use Value Iteration or Policy Iteration to learn the policy
Limitations:
Arbitrary division between the learning phase and acting phase
Non-stationary environments
A Survey of Reinforcement Learning – p.22/35
Learning Policies: Model-based Methods
Sutton’s DynaInterleaves model learning with acting
Computationally efficient than the certainty equivalence method
Algorithm:Update the model
� X
and
� SUpdate the action-value function at a using
� X
and
� S
p � a i 2 � T � � S � aji 2 �n � q Jr t � X � aji 2i a g � Z 2 [u J p � a g i 2 g �
Perform
�
additional updates at randomp � a 7 i 27 � T � � S � a 7 i 27 �!n � q Jr t � X � a 7 i 2 7 i a g � Z 2[vu J p � a g i 2 g �
Choose an action 2 g to perform in state a g that may be greedy onpA Survey of Reinforcement Learning – p.23/35
Learning Policies: Model-based Methods
Dyna is relatively undirected
Prioritized Sweeping/Queue-DynaUpdates are prioritised rather than random; we update
8
(not
p
)
Algorithm:
Select a high priority state from a queue
Remember the current value of the state
8�� �� � 8 � a �
Update
8 � a � T � Z 2[ u � � S � aji 2 �n � q Jr t � X � aji 2i a g � 8 � a g � �
Reset the state’s priority back to 0
Compute
� � & 8� �� > 8 � a � &Set the priorities of the '� + �+� + a a� � a of a to �U� � X � aji 2i a g �
If
�k 1 $� + a $� - �, then insert the '� + �+� + a a� � a into a queue
Priority of backing up and updating depends on size of changeand the current transition probabilities
A Survey of Reinforcement Learning – p.24/35
Learning Policies: Model-based Methods
Performance Comparison
Problem domain: A 3277 state grid world formulated as ashortest path learning problem, which yields the sameresult as if a reward of 1 is given at the goal, and a rewardof 0 elsewhere and a discount factor is used .
Algorithm Steps BackupsQ-Learning 531,000 531,000
Dyna 62,000 3,055,000Prioritized sweeping 28,000 1,010,000
A Survey of Reinforcement Learning – p.25/35
End of Part I: RecapRL provides us with an intuitive mechanism for learningpolicies
3 models of optimal behaviour and some measures oflearning performance
#
-armed bandit problem:Formally justified techniques
Ad-hoc techniques
Learning Policies for Delayed RewardGiven a model: Value and Policy Iteration
Model-free: TD(0) and Q-Learning
Model-based: Dyna and Prioritized Sweeping
Q-Learning requires most steps before convergence; prioritizedlearning requires least steps and backups before convergence
A Survey of Reinforcement Learning – p.26/35
Part II: Roadmap
GeneralizationGeneralization over input
Generalization over actions
Hierarchical Methods
Partially Observable Environments
Reinforcement Learning ApplicationsGame Playing
Discussion
A Survey of Reinforcement Learning – p.27/35
GeneralizationTrivial problems
Enumeration of state and action spaces
Store state values as tables
Experience is handled inefficiently
Non-trivial problemsLarge state and action spaces(possibly continuous)
Possible to aggregate over state and action spaces
Generalization:How can experience with a limited subset of the state space beusefully generalized to produce a good approximation over amuch larger subset?
Generalization technique: Function approximationGradient Descent methods such as Backpropagation NN
A Survey of Reinforcement Learning – p.28/35
Generalization Over Input
Approximate
� �
: Value prediction with functionapproximation
Method of Approximation8�� 8 - parameterized functional form with parameter vector
� �� � � � �= � i �� � = � i O O O i �� � 3 � � �
eg.
8 � a � � � � � q � d( � � �� � . � � q � . �(Linear)
where
� � � � q � = � i � q �= � i O O O i � q � 3 � � � is the column of featuresthat characterize a state aTask: Learn
� using supervised learning on a training set ofa > % b where b �� n � 8 � a g � in case of TD(0)
Supervised learning methods seek to minimize mean squarederror over some distribution of the states�0 � � �� � � qr t ? � a � � 8\ � a � > 8 � a � � 5
A Survey of Reinforcement Learning – p.29/35
Generalization Over Input
Gradient Descent: Adjust the parameter after eachexample by a small amount in the direction that wouldreduce the error on that example
Method��� � � �� > �5 P �s�¡ � 8\ � a � > 8 � a � � 5
��� � � �� n P � 8\ � a � > 8 � a � � �!�¢ 8 � a �where �s�¡ 8 � a � � � q (linear model)
But we dont know
8\ � a �; however from the t th training example,a % b��� � � �� n P � bs > 8 � a � � �!�¢ 8 � a �
Sample Function ApproximationsFunction computed by NN utilizing the backpropagation algorithmwith
� as the connection weights
A Survey of Reinforcement Learning – p.30/35
Generalization Over Actions
Approximate
x �
: Action-value prediction using functionapproximation
Method of Approximationp � p\
- parameterized functional form with parameter vector
�
Training example are of the form a i 2 % b
where b� �� n � p � a g i 2 g �Gradient Descent:�£� � � �£ n P � bE > p � a i 2 � � ���¡ p � a i 2 �
We can then combine action-value prediction withtechniques for policy improvement and action selection
A Survey of Reinforcement Learning – p.31/35
Hierarchical MethodsLarge state spaces ¤ Hierarchy of learning problems
Hierarchical learners as Gated behaviorsCollection of
¥+ $ 2 b. � � a and a ¦ 21 . 3 ¦ function that selects aparticular behavior based on the state of the environment
Feudal Q-LearnerHigh-level
� 2 a1 + � and low-level
0 - 2 b+ modules
Master:Receives reinforcement from the environment and sendscommands to the slaveLearns a mapping from states to commands
Slave:Receives reinforcement from the master for taking actions thatsatisfy the commandsLearns a mapping from state,commands to actions
A Survey of Reinforcement Learning – p.32/35
Partially Observable EnvironmentsNaive Approach
Ignore partial observability and treatobservations as if they were states of theenvironment
TD(0) and Q-Learning can be applied butthey lead to suboptimal behaviour
POMDP ApproachState Estimator: Computes a newexpected belief state
§Policy ¨: Compute from a piece-wise lin-ear and convex function over the beliefspace
A Survey of Reinforcement Learning – p.33/35
Reinforcement Learning Applications
Game Playing: Tesauro’s TD-Gammon 0.0Combination of TD(
o
) and a non-linearfunction(
8 ) approximation(
=N 5 �
states).
8 % Estimate of the probability ofvictory for the current playermodeled as a multi-layer neural net-work trained using backpropagation algo-rithm(uses gradient-descent method)
Training examples: Obtained through constant self-play and a greedyexploration strategy
Achieved tournament-level performance
TD-Gammon 1.0 utilized inductive biases
A Survey of Reinforcement Learning – p.34/35
Discussion
For complex problems, tabulation of values may not beenough. Inductive biases will give leverage to thelearning process
Shaping
Imitation
Problem decomposition
Reflexes
Future work: Methods for approximating, decomposingand incorporating bias into problems
A Survey of Reinforcement Learning – p.35/35