Upload
thane-mccullough
View
24
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Reinforcement Learning : Overview. Cheng-Zhong Xu Wayne State University. Introduction. - PowerPoint PPT Presentation
Citation preview
Introduction In RL, the learner is a decision-making agent that takes actions in
an environment state and receives reward (or penalty) for its actions. The action may cause the change of environment state. After a set of trial-and-error runs, it should learn the best policy: the sequence of actions that maximize the total reward Supervised learning: learning from examples provided by a teacher RL: learning with a critic (reward or penalty); goal-directed learning
from interaction Examples:
Game-playing: Sequence of moves to win a game Robot in a maze: Sequence of actions to find a goal
2C. Xu, 2008
3
Example: K-armed Bandit Given $10 to play on a slot machine with 5 levers: Each play costs $1; each pull of a lever may produce payoff of 0, 1$, 5$, 10$ Find the optimal policy that pay off the most.
Tradeoff between exploitation and explorationExploitation: continue to pull the lever that returns positive Exploration: try to pull a new one
Deterministic model The payoff of each lever is fixed,
but unknown in advance Stochastic model The pay of each lever is uncertainty,
with known or unknown probabilityC. Xu, 2008
K-armed Bandit in General In deterministic case:
Q(a): value of action a
Reward of act a is ra
Q(a)= ra
Choose a* if
Q(a*)=maxa Q(a)
In stochastic model: Reward is non-deterministic: p(r|a)
Qt(a): estimate of the value of act a at time t
Delta Rule
is learning factor Qt+1(a) is expected value and should converge to the mean of p(r|a) as t increases
4
1 1.t t t tQ a Q a r a Q a
C. Xu, 2008
K-Armed Bandit as Simplified RL Single state (single slot machine) vs Multiple States
p(r|si , aj) : different reward probabilities
Q(Si aj ): value of action aj in state si to be learnt
Action causes state change, in addition to reward Rewards are not necessarily immediate value
Delayed rewards
5
Start S2
S3S4
S5 Goal
S7S8
C. Xu, 2008
6
Elements of RL
st : State of agent at time t
at: Action taken at time t
In st, action at is taken, clock ticks and reward rt+1 is received and state changes to st+1
Next state prob: P (st+1 | st , at ) Markov system
Reward prob: p (rt+1 | st , at ) Initial state(s), goal state(s) Episode (trial) of actions from initial state to goal
C. Xu, 2008
7
Policy and Cumulative Reward
Policy, State value of a policy, Finite-horizon:
Infinite horizon:
tt sa: AS tsV
T
iitTtttt rErrrEsV
121
rate discount the is 10
1
13
221
iit
itttt rErrrEsV
C. Xu, 2008
8
Bellman’s equation
1
11
* *1 1 1
*1 1
1
1
11 1
1
1 1
* *
[ ]
max ,
max | ,
, | ,
Value of in
max
tt
t
t
t
t t t t
it t i
i
it t i
i
t t
t t t t ta
t ta
s
t t t t t ta
s
V s E r
E r r
V s E r P s s a V s
Q s a E r
E r E V s
V s Q s a a
P s
s
s a
*1 1,t tQ s a
State Value Function Example GridWorld: a simple MDP
Grid cell ~ environment states Four possible actions at each cell: n/s/e/w, one cell in
respective dir; Agent would remain in location, if its move would take it off
the grid, but with reward of -1; Other move receives reward of 0, except
Those moves out of states A and B; rewarding 10 for each move out of A (to A’) and 5 for move out of B (to B’)
Policy: the agent selects four actions with equal prob and assume =0.9
9
10
Environment, P (st+1 | st , at ), p (rt+1 | st , at ), is known
There is no need for exploration Can be solved using dynamic programming Solve for
Optimal policy
Model-Based Learning
111
1
|max t*
stttt
at
* sVa,ssPrEsVt
t
111
1
||max arg t*
stttttt
at sVa,ssPa,srEs*
tt
C. Xu, 2008
11
Value Iteration vs Policy Iteration
Policy iteration needs fewer iterations than value iteration
C. Xu, 2008
12
Model-Free Learning
Environment, P (st+1 | st , at ), p (rt+1 | st , at ), is not known model-free learning, based on both exploitation and exploration
Temporal difference learning: use the (discounted) reward received in the next time step to update the value of current state (action): 1-step TD Temporal difference: between the value of the
current action and the value discounted from the next state
C. Xu, 2008
13
Deterministic Rewards and Actions
is reduced to
Therefore, we have a backup update rule as
Initially, and its value increases as learning proceeds episode by episode.
11111
1
max|
tt*
as
tttttt* a,sQa,ssPrEa,sQ
tt
1111
max
tta
ttt a,sQra,sQt
1111
max
tta
ttt a,sQ̂ra,sQ̂t
ˆ , 0,t tQ s a
Start S2
S3S4
S5 Goal
S7S8
In maze, all rewards of intermediate states are zero in the first episode. We a goal is reached, we get reward r and the Q value of last state, say S5, is Updated as r. In the next episode, when S5 is reached,the Q value of its preceding state S4 is updated as 2r.
C. Xu, 2008
14
Nondeterministic Rewards and Actions
Uncertainty in reward and state change is due To presence of opponents or randomness in the environment.
Q-learning (Watkins & Dayan’92): we keep a running average for each pair of state-action
tttt
attttt a,sQ̂a,sQ̂ra,sQ̂a,sQ̂
t111
1
max
value of a sample of instances for each (st,at)
11111
1
max|
tt*
as
tttttt* a,sQa,ssPrEa,sQ
tt
C. Xu, 2008
15
Exploration Strategies
Greedy: choose action that maximizes the immediate reward
ε-greedy: with prob ε, choose one action at random uniformly, and choose the best action with prob 1-ε
Softmax selection:
To m gradually move from exploration to exploitation, temperature variable T could help the annealing process
,
,
1
e|
e
Q s a
Q s b
b
P a s
A
( , ) /
( , ) /
1
e|
e
Q s a T
Q s b T
b
P a s
A
C. Xu, 2008
Summary RL is a process of learning by interaction, in contrast to
supervised learning from examples. Elements of RL for an agent and its environment
state value function, state-action function (Q-value), reward, state change probability, policy
Tradeoff between exploitation and exploration Markov Decision Process Model-based learning
Value function in Bellman equation Dynamic programming
Model-free learning Temporal difference (TD) and Q-learning (timing average) to update
Q value Action selection for exploration
-greedy, softmax-based selection
16C. Xu, 2008