Using Advice to Transfer Knowledge Acquired in
One Reinforcement Learning Task to Another
Lisa Torrey, Trevor Walker, Jude Shavlik
University of Wisconsin-Madison, USA
Richard MaclinUniversity of Minnesota-Duluth, USA
Our Goal
Transfer knowledge…
… between reinforcement learning tasks
… employing SVM function approximators
… using advice
Transfer
Learn first task
Learn related taskknowledge
acquired
Exploit previously learned models
Improve learning of new tasks
performance
experience
with transferwithout transfer
Reinforcement Learning
state…action…reward…new state
Q-function: value of taking action from state Policy: take action with max Qaction(state)
+2
-1
0
Advice for Transfer
Based on what worked in Task A,
I suggest…
Task ASolution
Task BLearner
I’ll try it, but if it doesn’t work I’ll do something
else.
Advice improves RL performance
Advice can be refined or even discarded
Transfer Process
Task AQ-functions
Task A experience
Advice from user (optional)
Transfer Advice
Task BQ-functions
Task B experience
Advice from user (optional)
Mapping from user Task A Task B
Task A experience
Task B experience
RoboCup Soccer Tasks
KeepAway BreakAway
Keep ball from opponents
[Stone & Sutton, ICML 2001]
Score a goal
[Maclin et al., AAAI 2005]
RL in RoboCup Tasks
Pass, Hold Pass, Move, Shoot
Each time step: +1 At end: +2, 0, or -1
KeepAway BreakAway
Features
Actions
Rewards
(time left)
Transfer Process
Task AQ-functions
Task A experience
Transfer Advice
Task BQ-functions
Task B experience
Mapping from user Task A Task B
Approximating Q-Functions
Learn linear coefficientsy = w1 f1 + … + wn fn + b
Non-linearity from Boolean tile featurestilei,lower,upper = 1 if lower ≤ fi < upper
Given examplesState features Si= <f1 , … , fn>
Estimated values y Qaction(Si)
Support Vector Regression
state S
Q-estimate y
minimize ||w||1 + |b| + C ||k||1
such that y - k Sw + b y + k
Linear Program
Transfer Process
Task AQ-functions
Task A experience
Transfer Advice
Task BQ-functions
Task B experience
Mapping from user Task A Task B
Advice Example
Need only follow advice approximately
Add soft constraints to linear program
if distance_to_goal 10
and shot_angle 30
then prefer shoot over all other actions
Incorporating AdviceMaclin et al., AAAI 2005
if v11 f1 + … + v1n fn d1
…
and vm1 f1 + … + vmn fn dn
then Qshoot > Qother for all other
Advice and Q-functions have same language Linear expressions of features
Transfer Process
Task AQ-functions
Task A experience
Transfer Advice
Task BQ-functions
Task B experience
Mapping from user Task A Task B
Expressing Policy with Advice
Qhold_ball(s) Qpass_near(s)Qpass_far(s)
if Qhold_ball(s) > Qpass_near(s)
and Qhold_ball(s) > Qpass_far(s)
then prefer hold_ball over all other actions
Old Q-functions
Advice expressing policy
Mapping Actions
hold_ball move
pass_near pass_near
pass_far
Qhold_ball(s) Qpass_near(s)Qpass_far(s)
if Qhold_ball(s) > Qpass_near(s)
and Qhold_ball(s) > Qpass_far(s)
then prefer move over all other actions
Old Q-functions
Mapped policy
Mapping from user
Mapping Features
Qhold_ball(s) = w1 (dist_keeper1)+ w2 (dist_taker2)+ …
Q´hold_ball(s) = w1 (dist_attacker1)+ w2 (MAX_DIST)+ …
Mapping from user
Q-function mapping
Transfer Example
Qx = wx1f1 + wx2f2 + bx
Qy = wy1f1 + by
Qz = wz2f2 + bz
Old model
Q´x = wx1f´1 + wx2f´2 + bx
Q´y = wy1f´1 + by
Q´z = wz2f´2 + bz
Mapped model
if Q´x > Q´y
and Q´x > Q´z
then prefer x´
Advice
if wx1f´1 + wx2f´2 + bx > wy1f´1 + by
and wx1f´1 + wx2f´2 + bx > wz2 f´2 + bz
then prefer x´ to all other actions
Advice (expanded)
Transfer Experiment
Between RoboCup subtasks From 3-on-2 KeepAway To 2-on-1 BreakAway
Two simultaneous mappings Transfer passing skills Map passing skills to shooting
Experiment Mappings
Play a moving KeepAway game Pass Pass, Hold Move
Pretend teammate is standing in the goal Pass Shoot
imaginaryteammate
Experimental Methodology
Averaged over 10 BreakAway runs
Transfer: advice from one KeepAway model
Control: runs without advice
Results
0
0.2
0.4
0.6
0.8
0 2500 5000 7500 10000
Games Played
Pro
bab
ility
(S
core
Go
al)
Analysis
Transfer advice helps BreakAway learners 7% more likely to score a goal after learning
Improvement is delayed Advantage begins after 2500 games
Some advice rules apply rarely Preconditions for shoot advice not often met
Related Work: Transfer
Remember action subsequences [Singh, ML 1992]
Restrict action choices [Sherstov & Stone, AAAI 2005]
Transfer Q-values directly in KeepAway [Taylor & Stone, AAMAS 2005]
Related Work: Advice
“Take action A now”[Clouse & Utgoff, ICML 1992]
“In situations S, action A has value X ”[Maclin & Shavlik, ML 1996]
“In situations S, prefer action A over B ”[Maclin et al., AAAI 2005]
Future Work
Increase speed of linear-program solving
Decrease sensitivity to imperfect advice
Extract advice from kernel-based models
Help user map actions and features
Conclusions
Transfer exploits previously learned models to improve learning of new tasks
Advice is an appealing way to transfer
Linear regression approach incorporates advice straightforwardly
Transferring a policy accommodates different reward structures
Acknowledgements
DARPA grant HR0011-04-1-0007 United States Naval Research Laboratory
grant N00173-04-1-G026 Michael Ferris Olvi Mangasarian Ted Wild