Download ppt - Using Advice to Transfer Knowledge Acquired in One Reinforcement Learning Task to Another Lisa Torrey, Trevor Walker, Jude Shavlik University of Wisconsin-Madison,

Using Advice to Transfer Knowledge Acquired in

One Reinforcement Learning Task to Another

Lisa Torrey, Trevor Walker, Jude Shavlik

University of Wisconsin-Madison, USA

Richard MaclinUniversity of Minnesota-Duluth, USA

Our Goal

Transfer knowledge…

… between reinforcement learning tasks

… employing SVM function approximators

… using advice

Transfer

Learn first task

Learn related taskknowledge

acquired

Exploit previously learned models

Improve learning of new tasks

performance

experience

with transferwithout transfer

Reinforcement Learning

state…action…reward…new state

Q-function: value of taking action from state Policy: take action with max Qaction(state)

+2

-1

0

Advice for Transfer

Based on what worked in Task A,

I suggest…

Task ASolution

Task BLearner

I’ll try it, but if it doesn’t work I’ll do something

else.

Advice improves RL performance

Advice can be refined or even discarded

Transfer Process

Task AQ-functions

Task A experience

Advice from user (optional)

Transfer Advice

Task BQ-functions

Task B experience

Advice from user (optional)

Mapping from user Task A Task B

Task A experience

Task B experience

RoboCup Soccer Tasks

KeepAway BreakAway

Keep ball from opponents

[Stone & Sutton, ICML 2001]

Score a goal

[Maclin et al., AAAI 2005]

RL in RoboCup Tasks

Pass, Hold Pass, Move, Shoot

Each time step: +1 At end: +2, 0, or -1

KeepAway BreakAway

Features

Actions

Rewards

(time left)

Transfer Process

Task AQ-functions

Task A experience

Transfer Advice

Task BQ-functions

Task B experience


Approximating Q-Functions

Learn linear coefficientsy = w1 f1 + … + wn fn + b

Non-linearity from Boolean tile featurestilei,lower,upper = 1 if lower ≤ fi < upper

Given examplesState features Si= <f1 , … , fn>

Estimated values y Qaction(Si)

Support Vector Regression

state S

Q-estimate y

minimize ||w||1 + |b| + C ||k||1

such that y - k Sw + b y + k

Linear Program

Transfer Process

Task AQ-functions

Task A experience

Transfer Advice

Task BQ-functions

Task B experience


Advice Example

Need only follow advice approximately

Add soft constraints to linear program

if distance_to_goal 10

and shot_angle 30

then prefer shoot over all other actions

Incorporating AdviceMaclin et al., AAAI 2005

if v11 f1 + … + v1n fn d1

…

and vm1 f1 + … + vmn fn dn

then Qshoot > Qother for all other

Advice and Q-functions have same language Linear expressions of features

Transfer Process

Task AQ-functions

Task A experience

Transfer Advice

Task BQ-functions

Task B experience


Expressing Policy with Advice

Qhold_ball(s) Qpass_near(s)Qpass_far(s)

if Qhold_ball(s) > Qpass_near(s)

and Qhold_ball(s) > Qpass_far(s)

then prefer hold_ball over all other actions

Old Q-functions

Advice expressing policy

Mapping Actions

hold_ball move

pass_near pass_near

pass_far

Qhold_ball(s) Qpass_near(s)Qpass_far(s)

if Qhold_ball(s) > Qpass_near(s)

and Qhold_ball(s) > Qpass_far(s)

then prefer move over all other actions

Old Q-functions

Mapped policy

Mapping from user

Mapping Features

Qhold_ball(s) = w1 (dist_keeper1)+ w2 (dist_taker2)+ …

Q´hold_ball(s) = w1 (dist_attacker1)+ w2 (MAX_DIST)+ …

Mapping from user

Q-function mapping

Transfer Example

Qx = wx1f1 + wx2f2 + bx

Qy = wy1f1 + by

Qz = wz2f2 + bz

Old model

Q´x = wx1f´1 + wx2f´2 + bx

Q´y = wy1f´1 + by

Q´z = wz2f´2 + bz

Mapped model

if Q´x > Q´y

and Q´x > Q´z

then prefer x´

Advice

if wx1f´1 + wx2f´2 + bx > wy1f´1 + by

and wx1f´1 + wx2f´2 + bx > wz2 f´2 + bz

then prefer x´ to all other actions

Advice (expanded)

Transfer Experiment

Between RoboCup subtasks From 3-on-2 KeepAway To 2-on-1 BreakAway

Two simultaneous mappings Transfer passing skills Map passing skills to shooting

Experiment Mappings

Play a moving KeepAway game Pass Pass, Hold Move

Pretend teammate is standing in the goal Pass Shoot

imaginaryteammate

Experimental Methodology

Averaged over 10 BreakAway runs

Transfer: advice from one KeepAway model

Control: runs without advice

Results

0

0.2

0.4

0.6

0.8

0 2500 5000 7500 10000

Games Played

Pro

bab

ility

(S

core

Go

al)

Analysis

Transfer advice helps BreakAway learners 7% more likely to score a goal after learning

Improvement is delayed Advantage begins after 2500 games

Some advice rules apply rarely Preconditions for shoot advice not often met

Related Work: Transfer

Remember action subsequences [Singh, ML 1992]

Restrict action choices [Sherstov & Stone, AAAI 2005]

Transfer Q-values directly in KeepAway [Taylor & Stone, AAMAS 2005]

Related Work: Advice

“Take action A now”[Clouse & Utgoff, ICML 1992]

“In situations S, action A has value X ”[Maclin & Shavlik, ML 1996]

“In situations S, prefer action A over B ”[Maclin et al., AAAI 2005]

Future Work

Increase speed of linear-program solving

Decrease sensitivity to imperfect advice

Extract advice from kernel-based models

Help user map actions and features

Conclusions

Transfer exploits previously learned models to improve learning of new tasks

Advice is an appealing way to transfer

Linear regression approach incorporates advice straightforwardly

Transferring a policy accommodates different reward structures

Acknowledgements

DARPA grant HR0011-04-1-0007 United States Naval Research Laboratory

grant N00173-04-1-G026 Michael Ferris Olvi Mangasarian Ted Wild