Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Reinforcement Learning

Peter Sunehag

2013

Reinforcement Learning

Reinforcement Learning Agents an approach to AGI

Peter Sunehag

Tutorial at the seventh conference on AGI

Quebec City August 2014

Many slides borrowed from e.g. Sutton&Barto

!"C-*#&$+J"*K""$+!,+-$B+30+

!"#$%&'(")"$*+,"-'$#$.+L'-)"K&'MS%&E(L%#%&"04(?AT%&.(BA./5"00()=(3&A$0%B.

!,+/'&JC")6+

N+3C.&'#*2)6

</A?2"./1?'(-#M#A6#'(#A#]1717@7(%#T1&A#B%#/.

3'*#%#(#-C

0$*"CC#."$("

*&"@1/1A#"00E(@%/%&B1#1./1?'(M#A6#((6A&0@(5(30"##1#L(3&A$0%B

O*-*#6*#(-C+

8-(2#$"+,"-'$#$.

VA./0E(1717@7(@"/"(?0"..1F1?"/1A#'(

&%L&%..1A#'(?0-./%&1#L

O&)"+3PPC#(-*#&$6+&%+!,¥ !2%?M%&.(N<"B-%0'(8ObOQ

F1&./(-.%(AF(;K(1#("#(1#/%&%./1#L(&%"0(L"B%

¥ N=#T%&/%@Q(2%01?A3/%&(F01L2/(N9L(%/("07(,++[Q

$%//%&(/2"#("#E(2-B"#

¥ !AB3-/%&(cA(N:!*'(,++DQF1#"00E(.AB%(3&AL&"B.(61/2(X&%".A#"$0%Y(30"E

¥ ;A$A?-3(<A??%&(*%"B.( N</A#%(G(S%0A.A'(;%1@B100%&(%/("07Q

dA&0@e.($%./(30"E%&(AF(.1B-0"/%@(.A??%&'(8OOOf(;-##%&]-3(,+++

¥ =#T%#/A&E(V"#"L%B%#/( NS"#(;AE'(H%&/.%M".'(K%%(G(*.1/.1M01.Q8+]8bg(1B3&AT%B%#/(AT%&(1#@-./&E(./"#@"&@(B%/2A@.

¥ JE#"B1?(!2"##%0()..1L#B%#/( N<1#L2(G(H%&/.%M".'(91%(G(Z"EM1#Q

dA&0@h.($%./("..1L#%&(AF(&"@1A(?2"##%0.(/A(BA$10%(/%0%32A#%(?"00.

¥ >0%T"/A&(!A#/&A0( N!&1/%.(G(H"&/AQNU&A$"$0EQ(6A&0@h.($%./(@A6#]3%"M(%0%T"/A&(?A#/&A00%&

¥ V"#E(;A$A/.#"T1L"/1A#'($1]3%@"0(6"0M1#L'(L&".31#L'(.61/?21#L($%/6%%#(.M100.777

¥ *J]c"BBA#("#@(i%00EF1.2( N*%."-&A'(J"20QdA&0@h.($%./($"?ML"BBA#(30"E%&7(c&"#@B"./%&(0%T%0

Arcade Learning EnivornmentNadaf 2010, Bellamare et al. 2012, ... (on-going challenge)

4&)PC"*"+3."$*

¥ *%B3A&"00E(.1/-"/%@

¥ !A#/1#-"0(0%"&#1#L("#@(30"##1#L

¥ _$^%?/(1.(/A(-%%"(* /2%(%#T1&A#B%#/

¥ >#T1&A#B%#/(1.(./A?2"./1?("#@(-#?%&/"1#

Environment

actionstate

rewardAgent

D2-*+#6+!"#$%&'(")"$*+,"-'$#$.F

¥ )#("33&A"?2(/A()&/1F1?1"0(=#/%001L%#?%

¥ K%"&#1#L(F&AB(1#/%&"?/1A#

¥ cA"0]A&1%#/%@(0%"&#1#L

¥ K%"&#1#L("$A-/'(F&AB'("#@(6210%(1#/%&"?/1#L(61/2("#(%R/%&#"0(%#T1&A#B%#/

¥ K%"&#1#L(62"/(/A(@Aj2A6(/A(B"3(.1/-"/1A#.(/A("?/1A#.j.A(".(/A(B"R1B1W%("(#-B%&1?"0(&%6"&@(.1L#"0

42-P*"'+RS+0$*'&B5(*#&$

Psychology

Artificial Intelligence

Control Theory andOperations Research

Artificial Neural Networks

ReinforcementLearning (RL)

Neuroscience

T"E+L"-*5'"6+&%+!,

¥ K%"&#%&(1.(#A/(/A0@(621?2("?/1A#.(/A(/"M%

¥ *&1"0]"#@]>&&A&(.%"&?2

¥ UA..1$101/E(AF(@%0"E%@(&%6"&@

<"?&1F1?%(.2A&/]/%&B(L"1#.(FA&(L&%"/%&(0A#L]/%&B(L"1#.

¥ *2%(#%%@(/A("UPC&'" "#@("UPC&#*

¥ !A#.1@%&.(/2%(62A0%(3&A$0%B(AF("(LA"0]@1&%?/%@("L%#/(1#/%&"?/1#L(61/2("#(-#?%&/"1#(%#T1&A#B%#/

O5P"'A#6"B+,"-'$#$.

Supervised!Learning!

SystemInputs Outputs

Training!Info!!=!!desired!(target)!outputs

Error!!=!!(target!output!!Ð actual!output)

!"#$%&'(")"$*+,"-'$#$.

RL

SystemInputs Outputs!(ÒactionsÓ)

Training!Info!!=!!evaluations!(ÒrewardsÓ!/!ÒpenaltiesÓ)

Objective:!!get!as!much!reward!as!possible

3$+WU*"$B"B+WU-)PC"S+V#(HV-(HV&"

X XXO O

X

XO

X

O

XO

X

O

X

XO

X

O

X O

XO

X

O

X O

X

}!xÕs!move

}!xÕs!move

}!oÕs!move

}!xÕs!move

}!oÕs!move

...

...... ...

... ... ... ... ...

x x

x

x o

x

o

xo

x

x

xxo

o

Assume an imperfect opponent:

Ñhe/she sometimes makes mistakes

3$+!,+3PP'&-(2+*&+V#(HV-(HV&"

1.!Make!a!table!with!one!entry!per!state:

2.!Now!play!lots!of!games.

To!pick!our!moves,!

look!ahead!one!step:

State!!!!!!!!!V(s)!Ð estimated!probability!of!winning

.5!!!!!!!!!!?

.5!!!!!!!!!!?

1!!!!!!!!win

0!!!!!!!!loss

0!!!!!!!draw

x

xxx

oo

oo

oxx

oo

o ox

xx

xo

current!state

various!possible

next!states*Just!pick!the!next!state!with!the!highest

estimated!prob.!of!winning!Ñ the!largest!V(s);

a!greedy move.

But!10%!of!the!time!pick!a!move!at!random;

an!exploratory move.

!,+,"-'$#$.+!5C"+%&'+V#(HV-(HV&"

ÒExploratoryÓ move

s!!! Ð !the!state!before!our!greedy!move

s !! Ð !!the!state!after!our!greedy!move

We!increment!each!V(s) toward!V(s ) Ð a!backup :

V(s) V (s) V(s ) V (s)

a!small!positive!fraction, e.g., .1

the!step - size parameter

c *

*

*g

The n-Armed Bandit Problem

Choose repeatedly from one of n actions; each

choice is called a play

After each play , you get a reward , whereat rt

These are unknown action values

Distribution of depends only on rt at

Objective is to maximize the reward in the long term,

e.g., over 1000 plays

To solve the n-armed bandit problem,you must explore a variety of actionsand exploit the best of them

The Exploration/Exploitation Dilemma

Suppose you form estimates

The greedy action at t is at

You canÕt exploit all the time; you canÕt explore all the time

You can never stop exploring; but you should always reduce exploring. Maybe.

Qt(a) Q*(a) action value estimates

at

* argmaxa

Qt(a)

at at

*exploitation

at at

* exploration

Optimistic Initial Values

All methods so far depend on , i.e., they are biased.

Suppose instead we initialize the action values optimistically,

Q0 (a)

i.e., on the 10-armed testbed, use Q0 (a) 5 for all a

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

The Agent-Environment Interface

Agent and environment interact at discrete time steps : t 0,1, 2,

Agent observes state at step t : st S

produces action at step t : at A(st )

gets resulting reward : rt 1

and resulting next state: st 1

t

. . .st a

rt +1 st +1t +1a

rt +2 st +2t +2a

rt +3 st +3. . .

t +3a


Bellman Equation for a Policy

Rt rt 1 rt 2

2rt 3

3rt 4

rt 1 rt 2 rt 3

2rt 4

rt 1 Rt 1

The basic idea:

So: V (s) E Rt st s

E rt 1 V st 1 st s

Or, without the expectation operator:

V (s) (s,a) Pss

aRss

aV (s )

s a


Golf

State is ball location

Reward of Ð1 for each stroke

until the ball is in the hole

Value of a state?

Actions:

putt (use putter)

driver (use driver)

putt succeeds anywhere on

the green


Optimal Value Function for Golf

We can hit the ball farther with driver than with putter,

but with less accuracy

Q*(s,driver) gives the value of using driver first, then

using whichever actions are best


Simple Monte Carlo

T T T TT

T T T T T

!(&# ) !(&#) %# ! (&# )

where %# is the actual return following state &# .

&#

T T

T T

TT T

T TT


Simplest TD Method

T T T TT

T T T T T

&# 1

$# 1

&#

!(&# ) !(&#) $# 1 ! (&# 1 ) !(&# )

TTTTT

T T T T T


Dynamic Programming

#### &&!$'&! |)()( 11

T

T T T

&#

$# 1

&# 1

T

TT

T

TT

T

T

T


TD methods bootstrap and sample

Bootstrapping: update involves an (&#)*+#(

MC does not bootstrap

DP bootstraps

TD bootstraps

Sampling: update does not involve an

(,-(.#(/01+23(

MC samples

DP does not sample

TD samples


Advantages of TD Learning

TD methods do not require a model of the environment,

only experience

TD, but not MC, methods can be fully incremental

You can learn before knowing the final outcome

Ð Less memory

Ð Less peak computation

You can learn without the final outcome

Ð From incomplete sequences

Both MC and TD converge (under certain assumptions to

be detailed later), but which is faster?

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. E

TD Gammon

F($'#&,)GHHEI)GHHJI)GHHKI)@@@

L0/%()0'$)>#$%)&,""(*)')K)'.*)')E)$,)

2'.)3,7(),.(),-)0/$)1/(2($)K)'.*)

,.()M1,$$/=";)%0()$'3(N)E)$%(1$

<=>(2%/7()/$)%,)'*7'.2()'"")1/(2($)%,)

1,/.%$)GH+EJ

O/%%/.6

P,#="/.6

QR)1/(2($I)EJ)",2'%/,.$)/31"/($)

(.,&3,#$).#3=(&),-)2,.-/6#&'%/,.$

S--(2%/7()=&'.20/.6)-'2%,&),-)JRR

FP)B'33,.)2,3=/.(*)FPM N)5/%0).(#&'").(%5,&4 '$)7'"#()-#.2%/,.)'11&,T/3'%,&

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. Q

A&%0#&)8'3#(")GHKHI)GHUV

SamuelÕs Checkers Player

82,&()=,'&*)2,.-/6#&'%/,.$)=;)')W$2,&/.6)1,";.,3/'"X)

M'-%(&)80'..,.I)GHKRN

Y/./3'T %,)*(%(&3/.()W='24(*+#1)$2,&(X),-)')1,$/%/,.

A"10'+=(%' 2#%,--$

?,%()"('&./.69)$'7()('20)=,'&*)2,.-/6)(.2,#.%(&(*)%,6(%0(&)

5/%0)='24(*+#1)$2,&(

.((*(*)')W$(.$(),-)*/&(2%/,.X9)"/4()*/$2,#.%/.6

D('&./.6)=;)6(.(&'"/Z'%/,.9)$/3/"'&)%,)FP)'"6,&/%03)5/%0)

"/.('&)7'"#()-#.2%/,.)'11&,T/3'%/,.@

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. J

SamuelÕs Backups

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. K

The Basic Idea

W@)@)@)5()'&()'%%(31%/.6)%,)3'4()%0()$2,&(I)2'"2#"'%(*)-,&)%0()2#&&(.%)=,'&*)1,$/%/,.I)",,4)"/4()%0'%)2'"2#"'%(*)-,&)%0()%(&3/.'")=,'&*)1,$/%/,.$),-)%0()20'/.),-)3,7($)50/20)3,$%)1&,='=";),22#&)*#&/.6)'2%#'")1"';@X

A. L. Samuel

Some Studies in Machine Learning Using the Game of Checkers, 1959

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. V

Elevator Dispatching

[&/%($)'.*)C'&%,))GHHU

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. \

State Space

¥ 18 hall call buttons: 2 combinations

¥ positions and directions of cars: 18 (rounding to nearest floor)

¥ motion states of cars (accelerating, moving, decelerating, stopped, loading, turning): 6

¥ 40 car buttons: 2

¥ Set of passengers waiting at each floor, each passenger's arrival time and destination: unobservable. However, 18 real numbers are available giving elapsed time since hall buttons pushed; we discretize these.

¥ Set of passengers riding each car and their destinations: observable only through the car buttons

18

4

40

Conservatively about 10 states22

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. H

Actions

¥ When moving (halfway between floors):

Ð stop at next floor

Ð continue past next floor

¥ When stopped at a floor:

Ð go up

Ð go down

¥ Asynchronous

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. GR

Performance Criteria

¥ Average wait time

¥ Average system time (wait + travel time)

¥ % waiting > T seconds (e.g., T = 60)

¥ Average squared wait time (to encourage fast and fair service)

Minimize:

Computer Go

8;"7'/.)B(""; MERRJN

GQ

WF'$4)]'&)ST2(""(.2()-,&)A!X)MO'.$)C(&"/.(&N

W^(5)P&,$,10/"'),-)A!X)M_,0.)Y2['&%0;N

WB&'.*)[0'""(.6()F'$4X)MP'7/*)Y(20.(&N

Monte Carlo Tree Search + Playout Policy

GJ

P])`)FP)*,).,%)'11";9

8%'%( $1'2()%,,)"'&6(

Y[)`)S8)*,).,%)'11";9

A2%/,.)$1'2()%,,)"'&6(

^(5)A11&,'20

!.)%0()2#&&(.%)$%'%()$'31"()'.*)"('&.)%,)

2,.$%&#2%)')",2'"";)$1(2/'"/Z(*)1,"/2;)

ST1",&'%/,.a(T1",/%'%/,.)*/"(33')*('"%)

5/%0)b11(&)[,.-/*(.2()F&(()Mb[FN

S7'"#'%()"('7($)#$/.6)]"';,#% ],"/2;

.)%,)

;)

*('"%)

N

;

Inverted Helicopter Flight

0%%19aa0("/@$%'.-,&*@(*#a

GK

A.*&(5)^6)(%)'"@)MERRJ+ERR\N

?D)"('&.$)1,"/2;)=(%%(&)%0'.)'.;)0#3'.

O("/2,1%(&)$0,#"*)

-,"",5)')*($/&(*)

%&'>(2%,&;@

?(5'&*

c#.2%/,.

?(/.-,&2(3(.%

D('&./.6

P;.'3/2$

Y,*("

P'%'

F&'>(2%,&;)d

](.'"%;)c#.2%/,.

],"/2;

B(.(&'%/7()3,*(")-,&)3#"%/1"()$#=,1%/3'")*(3,.$%&'%/,.$@

D('&./.6)'"6,&/%03)%0'%)(T%&'2%$9!.%(.*(*)%&'>(2%,&;

O/60+'22#&'2;)*;.'3/2$)3,*("

ST1(&/3(.%'")&($#"%$9)S.'="(*)%0(3)%,)-";)'#%,.,3,#$)0("/2,1%(&)'(&,='%/2$)5("")=(;,.*)%0()2'1'=/"/%/($),-)'.;),%0(&)'#%,.,3,#$)0("/2,1%(&@

The Arcade Learning Environment (ALE)ALE (Nadaf 2010, Bellamare et. al. 2012) is an interface builtupon the open-source Atari 2600 emulator Stella. It provides aconvenient interface to ATARI 2600 games.

Features for ALE

• Basic Abstraction of Screen Shots (BASS, from Nadaf 2010)first stores a background of the game it’s playing. Then forevery frame it subtracts away the background and divides thescreen into 16x14 tiles. For each colour (8-bit SECAM) itcreates a feature. It then takes the pairwise interaction of allthese resulting features resulting in 1,606,528 features.

• Color provides object recognition.

• We study linear function approximation with BASS. We wantto see how well one can do with that if one finds the rightparameters

• Improved results has been achieved with non-linearneural/deep approaches.

The gap

Table : The gap between score (more is better) achieved by (linear)learning and (uct) planning

Game UCT BestLearner

Beam Rider 6,624.6 929.4Seaquest 5,132.4 288

Space Invaders 2,718 250.1Pong 21 −19

Out of 55 games, UCT has the best performance on 45 of them.The remaining games require a terribly long horizon.

Learning from an oracle

• Reinforcement learning is made much more difficult thansupervised learning due to the need to explore.

• Therefore, many authors has in recent years been developingways of teaching an rl agent through e.g. demonstration oradvice with reduction to supervised learning.

• I will here discuss this idea in the context of Atari gamesthrough the Arcade Learning Environment (ALE) framework

Learning from UCTA common scenario when applying reinforcement learningalgorithms in real-world situations, learn in a simulator, apply inthe real-world.

• UCT in the “real-world” still requires the simulator.

• UCT does not provide a policy representation, merely atrajectory.

• How do you extract a complete explicit policy from UCT?

• We will treat the value estimates from UCT as adviceprovided to the agent and we can then learn to play Pongwith just a few episodes of data.

• Learning the value function is now a regression problem wesolve using LibLinear (also exploring kernels, brings us back tofeature selection/sparsification )

• Similar to the Dataset Aggregation algorithm for imitationlearning (Ross and Bagnell 2010)

DAgger for reinforcement learning with advice Initialise D ← ∅Initialise π1(= π∗) t = 0 for i = 1 to N do

while not end of episode dofor each action a do

Obtain feature φ(st , a) and oracle’s Q∗(st , a)Add training sample {(φ(st , a),Q∗(st , a))} to Da.

endAct according to πi

endfor each action a do

Learn new model Q̂ai := wa

i>φ from Da using regression

end

πi (φ) = arg maxa Q̂ai (φ)

end

Preliminary results

0 10 20 30

−20

−18

−16

−14

Episodes

Rew

ard

Average performance on Pong

RLAdviceSARSA

0 10 20 30

−20

−15

−10

EpisodesR

ewar

d

Best Reward so Far on Pong

RLAdvice

Figure : Pong Results: RLadvice with different amount of aggregateddata (1-30 games) vs SARSA (linear function approximation) after 5000games played. Results averaged over 8 runs

By Daswani, Sunehag, Hutter 2014

Partially Observable Markov Decision Processes

• Exacerbated Exploration issues due to not knowing the state.

• Not knowing the underlying state space makes things worse

• Explicit or implicit restrictions to subclasses

• Recent work (feature rl) on history based methods that learnsa map from histories to some finite/compact state space(other alternative PSRs)

• Model-free version becomes feature selection for high-dim rl

General Reinforcement Learning

• Classes of completely general reinforcement learningenvironments. Beyond hopeless

• No finite underlying state space assumed. Theoretical workexist with some further assumption exist

• Bayesian (computable env.), AIXI (Hutter 2005)

• Finite/compact classes; Optimistic agents (Sunehag&Hutter.2012), Max. exploration (Lattimore&Hutter&Sunehag 2013)

Feature Reinforcement Learning

Life

Universe

Everything

42

Feature RL aims to automatically reduce a complex real-worldnon-Markovian problem to a useful (computationally tractable)representation (MDP).

Formally we create a map φ from an agent’s history to a staterepresentation. φ is then a function that produces a relevantsummary of the history.

φ(ht) = st

Feature Markov Decision Process (ΦMDP)To select the best φ, one defines a cost function.

φbest = arg minφ(Cost(φ)).

• Feature RL is a recent framework.• Original cost from Hutter 2009 is a model-based criterion.

Cost(φ|h) = CL(s1:n|a1:n) + CL(r1:n|s1:n, a1:n) + CL(φ)

A practically useful modification adds a parameter α tocontrol the balance between reward coding and state coding,

Costα(φ|hn) := αCL(s1:n|a1:n) + (1− α)CL(r1:n|s1:n, a1:n) + CL(φ).

• A global stochastic search (e.g. simulated annealing) is usedto find the φ with minimal cost.

• For fixed φ, MDP methods can be used to find a good policy

Model-free cost criterion

Daswani&Sunehag&Hutter 2013 introduced a fitted-Q cost

CostQL(φ) =minQ

12

∑nt=1(rt+1+γmaxa Q(φ(ht+1), a)−Q(φ(ht), at))2+Reg(φ)

• CostQL also extends easily to the linear function approximationsetting by approximating Q(ht , at)← ξ(ht , at)

Tw whereξ : H×A → Rk for some k ∈ R.

• Connects feature rl to feature selection for TD methods, e.g.Lasso-TD or Dantzig-TD using `1 regularization while aboveReg tends to be a more aggressive `0.

• For a fixed policy, a TD cost without maxa can be defined butone can also reduce the problem to feature selection forsupervised learning using pairs (st ,Rt) where Rt is the returnachieved after state st .

Input : Environment Env();Initialise φ ;Initialise history with observations and rewards fromt = init history random actions;Initialise M to be the number of timesteps per epoch;while true do

φ = SimulAnneal(φ, ht);s1:t = (φ(h1), φ(h2), ..., φ(ht));π = FindPolicy(s1:t , r1:t , a1:t−1) ;for i = 1, 2, 3, ...M do

at ← π(st);ot+1, rt+1 ← Env(ht , at);ht+1 ← htatot+1rt+1;t ← t + 1;

end

endAlgorithm 1: A high-level view of the generic ΦMDP algorithm.

Feature maps

• Tabular : use suffix trees to map historiesto states (Nguyen&Sunehag&Hutter2011,2012). Looping trees for long-termdependences (Daswani&Sunehag&Hutter2012)

• Function approximation : define a newfeature class of event selectors. A featureξi checks the n−m position in the history(hn) for an observation-action pair (o, a).

10

10

s0

s2

s1

If the history is (0, 1), (0, 2), (3, 4), (1, 2) then a event-selectorchecking 3 steps in the past for the observation-action pair (0, 2)will be turned on.

Bayesian general reinforcement learning: MC-AIXI-CTW

Unlike Feature RL, the Bayesian approach does not pick one map butuses a mixture of all instead. The problem is (again) split into two mainareas:

• Learning - online sequence prediction / model building

• Planning/Control - search / sequential decision theory

The hard parts:

• Large model class required for Bayesian mixture predictor to havegeneral prediction capabilities.

• Fortunately, an efficient and general class exists: all Prediction

Suffix Trees of maximum finite depth D. Class contains over 22D−1

models!

• The planning problem can be performed approximately withMonte-Carlo Tree Search (UCT)

• MC-AIXI-CTW (Veness et. al. 2010) combines the above

Overview of proposed agent architecture

Domain : POCMAN

1,000 1,500 2,000 2,500 3,000 3,500

−2

0

Epochs

Cu

m.

Rew

ard

POCMAN : Rolling average over 1000 epochs

FAhQLMC-AIXI 48

Figure : MC-AIXI vs hQL on Pocman

Agent Cores Memory(GB) Time(hours) Iterations

MC-AIXI 96 bits 8 32 60 1 · 105

MC-AIXI 48 bits 8 14.5 49.5 3.5 · 105

FAhQL 1 0.4 17.5 3.5 · 105

Conclusions/Outlook

• Reinforcement Learning is a powerful paradigm within which(basically) all AI problems can be formulated

• Many practical successes using MDPs by engineering problemreductions/reprentations

• Practically increasing the versatility of agents by learningreductions automatically.

• Recently introduced arcade gaming environment (fromAlberta) for RL containing all ATARI games. Aim, have onegeneric RL agent solve all!

• Use data on how valuable states are from either simulations orexperience to reduce complexity

Documents

Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g