Upload
trannguyet
View
221
Download
2
Embed Size (px)
Citation preview
Reinforcement Learning
Peter Sunehag
2013
Reinforcement Learning
Reinforcement Learning Agents an approach to AGI
Peter Sunehag
Tutorial at the seventh conference on AGI
Quebec City August 2014
Many slides borrowed from e.g. Sutton&Barto
!"C-*#&$+J"*K""$+!,+-$B+30+
!"#$%&'(")"$*+,"-'$#$.+L'-)"K&'MS%&E(L%#%&"04(?AT%&.(BA./5"00()=(3&A$0%B.
!,+/'&JC")6+
N+3C.&'#*2)6
</A?2"./1?'(-#M#A6#'(#A#]1717@7(%#T1&A#B%#/.
3'*#%#(#-C
0$*"CC#."$("
*&"@1/1A#"00E(@%/%&B1#1./1?'(M#A6#((6A&0@(5(30"##1#L(3&A$0%B
O*-*#6*#(-C+
8-(2#$"+,"-'$#$.
VA./0E(1717@7(@"/"(?0"..1F1?"/1A#'(
&%L&%..1A#'(?0-./%&1#L
O&)"+3PPC#(-*#&$6+&%+!,¥ !2%?M%&.(N<"B-%0'(8ObOQ
F1&./(-.%(AF(;K(1#("#(1#/%&%./1#L(&%"0(L"B%
¥ N=#T%&/%@Q(2%01?A3/%&(F01L2/(N9L(%/("07(,++[Q
$%//%&(/2"#("#E(2-B"#
¥ !AB3-/%&(cA(N:!*'(,++DQF1#"00E(.AB%(3&AL&"B.(61/2(X&%".A#"$0%Y(30"E
¥ ;A$A?-3(<A??%&(*%"B.( N</A#%(G(S%0A.A'(;%1@B100%&(%/("07Q
dA&0@e.($%./(30"E%&(AF(.1B-0"/%@(.A??%&'(8OOOf(;-##%&]-3(,+++
¥ =#T%#/A&E(V"#"L%B%#/( NS"#(;AE'(H%&/.%M".'(K%%(G(*.1/.1M01.Q8+]8bg(1B3&AT%B%#/(AT%&(1#@-./&E(./"#@"&@(B%/2A@.
¥ JE#"B1?(!2"##%0()..1L#B%#/( N<1#L2(G(H%&/.%M".'(91%(G(Z"EM1#Q
dA&0@h.($%./("..1L#%&(AF(&"@1A(?2"##%0.(/A(BA$10%(/%0%32A#%(?"00.
¥ >0%T"/A&(!A#/&A0( N!&1/%.(G(H"&/AQNU&A$"$0EQ(6A&0@h.($%./(@A6#]3%"M(%0%T"/A&(?A#/&A00%&
¥ V"#E(;A$A/.#"T1L"/1A#'($1]3%@"0(6"0M1#L'(L&".31#L'(.61/?21#L($%/6%%#(.M100.777
¥ *J]c"BBA#("#@(i%00EF1.2( N*%."-&A'(J"20QdA&0@h.($%./($"?ML"BBA#(30"E%&7(c&"#@B"./%&(0%T%0
Arcade Learning EnivornmentNadaf 2010, Bellamare et al. 2012, ... (on-going challenge)
4&)PC"*"+3."$*
¥ *%B3A&"00E(.1/-"/%@
¥ !A#/1#-"0(0%"#L("#@(30"##1#L
¥ _$^%?/(1.(/A(-%%"(* /2%(%#T1&A#B%#/
¥ >#T1&A#B%#/(1.(./A?2"./1?("#@(-#?%&/"1#
Environment
actionstate
rewardAgent
D2-*+#6+!"#$%&'(")"$*+,"-'$#$.F
¥ )#("33&A"?2(/A()&/1F1?1"0(=#/%001L%#?%
¥ K%"#L(F&AB(1#/%&"?/1A#
¥ cA"0]A&1%#/%@(0%"#L
¥ K%"#L("$A-/'(F&AB'("#@(6210%(1#/%&"?/1#L(61/2("#(%R/%&#"0(%#T1&A#B%#/
¥ K%"#L(62"/(/A(@Aj2A6(/A(B"3(.1/-"/1A#.(/A("?/1A#.j.A(".(/A(B"R1B1W%("(#-B%&1?"0(&%6"&@(.1L#"0
42-P*"'+RS+0$*'&B5(*#&$
Psychology
Artificial Intelligence
Control Theory andOperations Research
Artificial Neural Networks
ReinforcementLearning (RL)
Neuroscience
T"E+L"-*5'"6+&%+!,
¥ K%"&#%&(1.(#A/(/A0@(621?2("?/1A#.(/A(/"M%
¥ *&1"0]"#@]>&&A&(.%"&?2
¥ UA..1$101/E(AF(@%0"E%@(&%6"&@
<"?&1F1?%(.2A&/]/%&B(L"1#.(FA&(L&%"/%&(0A#L]/%&B(L"1#.
¥ *2%(#%%@(/A("UPC&'" "#@("UPC&#*
¥ !A#.1@%&.(/2%(62A0%(3&A$0%B(AF("(LA"0]@1&%?/%@("L%#/(1#/%&"?/1#L(61/2("#(-#?%&/"1#(%#T1&A#B%#/
O5P"'A#6"B+,"-'$#$.
Supervised!Learning!
SystemInputs Outputs
Training!Info!!=!!desired!(target)!outputs
Error!!=!!(target!output!!Ð actual!output)
!"#$%&'(")"$*+,"-'$#$.
RL
SystemInputs Outputs!(ÒactionsÓ)
Training!Info!!=!!evaluations!(ÒrewardsÓ!/!ÒpenaltiesÓ)
Objective:!!get!as!much!reward!as!possible
3$+WU*"$B"B+WU-)PC"S+V#(HV-(HV&"
X XXO O
X
XO
X
O
XO
X
O
X
XO
X
O
X O
XO
X
O
X O
X
}!xÕs!move
}!xÕs!move
}!oÕs!move
}!xÕs!move
}!oÕs!move
...
...... ...
... ... ... ... ...
x x
x
x o
x
o
xo
x
x
xxo
o
Assume an imperfect opponent:
Ñhe/she sometimes makes mistakes
3$+!,+3PP'&-(2+*&+V#(HV-(HV&"
1.!Make!a!table!with!one!entry!per!state:
2.!Now!play!lots!of!games.
To!pick!our!moves,!
look!ahead!one!step:
State!!!!!!!!!V(s)!Ð estimated!probability!of!winning
.5!!!!!!!!!!?
.5!!!!!!!!!!?
1!!!!!!!!win
0!!!!!!!!loss
0!!!!!!!draw
x
xxx
oo
oo
oxx
oo
o ox
xx
xo
current!state
various!possible
next!states*Just!pick!the!next!state!with!the!highest
estimated!prob.!of!winning!Ñ the!largest!V(s);
a!greedy move.
But!10%!of!the!time!pick!a!move!at!random;
an!exploratory move.
!,+,"-'$#$.+!5C"+%&'+V#(HV-(HV&"
ÒExploratoryÓ move
s!!! Ð !the!state!before!our!greedy!move
s !! Ð !!the!state!after!our!greedy!move
We!increment!each!V(s) toward!V(s ) Ð a!backup :
V(s) V (s) V(s ) V (s)
a!small!positive!fraction, e.g., .1
the!step - size parameter
c *
*
*g
The n-Armed Bandit Problem
Choose repeatedly from one of n actions; each
choice is called a play
After each play , you get a reward , whereat rt
These are unknown action values
Distribution of depends only on rt at
Objective is to maximize the reward in the long term,
e.g., over 1000 plays
To solve the n-armed bandit problem,you must explore a variety of actionsand exploit the best of them
The Exploration/Exploitation Dilemma
Suppose you form estimates
The greedy action at t is at
You canÕt exploit all the time; you canÕt explore all the time
You can never stop exploring; but you should always reduce exploring. Maybe.
Qt(a) Q*(a) action value estimates
at
* argmaxa
Qt(a)
at at
*exploitation
at at
* exploration
Optimistic Initial Values
All methods so far depend on , i.e., they are biased.
Suppose instead we initialize the action values optimistically,
Q0 (a)
i.e., on the 10-armed testbed, use Q0 (a) 5 for all a
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2
The Agent-Environment Interface
Agent and environment interact at discrete time steps : t 0,1, 2,
Agent observes state at step t : st S
produces action at step t : at A(st )
gets resulting reward : rt 1
and resulting next state: st 1
t
. . .st a
rt +1 st +1t +1a
rt +2 st +2t +2a
rt +3 st +3. . .
t +3a
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17
Bellman Equation for a Policy
Rt rt 1 rt 2
2rt 3
3rt 4
rt 1 rt 2 rt 3
2rt 4
rt 1 Rt 1
The basic idea:
So: V (s) E Rt st s
E rt 1 V st 1 st s
Or, without the expectation operator:
V (s) (s,a) Pss
aRss
aV (s )
s a
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20
Golf
State is ball location
Reward of Ð1 for each stroke
until the ball is in the hole
Value of a state?
Actions:
putt (use putter)
driver (use driver)
putt succeeds anywhere on
the green
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22
Optimal Value Function for Golf
We can hit the ball farther with driver than with putter,
but with less accuracy
Q*(s,driver) gives the value of using driver first, then
using whichever actions are best
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4
Simple Monte Carlo
T T T TT
T T T T T
!(&# ) !(&#) %# ! (&# )
where %# is the actual return following state &# .
&#
T T
T T
TT T
T TT
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6
Simplest TD Method
T T T TT
T T T T T
&# 1
$# 1
&#
!(&# ) !(&#) $# 1 ! (&# 1 ) !(&# )
TTTTT
T T T T T
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7
Dynamic Programming
#### &&!$'&! |)()( 11
T
T T T
&#
$# 1
&# 1
T
TT
T
TT
T
T
T
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8
TD methods bootstrap and sample
Bootstrapping: update involves an (&#)*+#(
MC does not bootstrap
DP bootstraps
TD bootstraps
Sampling: update does not involve an
(,-(.#(/01+23(
MC samples
DP does not sample
TD samples
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11
Advantages of TD Learning
TD methods do not require a model of the environment,
only experience
TD, but not MC, methods can be fully incremental
You can learn before knowing the final outcome
Ð Less memory
Ð Less peak computation
You can learn without the final outcome
Ð From incomplete sequences
Both MC and TD converge (under certain assumptions to
be detailed later), but which is faster?
?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. E
TD Gammon
F($'#&,)GHHEI)GHHJI)GHHKI)@@@
L0/%()0'$)>#$%)&,""(*)')K)'.*)')E)$,)
2'.)3,7(),.(),-)0/$)1/(2($)K)'.*)
,.()M1,$$/=";)%0()$'3(N)E)$%(1$
<=>(2%/7()/$)%,)'*7'.2()'"")1/(2($)%,)
1,/.%$)GH+EJ
O/%%/.6
P,#="/.6
QR)1/(2($I)EJ)",2'%/,.$)/31"/($)
(.,&3,#$).#3=(&),-)2,.-/6#&'%/,.$
S--(2%/7()=&'.20/.6)-'2%,&),-)JRR
FP)B'33,.)2,3=/.(*)FPM N)5/%0).(#&'").(%5,&4 '$)7'"#()-#.2%/,.)'11&,T/3'%,&
?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. Q
A&%0#&)8'3#(")GHKHI)GHUV
SamuelÕs Checkers Player
82,&()=,'&*)2,.-/6#&'%/,.$)=;)')W$2,&/.6)1,";.,3/'"X)
M'-%(&)80'..,.I)GHKRN
Y/./3'T %,)*(%(&3/.()W='24(*+#1)$2,&(X),-)')1,$/%/,.
A"10'+=(%' 2#%,--$
?,%()"('&./.69)$'7()('20)=,'&*)2,.-/6)(.2,#.%(&(*)%,6(%0(&)
5/%0)='24(*+#1)$2,&(
.((*(*)')W$(.$(),-)*/&(2%/,.X9)"/4()*/$2,#.%/.6
D('&./.6)=;)6(.(&'"/Z'%/,.9)$/3/"'&)%,)FP)'"6,&/%03)5/%0)
"/.('&)7'"#()-#.2%/,.)'11&,T/3'%/,.@
?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. J
SamuelÕs Backups
?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. K
The Basic Idea
W@)@)@)5()'&()'%%(31%/.6)%,)3'4()%0()$2,&(I)2'"2#"'%(*)-,&)%0()2#&&(.%)=,'&*)1,$/%/,.I)",,4)"/4()%0'%)2'"2#"'%(*)-,&)%0()%(&3/.'")=,'&*)1,$/%/,.$),-)%0()20'/.),-)3,7($)50/20)3,$%)1&,='=";),22#&)*#&/.6)'2%#'")1"';@X
A. L. Samuel
Some Studies in Machine Learning Using the Game of Checkers, 1959
?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. V
Elevator Dispatching
[&/%($)'.*)C'&%,))GHHU
?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. \
State Space
¥ 18 hall call buttons: 2 combinations
¥ positions and directions of cars: 18 (rounding to nearest floor)
¥ motion states of cars (accelerating, moving, decelerating, stopped, loading, turning): 6
¥ 40 car buttons: 2
¥ Set of passengers waiting at each floor, each passenger's arrival time and destination: unobservable. However, 18 real numbers are available giving elapsed time since hall buttons pushed; we discretize these.
¥ Set of passengers riding each car and their destinations: observable only through the car buttons
18
4
40
Conservatively about 10 states22
?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. H
Actions
¥ When moving (halfway between floors):
Ð stop at next floor
Ð continue past next floor
¥ When stopped at a floor:
Ð go up
Ð go down
¥ Asynchronous
?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. GR
Performance Criteria
¥ Average wait time
¥ Average system time (wait + travel time)
¥ % waiting > T seconds (e.g., T = 60)
¥ Average squared wait time (to encourage fast and fair service)
Minimize:
Computer Go
8;"7'/.)B(""; MERRJN
GQ
WF'$4)]'&)ST2(""(.2()-,&)A!X)MO'.$)C(&"/.(&N
W^(5)P&,$,10/"'),-)A!X)M_,0.)Y2['&%0;N
WB&'.*)[0'""(.6()F'$4X)MP'7/*)Y(20.(&N
Monte Carlo Tree Search + Playout Policy
GJ
P])`)FP)*,).,%)'11";9
8%'%( $1'2()%,,)"'&6(
Y[)`)S8)*,).,%)'11";9
A2%/,.)$1'2()%,,)"'&6(
^(5)A11&,'20
!.)%0()2#&&(.%)$%'%()$'31"()'.*)"('&.)%,)
2,.$%%)')",2'"";)$1(2/'"/Z(*)1,"/2;)
ST1",&'%/,.a(T1",/%'%/,.)*/"(33')*('"%)
5/%0)b11(&)[,.-/*(.2()F&(()Mb[FN
S7'"#'%()"('7($)#$/.6)]"';,#% ],"/2;
.)%,)
;)
*('"%)
N
;
Inverted Helicopter Flight
0%%19aa0("/@$%'.-,&*@(*#a
GK
A.*&(5)^6)(%)'"@)MERRJ+ERR\N
?D)"('&.$)1,"/2;)=(%%(&)%0'.)'.;)0#3'.
O("/2,1%(&)$0,#"*)
-,"",5)')*($/&(*)
%&'>(2%,&;@
?(5'&*
c#.2%/,.
?(/.-,&2(3(.%
D('&./.6
P;.'3/2$
Y,*("
P'%'
F&'>(2%,&;)d
](.'"%;)c#.2%/,.
],"/2;
B(.(&'%/7()3,*(")-,&)3#"%/1"()$#=,1%/3'")*(3,.$%&'%/,.$@
D('&./.6)'"6,&/%03)%0'%)(T%&'2%$9!.%(.*(*)%&'>(2%,&;
O/60+'22#&'2;)*;.'3/2$)3,*("
ST1(&/3(.%'")&($#"%$9)S.'="(*)%0(3)%,)-";)'#%,.,3,#$)0("/2,1%(&)'(&,='%/2$)5("")=(;,.*)%0()2'1'=/"/%/($),-)'.;),%0(&)'#%,.,3,#$)0("/2,1%(&@
The Arcade Learning Environment (ALE)ALE (Nadaf 2010, Bellamare et. al. 2012) is an interface builtupon the open-source Atari 2600 emulator Stella. It provides aconvenient interface to ATARI 2600 games.
Features for ALE
• Basic Abstraction of Screen Shots (BASS, from Nadaf 2010)first stores a background of the game it’s playing. Then forevery frame it subtracts away the background and divides thescreen into 16x14 tiles. For each colour (8-bit SECAM) itcreates a feature. It then takes the pairwise interaction of allthese resulting features resulting in 1,606,528 features.
• Color provides object recognition.
• We study linear function approximation with BASS. We wantto see how well one can do with that if one finds the rightparameters
• Improved results has been achieved with non-linearneural/deep approaches.
The gap
Table : The gap between score (more is better) achieved by (linear)learning and (uct) planning
Game UCT BestLearner
Beam Rider 6,624.6 929.4Seaquest 5,132.4 288
Space Invaders 2,718 250.1Pong 21 −19
Out of 55 games, UCT has the best performance on 45 of them.The remaining games require a terribly long horizon.
Learning from an oracle
• Reinforcement learning is made much more difficult thansupervised learning due to the need to explore.
• Therefore, many authors has in recent years been developingways of teaching an rl agent through e.g. demonstration oradvice with reduction to supervised learning.
• I will here discuss this idea in the context of Atari gamesthrough the Arcade Learning Environment (ALE) framework
Learning from UCTA common scenario when applying reinforcement learningalgorithms in real-world situations, learn in a simulator, apply inthe real-world.
• UCT in the “real-world” still requires the simulator.
• UCT does not provide a policy representation, merely atrajectory.
• How do you extract a complete explicit policy from UCT?
• We will treat the value estimates from UCT as adviceprovided to the agent and we can then learn to play Pongwith just a few episodes of data.
• Learning the value function is now a regression problem wesolve using LibLinear (also exploring kernels, brings us back tofeature selection/sparsification )
• Similar to the Dataset Aggregation algorithm for imitationlearning (Ross and Bagnell 2010)
DAgger for reinforcement learning with advice Initialise D ← ∅Initialise π1(= π∗) t = 0 for i = 1 to N do
while not end of episode dofor each action a do
Obtain feature φ(st , a) and oracle’s Q∗(st , a)Add training sample {(φ(st , a),Q∗(st , a))} to Da.
endAct according to πi
endfor each action a do
Learn new model Q̂ai := wa
i>φ from Da using regression
end
πi (φ) = arg maxa Q̂ai (φ)
end
Preliminary results
0 10 20 30
−20
−18
−16
−14
Episodes
Rew
ard
Average performance on Pong
RLAdviceSARSA
0 10 20 30
−20
−15
−10
EpisodesR
ewar
d
Best Reward so Far on Pong
RLAdvice
Figure : Pong Results: RLadvice with different amount of aggregateddata (1-30 games) vs SARSA (linear function approximation) after 5000games played. Results averaged over 8 runs
By Daswani, Sunehag, Hutter 2014
Partially Observable Markov Decision Processes
• Exacerbated Exploration issues due to not knowing the state.
• Not knowing the underlying state space makes things worse
• Explicit or implicit restrictions to subclasses
• Recent work (feature rl) on history based methods that learnsa map from histories to some finite/compact state space(other alternative PSRs)
• Model-free version becomes feature selection for high-dim rl
General Reinforcement Learning
• Classes of completely general reinforcement learningenvironments. Beyond hopeless
• No finite underlying state space assumed. Theoretical workexist with some further assumption exist
• Bayesian (computable env.), AIXI (Hutter 2005)
• Finite/compact classes; Optimistic agents (Sunehag&Hutter.2012), Max. exploration (Lattimore&Hutter&Sunehag 2013)
Feature Reinforcement Learning
Life
Universe
Everything
42
Feature RL aims to automatically reduce a complex real-worldnon-Markovian problem to a useful (computationally tractable)representation (MDP).
Formally we create a map φ from an agent’s history to a staterepresentation. φ is then a function that produces a relevantsummary of the history.
φ(ht) = st
Feature Markov Decision Process (ΦMDP)To select the best φ, one defines a cost function.
φbest = arg minφ(Cost(φ)).
• Feature RL is a recent framework.• Original cost from Hutter 2009 is a model-based criterion.
Cost(φ|h) = CL(s1:n|a1:n) + CL(r1:n|s1:n, a1:n) + CL(φ)
A practically useful modification adds a parameter α tocontrol the balance between reward coding and state coding,
Costα(φ|hn) := αCL(s1:n|a1:n) + (1− α)CL(r1:n|s1:n, a1:n) + CL(φ).
• A global stochastic search (e.g. simulated annealing) is usedto find the φ with minimal cost.
• For fixed φ, MDP methods can be used to find a good policy
Model-free cost criterion
Daswani&Sunehag&Hutter 2013 introduced a fitted-Q cost
CostQL(φ) =minQ
12
∑nt=1(rt+1+γmaxa Q(φ(ht+1), a)−Q(φ(ht), at))2+Reg(φ)
• CostQL also extends easily to the linear function approximationsetting by approximating Q(ht , at)← ξ(ht , at)
Tw whereξ : H×A → Rk for some k ∈ R.
• Connects feature rl to feature selection for TD methods, e.g.Lasso-TD or Dantzig-TD using `1 regularization while aboveReg tends to be a more aggressive `0.
• For a fixed policy, a TD cost without maxa can be defined butone can also reduce the problem to feature selection forsupervised learning using pairs (st ,Rt) where Rt is the returnachieved after state st .
Input : Environment Env();Initialise φ ;Initialise history with observations and rewards fromt = init history random actions;Initialise M to be the number of timesteps per epoch;while true do
φ = SimulAnneal(φ, ht);s1:t = (φ(h1), φ(h2), ..., φ(ht));π = FindPolicy(s1:t , r1:t , a1:t−1) ;for i = 1, 2, 3, ...M do
at ← π(st);ot+1, rt+1 ← Env(ht , at);ht+1 ← htatot+1rt+1;t ← t + 1;
end
endAlgorithm 1: A high-level view of the generic ΦMDP algorithm.
Feature maps
• Tabular : use suffix trees to map historiesto states (Nguyen&Sunehag&Hutter2011,2012). Looping trees for long-termdependences (Daswani&Sunehag&Hutter2012)
• Function approximation : define a newfeature class of event selectors. A featureξi checks the n−m position in the history(hn) for an observation-action pair (o, a).
10
10
s0
s2
s1
If the history is (0, 1), (0, 2), (3, 4), (1, 2) then a event-selectorchecking 3 steps in the past for the observation-action pair (0, 2)will be turned on.
Bayesian general reinforcement learning: MC-AIXI-CTW
Unlike Feature RL, the Bayesian approach does not pick one map butuses a mixture of all instead. The problem is (again) split into two mainareas:
• Learning - online sequence prediction / model building
• Planning/Control - search / sequential decision theory
The hard parts:
• Large model class required for Bayesian mixture predictor to havegeneral prediction capabilities.
• Fortunately, an efficient and general class exists: all Prediction
Suffix Trees of maximum finite depth D. Class contains over 22D−1
models!
• The planning problem can be performed approximately withMonte-Carlo Tree Search (UCT)
• MC-AIXI-CTW (Veness et. al. 2010) combines the above
Overview of proposed agent architecture
Domain : POCMAN
1,000 1,500 2,000 2,500 3,000 3,500
−2
0
Epochs
Cu
m.
Rew
ard
POCMAN : Rolling average over 1000 epochs
FAhQLMC-AIXI 48
Figure : MC-AIXI vs hQL on Pocman
Agent Cores Memory(GB) Time(hours) Iterations
MC-AIXI 96 bits 8 32 60 1 · 105
MC-AIXI 48 bits 8 14.5 49.5 3.5 · 105
FAhQL 1 0.4 17.5 3.5 · 105
Conclusions/Outlook
• Reinforcement Learning is a powerful paradigm within which(basically) all AI problems can be formulated
• Many practical successes using MDPs by engineering problemreductions/reprentations
• Practically increasing the versatility of agents by learningreductions automatically.
• Recently introduced arcade gaming environment (fromAlberta) for RL containing all ATARI games. Aim, have onegeneric RL agent solve all!
• Use data on how valuable states are from either simulations orexperience to reduce complexity