Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

AgentsthatLearn:ReinforcementLearning

VasantHonavarArtificialIntelligenceResearchLaboratory

CollegeofInformationSciencesandTechnologyBioinformaticsandGenomicsGraduateProgram

TheHuckInstitutesoftheLifeSciencesPennsylvaniaStateUniversity

[email protected]://faculty.ist.psu.edu/vhonavar

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar




Agentinanenvironment

Environment

Action

State

Reward/Punishment

Agent




LearningfromInteractionwiththeworld

•  Anagentreceivessensationsorperceptsfromtheenvironmentthroughitssensorsandactsontheenvironmentthroughitseffectorsandoccasionallyreceivesrewardsorpunishmentsfromtheenvironment

•  Thegoaloftheagentistomaximizeitsreward(pleasure)orminimizeitspunishment(orpain)asitstumblesalonginana-prioriunknown,uncertain,environment




MarkovDecisionProcesses•  Assume

–  FinitesetofstatesS –  FinitesetofactionsA

•  Ateachdiscretetime–  Theagentobservesstatest∈Sandchoosesactionat∈A, receivesimmediaterewardrt

–  Environmentstatechangestost+1 •  Markovassumption:st+1 =δ(st, at)andrt = r(st, at)

–  i.e.,rt andst+1dependonlyoncurrentstateandaction–  functionsδandrmaybenondeterministic–  functionsδandrmaynotnecessarilybeknowntotheagent–reinforcementlearning




Actingrationallyinthepresenceofdelayedrewards

€

Agent and environment interact at discrete time steps : t = 0,1, 2,… Agent observes state at step t : st ∈ S produces action at step t : at ∈ A(st ) gets resulting reward : rt+1 ∈ ℜ

and resulting next state : st+1

t. . . st a

rt +1 st +1t +1a

rt +2 st +2t +2a

rt +3 st +3. . .

t +3a




KeyelementsofanRLSystem

•  Policy–whattodo•  Reward–whatisgood•  Value–whatisgoodbecauseitpredictsreward•  Model–whatfollowswhat

Policy

RewardValue

Model ofenvironment




Policy at step t, π t : a mapping from states to action probabilities π t (s,a) = probability that at = a when st = s

TheAgentUsesaPolicytoselectactions

•  Arationalagent’sgoalistogetasmuchrewardasitcan

overthelongrun




GoalsandRewards

•  Isascalarrewardsignalanadequatenotionofagoal?–maybenot,butitissurprisinglyflexible.

•  Agoalshouldspecifywhatwewanttoachieve,nothowwewanttoachieveit.

•  Agoalistypicallyoutsidetheagent’sdirectcontrol•  Theagentmustbeabletomeasuresuccess:

•  explicitly•  frequentlyduringitslifespan




RewardsforContinuingTasks

Continuingtasks:interactiondoesnothavenaturalepisodes

Cumulativediscountedreward

rate.discount theis ,10, where

, 0

132

21

≤≤

=+++= ∑∞

=+++++

γγ

γγγk

ktk

tttt rrrrR !

farsighted 10 edshortsight →←γ




Rewards

maximize? to want wedo What,,,

:is step after rewards of sequence the Suppose…321 +++ ttt rrr

t

{ }. step each for

,return, expected the maximize to want wegeneral, Int

RE t

Episodictasks–interactionbreaksnaturallyintoepisodes,e.g.,playsofagame,tripsthroughamaze.

,Tttt rrrR +++= ++ !21

whereT isafinaltimestepatwhichaterminalstateisreached,endinganepisode.




MarkovDecisionProcesses

•  IfareinforcementlearningtaskhastheMarkovProperty,itiscalledaMarkovDecisionProcess(MDP).

•  Ifstateandactionsetsarefinite,theMDPisafiniteMDP.•  TodefineafiniteMDP,youneedtospecify:

–  stateandactionsets;–  one-stepdynamicsdefinedbytransitionprobabilities:

–  rewardprobabilities:

{ } ).(,, ,Pr sAaSssaassssP tttass ∈∈ʹ∀==ʹ== +ʹ 1

{ } ).(,, ,, sAaSssssaassrER ttttass ∈∈ʹ∀ʹ==== ++ʹ 11




Example–PoleBalancingTask

Avoidfailure:thepolefallingbeyondacriticalangleorthecarthittingendoftrack.

reward = +1 for each step before failure⇒ return = number of steps before failure

Asanepisodictaskwhereepisodeendsuponfailure:

Asacontinuingtaskwithdiscountedreturn:reward = −1 upon failure; 0 otherwise

⇒ return = −γ k , for k steps before failure

Ineithercase,returnismaximizedbyavoidingfailureforaslongaspossible.




Example--Drivingtask

Gettothetopofthehillasquicklyaspossible.

hill of topreaching before steps ofnumber return hill of at top when stepeach for 1 reward

−=⇒

−= not

Returnismaximizedbyminimizingthenumberofstepstakentoreachthetopofthehill




RecyclingRobot

FiniteMDPExample

•  Ateachstep,robothastodecidewhetheritshoulda)  activelysearchforacan,b)  waitforsomeonetobringitacan,orc)  gotohomebaseandrecharge.

•  Searchingisbetterbutrunsdownthebattery;ifrunsoutofpowerwhilesearching,hastoberescued(whichisbad).

•  Decisionsmadeonbasisofcurrentenergylevel:high,low.•  Reward=numberofcanscollected




SomeNotableRLApplications

•  TD-Gammon–  Worldsbestbackgammonprogram

•  Elevatorscheduling•  InventoryManagement

–  10%–15%improvementoverthestate-of-theartmethods•  DynamicChannelAssignment–

–  highperformanceassignmentofradiochannelstomobiletelephonecalls




TheMarkovProperty

•  Bythestateatstept,wemeanwhateverinformationisavailabletotheagentatsteptaboutitsenvironment.

•  Thestatecanincludeimmediatesensations,highlyprocessedsensations,andstructuresbuiltupovertimefromsequencesofsensations.

•  Ideally,astateshouldsummarizepastsensationssoastoretainallessentialinformation–itshouldhavetheMarkovProperty:

Pr st+1 = !s , rt+1 = r st,at, rt, st−1,at−1,…, r1, s0,a0{ }= Pr st+1 = !s , rt+1 = r st,at{ }∀ !s , r, and histories st,at, rt, st−1,at−1,…, r1, s0,a0.




Reinforcementlearning

•  Learnerisnottoldwhichactionstotake•  Rewardsandpunishmentsmaybedelayed

–  Sacrificeshort-termgainsforgreaterlong-termgains•  Theneedtotradeoffbetweenexplorationandexploitation•  Environmentmaynotbeobservableoronlypartiallyobservable•  Environmentmaybedeterministicorstochastic




ValueFunctions

{ }⎭⎬⎫

⎩⎨⎧

=γ=== ∑∞

=++ππ

π

01

ktkt

ktt ssrEssREsV

π

)(

policy for function value-State :

{ }⎭⎬⎫

⎩⎨⎧

==γ====

π

∑∞

=++ππ

π

01

kttkt

kttt aassrEaassREasQ ,,),(

:policy for function value-Action

•  Thevalueofastateistheexpectedreturnstartingfromthatstate;dependsontheagent’spolicy:

Thevalueoftakinganactioninastateunderpolicyπistheexpectedcumulativerewardstartingfromthatstate,takingthataction,andthereafterfollowingπ:







BellmanEquationforaPolicyπ

( )11

42

321

43

32

21

++

++++

++++

γ+=

γ+γ+γ+=

γ+γ+γ+=

tt

tttt

ttttt

Rrrrrr

rrrrR

!

!

Thebasicidea:

So: Vπ (s) = Eπ Rt st = s{ }

= Eπ rt+1 + γV st+1( ) st = s{ }

Or,withouttheexpectationoperator:

[ ]∑ ∑́ πʹʹ

π ʹγ+π=a s

ass

ass sVRPassV )(),()(




π ≥ # π if and only if Vπ (s) ≥ V # π (s) for all s ∈S

OptimalValueFunctions

•  ForfiniteMDPs,policiescanbepartiallyordered:

•  Thereisalwaysatleastone(andpossiblymany)policiesthatis(are)betterthanorequaltoalltheothers.Thisisanoptimalpolicy.Wedenotethemallπ *.

•  Optimalpoliciessharethesameoptimalstate-valuefunction:

•  Optimalpoliciesalsosharethesameoptimalaction-valuefunction:

V∗ (s) = maxπVπ (s) for all s ∈S

Q∗(s, a) = maxπQπ (s, a) for all s ∈S and a ∈A(s)

Thisistheexpectedcumulativerewardfortakingactionainstatesandthereafterfollowinganoptimalpolicy.







WhyOptimalState-ValueFunctionsareUseful

∗V

V∗

Anypolicythatisgreedywithrespecttoisanoptimalpolicy.

Therefore,given,one-step-aheadsearchproducesthelong-termoptimalactions.




Example:Tic-Tac-Toe

X XXO OX

XOX

OXO

X

O

XXO

X

O

X OXO

X

O

X O

X

..

. ...

..

....... ... ... ... ...

x xx

x ox

oxo

x

x

xx

oo

Assumeanimperfectopponent:sometimesmakesmistakes

}Xmoves

}Omoves

}Xmoves

}Omoves

}Xmoves




ASimpleRLApproachtoTic-Tac-Toe

Makeatablewithoneentryperstate

Playlotsofgames.Topickourmoves,lookaheadonestep

StateV(s)–estimatedprobabilityofwinning0.50.5. . . . . .

. . .. . .

1 win

0 loss

. . .. . .

0.5 draw

x

xxxoo

ooo

xx

ooo o

xx

xx

o

Current state

Possible next states*Pickthenextstatewiththehighestestimatedprob.ofwinning—thelargestV(s)–agreedymove;Occasionallypickamoveatrandom–anexploratorymove.




LearningRuleforTic-Tac-Toe

..

•

our move{opponent's move{

our move{

starting position

•

•

•

a

b

c*

d

ee*

opponent's move{

c

•f

•g*g

opponent's move{our move{

.

•

“Exploratory”move

s – the state before our greedy move! s – the state after our greedy move

We increment each V(s) toward V( ! s ) – a backup :

V(s)←V (s) + α V( ! s ) − V (s)[ ]




WhyisTic-Tac-ToeTooEasy?

•  Numberofstatesissmallandfinite•  One-steplook-aheadisalwayspossible•  Statecompletelyobservable




WhatAboutOptimalAction-ValueFunctions?

Given,theagentdoesnotevenhavetodoaone-step-aheadsearch:

Q*

π∗(s) = argmaxa∈A (s)

Q∗(s, a)




SimplerSetting:Then-ArmedBanditProblem

•  Chooserepeatedlyfromoneofnactions;eachchoiceiscalledaplay

•  Aftereachplay,yougetareward,where

E rt | at =Q*(at )

at rt

rt at•  Distributionofdependsonlyon•  Nodependenceonstate•  Objectiveistomaximizetherewardinthelongterm,e.g.,

over1000plays




TheExploration–ExploitationDilemma

•  Supposeyouform

•  Thegreedyactionattis

•  Youcan’texploitallthetime;youcan’texploreallthetime

•  Youcanneverstopexploring;butyoucouldreduceexploring

Qt(a) ≈Q*(a)

actionvalueestimates

at* = argmaxa Qt(a)

at = at* ⇒ exploitation

at ≠ at* ⇒ exploration




Action-ValueMethods

•  Adaptaction-valueestimatesandnothingelse.•  Supposebythet-thplay,actionhadbeenchosentimes,

producingrewardsthen

Qt (a) =r1 + r2 +!rka

ka

kar1, r2,…, rka ,

limka→∞

Qt (a) =Q*(a)

a




ε-GreedyActionSelection§  Greedy§  ε-Greedy

§  Boltzmann

at = at* = argmax

aQt (a)

at =at

* with probability 1−εrandom action with probability ε

"#$

Pr(choosing action a at time t) = eQt (a) τ

eQt (b) τ

b=1

n∑

where τ is computational temperature




SolvingtheBellmanOptimalityEquation

•  FindinganoptimalpolicybysolvingtheBellmanOptimalityEquationrequires:–  accurateknowledgeofenvironmentdynamics;–  enoughspaceantimetodothecomputation;–  theMarkovProperty.

•  Howmuchspaceandtimedoweneed?–  polynomialinnumberofstates(viadynamicprogrammingmethods),

–  BUT,numberofstatesisoftenhuge•  Weusuallyhavetosettleforapproximations.•  ManyRLmethodscanbeunderstoodasapproximately

solvingtheBellmanOptimalityEquation.




SomeNotableRLApplications

•  TD-Gammon–  Worldsbestbackgammonprogram

•  Elevatorscheduling•  InventoryManagement

–  10%–15%improvementoverthestate-of-theartmethods•  DynamicChannelAssignment–

–  highperformanceassignmentofradiochannelstomobiletelephonecalls




AgentsthatActsRationallyinthepresenceofdelayedrewards

€

Agent and environment interact at discrete time steps : t = 0,1, 2,… Agent observes state at step t : st ∈ S produces action at step t : at ∈ A(st ) gets resulting reward : rt+1 ∈ ℜ

and resulting next state : st+1

t. . . st a

rt +1 st +1t +1a

rt +2 st +2t +2a

rt +3 st +3. . .

t +3a




Policy at step t, π t : a mapping from states to action probabilities π t (s,a) = probability that at = a when st = s

TheAgentLearnsaPolicy

•  Reinforcementlearningmethodsspecifyhowtheagentchangesitspolicyasaresultofitsexperience.

•  Roughly,theagent’sgoalistogetasmuchrewardasitcanoverthelongrun.




Reinforcementlearning

Environment

action

state

reward

Agent




MarkovDecisionProcesses•  Assume

–  FinitesetofstatesS –  FinitesetofactionsA

•  Ateachdiscretetime–  Theagentobservesstatest∈Sandchoosesactionat∈A, receivesimmediaterewardrt

–  Environmentstatechangestost+1 •  Markovassumption:st+1 =δ(st, at)andrt = r(st, at)

–  i.e.,rt andst+1dependonlyoncurrentstateandaction–  functionsδandrmaybenondeterministic–  functionsδandrmaynotnecessarilybeknowntotheagent–reinforcementlearning




Agent’slearningtask

•  Executeactionsinenvironment,observeresults,and•  Learnactionpolicyπ:S → Athatmaximizes

E [rt + γrt+1 + γ2rt+2 + … ] fromanystartingstateinS

•  Here0≤γ<1isthediscountfactorforfuturerewards

•  Targetfunctionistobelearnedisπ:S → A•  Butwehavenotrainingexamplesofform〈s, a〉•  Trainingexamplesareofform〈〈s, a〉,r〉




Reinforcementlearningproblem

•  Goal:learntochooseactionsthatmaximizer0+γr1+γ2r2+…,where0≤γ<1




Valuefunction

•  Tobeginwith,considerdeterministicworlds...•  Foreachpossiblepolicyπtheagentmightadopt,wecan

defineanevaluationfunctionoverstates

∑∞

=+

++

≡

+++≡

0

22

1 ...)(

iit

i

ttt

r

rrrsV

γ

γγπ

π*≡ argmaxπ

V π (s), (∀s)

wherert, rt+1, ...aregeneratedbyfollowingpolicyπstartingatstates

•  Restated,thetaskistolearntheoptimalpolicyπ*







Whattolearn•  Wemighttrytohaveagentlearntheevaluationfunction

Vπ*(whichwewriteasV*)•  Itcouldthendoalook-aheadsearchtochoosebest

actionfromanystatesbecause

))],((*),([maxarg)(* asVasrsa

δγπ +≡

Aproblem:•  Thisworksifagentknowsδ : S × A → S,andr:S×A→ℜ•  Butwhenitdoesn't,itcan'tchooseactionsinthisway




Action-Valuefunction–Qfunction

•  DefineanewfunctionverysimilartoV*

Q(s,a) ≡ r(s,a)+γV *(δ(s,a))•  IfagentlearnsQ,itcanchooseoptimalactionevenwithout

knowingδ!

π *(s) ≡ argmaxπ

[r(s,a)+γV *(δ(s,a))]

π *(s) ≡ argmaxπ

Q(s,a)

•  Qistheevaluationfunctiontheagentwilllearn




TrainingruletolearnQ

•  NoteQandV*arecloselyrelated:V *(s) =max

a 'Q(s,a '))

•  WhichallowsustowriteQrecursivelyas

Q(s1,a1) = r(s1,a1)+γV *(δ(s1,a1))= r(s1,a1)+γmaxa ' Q(δ(st+1,a '))

•  Letdenotelearner’scurrentapproximationtoQ.Considertrainingrule

Q̂(s,a)← r +γmaxa 'Q̂(s ',a ')

Q̂

•  wheres’isthestateresultingfromapplyingactionainstates.




Q-Learning

€

Q st,at( )← Q st,at( ) +α rt+1 + γmaxaQ st+1,a( ) −Q st,at( )[ ]




QLearningforDeterministicWorlds

•  Foreachs,ainitializetableentry•  Doforever:

–  Observecurrentstates–  Selectanactionaandexecuteit–  Receiveimmediaterewardr–  Observethenewstates’–  Updatethetableentryfor asfollows:

–  Updatestates ← s’.

)','(ˆmax)(ˆ'

asQrsQa

γ+←

0←),(ˆ asQ

),(ˆ asQ




UpdatingQ

901008163900

21

←

+←

γ+←

},,max{.

)','(ˆmax),(ˆ'

asQrasQaright

Noticeifrewardsnon-negative,thenand

),(ˆ),(ˆ ),,( asQasQnas nn ≥∀ +1

),(),(ˆ0 ),,( asQasQnas n ≤≤∀




•  Theorem:convergestoQ.•  Considercaseofdeterministicworld,withbounded

immediaterewards,whereeach〈s,a〉visitedinfinitelyoften.

•  Proof:Defineanintervalduringwhicheach〈s,a〉isvisitedatleastonce.Duringeachfullintervalthelargesterrorintableisreducedbyfactorofγ.

•  Letbethetableafternupdates,andΔnbethemaximumerrorin:

Q̂

Q̂nQ̂

nQ̂

|),(),(ˆ|max,

asQasQnasn −=Δ

Convergencetheorem




•  Foranytableentryupdatedoniterationn + 1,theerrorintherevisedestimateis

),(ˆ asQn),(ˆ

1 asQn+

|)','(max)','(ˆmax|

|))','(max())','(ˆmax(| |),(),(ˆ|

''

''1

asQasQ

asQrasQrasQasQ

ana

anan

−=

+−+=−+

γ

γγ

Convergencetheorem




nn

nas

na

ana

anan

asQasQ

asQasQ

asQasQ

asQasQ

asQrasQrasQasQ

Δ=−

−≤

−≤

−=

+−+=−

+

+

γ

γ

γ

γ

γγ

|),(),(ˆ|

|)',''()',''(ˆ|max

|)','()','(ˆ|max

|)','(max)','(ˆmax|

|))','(max())','(ˆmax(| |),(),(ˆ|

1

',''

'

''

''1

Noteweusedgeneralfactthat:

|)()(|max|)(max)(max| 2121 afafafafaaa

−≤−

Convergencetheorem




Non-deterministiccase•  Whatifrewardandnextstatearenon-deterministic?•  WeredefineVandQbytakingexpectedvalues.

V π (s) ≡ E[rt +γrt+1 +γ2rt+2 +...]

≡ E γ irt+ii=0

∞

∑$

%&

'

()

Q(s,a) ≡ E[r(s,a)+γV *(δ(s,a))]




Nondeterministiccase

QlearninggeneralizestonondeterministicworldsAltertrainingruleto

where

)]','(ˆmax[),(ˆ)(),(ˆ'

asQrasQasQ nannnn 111 −− +α+α−←

ConvergenceoftoQcanbeproved[WatkinsandDayan,1992]

Q̂

),(11

asvisitsnn +=α




SarsaAlways update the policy to be greedy with respect to the current estimate




TemporalDifferenceLearning

•  TemporalDifference(TD)learningmethods–  Canbeusedwhenaccuratemodelsoftheenvironmentareunavailable–neitherstatetransitionfunctionnorrewardfunctionareknown

–  Canbeextendedtoworkwithimplicitrepresentationsofaction-valuefunctions

–  Areamongthemostusefulreinforcementlearningmethods




Q-learning:ThesimplestTDMethodTD(0)

T T T TT

T T T T T

st+1rt+1

st

TTTTT

T T T T T




Example–TD-Gammon

•  LearntoplayBackgammon(Tesauro,1995)•  Immediatereward:•  +100ifwin•  -100iflose•  0forallotherstates

•  Trainedbyplaying1.5milliongamesagainstitself.•  Nowcomparabletothebesthumanplayer.




Temporaldifferencelearning

),(),(),()[1(),( )3(2)2()1(tttttttt asQasQasQasQ λλλλ ++−≡

),(ˆmax),( 1)1( asQrasQ tattt ++≡ γ

Qlearning:reducediscrepancybetweensuccessiveQestimatesOnesteptimedifference:

Whynottwosteps?

),(ˆmax),( 22

1)2( asQrrasQ tatttt ++ ++≡ γγ

Orn?

),(ˆmax),( 1)1(

1)( asQrrrasQ nta

nnt

ntttt

n+−+

−+ ++++≡ γγγ !

Blendallofthese:




Temporaldifferencelearning

•  TD(λ)algorithmusesabovetrainingrule•  SometimesconvergesfasterthanQlearning•  convergesforlearningV*forany0≤λ≤1(Dayan,1992)•  Tesauro'sTD-Gammonusesthisalgorithm

),(),(),()[1(),( )3(2)2()1(tttttttt asQasQasQasQ λλλλ ++−≡

)],(),(ˆmax)1[(),( 11 +++−+= ttttattt asQasQrasQ λλ λλγ

Equivalentexpression:




AdvantagesofTDLearning

•  TDmethodsdonotrequireamodeloftheenvironment,onlyexperience

•  TDmethodscanbefullyincremental–  Youcanlearnbeforeknowingthefinaloutcome

• Lessmemory• Lesspeakcomputation

–  Youcanlearnwithoutthefinaloutcome• Fromincompletesequences

•  TDconverges(underreasonableassumptions)




OptimalityofTD(0)

•  BatchUpdating:traincompletelyonafiniteamountofdata,e.g.,trainrepeatedlyon10episodesuntilconvergence.

• ComputeupdatesaccordingtoTD(0),butonlyupdateestimatesaftereachcompletepassthroughthedata.

ForanyfiniteMarkovpredictiontask,underbatchupdating,TD(0)convergesforsufficientlysmallα.




HandlingLargeStateSpaces

•  Replacetablewithneuralnetorotherfunctionapproximator

•  Virtuallyanyfunctionapproximatorwouldworkprovideditcanbeupdatedinanonlinefashion

Q̂




Learningstate-actionvalues

< (st,at ), vt >

•  Trainingexamplesoftheform:

•  Thegeneralgradient-descentrule:

!θt+1 =

!θt +α vt −Qt (st,at )[ ]∇ !θQ(st,at )




GradientDescentMethods

!θ t = θt (1),θt (2),…,θt (n)( )T

. allfor , of

function abledifferenti smooth)tly (sufficien a is Assume

Ss

V

t

t

∈θ!

Assume, for now, training examples of this form:

<st, V π (st )>{ }




GradientDescent

!θt+1 =

!θt −

12α∇ !

θMSE(

!θt )

=!θt −

12α∇ !

θP(s)

s∈S∑ V π (s)−Vt (s)%& '(

2

=!θt +α P(s)

s∈S∑ V π (s)−Vt (s)%& '(∇ !θVt (s)




GradientDescent

!θt+1 =

!θt −

12α∇ !

θV π (st )−Vt (st )#$ %&

2

=!θt +α V π (st )−Vt (st )#$ %&∇ !θVt (st ),

Use just the per sample gradient

Sinceeachsamplegradientisanunbiasedestimateofthetruegradient,thisconvergestoalocalminimumoftheMSEifadecreasesappropriatelywitht.

E V π (st )−Vt (st )"# $%∇ !θVt (st ) = P(s) V π (s)−Vt (s)"# $%s∈S∑ ∇ !

θVt (s)




ButWeDon’thavetheseTargets

[ ] )()(

:instead targetshavejust weSuppose

1 ttttttt

t

sVsVv

v

θαθθ !

!!∇−+=+

{ }ely).appropriat decreases (provided minimum local a to

convergesdescent gradient then ,)( i.e.,

,)( of estimate unbiasedan is each If

α

π

π

tt

tt

sVvEsVv

=




WhataboutTD(λ)Targets?

!θt+1 =

!θt +α Rt

λ −Vt (st )"# $%∇ !θVt (st )

!θt+1 =

!θt +αδt

!et,where:δt = rt+1 +γVt (st+1)−Vt (st ), as usual, and!et = γ λ

!et−1 +∇ !θVt (st )




LinearGradientDescentWatkins’Q(λ)




Notcovered

•  Usinghierarchicalstateandactionrepresentations•  Copingwithpartialobservability•  Copingwithextremelylargestatespaces•  Neuralbasisofreinforcementlearning

Documents

Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts