Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
AgentsthatLearn:ReinforcementLearning
VasantHonavarArtificialIntelligenceResearchLaboratory
CollegeofInformationSciencesandTechnologyBioinformaticsandGenomicsGraduateProgram
TheHuckInstitutesoftheLifeSciencesPennsylvaniaStateUniversity
[email protected]://faculty.ist.psu.edu/vhonavar
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Agentinanenvironment
Environment
Action
State
Reward/Punishment
Agent
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
LearningfromInteractionwiththeworld
• Anagentreceivessensationsorperceptsfromtheenvironmentthroughitssensorsandactsontheenvironmentthroughitseffectorsandoccasionallyreceivesrewardsorpunishmentsfromtheenvironment
• Thegoaloftheagentistomaximizeitsreward(pleasure)orminimizeitspunishment(orpain)asitstumblesalonginana-prioriunknown,uncertain,environment
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
MarkovDecisionProcesses• Assume
– FinitesetofstatesS – FinitesetofactionsA
• Ateachdiscretetime– Theagentobservesstatest∈Sandchoosesactionat∈A, receivesimmediaterewardrt
– Environmentstatechangestost+1 • Markovassumption:st+1 =δ(st, at)andrt = r(st, at)
– i.e.,rt andst+1dependonlyoncurrentstateandaction– functionsδandrmaybenondeterministic– functionsδandrmaynotnecessarilybeknowntotheagent–reinforcementlearning
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Actingrationallyinthepresenceofdelayedrewards
€
Agent and environment interact at discrete time steps : t = 0,1, 2,… Agent observes state at step t : st ∈ S produces action at step t : at ∈ A(st ) gets resulting reward : rt+1 ∈ ℜ
and resulting next state : st+1
t. . . st a
rt +1 st +1t +1a
rt +2 st +2t +2a
rt +3 st +3. . .
t +3a
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
KeyelementsofanRLSystem
• Policy–whattodo• Reward–whatisgood• Value–whatisgoodbecauseitpredictsreward• Model–whatfollowswhat
Policy
RewardValue
Model ofenvironment
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Policy at step t, π t : a mapping from states to action probabilities π t (s,a) = probability that at = a when st = s
TheAgentUsesaPolicytoselectactions
• Arationalagent’sgoalistogetasmuchrewardasitcan
overthelongrun
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
GoalsandRewards
• Isascalarrewardsignalanadequatenotionofagoal?–maybenot,butitissurprisinglyflexible.
• Agoalshouldspecifywhatwewanttoachieve,nothowwewanttoachieveit.
• Agoalistypicallyoutsidetheagent’sdirectcontrol• Theagentmustbeabletomeasuresuccess:
• explicitly• frequentlyduringitslifespan
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
RewardsforContinuingTasks
Continuingtasks:interactiondoesnothavenaturalepisodes
Cumulativediscountedreward
rate.discount theis ,10, where
, 0
132
21
≤≤
=+++= ∑∞
=+++++
γγ
γγγk
ktk
tttt rrrrR !
farsighted 10 edshortsight →←γ
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Rewards
maximize? to want wedo What,,,
:is step after rewards of sequence the Suppose…321 +++ ttt rrr
t
{ }. step each for
,return, expected the maximize to want wegeneral, Int
RE t
Episodictasks–interactionbreaksnaturallyintoepisodes,e.g.,playsofagame,tripsthroughamaze.
,Tttt rrrR +++= ++ !21
whereT isafinaltimestepatwhichaterminalstateisreached,endinganepisode.
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
MarkovDecisionProcesses
• IfareinforcementlearningtaskhastheMarkovProperty,itiscalledaMarkovDecisionProcess(MDP).
• Ifstateandactionsetsarefinite,theMDPisafiniteMDP.• TodefineafiniteMDP,youneedtospecify:
– stateandactionsets;– one-stepdynamicsdefinedbytransitionprobabilities:
– rewardprobabilities:
{ } ).(,, ,Pr sAaSssaassssP tttass ∈∈ʹ∀==ʹ== +ʹ 1
{ } ).(,, ,, sAaSssssaassrER ttttass ∈∈ʹ∀ʹ==== ++ʹ 11
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Example–PoleBalancingTask
Avoidfailure:thepolefallingbeyondacriticalangleorthecarthittingendoftrack.
reward = +1 for each step before failure⇒ return = number of steps before failure
Asanepisodictaskwhereepisodeendsuponfailure:
Asacontinuingtaskwithdiscountedreturn:reward = −1 upon failure; 0 otherwise
⇒ return = −γ k , for k steps before failure
Ineithercase,returnismaximizedbyavoidingfailureforaslongaspossible.
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Example--Drivingtask
Gettothetopofthehillasquicklyaspossible.
hill of topreaching before steps ofnumber return hill of at top when stepeach for 1 reward
−=⇒
−= not
Returnismaximizedbyminimizingthenumberofstepstakentoreachthetopofthehill
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
RecyclingRobot
FiniteMDPExample
• Ateachstep,robothastodecidewhetheritshoulda) activelysearchforacan,b) waitforsomeonetobringitacan,orc) gotohomebaseandrecharge.
• Searchingisbetterbutrunsdownthebattery;ifrunsoutofpowerwhilesearching,hastoberescued(whichisbad).
• Decisionsmadeonbasisofcurrentenergylevel:high,low.• Reward=numberofcanscollected
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
SomeNotableRLApplications
• TD-Gammon– Worldsbestbackgammonprogram
• Elevatorscheduling• InventoryManagement
– 10%–15%improvementoverthestate-of-theartmethods• DynamicChannelAssignment–
– highperformanceassignmentofradiochannelstomobiletelephonecalls
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
TheMarkovProperty
• Bythestateatstept,wemeanwhateverinformationisavailabletotheagentatsteptaboutitsenvironment.
• Thestatecanincludeimmediatesensations,highlyprocessedsensations,andstructuresbuiltupovertimefromsequencesofsensations.
• Ideally,astateshouldsummarizepastsensationssoastoretainallessentialinformation–itshouldhavetheMarkovProperty:
Pr st+1 = !s , rt+1 = r st,at, rt, st−1,at−1,…, r1, s0,a0{ }= Pr st+1 = !s , rt+1 = r st,at{ }∀ !s , r, and histories st,at, rt, st−1,at−1,…, r1, s0,a0.
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Reinforcementlearning
• Learnerisnottoldwhichactionstotake• Rewardsandpunishmentsmaybedelayed
– Sacrificeshort-termgainsforgreaterlong-termgains• Theneedtotradeoffbetweenexplorationandexploitation• Environmentmaynotbeobservableoronlypartiallyobservable• Environmentmaybedeterministicorstochastic
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
ValueFunctions
{ }⎭⎬⎫
⎩⎨⎧
=γ=== ∑∞
=++ππ
π
01
ktkt
ktt ssrEssREsV
π
)(
policy for function value-State :
{ }⎭⎬⎫
⎩⎨⎧
==γ====
π
∑∞
=++ππ
π
01
kttkt
kttt aassrEaassREasQ ,,),(
:policy for function value-Action
• Thevalueofastateistheexpectedreturnstartingfromthatstate;dependsontheagent’spolicy:
Thevalueoftakinganactioninastateunderpolicyπistheexpectedcumulativerewardstartingfromthatstate,takingthataction,andthereafterfollowingπ:
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
BellmanEquationforaPolicyπ
( )11
42
321
43
32
21
++
++++
++++
γ+=
γ+γ+γ+=
γ+γ+γ+=
tt
tttt
ttttt
Rrrrrr
rrrrR
!
!
Thebasicidea:
So: Vπ (s) = Eπ Rt st = s{ }
= Eπ rt+1 + γV st+1( ) st = s{ }
Or,withouttheexpectationoperator:
[ ]∑ ∑́ πʹʹ
π ʹγ+π=a s
ass
ass sVRPassV )(),()(
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
π ≥ # π if and only if Vπ (s) ≥ V # π (s) for all s ∈S
OptimalValueFunctions
• ForfiniteMDPs,policiescanbepartiallyordered:
• Thereisalwaysatleastone(andpossiblymany)policiesthatis(are)betterthanorequaltoalltheothers.Thisisanoptimalpolicy.Wedenotethemallπ *.
• Optimalpoliciessharethesameoptimalstate-valuefunction:
• Optimalpoliciesalsosharethesameoptimalaction-valuefunction:
V∗ (s) = maxπVπ (s) for all s ∈S
Q∗(s, a) = maxπQπ (s, a) for all s ∈S and a ∈A(s)
Thisistheexpectedcumulativerewardfortakingactionainstatesandthereafterfollowinganoptimalpolicy.
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
WhyOptimalState-ValueFunctionsareUseful
∗V
V∗
Anypolicythatisgreedywithrespecttoisanoptimalpolicy.
Therefore,given,one-step-aheadsearchproducesthelong-termoptimalactions.
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Example:Tic-Tac-Toe
X XXO OX
XOX
OXO
X
O
XXO
X
O
X OXO
X
O
X O
X
..
. ...
..
....... ... ... ... ...
x xx
x ox
oxo
x
x
xx
oo
Assumeanimperfectopponent:sometimesmakesmistakes
}Xmoves
}Omoves
}Xmoves
}Omoves
}Xmoves
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
ASimpleRLApproachtoTic-Tac-Toe
Makeatablewithoneentryperstate
Playlotsofgames.Topickourmoves,lookaheadonestep
StateV(s)–estimatedprobabilityofwinning0.50.5. . . . . .
. . .. . .
1 win
0 loss
. . .. . .
0.5 draw
x
xxxoo
ooo
xx
ooo o
xx
xx
o
Current state
Possible next states*Pickthenextstatewiththehighestestimatedprob.ofwinning—thelargestV(s)–agreedymove;Occasionallypickamoveatrandom–anexploratorymove.
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
LearningRuleforTic-Tac-Toe
..
•
our move{opponent's move{
our move{
starting position
•
•
•
a
b
c*
d
ee*
opponent's move{
c
•f
•g*g
opponent's move{our move{
.
•
“Exploratory”move
s – the state before our greedy move! s – the state after our greedy move
We increment each V(s) toward V( ! s ) – a backup :
V(s)←V (s) + α V( ! s ) − V (s)[ ]
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
WhyisTic-Tac-ToeTooEasy?
• Numberofstatesissmallandfinite• One-steplook-aheadisalwayspossible• Statecompletelyobservable
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
WhatAboutOptimalAction-ValueFunctions?
Given,theagentdoesnotevenhavetodoaone-step-aheadsearch:
Q*
π∗(s) = argmaxa∈A (s)
Q∗(s, a)
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
SimplerSetting:Then-ArmedBanditProblem
• Chooserepeatedlyfromoneofnactions;eachchoiceiscalledaplay
• Aftereachplay,yougetareward,where
E rt | at =Q*(at )
at rt
rt at• Distributionofdependsonlyon• Nodependenceonstate• Objectiveistomaximizetherewardinthelongterm,e.g.,
over1000plays
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
TheExploration–ExploitationDilemma
• Supposeyouform
• Thegreedyactionattis
• Youcan’texploitallthetime;youcan’texploreallthetime
• Youcanneverstopexploring;butyoucouldreduceexploring
Qt(a) ≈Q*(a)
actionvalueestimates
at* = argmaxa Qt(a)
at = at* ⇒ exploitation
at ≠ at* ⇒ exploration
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Action-ValueMethods
• Adaptaction-valueestimatesandnothingelse.• Supposebythet-thplay,actionhadbeenchosentimes,
producingrewardsthen
Qt (a) =r1 + r2 +!rka
ka
kar1, r2,…, rka ,
limka→∞
Qt (a) =Q*(a)
a
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
ε-GreedyActionSelection§ Greedy§ ε-Greedy
§ Boltzmann
at = at* = argmax
aQt (a)
at =at
* with probability 1−εrandom action with probability ε
"#$
Pr(choosing action a at time t) = eQt (a) τ
eQt (b) τ
b=1
n∑
where τ is computational temperature
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
SolvingtheBellmanOptimalityEquation
• FindinganoptimalpolicybysolvingtheBellmanOptimalityEquationrequires:– accurateknowledgeofenvironmentdynamics;– enoughspaceantimetodothecomputation;– theMarkovProperty.
• Howmuchspaceandtimedoweneed?– polynomialinnumberofstates(viadynamicprogrammingmethods),
– BUT,numberofstatesisoftenhuge• Weusuallyhavetosettleforapproximations.• ManyRLmethodscanbeunderstoodasapproximately
solvingtheBellmanOptimalityEquation.
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
SomeNotableRLApplications
• TD-Gammon– Worldsbestbackgammonprogram
• Elevatorscheduling• InventoryManagement
– 10%–15%improvementoverthestate-of-theartmethods• DynamicChannelAssignment–
– highperformanceassignmentofradiochannelstomobiletelephonecalls
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
AgentsthatActsRationallyinthepresenceofdelayedrewards
€
Agent and environment interact at discrete time steps : t = 0,1, 2,… Agent observes state at step t : st ∈ S produces action at step t : at ∈ A(st ) gets resulting reward : rt+1 ∈ ℜ
and resulting next state : st+1
t. . . st a
rt +1 st +1t +1a
rt +2 st +2t +2a
rt +3 st +3. . .
t +3a
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Policy at step t, π t : a mapping from states to action probabilities π t (s,a) = probability that at = a when st = s
TheAgentLearnsaPolicy
• Reinforcementlearningmethodsspecifyhowtheagentchangesitspolicyasaresultofitsexperience.
• Roughly,theagent’sgoalistogetasmuchrewardasitcanoverthelongrun.
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Reinforcementlearning
Environment
action
state
reward
Agent
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
MarkovDecisionProcesses• Assume
– FinitesetofstatesS – FinitesetofactionsA
• Ateachdiscretetime– Theagentobservesstatest∈Sandchoosesactionat∈A, receivesimmediaterewardrt
– Environmentstatechangestost+1 • Markovassumption:st+1 =δ(st, at)andrt = r(st, at)
– i.e.,rt andst+1dependonlyoncurrentstateandaction– functionsδandrmaybenondeterministic– functionsδandrmaynotnecessarilybeknowntotheagent–reinforcementlearning
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Agent’slearningtask
• Executeactionsinenvironment,observeresults,and• Learnactionpolicyπ:S → Athatmaximizes
E [rt + γrt+1 + γ2rt+2 + … ] fromanystartingstateinS
• Here0≤γ<1isthediscountfactorforfuturerewards
• Targetfunctionistobelearnedisπ:S → A• Butwehavenotrainingexamplesofform〈s, a〉• Trainingexamplesareofform〈〈s, a〉,r〉
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Reinforcementlearningproblem
• Goal:learntochooseactionsthatmaximizer0+γr1+γ2r2+…,where0≤γ<1
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Valuefunction
• Tobeginwith,considerdeterministicworlds...• Foreachpossiblepolicyπtheagentmightadopt,wecan
defineanevaluationfunctionoverstates
∑∞
=+
++
≡
+++≡
0
22
1 ...)(
iit
i
ttt
r
rrrsV
γ
γγπ
π*≡ argmaxπ
V π (s), (∀s)
wherert, rt+1, ...aregeneratedbyfollowingpolicyπstartingatstates
• Restated,thetaskistolearntheoptimalpolicyπ*
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Whattolearn• Wemighttrytohaveagentlearntheevaluationfunction
Vπ*(whichwewriteasV*)• Itcouldthendoalook-aheadsearchtochoosebest
actionfromanystatesbecause
))],((*),([maxarg)(* asVasrsa
δγπ +≡
Aproblem:• Thisworksifagentknowsδ : S × A → S,andr:S×A→ℜ• Butwhenitdoesn't,itcan'tchooseactionsinthisway
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Action-Valuefunction–Qfunction
• DefineanewfunctionverysimilartoV*
Q(s,a) ≡ r(s,a)+γV *(δ(s,a))• IfagentlearnsQ,itcanchooseoptimalactionevenwithout
knowingδ!
π *(s) ≡ argmaxπ
[r(s,a)+γV *(δ(s,a))]
π *(s) ≡ argmaxπ
Q(s,a)
• Qistheevaluationfunctiontheagentwilllearn
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
TrainingruletolearnQ
• NoteQandV*arecloselyrelated:V *(s) =max
a 'Q(s,a '))
• WhichallowsustowriteQrecursivelyas
Q(s1,a1) = r(s1,a1)+γV *(δ(s1,a1))= r(s1,a1)+γmaxa ' Q(δ(st+1,a '))
• Letdenotelearner’scurrentapproximationtoQ.Considertrainingrule
Q̂(s,a)← r +γmaxa 'Q̂(s ',a ')
Q̂
• wheres’isthestateresultingfromapplyingactionainstates.
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Q-Learning
€
Q st,at( )← Q st,at( ) +α rt+1 + γmaxaQ st+1,a( ) −Q st,at( )[ ]
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
QLearningforDeterministicWorlds
• Foreachs,ainitializetableentry• Doforever:
– Observecurrentstates– Selectanactionaandexecuteit– Receiveimmediaterewardr– Observethenewstates’– Updatethetableentryfor asfollows:
– Updatestates ← s’.
)','(ˆmax)(ˆ'
asQrsQa
γ+←
0←),(ˆ asQ
),(ˆ asQ
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
UpdatingQ
901008163900
21
←
+←
γ+←
},,max{.
)','(ˆmax),(ˆ'
asQrasQaright
Noticeifrewardsnon-negative,thenand
),(ˆ),(ˆ ),,( asQasQnas nn ≥∀ +1
),(),(ˆ0 ),,( asQasQnas n ≤≤∀
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
• Theorem:convergestoQ.• Considercaseofdeterministicworld,withbounded
immediaterewards,whereeach〈s,a〉visitedinfinitelyoften.
• Proof:Defineanintervalduringwhicheach〈s,a〉isvisitedatleastonce.Duringeachfullintervalthelargesterrorintableisreducedbyfactorofγ.
• Letbethetableafternupdates,andΔnbethemaximumerrorin:
Q̂
Q̂nQ̂
nQ̂
|),(),(ˆ|max,
asQasQnasn −=Δ
Convergencetheorem
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
• Foranytableentryupdatedoniterationn + 1,theerrorintherevisedestimateis
),(ˆ asQn),(ˆ
1 asQn+
|)','(max)','(ˆmax|
|))','(max())','(ˆmax(| |),(),(ˆ|
''
''1
asQasQ
asQrasQrasQasQ
ana
anan
−=
+−+=−+
γ
γγ
Convergencetheorem
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
nn
nas
na
ana
anan
asQasQ
asQasQ
asQasQ
asQasQ
asQrasQrasQasQ
Δ=−
−≤
−≤
−=
+−+=−
+
+
γ
γ
γ
γ
γγ
|),(),(ˆ|
|)',''()',''(ˆ|max
|)','()','(ˆ|max
|)','(max)','(ˆmax|
|))','(max())','(ˆmax(| |),(),(ˆ|
1
',''
'
''
''1
Noteweusedgeneralfactthat:
|)()(|max|)(max)(max| 2121 afafafafaaa
−≤−
Convergencetheorem
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Non-deterministiccase• Whatifrewardandnextstatearenon-deterministic?• WeredefineVandQbytakingexpectedvalues.
V π (s) ≡ E[rt +γrt+1 +γ2rt+2 +...]
≡ E γ irt+ii=0
∞
∑$
%&
'
()
Q(s,a) ≡ E[r(s,a)+γV *(δ(s,a))]
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Nondeterministiccase
QlearninggeneralizestonondeterministicworldsAltertrainingruleto
where
)]','(ˆmax[),(ˆ)(),(ˆ'
asQrasQasQ nannnn 111 −− +α+α−←
ConvergenceoftoQcanbeproved[WatkinsandDayan,1992]
Q̂
),(11
asvisitsnn +=α
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
SarsaAlways update the policy to be greedy with respect to the current estimate
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
TemporalDifferenceLearning
• TemporalDifference(TD)learningmethods– Canbeusedwhenaccuratemodelsoftheenvironmentareunavailable–neitherstatetransitionfunctionnorrewardfunctionareknown
– Canbeextendedtoworkwithimplicitrepresentationsofaction-valuefunctions
– Areamongthemostusefulreinforcementlearningmethods
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Q-learning:ThesimplestTDMethodTD(0)
T T T TT
T T T T T
st+1rt+1
st
TTTTT
T T T T T
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Example–TD-Gammon
• LearntoplayBackgammon(Tesauro,1995)• Immediatereward:• +100ifwin• -100iflose• 0forallotherstates
• Trainedbyplaying1.5milliongamesagainstitself.• Nowcomparabletothebesthumanplayer.
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Temporaldifferencelearning
),(),(),()[1(),( )3(2)2()1(tttttttt asQasQasQasQ λλλλ ++−≡
),(ˆmax),( 1)1( asQrasQ tattt ++≡ γ
Qlearning:reducediscrepancybetweensuccessiveQestimatesOnesteptimedifference:
Whynottwosteps?
),(ˆmax),( 22
1)2( asQrrasQ tatttt ++ ++≡ γγ
Orn?
),(ˆmax),( 1)1(
1)( asQrrrasQ nta
nnt
ntttt
n+−+
−+ ++++≡ γγγ !
Blendallofthese:
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Temporaldifferencelearning
• TD(λ)algorithmusesabovetrainingrule• SometimesconvergesfasterthanQlearning• convergesforlearningV*forany0≤λ≤1(Dayan,1992)• Tesauro'sTD-Gammonusesthisalgorithm
),(),(),()[1(),( )3(2)2()1(tttttttt asQasQasQasQ λλλλ ++−≡
)],(),(ˆmax)1[(),( 11 +++−+= ttttattt asQasQrasQ λλ λλγ
Equivalentexpression:
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
AdvantagesofTDLearning
• TDmethodsdonotrequireamodeloftheenvironment,onlyexperience
• TDmethodscanbefullyincremental– Youcanlearnbeforeknowingthefinaloutcome
• Lessmemory• Lesspeakcomputation
– Youcanlearnwithoutthefinaloutcome• Fromincompletesequences
• TDconverges(underreasonableassumptions)
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
OptimalityofTD(0)
• BatchUpdating:traincompletelyonafiniteamountofdata,e.g.,trainrepeatedlyon10episodesuntilconvergence.
• ComputeupdatesaccordingtoTD(0),butonlyupdateestimatesaftereachcompletepassthroughthedata.
ForanyfiniteMarkovpredictiontask,underbatchupdating,TD(0)convergesforsufficientlysmallα.
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
HandlingLargeStateSpaces
• Replacetablewithneuralnetorotherfunctionapproximator
• Virtuallyanyfunctionapproximatorwouldworkprovideditcanbeupdatedinanonlinefashion
Q̂
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Learningstate-actionvalues
< (st,at ), vt >
• Trainingexamplesoftheform:
• Thegeneralgradient-descentrule:
!θt+1 =
!θt +α vt −Qt (st,at )[ ]∇ !θQ(st,at )
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
GradientDescentMethods
!θ t = θt (1),θt (2),…,θt (n)( )T
. allfor , of
function abledifferenti smooth)tly (sufficien a is Assume
Ss
V
t
t
∈θ!
Assume, for now, training examples of this form:
<st, V π (st )>{ }
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
GradientDescent
!θt+1 =
!θt −
12α∇ !
θMSE(
!θt )
=!θt −
12α∇ !
θP(s)
s∈S∑ V π (s)−Vt (s)%& '(
2
=!θt +α P(s)
s∈S∑ V π (s)−Vt (s)%& '(∇ !θVt (s)
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
GradientDescent
!θt+1 =
!θt −
12α∇ !
θV π (st )−Vt (st )#$ %&
2
=!θt +α V π (st )−Vt (st )#$ %&∇ !θVt (st ),
Use just the per sample gradient
Sinceeachsamplegradientisanunbiasedestimateofthetruegradient,thisconvergestoalocalminimumoftheMSEifadecreasesappropriatelywitht.
E V π (st )−Vt (st )"# $%∇ !θVt (st ) = P(s) V π (s)−Vt (s)"# $%s∈S∑ ∇ !
θVt (s)
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
ButWeDon’thavetheseTargets
[ ] )()(
:instead targetshavejust weSuppose
1 ttttttt
t
sVsVv
v
θαθθ !
!!∇−+=+
{ }ely).appropriat decreases (provided minimum local a to
convergesdescent gradient then ,)( i.e.,
,)( of estimate unbiasedan is each If
α
π
π
tt
tt
sVvEsVv
=
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
WhataboutTD(λ)Targets?
!θt+1 =
!θt +α Rt
λ −Vt (st )"# $%∇ !θVt (st )
!θt+1 =
!θt +αδt
!et,where:δt = rt+1 +γVt (st+1)−Vt (st ), as usual, and!et = γ λ
!et−1 +∇ !θVt (st )
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
LinearGradientDescentWatkins’Q(λ)
PennsylvaniaStateUniversity
CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory
PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar
Notcovered
• Usinghierarchicalstateandactionrepresentations• Copingwithpartialobservability• Copingwithextremelylargestatespaces• Neuralbasisofreinforcementlearning