70
Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Agents that Learn: Reinforcement Learning Vasant Honavar Artificial Intelligence Research Laboratory College of Information Sciences and Technology Bioinformatics and Genomics Graduate Program The Huck Institutes of the Life Sciences Pennsylvania State University [email protected] http://faculty.ist.psu.edu/vhonavar Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar

Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

AgentsthatLearn:ReinforcementLearning

VasantHonavarArtificialIntelligenceResearchLaboratory

CollegeofInformationSciencesandTechnologyBioinformaticsandGenomicsGraduateProgram

TheHuckInstitutesoftheLifeSciencesPennsylvaniaStateUniversity

[email protected]://faculty.ist.psu.edu/vhonavar

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Page 2: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Agentinanenvironment

Environment

Action

State

Reward/Punishment

Agent

Page 3: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

LearningfromInteractionwiththeworld

•  Anagentreceivessensationsorperceptsfromtheenvironmentthroughitssensorsandactsontheenvironmentthroughitseffectorsandoccasionallyreceivesrewardsorpunishmentsfromtheenvironment

•  Thegoaloftheagentistomaximizeitsreward(pleasure)orminimizeitspunishment(orpain)asitstumblesalonginana-prioriunknown,uncertain,environment

Page 4: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

MarkovDecisionProcesses•  Assume

–  FinitesetofstatesS –  FinitesetofactionsA

•  Ateachdiscretetime–  Theagentobservesstatest∈Sandchoosesactionat∈A, receivesimmediaterewardrt

–  Environmentstatechangestost+1 •  Markovassumption:st+1 =δ(st, at)andrt = r(st, at)

–  i.e.,rt andst+1dependonlyoncurrentstateandaction–  functionsδandrmaybenondeterministic–  functionsδandrmaynotnecessarilybeknowntotheagent–reinforcementlearning

Page 5: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Actingrationallyinthepresenceofdelayedrewards

Agent and environment interact at discrete time steps : t = 0,1, 2,… Agent observes state at step t : st ∈ S produces action at step t : at ∈ A(st ) gets resulting reward : rt+1 ∈ ℜ

and resulting next state : st+1

t. . . st a

rt +1 st +1t +1a

rt +2 st +2t +2a

rt +3 st +3. . .

t +3a

Page 6: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

KeyelementsofanRLSystem

•  Policy–whattodo•  Reward–whatisgood•  Value–whatisgoodbecauseitpredictsreward•  Model–whatfollowswhat

Policy

RewardValue

Model ofenvironment

Page 7: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Policy at step t, π t : a mapping from states to action probabilities π t (s,a) = probability that at = a when st = s

TheAgentUsesaPolicytoselectactions

•  Arationalagent’sgoalistogetasmuchrewardasitcan

overthelongrun

Page 8: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

GoalsandRewards

•  Isascalarrewardsignalanadequatenotionofagoal?–maybenot,butitissurprisinglyflexible.

•  Agoalshouldspecifywhatwewanttoachieve,nothowwewanttoachieveit.

•  Agoalistypicallyoutsidetheagent’sdirectcontrol•  Theagentmustbeabletomeasuresuccess:

•  explicitly•  frequentlyduringitslifespan

Page 9: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

RewardsforContinuingTasks

Continuingtasks:interactiondoesnothavenaturalepisodes

Cumulativediscountedreward

rate.discount theis ,10, where

, 0

132

21

≤≤

=+++= ∑∞

=+++++

γγ

γγγk

ktk

tttt rrrrR !

farsighted 10 edshortsight →←γ

Page 10: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Rewards

maximize? to want wedo What,,,

:is step after rewards of sequence the Suppose…321 +++ ttt rrr

t

{ }. step each for

,return, expected the maximize to want wegeneral, Int

RE t

Episodictasks–interactionbreaksnaturallyintoepisodes,e.g.,playsofagame,tripsthroughamaze.

,Tttt rrrR +++= ++ !21

whereT isafinaltimestepatwhichaterminalstateisreached,endinganepisode.

Page 11: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

MarkovDecisionProcesses

•  IfareinforcementlearningtaskhastheMarkovProperty,itiscalledaMarkovDecisionProcess(MDP).

•  Ifstateandactionsetsarefinite,theMDPisafiniteMDP.•  TodefineafiniteMDP,youneedtospecify:

–  stateandactionsets;–  one-stepdynamicsdefinedbytransitionprobabilities:

–  rewardprobabilities:

{ } ).(,, ,Pr sAaSssaassssP tttass ∈∈ʹ∀==ʹ== +ʹ 1

{ } ).(,, ,, sAaSssssaassrER ttttass ∈∈ʹ∀ʹ==== ++ʹ 11

Page 12: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Example–PoleBalancingTask

Avoidfailure:thepolefallingbeyondacriticalangleorthecarthittingendoftrack.

reward = +1 for each step before failure⇒ return = number of steps before failure

Asanepisodictaskwhereepisodeendsuponfailure:

Asacontinuingtaskwithdiscountedreturn:reward = −1 upon failure; 0 otherwise

⇒ return = −γ k , for k steps before failure

Ineithercase,returnismaximizedbyavoidingfailureforaslongaspossible.

Page 13: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Example--Drivingtask

Gettothetopofthehillasquicklyaspossible.

hill of topreaching before steps ofnumber return hill of at top when stepeach for 1 reward

−=⇒

−= not

Returnismaximizedbyminimizingthenumberofstepstakentoreachthetopofthehill

Page 14: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

RecyclingRobot

FiniteMDPExample

•  Ateachstep,robothastodecidewhetheritshoulda)  activelysearchforacan,b)  waitforsomeonetobringitacan,orc)  gotohomebaseandrecharge.

•  Searchingisbetterbutrunsdownthebattery;ifrunsoutofpowerwhilesearching,hastoberescued(whichisbad).

•  Decisionsmadeonbasisofcurrentenergylevel:high,low.•  Reward=numberofcanscollected

Page 15: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

SomeNotableRLApplications

•  TD-Gammon–  Worldsbestbackgammonprogram

•  Elevatorscheduling•  InventoryManagement

–  10%–15%improvementoverthestate-of-theartmethods•  DynamicChannelAssignment–

–  highperformanceassignmentofradiochannelstomobiletelephonecalls

Page 16: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

TheMarkovProperty

•  Bythestateatstept,wemeanwhateverinformationisavailabletotheagentatsteptaboutitsenvironment.

•  Thestatecanincludeimmediatesensations,highlyprocessedsensations,andstructuresbuiltupovertimefromsequencesofsensations.

•  Ideally,astateshouldsummarizepastsensationssoastoretainallessentialinformation–itshouldhavetheMarkovProperty:

Pr st+1 = !s , rt+1 = r st,at, rt, st−1,at−1,…, r1, s0,a0{ }= Pr st+1 = !s , rt+1 = r st,at{ }∀ !s , r, and histories st,at, rt, st−1,at−1,…, r1, s0,a0.

Page 17: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Reinforcementlearning

•  Learnerisnottoldwhichactionstotake•  Rewardsandpunishmentsmaybedelayed

–  Sacrificeshort-termgainsforgreaterlong-termgains•  Theneedtotradeoffbetweenexplorationandexploitation•  Environmentmaynotbeobservableoronlypartiallyobservable•  Environmentmaybedeterministicorstochastic

Page 18: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

ValueFunctions

{ }⎭⎬⎫

⎩⎨⎧

=γ=== ∑∞

=++ππ

π

01

ktkt

ktt ssrEssREsV

π

)(

policy for function value-State :

{ }⎭⎬⎫

⎩⎨⎧

==γ====

π

∑∞

=++ππ

π

01

kttkt

kttt aassrEaassREasQ ,,),(

:policy for function value-Action

•  Thevalueofastateistheexpectedreturnstartingfromthatstate;dependsontheagent’spolicy:

Thevalueoftakinganactioninastateunderpolicyπistheexpectedcumulativerewardstartingfromthatstate,takingthataction,andthereafterfollowingπ:

Page 19: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Page 20: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

BellmanEquationforaPolicyπ

( )11

42

321

43

32

21

++

++++

++++

γ+=

γ+γ+γ+=

γ+γ+γ+=

tt

tttt

ttttt

Rrrrrr

rrrrR

!

!

Thebasicidea:

So: Vπ (s) = Eπ Rt st = s{ }

= Eπ rt+1 + γV st+1( ) st = s{ }

Or,withouttheexpectationoperator:

[ ]∑ ∑́ πʹʹ

π ʹγ+π=a s

ass

ass sVRPassV )(),()(

Page 21: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

π ≥ # π if and only if Vπ (s) ≥ V # π (s) for all s ∈S

OptimalValueFunctions

•  ForfiniteMDPs,policiescanbepartiallyordered:

•  Thereisalwaysatleastone(andpossiblymany)policiesthatis(are)betterthanorequaltoalltheothers.Thisisanoptimalpolicy.Wedenotethemallπ *.

•  Optimalpoliciessharethesameoptimalstate-valuefunction:

•  Optimalpoliciesalsosharethesameoptimalaction-valuefunction:

V∗ (s) = maxπVπ (s) for all s ∈S

Q∗(s, a) = maxπQπ (s, a) for all s ∈S and a ∈A(s)

Thisistheexpectedcumulativerewardfortakingactionainstatesandthereafterfollowinganoptimalpolicy.

Page 22: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Page 23: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

WhyOptimalState-ValueFunctionsareUseful

∗V

V∗

Anypolicythatisgreedywithrespecttoisanoptimalpolicy.

Therefore,given,one-step-aheadsearchproducesthelong-termoptimalactions.

Page 24: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Example:Tic-Tac-Toe

X XXO OX

XOX

OXO

X

O

XXO

X

O

X OXO

X

O

X O

X

..

. ...

..

....... ... ... ... ...

x xx

x ox

oxo

x

x

xx

oo

Assumeanimperfectopponent:sometimesmakesmistakes

}Xmoves

}Omoves

}Xmoves

}Omoves

}Xmoves

Page 25: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

ASimpleRLApproachtoTic-Tac-Toe

Makeatablewithoneentryperstate

Playlotsofgames.Topickourmoves,lookaheadonestep

StateV(s)–estimatedprobabilityofwinning0.50.5. . . . . .

. . .. . .

1 win

0 loss

. . .. . .

0.5 draw

x

xxxoo

ooo

xx

ooo o

xx

xx

o

Current state

Possible next states*Pickthenextstatewiththehighestestimatedprob.ofwinning—thelargestV(s)–agreedymove;Occasionallypickamoveatrandom–anexploratorymove.

Page 26: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

LearningRuleforTic-Tac-Toe

..

our move{opponent's move{

our move{

starting position

a

b

c*

d

ee*

opponent's move{

c

•f

•g*g

opponent's move{our move{

.

“Exploratory”move

s – the state before our greedy move! s – the state after our greedy move

We increment each V(s) toward V( ! s ) – a backup :

V(s)←V (s) + α V( ! s ) − V (s)[ ]

Page 27: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

WhyisTic-Tac-ToeTooEasy?

•  Numberofstatesissmallandfinite•  One-steplook-aheadisalwayspossible•  Statecompletelyobservable

Page 28: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

WhatAboutOptimalAction-ValueFunctions?

Given,theagentdoesnotevenhavetodoaone-step-aheadsearch:

Q*

π∗(s) = argmaxa∈A (s)

Q∗(s, a)

Page 29: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

SimplerSetting:Then-ArmedBanditProblem

•  Chooserepeatedlyfromoneofnactions;eachchoiceiscalledaplay

•  Aftereachplay,yougetareward,where

E rt | at =Q*(at )

at rt

rt at•  Distributionofdependsonlyon•  Nodependenceonstate•  Objectiveistomaximizetherewardinthelongterm,e.g.,

over1000plays

Page 30: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

TheExploration–ExploitationDilemma

•  Supposeyouform

•  Thegreedyactionattis

•  Youcan’texploitallthetime;youcan’texploreallthetime

•  Youcanneverstopexploring;butyoucouldreduceexploring

Qt(a) ≈Q*(a)

actionvalueestimates

at* = argmaxa Qt(a)

at = at* ⇒ exploitation

at ≠ at* ⇒ exploration

Page 31: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Action-ValueMethods

•  Adaptaction-valueestimatesandnothingelse.•  Supposebythet-thplay,actionhadbeenchosentimes,

producingrewardsthen

Qt (a) =r1 + r2 +!rka

ka

kar1, r2,…, rka ,

limka→∞

Qt (a) =Q*(a)

a

Page 32: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

ε-GreedyActionSelection§  Greedy§  ε-Greedy

§  Boltzmann

at = at* = argmax

aQt (a)

at =at

* with probability 1−εrandom action with probability ε

"#$

Pr(choosing action a at time t) = eQt (a) τ

eQt (b) τ

b=1

n∑

where τ is computational temperature

Page 33: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

SolvingtheBellmanOptimalityEquation

•  FindinganoptimalpolicybysolvingtheBellmanOptimalityEquationrequires:–  accurateknowledgeofenvironmentdynamics;–  enoughspaceantimetodothecomputation;–  theMarkovProperty.

•  Howmuchspaceandtimedoweneed?–  polynomialinnumberofstates(viadynamicprogrammingmethods),

–  BUT,numberofstatesisoftenhuge•  Weusuallyhavetosettleforapproximations.•  ManyRLmethodscanbeunderstoodasapproximately

solvingtheBellmanOptimalityEquation.

Page 34: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

SomeNotableRLApplications

•  TD-Gammon–  Worldsbestbackgammonprogram

•  Elevatorscheduling•  InventoryManagement

–  10%–15%improvementoverthestate-of-theartmethods•  DynamicChannelAssignment–

–  highperformanceassignmentofradiochannelstomobiletelephonecalls

Page 35: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

AgentsthatActsRationallyinthepresenceofdelayedrewards

Agent and environment interact at discrete time steps : t = 0,1, 2,… Agent observes state at step t : st ∈ S produces action at step t : at ∈ A(st ) gets resulting reward : rt+1 ∈ ℜ

and resulting next state : st+1

t. . . st a

rt +1 st +1t +1a

rt +2 st +2t +2a

rt +3 st +3. . .

t +3a

Page 36: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Policy at step t, π t : a mapping from states to action probabilities π t (s,a) = probability that at = a when st = s

TheAgentLearnsaPolicy

•  Reinforcementlearningmethodsspecifyhowtheagentchangesitspolicyasaresultofitsexperience.

•  Roughly,theagent’sgoalistogetasmuchrewardasitcanoverthelongrun.

Page 37: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Reinforcementlearning

Environment

action

state

reward

Agent

Page 38: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

MarkovDecisionProcesses•  Assume

–  FinitesetofstatesS –  FinitesetofactionsA

•  Ateachdiscretetime–  Theagentobservesstatest∈Sandchoosesactionat∈A, receivesimmediaterewardrt

–  Environmentstatechangestost+1 •  Markovassumption:st+1 =δ(st, at)andrt = r(st, at)

–  i.e.,rt andst+1dependonlyoncurrentstateandaction–  functionsδandrmaybenondeterministic–  functionsδandrmaynotnecessarilybeknowntotheagent–reinforcementlearning

Page 39: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Agent’slearningtask

•  Executeactionsinenvironment,observeresults,and•  Learnactionpolicyπ:S → Athatmaximizes

E [rt + γrt+1 + γ2rt+2 + … ] fromanystartingstateinS

•  Here0≤γ<1isthediscountfactorforfuturerewards

•  Targetfunctionistobelearnedisπ:S → A•  Butwehavenotrainingexamplesofform〈s, a〉•  Trainingexamplesareofform〈〈s, a〉,r〉

Page 40: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Reinforcementlearningproblem

•  Goal:learntochooseactionsthatmaximizer0+γr1+γ2r2+…,where0≤γ<1

Page 41: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Valuefunction

•  Tobeginwith,considerdeterministicworlds...•  Foreachpossiblepolicyπtheagentmightadopt,wecan

defineanevaluationfunctionoverstates

∑∞

=+

++

+++≡

0

22

1 ...)(

iit

i

ttt

r

rrrsV

γ

γγπ

π*≡ argmaxπ

V π (s), (∀s)

wherert, rt+1, ...aregeneratedbyfollowingpolicyπstartingatstates

•  Restated,thetaskistolearntheoptimalpolicyπ*

Page 42: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Page 43: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Whattolearn•  Wemighttrytohaveagentlearntheevaluationfunction

Vπ*(whichwewriteasV*)•  Itcouldthendoalook-aheadsearchtochoosebest

actionfromanystatesbecause

))],((*),([maxarg)(* asVasrsa

δγπ +≡

Aproblem:•  Thisworksifagentknowsδ : S × A → S,andr:S×A→ℜ•  Butwhenitdoesn't,itcan'tchooseactionsinthisway

Page 44: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Action-Valuefunction–Qfunction

•  DefineanewfunctionverysimilartoV*

Q(s,a) ≡ r(s,a)+γV *(δ(s,a))•  IfagentlearnsQ,itcanchooseoptimalactionevenwithout

knowingδ!

π *(s) ≡ argmaxπ

[r(s,a)+γV *(δ(s,a))]

π *(s) ≡ argmaxπ

Q(s,a)

•  Qistheevaluationfunctiontheagentwilllearn

Page 45: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

TrainingruletolearnQ

•  NoteQandV*arecloselyrelated:V *(s) =max

a 'Q(s,a '))

•  WhichallowsustowriteQrecursivelyas

Q(s1,a1) = r(s1,a1)+γV *(δ(s1,a1))= r(s1,a1)+γmaxa ' Q(δ(st+1,a '))

•  Letdenotelearner’scurrentapproximationtoQ.Considertrainingrule

Q̂(s,a)← r +γmaxa 'Q̂(s ',a ')

•  wheres’isthestateresultingfromapplyingactionainstates.

Page 46: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Q-Learning

Q st,at( )← Q st,at( ) +α rt+1 + γmaxaQ st+1,a( ) −Q st,at( )[ ]

Page 47: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

QLearningforDeterministicWorlds

•  Foreachs,ainitializetableentry•  Doforever:

–  Observecurrentstates–  Selectanactionaandexecuteit–  Receiveimmediaterewardr–  Observethenewstates’–  Updatethetableentryfor asfollows:

–  Updatestates ← s’.

)','(ˆmax)(ˆ'

asQrsQa

γ+←

0←),(ˆ asQ

),(ˆ asQ

Page 48: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

UpdatingQ

901008163900

21

+←

γ+←

},,max{.

)','(ˆmax),(ˆ'

asQrasQaright

Noticeifrewardsnon-negative,thenand

),(ˆ),(ˆ ),,( asQasQnas nn ≥∀ +1

),(),(ˆ0 ),,( asQasQnas n ≤≤∀

Page 49: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

•  Theorem:convergestoQ.•  Considercaseofdeterministicworld,withbounded

immediaterewards,whereeach〈s,a〉visitedinfinitelyoften.

•  Proof:Defineanintervalduringwhicheach〈s,a〉isvisitedatleastonce.Duringeachfullintervalthelargesterrorintableisreducedbyfactorofγ.

•  Letbethetableafternupdates,andΔnbethemaximumerrorin:

Q̂nQ̂

nQ̂

|),(),(ˆ|max,

asQasQnasn −=Δ

Convergencetheorem

Page 50: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

•  Foranytableentryupdatedoniterationn + 1,theerrorintherevisedestimateis

),(ˆ asQn),(ˆ

1 asQn+

|)','(max)','(ˆmax|

|))','(max())','(ˆmax(| |),(),(ˆ|

''

''1

asQasQ

asQrasQrasQasQ

ana

anan

−=

+−+=−+

γ

γγ

Convergencetheorem

Page 51: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

nn

nas

na

ana

anan

asQasQ

asQasQ

asQasQ

asQasQ

asQrasQrasQasQ

Δ=−

−≤

−≤

−=

+−+=−

+

+

γ

γ

γ

γ

γγ

|),(),(ˆ|

|)',''()',''(ˆ|max

|)','()','(ˆ|max

|)','(max)','(ˆmax|

|))','(max())','(ˆmax(| |),(),(ˆ|

1

',''

'

''

''1

Noteweusedgeneralfactthat:

|)()(|max|)(max)(max| 2121 afafafafaaa

−≤−

Convergencetheorem

Page 52: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Non-deterministiccase•  Whatifrewardandnextstatearenon-deterministic?•  WeredefineVandQbytakingexpectedvalues.

V π (s) ≡ E[rt +γrt+1 +γ2rt+2 +...]

≡ E γ irt+ii=0

∑$

%&

'

()

Q(s,a) ≡ E[r(s,a)+γV *(δ(s,a))]

Page 53: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Nondeterministiccase

QlearninggeneralizestonondeterministicworldsAltertrainingruleto

where

)]','(ˆmax[),(ˆ)(),(ˆ'

asQrasQasQ nannnn 111 −− +α+α−←

ConvergenceoftoQcanbeproved[WatkinsandDayan,1992]

),(11

asvisitsnn +=α

Page 54: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

SarsaAlways update the policy to be greedy with respect to the current estimate

Page 55: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

TemporalDifferenceLearning

•  TemporalDifference(TD)learningmethods–  Canbeusedwhenaccuratemodelsoftheenvironmentareunavailable–neitherstatetransitionfunctionnorrewardfunctionareknown

–  Canbeextendedtoworkwithimplicitrepresentationsofaction-valuefunctions

–  Areamongthemostusefulreinforcementlearningmethods

Page 56: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Q-learning:ThesimplestTDMethodTD(0)

T T T TT

T T T T T

st+1rt+1

st

TTTTT

T T T T T

Page 57: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Example–TD-Gammon

•  LearntoplayBackgammon(Tesauro,1995)•  Immediatereward:•  +100ifwin•  -100iflose•  0forallotherstates

•  Trainedbyplaying1.5milliongamesagainstitself.•  Nowcomparabletothebesthumanplayer.

Page 58: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Temporaldifferencelearning

),(),(),()[1(),( )3(2)2()1(tttttttt asQasQasQasQ λλλλ ++−≡

),(ˆmax),( 1)1( asQrasQ tattt ++≡ γ

Qlearning:reducediscrepancybetweensuccessiveQestimatesOnesteptimedifference:

Whynottwosteps?

),(ˆmax),( 22

1)2( asQrrasQ tatttt ++ ++≡ γγ

Orn?

),(ˆmax),( 1)1(

1)( asQrrrasQ nta

nnt

ntttt

n+−+

−+ ++++≡ γγγ !

Blendallofthese:

Page 59: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Temporaldifferencelearning

•  TD(λ)algorithmusesabovetrainingrule•  SometimesconvergesfasterthanQlearning•  convergesforlearningV*forany0≤λ≤1(Dayan,1992)•  Tesauro'sTD-Gammonusesthisalgorithm

),(),(),()[1(),( )3(2)2()1(tttttttt asQasQasQasQ λλλλ ++−≡

)],(),(ˆmax)1[(),( 11 +++−+= ttttattt asQasQrasQ λλ λλγ

Equivalentexpression:

Page 60: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

AdvantagesofTDLearning

•  TDmethodsdonotrequireamodeloftheenvironment,onlyexperience

•  TDmethodscanbefullyincremental–  Youcanlearnbeforeknowingthefinaloutcome

• Lessmemory• Lesspeakcomputation

–  Youcanlearnwithoutthefinaloutcome• Fromincompletesequences

•  TDconverges(underreasonableassumptions)

Page 61: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

OptimalityofTD(0)

•  BatchUpdating:traincompletelyonafiniteamountofdata,e.g.,trainrepeatedlyon10episodesuntilconvergence.

• ComputeupdatesaccordingtoTD(0),butonlyupdateestimatesaftereachcompletepassthroughthedata.

ForanyfiniteMarkovpredictiontask,underbatchupdating,TD(0)convergesforsufficientlysmallα.

Page 62: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

HandlingLargeStateSpaces

•  Replacetablewithneuralnetorotherfunctionapproximator

•  Virtuallyanyfunctionapproximatorwouldworkprovideditcanbeupdatedinanonlinefashion

Page 63: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Learningstate-actionvalues

< (st,at ), vt >

•  Trainingexamplesoftheform:

•  Thegeneralgradient-descentrule:

!θt+1 =

!θt +α vt −Qt (st,at )[ ]∇ !θQ(st,at )

Page 64: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

GradientDescentMethods

!θ t = θt (1),θt (2),…,θt (n)( )T

. allfor , of

function abledifferenti smooth)tly (sufficien a is Assume

Ss

V

t

t

∈θ!

Assume, for now, training examples of this form:

<st, V π (st )>{ }

Page 65: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

GradientDescent

!θt+1 =

!θt −

12α∇ !

θMSE(

!θt )

=!θt −

12α∇ !

θP(s)

s∈S∑ V π (s)−Vt (s)%& '(

2

=!θt +α P(s)

s∈S∑ V π (s)−Vt (s)%& '(∇ !θVt (s)

Page 66: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

GradientDescent

!θt+1 =

!θt −

12α∇ !

θV π (st )−Vt (st )#$ %&

2

=!θt +α V π (st )−Vt (st )#$ %&∇ !θVt (st ),

Use just the per sample gradient

Sinceeachsamplegradientisanunbiasedestimateofthetruegradient,thisconvergestoalocalminimumoftheMSEifadecreasesappropriatelywitht.

E V π (st )−Vt (st )"# $%∇ !θVt (st ) = P(s) V π (s)−Vt (s)"# $%s∈S∑ ∇ !

θVt (s)

Page 67: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

ButWeDon’thavetheseTargets

[ ] )()(

:instead targetshavejust weSuppose

1 ttttttt

t

sVsVv

v

θαθθ !

!!∇−+=+

{ }ely).appropriat decreases (provided minimum local a to

convergesdescent gradient then ,)( i.e.,

,)( of estimate unbiasedan is each If

α

π

π

tt

tt

sVvEsVv

=

Page 68: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

WhataboutTD(λ)Targets?

!θt+1 =

!θt +α Rt

λ −Vt (st )"# $%∇ !θVt (st )

!θt+1 =

!θt +αδt

!et,where:δt = rt+1 +γVt (st+1)−Vt (st ), as usual, and!et = γ λ

!et−1 +∇ !θVt (st )

Page 69: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

LinearGradientDescentWatkins’Q(λ)

Page 70: Agents that Learn: Reinforcement Learning · Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts

PennsylvaniaStateUniversity

CollegeofInformationSciencesandTechnologyArtificialIntelligenceResearchLaboratory

PrinciplesofArtificialIntelligence,IST597F,Fall2017,(C)VasantHonavar

Notcovered

•  Usinghierarchicalstateandactionrepresentations•  Copingwithpartialobservability•  Copingwithextremelylargestatespaces•  Neuralbasisofreinforcementlearning