Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and...

Preview:

DESCRIPTION

Book Machine Learning

Citation preview

STATISTICAL

REINFORCEMENT

LEARNING

ModernMachine

LearningApproaches

Chapman&Hall/CRC

MachineLearning&PatternRecognitionSeries

SERIESEDITORS

RalfHerbrich

ThoreGraepel

AmazonDevelopmentCenter

MicrosoftResearchLtd.

Berlin,Germany

Cambridge,UK

AIMSANDSCOPE

Thisseriesreflectsthelatestadvancesandapplicationsinmachinelearningandpatternrecognitionthroughthepublicationofabroadrangeofreferenceworks,textbooks,andhandbooks.Theinclusionofconcreteexamples,applications,andmethodsishighlyencouraged.Thescopeoftheseriesincludes,butisnotlimitedto,titlesintheareasofmachinelearning,patternrecognition,computationalintelligence,robotics,computational/statisticallearningtheory,naturallanguageprocessing,computervision,gameAI,gametheory,neuralnetworks,computationalneuroscience,andotherrelevanttopics,suchasmachinelearningappliedtobioinformaticsorcognitivescience,whichmightbeproposedbypotentialcontribu-tors.

PUBLISHEDTITLES

BAYESIANPROGRAMMING

PierreBessière,EmmanuelMazer,Juan-ManuelAhuactzin,andKamelMekhnacha

UTILITY-BASEDLEARNINGFROMDATA

CraigFriedmanandSvenSandow

HANDBOOKOFNATURALLANGUAGEPROCESSING,SECONDEDITION

NitinIndurkhyaandFredJ.Damerau

COST-SENSITIVEMACHINELEARNING

BalajiKrishnapuram,ShipengYu,andBharatRao

COMPUTATIONALTRUSTMODELSANDMACHINELEARNING

XinLiu,AnwitamanDatta,andEe-PengLim

MULTILINEARSUBSPACELEARNING:DIMENSIONALITYREDUCTIONOF

MULTIDIMENSIONALDATA

HaipingLu,KonstantinosN.Plataniotis,andAnastasiosN.Venetsanopoulos

MACHINELEARNING:AnAlgorithmicPerspective,SecondEdition

StephenMarsland

SPARSEMODELING:THEORY,ALGORITHMS,ANDAPPLICATIONS

IrinaRishandGenadyYa.Grabarnik

AFIRSTCOURSEINMACHINELEARNING

SimonRogersandMarkGirolami

STATISTICALREINFORCEMENTLEARNING:MODERNMACHINELEARNINGAPPROACHES

MasashiSugiyama

MULTI-LABELDIMENSIONALITYREDUCTION

LiangSun,ShuiwangJi,andJiepingYe

REGULARIZATION,OPTIMIZATION,KERNELS,ANDSUPPORTVECTORMACHINES

JohanA.K.Suykens,MarcoSignoretto,andAndreasArgyriou

ENSEMBLEMETHODS:FOUNDATIONSANDALGORITHMS

Zhi-HuaZhou

Chapman&Hall/CRC

MachineLearning&PatternRecognitionSeries

STATISTICAL

REINFORCEMENT

LEARNING

ModernMachine

LearningApproaches

MasashiSugiyama

UniversityofTokyo

Tokyo,Japan

CRCPress

Taylor&FrancisGroup

6000BrokenSoundParkwayNW,Suite300

BocaRaton,FL33487-2742

©2015byTaylor&FrancisGroup,LLC

CRCPressisanimprintofTaylor&FrancisGroup,anInformabusiness

NoclaimtooriginalU.S.Governmentworks

VersionDate:20150128

InternationalStandardBookNumber-13:978-1-4398-5690-1(eBook-PDF)

Thisbookcontainsinformationobtainedfromauthenticandhighlyregardedsources.Reasonableeffortshavebeenmadetopublishreliabledataandinformation,buttheauthorandpublishercannotassumeresponsibilityforthevalidityofallmaterialsortheconsequencesoftheiruse.Theauthorsandpublishershaveattemptedtotracethecopyrightholdersofallmaterialreproducedinthispublicationandapologizetocopyrightholdersifpermissiontopublishinthisformhasnotbeenobtained.Ifanycopyrightmaterialhasnotbeenacknowledgedpleasewriteandletusknowsowemayrectifyinanyfuturereprint.

ExceptaspermittedunderU.S.CopyrightLaw,nopartofthisbookmaybereprinted,reproduced,transmitted,orutilizedinanyformbyanyelectronic,mechanical,orothermeans,nowknownorhereafterinvented,includingphotocopying,microfilming,andrecording,orinanyinformationstor-ageorretrievalsystem,withoutwrittenpermissionfromthepublishers.

Forpermissiontophotocopyorusematerialelectronicallyfromthiswork,pleaseaccesswww.copy-

right.com(http://www.copyright.com/)orcontacttheCopyrightClearanceCenter,Inc.(CCC),222

RosewoodDrive,Danvers,MA01923,978-750-8400.CCCisanot-for-profitorganizationthatprovideslicensesandregistrationforavarietyofusers.FororganizationsthathavebeengrantedaphotocopylicensebytheCCC,aseparatesystemofpaymenthasbeenarranged.

TrademarkNotice:Productorcorporatenamesmaybetrademarksorregisteredtrademarks,andareusedonlyforidentificationandexplanationwithoutintenttoinfringe.

VisittheTaylor&FrancisWebsiteat

http://www.taylorandfrancis.com

andtheCRCPressWebsiteat

http://www.crcpress.com

Contents

Foreword

ix

Preface

xi

Author

xiii

I

Introduction

1

1IntroductiontoReinforcementLearning

3

1.1

ReinforcementLearning…………………

3

1.2

MathematicalFormulation

……………….

8

1.3

StructureoftheBook………………….

12

1.3.1

Model-FreePolicyIteration……………

12

1.3.2

Model-FreePolicySearch…………….

13

1.3.3

Model-BasedReinforcementLearning………

14

II

Model-FreePolicyIteration

15

2PolicyIterationwithValueFunctionApproximation

17

2.1

ValueFunctions

…………………….

17

2.1.1

StateValueFunctions………………

17

2.1.2

State-ActionValueFunctions…………..

18

2.2

Least-SquaresPolicyIteration

……………..

20

2.2.1

Immediate-RewardRegression………….

20

2.2.2

Algorithm…………………….

21

2.2.3

Regularization………………….

23

2.2.4

ModelSelection………………….

25

2.3

Remarks

………………………..

26

3BasisDesignforValueFunctionApproximation

27

3.1

GaussianKernelsonGraphs

………………

27

3.1.1

MDP-InducedGraph……………….

27

3.1.2

OrdinaryGaussianKernels……………

29

3.1.3

GeodesicGaussianKernels……………

29

3.1.4

ExtensiontoContinuousStateSpaces………

30

3.2

Illustration……………………….

30

3.2.1

Setup………………………

31

v

vi

Contents

3.2.2

GeodesicGaussianKernels……………

31

3.2.3

OrdinaryGaussianKernels……………

33

3.2.4

Graph-LaplacianEigenbases……………

34

3.2.5

DiffusionWavelets………………..

35

3.3

NumericalExamples…………………..

36

3.3.1

Robot-ArmControl……………….

36

3.3.2

Robot-AgentNavigation……………..

39

3.4

Remarks

………………………..

45

4SampleReuseinPolicyIteration

47

4.1

Formulation

………………………

47

4.2

Off-PolicyValueFunctionApproximation………..

48

4.2.1

EpisodicImportanceWeighting………….

49

4.2.2

Per-DecisionImportanceWeighting

……….

50

4.2.3

AdaptivePer-DecisionImportanceWeighting…..

50

4.2.4

Illustration……………………

51

4.3

AutomaticSelectionofFlatteningParameter………

54

4.3.1

Importance-WeightedCross-Validation………

54

4.3.2

Illustration……………………

55

4.4

Sample-ReusePolicyIteration

……………..

56

4.4.1

Algorithm…………………….

56

4.4.2

Illustration……………………

57

4.5

NumericalExamples…………………..

58

4.5.1

InvertedPendulum………………..

58

4.5.2

MountainCar…………………..

60

4.6

Remarks

………………………..

63

5ActiveLearninginPolicyIteration

65

5.1

EfficientExplorationwithActiveLearning

……….

65

5.1.1

ProblemSetup………………….

65

5.1.2

DecompositionofGeneralizationError………

66

5.1.3

EstimationofGeneralizationError………..

67

5.1.4

DesigningSamplingPolicies……………

68

5.1.5

Illustration……………………

69

5.2

ActivePolicyIteration

…………………

71

5.2.1

Sample-ReusePolicyIterationwithActiveLearning.

72

5.2.2

Illustration……………………

73

5.3

NumericalExamples…………………..

75

5.4

Remarks

………………………..

77

6RobustPolicyIteration

79

6.1

RobustnessandReliabilityinPolicyIteration

……..

79

6.1.1

Robustness……………………

79

6.1.2

Reliability…………………….

80

6.2

LeastAbsolutePolicyIteration……………..

81

Contents

vii

6.2.1

Algorithm…………………….

81

6.2.2

Illustration……………………

81

6.2.3

Properties…………………….

83

6.3

NumericalExamples…………………..

84

6.4

PossibleExtensions

…………………..

88

6.4.1

HuberLoss……………………

88

6.4.2

PinballLoss……………………

89

6.4.3

Deadzone-LinearLoss………………

90

6.4.4

ChebyshevApproximation…………….

90

6.4.5

ConditionalValue-At-Risk…………….

91

6.5

Remarks

………………………..

92

III

Model-FreePolicySearch

93

7DirectPolicySearchbyGradientAscent

95

7.1

Formulation

………………………

95

7.2

GradientApproach

…………………..

96

7.2.1

GradientAscent…………………

96

7.2.2

BaselineSubtractionforVarianceReduction…..

98

7.2.3

VarianceAnalysisofGradientEstimators…….

99

7.3

NaturalGradientApproach……………….

101

7.3.1

NaturalGradientAscent……………..

101

7.3.2

Illustration……………………

103

7.4

ApplicationinComputerGraphics:ArtistAgent…….

104

7.4.1

SumiePainting………………….

105

7.4.2

DesignofStates,Actions,andImmediateRewards..

105

7.4.3

ExperimentalResults………………

112

7.5

Remarks

………………………..

113

8DirectPolicySearchbyExpectation-Maximization

117

8.1

Expectation-MaximizationApproach

………….

117

8.2

SampleReuse

……………………..

120

8.2.1

EpisodicImportanceWeighting………….

120

8.2.2

Per-DecisionImportanceWeight…………

122

8.2.3

AdaptivePer-DecisionImportanceWeighting…..

123

8.2.4

AutomaticSelectionofFlatteningParameter…..

124

8.2.5

Reward-WeightedRegressionwithSampleReuse…

125

8.3

NumericalExamples…………………..

126

8.4

Remarks

………………………..

132

9Policy-PriorSearch

133

9.1

Formulation

………………………

133

9.2

PolicyGradientswithParameter-BasedExploration…..

134

9.2.1

Policy-PriorGradientAscent…………..

135

9.2.2

BaselineSubtractionforVarianceReduction…..

136

9.2.3

VarianceAnalysisofGradientEstimators…….

136

viii

Contents

9.2.4

NumericalExamples……………….

138

9.3

SampleReuseinPolicy-PriorSearch…………..

143

9.3.1

ImportanceWeighting………………

143

9.3.2

VarianceReductionbyBaselineSubtraction……

145

9.3.3

NumericalExamples……………….

146

9.4

Remarks

………………………..

153

IV

Model-BasedReinforcementLearning

155

10TransitionModelEstimation

157

10.1ConditionalDensityEstimation

…………….

157

10.1.1Regression-BasedApproach……………

157

10.1.2ǫ-NeighborKernelDensityEstimation………

158

10.1.3Least-SquaresConditionalDensityEstimation….

159

10.2Model-BasedReinforcementLearning………….

161

10.3NumericalExamples…………………..

162

10.3.1ContinuousChainWalk……………..

162

10.3.2HumanoidRobotControl…………….

167

10.4Remarks

………………………..

172

11DimensionalityReductionforTransitionModelEstimation173

11.1SufficientDimensionalityReduction…………..

173

11.2Squared-LossConditionalEntropy……………

174

11.2.1ConditionalIndependence…………….

174

11.2.2DimensionalityReductionwithSCE……….

175

11.2.3RelationtoSquared-LossMutualInformation…..

176

11.3NumericalExamples…………………..

177

11.3.1ArtificialandBenchmarkDatasets………..

177

11.3.2HumanoidRobot…………………

180

11.4Remarks

………………………..

182

References

183

Index

191

Foreword

Howcanagentslearnfromexperiencewithoutanomniscientteacherexplicitly

tellingthemwhattodo?Reinforcementlearningistheareawithinmachine

learningthatinvestigateshowanagentcanlearnanoptimalbehaviorby

correlatinggenericrewardsignalswithitspastactions.Thedisciplinedraws

uponandconnectskeyideasfrombehavioralpsychology,economics,control

theory,operationsresearch,andotherdisparatefieldstomodelthelearning

process.Inreinforcementlearning,theenvironmentistypicallymodeledasa

Markovdecisionprocessthatprovidesimmediaterewardandstateinforma-

tiontotheagent.However,theagentdoesnothaveaccesstothetransition

structureoftheenvironmentandneedstolearnhowtochooseappropriate

actionstomaximizeitsoverallrewardovertime.

ThisbookbyProf.MasashiSugiyamacoverstherangeofreinforcement

learningalgorithmsfromafresh,modernperspective.Withafocusonthe

statisticalpropertiesofestimatingparametersforreinforcementlearning,the

bookrelatesanumberofdifferentapproachesacrossthegamutoflearningsce-

narios.Thealgorithmsaredividedintomodel-freeapproachesthatdonotex-

plicitlymodelthedynamicsoftheenvironment,andmodel-basedapproaches

thatconstructdescriptiveprocessmodelsfortheenvironment.Withineach

ofthesecategories,therearepolicyiterationalgorithmswhichestimatevalue

functions,andpolicysearchalgorithmswhichdirectlymanipulatepolicypa-

rameters.

Foreachofthesedifferentreinforcementlearningscenarios,thebookmetic-

ulouslylaysouttheassociatedoptimizationproblems.Acarefulanalysisis

givenforeachofthesecases,withanemphasisonunderstandingthestatistical

propertiesoftheresultingestimatorsandlearnedparameters.Eachchapter

containsillustrativeexamplesofapplicationsofthesealgorithms,withquan-

titativecomparisonsbetweenthedifferenttechniques.Theseexamplesare

drawnfromavarietyofpracticalproblems,includingrobotmotioncontrol

andAsianbrushpainting.

Insummary,thebookprovidesathoughtprovokingstatisticaltreatmentof

reinforcementlearningalgorithms,reflectingtheauthor’sworkandsustained

researchinthisarea.Itisacontemporaryandwelcomeadditiontotherapidly

growingmachinelearningliterature.Bothbeginnerstudentsandexperienced

ix

x

Foreword

researcherswillfindittobeanimportantsourceforunderstandingthelatest

reinforcementlearningtechniques.

DanielD.Lee

GRASPLaboratory

SchoolofEngineeringandAppliedScience

UniversityofPennsylvania,Philadelphia,PA,USA

Preface

Inthecomingbigdataera,statisticsandmachinelearningarebecoming

indispensabletoolsfordatamining.Dependingonthetypeofdataanalysis,

machinelearningmethodsarecategorizedintothreegroups:

•Supervisedlearning:Giveninput-outputpaireddata,theobjective

ofsupervisedlearningistoanalyzetheinput-outputrelationbehindthe

data.Typicaltasksofsupervisedlearningincluderegression(predict-

ingtherealvalue),classification(predictingthecategory),andranking

(predictingtheorder).Supervisedlearningisthemostcommondata

analysisandhasbeenextensivelystudiedinthestatisticscommunity

forlongtime.Arecenttrendofsupervisedlearningresearchinthema-

chinelearningcommunityistoutilizesideinformationinadditiontothe

input-outputpaireddatatofurtherimprovethepredictionaccuracy.For

example,semi-supervisedlearningutilizesadditionalinput-onlydata,

transferlearningborrowsdatafromothersimilarlearningtasks,and

multi-tasklearningsolvesmultiplerelatedlearningtaskssimultaneously.

•Unsupervisedlearning:Giveninput-onlydata,theobjectiveofun-

supervisedlearningistofindsomethingusefulinthedata.Duetothis

ambiguousdefinition,unsupervisedlearningresearchtendstobemore

adhocthansupervisedlearning.Nevertheless,unsupervisedlearningis

regardedasoneofthemostimportanttoolsindataminingbecause

ofitsautomaticandinexpensivenature.Typicaltasksofunsupervised

learningincludeclustering(groupingthedatabasedontheirsimilarity),

densityestimation(estimatingtheprobabilitydistributionbehindthe

data),anomalydetection(removingoutliersfromthedata),datavisual-

ization(reducingthedimensionalityofthedatato1–3dimensions),and

blindsourceseparation(extractingtheoriginalsourcesignalsfromtheir

mixtures).Also,unsupervisedlearningmethodsaresometimesusedas

datapre-processingtoolsinsupervisedlearning.

•Reinforcementlearning:Supervisedlearningisasoundapproach,

butcollectinginput-outputpaireddataisoftentooexpensive.Unsu-

pervisedlearningisinexpensivetoperform,butittendstobeadhoc.

Reinforcementlearningisplacedbetweensupervisedlearningandunsu-

pervisedlearning—noexplicitsupervision(outputdata)isprovided,

butwestillwanttolearntheinput-outputrelationbehindthedata.

Insteadofoutputdata,reinforcementlearningutilizesrewards,which

xi

xii

Preface

evaluatethevalidityofpredictedoutputs.Givingimplicitsupervision

suchasrewardsisusuallymucheasierandlesscostlythangivingex-

plicitsupervision,andthereforereinforcementlearningcanbeavital

approachinmoderndataanalysis.Varioussupervisedandunsupervised

learningtechniquesarealsoutilizedintheframeworkofreinforcement

learning.

Thisbookisdevotedtointroducingfundamentalconceptsandpracti-

calalgorithmsofstatisticalreinforcementlearningfromthemodernmachine

learningviewpoint.Variousillustrativeexamples,mainlyinrobotics,arealso

providedtohelpunderstandtheintuitionandusefulnessofreinforcement

learningtechniques.Targetreadersaregraduate-levelstudentsincomputer

scienceandappliedstatisticsaswellasresearchersandengineersinrelated

fields.Basicknowledgeofprobabilityandstatistics,linearalgebra,andele-

mentarycalculusisassumed.

Machinelearningisarapidlydevelopingareaofscience,andtheauthor

hopesthatthisbookhelpsthereadergraspvariousexcitingtopicsinrein-

forcementlearningandstimulatereaders’interestinmachinelearning.Please

visitourwebsiteat:http://www.ms.k.u-tokyo.ac.jp.

MasashiSugiyama

UniversityofTokyo,Japan

Author

MasashiSugiyamawasborninOsaka,Japan,in1974.HereceivedBachelor,

Master,andDoctorofEngineeringdegreesinComputerSciencefromAll

TokyoInstituteofTechnology,Japanin1997,1999,and2001,respectively.

In2001,hewasappointedAssistantProfessorinthesameinstitute,andhe

waspromotedtoAssociateProfessorin2003.HemovedtotheUniversityof

TokyoasProfessorin2014.

HereceivedanAlexandervonHumboldtFoundationResearchFellowship

andresearchedatFraunhoferInstitute,Berlin,Germany,from2003to2004.In

2006,hereceivedaEuropeanCommissionProgramErasmusMundusSchol-

arshipandresearchedattheUniversityofEdinburgh,Scotland.Hereceived

theFacultyAwardfromIBMin2007forhiscontributiontomachinelearning

undernon-stationarity,theNagaoSpecialResearcherAwardfromtheInfor-

mationProcessingSocietyofJapanin2011andtheYoungScientists’Prize

fromtheCommendationforScienceandTechnologybytheMinisterofEd-

ucation,Culture,Sports,ScienceandTechnologyforhiscontributiontothe

density-ratioparadigmofmachinelearning.

Hisresearchinterestsincludetheoriesandalgorithmsofmachinelearning

anddatamining,andawiderangeofapplicationssuchassignalprocessing,

imageprocessing,androbotcontrol.HepublishedDensityRatioEstimationin

MachineLearning(CambridgeUniversityPress,2012)andMachineLearning

inNon-StationaryEnvironments:IntroductiontoCovariateShiftAdaptation

(MITPress,2012).

Theauthorthankshiscollaborators,HirotakaHachiya,SethuVijayaku-

mar,JanPeters,JunMorimoto,ZhaoTingting,NingXie,VootTangkaratt,

TetsuroMorimura,andNorikazuSugimoto,forexcitingandcreativediscus-

sions.HeacknowledgessupportfromMEXTKAKENHI17700142,18300057,

20680007,23120004,23300069,25700022,and26280054,theOkawaFounda-

tion,EUErasmusMundusFellowship,AOARD,SCAT,theJSTPRESTO

program,andtheFIRSTprogram.

xiii

Thispageintentionallyleftblank

PartI

Introduction

Thispageintentionallyleftblank

Chapter1

IntroductiontoReinforcement

Learning

Reinforcementlearningisaimedatcontrollingacomputeragentsothata

targettaskisachievedinanunknownenvironment.

Inthischapter,wefirstgiveaninformaloverviewofreinforcementlearning

inSection1.1.Thenweprovideamoreformalformulationofreinforcement

learninginSection1.2.Finally,thebookissummarizedinSection1.3.

1.1

ReinforcementLearning

AschematicofreinforcementlearningisgiveninFigure1.1.Inanunknown

environment(e.g.,inamaze),acomputeragent(e.g.,arobot)takesanaction

(e.g.,towalk)basedonitsowncontrolpolicy.Thenitsstateisupdated(e.g.,

bymovingforward)andevaluationofthatactionisgivenasa“reward”(e.g.,

praise,neutral,orscolding).Throughsuchinteractionwiththeenvironment,

theagentistrainedtoachieveacertaintask(e.g.,gettingoutofthemaze)

withoutexplicitguidance.Acrucialadvantageofreinforcementlearningisits

non-greedynature.Thatis,theagentistrainednottoimproveperformancein

ashortterm(e.g.,greedilyapproachinganexitofthemaze),buttooptimize

thelong-termachievement(e.g.,successfullygettingoutofthemaze).

Areinforcementlearningproblemcontainsvarioustechnicalcomponents

suchasstates,actions,transitions,rewards,policies,andvalues.Beforego-

ingintomathematicaldetails(whichwillbeprovidedinSection1.2),we

intuitivelyexplaintheseconceptsthroughillustrativereinforcementlearning

problemshere.

Letusconsideramazeproblem(Figure1.2),wherearobotagentislocated

inamazeandwewanttoguidehimtothegoalwithoutexplicitsupervision

aboutwhichdirectiontogo.Statesarepositionsinthemazewhichtherobot

agentcanvisit.IntheexampleillustratedinFigure1.3,thereare21states

inthemaze.Actionsarepossibledirectionsalongwhichtherobotagentcan

move.IntheexampleillustratedinFigure1.4,thereare4actionswhichcorre-

spondtomovementtowardthenorth,south,east,andwestdirections.States

3

4

StatisticalReinforcementLearning

Action

Environment

Reward

Agent

State

FIGURE1.1:Reinforcementlearning.

andactionsarefundamentalelementsthatdefineareinforcementlearning

problem.

Transitionsspecifyhowstatesareconnectedtoeachotherthroughactions

(Figure1.5).Thus,knowingthetransitionsintuitivelymeansknowingthemap

ofthemaze.Rewardsspecifytheincomes/coststhattherobotagentreceives

whenmakingatransitionfromonestatetoanotherbyacertainaction.Inthe

caseofthemazeexample,therobotagentreceivesapositiverewardwhenit

reachesthegoal.Morespecifically,apositiverewardisprovidedwhenmaking

atransitionfromstate12tostate17byaction“east”orfromstate18to

state17byaction“north”(Figure1.6).Thus,knowingtherewardsintuitively

meansknowingthelocationofthegoalstate.Toemphasizethefactthata

rewardisgiventotherobotagentrightaftertakinganactionandmakinga

transitiontothenextstate,itisalsoreferredtoasanimmediatereward.

Undertheabovesetup,thegoalofreinforcementlearningtofindthepolicy

forcontrollingtherobotagentthatallowsittoreceivethemaximumamount

ofrewardsinthelongrun.Here,apolicyspecifiesanactiontherobotagent

takesateachstate(Figure1.7).Throughapolicy,aseriesofstatesandac-

tionsthattherobotagenttakesfromastartstatetoanendstateisspecified.

Suchaseriesiscalledatrajectory(seeFigure1.7again).Thesumofim-

mediaterewardsalongatrajectoryiscalledthereturn.Inpractice,rewards

thatcanbeobtainedinthedistantfutureareoftendiscountedbecausere-

ceivingrewardsearlierisregardedasmorepreferable.Inthemazetask,such

adiscountingstrategyurgestherobotagenttoreachthegoalasquicklyas

possible.

Tofindtheoptimalpolicyefficiently,itisusefultoviewthereturnasa

functionoftheinitialstate.Thisiscalledthe(state-)value.Thevaluescan

beefficientlyobtainedviadynamicprogramming,whichisageneralmethod

forsolvingacomplexoptimizationproblembybreakingitdownintosimpler

subproblemsrecursively.Withthehopethatmanysubproblemsareactually

thesame,dynamicprogrammingsolvessuchoverlappedsubproblemsonly

onceandreusesthesolutionstoreducethecomputationcosts.

Inthemazeproblem,thevalueofastatecanbecomputedfromthevalues

ofneighboringstates.Forexample,letuscomputethevalueofstate7(see

IntroductiontoReinforcementLearning

5

FIGURE1.2:Amazeproblem.Wewanttoguidetherobotagenttothe

goal.

1

6

12

17

2

7

13

18

3

8

14

19

4

9

11

15

20

5

10

16

21

FIGURE1.3:Statesarevisitablepositionsinthemaze.

North

West

East

South

FIGURE1.4:Actionsarepossiblemovementsoftherobotagent.

6

StatisticalReinforcementLearning

1

6

12

17

2

7

13

18

3

8

14

19

4

9

11

15

20

5

10

16

21

FIGURE1.5:Transitionsspecifyconnectionsbetweenstatesviaactions.

Thus,knowingthetransitionsmeansknowingthemapofthemaze.

1

6

12

17

2

7

13

18

3

8

14

19

4

9

11

15

20

5

10

16

21

FIGURE1.6:Apositiverewardisgivenwhentherobotagentreachesthe

goal.Thus,therewardspecifiesthegoallocation.

FIGURE1.7:Apolicyspecifiesanactiontherobotagenttakesateach

state.Thus,apolicyalsospecifiesatrajectory,whichisaseriesofstatesand

actionsthattherobotagenttakesfromastartstatetoanendstate.

IntroductiontoReinforcementLearning

7

.35

.39

.9

1

.39

.43

.81

.9

.43

.48

.73

.81

.48

.53

.59

.66

.73

.43

.48

.59

.66

FIGURE1.8:Valuesofeachstatewhenreward+1isgivenatthegoalstate

andtherewardisdiscountedattherateof0.9accordingtothenumberof

steps.

Figure1.5again).Fromstate7,therobotagentcanreachstate2,state6,

andstate8byasinglestep.Iftherobotagentknowsthevaluesofthese

neighboringstates,thebestactiontherobotagentshouldtakeistovisitthe

neighboringstatewiththelargestvalue,becausethisallowstherobotagent

toearnthelargestamountofrewardsinthelongrun.However,thevalues

ofneighboringstatesareunknowninpracticeandthustheyshouldalsobe

computed.

Now,weneedtosolve3subproblemsofcomputingthevaluesofstate2,

state6,andstate8.Then,inthesameway,thesesubproblemsarefurther

decomposedasfollows:

•Theproblemofcomputingthevalueofstate2isdecomposedinto3

subproblemsofcomputingthevaluesofstate1,state3,andstate7.

•Theproblemofcomputingthevalueofstate6isdecomposedinto2

subproblemsofcomputingthevaluesofstate1andstate7.

•Theproblemofcomputingthevalueofstate8isdecomposedinto3

subproblemsofcomputingthevaluesofstate3,state7,andstate9.

Thus,byremovingoverlaps,theoriginalproblemofcomputingthevalueof

state7hasbeendecomposedinto6uniquesubproblems:computingthevalues

ofstate1,state2,state3,state6,state8,andstate9.

Ifwefurthercontinuethisproblemdecomposition,weencountertheprob-

lemofcomputingthevaluesofstate17,wheretherobotagentcanreceive

reward+1.Thenthevaluesofstate12andstate18canbeexplicitlycom-

puted.Indeed,ifadiscountingfactor(amultiplicativepenaltyfordelayed

rewards)is0.9,thevaluesofstate12andstate18are(0.9)1=0.9.Thenwe

canfurtherknowthatthevaluesofstate13andstate19are(0.9)2=0.81.

Byrepeatingthisprocedure,wecancomputethevaluesofallstates(asillus-

tratedinFigure1.8).Basedonthesevalues,wecanknowtheoptimalaction

8

StatisticalReinforcementLearning

therobotagentshouldtake,i.e.,anactionthatleadstherobotagenttothe

neighboringstatewiththelargestvalue.

Notethat,inreal-worldreinforcementlearningtasks,transitionsareoften

notdeterministicbutstochastic,becauseofsomeexternaldisturbance;inthe

caseoftheabovemazeexample,thefloormaybeslipperyandthustherobot

agentcannotmoveasperfectlyasitdesires.Also,stochasticpoliciesinwhich

mappingfromastatetoanactionisnotdeterministicareoftenemployed

inmanyreinforcementlearningformulations.Inthesecases,theformulation

becomesslightlymorecomplicated,butessentiallythesameideacanstillbe

usedforsolvingtheproblem.

Tofurtherhighlightthenotableadvantageofreinforcementlearningthat

nottheimmediaterewardsbutthelong-termaccumulationofrewardsismax-

imized,letusconsideramountain-carproblem(Figure1.9).Therearetwo

mountainsandacarislocatedinavalleybetweenthemountains.Thegoalis

toguidethecartothetopoftheright-handhill.However,theengineofthe

carisnotpowerfulenoughtodirectlyrunuptheright-handhillandreach

thegoal.Theoptimalpolicyinthisproblemistofirstclimbtheleft-handhill

andthengodowntheslopetotherightwithfullaccelerationtogettothe

goal(Figure1.10).

Supposewedefinetheimmediaterewardsuchthatmovingthecartothe

rightgivesapositivereward+1andmovingthecartotheleftgivesanega-

tivereward−1.Then,agreedysolutionthatmaximizestheimmediatereward

movesthecartotheright,whichdoesnotallowthecartogettothegoal

duetolackofenginepower.Ontheotherhand,reinforcementlearningseeks

asolutionthatmaximizesthereturn,i.e.,thediscountedsumofimmediate

rewardsthattheagentcancollectovertheentiretrajectory.Thismeansthat

thereinforcementlearningsolutionwillfirstmovethecartothelefteven

thoughnegativerewardsaregivenforawhile,toreceivemorepositivere-

wardsinthefuture.Thus,thenotionof“priorinvestment”canbenaturally

incorporatedinthereinforcementlearningframework.

1.2

MathematicalFormulation

Inthissection,thereinforcementlearningproblemismathematicallyfor-

mulatedastheproblemofcontrollingacomputeragentunderaMarkovde-

cisionprocess.

Weconsidertheproblemofcontrollingacomputeragentunderadiscrete-

timeMarkovdecisionprocess(MDP).Thatis,ateachdiscretetime-stept,

theagentobservesastatest∈S,selectsanactionat∈A,makesatransitionst+1∈S,andreceivesanimmediatereward,rt=r(st,at,st+1)∈R.

IntroductiontoReinforcementLearning

9

Goal

Car

FIGURE1.9:Amountain-carproblem.Wewanttoguidethecartothe

goal.However,theengineofthecarisnotpowerfulenoughtodirectlyrunup

theright-handhill.

Goal

FIGURE1.10:Theoptimalpolicytoreachthegoalistofirstclimbthe

left-handhillandthenheadfortheright-handhillwithfullacceleration.

SandAarecalledthestatespaceandtheactionspace,respectively.r(s,a,s′)

iscalledtheimmediaterewardfunction.

Theinitialpositionoftheagent,s1,isdrawnfromtheinitialprobability

distribution.IfthestatespaceSisdiscrete,theinitialprobabilitydistributionisspecifiedbytheprobabilitymassfunctionP(s)suchthat

0≤P(s)≤1,∀s∈S,XP(s)=1.

s∈SIfthestatespaceSiscontinuous,theinitialprobabilitydistributionisspeci-

fiedbytheprobabilitydensityfunctionp(s)suchthat

p(s)≥0,∀s∈S,

10

StatisticalReinforcementLearning

Z

p(s)ds=1.

s∈SBecausetheprobabilitymassfunctionP(s)canbeexpressedasaprobability

densityfunctionp(s)byusingtheDiracdeltafunction1δ(s)as

X

p(s)=

δ(s′−s)P(s′),

s′∈Swefocusonlyonthecontinuousstatespacebelow.

Thedynamicsoftheenvironment,whichrepresentthetransitionprob-

abilityfromstatestostates′whenactionaistaken,arecharacterized

bythetransitionprobabilitydistributionwithconditionalprobabilitydensity

p(s′|s,a):

p(s′|s,a)≥0,∀s,s′∈S,∀a∈A,Z

p(s′|s,a)ds′=1,∀s∈S,∀a∈A.

s′∈STheagent’sdecisionisdeterminedbyapolicyπ.Whenweconsideradeter-

ministicpolicywheretheactiontotakeateachstateisuniquelydetermined,

weregardthepolicyasafunctionofstates:

π(s)∈A,∀s∈S.Actionacanbeeitherdiscreteorcontinuous.Ontheotherhand,whendevel-

opingmoresophisticatedreinforcementlearningalgorithms,itisoftenmore

convenienttoconsiderastochasticpolicy,whereanactiontotakeatastate

isprobabilisticallydetermined.Mathematically,astochasticpolicyisacon-

ditionalprobabilitydensityoftakingactionaatstates:

π(a|s)≥0,∀s∈S,∀a∈A,Z

π(a|s)da=1,∀s∈S.a∈AByintroducingstochasticityinactionselection,wecanmoreactivelyexplore

theentirestatespace.Notethatwhenactionaisdiscrete,thestochasticpolicy

isexpressedusingDirac’sdeltafunction,asinthecaseofthestatedensities.

Asequenceofstatesandactionsobtainedbytheproceduredescribedin

Figure1.11iscalledatrajectory.

1TheDiracdeltafunctionδ(·)allowsustoobtainthevalueofafunctionfatapointτ

viatheconvolutionwithf:

Z

f(s)δ(s−τ)ds=f(τ).

−∞

Dirac’sdeltafunctionδ(·)canbeexpressedastheGaussiandensitywithstandarddeviationσ→0:

1

a2

δ(a)=lim√

exp−

.

σ→0

2πσ2

2σ2

IntroductiontoReinforcementLearning

11

1.Theinitialstates1ischosenfollowingtheinitialprobabilityp(s).

2.Fort=1,…,T,

(a)Theactionatischosenfollowingthepolicyπ(at|st).

(b)Thenextstatest+1isdeterminedaccordingtothetransition

probabilityp(st+1|st,at).

FIGURE1.11:Generationofatrajectorysample.

Whenthenumberofsteps,T,isfiniteorinfinite,thesituationiscalled

thefinitehorizonorinfinitehorizon,respectively.Below,wefocusonthe

finite-horizoncasebecausethetrajectorylengthisalwaysfiniteinpractice.

Wedenoteatrajectorybyh(whichstandsfora“history”):

h=[s1,a1,…,sT,aT,sT+1].

Thediscountedsumofimmediaterewardsalongthetrajectoryhiscalled

thereturn:

T

X

R(h)=

γt−1r(st,at,st+1),

t=1

whereγ∈[0,1)iscalledthediscountfactorforfuturerewards.Thegoalofreinforcementlearningistolearntheoptimalpolicyπ∗thatmaximizestheexpectedreturn:

h

i

π∗=argmaxEpπ(h)R(h),

π

whereEpπ(h)denotestheexpectationovertrajectoryhdrawnfrompπ(h),and

pπ(h)denotestheprobabilitydensityofobservingtrajectoryhunderpolicy

π:

T

Y

pπ(h)=p(s1)

p(st+1|st,at)π(at|st).

t=1

“argmax”givesthemaximizerofafunction(Figure1.12).

Forpolicylearning,variousmethodshavebeendevelopedsofar.These

methodscanbeclassifiedintomodel-basedreinforcementlearningandmodel-

freereinforcementlearning.Theterm“model”indicatesamodelofthetran-

sitionprobabilityp(s′|s,a).Inthemodel-basedreinforcementlearningap-

proach,thetransitionprobabilityislearnedinadvanceandthelearnedtran-

sitionmodelisexplicitlyusedforpolicylearning.Ontheotherhand,inthe

model-freereinforcementlearningapproach,policiesarelearnedwithoutex-

plicitlyestimatingthetransitionprobability.Ifstrongpriorknowledgeofthe

12

StatisticalReinforcementLearning

max

argmax

FIGURE1.12:“argmax”givesthemaximizerofafunction,while“max”

givesthemaximumvalueofafunction.

transitionmodelisavailable,themodel-basedapproachwouldbemorefavor-

able.Ontheotherhand,learningthetransitionmodelwithoutpriorknowl-

edgeitselfisahardstatisticalestimationproblem.Thus,ifgoodpriorknowl-

edgeofthetransitionmodelisnotavailable,themodel-freeapproachwould

bemorepromising.

1.3

StructureoftheBook

Inthissection,weexplainthestructureofthisbook,whichcoversmajor

reinforcementlearningapproaches.

1.3.1

Model-FreePolicyIteration

Policyiterationisapopularandwell-studiedapproachtoreinforcement

learning.Thekeyideaofpolicyiterationistodeterminepoliciesbasedonthe

valuefunction.

Letusfirstintroducethestate-actionvaluefunctionQπ(s,a)∈Rforpolicyπ,whichisdefinedastheexpectedreturntheagentwillreceivewhen

takingactionaatstatesandfollowingpolicyπthereafter:

h

i

Qπ(

s,a)=Epπ(h)R(h)s1=s,a1=a,

where“|s1=s,a1=a”meansthattheinitialstates1andthefirstactiona1

arefixedats1=sanda1=a,respectively.Thatis,theright-handsideof

theaboveequationdenotestheconditionalexpectationofR(h)givens1=s

anda1=a.

LetQ∗(s,a)betheoptimalstate-actionvalueatstatesforactionadefinedas

Q∗(s,a)=maxQπ(s,a).π

Basedontheoptimalstate-actionvaluefunction,theoptimalactiontheagent

shouldtakeatstatesisdeterministicallygivenasthemaximizerofQ∗(s,a)

IntroductiontoReinforcementLearning

13

1.Initializepolicyπ(a|s).

2.Repeatthefollowingtwostepsuntilthepolicyπ(a|s)converges.

(a)Policyevaluation:Computethestate-actionvaluefunction

Qπ(s,a)forthecurrentpolicyπ(a|s).

(b)Policyimprovement:Updatethepolicyas

π(a|s)←−δa−argmaxQπ(s,a′).

a′

FIGURE1.13:Algorithmofpolicyiteration.

withrespecttoa.Thus,theoptimalpolicyπ∗(a|s)isgivenbyπ∗(a|s)=δa−argmaxQ∗(s,a′),a′

whereδ(·)denotesDirac’sdeltafunction.

Becausetheoptimalstate-actionvalueQ∗isunknowninpractice,thepolicyiterationalgorithmalternatelyevaluatesthevalueQπforthecurrent

policyπandupdatesthepolicyπbasedonthecurrentvalueQπ(Figure1.13).

Theperformanceoftheabovepolicyiterationalgorithmdependsonthe

qualityofpolicyevaluation;i.e.,howtolearnthestate-actionvaluefunction

fromdataisthekeyissue.Valuefunctionapproximationcorrespondstoare-

gressionprobleminstatisticsandmachinelearning.Thus,variousstatistical

machinelearningtechniquescanbeutilizedforbettervaluefunctionapprox-

imation.PartIIofthisbookaddressesthisissue,includingleast-squareses-

timationandmodelselection(Chapter2),basisfunctiondesign(Chapter3),

efficientsamplereuse(Chapter4),activelearning(Chapter5),androbust

learning(Chapter6).

1.3.2

Model-FreePolicySearch

Oneofthepotentialweaknessesofpolicyiterationisthatpoliciesare

learnedviavaluefunctions.Thus,improvingthequalityofvaluefunction

approximationdoesnotnecessarilycontributetoimprovingthequalityof

resultingpolicies.Furthermore,asmallchangeinvaluefunctionscancausea

bigdifferenceinpolicies,whichisproblematicin,e.g.,robotcontrolbecause

suchinstabilitycandamagetherobot’sphysicalsystem.Anotherweakness

ofpolicyiterationisthatpolicyimprovement,i.e.,findingthemaximizerof

Qπ(s,a)withrespecttoa,iscomputationallyexpensiveordifficultwhenthe

actionspaceAiscontinuous.

14

StatisticalReinforcementLearning

Policysearch,whichdirectlylearnspolicyfunctionswithoutestimating

valuefunctions,canovercometheabovelimitations.Thebasicideaofpolicy

searchistofindthepolicythatmaximizestheexpectedreturn:

h

i

π∗=argmaxEpπ(h)R(h).π

Inpolicysearch,howtofindagoodpolicyfunctioninavastfunctionspaceis

thekeyissuetobeaddressed.PartIIIofthisbookfocusesonpolicysearchand

introducesgradient-basedmethodsandtheexpectation-maximizationmethod

inChapter7andChapter8,respectively.However,apotentialweaknessof

thesedirectpolicysearchmethodsistheirinstabilityduetothestochasticity

ofpolicies.Toovercometheinstabilityproblem,analternativeapproachcalled

policy-priorsearch,whichlearnsthepolicy-priordistributionfordeterministic

policies,isintroducedinChapter9.Efficientsamplereuseinpolicy-prior

searchisalsodiscussedthere.

1.3.3

Model-BasedReinforcementLearning

Intheabovemodel-freeapproaches,policiesarelearnedwithoutexplicitly

modelingtheunknownenvironment(i.e.,thetransitionprobabilityofthe

agentintheenvironment,p(s′|s,a)).Ontheotherhand,themodel-based

approachexplicitlylearnstheenvironmentinadvanceandusesthelearned

environmentmodelforpolicylearning.

Noadditionalsamplingcostisnecessarytogenerateartificialsamplesfrom

thelearnedenvironmentmodel.Thus,themodel-basedapproachisparticu-

larlyusefulwhendatacollectionisexpensive(e.g.,robotcontrol).However,

accuratelyestimatingthetransitionmodelfromalimitedamountoftrajec-

torydatainmulti-dimensionalcontinuousstateandactionspacesishighly

challenging.PartIVofthisbookfocusesonmodel-basedreinforcementlearn-

ing.InChapter10,anon-parametrictransitionmodelestimatorthatpossesses

theoptimalconvergenceratewithhighcomputationalefficiencyisintroduced.

However,evenwiththeoptimalconvergencerate,estimatingthetransition

modelinhigh-dimensionalstateandactionspacesisstillchallenging.InChap-

ter11,adimensionalityreductionmethodthatcanbeefficientlyembedded

intothetransitionmodelestimationprocedureisintroducedanditsusefulness

isdemonstratedthroughexperiments.

PartII

Model-FreePolicy

Iteration

InPartII,weintroduceareinforcementlearningapproachbasedonvalue

functionscalledpolicyiteration.

Thekeyissueinthepolicyiterationframeworkishowtoaccuratelyap-

proximatethevaluefunctionfromasmallnumberofdatasamples.InChap-

ter2,afundamentalframeworkofvaluefunctionapproximationbasedon

leastsquaresisexplained.Inthisleast-squaresformulation,howtodesign

goodbasisfunctionsiscriticalforbettervaluefunctionapproximation.A

practicalbasisdesignmethodbasedonmanifold-basedsmoothing(Chapelle

etal.,2006)isexplainedinChapter3.

Inreal-worldreinforcementlearningtasks,gatheringdataisoftencostly.

InChapter4,wedescribeamethodforefficientlyreusingpreviouslycor-

rectedsamplesintheframeworkofcovariateshiftadaptation(Sugiyama&

Kawanabe,2012).InChapter5,weapplyastatisticalactivelearningtech-

nique(Sugiyama&Kawanabe,2012)tooptimizingdatacollectionstrategies

forreducingthesamplingcost.

Finally,inChapter6,anoutlier-robustextensionoftheleast-squares

methodbasedonrobustregression(Huber,1981)isintroduced.Sucharo-

bustmethodishighlyusefulinhandlingnoisyreal-worlddata.

Thispageintentionallyleftblank

Chapter2

PolicyIterationwithValueFunction

Approximation

Inthischapter,weintroducetheframeworkofleast-squarespolicyiteration.

InSection2.1,wefirstexplaintheframeworkofpolicyiteration,whichitera-

tivelyexecutesthepolicyevaluationandpolicyimprovementstepsforfinding

betterpolicies.Then,inSection2.2,weshowhowvaluefunctionapproxima-

tioninthepolicyevaluationstepcanbeformulatedasaregressionproblem

andintroducealeast-squaresalgorithmcalledleast-squarespolicyiteration

(Lagoudakis&Parr,2003).Finally,thischapterisconcludedinSection2.3.

2.1

ValueFunctions

Atraditionalwaytolearntheoptimalpolicyisbasedonvaluefunction.

Inthissection,weintroducetwotypesofvaluefunctions,thestatevalue

functionandthestate-actionvaluefunction,andexplainhowtheycanbe

usedforfindingbetterpolicies.

2.1.1

StateValueFunctions

ThestatevaluefunctionVπ(s)∈Rforpolicyπmeasuresthe“value”ofstates,whichisdefinedastheexpectedreturntheagentwillreceivewhen

followingpolicyπfromstates:

h

i

Vπ(

s)=Epπ(h)R(h)s1=s,

where“|s1=s”meansthattheinitialstates1isfixedats1=s.Thatis,the

right-handsideoftheaboveequationdenotestheconditionalexpectationof

returnR(h)givens1=s.

Byrecursion,Vπ(s)canbeexpressedas

h

i

Vπ(s)=Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′),

whereEp(s′|s,a)π(a|s)denotestheconditionalexpectationoveraands′drawn

17

18

StatisticalReinforcementLearning

fromp(s′|s,a)π(a|s)givens.ThisrecursiveexpressioniscalledtheBellman

equationforstatevalues.Vπ(s)maybeobtainedbyrepeatingthefollowing

updatefromsomeinitialestimate:

h

i

Vπ(s)←−Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′).

Theoptimalstatevalueatstates,V∗(s),isdefinedasthemaximizerofstatevalueVπ(s)withrespecttopolicyπ:

V∗(s)=maxVπ(s).π

BasedontheoptimalstatevalueV∗(s),theoptimalpolicyπ∗,whichisde-terministic,canbeobtainedas

π∗(a|s)=δ(a−a∗(s)),whereδ(·)denotesDirac’sdeltafunctionand

n

h

io

a∗(s)=argmaxEp(s′|s,a)r(s,a,s′)+γV∗(s′).

a∈A

Ep(s′|s,a)denotestheconditionalexpectationovers′drawnfromp(s′|s,a)

givensanda.Thisalgorithm,firstcomputingtheoptimalvaluefunction

andthenobtainingtheoptimalpolicybasedontheoptimalvaluefunction,is

calledvalueiteration.

Apossiblevariationistoiterativelyperformpolicyevaluationandim-

provementas

h

i

Policyevaluation:Vπ(s)←−Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′).

Policyimprovement:π∗(a|s)←−δ(a−aπ(s)),

where

n

h

io

aπ(s)=argmaxEp(s′|s,a)r(s,a,s′)+γVπ(s′)

.

a∈AThesetwostepsmaybeiteratedeitherforallstatesatonceorinastate-by-

statemanner.Thisiterativealgorithmiscalledthepolicyiteration(basedon

statevaluefunctions).

2.1.2

State-ActionValueFunctions

Intheabovepolicyimprovementstep,theactiontotakeisoptimizedbased

onthestatevaluefunctionVπ(s).Amoredirectwaytohandlethisaction

optimizationistoconsiderthestate-actionvaluefunctionQπ(s,a)forpolicy

π:

h

i

Qπ(

s,a)=Epπ(h)R(h)s1=s,a1=a,

PolicyIterationwithValueFunctionApproximation

19

where“|s1=s,a1=a”meansthattheinitialstates1andthefirstactiona1

arefixedats1=sanda1=a,respectively.Thatis,theright-handsideof

theaboveequationdenotestheconditionalexpectationofreturnR(h)given

s1=sanda1=a.

Letr(s,a)betheexpectedimmediaterewardwhenactionaistakenat

states:

r(s,a)=Ep(s′|s,a)[r(s,a,s′)].

Then,inthesamewayasVπ(s),Qπ(s,a)canbeexpressedbyrecursionas

h

i

Qπ(s,a)=r(s,a)+γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′),

(2.1)

whereEπ(a′|s′)p(s′|s,a)denotestheconditionalexpectationovers′anda′drawn

fromπ(a′|s′)p(s′|s,a)givensanda.Thisrecursiveexpressioniscalledthe

Bellmanequationforstate-actionvalues.

BasedontheBellmanequation,theoptimalpolicymaybeobtainedby

iteratingthefollowingtwosteps:

h

i

Policyevaluation:Qπ(s,a)←−r(s,a)+γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′).

Policyimprovement:π(a|s)←−δa−argmaxQπ(s,a′).

a′∈AInpractice,itissometimespreferabletouseanexplorativepolicy.For

example,Gibbspolicyimprovementisgivenby

exp(Qπ(s,a)/τ)

π(a|s)←−R

,

exp(Qπ(s,a′)/τ)da′

A

whereτ>0determinesthedegreeofexploration.WhentheactionspaceA

isdiscrete,ǫ-greedypolicyimprovementisalsoused:

(1−ǫ+ǫ/|A|ifa=argmaxQπ(s,a′),

π(a|s)←−

a′∈Aǫ/|A|otherwise,

whereǫ∈(0,1]determinestherandomnessofthenewpolicy.TheabovepolicyimprovementstepbasedonQπ(s,a)isessentiallythe

sameastheonebasedonVπ(s)explainedinSection2.1.1.However,the

policyimprovementstepbasedonQπ(s,a)doesnotcontaintheexpectation

operatorandthuspolicyimprovementcanbemoredirectlycarriedout.For

thisreason,wefocusontheaboveformulation,calledpolicyiterationbased

onstate-actionvaluefunctions.

20

StatisticalReinforcementLearning

2.2

Least-SquaresPolicyIteration

Asexplainedintheprevioussection,theoptimalpolicyfunctionmaybe

learnedviastate-actionvaluefunctionQπ(s,a).However,learningthestate-

actionvaluefunctionfromdataisachallengingtaskforcontinuousstates

andactiona.

Learningthestate-actionvaluefunctionfromdatacanactuallybere-

gardedasaregressionprobleminstatisticsandmachinelearning.Inthissec-

tion,weexplainhowtheleast-squaresregressiontechniquecanbeemployed

invaluefunctionapproximation,whichiscalledleast-squarespolicyiteration

(Lagoudakis&Parr,2003).

2.2.1

Immediate-RewardRegression

Letusapproximatethestate-actionvaluefunctionQπ(s,a)bythefollow-

inglinear-in-parametermodel:

B

Xθbφb(s,a),

b=1

whereφb(s,a)Barebasisfunctions,Bdenotesthenumberofbasisfunc-

b=1

tions,andθbB

areparameters.Specificdesignsofbasisfunctionswillbe

b=1

discussedinChapter3.Below,weusethefollowingvectorrepresentationfor

compactlyexpressingtheparametersandbasisfunctions:

θ⊤φ(s,a),

where⊤denotesthetransposeand

θ=(θ1,…,θB)⊤∈RB,⊤φ(s,a)=φ1(s,a),…,φB(s,a)

∈RB.FromtheBellmanequationforstate-actionvalues(2.1),wecanexpress

theexpectedimmediaterewardr(s,a)as

h

i

r(s,a)=Qπ(s,a)−γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′).

Bysubstitutingthevaluefunctionmodelθ⊤φ(s,a)intheaboveequation,

theexpectedimmediaterewardr(s,a)maybeapproximatedas

h

i

r(s,a)≈θ⊤φ(s,a)−γEπ(a′|s′)p(s′|s,a)θ⊤φ(s′,a′).

Nowletusdefineanewbasisfunctionvectorψ(s,a):

h

i

ψ(s,a)=φ(s,a)−γEπ(a′|s′)p(s′|s,a)φ(s′,a′).

PolicyIterationwithValueFunctionApproximation

21

r(s1,a1)

r(s,a)

r(sT,aT)

r(s1,a1,s2)

r(s2,a2)

T

θψ(s,a)

r(sT,aT,sT+1)

r(s2,a2,s3)

(s,a)

(s1,a1)

(s2,a2)

(sT,aT)

FIGURE2.1:Linearapproximationofstate-actionvaluefunctionQπ(s,a)

aslinearregressionofexpectedimmediaterewardr(s,a).

Thentheexpectedimmediaterewardr(s,a)maybeapproximatedas

r(s,a)≈θ⊤ψ(s,a).

Asexplainedabove,thelinearapproximationproblemofthestate-action

valuefunctionQπ(s,a)canbereformulatedasthelinearregressionproblem

oftheexpectedimmediaterewardr(s,a)(seeFigure2.1).Thekeytrickwas

topushtherecursivenatureofthestate-actionvaluefunctionQπ(s,a)into

thecompositebasisfunctionψ(s,a).

2.2.2

Algorithm

Now,weexplainhowtheparametersθarelearnedintheleast-squares

framework.Thatis,themodelθ⊤ψ(s,a)isfittedtotheexpectedimmediate

rewardr(s,a)underthesquaredloss:

(

#)

T

1X

2

minEpπ(h)

θ⊤ψ(st,at)−r(st,at)

,

θ

Tt=1

wherehdenotesthehistorysamplefollowingthecurrentpolicyπ:

h=[s1,a1,…,sT,aT,sT+1].

ForhistorysamplesH=h1,…,hN,where

hn=[s1,n,a1,n,…,sT,n,aT,n,sT+1,n],

anempiricalversionoftheaboveleast-squaresproblemisgivenas

(

#)

N

T

1X

1X

2

min

θ⊤b

ψ(st,n,at,n;H)−r(st,n,at,n,st+1,n)

.

θ

N

T

n=1

t=1

22

StatisticalReinforcementLearning

1

2

θ−r

Ψ

ˆ

NT

θ

FIGURE2.2:Gradientdescent.

Here,b

ψ(s,a;H)isanempiricalestimatorofψ(s,a)givenby

X

h

i

b

1

ψ(s,a;H)=φ(s,a)−

E

γφ(s′,a′),

|H

π(a′|s′)

(s,a)|s′∈H(s,a)whereH(s,a)denotesasubsetofHthatconsistsofalltransitionsamplesfrom

statesbyactiona,|H(s,a)|denotesthenumberofelementsinthesetH(s,a),

P

and

denotesthesummationoveralldestinationstatess′intheset

s′∈Hs,a)H(s,a).

Letb

ΨbetheNT×BmatrixandrbetheNT-dimensionalvectordefined

as

b

ΨN(t−1)+n,b=b

ψb(st,n,at,n),

rN(t−1)+n=r(st,n,at,n,st+1,n).

b

Ψissometimescalledthedesignmatrix.Thentheaboveleast-squaresprob-

lemcanbecompactlyexpressedas

1

min

kb

Ψθ−rk2,

θ

NT

wherek·kdenotestheℓ2-norm.Becausethisisaquadraticfunctionwith

respecttoθ,itsglobalminimizerb

θcanbeanalyticallyobtainedbysettingits

derivativetozeroas

b

⊤⊤

θ=(b

Ψb

Ψ)−1b

Ψr.

(2.2)

⊤IfBistoolargeandcomputingtheinverseofb

Ψb

Ψisintractable,wemay

useagradientdescentmethod.Thatis,startingfromsomeinitialestimateθ,

thesolutionisupdateduntilconvergence,asfollows(seeFigure2.2):

⊤⊤θ←−θ−ε(b

Ψb

Ψθ−b

Ψr),

PolicyIterationwithValueFunctionApproximation

23

⊤⊤whereb

Ψb

Ψθ−b

Ψrcorrespondstothegradientoftheobjectivefunction

kb

Ψθ−rk2andεisasmallpositiveconstantrepresentingthestepsizeof

gradientdescent.

Anotablevariationoftheaboveleast-squaresmethodistocomputethe

solutionby

eθ=(Φ⊤b

Ψ)−1Φ⊤r,

whereΦistheNT×Bmatrixdefinedas

ΦN(t−1)+n,b=φ(st,n,at,n).

This

variation

is

called

the

least-squaresfixed-pointapproximation

(Lagoudakis&Parr,2003)andisshowntohandletheestimationerrorin-

cludedinthebasisfunctionb

ψinasoundway(Bradtke&Barto,1996).

However,forsimplicity,wefocusonEq.(2.2)below.

2.2.3

Regularization

Regressiontechniquesinmachinelearningaregenerallyformulatedasmin-

imizationofagoodness-of-fittermandaregularizationterm.Intheabove

least-squaresframework,thegoodness-of-fitofourmodelismeasuredbythe

squaredloss.Inthefollowingchapters,wediscusshowotherlossfunctionscan

beutilizedinthepolicyiterationframework,e.g.,samplereuseinChapter4

andoutlier-robustlearninginChapter6.Herewefocusontheregularization

termandintroducepracticallyusefulregularizationtechniques.

Theℓ2-regularizeristhemoststandardregularizerinstatisticsandma-

chinelearning;itisalsocalledtheridgeregression(Hoerl&Kennard,1970):

1

min

kb

Ψθ−rk2+λkθk2,

θ

NT

whereλ≥0istheregularizationparameter.Theroleoftheℓ2-regularizer

kθk2istopenalizethegrowthoftheparametervectorθtoavoidoverfitting

tonoisysamples.Apracticaladvantageoftheuseoftheℓ2-regularizeristhat

theminimizerb

θcanstillbeobtainedanalytically:

b

⊤⊤θ=(b

Ψb

Ψ+λIB)−1b

Ψr,

whereIBdenotestheB×Bidentitymatrix.BecauseoftheadditionofλIB,

thematrixtobeinvertedabovehasabetternumericalconditionandthus

thesolutiontendstobemorestablethanthesolutionobtainedbyplainleast

squareswithoutregularization.

Notethatthesamesolutionastheaboveℓ2-penalizedleast-squaresprob-

lemcanbeobtainedbysolvingthefollowingℓ2-constrainedleast-squaresprob-

lem:

1

min

kb

Ψθ−rk2

θ

NT

24

StatisticalReinforcementLearning

θ

θ

2

2

θ

ˆ

θ

ˆ

LS

LS

θ

ˆ

θ

ˆ

ℓ2−CLS

ℓ1−CLS

θ

θ

1

1

(a)ℓ2-constraint

(b)ℓ1-constraint

FIGURE2.3:Feasibleregions(i.e.,regionswheretheconstraintissatisfied).

Theleast-squares(LS)solutionisthebottomoftheellipticalhyperboloid,

whereasthesolutionofconstrainedleast-squares(CLS)islocatedatthepoint

wherethehyperboloidtouchesthefeasibleregion.

subjecttokθk2≤C,

whereCisdeterminedfromλ.Notethatthelargerthevalueofλis(i.e.,the

strongertheeffectofregularizationis),thesmallerthevalueofCis(i.e.,the

smallerthefeasibleregionis).Thefeasibleregion(i.e.,theregionwherethe

constraintkθk2≤Cissatisfied)isillustratedinFigure2.3(a).

Anotherpopularchoiceofregularizationinstatisticsandmachinelearn-

ingistheℓ1-regularizer,whichisalsocalledtheleastabsoluteshrinkageand

selectionoperator(LASSO)(Tibshirani,1996):

1

min

kb

Ψθ−rk2+λkθk1,

θ

NT

wherek·k1denotestheℓ1-normdefinedastheabsolutesumofelements:

B

X

kθk1=

|θb|.

b=1

Inthesamewayastheℓ2-regularizationcase,thesamesolutionastheabove

ℓ1-penalizedleast-squaresproblemcanbeobtainedbysolvingthefollowing

constrainedleast-squaresproblem:

1

min

kb

Ψθ−rk2

θ

NT

subjecttokθk1≤C,

PolicyIterationwithValueFunctionApproximation

25

1stSubset

(K–1)thsubset

Kthsubset

···

Estimation

Validation

FIGURE2.4:Crossvalidation.

whereCisdeterminedfromλ.ThefeasibleregionisillustratedinFig-

ure2.3(b).

Anotablepropertyofℓ1-regularizationisthatthesolutiontendstobe

sparse,i.e.,manyoftheelementsθbBbecomeexactlyzero.Thereasonwhy

b=1

thesolutionbecomessparsecanbeintuitivelyunderstoodfromFigure2.3(b):

thesolutiontendstobeononeofthecornersofthefeasibleregion,where

thesolutionissparse.Ontheotherhand,intheℓ2-constraintcase(seeFig-

ure2.3(a)again),thesolutionissimilartotheℓ1-constraintcase,butitis

notgenerallyonanaxisandthusthesolutionisnotsparse.Suchasparse

solutionhasvariouscomputationaladvantages.Forexample,thesolutionfor

large-scaleproblemscanbecomputedefficiently,becauseallparametersdo

nothavetobeexplicitlyhandled;see,e.g.,Tomiokaetal.,2011.Furthermore,

thesolutionsforalldifferentregularizationparameterscanbecomputedef-

ficiently(Efronetal.,2004),andtheoutputofthelearnedmodelcanbe

computedefficiently.

2.2.4

ModelSelection

Inregression,tuningparametersareoftenincludedinthealgorithm,such

asbasisparametersandtheregularizationparameter.Suchtuningparameters

canbeobjectivelyandsystematicallyoptimizedbasedoncross-validation

(Wahba,1990)asfollows(seeFigure2.4).

First,thetrainingdatasetHisdividedintoKdisjointsubsetsofapprox-

imatelythesamesize,HkK.Thentheregressionsolutionbθ

k=1

kisobtained

usingH\Hk(i.e.,allsampleswithoutHk),anditssquarederrorforthehold-

outsamplesHkiscomputed.Thisprocedureisrepeatedfork=1,…,K,and

themodel(suchasthebasisparameterandtheregularizationparameter)that

minimizestheaverageerrorischosenasthemostsuitableone.

Onemaythinkthattheordinarysquarederrorisdirectlyusedformodel

selection,insteadofitscross-validationestimator.However,theordinary

squarederrorisheavilybiased(orinotherwords,over-fitted)sincethesame

trainingsamplesareusedtwiceforlearningparametersandestimatingthe

generalizationerror(i.e.,theout-of-samplepredictionerror).Ontheother

hand,thecross-validationestimatorofsquarederrorisalmostunbiased,where

“almost”comesfromthefactthatthenumberoftrainingsamplesisreduced

duetodatasplittinginthecross-validationprocedure.

26

StatisticalReinforcementLearning

Ingeneral,cross-validationiscomputationallyexpensivebecausethe

squarederrorneedstobeestimatedmanytimes.Forexample,whenperform-

ing5-foldcross-validationfor10modelcandidates,thelearningprocedurehas

toberepeated5×10=50times.However,thisisoftenacceptableinpractice

becausesensiblemodelselectiongivesanaccuratesolutionevenwithasmall

numberofsamples.Thus,intotal,thecomputationtimemaynotgrowthat

much.Furthermore,cross-validationissuitableforparallelcomputingsinceer-

rorestimationfordifferentmodelsanddifferentfoldsareindependentofeach

other.Forinstance,whenperforming5-foldcross-validationfor10modelcan-

didates,theuseof50computingunitsallowsustocomputeeverythingat

once.

2.3

Remarks

Reinforcementlearningviaregressionofstate-actionvaluefunctionsisa

highlypowerfulandflexibleapproach,becausewecanutilizevariousregression

techniquesdevelopedinstatisticsandmachinelearningsuchasleast-squares,

regularization,andcross-validation.

Inthefollowingchapters,weintroducemoresophisticatedregressiontech-

niquessuchasmanifold-basedsmoothing(Chapelleetal.,2006)inChapter3,

covariateshiftadaptation(Sugiyama&Kawanabe,2012)inChapter4,active

learning(Sugiyama&Kawanabe,2012)inChapter5,androbustregression

(Huber,1981)inChapter6.

Chapter3

BasisDesignforValueFunction

Approximation

Least-squarespolicyiterationexplainedinChapter2workswell,givenappro-

priatebasisfunctionsforvaluefunctionapproximation.Becauseofitssmooth-

ness,theGaussiankernelisapopularandusefulchoiceasabasisfunction.

However,itdoesnotallowfordiscontinuity,whichisconceivableinmanyre-

inforcementlearningtasks.Inthischapter,weintroduceanalternativebasis

functionbasedongeodesicGaussiankernels(GGKs),whichexploitthenon-

linearmanifoldstructureinducedbytheMarkovdecisionprocesses(MDPs).

ThedetailsofGGKareexplainedinSection3.1,anditsrelationtoother

basisfunctiondesignsisdiscussedinSection3.2.Then,experimentalperfor-

manceisnumericallyevaluatedinSection3.3,andthischapterisconcluded

inSection3.4.

3.1

GaussianKernelsonGraphs

Inleast-squarespolicyiteration,thechoiceofbasisfunctionsφb(s,a)B

b=1

isanopendesignissue(seeChapter2).Traditionally,Gaussiankernelshave

beenapopularchoice(Lagoudakis&Parr,2003;Engeletal.,2005),butthey

cannotapproximatediscontinuousfunctionswell.Tocopewiththisproblem,

moresophisticatedmethodsofconstructingsuitablebasisfunctionshavebeen

proposedwhicheffectivelymakeuseofthegraphstructureinducedbyMDPs

(Mahadevan,2005).Inthissection,weintroduceanalternativewayofcon-

structingbasisfunctionsbyincorporatingthegraphstructureofthestate

space.

3.1.1

MDP-InducedGraph

LetGbeagraphinducedbyanMDP,wherestatesSarenodesofthe

graphandthetransitionswithnon-zerotransitionprobabilitiesfromonenode

toanotherareedges.Theedgesmayhaveweightsdetermined,e.g.,basedon

thetransitionprobabilitiesorthedistancebetweennodes.Thegraphstructure

correspondingtoanexamplegridworldshowninFigure3.1(a)isillustrated

27

28

StatisticalReinforcementLearning

123456789101112131415161718192021

1

2

→→→→→→→↓↓

→→→→→→→→

−10

3

→→→→→→→↓↓

→→→→→→↑↑↑

4

→→↓→↓→→→↓

→↑↑→→↑↑↑↑

−20

5

↓↓↓↓↓↓↓↓↓

→→→→↑↑↑↑↑

6

→→→→→→↓↓↓

→→↑→↑↑↑↑↑

−30

7

→↓↓→↓→↓↓↓

↑→↑↑↑↑↑↑↑

8

→→↓→→→↓↓↓

↑→↑→↑↑↑↑↑

9

→→→→→→↓↓↓

→→↑↑↑↑↑↑↑

10

→→→→→→→→→→→↑↑→↑↑↑↑↑

11

→→→→→→→→→↑→↑→↑↑↑↑↑↑

5

12

→→→→→→→→↑

→↑→↑→↑↑↑↑

13

→→→↑→→↑↑↑

↑→→↑↑↑↑↑↑

14

→→↑↑→↑↑→↑

↑→↑↑↑→↑↑↑

10

15

→→→→→→→→↑

↑→↑↑↑↑↑↑↑

16

↑→↑↑↑→→↑↑

↑→→↑↑↑↑↑↑

20

17

→→→→→→→→↑

↑→↑↑↑↑↑↑↑

15

15

18

↑↑↑→→→↑↑↑

→→↑↑↑→↑↑↑

10

y

19

→→→→→→→→↑

↑↑↑↑↑↑↑↑↑

5

20

20

x

(a)Blackareasarewallsoverwhich

(b)Optimalstatevaluefunction(in

theagentcannotmove,whilethegoal

log-scale).

isrepresentedingray.Arrowsonthe

gridsrepresentoneoftheoptimalpoli-

cies.

(c)GraphinducedbytheMDPanda

randompolicy.

FIGURE3.1:Anillustrativeexampleofareinforcementlearningtaskof

guidinganagenttoagoalinthegridworld.

inFigure3.1(c).Inpractice,suchgraphstructure(includingtheconnection

weights)isestimatedfromsamplesofafinitelength.Weassumethatthe

graphGisconnected.Typically,thegraphissparseinreinforcementlearning

tasks,i.e.,

ℓ≪n(n−1)/2,

whereℓisthenumberofedgesandnisthenumberofnodes.

BasisDesignforValueFunctionApproximation

29

3.1.2

OrdinaryGaussianKernels

OrdinaryGaussiankernels(OGKs)ontheEuclideanspacearedefinedas

ED(s,s′)2

K(s,s′)=exp−

,

2σ2

whereED(s,s′)aretheEuclideandistancebetweenstatessands′;forex-

ample,

ED(s,s′)=kx−x′k,

whentheCartesianpositionsofsands′inthestatespacearegivenbyxand

x′,respectively.σ2isthevarianceparameteroftheGaussiankernel.

TheaboveGaussianfunctionisdefinedonthestatespaceS,wheres′is

treatedasacenterofthekernel.InordertoemploytheGaussiankernelin

least-squarespolicyiteration,itneedstobeextendedoverthestate-action

spaceS×A.Thisisusuallycarriedoutbysimply“copying”theGaussian

functionovertheactionspace(Lagoudakis&Parr,2003;Mahadevan,2005).

Moreprecisely,letthetotalnumberkofbasisfunctionsbemp,wheremis

thenumberofpossibleactionsandpisthenumberofGaussiancenters.For

thei-thactiona(i)(∈A)(i=1,2,…,m)andforthej-thGaussiancenter

c(j)(∈S)(j=1,2,…,p),the(i+(j−1)m)-thbasisfunctionisdefinedasφi+(j−1)m(s,a)=I(a=a(i))K(s,c(j)),

(3.1)

whereI(·)istheindicatorfunction:

(1ifa=a(i),

I(a=a(i))=

0otherwise.

3.1.3

GeodesicGaussianKernels

Ongraphs,anaturaldefinitionofthedistancewouldbetheshortestpath.

TheGaussiankernelbasedontheshortestpathisgivenby

SP(s,s′)2

K(s,s′)=exp−

,

(3.2)

2σ2

whereSP(s,s′)denotestheshortestpathfromstatestostates′.Theshortest

pathonagraphcanbeinterpretedasadiscreteapproximationtothegeodesic

distanceonanon-linearmanifold(Chung,1997).Forthisreason,wecallEq.

(3.2)ageodesicGaussiankernel(GGK)(Sugiyamaetal.,2008).

ShortestpathsongraphscanbeefficientlycomputedusingtheDijkstraal-

gorithm(Dijkstra,1959).Withitsnaiveimplementation,computationalcom-

plexityforcomputingtheshortestpathsfromasinglenodetoallothernodes

isO(n2),wherenisthenumberofnodes.IftheFibonacciheapisemployed,

30

StatisticalReinforcementLearning

computationalcomplexitycanbereducedtoO(nlogn+ℓ)(Fredman&Tar-

jan,1987),whereℓisthenumberofedges.Sincethegraphinvaluefunction

approximationproblemsistypicallysparse(i.e.,ℓ≪n2),usingtheFibonacci

heapprovidessignificantcomputationalgains.Furthermore,thereexistvar-

iousapproximationalgorithmswhicharecomputationallyveryefficient(see

Goldberg&Harrelson,2005andreferencestherein).

AnalogouslytoOGKs,weneedtoextendGGKstothestate-actionspace

tousetheminleast-squarespolicyiteration.Anaivewayistojustemploy

Eq.(3.1),butthiscancauseashiftintheGaussiancenterssincethestate

usuallychangeswhensomeactionistaken.Toincorporatethistransition,

thebasisfunctionsaredefinedastheexpectationofGaussianfunctionsafter

transition:

X

φi+(j−1)m(s,a)=I(a=a(i))

P(s′|s,a)K(s′,c(j)).

(3.3)

s′∈SThisshiftingschemeisshowntoworkverywellwhenthetransitionispre-

dominantlydeterministic(Sugiyamaetal.,2008).

3.1.4

ExtensiontoContinuousStateSpaces

Sofar,wefocusedondiscretestatespaces.However,theconceptofGGKs

canbenaturallyextendedtocontinuousstatespaces,whichisexplainedhere.

First,thecontinuousstatespaceisdiscretized,whichgivesagraphasadis-

creteapproximationtothenon-linearmanifoldstructureofthecontinuous

statespace.Basedonthegraph,GGKscanbeconstructedinthesameway

asthediscretecase.Finally,thediscreteGGKsareinterpolated,e.g.,usinga

linearmethodtogivecontinuousGGKs.

Althoughthisprocedurediscretizesthecontinuousstatespace,itmustbe

notedthatthediscretizationisonlyforthepurposeofobtainingthegraphas

adiscreteapproximationofthecontinuousnon-linearmanifold;theresulting

basisfunctionsthemselvesarecontinuouslyinterpolatedandhencethestate

spaceisstilltreatedascontinuous,asopposedtoconventionaldiscretization

procedures.

3.2

Illustration

Inthissection,thecharacteristicsofGGKsarediscussedincomparisonto

existingbasisfunctions.

BasisDesignforValueFunctionApproximation

31

3.2.1

Setup

Letusconsideratoyreinforcementlearningtaskofguidinganagentto

agoalinadeterministicgridworld(seeFigure3.1(a)).Theagentcantake

4actions:up,down,left,andright.Notethatactionswhichmaketheagent

collidewiththewallaredisallowed.Apositiveimmediatereward+1isgiven

iftheagentreachesagoalstate;otherwiseitreceivesnoimmediatereward.

Thediscountfactorissetatγ=0.9.

Inthistask,astatescorrespondstoatwo-dimensionalCartesiangrid

positionxoftheagent.Forillustrationpurposes,letusdisplaythestate

valuefunction,

Vπ(s):S→R,

whichistheexpectedlong-termdiscountedsumofrewardstheagentreceives

whentheagenttakesactionsfollowingpolicyπfromstates.Fromthedefi-

nition,itcanbeconfirmedthatVπ(s)isexpressedintermsofQπ(s,a)as

Vπ(s)=Qπ(s,π(s)).

TheoptimalstatevaluefunctionV∗(s)(inlog-scale)isillustratedinFig-ure3.1(b).AnMDP-inducedgraphstructureestimatedfrom20seriesofran-

domwalksamples1oflength500isillustratedinFigure3.1(c).Here,theedge

weightsinthegrapharesetat1(whichisequivalenttotheEuclideandistance

betweentwonodes).

3.2.2

GeodesicGaussianKernels

AnexampleofGGKsforthisgraphisdepictedinFigure3.2(a),wherethe

varianceofthekernelissetatalargevalue(σ2=30)forillustrationpurposes.

ThegraphshowsthatGGKshaveanicesmoothsurfacealongthemaze,but

notacrossthepartitionbetweentworooms.SinceGGKshave“centers,”they

areextremelyusefulforadaptivelychoosingasubsetofbases,e.g.,usinga

uniformallocationstrategy,sample-dependentallocationstrategy,ormaze-

dependentallocationstrategyofthecenters.Thisisapracticaladvantage

oversomenon-orderedbasisfunctions.Moreover,sinceGGKsarelocalby

nature,theilleffectsoflocalnoiseareconstrainedlocally,whichisanother

usefulpropertyinpractice.

Theapproximatedvaluefunctionsobtainedby40GGKs2aredepictedin

Figure3.3(a),whereoneGGKcenterisputatthegoalstateandtheremaining

9centersarechosenrandomly.ForGGKs,kernelfunctionsareextendedover

theactionspaceusingtheshiftingscheme(seeEq.(3.3))sincethetransitionis

1Moreprecisely,ineachrandomwalk,aninitialstateischosenrandomly.Then,anactionischosenrandomlyandtransitionismade;thisisrepeated500times.Thisentireprocedureisindependentlyrepeated20timestogeneratethetrainingset.

2Notethatthetotalnumberkofbasisfunctionsis160sinceeachGGKiscopiedovertheactionspaceasperEq.(3.3).

32

StatisticalReinforcementLearning

1

1

1

0.5

0.5

0.5

0

0

0

5

5

5

10

10

10

20

20

20

15

15

15

15

15

15

10

y

10

y

10

y

5

20

5

20

5

20

x

x

x

(a)GeodesicGaussiankernels

1

1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.2

0.4

0.2

5

5

5

10

10

10

20

20

20

15

15

15

15

15

15

10

y

10

y

10

y

5

20

5

20

5

20

x

x

x

(b)OrdinaryGaussiankernels

0.05

0.05

0.05

0

0

0

−0.05

−0.05

−0.05

5

5

5

10

10

10

20

20

20

15

15

15

15

15

15

10

y

10

y

10

y

5

20

5

20

5

20

x

x

x

(c)Graph-Laplacianeigenbases

0.2

0.15

0.2

0.1

0.1

0

0.05

0

−0.2

0

−0.1

5

5

5

10

10

10

20

20

20

15

15

15

15

15

15

10

y

10

y

10

y

5

20

5

20

5

20

x

x

x

(d)Diffusionwavelets

FIGURE3.2:Examplesofbasisfunctions.

BasisDesignforValueFunctionApproximation

33

−1.5

−2

−2

−2.5

−2.5

−3

−3

−3.5

−3.5

5

5

10

10

20

20

15

15

15

15

10

y

10

y

5

20

5

20

x

x

(a)GeodesicGaussiankernels(MSE=

(b)OrdinaryGaussiankernels(MSE=

1.03×10−2)

1.19×10−2)

−4

−6

−6

−8

−8

−10

−10

−12

−12

5

5

10

10

20

20

15

15

15

15

10

y

10

y

5

20

5

20

x

x

(c)Graph-Laplacianeigenbases(MSE=

(d)Diffusionwavelets

(MSE=5.00×

4.73×10−4)

10−4)

FIGURE3.3:Approximatedvaluefunctionsinlog-scale.Theerrorsarecom-

putedwithrespecttotheoptimalvaluefunctionillustratedinFigure3.1(b).

deterministic(seeSection3.1.3).TheproposedGGK-basedmethodproduces

anicesmoothfunctionalongthemazewhilethediscontinuityaroundthepar-

titionbetweentworoomsissharplymaintained(cf.Figure3.1(b)).Asaresult,

forthisparticularcase,GGKsgivetheoptimalpolicy(seeFigure3.4(a)).

AsdiscussedinSection3.1.3,thesparsityofthestatetransitionmatrixal-

lowsefficientandfastcomputationsofshortestpathsonthegraph.Therefore,

least-squarespolicyiterationwithGGK-basedbasesisstillcomputationally

attractive.

3.2.3

OrdinaryGaussianKernels

OGKssharesomeofthepreferablepropertiesofGGKsdescribedabove.

However,asillustratedinFigure3.2(b),thetailofOGKsextendsbeyondthe

partitionbetweentworooms.Therefore,OGKstendtoundesirablysmooth

outthediscontinuityofthevaluefunctionaroundthebarrierwall(see

34

StatisticalReinforcementLearning

123456789101112131415161718192021

123456789101112131415161718192021

1

1

2

→→→→→→↓↓↓

→→→→→→→→

2

→→→→→→→→↓

→→→→→→→→

3

→→→→→↓↓↓↓

→→→→→→→→↑

3

→→→→→→→→↑

→→→→→→→→↑

4

→→→→→↓↓↓↓

→→→→→→→→↑

4

→→→→→→→→↑

→→→→→→→→↑

5

→→→→↓↓↓↓↓

→→→→→→→→↑

5

→→→→→→→→↑

→→→→→→→→↑

6

→→→↓↓↓↓↓↓

→→→→→→→→↑

6

→→→→→→→→↑

→→→→→→→→↑

7

→→→↓↓↓↓↓↓

→→→→→→→↑↑

7

→→→→→→→→↑

→→→→→→→→↑

8

→→↓↓↓↓↓↓↓

→→→→→→→↑↑

8

→→→→→→→→↑

→→→→→↑↑↑↑

9

→↓↓↓↓↓↓↓↓

→→→→↑↑→↑↑

9

→→→→→→→→↑

→↑↑↑↑↑↑↑↑

10

→→→→→→→→→→→→→→↑↑↑↑↑

10

→→→→→→→→→→↑↑↑↑↑↑↑↑↑

11

→→→→→↑↑↑↑↑↑→↑↑↑↑↑↑↑

11

→→→→→→→↑↑↑↑↑↑↑↑↑↑↑↑

12

→→→↑↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

12

→→→→→→↑↑↑

↑↑↑↑↑↑↑↑↑

13

→→→↑↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

13

→→→→→↑↑↑↑

↑↑↑↑↑↑↑↑↑

14

→→→↑↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

14

→→→→→↑↑↑↑

↑↑↑↑↑↑↑↑↑

15

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

15

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

16

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

16

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

17

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

17

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

18

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

18

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

19

→→→↑↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

19

→→→→↑↑↑↑↑

↑↑↑↑↑↑↑↑↑

20

20

(a)GeodesicGaussiankernels

(b)OrdinaryGaussiankernels

123456789101112131415161718192021

123456789101112131415161718192021

1

1

2

→←↓↓↓↓↓↓↓

↓←↓↓→→→→

2

↓↓↓↓↓↓↓→↓

→→→→→→→→

3

↑←↓↓↓↓↓↓↓

↑↑↓↓→→→→↑

3

↓↓↓↓↓↓→↓↑

→→→→→→→→↑

4

↓↓↓↓↓↑↑↓↓

↑↑↑↓↓↓→→↑

4

↓↓↓↓↓→↓→↓

→→→→→→→→↑

5

↓↓↓↓↓↑↑←↓

→↑↑↑↓↓↓↑↑

5

↓↓↓↓→↓→↓↑

→→→→→→→→↑

6

↓↓↓↓↓↓↓↓↓

↓→↑↑↓↓↓↑↑

6

↓→↓→↓→↓→↓

→→→→→→→→↑

7

↓↓↓↓↓↓↓↓↓

↓→→→→↓↓←↑

7

→↓→↓→↓→↓↓

→→→→→→→→↑

8

↓↓↑↑↓↓↓↓↓

↓↓↑→→→→←←

8

↓→↓→↓→↓→↑

→←→→→→↑→↑

9

↓↓↓↑↓↓↓↓↓

↓↓↑↑→→→→↑

9

→↓→↓→↓→→↑

↑→←→→↑→↑↑

10

↓↓↓←↓↓↓↓→→↓↓↑↑←→→→↑

10

↓→↓→↓→↑→→→→↓↑↑↑→↑→↑

11

↓↓↓←↓↓↓↓→→→↓↓←←←←↑↑

11

→↓→↓→↑→→↑→↓↑↑↑↑↑→↑↑

12

↓↓↓↑↓↓↓↓↓

→→↓↓↓←←↑↑

12

↓→↓→↓→↑→↑

↑↑↑↑↑↑↑→↑

13

↓↓↑↑↓↓↓↓↓

→→→→↓←←←↓

13

→↓→↑→↑→↑↓

→↑↑↑↑↑↑↑↑

14

↓↓↓↓↓↓↓↓↓

↓→→→→→→↓↓

14

↑→↑→↑→↑→↑

↑→↑↑↑↑↑↑↑

15

↓↓↓↓↓↓↓↓↓

↓↓↓→→→←↓↓

15

↓↑↓↑↓↑→↑↓

→↑→↑↑↑↑↑↑

16

↓↓↓↓↓↑↑←↓

↓↓↓←↑↑←↓↓

16

↑↓↑↓↑↓↑↓↑

↑→↑→↑↑↑←↑

17

↓↓↓↓↓↑↑↓↓

↑↓↓←←←←↓↓

17

↓↑↓↑↓↑↓↑↓

→↑→↑↑↑←↑←

18

↑↓↓↓↓←←↓↓

↑→→↓←←↑↓↓

18

↑↓↑↓↑↓↑↓↑

↑→↑→↑←↑←↑

19

→↑←←←←←←←→→→→←←↑↑↑

19

↑↑←↑←↑←↑←→↑→↑→↑←↑←

20

20

(c)Graph-Laplacianeigenbases

(d)Diffusionwavelets

FIGURE3.4:Obtainedpolicies.

Figure3.3(b)).Thiscausesanerrorinthepolicyaroundthepartition(see

x=10,y=2,3,…,9ofFigure3.4(b)).

3.2.4

Graph-LaplacianEigenbases

Mahadevan(2005)proposedemployingthesmoothestvectorsongraphsas

basesinvaluefunctionapproximation.Accordingtothespectralgraphtheory

(Chung,1997),suchsmoothbasesaregivenbytheminoreigenvectorsofthe

graph-Laplacianmatrix,whicharecalledgraph-Laplacianeigenbases(GLEs).

GLEsmayberegardedasanaturalextensionofFourierbasestographs.

ExamplesofGLEsareillustratedinFigure3.2(c),showingthattheyhave

Fourier-likestructureonthegraph.ItshouldbenotedthatGLEsarerather

globalinnature,implyingthatnoiseinalocalregioncanpotentiallyde-

gradetheglobalqualityofapproximation.AnadvantageofGLEsisthatthey

haveanaturalorderingofthebasisfunctionsaccordingtothesmoothness.

Thisispracticallyveryhelpfulinchoosingasubsetofbasisfunctions.Fig-

ure3.3(c)depictstheapproximatedvaluefunctioninlog-scale,wherethetop

BasisDesignforValueFunctionApproximation

35

40smoothestGLEsoutof326GLEsareused(notethattheactualnumber

ofbasesis160becauseoftheduplicationovertheactionspace).Itshows

thatGLEsgloballygiveaverygoodapproximation,althoughthesmalllocal

fluctuationissignificantlyemphasizedsincethegraphisinlog-scale.Indeed,

themeansquarederror(MSE)betweentheapproximatedandoptimalvalue

functionsdescribedinthecaptionsofFigure3.3showsthatGLEsgivea

muchsmallerMSEthanGGKsandOGKs.However,theobtainedvaluefunc-

tioncontainssystematiclocalfluctuationandthisresultsinaninappropriate

policy(seeFigure3.4(c)).

MDP-inducedgraphsaretypicallysparse.Insuchcases,theresultant

graph-LaplacianmatrixisalsosparseandGLEscanbeobtainedjustbysolv-

ingasparseeigenvalueproblem,whichiscomputationallyefficient.However,

findingminoreigenvectorscouldbenumericallyunstable.

3.2.5

DiffusionWavelets

CoifmanandMaggioni(2006)proposeddiffusionwavelets(DWs),which

areanaturalextensionofwaveletstothegraph.Theconstructionisbased

onasymmetrizedrandomwalkonagraph.Itisdiffusedonthegraphupto

adesiredlevel,resultinginamulti-resolutionstructure.Adetailedalgorithm

forconstructingDWsandmathematicalpropertiesaredescribedinCoifman

andMaggioni(2006).

WhenconstructingDWs,themaximumnestlevelofwaveletsandtoler-

anceusedintheconstructionalgorithmneedstobespecifiedbyusers.The

maximumnestlevelissetat10andthetoleranceissetat10−10,whichare

suggestedbytheauthors.ExamplesofDWsareillustratedinFigure3.2(d),

showinganicemulti-resolutionstructureonthegraph.DWsareover-complete

bases,soonehastoappropriatelychooseasubsetofbasesforbetterapprox-

imation.Figure3.3(d)depictstheapproximatedvaluefunctionobtainedby

DWs,wherewechosethemostglobal40DWsfrom1626over-completeDWs

(notethattheactualnumberofbasesis160becauseoftheduplicationover

theactionspace).Thechoiceofthesubsetbasescouldpossiblybeenhanced

usingmultipleheuristics.However,thecurrentchoiceisreasonablesinceFig-

ure3.3(d)showsthatDWsgiveamuchsmallerMSEthanGaussiankernels.

Nevertheless,similartoGLEs,theobtainedvaluefunctioncontainsalotof

smallfluctuations(seeFigure3.3(d))andthisresultsinanerroneouspolicy

(seeFigure3.4(d)).

Thankstothemulti-resolutionstructure,computationofdiffusionwavelets

canbecarriedoutrecursively.However,duetotheover-completeness,itisstill

ratherdemandingincomputationtime.Furthermore,appropriatelydetermin-

ingthetuningparametersaswellaschoosinganappropriatebasissubsetis

notstraightforwardinpractice.

36

StatisticalReinforcementLearning

3.3

NumericalExamples

Asdiscussedintheprevioussection,GGKsbringanumberofpreferable

propertiesformakingvaluefunctionapproximationeffective.Inthissection,

thebehaviorofGGKsisillustratednumerically.

3.3.1

Robot-ArmControl

Here,asimulatorofatwo-jointrobotarm(movinginaplane),illustrated

inFigure3.5(a),isemployed.Thetaskistoleadtheend-effector(“hand”)

ofthearmtoanobjectwhileavoidingtheobstacles.Possibleactionsareto

increaseordecreasetheangleofeachjoint(“shoulder”and“elbow”)by5

degreesintheplane,simulatingcoarsestepper-motorjoints.Thus,thestate

spaceSisthe2-dimensionaldiscretespaceconsistingoftwojoint-angles,as

illustratedinFigure3.5(b).Theblackareainthemiddlecorrespondstothe

obstacleinthejoint-anglestatespace.TheactionspaceAinvolves4actions:

increaseordecreaseoneofthejointangles.Apositiveimmediatereward+1

isgivenwhentherobot’send-effectortouchestheobject;otherwisetherobot

receivesnoimmediatereward.Notethatactionswhichmakethearmcollide

withobstaclesaredisallowed.Thediscountfactorissetatγ=0.9.Inthis

environment,therobotcanchangethejointangleexactlyby5degrees,and

thereforetheenvironmentisdeterministic.However,becauseoftheobstacles,

itisdifficulttoexplicitlycomputeaninversekinematicmodel.Furthermore,

theobstaclesintroducediscontinuityinvaluefunctions.Therefore,thisrobot-

armcontroltaskisaninterestingtestbedforinvestigatingthebehaviorof

GGKs.

Trainingsamplesfrom50seriesof1000randomarmmovementsarecol-

lected,wherethestartstateischosenrandomlyineachtrial.Thegraph

inducedbytheaboveMDPconsistsof1605nodesanduniformweightsare

assignedtotheedges.Sincethereare16goalstatesinthisenvironment(see

Figure3.5(b)),thefirst16Gaussiancentersareputatthegoalsandthere-

mainingcentersarechosenrandomlyinthestatespace.ForGGKs,kernel

functionsareextendedovertheactionspaceusingtheshiftingscheme(see

Eq.(3.3))sincethetransitionisdeterministicinthisexperiment.

Figure3.6illustratesthevaluefunctionsapproximatedusingGGKsand

OGKs.ThegraphsshowthatGGKsgiveanicesmoothsurfacewithobstacle-

induceddiscontinuitysharplypreserved,whileOGKstendtosmoothout

thediscontinuity.Thismakesasignificantdifferenceinavoidingtheobsta-

cle.From“A”to“B”inFigure3.5(b),theGGK-basedvaluefunctionresults

inatrajectorythatavoidstheobstacle(seeFigure3.6(a)).Ontheotherhand,

theOGK-basedvaluefunctionyieldsatrajectorythattriestomovethearm

throughtheobstaclebyfollowingthegradientupward(seeFigure3.6(b)),

causingthearmtogetstuckbehindtheobstacle.

BasisDesignforValueFunctionApproximation

37

-

(a)Aschematic

A

B

(b)Statespace

FIGURE3.5:Atwo-jointrobotarm.Inthisexperiment,GGKsareputat

allthegoalstatesandtheremainingkernelsaredistributeduniformlyover

themaze;theshiftingschemeisusedinGGKs.

Figure3.7summarizestheperformanceofGGKsandOGKsmeasured

bythepercentageofsuccessfultrials(i.e.,theend-effectorreachestheobject)

over30independentruns.Moreprecisely,ineachrun,50,000trainingsamples

arecollectedusingadifferentrandomseed,apolicyisthencomputedbythe

GGK-orOGK-basedleast-squarespolicyiteration,andfinallytheobtained

policyistested.ThisgraphshowsthatGGKsremarkablyoutperformOGKs

sincethearmcansuccessfullyavoidtheobstacle.TheperformanceofOGKs

doesnotgobeyond0.6evenwhenthenumberofkernelsisincreased.Thisis

causedbythetaileffectofOGKs.Asaresult,theOGK-basedpolicycannot

leadtheend-effectortotheobjectifitstartsfromthebottomlefthalfofthe

statespace.

Whenthenumberofkernelsisincreased,theperformanceofbothGGKs

andOGKsgetsworseataroundk=20.Thisiscausedbythekernelalloca-

38

StatisticalReinforcementLearning

3

1

2

0.5

1

0

0

180

180

100

100

0

0

0

0

Joint2(degree)

Joint2(degree)

−180

−100

Joint1(degree)

−180

−100

Joint1(degree)

(a)GeodesicGaussiankernels

(b)OrdinaryGaussiankernels

FIGURE3.6:Approximatedvaluefunctionswith10kernels(theactual

numberofbasesis40becauseoftheduplicationovertheactionspace).

1

0.9

0.8

0.7

0.6

0.5

0.4

Fractionofsuccessfultrials0.3

0.2

GGK(5)

GGK(9)

0.1

OGK(5)

OGK(9)

00

20

40

60

80

100

Numberofkernels

FIGURE3.7:Fractionofsuccessfultrials.

tionstrategy:thefirst16kernelsareputatthegoalstatesandtheremaining

kernelcentersarechosenrandomly.Whenkislessthanorequalto16,the

approximatedvaluefunctiontendstohaveaunimodalprofilesinceallkernels

areputatthegoalstates.However,whenkislargerthan16,thisunimodality

isbrokenandthesurfaceoftheapproximatedvaluefunctionhasslightfluc-

tuations,causinganerrorinpoliciesanddegradingperformanceataround

BasisDesignforValueFunctionApproximation

39

k=20.Thisperformancedegradationtendstorecoverasthenumberof

kernelsisfurtherincreased.

MotionexamplesoftherobotarmtrainedwithGGKandOGKareillus-

tratedinFigure3.8andFigure3.9,respectively.

Overall,theaboveresultshowsthatwhenGGKsarecombinedwiththe

above-mentionedkernel-centerallocationstrategy,almostperfectpoliciescan

beobtainedwithasmallnumberofkernels.Therefore,theGGKmethodis

computationallyhighlyadvantageous.

3.3.2

Robot-AgentNavigation

Theabovesimplerobot-armcontrolsimulationshowsthatGGKsare

promising.Here,GGKsareappliedtoamorechallengingtaskofmobile-robot

navigation,whichinvolvesahigh-dimensionalandverylargestatespace.

AKheperarobot,illustratedinFigure3.10(a),isemployedforthenavi-

gationtask.TheKheperarobotisequippedwith8infraredsensors(“s1”to

“s8”inthefigure),eachofwhichgivesameasureofthedistancefromthesur-

roundingobstacles.Eachsensorproducesascalarvaluebetween0and1023:

thesensorobtainsthemaximumvalue1023ifanobstacleisjustinfrontofthe

sensorandthevaluedecreasesastheobstaclegetsfartheruntilitreachesthe

minimumvalue0.Therefore,thestatespaceSis8-dimensional.TheKhep-

erarobothastwowheelsandtakesthefollowingdefinedactions:forward,

leftrotation,rightrotation,andbackward(i.e.,theactionspaceAcontains

actions).Thespeedoftheleftandrightwheelsforeachactionisdescribed

inFigure3.10(a)inthebracket(theunitispulseper10milliseconds).Note

thatthesensorvaluesandthewheelspeedarehighlystochasticduetothe

crosstalk,sensornoise,slip,etc.Furthermore,perceptualaliasingoccursdue

tothelimitedrangeandresolutionofsensors.Therefore,thestatetransition

isalsohighlystochastic.Thediscountfactorissetatγ=0.9.

ThegoalofthenavigationtaskistomaketheKheperarobotexplore

theenvironmentasmuchaspossible.Tothisend,apositivereward+1is

givenwhentheKheperarobotmovesforwardandanegativereward−2is

givenwhentheKheperarobotcollideswithanobstacle.Norewardisgiven

totheleftrotation,rightrotation,andbackwardactions.Thisrewarddesign

encouragestheKheperarobottogoforwardwithouthittingobstacles,through

whichextensiveexplorationintheenvironmentcouldbeachieved.

Trainingsamplesarecollectedfrom200seriesof100randommovementsin

afixedenvironmentwithseveralobstacles(seeFigure3.11(a)).Then,agraph

isconstructedfromthegatheredsamplesbydiscretizingthecontinuousstate

spaceusingaself-organizingmap(SOM)(Kohonen,1995).ASOMconsists

ofneuronslocatedonaregulargrid.Eachneuroncorrespondstoacluster

andneuronsareconnectedtoadjacentonesbyneighborhoodrelation.The

SOMissimilartothek-meansclusteringalgorithm,butitisdifferentinthat

thetopologicalstructureoftheentiremapistakenintoaccount.Thanksto

this,theentirespacetendstobecoveredbytheSOM.Thenumberofnodes

40

StatisticalReinforcementLearning

FIGURE3.8:AmotionexampleoftherobotarmtrainedwithGGK(from

lefttorightandtoptobottom).

FIGURE3.9:AmotionexampleoftherobotarmtrainedwithOGK(from

lefttorightandtoptobottom).

BasisDesignforValueFunctionApproximation

41

(a)Aschematic

1000

800

600

400

200

0

−200

−400

−1000

−800

−600

−400

−200

0

200

400

600

800

1000

(b)Statespaceprojectedontoa2-dimensionalsubspaceforvisualization

FIGURE3.10:Kheperarobot.Inthisexperiment,GGKsaredistributed

uniformlyoverthemazewithouttheshiftingscheme.

(states)inthegraphissetat696(equivalenttotheSOMmapsizeof24×29).

Thisvalueiscomputedbythestandardrule-of-thumbformula5n(Vesanto

etal.,2000),wherenisthenumberofsamples.Theconnectivityofthegraph

isdeterminedbystatetransitionsoccurringinthesamples.Morespecifically,

ifthereisastatetransitionfromonenodetoanotherinthesamples,anedge

isestablishedbetweenthesetwonodesandtheedgeweightissetaccording

totheEuclideandistancebetweenthem.

Figure3.10(b)illustratesanexampleoftheobtainedgraphstructure.For

visualizationpurposes,the8-dimensionalstatespaceisprojectedontoa2-

dimensionalsubspacespannedby

(−1−10011

0

0),

(0

0

11

00

−1−1).

42

StatisticalReinforcementLearning

(a)Training

(b)Test

FIGURE3.11:Simulationenvironment.

Notethatthisprojectionisperformedonlyforthepurposeofvisualization.

Allthecomputationsarecarriedoutusingtheentire8-dimensionaldata.

Thei-thelementintheabovebasescorrespondstotheoutputofthei-th

sensor(seeFigure3.10(a)).Theprojectionontothissubspaceroughlymeans

thatthehorizontalaxiscorrespondstothedistancetotheleftandright

obstacles,whiletheverticalaxiscorrespondstothedistancetothefrontand

backobstacles.Forclearvisibility,theedgeswhoseweightislessthan250are

plotted.RepresentativelocalposesoftheKheperarobotwithrespecttothe

obstaclesareillustratedinFigure3.10(b).Thisgraphhasanotablefeature:

thenodesaroundtheregion“B”inthefigurearedirectlyconnectedtothe

nodesat“A,”butareverysparselyconnectedtothenodesat“C,”“D,”and

“E.”Thisimpliesthatthegeodesicdistancefrom“B”to“C,”“B”to“D,”

or“B”to“E”istypicallylargerthantheEuclideandistance.

Sincethetransitionfromonestatetoanotherishighlystochasticinthe

currentexperiment,theGGKfunctionissimplyduplicatedovertheaction

space(seeEq.(3.1)).ForobtainingcontinuousGGKs,GGKfunctionsneedto

beinterpolated(seeSection3.1.4).Asimplelinearinterpolationmethodmay

beemployedingeneral,butthecurrentexperimenthasuniquecharacteristics:

atleastoneofthesensorvaluesisalwayszerosincetheKheperarobotisnever

completelysurroundedbyobstacles.Therefore,samplesarealwaysonthe

surfaceofthe8-dimensionalhypercube-shapedstatespace.Ontheotherhand,

thenodecentersdeterminedbytheSOMarenotgenerallyonthesurface.This

meansthatanysampleisnotincludedintheconvexhullofitsnearestnodes

andthefunctionvalueneedstobeextrapolated.Here,theEuclideandistance

betweenthesampleanditsnearestnodeissimplyaddedwhencomputing

kernelvalues.Moreprecisely,forastatesthatisnotgenerallylocatedona

nodecenter,theGGK-basedbasisfunctionisdefinedas

(ED(s,˜

s)+SP(˜

s,c(j)))2

φi+(j−1)m(s,a)=I(a=a(i))exp−

,

2σ2

BasisDesignforValueFunctionApproximation

43

where˜

sisthenodeclosesttosintheEuclideandistance.

Figure3.12illustratesanexampleofactionsselectedateachnodebythe

GGK-basedandOGK-basedpolicies.Onehundredkernelsareusedandthe

widthissetat1000.Thesymbols↑,↓,⊂,and⊃inthefigureindicateforward,backward,leftrotation,andrightrotationactions.Thisshowsthatthereisa

cleardifferenceintheobtainedpoliciesatthestate“C.”Thebackwardaction

ismostlikelytobetakenbytheOGK-basedpolicy,whiletheleftrotation

andrightrotationaremostlikelytobetakenbytheGGK-basedpolicy.This

causesasignificantdifferenceintheperformance.Toexplainthis,supposethat

theKheperarobotisatthestate“C,”i.e.,itfacesawall.TheGGK-based

policyguidestheKheperarobotfrom“C”to“A”via“D”or“E”bytaking

theleftandrightrotationactionsanditcanavoidtheobstaclesuccessfully.

Ontheotherhand,theOGK-basedpolicytriestoplanapathfrom“C”to

“A”via“B”byactivatingthebackwardaction.Asaresult,theforwardaction

istakenat“B.”Forthisreason,theKheperarobotreturnsto“C”againand

endsupmovingbackandforthbetween“C”and“B.”

Fortheperformanceevaluation,amorecomplicatedenvironmentthan

theoneusedforgatheringtrainingsamples(seeFigure3.11)isused.This

meansthathowwelltheobtainedpoliciescanbegeneralizedtoanunknown

environmentisevaluatedhere.Inthistestenvironment,theKheperarobot

runsfromafixedstartingposition(seeFigure3.11(b))andtakes150steps

followingtheobtainedpolicy,withthesumofrewards(+1fortheforward

action)computed.IftheKheperarobotcollideswithanobstaclebefore150

steps,theevaluationisstopped.Themeantestperformanceover30indepen-

dentrunsisdepictedinFigure3.13asafunctionofthenumberofkernels.

Moreprecisely,ineachrun,agraphisconstructedbasedonthetraining

samplestakenfromthetrainingenvironmentandthespecifiednumberofker-

nelsisputrandomlyonthegraph.Then,apolicyislearnedbytheGGK-

orOGK-basedleast-squarespolicyiterationusingthetrainingsamples.Note

thattheactualnumberofbasesisfourtimesmorebecauseoftheexten-

sionofbasisfunctionsovertheactionspace.Thetestperformanceismea-

sured5timesforeachpolicyandtheaverageisoutput.Figure3.13shows

thatGGKssignificantlyoutperformOGKs,demonstratingthatGGKsare

promisingeveninthechallengingsettingwithahigh-dimensionallargestate

space.

Figure3.14depictsthecomputationtimeofeachmethodasafunctionof

thenumberofkernels.Thisshowsthatthecomputationtimemonotonically

increasesasthenumberofkernelsincreasesandtheGGK-basedandOGK-

basedmethodshavecomparablecomputationtime.However,giventhatthe

GGK-basedmethodworksmuchbetterthantheOGK-basedmethodwitha

smallernumberofkernels(seeFigure3.13),theGGK-basedmethodcouldbe

regardedasacomputationallyefficientalternativetothestandardOGK-based

method.

Finally,thetrainedKheperarobotisappliedtomapbuilding.Starting

fromaninitialposition(indicatedbyasquareinFigure3.15),theKhepera

44

StatisticalReinforcementLearning

1000

⊃⊃⊃⊃⊃⊃⊃⊃↑⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↑↑⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↑

⊃⊃⊂⊂⊃⊃⊃⊃⊃⊃⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊂⊂⊂⊂

⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊂⊂↓

⊃⊂⊂⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃↑

⊃⊃⊃

⊃⊃⊃⊃↓⊃⊃⊃↓

⊃⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂800

⊃⊃⊃⊃⊃⊃⊂⊂⊂⊃⊃⊃↑

⊂⊂⊂⊃⊃⊃⊃⊃↑↑↑⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂

⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃↑↑⊃⊃⊃↑

↑↑⊂⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃↑⊃⊃⊃⊃⊃↑⊂↑⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↑

⊂↑⊂⊂⊂⊂↑↑

⊂↑↑⊂⊂⊂⊂600

⊂⊃⊃⊃⊃⊃↑⊃↑

⊂⊂⊂↑⊃⊃⊂⊃↑

⊂↑

⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃↑

⊃↑↑

⊃↑

⊂↑⊂↑↑⊂⊂⊂⊂⊃⊃⊃⊃↑

⊂⊂↑↑↑

⊂⊂⊂⊂↑↑↑⊂⊂⊂400

⊃↑

⊃⊃⊃↑

↑⊂↑⊂⊂⊂⊃⊃⊃↑

↑↑↑↑

↑↑↑

⊂⊂⊃⊃↑

⊃⊃⊃↑

↑↑

⊂↑

↑↑↑

↑↑↑

⊂⊂⊃⊃⊃⊃↑↑↑↑↑

⊂⊂⊂↑↑

⊂↑↑↑

⊂200

⊃⊃⊃↑

⊂⊂⊃⊃↑↑↑↑↑

⊂⊃↑↑↑↑↑

↑↑

⊂⊂⊃⊃⊃⊃↑

⊃↑↑

↑↑↑

⊂↑

↑↑↑

↑↑↑

⊂⊂⊂⊃⊃

↑↑

↑↑

⊃⊃↑

↑↑↑↑↑

↑↑

⊂↑↑

↑↑↑↑

↑⊂⊂0

⊃⊃⊃⊃↑↑↑↑↑

↑↑

↑↑↑

↑↑↑↑↑↑↑↑↑

↑↑

↑↑

↑↑

↑↑↑

↑↑↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑↑

↑↑

↑↑

⊂↑↑

↑↑

⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃↑↑↑↑↑↑↑↑

↑↑↑↑

↑↑↑↑

⊃⊃

⊃⊃⊃⊃↑↑

↑↑↑

↑↑↑

↑↑

↑↑

↑↑↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑↑

−200

↑↑

−400

−1000

−800

−600

−400

−200

0

200

400

600

800

1000

(a)GeodesicGaussiankernels

1000

⊃⊃⊃↓↓↓↓↓↓↓

↓↓↓

⊃⊃⊃⊃⊃⊃↓↓

↓↓↓↓↓↓

↓↓

⊃⊃⊃⊃⊃↓

⊃⊃↓↓⊃↓↓↓↓

↓↓

↓↓

↓↓↓↓↓

↓↓↓

↓↓↓↓↓⊂

↓⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↓

⊃⊃↓

↓↓↓

↓↓↓

↓↓

↓↓

↓↓↓↓⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃

⊃↓↓⊃⊃↓↓↓

↓↓↓↓↓⊂⊂⊂⊂⊂⊂800

⊃⊃⊃⊃⊃↓

⊂⊂⊂⊃⊃⊃↓

⊂⊂⊂⊃⊃⊃↓↓↑↓

↓↓

⊂⊂⊂⊂⊂⊂⊂⊂

⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↓↑

↓↓↑

↑↓↓↓↓↓

⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↓↓↑↑↑

↓↓

⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↓

↑↓

⊂⊂⊂⊂⊃⊃

↑↑

↓↓

⊂⊂⊂600

⊂⊃⊃⊃⊃⊃↓↑

⊂⊂⊂⊃⊃↑

⊂↓

⊂↑

⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃↑

⊂⊂⊃↑

⊃⊃

⊃↑

↑↓⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃↑

⊂⊂↑↑↑

↑↑

⊂↓⊂⊂⊂⊂⊂⊂400

⊃⊃⊃⊃⊃↑

↑↑

⊂⊂⊂⊂⊃⊃⊃⊃↑

↑↑↑↑

↑⊂⊂⊂⊂⊃⊃⊃↑

⊃⊃↑

↑⊂⊂↑

↑↑↑

↑↑⊂⊂⊂⊂⊃⊃⊃⊃⊃↑↑↑↑

⊂⊂⊂↑↑

⊂⊂↑↑⊂⊂

200

⊃⊃↑

⊂⊂⊃⊃⊃⊃↑↑↑↑

⊂↑

↑↑↑↑↑

↑⊂⊂⊂⊃⊃⊃↑

⊃⊃⊃↑↑↑

⊂↑

↑↑↑

↑↑↑

⊂⊂⊂⊃

⊃⊃↑↑

↑↑

⊃⊃↑

⊂↑↑↑↑↑

↑↑

↑↑

↑↑↑↑

⊂⊂⊂0

⊃⊃⊃⊃↑↑↑↑↑

↑↑

↑↑↑

↑↑↑↑↑↑↑↑↑

↑↑

↑↑

↑↑

↑↑↑

↑↑↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑↑

↑↑

↑⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃↑↑↑↑↑↑↑

↑↑↑↑

↑↑↑↑

⊃⊃

⊃⊃⊃⊃↑↑

↑↑↑

↑↑↑

↑↑

↑↑

↑↑↑↑

↑↑

↑↑

↑↑

↑↑

↑↑

↑↑↑

−200

↑↑

−400

−1000

−800

−600

−400

−200

0

200

400

600

800

1000

(b)OrdinaryGaussiankernels

FIGURE3.12:Examplesofobtainedpolicies.Thesymbols↑,↓,⊂,and⊃indicateforward,backward,leftrotation,andrightrotationactions.

robottakesanaction2000timesfollowingthelearnedpolicy.Eightykernels

withGaussianwidthσ=1000areusedforvaluefunctionapproximation.The

resultsofGGKsandOGKsaredepictedinFigure3.15.Thegraphsshowthat

theGGKresultgivesabroaderprofileoftheenvironment,whiletheOGK

resultonlyrevealsalocalareaaroundtheinitialposition.

MotionexamplesoftheKheperarobottrainedwithGGKandOGKare

illustratedinFigure3.16andFigure3.17,respectively.

BasisDesignforValueFunctionApproximation

45

70

GGK(200)

4500

65

GGK(1000)

OGK(200)

GGK(1000)

60

4000

OGK(1000)

OGK(1000)

55

3500

50

3000

45

2500

40

2000

35

1500

Averagedtotalrewards30

Computationtime[sec]1000

25

500

200102030405060708090100

00102030405060708090100

Numberofkernels

Numberofkernels

FIGURE3.13:Averageamountof

FIGURE3.14:Computationtime.

exploration.

(a)GeodesicGaussiankernels

(b)OrdinaryGaussiankernels

FIGURE3.15:Resultsofmapbuilding(cf.Figure3.11(b)).

3.4

Remarks

Theperformanceofleast-squarespolicyiterationdependsheavilyonthe

choiceofbasisfunctionsforvaluefunctionapproximation.Inthischapter,

thegeodesicGaussiankernel(GGK)wasintroducedandshowntopossess

severalpreferablepropertiessuchassmoothnessalongthegraphandeasy

computability.ItwasalsodemonstratedthatthepoliciesobtainedbyGGKs

arenotassensitivetothechoiceoftheGaussiankernelwidth,whichisa

usefulpropertyinpractice.Also,theheuristicsofputtingGaussiancenters

ongoalstateswasshowntoworkwell.

However,whenthetransitionishighlystochastic(i.e.,thetransitionprob-

abilityhasawidesupport),thegraphconstructedbasedonthetransition

samplescouldbenoisy.Whenanerroneoustransitionresultsinashort-cut

46

StatisticalReinforcementLearning

FIGURE3.16:AmotionexampleoftheKheperarobottrainedwithGGK

(fromlefttorightandtoptobottom).

FIGURE3.17:AmotionexampleoftheKheperarobottrainedwithOGK

(fromlefttorightandtoptobottom).

overobstacles,thegraph-basedapproachmaynotworkwellsincethetopology

ofthestatespacechangessignificantly.

Chapter4

SampleReuseinPolicyIteration

Off-policyreinforcementlearningisaimedatefficientlyusingdatasamples

gatheredfromapolicythatisdifferentfromthecurrentlyoptimizedpolicy.A

commonapproachistouseimportancesamplingtechniquesforcompensating

forthebiascausedbythedifferencebetweenthedata-samplingpolicyandthe

targetpolicy.Inthischapter,weexplainhowimportancesamplingcanbeuti-

lizedtoefficientlyreusepreviouslycollecteddatasamplesinpolicyiteration.

Afterformulatingtheproblemofoff-policyvaluefunctionapproximationin

Section4.1,representativeoff-policyvaluefunctionapproximationtechniques

includingadaptiveimportancesamplingarereviewedinSection4.2.Then,in

Section4.3,howtheadaptivityofimportancesamplingcanbeoptimallycon-

trolledisexplained.InSection4.4,off-policyvaluefunctionapproximation

techniquesareintegratedintheframeworkofleast-squarespolicyiteration

forefficientsamplereuse.ExperimentalresultsareshowninSection4.5,and

finallythischapterisconcludedinSection4.6.

4.1

Formulation

AsexplainedinSection2.2,least-squarespolicyiterationmodelsthestate-

actionvaluefunctionQπ(s,a)byalineararchitecture,

θ⊤φ(s,a),

andlearnstheparameterθsothatthegeneralizationerrorGisminimized:

#

T

1X

2

G(θ)=Epπ(h)

θ⊤ψ(s

.

(4.1)

T

t,at)−r(st,at)

t=1

Here,Epπ(h)denotestheexpectationoverhistory

h=[s1,a1,…,sT,aT,sT+1]

followingthetargetpolicyπand

h

i

ψ(s,a)=φ(s,a)−γEπ(a′|s′)p(s′|s,a)φ(s′,a′).

47

48

StatisticalReinforcementLearning

Whenhistorysamplesfollowingthetargetpolicyπareavailable,thesitu-

ationiscalledon-policyreinforcementlearning.Inthiscase,justreplacingthe

expectationcontainedinthegeneralizationerrorGbysampleaveragesgives

astatisticallyconsistentestimator(i.e.,theestimatedparameterconvergesto

theoptimalvalueasthenumberofsamplesgoestoinfinity).

Here,weconsiderthesituationcalledoff-policyreinforcementlearning,

wherethesamplingpolicye

πforcollectingdatasamplesisgenerallydifferent

fromthetargetpolicyπ.Letusdenotethehistorysamplesfollowinge

πby

Heπ=heπ1,…,heπN,

whereeachepisodicsampleheπnisgivenas

heπn=[seπ1,n,aeπ1,n,…,seπT,n,aeπT,n,seπT+1,n].

Undertheoff-policysetup,naivelearningbyminimizingthesample-

approximatedgeneralizationerrorb

GNIWleadstoaninconsistentestimator:

N

XT

X

2

b

1

GNIW(θ)=

θ⊤b

ψ(seπ

,

NT

t,n,ae

π

t,n;He

π)−r(seπt,n,aeπt,n,seπt+1,n)

n=1t=1

where

X

h

i

b

1

ψ(s,a;H)=φ(s,a)−

E

γφ(s′,a′).

|H

e

π(a′|s′)

(s,a)|s′∈H(s,a)H(s,a)denotesasubsetofHthatconsistsofalltransitionsamplesfromstate

sbyactiona,|H(s,a)|denotesthenumberofelementsinthesetH(s,a),and

P

denotesthesummationoveralldestinationstatess′intheset

s′∈Hs,a)H(s,a).NIWstandsfor“NoImportanceWeight,”whichwillbeexplained

later.

Thisinconsistencyproblemcanbeavoidedbygatheringnewsamplesfol-

lowingthetargetpolicyπ,i.e.,whenthecurrentpolicyisupdated,newsam-

plesaregatheredfollowingtheupdatedpolicyandthenewsamplesareused

forpolicyevaluation.However,whenthedatasamplingcostishigh,thisis

tooexpensive.Itwouldbemorecostefficientifpreviouslygatheredsamples

couldbereusedeffectively.

4.2

Off-PolicyValueFunctionApproximation

Importancesamplingisageneraltechniquefordealingwiththeoff-policy

situation.Supposewehavei.i.d.(independentandidenticallydistributed)sam-

plesxnN

n=1fromastrictlypositiveprobabilitydensityfunctione

p(x).Using

SampleReuseinPolicyIteration

49

thesesamples,wewouldliketocomputetheexpectationofafunctiong(x)

overanotherprobabilitydensityfunctionp(x).Aconsistentapproximationof

theexpectationisgivenbytheimportance-weightedaverageas

1N

X

p(x

p(x)

g(x

n)N→∞

−→E

g(x)

N

n)ep(x

e

p(x)

e

p(x)

n=1

n)

Z

Z

p(x)

=

g(x)

e

p(x)dx=

g(x)p(x)dx=E

e

p(x)

p(x)[g(x)].

However,applyingtheimportancesamplingtechniqueinoff-policyrein-

forcementlearningisnotstraightforwardsinceourtrainingsamplesofstate

sandactionaarenoti.i.d.duetothesequentialnatureofMarkovdeci-

sionprocesses(MDPs).Inthissection,representativeimportance-weighting

techniquesforMDPsarereviewed.

4.2.1

EpisodicImportanceWeighting

Basedontheindependencebetweenepisodes,

p(h,h′)=p(h)p(h′)=p(s1,a1,…,sT,aT,sT+1)p(s′1,a′1,…,s′T,a′T,s′T+1),thegeneralizationerrorGcanberewrittenas

#

T

1X

2

G(θ)=Epeπ(h)

θ⊤ψ(s

w

,

T

t,at)−r(st,at)

T

t=1

wherewTistheepisodicimportanceweight(EIW):

pπ(h)

wT=

.

peπ(h)

pπ(h)andpeπ(h)aretheprobabilitydensitiesofobservingepisodicdatah

underpolicyπande

π:

T

Y

pπ(h)=p(s1)

π(at|st)p(st+1|st,at),

t=1

T

Y

peπ(h)=p(s1)

eπ(at|st)p(st+1|st,at).

t=1

Notethattheimportanceweightscanbecomputedwithoutexplicitlyknowing

p(s1)andp(st+1|st,at),sincetheyarecanceledout:

QTπ(a

w

t=1

t|st)

T=Q

.

T

t=1e

π(at|st)

50

StatisticalReinforcementLearning

UsingthetrainingdataHeπ,wecanconstructaconsistentestimatorofG

as

N

XT

X

2

b

1

GEIW(θ)=

θ⊤b

ψ(seπ

b

w

NT

t,n,ae

π

t,n;He

π)−r(seπt,n,aeπt,n,seπt+1,n)

T,n,

n=1t=1

(4.2)

where

QTπ(aeπ

b

w

t=1

t,n|se

π

t,n)

T,n=Q

.

T

t=1e

π(aeπt,n|seπt,n)

4.2.2

Per-DecisionImportanceWeighting

AcrucialobservationinEIWisthattheerroratthet-thstepdoesnot

dependonthesamplesafterthet-thstep(Precupetal.,2000).Thus,the

generalizationerrorGcanberewrittenas

#

T

1X

2

G(θ)=Epeπ(h)

θ⊤ψ(s

w

,

T

t,at)−r(st,at)

t

t=1

wherewtistheper-decisionimportanceweight(PIW):

Q

Q

p(s

t

π(a

t

π(a

w

1)

t′=1

t′|st′)p(st′+1|st′,at′)

t′=1

t′|st′)

t=

Q

=Q

.

p(s

t

t

1)

t′=1e

π(at′|st′)p(st′+1|st′,at′)

t′=1e

π(at′|st′)

UsingthetrainingdataHeπ,wecanconstructaconsistentestimatoras

follows(cf.Eq.(4.2)):

N

XT

X

2

b

1

GPIW(θ)=

θ⊤b

ψ(seπ

b

w

NT

t,n,ae

π

t,n;He

π)−r(seπt,n,aeπt,n,seπt+1,n)

t,n,

n=1t=1

where

Qt

π(aeπ

)

b

w

t′=1

t′,n|se

π

t′,n

t,n=Q

.

t

)

t′=1e

π(aeπt′,n|seπt′,n

b

wt,nonlycontainstherelevanttermsuptothet-thstep,whileb

wT,nincludes

allthetermsuntiltheendoftheepisode.

4.2.3

AdaptivePer-DecisionImportanceWeighting

ThePIWestimatorisguaranteedtobeconsistent.However,botharenot

efficientinthestatisticalsense(Shimodaira,2000),i.e.,theydonothavethe

smallestadmissiblevariance.Forthisreason,thePIWestimatorcanhave

largevarianceinfinitesamplecasesandthereforelearningwithPIWtendsto

beunstableinpractice.

Toimprovethestability,itisimportanttocontrolthetrade-offbetween

SampleReuseinPolicyIteration

51

consistencyandefficiency(orsimilarlybiasandvariance)basedontraining

data.Here,theflatteningparameterν(∈[0,1])isintroducedtocontrolthetrade-offbyslightly“flattening”theimportanceweights(Shimodaira,2000;

Sugiyamaetal.,2007):

N

XT

X

b

1

GAIW(θ)=

θ⊤b

ψ(seπ

NT

t,n,ae

π

t,n;He

π)

n=1t=1

2

−r(seπt,n,aeπt,n,seπt+1,n)(b

wt,n)ν,

whereAIWstandsfortheadaptiveper-decisionimportanceweight.When

ν=0,AIWisreducedtoNIWandthereforeithaslargebiasbuthasrelatively

smallvariance.Ontheotherhand,whenν=1,AIWisreducedtoPIW.Thus,

ithassmallbiasbuthasrelativelylargevariance.Inpractice,anintermediate

valueofνwillyieldthebestperformance.

Letb

ΨbetheNT×Bmatrix,c

WbetheNT×NTdiagonalmatrix,and

rbetheNT-dimensionalvectordefinedas

b

ΨN(t−1)+n,b=b

ψb(st,n,at,n),

c

WN(t−1)+n,N(t−1)+n=b

wt,n,

rN(t−1)+n=r(st,n,at,n,st+1,n).

Then,b

GAIWcanbecompactlyexpressedas

b

1

ν

GAIW(θ)=

(b

Ψθ−r)⊤c

W(b

Ψθ−r).

NT

Becausethisisaconvexquadraticfunctionwithrespecttoθ,itsglobalmin-

imizerb

θAIWcanbeanalyticallyobtainedbysettingitsderivativetozeroas

b

⊤ν

⊤ν

θ

cb

c

AIW=(b

ΨWΨ)−1b

ΨWr.

Thismeansthatthecostforcomputingb

θAIWisessentiallythesameasb

θNIW,

whichisgivenasfollows(seeSection2.2.2):

b

⊤⊤θ

b

NIW=(b

ΨΨ)−1b

Ψr.

4.2.4

Illustration

Here,theinfluenceoftheflatteningparameterνontheestimatorb

θAIWis

illustratedusingthechain-walkMDPillustratedinFigure4.1.

TheMDPconsistsof10states

S=s(1),…,s(10)

52

StatisticalReinforcementLearning

FIGURE4.1:Ten-statechain-walkMDP.

andtwoactions

A=a(1),a(2)=“L,”“R”.

Thereward+1isgivenwhenvisitings(1)ands(10).Thetransitionprobability

pisindicatedbythenumbersattachedtothearrowsinthefigure.Forexample,

p(s(2)|s(1),a=“R”)=0.9

and

p(s(1)|s(1),a=“R”)=0.1

meanthattheagentcansuccessfullymovetotherightnodewithprobability

0.9(indicatedbysolidarrowsinthefigure)andtheactionfailswithprob-

ability0.1(indicatedbydashedarrowsinthefigure).SixGaussiankernels

withstandarddeviationσ=10areusedasbasisfunctions,andkernelcen-

tersarelocatedats(1),s(5),ands(10).Morespecifically,thebasisfunctions,

φ(s,a)=(φ1(s,a),…,φ6(s,a))aredefinedas

(s−c

φ

j)2

3(i−1)+j(s,a)=I(a=a(i))exp

,

2σ2

fori=1,2andj=1,2,3,where

c1=1,c2=5,c3=10,

and

1ifxistrue,

I(x)=

0ifxisnottrue.

Theexperimentsarerepeated50times,wherethesamplingpolicye

π(a|s)

andthecurrentpolicyπ(a|s)arechosenrandomlyineachtrialsuchthat

eπ6=π.Thediscountfactorissetatγ=0.9.ThemodelparameterbθAIWis

learnedfromthetrainingsamplesHeπanditsgeneralizationerroriscomputed

fromthetestsamplesHπ.

TheleftcolumnofFigure4.2depictsthetruegeneralizationerrorGav-

eragedover50trialsasafunctionoftheflatteningparameterνforN=10,

30,and50.Figure4.2(a)showsthatwhenthenumberofepisodesislarge

(N=50),thegeneralizationerrortendstodecreaseastheflatteningparam-

eterincreases.Thiswouldbeanaturalresultduetotheconsistencyofb

θAIW

SampleReuseinPolicyIteration

53

0.07

0.08

0.068

0.075

0.066

Trueerror

0.064

Estimatederror

0.07

0.062

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Flatteningparameterν

Flatteningparameterν

(a)N=50

0.084

0.082

0.073

0.08

0.072

0.071

0.078

Trueerror

0.07

0.076

Estimatederror

0.069

0.074

0.068

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Flatteningparameterν

Flatteningparameterν

(b)N=30

0.11

0.14

0.135

0.105

0.13

0.125

Trueerror

0.1

0.12

Estimatederror

0.115

0.0950

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Flatteningparameterν

Flatteningparameterν

(c)N=10

FIGURE4.2:Left:TruegeneralizationerrorGaveragedover50trialsas

afunctionoftheflatteningparameterνinthe10-statechain-walkproblem.

ThenumberofstepsisfixedatT=10.ThetrendofGdiffersdependingon

thenumberNofepisodicsamples.Right:Generalizationerrorestimatedby

5-foldimportanceweightedcrossvalidation(IWCV)(b

GIWCV)averagedover

50trialsasafunctionoftheflatteningparameterνinthe10-statechain-walk

problem.ThenumberofstepsisfixedatT=10.IWCVnicelycapturesthe

trendofthetruegeneralizationerrorG.

54

StatisticalReinforcementLearning

whenν=1.Ontheotherhand,Figure4.2(b)showsthatwhenthenumberof

episodesisnotlarge(N=30),ν=1performsratherpoorly.Thisimpliesthat

theconsistentestimatortendstobeunstablewhenthenumberofepisodes

isnotlargeenough;ν=0.7worksthebestinthiscase.Figure4.2(c)shows

theresultswhenthenumberofepisodesisfurtherreduced(N=10).This

illustratesthattheconsistentestimatorwithν=1isevenworsethanthe

ordinaryestimator(ν=0)becausethebiasisdominatedbylargevariance.

Inthiscase,thebestνisevensmallerandisachievedatν=0.4.

TheaboveresultsshowthatAIWcanoutperformPIW,particularlywhen

onlyasmallnumberoftrainingsamplesareavailable,providedthattheflat-

teningparameterνischosenappropriately.

4.3

AutomaticSelectionofFlatteningParameter

Inthissection,theproblemofselectingtheflatteningparameterinAIW

isaddressed.

4.3.1

Importance-WeightedCross-Validation

Generally,thebestνtendstobelarge(small)whenthenumberoftraining

samplesislarge(small).However,thisgeneraltrendisnotsufficienttofine-

tunetheflatteningparametersincethebestvalueofνdependsontraining

samples,policies,themodelofvaluefunctions,etc.Inthissection,wediscuss

howmodelselectionisperformedtochoosethebestflatteningparameterν

automaticallyfromthetrainingdataandpolicies.

Ideally,thevalueofνshouldbesetsothatthegeneralizationerrorG

isminimized,butthetrueGisnotaccessibleinpractice.Tocopewiththis

problem,wecanusecross-validation(seeSection2.2.4)forestimatingthe

generalizationerrorG.However,intheoff-policyscenariowherethesampling

policye

πandthetargetpolicyπaredifferent,ordinarycross-validationgives

abiasedestimateofG.Intheoff-policyscenario,importance-weightedcross-

validation(IWCV)(Sugiyamaetal.,2007)ismoreuseful,wherethecross-

validationestimateofthegeneralizationerrorisobtainedwithimportance

weighting.

Morespecifically,letusdivideatrainingdatasetHeπcontainingNepisodes

intoKsubsetsHeπ

ofapproximatelythesamesize.Forsimplicity,weas-

kK

k=1

k

sumethatNisdivisiblebyK.Letb

θAIWbetheparameterlearnedfromH\Hk

(i.e.,allsampleswithoutHk).Then,thegeneralizationerrorisestimatedwith

SampleReuseinPolicyIteration

55

0.11

NIW(ν=0)

0.105

PIW(ν=1)

AIW+IWCV

0.1

0.095

0.09

Trueerror0.085

0.08

0.075

10

15

20

25

30

35

40

45

50

Numberofepisodes

FIGURE4.3:TruegeneralizationerrorGaveragedover50trialsobtained

byNIW(ν=0),PIW(ν=1),AIW+IWCV(νischosenbyIWCV)inthe

10-statechain-walkMDP.

importanceweightingas

K

X

b

1

G

b

IWCV=

Gk

K

IWCV,

k=1

where

XT

X

2

b

K

k

Gk

b

⊤b

IWCV=

θ

ψ(s

b

w

NT

AIW

t,at;He

π

k)−r(st,at,st+1)

t.

h∈Heπt=1

k

Thegeneralizationerrorestimateb

GIWCViscomputedforallcandidate

models(inthecurrentsetting,acandidatemodelcorrespondstoadifferent

valueoftheflatteningparameterν)andtheonethatminimizestheestimated

generalizationerrorischosen:

b

ν

b

IWCV=argminGIWCV.

ν

4.3.2

Illustration

ToillustratehowIWCVworks,letususethesamenumericalexamples

asSection4.2.4.TherightcolumnofFigure4.2depictsthegeneralization

errorestimatedby5-foldIWCVaveragedover50trialsasafunctionofthe

flatteningparameterν.ThegraphsshowthatIWCVnicelycapturesthetrend

ofthetruegeneralizationerrorforallthreecases.

Figure4.3describes,asafunctionofthenumberNofepisodes,theav-

eragetruegeneralizationerrorobtainedbyNIW(AIWwithν=0),PIW

56

StatisticalReinforcementLearning

(AIWwithν=1),andAIW+IWCV(ν∈0.0,0.1,…,0.9,1.0isselectedin

eachtrialusing5-foldIWCV).Thisresultshowsthattheimprovementofthe

performancebyNIWsaturateswhenN≥30,implyingthatthebiascaused

byNIWisnotnegligible.TheperformanceofPIWisworsethanNIWwhen

N≤20,whichiscausedbythelargevarianceofPIW.Ontheotherhand,

AIW+IWCVconsistentlygivesgoodperformanceforallN,illustratingthe

strongadaptationabilityofAIW+IWCV.

4.4

Sample-ReusePolicyIteration

Inthissection,AIW+IWCVisextendedfromsingle-steppolicyevaluation

tofullpolicyiteration.Thismethodiscalledsample-reusepolicyiteration

(SRPI).

4.4.1

Algorithm

LetusdenotethepolicyattheL-thiterationbyπL.Inon-policypolicy

iteration,newdatasamplesHπLarecollectedfollowingthenewpolicyπL

duringthepolicyevaluationstep.Thus,previouslycollecteddatasamples

Hπ1,…,HπL−1arenotused:

E:Hπ1

E:Hπ2

E:Hπ3

π

I

I

1

b

Qπ1→π2−→

b

Qπ2→π3−→···I

−→πL,

where“E:H”indicatesthepolicyevaluationstepusingthedatasampleH

and“I”indicatesthepolicyimprovementstep.Itwouldbemorecostefficient

ifallpreviouslycollecteddatasampleswerereusedinpolicyevaluation:

E:Hπ1

E:Hπ1,Hπ2

E:Hπ1,Hπ2,Hπ3

π

I

I

1

−→

b

Qπ1→π2

−→

b

Qπ2→π3

−→

···I

−→πL.

Sincethepreviouspoliciesandthecurrentpolicyaredifferentingeneral,

anoff-policyscenarioneedstobeexplicitlyconsideredtoreusepreviously

collecteddatasamples.Here,weexplainhowAIW+IWCVcanbeusedin

thissituation.Forthispurpose,thedefinitionofb

GAIWisextendedsothat

multiplesamplingpoliciesπ1,…,πLaretakenintoaccount:

L

XN

XT

X

b

1

GL

AIW=

θ⊤b

ψ(sπl

LNT

t,n,aπl

t,n;HπlL

l=1)

l=1n=1t=1

!

Qt

νL

2

πL(aπl

)

−r(

t′,n|sπl

t′,n

t′=1

l

t,n,aπl

t,n,sπl

t+1,n)

Q

,

(4.3)

t

π

)

t′=1

l(aπl

t′,n|sπl

t′,n

whereb

GL

isthegeneralizationerrorestimatedattheL-thpolicyevaluation

AIW

usingAIW.TheflatteningparameterνLischosenbasedonIWCVbefore

performingpolicyevaluation.

SampleReuseinPolicyIteration

57

ν=0

4.5

4.5

ν=1

ν^

=νIWCV

4.4

4.4

4.3

4.3

4.2

4.2

Return

Return

4.1

4.1

4

4

ν=0

3.9

ν=1

3.9

ν^

3.8

IWCV

3.8

5

10

15

20

25

30

35

40

45

10

15

20

25

30

35

40

Totalnumberofepisodes

Totalnumberofepisodes

(a)N=5

(b)N=10

FIGURE4.4:Theperformanceofpolicieslearnedinthreescenarios:ν=0,

ν=1,andSRPI(νischosenbyIWCV)inthe10-statechain-walkproblem.

Theperformanceismeasuredbytheaveragereturncomputedfromtestsam-

plesover30trials.TheagentcollectstrainingsampleHπL(N=5or10with

T=10)ateveryiterationandperformspolicyevaluationusingallcollected

samplesHπ1,…,HπL.Thetotalnumberofepisodesmeansthenumberof

trainingepisodes(N×L)collectedbytheagentinpolicyiteration.

4.4.2

Illustration

Here,thebehaviorofSRPIisillustratedunderthesameexperimental

setupasSection4.3.2.Letusconsiderthreescenarios:νisfixedat0,νisfixedat1,andνischosenbyIWCV(i.e.,SRPI).TheagentcollectssamplesHπLin

L

eachpolicyiterationfollowingthecurrentpolicyπLandcomputesb

θAIWfrom

allcollectedsamplesHπ1,…,HπLusingEq.(4.3).ThreeGaussiankernels

areusedasbasisfunctions,wherekernelcentersarerandomlyselectedfrom

thestatespaceSineachtrial.Theinitialpolicyπ1ischosenrandomlyand

Gibbspolicyimprovement,

exp(Qπ(s,a)/τ)

π(a|s)←−P

,

(4.4)

exp(Qπ(s,a′)/τ)

a′∈Aisperformedwithτ=2L.

Figure4.4depictstheaveragereturnover30trialswhenN=5and10

withafixednumberofsteps(T=10).ThegraphsshowthatSRPIprovides

stableandfastlearningofpolicies,whiletheperformanceimprovementof

policieslearnedwithν=0saturatesinearlyiterations.Themethodwith

ν=1canimprovepolicieswell,butitsprogresstendstobebehindSRPI.

Figure4.5depictstheaveragevalueoftheflatteningparameterusedin

SRPIasafunctionofthetotalnumberofepisodicsamples.Thegraphsshow

thatthevalueoftheflatteningparameterchosenbyIWCVtendstoriseinthe

beginningandgodownlater.Atfirstsight,thisdoesnotagreewiththegeneral

trendofpreferringalow-varianceestimatorinearlystagesandpreferringa

58

StatisticalReinforcementLearning

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

Flatteningparameter

0.3

Flatteningparameter

0.2

0.2

0.1

0.1

0

0

5

10

15

20

25

30

35

40

45

10

15

20

25

30

35

40

Totalnumberofepisodes

Totalnumberofepisodes

(a)N=5

(b)N=10

FIGURE4.5:FlatteningparametervaluesusedbySRPIaveragedover30

trialsasafunctionofthetotalnumberofepisodicsamplesinthe10-state

chain-walkproblem.

low-biasestimatorlater.However,thisresultisstillconsistentwiththegeneral

trend:whenthereturnincreasesrapidly(thetotalnumberofepisodicsamples

isupto15whenN=5and30whenN=10inFigure4.5),thevalueofthe

flatteningparameterincreases(seeFigure4.4).Afterthat,thereturndoes

notincreaseanymore(seeFigure4.4)sincethepolicyiterationhasalready

beenconverged.Then,itisnaturaltopreferasmallflatteningparameter

(Figure4.5)sincethesampleselectionbiasbecomesmildafterconvergence.

TheseresultsshowthatSRPIcaneffectivelyreusepreviouslycollected

samplesbyappropriatelytuningtheflatteningparameteraccordingtothe

conditionofdatasamples,policies,etc.

4.5

NumericalExamples

Inthissection,theperformanceofSRPIisnumericallyinvestigatedin

morecomplextasks.

4.5.1

InvertedPendulum

First,weconsiderthetaskoftheswing-upinvertedpendulumillustrated

inFigure4.6,whichconsistsofapolehingedatthetopofacart.Thegoalof

thetaskistoswingthepoleupbymovingthecart.Therearethreeactions:

applyingpositiveforce+50(kg·m/s2)tothecarttomoveright,negative

force−50tomoveleft,andzeroforcetojustcoast.Thatis,theactionspace

SampleReuseinPolicyIteration

59

FIGURE4.6:Illustrationoftheinvertedpendulumtask.

Aisdiscreteanddescribedby

A=50,−50,0kg·m/s2.

Notethattheforceitselfisnotstrongenoughtoswingthepoleup.Thusthe

cartneedstobemovedbackandforthseveraltimestoswingthepoleup.

ThestatespaceSiscontinuousandconsistsoftheangleϕ[rad](∈[0,2π])andtheangularvelocity˙

ϕ[rad/s](∈[−π,π]).Thus,astatesisdescribedbytwo-dimensionalvectors=(ϕ,˙

ϕ)⊤.Theangleϕandangularvelocity˙

ϕare

updatedasfollows:

ϕt+1=ϕt+˙

ϕt+1∆t,

9.8sin(ϕ

˙

ϕ

t)−αwd(˙

ϕt)2sin(2ϕt)/2+αcos(ϕt)at

t+1=˙

ϕt+

∆t,

4l/3−αwdcos2(ϕt)

whereα=1/(W+w)andat(∈A)istheactionchosenattimet.Therewardfunctionr(s,a,s′)isdefinedas

r(s,a,s′)=cos(ϕs′),

whereϕs′denotestheangleϕofstates′.Theproblemparametersaresetas

follows:themassofthecartWis8[kg],themassofthepolewis2[kg],the

lengthofthepoledis0.5[m],andthesimulationtimestep∆tis0.1[s].

Forty-eightGaussiankernelswithstandarddeviationσ=πareusedas

basisfunctions,andkernelcentersarelocatedoverthefollowinggridpoints:

0,2/3π,4/3π,2π×−3π,−π,π,3π.

Thatis,thebasisfunctionsφ(s,a)=φ1(s,a),…,φ16(s,a)aresetas

ks−c

φ

jk2

16(i−1)+j(s,a)=I(a=a(i))exp

,

2σ2

fori=1,2,3andj=1,…,16,where

c1=(0,−3π)⊤,c2=(0,−π)⊤,…,c12=(2π,3π)⊤.

60

StatisticalReinforcementLearning

−6

ν=0

1

−7

ν=1

0.9

ν^

=νIWCV

−8

0.8

0.7

−9

0.6

−10

0.5

−11

0.4

−12

0.3

Flatteningparameter

−13

Sumofdiscountedrewards

0.2

−14

0.1

−15

0

10

20

30

40

50

60

70

80

90

10

20

30

40

50

60

70

80

90

Totalnumberofepisodes

Totalnumberofepisodes

(a)Performanceofpolicy

(b)Averageflatteningparameter

FIGURE4.7:ResultsofSRPIintheinvertedpendulumtask.Theagentcol-

lectstrainingsampleHπL(N=10andT=100)ineachiterationandpolicy

evaluationisperformedusingallcollectedsamplesHπ1,…,HπL.(a)The

performanceofpolicieslearnedwithν=0,ν=1,andSRPI.Theperformance

ismeasuredbytheaveragereturncomputedfromtestsamplesover20trials.

Thetotalnumberofepisodesmeansthenumberoftrainingepisodes(N×L)

collectedbytheagentinpolicyiteration.(b)Averageflatteningparameter

valueschosenbyIWCVinSRPIover20trials.

Theinitialpolicyπ1(a|s)ischosenrandomly,andtheinitial-stateproba-

bilitydensityp(s)issettobeuniform.TheagentcollectsdatasamplesHπL

(N=10andT=100)ateachpolicyiterationfollowingthecurrentpolicy

πL.Thediscountedfactorissetatγ=0.95andthepolicyisupdatedby

Gibbspolicyimprovement(4.4)withτ=L.

Figure4.7(a)describestheperformanceoflearnedpolicies.Thegraph

showsthatSRPInicelyimprovestheperformancethroughouttheentirepolicy

iteration.Ontheotherhand,theperformancewhentheflatteningparameter

isfixedatν=0orν=1isnotproperlyimprovedafterthemiddleof

iterations.TheaverageflatteningparametervaluesdepictedinFigure4.7(b)

showthattheflatteningparametertendstoincreasequicklyinthebeginning

andtheniskeptatmediumvalues.Motionexamplesoftheinvertedpendulum

bySRPIwithνchosenbyIWCVandν=1areillustratedinFigure4.8and

Figure4.9,respectively.

Theseresultsindicatethattheflatteningparameteriswelladjustedto

reusethepreviouslycollectedsampleseffectivelyforpolicyevaluation,and

thusSRPIcanoutperformtheothermethods.

4.5.2

MountainCar

Next,weconsiderthemountaincartaskillustratedinFigure4.10.The

taskconsistsofacarandtwohillswhoselandscapeisdescribedbysin(3x).

SampleReuseinPolicyIteration

61

FIGURE4.8:MotionexamplesoftheinvertedpendulumbySRPIwithν

chosenbyIWCV(fromlefttorightandtoptobottom).

FIGURE4.9:MotionexamplesoftheinvertedpendulumbySRPIwith

ν=1(fromlefttorightandtoptobottom).

62

StatisticalReinforcementLearning

Goal

FIGURE4.10:Illustrationofthemountaincartask.

Thetopoftherighthillisthegoaltowhichwewanttoguidethecar.There

arethreeactions,

+0.2,−0.2,0,

whicharethevaluesoftheforceappliedtothecar.Notethattheforceofthe

carisnotstrongenoughtoclimbuptheslopetoreachthegoal.Thestate

spaceSisdescribedbythehorizontalpositionx[m](∈[−1.2,0.5])andthevelocity˙x[m/s](∈[−1.5,1.5]):s=(x,˙x)⊤.

Thepositionxandvelocity˙xareupdatedby

xt+1=xt+˙xt+1∆t,

a

˙x

t

t+1=˙

xt+−9.8wcos(3xt)+

−k˙x∆t,

w

t

whereat(∈A)istheactionchosenatthetimet.TherewardfunctionR(s,a,s′)isdefinedas

1ifx

R(s,a,s′)=

s′≥0.5,

−0.01otherwise,

wherexs′denotesthehorizontalpositionxofstates′.Theproblemparame-

tersaresetasfollows:themassofthecarwis0.2[kg],thefrictioncoefficientkis0.3,andthesimulationtimestep∆tis0.1[s].

Thesameexperimentalsetupastheswing-upinvertedpendulumtaskin

Section4.5.1isused,exceptthatthenumberofGaussiankernelsis36,the

kernelstandarddeviationissetatσ=1,andthekernelcentersareallocated

overthefollowinggridpoints:

−1.2,0.35,0.5×−1.5,−0.5,0.5,1.5.

Figure4.11(a)showstheperformanceoflearnedpoliciesmeasuredbythe

SampleReuseinPolicyIteration

63

0.2

1

ν=0

ν=1

0.9

ν^

=νIWCV

0.15

0.8

0.7

0.1

0.6

0.5

0.05

0.4

0.3

Flatteningparameter

Sumofdiscountedrewards

0

0.2

0.1

−0.05

0

10

20

30

40

50

60

70

80

90

10

20

30

40

50

60

70

80

90

Totalnumberofepisodes

Totalnumberofepisodes

(a)Performanceofpolicy

(b)Averageflatteningparameter

FIGURE4.11:Resultsofsample-reusepolicyiterationinthemountain-car

task.TheagentcollectstrainingsampleHπL(N=10andT=100)atev-

eryiterationandpolicyevaluationisperformedusingallcollectedsamples

Hπ1,…,HπL.(a)Theperformanceismeasuredbytheaveragereturncom-

putedfromtestsamplesover20trials.Thetotalnumberofepisodesmeansthe

numberoftrainingepisodes(N×L)collectedbytheagentinpolicyiteration.

(b)AverageflatteningparametervaluesusedbySRPIover20trials.

averagereturncomputedfromthetestsamples.Thegraphshowssimilarten-

denciestotheswing-upinvertedpendulumtaskforSRPIandν=1,while

themethodwithν=0performsrelativelywellthistime.Thisimpliesthat

thebiasinthepreviouslycollectedsamplesdoesnotaffecttheestimationof

thevaluefunctionsthatstrongly,becausethefunctionapproximatorisbetter

suitedtorepresentthevaluefunctionforthisproblem.Theaverageflattening

parametervalues(cf.Figure4.11(b))showthattheflatteningparameterde-

creasessoonaftertheincreaseinthebeginning,andthenthesmallervalues

tendtobechosen.ThisindicatesthatSRPItendstouselow-varianceesti-

matorsinthistask.MotionexamplesbySRPIwithνchosenbyIWCVare

illustratedinFigure4.12.

TheseresultsshowthatSRPIcanperformstableandfastlearningby

effectivelyreusingpreviouslycollecteddata.

4.6

Remarks

Instabilityhasbeenoneofthecriticallimitationsofimportance-sampling

techniques,whichoftenmakesoff-policymethodsimpractical.Toovercome

thisweakness,anadaptiveimportance-samplingtechniquewasintroducedfor

controllingthetrade-offbetweenconsistencyandstabilityinvaluefunction

64

StatisticalReinforcementLearning

Goal

Goal

Goal

Goal

Goal

Goal

Goal

Goal

Goal

Goal

Goal

Goal

Goal

Goal

Goal

FIGURE4.12:MotionexamplesofthemountaincarbySRPIwithνchosen

byIWCV(fromlefttorightandtoptobottom).

approximation.Furthermore,importance-weightedcross-validationwasintro-

ducedforautomaticallychoosingthetrade-offparameter.

Therangeofapplicationofimportancesamplingisnotlimitedtopolicy

iteration.Wewillexplainhowimportancesamplingcanbeutilizedforsample

reuseinthepolicysearchframeworksinChapter8andChapter9.

Chapter5

ActiveLearninginPolicyIteration

InChapter4,weconsideredtheoff-policysituationwhereadata-collecting

policyandthetargetpolicyaredifferent.Intheframeworkofsample-reuse

policyiteration,newsamplesarealwayschosenfollowingthetargetpolicy.

However,acleverchoiceofsamplingpoliciescanactuallyfurtherimprovethe

performance.Thetopicofchoosingsamplingpoliciesiscalledactivelearning

instatisticsandmachinelearning.Inthischapter,weaddresstheproblem

ofchoosingsamplingpoliciesinsample-reusepolicyiteration.InSection5.1,

weexplainhowastatisticalactivelearningmethodcanbeemployedforop-

timizingthesamplingpolicyinvaluefunctionapproximation.InSection5.2,

weintroduceactivepolicyiteration,whichincorporatestheactivelearning

ideaintotheframeworkofsample-reusepolicyiteration.Theeffectivenessof

activepolicyiterationisnumericallyinvestigatedinSection5.3,andfinally

thischapterisconcludedinSection5.4.

5.1

EfficientExplorationwithActiveLearning

Theaccuracyofestimatedvaluefunctionsdependsontrainingsamples

collectedfollowingsamplingpolicye

π(a|s).Inthissection,weexplainhowa

statisticalactivelearningmethod(Sugiyama,2006)canbeemployedforvalue

functionapproximation.

5.1.1

ProblemSetup

Letusconsiderasituationwherecollectingstate-actiontrajectorysam-

plesiseasyandcheap,butgatheringimmediaterewardsamplesishardand

expensive.Forexample,considerarobot-armcontroltaskofhittingaball

withabatanddrivingtheballasfarawayaspossible(seeFigure5.6).Let

usadoptthecarryoftheballastheimmediatereward.Inthissetting,ob-

tainingstate-actiontrajectorysamplesoftherobotarmiseasyandrelatively

cheapsincewejustneedtocontroltherobotarmandrecorditsstate-action

trajectoriesovertime.However,explicitlycomputingthecarryoftheball

fromthestate-actionsamplesishardduetofrictionandelasticityoflinks,

65

66

StatisticalReinforcementLearning

airresistance,aircurrents,andsoon.Forthisreason,inpractice,wemay

havetoputtherobotinopenspace,lettherobotreallyhittheball,and

measurethecarryoftheballmanually.Thus,gatheringimmediatereward

samplesismuchmoreexpensivethanthestate-actiontrajectorysamples.In

suchasituation,immediaterewardsamplesaretooexpensivetobeusedfor

designingthesamplingpolicy.Onlystate-actiontrajectorysamplesmaybe

usedfordesigningsamplingpolicies.

Thegoalofactivelearninginthecurrentsetupistodeterminethesampling

policysothattheexpectedgeneralizationerrorisminimized.However,since

thegeneralizationerrorisnotaccessibleinpractice,itneedstobeestimated

fromsamplesforperformingactivelearning.Adifficultyofestimatingthe

generalizationerrorinthecontextofactivelearningisthatitsestimation

needstobecarriedoutonlyfromstate-actiontrajectorysampleswithoutusing

immediaterewardsamples.Thismeansthatstandardgeneralizationerror

estimationtechniquessuchascross-validationcannotbeemployed.Below,

weexplainhowthegeneralizationerrorcanbeestimatedwithoutthereward

samples.

5.1.2

DecompositionofGeneralizationError

Theinformationweareallowedtouseforestimatingthegeneralization

errorisasetofroll-outsampleswithoutimmediaterewards:

Heπ=heπ1,…,heπN,

whereeachepisodicsampleheπnisgivenas

heπn=[seπ1,n,aeπ1,n,…,seπT,n,aeπT,n,seπT+1,n].

Letusdefinethedeviationofanobservedimmediaterewardreπ

t,nfromits

expectationr(seπt,n,aeπt,n)as

ǫeπt,n=reπt,n−r(seπt,n,aeπt,n).

Notethatǫeπt,ncouldberegardedasadditivenoiseinthecontextofleast-

squaresfunctionfitting.Bydefinition,ǫeπt,nhasmeanzeroanditsvariance

generallydependsonseπt,nandaeπt,n,i.e.,heteroscedasticnoise(Bishop,2006).

However,sinceestimatingthevarianceofǫeπt,nwithoutusingrewardsamples

isnotgenerallypossible,weignorethedependenceofthevarianceonseπt,nand

aeπt,n.Letusdenotetheinput-independentcommonvariancebyσ2.

Wewouldliketoestimatethegeneralizationerror,

#

1T

X

⊤2

G(b

θ)=E

bb

pe

π(h)

θψ(s

,

T

t,at;He

π)−r(st,at)

t=1

ActiveLearninginPolicyIteration

67

fromHeπ.Itsexpectationover“noise”canbedecomposedasfollows

(Sugiyama,2006):

h

i

EǫeπG(b

θ)=Bias+Variance+ModelError,

whereEǫeπdenotestheexpectationover“noise”ǫeπt,nT,N

t=1,n=1.

“Bias,”

“Variance,”and“ModelError”arethebiasterm,thevarianceterm,andthe

modelerrortermdefinedby

#

T

1Xn

hi

o2

Bias=E

b

pe

π(h)

(E

θ−θ∗)⊤b

ψ(s

,

T

ǫe

π

t,at;He

π)

t=1

#

T

1Xn

hi

o2

Variance=E

b

pe

π(h)

(b

θ−E

θ)⊤b

ψ(s

,

T

ǫe

π

t,at;He

π)

t=1

#

T

1X

ModelError=Epeπ(h)

(θ∗⊤b

ψ(s

.

T

t,at;He

π)−r(st,at))2

t=1

θ∗denotestheoptimalparameterinthemodel:”

#

T

1X

θ∗=argminEpeπ(h)(θ⊤ψ(st,at)−r(st,at))2.

θ

Tt=1

Notethat,foralinearestimatorb

θsuchthat

bθ=b

Lr,

whereb

LissomematrixandristheNT-dimensionalvectordefinedas

rN(t−1)+n=r(st,n,at,n,st+1,n),

thevariancetermcanbeexpressedinacompactformas

⊤Variance=σ2tr(Ub

Lb

L),

wherethematrixUisdefinedas

#

1T

X

U=E

b

pe

π(h)

ψ(s

.

(5.1)

T

t,at;He

π)b

ψ(st,at;Heπ)⊤t=1

5.1.3

EstimationofGeneralizationError

Sinceweareinterestedinfindingaminimizerofthegeneralizationerror

withrespecttoe

π,themodelerror,whichisconstant,canbesafelyignoredin

generalizationerrorestimation.Ontheotherhand,thebiastermincludesthe

68

StatisticalReinforcementLearning

unknownoptimalparameterθ∗.Thus,itmaynotbepossibletoestimatethebiastermwithoutusingrewardsamples.Similarly,itmaynotbepossibleto

estimatethe“noise”varianceσ2includedinthevariancetermwithoutusing

rewardsamples.

Itisknownthatthebiastermissmallenoughtobeneglectedwhenthe

modelisapproximatelycorrect(Sugiyama,2006),i.e.,θ∗⊤b

ψ(s,a)approxi-

matelyagreeswiththetruefunctionr(s,a).Thenwehave

h

i

⊤EǫeπG(b

θ)−ModelError−Bias∝tr(UbLb

L),

(5.2)

whichdoesnotrequireimmediaterewardsamplesforitscomputation.Since

Epeπ(h)includedinUisnotaccessible(seeEq.(5.1)),Uisreplacedbyits

consistentestimatorb

U:

N

XT

X

b

1

U=

b

ψ(seπ

NT

t,n,ae

π

t,n;He

π)b

ψ(seπt,n,aeπt,n;Heπ)⊤b

wt,n.

n=1t=1

Consequently,thefollowinggeneralizationerrorestimatorisobtained:

⊤J=tr(b

Ub

Lb

L),

whichcanbecomputedonlyfromHeπandthuscanbeemployedintheactive

learningscenarios.IfitispossibletogatherHeπmultipletimes,theaboveJ

maybecomputedmultipletimesandtheiraverageisusedasageneralization

errorestimator.

NotethatthevaluesofthegeneralizationerrorestimatorJandthetrue

generalizationerrorGarenotdirectlycomparablesinceirrelevantadditive

andmultiplicativeconstantsareignored(seeEq.(5.2)).However,thisisno

problemaslongastheestimatorJhasasimilarprofiletothetrueerrorGas

afunctionofsamplingpolicye

πsincethepurposeofderivingageneralization

errorestimatorinactivelearningisnottoapproximatethetruegeneralization

erroritself,buttoapproximatetheminimizerofthetruegeneralizationerror

withrespecttosamplingpolicye

π.

5.1.4

DesigningSamplingPolicies

Basedonthegeneralizationerrorestimatorderivedabove,asampling

policyisdesignedasfollows:

1.PrepareKcandidatesofsamplingpolicy:e

πkK.

k=1

2.Collectepisodicsampleswithoutimmediaterewardsforeachsampling-

policycandidate:HeπkK.

k=1

3.EstimateUusingallsamplesHeπkK:

k=1

K

XN

XT

X

b

1

U=

b

ψ(seπk

KNT

t,n,ae

πk

t,n;He

πkK

k=1)b

ψ(seπk

t,n,ae

πk

t,n;He

πkK

k=1)⊤b

weπk

t,n,

k=1n=1t=1

ActiveLearninginPolicyIteration

69

whereb

weπk

t,ndenotestheimportanceweightforthek-thsamplingpolicy

eπk:

Qt

π(aeπk

)

b

weπ

t′,n|se

πk

t′,n

k

t′=1

t,n=Q

.

t

)

t′=1e

πk(aeπk

t′,n|se

πk

t′,n

4.Estimatethegeneralizationerrorforeachk:

e

πk

e

πk

J

b⊤k=tr(b

Ub

L

L

),

e

πk

whereb

L

isdefinedas

beπk

e

πk

e

πk

e

πk

e

πk

e

πk

L

=(b

Ψ⊤c

W

b

Ψ)−1b

Ψ⊤c

W

.

beπk

e

πk

Ψ

istheNT×Bmatrixandc

W

istheNT×NTdiagonalmatrix

definedas

b

Ψeπk

=b

ψ

N(t−1)+n,b

b(se

πk

t,n,ae

πk

t,n),

c

Weπk

=

N(t−1)+n,N(t−1)+n

b

weπk

t,n.

5.(Ifpossible)repeat2to4severaltimesandcalculatetheaveragefor

eachk.

6.Determinethesamplingpolicyas

eπAL=argminJk.

k=1,…,K

7.Collecttrainingsampleswithimmediaterewardsfollowinge

πAL.

8.Learnthevaluefunctionbyleast-squarespolicyiterationusingthecol-

lectedsamples.

5.1.5

Illustration

Here,thebehavioroftheactivelearningmethodisillustratedonatoy

10-statechain-walkenvironmentshowninFigure5.1.TheMDPconsistsof

10states,

S=s(i)10

i=1=1,2,…,10,

and2actions,

A=a(i)2i=1=“L,”“R”.

70

StatisticalReinforcementLearning

02

02

.

.

1

2

3

8

9

10

···

08

08

.

.

FIGURE5.1:Ten-statechainwalk.Filledandunfilledarrowsindicatethe

transitionswhentakingaction“R”and“L,”andsolidanddashedlinesindi-

catethesuccessfulandfailedtransitions.

Theimmediaterewardfunctionisdefinedas

r(s,a,s′)=f(s′),

wheretheprofileofthefunctionf(s′)isillustratedinFigure5.2.

Thetransitionprobabilityp(s′|s,a)isindicatedbythenumbersattached

tothearrowsinFigure5.1.Forexample,p(s(2)|s(1),a=“R”)=0.8and

p(s(1)|s(1),a=“R”)=0.2.Thus,theagentcansuccessfullymovetothe

intendeddirectionwithprobability0.8(indicatedbysolid-filledarrowsinthe

figure)andtheactionfailswithprobability0.2(indicatedbydashed-filled

arrowsinthefigure).Thediscountfactorγissetat0.9.Thefollowing12

Gaussianbasisfunctionsφ(s,a)areused:

(s−c

i)2

I(a=a(j))exp−

2τ2

φ2(i−1)+j(s,a)=

fori=1,…,5andj=1,2

I(a=a(j))fori=6andj=1,2,

wherec1=1,c2=3,c3=5,c4=7,c5=9,andτ=1.5.I(a=a′)denotes

theindicatorfunction:

1ifa=a′,

I(a=a′)=

0

ifa6=a′.

Samplingpoliciesandevaluationpoliciesareconstructedasfollows.First,

3

2.5

2

’)1.5

f(s

1

0.5

01

2

3

4

5

6

7

8

9

10

s’

FIGURE5.2:Profileofthefunctionf(s′).

ActiveLearninginPolicyIteration

71

adeterministic“base”policyπisprepared.Forexample,“LLLLLRRRRR,”

wherethei-thletterdenotestheactiontakenats(i).Letπǫbethe“ǫ-greedy”

versionofthebasepolicyπ,i.e.,theintendedactioncanbesuccessfullychosen

withprobability1−ǫ/2andtheotheractionischosenwithprobabilityǫ/2.

Experimentsareperformedforthreedifferentevaluationpolicies:

π1:“RRRRRRRRRR,”

π2:“RRLLLLLRRR,”

π3:“LLLLLRRRRR,”

withǫ=0.1.Foreachevaluationpolicyπ0.1

i

(i=1,2,3),10candidatesofthe

samplingpolicye

π(k)

areprepared,where

=πk/10.Notethat

is

i

10

k=1

eπ(k)

i

i

eπ(1)

i

equivalenttotheevaluationpolicyπ0.1

i

.

Foreachsamplingpolicy,theactivelearningcriterionJiscomputed5

timesandtheiraverageistaken.Thenumbersofepisodesandstepsareset

atN=10andT=10,respectively.Theinitial-stateprobabilityp(s)is

settobeuniform.Whenthematrixinverseiscomputed,10−3isaddedto

diagonalelementstoavoiddegeneracy.Thisexperimentisrepeated100times

withdifferentrandomseedsandthemeanandstandarddeviationofthetrue

generalizationerroranditsestimateareevaluated.

TheresultsaredepictedinFigure5.3asfunctionsoftheindexkofthe

samplingpolicies.Thegraphsshowthatthegeneralizationerrorestimator

overallcapturesthetrendofthetruegeneralizationerrorwellforallthree

cases.

Next,thevaluesoftheobtainedgeneralizationerrorGisevaluatedwhen

kischosensothatJisminimized(activelearning,AL),theevaluationpolicy

(k=1)isusedforsampling(passivelearning,PL),andkischosenoptimally

sothatthetruegeneralizationerrorisminimized(optimal,OPT).Figure5.4

showsthattheactivelearningmethodcomparesfavorablywithpassivelearn-

ingandperformswellforreducingthegeneralizationerror.

5.2

ActivePolicyIteration

InSection5.1,theunknowngeneralizationerrorwasshowntobeaccu-

ratelyestimatedwithoutusingimmediaterewardsamplesinone-steppolicy

evaluation.Inthissection,thisone-stepactivelearningideaisextendedtothe

frameworkofsample-reusepolicyiterationintroducedinChapter4,whichis

calledactivepolicyiteration.LetusdenotetheevaluationpolicyattheL-th

iterationbyπL.

72

StatisticalReinforcementLearning

2.5

2

2

1.5

1.5

|G

1

J

1

0.5

0.5

0

−0.5

0

2

4

6

8

10

2

4

6

8

10

Samplingpolicyindexk

Samplingpolicyindexk

(a)π0.1

1

0.6

1.4

0.5

1.2

0.4

1

0.3

0.8

|G

J

0.2

0.6

0.1

0.4

0

0.2

−0.1

0

2

4

6

8

10

2

4

6

8

10

Samplingpolicyindexk

Samplingpolicyindexk

(b)π0.1

2

0.8

1

0.6

0.8

0.4

0.6

|G

J

0.2

0.4

0

0.2

−0.2

0

2

4

6

8

10

2

4

6

8

10

Samplingpolicyindexk

Samplingpolicyindexk

(c)π0.1

3

FIGURE5.3:Themeanandstandarddeviationofthetruegeneralization

errorG(left)andtheestimatedgeneralizationerrorJ(right)over100trials.

5.2.1

Sample-ReusePolicyIterationwithActiveLearning

Intheoriginalsample-reusepolicyiteration,newdatasamplesHπlare

collectedfollowingthenewtargetpolicyπlforthenextpolicyevaluation

step:

E:Hπ1

E:Hπ1,Hπ2

E:Hπ1,Hπ2,Hπ3

π

I

I

1

b

Qπ1→π2

b

Qπ2→π3

···I

→πL+1,

ActiveLearninginPolicyIteration

73

3.5

0.35

3

0.3

2.5

0.25

2

0.2

1.5

0.15

1

0.1

0.5

0.05

0

0

AL

PL

OPT

AL

PL

OPT

(a)π0.1

(b)π0.1

1

2

1

0.8

0.6

0.4

0.2

0

AL

PL

OPT

(c)π0.1

3

FIGURE5.4:Thebox-plotsofthevaluesoftheobtainedgeneralizationerror

Gover100trialswhenkischosensothatJisminimized(activelearning,AL),

theevaluationpolicy(k=1)isusedforsampling(passivelearning,PL),andk

ischosenoptimallysothatthetruegeneralizationerrorisminimized(optimal,

OPT).Thebox-plotnotationindicatesthe5%quantile,25%quantile,50%

quantile(i.e.,median),75%quantile,and95%quantilefrombottomtotop.

where“E:H”indicatespolicyevaluationusingthedatasampleHand“I”

denotespolicyimprovement.Ontheotherhand,inactivepolicyiteration,the

optimizedsamplingpolicye

πlisusedateachiteration:

E:He

π1

E:He

π1,Heπ2

E:He

π1,Heπ2,Heπ3

π

I

I

1

b

Qπ1→π2

b

Qπ2→π3

···I

→πL+1.

Notethat,inactivepolicyiteration,thepreviouslycollectedsamplesareused

notonlyforvaluefunctionapproximation,butalsoforactivelearning.Thus,

activepolicyiterationmakesfulluseofthesamples.

5.2.2

Illustration

Here,thebehaviorofactivepolicyiterationisillustratedusingthesame

10-statechain-walkproblemasSection5.1.5(seeFigure5.1).

74

StatisticalReinforcementLearning

Theinitialevaluationpolicyπ1issetas

π

b

1(a|s)=0.15pu(a)+0.85I(a=argmaxQ0(s,a′)),

a′

wherepu(a)denotestheprobabilitymassfunctionoftheuniformdistribution

and

12

X

b

Q0(s,a)=

φb(s,a).

b=1

Policiesareupdatedinthel-thiterationusingtheǫ-greedyrulewithǫ=

0.15/l.Inthesampling-policyselectionstepofthel-thiteration,thefollowing

foursampling-policycandidatesareprepared:

eπ(1),

,

,

,π0.15/l+0.15,π0.15/l+0.5,π0.15/l+0.85

l

eπ(2)

l

eπ(3)

l

eπ(4)

l

=π0.15/l

l

l

l

l

,

whereπldenotesthepolicyobtainedbygreedyupdateusingb

Qπl−1.

Thenumberofiterationstolearnthepolicyissetat7,thenumberof

stepsissetatT=10,andthenumberNofepisodesisdifferentineachitera-

tionanddefinedasN1,…,N7,whereNl(l=1,…,7)denotesthenumberofepisodescollectedinthel-thiteration.Inthisexperiment,twotypesof

schedulingarecompared:5,5,3,3,3,1,1and3,3,3,3,3,3,3,whichare

referredtoasthe“decreasingN”strategyandthe“fixedN”strategy,respec-

tively.TheJ-valuecalculationisrepeated5timesforactivelearning.The

performanceofthefinallyobtainedpolicyπ8ismeasuredbythereturnfor

testsamplesrπ8

t,nT,N

t,n=1(50episodeswith50stepscollectedfollowingπ8):

1N

XT

X

Performance=

γt−1rπ8

N

t,n,

n=1t=1

wherethediscountfactorγissetat0.9.

Theperformanceofpassivelearning(PL;thecurrentpolicyisusedasthe

samplingpolicyineachiteration)andactivelearning(AL;thebestsampling

policyischosenfromthepolicycandidatespreparedineachiteration)is

compared.Theexperimentsarerepeated1000timeswithdifferentrandom

seedsandtheaverageperformanceofPLandALisevaluated.Theresults

aredepictedinFigure5.5,showingthatALworksbetterthanPLinboth

typesofepisodeschedulingwithstatisticalsignificancebythet-testatthe

significancelevel1%(Henkel,1976)fortheerrorvaluesobtainedafterthe7th

iteration.Furthermore,the“decreasingN”strategyoutperformsthe“fixed

N”strategyforbothPLandAL,showingtheusefulnessofthe“decreasing

N”strategy.

ActiveLearninginPolicyIteration

75

14

13

12

11

10

AL(decreasingN)

Performance(average)

9

PL(decreasingN)

AL(fixedN)

8

PL(fixedN)

71

2

3

4

5

6

7

Iteration

FIGURE5.5:Themeanperformanceover1000trialsinthe10-statechain-

walkexperiment.Thedottedlinesdenotetheperformanceofpassivelearning

(PL)andthesolidlinesdenotetheperformanceoftheproposedactivelearning

(AL)method.Theerrorbarsareomittedforclearvisibility.Forboththe

“decreasingN”and“fixedN”strategies,theperformanceofALafterthe7th

iterationissignificantlybetterthanthatofPLaccordingtothet-testatthe

significancelevel1%appliedtotheerrorvaluesatthe7thiteration.

5.3

NumericalExamples

Inthissection,theperformanceofactivepolicyiterationisevaluatedusing

aball-battingrobotillustratedinFigure5.6,whichconsistsoftwolinksand

twojoints.Thegoaloftheball-battingtaskistocontroltherobotarmso

thatitdrivestheballasfarawayaspossible.ThestatespaceSiscontinuous

andconsistsofanglesϕ1[rad](∈[0,π/4])andϕ2[rad](∈[−π/4,π/4])and

angularvelocities˙

ϕ1[rad/s]and˙

ϕ2[rad/s].Thus,astates(∈S)isdescribedbya4-dimensionalvectors=(ϕ1,˙

ϕ1,ϕ2,˙

ϕ2)⊤.TheactionspaceAisdiscrete

andcontainstwoelements:

A=a(i)2i=1=(50,−35)⊤,(−50,10)⊤,

wherethei-thelement(i=1,2)ofeachvectorcorrespondstothetorque

[N·m]addedtojointi.

Theopendynamicsengine(http://ode.org/)isusedforphysicalcalculationsincludingtheupdateoftheanglesandangularvelocities,andcollision

detectionbetweentherobotarm,ball,andpin.Thesimulationtimestepis

setat7.5[ms]andthenextstateisobservedafter10timesteps.Theaction

choseninthecurrentstateistakenfor10timesteps.Tomaketheexperi-

mentsrealistic,noiseisaddedtoactions:ifaction(f1,f2)⊤istaken,theactual

76

StatisticalReinforcementLearning

FIGURE5.6:Aball-battingrobot.

torquesappliedtothejointsaref1+ε1andf2+ε2,whereε1andε2aredrawn

independentlyfromtheGaussiandistributionwithmean0andvariance3.

Theimmediaterewardisdefinedasthecarryoftheball.Thisrewardis

givenonlywhentherobotarmcollideswiththeballforthefirsttimeatstate

s′aftertakingactionaatcurrentstates.Forvaluefunctionapproximation,

thefollowing110basisfunctionsareused:

ks−c

ik2

I(a=a(j))exp−

2τ2

φ2(i−1)+j=

fori=1,…,54andj=1,2,

I(a=a(j))fori=55andj=1,2,

whereτissetat3π/2andtheGaussiancentersci(i=1,…,54)arelocated

ontheregulargrid:0,π/4×−π,0,π×−π/4,0,π/4×−π,0,π.

ForL=7andT=10,the“decreasingN”strategyandthe“fixed

N”strategyarecompared.The“decreasingN”strategyisdefinedby

10,10,7,7,7,4,4andthe“fixedN”strategyisdefinedby7,7,7,7,7,7,7.

Theinitialstateisalwayssetats=(π/4,0,0,0)⊤,andJ-calculationsare

repeated5timesintheactivelearningmethod.Theinitialevaluationpolicy

π1issetattheǫ-greedypolicydefinedas

π

b

1(a|s)=0.15pu(a)+0.85I

a=argmaxQ0(s,a′),

a′

110

X

b

Q0(s,a)=

φb(s,a).

b=1

Policiesareupdatedinthel-thiterationusingtheǫ-greedyrulewithǫ=

0.15/l.Sampling-policycandidatesarepreparedinthesamewayasthechain-

walkexperimentinSection5.2.2.

Thediscountfactorγissetat1andtheperformanceoflearnedpolicyπ8

ActiveLearninginPolicyIteration

77

70

65

60

55

50

45

AL(decreasingN)

Performance(average)

40

PL(decreasingN)

AL(fixedN)

35

PL(fixedN)

301

2

3

4

5

6

7

Iteration

FIGURE5.7:Themeanperformanceover500trialsintheball-batting

experiment.Thedottedlinesdenotetheperformanceofpassivelearning(PL)

andthesolidlinesdenotetheperformanceoftheproposedactivelearning(AL)

method.Theerrorbarsareomittedforclearvisibility.Forthe“decreasingN”

strategy,theperformanceofALafterthe7thiterationissignificantlybetter

thanthatofPLaccordingtothet-testatthesignificancelevel1%forthe

errorvaluesatthe7thiteration.

ismeasuredbythereturnfortestsamplesrπ8

t,n10,20

t,n=1(20episodeswith10

P

P

stepscollectedfollowingπ

N

T

8):

rπ8

n=1

t=1t,n.

Theexperimentisrepeated500timeswithdifferentrandomseedsand

theaverageperformanceofeachlearningmethodisevaluated.Theresults,

depictedinFigure5.7,showthatactivelearningoutperformspassivelearning.

Forthe“decreasingN”strategy,theperformancedifferenceisstatistically

significantbythet-testatthesignificancelevel1%fortheerrorvaluesafter

the7thiteration.

Motionexamplesoftheball-battingrobottrainedwithactivelearningand

passivelearningareillustratedinFigure5.8andFigure5.9,respectively.

5.4

Remarks

Whenwecannotaffordtocollectmanytrainingsamplesduetohighsam-

plingcosts,itiscrucialtochoosethemostinformativesamplesforefficiently

learningthevaluefunction.Inthischapter,anactivelearningmethodforop-

timizingdatasamplingstrategieswasintroducedintheframeworkofsample-

reusepolicyiteration,andtheresultingactivepolicyiterationwasdemon-

stratedtobepromising.

78

StatisticalReinforcementLearning

FIGURE5.8:Amotionexampleoftheball-battingrobottrainedwithactive

learning(fromlefttorightandtoptobottom).

FIGURE5.9:Amotionexampleoftheball-battingrobottrainedwithpas-

sivelearning(fromlefttorightandtoptobottom).

Chapter6

RobustPolicyIteration

Theframeworkofleast-squarespolicyiteration(LSPI)introducedinChap-

ter2isuseful,thankstoitscomputationalefficiencyandanalyticaltractabil-

ity.However,duetothesquaredloss,ittendstobesensitivetooutliersin

observedrewards.Inthischapter,weintroduceanalternativepolicyiter-

ationmethodthatemploystheabsolutelossforenhancingrobustnessand

reliability.InSection6.1,robustnessandreliabilitybroughtbytheuseofthe

absolutelossisdiscussed.InSection6.2,thepolicyiterationframeworkwith

theabsolutelosscalledleast-absolutepolicyiteration(LAPI)isintroduced.

InSection6.3,theusefulnessofLAPIisillustratedthroughexperiments.

VariationsofLAPIareconsideredinSection6.4,andfinallythischapteris

concludedinSection6.5.

6.1

RobustnessandReliabilityinPolicyIteration

ThebasicideaofLSPIistofitalinearmodeltoimmediaterewardsun-

derthesquaredloss,whiletheabsolutelossisusedinthischapter(seeFig-

ure6.1).Thisisjustreplacementoflossfunctions,butthismodificationhighly

enhancesrobustnessandreliability.

6.1.1

Robustness

Inmanyroboticsapplications,immediaterewardsareobtainedthrough

measurementsuchasdistancesensorsorcomputervision.Duetointrinsic

measurementnoiseorrecognitionerror,theobtainedrewardsoftendeviate

fromthetruevalue.Inparticular,therewardsoccasionallycontainoutliers,

whicharesignificantlydifferentfromregularvalues.

Residualminimizationunderthesquaredlossamountstoobtainingthe

meanofsamplesxim

i=1:

#

m

X

1m

X

argmin

(xi−c)2=mean(xim

i=1)=

xi.

c

m

i=1

i=1

Ifoneofthevaluesisanoutlierhavingaverylargeorsmallvalue,themean

79

80

StatisticalReinforcementLearning

5

Absoluteloss

Squaredloss

4

3

2

1

0

−3

−2

−1

0

1

2

3

FIGURE6.1:Theabsoluteandsquaredlossfunctionsforreducingthe

temporal-differenceerror.

wouldbestronglyaffectedbythisoutlier.Thismeansthatallthevalues

xim

i=1areresponsibleforthemean,andthereforeevenasingleoutlierob-

servationcansignificantlydamagethelearnedresult.

Ontheotherhand,residualminimizationundertheabsolutelossamounts

toobtainingthemedian:

#

2n+1

X

argmin

|xi−c|=median(xi2n+1)=x

i=1

n+1,

c

i=1

wherex1≤x2≤···≤x2n+1.Themedianisinfluencednotbythemagnitude

ofthevaluesxi2n+1butonlybytheirorder.Thus,aslongastheorderis

i=1

keptunchanged,themedianisnotaffectedbyoutliers.Infact,themedianis

knowntobethemostrobustestimatorinlightofbreakdown-pointanalysis

(Huber,1981;Rousseeuw&Leroy,1987).

Therefore,theuseoftheabsolutelosswouldremedytheproblemofro-

bustnessinpolicyiteration.

6.1.2

Reliability

Inpracticalrobot-controltasks,weoftenwanttoattainastableperfor-

mance,ratherthantoachievea“dream”performancewithlittlechanceof

success.Forexample,intheacquisitionofahumanoidgait,wemaywantthe

robottowalkforwardinastablemannerwithhighprobabilityofsuccess,

ratherthantorushveryfastinachancelevel.

Ontheotherhand,wedonotwanttobetooconservativewhentraining

robots.Ifweareoverlyconcernedwithunrealisticfailure,nopracticallyuseful

controlpolicycanbeobtained.Forexample,anyrobotscanbebrokenin

principleiftheyareactivatedforalongtime.However,ifwefearthisfact

toomuch,wemayendupinpraisingacontrolpolicythatdoesnotmovethe

robotsatall,whichisobviouslynonsense.

Sincethesquared-losssolutionisnotrobustagainstoutliers,itissensitive

torareeventswitheitherpositiveornegativeverylargeimmediaterewards.

RobustPolicyIteration

81

Consequently,thesquaredlossprefersanextraordinarilysuccessfulmotion

evenifthesuccessprobabilityisverylow.Similarly,itdislikesanunrealistic

troubleevenifsuchaterribleeventmaynothappeninreality.Ontheother

hand,theabsolutelosssolutionisnoteasilyaffectedbysuchrareeventsdueto

itsrobustness.Therefore,theuseoftheabsolutelosswouldproduceareliable

controlpolicyeveninthepresenceofsuchextremeevents.

6.2

LeastAbsolutePolicyIteration

Inthissection,apolicyiterationmethodwiththeabsolutelossisintro-

duced.

6.2.1

Algorithm

Insteadofthesquaredloss,alinearmodelisfittedtoimmediaterewards

undertheabsolutelossas

#

T

X

min

θ⊤b

ψ(st,at)−rt.

θ

t=1

Thisminimizationproblemlookscumbersomeduetotheabsolutevalueoper-

atorwhichisnon-differentiable,butthisminimizationproblemcanbereduced

tothefollowinglinearprogram(Boyd&Vandenberghe,2004):

T

X

min

bt

θ,btT

t=1

t=1

subjectto−bt≤θ⊤bψ(st,at)−rt≤bt,t=1,…,T.

ThenumberofconstraintsisTintheabovelinearprogram.WhenTislarge,

wemayemploysophisticatedoptimizationtechniquessuchascolumngen-

eration(Demirizetal.,2002)forefficientlysolvingthelinearprogramming

problem.Alternatively,anapproximatesolutioncanbeobtainedbygradient

descentorthe(quasi)-Newtonmethodsiftheabsolutelossisapproximated

byasmoothloss(see,e.g.,Section6.4.1).

Thepolicyiterationmethodbasedontheabsolutelossiscalledleastab-

solutepolicyiteration(LAPI).

6.2.2

Illustration

Forillustrationpurposes,letusconsiderthe4-stateMDPproblemde-

scribedinFigure6.2.Theagentisinitiallylocatedatstates(0)andtheactions

82

StatisticalReinforcementLearning

FIGURE6.2:IllustrativeMDPproblem.

theagentisallowedtotakearemovingtotheleftorrightstate.Iftheleft

movementactionischosen,theagentalwaysreceivessmallpositivereward

+0.1ats(L).Ontheotherhand,iftherightmovementactionischosen,the

agentreceivesnegativereward−1withprobability0.9999ats(R1)oritre-

ceivesverylargepositivereward+20,000withprobability0.0001ats(R2).The

meanandmedianrewardsforleftmovementareboth+0.1,whilethemean

andmedianrewardsforrightmovementare+1.0001and−1,respectively.

IfQ(s(0),“Left”)andQ(s(0),“Right”)areapproximatedbytheleast-

squaresmethod,itreturnsthemeanrewards,i.e.,+0.1and+1.0001,re-

spectively.Thus,theleast-squaresmethodprefersrightmovement,whichisa

“gambling”policythatnegativereward−1isalmostalwaysobtainedats(R1),

butitispossibletoobtainveryhighreward+20,000withaverysmallprob-

abilityats(R2).Ontheotherhand,ifQ(s(0),“Left”)andQ(s(0),“Right”)are

approximatedbytheleastabsolutemethod,itreturnsthemedianrewards,

i.e.,+0.1and−1,respectively.Thus,theleastabsolutemethodprefersleft

movement,whichisastablepolicythattheagentcanalwaysreceivesmall

positivereward+0.1ats(L).

IfalltherewardsinFigure6.2arenegated,thevaluefunctionsarealso

negatedandadifferentinterpretationcanbeobtained:theleast-squares

methodisafraidoftheriskofreceivingverylargenegativereward−20,000

ats(R2)withaverylowprobability,andconsequentlyitendsupinavery

conservativepolicythattheagentalwaysreceivesnegativereward−0.1at

s(L).Ontheotherhand,theleastabsolutemethodtriestoreceivepositive

reward+1ats(R1)withoutbeingafraidofvisitings(R2)toomuch.

Asillustratedabove,theleastabsolutemethodtendstoprovidequalita-

tivelydifferentsolutionsfromtheleast-squaresmethod.

RobustPolicyIteration

83

6.2.3

Properties

Here,propertiesoftheleastabsolutemethodareinvestigatedwhenthe

modelb

Q(s,a)iscorrectlyspecified,i.e.,thereexistsaparameterθ∗suchthatb

Q(s,a)=Q(s,a)

forallsanda.

Underthecorrectmodelassumption,whenthenumberofsamplesTtends

toinfinity,theleastabsolutesolutionb

θwouldsatisfythefollowingequa-

tion(Koenker,2005):

b⊤θψ(s,a)=Mp(s′|s,a)[r(s,a,s′)]forallsanda,

(6.1)

whereMp(s′|s,a)denotestheconditionalmedianofs′overp(s′|s,a)givens

anda.ψ(s,a)isdefinedby

ψ(s,a)=φ(s,a)−γEp(s′|s,a)Eπ(a′|s′)[φ(s′,a′)],

whereEp(s′|s,a)denotestheconditionalexpectationofs′overp(s′|s,a)given

sanda,andEπ(a′|s′)denotestheconditionalexpectationofa′overπ(a′|s′)

givens′.

FromEq.(6.1),wecanobtainthefollowingBellman-likerecursiveexpres-

sion:

h

i

b

Q(s,a)=M

b

p(s′|s,a)[r(s,a,s′)]+γEp(s′|s,a)Eπ(a′|s′)Q(s′,a′).

(6.2)

Notethatinthecaseoftheleast-squaresmethodwhere

b⊤θψ(s,a)=Ep(s′|s,a)[r(s,a,s′)]

issatisfiedinthelimitunderthecorrectmodelassumption,wehave

h

i

b

Q(s,a)=E

b

p(s′|s,a)[r(s,a,s′)]+γEp(s′|s,a)Eπ(a′|s′)Q(s′,a′).

(6.3)

ThisistheordinaryBellmanequation,andthusEq.(6.2)couldberegarded

asanextensionoftheBellmanequationtotheabsoluteloss.

FromtheordinaryBellmanequation(6.3),wecanrecovertheoriginal

definitionofthestate-valuefunctionQ(s,a):

#

T

X

Qπ(

s,a)=Epπ(h)

γt−1r(st,at,st+1),s1=s,a1=a,

t=1

whereEpπ(h)denotestheexpectationovertrajectoryh=[s1,a1,…,

sT,aT,sT+1]and“|s1=s,a1=a”meansthattheinitialstates1andthe

firstactiona1arefixedats1=sanda1=a,respectively.Incontrast,from

theabsolute-lossBellmanequation(6.2),wehave

#

T

X

Q′(

s,a)=Epπ(h)

γt−1Mp(s

s

.

t+1|st,at)[r(st,at,st+1)]1=s,a1=a

t=1

84

StatisticalReinforcementLearning

Bar

1stlink

1stjoint

2ndlink

2ndjoint

Endeffector

FIGURE6.3:Illustrationoftheacrobot.Thegoalistoswinguptheend

effectorbyonlycontrollingthesecondjoint.

Thisisthevaluefunctionthattheleastabsolutemethodistryingtoap-

proximate,whichisdifferentfromtheordinaryvaluefunction.Sincethedis-

countedsumofmedianrewards—nottheexpectedrewards—ismaximized,

theleastabsolutemethodisexpectedtobelesssensitivetooutliersthanthe

least-squaresmethod.

6.3

NumericalExamples

Inthissection,thebehaviorofLAPIisillustratedthroughexperiments

usingtheacrobotshowninFigure6.3.Theacrobotisanunder-actuated

systemandconsistsoftwolinks,twojoints,andanendeffector.Thelengthof

eachlinkis0.3[m],andthediameterofeachjointis0.15[m].Thediameterof

theendeffectoris0.10[m],andtheheightofthehorizontalbaris1.2[m].The

firstjointconnectsthefirstlinktothehorizontalbarandisnotcontrollable.

Thesecondjointconnectsthefirstlinktothesecondlinkandiscontrollable.

Theendeffectorisattachedtothetipofthesecondlink.Thecontrolcommand

(action)wecanchooseistoapplypositivetorque+50[N·m],notorque0

[N·m],ornegativetorque−50[N·m]tothesecondjoint.Notethatthe

acrobotmovesonlywithinaplaneorthogonaltothehorizontalbar.

Thegoalistoacquireacontrolpolicysuchthattheendeffectorisswungup

ashighaspossible.Thestatespaceconsistsoftheangleθi[rad]andangular

velocity˙θi[rad/s]ofthefirstandsecondjoints(i=1,2).Theimmediate

RobustPolicyIteration

85

rewardisgivenaccordingtotheheightyofthecenteroftheendeffectoras

10

ify>1.75,

r(s,a,s′)=

exp−(y−1.85)2

if1.5<y≤1.75,

2(0.2)2

0.001

otherwise.

Notethat0.55≤y≤1.85inthecurrentsetting.

Here,supposethatthelengthofthelinksisunknown.Thus,theheight

ycannotbedirectlycomputedfromstateinformation.Theheightoftheend

effectorissupposedtobeestimatedfromanimagetakenbyacamera—

theendeffectorisdetectedintheimageandthenitsverticalcoordinateis

computed.Duetorecognitionerror,theestimatedheightishighlynoisyand

couldcontainoutliers.

Ineachpolicyiterationstep,20episodictrainingsamplesoflength150

aregathered.Theperformanceoftheobtainedpolicyisevaluatedusing50

episodictestsamplesoflength300.Notethatthetestsamplesarenotused

forlearningpolicies.Theyareusedonlyforevaluatinglearnedpolicies.The

policiesareupdatedinasoft-maxmanner:

exp(Q(s,a)/η)

π(a|s)←−P

,

exp(Q(s,a′)/η)

a′∈Awhereη=10−l+1withlbeingtheiterationnumber.Thediscounted

factorissetatγ=1,i.e.,nodiscount.Asbasisfunctionsforvaluefunction

approximation,theGaussiankernelwithstandarddeviationπisused,where

Gaussiancentersarelocatedat

(θ1,θ2,˙θ1,˙θ2)∈−π,−π,0,π,π×−π,0,π×−π,0,π×−π,0,π.2

2

Theabove135(=5×3×3×3)Gaussiankernelsaredefinedforeachofthe

threeactions.Thus,405(=135×3)kernelsareusedintotal.

Letusconsidertwonoiseenvironments:oneisthecasewherenonoiseis

addedtotherewardsandtheothercaseiswhereLaplaciannoisewithmean

zeroandstandarddeviation2isaddedtotherewardswithprobability0.1.

NotethatthetailoftheLaplaciandensityisheavierthanthatoftheGaussian

density(seeFigure6.4),implyingthatasmallnumberofoutlierstendtobe

includedintheLaplaciannoiseenvironment.Anexampleofthenoisytraining

samplesisshowninFigure6.5.Foreachnoiseenvironment,theexperimentis

repeated50timeswithdifferentrandomseedsandtheaveragesofthesumof

rewardsobtainedbyLAPIandLSPIaresummarizedinFigure6.6.Thebest

methodintermsofthemeanvalueandcomparablemethodsaccordingtothe

t-test(Henkel,1976)atthesignificancelevel5%isspecifiedby“.”

Inthenoiselesscase(seeFigure6.6(a)),bothLAPIandLSPIimprovethe

performanceoveriterationsinacomparableway.Ontheotherhand,inthe

noisycase(seeFigure6.6(b)),theperformanceofLSPIisnotimprovedmuch

duetooutliers,whileLAPIstillproducesagoodcontrolpolicy.

86

StatisticalReinforcementLearning

1

10

Gaussiandensity

True

Laplaciandensity

Samplewithnoise

8

0.8

6

0.6

4

2

0.4

Immediatereward0

0.2

−2

−4

0

0.55

1.5

1.751.85

−4

−2

0

2

4

Heightofendeffector

FIGURE6.4:Probabilitydensity

FIGURE6.5:Exampleoftraining

functionsofGaussianandLapla-

sampleswithLaplaciannoise.The

ciandistributions.

horizontalaxisistheheightofthe

endeffector.Thesolidlinedenotes

thenoiselessimmediaterewardand

“”denotesanoisytrainingsample.

14

12

12

10

10

8

8

6

6

Sumofrewards

Sumofrewards

4

4

2

2

LSPI

LAPI

0

0

1

2

3

4

5

6

7

8

9

10

1

2

3

4

5

6

7

8

9

10

Iteration

Iteration

(a)Nonoise

(b)Laplaciannoise

FIGURE6.6:Averageandstandarddeviationofthesumofrewardsover50

runsfortheacrobotswinging-upsimulation.Thebestmethodintermsofthe

meanvalueandcomparablemethodsaccordingtothet-testatthesignificance

level5%specifiedby“.”

Figure6.7andFigure6.8depictmotionexamplesoftheacrobotlearned

byLAPIandLSPIintheLaplacian-noiseenvironment.WhenLSPIisused

(Figure6.7),thesecondjointisswunghardinordertolifttheendeffector.

However,theendeffectortendstostaybelowthehorizontalbar,andtherefore

onlyasmallamountofrewardcanbeobtainedbyLSPI.Thiswouldbedueto

theexistenceofoutliers.Ontheotherhand,whenLAPIisused(Figure6.8),

theendeffectorgoesbeyondthebar,andthereforealargeamountofreward

canbeobtainedeveninthepresenceofoutliers.

RobustPolicyIteration

87

FIGURE6.7:AmotionexampleoftheacrobotlearnedbyLSPIinthe

Laplacian-noiseenvironment(fromlefttorightandtoptobottom).

FIGURE6.8:AmotionexampleoftheacrobotlearnedbyLAPIinthe

Laplacian-noiseenvironment(fromlefttorightandtoptobottom).

88

StatisticalReinforcementLearning

6.4

PossibleExtensions

Inthissection,possiblevariationsofLAPIareconsidered.

6.4.1

HuberLoss

UseoftheHuberlosscorrespondstomakingacompromisebetweenthe

squaredandabsolutelossfunctions(Huber,1981):

#

T

X

argmin

ρHB

κ

θ⊤b

ψ(st,at)−rt

,

θ

t=1

whereκ(≥0)isathresholdparameterandρHB

κ

istheHuberlossdefinedas

follows(seeFigure6.9):

1x2

if|x|≤κ,

2

ρHB

κ

(x)= κ|x|−1κ2if|x|>κ.

2

TheHuberlossconvergestotheabsolutelossasκtendstozero,andit

convergestothesquaredlossasκtendstoinfinity.

TheHuberlossfunctionisratherintricate,butthesolutioncanbeob-

tainedbysolvingthefollowingconvexquadraticprogram(Mangasarian&

Musicant,2000):

T

T

X

X

1

min

b2t+κ

ct

θ,b

2

t,ctT

t=1

t=1

t=1

subjectto−ct≤θ⊤bψ(st,at)−rt−bt≤ct,t=1,…,T.

Anotherwaytoobtainthesolutionistouseagradientdescentmethod,

wheretheparameterθisupdatedasfollowsuntilconvergence:

T

X

θ←θ−ε

∆ρHB

κ

(θ⊤b

ψ(st,at)−rt)b

ψ(st,at).

t=1

ε(>0)isthelearningrateand∆ρHB

κ

isthederivativeofρHB

κ

givenby

x

if|x|≤κ,

∆ρHB

κ

(x)=

κ

ifx>κ,

−κifx<−κ.

Inpractice,thefollowingstochasticgradientmethod(Amari,1967)wouldbe

RobustPolicyIteration

89

5

Huberloss

Pinballloss

4

Deadzone-linearloss

3

2

1

0

−3

−2

−1

0

1

2

FIGURE6.9:TheHuberlossfunction(withκ=1),thepinballlossfunction

(withτ=0.3),andthedeadzone-linearlossfunction(withǫ=1).

moreconvenient.Forarandomlychosenindext∈1,…,Tineachiteration,

repeatthefollowingupdateuntilconvergence:

θ←θ−ε∆ρHB

κ

(θ⊤b

ψ(st,at)−rt)b

ψ(st,at).

Theplain/stochasticgradientmethodsalsocomeinhandywhenapprox-

imatingtheleastabsolutesolution,sincetheHuberlossfunctionwithsmall

κcanberegardedasasmoothapproximationtotheabsoluteloss.

6.4.2

PinballLoss

Theabsolutelossinducesthemedian,whichcorrespondstothe50-

percentilepoint.Asimilardiscussionisalsopossibleforanarbitrarypercentile

100τ(0≤τ≤1)basedonthepinballloss(Koenker,2005):

#

T

X

min

ρPB

τ

(θ⊤b

ψ(st,at)−rt),

θ

t=1

whereρPB

τ

(x)isthepinballlossdefinedby

(2τx

ifx≥0,

ρPB

τ

(x)=

2(τ−1)xifx<0.

TheprofileofthepinballlossisdepictedinFigure6.9.Whenτ=0.5,the

pinballlossisreducedtotheabsoluteloss.

Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram:

T

X

min

bt

θ,btT

t=1

t=1

b

b

subjectto

t

t

θ⊤b

ψ(s

,t=1,…,T.

2(τ−1)

t,at)−rt≤2τ

90

StatisticalReinforcementLearning

6.4.3

Deadzone-LinearLoss

Anothervariantoftheabsolutelossisthedeadzone-linearloss(seeFig-

ure6.9):

#

T

X

min

ρDL

ǫ

(θ⊤b

ψ(st,at)−rt),

θ

t=1

whereρDL

ǫ

(x)isthedeadzone-linearlossdefinedby

(0

if|x|≤ǫ,

ρDL

ǫ

(x)=

|x|−ǫif|x|>ǫ.

Thatis,ifthemagnitudeoftheerrorislessthanǫ,noerrorisassessed.This

lossisalsocalledtheǫ-insensitivelossandusedinsupportvectorregression

(Vapnik,1998).

Whenǫ=0,thedeadzone-linearlossisreducedtotheabsoluteloss.

Thus,thedeadzone-linearlossandtheabsolutelossarerelatedtoeachother.

However,theeffectofthedeadzone-linearlossiscompletelyoppositetothe

absolutelosswhenǫ>0.Theinfluenceof“good”samples(withsmallerror)

isdeemphasizedinthedeadzone-linearloss,whiletheabsolutelosstendsto

suppresstheinfluenceof“bad”samples(withlargeerror)comparedwiththe

squaredloss.

Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram(Boyd

&Vandenberghe,2004):

T

X

min

b

t

θ,btT

t=1

t=1

subjectto

−b

t−ǫ≤θ⊤b

ψ(st,at)−rt≤bt+ǫ,

bt≥0,t=1,…,T.

6.4.4

ChebyshevApproximation

TheChebyshevapproximationminimizestheerrorforthe“worst”sample:

min

max|θ⊤b

ψ(st,at)−rt|.

θ

t=1,…,T

Thisisalsocalledtheminimaxapproximation.

Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram(Boyd

&Vandenberghe,2004):

min

b

θ,b

subjectto−b≤θ⊤bψ(st,at)−rt≤b,t=1,…,T.

RobustPolicyIteration

91

FIGURE6.10:Theconditionalvalue-at-risk(CVaR).

6.4.5

ConditionalValue-At-Risk

Intheareaoffinance,theconditionalvalue-at-risk(CVaR)isapopular

riskmeasure(Rockafellar&Uryasev,2002).TheCVaRcorrespondstothe

meanoftheerrorforasetof“bad”samples(seeFigure6.10).

Morespecifically,letusconsiderthedistributionoftheabsoluteerrorover

alltrainingsamples(st,at,rt)Tt=1:

Φ(α|θ)=P(st,at,rt):|θ⊤b

ψ(st,at)−rt|≤α.

Forβ∈[0,1),letαβ(θ)bethe100βpercentileoftheabsoluteerrordistribu-tion:

αβ(θ)=minα|Φ(α|θ)≥β.

Thus,onlythefraction(1−β)oftheabsoluteerror|θ⊤b

ψ(st,at)−rt|exceeds

thethresholdαβ(θ).αβ(θ)isalsoreferredtoasthevalue-at-risk(VaR).

Letusconsidertheβ-taildistributionoftheabsoluteerror:

0

ifα<αβ(θ),

Φβ(α|θ)= Φ(α|θ)−β

ifα≥α

1−β

β(θ).

Letφβ(θ)bethemeanoftheβ-taildistributionoftheabsolutetemporal

difference(TD)error:

h

i

φβ(θ)=EΦ

|θ⊤b

ψ(s

,

β

t,at)−rt|

whereEΦdenotestheexpectationoverthedistributionΦ

β

β.φβ(θ)iscalled

theCVaR.Bydefinition,theCVaRoftheabsoluteerrorisreducedtothe

meanabsoluteerrorifβ=0anditconvergestotheworstabsoluteerror

asβtendsto1.Thus,theCVaRsmoothlybridgestheleastabsoluteand

Chebyshevapproximationmethods.CVaRisalsoreferredtoastheexpected

shortfall.

92

StatisticalReinforcementLearning

TheCVaRminimizationprobleminthecurrentcontextisformulatedas

h

h

ii

minEΦ

|θ⊤b

ψ(s

.

β

t,at)−rt|

θ

Thisoptimizationproblemlookscomplicated,butthesolutionb

θCVcanbeob-

tainedbysolvingthefollowinglinearprogram(Rockafellar&Uryasev,2002):

T

X

min

T(1−β)α+

ct

θ,btT

,c

t=1

tT

t=1

t=1

subjectto

−b

t≤θ⊤b

ψ(st,at)−rt≤bt,

ct≥bt−α,

ct≥0,t=1,…,T.

Notethatifthedefinitionoftheabsoluteerrorisslightlychanged,the

CVaRminimizationmethodamountstominimizingthedeadzone-linearloss

(Takeda,2007).

6.5

Remarks

LSPIcanberegardedasregressionofimmediaterewardsunderthe

squaredloss.Inthischapter,theabsolutelosswasusedforregression,which

contributestoenhancingrobustnessandreliability.Theleastabsolutemethod

isformulatedasalinearprogramanditcanbesolvedefficientlybystandard

optimizationsoftware.

LSPImaximizesthestate-actionvaluefunctionQ(s,a),whichistheex-

pectationofreturns.Anotherwaytoaddresstherobustnessandreliability

istomaximizeotherquantitiessuchasthemedianoraquantileofreturns.

AlthoughBellman-likesimplerecursiveexpressionsarenotavailableforquan-

tilesofrewards,aBellman-likerecursiveequationholdsforthedistribution

ofthediscountedsumofrewards(Morimuraetal.,2010a;Morimuraetal.,

2010b).Developingrobustreinforcementlearningalgorithmsalongthisline

ofresearchwouldbeapromisingfuturedirection.

PartIII

Model-FreePolicySearch

InthepolicyiterationapproachexplainedinPartII,thevaluefunctionis

firstestimatedandthenthepolicyisdeterminedbasedonthelearnedvalue

function.Policyiterationwasdemonstratedtoworkwellinmanyreal-world

applications,especiallyinproblemswithdiscretestatesandactions(Tesauro,

1994;Williams&Young,2007;Abeetal.,2010).Althoughpolicyiteration

canalsohandlecontinuousstatesbyfunctionapproximation(Lagoudakis&

Parr,2003),continuousactionsarehardtodealwithduetothedifficultyof

findingamaximizerofthevaluefunctionwithrespecttoactions.Moreover,

sincepoliciesareindirectlydeterminedviavaluefunctionapproximation,mis-

specificationofvaluefunctionmodelscanleadtoaninappropriatepolicyeven

inverysimpleproblems(Weaver&Baxter,1999;Baxteretal.,2001).Another

limitationofpolicyiterationespeciallyinphysicalcontroltasksisthatcontrol

policiescanvarydrasticallyineachiteration.Thiscausessevereinstabilityin

thephysicalsystemandthusisnotfavorableinpractice.

Policysearchisanalternativeapproachtoreinforcementlearningthatcan

overcomethelimitationsofpolicyiteration(Williams,1992;Dayan&Hin-

ton,1997;Kakade,2002).Inthepolicysearchapproach,policiesaredirectly

learnedsothatthereturn(i.e.,thediscountedsumoffuturerewards),

T

Xγt−1r(st,at,st+1),

t=1

ismaximized.

InPartIII,wefocusontheframeworkofpolicysearch.First,directpolicy

searchmethodsareintroduced,whichtrytofindthepolicythatachievesthe

maximumreturnviagradientascent(Chapter7)orexpectation-maximization

(Chapter8).Apotentialweaknessofthedirectpolicysearchapproachisits

instabilityduetotherandomnessofstochasticpolicies.Toovercometheinsta-

bilityproblem,analternativeapproachcalledpolicy-priorsearchisintroduced

inChapter9.

Thispageintentionallyleftblank

Chapter7

DirectPolicySearchbyGradient

Ascent

Thedirectpolicysearchapproachtriestofindthepolicythatmaximizes

theexpectedreturn.Inthischapter,weintroducegradient-basedalgorithms

fordirectpolicysearch.AftertheproblemformulationinSection7.1,the

gradientascentalgorithmisintroducedinSection7.2.Then,inSection7.3,

itsextentionusingnaturalgradientsisdescribed.InSection7.4,applicationto

computergraphicsisshown.Finally,thischapterisconcludedinSection7.5.

7.1

Formulation

Inthissection,theproblemofdirectpolicysearchismathematicallyfor-

mulated.

LetusconsideraMarkovdecisionprocessspecifiedby

(S,A,p(s′|s,a),p(s),r,γ),

whereSisasetofcontinuousstates,Aisasetofcontinuousactions,p(s′|s,a)

isthetransitionprobabilitydensityfromcurrentstatestonextstates′when

actionaistaken,p(s)istheprobabilitydensityofinitialstates,r(s,a,s′)

isanimmediaterewardfortransitionfromstos′bytakingactiona,and

0<γ≤1isthediscountedfactorforfuturerewards.

Letπ(a|s,θ)beastochasticpolicyparameterizedbyθ,whichrepresents

theconditionalprobabilitydensityoftakingactionainstates.Lethbea

trajectoryoflengthT:

h=[s1,a1,…,sT,aT,sT+1].

Thereturn(i.e.,thediscountedsumoffuturerewards)alonghisdefinedas

T

X

R(h)=

γt−1r(st,at,st+1),

t=1

andtheexpectedreturnforpolicyparameterθisdefinedas

Z

J(θ)=Ep(h|θ)[R(h)]=

p(h|θ)R(h)dh,

95

96

StatisticalReinforcementLearning

FIGURE7.1:Gradientascentfordirectpolicysearch.

whereEp(h|θ)istheexpectationovertrajectoryhdrawnfromp(h|θ),and

p(h|θ)denotestheprobabilitydensityofobservingtrajectoryhunderpolicy

parameterθ:

T

Y

p(h|θ)=p(s1)

p(st+1|st,at)π(at|st,θ).

t=1

Thegoalofdirectpolicysearchistofindtheoptimalpolicyparameterθ∗thatmaximizestheexpectedreturnJ(θ):

θ∗=argmaxJ(θ).θ

However,directlymaximizingJ(θ)ishardsinceJ(θ)usuallyinvolveshigh

non-linearitywithrespecttoθ.Below,agradient-basedalgorithmisintro-

ducedtofindalocalmaximizerofJ(θ).Analternativeapproachbasedon

theexpectation-maximizationalgorithmisprovidedinChapter8.

7.2

GradientApproach

Inthissection,agradientascentmethodfordirectpolicysearchisintro-

duced(Figure7.1).

7.2.1

GradientAscent

Thesimplestapproachtofindingalocalmaximizeroftheexpectedreturn

isgradientascent(Williams,1992):

θ←−θ+ε∇θJ(θ),

DirectPolicySearchbyGradientAscent

97

whereεisasmallpositiveconstantand∇θJ(θ)denotesthegradientofex-pectedreturnJ(θ)withrespecttopolicyparameterθ.Thegradient∇θJ(θ)isgivenby

Z

∇θJ(θ)=∇θp(h|θ)R(h)dhZ

=

p(h|θ)∇θlogp(h|θ)R(h)dhZ

T

X

=

p(h|θ)

∇θlogπ(at|st,θ)R(h)dh,t=1

wheretheso-called“logtrick”isused:

∇θp(h|θ)=p(h|θ)∇θlogp(h|θ).Thisexpressionmeansthatthegradient∇θJ(θ)isgivenastheexpectationoverp(h|θ):

#

T

X

∇θJ(θ)=Ep(h|θ)∇θlogπ(at|st,θ)R(h).t=1

Sincep(h|θ)isunknown,theexpectationisapproximatedbytheempirical

averageas

N

T

1XX

∇bθJ(θ)=

N

θlogπ(at,n|st,n,θ)R(hn),

n=1t=1

where

hn=[s1,n,a1,n,…,sT,n,aT,n,sT+1,n]

isanindependentsamplefromp(h|θ).ThisalgorithmiscalledREINFORCE

(Williams,1992),whichisanacronymfor“REwardIncrement=Nonnegative

Factor×OffsetReinforcement×CharacteristicEligibility.”

Apopularchoiceforpolicymodelπ(a|s,θ)istheGaussianpolicymodel,

wherepolicyparameterθconsistsofmeanvectorµandstandarddeviation

σ:

1

(a−µ⊤φ(s))2

π(a|s,µ,σ)=√

exp−

.

(7.1)

σ2π

2σ2

Here,φ(s)denotesthebasisfunction.ForthisGaussianpolicymodel,the

policygradientsareexplicitlycomputedas

a−µ⊤φ(s)

∇µlogπ(a|s,µ,σ)=φ(s),

σ2

(a−µ⊤φ(s))2−σ2

∇σlogπ(a|s,µ,σ)=.

σ3

98

StatisticalReinforcementLearning

Asshownabove,thegradientascentalgorithmfordirectpolicysearchis

verysimpletoimplement.Furthermore,thepropertythatpolicyparameters

aregraduallyupdatedinthegradientascentalgorithmispreferablewhen

reinforcementlearningisappliedtothecontrolofavulnerablephysicalsystem

suchasahumanoidrobot,becausesuddenpolicychangecandamagethe

system.However,thevarianceofpolicygradientstendstobelargeinpractice

(Peters&Schaal,2006;Sehnkeetal.,2010),whichcanresultinslowand

unstableconvergence.

7.2.2

BaselineSubtractionforVarianceReduction

Baselinesubtractionisausefultechniquetoreducethevarianceofgradient

estimators.Technically,baselinesubtractioncanbeviewedasthemethodof

controlvariates(Fishman,1996),whichisaneffectiveapproachtoreducing

thevarianceofMonteCarlointegralestimators.

Thebasicideaofbaselinesubtractionisthatanunbiasedestimatorb

ηis

stillunbiasedifazero-meanrandomvariablemmultipliedbyaconstantξis

subtracted:

b

ηξ=b

η−ξm.

Theconstantξ,whichiscalledabaseline,maybechosensothatthevariance

ofb

ηξisminimized.Bybaselinesubtraction,amorestableestimatorthanthe

originalb

ηcanbeobtained.

Apolicygradientestimatorwithbaselineξsubtractedisgivenby

T

X

∇b

b

θJξ(θ)=∇θJ(θ)−ξ∇θlogπ(at,n|st,n,θ)t=1

1N

X

T

X

=

(R(h

∇N

n)−ξ)

θlogπ(at,n|st,n,θ),

n=1

t=1

wheretheexpectationof∇θlogπ(a|s,θ)iszero:Z

E[∇θlogπ(a|s,θ)]=π(a|s,θ)∇θlogπ(a|s,θ)daZ

=

∇θπ(a|s,θ)daZ

=∇θπ(a|s,θ)da=∇θ1=0.Theoptimalbaselineisdefinedastheminimizerofthevarianceofthegradient

estimatorwithrespecttothebaseline(Greensmithetal.,2004;Weaver&Tao,

2001):

ξ∗=argminVarb

p(h|θ)[∇θJξ(θ)],

ξ

DirectPolicySearchbyGradientAscent

99

whereVarp(h|θ)denotesthetraceofthecovariancematrix:

Varp(h|

E

θ)[ζ]=tr

p(h|θ)(ζ−Ep(h|θ)[ζ])(ζ−Ep(h|θ)[ζ])⊤h

i

=Ep(h|θ)kζ−Ep(h|θ)[ζ]k2.

ItwasshowninPetersandSchaal(2006)thattheoptimalbaselineξ∗isgivenas

P

E

T

ξ∗=p(h|θ)[R(h)k

t=1∇θlogπ(at|st,θ)k2]P

.

E

T

p(h|θ)[k

t=1∇θlogπ(at|st,θ)k2]Inpractice,theexpectationsareapproximatedbysampleaverages.

7.2.3

VarianceAnalysisofGradientEstimators

Here,thevarianceofgradientestimatorsistheoreticallyinvestigatedfor

theGaussianpolicymodel(7.1)withφ(s)=s.SeeZhaoetal.(2012)for

technicaldetails.

Inthetheoreticalanalysis,subsetsofthefollowingassumptionsarecon-

sidered:

Assumption(A):r(s,a,s′)∈[−β,β]forβ>0.Assumption(B):r(s,a,s′)∈[α,β]for0<α<β.Assumption(C):Forδ>0,thereexisttwoseriesctTt=1anddtTt=1such

thatkstk≥ctandkstk≤dtholdwithprobabilityatleast1−δ,

2N

respectively,overthechoiceofsamplepaths.

NotethatAssumption(B)isstrongerthanAssumption(A).Let

ζ(T)=CTα2−DTβ2/(2π),

where

T

X

T

X

CT=

c2tandDT=

d2t.

t=1

t=1

First,thevarianceofgradientestimatorsisanalyzed.

Theorem7.1UnderAssumptions(A)and(C),thefollowingupperbound

holdswithprobabilityatleast1−δ/2:

h

i

D

Var

b

Tβ2(1−γT)2

p(h|θ)∇µJ(µ,σ)≤

.

Nσ2(1−γ)2

UnderAssumption(A),itholdsthat

h

i

2Tβ2(1−γT)2

Var

b

p(h|θ)∇σJ(µ,σ)≤

.

Nσ2(1−γ)2

100

StatisticalReinforcementLearning

Theaboveupperboundsaremonotoneincreasingwithrespecttotrajec-

torylengthT.

Forthevarianceof∇bµJ(µ,σ),thefollowinglowerboundholds(itsupper

boundhasnotbeenderivedyet):

Theorem7.2UnderAssumptions(B)and(C),thefollowinglowerbound

holdswithprobabilityatleast1−δ:

h

i

(1−γT)2

Var

b

p(h|θ)∇µJ(µ,σ)≥

ζ(T).

Nσ2(1−γ)2

Thislowerboundisnon-trivialifζ(T)>0,whichcanbefulfilled,e.g.,if

αandβsatisfy

2πCTα2>DTβ2.

Next,thecontributionoftheoptimalbaselineisinvestigated.Itwasshown

(Greensmithetal.,2004;Weaver&Tao,2001)thattheexcessvarianceforan

arbitrarybaselineξisgivenby

Var

b

b

p(h|θ)[∇θJξ(θ)]−Varp(h|θ)[∇θJξ∗(θ)]

2

(ξ−ξ∗)2T

X

=

E

∇.

N

p(h|θ)

θlogπ(at|st,θ)

t=1

Basedonthisexpression,thefollowingtheoremcanbeobtained.

Theorem7.3UnderAssumptions(B)and(C),thefollowingboundshold

withprobabilityatleast1−δ:

CTα2(1−γT)2≤Var

b

J(µ,σ)]−Var

b

Jξ∗(µ,σ)]Nσ2(1−γ)2

p(h|θ)[∇µp(h|θ)[∇µβ2(1−γT)2D

T.

Nσ2(1−γ)2

Thistheoremshowsthatthelowerboundoftheexcessvarianceispositive

andmonotoneincreasingwithrespecttothetrajectorylengthT.Thismeans

thatthevarianceisalwaysreducedbyoptimalbaselinesubtractionandthe

amountofvariancereductionismonotoneincreasingwithrespecttothetra-

jectorylengthT.Notethattheupperboundisalsomonotoneincreasingwith

respecttothetrajectorylengthT.

Finally,thevarianceofgradientestimatorswiththeoptimalbaselineis

investigated:

Theorem7.4UnderAssumptions(B)and(C),itholdsthat

(1−γT)2

Var

b

p(h|θ)[∇µJξ∗(µ,σ)]≤(β2D

Nσ2(1−γ)2

T−α2CT),

wheretheinequalityholdswithprobabilityatleast1−δ.

DirectPolicySearchbyGradientAscent

101

(a)Ordinarygradients

(b)Naturalgradients

FIGURE7.2:Ordinaryandnaturalgradients.Ordinarygradientstreatall

dimensionsequally,whilenaturalgradientstaketheRiemannianstructure

intoaccount.

Thistheoremshowsthattheupperboundofthevarianceofthegradient

estimatorswiththeoptimalbaselineisstillmonotoneincreasingwithrespect

tothetrajectorylengthT.Thus,whenthetrajectorylengthTislarge,the

varianceofthegradientestimatorscanstillbelargeevenwiththeoptimal

baseline.

InChapter9,anothergradientapproachwillbeintroducedforovercoming

thislarge-varianceproblem.

7.3

NaturalGradientApproach

Thegradient-basedpolicyparameterupdateusedintheREINFORCE

algorithmisperformedundertheEuclideanmetric.Inthissection,weshow

anotherusefulchoiceofthemetricforgradient-basedpolicysearch.

7.3.1

NaturalGradientAscent

UseoftheEuclideanmetricimpliesthatalldimensionsofthepolicypa-

rametervectorθaretreatedequally(Figure7.2(a)).However,sinceapolicy

parameterθspecifiesaconditionalprobabilitydensityπ(a|s,θ),useofthe

Euclideanmetricintheparameterspacedoesnotnecessarilymeanalldi-

mensionsaretreatedequallyinthespaceofconditionalprobabilitydensities.

Thus,asmallchangeinthepolicyparameterθcancauseabigchangeinthe

conditionalprobabilitydensityπ(a|s,θ)(Kakade,2002).

Figure7.3describestheGaussiandensitieswithmeanµ=−5,0,5and

standarddeviationσ=1,2.Thisshowsthatifthestandarddeviationis

102

StatisticalReinforcementLearning

0.4

0.3

0.2

0.1

0

−10

−5

0

5

10

a

FIGURE7.3:Gaussiandensitieswithdifferentmeansandstandarddevi-

ations.Ifthestandarddeviationisdoubled(fromthesolidlinestodashed

lines),thedifferenceinmeanshouldalsobedoubledtomaintainthesame

overlappinglevel.

doubled,thedifferenceinmeanshouldalsobedoubledtomaintainthesame

overlappinglevel.Thus,itis“natural”tocomputethedistancebetweentwo

Gaussiandensitiesparameterizedwith(µ,σ)and(µ+∆µ,σ)notby∆µ,but

by∆µ/σ.

Gradientsthattreatalldimensionsequallyinthespaceofprobability

densitiesarecallednaturalgradients(Amari,1998;Amari&Nagaoka,2000).

Theordinarygradientisdefinedasthesteepestascentdirectionunderthe

Euclideanmetric(Figure7.2(a)):

∇θJ(θ)=argmaxJ(θ+∆θ)subjectto∆θ⊤∆θ≤ǫ,

∆θ

whereǫisasmallpositivenumber.Ontheotherhand,thenaturalgradi-

entisdefinedasthesteepestascentdirectionundertheRiemannianmetric

(Figure7.2(b)):

e

∇θJ(θ)=argmaxJ(θ+∆θ)subjectto∆θ⊤Rθ∆θ≤ǫ,

∆θ

whereRθistheRiemannianmetric,whichisapositivedefinitematrix.The

solutionoftheaboveoptimizationproblemisgivenby

e

∇θJ(θ)=R−1θ

∇θJ(θ).Thus,theordinarygradient∇θJ(θ)ismodifiedbytheinverseRiemannianmetricR−1inthenaturalgradient.

θ

Astandarddistancemetricinthespaceofprobabilitydensitiesisthe

Kullback–Leibler(KL)divergence(Kullback&Leibler,1951).TheKLdiver-

gencefromdensityptodensityqisdefinedas

Z

p(θ)

KL(pkq)=

p(θ)log

dθ.

q(θ)

DirectPolicySearchbyGradientAscent

103

KL(pkq)isalwaysnon-negativeandzeroifandonlyifp=q.Thus,smaller

KL(pkq)meansthatpandqare“closer.”However,notethattheKLdiver-

genceisnotsymmetric,i.e.,KL(pkq)6=KL(qkp)ingeneral.

Forsmall∆θ,theKLdivergencefromp(h|θ)top(h|θ+∆θ)canbeap-

proximatedby

∆θ⊤Fθ∆θ,

whereFθistheFisherinformationmatrix:

Fθ=Ep(h|θ)[∇θlogp(h|θ)∇θlogp(h|θ)⊤].

Thus,FθistheRiemannianmetricinducedbytheKLdivergence.

Thentheupdateruleofthepolicyparameterθbasedonthenatural

gradientisgivenby

−1

θ←−θ+εb

Fθ∇θJ(θ),whereεisasmallpositiveconstantandb

FθisasampleapproximationofFθ:

N

X

b

1

Fθ=

∇N

θlogp(hn|θ)∇θlogp(hn|θ)⊤.

n=1

Undermildregularityconditions,theFisherinformationmatrixFθcan

beexpressedas

Fθ=−Ep(h|θ)[∇2θlogp(h|θ)],where∇2logp(hθ

|θ)denotestheHessianmatrixoflogp(h|θ).Thatis,the

(b,b′)-thelementof∇2logp(hlogp(h

θ

|θ)isgivenby

∂2

∂θ

|θ).Thismeans

b∂θb′

thatthenaturalgradienttakesthecurvatureintoaccount,bywhichthecon-

vergencebehavioratflatplateausandsteepridgestendstobeimproved.On

theotherhand,apotentialweaknessofnaturalgradientsisthatcomputation

oftheinverseRiemannianmetrictendstobenumericallyunstable(Deisenroth

etal.,2013).

7.3.2

Illustration

Letusillustratethedifferencebetweenordinaryandnaturalgradients

numerically.

Considerone-dimensionalreal-valuedstatespaceS=Randone-

dimensionalreal-valuedactionspaceA=R.Thetransitiondynamicsislin-

earanddeterministicass′=s+a,andtherewardfunctionisquadraticas

r=0.5s2−0.05a.Thediscountfactorissetatγ=0.95.TheGaussianpolicy

model,

1

(a−µs)2

π(a|s,µ,σ)=√

exp−

,

σ2π

2σ2

isemployed,whichcontainsthemeanparameterµandthestandarddevia-

tionparameterσ.Theoptimalpolicyparametersinthissetuparegivenby

(µ∗,σ∗)≈(−0.912,0).

104

StatisticalReinforcementLearning

1

1

0.8

0.8

0.6

0.6

σ

σ

0.4

0.4

0.2

0.2

0

0

−1.5

−1

−0.5

−1.5

−1

−0.5

µ

µ

(a)Ordinarygradients

(b)Naturalgradients

FIGURE7.4:Numericalillustrationsofordinaryandnaturalgradients.

Figure7.4showsnumericalcomparisonofordinaryandnaturalgradients

fortheGaussianpolicy.Thecontourlinesandthearrowsindicatetheex-

pectedreturnsurfaceandthegradientdirections,respectively.Thegraphs

showthattheordinarygradientstendtostronglyreducethestandarddevia-

tionparameterσwithoutreallyupdatingthemeanparameterµ.Thismeans

thatthestochasticityofthepolicyislostquicklyandthustheagentbecomes

lessexploratory.Consequently,onceσgetsclosertozero,thesolutionisat

aflatplateaualongthedirectionofµandthuspolicyupdatesinµarevery

slow.Ontheotherhand,thenaturalgradientsreduceboththemeanparam-

eterµandthestandarddeviationparameterσinabalancedway.Asaresult,

convergencegetsmuchfasterthantheordinarygradientmethod.

7.4

ApplicationinComputerGraphics:ArtistAgent

Orientalinkpainting,whichisalsocalledsumie,isoneofthemostdis-

tinctivepaintingstylesandhasattractedartistsaroundtheworld.Major

challengesinsumiesimulationaretoabstractcomplexsceneinformationand

reproducesmoothandnaturalbrushstrokes.Reinforcementlearningisuseful

toautomaticallygeneratesuchsmoothandnaturalstrokes(Xieetal.,2013).

Inthissection,theREINFORCEalgorithmexplainedinSection7.2isapplied

tosumieagenttraining.

DirectPolicySearchbyGradientAscent

105

7.4.1

SumiePainting

Amongvarioustechniquesofnon-photorealisticrendering(Gooch&

Gooch,2001),stroke-basedpainterlyrenderingsynthesizesanimagefroma

sourceimageinadesiredpaintingstylebyplacingdiscretestrokes(Hertz-

mann,2003).Suchanalgorithmsimulatesthecommonpracticeofhuman

painterswhocreatepaintingswithbrushstrokes.

Westernpaintingstylessuchaswater-color,pastel,andoilpaintingoverlay

strokesontomultiplelayers,whileorientalinkpaintingusesafewexpressive

strokesproducedbysoftbrushtuftstoconveysignificantinformationabouta

targetscene.Theappearanceofthestrokeinorientalinkpaintingistherefore

determinedbytheshapeoftheobjecttopaint,thepathandpostureofthe

brush,andthedistributionofpigmentsinthebrush.

Drawingsmoothandnaturalstrokesinarbitraryshapesischallenging

sinceanoptimalbrushtrajectoryandthepostureofabrushfootprintare

differentforeachshape.Existingmethodscanefficientlymapbrushtexture

bydeformationontoauser-giventrajectorylineortheshapeofatargetstroke

(Hertzmann,1998;Guo&Kunii,2003).However,thegeometricalprocessof

morphingtheentiretextureofabrushstrokeintothetargetshapeleads

toundesirableeffectssuchasunnaturalfoldingsandcreasedappearancesat

cornersorcurves.

Here,asoft-tuftbrushistreatedasareinforcementlearningagent,andthe

REINFORCEalgorithmisusedtoautomaticallydrawartisticstrokes.More

specifically,givenanyclosedcontourthatrepresentstheshapeofadesired

singlestrokewithoutoverlap,theagentmovesthebrushonthecanvastofill

thegivenshapefromastartpointtoanendpointwithstableposesalonga

smoothcontinuousmovementtrajectory(seeFigure7.5).

Inorientalinkpainting,thereareseveraldifferentbrushstylesthatcharac-

terizethepaintings.Below,tworepresentativestylescalledtheuprightbrush

styleandtheobliquebrushstyleareconsidered(seeFigure7.6).Intheupright

brushstyle,thetipofthebrushshouldbelocatedonthemedialaxisofthe

expectedstrokeshape,andthebottomofthebrushshouldbetangenttoboth

sidesoftheboundary.Ontheotherhand,intheobliquebrushstyle,thetip

ofthebrushshouldtouchonesideoftheboundaryandthebottomofthe

brushshouldbetangenttotheothersideoftheboundary.Thechoiceofthe

uprightbrushstyleandtheobliquebrushstyleisexclusiveandauserisasked

tochooseoneofthestylesinadvance.

7.4.2

DesignofStates,Actions,andImmediateRewards

Here,specificdesignofstates,actions,andimmediaterewardstailoredto

thesumieagentisdescribed.

106

StatisticalReinforcementLearning

(a)Brushmodel

(b)Footprints

(c)Basicstrokestyles

FIGURE7.5:Illustrationofthebrushagentanditspath.(a)Astrokeisgen-

eratedbymovingthebrushwiththefollowing3actions:Action1isregulating

thedirectionofthebrushmovement,Action2ispushingdown/liftingupthe

brush,andAction3isrotatingthebrushhandle.OnlyAction1isdetermined

byreinforcementlearning,andAction2andAction3aredeterminedbased

onAction1.(b)Thetopsymbolillustratesthebrushagent,whichconsistsof

atipQandacirclewithcenterCandradiusr.Othersillustratefootprintsof

arealbrushwithdifferentinkquantities.(c)Thereare6basicstrokestyles:

fullink,dryink,first-halfhollow,hollow,middlehollow,andboth-endhollow.

Smallfootprintsonthetopofeachstrokeshowtheinterpolationorder.

7.4.2.1

States

Theglobalmeasurement(i.e.,theposeconfigurationofafootprintunder

theglobalCartesiancoordinate)andthelocalmeasurement(i.e.,thepose

andthelocomotioninformationofthebrushagentrelativetothesurrounding

environment)areusedasstates.Here,onlythelocalmeasurementisusedto

calculatearewardandapolicy,bywhichtheagentcanlearnthedrawing

policythatisgeneralizabletonewshapes.Below,thelocalmeasurementis

regardedasstatesandtheglobalmeasurementisdealtwithonlyimplicitly.

DirectPolicySearchbyGradientAscent

107

FIGURE7.6:Uprightbrushstyle(left)andobliquebrushstyle(right).

Thelocalstate-spacedesignconsistsoftwocomponents:acurrentsur-

roundingshapeandanupcomingshape.Morespecifically,statevectorscon-

sistsofthefollowingsixfeatures:

s=(ω,φ,d,κ1,κ2,l)⊤.

Eachfeatureisdefinedasfollows(seeFigures7.7):

•ω∈(−π,π]:Theangleofthevelocityvectorofthebrushagentrelativetothemedialaxis.

•φ∈(−π,π]:Theheadingdirectionofthebrushagentrelativetothemedialaxis.

•d∈[−2,2]:TheratioofoffsetdistanceδfromthecenterCofthebrushagenttothenearestpointPonthemedialaxisMovertheradiusrof

thebrushagent(|d|=δ/r).dtakesapositive/negativevaluewhenthe

centerofthebrushagentisontheleft-/right-handsideofthemedial

axis:

–dtakesthevalue0whenthecenterofthebrushagentisonthe

medialaxis.

–dtakesavaluein[−1,1]whenthebrushagentisinsidethebound-

aries.

–Thevalueofdisin[−2,−1)orin(1,2]whenthebrushagentgoes

overtheboundaryofoneside.

108

StatisticalReinforcementLearning

dt–1<=1

t

P

rt–1

f

t–1

C

t–1

t–1

Q

Qt–1

C

r

t

r

d

P

C

t

t

P

>1

Qt

t

f

t

t

Pt–1

FIGURE7.7:Illustrationofthedesignofstates.Left:Thebrushagent

consistsofatipQandacirclewithcenterCandradiusr.Right:Theratiod

oftheoffsetdistanceδovertheradiusr.Footprintft−1isinsidethedrawing

area,andthecirclewithcenterCt−1andthetipQt−1touchtheboundaryon

eachside.Inthiscase,δt−1≤rt−1anddt−1∈[0,1].Ontheotherhand,ftgoesovertheboundary,andthenδt>rtanddt>1.Notethatdisrestrictedtobein[−2,2],andPisthenearestpointonmedialaxisMtoC.

Notethatthecenteroftheagentisrestrictedwithintheshape.There-

fore,theextremevaluesofdare±2whenthecenteroftheagentison

theboundary.

•κ1,κ2∈(−1,1):κ1providesthecurrentsurroundinginformationonthepointPt,whereasκ2providestheupcomingshapeinformationonpoint

Pt+1:

2

p

κi=

arctan0.05/r′,

π

i

wherer′iistheradiusofthecurve.Morespecifically,thevaluetakes

0/negative/positivewhentheshapeisstraight/left-curved/right-curved,

andthelargeritsabsolutevalueis,thetighterthecurveis.

•l∈0,1:Abinarylabelthatindicateswhethertheagentmovestoaregioncoveredbythepreviousfootprintsornot.l=0meansthatthe

agentmovestoaregioncoveredbythepreviousfootprint.Otherwise,

l=1meansthatitmovestoanuncoveredregion.

7.4.2.2

Actions

Togenerateelegantbrushstrokes,thebrushagentshouldmoveinside

givenboundariesproperly.Here,thefollowingactionsareconsideredtocontrol

thebrush(seeFigure7.5(a)):

•Action1:Movementofthebrushonthecanvaspaper.

•Action2:Scalingup/downofthefootprint.

DirectPolicySearchbyGradientAscent

109

•Action3:Rotationoftheheadingdirectionofthebrush.

Sinceproperlycoveringthewholedesiredregionisthemostimportantin

termsofthevisualquality,themovementofthebrush(Action1)isregarded

astheprimaryaction.Morespecifically,Action1takesavaluein(−π,−π]

thatindicatestheoffsetturningangleofthemotiondirectionrelativetothe

medialaxisofanexpectedstrokeshape.Inpracticalapplications,theagent

shouldbeabletodealwitharbitrarystrokesinvariousscales.Toachieve

stableperformanceindifferentscales,thevelocityisadaptivelychangedas

r/3,whereristheradiusofthecurrentfootprint.

Action1isdeterminedbytheGaussianpolicyfunctiontrainedbythe

REINFORCEalgorithm,andAction2andAction3aredeterminedasfollows.

•Obliquebrushstrokestyle:Thetipoftheagentissettotouchoneside

oftheboundary,andthebottomoftheagentissettobetangenttothe

othersideoftheboundary.

•Uprightbrushstrokestyle:Thetipoftheagentischosentotravelalong

themedialaxisoftheshape.

IfitisnotpossibletosatisfytheaboveconstraintsbyadjustingAction2and

Action3,thenewfootprintwillsimplybethesamepostureastheprevious

one.

7.4.2.3

ImmediateRewards

Theimmediaterewardfunctionmeasuresthequalityofthebrushagent’s

movementaftertakinganactionateachtimestep.Therewardisdesignedto

reflectthefollowingtwoaspects:

•Thedistancebetweenthecenterofthebrushagentandthenearestpoint

onthemedialaxisoftheshapeatthecurrenttimestep:Thisdetects

whethertheagentmovesoutoftheregionortravelsbackwardfromthe

correctdirection.

•Changeofthelocalconfigurationofthebrushagentafterexecutingan

action:Thisdetectswhethertheagentmovessmoothly.

Thesetwoaspectsareformalizedbydefiningtherewardfunctionasfol-

lows:

0

ifft=ft+1orlt+1=0,

r(st,at,st+1)= 2+|κ1(t)|+|κ2(t)|

otherwise,

E(t)

+E(t)

location

posture

whereftandft+1arethefootprintsattimestepstandt+1,respectively.This

rewarddesignimpliesthattheimmediaterewardiszerowhenthebrushis

blockedbyaboundaryasft=ft+1orthebrushisgoingbackwardtoaregion

110

StatisticalReinforcementLearning

thathasalreadybeencoveredbypreviousfootprints.κ1(t)andκ2(t)arethe

valuesofκ1andκ2attimestept.|κ1(t)|+|κ2(t)|adaptivelyincreasesthe

immediaterewarddependingonthecurvaturesκ1(t)andκ2(t)ofthemedial

axis.

E(t)

measuresthequalityofthelocationofthebrushagentwithre-

location

specttothemedialaxis,definedby

E(t)

=

1|ωt|+τ2(|dt|+5)

dt∈[−2,−1)∪(1,2],location

τ1|ωt|+τ2|dt|

dt∈[−1,1],wheredtisthevalueofdattimestept.τ1andτ2areweightparameters,

whicharechosendependingonthebrushstyle:τ1=τ2=0.5fortheupright

brushstyleandτ1=0.1andτ2=0.9fortheobliquebrushstyle.Sincedt

containsinformationaboutwhethertheagentgoesovertheboundaryornot,

asillustratedinFigure7.7,thepenalty+5isaddedtoElocationwhenthe

agentgoesovertheboundaryoftheshape.

E(t)

posturemeasuresthequalityofthepostureofthebrushagentbasedon

neighboringfootprints,definedby

E(t)

posture=∆ωt/3+∆φt/3+∆dt/3,

where∆ωt,∆φt,and∆dtarechangesinangleωofthevelocityvector,heading

directionφ,andratiodoftheoffsetdistance,respectively.Thenotation∆xt

denotesthenormalizedsquaredchangebetweenxt−1andxtdefinedby

1

ifxt=xt−1=0,

∆xt=

(x

t−xt−1)2

otherwise.

(|xt|+|xt−1|)2

7.4.2.4

TrainingandTestSessions

Anaivewaytotrainanagentistouseanentirestrokeshapeasatraining

sample.However,thishasseveraldrawbacks,e.g.,collectingmanytraining

samplesiscostlyandgeneralizationtonewshapesishard.Toovercomethese

limitations,theagentistrainedbasedonpartialshapes,nottheentireshapes

(Figure7.8(a)).Thisallowsustogeneratevariouspartialshapesfromasingle

entireshape,whichsignificantlyincreasesthenumberandvariationoftrain-

ingsamples.Anothermeritisthatthegeneralizationabilitytonewshapes

canbeenhanced,becauseevenwhentheentireprofileofanewshapeisquite

differentfromthatoftrainingdata,thenewshapemaycontainsimilarpartial

shapes.Figure7.8(c)illustrates8examplesof80digitizedrealsinglebrush

strokesthatarecommonlyusedinorientalinkpainting.Boundariesareex-

tractedastheshapeinformationandarearrangedinaqueuefortraining(see

Figure7.8(b)).

Inthetrainingsession,theinitialpositionofthefirstepisodeischosento

bethestartpointofthemedialaxis,andthedirectiontomoveischosentobe

DirectPolicySearchbyGradientAscent

111

(a)Combinationofshapes

(b)Setupofpolicytraining

(c)Trainingshapes

FIGURE7.8:Policytrainingscheme.(a)Eachentireshapeiscomposed

ofoneoftheupperregionsUi,thecommonregionΩ,andoneofthelower

regionsLj.(b)Boundariesareextractedastheshapeinformationandare

arrangedinaqueuefortraining.(c)Eightexamplesof80digitizedrealsingle

brushstrokesthatarecommonlyusedinorientalinkpaintingareillustrated.

thegoalpoint,asillustratedinFigure7.8(b).Inthefirstepisode,theinitial

footprintissetatthestartpointoftheshape.Then,inthefollowingepisodes,

theinitialfootprintissetateitherthelastfootprintinthepreviousepisode

orthestartpointoftheshape,dependingonwhethertheagentmovedwell

orwasblockedbytheboundaryinthepreviousepisode.

Afterlearningadrawingpolicy,thebrushagentappliesthelearnedpolicy

tocoveringgivenboundarieswithsmoothstrokes.Thelocationoftheagentis

112

StatisticalReinforcementLearning

30

30

25

25

20

20

15

15

Return10

Return10

Upperbound

Upperbound

5

RL

5

RL

0

10

20

30

40

10

20

30

40

Iteration

Iteration

(a)Uprightbrushstyle

(b)Obliquebrushstyle

FIGURE7.9:Averageandstandarddeviationofreturnsobtainedbythe

reinforcementlearning(RL)methodover10trialsandtheupperlimitofthe

returnvalue.

initializedatthestartpointofanewshape.Theagentthensequentiallyselects

actionsbasedonthelearnedpolicyandmakestransitionsuntilitreachesthe

goalpoint.

7.4.3

ExperimentalResults

First,theperformanceofthereinforcementlearning(RL)methodisin-

vestigated.PoliciesareseparatelytrainedbytheREINFORCEalgorithmfor

theuprightbrushstyleandtheobliquebrushstyleusing80singlestrokesas

trainingdata(seeFigure7.8(c)).Theparametersoftheinitialpolicyareset

at

θ=(µ⊤,σ)⊤=(0,0,0,0,0,0,2)⊤,

wherethefirstsixelementscorrespondtotheGaussianmeanandthelast

elementistheGaussianstandarddeviation.TheagentcollectsN=300

episodicsampleswithtrajectorylengthT=32.Thediscountedfactoris

setatγ=0.99.

Theaverageandstandarddeviationsofthereturnfor300trainingepisodic

samplesover10trialsareplottedinFigure7.9.Thegraphsshowthatthe

averagereturnssharplyincreaseinanearlystageandapproachtheoptimal

values(i.e.,receivingthemaximumimmediatereward,+1,forallsteps).

Next,theperformanceoftheRLmethodiscomparedwiththatofthe

dynamicprogramming(DP)method(Xieetal.,2011),whichinvolvesdis-

cretizationofcontinuousstatespace.InFigure7.10,theexperimentalresults

obtainedbyDPwithdifferentnumbersoffootprintcandidatesineachstep

oftheDPsearchareplottedtogetherwiththeresultobtainedbyRL.This

showsthattheexecutiontimeoftheDPmethodincreasessignificantlyasthe

numberoffootprintcandidatesincreases.IntheDPmethod,thebestreturn

DirectPolicySearchbyGradientAscent

113

30

2500

DP

2000

RL

20

1500

10

1000

Averagereturn

0

DP

Computationtime

500

RL

−10

0

0

50

100

150

200

0

50

100

150

200

Thenumberoffootprintcandidates

Thenumberoffootprintcandidates

(a)Averagereturn

(b)Computationtime

FIGURE7.10:Averagereturnandcomputationtimeforreinforcement

learning(RL)anddynamicprogramming(DP).

value26.27isachievedwhenthenumberoffootprintcandidatesissetat180.

Althoughthismaximumvalueiscomparabletothereturnobtainedbythe

RLmethod(26.44),RLisabout50timesfasterthantheDPmethod.Fig-

ure7.11showssomeexemplarystrokesgeneratedbyRL(thetoptworows)

andDP(thebottomtworows).ThisshowsthattheagenttrainedbyRLis

abletodrawnicestrokeswithstableposesafterthe30thpolicyupdateiter-

ation(seealsoFigure7.9).Ontheotherhand,asillustratedinFigure7.11,

theDPresultsfor5,60,and100footprintcandidatesareunacceptablypoor.

GiventhattheDPmethodrequiresmanualtuningofthenumberoffootprint

candidatesateachstepforeachinputshape,theRLmethodisdemonstrated

tobepromising.

TheRLmethodisfurtherappliedtomorerealisticshapes,illustratedin

Figure7.12.Althoughtheshapesarenotincludedinthetrainingsamples,

theRLmethodcanproducesmoothandnaturalbrushstrokesforvarious

unlearnedshapes.MoreresultsareillustratedinFigure7.13,showingthat

theRLmethodispromisinginphotoconversionintothesumiestyle.

7.5

Remarks

Inthischapter,gradient-basedalgorithmsfordirectpolicysearchareintro-

duced.Thesegradient-basedmethodsaresuitableforcontrollingvulnerable

physicalsystemssuchashumanoidrobots,thankstothenatureofgradient

methodsthatparametersareupdatedgradually.Furthermore,directpolicy

searchcanhandlecontinuousactionsinastraightforwardway,whichisan

advantageoverpolicyiteration,explainedinPartII.

114

StatisticalReinforcementLearning

1stiteration

10thiteration

20thiteration

30thiteration

40thiteration

(a)RLmethod

5candidates

60candidates

100candidates

140candidates

180candidates

(b)DPmethod

FIGURE7.11:ExamplesofstrokesgeneratedbyRLandDP.Thetoptwo

rowsshowtheRLresultsoverpolicyupdateiterations,whilethebottomtwo

rowsshowtheDPresultsfordifferentnumbersoffootprintcandidates.The

linesegmentconnectsthecenterandthetipofafootprint,andthecircle

denotesthebottomcircleofthefootprint.

Thegradient-basedmethodwassuccessfullyappliedtoautomaticsumie

paintinggeneration.Consideringlocalmeasurementsinstatedesignwas

showntobeuseful,whichallowedabrushagenttolearnageneraldrawing

policythatisindependentofaspecificentireshape.Anotherimportantfactor

wastotrainthebrushagentonpartialshapes,nottheentireshapes.This

contributedhighlytoenhancingthegeneralizationabilitytonewshapes,be-

causeevenwhenanewshapeisquitedifferentfromtrainingdataasawhole,

itoftencontainssimilarpartialshapes.Inthiskindofreal-worldapplica-

tionsmanuallydesigningimmediaterewardfunctionsisoftentimeconsuming

anddifficult.Theuseofinversereinforcementlearning(Abbeel&Ng,2004)

wouldbeapromisingapproachforthispurpose.Inparticular,inthecon-

DirectPolicySearchbyGradientAscent

115

(a)Realphoto

(b)Userinputboundaries

(c)TrajectoriesestimatedbyRL

(d)Renderingresults

FIGURE7.12:Resultsonnewshapes.

textofsumiedrawing,suchdata-drivendesignofrewardfunctionswillallow

automaticlearningofthestyleofaparticularartistfromhis/herdrawings.

Apracticalweaknessofthegradient-basedapproachisthatthestepsize

ofgradientascentisoftendifficulttochoose.InChapter8,astep-size-free

methodofdirectpolicysearchbasedontheexpectation-maximizationalgo-

rithmwillbeintroduced.Anothercriticalproblemofdirectpolicysearchis

thatpolicyupdateisratherunstableduetothestochasticityofpolicies.Al-

thoughvariancereductionbybaselinesubtractioncanmitigatethisproblem

tosomeextent,theinstabilityproblemisstillcriticalinpractice.Thenatural

gradientmethodcouldbeanalternative,butcomputingtheinverseRieman-

nianmetrictendstobeunstable.InChapter9,anothergradientapproach

thatcanaddresstheinstabilityproblemwillbeintroduced.

116

StatisticalReinforcementLearning

FIGURE7.13:Photoconversionintothesumiestyle.

Chapter8

DirectPolicySearchby

Expectation-Maximization

Gradient-baseddirectpolicysearchmethodsintroducedinChapter7are

usefulparticularlyincontrollingcontinuoussystems.However,appropriately

choosingthestepsizeofgradientascentisoftendifficultinpractice.In

thischapter,weintroduceanotherdirectpolicysearchmethodbasedonthe

expectation-maximization(EM)algorithmthatdoesnotcontainthestepsize

parameter.InSection8.1,themainideaoftheEM-basedmethodisdescribed,

whichisexpectedtoconvergefasterbecausepoliciesaremoreaggressivelyup-

datedthanthegradient-basedapproach.Inpractice,however,directpolicy

searchoftenrequiresalargenumberofsamplestoobtainastablepolicy

updateestimator.Toimprovethestabilitywhenthesamplesizeissmall,

reusingpreviouslycollectedsamplesisapromisingapproach.InSection8.2,

thesample-reusetechniquethathasbeensuccessfullyusedtoimprovethe

performanceofpolicyiteration(seeChapter4)isappliedtotheEM-based

method.ThenitsexperimentalperformanceisevaluatedinSection8.3and

thischapterisconcludedinSection8.4.

8.1

Expectation-MaximizationApproach

Thegradient-basedoptimizationalgorithmsintroducedinSection7.2

graduallyupdatepolicyparametersoveriterations.Althoughthisisadvan-

tageouswhencontrollingaphysicalsystem,itrequiresmanyiterationsuntil

convergence.Inthissection,theexpectation-maximization(EM)algorithm

(Dempsteretal.,1977)isusedtocopewiththisproblem.

ThebasicideaofEM-basedpolicysearchistoiterativelyupdatethepolicy

parameterθbymaximizingalowerboundoftheexpectedreturnJ(θ):

Z

J(θ)=

p(h|θ)R(h)dh.

ToderivealowerboundofJ(θ),Jensen’sinequality(Bishop,2006)isutilized:

Z

Z

q(h)f(g(h))dh≥f

q(h)g(h)dh,

117

118

StatisticalReinforcementLearning

whereqisaprobabilitydensity,fisaconvexfunction,andgisanon-negative

function.Forf(t)=−logt,Jensen’sinequalityyields

Z

Z

q(h)logg(h)dh≤log

q(h)g(h)dh.

(8.1)

AssumethatthereturnR(h)isnonnegative.Lete

θbethecurrentpolicy

parameterduringtheoptimizationprocedure,andqandginEq.(8.1)areset

as

p(h|e

θ)R(h)

p(h|θ)

q(h)=

andg(h)=

.

J(e

θ)

p(h|e

θ)

Thenthefollowinglowerboundholdsforallθ:

Z

J(θ)

p(h|θ)R(h)

log

=log

dh

J(e

θ)

J(e

θ)

Zp(h|eθ)R(h)p(h|θ)

=log

dh

J(e

θ)

p(h|e

θ)

Zp(h|eθ)R(h)

p(h|θ)

log

dh.

J(e

θ)

p(h|e

θ)

Thisyields

logJ(θ)≥loge

J(θ),

where

ZR(h)p(h|eθ)

p(h|θ)

loge

J(θ)=

log

dh+logJ(e

θ).

J(e

θ)

p(h|e

θ)

IntheEMapproach,theparameterθisiterativelyupdatedbymaximizing

thelowerbounde

J(θ):

bθ=argmaxe

J(θ).

θ

Sinceloge

J(e

θ)=logJ(e

θ),thelowerbounde

JtouchesthetargetfunctionJat

thecurrentsolutione

θ:

e

J(e

θ)=J(e

θ).

Thus,monotonenon-decreaseoftheexpectedreturnisguaranteed:

J(b

θ)≥J(e

θ).

Thisupdateisiterateduntilconvergence(seeFigure8.1).

LetusemploytheGaussianpolicymodeldefinedas

π(a|s,θ)=π(a|s,µ,σ)

DirectPolicySearchbyExpectation-Maximization

119

FIGURE8.1:PolicyparameterupdateintheEM-basedpolicysearch.The

policyparameterθisupdatediterativelybymaximizingthelowerbound

e

J(θ),whichtouchesthetrueexpectedreturnJ(θ)atthecurrentsolutione

θ.

1

(a−µ⊤φ(s))2

=

exp−

,

σ2π

2σ2

whereθ=(µ⊤,σ)⊤andφ(s)denotesthebasisfunction.

Themaximizerb

θ=(b

µ⊤,b

σ)⊤ofthelowerbounde

J(θ)canbeanalytically

obtainedas

Z

!

!

T

−1

X

Z

T

X

b

µ=

p(h|e

θ)R(h)

φ(st)φ(st)⊤dh

p(h|e

θ)R(h)

atφ(st)dh

t=1

t=1

!

!

N

−1

X

T

X

N

X

T

X

R(hn)

φ(st,n)φ(st,n)⊤R(hn)

at,nφ(st,n),

n=1

t=1

n=1

t=1

Z

!

−1

Z

T

1X

b

σ2=

p(h|e

θ)R(h)dh

p(h|e

θ)R(h)

(a

T

t−b

µ⊤φ(st))2dh

t=1

!

!

N

−1

X

N

X

T

1X

R(hn)

R(hn)

(a

,

T

t,n−b

µ⊤φ(st,n))2

n=1

n=1

t=1

wheretheexpectationoverhisapproximatedbytheaverageoverroll-out

samplesH=hnN

n=1fromthecurrentpolicye

θ:

hn=[s1,n,a1,n,…,sT,n,aT,n].

NotethatEM-basedpolicysearchforGaussianmodelsiscalledreward-

weightedregression(RWR)(Peters&Schaal,2007).

120

StatisticalReinforcementLearning

8.2

SampleReuse

Inpractice,alargenumberofsamplesisneededtoobtainastablepolicy

updateestimatorintheEM-basedpolicysearch.Inthissection,thesample-

reusetechniqueisappliedtotheEMmethodtocopewiththeinstability

problem.

8.2.1

EpisodicImportanceWeighting

TheoriginalRWRmethodisanon-policyalgorithmthatusesdatadrawn

fromthecurrentpolicy.Ontheotherhand,thesituationcalledoff-policyrein-

forcementlearningisconsideredhere,wherethesamplingpolicyforcollecting

datasamplesisdifferentfromthetargetpolicy.Morespecifically,Ntrajec-

torysamplesaregatheredfollowingthepolicyπℓintheℓ-thpolicyupdate

iteration:

Hπℓ=hπℓ

1,…,hπℓ

N,

whereeachtrajectorysamplehπℓ

nisgivenas

hπℓ

n=[sπℓ

1,n,aπℓ

1,n,…,sπℓ,aπℓ,sπℓ

].

T,n

T,n

T+1,n

Wewanttoutilizeallthesesamplestoimprovethecurrentpolicy.

SupposethatwearecurrentlyattheL-thpolicyupdateiteration.Ifthe

policiesπℓL

remainunchangedovertheRWRupdates,justusingthe

ℓ=1

NIW

plainupdaterulesprovidedinSection8.1givesaconsistentestimatorb

θL+1=

(b

µNIW⊤L+1

,b

σNIW)⊤,where

L+1

!

L

−1

XN

X

T

X

b

µNIW

L+1=

R(hπℓ

n)

φ(sπℓ

t,n)φ(sπℓ

t,n)⊤ℓ=1n=1

t=1

!

L

XN

X

T

X

×

R(hπℓ

n)

aπℓ

t,nφ(sπℓ

t,n)

,

ℓ=1n=1

t=1

!

L

−1

XN

X

(b

σNIW

L+1)2=

R(hπℓ

n)

ℓ=1n=1

!

L

XN

X

T

1X

2

×

R(hπℓ

⊤n)

aπℓ

φ(sπℓ

.

T

t,n−b

µNIW

L+1

t,n)

ℓ=1n=1

t=1

Thesuperscript“NIW”standsfor“noimportanceweight.”However,since

policiesareupdatedineachRWRiteration,datasamplesHπℓL

collected

ℓ=1

overiterationsgenerallyfollowdifferentprobabilitydistributionsinducedby

differentpolicies.Therefore,naiveuseoftheaboveupdateruleswillresultin

aninconsistentestimator.

DirectPolicySearchbyExpectation-Maximization

121

InthesamewayasthediscussioninChapter4,importancesamplingcan

beusedtocopewiththisproblem.Thebasicideaofimportancesampling

istoweightthesamplesdrawnfromadifferentdistributiontomatchthe

targetdistribution.Morespecifically,fromi.i.d.(independentandidentically

distributed)sampleshπℓ

nN

n=1followingp(h|θℓ),theexpectationofafunction

g(h)overanotherprobabilitydensityfunctionp(h|θL)canbeestimatedina

consistentmannerbytheimportance-weightedaverage:

N

1X

p(hπℓ

p(h|θ

g(hπ

N→∞

L)

n|θL)

−→E

g(h)

N

n)p(hπℓ

p(h|θℓ)

n|θ

p(h|θ

n=1

ℓ)

ℓ)

Z

Z

p(h|θ

=

g(h)

L)p(h|θ

g(h)p(h|θ

p(h|

ℓ)dh=

L)dh

θℓ)

=Ep(h|θL)[g(h)].

Theratiooftwodensitiesp(h|θL)/p(h|θℓ)iscalledtheimportanceweightfor

trajectoryh.

ThisimportancesamplingtechniquecanbeemployedinRWRtoobtain

EIW

aconsistentestimatorb

θ

⊤L+1=(b

µEIW

L+1

,b

σEIW)⊤,where

L+1

!

L

−1

XN

X

T

X

b

µEIW

L+1=

R(hπℓ

n)w(L,ℓ)(h)

φ(sπℓ

t,n)φ(sπℓ

t,n)⊤ℓ=1n=1

t=1

!

L

XN

X

T

X

×

R(hπℓ

n)w(L,ℓ)(h)

aπℓ

t,nφ(sπℓ

t,n)

,

ℓ=1n=1

t=1

!

L

−1

XN

X

(b

σEIW

L+1)2=

R(hπℓ

n)w(L,ℓ)(hπℓ

n)

ℓ=1n=1

!

L

XN

X

T

1X

2

×

R(hπℓ

⊤n)w(L,ℓ)(hπℓ

n)

aπℓ

φ(sπℓ

.

T

t,n−b

µEIW

L+1

t,n)

ℓ=1n=1

t=1

Here,w(L,ℓ)(h)denotestheimportanceweightdefinedby

p(h|θ

w(L,ℓ)(h)=

L).

p(h|θℓ)

Thesuperscript“EIW”standsfor“episodicimportanceweight.”

p(h|θL)andp(h|θℓ)denotetheprobabilitydensityofobservingtrajectory

h=[s1,a1,…,sT,aT,sT+1]

underpolicyparametersθLandθℓ,whichcanbeexplicitlywrittenas

T

Y

p(h|θL)=p(s1)

p(st+1|st,at)π(at|st,θL),

t=1

122

StatisticalReinforcementLearning

T

Y

p(h|θℓ)=p(s1)

p(st+1|st,at)π(at|st,θℓ).

t=1

Thetwoprobabilitydensitiesp(h|θL)andp(h|θℓ)bothcontainunknownprob-

abilitydensitiesp(s1)andp(st+1|st,at)Tt=1.However,sincetheycancelout

intheimportanceweight,itcanbecomputedwithouttheknowledgeofp(s)

andp(s′|s,a)as

QTπ(a

w(L,ℓ)(h)=

t=1

t|st,θL)

Q

.

T

π(a

t=1

t|st,θℓ)

EIW

Althoughtheimportance-weightedestimatorb

θL+1isguaranteedtobe

consistent,ittendstohavelargevariance(Shimodaira,2000;Sugiyama&

Kawanabe,2012).Therefore,theimportance-weightedestimatortendstobe

unstablewhenthenumberofepisodesNisrathersmall.

8.2.2

Per-DecisionImportanceWeight

Sincetherewardatthet-thstepdoesnotdependonfuturestate-action

transitionsafterthet-thstep,anepisodicimportanceweightcanbedecom-

posedintostepwiseimportanceweights(Precupetal.,2000).Forinstance,

theexpectedreturnJ(θL)canbeexpressedas

Z

J(θL)=

R(h)p(h|θL)dh

ZT

X

=

γt−1r(st,at,st+1)w(L,ℓ)(h)p(h|θℓ)dh

t=1

ZT

X

=

γt−1r(st,at,st+1)w(L,ℓ)

t

(h)p(h|θℓ)dh,

t=1

wherew(L,ℓ)

t

(h)isthet-stepimportanceweight,calledtheper-decisionim-

portanceweight(PIW),definedas

Qt

π(a

w(L,ℓ)

t′=1

t′|st′,θL)

t

(h)=Q

.

t

π(a

t′=1

t′|st′,θℓ)

Here,thePIWideaisappliedtoRWRandamorestablealgorithmis

developed.Aslightcomplicationisthatthepolicyupdateformulagivenin

Section8.2.1containsdoublesumsoverTsteps,e.g.,

T

X

T

X

R(h)

φ(st′)φ(st′)=

γt−1r(st,at,st+1)φ(st′)φ(st′).

t′=1

t,t′=1

Inthiscase,thesummand

γt−1r(st,at,st+1)φ(st′)φ(st′)

DirectPolicySearchbyExpectation-Maximization

123

doesnotdependonfuturestate-actionpairsafterthemax(t,t′)-thstep.Thus,

theepisodicimportanceweightfor

γt−1r(st,at,st+1)φ(st′)φ(st′)

canbesimplifiedtotheper-decisionimportanceweightw(L,ℓ)

.Conse-

max(t,t′)

quently,thePIW-basedpolicyupdaterulesaregivenas

−1

L

XN

XT

X

b

µPIW

L+1=

γt−1rt,nφ(sπℓ)φ(sπℓ)⊤w(L,ℓ)

(hπℓ

t′,n

t′,n

max(t,t′)

n)

ℓ=1n=1t,t′=1

L

XN

XT

X

×

γt−1r

t,naπℓφ(sπℓ)w(L,ℓ)

(hπℓ

,

t′,n

t′,n

max(t,t′)

n)

ℓ=1n=1t,t′=1

!

L

−1

XN

XT

X

(b

σPIW

L+1)2=

γt−1rt,nw(L,ℓ)

t

(hπℓ

n)

ℓ=1n=1t=1

!

L

N

T

1XXX

2

×

γt−1r

aπℓ

⊤φ(sπℓ)

w(L,ℓ)

(hπℓ

,

T

t,n

t′,n−b

µPIW

L+1

t′,n

max(t,t′)

n)

ℓ=1n=1t,t′=1

where

rt,n=r(st,n,at,n,st+1,n).

PIW

ThisPIWestimatorb

θ

⊤L+1=(b

µPIW

L+1

,b

σPIW)⊤isconsistentandpotentially

L+1EIW

morestablethantheplainEIWestimatorb

θL+1.

8.2.3

AdaptivePer-DecisionImportanceWeighting

TomoreactivelycontrolthestabilityofthePIWestimator,theadaptive

per-decisionimportanceweight(AIW)isemployed.Morespecifically,anim-

portanceweightw(L,ℓ)

(h)is“flattened”byflatteningparameterν

max(t,t

∈[0,1]′)

ν

asw(L,ℓ)

(h)

,i.e.,theν-thpoweroftheper-decisionimportanceweight.

max(t,t′)

AIW

Thenwehaveb

θ

⊤L+1=(b

µAIW

L+1

,b

σAIW

L+1)⊤,where

−1

L

XN

XT

X

ν

b

µAIW

L+1=

γt−1rt,nφ(sπℓ)φ(sπℓ)⊤w(L,ℓ)

(hπℓ

t′,n

t′,n

max(t,t′)

n)

ℓ=1n=1t,t′=1

L

XN

XT

X

ν

×

γt−1r

t,naπℓφ(sπℓ)

w(L,ℓ)

(hπℓ

,

t′,n

t′,n

max(t,t′)

n)

ℓ=1n=1t,t′=1

!

L

−1

XN

XT

X

ν

(b

σAIW

L+1)2=

γt−1rt,nw(L,ℓ)

t

(hπℓ

n)

ℓ=1n=1t=1

124

StatisticalReinforcementLearning

!

L

N

T

1XXX

2

ν

×

γt−1r

aπℓ

⊤φ(sπℓ)

w(L,ℓ)

(hπℓ

.

T

t,n

t′,n−b

µAIW

L+1

t′,n

max(t,t′)

n)

ℓ=1n=1t,t′=1

Whenν=0,AIWisreducedtoNIW.Therefore,itisrelativelystable,but

notconsistent.Ontheotherhand,whenν=1,AIWisreducedtoPIW.

Therefore,itisconsistent,butratherunstable.Inpractice,anintermediate

νoftenproducesabetterestimator.Notethatthevalueoftheflattening

parametercanbedifferentineachiteration,i.e.,νmaybereplacedbyνℓ.

However,forsimplicity,asinglecommonvalueνisconsideredhere.

8.2.4

AutomaticSelectionofFlatteningParameter

Theflatteningparameterallowsustocontrolthetrade-offbetweenconsis-

tencyandstability.Here,weshowhowthevalueoftheflatteningparameter

canbeoptimallychosenusingdatasamples.

Thegoalofpolicysearchistofindtheoptimalpolicythatmaximizesthe

expectedreturnJ(θ).Therefore,theoptimalflatteningparametervalueν∗LattheL-thiterationisgivenby

AIW

ν∗L=argmaxJ(bθL+1(ν)).ν

Directlyobtainingν∗requiresthecomputationoftheexpectedreturnL

AIW

J(b

θL+1(ν))foreachcandidateofν.Tothisend,datasamplesfollowing

AIW

π(a|s;bθL+1(ν))areneededforeachν,whichisprohibitivelyexpensive.To

reusesamplesgeneratedbypreviouspolicies,avariationofcross-validation

calledimportance-weightedcross-validation(IWCV)(Sugiyamaetal.,2007)

isemployed.

ThebasicideaofIWCVistosplitthetrainingdatasetHπ1:L=HπℓLℓ=1

intoan“estimationpart”anda“validationpart.”Thenthepolicyparam-

AIW

eterb

θL+1(ν)islearnedfromtheestimationpartanditsexpectedreturn

AIW

J(b

θ

(ν))isapproximatedusingtheimportance-weightedlossfortheval-

idationpart.AspointedoutinSection8.2.1,importanceweightingtendsto

beunstablewhenthenumberNofepisodesissmall.Forthisreason,per-

decisionimportanceweightingisusedforcross-validation.Below,howIWCV

isappliedtotheselectionoftheflatteningparameterνinthecurrentcontext

isexplainedinmoredetail.

LetusdividethetrainingdatasetHπ1:L=HπℓLintoKdisjointsubsets

ℓ=1

Hπ1:L

ofthesamesize,whereeach

containsN/Kepisodicsamples

k

K

k=1

Hπ1:L

k

fromeveryHπℓ.Forsimplicity,weassumethatNisdivisiblebyK,i.e.,N/K

isaninteger.K=5willbeusedintheexperimentslater.

AIW

Letb

θL+1,k(ν)bethepolicyparameterlearnedfromHπ1:L

k

k′6=k(i.e.,all

AIW

datawithoutHπ1:L)byAIWestimation.Theexpectedreturnofb

θ

k

L+1,k(ν)is

DirectPolicySearchbyExpectation-Maximization

125

estimatedusingthePIWestimatorfromHπ1:Las

k

X

T

X

b

AIW

1

Jk

IWCV(b

θL+1,k(ν))=

γt−1r(s

η

t,at,st+1)w(L,ℓ)

t

(h),

π

h∈H1:Lt=1k

whereηisanormalizationconstant.Anordinarychoiceisη=LN/K,buta

morestablevariantgivenby

X

η=

w(L,ℓ)

t

(h)

π

h∈H1:Lk

isoftenpreferredinpractice(Precupetal.,2000).

Theaboveprocedureisrepeatedforallk=1,…,K,andtheaverage

score,

K

X

b

AIW

1

AIW

J

b

IWCV(b

θL+1(ν))=

Jk

K

IWCV(b

θL+1,k(ν)),

k=1

AIW

iscomputed.ThisistheK-foldIWCVestimatorofJ(b

θL+1(ν)),whichwas

showntobealmostunbiased(Sugiyamaetal.,2007).

ThisK-foldIWCVscoreiscomputedforeachcandidatevalueoftheflat-

teningparameterνandtheonethatmaximizestheIWCVscoreischosen:

AIW

b

ν

b

IWCV=argmaxJIWCV(b

θL+1(ν)).

ν

ThisIWCVschemecanalsobeusedforchoosingthebasisfunctionsφ(s)in

theGaussianpolicymodel.

Notethatwhentheimportanceweightsw(L,ℓ)

areallone(i.e.,noim-

max(t,t′)

portanceweighting),theaboveIWCVprocedureisreducedtotheordinary

CVprocedure.TheuseofIWCVisessentialheresincethetargetpolicy

AIW

π(a|s,bθL+1(ν))isusuallydifferentfromthepreviouspoliciesusedforcollect-

ingthedatasamplesHπ1:L.Therefore,theexpectedreturnestimatedusing

AIW

ordinaryCV,b

JCV(b

θL+1(ν)),wouldbeheavilybiased.

8.2.5

Reward-WeightedRegressionwithSampleReuse

Sofar,wehaveintroducedAIWtocontrolthestabilityofthepolicy-

parameterupdateandIWCVtoautomaticallychoosetheflatteningparameter

basedontheestimatedexpectedreturn.Thepolicysearchalgorithmthat

combinesthesetwomethodsiscalledreward-weightedregressionwithsample

reuse(RRR).

Ineachiteration(L=1,2,…)ofRRR,episodicdatasamplesHπLare

collectedfollowingthecurrentpolicyπ(a|s,θAIW

L

),theflatteningparameter

νischosensoastomaximizetheexpectedreturnb

JIWCV(ν)estimatedby

IWCVusingHπℓL,andthenthepolicyparameterisupdatedto

ℓ=1

θAIW

L+1

usingHπℓL.

ℓ=1

126

StatisticalReinforcementLearning

elbow

wrist

FIGURE8.2:Ballbalancingusingarobotarmsimulator.Twojointsofthe

robotsarecontrolledtokeeptheballinthemiddleofthetray.

8.3

NumericalExamples

TheperformanceofRRRisexperimentallyevaluatedonaball-balancing

taskusingarobotarmsimulator(Schaal,2009).

AsillustratedinFigure8.2,a7-degree-of-freedomarmismountedonthe

ceilingupsidedown,whichisequippedwithacirculartrayofradius0.24[m]

attheendeffector.Thegoalistocontrolthejointsoftherobotsothatthe

ballisbroughttothemiddleofthetray.However,thedifficultyisthatthe

angleofthetraycannotbecontrolleddirectly,whichisatypicalrestriction

inreal-worldjoint-motionplanningbasedonfeedbackfromtheenvironment

(e.g.,thestateoftheball).

Tosimplifytheproblem,onlytwojointsarecontrolledhere:thewristangle

αrollandtheelbowangleαpitch.Alltheremainingjointsarefixed.Control

ofthewristandelbowangleswouldroughlycorrespondtochangingtheroll

andpitchanglesofthetray,butnotdirectly.

Twoseparatecontrolsubsystemsaredesignedhere,eachofwhichisin

chargeofcontrollingtherollandpitchangles.Eachsubsystemhasitsown

policyparameterθ,statespaceS,andactionspaceA.ThestatespaceSis

continuousandconsistsof(x,˙x),wherex[m]isthepositionoftheballonthe

trayalongeachaxisand˙x[m/s]isthevelocityoftheball.Theactionspace

Aiscontinuousandcorrespondstothetargetanglea[rad]ofthejoint.The

rewardfunctionisdefinedas

5(x′)2+(˙x′)2+a2

r(s,a,s′)=exp−

,

2(0.24/2)2

wherethenumber0.24inthedenominatorcomesfromtheradiusofthetray.

Below,howthecontrolsystemisdesignedisexplainedinmoredetail.

DirectPolicySearchbyExpectation-Maximization

127

FIGURE8.3:Theblockdiagramoftherobot-armcontrolsystemforball

balancing.Thecontrolsystemhastwofeedbackloops,i.e.,joint-trajectory

planningbyRRRandtrajectorytrackingbyahigh-gainproportional-

derivative(PD)controller.

AsillustratedinFigure8.3,thecontrolsystemhastwofeedbackloopsfor

trajectoryplanningusinganRRRcontrollerandtrajectorytrackingusinga

high-gainproportional-derivative(PD)controller(Siciliano&Khatib,2008).

TheRRRcontrolleroutputsthetargetjointangleobtainedbythecurrent

policyatevery0.2[s].NineGaussiankernelsareusedasbasisfunctionsφ(s)

withthekernelcenterscb9

locatedoverthestatespaceat

b=1

(x,˙x)∈(−0.2,−0.4),(−0.2,0),(−0.1,0.4),(0,−0.4),(0,0),(0,0.4),

(0.1,−0.4),(0.2,0),(0.2,0.4).

TheGaussianwidthissetatσbasis=0.1.Basedonthediscrete-timetarget

anglesobtainedbyRRR,thedesiredjointtrajectoryinthecontinuoustime

domainislinearlyinterpolatedas

at,u=at+u˙at,

whereuisthetimefromthelastoutputatofRRRatthet-thstep.˙atisthe

angularvelocitycomputedby

a

˙a

t−at−1

t=

,

0.2

wherea0istheinitialangleofajoint.Theangularvelocityisassumedtobe

constantduringthe0.2[s]cycleoftrajectoryplanning.

Ontheotherhand,thePDcontrollerconvertsdesiredjointtrajectoriesto

motortorquesas

τt,u=µp∗(at,u−αt,u)+µd∗(˙at−˙αt,u),whereτisthe2-dimensionalvectorconsistingofthetorqueappliedtothe

wristandelbowjoints.a=(apitch,aroll)⊤and˙a=(˙apitch,˙aroll)⊤arethe

2-dimensionalvectorsconsistingofthedesiredanglesandvelocities.α=

128

StatisticalReinforcementLearning

(αpitch,αroll)⊤and˙α=(˙αpitch,˙αroll)⊤arethe2-dimensionalvectorsconsist-

ingofthecurrentjointanglesandvelocities.µpandµdarethe2-dimensional

vectorsconsistingoftheproportionalandderivativegains.“∗”denotestheelement-wiseproduct.Sincethecontrolcycleoftherobotarmis0.002[s],

thePDcontrollerisapplied100times(i.e.,t=0.002,0.004,…,0.198,0.2)ineachRRRcycle.

Figure8.4depictsadesiredtrajectoryofthewristjointgeneratedby

arandompolicyandanactualtrajectoryobtainedusingthehigh-gainPD

controllerdescribedabove.Thegraphsshowthatthedesiredtrajectoryis

followedbytherobotarmreasonablywell.

ThepolicyparameterθLislearnedthroughtheRRRiterations.Theinitial

policyparametersθ1=(µ⊤

1,σ1)⊤aresetmanuallyas

µ1=(−0.5,−0.5,0,−0.5,0,0,0,0,0)⊤andσ1=0.1,

sothatawiderangeofstatesandactionscanbesafelyexploredinthefirstiter-

ation.Theinitialpositionoftheballisrandomlyselectedasx∈[−0.05,0.05].Thedatasetcollectedineachiterationconsistsof10episodeswith20steps.

Thedurationofanepisodeis4[s]andthesamplingcyclebyRRRis0.2[s].

Threescenariosareconsideredhere:

•NIW:Samplereusewithν=0.

•PIW:Samplereusewithν=1.

•RRR:SamplereusewithνchosenbyIWCVfrom0,0.25,0.5,0.75,1

ineachiteration.

Thediscountfactorissetatγ=0.99.Figure8.5depictstheaveragedexpected

returnover10trialsasafunctionofthenumberofpolicyupdateiterations.

Theexpectedreturnineachtrialiscomputedfrom20testepisodicsamples

thathavenotbeenusedfortraining.ThegraphshowsthatRRRnicelyim-

provestheperformanceoveriterations.Ontheotherhand,theperformance

forν=0issaturatedafterthe3rditeration,andtheperformanceforν=1

isimprovedinthebeginningbutsuddenlygoesdownatthe5thiteration.

Theresultforν=1indicatesthatalargechangeinpoliciescausessevere

instabilityinsamplereuse.

Figure8.6andFigure8.7depictexamplesoftrajectoriesofthewristangle

αroll,theelbowangleαpitch,resultingballmovementx,andrewardrfor

policiesobtainedbyNIW(ν=0)andRRR(νischosenbyIWCV)after

the10thiteration.BythepolicyobtainedbyNIW,theballgoesthroughthe

middleofthetray,i.e.,(xroll,xpitch)=(0,0),anddoesnotstop.Ontheother

hand,thepolicyobtainedbyRRRsuccessfullyguidestheballtothemiddle

ofthetrayalongtherollaxis,althoughthemovementalongthepitchaxis

lookssimilartothatbyNIW.MotionexamplesbyRRRwithνchosenby

IWCVareillustratedinFigure8.8.

DirectPolicySearchbyExpectation-Maximization

129

0.2

1

0.15

0.5

0.1

0

0.05

Angle[rad]

−0.5

Angularvelocity[rad/s]

0

−1

Desiredtrajectory

Actualtrajectory

−0.05

−1.5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Time[s]

Time[s]

(a)Trajectoryinangles

(b)Trajectoryinangularvelocities

FIGURE8.4:Anexampleofdesiredandactualtrajectoriesofthewrist

jointintherealisticball-balancingtask.Thetargetjointangleisdetermined

byarandompolicyatevery0.2[s],andthenalinearlyinterpolatedangleand

constantvelocityaretrackedusingtheproportional-derivative(PD)controller

inthecycleof0.002[s].

17

16

15

Reusen=0(NIW)

14

Reusen=1(PIW)

RRR(

^

n=νIWCV)

13

12

Expectedreturn11

10

9

2

4

6

8

10

Iteration

FIGURE8.5:Theperformanceoflearnedpolicieswhenν=0(NIW),ν=1

(PIW),andνischosenbyIWCV(RRR)inballbalancingusingasimulated

robot-armsystem.Theperformanceismeasuredbythereturnaveragedover

10trials.Thesymbol“”indicatesthatthemethodisthebestorcomparable

tothebestoneintermsoftheexpectedreturnbythet-testatthesignifi-

cancelevel5%,performedateachiteration.Theerrorbarsindicate1/10ofa

standarddeviation.

130

StatisticalReinforcementLearning

0.2

1.7

0.15

1.65

0.1

1.6

0.05

[rad]

[rad]

roll

0

α

pitch1.55

α

−0.05

Angle

1.5

Angle

−0.1

1.45

−0.15

−0.2

1.4

0

1

2

3

4

0

1

2

3

4

Time[s]

Time[s]

0.2

1

Pitch

Roll

0.15

Middleoftray

0.8

0.1

[m]

r

x

0.6

0.05

Reward0.4

0

Ballposition

0.2

−0.05

−0.1

0

0

1

2

3

4

0

1

2

3

4

Time[s]

Time[s]

FIGURE8.6:Typicalexamplesoftrajectoriesofwristangleαroll,elbow

angleαpitch,resultingballmovementx,andrewardrforpoliciesobtainedby

NIW(ν=0)atthe10thiterationintheball-balancingtask.

0.2

1.7

0.15

1.65

0.1

1.6

0.05

[rad]

[rad]

roll

0

α

pitch1.55

α

−0.05

Angle

1.5

Angle

−0.1

1.45

−0.15

−0.2

1.4

0

1

2

3

4

0

1

2

3

4

Time[s]

Time[s]

0.2

1

Pitch

Roll

0.15

Middleoftray

0.8

0.1

[m]

r

x

0.6

0.05

Reward0.4

0

Ballposition

0.2

−0.05

−0.1

0

0

1

2

3

4

0

1

2

3

4

Time[s]

Time[s]

FIGURE8.7:Typicalexamplesoftrajectoriesofwristangleαroll,elbow

angleαpitch,resultingballmovementx,andrewardrforpoliciesobtainedby

RRR(νischosenbyIWCV)atthe10thiterationintheball-balancingtask.

DirectPolicySearchbyExpectation-Maximization

131

FIGURE8.8:MotionexamplesofballbalancingbyRRR(fromlefttoright

andtoptobottom).

132

StatisticalReinforcementLearning

8.4

Remarks

Adirectpolicysearchalgorithmbasedonexpectation-maximization(EM)

iterativelymaximizesthelower-boundoftheexpectedreturn.TheEM-based

approachdoesnotincludethestepsizeparameter,whichisanadvantageover

thegradient-basedapproachintroducedinChapter7.Asample-reusevariant

oftheEM-basedmethodwasalsoprovided,whichcontributestoimproving

thestabilityofthealgorithminsmall-samplescenarios.

Inpractice,however,theEM-basedapproachisstillratherinstableevenif

itiscombinedwiththesample-reusetechnique.InChapter9,anotherpolicy

searchapproachwillbeintroducedtofurtherimprovethestabilityofpolicy

updates.

Chapter9

Policy-PriorSearch

ThedirectpolicysearchmethodsexplainedinChapter7andChapter8are

usefulinsolvingproblemswithcontinuousactionssuchasrobotcontrol.How-

ever,theytendtosufferfrominstabilityofpolicyupdate.Inthischapter,we

introduceanalternativepolicysearchmethodcalledpolicy-priorsearch,which

isadoptedinthePGPE(policygradientswithparameter-basedexploration)

method(Sehnkeetal.,2010).Thebasicideaistousedeterministicpoliciesto

removeexcessiverandomnessandintroduceusefulstochasticitybyconsidering

apriordistributionforpolicyparameters.

Afterformulatingtheproblemofpolicy-priorsearchinSection9.1,a

gradient-basedalgorithmisintroducedinSection9.2,includingitsimprove-

mentusingbaselinesubtraction,theoreticalanalysis,andexperimentaleval-

uation.Then,inSection9.3,asample-reusevariantisdescribedanditsper-

formanceistheoreticallyanalyzedandexperimentallyinvestigatedusinga

humanoidrobot.Finally,thischapterisconcludedinSection9.4.

9.1

Formulation

Inthissection,thepolicysearchproblemisformulatedbasedonpolicy

priors.

Thebasicideaistouseadeterministicpolicyandintroducestochasticity

bydrawingpolicyparametersfromapriordistribution.Morespecifically,pol-

icyparametersarerandomlydeterminedfollowingthepriordistributionatthe

beginningofeachtrajectory,andthereafteractionselectionisdeterministic

(Figure9.1).Notethattransitionsaregenerallystochastic,andthustrajecto-

riesarealsostochasticeventhoughthepolicyisdeterministic.Thankstothis

per-trajectoryformulation,thevarianceofgradientestimatorsinpolicy-prior

searchdoesnotincreasewithrespecttothetrajectorylength,whichallows

ustoovercomethecriticaldrawbackofdirectpolicysearch.

Policy-priorsearchusesadeterministicpolicywithtypicallyalinearar-

chitecture:

π(a|s,θ)=δ(a=θ⊤φ(s)),

whereδ(·)istheDiracdeltafunctionandφ(s)isthebasisfunction.Thepolicy

133

134

StatisticalReinforcementLearning

a

s

a

s

a

s

s

a

s

a

s

a

s

(a)Stochasticpolicy

a

s

a

s

s

a

s

a

s

a

s

a

s

(b)Deterministicpolicywithprior

FIGURE9.1:Illustrationofthestochasticpolicyandthedeterministicpol-

icywithapriorunderdeterministictransition.Thenumberofpossibletra-

jectoriesisexponentialwithrespecttothetrajectorylengthwhenstochastic

policiesareused,whileitdoesnotgrowwhendeterministicpoliciesdrawn

fromapriordistributionareused.

parameterθisdrawnfromapriordistributionp(θ|ρ)withhyper-parameter

ρ.

Theexpectedreturninpolicy-priorsearchisdefinedintermsoftheex-

pectationsoverbothtrajectoryhandpolicyparameterθasafunctionof

hyper-parameterρ:

ZZ

J(ρ)=Ep(h|θ)p(θ|ρ)[R(h)]=

p(h|θ)p(θ|ρ)R(h)dhdθ,

whereEp(h|θ)p(θ|ρ)denotestheexpectationovertrajectoryhandpolicy

parameterθdrawnfromp(h|θ)p(θ|ρ).Inpolicy-priorsearch,thehyper-

parameterρisoptimizedsothattheexpectedreturnJ(ρ)ismaximized.

Thus,theoptimalhyper-parameterρ∗isgivenbyρ∗=argmaxJ(ρ).ρ

9.2

PolicyGradientswithParameter-BasedExploration

Inthissection,agradient-basedalgorithmforpolicy-priorsearchisgiven.

Policy-PriorSearch

135

9.2.1

Policy-PriorGradientAscent

Here,agradientmethodisusedtofindalocalmaximizeroftheexpected

returnJwithrespecttohyper-parameterρ:

ρ←−ρ+ε∇ρJ(ρ),whereεisasmallpositiveconstantand∇ρJ(ρ)isthederivativeofJwithrespecttoρ:

ZZ

∇ρJ(ρ)=p(h|θ)∇ρp(θ|ρ)R(h)dhdθ

ZZ

=

p(h|θ)p(θ|ρ)∇ρlogp(θ|ρ)R(h)dhdθ=Ep(h|θ)p(θ|ρ)[∇ρlogp(θ|ρ)R(h)],wherethelogarithmicderivative,

∇∇ρp(θ|ρ)

ρlogp(θ|ρ)=

,

p(θ|ρ)

wasusedinthederivation.Theexpectationsoverhandθareapproximated

bytheempiricalaverages:

1N

X

∇bρJ(ρ)=

∇N

ρlogp(θn|ρ)R(hn),

(9.1)

n=1

whereeachtrajectorysamplehnisdrawnindependentlyfromp(h|θn)and

parameterθnisdrawnfromp(θ|ρ).Thus,inpolicy-priorsearch,samplesare

pairsofθandh:

H=(θ1,h1),…,(θN,hN).

Asthepriordistributionforpolicyparameterθ=(θ1,…,θB)⊤,where

Bisthedimensionalityofthebasisvectorφ(s),theindependentGaussian

distributionisastandardchoice.ForthisGaussianprior,thehyper-parameter

ρconsistsofpriormeansη=(η1,…,ηB)⊤andpriorstandarddeviations

τ=(τ1,…,τB)⊤:

B

Y

1

p(

b−ηb)2

θ|η,τ)=

exp−

.

(9.2)

τ

2τ2

b=1

b

b

Thenthederivativesoflog-priorlogp(θ|η,τ)withrespecttoηbandτbare

givenas

θ

∇b−ηb

ηlogp(θ|η,τ)=

,

b

τ2

b

∇b−ηb)2−τ2

b

τlogp(θ|η,τ)=

.

b

τ3

b

BysubstitutingthesederivativesintoEq.(9.1),thepolicy-priorgradientswith

respecttoηandτcanbeapproximated.

136

StatisticalReinforcementLearning

9.2.2

BaselineSubtractionforVarianceReduction

AsexplainedinSection7.2.2,subtractionofabaselinecanreducethevari-

anceofgradientestimators.Here,abaselinesubtractionmethodforpolicy-

priorsearchisdescribed.

Forabaselineξ,amodifiedgradientestimatorisgivenby

1N

X

∇bρJξ(ρ)=

(R(h

N

n)−ξ)∇ρlogp(θn|ρ).n=1

Letξ∗betheoptimalbaselinethatminimizesthevarianceofthegradient:ξ∗=argminVarb

p(h|θ)p(θ|ρ)[∇ρJξ(ρ)],ξ

whereVarp(h|θ)p(θ|ρ)denotesthetraceofthecovariancematrix:

Varp(h|θ)p(θ|ρ)[ζ]

=trEp(h|θ)p(θ|ρ)(ζ−Ep(h|θ)p(θ|ρ)[ζ])(ζ−Ep(h|θ)p(θ|ρ)[ζ])⊤

h

i

=Ep(h|θ)p(θ|ρ)kζ−Ep(h|θ)p(θ|ρ)[ζ]k2.

ItwasshowninZhaoetal.(2012)thattheoptimalbaselineforpolicy-prior

searchisgivenby

E

ξ∗=p(h|θ)p(θ|ρ)[R(h)k∇ρlogp(θ|ρ)k2],Ep(θ|ρ)[k∇ρlogp(θ|ρ)k2]whereEp(θ|ρ)denotestheexpectationoverpolicyparameterθdrawnfrom

p(θ|ρ).Inpractice,theexpectationsareapproximatedbythesampleaverages.

9.2.3

VarianceAnalysisofGradientEstimators

Herethevarianceofgradientestimatorsistheoreticallyinvestigatedfor

theindependentGaussianprior(9.2)withφ(s)=s.SeeZhaoetal.(2012)

fortechnicaldetails.

Below,subsetsofthefollowingassumptionsareconsidered(whicharethe

sameastheonesusedinSection7.2.3):

Assumption(A):r(s,a,s′)∈[−β,β]forβ>0.Assumption(B):r(s,a,s′)∈[α,β]for0<α<β.Assumption(C):Forδ>0,thereexisttwoseriesctTt=1anddtTt=1such

that

kstk≥ctandtk≤dt

holdwithprobabilityatleast1−δ,respectively,overthechoiceof

2N

samplepaths.

Policy-PriorSearch

137

NotethatAssumption(B)isstrongerthanAssumption(A).

Let

B

X

G=

τ−2.

b

b=1

First,thevarianceofgradientestimatorsinpolicy-priorsearchisanalyzed:

Theorem9.1UnderAssumption(A),thefollowingupperboundshold:

h

i

β2(1−γT)2G

β2G

Var

b

p(h|θ)p(θ|ρ)∇ηJ(η,τ)≤≤

,

N(1−γ)2

N(1−γ)2

h

i

2β2(1−γT)2G

2β2G

Var

b

p(h|θ)p(θ|ρ)∇τJ(η,τ)≤≤

.

N(1−γ)2

N(1−γ)2

ThesecondupperboundsareindependentofthetrajectorylengthT,while

theupperboundsfordirectpolicysearch(Theorem7.1inSection7.2.3)are

monotoneincreasingwithrespecttothetrajectorylengthT.Thus,gradient

estimationinpolicy-priorsearchisexpectedtobemorereliablethanthatin

directpolicysearchwhenthetrajectorylengthTislarge.

Thefollowingtheoremmoreexplicitlycomparesthevarianceofgradient

estimatorsindirectpolicysearchandpolicy-priorsearch:

Theorem9.2InadditiontoAssumptions(B)and(C),assumethat

ζ(T)=CTα2−DTβ2/(2π)

ispositiveandmonotoneincreasingwithrespecttoT,where

T

X

T

X

CT=

c2tandDT=

d2t.

t=1

t=1

IfthereexistsT0suchthat

ζ(T0)≥β2Gσ2,

thenitholdsthat

Var

b

b

p(h|θ)p(θ|ρ)[∇µJ(θ)]>Varp(h|θ)p(θ|ρ)[∇ηJ(η,τ)]forallT>T0,withprobabilityatleast1−δ.

Theabovetheoremmeansthatpolicy-priorsearchismorefavorablethan

directpolicysearchintermsofthevarianceofgradientestimatorsofthe

mean,iftrajectorylengthTislarge.

Next,thecontributionoftheoptimalbaselinetothevarianceofthegradi-

entestimatorwithrespecttomeanparameterηisinvestigated.Itwasshown

inZhaoetal.(2012)thattheexcessvarianceforabaselineξisgivenby

Var

b

b

p(h|θ)p(θ|ρ)[∇ρJξ(ρ)]−Varp(h|θ)p(θ|ρ)[∇ρJξ∗(ρ)]

138

StatisticalReinforcementLearning

(ξ−ξ∗)2h

i

=

E

k∇.

N

p(h|θ)p(θ|ρ)

ρlogp(θ|ρ)k2

Basedonthisexpression,thefollowingtheoremholds.

Theorem9.3Ifr(s,a,s′)≥α>0,thefollowinglowerboundholds:

α2(1−γT)2G

Var

b

b

p(h|θ)p(θ|ρ)[∇ηJ(η,τ)]−Varp(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≥.

N(1−γ)2

UnderAssumption(A),thefollowingupperboundholds:

β2(1−γT)2G

Var

b

b

p(h|θ)p(θ|ρ)[∇ηJ(η,τ)]−Varp(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≤.

N(1−γ)2

Theabovetheoremshowsthatthelowerboundoftheexcessvariance

ispositiveandmonotoneincreasingwithrespecttothetrajectorylengthT.

Thismeansthatthevarianceisalwaysreducedbysubtractingtheoptimal

baselineandtheamountofvariancereductionismonotoneincreasingwith

respecttothetrajectorylengthT.Notethattheupperboundisalsomonotone

increasingwithrespecttothetrajectorylengthT.

Finally,thevarianceofthegradientestimatorwiththeoptimalbaseline

isinvestigated:

Theorem9.4UnderAssumptions(B)and(C),thefollowingupperbound

holdswithprobabilityatleast1−δ:

(1−γT)2

(β2−α2)G

Var

b

p(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≤(β2−α2)G≤

.

N(1−γ)2

N(1−γ)2

ThesecondupperboundisindependentofthetrajectorylengthT,while

Theorem7.4inSection7.2.3showedthattheupperboundofthevariance

ofgradientestimatorswiththeoptimalbaselineindirectpolicysearchis

monotoneincreasingwithrespecttotrajectorylengthT.Thus,whentrajec-

torylengthTislarge,policy-priorsearchismorefavorablethandirectpolicy

searchintermsofthevarianceofthegradientestimatorwithrespecttothe

meanevenwhenoptimalbaselinesubtractionisapplied.

9.2.4

NumericalExamples

Here,theperformanceofthedirectpolicysearchandpolicy-priorsearch

algorithmsareexperimentallycompared.

9.2.4.1

Setup

LetthestatespaceSbeone-dimensionalandcontinuous,andtheinitial

stateisrandomlychosenfollowingthestandardnormaldistribution.Theac-

tionspaceAisalsosettobeone-dimensionalandcontinuous.Thetransition

dynamicsoftheenvironmentissetat

st+1=st+at+ε,

Policy-PriorSearch

139

TABLE9.1:Varianceandbiasofestimatedparameters.

(a)TrajectorylengthT=10

Method

Variance

Bias

µ,η

σ,τ

µ,η

σ,τ

REINFORCE

13.25726.917-0.310-1.510

REINFORCE-OB

0.091

0.120

0.067

0.129

PGPE

0.971

1.686

-0.069

0.132

PGPE-OB

0.037

0.069

-0.016

0.051

(b)TrajectorylengthT=50

Method

Variance

Bias

µ,η

σ,τ

µ,η

σ,τ

REINFORCE

188.386278.310-1.813-5.175

REINFORCE-OB

0.545

0.900

-0.299-0.201

PGPE

1.657

3.372

-0.105-0.329

PGPE-OB

0.085

0.182

0.048

-0.078

whereε∼N(0,0.52)isstochasticnoiseandN(µ,σ2)denotesthenormaldistributionwithmeanµandvarianceσ2.Theimmediaterewardisdefined

as

r=exp−s2/2−a2/2+1,

whichisboundedas1<r≤2.ThelengthofthetrajectoryissetatT=10

or50,thediscountfactorissetatγ=0.9,andthenumberofepisodicsamples

issetatN=100.

9.2.4.2

VarianceandBias

First,thevarianceandthebiasofgradientestimatorsofthefollowing

methodsareinvestigated:

•REINFORCE:REINFORCE(gradient-baseddirectpolicysearch)

withoutabaseline(Williams,1992).

•REINFORCE-OB:REINFORCEwithoptimalbaselinesubtraction

(Peters&Schaal,2006).

•PGPE:PGPE(gradient-basedpolicy-priorsearch)withoutabaseline

(Sehnkeetal.,2010).

•PGPE-OB:PGPEwithoptimalbaselinesubtraction(Zhaoetal.,

2012).

Table9.1summarizesthevarianceofgradientestimatorsover100runs,

showingthatthevarianceofREINFORCEisoveralllargerthanPGPE.A

notabledifferencebetweenREINFORCEandPGPEisthatthevarianceof

REINFORCEsignificantlygrowsasthetrajectorylengthTincreases,whereas

140

StatisticalReinforcementLearning

thatofPGPEisnotinfluencedthatmuchbyT.Thisagreeswellwiththe

theoreticalanalysesgiveninSection7.2.3andSection9.2.3.Optimalbaseline

subtraction(REINFORCE-OBandPGPE-OB)isshowntocontributehighly

toreducingthevariance,especiallywhentrajectorylengthTislarge,which

alsoagreeswellwiththetheoreticalanalysis.

Thebiasofthegradientestimatorofeachmethodisalsoinvestigated.

Here,gradientsestimatedwithN=1000areregardedastruegradients,and

thebiasofgradientestimatorsiscomputed.Theresultsarealsoincludedin

Table9.1,showingthatintroductionofbaselinesdoesnotincreasethebias;

rather,ittendstoreducethebias.

9.2.4.3

VarianceandPolicyHyper-ParameterChangethroughEn-

tirePolicy-UpdateProcess

Next,thevarianceofgradientestimatorsisinvestigatedwhenpolicyhyper-

parametersareupdatedoveriterations.Ifthedeviationparameterσtakesa

negativevalueduringthepolicy-updateprocess,itissetat0.05.Inthisex-

periment,thevarianceiscomputedfrom50runsforT=20andN=10,and

policiesareupdatedover50iterations.Inordertoevaluatethevariancein

astablemanner,theaboveexperimentsarerepeated20timeswithrandom

choiceofinitialmeanparameterµfrom[−3.0,−0.1],andtheaveragevariance

ofgradientestimatorsisinvestigatedwithrespecttomeanparameterµover

20trials.TheresultsareplottedinFigure9.2.Figure9.2(a)comparesthe

varianceofREINFORCEwith/withoutbaselines,whereasFigure9.2(b)com-

paresthevarianceofPGPEwith/withoutbaselines.Thesegraphsshowthat

introductionofbaselinescontributeshighlytothereductionofthevariance

overiterations.

LetusillustratehowparametersareupdatedbyPGPE-OBover50itera-

tionsforN=10andT=10.Theinitialmeanparameterissetatη=−1.6,

−0.8,or−0.1,andtheinitialdeviationparameterissetatτ=1.Figure9.3

depictsthecontouroftheexpectedreturnandillustratestrajectoriesofpa-

rameterupdatesoveriterationsbyPGPE-OB.Inthegraph,themaximumof

thereturnsurfaceislocatedatthemiddlebottom,andPGPE-OBleadsthe

solutionstoamaximumpointrapidly.

9.2.4.4

PerformanceofLearnedPolicies

Finally,thereturnobtainedbyeachmethodisevaluated.Thetrajectory

lengthisfixedatT=20,andthemaximumnumberofpolicy-updateitera-

tionsissetat50.Averagereturnsover20runsareinvestigatedasfunctions

ofthenumberofepisodicsamplesN.Figure9.4(a)showstheresultswhen

initialmeanparameterµischosenrandomlyfrom[−1.6,−0.1],whichtends

toperformwell.ThegraphshowsthatPGPE-OBperformsthebest,espe-

ciallywhenN<5;thenREINFORCE-OBfollowswithasmallmargin.The

Policy-PriorSearch

141

6

REINFORCE

REINFORCE−OB

5

−scale4

10

3

2

Varianceinlog

1

00

10

20

30

40

50

Iteration

(a)REINFORCEandREINFORCE-OB

4

PGPE

3.5

PGPE−OB

3

2.5

−scale

10

2

1.5

1

0.5

Varianceinlog

0

−0.50

10

20

30

40

50

Iteration

(b)PGPEandPGPE-OB

FIGURE9.2:Meanandstandarderrorofthevarianceofgradientestimators

withrespecttothemeanparameterthroughpolicy-updateiterations.

1

17.00

τ

17.54

17.81

17.27

0.8

18.07

17.54

18.34

18.0717.81

18.61

0.6

18.88

19.14

18.34

0.4

18.61

19.41

18.88

19.68

19.14

0.2

19.41

19.68

Policy-priorstandarddeviation

0

−1.6

−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

Policy-priormeanη

FIGURE9.3:Trajectoriesofpolicy-priorparameterupdatesbyPGPE.

142

StatisticalReinforcementLearning

16.5

16

15.5

Return

15

REINFORCE

14.5

REINFORCE−OB

PGPE

PGPE−OB

0

5

10

15

20

Iteration

(a)Goodinitialpolicy

16.5

16

15.5

15

14.5

Return

14

13.5

REINFORCE

REINFORCE−OB

13

PGPE

PGPE−OB

12.50

5

10

15

20

Iteration

(b)Poorinitialpolicy

FIGURE9.4:Averageandstandarderrorofreturnsover20runsasfunctions

ofthenumberofepisodicsamplesN.

plainPGPEalsoworksreasonablywell,althoughitisslightlyunstabledueto

largervariance.TheplainREINFORCEishighlyunstable,whichiscausedby

thehugevarianceofgradientestimators(seeFigure9.2again).Figure9.4(b)

describestheresultswheninitialmeanparameterµischosenrandomlyfrom

[−3.0,−0.1],whichtendstoresultinpoorerperformance.Inthissetup,the

differenceamongthecomparedmethodsismoresignificantthanthecasewith

goodinitialpolicies,meaningthatREINFORCEissensitivetothechoiceof

initialpolicies.Overall,thePGPEmethodstendtooutperformtheREIN-

FORCEmethods,andamongthePGPEmethods,PGPE-OBworksvery

wellandconvergesquickly.

Policy-PriorSearch

143

9.3

SampleReuseinPolicy-PriorSearch

AlthoughPGPEwasshowntooutperformREINFORCE,itsbehavioris

stillratherunstableifthenumberofdatasamplesusedforestimatingthegra-

dientissmall.Inthissection,thesample-reuseideaisappliedtoPGPE.Tech-

nically,theoriginalPGPEiscategorizedasanon-policyalgorithmwheredata

drawnfromthecurrenttargetpolicyisusedtoestimatepolicy-priorgradients.

Ontheotherhand,off-policyalgorithmsaremoreflexibleinthesensethat

adata-collectingpolicyandthecurrenttargetpolicycanbedifferent.Here,

PGPEisextendedtotheoff-policyscenariousingtheimportance-weighting

technique.

9.3.1

ImportanceWeighting

Letusconsideranoff-policyscenariowhereadata-collectingpolicyand

thecurrenttargetpolicyaredifferentingeneral.InthecontextofPGPE,

twohyper-parametersareconsidered:ρasthetargetpolicytolearnandρ′

asapolicyfordatacollection.Letusdenotethedatasamplescollectedwith

hyper-parameterρ′byH′:

H′=

i.i.d.

θ′n,h′nN′

n=1

∼p(h|θ)p(θ|ρ′).IfdataH′isnaivelyusedtoestimatepolicy-priorgradientsbyEq.(9.1),we

sufferaninconsistencyproblem:

N′

1X∇N′

ρlogp(θ′n|ρ)R(h′n)N′−→∞

9

∇ρJ(ρ),n=1

where

ZZ

∇ρJ(ρ)=p(h|θ)p(θ|ρ)∇ρlogp(θ|ρ)R(h)dhdθisthegradientoftheexpectedreturn,

ZZ

J(ρ)=

p(h|θ)p(θ|ρ)R(h)dhdθ,

withrespecttothepolicyhyper-parameterρ.Below,thisnaivemethodis

referredtoasnon-importance-weightedPGPE(NIW-PGPE).

Thisinconsistencyproblemcanbesystematicallyresolvedbyimportance

weighting:

1N′

X

∇bN′→∞

ρJIW(ρ)=

w(θ′

−→∇N′

n)∇ρlogp(θ′n|ρ)R(h′n)ρJ(ρ),

n=1

144

StatisticalReinforcementLearning

wherew(θ)=p(θ|ρ)/p(θ|ρ′)istheimportanceweight.Thisextendedmethod

iscalledimportance-weightedPGPE(IW-PGPE).

Below,thevarianceofgradientestimatorsinIW-PGPEistheoretically

analyzed.SeeZhaoetal.(2013)fortechnicaldetails.AsdescribedinSec-

tion9.2.1,thedeterministiclinearpolicymodelisusedhere:

π(a|s,θ)=δ(a=θ⊤φ(s)),

(9.3)

whereδ(·)istheDiracdeltafunctionandφ(s)istheB-dimensionalbasis

function.Policyparameterθ=(θ1,…,θB)⊤isdrawnfromtheindependent

Gaussianprior,wherepolicyhyper-parameterρconsistsofpriormeansη=

(η1,…,ηB)⊤andpriorstandarddeviationsτ=(τ1,…,τB)⊤:

B

Y

1

p(

b−ηb)2

θ|η,τ)=

exp−

.

(9.4)

τ

2τ2

b=1

b

b

Let

B

X

G=

τ−2,

b

b=1

andletVarp(h′|θ′)p(θ′|ρ′)denotethetraceofthecovariancematrix:

Varp(h′|θ′)p(θ′|ρ′)[ζ]

=trEp(h′|θ′)p(θ′|ρ′)(ζ−Ep(h′|θ′)p(θ′|ρ′)[ζ])(ζ−Ep(h′|θ′)p(θ′|ρ′)[ζ])⊤h

i

=Ep(h′|θ′)p(θ′|ρ′)kζ−Ep(h′|θ′)p(θ′|ρ′)[ζ]k2,

whereEp(h′|θ′)p(θ′|ρ′)denotestheexpectationovertrajectoryh′andpolicy

parameterθ′drawnfromp(h′|θ′)p(θ′|ρ′).Thenthefollowingtheoremholds:

Theorem9.5Assumethatforalls,a,ands′,thereexistsβ>0suchthat

r(s,a,s′)∈[−β,β],and,forallθ,thereexists0<wmax<∞suchthat0<w(θ)≤wmax.Then,thefollowingupperboundshold:

h

i

β2(1−γT)2G

Var

b

p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)≤

w

N′(1−γ)2

max,

h

i

2β2(1−γT)2G

Var

b

p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)≤

w

N′(1−γ)2

max.

Itisinterestingtonotethattheupperboundsarethesameastheones

fortheplainPGPE(Theorem9.1inSection9.2.3)exceptforfactorwmax.

Whenwmax=1,theboundsarereducedtothoseoftheplainPGPEmethod.

However,ifthesamplingdistributionissignificantlydifferentfromthetarget

distribution,wmaxcantakealargevalueandthusIW-PGPEcanproducea

gradientestimatorwithlargevariance.Therefore,IW-PGPEmaynotbea

reliableapproachasitis.

Below,avariancereductiontechniqueforIW-PGPEisintroducedwhich

leadstoapracticallyusefulalgorithm.

Policy-PriorSearch

145

9.3.2

VarianceReductionbyBaselineSubtraction

Here,abaselineisintroducedforIW-PGPEtoreducethevarianceof

gradientestimators,inthesamewayastheplainPGPEexplainedinSec-

tion9.2.2.

Apolicy-priorgradientestimatorwithabaselineξ∈RisdefinedasN′

1X

∇bρJξ

(ρ)=

(R(h′

IW

N′

n)−ξ)w(θ′n)∇ρlogp(θ′n|ρ).n=1

Here,thebaselineξisdeterminedsothatthevarianceisminimized.Letξ∗betheoptimalbaselineforIW-PGPEthatminimizesthevariance:

ξ∗=argminVarb

p(h′|θ′)p(θ′|ρ′)[∇ρJξ(ρ)].IW

ξ

ThentheoptimalbaselineforIW-PGPEisgivenasfollows(Zhaoetal.,2013):

E

ξ∗=p(h′|θ′)p(θ′|ρ′)[R(h′)w2(θ′)k∇ρlogp(θ′|ρ)k2],Ep(θ′|ρ′)[w2(θ′)k∇ρlogp(θ′|ρ)k2]whereEp(θ′|ρ′)denotestheexpectationoverpolicyparameterθ′drawnfrom

p(θ′|ρ′).Inpractice,theexpectationsareapproximatedbythesampleaver-

ages.Theexcessvarianceforabaselineξisgivenas

Var

b

b

p(h′|θ′)p(θ′|ρ′)[∇ρJξ(ρ)]Jξ∗(ρ)]IW

−Varp(h′|θ′)p(θ′|ρ′)[∇ρIW(ξ−ξ∗)2=

E

N′

p(θ′|ρ′)[w2(θ′)k∇ρlogp(θ′|ρ)k2].Next,contributionsoftheoptimalbaselinetovariancereductioninIW-

PGPEareanalyzedforthedeterministiclinearpolicymodel(9.3)andthe

independentGaussianprior(9.4).SeeZhaoetal.(2013)fortechnicaldetails.

Theorem9.6Assumethatforalls,a,ands′,thereexistsα>0suchthat

r(s,a,s′)≥α,and,forallθ,thereexistswmin>0suchthatw(θ)≥wmin.

Then,thefollowinglowerboundshold:

h

i

h

i

Var

b

b

p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)IW

α2(1−γT)2G

w

N′(1−γ)2

min,

h

i

h

i

Var

b

b

p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)IW

2α2(1−γT)2G

w

N′(1−γ)2

min.

Assumethatforalls,a,ands′,thereexistsβ>0suchthatr(s,a,s′)∈

146

StatisticalReinforcementLearning

[−β,β],and,forallθ,thereexists0<wmax<∞suchthat0<w(θ)≤wmax.

Then,thefollowingupperboundshold:

h

i

h

i

Var

b

b

p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)IW

β2(1−γT)2G

w

N′(1−γ)2

max,

h

i

h

i

Var

b

b

p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)IW

2β2(1−γT)2G

w

N′(1−γ)2

max.

ThistheoremshowsthattheboundsofthevariancereductioninIW-PGPE

broughtbytheoptimalbaselinedependontheboundsoftheimportance

weight,wminandwmax—thelargertheupperboundwmaxis,themore

optimalbaselinesubtractioncanreducethevariance.

FromTheorem9.5andTheorem9.6,thefollowingcorollarycanbeimme-

diatelyobtained:

Corollary9.7Assumethatforalls,a,ands′,thereexists0<α<βsuch

thatr(s,a,s′)∈[α,β],and,forallθ,thereexists0<wmin<wmax<∞suchthatwmin≤w(θ)≤wmax.Then,thefollowingupperboundshold:

h

i

(1−γT)2G

Var

b

p(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)(β2w

IW

≤N′(1−γ)2

max−α2wmin),

h

i

2(1−γT)2G

Var

b

p(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)(β2w

IW

≤N′(1−γ)2

max−α2wmin).

FromTheorem9.5andthiscorollary,wecanconfirmthattheupper

boundsforthebaseline-subtractedIW-PGPEaresmallerthanthoseforthe

plainIW-PGPEwithoutbaselinesubtraction,becauseα2wmin>0.Inpartic-

ular,ifwminislarge,theupperboundsforthebaseline-subtractedIW-PGPE

canbemuchsmallerthanthosefortheplainIW-PGPEwithoutbaseline

subtraction.

9.3.3

NumericalExamples

Here,weconsiderthecontrollingtaskofthehumanoidrobotCB-i(Cheng

etal.,2007)showninFigure9.5(a).Thegoalistoleadtheendeffectorof

therightarm(righthand)toatargetobject.First,itssimulatedupper-body

model,illustratedinFigure9.5(b),isusedtoinvestigatetheperformanceof

theIW-PGPE-OBmethod.ThentheIW-PGPE-OBmethodisappliedtothe

realrobot.

9.3.3.1

Setup

Theperformanceofthefollowing4methodsiscompared:

Policy-PriorSearch

147

(a)CB-i

(b)Simulatedupper-bodymodel

FIGURE9.5:HumanoidrobotCB-ianditsupper-bodymodel.Thehu-

manoidrobotCB-iwasdevelopedbytheJST-ICORPComputationalBrain

ProjectandATRComputationalNeuroscienceLabs(Chengetal.,2007).

•IW-REINFORCE-OB:Importance-weightedREINFORCEwiththe

optimalbaseline.

•NIW-PGPE-OB:Data-reusePGPE-OBwithoutimportanceweight-

ing.

•PGPE-OB:PlainPGPE-OBwithoutdatareuse.

•IW-PGPE-OB:Importance-weightedPGPEwiththeoptimalbase-

line.

TheupperbodyofCB-ihas9degreesoffreedom:theshoulderpitch,

shoulderroll,elbowpitchoftherightarm;shoulderpitch,shoulderroll,elbow

pitchoftheleftarm;waistyaw;torsoroll;andtorsopitch(Figure9.5(b)).At

eachtimestep,thecontrollerreceivesstatesfromthesystemandsendsout

actions.Thestatespaceis18-dimensional,whichcorrespondstothecurrent

angleandangularvelocityofeachjoint.Theactionspaceis9-dimensional,

whichcorrespondstothetargetangleofeachjoint.Bothstatesandactions

arecontinuous.

Giventhestateandactionineachtimestep,thephysicalcontrolsystem

calculatesthetorquesateachjointbyusingaproportional-derivative(PD)

controlleras

τi=Kp(a

˙s

i

i−si)−Kdii,

148

StatisticalReinforcementLearning

wheresi,˙si,andaidenotethecurrentangle,thecurrentangularvelocity,

andthetargetangleofthei-thjoint,respectively.KpandK

denotethe

i

di

positionandvelocitygainsforthei-thjoint,respectively.Theseparameters

aresetat

Kp=200andK=10

i

di

fortheelbowpitchjoints,and

Kp=2000andK=100

i

di

forotherjoints.

Theinitialpositionoftherobotisfixedatthestanding-up-straightpose

withthearmsdown.Theimmediaterewardrtatthetimesteptisdefinedas

rt=exp(−10dt)−0.0005min(ct,10,000),

wheredtisthedistancebetweentherighthandoftherobotandthetarget

object,andctisthesumofcontrolcostsforeachjoint.Thelineardeterministic

policyisusedforthePGPEmethods,andtheGaussianpolicyisusedforIW-

REINFORCE-OB.Inbothcases,thelinearbasisfunctionφ(s)=sisused.

ForPGPE,theinitialpriormeanηisrandomlychosenfromthestandard

normaldistribution,andtheinitialpriorstandarddeviationτissetat1.

Toevaluatetheusefulnessofdatareusemethodswithasmallnumber

ofsamples,theagentcollectsonlyN=3on-policysampleswithtrajectory

lengthT=100ateachiteration.Allpreviousdatasamplesarereusedto

estimatethegradientsinthedatareusemethods,whileonlyon-policysam-

plesareusedtoestimatethegradientsintheplainPGPE-OBmethod.The

discountfactorissetatγ=0.9.

9.3.3.2

Simulationwith2DegreesofFreedom

First,theperformanceonthereachingtaskwithonly2degreesoffreedom

isinvestigated.Thebodyoftherobotisfixedandonlytherightshoulderpitch

andrightelbowpitchareused.Figure9.6depictstheaveragedexpectedreturn

over10trialsasafunctionofthenumberofiterations.Theexpectedreturn

ateachtrialiscomputedfrom50newlydrawntestepisodicdatathatarenot

usedforpolicylearning.ThegraphshowsthatIW-PGPE-OBnicelyimproves

theperformanceoveriterationswithonlyasmallnumberofon-policysamples.

TheplainPGPE-OBmethodcanalsoimprovetheperformanceoveritera-

tions,butslowly.NIW-PGPE-OBisnotasgoodasIW-PGPE-OB,especially

atthelateriterations,becauseoftheinconsistencyoftheNIWestimator.

Thedistancefromtherighthandtotheobjectandthecontrolcostsalong

thetrajectoryarealsoinvestigatedforthreepolicies:theinitialpolicy,thepolicyobtainedatthe20thiterationbyIW-PGPE-OB,andthepolicyobtained

atthe50thiterationbyIW-PGPE-OB.Figure9.7(a)plotsthedistanceto

thetargetobjectasafunctionofthetimestep.Thisshowsthatthepolicy

obtainedatthe50thiterationdecreasesthedistancerapidlycomparedwith

Policy-PriorSearch

149

5

IW−PGPE−OB

NIW−PGPE−OB

PGPE−OB

4

IW−REINFORCE−OB

3

Return

2

1

0

10

20

30

40

50

Iteration

FIGURE9.6:Averageandstandarderrorofreturnsover10runsasfunctions

ofthenumberofiterationsforthereachingtaskwith2degreesoffreedom

(rightshoulderpitchandrightelbowpitch).

0.35

Initialpolicy

Policyatthe20thiteration

0.3

Policyatthe50thiteration

0.25

0.2

0.15

Distance

0.1

0.05

00

10

20

30

40

50

60

70

80

90

100

TImesteps

(a)Distance

120

Initialpolicy

110

Policyatthe20thiteration

Policyatthe50thiteration

100

90

80

70

Controlcosts60

50

40

300

10

20

30

40

50

60

70

80

90

100

Timesteps

(b)Controlcosts

FIGURE9.7:Distanceandcontrolcostsofarmreachingwith2degreesof

freedomusingthepolicylearnedbyIW-PGPE-OB.

150

StatisticalReinforcementLearning

FIGURE9.8:Typicalexampleofarmreachingwith2degreesoffreedom

usingthepolicyobtainedbyIW-PGPE-OBatthe50thiteration(fromleftto

rightandtoptobottom).

theinitialpolicyandthepolicyobtainedatthe20thiteration,whichmeans

thattherobotcanreachtheobjectquicklybyusingthelearnedpolicy.

Figure9.7(b)plotsthecontrolcostasafunctionofthetimestep.This

showsthatthepolicyobtainedatthe50thiterationdecreasesthecontrol

coststeadilyuntilthereachingtaskiscompleted.Thisisbecausetherobot

mainlyadjuststheshoulderpitchinthebeginning,whichconsumesalarger

amountofenergythantheenergyrequiredforcontrollingtheelbowpitch.

Then,oncetherighthandgetsclosertothetargetobject,therobotstarts

adjustingtheelbowpitchtoreachthetargetobject.Thepolicyobtainedat

the20thiterationactuallyconsumeslesscontrolcosts,butitcannotleadthe

armtothetargetobject.

Figure9.8illustratesatypicalsolutionofthereachingtaskwith2degrees

offreedombythepolicyobtainedbyIW-PGPE-OBatthe50thiteration.The

imagesshowthattherighthandissuccessfullyledtothetargetobjectwithin

only10timesteps.

9.3.3.3

SimulationwithAll9DegreesofFreedom

Finally,thesameexperimentiscarriedoutusingall9degreesoffreedom.

Thepositionofthetargetobjectismoredistantfromtherobotsothatit

cannotbereachedbyonlyusingtherightarm.

Policy-PriorSearch

151

−2

−3

−4

−5

−6

Return

−7

−8

TruncatedIW−PGPE−OB

−9

IW−PGPE−OB

−10

NIW−PGPE−OB

PGPE−OB

0

50

100

150

200

250

300

350

400

Iteration

FIGURE9.9:Averageandstandarderrorofreturnsover10runsasfunctions

ofthenumberofiterationsforthereachingtaskwithall9degreesoffreedom.

Becauseall9jointsareused,thedimensionalityofthestatespaceismuch

increasedandthisgrowsthevaluesofimportanceweightsexponentially.In

ordertomitigatethelargevaluesofimportanceweights,wedecidednotto

reuseallpreviouslycollectedsamples,butonlysamplescollectedinthelast

5iterations.Thisallowsustokeepthedifferencebetweenthesamplingdis-

tributionandthetargetdistributionreasonablysmall,andthusthevaluesof

importanceweightscanbesuppressedtosomeextent.Furthermore,follow-

ingWawrzynski(2009),weconsideraversionofIW-PGPE-OB,denotedas

“truncatedIW-PGPE-OB”below,wheretheimportanceweightistruncated

asw=min(w,2).

TheresultsplottedinFigure9.9showthattheperformanceofthetrun-

catedIW-PGPE-OBisthebest.Thisimpliesthatthetruncationofimpor-

tanceweightsishelpfulwhenapplyingIW-PGPE-OBtohigh-dimensional

problems.

Figure9.10illustratesatypicalsolutionofthereachingtaskwithall9

degreesoffreedombythepolicyobtainedbythetruncatedIW-PGPE-OB

atthe400thiteration.Theimagesshowthatthepolicylearnedbyourpro-

posedmethodsuccessfullyleadstherighthandtothetargetobject,andthe

irrelevantpartsarekeptattheinitialpositionforreducingthecontrolcosts.

9.3.3.4

RealRobotControl

Finally,theIW-PGPE-OBmethodisappliedtotherealCB-irobotshown

inFigure9.11(Sugimotoetal.,2014).

Theexperimentalsettingisessentiallythesameastheabovesimulation

studieswith9joints,butpoliciesareupdatedonlyevery5trialsandsamples

takenfromthelast10trialsarereusedforstabilizationpurposes.Figure9.12

152

StatisticalReinforcementLearning

FIGURE9.10:Typicalexampleofarmreachingwithall9degreesoffree-

domusingthepolicyobtainedbythetruncatedIW-PGPE-OBatthe400th

iteration(fromlefttorightandtoptobottom).

FIGURE9.11:ReachingtaskbytherealCB-irobot(Sugimotoetal.,2014).

plotstheobtainedrewardscumulatedoverpolicyupdateiterations,showing

thatrewardsaresteadilyincreasedoveriteration.Figure9.13exhibitsthe

acquiredreachingmotionbasedonthepolicyobtainedatthe120thiteration,

showingthattheendeffectoroftherobotcansuccessfullyreachthetarget

object.

Policy-PriorSearch

153

60

40

Cumulativerewards20

0

20

40

60

80

100

120

Numberofupdates

FIGURE9.12:Obtainedrewardcumulatedoverpolicyupdatediterations.

9.4

Remarks

Whenthetrajectorylengthislarge,directpolicysearchtendstoproduce

gradientestimatorswithlargevariance,duetotherandomnessofstochas-

ticpolicies.Policy-priorsearchcanavoidthisproblembyusingdeterminis-

ticpoliciesandintroducingstochasticitybyconsideringapriordistribution

overpolicyparameters.Boththeoreticallyandexperimentally,advantagesof

policy-priorsearchoverdirectpolicysearchwereshown.

Asamplereuseframeworkforpolicy-priorsearchwasalsointroduced

whichishighlyusefulinreal-worldreinforcementlearningproblemswithhigh

samplingcosts.Followingthesamelineasthesamplereusemethodsforpolicy

iterationdescribedinChapter4anddirectpolicysearchintroducedinChap-

ter8,importanceweightingplaysanessentialroleinsample-reusepolicy-prior

search.Whenthedimensionalityofthestate-actionspaceishigh,however,

importanceweightstendtotakeextremelylargevalues,whichcausesinstabil-

ityoftheimportanceweightingmethods.Tomitigatethisproblem,truncation

oftheimportanceweightsisusefulinpractice.

154

StatisticalReinforcementLearning

FIGURE9.13:Typicalexampleofarmreachingusingthepolicyobtained

bytheIW-PGPE-OBmethod(fromlefttorightandtoptobottom).

PartIV

Model-Based

ReinforcementLearning

ThereinforcementlearningmethodsexplainedinPartIIandPartIIIare

categorizedintothemodel-freeapproach,meaningthatpoliciesarelearned

withoutexplicitlymodelingtheunknownenvironment(i.e.,thetransition

probabilityoftheagent).Ontheotherhand,inPartIV,weintroducean

alternativeapproachcalledthemodel-basedapproach,whichexplicitlymodels

theenvironmentinadvanceandusesthelearnedenvironmentmodelforpolicy

learning.

Inthemodel-basedapproach,noadditionalsamplingcostisnecessaryto

generateartificialsamplesfromthelearnedenvironmentmodel.Thus,the

model-basedapproachisusefulwhendatacollectionisexpensive(e.g.,robot

control).However,accuratelyestimatingthetransitionmodelfromalimited

amountoftrajectorydatainmulti-dimensionalcontinuousstateandaction

spacesishighlychallenging.

InChapter10,weintroduceanon-parametricmodelestimatorthatpos-

sessestheoptimalconvergenceratewithhighcomputationalefficiency,and

demonstrateitsusefulnessthroughexperiments.Then,inChapter11,we

combinedimensionalityreductionwithmodelestimationtocopewithhigh

dimensionalityofstateandactionspaces.

Thispageintentionallyleftblank

Chapter10

TransitionModelEstimation

Inthischapter,weintroducetransitionprobabilityestimationmethodsfor

model-basedreinforcementlearning(Wang&Dietterich,2003;Deisenroth&

Rasmussen,2011).AmongthemethodsdescribedinSection10.1,anon-

parametrictransitionmodelestimatorcalledleast-squaresconditionaldensity

estimation(LSCDE)(Sugiyamaetal.,2010)isshowntobethemostpromis-

ingapproach(Tangkarattetal.,2014a).TheninSection10.2,wedescribe

howthetransitionmodelestimatorcanbeutilizedinmodel-basedreinforce-

mentlearning.InSection10.3,experimentalperformanceofamodel-based

policy-priorsearchmethodisevaluated.Finally,inSection10.4,thischapter

isconcluded.

10.1

ConditionalDensityEstimation

Inthissection,theproblemofapproximatingthetransitionprobabil-

ityp(s′|s,a)fromindependenttransitionsamples(sm,am,s′m)M

m=1isad-

dressed.

10.1.1

Regression-BasedApproach

Intheregression-basedapproach,theproblemoftransitionprobability

estimationisformulatedasafunctionapproximationproblemofpredicting

outputs′giveninputsandaunderGaussiannoise:

s′=f(s,a)+ǫ,

wherefisanunknownregressionfunctiontobelearned,ǫisanindepen-

dentGaussiannoisevectorwithmeanzeroandcovariancematrixσ2I,andI

denotestheidentitymatrix.

Letusapproximatefbythefollowinglinear-in-parametermodel:

f(s,a,Γ)=Γ⊤φ(s,a),

whereΓistheB×dim(s)parametermatrixandφ(s,a)istheB-dimensional

157

158

StatisticalReinforcementLearning

basisvector.AtypicalchoiceofthebasisvectoristheGaussiankernel,which

isdefinedforB=Mas

ks−s

φ

bk2+(a−ab)2

b(s,a)=exp

,

2κ2

andκ>0denotestheGaussiankernelwidth.IfBistoolarge,thenumberof

basisfunctionsmaybereducedbyonlyusingasubsetofsamplesasGaussian

centers.DifferentGaussianwidthsforsandamaybeusedifnecessary.

TheparametermatrixΓislearnedsothattheregularizedsquarederror

isminimized:

#

M

X

2

b

Γ=argmin

f(sm,am,Γ)−f(sm,am)

+trΓ⊤RΓ

,

Γ

m=1

whereRistheB×Bpositivesemi-definitematrixcalledtheregularization

matrix.Thesolutionb

Γisgivenanalyticallyas

b

Γ=(Φ⊤Φ+R)−1Φ⊤(s′1,…,s′M)⊤,

whereΦistheM×Bdesignmatrixdefinedas

Φm,b=φb(sm,am).

Wecanconfirmthatpredictedoutputvectorbs′=f(s,a,b

Γ)actuallyfollows

theGaussiandistributionwithmean

(s′1,…,s′M)Φ(Φ⊤Φ+R)−1φ(s,a)

andcovariancematrixb

δ2I,where

bδ2=σ2tr(Φ⊤Φ+R)−2Φ⊤Φ.

ThetuningparameterssuchastheGaussiankernelwidthκandtheregu-

larizationmatrixRcanbedeterminedeitherbycross-validationorevidence

maximizationiftheabovemethodisregardedasGaussianprocessregression

intheBayesianframework(Rasmussen&Williams,2006).

Thisistheregression-basedestimatorofthetransitionprobabilitydensity

p(s′|s,a)foranarbitrarytestinputsanda.Thus,bytheuseofkernel

regressionmodels,theregressionfunctionf(whichistheconditionalmeanof

outputs)isapproximatedinanon-parametricway.However,theconditional

distributionofoutputsitselfisrestrictedtobeGaussian,whichishighly

restrictiveinreal-worldreinforcementlearning.

10.1.2

ǫ-NeighborKernelDensityEstimation

Whentheconditioningvariables(s,a)arediscrete,theconditionaldensity

p(s′|s,a)canbeeasilyestimatedbystandarddensityestimatorssuchaskernel

TransitionModelEstimation

159

densityestimation(KDE)byonlyusingsampless′iisuchthat(si,ai)agrees

withthetargetvalues(s,a).ǫ-neighborKDE(ǫKDE)extendsthisideatothe

continuouscasesuchthat(si,ai)areclosetothetargetvalues(s,a).

Morespecifically,ǫKDEwiththeGaussiankernelisgivenby

1

X

b

p(s′|s,a)=

N(s′;s′

|I

i,σ2I),

(s,a),ǫ|i∈I(s,a),ǫwhereI(s,a),ǫisthesetofsampleindicessuchthatk(s,a)−(si,ai)k≤ǫ

andN(s′;s′i,σ2I)denotestheGaussiandensitywithmeans′iandcovariance

matrixσ2I.TheGaussianwidthσandthedistancethresholdǫmaybechosen

bycross-validation.

ǫKDEisausefulnon-parametricdensityestimatorthatiseasytoim-

plement.However,itisunreliableinhigh-dimensionalproblemsduetothe

distance-basedconstruction.

10.1.3

Least-SquaresConditionalDensityEstimation

Anon-parametricconditionaldensityestimatorcalledleast-squarescondi-

tionaldensityestimation(LSCDE)(Sugiyamaetal.,2010)possessesvarious

usefulproperties:

•Itcandirectlyhandlemulti-dimensionalmulti-modalinputsandout-

puts.

•Itwasprovedtoachievetheoptimalconvergencerate(Kanamorietal.,

2012).

•Ithashighnumericalstability(Kanamorietal.,2013).

•Itisrobustagainstoutliers(Sugiyamaetal.,2010).

•Itssolutioncanbeanalyticallyandefficientlycomputedjustbysolving

asystemoflinearequations(Kanamorietal.,2009).

•Generatingsamplesfromthelearnedtransitionmodelisstraightforward.

Letusmodelthetransitionprobabilityp(s′|s,a)bythefollowinglinear-

in-parametermodel:

α⊤φ(s,a,s′),

(10.1)

whereαistheB-dimensionalparametervectorandφ(s,a,s′)istheB-

dimensionalbasisfunctionvector.Atypicalchoiceofthebasisfunctionis

theGaussiankernel,whichisdefinedforB=Mas

ks−s

φ

bk2+(a−ab)2+ks′−s′bk2

b(s,a,s′)=exp

.

2κ2

160

StatisticalReinforcementLearning

κ>0denotestheGaussiankernelwidth.IfBistoolarge,thenumberof

basisfunctionsmaybereducedbyonlyusingasubsetofsamplesasGaussian

centers.DifferentGaussianwidthsfors,a,ands′maybeusedifnecessary.

Theparameterαislearnedsothatthefollowingsquarederrorismini-

mized:

ZZZ

1

2

J0(α)=

α⊤φ(s,a,s′)−p(s′|s,a)p(s,a)dsdads′

2ZZZ

1

2

=

α⊤φ(s,a,s′)

p(s,a)dsdads′

2ZZZ

α⊤φ(s,a,s′)p(s,a,s′)dsdads′+C,

wheretheidentityp(s′|s,a)=p(s,a,s′)/p(s,a)isusedinthesecondterm

and

ZZZ

1

C=

p(s′|s,a)p(s,a,s′)dsdads′.

2

BecauseCisconstantindependentofα,onlythefirsttwotermswillbe

consideredfromhereon:

1

J(α)=J0(α)−C=α⊤Uα−α⊤v,

2

whereUistheB×BandvistheB-dimensionalvectordefinedas

ZZ

U=

Φ(s,a)p(s,a)dsda,

ZZZ

v=

φ(s,a,s′)p(s,a,s′)dsdads′,

Z

Φ(s,a)=

φ(s,a,s′)φ(s,a,s′)⊤ds′.

Notethat,fortheGaussianmodel(10.1),the(b,b′)-thelementofmatrix

Φ(s,a)canbecomputedanalyticallyas

ks′

Φ

b−s′b′k2

b,b′(s,a)=(

πκ)dim(s′)exp−

4κ2

ks−s

×exp−

bk2+ks−sb′k2+(a−ab)2+(a−ab′)2

.

2κ2

BecauseUandvincludedinJ(α)containtheexpectationsoverunknown

densitiesp(s,a)andp(s,a,s′),theyareapproximatedbysampleaverages.

Thenwehave

b

1

J(α)=

α⊤b

Uα−b

v⊤α,

2

TransitionModelEstimation

161

where

M

X

M

X

b

1

1

U=

Φ(s

φ(s

M

m,am)

and

bv=M

m,am,s′m).

m=1

m=1

Byaddinganℓ2-regularizertob

J(α)toavoidoverfitting,theLSCDEop-

timizationcriterionisgivenas

λ

e

α=argminb

J(α)+

kαk2,

α∈RM2

whereλ≥0istheregularizationparameter.Thesolutione

αisgivenanalyti-

callyas

e

α=(b

U+λI)−1b

v,

whereIdenotestheidentitymatrix.Becauseconditionalprobabilitydensities

arenon-negativebydefinition,thesolutione

αismodifiedas

b

αb=max(0,e

αb).

Finally,thesolutionisnormalizedinthetestphase.Morespecifically,given

atestinputpoint(s,a),thefinalLSCDEsolutionisgivenas

b

α⊤φ(s,a,s′)

b

p(s′|s,a)=R

,

b

α⊤φ(s,a,s′′)ds′′

where,fortheGaussianmodel(10.1),thedenominatorcanbeanalytically

computedas

Z

B

X

ks−s

b

bk2+(a−ab)2

α⊤φ(s,a,s′′)ds′′=(2πκ)dim(s′)

αbexp−

.

2κ2

b=1

ModelselectionoftheGaussianwidthκandtheregularizationparameterλ

ispossiblebycross-validation(Sugiyamaetal.,2010).

10.2

Model-BasedReinforcementLearning

Model-basedreinforcementlearningissimplycarriedoutasfollows.

1.Collecttransitionsamples(sm,am,s′m)M

m=1.

2.Obtainatransitionmodelestimateb

p(s′|s,a)from(sm,am,s′m)M

m=1.

162

StatisticalReinforcementLearning

3.Runamodel-freereinforcementlearningmethodusingtrajectorysam-

plesehte

T

t=1artificiallygeneratedfromestimatedtransitionmodel

b

p(s′|s,a)andcurrentpolicyπ(a|s,θ).

Model-basedreinforcementlearningisparticularlyadvantageouswhenthe

samplingcostislimited.Morespecifically,inmodel-freemethods,weneedto

fixthesamplingscheduleinadvance—forexample,whethermanysamples

aregatheredinthebeginningoronlyasmallbatchofsamplesiscollectedfor

alongerperiod.However,optimizingthesamplingscheduleinadvanceisnot

possiblewithoutstrongpriorknowledge.Thus,weneedtojustblindlydesign

thesamplingscheduleinpractice,whichcancausesignificantperformance

degradation.Ontheotherhand,model-basedmethodsdonotsufferfromthis

problem,becausewecandrawasmanytrajectorysamplesaswewantfrom

thelearnedtransitionmodelwithoutadditionalsamplingcosts.

10.3

NumericalExamples

Inthissection,theexperimentalperformanceofthemodel-freeandmodel-

basedversionsofPGPE(policygradientswithparameter-basedexploration)

areevaluated:

M-PGPE(LSCDE):Themodel-basedPGPEmethodwithtransitionmodel

estimatedbyLSCDE.

M-PGPE(GP):Themodel-basedPGPEmethodwithtransitionmodeles-

timatedbyGaussianprocess(GP)regression.

IW-PGPE:Themodel-freePGPEmethodwithsamplereusebyimportance

weighting(themethodintroducedinChapter9).

10.3.1

ContinuousChainWalk

Letusfirstconsiderasimplecontinuouschainwalktask,describedin

Figure10.1.

10.3.1.1

Setup

Let

(1(4<s′<6),

s∈S=[0,10],a∈A=[−5,5],andr(s,a,s′)=0(otherwise).Thatis,theagentreceivespositivereward+1atthecenterofthestatespace.

ThetrajectorylengthissetatT=10andthediscountfactorissetat

TransitionModelEstimation

163

0

4

6

10

FIGURE10.1:Illustrationofcontinuouschainwalk.

γ=0.99.Thefollowinglinear-in-parameterpolicymodelisusedinboth

theM-PGPEandIW-PGPEmethods:

6

X

(s−c

a=

θ

i)2

iexp

,

2

i=1

where(c1,…,c6)=(0,2,4,6,8,10).Ifanactiondeterminedbytheabove

policyisoutoftheactionspace,itispulledbacktobeconfinedinthedomain.

Astransitiondynamics,thefollowingtwoscenariosareconsidered:

Gaussian:Thetruetransitiondynamicsisgivenby

st+1=st+at+εt,

whereεtistheGaussiannoisewithmean0andstandarddeviation0.3.

Bimodal:Thetruetransitiondynamicsisgivenby

st+1=st±at+εt,

whereεtistheGaussiannoisewithmean0andstandarddeviation0.3,

andthesignofatisrandomlychosenwithprobability1/2.

Ifthenextstateisoutofthestatespace,itisprojectedbacktothe

domain.Below,thebudgetfordatacollectionisassumedtobelimitedto

N=20trajectorysamples.

10.3.1.2

ComparisonofModelEstimators

WhenthetransitionmodelislearnedintheM-PGPEmethods,allN=20

trajectorysamplesaregatheredrandomlyinthebeginningatonce.More

specifically,theinitialstates1andtheactiona1arechosenfromtheuniform

distributionsoverSandA,respectively.Thenthenextstates2andtheim-

mediaterewardr1areobtained.Afterthat,theactiona2ischosenfromthe

uniformdistributionoverA,andthenextstates3andtheimmediatereward

r2areobtained.ThisprocessisrepeateduntilrTisobtained,bywhichatra-

jectorysampleisobtained.ThisdatagenerationprocessisrepeatedNtimes

toobtainNtrajectorysamples.

Figure10.2andFigure10.3illustratethetruetransitiondynamicsand

164

StatisticalReinforcementLearning

)10

,as’|(sp’s5

argmax05

10

0

5

−5

a

0

s

(a)Truetransition

)10

)10

,a

,a

s’|

s’|

(s

(s

p’

p’

s5

s5

argmax

argmax

0

0

5

5

10

10

0

0

5

5

−5

a

0

s

−5

a

0

s

(b)TransitionestimatedbyLSCDE

(c)TransitionestimatedbyGP

FIGURE10.2:GaussiantransitiondynamicsanditsestimatesbyLSCDE

andGP.

theirestimatesobtainedbyLSCDEandGPintheGaussianandbimodal

cases,respectively.Figure10.2showsthatbothLSCDEandGPcanlearnthe

entireprofileofthetruetransitiondynamicswellintheGaussiancase.Onthe

otherhand,Figure10.3showsthatLSCDEcanstillsuccessfullycapturethe

entireprofileofthetruetransitiondynamicswelleveninthebimodalcase,

butGPfailstocapturethebimodalstructure.

Basedontheestimatedtransitionmodels,policiesarelearnedbytheM-

PGPEmethod.Morespecifically,fromthelearnedtransitionmodel,1000

artificialtrajectorysamplesaregeneratedforgradientestimationandan-

other1000artificialtrajectorysamplesareusedforbaselineestimation.Then

policiesareupdatedbasedontheseartificialtrajectorysamples.Thispolicy

updatestepisrepeated100times.Forevaluatingthereturnofalearnedpol-

icy,100additionaltesttrajectorysamplesareusedwhicharenotemployedfor

policylearning.Figure10.4andFigure10.5depicttheaveragesandstandard

errorsofreturnsover100runsfortheGaussianandbimodalcases,respec-

tively.Theresultsshowthat,intheGaussiancase,theGP-basedmethod

performsverywellandLSCDEalsoexhibitsreasonableperformance.Inthe

bimodalcase,ontheotherhand,GPperformspoorlyandLSCDEgivesmuch

betterresultsthanGP.ThisillustratesthehighflexibilityofLSCDE.

TransitionModelEstimation

165

)10

,as’|(sp’s5

argmax05

10

0

5

−5

a

0

s

(a)Truetransition

)10

)

,a

10

s

,a

’|

s’|

(s

(s

p’

p

s5

’s5

argmax0

argmax0

5

5

10

10

0

0

5

5

−5

a

0

s

−5

a

0

s

(b)TransitionestimatedbyLSCDE

(c)TransitionestimatedbyGP

FIGURE10.3:BimodaltransitiondynamicsanditsestimatesbyLSCDE

andGP.

10

2.8

M−PGPE(LSCDE)

M−PGPE(GP)

8

2.6

IW−PGPE

2.4

M−PGPE(LSCDE)

6

M−PGPE(GP)

2.2

Return

IW−PGPE

Return

4

2

1.8

2

1.6

0

20

40

60

80

100

0

20

40

60

80

100

Iteration

Iteration

FIGURE10.4:Averagesandstan-

FIGURE10.5:Averagesandstan-

darderrorsofreturnsofthepolicies

darderrorsofreturnsofthepolicies

over100runsobtainedbyM-PGPE

over100runsobtainedbyM-PGPE

withLSCDE,M-PGPEwithGP,

withLSCDE,M-PGPEwithGP,

andIW-PGPEforGaussiantransi-

andIW-PGPEforbimodaltransi-

tion.

tion.

166

StatisticalReinforcementLearning

4

2

1.9

3.5

1.8

3

Return

Return1.7

2.5

1.6

2

1.5

20x1

10x2

5x4

4x5

2x10

1x20

20x1

10x2

5x4

4x5

2x10

1x20

Samplingschedules

Samplingschedules

FIGURE10.6:Averagesandstan-

FIGURE10.7:Averagesandstan-

darderrorsofreturnsobtainedby

darderrorsofreturnsobtainedby

IW-PGPEover100runsforGaus-

IW-PGPEover100runsforbimodal

siantransitionwithdifferentsam-

transitionwithdifferentsampling

plingschedules(e.g.,5×4means

schedules(e.g.,5×4meansgathering

gatheringk=5trajectorysamples

k=5trajectorysamples4times).

4times).

10.3.1.3

ComparisonofModel-BasedandModel-FreeMethods

Next,theperformanceofthemodel-basedandmodel-freePGPEmethods

arecompared.

Underthefixedbudgetscenario,thescheduleofcollecting20trajectory

samplesneedstobedeterminedfortheIW-PGPEmethod.First,theinfluence

ofthechoiceofsamplingschedulesisillustrated.Figure10.6andFigure10.7

showexpectedreturnsaveragedover100runsunderthesamplingschedule

thatabatchofktrajectorysamplesaregathered20/ktimesfordifferentval-

uesofk.Here,policyupdateisperformed100timesafterobservingeachbatch

ofktrajectorysamples,becausethisperformedbetterthantheusualscheme

ofupdatingthepolicyonlyonce.Figure10.6showsthattheperformanceof

IW-PGPEdependsheavilyonthesamplingschedule,andgatheringk=20

trajectorysamplesatonceisshowntobethebestchoiceintheGaussiancase.

Figure10.7showsthatgatheringk=20trajectorysamplesatonceisalsothe

bestchoiceinthebimodalcase.

Althoughthebestsamplingscheduleisnotaccessibleinpractice,theop-

timalsamplingscheduleisusedforevaluatingtheperformanceofIW-PGPE.

Figure10.4andFigure10.5showtheaveragesandstandarderrorsofreturns

obtainedbyIW-PGPEover100runsasfunctionsofthesamplingsteps.These

graphsshowthatIW-PGPEcanimprovethepoliciesonlyinthebeginning,

becausealltrajectorysamplesaregatheredatonceinthebeginning.The

performanceofIW-PGPEmaybefurtherimprovedifitispossibletogather

moretrajectorysamples.However,thisisprohibitedunderthefixedbudget

scenario.Ontheotherhand,returnsofM-PGPEkeepincreasingoveriter-

TransitionModelEstimation

167

ations,becauseartificialtrajectorysamplescanbekeptgeneratedwithout

additionalsamplingcosts.Thisillustratesapotentialadvantageofmodel-

basedreinforcementlearning(RL)methods.

10.3.2

HumanoidRobotControl

Finally,theperformanceofM-PGPEisevaluatedonapracticalcontrol

problemofasimulatedupper-bodymodelofthehumanoidrobotCB-i(Cheng

etal.,2007),whichwasalsousedinSection9.3.3;seeFigure9.5forthe

illustrationsofCB-ianditssimulator.

10.3.2.1

Setup

ThesimulatorisbasedontheupperbodyoftheCB-ihumanoidrobot,

whichhas9jointsforshoulderpitch,shoulderroll,elbowpitchoftheright

arm,andshoulderpitch,shoulderroll,elbowpitchoftheleftarm,waistyaw,

torsoroll,andtorsopitch.Thestatevectoris18-dimensionalandreal-valued,

whichcorrespondstothecurrentangleindegreeandthecurrentangular

velocityforeachjoint.Theactionvectoris9-dimensionalandreal-valued,

whichcorrespondstothetargetangleofeachjointindegree.Thegoalofthe

controlproblemistoleadtheendeffectoroftherightarm(righthand)tothe

targetobject.Anoisycontrolsystemissimulatedbyperturbingactionvectors

withindependentbimodalGaussiannoise.Morespecifically,foreachelement

oftheactionvector,Gaussiannoisewithmean0andstandarddeviation3is

addedwithprobability0.6,andGaussiannoisewithmean−5andstandard

deviation3isaddedwithprobability0.4.

Theinitialpostureoftherobotisfixedtobestandingupstraightwith

armsdown.Thetargetobjectislocatedinfrontofandabovetherighthand,

whichisreachablebyusingthecontrollablejoints.Therewardfunctionat

eachtimestepisdefinedas

rt=exp(−10dt)−0.000005minct,1,000,000,

wheredtisthedistancebetweentherighthandandtargetobjectattimestep

t,andctisthesumofcontrolcostsforeachjoint.Thedeterministicpolicy

modelusedinM-PGPEandIW-PGPEisdefinedasa=θ⊤φ(s)withthe

basisfunctionφ(s)=s.ThetrajectorylengthissetatT=100andthe

discountfactorissetatγ=0.9.

10.3.2.2

Experimentwith2Joints

First,weconsiderusingonly2jointsamongthe9joints,i.e.,onlytheright

shoulderpitchandrightelbowpitchareallowedtobecontrolled,whilethe

otherjointsremainstillateachtimestep(nocontrolsignalissenttothese

168

StatisticalReinforcementLearning

joints).Therefore,thedimensionalitiesofstatevectorsandactionvectora

are4and2,respectively.

WesupposethatthebudgetfordatacollectionislimitedtoN=50trajec-

torysamples.FortheM-PGPEmethods,alltrajectorysamplesarecollected

atfirstusingtheuniformlyrandominitialstatesandpolicy.Morespecifically,

theinitialstateischosenfromtheuniformdistributionoverS.Ateachtime

step,theactionaiofthei-thjointisfirstdrawnfromtheuniformdistribu-

tionon[si−5,si+5],wheresidenotesthestateforthei-thjoint.Intotal,

5000transitionsamplesarecollectedformodelestimation.Then,fromthe

learnedtransitionmodel,1000artificialtrajectorysamplesaregeneratedfor

gradientestimationandanother1000artificialtrajectorysamplesaregener-

atedforbaselineestimationineachiteration.Thesamplingscheduleofthe

IW-PGPEmethodischosentocollectk=5trajectorysamples50/ktimes,

whichperformswell,asshowninFigure10.8.Theaverageandstandarderror

ofthereturnobtainedbyeachmethodover10runsareplottedinFigure10.9,

showingthatM-PGPE(LSCDE)tendstooutperformbothM-PGPE(GP)and

IW-PGPE.

Figure10.10illustratesanexampleofthereachingmotionwith2joints

obtainedbyM-PGPE(LSCDE)atthe60thiteration.Thisshowsthatthe

learnedpolicysuccessfullyleadstherighthandtothetargetobjectwithin

only13stepsinthisnoisycontrolsystem.

10.3.2.3

Experimentwith9Joints

Finally,theperformanceofM-PGPE(LSCDE)andIW-PGPEisevaluated

onthereachingtaskwithall9joints.

Theexperimentalsetupisessentiallythesameasthe2-jointcase,butthe

budgetforgatheringN=1000trajectorysamplesisgiventothiscomplex

andhigh-dimensionaltask.Thepositionofthetargetobjectismovedto

farleft,whichisnotreachablebyusingonly2joints.Thus,therobotis

requiredtomoveotherjointstoreachtheobjectwiththerighthand.Five

thousandrandomlychosentransitionsamplesareusedasGaussiancentersfor

M-PGPE(LSCDE).ThesamplingscheduleforIW-PGPEissetatgathering

1000trajectorysamplesatonce,whichisthebestsamplingscheduleaccording

toFigure10.11.Theaveragesandstandarderrorsofreturnsobtainedby

M-PGPE(LSCDE)andIW-PGPEover30runsareplottedinFigure10.12,

showingthatM-PGPE(LSCDE)tendstooutperformIW-PGPE.

Figure10.13exhibitsatypicalreachingmotionwith9jointsobtainedby

M-PGPE(LSCDE)atthe1000thiteration.Thisshowsthattherighthandis

ledtothedistantobjectsuccessfullywithin14steps.

TransitionModelEstimation

169

3.5

3

Return2.5

2

1.5

50x1

25x2

10x5

5x10

1x50

Samplingschedules

FIGURE10.8:AveragesandstandarderrorsofreturnsobtainedbyIW-

PGPEover10runsforthe2-jointhumanoidrobotsimulatorfordifferent

samplingschedules(e.g.,5×10meansgatheringk=5trajectorysamples10

times).

0

150

300

450

600

750

1000

5

4

3

Return2

1

M−PGPE(LSCDE)

M−PGPE(GP)

IW−PGPE

0

0

20

40

60

Iteration

FIGURE10.9:Averagesandstandarderrorsofobtainedreturnsover10

runsforthe2-jointhumanoidrobotsimulator.Allmethodsuse50trajectory

samplesforpolicylearning.InM-PGPE(LSCDE)andM-PGPE(GP),all50

trajectorysamplesaregatheredinthebeginningandtheenvironmentmodel

islearned;then2000artificialtrajectorysamplesaregeneratedineachup-

dateiteration.InIW-PGPE,abatchof5trajectorysamplesisgatheredfor

10iterations,whichwasshowntobethebestsamplingscheduling(seeFig-

ure10.8).Notethatpolicyupdateisperformed100timesafterobservingeach

batchoftrajectorysamples,whichweconfirmedtoperformwell.Thebottom

horizontalaxisisfortheM-PGPEmethods,whilethetophorizontalaxisis

fortheIW-PGPEmethod.

170

StatisticalReinforcementLearning

FIGURE10.10:Exampleofarmreachingwith2jointsusingapolicyob-

tainedbyM-PGPE(LSCDE)atthe60thiteration(fromlefttorightandtop

tobottom).

−4.5

−5

−5.5

Return

−6

−6.5

−71000x1

500x2

100x10

50x20

10x100

5x200

1x1000

Samplingschedules

FIGURE10.11:AveragesandstandarderrorsofreturnsobtainedbyIW-

PGPEover30runsforthe9-jointhumanoidrobotsimulatorfordifferent

samplingschedules(e.g.,100×10meansgatheringk=100trajectorysamples

10times).

TransitionModelEstimation

171

0

20

40

60

80

100

−4

−5

−6

Return

−7

M−PGPE

IW−PGPE

−8

0

200

400

600

800

1000

Iteration

FIGURE10.12:Averagesandstandarderrorsofobtainedreturnsover30

runsforthehumanoidrobotsimulatorwith9joints.Bothmethodsuse1000

trajectorysamplesforpolicylearning.InM-PGPE(LSCDE),all1000tra-

jectorysamplesaregatheredinthebeginningandtheenvironmentmodel

islearned;then2000artificialtrajectorysamplesaregeneratedineachup-

dateiteration.InIW-PGPE,abatchof1000trajectorysamplesisgatheredat

once,whichwasshowntobethebestscheduling(seeFigure10.11).Notethat

policyupdateisperformed100timesafterobservingeachbatchoftrajectory

samples.ThebottomhorizontalaxisisfortheM-PGPEmethod,whilethe

tophorizontalaxisisfortheIW-PGPEmethod.

FIGURE10.13:Exampleofarmreachingwith9jointsusingapolicyob-

tainedbyM-PGPE(LSCDE)atthe1000thiteration(fromlefttorightand

toptobottom).

172

StatisticalReinforcementLearning

10.4

Remarks

Model-basedreinforcementlearningisapromisingapproach,giventhat

thetransitionmodelcanbeestimatedaccurately.However,estimatingthe

high-dimensionalconditionaldensityischallenging.Inthischapter,anon-

parametricconditionaldensityestimatorcalledleast-squaresconditionalden-

sityestimation(LSCDE)wasintroduced,andmodel-basedPGPEwith

LSCDEwasshowntoworkexcellentlyinexperiments.

Underthefixedsamplingbudget,themodel-freeapproachrequiresusto

designthesamplingscheduleappropriatelyinadvance.However,thisisprac-

ticallyveryhardunlessstrongpriorknowledgeisavailable.Ontheotherhand,

model-basedmethodsdonotsufferfromthisproblem,whichisanexcellent

practicaladvantageoverthemodel-freeapproach.

Inrobotics,themodel-freeapproachseemstobepreferredbecauseac-

curatelylearningthetransitiondynamicsofcomplexrobotsischallenging

(Deisenrothetal.,2013).Furthermore,model-freemethodscanutilizethe

priorknowledgeintheformofpolicydemonstration(Kober&Peters,2011).

Ontheotherhand,themodel-basedapproachisadvantageousinthatnoin-

teractionwiththerealrobotisrequiredoncethetransitionmodelhasbeen

learnedandthelearnedtransitionmodelcanbeutilizedforfurthersimulation.

Actually,thechoiceofmodel-freeormodel-basedmethodsisnotonlyan

ongoingresearchtopicinmachinelearning,butalsoabigdebatableissuein

neuroscience.Therefore,furtherdiscussionwouldbenecessarytomoredeeply

understandtheprosandconsofthemodel-basedandmodel-freeapproaches.

Combiningorswitchingthemodel-freeandmodel-basedapproacheswould

alsobeaninterestingdirectiontobefurtherinvestigated.

Chapter11

DimensionalityReductionfor

TransitionModelEstimation

Least-squaresconditionaldensityestimation(LSCDE),introducedinChap-

ter10,isapracticaltransitionmodelestimator.However,transitionmodel

estimationisstillchallengingwhenthedimensionalityofstateandaction

spacesishigh.Inthischapter,adimensionalityreductionmethodisintro-

ducedtoLSCDEwhichfindsalow-dimensionalexpressionoftheoriginal

stateandactionvectorthatisrelevanttopredictingthenextstate.After

mathematicallyformulatingtheproblemofdimensionalityreductioninSec-

tion11.1,adetaileddescriptionofthedimensionalityreductionalgorithm

basedonsquared-lossconditionalentropyisprovidedinSection11.2.Then

numericalexamplesaregiveninSection11.3,andthischapterisconcluded

inSection11.4.

11.1

SufficientDimensionalityReduction

Sufficientdimensionalityreduction(Li,1991;Cook&Ni,2005)isaframe-

workofdimensionalityreductioninasupervisedlearningsettingofanalyzing

aninput-outputrelation—inourcase,inputisthestate-actionpair(s,a)

andoutputisthenextstates′.Sufficientdimensionalityreductionisaimedat

findingalow-dimensionalexpressionzofinput(s,a)thatcontains“sufficient”

informationaboutoutputs′.

Letzbealinearprojectionofinput(s,a).Morespecifically,usingmatrix

WsuchthatWW⊤=IwhereIdenotestheidentitymatrix,zisgivenby

s

z=W

.

a

Thegoalofsufficientdimensionalityreductionis,fromindependenttransition

samples(sm,am,s′m)M

m=1,tofindWsuchthats′and(s,a)areconditionally

independentgivenz.Thisconditionalindependencemeansthatzcontainsall

informationabouts′andisequivalentlyexpressedas

p(s′|s,a)=p(s′|z).

(11.1)

173

174

StatisticalReinforcementLearning

11.2

Squared-LossConditionalEntropy

Inthissection,thedimensionalityreductionmethodbasedonthesquared-

lossconditionalentropy(SCE)isintroduced.

11.2.1

ConditionalIndependence

SCEisdefinedandexpressedas

ZZ

1

SCE(s′|z)=−

p(s′|z)p(s′,z)dzds′

2ZZ

Z

1

2

1

=−

p(s′|z)−1p(z)dzds′−1+

ds′.

2

2

ItwasshowninTangkarattetal.(2015)that

SCE(s′|z)≥SCE(s′|s,a),

andtheequalityholdsifandonlyifEq.(11.1)holds.Thus,sufficientdimen-

sionalityreductioncanbeperformedbyminimizingSCE(s′|z)withrespect

toW:

W∗=argminSCE(s′|z).W∈GHere,GdenotestheGrassmannmanifold,whichisthesetofmatricesW

suchthatWW⊤=Iwithoutredundancyintermsofthespan.

SinceSCEcontainsunknowndensitiesp(s′|z)andp(s′,z),itcannotbe

directlycomputed.Here,letusemploytheLSCDEmethodintroducedin

Chapter10toobtainanestimatorb

p(s′|z)ofconditionaldensityp(s′|z).Then,

byreplacingtheexpectationoverp(s′,z)withthesampleaverage,SCEcan

beapproximatedas

M

X

d

1

1

SCE(s′|z)=−

b

p(s′

e

α⊤b

v,

2M

m|zm)=−2

m=1

where

M

s

1X

z

m

m=W

and

bv=

φ(z

a

m,s′m).

m

Mm=1

φ(z,s′)isthebasisfunctionvectorusedinLSCDEgivenby

kz−z

φ

bk2+ks′−s′bk2

b(z,s′)=exp

,

2κ2

DimensionalityReductionforTransitionModelEstimation

175

whereκ>0denotestheGaussiankernelwidth.e

αistheLSCDEsolution

givenby

e

α=(b

U+λI)−1b

v,

whereλ≥0istheregularizationparameterand

b

(πκ)dim(s′)

ks′

U

b−s′b′k2

b,b′=

exp−

M

4κ2

M

X

kz

×

exp−

m−zbk2+kzm−zb′k2

.

2κ2

m=1

11.2.2

DimensionalityReductionwithSCE

WiththeaboveSCEestimator,apracticalformulationforsufficientdi-

mensionalityreductionisgivenby

c

W=argmaxS(W),whereS(W)=e

α⊤b

v.

W∈GThegradientofS(W)withrespecttoWℓ,ℓ′isgivenby

∂S

∂b

v⊤=−e

α⊤∂b

U

e

α+2

e

α.

∂Wℓ,ℓ′

∂Wℓ,ℓ′

∂Wℓ,ℓ′

IntheEuclideanspace,theabovegradientgivesthesteepestdirection(see

alsoSection7.3.1).However,ontheGrassmannmanifold,thenaturalgradi-

ent(Amari,1998)givesthesteepestdirection.ThenaturalgradientatW

istheprojectionoftheordinarygradienttothetangentspaceoftheGrass-

mannmanifold.Ifthetangentspaceisequippedwiththecanonicalmetric

W,W′=1tr(W⊤W′),thenaturalgradientatWisgivenasfollows(Edel-

2

manetal.,1998):

∂SW⊤∂W

⊥W⊥,

whereW⊥isthematrixsuchthatW⊤,W⊤isanorthogonalmatrix.

⊥ThegeodesicfromWtothedirectionofthenaturalgradientoverthe

Grassmannmanifoldcanbeexpressedusingt∈Ras”

#!

O

∂SW⊤W

W

∂W

⊥t=

I

Oexp−t

⊤,

−W

∂S

W

⊥O

⊥∂W

where“exp”foramatrixdenotesthematrixexponentialandOdenotesthe

zeromatrix.Thenlinesearchalongthegeodesicinthenaturalgradientdi-

rectionisperformedbyfindingthemaximizerfromWt|t≥0(Edelman

etal.,1998).

176

StatisticalReinforcementLearning

OnceWisupdatedbythenaturalgradientmethod,SCEisre-estimated

fornewWandnaturalgradientascentisperformedagain.Thisentirepro-

cedureisrepeateduntilWconverges,andthefinalsolutionisgivenby

b

α⊤φ(z,s′)

b

p(s′|z)=R

,

b

α⊤φ(z,s′′)ds′′

whereb

αb=max(0,e

αb),andthedenominatorcanbeanalyticallycomputedas

Z

B

X

kz−z

b

bk2

α⊤φ(z,s′′)ds′′=(2πκ)dim(s′)

αbexp−

.

2κ2

b=1

WhenSCEisre-estimated,performingcross-validationforLSCDEinevery

stepiscomputationallyexpensive.Inpractice,cross-validationmaybeper-

formedonlyonceeveryseveralgradientupdates.Furthermore,tofindabetter

localoptimalsolution,thisgradientascentproceduremaybeexecutedmul-

tipletimeswithrandomlychoseninitialsolutions,andtheoneachievingthe

largestobjectivevalueischosen.

11.2.3

RelationtoSquared-LossMutualInformation

TheabovedimensionalityreductionmethodminimizesSCE:

ZZ

1

p(z,s′)2

SCE(s′|z)=−

dzds′.

2

p(z)

Ontheotherhand,thedimensionalityreductionmethodproposedinSuzuki

andSugiyama(2013)maximizessquared-lossmutualinformation(SMI):

ZZ

1

p(z,s′)2

SMI(z,s′)=

dzds′.

2

p(z)p(s′)

NotethatSMIcanbeapproximatedalmostinthesamewayasSCEby

theleast-squaresmethod(Suzuki&Sugiyama,2013).Theaboveequations

showthattheessentialdifferencebetweenSCEandSMIiswhetherp(s′)

isincludedinthedenominatorofthedensityratio,andSCEisreducedto

thenegativeSMIifp(s′)isuniform.However,ifp(s′)isnotuniform,the

densityratiofunctionp(z,s′)includedinSMImaybemorefluctuatedthan

p(z)p(s′)

p(z,s′)includedinSCE.Sinceasmootherfunctioncanbemoreaccurately

p(z)

estimatedfromasmallnumberofsamplesingeneral(Vapnik,1998),SCE-

baseddimensionalityreductionisexpectedtoworkbetterthanSMI-based

dimensionalityreduction.

DimensionalityReductionforTransitionModelEstimation

177

11.3

NumericalExamples

Inthissection,experimentalbehavioroftheSCE-baseddimensionality

reductionmethodisillustrated.

11.3.1

ArtificialandBenchmarkDatasets

Thefollowingdimensionalityreductionschemesarecompared:

•None:Nodimensionalityreductionisperformed.

•SCE(Section11.2):Dimensionalityreductionisperformedbymini-

mizingtheleast-squaresSCEapproximatorusingnaturalgradientsover

theGrassmannmanifold(Tangkarattetal.,2015).

•SMI(Section11.2.3):Dimensionalityreductionisperformedbymax-

imizingtheleast-squaresSMIapproximatorusingnaturalgradientsover

theGrassmannmanifold(Suzuki&Sugiyama,2013).

•True:The“true”subspaceisused(onlyforartificialdatasets).

Afterdimensionalityreduction,thefollowingconditionaldensityestimators

arerun:

•LSCDE(Section10.1.3):Least-squaresconditionaldensityestima-

tion(Sugiyamaetal.,2010).

•ǫKDE(Section10.1.2):ǫ-neighborkerneldensityestimation,where

ǫischosenbyleast-squarescross-validation.

First,thebehaviorofSCE-LSCDEiscomparedwiththeplainLSCDE

withnodimensionalityreduction.Thedatasetshave5-dimensionalinputx=

(x(1),…,x(5))⊤and1-dimensionaloutputy.Amongthe5dimensionsofx,

onlythefirstdimensionx(1)isrelevanttopredictingtheoutputyandthe

other4dimensionsx(2),…,x(5)arejuststandardGaussiannoise.Figure11.1

plotsthefirstdimensionofinputandoutputofthesamplesinthedatasets

andconditionaldensityestimationresults.Thegraphsshowthattheplain

LSCDEdoesnotperformwellduetotheirrelevantnoisedimensionsininput,

whileSCE-LSCDEgivesmuchbetterestimates.

Next,artificialdatasetswith5-dimensionalinputx=(x(1),…,x(5))⊤and1-dimensionaloutputyareused.Eachelementofxfollowsthestandard

Gaussiandistributionandyisgivenby

(a)y=x(1)+(x(1))2+(x(1))3+ε,

(b)y=(x(1))2+(x(2))2+ε,

178

StatisticalReinforcementLearning

6

6

Sample

Sample

Plain-LSCDE

Plain-LSCDE

5

SCE-LSCDE

SCE-LSCDE

4

4

y2

y3

2

0

1

−2

0

2

3

4

5

6

7

3

4

5

6

7

8

x(1)

x(1)

(a)Bonemineraldensity

(b)OldFaithfulgeyser

FIGURE11.1:ExamplesofconditionaldensityestimationbyplainLSCDE

andSCE-LSCDE.

whereεistheGaussiannoisewithmeanzeroandstandarddeviation1/4.

ThetoprowofFigure11.2showsthedimensionalityreductionerrorbe-

tweentrueW∗anditsestimatecWfordifferentsamplesizen,measured

by

⊤Error

c

DR=kc

WW−W∗⊤W∗kFrobenius,wherek·kFrobeniusdenotestheFrobeniusnorm.TheSMI-basedandSCE-based

dimensionalityreductionmethodsbothperformsimilarlyforthedataset(a),

whiletheSCE-basedmethodclearlyoutperformstheSMI-basedmethodfor

thedataset(b).Thehistogramsofy400

i=1plottedinthe2ndrowofFigure11.2

showthattheprofileofthehistogram(whichisasampleapproximationof

p(y))inthedataset(b)ismuchsharperthanthatinthedataset(a).As

explainedinSection11.2.3,thedensityratiofunctionusedinSMIcontains

p(y)inthedenominator.Therefore,itwouldbehighlynon-smoothandthus

ishardtoapproximate.Ontheotherhand,thedensityratiofunctionused

inSCEdoesnotcontainp(y).Therefore,itwouldbesmootherthantheone

usedinSMIandthusiseasiertoapproximate.

The3rdand4throwsofFigure11.2plottheconditionaldensityestimation

errorbetweentruep(y|x)anditsestimateb

p(y|x),evaluatedbythesquared

loss(withoutaconstant):

Z

1

n′

X

1n′

X

ErrorCDE=

b

p(y|e

x

b

p(e

y

2n′

i)2dy−n′

i|e

xi),

i=1

i=1

where(e

xi,e

yi)n′

i=1isasetoftestsamplesthathavenotbeenusedfor

conditionaldensityestimation.Wesetn′=1000.Thegraphsshowthat

LSCDEoveralloutperformsǫKDEforbothdatasets.Forthedataset(a),

SMI-LSCDEandSCE-LSCDEperformequallywell,andaremuchbetterthan

DimensionalityReductionforTransitionModelEstimation

179

1

0.25

SMI-based

SMI-based

SCE-based

SCE-based

0.8

0.2

0.6

0.15

DR

DR

Error0.4

Error

0.1

0.2

0.05

0

0

50

100150200250300350400

50

100150200250300350400

Samplesizen

Samplesizen

40

200

30

150

20

100

Frequency

Frequency

10

50

0

0

−2

0

2

4

6

−5

0

5

10

y

y

1

0.1

LSCDE

εKDE

LSCDE

εKDE

LSCDE*

εKDE*

0

LSCDE*

εKDE*

0.5

−0.1

0

−0.2

−0.5

CDE

CDE

−0.3

−1

Error−0.4

Error

−1.5

−0.5

−0.6

−2

−0.7

−2.5

50

100150200250300350400

50

100150200250300350400

Samplesizen

Samplesizen

0.1

1

SMI-LSCDE

SMI-LSCDE

SMI-

SMI-

εKDE

εKDE

0

SCE-LSCDE

SCE-εKDE

SCE-LSCDE

SCE-εKDE

0.5

−0.1

0

−0.2

−0.5

CDE

CDE

−0.3

−1

Error−0.4

Error

−1.5

−0.5

−0.6

−2

−0.7

−2.5

50

100150200250300350400

50

100150200250300350400

Samplesizen

Samplesizen

FIGURE11.2:Toprow:Themeanandstandarderrorofthedimensionality

reductionerrorover20runsontheartificialdatasets.2ndrow:Histograms

ofoutputyi400

i=1.3rdand4throws:Themeanandstandarderrorofthe

conditionaldensityestimationerrorover20runs.

180

StatisticalReinforcementLearning

plainLSCDEwithnodimensionalityreduction(LSCDE)andcomparableto

LSCDEwiththetruesubspace(LSCDE*).Forthedataset(b),SCE-LSCDE

outperformsSMI-LSCDEandLSCDEandiscomparabletoLSCDE*.

Next,theUCIbenchmarkdatasets(Bache&Lichman,2013)areusedfor

performanceevaluation.nsamplesareselectedrandomlyfromeachdatasetfor

conditionaldensityestimation,andtherestofthesamplesareusedtomeasure

theconditionaldensityestimationerror.Sincethedimensionalityofzisun-

knownforthebenchmarkdatasets,itwasdeterminedbycross-validation.The

resultsaresummarizedinTable11.1,showingthatSCE-LSCDEworkswell

overall.Table11.2describesthedimensionalitiesselectedbycross-validation,

showingthatboththeSCE-basedandSMI-basedmethodsreducethedimen-

sionalitysignificantly.

11.3.2

HumanoidRobot

Finally,SCE-LSCDEisappliedtotransitionestimationofahumanoid

robot.Weuseasimulatoroftheupper-bodypartofthehumanoidrobot

CB-i(Chengetal.,2007)(seeFigure9.5).

Therobothas9controllablejoints:shoulderpitch,shoulderroll,elbow

pitchoftherightarm,andshoulderpitch,shoulderroll,elbowpitchofthe

leftarm,waistyaw,torsoroll,andtorsopitchjoints.Postureoftherobotis

describedby18-dimensionalreal-valuedstatevectors,whichcorrespondsto

theangleandangularvelocityofeachjointinradianandradian-per-second,

respectively.Therobotiscontrolledbysendinganactioncommandatothe

system.Theactioncommandaisa9-dimensionalreal-valuedvector,which

correspondstothetargetangleofeachjoint.Whentherobotiscurrentlyat

statesandreceivesactiona,thephysicalcontrolsystemofthesimulator

calculatestheamountoftorquetobeappliedtoeachjoint(seeSection9.3.3

fordetails).

Intheexperiment,theactionvectoraisrandomlychosenandanoisy

controlsystemissimulatedbyaddingabimodalGaussiannoisevector.More

specifically,theactionaiofthei-thjointisfirstdrawnfromtheuniformdis-

tributionon[si−0.087,si+0.087],wheresidenotesthestateforthei-th

joint.ThedrawnactionisthencontaminatedbyGaussiannoisewithmean

0andstandarddeviation0.034withprobability0.6andGaussiannoisewith

mean−0.087andstandarddeviation0.034withprobability0.4.Byrepeat-

edlycontrollingtherobotMtimes,transitionsamples(sm,am,s′m)M

m=1

areobtained.Ourgoalistolearnthesystemdynamicsasastatetransition

probabilityp(s′|s,a)fromthesesamples.

Thefollowingthreescenariosareconsidered:usingonly2joints(right

shoulderpitchandrightelbowpitch),only4joints(inaddition,rightshoulder

rollandwaistyaw),andall9joints.Thesesetupscorrespondto6-dimensional

inputand4-dimensionaloutputinthe2-jointcase,12-dimensionalinputand

8-dimensionaloutputinthe4-jointcase,and27-dimensionalinputand18-

dimensionaloutputinthe9-jointcase.Fivehundred,1000,and1500transition

DimensionalityReductionforTransitionModelEstimation

181

r

llea

0

t-test

le

1

1

1

1

1

1

1

1

1

0

1

1

0

0

ca

1

1

1

(sm

×××××××××

××

ed

S

×

××

ira

)

)

)

)

)

)

)

)

)

)

)

)

)

)

sets

p

1

4

6

2

1

1

4

2

3

4

4

4

1

2

ta

E

a

le

(.0

(.0

(.0

(.0

(.0

(.0

(.0

(.0

(.0

(.1

(.1

(.1

(.0

(.0

d

p

n

D

3

6

2

5

1

9

3

6

0

5

5

5

5

9

s

m

.1

.4

.7

.9

.9

.8

.1

.9

.8

.9

.2

.6

.7

.8

u

ǫK

1

1

2

2

0

1

1

6

0

1

9

3

0

0

-sa

ctio

−−−−−−−−−−−−−−

rio

o

u

)

)

va

tw

)

)

)

)

)

)

)

)

)

)

)

)

red

5

4

9

4

1

1

2

7

2

6

3

3

3

6

r

e

o

E

fo

th

D

(.0

(.0

(.0

(.0

(.0

(.0

(.0

(.0

(.0

(.0

(.1

(.1

(.0

(.3

s

N

C

1

5

2

2

9

6

3

0

1

2

5

5

3

0

n

to

S

.4

.6

.7

.0

.0

.4

.1

.1

.3

.9

.8

.6

1

.7

2

1

.1

2

2

3

1

2

7

3

0

1

ru

L

1

1

g

−−

−−−−−−−−−

0

1

inrd

)

er

)

)

)

)

)

)

)

)

)

)

)

)

)

8

5

1

9

2

2

7

2

4

9

3

0

6

1

ov

cco

E

(.3

r

a

(.0

(.0

(.1

(.2

(.0

(.1

(.1

(.0

(.0

(.4

(.6

(.1

(.5

s

D

2

7

5

7

7

0

3

3

8

1

7

4

8

7

d

.6

.7

.9

.4

.9

.6

.9

.9

.1

.4

.2

.4

.3

.3

erro

o

ǫK

0

sed

1

1

2

5

0

2

1

6

1

3

1

7

1

2

n

eth

a

−−

−−−−

−−−

b

−−

tioam

I-

)

)

)

)

)

)

)

)

)

)

)

)

)

)

le

M

5

4

8

6

1

2

3

4

3

7

0

4

5

3

b

S

ED(.0(.0(.1(.2(.0(.0(.0(.0(.0(.4(.5(.8(.2(.6

estim

ra

3

3

4

6

a

C

1

5

9

0

5

2

0

2

0

4

y

p

S

.9

.8

.6

.6

.2

.3

.8

.9

.3

.0

.4

.0

.0

.7

L

1

1

2

5

1

2

2

6

1

6

9

8

2

9

sit

m

−−−−−−−−−−−−−−

en

co

d

d

)

)

)

)

)

l

)

6

4

4

)

5

)

)

)

)

7

)

)

)

a

n

1

2

7

3

6

2

4

4

7

a

n

r

E

(.1

(.0

(.1

(.1

(.0

(.1

(.1

(.0

(.0

(.2

(.3

(.5

(.1

(.1

io

D

7

4

3

3

9

7

5

3

0

8

5

0

3

4

it

.5

.9

.9

.9

.2

.1

.5

.7

.4

d

erro

ce.

ǫK

1

.7

.0

.2

0

.4

1

6

1

4

.7

7

1

2

n

1

3

6

2

9

n

fa

sed

−−−−−−−−−−−−−−

co

a

e

ea

ld

-b

)

m

o

E

)

)

)

)

)

)

)

)

)

)

)

6

)

)

th

b

9

4

8

2

1

1

2

2

3

4

3

1

3

f

e

C

E

o

y

S

(.8

th

b

D

(.0

(.0

(.1

(.0

(.0

(.0

(.0

(.0

(.0

(.0

(.5

9

(.2

(.8

r

f

C

3

0

2

6

9

1

5

8

6

3

7

1

7

o

ed

S

.7

.8

.9

.4

.1

.3

.8

.1

.3

.1

.3

.4

.8

.3

L

1

1

2

6

1

2

2

7

1

7

8

0

2

8

erro

s

−1

ecifi

−−−−−

−−−−

−−−

rdatermsp

0

0

0

0

0

0

0

0

0

0

0

0

d

0

0

0

0

0

0

0

0

0

0

0

0

0

0

n

n

in

re

1

1

5

8

5

4

3

1

3

2

1

1

2

5

a

)

)

sta

d

)

)

)

)

)

o

)

)

)

)

)

)

)

)

8

d

%5

,dy

,1

,1

,1

,1

,1

,1

,1

,1

,1

,2

,2

,4

,8

,1

n

3

1

1

2

2

7

a

eth

el

(1

(7

(4

(6

(9

(1

(1

(1

(8

(8

(7

(6

(1

n

m

v

(dx

(2

le

ea

est

e

e

es

M

b

ce

g

G

em

in

:

P

ir

y

ts

ts

ts

e

n

set

o

t

ch

in

W

F

.1

h

ca

sin

M

ch

o

W

crete

ck

erg

in

in

in

1

T

o

o

o

ifi

ta

u

erv

a

e

n

n

to

J

J

J

1

a

n

o

to

sic

it

D

S

Y

o

S

H

u

y

h

ed

rest

C

E

2

4

9

E

o

sig

A

h

W

R

F

L

P

e

B

etter).

th

A

b

t

T

is

a

182

StatisticalReinforcementLearning

TABLE11.2:Meanandstandarderrorofthechosensubspacedimensional-

ityover10runsforbenchmarkandrobottransitiondatasets.

SCE-based

SMI-based

Dataset

(dx,dy)

LSCDE

ǫKDE

LSCDE

ǫKDE

Housing

(13,1)

3.9(0.74)

2.0(0.79)

2.0(0.39)

1.3(0.15)

AutoMPG

(7,1)

3.2(0.66)

1.3(0.15)

2.1(0.67)

1.1(0.10)

Servo

(4,1)

1.9(0.35)

2.4(0.40)

2.2(0.33)

1.6(0.31)

Yacht

(6,1)

1.0(0.00)

1.0(0.00)

1.0(0.00)

1.0(0.00)

Physicochem

(9,1)

6.5(0.58)

1.9(0.28)

6.6(0.58)

2.6(0.86)

WhiteWine

(11,1)

1.2(0.13)

1.0(0.00)

1.4(0.31)

1.0(0.00)

RedWine

(11,1)

1.0(0.00)

1.3(0.15)

1.2(0.20)

1.0(0.00)

ForestFires

(12,1)

1.2(0.20)

4.9(0.99)

1.4(0.22)

6.8(1.23)

Concrete

(8,1)

1.0(0.00)

1.0(0.00)

1.2(0.13)

1.0(0.00)

Energy

(8,2)

5.9(0.10)

3.9(0.80)

2.1(0.10)

2.0(0.30)

Stock

(7,2)

3.2(0.83)

2.1(0.59)

2.1(0.60)

2.7(0.67)

2Joints

(6,4)

2.9(0.31)

2.7(0.21)

2.5(0.31)

2.0(0.00)

4Joints

(12,8)

5.2(0.68)

6.2(0.63)

5.4(0.67)

4.6(0.43)

9Joints

(27,18)

13.8(1.28)15.3(0.94)11.4(0.75)13.2(1.02)

samplesaregeneratedforthe2-joint,4-joint,and9-jointcases,respectively.

Thenrandomlychosenn=100,200,and500samplesareusedforconditional

densityestimation,andtherestisusedforevaluatingthetesterror.The

resultsaresummarizedinTable11.1,showingthatSCE-LSCDEperforms

wellfortheallthreecases.Table11.2describesthedimensionalitiesselected

bycross-validation.Thisshowsthatthedimensionalitiesaremuchreduced,

implyingthattransitionofthehumanoidrobotishighlyredundant.

11.4

Remarks

Copingwithhighdimensionalityofthestateandactionspacesisoneof

themostimportantchallengesinmodel-basedreinforcementlearning.Inthis

chapter,adimensionalityreductionmethodforconditionaldensityestimation

wasintroduced.Thekeyideawastousethesquared-lossconditionalentropy

(SCE)fordimensionalityreduction,whichcanbeestimatedbyleast-squares

conditionaldensityestimation.Thisallowedustoperformdimensionalityre-

ductionandconditionaldensityestimationsimultaneouslyinanintegrated

manner.Incontrast,dimensionalityreductionbasedonsquared-lossmutual

information(SMI)yieldsatwo-stepprocedureoffirstreducingthedimension-

alityandthentheconditionaldensityisestimated.SCE-baseddimensionality

reductionwasshowntooutperformtheSMI-basedmethod,particularlywhen

outputfollowsaskeweddistribution.

References

Abbeel,P.,&Ng,A.Y.(2004).Apprenticeshiplearningviainverserein-

forcementlearning.ProceedingsofInternationalConferenceonMachine

Learning(pp.1–8).

Abe,N.,Melville,P.,Pendus,C.,Reddy,C.K.,Jensen,D.L.,Thomas,V.P.,

Bennett,J.J.,Anderson,G.F.,Cooley,B.R.,Kowalczyk,M.,Domick,M.,

&Gardinier,T.(2010).Optimizingdebtcollectionsusingconstrainedrein-

forcementlearning.ProceedingsofACMSIGKDDInternationalConference

onKnowledgeDiscoveryandDataMining(pp.75–84).

Amari,S.(1967).Theoryofadaptivepatternclassifiers.IEEETransactions

onElectronicComputers,EC-16,299–307.

Amari,S.(1998).Naturalgradientworksefficientlyinlearning.NeuralCom-

putation,10,251–276.

Amari,S.,&Nagaoka,H.(2000).Methodsofinformationgeometry.Provi-

dence,RI,USA:OxfordUniversityPress.

Bache,K.,&Lichman,M.(2013).UCImachinelearningrepository.http:

//archive.ics.uci.edu/ml/

Baxter,J.,Bartlett,P.,&Weaver,L.(2001).Experimentswithinfinite-

horizon,policy-gradientestimation.JournalofArtificialIntelligenceRe-

search,15,351–381.

Bishop,C.M.(2006).Patternrecognitionandmachinelearning.NewYork,

NY,USA:Springer.

Boyd,S.,&Vandenberghe,L.(2004).Convexoptimization.Cambridge,UK:

CambridgeUniversityPress.

Bradtke,S.J.,&Barto,A.G.(1996).Linearleast-squaresalgorithmsfor

temporaldifferencelearning.MachineLearning,22,33–57.

Chapelle,O.,Schölkopf,B.,&Zien,A.(Eds.).(2006).Semi-supervisedlearn-

ing.Cambridge,MA,USA:MITPress.

Cheng,G.,Hyon,S.,Morimoto,J.,Ude,A.,Joshua,G.H.,Colvin,G.,Scrog-

gin,W.,&Stephen,C.J.(2007).CB:Ahumanoidresearchplatformfor

exploringneuroscience.AdvancedRobotics,21,1097–1114.

183

184

References

Chung,F.R.K.(1997).Spectralgraphtheory.Providence,RI,USA:American

MathematicalSociety.

Coifman,R.,&Maggioni,M.(2006).Diffusionwavelets.AppliedandCom-

putationalHarmonicAnalysis,21,53–94.

Cook,R.D.,&Ni,L.(2005).Sufficientdimensionreductionviainverse

regression.JournaloftheAmericanStatisticalAssociation,100,410–428.

Dayan,P.,&Hinton,G.E.(1997).Usingexpectation-maximizationforrein-

forcementlearning.NeuralComputation,9,271–278.

Deisenroth,M.P.,Neumann,G.,&Peters,J.(2013).Asurveyonpolicy

searchforrobotics.FoundationsandTrendsinRobotics,2,1–142.

Deisenroth,M.P.,&Rasmussen,C.E.(2011).PILCO:Amodel-basedand

data-efficientapproachtopolicysearch.ProceedingsofInternationalCon-

ferenceonMachineLearning(pp.465–473).

Demiriz,A.,Bennett,K.P.,&Shawe-Taylor,J.(2002).Linearprogramming

boostingviacolumngeneration.MachineLearning,46,225–254.

Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).Maximumlikelihood

fromincompletedataviatheEMalgorithm.JournaloftheRoyalStatistical

Society,seriesB,39,1–38.

Dijkstra,E.W.(1959).Anoteontwoproblemsinconnexion[sic]withgraphs.

NumerischeMathematik,1,269–271.

Edelman,A.,Arias,T.A.,&Smith,S.T.(1998).Thegeometryofalgo-

rithmswithorthogonalityconstraints.SIAMJournalonMatrixAnalysis

andApplications,20,303–353.

Efron,B.,Hastie,T.,Johnstone,I.,&Tibshirani,R.(2004).Leastangle

regression.AnnalsofStatistics,32,407–499.

Engel,Y.,Mannor,S.,&Meir,R.(2005).ReinforcementlearningwithGaus-

sianprocesses.ProceedingsofInternationalConferenceonMachineLearn-

ing(pp.201–208).

Fishman,G.S.(1996).MonteCarlo:Concepts,algorithms,andapplications.

Berlin,Germany:Springer-Verlag.

Fredman,M.L.,&Tarjan,R.E.(1987).Fibonacciheapsandtheiruses

inimprovednetworkoptimizationalgorithms.JournaloftheACM,34,

569–615.

Goldberg,A.V.,&Harrelson,C.(2005).Computingtheshortestpath:A*

searchmeetsgraphtheory.ProceedingsofAnnualACM-SIAMSymposium

onDiscreteAlgorithms(pp.156–165).

References

185

Gooch,B.,&Gooch,A.(2001).Non-photorealisticrendering.Natick,MA,

USA:A.K.PetersLtd.

Greensmith,E.,Bartlett,P.L.,&Baxter,J.(2004).Variancereductiontech-

niquesforgradientestimatesinreinforcementlearning.JournalofMachine

LearningResearch,5,1471–1530.

Guo,Q.,&Kunii,T.L.(2003).“Nijimi”renderingalgorithmforcreating

qualityblackinkpaintings.ProceedingsofComputerGraphicsInternational

(pp.152–159).

Henkel,R.E.(1976).Testsofsignificance.BeverlyHills,CA,USA.:SAGE

Publication.

Hertzmann,A.(1998).Painterlyrenderingwithcurvedbrushstrokesofmul-

tiplesizes.ProceedingsofAnnualConferenceonComputerGraphicsand

InteractiveTechniques(pp.453–460).

Hertzmann,A.(2003).Asurveyofstrokebasedrendering.IEEEComputer

GraphicsandApplications,23,70–81.

Hoerl,A.E.,&Kennard,R.W.(1970).Ridgeregression:Biasedestimation

fornonorthogonalproblems.Technometrics,12,55–67.

Huber,P.J.(1981).Robuststatistics.NewYork,NY,USA:Wiley.

Kakade,S.(2002).Anaturalpolicygradient.AdvancesinNeuralInformation

ProcessingSystems14(pp.1531–1538).

Kanamori,T.,Hido,S.,&Sugiyama,M.(2009).Aleast-squaresapproachto

directimportanceestimation.JournalofMachineLearningResearch,10,

1391–1445.

Kanamori,T.,Suzuki,T.,&Sugiyama,M.(2012).Statisticalanalysisof

kernel-basedleast-squaresdensity-ratioestimation.MachineLearning,86,

335–367.

Kanamori,T.,Suzuki,T.,&Sugiyama,M.(2013).Computationalcomplex-

ityofkernel-baseddensity-ratioestimation:Aconditionnumberanalysis.

MachineLearning,90,431–460.

Kober,J.,&Peters,J.(2011).Policysearchformotorprimitivesinrobotics.

MachineLearning,84,171–203.

Koenker,R.(2005).Quantileregression.Cambridge,MA,USA:Cambridge

UniversityPress.

Kohonen,T.(1995).Self-organizingmaps.Berlin,Germany:Springer.

Kullback,S.,&Leibler,R.A.(1951).Oninformationandsufficiency.Annals

ofMathematicalStatistics,22,79–86.

186

References

Lagoudakis,M.G.,&Parr,R.(2003).Least-squarespolicyiteration.Journal

ofMachineLearningResearch,4,1107–1149.

Li,K.(1991).Slicedinverseregressionfordimensionreduction.Journalof

theAmericanStatisticalAssociation,86,316–342.

Mahadevan,S.(2005).Proto-valuefunctions:Developmentalreinforcement

learning.ProceedingsofInternationalConferenceonMachineLearning(pp.

553–560).

Mangasarian,O.L.,&Musicant,D.R.(2000).Robustlinearandsupport

vectorregression.IEEETransactionsonPatternAnalysisandMachine

Intelligence,22,950–955.

Morimura,T.,Sugiyama,M.,Kashima,H.,Hachiya,H.,&Tanaka,T.(2010a).

Nonparametricreturndistributionapproximationforreinforcementlearn-

ing.ProceedingsofInternationalConferenceonMachineLearning(pp.

799–806).

Morimura,T.,Sugiyama,M.,Kashima,H.,Hachiya,H.,&Tanaka,T.

(2010b).Parametricreturndensityestimationforreinforcementlearning.

ConferenceonUncertaintyinArtificialIntelligence(pp.368–375).

Peters,J.,&Schaal,S.(2006).Policygradientmethodsforrobotics.Process-

ingoftheIEEE/RSJInternationalConferenceonIntelligentRobotsand

Systems(pp.2219–2225).

Peters,J.,&Schaal,S.(2007).Reinforcementlearningbyreward-weighted

regressionforoperationalspacecontrol.ProceedingsofInternationalCon-

ferenceonMachineLearning(pp.745–750).Corvallis,Oregon,USA.

Precup,D.,Sutton,R.S.,&Singh,S.(2000).Eligibilitytracesforoff-policypolicyevaluation.ProceedingsofInternationalConferenceonMachine

Learning(pp.759–766).

Rasmussen,C.E.,&Williams,C.K.I.(2006).Gaussianprocessesformachine

learning.Cambridge,MA,USA:MITPress.

Rockafellar,R.T.,&Uryasev,S.(2002).Conditionalvalue-at-riskforgeneral

lossdistributions.JournalofBanking&Finance,26,1443–1471.

Rousseeuw,P.J.,&Leroy,A.M.(1987).Robustregressionandoutlierdetec-

tion.NewYork,NY,USA:Wiley.

Schaal,S.(2009).TheSLsimulationandreal-timecontrolsoftwarepack-

age(TechnicalReport).ComputerScienceandNeuroscience,Universityof

SouthernCalifornia.

Sehnke,F.,Osendorfer,C.,Rückstiess,T.,Graves,A.,Peters,J.,&Schmid-

huber,J.(2010).Parameter-exploringpolicygradients.NeuralNetworks,

23,551–559.

References

187

Shimodaira,H.(2000).Improvingpredictiveinferenceundercovariateshift

byweightingthelog-likelihoodfunction.JournalofStatisticalPlanningand

Inference,90,227–244.

Siciliano,B.,&Khatib,O.(Eds.).(2008).Springerhandbookofrobotics.

Berlin,Germany:Springer-Verlag.

Sugimoto,N.,Tangkaratt,V.,Wensveen,T.,Zhao,T.,Sugiyama,M.,&Mo-

rimoto,J.(2014).Efficientreuseofpreviousexperiencesinhumanoidmotor

learning.ProceedingsofIEEE-RASInternationalConferenceonHumanoid

Robots(pp.554–559).

Sugiyama,M.(2006).Activelearninginapproximatelylinearregressionbased

onconditionalexpectationofgeneralizationerror.JournalofMachine

LearningResearch,7,141–166.

Sugiyama,M.,Hachiya,H.,Towell,C.,&Vijayakumar,S.(2008).Geodesic

Gaussiankernelsforvaluefunctionapproximation.AutonomousRobots,

25,287–304.

Sugiyama,M.,&Kawanabe,M.(2012).Machinelearninginnon-stationary

environments:Introductiontocovariateshiftadaptation.Cambridge,MA,

USA:MITPress.

Sugiyama,M.,Krauledat,M.,&Müller,K.-R.(2007).Covariateshiftadapta-

tionbyimportanceweightedcrossvalidation.JournalofMachineLearning

Research,8,985–1005.

Sugiyama,M.,Suzuki,T.,&Kanamori,T.(2012).Densityratiomatching

undertheBregmandivergence:Aunifiedframeworkofdensityratioesti-

mation.AnnalsoftheInstituteofStatisticalMathematics,64,1009–1044.

Sugiyama,M.,Takeuchi,I.,Suzuki,T.,Kanamori,T.,Hachiya,H.,&

Okanohara,D.(2010).Least-squaresconditionaldensityestimation.IEICE

TransactionsonInformationandSystems,E93-D,583–594.

Sutton,R.S.,&Barto,G.A.(1998).Reinforcementlearning:Anintroduction.

Cambridge,MA,USA:MITPress.

Suzuki,T.,&Sugiyama,M.(2013).

Sufficientdimensionreductionvia

squared-lossmutualinformationestimation.NeuralComputation,25,725–

758.

Takeda,A.(2007).Supportvectormachinebasedonconditionalvalue-at-risk

minimization(TechnicalReportB-439).DepartmentofMathematicaland

ComputingSciences,TokyoInstituteofTechnology.

Tangkaratt,V.,Mori,S.,Zhao,T.,Morimoto,J.,&Sugiyama,M.(2014).

Model-basedpolicygradientswithparameter-basedexplorationbyleast-

squaresconditionaldensityestimation.NeuralNetworks,57,128–140.

188

References

Tangkaratt,V.,Xie,N.,&Sugiyama,M.(2015).Conditionaldensityesti-

mationwithdimensionalityreductionviasquared-lossconditionalentropy

minimization.NeuralComputation,27,228–254.

Tesauro,G.(1994).

TD-gammon,aself-teachingbackgammonprogram,

achievesmaster-levelplay.NeuralComputation,6,215–219.

Tibshirani,R.(1996).Regressionshrinkageandsubsetselectionwiththe

lasso.JournaloftheRoyalStatisticalSociety,SeriesB,58,267–288.

Tomioka,R.,Suzuki,T.,&Sugiyama,M.(2011).Super-linearconvergenceof

dualaugmentedLagrangianalgorithmforsparsityregularizedestimation.

JournalofMachineLearningResearch,12,1537–1586.

Vapnik,V.N.(1998).Statisticallearningtheory.NewYork,NY,USA:Wiley.

Vesanto,J.,Himberg,J.,Alhoniemi,E.,&Parhankangas,J.(2000).SOM

toolboxforMatlab5(TechnicalReportA57).HelsinkiUniversityofTech-

nology.

Wahba,G.(1990).Splinemodelsforobservationaldata.Philadelphia,PA,

USA:SocietyforIndustrialandAppliedMathematics.

Wang,X.,&Dietterich,T.G.(2003).Model-basedpolicygradientrein-

forcementlearning.ProceedingsofInternationalConferenceonMachine

Learning(pp.776–783).

Wawrzynski,P.(2009).Real-timereinforcementlearningbysequentialactor-

criticsandexperiencereplay.NeuralNetworks,22,1484–1497.

Weaver,L.,&Baxter,J.(1999).Reinforcementlearningfromstateandtem-

poraldifferences(TechnicalReport).DepartmentofComputerScience,

AustralianNationalUniversity.

Weaver,L.,&Tao,N.(2001).Theoptimalrewardbaselineforgradient-

basedreinforcementlearning.ProceedingsofConferenceonUncertaintyin

ArtificialIntelligence(pp.538–545).

Williams,J.D.,&Young,S.J.(2007).PartiallyobservableMarkovdecision

processesforspokendialogsystems.ComputerSpeechandLanguage,21,

393–422.

Williams,R.J.(1992).Simplestatisticalgradient-followingalgorithmsfor

connectionistreinforcementlearning.MachineLearning,8,229–256.

Xie,N.,Hachiya,H.,&Sugiyama,M.(2013).Artistagent:Areinforcement

learningapproachtoautomaticstrokegenerationinorientalinkpainting.

IEICETransactionsonInformationandSystems,E95-D,1134–1144.

Xie,N.,Laga,H.,Saito,S.,&Nakajima,M.(2011).Contour-drivenSumi-e

renderingofrealphotos.Computers&Graphics,35,122–134.

References

189

Zhao,T.,Hachiya,H.,Niu,G.,&Sugiyama,M.(2012).Analysisandim-

provementofpolicygradientestimation.NeuralNetworks,26,118–129.

Zhao,T.,Hachiya,H.,Tangkaratt,V.,Morimoto,J.,&Sugiyama,M.(2013).

Efficientsamplereuseinpolicygradientswithparameter-basedexploration.

NeuralComputation,25,1512–1547.

Thispageintentionallyleftblank

DocumentOutlineCoverContentsForewordPrefaceAuthorPartI:Introduction

Chapter1:IntroductiontoReinforcementLearningPartII:Model-FreePolicyIteration

Chapter2:PolicyIterationwithValueFunctionApproximationChapter3:BasisDesignforValueFunctionApproximationChapter4:SampleReuseinPolicyIterationChapter5:ActiveLearninginPolicyIterationChapter6:RobustPolicyIteration

PartIII:Model-FreePolicySearchChapter7:DirectPolicySearchbyGradientAscentChapter8:DirectPolicySearchbyExpectation-MaximizationChapter9:Policy-PriorSearch

PartIV:Model-BasedReinforcementLearningChapter10:TransitionModelEstimationChapter11:DimensionalityReductionforTransitionModelEstimation

References

Recommended