CSC2621 Topics in Robotics · 💡If/when the individual actions produced by the policy can be...

CSC2621 Topics in RoboticsReinforcement Learning in Robotics

Week 2: Supervised & Imitation Learning

Instructor: Animesh Garg

TA: Dylan Turpin & Tingwu Wang

VARIANCE REDUCTION FOR POLICY GRADIENT WITH ACTION-DEPENDENT

FACTORIZED BASELINESWu, Rajeswaran, Duan, Kumar, Bayen, Kakade, Mordatch, Abbeel

Topic: Policy GradientsPresenter: Animesh Garg

Motivation

• Policy Gradient:Improve the log-prob of actions weighted by expected future returns

• The gradient estimator has high variance!- Credit assignment to actions (especially in long horizon tasks)- High Dimensional action spaces

• Variance of estimator matters because of stability and convergence properties.

Key Insights

• Use a baseline! But, what does a baseline do?Intuition: Removes the effect of future actions from total reward!

𝐴 = 𝑄 𝑠, 𝑎 − 𝑉 𝑠

• But do we have to be limited to state-only?💡 If/when the individual actions produced by the policy can be decomposed into multiple factors, we can incorporate this additional information into the baseline to further reduce variance. Information about the other factors can provide a better evaluation of how well a specific factor performs

State-dependent Baseline

Key Insights

When do actions decouple?

• Different Actions dimensions act of independent components of Observation Space.

• Action space composed of multiple independent Function Approximators (no weight sharing)

• Multivariate Gaussian policies with a diagonal covariance

• Multi-agent & Distributed RLCentralized Learning + Decentralized Execution

Contributions

• An action-dependent baseline enables using additional signals beyond the state to achieve bias-free variance reduction.

• Derive an Optimal Action-Dependent Baseline

• Analysis of improvement in variance reduction

• Empirical results to show the effects of the proposal baselines and comparison of several choices of baselines

Background

• Value-Function Based Gradient through Q(or V)

• Actor-CriticGradient through critic

• Policy GradientsGradient through rollouts

Low-VarianceBiased (often)Sample EfficientCan be unstable

High-VarianceUnbiasedLess Sample EfficientMore Stable

Background: Variance Reduction

• MDP Objective

• Score Function Estimator (Williams 1992)

Background: Variance Reduction

• Policy Gradient

• State-Dependent Baseline

Action-Dependent Baseline

• Assume m-dimensional action space

• So the MDP objective becomes

• Baseline is independent of other action components, hence:

Action-Dependent Baseline

• New Gradient Estimator

• Notice that this is similar to Advantage Function

Optimal State-dependent Baseline

• Reformulate the objective

• Optimal State-dep Baseline

Optimal Action-Dependent Baseline

• Reformulate the objective

• Independent action dimensions

• Optimal Action Baseline

How are they related

• Action-dep. Baseline doesn’t degenerate to State-dep. Baseline

• The difference is particularly large when the Q function is highly sensitive to the actions, esp. along directions that influence the gradient the most.

weighted sum of the deviation of the per-component score-weighted marginalizedQ from the component weight (based onscore only, not Q) of the overallaggregated marginalized Q values

Potential Choices of Action-Dep Baselines

• Marginalized Q-Baseline

nearly optimal if 𝐶𝑜𝑟𝑟 log 𝜋 , 𝑄 ≈ 0

• Monte Carlo marginalized Q baseline

• Mean marginalized Q baseline

Experiments

Experiments: Does it help v/s b(s)

Experiments: Different Baseline Choices

• Variants of the action-dependent baseline that use: (i) sampling from the Q-function to estimate the conditional expectation; (ii) Using the mean action to form a linear approximation to the conditional expectation.

• Both variants are comparable with the latter being more comp. efficient

Experiments: Effect of global information

Contributions (Recap)

• An action-dependent baseline enables using additional signals beyond the state to achieve bias-free variance reduction.

• Derive an Optimal Action-Dependent Baseline

• Analysis of improvement in variance reduction

• Empirical results to show the effects of the proposal baselines and comparison of several choices of baselines

Is this it?

• Some people don’t believe this works at all!

References

• Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselineshttps://arxiv.org/pdf/1803.07246.pdf

• Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critichttps://arxiv.org/abs/1611.02247

• The Mirage of Action-Dependent Baselines in Reinforcement Learning https://arxiv.org/pdf/1802.10031.pdf

• Trajectory-wise Control Variates for Variance Reduction in Policy Gradient Methodshttps://arxiv.org/abs/1908.03263

Presentations

Jan 28

• Need 4 students

• Presentation Review Thurs/Fri (sign up)

• Need a minimum of 4 students

• Presentation Review Tues Jan 28 and Wed Jan 29

Projects

Jan 28

• Proposal: Due

• 2 pages

• Latex Template provided. (CoRL)

• Should include introduction with motivation

Intuition & description of the aimed contribution.

• If empirical, state broad experimental plan and condition of success

• If analysis, then state what property you are analysing.

Going from Go to Robot/Control

• Known Environment vs Unstructured/Open World

• Need for Behavior Transfer

• Discrete vs Continuous States-Actions

• Single vs Variable Goals

• Reward Oracle vs Reward Inference

CSC2621 Topics in Robotics · 💡If/when the individual actions produced by the policy can be...

Documents

Dual Decomposed Learning with Factorwise Oracles for Structural …inderjit/public_papers/dual_decomp_learn_nips… · Dual Decomposed Learning with Factorwise Oracles for Structural

Finding teleconnections from decomposed rainfall signals ...dspace.ucuenca.edu.ec/bitstream/123456789/32110/2/documento.pdf · Finding teleconnections from decomposed rainfall signals

Enhancement Ideas Freeway - Phoenix, Arizona...Material: Desert and low water use vegetation, decomposed granite Cost: $1.50-$2.25/square foot (landscaping) $0.50/square foot (decomposed

Decomposed Deep Reinforcement Learning for Robotic Control · Yinzhao Dong, Chao Yu*, Paul Weng, Ahmed Maustafa, Hui Cheng, and Hongwei Ge. 2020. Decomposed Deep Reinforcement Learning

Controllable Person Image Synthesis with Attribute-Decomposed … · 2020-06-17 · Controllable Person Image Synthesis with Attribute-Decomposed GAN Yifang Men1, Yiming Mao 2, Yuning

A New Parallel Domain-Decomposed Chebyshev …kelvin.as.ntu.edu.tw/.../Publication/45.Chebyshev_2010.pdf1 A New Parallel Domain-Decomposed Chebyshev Collocation Method for Atmospheric

Water decomposed into its primitive elements, and decomposed doubtless by electricity, which will then have become a powerful and manageable force. Yes,

Simple Radiative Transfer in Decomposed Domains

Decomposed Inductive Procedure Learning

Decomposed Local Models for Coordinate Structure Parsing

Investigation of Completely Decomposed Weathered Granitic Soil in Peninsular Malaysia

Pattern Classification of Decomposed Wavelet Information using … · 2019. 9. 30. · Pattern Classification of Decomposed Wavelet Information using ART2 Networks for echoes Analysis,

Presentación de PowerPoint · 2021. 4. 14. · Metodología 💡Combina técnicas del coaching ontológico, mentoring y consultoría de carrera profesional 💡Durante todo el transcurso

Language switching decomposed through MEG and evidence ... · Language switching decomposed through MEG and evidence from bimodal bilinguals Esti Blanco-Elorrietaa,b,1, Karen Emmoreyc,

Covariation and weighting of harmonically decomposed streams for ASR

Biases in Variance of Decomposed Portfolio Returns

TOWARDS A FAST NVMe LAYER FOR A DECOMPOSED KERNEL

A New Parallel Domain-Decomposed Chebyshev …kelvin.as.ntu.edu.tw/Kuo_files/Publication/Tsai et al 2012.pdf · A New Parallel Domain-Decomposed Chebyshev Collocation Method for

DEFINITION OF SOIL ORGANIC MATTER - Humintech · Organic compounds of soil - live organisms and their undecomposed, partly decomposed and completely decomposed remains as well as

What's new in iOS13More...• Ryan Davis • Professional Mobile LINQPad Developer • essential-interfaces –use DI/mocking with Xamarin.Essentials • lightswitch 💡–toggle