9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9...

Active Reinforcement Learning

Virginia Tech CS4804

Outline

• Active reinforcement learning

• Active adaptive dynamic programming

• Q-learning

• Policy Search

Passive Learning• Recordings of agent running fixed policy

• Observe states, rewards, actions

• Direct utility estimation

• Adaptive dynamic programming (ADP)

• Temporal-difference (TD) learning

Problems with Passive Reinforcement Learning

⇡(s) = argmaxa

P (s0|s, a)U(s0)

Exploration• Naive approach: randomly choose random action

• shrink probability of random action over time

• Another approach: always act greedy, but overestimate rewards for unexplored states

U+(s) R(s) + �maxa

P (s0|s, a)U+(s0), N(s, a)

• shrink probability of random action over time

• Another approach: always act greedy, but overestimate rewards for unexplored states

f(u, n) =

(R+ if n < Ne

u otherwise

U+(s) R(s) + �maxa

P (s0|s, a)U+(s0), N(s, a)

Active TD-Learning• Exactly the same as non-active TD learning

U(s) U(s) + ↵(R(s) + �U(s0)� U(s))

U(s) = R(s) + �Es0 [U(s0)]

• Still need estimates of transition probabilities

Action-Utility Functions• Combine transition probabilities with utilities

Q(s, a) = R(s) + �X

P (s0|s, a)maxa

Q(s0, a0)

U(s) = R(s) + �maxa

P (s0|s, a)U(s0)

U(s) = maxa

Q(s, a)

Q-LearningQ(s, a) = R(s) + �

P (s0|s, a)maxa

Q(s0, a0)

U(s) = R(s) + �maxa

P (s0|s, a)U(s0)

U(s) = maxa

Q(s, a)

Q(s, a) Q(s, a) + ↵(R(s) + �maxa0

Q(s0, a0)�Q(s, a))

U(s) U(s) + ↵(R(s) + �U(s0)� U(s))

Approximate Q-LearningQ̂(s, a) := g(s, a,✓) := ✓1f1(s, a) + ✓2f2(s, a) + . . .+ ✓dfd(s, a)

✓i ✓i + ↵⇣R(s) + �max

a0Q̂(s0, a0)� Q̂(s, a)

⌘ @g

✓i ✓i + ↵⇣R(s) + �max

a0Q̂(s0, a0)� Q̂(s, a)

⌘fi(s, a)

Loss Minimization for Parameterized Q

Q(s, a) Q(s, a) + ↵(R(s) + �maxa0

Q(s0, a0)�Q(s, a))

`(s, s0) = `⇣R(s) + �max

a0Q(s0, a0)�Q(s, a)

rQ`(s, s0)Compute with backpropagation, do stochastic gradient descent.

Estimation error:

Policy Search• Instead, parameterize policy as a condition probability distribution

• Tune parameters by approximating the gradient of policy

• Gradient descent

• If curious, read https://towardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d

πθ(a |s)

Summary• Active ADP:

• choose random actions or act greedy and overestimate utility

• Active TD: exactly the same as standard TD

• Q-learning: learn action-reward function directly, via TD learning

• Approximation uses derivative of utility function, easy if linear

• Policy search

SummaryDirect Utility Estimation ADP TD Learning Q-Learning Policy Search

(Gradient)

Learns p(s’ | s, a) No Yes, by counting No No No

Learns R(s) No Yes No No No

Learns state utility (U) U𝝅 optimal U optimal U if active.

U𝝅 otherwise No No

Learns state-action utility (Q) No No No Yes No

Learns 𝝅 No Yes No (need p) Yes Yes

Strategy Average observed utility

Solve Bellman w/ estimated p & R

Adjust U based on change from s to s’

TD learning for Q instead of U Fancy calculus

9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9...

Documents

Active Object Localization with Deep Reinforcement Learningslazebni.cs.illinois.edu/publications/iccv15_active.pdf · Active Object Localization with Deep Reinforcement Learning Juan

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning

SRM450v2 Active Sound Reinforcement SPEAKER ...€¦ · Sound-Reinforcement-Lautsprecher zu bauen mit: 1. großer Präzision, hoher Ausgangsleistung und akkurater Wiedergabe. 2. sehr

SRM450 Active Sound Reinforcement Speaker User's Manual4 INTRODUCTION Thank you for choosing LOUD Technologies’ Mackie active sound reinforcement speakers. The SRM450 is an active

Neuralnetworkcontrollerforsemi-active suspension systems with … · 2019-07-09 · Neuralnetworkcontrollerforsemi-active suspension systems with road preview A Reinforcement learning

1 Alan Fern * Based in part on slides by Daniel Weld Reinforcement Learning Active Learning

CS230: Lecture 9 Deep Reinforcement Learningcs230.stanford.edu/spring2020/lecture9.pdfCS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Kian Katanforoosh I. Motivation

Chapter 9: Planning and Learning CSE 190: Reinforcement

Gym-ANM: Reinforcement learning environments for active

Lec 9 Reinforcement

Reinforcement Learning for Active Flow Control … · 2020. 3. 10. · Reinforcement Learning for Active Flow Control in Experiments Dixia Fan 1, y, Liu Yang2,, Michael S Triantafyllou1,

9 FRP Reinforcement in GRC Elements - Fibre Tech

CSE-571 Probabilistic Robotics - University of WashingtonCSE-571 Probabilistic Robotics Reinforcement Learning for Active Sensing Manipulator Control Reinforcement Learning • Same

9 FRP Reinforcement in GRC Elements FRP Reinforcement in GRC... · 1 9 FRP Reinforcement in GRC Elements . Yanfei Che . Power-Sprays Ltd . Iain Peter . Fibre Technologies International

Chapter 9 Adjusting to Schedules of Partial Reinforcement

Laois-Kilkenny Reinforcement Project The Final Planning... · Laois-Kilkenny Reinforcement Project Planning Report vi Power Flow: The flow of „active‟ power is measured in Megawatts

SRM450 Active Sound Reinforcement Speaker User's Manual

Theories of Child Language Acquisition. Imitation Reinforcement Innateness Active construction of a grammar

SRM350 2-Way Active Loudspeaker User's Manual INTRODUCTION Thank you for choosing LOUD Technologies’ Mackie active sound reinforcement speakers. The SRM350 is an active two-way loud-speaker

SRM450 ACTIVE SOUND REINFORCEMENT MONITOR USER… · SRM450 ACTIVE SOUND REINFORCEMENT MONITOR USER’S MANUAL ... un prise de courant ou une autre ... the high-end pro-audio market