Upload
peter-mendez
View
23
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Efficient Policy Gradient Optimization/Learning of Feedback Controllers. Chris Atkeson. Punchlines. Optimize and learn policies. Switch from “value iteration” to “policy iteration”. This is a big switch from optimizing and learning value functions. Use gradient-based policy optimization. - PowerPoint PPT Presentation
Citation preview
Efficient Policy GradientOptimization/Learning of Feedback
Controllers
Chris Atkeson
Punchlines
• Optimize and learn policies.
Switch from “value iteration” to “policy iteration”.
• This is a big switch from optimizing and learning value functions.
• Use gradient-based policy optimization.
Motivations
• Efficiently design nonlinear policies
• Make policy-gradient reinforcement learning practical.
Model-Based Policy Optimization• Simulate policy u = π(x,p) from some initial
states x0 to find policy cost.
• Use favorite local or global optimizer to optimize simulated policy cost.
• If gradients are used, they are typically numerically estimated.
• Δp = -ε ∑x0w(x0)Vp 1st order gradient
• Δp = -(∑x0w(x0)Vpp)-1 ∑x0w(x0)Vp 2nd order
Can we make model-based policy gradient more efficient?
Analytic Gradients• Deterministic policy: u = π(x,p) • Policy Iteration (Bellman Equation):
Vk-1(x,p) = L(x,π(x,p)) + V(f(x,π(x,p)),p)
• Linear models: f(x,u) = f0 + fxΔx + fuΔu
L(x,u) = L0 + LxΔx + LuΔu
π(x,p) = π0 + πxΔx + πpΔp
V(x,p) = V0 + VxΔx + VpΔp
• Policy Gradient:
Vxk-1 = Lx + Luπx + Vx(fx + fuπx)
Vpk-1 = (Lu + Vxfu)πp + Vp
Handling Constraints
• Lagrange multiplier approach, with constraint violation value function.
Vpp: Second Order Models
Regularization
LQBR: Linear (dynamics) Quadratic (cost) Bilinear (policy) Regulator
Timing Test
Antecedents
• Optimizing control “parameters” in DDP: Dyer and McReynolds 1970.
• Optimal output feedback design (1960s-1970s)
• Multiple model adaptive control (MMAC)
• Policy gradient reinforcement learning
• Adaptive critics, Werbos: HDP, DHP, GDHP, ADHDP, ADDHP
When Will LQBR Work?
• Initial stabilizing policy is known (“output stabilizable”)
• Luu is positive definite.
• Lxx is positive semi-definite and (sqrt(Lxx),Fx) is detectable.
• Measurement matrix C has full row rank.
Locally Linear Policies
Local Policies
GOAL
Cost Of One Gradient Calculation
Continuous Time
Other Issues
• Model Following
• Stochastic Plants
• Receding Horizon Control/MPC
• Adaptive RHC/MPC
• Combine with Dynamic Programming
• Dynamic Policies -> Learn State Estimator
Optimize Policies
• Policy Iteration, with gradient-based policy improvement step.
• Analytic gradients are easy.
• Non-overlapping sub-policies make second order gradient calculations fast.
• Big problem: How choose policy structure?