73
Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The University of Texas at Arlington, USA and F.L. Lewis National Academy of Inventors Talk available online at http://www.UTA.edu/UTARI/acs New Developments in Integral Reinforcement Learning: Continuous-time Optimal Control and Games Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries Northeastern University, Shenyang, China Supported by : ONR US NSF Supported by : China NNSF China Project 111

2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA

and

F.L. Lewis National Academy of Inventors

Talk available online at http://www.UTA.edu/UTARI/acs

New Developments in Integral Reinforcement Learning:Continuous-time Optimal Control and Games

Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries

Northeastern University, Shenyang, China

Supported by :ONRUS NSF

Supported by :China NNSFChina Project 111

Page 2: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Invited byManfred MorariKonstantinos GatsisPramod Khargonekar George Pappas

Page 3: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

New Research ResultsIntegral Reinforcement Learning for Online Optimal ControlIRL for Online Solution of Multi-player GamesMulti‐Player Games on Communication Graphs Off‐Policy LearningExperience ReplayBio-inspired Multi-Actor CriticsOutput Synchronization of Heterogeneous MAS

Applications to:MicrogridRoboticsIndustry Process Control

Page 4: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Optimal Control is Effective for:Aircraft AutopilotsVehicle engine controlAerospace VehiclesShip ControlIndustrial Process Control

Multi-player Games Occur in:Networked Systems Bandwidth AssignmentEconomicsControl Theory disturbance rejectionTeam gamesInternational politicsSports strategy

But, optimal control and game solutions are found byOffline solution of Matrix Design equationsA full dynamical model of the system is needed

Optimality and Games

Page 5: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

xux Ax Bu

SystemControlK

PBPBRQPAPA TT 10

1 TK R B P

On-line real-timeControl Loop

Off-line Design LoopUsing ARE

Optimal Control- The Linear Quadratic Regulator (LQR)

An Offline Design Procedurethat requires Knowledge of system dynamics model (A,B)

System modeling is expensive, time consuming, and inaccurate

( , )Q R

User prescribed optimization criterion ( ( )) ( )T T

t

V x t x Qx u Ru d

Page 6: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Adaptive Control is online and works for unknown systems.Generally not Optimal

Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.

Reinforcement Learning turns out to be the key to this!

We want to find optimal control solutions Online in real-time Using adaptive control techniquesWithout knowing the full dynamics

For nonlinear systems and general performance indices

Bring together Optimal Control and Adaptive Control

Page 7: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

D. Vrabie, K. Vamvoudakis, and F.L. Lewis,Optimal Adaptive Control and DifferentialGames by Reinforcement LearningPrinciples, IET Press,2012.

BooksF.L. Lewis, D. Vrabie, and V. Syrmos,Optimal Control, third edition, John Wiley andSons, New York, 2012.

New Chapters on:Reinforcement LearningDifferential Games

Page 8: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

F.L. Lewis and D. Vrabie,“Reinforcement learning and adaptive dynamic programming for feedback control,”IEEE Circuits & Systems Magazine, Invited Feature Article, pp. 32-50, Third Quarter 2009.

IEEE Control Systems Magazine,F. Lewis, D. Vrabie, and K. Vamvoudakis,“Reinforcement learning and feedback Control,” Dec. 2012

Page 9: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Multi‐player Game SolutionsIEEE Control Systems Magazine,Dec 2017

Page 10: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

( , , , )X U P RRL for Markov Decision ProcessesX= states, U= controls

P= Probability of going to state x’ from state xgiven that the control is uR= Expected reward on going to state x’ from state x given that the control is u

R.S. Sutton and A.G. Barto, Reinforcement Learning– An Introduction, MIT Press, Cambridge, Massachusetts, 1998.D.P. Bertsekas and J. N. Tsitsiklis, Neuro‐Dynamic Programming, Athena Scientific, MA, 1996.W.B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, Wiley, New York, 2009.

,( ) { | } { | }k T

i kk k T k i k

i kV x E J x x E r x x

*( , ) arg min ( ) arg min { | } .k T

i kk i k

i kx u V s E r x x

*( ) min ( ) min { | } .

k Ti k

k k i ki k

V x V x E r x x

( , )x u

determine a policy ( , )x u to minimize the expected future cost Optimal control problem

optimal policy

optimal value

Expected Value of a policy

Policy evaluation by Bellman eq.

Policy Improvement' '

'( ) ( , ) ( ')u u

j j xx xx ju x

V x x u P R V x 1 ' '

'( , ) arg min ( ')u u

j xx xx ju x

x u P R V x .for all x X

.for all x X

Policy Iteration

Policy Evaluation equation is a system of N simultaneous linear equations, one for each state. ' ( ) ( )V x V x Policy Improvement makes

Discrete State

Page 11: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

How can one do Policy Iteration for Unknown Continuous‐Time Systems?What is Value Iteration for Continuous‐Time systems?How can one do ADP for CT Systems?

( , , ) ( , ) ( , ) ( , ) ( , )T TV V VH x u V r x u x r x u f x u r x u

x x x

)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH Directly leads to temporal difference techniques System dynamics does not occur Two occurrences of value allow APPROXIMATE DYNAMIC PROGRAMMING methods

Leads to off‐line solutions if system dynamics is knownHard to do on‐line learning

Discrete‐Time System Hamiltonian Function

Continuous‐Time System Hamiltonian Function

How to define temporal difference? System dynamics DOES occur Only ONE occurrence of value gradient

RL ADP has been developed for Discrete-Time Systems1 ( , )k k kx f x u

( , )x f x u

Page 12: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Four ADP Methods proposed by Paul Werbos

Heuristic dynamic programming

Dual heuristic programming

AD Heuristic dynamic programming

AD Dual heuristic programming

(Watkins Q Learning)

Critic NN to approximate:

Value

Gradient xV

)( kxV Q function ),( kk uxQ

GradientsuQ

xQ

,

Action NN to approximate the Control

Bertsekas- Neurodynamic Programming

Barto & Bradtke- Q-learning proof (Imposed a settling time)

Discrete-Time SystemsAdaptive (Approximate) Dynamic Programming

Value Iteration

Bertsekas- Neurodynamic Programming

Page 13: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

t

T

t

dtRuuxQdtuxrtxV ))((),())((

( , , ) ( , ) ( , ) ( ) ( ) ( , ) 0T TV V VH x u V r x u x r x u f x g x u r x u

x x x

112( ) ( )T Vu h x R g x

x

dxdVggR

dxdVxQf

dxdV T

TT *1

*

41

*

)(0

, (0) 0V

Nonlinear System dynamics 

Cost/value 

Bellman Equation, in terms of the Hamiltonian function

Stationary Control Policy

HJB equation

CT Systems‐ Derivation of Nonlinear Optimal Regulator

Off‐line solutionHJB hard to solve.   May not have smooth solution.Dynamics must be known

Stationarity condition 0Hu

( , ) ( ) ( )x f x u f x g x u

Leibniz gives Differential equivalent

To find online methods for optimal control Focus on these two equations

Problem‐ System dynamics shows up in Hamiltonian

Page 14: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

),,(),(),(0 uxVxHuxruxf

xV T

CT Policy Iteration – a Reinforcement Learning Technique

• Convergence proved by Leake and Liu 1967, 

Saridis 1979 if Lyapunov eq. solved exactly• Beard & Saridis used Galerkin Integrals to solve Lyapunov eq.• Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergence

RuuxQuxr T )(),(Utility 

The cost is given by solving the CT Bellman equation

Full system dynamics must be knownOff‐line solution

dxdVggR

dxdVxQf

dxdV T

TT *1

*

41

*

)(0

Scalar equation

M. Abu-Khalaf, F.L. Lewis, and J. Huang, “Policyiterations on the Hamilton-Jacobi-Isaacs equation for H-infinity state feedback control with input saturation,”IEEE Trans. Automatic Control, vol. 51, no. 12, pp.1989-1995, Dec. 2006.

( ) ( )u x h xGiven any admissible policy

0 ( )h x

Converges to solution of HJB

0 ( , ( )) ( , ( ))T

jj j

Vf x h x r x h x

x

(0) 0jV

1121 ( ) ( ) jT

j

Vh x R g x

x

Policy Iteration Solution

Pick stabilizing initial control policy

Policy Evaluation ‐ Find cost, Bellman eq.

Policy improvement ‐ Update control

Page 15: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

PBPBRQPAPA TT 10

1 Tu R B Px Kx

BuAxx

( ( )) ( ) ( ) ( )T T T

t

V x t x Qx u Ru d x t Px t

Full system dynamics must be knownOff‐line solution

0 ( , , ) 2 2 ( )T

T T T T T T TV VH x u V x Qx u Ru x x Qx u Ru x P Ax Bu x Qx u Rux x

Policy Iterations for the Linear Quadratic Regulator

0 ( ) ( )T TA BK P P A BK Q K RK

u Kx

System 

Cost

Differential equivalent is the Bellman equation

Given any stabilizing FB policy

The cost value is found by solving Lyapunov equation = Bellman equation

Optimal Control is

Algebraic Riccati equation

Page 16: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

LQR Policy iteration = Kleinman algorithm

1. For a given control policy                 solve for the cost: 

2. Improve policy:

If started with a stabilizing control policy the matrix monotonically converges to the unique positive definite solution of the Riccati equation.

Every iteration step will return a stabilizing controller. The system has to be known.

ju K x

0 T Tj j j j j jA P P A Q K RK

11

Tj jK R B P

j jA A BK

0K jP

Kleinman 1968

Bellman eq. = Lyapunov eq.

OFF‐LINE DESIGNMUST SOLVE LYAPUNOV EQUATION AT EACH STEP.

Matrix equation

Page 17: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Lemma 1 – Draguna Vrabie

Solves Bellman equation without knowing f(x,u)

( ( )) ( , ) ( ( )), (0) 0t T

t

V x t r x u d V x t T V

0 ( , ) ( , ) ( , , ), (0) 0TV Vf x u r x u H x u V

x x

Allows definition of temporal difference error for CT systems

( ) ( ( )) ( , ) ( ( ))t T

t

e t V x t r x u d V x t T

Integral reinf. form  (IRL) for the CT Bellman eq.Is equivalent to

( ( )) ( , ) ( , ) ( , )t T

t t t T

V x t r x u d r x u d r x u d

value

Key Idea= US Patent

Work of Draguna Vrabie 2009Integral Reinforcement Learning

Bad Bellman Equation

Good Bellman Equation

Page 18: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

IRL Policy iteration

Initial stabilizing control is needed

Cost update

Control gain update

f(x) and g(x) do not appear

g(x) needed for control update

Policy evaluation‐ IRL Bellman Equation

Policy improvement11

21 1( ) ( )T kk k

Vu h x R g xx

),,(),(),(0 uxVxHuxruxf

xV T

Equivalent to

Solves Bellman eq. (nonlinear Lyapunov eq.) without knowing system dynamics

CT Bellman eq.

Integral Reinforcement Learning (IRL)- Draguna Vrabie

D. Vrabie proved convergence to the optimal value and controlAutomatica 2009, Neural Networks 2009

Converges to solution to HJB eq.dx

dVggRdx

dVxQfdx

dV TTT *

1*

41

*

)(0

Page 19: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

( ( )) ( ) ( ( ))t T

Tk k k k

t

V x t Q x u Ru dt V x t T

Approximate value by Weierstrass Approximator Network ( )TV W x

( ( )) ( ) ( ( ))t T

T T Tk k k k

t

W x t Q x u Ru dt W x t T

( ( )) ( ( )) ( )t T

T Tk k k

t

W x t x t T Q x u Ru dt

regression vector Reinforcement on time interval [t, t+T]

kWNow use RLS or batch least-squares along the trajectory to get new weights

Then find updated FB1 11 1

2 21 1( ( ))( ) ( ) ( )

( )

TT Tk

k k kV x tu h x R g x R g x Wx x t

Approximate Dynamic Programming Implementation

Direct Optimal Adaptive Control for Partially Unknown CT Systems

Value Function Approximation (VFA) to Solve Bellman Equation– Paul Werbos (ADP), Dimitri Bertsekas (NDP)

Scalar equation with vector unknowns

Same form as standard System ID problems in Adaptive Control

Optimal ControlandAdaptive Controlcome togetherOn this slide.Because of RL

Page 20: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

( ( )) ( ( )) ( ( )) ( ) ( )t T

T T Tk k k k

t

W x t W x t x t T Q x u Ru dt t

2

( ( )) ( ( )) ( ( 2 )) ( ) ( )t T

T T Tk k k k

t T

W x t T W x t T x t T Q x u Ru dt t T

3

2

( ( 2 )) ( ( 2 )) ( ( 3 )) ( ) ( 2 )t T

T T Tk k k k

t T

W x t T W x t T x t T Q x u Ru dt t T

11 12

12 22

p pp p

11 12 22TW p p p

Solving the IRL Bellman Equation

Need data from 3 time intervals to get 3 equations to solve for 3 unknowns

Now solve by Batch least-squares

Or can use Recursive Least-Squares (RLS)

Solve for value function parameters

Put together

( ( )) ( ( )) ( ( 2 )) ( ) ( ) ( 2 )TkW x t x t T x t T t t T t T

Page 21: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

t t+T

observe x(t)

apply uk=Kkx

observe cost integral

update P

Do RLS until convergence to Pk

update control gain

Integral Reinforcement Learning (IRL)

Data set at time [t,t+T)

( ), ( , ), ( )x t t t T x t T

t+2T

observe x(t+T)

apply uk=Kkx

observe cost integral

update P

t+3T

observe x(t+2T)

apply uk=Kkx

observe cost integral

update P

11

Tk kK R B P

( , )t t T ( , 2 )t T t T ( 2 , 3 )t T t T

Or use batch least-squares

(x( )) ( ( )) ( ) ( ) ( ) ( , )t T

T T Tk k k

tW t x t T x Q K RK x d t t T

Solve Bellman Equation - Solves Lyapunov eq. without knowing dynamics

This is a data-based approach that uses measurements of x(t), u(t)Instead of the plant dynamical model.

A is not needed anywhere

Page 22: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

( ) ( )k ku t K x t

Continuous‐time control with discrete gain updates

t

Kk

k0 1 2 3 4 5

Reinforcement Intervals T need not be the sameThey can be selected on‐line in real time

Gain update (Policy)

Control

T

Interval T can vary

Page 23: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Persistence of Excitation

Regression vector must be PE

Relates to choice of reinforcement interval T

( ( )) ( ( )) ( )t T

T Tk k k

t

W x t x t T Q x u Ru dt

Page 24: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Implementation Policy evaluationNeed to solve online

Add a new state= Integral Reinforcement

T Tx Q x u R u

This is the controller dynamics or memory

(x( )) ( ( )) ( ) ( ) ( ) ( , )t T

T T Tk k k

tW t x t T x Q K RK x d t t T

Page 25: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Direct Optimal Adaptive Controller

A hybrid continuous/discrete dynamic controller whose internal state is the observed cost over the interval

Draguna Vrabie

IRL requires aDynamicControlSystemw/ MEMORY

xu

V

ZOH T

x Ax Bu System

T Tx Q x u Ru

Critic

ActorK

T T

Run RLS or use batch L.S.To identify value of current control

Update FB gain afterCritic has converged

Reinforcement interval T can be selected on line on the fly – can change

Solves Riccati Equation Online without knowing A matrix

CT time Actor‐Critic Structure

Page 26: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Actor / Critic structure for CT Systems

Theta waves 4-8 Hz

Reinforcement learning

Motor control 200 Hz

Optimal Adaptive IRL for CT systems

A new structure of adaptive controllers

( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

1121 1( ) ( )T k

k kVu h x R g xx

D. Vrabie, 2009

Page 27: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

1 TK R B P

PBPBRQPAPA TT 10

xux Ax Bu

SystemControlK On-line Control Loop

On-line Performance Loop

Data-driven Online Adaptive Optimal ControlDDO

An Online Supervisory Control Procedurethat requires no Knowledge of system dynamics model A

Automatically tunes the control gains in real time to optimize a user given cost functionUses measured data (u(t),x(t)) along system trajectories

( , )J Q RUser prescribed optimization criterion

( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k

tx t P x t x Q K RK x d x t T P x t T

11

Tk kK R B P

Data set at time [t,t+T)

( ), ( , ), ( )x t t t T x t T

Page 28: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Optimal Control Design Allows a Lot of Design Freedom

Page 29: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

IRL Policy iteration Initial stabilizing control is needed

Cost update

Control gain update

Policy evaluation‐ IRL Bellman Equation

Policy improvement11

21 1( ) ( )T kk k

Vu h x R g xx

CT PI Bellman eq.= Lyapunov eq.

IRL Value Iteration - Draguna Vrabie

Converges to solution to HJB eq.dx

dVggRdx

dVxQfdx

dV TTT *

1*

41

*

)(0

1( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

IRL Value iteration Initial stabilizing control is NOT needed

Cost update

Control gain update

Value evaluation‐ IRL Bellman Equation

Policy improvement1 11

21 1( ) ( )T kk k

Vu h x R g xx

CT VI Bellman eq.

Converges if T is small enough

Page 30: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Actor / Critic structure for CT Systems

Theta waves 4-8 Hz

Reinforcement learning

Motor control 200 Hz

Optimal Adaptive IRL for CT systems

A new structure of adaptive controllers

( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

1121 1( ) ( )T k

k kVu h x R g xx

D. Vrabie, 2009

Page 31: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Doya, Kimura, Kawato 2001

Limbic system

Motor control 200 Hz

theta rhythms 4-10 HzDeliberativeevaluation

control

Page 32: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Cerebral cortexMotor areas

ThalamusBasal ganglia

Cerebellum

Brainstem

Spinal cord

Interoceptivereceptors

Exteroceptivereceptors

Muscle contraction and movement

Summary of Motor Control in the Human Nervous System

reflex

Supervisedlearning

ReinforcementLearning- dopamine

(eye movement)inf. olive

Hippocampus

Unsupervisedlearning

Limbic System

Motor control 200 Hz

theta rhythms 4-10 Hz

picture by E. StinguD. Vrabie

Memoryfunctions

Long term

Short term

Hierarchy of multiple parallel loops

gamma rhythms 30-100 Hz

Kenji Doya

Page 33: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

79

Synchronous Real‐time Data‐driven Optimal Control 

Page 34: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Actor / Critic structure for CT Systems

Theta waves 4-8 Hz

Motor control 200 Hz

Optimal Adaptive Integral Reinforcement Learning for CT systems

A new structure of adaptive controllers

( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

1121 1( ) ( )T k

k kVu h x R g xx

D. Vrabie, 2009

Policy Iteration gives the structure needed for online optimal solution

Page 35: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

1 1ˆ( ) ( ) ( )TV x W x x Take VFA as

Then IRL Bellman eq

becomes

Critic Network

1 1̂, ( ) TV x W

Action Network for Control Approximation

111 22

ˆ( ) ( ) ,T Tu x R g x W

Synchronous Online Solution of Optimal Control for Nonlinear SystemsKyriakos Vamvoudakis

( ( )) ( ) ( ( ))t

Tk k

t T

V x t Q x u Ru dt V x t T

1 1

ˆ ˆ( ( )) ( ) ( ( ))t

T T Tk k

t T

W x t T Q x u Ru dt W x t

( ( )) ( ( )) ( ( ))x t x t x t T

11 2 21ˆ ˆ ˆ( ( )) ( ) 04

tT T

t T

x t W Q x W D W

Define

Bellman eq becomes

Page 36: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Then there exists an N0 such that, for the number of hidden layer units 

Theorem (Vamvoudakis & Vrabie)‐ Online Learning of Nonlinear Optimal Control

Let                                                be PE.  Tune critic NN weights as

Tune actor NN weights as

0N N

the closed‐loop system state, the critic NN error 

and the actor NN error  are UUB bounded.  2 1 2ˆW W W

1 1 1̂W W W

Learning the Value

Learning the control policy

Data‐driven Online Synchronous Policy Iteration using IRLDoes not need to know f(x)

( ( )) ( ( )) ( ( ))x t x t x t T

11 1 1 2 22

( ( )) 1ˆ ˆ ˆ ˆ( ( )) ( )41 ( ( )) ( ( ))

tT T

Tt T

x tW a x t W Q x W D W dx t x t

1 12 2 2 2 1 1 2 2 14 2( ( ))ˆ ˆ ˆ ˆ ˆ( ( )) ( )

1 ( ( )) ( ( ))

TT

T

x tW a F W F x t W a D x W Wx t x t

Data set at time [t,t+T)

( ), ( , ), ( )x t t T t x t T

Vamvoudakis & Vrabie

Page 37: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

1 11 1 1 2 2 2

1 1( ) ( ) ( ) ( ).2 2

T TL t V x tr W a W tr W a W

Lyapunov  energy‐based Proof:

1140 ( )

T TTdV dV dVf Q x gR g

dx dx dx

V(x)= Unknown solution to HJB eq.

2 1 2ˆW W W

1 1 1̂W W W

1 1 1( , , ) ( ) ( )T THH x W u W f gu Q x u Ru

W1= Unknown LS solution to Bellman equation for given N

Guarantees stability

Page 38: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Adaptive Critic structure Reinforcement learning

Two Learning NetworksTune them Simultaneously

Synchronous Online Solution of Optimal Control for Nonlinear Systems

A new form of Adaptive Control with TWO tunable networks

A new structure of adaptive controllers

K.G. Vamvoudakis and F.L. Lewis, “Online actor-critic algorithm to solve the continuous-time infinitehorizon optimal control problem,” Automatica, vol. 46, no. 5, pp. 878-888, May 2010.

11 1 1 2 22

( ( )) 1ˆ ˆ ˆ ˆ( ( )) ( )41 ( ( )) ( ( ))

tT T

Tt T

x tW a x t W Q x W D W dx t x t

1 12 2 2 2 1 1 2 2 14 2( ( ))ˆ ˆ ˆ ˆ ˆ( ( )) ( )

1 ( ( )) ( ( ))

TT

T

x tW a F W F x t W a D x W Wx t x t

Page 39: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

A New Class of Adaptive Control

Plantcontrol output

Identify the Controller-Direct Adaptive

Identify the system model-Indirect Adaptive

Identify the performance value-Optimal Adaptive

)()( xWxV T

Page 40: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Data‐driven Online Solution of Differential GamesSynchronous Solution of Multi‐player Non Zero‐sum Games

Page 41: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Multi‐player Game SolutionsIEEE Control Systems Magazine,Dec 2017

Multi‐player Differential Games

Page 42: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Sun Tz bin fa孙子兵法

Games on Communication Graphs

500 BC

Page 43: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

F.L. Lewis, H. Zhang, A. Das, K. Hengster-Movric, Cooperative Control of Multi-Agent Systems: Optimal Design and Adaptive Control, Springer-Verlag, 2013

Key Point

Lyapunov Functions and Performance IndicesMust depend on graph topology

Hongwei Zhang, F.L. Lewis, and Abhijit Das“Optimal design for synchronization of cooperative systems: state feedback, observer and output feedback,”IEEE Trans. Automatic Control, vol. 56, no. 8, pp. 1948-1952, August 2011.

H. Zhang, F.L. Lewis, and Z. Qu, "Lyapunov, Adaptive, and Optimal Design Techniques for Cooperative Systems on Directed Communication Graphs," IEEE Trans. Industrial Electronics, vol. 59, no. 7, pp. 3026‐3041, July 2012.

Page 44: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

,i i i ix Ax B u

0 0x Ax

0( ) ( ),ix t x t i

0( ) ( ),i

i ij i j i ij N

e x x g x x

( ) ,nix t ( ) im

iu t

Graphical GamesSynchronization‐ Cooperative Tracker Problem

Node dynamics

Target generator dynamics

Synchronization problem

Local neighborhood tracking error (Lihua Xie)

x0(t)

( )i

i i i i i i ij j jj N

A d g B u e B u

12

0

( (0), , ) ( )i

T T Ti i i i i ii i i ii i j ij j

j N

J u u Q u R u u R u dt

12

0

( ( ), ( ), ( ))i i i iL t u t u t dt

Local nbhd. tracking error dynamics

Define Local nbhd. performance index

Local agent dynamics driven by neighbors’ controls

Values driven by neighbors’ controls

K.G. Vamvoudakis, F.L. Lewis, and G.R. Hudas, “Multi-Agent Differential Graphical Games: online adaptive learning solution for synchronization with optimality,” Automatica, vol. 48, no. 8, pp. 1598-1611, Aug. 2012.

M. Abouheaf, K. Vamvoudakis, F.L. Lewis, S. Haesaert, and R. Babuska, “Multi-Agent Discrete-Time GraphicalGames and Reinforcement Learning Solutions,” Automatica, Vol. 50, no. 12, pp. 3038-3053, 2014.

Page 45: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

1u

2u

iu Control action of player i

Value function of player i

New Differential Graphical Game

( )i

i i i i i i ij j jj N

A d g B u e B u

State dynamics of agent i

Local DynamicsLocal Value Function

Only depends on graph neighbors

12

0

( (0), , ) ( )i

T T Ti i i i i ii i i ii i j ij j

j N

J u u Q u R u u R u dt

Page 46: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

1

N

i ii

z Az B u

12

10

( (0), , ) ( )N

T Ti i i j ij j

j

J z u u z Qz u R u dt

1u

2u

iu Control action of player i

Central Dynamics

Value function of player i

Standard Multi-Agent Differential Game

Central DynamicsLocal Value Functiondepends on ALL

other control actions

Page 47: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

* * *( , ) ( , ), ,i j G j i j G jJ u u J u u i j N

New Definition of Nash Equilibrium for Graphical Games

* * *1 2, ,...,u u u

* * * *( , ) ( , ),i i i G i i i G iJ J u u J u u i N

Def:  Interactive Nash equilibrium

are in Interactive Nash equilibrium if

2.  There exists a policy         such that ju

1.

That is, every player can find a policy that changes the value of every other player. 

1. They are in Nash equilibrium

2. Interaction Condition

To restore symmetry of Nash Equilibrium

Page 48: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

1 1 12 2 2( , , , ) ( ) 0

i i

TT T Ti i

i i i i i i i i i ij j j i ii i i ii i j ij ji i j N j N

V VH u u A d g B u e B u Q u R u u R u

10 ( ) Ti ii i i ii i

i i

H Vu d g R B

u

2 1 2 1 11 1 12 2 2( ) ( ) 0,

i

TT Tj jc T T Ti i i

i i ii i i i i ii i j j j jj ij jj ji i i j jj N

V VV V VA Q d g B R B d g B R R R B i N

2 1 1( ) ( ) ,i

jc T Tii i i i i ii i ij j j j jj j

i jj N

VVA A d g B R B e d g B R B i N

* *( , , , ) 0ii i i i

i

VH u u

12( ( )) ( )

i

T T Ti i i ii i i ii i j ij j

j Nt

V t Q u R u u R u dt

Value function

Differential equivalent (Leibniz formula) is Bellman’s Equation

Stationarity Condition

1.  Coupled HJ equations

where

Graphical Game Solution Equations

Now use Synchronous PI to learn optimal Nash policies online in real‐time as players interact 

Distributed Multi‐Agent Learning Proofs

Page 49: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Online Solution of Graphical Games

Use Reinforcement Learning Convergence Results

POLICY ITERATION

Kyriakos VamvoudakisMulti‐agent Learning Convergence proofs

Page 50: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Data‐driven Online Solution of Differential GamesZero‐sum 2‐Player Games and H‐infinity Control

Page 51: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

2

0

2

0

2

0

2

0

2

)(

)(

)(

)(

dttd

dtuhh

dttd

dttz T

System( ) ( ) ( )( )

TT T

x f x g x u k x dy h x

z y u

( )u l x

d

u

z

x control

Performance output disturbance

Find control u(t) so that For all L2 disturbancesAnd a prescribed gain 

L2 Gain Problem

H‐Infinity Control Using Reinforcement Learning

Zero‐Sum differential game ‐ Nature as the opposing player

Disturbance Rejection

The game has a unique value (saddle‐point solution) iff the Nash condition holds

Page 52: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

( , ) ( ) ( ) ( )( )

x f x u f x g x u k x dy h x

System 

Cost 

Online Zero‐Sum Differential Games

22( ( ), , ) ( , , )T T

t t

V x t u d h h u Ru d dt r x u d dt

H-infinity Control

2 players

Leibniz gives Differential equivalent

K.G. Vamvoudakis and F.L. Lewis, “Online solution of nonlinear two-player zero-sum games using synchronous policy iteration,” Int. J. Robust and Nonlinear Control, vol. 22, pp. 1460-1483, 2012.

D. Vrabie and F.L. Lewis, “Adaptive dynamic programming for online solution of a zero-sum differential game,” J Control Theory App., vol. 9, no. 3, pp. 353–360, 2011

Optimal control/dist. policies found by stationarity conditions

Game saddle point solution found from Hamiltonian   ‐ ZS Game BELLMAN EQUATION

0 , 0H Hu d

112 ( )Tu R g x V 2

1 ( )2

Td k x V

22( , , , ) ( ( ) ( ) ( ) ) 0TT TVH x u d h h u Ru d V f x g x u k x dx

HJI equation * *

12

0 ( , , , )1 1( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )4 4

T T T T T T

H x V u d

h h V x f x V x g x R g x V x V x kk V x

Page 53: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Actor‐Critic structure ‐ three time scales

x

u2 1 0;x Ax B u B w x

System

Controller/Player 1

w( 1)T i kuu B P x 1

1T i

ww B P x

Disturbance/Player 2

CriticLearning procedure

ˆ ˆ, if =1

ˆ ˆ ˆ ˆ, if >1

T T T

T T

V x C Cx u u i

V w w u u i

T T

1 2

1iuP ( 1) 1 ( 1)i k i i k

u u uP P Z

V

iuP

x

( ) 1 ( )i k i i ku u uP P Z

Page 54: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

154

New Developments in IRL for CT SystemsQ Learning for CT SystemsExperience ReplayOff-Policy IRL

Page 55: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

IRL with Experience ReplayHumans use memories of past experiences to tune current policies

( ) ( ( )) ( ( )) ( )x t f x t g x t u t= +

1

0( ( )) ( ( )) 2 ( tanh ( ))( )u

T

t

V x t Q x v Rdv dt l l t¥

-= +ò ò1

0( ) 2 tanh ( ) ( ) ( ( ) ( ) ) 0, (0) 0( )

uTTQ x v Rdv V x f x g x u Vl l-+ + + = =ò

1tanh (1 2 ) ( ) ( )( )Tu R g x V xl l* - *= -

1ˆ ˆ( ) ( )TV x W xf=

system

Value

1 1 12

( )ˆ ˆ( ) ( ) ( ) ( )(1 ( ) ( ))

( )T

T

tW t p t t W t

t t

fa f

f fD

= - +D+D D

Modares and Lewis, Automatica 2014Girish Chowdhary‐ concurrent learningSutton and Barto book

Bellman Equation 

Action Update

VFA‐ Value Function Approximation

( ( )) ( ( )) ( ( ))x t x t x t Tf f fD = - -

1

0( ) 2 ( tanh ( ))( )

tu

T

t T

p t Q v Rdv dl l t-

-

= +ò ò

i/o Data Measurements

Standard Critic Weight Tuning

IRL Bellman Equation  1

0( ( )) ( ( )) 2 ( tanh ( )) ( ( ))( )

tu

T

t T

V x t T Q x v Rdv d V x tt l l t-

-

- = + +ò ò

110

( ( )) 2 ( tanh ( )) ( ( )) ( )( )t

uT T

B

t T

Q x v Rdv d W x t tt l l t f e-

-

+ + D ºò òBellman Eq gives Linear Equation for Weights

Page 56: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

IRL with Experience ReplayHumans use memories of past experiences to tune current policies

1ˆ ˆ( ) ( )TV x W xf=

1 1 1 1 12 2

1

( )ˆ ˆ ˆ( ) ( ) ( ) ( ) ( )(1 ( ) ( )) (1 )

( ) ( )l

jT Tj jT T

j j j

tW t p t t W t p W t

t t

ffa f a f

f f f f=

DD= - +D +D

+D D +D D- å

NN weight tuning uses past samples

Improvements1. Speeds up convergence2. PE condition is milder

Modares and Lewis, Automatica 2014Girish Chowdhary‐ concurrent learningSutton and Barto book

VFA‐ Value Function Approximation

Data from Previous time intervals

( ( )) ( ( )) ( ( ))x t x t x t Tf f fD = - -

1

0( ) 2 ( tanh ( ))( )

tu

T

t T

p t Q v Rdv dl l t-

-

= +ò ò

i/o Data Measurements

Previous data

Page 57: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

New Principles

Off-Policy Learning

Page 58: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

On‐policy RL

163

Target and behaviorpolicy

SystemRef.

Target policy: The policy that we are learning about.Behavior policy: The policy that generates actions and behavior

Target policy and behavior policy are the same

Sutton and Barto Book

Off‐Policy Reinforcement Learning

Humans can learn optimal policies while actually playing suboptimal policies

Page 59: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Off‐policy RL

164

Behavior Policy System

Target policy

Ref.

Target policy and behavior policy are different

Humans can learn optimal policies while actually applying suboptimal policies

H. Modares, F.L. Lewis, and Z.-P. Jiang, “H-infinity Tracking Control of Completely-unknown Continuous-time Systems via Off-policyReinforcement Learning ,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 10, pp. 2550-2562, Oct. 2015.

Ruizhuo Song, F.L. Lewis, Qinglai Wei, “Off-Policy Integral Reinforcement Learning Method to Solve Nonlinear Continuous-TimeMulti-Player Non-Zero-Sum Games,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 3, pp. 704-713, 2017.

Bahare Kiumarsi, Frank L. Lewis, Zhong-Ping Jiang, “H-infinity Control of Linear Discrete-time Systems: Off-policy ReinforcementLearning,” Automatica, vol. 78, pp. 144-152, 2017.

Page 60: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Off-policy IRLHumans can learn optimal policies while actually applying suboptimal policies

( ) ( )x f x g x u

( ) ( ), ( )t

J x r x u d

[ [ [] [ ]] ]( ( )) ( ( )) ( )i i

t T t

t i i

T

t TJ x t J x t T Q x d Ru du

On-policy IRL

[ 1] 1 [ ]12

i T ixu R g J

[ ] [ ]( )i ix f gu g u u Off-policy IRL

[ ] [ ] [ ] [ ] [ 1] [ ]( ( )) ( ( )) ( ) 2 ( )t t tii i

t T t T t T

T i i T iJ x t J x t T Q x d R d du u u R u u

This is a linear equation for andThey can be found simultaneously online using measured data using Kronecker product and VFA

[ ]iJ [ 1]iu

1. Completely unknown system dynamics2. Can use applied u(t) for –

disturbance rejection – Z.P. Jiang - R. Song and Lewis, 2015robust control – Y. Jiang & Z.P. Jiang, IEEE TCS 2012exploring probing noise – without bias ! - J.Y. Lee, J.B. Park, Y.H. Choi 2012

DDO

Yu Jiang & Zhong-Ping Jiang, Automatica 2012

system

value

Must know g(x)

Page 61: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Off‐policy for Multi‐player NZS Games

1( ) ( )N

jjx f x g x u

1 2 1( ( )) ( ( , , , , )) ( ( ) )N T

i i N i j ij jjt tV x t r x u u u d Q x u R u d

[ ] [ ]1 1

( ) ( ) ( )( )N Nk kj j jj j

x f x g x u g x u u

Off‐policy

[ ] [ ] [ ] [ ] [ 1] [ ]1 1

( ( )) ( ( 2 ( ))) ( )t T t T t TN Nkk k T k k T k

j ij j i ii j jji i it t jtu R u u R u uV x t T V x t Q x d d d

1. Solve online using measured data for 2. Completely unknown dynamics3. Add exploring noise with no bias

[ ] [ 1],k ki iV u

DDO

Angela Song and Lewis

On‐policy

Page 62: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Off-Policy Learning for Estimating Malicious Adversaries’ Hidden True Intent

( ) ( )k k k ki i i i i i i i i ix Ax Bu Dv B u u D v v

1 2 112( ( )) ( ( )) ( , , ) ( ) ( ) ( ) ( )

t T t Tk k k k k T k k T k

i i i i i i i i i ii i i i ii i it t

V x t V x t T r x u v dt u R u u v R v v dt

i i i i i ix Ax Bu Dv

MAS H‐infinity control

MA Systems 

Cost

Off‐policy IRL

Two Opposing Teams

2 21 12 2( ( )) (x ) ( , , v )

i i

T T T T Ti i i ii i i ii i j ij j i ii i j ij j i i i i

t tj j

V x t Q x u R u u R u v T v v T v dt r x u dt

1 12

1 1,2 2

k kk T k Ti ii i i i

i i

V Vu B v D

x x

Optimal Target policiesActual Behavior policiesOff‐Policy Bellman Eq.

Page 63: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

175

Output Synchronization of Heterogeneous MAS

Page 64: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

176

Heterogeneous Multi‐Agents

Leader

Output regulation error

leader node x0

i i i i i

i i i

x A x B u

y C x

= +

=

0 0Sz z=

0 0y R z=

0( ) ( ) ( ) 0i it y t y t

Output regulator equations

i i i i i

i i

A B S

C R

P + G = P

P =

Dynamics are different, state dimensions can be differento/p reg eqs capture the common core of all the agents dynamicsAnd define a synchronization manifold 

Output Synchronization of Heterogeneous MAS

Page 65: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Optimal Output Synchronization of Heterogeneous MAS Using Off-policy IRLNageshrao, Modares, Lopes, Babuska, Lewis

i i i i i

i i i

x A x B u

y C x

= +

=

0 0Sz z=

0 0y R z=

0

( ) ( ) iT

n pT T

iX t x t z +é ù= Îê úë û

1i i i i iX T X B u= + 1

0,

0 0i i

i i

A BT B

S

é ù é ùê ú ê ú= =ê ú ê úê ú ê úë û ë û

( )

1 1( ( )) ( )

( ) ( )

i t T T Ti i i i i i i it i

Ti i i

V X t e X C Q C K W K X d

X t P X t

g t t¥ - -= +

1 2 0i i i i i iu K x K K Xz= + =

MAS Leader

Optimal Tracker Problem

Augmented Systems

Performance index

Control 

Our Solution

Page 66: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

180

1 1 1

( ) ( ) ( )i i i i i i i i i i i i i i iX T B K X B u K X T X B u K Xk k k= + + - º + -

Off‐Policy RL

1i i i i i

X TX B u= +

Tracker dynamics

Rewrite as

( )

0 0

( ) 1

( ) ( ) ( ) ( ) ( ) ( )

2 ( )

i i

i

t tt tT T Ti i i i i i i i it

t t t Ti i i i i it

e X t t P X t t X t P X t e y y Q y y d

e u K X W K X d

dg d g tk k

d g t k k

d d t

t

+- - -

+ - - +

+ + - = - - -

+ -

ò

ò

Now the Bellman equation becomes

Extra term containing  1iKk+

Algorithm 2. Off-policy IRL Data-based algorithm

Iterate on this equation and solve for                  simultaneously at each step 1,i iP Kk k+

Note about probing noise If then ( )i i i i i iu K X e u K X ek k= + - =

Do not have to know any dynamics  i i i i i

i i i

x A x B u

y C x

= +

=0 0Sz z=

0 0y R z=agent Or leader

Page 67: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

181

11 1 1 1

0T T Ti i i i i i i i i i i i i iT P T P P C QC P B W B Pg -+ - + - =

Theorem‐ Off‐policy Algorithm 2 converges to the solution to the ARE

Then the solution to the output regulator equations i i i i i

i i

A B S

C R

P + G = P

P =

Is given by 1

11 12( )i i

iP P-P = -

Let 11 12

21 22

i i

i ii

P PP

P P

é ùê ú= ê úê úë û

Theorem‐ o/p reg eq solution

1

2 1 11 12( )i i

i i iK K P P-G = -

Do not have to know the Agent dynamics or the leader’s dynamics (S,R)

Page 68: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

New Principles

There Appear to be Multiple Reinforcement Learning Loops in the Brain

Multiple Actor-Critic Learning Structures

Narendra MMAC - Multiple Model Adaptive Control

Page 69: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Applications of Reinforcement Learning

Microgrid Control Human‐Robot Interactive LearningIndustrial process control‐Mineral grinding in Gansu, ChinaResilient Control to Cyber‐Attacks in Networked Multi‐agent SystemsDecision & Control for Heterogeneous MAS (different dynamics)

Page 70: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Intelligent Operational Control for Complex Industrial Processes

Professor Chai Tianyou

State Key Laboratory of Synthetical Automation for Process Industries

Northeastern UniversityMay 20, 2013

Jinliang Ding

1. Jinliang Ding, H. Modares, Tianyou Chai, and F.L. Lewis, "Data-based Multi-objective Plant-widePerformance Optimization of Industrial Processes under Dynamic Environments,” IEEE Trans. IndustrialInformatics, vol.12, no. 2, pp. 454-465, April 2016.

2. Xinglong Lu, B. Kiumarsi, Tianyou Chai, and F.L. Lewis, “Data-driven Optimal Control of OperationalIndices for a Class of Industrial Processes,” IET Control Theory & Applications, vol. 10, no. 12, pp. 1348-1356, 2016.

Page 71: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

Production line for mineral processing plant

Mineral Processing Plant in Gansu China

Page 72: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games

RL for Human-Robot Interaction (HRI)1. H. Modares, I. Ranatunga, F.L. Lewis, and D.O. Popa, “Optimized Assistive Human-robot

Interaction using Reinforcement Learning,” IEEE Transactions on Cybernetics, vol. 46, no. 3,pp. 655-667, 2016.

2. I. Ranatunga, F.L. Lewis, D.O. Popa, and S.M. Tousif, "Adaptive Admittance Control forHuman-Robot Interaction Using Model Reference Design and Adaptive Inverse Filtering" IEEETransactions on Control Systems Technology, vol. 25, no. 1, pp. 278-285, Jan. 2017.

3. B. AlQaudi, H. Modares, I. Ranatunga, S.M. Tousif, F.L. Lewis, and D.O. Popa, “Modelreference adaptive impedance control for physical human robot interaction,” Control Theory andTechnology, vol. 14, no. 1, pp. 1-15, Feb. 2016.

Page 73: 2018 12 new RL for opt ctrl and graph games- CDC workshop ... · Sports strategy But, optimal control and game solutions are found by ... Reinforcement Learning Differential Games