Tutorial: Algorithms & Hardware for Embedded Optimizationfolk.ntnu.no/torarnj/Kerrigan ECC 2016.pdf · Algorithms & Hardware for Embedded Optimization 14:00 What Is Different about

Tutorial: Algorithms & Hardware for Embedded Optimization

14:00 What Is Different about Embedded Optimization? Eric Kerrigan

14:40 Survey of Industrial Applications of Embedded MPC Alexander Domahidi

15:00 Efficient QP Frameworks for Industrial Embedded MPC Giorgio Kufoalor

15:20 Implicit vs Explicit MPC Martin Klauco (on behalf of Michal Kvasnica)

15:40 Robustness of Explicit MPC Pedro Ayerbe

What can I expect from this session?

What is Different About Embedded Optimization?Eric Kerrigan, Bulat Khusainov, George Constantinides

Communication

ProcessingStorage

Computing System Physical

System

On time & on budget in an uncertain world

+

+

+

Applications of Embedded Optimization

A Fundamental Problem

dt

dt= 1

Correctness should be a function of time

Process + Embedded Optimizer = Uncertain Cyber-Physical System

Time + Uncertainty ⇒ Approximate today might be better than accurate tomorrow

u⇤(y) := argminu

f(u, y)

s.t. g(u, y) = 0

h(u, y) 0

yu⇤(y)measurementsoptimal inputs

disturbances

numerical errors

physical system

computing system

(p⇤, c⇤) := argminp,c

�(p, c)

s.t. ↵(p, c) = 0

�(p, c) 0

for physical system

for computing system

p⇤

c⇤

co-designer

cyber-physical system

optimal design parameters

optimal design parameters

Ordering is important:

Discretized Optimal Control/Estimation Problems

minq2Q,s2S

N�1X

i=0

`(qi, si, si+1, i, d, t)

f(qi, si, si+1, i, d, t) = 0, i = 0, . . . , N � 1

g(qi, si, si+1, i, d, t) 0, i = 0, . . . , N � 1

x := (s0, q0, s1, q1, . . . , sN�1, qN�1, sN )

Nonlinear)Programming)Strategies)On)High3Performance)Computers)

Victor)M.)Zavala)Scalable)Systems)Laboratory)Department)of)Chemical)&)Biological)Engineering)University)of)Wisconsin3Madison!

With:)Carl)Laird)(Purdue))))))Jia)Kang)(Sabre)CorporaMon)))))))Nai3Yuan)Chiang)(United)Technologies))

Acknowledgement:)Yankai)Cao)(Purdue))

Sparse & structured matrices ⇒ small (and full) not always better

Exploit Structure: Interior Point Method

Time Space

Proposed = [Cantoni, Farokhi, Kerrigan, Shames, AuCC16]

Control / Automation

SYSTEM

MODEL

ALGORITHM

Unknown Inputs

Known Inputs

Known Outputs

Unknown Outputs

Estimates of Unknown Outputs

Estimates of Known Outputs

CONTROLLER

Corrections

Real-time Optimal/Predictive Control (e.g. Receding Horizon)

time

time

output

input


time

time

output

input


1.Take measurement

time

time

output

input


1.Take measurement

2.Solve optimal control problem

time

time

output

input


1.Take measurement


time

time

output

input


1.Take measurement


3.Implement first parttime

time

output

input


1.Take measurement


3.Implement first parttime

time

output

input


1.Take measurement


3.Implement first part

4.Go to step 1

time

time

output

input


1.Take measurement



4.Go to step 1

time

time

output

input


1.Take measurement



4.Go to step 1

time

time

output

input


1.Take measurement



4.Go to step 1

time

time

output

input

Feedback Algorithm

Signal Processing / Estimation / Learning

SYSTEM

MODEL

ALGORITHM

Unknown Inputs

Known Inputs

Known Outputs

Unknown Outputs

Estimates of Unknown Outputs

Estimates of Known Outputs

ESTIMATOR

Corrections

Process + Embedded Optimizer = Uncertain Cyber-Physical System

HC

Pz

vy

w

euUnknowns Behaviour

Optimal closed-loop system?

u(t) 2 Proj argmin

x

{J(x, d, t) | x 2 X(d, t), d = D(y, v, t)}

When is Sub-optimal Optimal? Real-time Dynamic Optimization

latency/computational delay

objec

tive

func

tion

value

monotonic & sub-optimal

optimal

non-monotic & sub-optimal

When is Sub-optimal Optimal? Real-time Dynamic Optimization

u⇤(t, �) =

(0 8t 2 [0, �)

1/(1� �) 8t 2 [�, 1)

y(0) = 0, y(1) � 1

u(t) = 0, 8t 2 [0, �)

y(t) = u(t), 8t 2 [0, 1)

s.t.

V (u⇤(·, 0.2)) = 1.25 < V (1.1u⇤(·, 0.2)) < V (u⇤(·, 0.4)) ⇡ 1.7

(u⇤(·, �), y⇤(·, �)) := arg min(u,y)

V (u), V (u) :=

Z 1

0u(t)2dt

Precision, Accuracy and Latency: Fast Gradient Method

Bounds for i ! 1 [Jerez et al., ECC 2013, IEEE TAC 2014]

0 20 40 60 80 10010−9

10−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1||z∗(x)−

zi||2

Number of fast gradient iterations i

doubleb = 18b = 24b = 32

Quantization Errors in Communication: Distributed First-Order Optimization

[Pu, Zeilinger, Jones, arXiv 2015]

Hardware or Algorithm?

Computing Systems and Resources

Communication

Storage

Processing

Space

Energy

Time

Possible Design Parameters

Computer Hardware Algorithm

cost, space, energy, power accuracy, termination tolerances

# processors/cores/arithmetic units # iterations in each loop

pipeline depth step length parameters

clock frequency and supply voltage amount of data/results to store

memory architecture, latency, size time horizon

communication architecture, bandwidth complexity of physical model

number representation, word length coarseness of discretization

actuation and sampling schedule/rate scheduling/communication strategy

Co-Design as Multi-objective Optimization

perform

ance

lim

it

perform

ance

constraint

computational constraint

Pareto frontier

com

puting

resources

performance cost

H(c)

C(c)

Pz

vy

w

eu

minc{F(H(c),c) | (H(c),c) 2 G}

[Khusainov, Kerrigan, Constantinides, ECC2016]

Explicit Constrained LQR in Fixed Point

10-4 10-3 10-2 10-1 100

1

1.1

1.2

1.3

1.4

1.5

1.6

Computing Power [W]

Clos

ed-lo

op C

ost

[Suardi et al., ECC 2013]

H(c)

C(c)

Pz

vy

w

eu

minc{F(H(c),c) | (H(c),c) 2 G}

c := #bits

Size is Very Important in Microprocessor Design

Die area = 1 Working = 64

Die area = 4 Working = 4

Cost per die = f(areax), x∈[2,4]

FPGA Resources for an Optimal Controller

5 23 520

2

4

6

x 105

# of bits

#of

FFs

requ

ired

5 23 520

2

4

6

x 105

# of bits

#of

LU

Ts

requ

ired

5 23 520

2000

4000

# of bits

#of

DSP

sre

quir

ed

5 23 520

5

x 10−6

# of bits

com

puta

tion

dela

y[s

]

Interior-point [Rao et al., 1998] on Xilinx Virtex 6 [Longo, Kerrigan, Constantinides, Automatica 2014]

Optimal Control in Low Precision Floating-Point

0 5 10 15 20−2

0

2

4

time [s]

stat

e5

5 bits, shift

5 bits, delta

52 bits, shift

0 5 10 15 20

−0.5

0

0.5

time [s]

inpu

t2

[Longo, Kerrigan, Constantinides, Automatica 2014]

Computational Resources for an Adder

Number representation

Registers/Flip-Flops (FFs)

Lookup-Tables (LUTs)

Latency/delay (clock cycles)

double floating-point 52-bit mantissa 1035 852 12

single floating-point 23-bit mantissa 542 445 12

fixed-point 53 bits 53 53 1

fixed-point 24 bits 24 24 1

Xilinx Virtex-7 XT 1140 FPGA:

Cheap and low power processors often only have fixed-point

Floating-Point Arithmetic

Fixed-Point Arithmetic

Computational Resources for an Adder

Given a fixed amount of silicon (£/$/€):

200x more fixed point additions

• per second

• per Joule

than in floating point

Fast Gradient Method in Fixed Point Arithmetic

Atomic force microscope Actuation rate > 1 MHz

y

d

r

cantilever

sample

Piezo plate actuatoru

[Jerez et al., ECC 2013 & IEEE TAC 2014]

QP solver on FPGA: latency < 1 µs power < 1 W within 0.1% of optimal

minu

u0Hu+ u0D(y, v, t)

s.t. u u u

Latency, Precision and Silicon: MINRES

Xilinx Virtex-7 XT 1140 FPGA

float52

float23

more parallelism

0 20 40 60 80 1000

200

400

600

800

1000

late

ncy

(cycl

esper

iter

ation)

% Registers (FFs)

fixed53

fixed29

faster

[Jerez, Constantinides, Kerrigan, IEEE TC 2015]

Fixed point (FPGA) vs floating point (GPU): MINRES

10−15 10−10 10−5 1000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5 x 1012

oper

ations

per

seco

nd

error tolerance for >90% of problems (η)

k = 41P = 4

k = 23P = 11

k = 17P = 21

k = 58P = 2

actual sustained

FPGA

theoretical peak GPU

NVIDIA C2050 1.03 TFLOP/s

1.15GHz, 100W 10 GFLOP/s/W

Xilinx Virtex-7 XT 1140 400MHz, 22W >180 GOP/s/W

[Jerez, Constantinides, Kerrigan, IEEE TC 2015]

What is Different About Embedded Optimization?

Optimal?

HC

Pz

vy

w

eu

Nonlinear)Programming)Strategies)On)High3Performance)Computers)

Victor)M.)Zavala)Scalable)Systems)Laboratory)Department)of)Chemical)&)Biological)Engineering)University)of)Wisconsin3Madison!

With:)Carl)Laird)(Purdue))))))Jia)Kang)(Sabre)CorporaMon)))))))Nai3Yuan)Chiang)(United)Technologies))

Acknowledgement:)Yankai)Cao)(Purdue))

Feedback

Structure

Hardware ⇔ Algorithm

+ + +

Documents

Tutorial: Algorithms & Hardware for Embedded Optimizationfolk.ntnu.no/torarnj/Kerrigan ECC 2016.pdf · Algorithms & Hardware for Embedded Optimization 14:00 What Is Different about