Overcoming the curse of dimensionality for some Hamilton

Overcoming the curse of dimensionality for someHamilton–Jacobi partial differential equations via neural

network architectures

Jerome Darbon

Division of Applied Mathematics, Brown University

Applied Mathematics and Computation SeminarDepartment of Mathematics and Statistics

University of Massachusetts AmherstApril 27, 2021

Joint work with Tingwei Meng and Gabriel Provencher Langlois

This work is funded by AFOSR MURI FA9550-20-1-0358.

Jerome Darbon UMass Amherst, April 27, 2021 1 / 43

Context and motivation

Consider the initial value problem∂S∂t (x , t) + H(∇x S(x , t), x , t) = ε4x S(x , t), in Rn × (0,+∞)

S(x , 0) = J(x) ∀x ∈ Rn.

that arise in many applications such as (stochastic) optimal control problems,imaging and data sciences, UAVs . . .Goals: compute the viscosity solution for a given (x , t) ∈ Rn × [0,+∞)

evaluate S(x , t) and ∇x S(x , t)very high dimensionfast to allow applications requiring real-time computationslow memory and low energy for embedded system

Pros and cons for computations with grid-based approaches:many advanced and sophisticated numerical schemes (e.g., ENO, WENO, DG)with excellent theoretical propertiesthe number of grid-points is exponential in n

→ using numerical approximations is not doable for n ≥ 4→ Curse of dimensionality, term coined my R. Bellman.


Overcoming/mitigating the curse of dimensionality

Several approaches to mitigate/overcome the curse of dimensionalityMax-plus methods[Akian, Dower, McEneaney, Fleming, Gaubert, Qu . . . ]Tensor decomposition methods[Doglov, Horowitz, Kalise, Kunisch, Todorov, . . . ]Sparse grids[Bokanovski, Garcke, Grieble, Kang, Klompmaker, Kroner, Wilcox]Model order reduction[Alla, Kunisch, Falcone, Wolkwein, . . . ]Optimization techniques via representation formulas[Darbon, Dower, Osher, Yegerov, . . . ]. . .

More recently, there is a significant trend in using Machine Learning andNeural Network techniques for solving PDEs

A key idea is to leverage universal approximation theoremsIt seems to be hard to find Neural Networks that are interpretable,generalizable which yield reproducible results


Neural Network: a computational point of viewNeural Networks have a huge computational advantage

Due to the success of neural networks, many new hardware designs have beenproposed to efficiently implement neural networks.

dedicated hardware for NN is now available. e.g., Xilinx AI (FPGA + silicon design),Intel AI (FPGA + new CPU assembly instructions),high throughput / low latency (more precise meaning of “fast”)low energy requirement (e.g., a few Watts)

Intel bought Altera in 2015 for ∼ $15 billions,AMD bought Xilinx in 2020 for ∼ $30 billionsRealising this new hardware and silicon designs requires new dedicated software toimplement neural networks on the new platforms.Available as boards or on the cloud: Amazon EC2 F1 instances, Microsoft Azure,. . .

Can we leverage these computational resources for high-dimensional HJ PDEs?How can we mathematically certify that Neural Networks (NNs) actually computea viscosity solution of an HJ PDE?

⇒ Establish new connections between NN architectures and representation formulasof HJ PDE solutions→ the physics of some HJ PDEs can be encoded by NN architectures→ the parameters or activation functions of the NNs define Hamiltonians and initial data→ no approximation: exact evaluation of S(x , t) and (sometimes) ∇x S(x , t)→ suggests an interpretation of some NN architectures in terms of HJ PDEs→ does not rely on NN universal approximation theorems


Outline

1. Shallow NN architectures and representation of solution of HJ PDEs inspiredby the Hopf representation formula

1 A class of first-order HJ2 Associated conservation law (1D)

2. NN architectures and representation of solution of HJ PDEs inspired by theLax-Oleinik representation formula

3. Some conclusions and on-going work


A first shallow network architecture

We consider the following HJ PDE∂S∂t (x , t) + H(∇x S(x , t)) = 0, in Rn × (0,+∞)

S(x , 0) = J(x) ∀x ∈ Rn.

When J is convex, (under some continuity assumptions on J and H), theviscosity solution is given by the Hopf formula, which reads

S(x , t) = supp∈Rn{〈p, x〉 − tH(p)− J∗(p)}. (1)

(J∗ denotes the Fenchel-Legendre transform)


A first shallow network architecture

Architecture: fully connected layer followed by the activation function“max-pooling”This network defines a function f : Rn × [0,+∞)→ R

f (x, t; {pi , θi , γi}mi=1) = max

i∈{1,...,m}{〈pi , x〉 − tθi − γi}.

Goal: identify the HJ PDEs which can be solved by this NN with someassumptions on the parameters.


Assumptions on the parameters

Recall: the network f (x, t; {pi , θi , γi}mi=1) = maxi∈{1,...,m}{〈pi , x〉 − tθi − γi}

We adopt the following assumptions on the parameters:(A1) The parameters {pi}m

i=1 are pairwise distinct, i.e., pi 6= pj if i 6= j.(A2) There exists a convex function g : Rn → R such that g(pi ) = γi .(A3) For any j ∈ {1, . . . ,m} and any (α1, . . . , αm) ∈ Rm that satisfy

(α1, . . . , αm) ∈ Λm with αj = 0,∑i 6=j αi pi = pj ,∑i 6=j αiγi = γj ,

there holds∑

i 6=j αiθi > θj .where Λm denotes the unit simplex of dimension m(A1) and (A3) are NOT strong assumptions.- (A1) simplifies the mathematical analysis - (A3) simply states the each “neuron” shouldcontribute to the definition of f .- If (A3) is not satisfied, then it means that some neurons can be removed and the NN stilldefines the same function f

(A2) is important, and it is related to the term J∗(p) in the Hopf formula.


Define initial data and Hamiltonians from parametersRecall: the network f


i∈{1,...,m}{〈pi , x〉 − tθi − γi} (2)

Define the initial data J using the NN parameters {pi , γi}mi=1

f (x, 0) = J(x) := maxi∈{1,...,m}

{〈pi , x〉 − γi} (3)

Then, J : Rn → R is convex, and its Legendre transform J∗ reads

J∗(p) =

min(α1,...,αm)∈Λm∑m

i=1αi pi =p

{∑mi=1 αiγi

}, if p ∈ co ({pi}m

i=1),

+∞, otherwise.

Denote by A(p) the set of minimizers in the above optimization problem.Define the Hamiltonian H : Rn → R ∪ {+∞} by

H(p) :={

infα∈A(p){∑m

i=1 αiθi}, if p ∈ dom J∗,

+∞, otherwise.(4)


NN computes viscosity solutions

TheoremAssume (A1)-(A3) hold. Let f be the neural network defined by Eq. (2) with parameters{(pi , θi , γi )}m

i=1. Let J and H be the functions defined in Eqs. (3) and (4), respectively, and letH : Rn → R be a continuous function. Then the following two statements hold.

(i) The neural network f is the unique uniformly continuous viscosity solution to theHamilton–Jacobi equation{

∂f∂t (x, t) + H(∇xf (x, t)) = 0, x ∈ Rn, t > 0,f (x, 0) = J(x), x ∈ Rn.

(5)

Moreover, f is jointly convex in (x,t).(ii) The neural network f is the unique uniformly continuous viscosity solution to the

Hamilton–Jacobi equation{∂f∂t (x, t) + H(∇xf (x, t)) = 0, x ∈ Rn, t > 0,f (x, 0) = J(x), x ∈ Rn.

(6)

if and only if H(pi ) = H(pi ) for every i = 1, . . . ,m and H(p) > H(p) for every p ∈ dom J∗.


NN computes viscosity solutions


i∈{1,...,m}{〈pi , x〉 − tθi − γi}.

The network computes viscosity solution for H and J given by the parametersof networkHamiltonians are not unique. However, among all possible Hamiltonians, H isthe smallest one.J needs to be convex, but H may be not convex. For instance, H can beconcave functions (see the next example).In addition, ∇x S(x , t) (when it exists) is given by the element that realizesthe maximum is the “max-pooling”


NN computes viscosity solutions: An example

Denote by ei the i th standard unit vector in Rn. Let m = 2n, {pi}mi=1 = {±ei}n

i=1,θi = − n

2 , and γi = 0 for every i ∈ {1, . . . ,m}.The parameters {(pi ,− n

2 , 0)}mi=1 define the following initial data and smallest Hamiltonian

J(x) = ‖x‖∞, for every x ∈ Rn;

H(p) ={− n

2 , p ∈ Bn;+∞, otherwise,

where Bn is the unit ball with respect to the `1 norm in Rn

By theorem (i), this NN gives the viscosity solution to the HJ PDE with Hamiltonian Hand initial data J.By theorem (ii), the same NN also gives the viscosity solution to another HJ PDE whichreads {

∂S∂t

(x, t)−12‖∇xS(x, t)‖2

2 = 0 in Rn × (0,+∞),

S(x, 0) = ‖x‖∞ in Rn.

(7)


NN computes viscosity solutions: Another example

Consider J(x) = max{‖x‖∞, 1√

2(|x1|+ |x2|)

}and H(p) = ‖p‖1

The corresponding neural network has m = 2n + 5 neurons, where the parameters are givenby

{(pi , θi , γi )}2ni=1 = {(ei , 1, 0)}n

i=1 ∪ {(−ei , 1, 0)}ni=1,

(p2n+1, θ2n+1, γ2n+1) = (0, 0, 0),

{(pi , θi , γi )}2n+5i=2n+2 =

{ 1√

2(αe1 + βe2, 2, 0) : α, β ∈ {±1}

},

where ei is the i th standard unit vector in Rn and 0 denotes the zero vector in Rn.Therefore, by theorem, the proposed neural network represents the viscosity solution to theHJ PDE with Hamiltonian H and initial data JIllustrations for n = 16. The following figures depicts the results for a 2-dimensional sliceat different times for the HJ PDE, (x1, x2, t) 7→ S(x1, x2, 0, . . . , 0, t)



Two dimensional slices of the graph of the solution for (x1, x2, 0, . . . , 0) with t = 0











NN architecture for the gradient map

This NN architecture computes the spatial gradient ∇x S(x , t) of the solution(i.e., the momentum)Consider u : Rn × [0,+∞)→ Rn defined by

u(x , t) = ∇x f (x , t) = pj , where j ∈ arg maxi∈{1,...,m}

{〈pi , x〉 − tθi − γi}. (8)


NN architecture and 1D conservation laws

TheoremConsider the one-dimensional case, i.e., n = 1. Suppose assumptions (A1)-(A3) hold. Letu := ∇x f be the function from R× [0,+∞) to R defined in Eq. (8). Let J and H be thefunctions defined in Eqs. (3) and (4), respectively, and let H : R→ R be a locally Lipschitzcontinuous function. Then the following two statements hold.

(i) The neural network u is the entropy solution to the conservation law{∂u∂t (x , t) +∇x H(u(x , t)) = 0, x ∈ R, t > 0,u(x , 0) = ∇J(x), x ∈ R.

(9)

(ii) The neural network u is the entropy solution to the conservation law{∂u∂t (x , t) +∇x H(u(x , t)) = 0, x ∈ R, t > 0,u(x , 0) = ∇J(x), x ∈ R,

(10)

if and only if there exists a constant C ∈ R such that H(pi ) = H(pi ) + C for everyi ∈ {1, . . . ,m} and H(p) > H(p) + C for any p ∈ conv {pi}m

i=1.

The results simply follow from [Dafermos 1972].


Partial summary

We have exhibited classes of network architecture that represent viscositysolution of certain HJ PDEs for initial data and Hamiltonian which takeparticular formInitial data and Hamiltonians are given by the parameters of the network

Initial data is completely determined by the parameters of the NNThe same NN is also a viscosity solution to other Hamiltonians. However thesmallest Hamiltonian is completely determined by the parameters of the NN.The initial data has to be convex.

Our result may provide a better understanding and interpretation of these NNarchitectures in terms of HJ PDEsAre there other NN architectures that can represent viscosity solutions ofcertain HJ PDEs?Details can be found inJ. Darbon, GP. Langlois and T. Meng, Overcoming the Curse of Dimensionality for SomeHamilton-Jacobi Partial Differential Equations via Neural Network Architectures. Research in theMathematical Sciences, 2020, 7(3):20.


Other classes of NN architectures based on Lax-Oleinikrepresentation formulas

Consider the following HJ PDE∂S∂t (x, t) + H(∇xS(x, t)) = 0 in Rn × (0,+∞),

S(x, 0) = J(x) in Rn,

where H : Rn → R ∪ {+∞} is a convex Hamiltonian, and J : Rn → R is theinitial data.Lax-Oleinik representation formula reads

SLO(x, t) = infu∈Rn

J(u) + tL(

x − ut

)= inf

v∈RnJ(x − tv) + tL(v).

Denote L = H∗ to be the Fenchel-Legendre transform of H.We propose two NN architectures inspired by Lax-Oleinik formulas


A first class of NN architectures based on Lax-Oleinikrepresentation formulas

This network defines a function f1 : Rn × (0,+∞)→ R

f1(x, t) = mini∈{1,...,m}

{tL( x − ui

t

)+ ai

}.

Initial data: f1(x , 0) = limt→0 f (x , t) = mini∈{1,...,m} {L′∞(x − ui ) + ai}, where L′∞ isdefined by L′∞(d) := limh→+∞

1h (L(x0 + hd)− L(x0)). (asymptotic function of L)


A first class of NN architectures

f1(x, t) = mini∈{1,...,m}

{tL(x − ui

t

)+ ai

}.

TheoremLet L : Rn → R be a convex uniformly Lipschitz function. Then f1 = SLO , where SLO is theLax–Oleinik formula with the convex Hamiltonian defined by H = L∗ and the initial conditionJ = mini∈{1,...,m} {L′∞(x − ui ) + ai}. Therefore, f1 is a viscosity solution to the correspondingHJ PDE.Remark: although we assume J is in a specific form, it is not very restricted. Actually, such J isdense in the set of “reasonable” initial condition.


A first class of NN architectures

f1(x, t) = mini∈{1,...,m}

{tL(x − ui

t

)+ ai

}.

In the corresponding HJ PDE, H needs to be convex, but J may be non-convex

The Hamiltonian H is defined through the activation function L = H∗The initial data J is defined through the parameters (u i , ai ) and the activationfunction.

Although we assume J is in a specific form

J = mini∈{1,...,m}

{L′∞(x − ui ) + ai

},

it is not very restricted. Actually, such J is dense in the set of “reasonable” initial condition.


A first class of NN architectures: An example

The dimension is n = 10. The number of neurons is m = 3.The activation function L is given by

L(x) = max{‖x‖2 − 1, 0} ={‖x‖2 − 1 if ‖x‖2 > 1,0 if ‖x‖2 ≤ 1.

It defines the HamiltonianH(p) = L∗(p) =

{‖p‖2 if ‖p‖2 ≤ 1,+∞ if ‖p‖2 > 1.

(11)

The parameters areu1 = (−2, 0, 0, 0, . . . , 0), a1 = −0.5.u2 = (2,−2,−1, 0, . . . , 0), a2 = 0.u3 = (0, 2, 0, 0, . . . , 0), a3 = −1.

The initial data is defined byJ(x) = min

i∈{1,2,3}{‖x − u i‖2 + ai}. (12)



Two dimensional slices of the graph of the solution for (x1, x2, 0, . . . , 0) witht = 10−6











A second class of NN architectures

This network defines a function f2 : Rn × [0,+∞)→ R

f2(x, t) = mini∈{1,...,m}

{J(x − tv i ) + tbi

}. (13)

where {(v i , bi )}mi=1 ⊂ Rn × R is the set of parameters, and J : Rn → R is the activation

function, which corresponds to the initial function in the HJ PDE.


A second class of NN architectures

Assumption: There exists a convex function ` : Rn → R satisfying `(v i ) = bi for alli ∈ {1, . . . ,m}.This network defines a function f2 : Rn × [0,+∞)→ R

f2(x, t) = mini∈{1,...,m}

{J(x − tv i ) + tbi

}. (14)

where {(v i , bi )}mi=1 ⊂ Rn × R is the set of parameters, and J : Rn → R is the activation

function, which corresponds to the initial function in the HJ PDE.

TheoremAssume the function J : Rn → R is a concave function and the assumption above is satisfied.Then f2 = SLO , where SLO is the Lax–Oleinik formula with initial condition J = J and theHamiltonian H defined by

H(p) = maxi∈{1,...,m}

{〈p, v i 〉 − bi} ,

for every p ∈ Rn. Hence f2 is a concave viscosity solution to the corresponding HJ PDE

The corresponding HJ PDE has piecewise affine convex H and concave J.The Hamiltonian H is defined by the parameters (v i , bi ), and the initial data J is given bythe activation function J.


A second class of NN architectures: An example

The dimension is n = 10. The number of neurons is m = 3.The activation function is J(x) = −‖x‖

22

2 .The parameters are

v1 = (−2, 0, 0, 0, . . . , 0), b1 = 0.5.v2 = (2,−2,−1, 0, . . . , 0), b2 = −5.v3 = (0, 2, 0, 0, . . . , 0), b3 = 1.

By theorem, this NN solves the HJ PDE wherethe Hamiltonian is

H(p) = maxi∈{1,2,3}

{〈p, v i〉 − bi} . (15)

the initial data is J(x) = J(x) = − ‖x‖22

2 .














A second class of NN architectures: Another example

The dimension is n = 5. The number of neurons is m = 2n = 10.The activation function is J(x) = −‖x‖

22

2 .The parameters are

{v i}mi=1 = {±e j}n

j=1, where e j is the j−th coordinate basis vector in Rn.bi = 0 for each i ∈ {1, . . . ,m}.

The Hamiltonian is given by

H(p) = maxi∈{1,...,m}

{〈p, v i〉 − bi} = maxi∈{1,...,n}

|pi | = ‖p‖∞.

By the theorem, this NN solves the following HJ PDE∂S∂t (x, t) + ‖∇xS(x, t)‖∞ = 0 in R5 × (0,+∞),

S(x, 0) = −‖x‖22

2 in R5.














SummaryWe have presented three NN architectures:

one shallow NN inspired by Hopf formula (there are two extended NNs whichare variations of this NN by replacing the activation fns)two NNs with 2 hidden layers inspired by Lax-Oleinik formula (denoted by LO1and LO2 in the following table)

In the following table, we summarize the classes of Hamiltonian H and initialdata J which can be solved using these NNs.

NN Hamiltonian H initial data J

Hopf piecewise affine H and a set ofH ≥ H (e.g., concave H) (pa-rameters in NN)

piecewise affine convex ({pi , γi}in NN)

LO1 convex with bounded domain(activation fn in 1st layer)

J(x) = mini{L′∞(x − u i ) + ai}(parameters & activation fn)

LO2 piecewise affine convex fns (pa-rameters in NN)

concave functions (activation fnin 1st layer)

Details can be found inJ. Darbon, T. Meng, On some neural network architectures that can represent viscosity solutions ofcertain high dimensional Hamilton–Jacobi partial differential equations. Journal of ComputationalPhysics, 2021.


Summary and ongoing works

Our work may provide a better understanding and interpretation of certainNN building blocks in terms of HJ PDE.These NNs can avoid the curse of dimensionality for certain HJ PDEs, butthey may suffer from the curse of complexity (m).Combine these NN architectures with other approaches (e.g, Min/Max-Plusmethods) to cope with more general Hamiltonian and initial data. Inparticular we have exhibited neural network architectures that computesviscosity solutions to the HJ PDEs with certain space-dependentHamiltonians (ongoing work with P. Dower and W. McEneaney)Derive error estimates for theses more general NN framework.FPGA implementation is underwayReferences:J. Darbon, GP. Langlois and T. Meng, Overcoming the Curse of Dimensionality for SomeHamilton-Jacobi Partial Differential Equations via Neural Network Architectures. Research in theMathematical Sciences, 2020, 7(3):20.J. Darbon, T. Meng, On some neural network architectures that can represent viscosity solutions ofcertain high dimensional Hamilton–Jacobi partial differential equations. Journal of ComputationalPhysics, 2021.


Documents

Overcoming the curse of dimensionality for some Hamilton