Download pdf - Data Assimilation, Machine Learning: Statistical …2018/08/04 · Data Assimilation, Machine Learning: Statistical Physics Problems Introduction, Core Ideas, Applications Henry D

Data Assimilation, Machine Learning:

Statistical Physics Problems

Introduction, Core Ideas, Applications

Henry D. I. Abarbanel

Department of Physics

and

Marine Physical Laboratory (Scripps Institution of Oceanography)

Center for Engineered Natural Intelligence

University of California, San Diego

[email protected]

mailto:[email protected]

This is meant to be an introductory and an

advanced set of pedagogical talks on data assimilation

and machine learning.

These are statistical Physics problems.

I hope you will ask a lot of questions.

My colleague Dan Margoliash will provide a

neurobiological setting for utilizing many of the methods

discussed. Much of what I address has been developed

with him and tested and improved in application to results

obtained in his laboratory.

I received a bipolar transistor for a

New Year’s present. I want to know how

it works so I can use many of them to

build a follow-on K computer (before

2020?).

What do I do?

General answer:

Hook up my nice new transistor to a known RLC

circuit and drive the dynamical variables of the transistor

through their dynamical range. Measure some of the

variables of the circuit V(t); this produces data. Make a

model of the transistor, drive it in precisely the same way, to

get model output Vmodel(t).

Minimize the distance ,

Subject to our model equations of motion.

Test the model completed by estimated parameters

through prediction for t > T.

(Vdata

(t) -Vmod el

(t))2

t=0

T

å

What do we need to complete this task?

➢ a model of the origin of the data

➢ data

➢ a way to minimize the distance between the data and

the model variables

We first generate our own data from our model,

then use our minimization method to show it works----

called a twin experiment.

Then, use the method on experimental data with

some confidence now to determine the parameters in my

new transistor.

B

C

E

L

R

C1

C2

Ree

Vee

Colpitts Oscillator Circuit 1920’s and 1950s and 1970s and 1990s

IC(VE) = (1 mA) exp

.

| |

Bth

k TV

e

J.J Ebers and J.L. Moll. Large-signal behavior of junction transistors.

Proceedings of the IRE, 42(12):1761–1772, Dec. 1954.

H.K. Gummel and H.C. Poon, “An Integral Charge Control Model of

Bipolar Transistors,” Bell System Technical Journal, vol. 49, no. 5, pp.

827–852, 1970.

Turn this on, please

Voltage time series VE(t) recorded from a Colpitts circuit operating in the

chaotic regime. Δt = 10 μsec

ms

Data Source

dx1(t)

dt= a x

2(t) a is kept fixed, then driven : a(t)

dx2(t)

dt= -g (x

1(t) + x

3(t)) - qx

2(t)

dx3(t)

dt= h(x

2(t) +1+ e

- x1(t )

)

Rescaled Colpitts Oscillator

No coupling of data into model k(t) = 0

Colpitts Oscillatork(t) = 1.9

Experimental Data from Colpitts Circuit

u = k

Data Source

dx1(t)

dt= a x

2(t)

dx2(t)

dt= -g (x

1(t) + x

3(t)) - qx

2(t)

dx3(t)

dt= h(x

2(t) +1+ e

- x1(t )

)

Model Equations x1(t) is passed to the model

dy

1(t)

dt= a

My

2(t) + u(t)(x

1(t) - y

1(t))

dy

2(t)

dt= -g

M( y

1(t) + y

3(t)) - q

My

2(t); u(t) ³ 0

dy

3(t)

dt= h

M( y

2(t) +1+ e

- y1(t )

)

12 2

1 1

0

Minimize

1( , , ) (( ( ) ( )) ( ) )

2

subject to

N

m

C y u p x m y m u mN

1

12 1 1

21 3 2

( )32

( ) ( ) ( )( ( ) ( ))

( ) ( ( ) ( )) ( )

( ) ( ( ) 1 )

M

M M

y t

M

dy ty t u t x t y t

dt

dy ty t y t q y t

dt

dy ty t e

dt

Data

The solution of the optimization problem is an

iterative process in (y(n),u(n),p) space. Given initial values,

(y0(n),u0(n),p0), iterate, adjusting all (ym(n),um(n),pm) m = 1, 2,

3, … to minimize cost function.

Tracks state variables and correctly estimates parameters

when time dependent. Tracks accurately through bifurcations in

system behavior: chaoticfixed pointlimit cyclechaotic.

The number of variables in each optimal estimation

calculation is about 3000-5000.

0 0 0 1 1 1( ( ) , ( ) , ) ( ( ) , ( ) , )

......( ( ) , ( ) , )

until objective (cost) function is minimized,

subject to model e

We require ( ) 0.

quations

Final Final Fi

Fin

nal

al

y n u n p y n u n p

y n u n p

u n

Chaotic Colpitts Oscillator; Initial Conditions in Optimization are free

x1(t) observed; other state variables evaluated by SNOPT

Chaotic Colpitts Oscillator; External Driving of Parameter α(t)

α > 5.0

Chaotic

α < 5.0

Regular

11 1 1 1

1

( )( ( ), ( ), ) ( )( ( ) - ( ))

( )( ( ), ( ), )

R

RR R

dy tF y t y t q u t x t y t

dt

dy tF y t y t q

dt

Model

Data

Colpitts Oscillator

Model

Data

SNOPT reached Solution

Model

Data

Experimental

Colpitts Oscillator Circuit

Δt = 10 µs

VE(t) presented to state and

estimation procedure.

VCE(t) and IL(t) estimated,

with 10 ms of data.

Then predictions are made

from estimated state at

t = 10ms: VE(t), VCE(t), IL(t)

Measured VE(t) and estimated VCE(t) and IL(t)

Predicted VE(t), VCE(t) and IL(t) from estimates at t = 10ms

12 2

1 1

0

Minimize

1( , , ) (( ( ) ( )) ( ) )

2

subject to

N

m


1

12 1 1

21 3 2

( )32

( ) ( ) ( )( ( ) ( ))

( ) ( ( ) ( )) ( )

( ) ( ( ) 1 )

M

M M

y t

M

dy ty t u t x t y t

dt

dy ty t y t q y t

dt

dy ty t e

dt

Data

Where did all this come from?

Why this C(y,u,p)? Why this “nudging” term ?

12 2

1 1

0

Minimize

1( , , ) (( ( ) ( )) ( ) )

2

subject to

N

m


1

12 1 1

21 3 2

( )32

( ) ( ) ( )( ( ) ( ))

( ) ( ( ) ( )) ( )

( ) ( ( ) 1 )

M

M M

y t

M

dy ty t u t x t y t

dt

dy ty t y t q y t

dt

dy ty t e

dt

Where did all this come from?

Why this C(y,u,p)? Why this “nudging” term ?

Just for the record, this is the wrong

answer. We will derive the correct answer.

With the electronic circuit in mind, we turn to a

general view of the problem of transferring information

from observations to a models of the processes of those

observations.

Not actually a new problem. Newton did this in

1687 in determining that elliptical orbits satisfying

Kepler’s laws require a 1/r2 force.

The questions we pose 330 years later are richer:

collect information from many sources from observations

of complex systems. We want to do that in a systematic

manner allowing large data sets, rich models of the

processes producing those data sets.

My transistor is the same, in scientific spirit, as your:

❖ Atmosphere

❖Neuron

❖ Lake

❖Ocean

❖Biological cell

❖ whatever is your complex system of interest

a model of the origin of the data

data

a way to minimize the distance between the data and the model variables

Topics:

❖ Investigating rules of nonlinear dynamics in

physical and biological (complex) systems

❖ A complex oscillator-data assimilation

❖ General setting: neurobiological example –

see Margoliash talks

❖ General problems—numerical algorithms

❖ machine learning: statistical Physics and

data assimilation

Data Assimilation in a time window [t0,tF]:

Transfer Information from Data Library y(τ) to a Model x(t)

321 k F

Observe y(τ1)

Observe y(τ2)

Observe y(τ3)

Observe y(τk)

Observe y(τF)

Move model

forward

Move model

forward

Move model

forward

Move model

forward

t0 t1tn tF

P( X |Y ) =P( X ,Y )

P(Y )

X= x(t0),x(t

1),...., x(t

F){ } Y= y(t

1), y(t

2),...., y(t

F){ }

states/parameters of model da ta

Data Assimilation:

Transfer of Information from Measurements

to a Model of the Observations

We start with noisy measurements yk(t); k = 1, 2, …L, errors in the model xa(t); a=1,2,…,D >>L, and uncertain initial conditions at x(t0) .

We wish to incorporate the information in measurements at t0, t1, …, tF = T into our statistical estimate of the complete state of the model at these times and into our statistical estimate of the model parameters.

The model has errors; given x(T), we use it to predict for x(t > T). This a validation (or not) of the model.

t = t

0,t

1,...t

n,...,t

F= T{ }

( )

1,2,...,

l ny

l L

1( )

( ( ));

1,2,...,

a n

a n

x t

f x t

a D

y

l(n) = x

l(n))

L D

Data source: Transmitter

Model: Receiver

Generalized synchronization of

the transmitter and receiver

Statistical data assimilation is communication of information from measurements (transmitter) to a dynamical model (receiver).

At the end of an observation window [t0,tF] we want

the conditional probability distribution of the state of the

system P(X|observations); X = {x(t0),x(t1),…,x(tF)} is path of

model through [t0,tF] given measurements during the

window.

We then want to predict the future conditional

probability distribution P(x(t > tF)|observations) for new

forcing of the system.

Typical situation: The measurements are noisy. The model has

errors. We are unsure of the state of the system when we begin observing.



321 k F

Observe y(τ1)

Observe y(τ2)

Observe y(τ3)

Observe y(τk)

Observe y(τF)

Move model

forward

Move model

forward

Move model

forward

Move model

forward

t0 t1tn tF

P( X |Y ) =P( X ,Y )

P(Y )

X= x(t0),x(t

1),...., x(t

F){ } Y= y(t

1), y(t

2),...., y(t

F){ }

states/parameters of model da ta

Observation window in time: t0,t

1,...t

N

X (n) = x(t0),x(t

1),...,x(N ){ } = x(0),x(1),x(2),...,x(N ){ }

Model state vectors and parameters at times t0,t

1,...t

N

Y(n) = y(1), y(2), y(3),...., y(F ){ }

Observed data vectors at times t0

£ t1,t

1,...t

F£ t

N= t

F

We want to express P(X(n+1)|Y(n+1)) in terms of

P(X(n)|Y(n)).

Then we iterate from n=N -1, back to n=0. The

product of these probabilities gives us a representation

of P(X(N)|Y(N)) starting at P(x(0)).

( ( 1), ( ), ( 1), ( ))

( ( 1), ( ), ( 1), ( )) ( ( 1), ( ), ( ))

( ( 1), ( ), ( )) ( ( ), ( ))

( ( 1), ( ) | ( ))( ( 1) | ( 1), ( ), ( ))

( ( ) | ( ))

( ( 1), ( 1))

( ( ), ( ))

( ( ), (

P x n X n y n Y n

P x n X n y n Y n P x n X n Y n

P x n X n Y n P X n Y n

P x n X n Y nP y n x n X n Y n

P X

P X n Y n

P X n Y n

n Y nP X n Y

))

( ( ), ( ))( ( 1) | ( 1), ( ), ( ( ( 1) | ( )) M) ar ov k) P x

n

P X n Y nP y n x n X n n nY xn

Change due to Observation Move Model Forward

( ( 1) | ( )

( ( 1) | ( 1))

( ( ) | (

( ( 1), ( 1))

( ( 1), ( ))

( ( 1) | ( 1), ( ), ( ))( ( ))

( ( 1), ( ))

( ( 1), ( 1), ( ), (

)

( ( 1) | ( )))

( ( 1) | (

))

)) ( ( 1), ( ), )(

( )) (

P X n Y n

P X n Y n

P X n Y n

P y n Y n

P y n x n X n Y nP Y n

P y n Y n

P y n x n X n Y n

P y n Y n P x n X n Y n

P x n x n

P x n P Xx nn

( ( 1), ( 1), ( ) | ( ))exp[log ( ( 1) | (]

( ( 1) | ( )) ( ( 1), ( ) |( ( ) | ( ))

( ))

))

) | ( ))

( (( ( 1), ( 1), ( ) | ( ))

( ( 1) | ( )) ( ( 1), ( ) | ( ))

ex

( ( 1) | ( )) ) | (

p

))

[

P y n x n X n Y n

P y n Y n P x n X n YP X n YP x n

nnx n

P yP x n x

Y n

P X n Yn x n X n Y n

P y n Y n P x n X n Y n

CM

n n

( ( 1), ( 1), ( ) | ( ))] ( ( 1) ( ( ) | () )| )( )P x n xI y n x n X n X nY nn P Yn


, log , /

1 , 1 , ,

1 , 1

| | | |

{ }

Shannon

, |

1 , 1 , |log

1 |

194

,

0s

1 |

CMI a b c P a b c P a c P b c

a y

CMI y n x n X n Y n

P y n x n X n Y n

P y n Y n P x n X n Y

n b x n X n c Y

n

n

X = x(t0),x(t

1),...., x(t

F){ } Y = y(t

1), y(t

0),...., y(t

F){ }

P( X |Y ) = d Dx(n)ò P( y(k) | X (k)) P(x(n +1) | x(n))P(x(0))n=0

F-1

Õk=0

F

Õn=0

F

Õ

= dX e- A( X )

ò

Expected value of functions on the path X(N), G(X)

E[G(X)|Y] = <G(X)> = dX e- A( X )

ò G( X )

dX e- A( X )

ò

Our job is this: given data Y,

and a model dx(t)/dt = F(x(t)) [equivalently x(n+1) = f(x(n))],

do the integral.

Path of state in [t0,tF]

00

1

0

( ) ( ( ), ( ) | ( 1))

- log{ ( ( 1) | ( ))} log{ ( (0))}

N

n

N

n

A X CMI X n y n Y n

P x n x n P x

Action for State and Parameter Estimation

Dynamics

Information Transfer

Initial Condition

Everything rests on the structure of A(X) in path space

E[G(X)|Y]=<G(X)>=dX e- A( X )

ò G(X )

dX e- A( X )

ò

Two methods for evaluating such high dimensional

integrals:

(1) Laplace’s method (1774); seek minima of A(X)--

there are multiple minima

(2) Monte Carlo searches

First seeks minima of A0(X); second samples near

these minima.

¶A( X )

¶Xa

= 0 and ¶2 A( X )

¶Xa

¶Xb

> 0

¶A0( X )

¶Xa X q

= 0 q=0,1,... A0( X 0 ) < A( X q¹0 )

We focus on Laplace method. How can we find the minima Xq

which minimum gives the biggest contribution to the integral ?

Everything rests on the structure of A(X) in path space.

E[G( X ) |Y] = < G( X ) > = dXe- A( X )

ò G( X )

dXe- A( X )

ò

A(X) is nonlinear in X and has multiple minima. Location and

number of these minima depend on number of measurements at

each observation time in [t0,tN].

Standard model, Gaussian Error Action

Observations have Gaussian noise; models have Gaussian

errors, action is

It is not Gaussian in X, if f(x) is nonlinear. Finding paths and

associated minima at any Rm and Rf is not hard (IPOPT, other

public domain optimization algorithms), but finding the path with the

smallest action—is a challenge---it is NP complete, in general.

ASM (X) =Rm (n,l)

2(xl (n) - yl (n))2

l=1

L

ån=0

N

å +

R f (a)

2a=1

D

ån=0

N-1

å (xa (n +1) - fa (x(n)))2

Origin of multiple minima is instability on the synchronization manifold

yl = xl(n). Measurements act to transfer information and stabilize directions

in state space. Looking in continuous time shows this:

gives the Euler-Lagrange equations:

These have the boundary

conditions pa(t0) = pa(tf) = 0.

ASM

(x(t), dx(t) / dt) = dt L(x(t), dx(t) / dtt0

tf

ò ,t)

= dt R

m(t)

2(x

l(t) - y

l(t))2

l=1

L

å +R

f(a)

2(dx

a(t) / dt - F

a(x(t)))2

a=1

D

åé

ëêê

ù

ûúút

0

tf

ò

( )

( )( ( ) ( ( )) ( ( ) ( ))

q

b mab ab b al l l

fx t

d dx t RDF x t F x t x t y t

dt dt R

‘nudging’ to x(n) = y(n)

0( ) 0A X

ASM

(x(t), dx(t) / dt) = dt L(x(t), dx(t) / dtt0

tf

ò ,t)

= dt R

m(t)

2(x

l(t) - y

l(t))2

l=1

L

å +R

f(a)

2(dx

a(t) / dt - F

a(x(t)))2

a=1

D

åé

ëêê

ù

ûúút

0

tf

ò

c(x(t) - y(t)) =R

m(t)

2(x

l(t) - y

l(t))2

l=1

L

å

Wab

(x(t)) =¶F

a(x(t))

¶xb(t)

-¶F

b(x(t))

¶xa(t)

dab

d 2xb(t)

dt2- W

ab(x(t))

dxb(t)

dt=

¶c(x(t) - y(t))

Rf

+F(x(t))2

2

é

ëêê

ù

ûúú

¶xa(t)

+¶F

a(x(t))

¶t

d 2xa(t)

dt2=

dx(t)

dt´ B(x(t))

a

+ Ea(x(t))

Ba(x(t)) = e

abc

¶

¶xb(t)

Ac(x(t))

=1

2

¶

¶xb(t)

Ac(x(t)) -

¶

¶xc(t)

Ab(x(t))

é

ëê

ù

ûú

Ea(x(t)) = -

¶j(x(t))

¶xa(t)

-¶A

a(x(t))

¶t

Now we move on to evaluating the expected value

integrals using Laplace’s method

We do not discuss corrections to the method, but

one can. It is complicated algebra.

A(X) is nonlinear in X and has multiple minima.

Location and number of these minima depend on the number

of measurements at each observation time in [t0,tN].

Standard model, Gaussian Error Action

Observations have Gaussian noise; models have Gaussian

errors, action is

Now we are ready to minimize the action—maximize the

probability distribution.

ASM (X) =Rm (n,l)

2(xl (n) - yl (n))2

l=1

L

ån=0

N

å +

R f (a)

2a=1

D

ån=0

N-1

å (xa (n +1) - fa (x(n)))2

2

Standard Model

0 1

12

0 1

( , )( ) ( ( ) ( ))

2

( )( ( 1) ( ( )))

2

N Lm

l l

n l

N Df

a a

n a

R n lA X x n y n

R ax n f x n

We want to minimize this over all x(n) and parameters in f(x(n))

If f(x(n)) is nonlinear the Action = A(x(n),x(n+1)) has many minima in

general.

The search for the smallest minimum of a nonlinear objective

function, such as A(x(n),x(n+1)) is, in general, NP-complete.

An NP-complete problem cannot be solved in polynomial time in any

known way.

For us, that is not good news.

To determine paths for lowest minimum action; find minimum for very

small model error value Rf, then slowly increase Rf to larger values. We call

this variational annealing (This is distinct from standard simulated

annealing in statistical Physics). If Rf ∞, model error is 0.

We look at the opposite limit Rf0, where model plays no role and

dynamical phase space structure is absent.

At Rf = 0, minimum is degenerate at xl(t) = yl(t); other unmeasured

states undetermined. With Rf = Rf0 very small, choose N0 initial starting paths

with xl(t) = yl(t), others chosen from a uniform distribution, this is a set of X0

for numerical minimization. We call outcomes X1.

Use N0 paths X1 as initial starting paths with Rf = αRf0 ; α>1, to arrive

at N0 paths X2. Increase Rf to α2Rf0, … and continue using outcome paths as

initial choices for next optimizations, slowly increasing Rf by powers of α.

Plot A0(Xq) versus . Action level plots.b = loga Rf / Rf 0

éë ùû

Simple Model Neuron NaKL

D+NP = 4 + 19, L = 1 Voltage

Twin Experiment on NaKL Neuron

CdV (t)

dt= g

Nam(t)3h(t)(E

Na-V (t)) +

gKn(t)4(E

K-V (t)) + g

L(E

L-V (t)) + I

applied(t)

dx(t)

dt=

x¥

(V (t)) - x(t)

tx(V (t))

x(t) = m(t),h(t),n(t){ }

2

1 2

1( ) 1 tanh( )

2

( ) (1 tanh ( ))

x

x

xtx x x

xt

V Vx V

dV

V VV t t

dV

Generate data from NaKL equations

y(t) = x(t) + σN(0,1) noise

D = 4, L = 1

Annealing in Model Error Accuracy

Paths giving minima of the action depend on the

number of measurements L.

For the Standard Model, when action levels are

independent of Rf, the action level is dictated by statistics of

measurement error term---a consistency check on the action

level evaluations.

NaKL Model Action Level Plot Measure Voltage ONLY

3/2 0log [ / ]f fR R

NaKL Neuron Twin Experiment

Parameters Known Estimated LB UBgNa 120.0 108.4 50.0 200.0ENa 50.0 49.98 0.0 100.0gK 20.0 21.11 5.0 40.0EK −77.0 −77.09 −100.0 −50.0gL 0.3 0.3028 0.1 1.0EL −54.0 −54.05 −60.0 −50.0C 0.8 0.81 0.5 1.5Vm −40.0 −40.24 −60.0 −30.0dVm 0.0667 0.0669 0.01 0.1τm0 0.1 0.0949 0.05 0.25τm1 0.4 0.4120 0.1 1.0Vh −60.0 −59.43 −70.0 −40.0dVh −0.0667 −0.0702 −0.1 −0.01τh0 1.0 1.0321 0.1 5.0τh1 7.0 7.76 1.0 15.0Vn −55.0 −54.52 −70.0 −40.0dVn 0.0333 0.0328 0.01 0.1τn0 1.0 1.06 0.1 5.0τn1 5.0 4.97 2.0 12.0

dxa(t)

dt= x

a-1(t)(x

a+1(t) - x

a-2(t)) - x

a(t) - f

a = 1, 2, ..., D;

x-1

(t) = xD-1

(t); x0(t) = x

D(t); x

D+1(t) = x

1(t).

f is a fixed parameter f = 10. Solutions are chaotic.

Lorenz96 Model D = 11

‘Twin Experiments’ Use to test methods of data

assimilation; use to design experiments.

Generate data with known model; add noise to

model output; present l = 1,2,…,L < D noisy time series to

assimilation procedure.

Neurobiological Example

Inject

current

Iapplied(t)

Measure

Response

Voltage

V(t)

3

4

( )( ) ( )( ( ))

( ) ( ( )) ( ( )) I ( )

( ( ) ( ))( ) a( ) ( ), ( ), ( )

( ( ))

Na Na

K K L L applied

x

dV tC g m t h t E V t

dt

g n t E V t g E V t t

a V t a tda tt m t h t n t

dt V t

2

1 2

1( ) 1 tanh( )

2

( ) (1 tanh ( ))

x

x

xtx x x

xt

V Va V

dV

V VV t t

dV

D = 4 L = 1 p = 20

Measure V(t) with

selected Iapplied(t)

Evaluate all

parameters and all

unobserved state

variables a(t)

Neuron Model

This is the challenge:

Using laboratory experiments on individual neurons and on

collections of neurons. Build biophysically based models of functional

neural networks which matches experiments and predicts the response

to new stimuli.

Our strategy is this:

o create a model of the functional network of interest—e.g. song production

network for song birds and, of course, the individual neurons in the network---

what is a sensible model—we use Hodgkin-Huxley models.

o using the model itself, design experiments that stimulate all degrees of

freedom of the neuron/network and measure enough quantities at each

observation time---these are numerical simulations.

o Use the model along with data, voltage across membranes, and perhaps

other measurements, to determine the unknown parameters in the model and

the unobserved state variables in the model.

o “validate the neuron model”—via prediction—These validated neurons can

can be used in network construction.

One mainstream view of network modeling and operation is that

details do not matter but some form of network “organization” or

structure determines network operations.

Our use of data assimilation to design experiments, test and validate

models of cells and systems points to advantages of other directions.

Why would we want the kind of detail of neural or cellular processes

accurate modeling and careful data assimilation provides?

➢ use models of nerve cells (neurons) to compare healthy and

diseased cells to provide biophysical targets for therapies

➢ use detailed models of regulatory networks for genetic action to

design interventions

➢ use detailed, verified models of functional network connectivity

and nodal performance to engineer functions into high-

performance electronics---e.g. sequence generation and

recognition with human accuracy but machine performance

Green: Motor Pathway

HVcRA

Respiration/Syrinx

Song Production

Auditory Feedback

Red: Anterior Forebrain Pathway (AFP)

HVcArea XDLM

LMANArea X

and HVC

Control and Song

Maintenance

Songbox

Neurobiological Laboratory Experiments

Margoliash Laboratory, UChicago

Isolated Neurons from the Avian Song System

On each neuron many different Iapplied(t) measurements in time

“epochs” of 2-6 seconds

Membrane Voltage Observed

Sampling time 0.02 ms (50 kHz), 500-1500 ms of observations

Use all this to estimate the unknown parameters in the neuron

and unmeasured state variables, then predict the response of

the neuron to new stimuli (forcing). Model Validation

Time (Units of 0.02 ms) 2000 ms

Why this Iappl(t) ?

Back to Song System Nucleus HVC

Interneurons, L = 1 Voltage

CdV (t)

dt= g

Nam(t)3h(t)(E

Na-V (t)) +

gKn(t)4(E

K-V (t)) + g

L(E

L-V (t)) +

gCa

a(t)b(t)V (t)[Ca2+ ]

ext- [Ca2+ ](t)e

-V (t )/VT

1- e-V (t )/V

T

+ other currents + Iapplied

(t)

dx(t)

dt=

x¥

(V (t)) - x(t)

tx(V (t))

x(t) = m(t),h(t),n(t),a(t),b(t){ }

2

1 2

1( ) 1 tanh( )

2

( ) (1 tanh ( ))

x

x

xtx x x

xt

V Vx V

dV

V VV t t

dV

VLSI Neuromorphic Chip

➢ Test parameters on the chip to check quality of fabrication

versus design

➢ Use a twin experiment to test method of data assimilation:

generate data from the VLSI chip. Use “voltages” on chip

neurons as measured quantities to estimate parameters

known from first step.

➢ Use voltage data from biological neuron to readjust chip

parameters and state variables to those for the data, then

predict voltage response to new current stimulation.

Test

parameters

on the chip

to check

quality of

fabrication

versus design

Use a twin

experiment to test

method of data

assimilation:

generate data from

the VLSI chip. Use

“voltages” on chip

neurons as

measured

quantities to

estimate

parameters known

from first step.

Use voltage

data from

biological

neuron to

readjust chip

parameters

and state

variables to

those for the

data, then

predict

voltage

response to

new current

stimulation.

Unfinished Business

Measurements for Networks of Neurons---extracellular

potentials? Other technology

Computational capability for the future

Port network models to VLSI

Use principles of network functions to solve similar problems in

other space and time domains.

Some application areas for Data Assimilation:

Genetic regulatory networks

signal transduction pathways

systems biology; synthetic biology; Immunology

biophysical modeling of neurons and functional networks

neutrino astrophysics

coastal flows and transport of toxic constituents after storms

electrical and chemical engineering

identifying oil and gas reservoirs

hydrological models of streams and lakes

neuromorphic engineering---neurons and functional networks on a chip

numerical weather prediction

Machine Learning

Feedforward Multi-layer Perceptron



321 k F

Observe y(τ1)

Observe y(τ2)

Observe y(τ3)

Observe y(τk)

Observe y(τF)

Move model

forward

Move model

forward

Move model

forward

Move model

forward

P(y|x) P(xn+1|xn)3 P(y|x) P(xn+1|xn)3 P(y|x) P(xn+1|xn)3 P(y|x) P(xn+1|xn)3 P(y|x)

l0 l1 lF-1 lFj

j=1

j=2

j=N

Fl0l

Input y(l0)

Output y(lF)

Move model forward

layer to layer

P(y|xl0) P(xl+1|xl)3 P(y|xlF)

Multi-Layer Perceptron over layers [l0,lF]:

Transfer Information from Data Library {y(l0),y(lF)} to a

Model x(l+1) = f(Wx(l))

Total Probability = P(y|x)P(xn+1|xn)3 P(y|x)

= exp -[-log(Total Probability] = exp – [Σ -logP(x|y) + Σ –log P(xl+1|xl)3 ]

= exp –[ActionML(l)] = e-AML

(l)

( )

( )

G(X) eExpected Value of G(X) =

e

ML

ML

A l

A l

dX

dX

Machine Learning

requires evaluating a

Statistical Physics Integral

In standard Machine Learning, we have a network with an input l0 layer

and an output layer lF, and between them intermediate “hidden” layers.

Information in noisy pairs {yk(l0),yk(lF)} k = 1,2,…,M are presented to the

network.

We want to minimize the cost function

subject to the network rules with layer l and active units

(“neurons”) xj(l), j = 1, 2, N satisfying

Relax the equality constraint

x

j(l +1) = f

j[W

ji(l)x

i(l)]

,2

1, 1

( , )1( ( ) ( ))

2 2

M Lk kmr r

k r

R r lx l y l

ML

AML

(x(l)) =1

2ML k=1,r=1

M ,L

åR

m(r,l)

2(x

r

(k )(l) - yr

(k )(l))2 Rm(l) ¹ 0 when l = l

0,l

F{ }

+R

f(l)

2[x(k )(l +1) - f (W (l)x(k )(l))]2å

Total Probability = exp[-AML

(x(l))]

To approximate expected value integral

Maximize Overall Probability = Minimize AML

(x(l)) over x(l) and W (l)

Model is exact when Rf ∞

Machine Learning Action

ML example

Data is generated by selecting a network with 100 layers

and 10 `neurons’ at each layer. Weights are selected from a

uniform distribution U[-0.1,0.1]. Inputs xk(l0); k = 1,2,… are

passed through the network producing outputs xk(lF).

Gaussian noise N(0,σ2=0.0025) is added to the inputs

and the outputs. These make our library of data {yk(l0), yk(lF)}.

We then build a network with lF layers and N active

units per layer. Train this network by minimizing:

AML

(xr

k (l),Wji(l)) =

1

M

Rm(l)

2L(x

r

k (l) - yr

k (l))2

r=1

L

å +R

f

2(x

j

k (l +1) - f (Wji(l)x

i

k (l)))2

j=1

N

ål=l

0

lF

åìíï

îï

üýï

þïk=1

M

å

tÞ l

Adversarial Perturbations

E2(lF, M ) =

1

NMP

[xj

k (lF

) - yj

k (lF

)]2

k=1, j=1

k= MP

, j=N

å

Network is trained using variational

annealing with M data pairs.

Prediction (generalization) is performed

using MP new pairs {yk(l0), yk(lF)}.

Input yk(l0) is presented to the trained

network at l0, output xk(lF) is compared to the

output of the data pair at lF.

Quality of predictions:

A(x(l),x '(l)) = dl L(x(l),x '(l),l)l0

lF

ò

L(x(l),x '(l),l) =R

m(r,l)

2(x

r(l) - y

r(l))2 +

Rf(a)

2[x

a'(l) - F

a(x(l),l)]2

a=1

D

år=1

L

å

= c(x(l) - y(l)) + +R

f(a)

2[x '

a(l) - F

a(x(l),l)]2

a=1

D

å

Deepest Learning: layer is continuous variable

Our variational principle is in Lagrangian

coordinates {x(l),x’(l)} with

Boundary Conditions: p(l0) = p(lF) = 0

Generator of a

rotation in

D-dimensions

d2x(t)

dt 2= "v(t)´B(x(t),t)"+[ÑF(x(t),t)+ ¶t A(x(t),t)]

H (x(l), p(l),l) =p2

2Rf

+ p · F(x(l)) - Measurement Error Term

p(l) =¶L(x(l),x '(l),l)

¶x '(l)

dx(l)

dl= F(x(l),l) +

p(l)

Rf

dp(l)

dl= -

¶F(x(l),l)

¶x(l)p(l) + R

m(r,l)(x

r(l) - y

r(l))

Boundary conditions: p(l0) = p(l

F) = 0

Back Propagation starting at lF

AML (x(l), x '(l)) = L(x(l), x(l +1)l=l0

lF -1

å )

d AML (x(l), x '(l),l) = d x(l0 )¶L(x(l0 ), x(l1))

¶x(l0 )

+d x(lF )¶L(x(lF-1), x(lF ))

¶x(lF )

+ Discrete Euler Lagrange Equation at each layer

Lagrangian variation

satisfies Symplectic Symmetry of problem

Accurate estimation of minima of the Action A(X)

More stable than Back Propagation ?

Boundary

Condition

s

Further possibilities:

Recurrent Networks

Performance on large libraries of labeled Images

Use of ML method in identification of functional network

connectivity of biophysical neurons

Information input at intermediate layers ?

More complex networks; learn by introducing many Rf

terms and use them as required?

( )

( )

G(X) eExpected Value of G(X) =

e

ML

ML

A l

A l

dX

dX

Machine Learning



1

1

12

21 3 2

( )32

Data Source

( )( )

( )( ( ) ( )) ( )

( )( ( ) 1 )

Model Equations ( ) is passed to th

e mode

l

x t

dx tx t

dt

dx tx t x t qx t

dt

dx tx t e

dt

d

x t

α is first treated as fixed, then driven : α(t)

1

12 1 1

21 3 2

( )32

( )( ) ( )( ( ) ( ))

( ) ( ( ) ( )) ( )

( ) ( ( ) 1 )

M

M M

y t

M

y ty t u t x t y t

dt

dy ty t y t q y t

dt

dy ty t e

dt

0 1 1

0 1 1

( 1) ( ), ( ),..., ( ), ( ) { ( ), ( 1)}

( 1) ( ), ( ),..., ( ), ( ) { ( ), ( 1)}

n n

n n

X n x t x t x t x t X n x n

Y n y t y t y t y t Y n y n

So, what did we learn ?

1. Make a model—no algorithms for this, use your best

knowledge of the (bio)physics.

2. Make a big model—experiments will prune the model

3. Do twin experiments to determine how many measurements

you need to get the “global” minimum of A0(X)—annealing.

4. Use twin experiments to design laboratory experiments

5. Do experiments to determine consistency of model with data.

6. Use the completed model and estimated x(T), via probability

distribution or dx/dt = F(x(t)), to predict, for t > T. This

validates (or not) the model.

So, what did we learn ?

7. Use Laplace method + computable corrections to determine

consistency of numerical methods.

8. If there are not enough measurements at each observation

time, (a) get more; (b) use waveform information via time delays.

9. Using Data Assimilation, one can (a) test new fabrications of

VLSI neurons, (b) test DA methods on verified VLSI chip, (c)

complete neuron model on chip from biological data; predict

response to new forcing.

( ( 1), ( ), ( 1), ( ))

( ( 1), ( ), ( 1), ( )) ( ( 1), ( ), ( ))

( ( 1), ( ), ( )) ( ( ), ( ))

( ( 1), ( ) | ( ))( ( 1) | ( 1), ( ), ( ))

( ( ) | ( ))

( ( 1), ( 1))

( ( ), ( ))

( ( ), (

P x n X n y n Y n

P x n X n y n Y n P x n X n Y n

P x n X n Y n P X n Y n

P x n X n Y nP y n x n X n Y n

P X

P X n Y n

P X n Y n

n Y nP X n Y

))

( ( ), ( ))( ( 1) | ( 1), ( ), ( ( ( 1) | ( )) M) ar ov k) P x

n

P X n Y nP y n x n X n n nY xn


To see how the optimization process is working, we look at the

optimization output at various steps during the iteration:

0 0 0 1 1 1( ( ) , ( ) , ) ( ( ) , ( ) , )

......( ( ) , ( ) , )

until objective (cost) function is minimized,

subject to model e

We req

qu

uire ( )

a o

0.

ti ns

Final Fina

Final

l Final

y n u n p y n u n p

y n u n p

u n

We will discuss the equivalence between machine learning (ML)

and data assimilation (DA).

Data Assimilation is the transfer of information from (often sparse)

observations, to dynamical models of complex systems---numerical

weather prediction, neurobiology, …

We will review the formulation of each, DA and ML. The equivalence will

be clear.

Then we will give a variational annealing method for locating the

smallest minimum of the action/cost function in variational calculations

in each field.

Using this in an example of each, gives design insight to “deep

learning”.

We then formulate “deepest learning” as ML layers become continuous.

This puts back propagation in a familiar perspective—the Euler-

Lagrange equations of the variational principle. It also suggests more

stable variational approaches.

Data Assimilation is the transfer of information from (often sparse)

observations, to dynamical models of complex systems---numerical

weather prediction, neurobiology, …

Then we will give a variational annealing method for locating the

smallest minimum of the action/cost function in variational calculations

in each field.

Total Probability

= P(y|x)P(xn+1|xn)3 P(y|x)P(xn+1|xn)3P(y|x)P(xn+1|xn)3 P(y|x)P(xn+1|xn)3 P(y|x)

= exp -[-log(Total Probability]

= exp – [Σ -logP(x|y) + Σ –log P(xn+1|xn)3 ] = e –[Action(X)] = e-A(X)

Expected Value of G(X) = dX G(X) e- A( X )

òdX e- A( X )

ò

Data Assimilation



0 1

0 1

0 1

Observation window in time: t ,t ,...t

( ) (0), (1), (2),...., ( )

Model state vectors and parameters at times t ,t ,...t

Y(n) (0), (1), (2),...., ( )

Observed data vectors at times , ,...

N

n

n

X n n

y y y y n

x x x x

We want P(X(n+1)|Y(n+1)) in terms of P(X(n)|Y(n)).

This relation will give us a representation of P(X(N)|Y(N))recursion

( ( 1) | ( 1))

P(x(n+1),y(n(1),X(n),Y(n)) definition of conditional probability

( ( 1), ( ))

P(x(n+1),y(n+1),X(n),Y(n))

( ( 1), ( )) (

( ( 1), ( ) | ( ))

( ( ) | ( ))( ( )

( 1), ( ) | ( )| ( )

)

P X n Y n

P y n Y n

P y n Y n P x n X n

P x n X n Y n

P X n Y nP

nn n

YX Y

exp[CMI(x(n+1),y(n

Markov property: x

+1),X(n)|Y(n))]

(n+1) depends ONLY on x(n)--true of

( ( 1) | ( ))

( ( 1) | ( )

differential

equations in

P )(

biophysic

y(n+1)|x(

)

( ( ) | ( )

n+1),X(

)

n ) ( ()

s

,Y(n)

P x n x n

P x n x

P n

P Xn

X Y n

+ terms independent of X

= exp[-A(X)] + terms independe

) |

nt

( )

f X

)

o

n Y n

Lorenz96 Model

L = 2, 4, 5, 6 D = 11