Data Assimilation, Machine Learning:
Statistical Physics Problems
Introduction, Core Ideas, Applications
Henry D. I. Abarbanel
Department of Physics
and
Marine Physical Laboratory (Scripps Institution of Oceanography)
Center for Engineered Natural Intelligence
University of California, San Diego
This is meant to be an introductory and an
advanced set of pedagogical talks on data assimilation
and machine learning.
These are statistical Physics problems.
I hope you will ask a lot of questions.
My colleague Dan Margoliash will provide a
neurobiological setting for utilizing many of the methods
discussed. Much of what I address has been developed
with him and tested and improved in application to results
obtained in his laboratory.
I received a bipolar transistor for a
New Year’s present. I want to know how
it works so I can use many of them to
build a follow-on K computer (before
2020?).
What do I do?
General answer:
Hook up my nice new transistor to a known RLC
circuit and drive the dynamical variables of the transistor
through their dynamical range. Measure some of the
variables of the circuit V(t); this produces data. Make a
model of the transistor, drive it in precisely the same way, to
get model output Vmodel(t).
Minimize the distance ,
Subject to our model equations of motion.
Test the model completed by estimated parameters
through prediction for t > T.
(Vdata
(t) -Vmod el
(t))2
t=0
T
å
What do we need to complete this task?
➢ a model of the origin of the data
➢ data
➢ a way to minimize the distance between the data and
the model variables
We first generate our own data from our model,
then use our minimization method to show it works----
called a twin experiment.
Then, use the method on experimental data with
some confidence now to determine the parameters in my
new transistor.
B
C
E
L
R
C1
C2
Ree
Vee
Colpitts Oscillator Circuit 1920’s and 1950s and 1970s and 1990s
IC(VE) = (1 mA) exp
.
| |
Bth
k TV
e
J.J Ebers and J.L. Moll. Large-signal behavior of junction transistors.
Proceedings of the IRE, 42(12):1761–1772, Dec. 1954.
H.K. Gummel and H.C. Poon, “An Integral Charge Control Model of
Bipolar Transistors,” Bell System Technical Journal, vol. 49, no. 5, pp.
827–852, 1970.
Turn this on, please
Voltage time series VE(t) recorded from a Colpitts circuit operating in the
chaotic regime. Δt = 10 μsec
ms
Data Source
dx1(t)
dt= a x
2(t) a is kept fixed, then driven : a(t)
dx2(t)
dt= -g (x
1(t) + x
3(t)) - qx
2(t)
dx3(t)
dt= h(x
2(t) +1+ e
- x1(t )
)
Rescaled Colpitts Oscillator
No coupling of data into model k(t) = 0
Colpitts Oscillatork(t) = 1.9
Experimental Data from Colpitts Circuit
u = k
Data Source
dx1(t)
dt= a x
2(t)
dx2(t)
dt= -g (x
1(t) + x
3(t)) - qx
2(t)
dx3(t)
dt= h(x
2(t) +1+ e
- x1(t )
)
Model Equations x1(t) is passed to the model
dy
1(t)
dt= a
My
2(t) + u(t)(x
1(t) - y
1(t))
dy
2(t)
dt= -g
M( y
1(t) + y
3(t)) - q
My
2(t); u(t) ³ 0
dy
3(t)
dt= h
M( y
2(t) +1+ e
- y1(t )
)
12 2
1 1
0
Minimize
1( , , ) (( ( ) ( )) ( ) )
2
subject to
N
m
C y u p x m y m u mN
1
12 1 1
21 3 2
( )32
( ) ( ) ( )( ( ) ( ))
( ) ( ( ) ( )) ( )
( ) ( ( ) 1 )
M
M M
y t
M
dy ty t u t x t y t
dt
dy ty t y t q y t
dt
dy ty t e
dt
Data
The solution of the optimization problem is an
iterative process in (y(n),u(n),p) space. Given initial values,
(y0(n),u0(n),p0), iterate, adjusting all (ym(n),um(n),pm) m = 1, 2,
3, … to minimize cost function.
Tracks state variables and correctly estimates parameters
when time dependent. Tracks accurately through bifurcations in
system behavior: chaoticfixed pointlimit cyclechaotic.
The number of variables in each optimal estimation
calculation is about 3000-5000.
0 0 0 1 1 1( ( ) , ( ) , ) ( ( ) , ( ) , )
......( ( ) , ( ) , )
until objective (cost) function is minimized,
subject to model e
We require ( ) 0.
quations
Final Final Fi
Fin
nal
al
y n u n p y n u n p
y n u n p
u n
Chaotic Colpitts Oscillator; Initial Conditions in Optimization are free
x1(t) observed; other state variables evaluated by SNOPT
Chaotic Colpitts Oscillator; External Driving of Parameter α(t)
α > 5.0
Chaotic
α < 5.0
Regular
11 1 1 1
1
( )( ( ), ( ), ) ( )( ( ) - ( ))
( )( ( ), ( ), )
R
RR R
dy tF y t y t q u t x t y t
dt
dy tF y t y t q
dt
Model
Data
Colpitts Oscillator
Model
Data
SNOPT reached Solution
Model
Data
Experimental
Colpitts Oscillator Circuit
Δt = 10 µs
VE(t) presented to state and
estimation procedure.
VCE(t) and IL(t) estimated,
with 10 ms of data.
Then predictions are made
from estimated state at
t = 10ms: VE(t), VCE(t), IL(t)
Measured VE(t) and estimated VCE(t) and IL(t)
Predicted VE(t), VCE(t) and IL(t) from estimates at t = 10ms
12 2
1 1
0
Minimize
1( , , ) (( ( ) ( )) ( ) )
2
subject to
N
m
C y u p x m y m u mN
1
12 1 1
21 3 2
( )32
( ) ( ) ( )( ( ) ( ))
( ) ( ( ) ( )) ( )
( ) ( ( ) 1 )
M
M M
y t
M
dy ty t u t x t y t
dt
dy ty t y t q y t
dt
dy ty t e
dt
Data
Where did all this come from?
Why this C(y,u,p)? Why this “nudging” term ?
12 2
1 1
0
Minimize
1( , , ) (( ( ) ( )) ( ) )
2
subject to
N
m
C y u p x m y m u mN
1
12 1 1
21 3 2
( )32
( ) ( ) ( )( ( ) ( ))
( ) ( ( ) ( )) ( )
( ) ( ( ) 1 )
M
M M
y t
M
dy ty t u t x t y t
dt
dy ty t y t q y t
dt
dy ty t e
dt
Where did all this come from?
Why this C(y,u,p)? Why this “nudging” term ?
Just for the record, this is the wrong
answer. We will derive the correct answer.
With the electronic circuit in mind, we turn to a
general view of the problem of transferring information
from observations to a models of the processes of those
observations.
Not actually a new problem. Newton did this in
1687 in determining that elliptical orbits satisfying
Kepler’s laws require a 1/r2 force.
The questions we pose 330 years later are richer:
collect information from many sources from observations
of complex systems. We want to do that in a systematic
manner allowing large data sets, rich models of the
processes producing those data sets.
My transistor is the same, in scientific spirit, as your:
❖ Atmosphere
❖Neuron
❖ Lake
❖Ocean
❖Biological cell
❖ whatever is your complex system of interest
a model of the origin of the data
data
a way to minimize the distance between the data and the model variables
Topics:
❖ Investigating rules of nonlinear dynamics in
physical and biological (complex) systems
❖ A complex oscillator-data assimilation
❖ General setting: neurobiological example –
see Margoliash talks
❖ General problems—numerical algorithms
❖ machine learning: statistical Physics and
data assimilation
Data Assimilation in a time window [t0,tF]:
Transfer Information from Data Library y(τ) to a Model x(t)
321 k F
Observe y(τ1)
Observe y(τ2)
Observe y(τ3)
Observe y(τk)
Observe y(τF)
Move model
forward
Move model
forward
Move model
forward
Move model
forward
t0 t1tn tF
P( X |Y ) =P( X ,Y )
P(Y )
X= x(t0),x(t
1),...., x(t
F){ } Y= y(t
1), y(t
2),...., y(t
F){ }
states/parameters of model da ta
Data Assimilation:
Transfer of Information from Measurements
to a Model of the Observations
We start with noisy measurements yk(t); k = 1, 2, …L, errors in the model xa(t); a=1,2,…,D >>L, and uncertain initial conditions at x(t0) .
We wish to incorporate the information in measurements at t0, t1, …, tF = T into our statistical estimate of the complete state of the model at these times and into our statistical estimate of the model parameters.
The model has errors; given x(T), we use it to predict for x(t > T). This a validation (or not) of the model.
t = t
0,t
1,...t
n,...,t
F= T{ }
( )
1,2,...,
l ny
l L
1( )
( ( ));
1,2,...,
a n
a n
x t
f x t
a D
y
l(n) = x
l(n))
L D
Data source: Transmitter
Model: Receiver
Generalized synchronization of
the transmitter and receiver
Statistical data assimilation is communication of information from measurements (transmitter) to a dynamical model (receiver).
At the end of an observation window [t0,tF] we want
the conditional probability distribution of the state of the
system P(X|observations); X = {x(t0),x(t1),…,x(tF)} is path of
model through [t0,tF] given measurements during the
window.
We then want to predict the future conditional
probability distribution P(x(t > tF)|observations) for new
forcing of the system.
Typical situation: The measurements are noisy. The model has
errors. We are unsure of the state of the system when we begin observing.
Data Assimilation in a time window [t0,tF]:
Transfer Information from Data Library y(τ) to a Model x(t)
321 k F
Observe y(τ1)
Observe y(τ2)
Observe y(τ3)
Observe y(τk)
Observe y(τF)
Move model
forward
Move model
forward
Move model
forward
Move model
forward
t0 t1tn tF
P( X |Y ) =P( X ,Y )
P(Y )
X= x(t0),x(t
1),...., x(t
F){ } Y= y(t
1), y(t
2),...., y(t
F){ }
states/parameters of model da ta
Observation window in time: t0,t
1,...t
N
X (n) = x(t0),x(t
1),...,x(N ){ } = x(0),x(1),x(2),...,x(N ){ }
Model state vectors and parameters at times t0,t
1,...t
N
Y(n) = y(1), y(2), y(3),...., y(F ){ }
Observed data vectors at times t0
£ t1,t
1,...t
F£ t
N= t
F
We want to express P(X(n+1)|Y(n+1)) in terms of
P(X(n)|Y(n)).
Then we iterate from n=N -1, back to n=0. The
product of these probabilities gives us a representation
of P(X(N)|Y(N)) starting at P(x(0)).
( ( 1), ( ), ( 1), ( ))
( ( 1), ( ), ( 1), ( )) ( ( 1), ( ), ( ))
( ( 1), ( ), ( )) ( ( ), ( ))
( ( 1), ( ) | ( ))( ( 1) | ( 1), ( ), ( ))
( ( ) | ( ))
( ( 1), ( 1))
( ( ), ( ))
( ( ), (
P x n X n y n Y n
P x n X n y n Y n P x n X n Y n
P x n X n Y n P X n Y n
P x n X n Y nP y n x n X n Y n
P X
P X n Y n
P X n Y n
n Y nP X n Y
))
( ( ), ( ))( ( 1) | ( 1), ( ), ( ( ( 1) | ( )) M) ar ov k) P x
n
P X n Y nP y n x n X n n nY xn
Change due to Observation Move Model Forward
( ( 1) | ( )
( ( 1) | ( 1))
( ( ) | (
( ( 1), ( 1))
( ( 1), ( ))
( ( 1) | ( 1), ( ), ( ))( ( ))
( ( 1), ( ))
( ( 1), ( 1), ( ), (
)
( ( 1) | ( )))
( ( 1) | (
))
)) ( ( 1), ( ), )(
( )) (
P X n Y n
P X n Y n
P X n Y n
P y n Y n
P y n x n X n Y nP Y n
P y n Y n
P y n x n X n Y n
P y n Y n P x n X n Y n
P x n x n
P x n P Xx nn
( ( 1), ( 1), ( ) | ( ))exp[log ( ( 1) | (]
( ( 1) | ( )) ( ( 1), ( ) |( ( ) | ( ))
( ))
))
) | ( ))
( (( ( 1), ( 1), ( ) | ( ))
( ( 1) | ( )) ( ( 1), ( ) | ( ))
ex
( ( 1) | ( )) ) | (
p
))
[
P y n x n X n Y n
P y n Y n P x n X n YP X n YP x n
nnx n
P yP x n x
Y n
P X n Yn x n X n Y n
P y n Y n P x n X n Y n
CM
n n
( ( 1), ( 1), ( ) | ( ))] ( ( 1) ( ( ) | () )| )( )P x n xI y n x n X n X nY nn P Yn
Change due to Observation Move Model Forward
, log , /
1 , 1 , ,
1 , 1
| | | |
{ }
Shannon
, |
1 , 1 , |log
1 |
194
,
0s
1 |
CMI a b c P a b c P a c P b c
a y
CMI y n x n X n Y n
P y n x n X n Y n
P y n Y n P x n X n Y
n b x n X n c Y
n
n
X = x(t0),x(t
1),...., x(t
F){ } Y = y(t
1), y(t
0),...., y(t
F){ }
P( X |Y ) = d Dx(n)ò P( y(k) | X (k)) P(x(n +1) | x(n))P(x(0))n=0
F-1
Õk=0
F
Õn=0
F
Õ
= dX e- A( X )
ò
Expected value of functions on the path X(N), G(X)
E[G(X)|Y] = <G(X)> = dX e- A( X )
ò G( X )
dX e- A( X )
ò
Our job is this: given data Y,
and a model dx(t)/dt = F(x(t)) [equivalently x(n+1) = f(x(n))],
do the integral.
Path of state in [t0,tF]
00
1
0
( ) ( ( ), ( ) | ( 1))
- log{ ( ( 1) | ( ))} log{ ( (0))}
N
n
N
n
A X CMI X n y n Y n
P x n x n P x
Action for State and Parameter Estimation
Dynamics
Information Transfer
Initial Condition
Everything rests on the structure of A(X) in path space
E[G(X)|Y]=<G(X)>=dX e- A( X )
ò G(X )
dX e- A( X )
ò
Two methods for evaluating such high dimensional
integrals:
(1) Laplace’s method (1774); seek minima of A(X)--
there are multiple minima
(2) Monte Carlo searches
First seeks minima of A0(X); second samples near
these minima.
¶A( X )
¶Xa
= 0 and ¶2 A( X )
¶Xa
¶Xb
> 0
¶A0( X )
¶Xa X q
= 0 q=0,1,... A0( X 0 ) < A( X q¹0 )
We focus on Laplace method. How can we find the minima Xq
which minimum gives the biggest contribution to the integral ?
Everything rests on the structure of A(X) in path space.
E[G( X ) |Y] = < G( X ) > = dXe- A( X )
ò G( X )
dXe- A( X )
ò
A(X) is nonlinear in X and has multiple minima. Location and
number of these minima depend on number of measurements at
each observation time in [t0,tN].
Standard model, Gaussian Error Action
Observations have Gaussian noise; models have Gaussian
errors, action is
It is not Gaussian in X, if f(x) is nonlinear. Finding paths and
associated minima at any Rm and Rf is not hard (IPOPT, other
public domain optimization algorithms), but finding the path with the
smallest action—is a challenge---it is NP complete, in general.
ASM (X) =Rm (n,l)
2(xl (n) - yl (n))2
l=1
L
ån=0
N
å +
R f (a)
2a=1
D
ån=0
N-1
å (xa (n +1) - fa (x(n)))2
Origin of multiple minima is instability on the synchronization manifold
yl = xl(n). Measurements act to transfer information and stabilize directions
in state space. Looking in continuous time shows this:
gives the Euler-Lagrange equations:
These have the boundary
conditions pa(t0) = pa(tf) = 0.
ASM
(x(t), dx(t) / dt) = dt L(x(t), dx(t) / dtt0
tf
ò ,t)
= dt R
m(t)
2(x
l(t) - y
l(t))2
l=1
L
å +R
f(a)
2(dx
a(t) / dt - F
a(x(t)))2
a=1
D
åé
ëêê
ù
ûúút
0
tf
ò
( )
( )( ( ) ( ( )) ( ( ) ( ))
q
b mab ab b al l l
fx t
d dx t RDF x t F x t x t y t
dt dt R
‘nudging’ to x(n) = y(n)
0( ) 0A X
ASM
(x(t), dx(t) / dt) = dt L(x(t), dx(t) / dtt0
tf
ò ,t)
= dt R
m(t)
2(x
l(t) - y
l(t))2
l=1
L
å +R
f(a)
2(dx
a(t) / dt - F
a(x(t)))2
a=1
D
åé
ëêê
ù
ûúút
0
tf
ò
c(x(t) - y(t)) =R
m(t)
2(x
l(t) - y
l(t))2
l=1
L
å
Wab
(x(t)) =¶F
a(x(t))
¶xb(t)
-¶F
b(x(t))
¶xa(t)
dab
d 2xb(t)
dt2- W
ab(x(t))
dxb(t)
dt=
¶c(x(t) - y(t))
Rf
+F(x(t))2
2
é
ëêê
ù
ûúú
¶xa(t)
+¶F
a(x(t))
¶t
d 2xa(t)
dt2=
dx(t)
dt´ B(x(t))
a
+ Ea(x(t))
Ba(x(t)) = e
abc
¶
¶xb(t)
Ac(x(t))
=1
2
¶
¶xb(t)
Ac(x(t)) -
¶
¶xc(t)
Ab(x(t))
é
ëê
ù
ûú
Ea(x(t)) = -
¶j(x(t))
¶xa(t)
-¶A
a(x(t))
¶t
Now we move on to evaluating the expected value
integrals using Laplace’s method
We do not discuss corrections to the method, but
one can. It is complicated algebra.
A(X) is nonlinear in X and has multiple minima.
Location and number of these minima depend on the number
of measurements at each observation time in [t0,tN].
Standard model, Gaussian Error Action
Observations have Gaussian noise; models have Gaussian
errors, action is
Now we are ready to minimize the action—maximize the
probability distribution.
ASM (X) =Rm (n,l)
2(xl (n) - yl (n))2
l=1
L
ån=0
N
å +
R f (a)
2a=1
D
ån=0
N-1
å (xa (n +1) - fa (x(n)))2
2
Standard Model
0 1
12
0 1
( , )( ) ( ( ) ( ))
2
( )( ( 1) ( ( )))
2
N Lm
l l
n l
N Df
a a
n a
R n lA X x n y n
R ax n f x n
We want to minimize this over all x(n) and parameters in f(x(n))
If f(x(n)) is nonlinear the Action = A(x(n),x(n+1)) has many minima in
general.
The search for the smallest minimum of a nonlinear objective
function, such as A(x(n),x(n+1)) is, in general, NP-complete.
An NP-complete problem cannot be solved in polynomial time in any
known way.
For us, that is not good news.
To determine paths for lowest minimum action; find minimum for very
small model error value Rf, then slowly increase Rf to larger values. We call
this variational annealing (This is distinct from standard simulated
annealing in statistical Physics). If Rf ∞, model error is 0.
We look at the opposite limit Rf0, where model plays no role and
dynamical phase space structure is absent.
At Rf = 0, minimum is degenerate at xl(t) = yl(t); other unmeasured
states undetermined. With Rf = Rf0 very small, choose N0 initial starting paths
with xl(t) = yl(t), others chosen from a uniform distribution, this is a set of X0
for numerical minimization. We call outcomes X1.
Use N0 paths X1 as initial starting paths with Rf = αRf0 ; α>1, to arrive
at N0 paths X2. Increase Rf to α2Rf0, … and continue using outcome paths as
initial choices for next optimizations, slowly increasing Rf by powers of α.
Plot A0(Xq) versus . Action level plots.b = loga Rf / Rf 0
éë ùû
Simple Model Neuron NaKL
D+NP = 4 + 19, L = 1 Voltage
Twin Experiment on NaKL Neuron
CdV (t)
dt= g
Nam(t)3h(t)(E
Na-V (t)) +
gKn(t)4(E
K-V (t)) + g
L(E
L-V (t)) + I
applied(t)
dx(t)
dt=
x¥
(V (t)) - x(t)
tx(V (t))
x(t) = m(t),h(t),n(t){ }
2
1 2
1( ) 1 tanh( )
2
( ) (1 tanh ( ))
x
x
xtx x x
xt
V Vx V
dV
V VV t t
dV
Generate data from NaKL equations
y(t) = x(t) + σN(0,1) noise
D = 4, L = 1
Annealing in Model Error Accuracy
Paths giving minima of the action depend on the
number of measurements L.
For the Standard Model, when action levels are
independent of Rf, the action level is dictated by statistics of
measurement error term---a consistency check on the action
level evaluations.
NaKL Model Action Level Plot Measure Voltage ONLY
3/2 0log [ / ]f fR R
NaKL Neuron Twin Experiment
Parameters Known Estimated LB UBgNa 120.0 108.4 50.0 200.0ENa 50.0 49.98 0.0 100.0gK 20.0 21.11 5.0 40.0EK −77.0 −77.09 −100.0 −50.0gL 0.3 0.3028 0.1 1.0EL −54.0 −54.05 −60.0 −50.0C 0.8 0.81 0.5 1.5Vm −40.0 −40.24 −60.0 −30.0dVm 0.0667 0.0669 0.01 0.1τm0 0.1 0.0949 0.05 0.25τm1 0.4 0.4120 0.1 1.0Vh −60.0 −59.43 −70.0 −40.0dVh −0.0667 −0.0702 −0.1 −0.01τh0 1.0 1.0321 0.1 5.0τh1 7.0 7.76 1.0 15.0Vn −55.0 −54.52 −70.0 −40.0dVn 0.0333 0.0328 0.01 0.1τn0 1.0 1.06 0.1 5.0τn1 5.0 4.97 2.0 12.0
dxa(t)
dt= x
a-1(t)(x
a+1(t) - x
a-2(t)) - x
a(t) - f
a = 1, 2, ..., D;
x-1
(t) = xD-1
(t); x0(t) = x
D(t); x
D+1(t) = x
1(t).
f is a fixed parameter f = 10. Solutions are chaotic.
Lorenz96 Model D = 11
‘Twin Experiments’ Use to test methods of data
assimilation; use to design experiments.
Generate data with known model; add noise to
model output; present l = 1,2,…,L < D noisy time series to
assimilation procedure.
Neurobiological Example
Inject
current
Iapplied(t)
Measure
Response
Voltage
V(t)
3
4
( )( ) ( )( ( ))
( ) ( ( )) ( ( )) I ( )
( ( ) ( ))( ) a( ) ( ), ( ), ( )
( ( ))
Na Na
K K L L applied
x
dV tC g m t h t E V t
dt
g n t E V t g E V t t
a V t a tda tt m t h t n t
dt V t
2
1 2
1( ) 1 tanh( )
2
( ) (1 tanh ( ))
x
x
xtx x x
xt
V Va V
dV
V VV t t
dV
D = 4 L = 1 p = 20
Measure V(t) with
selected Iapplied(t)
Evaluate all
parameters and all
unobserved state
variables a(t)
Neuron Model
This is the challenge:
Using laboratory experiments on individual neurons and on
collections of neurons. Build biophysically based models of functional
neural networks which matches experiments and predicts the response
to new stimuli.
Our strategy is this:
o create a model of the functional network of interest—e.g. song production
network for song birds and, of course, the individual neurons in the network---
what is a sensible model—we use Hodgkin-Huxley models.
o using the model itself, design experiments that stimulate all degrees of
freedom of the neuron/network and measure enough quantities at each
observation time---these are numerical simulations.
o Use the model along with data, voltage across membranes, and perhaps
other measurements, to determine the unknown parameters in the model and
the unobserved state variables in the model.
o “validate the neuron model”—via prediction—These validated neurons can
can be used in network construction.
One mainstream view of network modeling and operation is that
details do not matter but some form of network “organization” or
structure determines network operations.
Our use of data assimilation to design experiments, test and validate
models of cells and systems points to advantages of other directions.
Why would we want the kind of detail of neural or cellular processes
accurate modeling and careful data assimilation provides?
➢ use models of nerve cells (neurons) to compare healthy and
diseased cells to provide biophysical targets for therapies
➢ use detailed models of regulatory networks for genetic action to
design interventions
➢ use detailed, verified models of functional network connectivity
and nodal performance to engineer functions into high-
performance electronics---e.g. sequence generation and
recognition with human accuracy but machine performance
Green: Motor Pathway
HVcRA
Respiration/Syrinx
Song Production
Auditory Feedback
Red: Anterior Forebrain Pathway (AFP)
HVcArea XDLM
LMANArea X
and HVC
Control and Song
Maintenance
Songbox
Neurobiological Laboratory Experiments
Margoliash Laboratory, UChicago
Isolated Neurons from the Avian Song System
On each neuron many different Iapplied(t) measurements in time
“epochs” of 2-6 seconds
Membrane Voltage Observed
Sampling time 0.02 ms (50 kHz), 500-1500 ms of observations
Use all this to estimate the unknown parameters in the neuron
and unmeasured state variables, then predict the response of
the neuron to new stimuli (forcing). Model Validation
Time (Units of 0.02 ms) 2000 ms
Why this Iappl(t) ?
Back to Song System Nucleus HVC
Interneurons, L = 1 Voltage
CdV (t)
dt= g
Nam(t)3h(t)(E
Na-V (t)) +
gKn(t)4(E
K-V (t)) + g
L(E
L-V (t)) +
gCa
a(t)b(t)V (t)[Ca2+ ]
ext- [Ca2+ ](t)e
-V (t )/VT
1- e-V (t )/V
T
+ other currents + Iapplied
(t)
dx(t)
dt=
x¥
(V (t)) - x(t)
tx(V (t))
x(t) = m(t),h(t),n(t),a(t),b(t){ }
2
1 2
1( ) 1 tanh( )
2
( ) (1 tanh ( ))
x
x
xtx x x
xt
V Vx V
dV
V VV t t
dV
VLSI Neuromorphic Chip
➢ Test parameters on the chip to check quality of fabrication
versus design
➢ Use a twin experiment to test method of data assimilation:
generate data from the VLSI chip. Use “voltages” on chip
neurons as measured quantities to estimate parameters
known from first step.
➢ Use voltage data from biological neuron to readjust chip
parameters and state variables to those for the data, then
predict voltage response to new current stimulation.
Test
parameters
on the chip
to check
quality of
fabrication
versus design
Use a twin
experiment to test
method of data
assimilation:
generate data from
the VLSI chip. Use
“voltages” on chip
neurons as
measured
quantities to
estimate
parameters known
from first step.
Use voltage
data from
biological
neuron to
readjust chip
parameters
and state
variables to
those for the
data, then
predict
voltage
response to
new current
stimulation.
Unfinished Business
Measurements for Networks of Neurons---extracellular
potentials? Other technology
Computational capability for the future
Port network models to VLSI
Use principles of network functions to solve similar problems in
other space and time domains.
Some application areas for Data Assimilation:
Genetic regulatory networks
signal transduction pathways
systems biology; synthetic biology; Immunology
biophysical modeling of neurons and functional networks
neutrino astrophysics
coastal flows and transport of toxic constituents after storms
electrical and chemical engineering
identifying oil and gas reservoirs
hydrological models of streams and lakes
neuromorphic engineering---neurons and functional networks on a chip
numerical weather prediction
Machine Learning
Feedforward Multi-layer Perceptron
Data Assimilation in a time window [t0,tF]:
Transfer Information from Data Library y(τ) to a Model x(t)
321 k F
Observe y(τ1)
Observe y(τ2)
Observe y(τ3)
Observe y(τk)
Observe y(τF)
Move model
forward
Move model
forward
Move model
forward
Move model
forward
P(y|x) P(xn+1|xn)3 P(y|x) P(xn+1|xn)3 P(y|x) P(xn+1|xn)3 P(y|x) P(xn+1|xn)3 P(y|x)
l0 l1 lF-1 lFj
j=1
j=2
j=N
Fl0l
Input y(l0)
Output y(lF)
Move model forward
layer to layer
P(y|xl0) P(xl+1|xl)3 P(y|xlF)
Multi-Layer Perceptron over layers [l0,lF]:
Transfer Information from Data Library {y(l0),y(lF)} to a
Model x(l+1) = f(Wx(l))
Total Probability = P(y|x)P(xn+1|xn)3 P(y|x)
= exp -[-log(Total Probability] = exp – [Σ -logP(x|y) + Σ –log P(xl+1|xl)3 ]
= exp –[ActionML(l)] = e-AML
(l)
( )
( )
G(X) eExpected Value of G(X) =
e
ML
ML
A l
A l
dX
dX
Machine Learning
requires evaluating a
Statistical Physics Integral
In standard Machine Learning, we have a network with an input l0 layer
and an output layer lF, and between them intermediate “hidden” layers.
Information in noisy pairs {yk(l0),yk(lF)} k = 1,2,…,M are presented to the
network.
We want to minimize the cost function
subject to the network rules with layer l and active units
(“neurons”) xj(l), j = 1, 2, N satisfying
Relax the equality constraint
x
j(l +1) = f
j[W
ji(l)x
i(l)]
,2
1, 1
( , )1( ( ) ( ))
2 2
M Lk kmr r
k r
R r lx l y l
ML
AML
(x(l)) =1
2ML k=1,r=1
M ,L
åR
m(r,l)
2(x
r
(k )(l) - yr
(k )(l))2 Rm(l) ¹ 0 when l = l
0,l
F{ }
+R
f(l)
2[x(k )(l +1) - f (W (l)x(k )(l))]2å
Total Probability = exp[-AML
(x(l))]
To approximate expected value integral
Maximize Overall Probability = Minimize AML
(x(l)) over x(l) and W (l)
Model is exact when Rf ∞
Machine Learning Action
ML example
Data is generated by selecting a network with 100 layers
and 10 `neurons’ at each layer. Weights are selected from a
uniform distribution U[-0.1,0.1]. Inputs xk(l0); k = 1,2,… are
passed through the network producing outputs xk(lF).
Gaussian noise N(0,σ2=0.0025) is added to the inputs
and the outputs. These make our library of data {yk(l0), yk(lF)}.
We then build a network with lF layers and N active
units per layer. Train this network by minimizing:
AML
(xr
k (l),Wji(l)) =
1
M
Rm(l)
2L(x
r
k (l) - yr
k (l))2
r=1
L
å +R
f
2(x
j
k (l +1) - f (Wji(l)x
i
k (l)))2
j=1
N
ål=l
0
lF
åìíï
îï
üýï
þïk=1
M
å
tÞ l
Adversarial Perturbations
E2(lF, M ) =
1
NMP
[xj
k (lF
) - yj
k (lF
)]2
k=1, j=1
k= MP
, j=N
å
Network is trained using variational
annealing with M data pairs.
Prediction (generalization) is performed
using MP new pairs {yk(l0), yk(lF)}.
Input yk(l0) is presented to the trained
network at l0, output xk(lF) is compared to the
output of the data pair at lF.
Quality of predictions:
A(x(l),x '(l)) = dl L(x(l),x '(l),l)l0
lF
ò
L(x(l),x '(l),l) =R
m(r,l)
2(x
r(l) - y
r(l))2 +
Rf(a)
2[x
a'(l) - F
a(x(l),l)]2
a=1
D
år=1
L
å
= c(x(l) - y(l)) + +R
f(a)
2[x '
a(l) - F
a(x(l),l)]2
a=1
D
å
Deepest Learning: layer is continuous variable
Our variational principle is in Lagrangian
coordinates {x(l),x’(l)} with
Boundary Conditions: p(l0) = p(lF) = 0
Generator of a
rotation in
D-dimensions
d2x(t)
dt 2= "v(t)´B(x(t),t)"+[ÑF(x(t),t)+ ¶t A(x(t),t)]
H (x(l), p(l),l) =p2
2Rf
+ p · F(x(l)) - Measurement Error Term
p(l) =¶L(x(l),x '(l),l)
¶x '(l)
dx(l)
dl= F(x(l),l) +
p(l)
Rf
dp(l)
dl= -
¶F(x(l),l)
¶x(l)p(l) + R
m(r,l)(x
r(l) - y
r(l))
Boundary conditions: p(l0) = p(l
F) = 0
Back Propagation starting at lF
AML (x(l), x '(l)) = L(x(l), x(l +1)l=l0
lF -1
å )
d AML (x(l), x '(l),l) = d x(l0 )¶L(x(l0 ), x(l1))
¶x(l0 )
+d x(lF )¶L(x(lF-1), x(lF ))
¶x(lF )
+ Discrete Euler Lagrange Equation at each layer
Lagrangian variation
satisfies Symplectic Symmetry of problem
Accurate estimation of minima of the Action A(X)
More stable than Back Propagation ?
Boundary
Condition
s
Further possibilities:
Recurrent Networks
Performance on large libraries of labeled Images
Use of ML method in identification of functional network
connectivity of biophysical neurons
Information input at intermediate layers ?
More complex networks; learn by introducing many Rf
terms and use them as required?
( )
( )
G(X) eExpected Value of G(X) =
e
ML
ML
A l
A l
dX
dX
Machine Learning
requires evaluating a
Statistical Physics Integral
1
1
12
21 3 2
( )32
Data Source
( )( )
( )( ( ) ( )) ( )
( )( ( ) 1 )
Model Equations ( ) is passed to th
e mode
l
x t
dx tx t
dt
dx tx t x t qx t
dt
dx tx t e
dt
d
x t
α is first treated as fixed, then driven : α(t)
1
12 1 1
21 3 2
( )32
( )( ) ( )( ( ) ( ))
( ) ( ( ) ( )) ( )
( ) ( ( ) 1 )
M
M M
y t
M
y ty t u t x t y t
dt
dy ty t y t q y t
dt
dy ty t e
dt
0 1 1
0 1 1
( 1) ( ), ( ),..., ( ), ( ) { ( ), ( 1)}
( 1) ( ), ( ),..., ( ), ( ) { ( ), ( 1)}
n n
n n
X n x t x t x t x t X n x n
Y n y t y t y t y t Y n y n
So, what did we learn ?
1. Make a model—no algorithms for this, use your best
knowledge of the (bio)physics.
2. Make a big model—experiments will prune the model
3. Do twin experiments to determine how many measurements
you need to get the “global” minimum of A0(X)—annealing.
4. Use twin experiments to design laboratory experiments
5. Do experiments to determine consistency of model with data.
6. Use the completed model and estimated x(T), via probability
distribution or dx/dt = F(x(t)), to predict, for t > T. This
validates (or not) the model.
So, what did we learn ?
7. Use Laplace method + computable corrections to determine
consistency of numerical methods.
8. If there are not enough measurements at each observation
time, (a) get more; (b) use waveform information via time delays.
9. Using Data Assimilation, one can (a) test new fabrications of
VLSI neurons, (b) test DA methods on verified VLSI chip, (c)
complete neuron model on chip from biological data; predict
response to new forcing.
( ( 1), ( ), ( 1), ( ))
( ( 1), ( ), ( 1), ( )) ( ( 1), ( ), ( ))
( ( 1), ( ), ( )) ( ( ), ( ))
( ( 1), ( ) | ( ))( ( 1) | ( 1), ( ), ( ))
( ( ) | ( ))
( ( 1), ( 1))
( ( ), ( ))
( ( ), (
P x n X n y n Y n
P x n X n y n Y n P x n X n Y n
P x n X n Y n P X n Y n
P x n X n Y nP y n x n X n Y n
P X
P X n Y n
P X n Y n
n Y nP X n Y
))
( ( ), ( ))( ( 1) | ( 1), ( ), ( ( ( 1) | ( )) M) ar ov k) P x
n
P X n Y nP y n x n X n n nY xn
Change due to Observation Move Model Forward
To see how the optimization process is working, we look at the
optimization output at various steps during the iteration:
0 0 0 1 1 1( ( ) , ( ) , ) ( ( ) , ( ) , )
......( ( ) , ( ) , )
until objective (cost) function is minimized,
subject to model e
We req
qu
uire ( )
a o
0.
ti ns
Final Fina
Final
l Final
y n u n p y n u n p
y n u n p
u n
We will discuss the equivalence between machine learning (ML)
and data assimilation (DA).
Data Assimilation is the transfer of information from (often sparse)
observations, to dynamical models of complex systems---numerical
weather prediction, neurobiology, …
We will review the formulation of each, DA and ML. The equivalence will
be clear.
Then we will give a variational annealing method for locating the
smallest minimum of the action/cost function in variational calculations
in each field.
Using this in an example of each, gives design insight to “deep
learning”.
We then formulate “deepest learning” as ML layers become continuous.
This puts back propagation in a familiar perspective—the Euler-
Lagrange equations of the variational principle. It also suggests more
stable variational approaches.
Data Assimilation is the transfer of information from (often sparse)
observations, to dynamical models of complex systems---numerical
weather prediction, neurobiology, …
Then we will give a variational annealing method for locating the
smallest minimum of the action/cost function in variational calculations
in each field.
Total Probability
= P(y|x)P(xn+1|xn)3 P(y|x)P(xn+1|xn)3P(y|x)P(xn+1|xn)3 P(y|x)P(xn+1|xn)3 P(y|x)
= exp -[-log(Total Probability]
= exp – [Σ -logP(x|y) + Σ –log P(xn+1|xn)3 ] = e –[Action(X)] = e-A(X)
Expected Value of G(X) = dX G(X) e- A( X )
òdX e- A( X )
ò
Data Assimilation
requires evaluating a
Statistical Physics Integral
0 1
0 1
0 1
Observation window in time: t ,t ,...t
( ) (0), (1), (2),...., ( )
Model state vectors and parameters at times t ,t ,...t
Y(n) (0), (1), (2),...., ( )
Observed data vectors at times , ,...
N
n
n
X n n
y y y y n
x x x x
We want P(X(n+1)|Y(n+1)) in terms of P(X(n)|Y(n)).
This relation will give us a representation of P(X(N)|Y(N))recursion
( ( 1) | ( 1))
P(x(n+1),y(n(1),X(n),Y(n)) definition of conditional probability
( ( 1), ( ))
P(x(n+1),y(n+1),X(n),Y(n))
( ( 1), ( )) (
( ( 1), ( ) | ( ))
( ( ) | ( ))( ( )
( 1), ( ) | ( )| ( )
)
P X n Y n
P y n Y n
P y n Y n P x n X n
P x n X n Y n
P X n Y nP
nn n
YX Y
exp[CMI(x(n+1),y(n
Markov property: x
+1),X(n)|Y(n))]
(n+1) depends ONLY on x(n)--true of
( ( 1) | ( ))
( ( 1) | ( )
differential
equations in
P )(
biophysic
y(n+1)|x(
)
( ( ) | ( )
n+1),X(
)
n ) ( ()
s
,Y(n)
P x n x n
P x n x
P n
P Xn
X Y n
+ terms independent of X
= exp[-A(X)] + terms independe
) |
nt
( )
f X
)
o
n Y n
Lorenz96 Model
L = 2, 4, 5, 6 D = 11