Data-driven predictions of the Lorenz system

HAL Id: hal-02475962https://hal.archives-ouvertes.fr/hal-02475962v2

Submitted on 10 Jun 2020

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Data-driven predictions of the Lorenz systemPierre Dubois, Thomas Gomez, Laurent Planckaert, Laurent Perret

To cite this version:Pierre Dubois, Thomas Gomez, Laurent Planckaert, Laurent Perret. Data-driven predictionsof the Lorenz system. Physica D: Nonlinear Phenomena, Elsevier, 2020, 408, pp.132495.�10.1016/j.physd.2020.132495�. �hal-02475962v2�

https://hal.archives-ouvertes.fr/hal-02475962v2

https://hal.archives-ouvertes.fr

Data-driven predictions of the Lorenz system

Pierre Duboisa,∗, Thomas Gomeza, Laurent Planckaerta, Laurent Perretb

aUniv. Lille, CNRS, ONERA, Arts et Metiers Institute of Technology, Centrale Lille, UMR 9014 - LMFL -Laboratoire de Mecanique des fluides de Lille - Kampe de Feriet, F-59000 Lille, France

bCentrale Nantes, LHEEA UMR CNRS 6598, Nantes, France

Abstract

This paper investigates the use of a data-driven method to model the dynamics of the chaotic Lorenzsystem. An architecture based on a recurrent neural network with long and short term dependenciespredicts multiple time steps ahead the position and velocity of a particle using a sequence of paststates as input. To account for modeling errors and make a continuous forecast, a dense artificialneural network assimilates online data to detect and update wrong predictions such as non-relevantswitchings between lobes. The data-driven strategy leads to good prediction scores and does not requirestatistics of errors to be known, thus providing significant benefits compared to a simple Kalman filterupdate.

Keywords: data-driven modeling, data assimilation, chaotic system, neural networks

1. Introduction

Chaotic dynamical systems exhibit character-istics (nonlinearities, boundedness, initial condi-tion sensitivity) [1] encountered in real-world prob-lems such as meteorology [2] and oceanography[3]. The multiple time steps ahead predictionof such a system is challenging because govern-ing equations may be unknown or too costly toevaluate. For instance, the Navier Stokes equa-tions require prohibitive computational resourcesto predict with great accuracy the velocity fieldof a turbulent flow [4].

Data-driven modeling of dynamical systems isan active research field whose objective is to in-fer dynamics from data [5]. Regressive methodsin machine learning [6] are particularly suitablefor such tasks and have proven to reliably recon-struct the state of a given system [7]. If param-eters are not overfitted to training examples, thedata-driven model can also be used for predictive

∗Corresponding author: [email protected]

tasks, providing the input lies in the input domainused for training. Main techniques in the litera-ture include autoregressive techniques [8], dynam-ical mode decomposition (DMD) [9], Hankel al-ternative view of Koopman (HAVOK) [10] or un-supervised methods such as CROM [11]. Neuralnetworks are also of increasing interest since theycan perform nonlinear regressions that are fast toevaluate. Architectures with recurrent units arerecommended for time-series predictions becausememory is incorporated in the prediction process.Neural networks can then learn chaotic dynamics[12] and predict with great accuracy the futurestate [13].

However, errors in modeling can lead to badmultiple time steps ahead predictions of chaoticdynamical systems: a tiny change in the initialcondition results in a big change in the output[12]. To overcome the propagation of uncertain-ties from the dynamical model (bad regressionchoice in a data-driven approach or bad turbu-lence modeling in CFD for instance) data assimi-lation (DA) techniques have been developed [14].

March 12, 2020

They combine the predicted state of a system withonline measurements to get an updated state. Suchmethods have successfully been applied in fluidmechanics to obtain a better description of initialor boundary conditions by finding the best com-promise between experimental measurements andCFD predictions [15]. Nevertheless, the dynam-ical model can be slow to evaluate (limiting theuse to offline assimilations) and errors (initial con-dition, dynamical model, measurements and un-certainties) can be hard to estimate in real-worldapplications.

In this paper, a data-driven approach is usedto discover a dynamical model for the Lorenz sys-tem. To handle the chaotic nature of the system,a recurrent neural network (RNN) dealing withlong and short term dependencies (LSTM) is con-sidered [16]. To correct modeling errors, a denseneural network (denoted hereafter DAN) whosedesign is based on Kalman filtering techniquesis developed. Results are promising for predict-ing multiple steps ahead the position and velocityof a particle on the Lorenz attractor, using onlythe initial sequence and real-time measurementsof the complete acceleration, the complete veloc-ity or a single component of the velocity.

The paper is organized as follows. In Sec-tion 2, the overall strategy is presented with aquick understanding of how neural networks work.In Section 3, results about the low dimensionalLorenz system are shown, with a particular inter-est in the impact of forecast horizon and noise.A discussion is given in Section 4 before givingconcluding remarks.

2. Strategy

2.1. Proposed methodology

This paper investigates the use of neural net-works to continuously predict a chaotic systemusing a data-driven dynamical model and onlinemeasurements. The method is summarized in Fig-ure 1 and contains the following steps:

. Consider m temporal states of the system.The sequence is denoted [s]tt−m−1 where s is

the state of the system and whose dimensionis nf .

. Predict n future states using a RNN withlong and short-term memory (LSTM). Thisgives a predicted sequence [sb]t+nt+1 where su-perscript b indicates a prediction.

. Predict the measured sequence. This gives[yb]t+nt+1 where yb is the predicted measure ofthe state. The mapping between the statespace and the measurement space is per-formed by a dense neural network called theshallow encoder (SE).

. Assimilate the exact sequence of measure-ments [y]t+nt+1 and update the predicted se-quence of states. This work is performedby a dense neural network which gives anupdated sequence [sa]t+nt+1 where superscripta stands for ”analyzed”. The network iscalled the data assimilation network (DAN).

. Construct [sa]t+nt+n−m+1 by adding m− n up-dated states from the previous iteration. Thisgives a new input that can be used to cycleand continue the forecasting process.

In this section, we give a quick overview ofneural networks and explain architectures behindthe dynamical model (RNN-LSTM), the measure-ment operator (SE) and the data assimilation pro-cess (DAN).

2.2. Quick overview of neural networks

A neuron is a unit passing a sum of weightedinputs through an activation function that intro-duce nonlinearities. These functions are classi-

cally a sigmoid σ(x) =1

1 + e−x, a hyperbolic tan-

gent tanh(x) or a rectified linear unit relu(x) =max(0, x). When neurons are organized in fullyconnected layers, the resulting network is calleda dense neural network. The universal approxi-mation theorem [17] states that any function canbe approximated by a sufficiently large networki.e. one hidden layer with a large number of neu-rons. Just like a linear regression y = ax+ b aims

2

[s]tt−m−1 RNN - LSTM [sb]t+nt+1

SE

[yb]t+nt+1

[y]t+nt+1

DAN

[sa]t+nt+1 [sa]t+nt+n−m+1

Figure 1: Summary of the data-driven method to make predictions of a chaotic system. A data-driven dynamical model(RNN-LSTM) predicts n future states of the system and the predicted sequence is updated according to a real sequenceof measurements.

at learning the best a and b parameters, a neu-ral network regression y = NN(x) aims at learn-ing the best weights and biases in the network byoptimizing a loss function evaluated on a set oftraining data.

Although they are universal approximators,dense neural networks face some limitations: theymay suffer from vanishing or exploding gradient(arising from derivatives of activation functions,see [18]), are prone to overfitting (fitting that cor-responds too much to training data) and inputsare not individually processed. Other architec-tures of artificial neural networks have then beendeveloped, including convolutional networks (CNN,for image recognition) or recurrent neural net-works (RNN, inputs are taken sequentially). Re-current networks use their internal state (denotedh) to process each input from the sequence of in-puts. This internal state is computed using anactivation function but to avoid limitations fromdense networks, its form is more elaborate. Forexample, Long Short-Term Memory (LSTM) cells[19] are combinations of classical activation func-tions (sigmoids and tanh) that incorparate a longand short term memory mechanism through thecell state (see Figure 2).

Several techniques exist to learn parameters inneural networks. The most common is the gradi-ent descent, which iteratively update parameters

according to the gradient of the cost function withrespect to weights and biases. The computationof gradients is made by backpropagating errorsin the network, using backpropagation for denseneural networks or backpropagation through timefor RNN [6]. The equations can be found in [20]for the curious reader. In this paper, all neuralnetworks are implemented using the Keras library[21].

In this paper, hyperparameters are not tuned.No grid search or genetic optimization is intendedand number of neurons, number of hidden lay-ers and activation functions are found by succes-sive trials. Defined architectures must not thenbe considered as a rule of thumb.

2.3. Novelty of the work

This paper proposes a regressive frameworkfor assimilating data as opposed to standard dataassimilation techniques whose architecture doesnot depend on the problem. Besides, the presentpaper considers time marching of an entire se-quence of the state while the most standard ap-proaches involve a time marching of the predictedstate at regular time units. More details aboutexisting works are given in Section 4.

2.4. Dynamical model

The first step is to establish a dynamical modelmapping m previous states s(t) to n future states.The chosen architecture is summarized in Figure

3

s(j)

tanh

h(j)

(a) RNN

s(j)

LSTM

h(j) and C(j)

(b) RNN - LSTM

σ σ tanh σ

× +

× ×

tanh

C(j − 1)

h(j − 1)

s(j)

C(j)

h(j)

FG

IG OG

(c) LSTM cell. The recurrent unit is composed of a cell state andgating mechanisms. The cell state C is modified when fed with anew time step from the input sequence, forgetting past information(via Forget Gate FG), storing new information (via Input Gate IG)and creating a short-memory (via Output Gate OG). Mathematicaldetails are given in the appendix.

Figure 2: Two types of recurrent neural networks: simple RNN handling short-term depedencies via a hidden state h(subfigure a) and RNN-LSTM handling short and long-term depedencies via a hidden state h, a cell state C and gatingmechanisms (subfigures b and c). Each time step s(j) from the input sequence is combined with h(j − 1) (and C(j − 1)for LSTM-RNN) which was (were) computed at previous time step.

3. In the recurent layer, 2m LSTM cells (makingthe cell state a 2m dimensional vector) processesthe input sequence [s]tt−m+1. This results in a finaloutput o(t) = h(t) summarizing all relevant infor-mation from the input sequence. In dense layers,the final output from the recurent layer is used topredict n future states [sb]t+nt+1 Concerning the thenumber of recurent units, it has been chosen toecho results of Faqih et al. [1] where best scoreswere obtained by considering twice as many neu-rons than the history window. Authors made this

conclusion after trying to predict multiple stepsahead the state of the Lorenz 63 system using adense neural network with radial basis functions.About the training of the model, the procedure isas follows:

1. Simulate the system to get data t → s(t).For the considered Lorenz system, only onetrajectory is simulated but it covers a goodregion of the phase space.

2. Split data into training and testing sets. Inthis work, 2/3 of the data is used for the

4

training and 1/3 is used for the testing. Todetect possible overfitting, cross validationis first performed by considering 20% of train-ing data as validation data. Depending onthe evolution of both training and validationerrors, dropout layers can be added to neu-ral networks. The model is then trained onthe whole training data set and errors arecomputed on the yet unseen testing dataset. This step is necessary to ensure thattest errors are representative of generaliza-tion errors.

3. Define supervised problems by writing dataas [s]tt−m+1 → [s]t+nt+1 . The number of train-ing examples is increased by considering asliding window of one time step i.e. train-ing set is composed of [s]m−1

0 → [s]m+n−1m ,

[s]m1 → [s]m+nm+1 , etc. For the testing set, a

sliding window of n time steps is used.

4. Find optimal weights and biases in the net-work by minimizing the mean square errorevaluated on batches of training data. Thechosen optimization algorithm is ADAM [22].They are numerous parameters to find, in-cluding all weights and biases for each LSTMcell in the reccurent architecture and all pa-rameters in dense layers. During the op-timization process, the mean square erroris also computed on the validation set. Toavoid overfitting and ensure that weightsand biases learned during training are rel-evant for future use on test set, errors eval-uated on training and validation sets shouldbe close.

5. Evaluate the performance of the final modelusing test data. Test 1 uses exact input se-quences [s]tt−m+1 to compute [sb]t+nt+1 . Test 2uses the first exact sequence [s]m−1

0 to com-pute [sb]m+n−1

m which is used as a new inputand so on. The metric to quantify errors isthe normalized mean square error which in-dicates how far predictions are from expec-tations on average. It is computed at theend of the process, using temporal testingstates as a reference. Its formula is basedon all predicted states sb and correspondingtrue states s:

NMSE =1

ntest

ntest∑i=1

||s(ti)− sb(ti)||2||s(ti)||2

Where ntest is the total number of test statesand ||.|| is the l2 operator to compute thenorm of the considered vector (dimensionnf ).

2.5. Data assimilation

To make a continuous forecast of the state us-ing a data-driven dynamical model, it is neces-sary to limit the accumulation of prediction errors[23] by incorporating online data in the predictionprocess. Consider y(t) an exact measurement ofthe state at t. The mapping between the statespace and the measurement space is done usingthe measurement operator H. In Kalman filter-ing techniques, a predicted state sb(t) is updatedaccording to sa(t) = sb(t) + Kt[y(t) − H(sb(t))]where the Kalman gain Kt blends errors from theprediction and the measurement. Such a methodis based on the Bayes theorem which helps to com-pute the density probability function of the stateconditioned by the measurement. However, thesetechniques require statistics of errors to explic-itly be known and work on a sequence of statesonly when considering the sequence as the state.The objective here is to adapt the strategy to di-rectly update, in a regressive manner, a sequenceof states using a sequence of measurements.

The first stage is to establish a relationship be-tween the state and its measurement i.e. find anapproximation of H operator. This task is per-formed by a shallow encoder which nonlinearlyexplains a measurement by its state. Figure 4summarizes the retained architecture, with nf de-scribing the number of features in the state and pbeing the number of observed variables. For sim-ple and known mappings, this step is not neces-sary. However, to develop a complete data-drivenframework, we choose to keep the shallow encoderdespite the simplicity of the true measurement op-erator for the considered Lorenz system.

5

Step 1: get h(t)

s(t− j)with j = m− 1 to j = 0

LSTM1 ... LSTM2m

2m LSTM cells

h(t− j)C(t− j)

Step 2: get [sb]t+nt+1

h(t)

σ ... σ

2m neurons

σ ... σ

m neurons

[sb]t+nt+1

Figure 3: Architecture of the dynamical model, a network composed of a recurrent layers and dense layers. Idea behindthe design: 2m cells are used to echo the results obtained in [1] where best prediction scores were obtained whenconsidering two times the history window.

The training and testing of the shallow en-coder is performed using data s(t) → y(t). Thedetermination coefficient R2 is used as a metric toassess the quality of the regression. It is definedby:

R2 = 1−∑ns

i=1 ||y(ti)− yb(ti)||2∑ns

i=1 ||y(ti)− y||2

Where ns is the number of samples (ntrain forassessing quality of the fit, ntest for assessing thequality of prediction) and y is the temporal meanmeasurement vector.

The second stage is to blend a predicted se-quence [sb]t+nt+1 with its associated sequence of mea-surements [yb]t+nt+1 and the real sequence of mea-surements [y]t+nt+1 to produce the updated sequence[sa]t+nt+1 . This job is done by a dense neural net-work whose architecture is summarized in Figure5. The training process is as follows:

1. Simulate the system to get t → s(t),y(t).Measurements are supposed to be exact andno noise is applied yet.

2. Split temporal states and measurements intotraining and testing sets. The procedure

is the same than for the dynamical modeltraining preparation.

3. Get supervised formulation of training data.The sliding window is supposed to be n.Outputs sequences are [s,y]

(j+1)n+m−1jn where

index j describe the j-th output.

4. Perform test 1 (see subsection 2.4) and shal-low encoder on defined outputs. This leadsto possible predicted states and measure-ments [sb,yb]

(j+1)n+m−1jn .

5. Each training measurement sequence [y]t+nt+1

is associated to 20% random pairs from theset defined in step 4. In doing so, a realsequence of measurement is associated torandomly selected 20% of all possible pre-dictions in order to produce sequence [s]t+nt+1 .All possible predictions could be used butthis would increase the computational timeto prepare the training set.

6. Perform gradient descent to optimize weightsand biases.

7. Evaluate the performance of the final modelusing test data. This is denoted as Test 3which is summarized in Figure 6. Like inTest 2, inputs are fed back in recursivelyto assess the continuous prediction of test

6

sb(t)nf neurons

input layer

2nf neurons

tanh layer

nf neurons

tanh layeryb(t)

p neurons

linear layer

Figure 4: Shallow artificial neural network to map a state to its measurement i.e. approximation of H operator.

data. The metric is still the normalizedmean square error.

The assimilation technique proposed here isa nonlinear regression learned on training data.This is different from Kalman filtering techniqueswhere the Kalman gain only relies on statistics oferrors and whose formula does not depend on theconsidered case.

3. Results

3.1. Lorenz system

The Lorenz system of equations is a simplifiedmodel for atmospheric convection [2] [24]. Closeinitial conditions lead to very different trajecto-ries, making the Lorenz system a choatic dynam-ical system. The system is defined by:

x = σ(y − x)

y = x(ρ− z)− yz = xy − βz

Parameters σ, ρ and β are respectively set to10, 28 and 8/3. The trajectory of a particle liesin an attractor whose shape resembles a butter-fly. In [10], Brunton proposed a method to writea chaotic system as a forced linear system. Fol-lowing this method, forcing statistics appear non-gaussian, with long tails corresponding to rare in-termitting forcing preceding switching events (seeFigures 7a and 7c). The system is simulated usinga Runge Kutta 4 method, a random initial condi-tion and a time step of 0.005s, for a total of 15000samples. The chosen state is s = (x, y, z, x, y, z)which is the position and velocity of the particleon the attractor. The state is normalized usingstatistics from training data. The time-series ofx, plotted in Figure 7b, clearly shows the lobe

switching process (positive values when the parti-cle travels on the right wing and negative valueswhen it travels on the left wing). The objectiveis to extract from the simulated data a dynamicalmodel mapping m past states to n future states.To account for modeling error, predictions areenforced using sequences of measurements. Ourchoice of observed variables is y = (x, y, z) ory = (x, y, z) or y = x. Measurements are di-rectly linked to the state and data-driven modelsshould automatically detect these relations.

Before training models and specify the pre-diction window n, the impact of n on the globalerror is investigated. Considering that wrong pre-dictions are more likely to appear when predictinglobe switchings, two sources of errors, quantifiedon training data, can be found:

. Source 1, denoted e1 → the ratio betweenthe mean position of switching in a switch-ing sequence and the prediction horizon n.The smaller the ratio, the bigger the impacton the global error.

. Source 2, denoted e2 → the ratio betweenthe number of sequences with switchings andthe number of training sequences. The big-ger the ratio, the bigger the impact on theglobal error.

Figure 7d shows the impact of global error forn ∈ [10, 90]. As expected, increasing the forecastwindow leads to a bigger impact on the globalscore (e2/e1 increasing) because prediction errorsaccumulate on longer sequences. However, theimpact is not strictly monotone, indicating a de-pendance on the initial position of the particle.For the considered starting point, a forecast hori-zon n = 80 has a bigger impact on NMSE than

7

[sb]t+nt+1

nf × n neurons

[yb]t+nt+1

p× n neurons

[y]t+nt+1

p× n neurons

[sa]t+nt+1

nf × n neurons

Hidden 1tanh

Hidden 2relu

Outputlinear

Inputlinear

0.1m× (nf + 2p) neurons

0.1m× (nf + p) neurons

Figure 5: Data assimilation network. The nonlinear regression correct predicted sequences of states by assimilatingsequences of real measurements. Hyperparameters must not be considered as a rule of thumb and are well tailored forthe Lorenz 63 system.

Input OutputRNN

Prediction

UpdateDAN

New input

Figure 6: Procedure to test the data assimilation network.The dynamical model is used to predict n future statesof the system using a history of m states. The predictedsequence is updated using the data assimilation networkand next input is formed. All predicted sequences are thencompared to all expected sequences using the normalizedmean square error as a metric.

for n = 90, indicating that lobe switchings (sopossibly wrong predictions) are more likely to ap-pear at the beginning of a new sequence to pre-dict for n = 80 and in the middle of the sequencefor n = 90. In next sections, a history windowm = 100 to capture one lobe switching or zero.

3.2. Testing the dynamical model

Nine dynamical models are established withm = 100 and n ∈ [10, 90]. Learning is stoppedwhen the validation mean square error do not de-crease for 3 epochs in a row. All learning curves

show converged and close training and validationerrors. The use of dropout layers or regularizationtechniques is then not necessary since no overfit-ting is noted. With models trained on the wholetraining data set, errors calculated on testing dataare less than 1% for Test 1 but always exceeds100% for Test 2 (see step 5 in Section 2.4 for thedefinition of tests). It means that the dynami-cal model alone has a great performance only forsmall term predictions. The Figure 8 shows pre-dictions of x feature for both tests with a forecasthorizon of 50-time steps. It is worth noting thatdespite the global score for Test 2, the dynami-cal model successfully recovers the region of phasespace it was learnt on.

3.3. Testing the data assimilation network

Concerning the shallow encoder (to map a stateto its measurement), training is extremely fastand accurate with a training and testing deter-mination coefficient close to 1. It means that thenonlinear regression performed by the neural net-work recovers nearly all the variance observed intraining and testing data. This is not a surprisesince the relationship between sn and yb is simple(derivation for the acceleration or selecting fea-tures for velocity). Concerning the data assimi-lation network, quantitative results of Test 3 areshown in Figure 9. Qualitative results for x and vx

8

x

−2

−1

0

12

y

−3−2

−10

12

z

−2

−1

0

1

2

Training attractor

(a) Lorenz attractor

0 20 40 60

−2

0

2

x(t)

0 20 40 60

−2

0

2

y(t)

0 20 40 60

−2

0

2

z(t)

0 20 40 60

−2

0

2

vx(t)

0 20 40 60−5.0

−2.5

0.0

2.5

5.0

vy(t)

0 20 40 60−2

0

2

4

vz(t)

States of the Lorenz system

(b) Time-series of x feature.

−0.02 −0.01 0.00 0.01 0.02Forcing signal

10−1

100

101

102

Distribution

Forcing statistics

Forcing signal

Gaussian

(c) Distribution of forcing signal

10 20 30 40 50 60 70 80 90

n (time steps)

1

2

3

4

5

6

Impact

ontheNMSE[-]

×10−5 Impact of forecast horizon on the global error

(d) Impact of forecast horizon

Figure 7: Analysis of training data. The attractor (Figure a) can be seen as a forced linear system with a non-gaussiandistribution for the forcing signal (Figure c, method from [10]). Lobe switchings are visible in time-series of x feature(Figure b) where the blue signal corresponds to training data and the orange signal corresponds to testing data. Theforecast horizon n has an impact on the global score as shown in subfigure d.

predictions using acceleration are shown in Figure10. Several comments can be made:

1. Qualitatively, most of bad predictions arefollowed by good predictions: wrong pre-dictions are detected and corrected by theDAN given online measurements.

2. Using the complete velocity leads to slightlybetter results than using the complete accel-eration which seems reasonable because giv-ing the velocity means giving three featuresout of six in the sequence to update.

3. Using only vx leads to bad results for smallsequences. To understand this behavior, themean linear correlation coefficient between

sequences of vx and sequences of the statehas been studied. It is defined by:

r(vx, sk) =1

nstrain

nstrain∑i=1

r([vx]i, [sk]i)

with r(X, Y ) =cov(X, Y )

σXσY

Where nstrain is the number of output train-ing sequences of size n, [vx]i is the i-th train-ing sequence of vx (size n) and [sk]i is thei-th training sequence of the k-th featureof the state. Results are shown in Figure11. It appears that small sequences of vx

9

0 1000 2000 3000 4000Time iteration

−2

−1

0

1

2

x

Predictions using exact inputs

Stacked predictions

Real state

Start prediction

(a) Predictions of x (test set) when performing Test 1.

0 1000 2000 3000 4000Time iteration

−2

−1

0

1

2

x

Predictions using forecasted inputs

Stacked predictions

Real state

Start prediction

(b) Predictions of x (test set) when performing Test 2.

Figure 8: Test of the dynamical model for n = 50. Dots indicate the start of a new prediction, using m past exact states(Test 1) or m past predicted states (Test 2) as input.

are linearly correlated to all features in thestate (linear correlation coefficient close to1), which is no longer the case for mediumand large sequences where nonlinearities arise(linear correlation coefficient between 0.6 and0.7). Therefore, the data assimilation net-work has a too complex architecture for up-dating small sequences: a lot of unecessaryparameters must be calculated during learn-

ing because the state could entirely be re-covered by a linear regression on online vx.This is a form overfitting and hyperparame-ters should be optimized but this is not thescope here.

4. Worst results are obtained for n = 80 whichis linked to the dependance on the initialstate to generate training data ( Figure 7d).

We now suppose that the initial sequence is

10

20 40 60 80

0

10

20

30

40

n (time steps)

NMSEtest

(%)

Test errors for test 3

ax, ay, azvx, vy, vzvx

Figure 9: Testing errors computed from Test 3 using thecomplete acceleration, the complete velocity or vx alone toupdate predictions.

noisy (Gaussian noise with standard deviation σ0 =0.3) just like online measurements (Gaussian noisewith standard deviation σy = 0.2). The objectiveis to compare the proposed strategy with a simpleKalman filter update. Tests are restricted:

1. The simplest Kalman filtering technique re-quires the mapping between the state andits measurement to be linear (i.e. the H op-erator is simply a matrix). Tests will thenconcern vx or the complete velocity as onlinemeasurements.

2. A Kalman filter requires the covariance ofthe prediction error to be advanced in time.In this paper, the covariance of the predic-tion error is supposed to be known at eachnew prediction and the predicted sequencesare updated as a whole state.

Quantitative results are shown in Figure 12.We can observe that the DAN performs better formedium and large sequences but has poor perfor-mance on small sequences compared to the Kalmanfilter. This result was expected: the limited size ofthe noisy input sequence makes it harder to detectand learn regularity, resulting in sub-optimal per-formance. This effect is not noticeable for largesequences because a pattern can still be detectedin the noisy sequence. It is also worth noting that

the noise applied to a small initial sequence hasno influence on performance since no weights areattributed to predicted sequences in the DAN ar-chitecture. When considering the complete veloc-ity, results obtained from the DAN are close tothose obtained by Kalman filter (which requiresthe error to be known while the DAN does not).The influence of noise on small sequences is not asimportant as when using vx alone since the cor-rection does not rely on a single neuron.

4. Discussions

The data-assimilation framework proposed inthis paper is based on regressive methods. Its suc-cess is then tailored to the good choice/design ofthe method and the relevance of training data. Ifthe trajectory to predict lies in the learned phasespace’s region, one can expected the DAN to havegreat performance. Otherwise, Kalman filteringtechniques which do not depend on the problemshould be prefered. The reader is referred to Monset al. [14] for an overview about existing tech-niques.

Concerning the combination of neural networkswith kalman filtering techniques, some works havealready been done in the litterature. Most innova-tions concern the estimation of uncertainties. Forinstance, Coskun et al. [25] use LSTM cells to pre-dict the internals of the Kalman filter. In doing so,the authors implicitely learn a dynamical modeland covariance errors to use for a kalman update.In Loh et al. [23], authors update LSTM predic-tions of flow rates in gaz wells using an ensemblekalman filter, thus estimating errors via the co-variance of an ensemble of predictions. In Beckeret al. [26], a new architecture called reccurentkalman network is developped: using an encodeur- decodeur network, authors are able to estimateuncertainties in the features. Finally, Vashista[27] directly train a RNN - LSTM network tosimulate ensemble kalman filter data assimilationusing the differentiable architecture search frame-work. In this paper, we proposed a frameworkto directly work on sequences of data, separat-ing the dynamical model from the data assimila-

11

0 1000 2000 3000 4000Time iteration

−2

−1

0

1

2

x

Predictions using reconstructed inputs

Stacked predictions

Real state

Start prediction

(a) Predictions of x (test set) when performing Test 3.

0 1000 2000 3000 4000Time iteration

−3

−2

−1

0

1

2

3

v x

Predictions using reconstructed inputs

Stacked predictions

Real state

Start prediction

(b) Predictions of vx (test set) when performing Test 3.

Figure 10: Test of the data assimilation network for n = 70. Dots indicate the start of a new prediction, using m pastreconstructed states as input (test 3).

tion process. Future investigations could includethe explicit estimation of uncertainties using forinstance bootstraping methods or gaussian pro-cesses [28]

Before applying the proposed strategy to ahigher dimensional system, several challenges needto be adressed. First, the relevant phase regionwhere to learn regressive models may be hard to

detect as more features are involved. Second, op-timal hyperparameters may not be easily guess-able, thus requiring the need to grid search orgenetic algorithm. Third, the dimensionality ofthe system could be reduced (by projecting it ona well defined basis) but some information wouldbe sacrifcied, thus raising the problem of whatrelevant features should be kept. Vlachas et al.[29] adresses some of the problems by establish-

12

10 20 30 40 50 60 70 80 90forecast horizon (n)

x

y

z

vx

vy

vz

Mean correlation of vx sequences with state’s sequences

0.5

0.6

0.7

0.8

0.9

1.0

Figure 11: Mean correlation coefficient between outputtraining sequences of vx measurements and each featurein output training sequences of the state. Small sequencesof vx measurements are linearly correlated to all featuresin small sequences of the state. Nonlinearities arise forhigher forecast horizons.

ing RNN - LSTM to forecast the Lorenz 96 systembut no data assimilation coupled to their dynam-ical model has been yet intended.

Finally, we are enthousiastic to use this frame-work for simple fluid mechanics problems: afterreducing the state to estimate (using POD tech-nique, see [30]), we aim to use data assimilationto continuously predict a simple flow. Neural net-works involved in the dynamical or the data as-similation process will be improved by incorpo-rating physics. For instance, a physics neural net-work is proposed by Ling [31] to establish a deeplearning Reynolds Average Navier Stokes modelembedding the invariance of the anisotropy stresstensor.

5. Conclusion

In this paper, we investigated the use of neu-ral networks to predict multiple steps ahead thestate of the Lorenz system. The first stage con-sisted in establishing a dynamical model map-ping previous states to future states. A recur-rent neural network handling long and short-termdependencies was used for this purpose. Sup-

posing the input sequence was exact, the out-put proved to be accurate with less than 1% er-ror. However, when running the dynamical modelwith predicted states as new inputs, errors ac-cumulated at each new prediction, leading to abad prediction the time-series: the system beingchaotic with extreme events, a small error in theinitial condition leads to a radically different out-put. To overcome this accumulation of errors andmake a continuous forecast of the Lorenz system,a data assimilation strategy based on sequentialtechniques was developed. This consisted in anonlinear network mapping between a predictedsequence of states and a corresponding sequenceof online measurements. This strategy proved tobe effective when starting with the exact initialsequence and feeding the system with exact on-line measurements, notably when using the com-plete acceleration or the complete velocity. Adeeper analysis of the DAN structure showed thatthis strategy was less relevant when using a singlemeasurement or when working with small forecastwindows. Besides, the DAN proved to be sensi-tive (at least for small forecast windows) to noisein measurements but not to noise in the initialcondition. The DAN remains a good alternativeto a simple Kalman filter where the estimation oferrors may be a difficult task, especially when up-dating sequences. It nonetheless must be notedthat the success of the DAN is mainly due to thequality of training data and extra care must betaken when learning regression parameters. Allin all, the global strategy developed here seemspromising to continuously forecast other chaoticsystems evolving on an attractor. Future workscould include the tuning of hyperparameters (tohave an optimal design for each neural networks)and the application to a high dimensional attrac-tor where, similarly to Lorenz system, extremeevents could be encountered.

6. Acknowledgments

The authours wish to thank ONERA and Hauts-De-France region for their funding.

13

20 40 60 800

50

100

150

200

250

n (time steps)

NMSEtest

(%)

Using vx measurements

n = 50/60/70/80/90NMSE = 12/16/20/43/23

(a) Using noisy vx as online measurement

20 40 60 80

5

10

15

20

n (time steps)

NMSEtest

(%)

Using v measurements

DANKalman

(b) Using noisy velocity as online measurement

Figure 12: Results from Test 3 using a Kalman filter update or the data assimilation network. Noise was incorporotedin both the initial sequence (starting point of the continuous forecast) and online measurements.

7. References

References

[1] A. Faqih, B. Kamanditya, B. Kusumoputro, Multi-step ahead prediction of lorenz’s chaotic system usingsom elm-rbfnn, in: 2018 International Conference onComputer, Information and Telecommunication Sys-tems (CITS), IEEE, 2018, pp. 1–5.

[2] E. N. Lorenz, Deterministic nonperiodic flow, Journalof the atmospheric sciences 20 (2) (1963) 130–141.

[3] J. Overland, J. Adams, H. Mofjeld, Chaos in thenorth pacific: spatial modes and temporal irregular-ity, Progress in Oceanography 47 (2-4) (2000) 337–354.

[4] K. T. Carlberg, A. Jameson, M. J. Kochenderfer,J. Morton, L. Peng, F. D. Witherden, Recoveringmissing cfd data for high-order discretizations usingdeep neural networks and dynamics learning, Journalof Computational Physics (2019).

[5] J. N. Kutz, Data-driven modeling & scientific com-putation: methods for complex systems & big data,Oxford University Press, 2013.

[6] S. Brunton, B. Noack, P. Koumoutsakos, Ma-chine learning for fluid mechanics, arXiv preprintarXiv:1905.11075 (2019).

[7] N. B. Erichson, L. Mathelin, Z. Yao, S. L. Brunton,M. W. Mahoney, J. N. Kutz, Shallow learning for fluidflow reconstruction with limited sensors and limiteddata, arXiv preprint arXiv:1902.07358 (2019).

[8] C.-K. Ing, Multistep prediction in autoregressive pro-cesses, Econometric theory 19 (2) (2003) 254–279.

[9] P. J. Schmid, Dynamic mode decomposition of nu-merical and experimental data, Journal of fluid me-chanics 656 (2010) 5–28.

[10] S. L. Brunton, B. W. Brunton, J. L. Proctor,E. Kaiser, J. N. Kutz, Chaos as an intermittentlyforced linear system, Nature communications 8 (1)(2017) 19.

[11] E. Kaiser, B. R. Noack, L. Cordier, A. Spohn,M. Segond, M. Abel, G. Daviller, J. Osth, S. Kra-jnovic, R. K. Niven, Cluster-based reduced-ordermodelling of a mixing layer, Journal of Fluid Mechan-ics 754 (2014) 365–414.

[12] R. Yu, S. Zheng, Y. Liu, Learning chaotic dynamicsusing tensor recurrent neural networks, in: Proceed-ings of the ICML, Vol. 17.

[13] J. Wang, Y. Li, Multi-step ahead wind speed predic-tion based on optimal feature extraction, long shortterm memory neural network and error correctionstrategy, Applied energy 230 (2018) 429–443.

[14] V. Mons, J.-C. Chassaing, T. Gomez, P. Sagaut,Reconstruction of unsteady viscous flows using dataassimilation schemes, Journal of ComputationalPhysics 316 (2016) 255–280.

[15] A. Gronskis, D. Heitz, E. Memin, Inflow and initialconditions for direct numerical simulation based onadjoint data assimilation, Journal of ComputationalPhysics 242 (2013) 480–497.

[16] J.-S. Zhang, X.-C. Xiao, Predicting chaotic time se-ries using recurrent neural network, Chinese PhysicsLetters 17 (2) (2000) 88.

[17] K. Hornik, M. Stinchcombe, H. White, Multilayerfeedforward networks are universal approximators,

14

Neural networks 2 (5) (1989) 359–366.[18] R. Pascanu, T. Mikolov, Y. Bengio, Under-

standing the exploding gradient problem, CoRR,abs/1211.5063 2 (2012).

[19] S. Hochreiter, J. Schmidhuber, Long short-termmemory, Neural computation 9 (8) (1997) 1735–1780.

[20] H. Salehinejad, S. Sankar, J. Barfett, E. Colak,S. Valaee, Recent advances in recurrent neural net-works, arXiv preprint arXiv:1801.01078 (2017).

[21] F. Chollet, et al., Keras, https://github.com/

fchollet/keras (2015).[22] D. P. Kingma, J. Ba, Adam: A method for stochastic

optimization, arXiv preprint arXiv:1412.6980 (2014).[23] K. Loh, P. S. Omrani, R. van der Linden, Deep

learning and data assimilation for real-time produc-tion prediction in natural gas wells, arXiv preprintarXiv:1802.05141 (2018).

[24] Z.-M. Chen, W. Price, On the relation be-tween rayleigh–benard convection and lorenz system,Chaos, Solitons & Fractals 28 (2) (2006) 571–578.

[25] H. Coskun, F. Achilles, R. DiPietro, N. Navab,F. Tombari, Long short-term memory kalman filters:Recurrent neural estimators for pose regularization,in: Proceedings of the IEEE International Conferenceon Computer Vision, 2017, pp. 5524–5532.

[26] P. Becker, H. Pandya, G. Gebhardt, C. Zhao, J. Tay-lor, G. Neumann, Recurrent kalman networks: Fac-torized inference in high-dimensional deep featurespaces, arXiv preprint arXiv:1905.07357 (2019).

[27] H. V. Vashistha, RNN/LSTM Data Assimilation forthe Lorenz Chaotic Models, University of Maryland,Baltimore County, 2018.

[28] X. Qiu, E. Meyerson, R. Miikkulainen, Quantifyingpoint-prediction uncertainty in neural networks viaresidual estimation with an i/o kernel, arXiv preprintarXiv:1906.00588 (2019).

[29] P. R. Vlachas, W. Byeon, Z. Y. Wan, T. P. Sapsis,P. Koumoutsakos, Data-driven forecasting of high-dimensional chaotic systems with long short-termmemory networks, Proceedings of the Royal SocietyA: Mathematical, Physical and Engineering Sciences474 (2213) (2018) 20170844.

[30] A. T. Mohan, D. V. Gaitonde, A deep learningbased approach to reduced order modeling for tur-bulent flow control using lstm neural networks, arXivpreprint arXiv:1804.09269 (2018).

[31] J. Ling, A. Kurzawski, J. Templeton, Reynolds av-eraged turbulence modelling using deep neural net-works with embedded invariance, Journal of FluidMechanics 807 (2016) 155–166.

Appendix A. LSTM cell

Long-Short Term Memory cells are recurrentunits deploying a cell state and gating mecha-

nisms. Equations of forget (ft), input (it), output(ot) and activation (at) gates are as follows:

ft = σ(Wfxt + Ufht−1 + bf )

it = σ(Wixt + Uiht−1 + bi)

ot = σ(Woxt + Uoht−1 + bo)

at = tanh(Waxt + Uaht−1 + ba)

Where W are weights associated to the inputxt, U are weights associated to the hidden inputht−1 and b are biases. Outputs are the cell statect and the hidden state ht, computed accordingto: {

ct = ct−1 � ft + at � itht = ot � tanh(ct)

Where � is the pointwise product. Given anew information [xt, ht−1], the long-term memoryct forgets information (via ft � ct−1) and stores apart of new information (via at � it). The short-term memory ht depends on the long-term mem-ory (via tanh(ct)) and the activation of the cell(via ot) given new information.

15

https://github.com/fchollet/keras

https://github.com/fchollet/keras

Documents

Data-driven predictions of the Lorenz system