Spatio-Temporal Modelling using Integro-Difference ... · Spatio-Temporal Modelling using Integro-Difference Equations with Bivariate Stable Kernels Robert Richardson Department of

Spatio-Temporal Modelling using Integro-Difference Equations

with Bivariate Stable Kernels

Robert Richardson

Department of Statistics, Brigham Young University, Provo, Utah 84602, U.S.A.

Athanasios Kottas and Bruno Sanso

Department of Applied Mathematics and Statistics, University of California at Santa Cruz, Santa Cruz, California 95064, U.S.A.

March 1, 2017

Abstract

Integro-difference equations can be represented as hierarchical spatio-temporal dynamic models using appropriate

parameterizations. The dynamics of the process defined by an integro-difference equation depends on the choice of a

bivariate kernel distribution, where more flexible shapes generally result in more flexible models. We consider the use

of the stable family of distributions for the kernel, as they are infinitely divisible and offer a variety of tail behaviours,

orientations and skewness. Many of the attributes of the bivariate stable distribution are controlled by a measure,

which we model using a flexible Bernstein polynomial basis prior. This nonparametric prior specification, along

with spatially dependent parameters for the stable kernel distribution, results in a model which is more general than

existing integro-difference equation models. A Fourier basis representation with thresholding of the relevant basis

functions and parallelization of certain parts of the posterior simulation algorithm allow the computational burden to

be reduced. The method is shown to improve prediction over the state-of-the-art Gaussian kernel model in an example

using Pacific sea surface temperature data.

Keywords: Bernstein polynomials; Fourier series; Geometric weights prior; Markov chain Monte Carlo; Semi-

parametric Bayesian modelling.

1 Introduction

Spatio-temporal data analysis requires the specification of spatial and temporal dynamics. In separable models, the

space and time components are modelled independently. However, spatio-temporal processes that occur naturally

in fields such as climate, ecology, and epidemiology often exhibit an interaction between the spatial and temporal

1

dynamics. Learning the nature of these space-time interactions is a key component of spatio-temporal modelling.

Hierarchical dynamic spatio-temporal models (Wikle & Hooten, 2010) provide a framework that can capture a wide

variety of non-separable model parameterizations. We explore a particular parameterization of a hierarchical dynamic

spatio-temporal model that offers a great deal of flexibility in capturing complicated physical characteristics, while

providing computational tractability.

Denoting by Xt(s) the process at time t and spatial location s, an integro-difference equation model is written as

Xt+1(s) =

∫k(u− s | θ)Xt(u) du + εt(s)

where εt(s) is a zero mean error process which may be spatially colored, and k is a bivariate redistribution kernel.

The kernel parameters θ may be spatially dependent, in which case they are denoted by θs. The concept of an

integro-difference equation is that the process at time t is propagated in time by re-weighting it using kernel k, acting in

space. This model provides a simple, yet effective and interpretable way to express the relationship between the space

and time components, using linear dynamics. As the kernel shifts or expands, the nature of the dependence between

the process at a specific location and the process at previous time points changes accordingly. In fact, it has been

shown that the first two moments of the kernel determine, respectively, the advection and the diffusion of the process

(Brown et al., 2000; Storvik et al., 2002). In Richardson et al. (2017) it is shown that integro-difference equations

with non-Gaussian kernels allow for the description of higher order properties in the spatio-temporal process. Thus,

although complex spatial fields will often exhibit non-linear dynamics, non-Gaussian kernels can help explain some of

the complexity. This feature can be enhanced by making the kernel parameters vary across the spatial region, as in Xu

et al. (2005). The introduction of spatial non-stationarity in the model is then used to capture the physical properties of

the field of interest, through the characteristics of the kernels (Lemos & Sanso, 2009). This is the approach we follow

here. In contrast, Wikle & Holan (2011) and Cressie & Wikle (2011) detail methods that account explicitly for non-

linearity. Such approaches offer powerful modelling tools, but suffer from inferential and computational difficulties.

In essentially all applications of integro-difference equation models, Gaussian kernels are used, owing to the

convenience in their specification and computation. To apply flexible integro-difference equation modelling to spatio-

temporal data, we propose a bivariate stable kernel due to its ability to capture varying tail behaviour and levels of

skewness. In addition, stable kernels are infinitely divisible. This property is important to guarantee that the evolution

equation of the process remains valid when the time scale is changed (Brown et al., 2000). An important feature of

the bivariate stable distribution is that it is defined through a measure Γ which dictates key distributional properties,

such as skewness, spread, and orientation. By flexibly modelling this measure we can achieve a wide variety of

kernel shapes in two dimensions. This treatment of the kernel provides a more general integro-difference equation

and improves prediction relative to Gaussian kernel models. To our knowledge, this is the first attempt at comparing

2

Gaussian versus non-Gaussian kernels in integro-difference equations for two-dimensional space. The proposed work

also offers a novel Bayesian semiparametric modelling approach for the bivariate stable distribution. Finally, while the

flexibility afforded by the proposed model is greater than that of Gaussian kernel models, the computational burden is

not significantly larger.

Section 2 provides relevant background and covers technical details useful in the fitting of integro-difference

equations with stable distributions. Methods of modelling for the measure Γ and inference for the model parameters

are given in Section 3. Section 4 details an analysis of sea surface temperature data comparing the bivariate stable

kernel with a Gaussian kernel. Using this data set, we show that the proposed integro-difference equation model

improves prediction over the Gaussian kernel model. Finally, Section 5 concludes with a summary.

2 Stable Distributions

2.1 Definitions

Here, we discuss some of the key properties of the class of stable distributions. A bivariate random vector X belongs

to the stable family of distributions if and only if for any positive constants a and b and independent copies X(1)

and X(2) of X , there exists a positive constant c and vector D ∈ R2 such that aX(1) + bX(2) and cX + D are

equal in distribution (Samorodnitsky & Taqqu, 1997). Many properties stem from this definition, including infinite

divisibility. Integro-difference equations with infinitely divisible kernels can be connected to well-studied partial

differential equation systems, and thus an infinitely divisible kernel is an attractive choice. As a family of distributions

without finite moments, the stable family has been used for applications involving heavy-tailed distributions (Panorska,

1996; Nolan, 2014).

In one dimension, the stable family is defined by four parameters: location parameter µ ∈ R, scale parameter

c > 0, parameter α ∈ (0, 2] which controls the thickness of the tails, and skewness parameter β ∈ [−1, 1]. The

extension to higher dimensions does not result in a comparably simple parametric form. Lacking a general closed-

form expression for its density, the stable distribution is typically defined through its characteristic function. For

α 6= 1, and for d ≥ 2, the d-dimensional stable characteristic function is given by

g(t | α, µ,Γ) = exp

{it′µ−

∫v∈Sd

|t′v|α [1− i sign(t′v) tan(πα/2)] dΓ(v)

}. (1)

Relating this to the one-dimensional stable distribution, the vector µ ∈ R2 is a location parameter vector, α ∈ (0, 2]

controls tail behavior, and Γ is a finite measure that controls characteristics of the distribution such as skewness,

orientation, and spread, effectively extending c and β to the multivariate setting. The integration space Sd is the unit

sphere in Rd, that is, the unit circle, S2, in two dimensions. In this case, a change of variables can be made to v =

3

dΓ = 5-4I(z>π)

-20 -10 0 10 20

-20

-10

010

20

dΓ = 3

-20 -10 0 10 20

-20

-10

010

20

dΓ = 1+4I(z>π)

-20 -10 0 10 20

-20

-10

010

20

dΓ = 1

-20 -10 0 10 20

-20

-10

010

20

dΓ = 2

-20 -10 0 10 20

-20

-10

010

20

dΓ = 3

-20 -10 0 10 20

-20

-10

010

20

dΓ = 2sin(2z)+2.5

-20 -10 0 10 20

-20

-10

010

20

dΓ = 2sin(2z+.5π)+2.5

-20 -10 0 10 20

-20

-10

010

20

dΓ = 2sin(2z+π)+2.5

-20 -10 0 10 20

-20

-10

010

20

Figure 1: Bivariate stable densities with µ = (0, 0)′, α = 1.5, and different choices for the Γ measure. In the toprow, γ is a step function resulting in a skewed distribution. In the middle row, γ is a constant function resulting indifferent spreads of the distribution. In the bottom row, γ is a sine function with a changing shift, resulting in differentorientations of the distribution.

(cos(z), sin(z))′ where the integral is now over z ∈ [0, 2π]. The unscaled density of the measure, γ = dΓ, must be

non-negative, but can take on a variety of shapes. Figure 1 shows how the shape of the distribution changes with γ.

Skewness changes when γ(z) and γ(z+π) are more disparate. The spread of the distribution changes with the scale of

the measure, that is, a measure with larger values for all z will have a larger spread. The orientation of the distribution

rotates with a shift in γ. A variety of other distributional shapes can be achieved by combining these properties.

2.2 Symmetry

In Richardson et al. (2017) it is shown that the degree of symmetry of an integro-difference equation kernel can

be connected to the physical characteristic of dispersion for the process. Thus, understanding the symmetry of a

bivariate stable kernel is key in the context of the proposed space-time modelling approach. Elliptically contoured

stable distributions are a simplification of stable laws (Nolan, 2013) and have characteristic functions of the form

exp{it′µ− (t′Σt)α/2}, where Σ is a positive definite matrix. These are also referred to as sub-Gaussian distributions

because they can be represented as scale mixtures of normals. A stable distribution is called symmetric α-stable if

4

there exists µ such that −X + µ and X + µ have the same distribution. This property is equivalent to that of elliptical

symmetry, which is often defined using the first and second moments. Equation 2.5.8 in Samorodnitsky & Taqqu

(1997) provides a specific form for Γ that yields elliptically contoured stable laws. The following lemma provides a

necessary and sufficient condition for elliptical symmetry.

Lemma 1. The condition γ(z) = γ(z+π), for z ∈ [0, π], is necessary and sufficient for a bivariate stable distribution

to be elliptically symmetric.

Proof. Without loss of generality, assume µ = (0, 0)′. Using a Fourier representation of the density, f(x) =

(2π)−2∑t∈Z2 g(t) exp(it′x), where g(t) is given by (1) with µ = (0, 0)′. Hence, f(x) − f(−x) can be written

as (2π)−2∑t∈Z2{g(t)− g(−t)} exp(it′x).

First, assume the distribution is elliptically symmetric, which implies f(x)−f(−x) = 0. Then, Fourier and inverse

Fourier transform properties yield g(t) − g(−t) = 0, for all t ∈ Z2. By collecting terms and dividing constants, we

obtain∫ 2π

z=0sign(t′v)|t′v|αdΓ(z) = 0. The integral can be split and written as

∫ πz=0|t′v|αdΓ(z)−

∫ πz=0|t′v|αdΓ(z +

π) = 0, for all t ∈ Z2, which holds true only when γ(z) = γ(z + π), for z ∈ [0, π].

Now, assume that γ(z) = γ(z + π), for z ∈ [0, π]. By working backwards in the above argument, this implies∫ 2π

z=0sign(t′v)|t′v|αdΓ(z) = 0. The characteristic function then becomes exp

{∫ 2π

z=0|t′v|α tan(πα/2)dΓ(z)

}. There-

fore, g(t)− g(−t) = 0, for t ∈ Z2, and thus f(x)− f(−x) = 0, resulting in the property of elliptical symmetry.

This result formalizes our previous statement, based on graphical exploration, that varying degrees of skewness in

the bivariate stable distribution can be achieved with more or less disparate values of γ(z) and γ(z + π).

2.3 Two-Dimensional Fourier Series Representation

The typical method for fitting an integro-difference equation model to spatio-temporal data involves decomposing the

process and the kernel into an orthonormal basis series expansion (Wikle, 2002). For a given set of basis functions,

ψi(s), the process is written asXt(s) =∑∞i=1 ai(t)ψi(s) and the kernel as k(u−s | θs) =

∑∞j=1 bj(s, θs)ψj(u). Any

orthonormal basis will then lead to the representation of the process at time t + 1 as Xt+1(s) =∑∞i=1 ai(t)bi(s, θs).

The process coefficients, ai(t), must be estimated, whereas, given parameters θs, the kernel coefficients, bj(s, θs), are

determined by the choice of kernel distribution.

Owing to the connection between the Fourier transform for densities and the characteristic function, a frequent

choice of basis is the Fourier series. If g(t), where t = (t1, t2)′, is the characteristic function of a bivariate density f ,

then, f(x) = (2π)−2∑t∈Z2 g(t) exp(it′x). When g(t) = exp{a(t)+ib(t)}, where a(t) and b(t) are real-valued scalar

functions, the real basis functions are cos(t′x) and sin(t′x) and the coefficients of the expansion are, respectively,

exp{a(t)} cos{b(t)} and exp{a(t)} sin{b(t)}. For identifiability, we can combine certain basis functions such that

5

Basis Function Coefficient

(2π)−1 cos (t′x) π−1 exp{∫ 2πz=0 |t

′v|α tan(πα/2)dΓ(z)} cos{t′µ−∫ 2πz=0 sign(t′v)|t′v|α tan(πα/2)dΓ(z)}

(2π)−1 sin (t′x) π−1 exp{∫ 2πz=0 |t

′v|α tan(πα/2)dΓ(z)} sin{t′µ−∫ 2πz=0 sign(t′v)|t′v|α tan(πα/2)dΓ(z)}

Table 1: Basis functions and coefficients for the real-valued Fourier basis expansion of the bivariate stable distributionwhere t ∈ Z2.

only the indices where t1 ≥ 1 for all t2, or for t2 ≥ 1 when t1 = 0, are included in the basis function set. The resulting

coefficients are exp{a(t)} cos{b(t)} + exp{a(−t)} cos{b(−t)} and exp{a(t)} sin{b(t)} − exp{a(−t)} sin{b(−t)}.

When t1 = t2 = 0, the basis coefficient is exp{a(0)} cos{b(0)}.

The characteristic function for the multivariate stable in equation (1) can be written for the bivariate case as

exp{a(t)+ib(t)}, where a(t) =∫ 2π

z=0|t′v|α tan(πα/2)dΓ(z), b(t) = t′µ−

∫ 2π

z=0sign(t′v)|t′v|α tan(πα/2)dΓ(z), and

v = (cos(z), sin(z))′. Table 1 includes the Fourier basis functions and coefficients for the bivariate stable distribution.

These basis coefficients are defined for t1 ≥ 1 for all t2, and for t2 ≥ 1 when t1 = 0. The basis coefficient when

t1 = t2 = 0 is (2π)−1, and only the cosine basis function should be included. In the IDE decomposition, the values in

Table 1 will be the coefficients bj(s, θs), where in the spatially varying case, the parameters µ and Γ will be replaced

with s + µs and Γs. The pairs of Fourier frequencies used in the decomposition will need to be ordered to match the

sum from 1 to ∞. The particular ordering doesn’t matter as long as the ordering for the coefficients, bj(s, θs) and

basis functions, ψj(s), match.

3 Methodology

3.1 Semiparametric Model for Γ

Here, we present our integro-difference equation model, starting with a semiparametric model for the bivariate stable

distribution kernel which is then extended to incorporate spatially dependent parameters.

As indicated in Section 2.3, and detailed later in Section 3.2, the stable distribution enters the hierarchical model

for the data through its characteristic function, thus overcoming the lack of a closed-form expression for the density.

Its definition in terms of the finite-dimensional parameter vector (α, µ) and measure Γ provides a natural setting for

a Bayesian semiparametric model specification. In particular, we develop a model for γ = dΓ building on ideas from

Bernstein polynomial priors for densities with compact support (Petrone, 1999). In light of the ensuing extension to

spatially dependent parameters for the kernel distribution, a key consideration for the model structure is to balance

model flexibility and computational feasibility for inference.

The fact that Γ is a finite measure on [0, 2π] implies that, up to a scaling parameter c > 0, γ = dΓ is a density

6

function on [0, 2π]. We model γ using a weighted combination of Beta densities with different shapes,

γ(z) = c (2π)−1M∑m=1

wm be(z/2π | m,M −m+ 1), z ∈ [0, 2π],

where be(· | a, b) denotes the density of the Beta distribution with mean a/(a + b). The mixture weights are defined

through an underlying distribution function F on [0, 2π], more specifically, wm = F (2πm/M)− F (2π(m− 1)/M),

for m = 1, ...,M .

Key to the flexibility of this mixture representation is the capacity of F to admit general shapes, including mul-

timodalities, which can lead to effective selection of the Beta mixture components that contribute most to the con-

struction of γ. This motivates using a nonparametric prior for F that supports general discrete distributions on [0, 2π].

In particular, we assign a geometric weights prior (Mena et al., 2011) to the distribution F associated with random

distribution function F , such that F =∑∞j=1 q(1 − q)j−1 δxj

. Here, the xj are independent from a base distribution

on [0, 2π], which is taken to be uniform, and q follows a distribution on the unit interval. Hence, F is a discrete distri-

bution with atoms xj and corresponding weights q(1− q)j−1 defined through a single random variable q. This model

structure is more economical in the number of parameters than, for instance, the stick-breaking weights of the Dirichlet

process, enabling the extension to a spatially varying kernel distribution. In practice, the countable representation for

F is truncated at finite level J , defining the last weight to be 1 minus the sum of the previous weights to force a proper

probability vector. The expected sum of the weights between J + 1 and∞ for a given q is equal to (1− q)J−1, so J

must be large enough to make this value close to 0, but for efficiency may be adjusted according to the expected value

of q.

Note that wm arises by summing up the geometric weights q(1 − q)j−1 for the xj that lie in interval (2π(m −

1)/M, 2πm/M). Since the xj are uniformly distributed on [0, 2π], we can associate a Multinomial(1, (1/M, ..., 1/M))

vector, (Zj1, ..., ZjM ), with each xj , and express the mixture weights as wm =∑Jj=1 q(1 − q)j−1Zjm, for m =

1, ...,M . This replaces a continuous latent variable with a discrete variable, which facilitates estimation. By construc-

tion, only one element of (Zj1, ..., ZjM ) is 1, the rest being 0, and thus the dimensionality of the new parameter set

remains effectively the same.

Therefore, measure Γ is defined in terms of parameters c, q, and {Zjm : j = 1, ..., J ;m = 1, ...,M}, and

the specification of the bivariate stable kernel distribution is completed with location parameter vector µ = (µ1, µ2)

and tail parameter α. To construct a non-stationary spatio-temporal process, we elaborate on the model such that

parameters µ, c, and q are spatially varying. To illustrate the flexibility of the proposed model and the effect of

spatially varying q, Figure 2 shows how the bivariate stable density changes with q with M = 40 and J = 40. Even

though the atoms are the same for each kernel, the shape can drastically change in both skewness and orientation

by only varying q. Also, Figure 2 shows that the kernel changes smoothly with q, implying that kernels at nearby

7

q = 0.1

-15 -10 -5 0 5 10 15

-15-10

-50

510

15

q = 0.2

-15 -10 -5 0 5 10 15

-15-10

-50

510

15

q = 0.3

-15 -10 -5 0 5 10 15

-15-10

-50

510

15

q = 0.4

-15 -10 -5 0 5 10 15

-15-10

-50

510

15

q = 0.5

-15 -10 -5 0 5 10 15

-15-10

-50

510

15

q = 0.6

-15 -10 -5 0 5 10 15

-15-10

-50

510

15

q = 0.7

-15 -10 -5 0 5 10 15

-15-10

-50

510

15

q = 0.8

-15 -10 -5 0 5 10 15

-15-10

-50

510

15

q = 0.9

-15 -10 -5 0 5 10 15

-15-10

-50

510

15

Figure 2: A bivariate stable density is shown with a scaled Bernstein polynomial measure and a geometric weightsbase distribution. µ = (0, 0)′, c = 2π, and α = 1.5 for each plot. Only the geometric weight, q, changes. The first 10atoms are (2, 5.14, 3.6, .4, 3.4, .6, 3.5, .5, 3.5, .5). The other atoms were randomly drawn from a uniform distributionon [0, 2π], but are the same for all 9 figures.

locations and having similar values for q, will be similar in shape. This smooth transition of kernel shape is desirable

for integro-difference equation spatio-temporal models. Further exploration has revealed that the added flexibility for

choosing greater than 40 Bernstein polynomials (M > 40) is not significant enough to justify the extra computational

cost. The support does not expand much past this point.

The spatially dependent parameters are modelled via kernel process convolutions (Higdon, 1998) for µ1(s), µ2(s),

log(c(s)), and Φ−1(q(s)), where Φ is the standard normal distribution function. For instance, for µ1(s), consider a

grid of knots u1, ..., uQ and latent variables (ζµ1(u1), ..., ζµ1

(uQ)) which are independent N(0, σµ1). Then, µ1(s) =

mµ1+∑Qi=1 kζ(ui, s)ζµ1

(ui), where kζ(ui, s) is come convolution kernel such as a Gaussian or a Matern function.

The process µ2(s) has a similar construction. Analogously, log(c(s)) = mc +∑Qi=1 kζ(ui, s)ζc(ui), where ζc(ui) ∼

N(0, σc), and Φ−1(q(s)) = mq +∑Qi=1 kζ(ui, s)ζq(ui), where ζq(ui) ∼ N(0, σq). Using process convolutions

reduces the parameter space as Q is typically set to be a number much smaller than the number of locations, but they

may potentially ignore some short range variability. The parameter processes, however, should vary smoothly in space,

so a process convolution approximation is acceptable.

In section 4, the flexibility of the model will be explored through the analysis of sea surface temperature data.

As a comparison, a Gaussian kernel IDE model will also be fit with all 5 parameters varying in space: two location

parameters, two variances, and the covariance. The proposed bivariate stable kernel model achieves a wider array

8

of kernel shapes with four spatially varying process parameters, thus offering a more general inference framework

without sacrificing computational feasibility relative to the state-of-the-art Gaussian kernel model.

3.2 Posterior Sampling

Define a spatial data vector Yt = (Yt(st,1), ..., Yt(st,nt))′ where the number of data points and the locations of

observations may change over time. Using the basis function decomposition of the process and the distribution, the

integro-difference equation model with a stable kernel can be represented as a dynamic linear model. The full model

for a given kernel parameter set θ is

Yt | at, σ2 ∼ N(Ψtat, σ2It), (2)

at | at−1, τ2, θ ∼ N(GtBθ,t at−1, τ2GtVtG

′t) t = 1, ..., T, (3)

where the i-th row of Ψt contains the Fourier basis functions as found in Table 1 evaluated at the i-th spatial

location at time t, at contains the stochastic basis function coefficients for the process which will be estimated,

Gt = (Ψ′tΨt)−1Ψ′t, and the i-th row of Bθ,t contains the fixed coefficients for the basis expansion of the kernel

given the parameters, also found in Table 1 with µ and Γ replaced with st,i +µst,i and Γst,i . More details on how this

formulation is derived for a general kernel is found in Wikle (2002) and Xu et al. (2005). The observational variance

σ2It is a scaled identity. The matrix Vt is a spatial covariance that is transformed spectrally by Gt and scaled by τ2.

The state vectors {a0, ..., aT } are sampled via dynamic linear model filtering. The kernel parameters will be up-

dated using Monte Carlo Markov chain sampling. Using Gibbs sampling within Monte Carlo Markov chain, the poste-

rior distribution for the latent variables is proportional to p(at | at−1, τ2, σ20 , {ζµj (ui) : j = 1, 2}, {ζc(ui)}, {ζq(ui)})p({ζµ1(ui)}).

The prior for the latent variables ζµ1(ui) are N(0, σµ1

). The parameter σµ1is given an inverse gamma prior. Con-

ditional on ζµ1(ui) for i = 1, ..., Q, the posterior distribution for σµ1

will also be inverse gamma. The conditional

posterior distributions for σµ2 , σc, and σq are also conjugate if they are paired with inverse gamma priors. With

normal priors on mµ1, mµ2

, mc, and mq , the posteriors are conjugate as well.

The parameter α for the stable distribution can be difficult to learn. To aid in estimation, we assume a discrete

prior for α, on a grid between 1.05 and 2 with step size .05. With a uniform discrete prior, the posterior probability that

α = λi is proportional to the likelihood evaluated at α = λi. Another advantage of discretizing α is that the integrals

in the basis coefficients from Table 1 can be calculated prior to the Monte Carlo Markov chain for each Bernstein

polynomial and for each possible value for α, resulting in a significant speed-up. The Zjm variables are also discrete.

For each set Zj1, ..., ZjM , the single variable which is equal to 1 can be sampled from a discrete posterior. Allowing lj

to be an indicator variable where lj = m when Zjm is 1, the probabilities are proportional to the likelihood evaluated

9

at lj = m, which means that it can be sampled discretely with conditional posterior probability

p(lj = m|{at : t = 1, ..., T}, ·) =

∏Tt=1 p(at|at−1, ·, lj = m)p(lj = m)∑M

i=1

∏Tt=1 p(at|at−1, ·, lj = i)p(lj = i)

.

This should be done for all J sets of latent variables. We note that, from a computational perspective, the discreteness

of these variables lends itself to parallelization, as does sampling the discretized variable α. This is done by sending

the calculations of the components of the discrete probabilities to different nodes and then collecting the proportional

posteriors to calculate the probabilities. For a large enough data set and using M nodes, this can significantly increase

the speed of the Monte Carlo Markov chain. For this work, the parallelization was done in C++ using openMP.

3.3 Thresholding

The method of decomposing the process and the kernel using a basis expansion requires specification of how to truncate

to a finite number of basis functions (Xu et al., 2005). In one dimension, the number of basis functions required for

accurately representing the kernel in integro-difference equation modelling is relatively small, ranging between 20 and

100. To achieve the same level of accuracy in 2 dimensions, hundreds of basis functions may be required. The main

determining factor of the optimal number of basis functions to use is the width of the kernel compared to the range of

the data. For example, when using a normal kernel, a higher variance requires fewer basis functions. This means that

the optimal number of basis functions to use is not constant but changes throughout the Monte Carlo Markov chain.

Recall that Bθ is a matrix where the (i, j)-th element is the coefficient of the j-th basis function at location si. Figure

3 shows how the width of the kernel affects the number of influential basis functions. This shows that the number of

significant basis functions for a particular basis index across all locations is generally smaller when the spread is larger.

Two other observations can be made from this. The first is that a small percentage of these are significantly larger than

0. The other observation is that the size of the coefficient is not fully determined by the order of the frequencies. As

basis function index increases, the frequency of the function increases. The general trend is that the coefficients get

smaller, but several coefficients of smaller frequency basis functions appear to be less significant. By exploiting these

facts the dimension of the state space can be intelligently decreased.

The Monte Carlo Markov chain can be adjusted at each iteration to include only the most important basis functions

and decrease computational time of the algorithm. After calculating Bθ, the column maximum of Bθ can be used to

decide the number of basis functions. There are several options for thresholding in the literature. Hard thresholding

and soft thresholding are two of the most common (Donoho & Johnstone, 1994). Hard thresholding involves setting all

coefficients less than a certain value to 0, b∗j = bj × I(|bj | > ε). Soft thresholding additionally subtracts ε from values

which are non-zero, b∗j = sign(bj)(|bj | − ε)I(|bj | > ε). There are several arguments for and against either of these

thresholding techniques. A more elegant option is the threshold based on the generalized double Pareto distribution

10

Kernel

-20 -10 0 10 20

-20

-10

010

20

0 100 200 300 400

0.0000.0100.0200.030

Largest Basis Coefficient

Kernel

-20 -10 0 10 20

-20

-10

010

20

0 100 200 300 4000.00

0.01

0.02

0.03

0.04 Largest Basis Coefficient

Figure 3: The left plots show the kernel and the right plots shows the maximum coefficient for every basis functionacross all spatial locations.

(Armagan et al., 2013). Using this more technical approach maintains the continuity of the target function without

over-shrinking. This method sets bj to 0 when bj < ε√

(η + 1). When bj > ε√

(η + 1) the values are

b∗j =

bj−ε√

(η+1)+[b2j+2bjε√

(η+1)−3ε2(η+1)]1/2

2 if bj > 0

bj+ε√

(η+1)−[b2j+2bjε√

(η+1)−3ε2(η+1)]1/2

2 if bj < 0

(4)

The value for ε defines the level of truncation and η controls the shrinkage. In order to be able to predict the length of

the learning algorithm, it may help to keep the number of basis functions constant. The only way to accomplish this

would be to change the thresholding level, which is ε in the equations above. This approach will also ensure that a

reasonable number of basis functions will always be present. For this work, the generalized double Pareto thresholding

was used with ε chosen to yield a fixed number of basis functions for each iteration of the Monte Carlo Markov chain.

If the column maximum of Bθ is less than ε√

(η + 1) then the column is not used and the dimensionality is reduced.

If the column maxmum is greater, then the values are adjusted as in equation (4). For this work, η was set to 1.

Sensitivity of the choice for η was not explored, but the resulting amount of dimension reduction compared reasonably

to what we expected.

11

4 Sea Surface Temperature Data Analysis

Sea surface temperature in the tropical Pacific Ocean has been useful in predicting a number of climate phenomena

(Philander, 1985). The most prominent of these phenomena is El Nino, which is a warming of ocean temperature

occurring between −5◦ and 5◦ Latitude and 180◦ and 240◦ E Longitude (see http://www.elnino.noaa.gov). This

warming results in a shift of nutrients in the water which can affect agriculture and fisheries in several countries. It

can also have profound climate effects that include increased precipitation in some areas of the Americas, as well

as very dry conditions in some others (Ropelewski & Halpert, 1987). There is a rich history of work dedicated to

predicting when El Nino will occur, as well as its counterpart La Nina, which is the cooling that usually follows. El

Nino follows a 2 to 7 year cycle and typically begins in Autumn, staying as long as a year. While many deterministic

physical models have arisen to explain and predict the occurrences (Jan van Oldenborgh et al., 2005), much success

has come from simply using sea surface temperature data over a large region in the Pacific Ocean in a stochastic model

(Berliner et al., 2000). These methods include linear systems (Penland & Magorian, 1993), but nonlinear methods

have proven more successful (Wikle & Hooten, 2010; Cressie & Wikle, 2011). Non-linear methods have been applied

to sea surface temperature data in the context of integro-difference equation models (Wikle & Holan, 2011), although

the specific nature of the kernel distribution was not the focus of that application.

We will illustrate the bivariate stable kernel for integro-difference equation modelling using the sea surface tem-

perature data. We do not claim to have a breakthrough method of analyzing sea surface temperature for El Nino

prediction. We simply use the data set to illustrate the proposed model and its predictive power. The data includes

9692 locations on a 1◦ by 1◦ resolution grid. It is collected monthly from December 1980 to July 2015. Seasonality is

accounted for by modelling the anomalies, which are differences between the data and the long term monthly average

for each location. The anomalies for July 2015 are shown in Figure 4, showing the large El Nino that begins in the

summer of 2015. It can be seen in the figure how the warm temperatures gather near the equator East of the Date Line,

indicating El Nino. We will compare prediction and model accuracy between the spatially varying stable kernel and

the spatially varying Gaussian kernel.

We apply the model of Section 3 to the sea surface temperature monthly anomalies. The parameter space is quite

large, including the observational variance σ2, the process variance τ2 and the kernel parameters, which include the

latent variables involved in the kernel convolution for the processes µ(s), q(s), and c(s), and the hyperparameters

for the latent ζ parameters for each process. The locations of the data are given in latitude and longitude, but for the

analysis they are scaled to between -10 and 10 in both directions and then scaled back to the original locations for

12

Figure 4: Data is shown for sea surface temperature anomalies for July 2015.

inference. The full model is

Yt|at, σ2 ∼ N(Ψat, σ2It), t = 1, ..., T, σ2 ∼ IG(ασ, βσ),

at|at−1, τ2, θ ∼ N(GBθat−1, τ2GV G′), τ2 ∼ IG(ατ , βτ ),

γ(z) = c(s)

M∑k=1

wM,mBeta(z/2π|m,M −m+ 1)/2π, α ∼ p(α),

wM,m =

J∑j=1

q(s)(1− q(s))j−1Zjm, (Zj1, ..., ZjM ) ∼ MN(1, (1/M, ..., 1/M)),

φ−1(q(s)) = µq1 +Kζζq, ζq(ui)|σq ∼ N(0, σq), µq, σq ∼ p(µq)p(σq),

log c(s) = µc1 +Kζζc, ζc(ui)|σc ∼ N(0, σc), µc, σc ∼ p(µc)p(σc),

µi(s) = µµi1 +Kζζµi

, ζµi(ui)|σµi

∼ N(0, σµi), µµi

, σµi∼ p(µµi

)p(σµi).

The matrix V is the unscaled spatial covariance matrix based on a Matern covariance function with κ = 1.5 and an

effective range of 2. The number of Bernstein polynomials used is M = 40 and the number of geometric weights is

equal to J = 40. The priors for σ2 and τ2 are IG(5, 3). The matrix Kζ maps the latent ζθ = (ζθ(u1), ..., ζθ(uQ))′

vectors for θ ∈ {q, σ, µ1, µ2} to the processes governing the integro-difference equation kernel parameters. The values

of Kζ correspond to convolution kernel where the (i, j)-th element is kζ(si−uj). The convolution kernel is a Matern

function with κ = 2.5 and an effective range of 4 for all the parameters, forcing a smooth evolution of the parameter

processes across the domain. The knots are chosen on a 30 by 30 grid from -11 to 11 in both directions, resulting in

a dimension reduction of the variables representing the spatially varying parameter processes from 9692 to 900. The

priors for the means of these processes are µq ∼ N(−1, .5), µc ∼ N(0, 4), and µµ1, µµ2

∼ N(0, 1). The scale terms

for the process covariances are given priors of IG(4, 3) for σc and σµiand IG(10, 6) for σq . These priors are based

on the scale of the data and reasonable shapes of kernels, but aren’t too restrictive. The process q(s) is very sensitive

to these priors, as the posterior is not necessarily identifiable, especially for very diffuse priors. There may be several

13

combinations of q(s) and the latent Zjk variables which results in the same values for wM,m, which are identifiable.

The other parameters are not overly sensitive to the prior.

75,000 samples are drawn from the posterior distributions using Metropolis-Hastings with the first 60,000 as burn-

in, leaving 15,000 samples. The convergence was checked using trace plots of the values of the kernel densities

at each location. That is to say that the trace plots suggested convergence by 60,000 iterations of the values for

k(s|µ(s), σ(s), q(s)) for various locations throughout the domain of the data.

The posterior kernel and associated measure may reveal information about the nature of dependence between

locations. In section 2.2, the property of elliptical symmetry is connected to the values of γ. The posterior kernel for

3 locations is shown in Figure 5 with the associated densities scaled by 2π. It is hard to compare the kernels at the

different locations by eye. The differences in the measure are much more easily detected for the three locations.

0 1 2 3 4 5 6

0.5

1.5

2.5

dΓ Location (171 E,23 N)

c(0, 2 * pi)

c(0.

3, 3

)

-5 0 5

-50

5

Posterior Kernel Mean

c(-7.5, 7.5)

c(-7

.5, 7

.5)

0 1 2 3 4 5 6

0.5

1.5

2.5

dΓ Location (166,-1)

c(0, 2 * pi)

c(0.

3, 3

)

-5 0 5

-50

5


c(-7.5, 7.5)

c(-7

.5, 7

.5)

0 1 2 3 4 5 6

0.5

1.5

2.5

dΓ Location (91 E,9 N)

c(0, 2 * pi)

c(0.

3, 3

)

-5 0 5

-50

5


c(-7.5, 7.5)

c(-7

.5, 7

.5)

Figure 5: The scaled density corresponding to the measure Γ and associated 95% credible bands are shown for threelocations on the left with the associated posterior mean kernel on the right.

To detect trends, we devise a method of measuring symmetry across the spatial field. Based on Lemma 1, we

use symell(s) =∫ π0{γs(z) − γs(z + π)}2dz as a metric of elliptical symmetry. Similarly, a metric of spherical

symmetry can be defined as symsph(s) =∫ 2π

0{γs(z)/c(s) − (2π)−1}2dz. These symmetry metrics are plotted as

a function of spatial location in Figure 6. The kernels at locations south of the equator are symmetric whereas the

kernels for locations north of the equator are non-symmetric. Additionally, the spherical symmetry map is very similar

to the elliptical symmetry map. This illustrates the importance of using a kernel that is spatially varying and can be

14

(a) Elliptical Symmetry (b) Spherical Symmetry

Figure 6: The symmetry metrics are shown across the spatial field for both elliptical and spherical symmetry. For bothmetrics, smaller values are associated with more symmetric kernels.

asymmetric. The splitting of the symmetry of the stable kernel suggests that the relationship between the spatial and

temporal components of the process behaves differently on the two sides of the equator. The physical interpretation of

this is not clear.

We will assess model performance through its predictive power. An additional model fit was conducted based on

a spatially varying Gaussian kernel, following Wikle (2002) and Xu et al. (2005). We score the predictions for each

time point based on energy scores of the form

es(F, y) =1

m

m∑i=1

||y(i) − y|| − 1

2m2

m∑i=1

m∑j=1

||y(i) − y(j)||, (5)

where y(1), ..., y(m) are samples from F , the posterior predictive distribution and y denotes the data vector (Gneiting

et al., 2008). We find that the stable kernel integro-difference equation model scores better in 241 time points, which

is 59.7% of all time points. Based on these alone, the advantage seems minimal. However slight the departures

from elliptical symmetry were, they do exist, but this does not seem to affect the scoring results. We also compare

the models using K-step ahead forecasts. This is done by propagating the state variables through the process level

of the model, a∗t ∼ N(GBθat−1, τ2GV G′). This can be propagated several steps past the final time, T , followed

by drawing the K-step ahead prediction Y ∗T+k ∼ N(Ψa∗T+k, σ2I). The scoring comparison for the K-step ahead

predictions, including in-sample predictions for K = 0, are shown in Table 2. As can be seen, while the in-sample

and 1-step ahead predictions are comparable, the stable distribution consistently outperforms the normal in each of

forecasts.

To perform true out of sample predictions, the last four months were left out of the analysis, resulting in 400 time

points ending in March 2015 being included in the model fit. The posterior distributions for Y ∗401, ..., Y∗404 are drawn as

part of the Monte Carlo Markov chain. The means of these posterior distributions are shown in Figure 7 for the stable

and normal kernel integro-difference equation models compared against the truth for March through July 2015. The

real data shows the El Nino which is known to be occurring in 2015. Both the predictions include the El Nino warming

15

K-Ahead 0 1 2 3 4 5

Percentage Stable Lower 48.2 59.7 69.7 62.0 57.5 56.1

Percentage Normal Lower 51.8 40.3 30.3 38.0 42.5 43.9

Table 2: Shown here are the percentage of time points where the energy score from the posterior predictive was lowerfor each model.

to some degree, but it is clear that the intensity of the predicted warming using the integro-difference equation model

with the stable kernel is much closer to the truth than with the model the normal kernel, especially closer to the coast

of South America.

There are several ways to numerically declare an El Nino event. Most of these involve high anomalies in certain

regions in the Pacific. For example, the official National Oceanic and Atmospheric Administration (NOAA) criterion

involves the block average anomalies in El Nino region 3.4 to be above .5◦ C for 3 consecutive months, although this

is modified to 5 consecutive months for the NOAA’s Climate Prediction Center (Larkin & Harrison, 2005). The El

Nino region 3.4 includes 120◦ to 170◦ W Longitude and -5◦ to 5◦ Latitude. Figure 8 shows the block averages for the

data and fitted models. These estimates are sampled from the posterior predictive distributions for K-steps ahead given

all information through March 2015. Because the fitting procedure is Bayesian, we have a full probability distribution

for the predictions, allowing us to calculate distributions on the probability of future El Nino events. Figure 8 shows

the posterior predictive probabilities that an El Nino will occur within the next 9 months for all months dating back

to the beginning of the data set. The red X’s are when El Nino did occur. The green bars extend from 6 months prior

to an El Nino occurring to 3 months after, so blue dots that occur in the regions with the green bars are indications

of correct predictions of an El Nino event. The model still has some false positives and false negatives, but overall, it

seems to be able to predict an El Nino months in advance in most cases.

5 Summary

We have proposed a model that explicitly considers space-time dynamics using a location-dependent linear evolution

equation that is specified by a wide and flexible family of infinitely-divisible kernels. The model can accommodate

both spatial and temporal non-stationarity. Furthermore, the proposed model incorporates observational error, and can

handle irregularly located observations and missing values.

In order to infer the characteristics of the kernel and the states of the spatio-temporal process of interest, we

use a Fourier basis representation for both the process and the kernel. A random measure is used as a prior for the

measure that controls the kernel shape. This, coupled with thresholding of the basis coefficients, produces a substantial

16

dimension reduction. As a result, the model, even being more flexible, is more parsimonious than a fully flexible

Gaussian kernel model. In spite of this, the model’s predictive skills are superior to those of the more standard model,

for the non-trivial example considered in the paper. Also, the use of a hierarchical model within a Bayesian approach

allows for a full probabilistic assessment of the prediction uncertainty. In addition, the structure of the proposed model

allows for efficient parallelization of the sampling algorithm. As a result, experiments with simulations as well as with

real data show that the model can effectively handle large numbers of observations.

References

ARMAGAN, A., DUNSON, D. B. & LEE, J. (2013). Generalized double pareto shrinkage. Statistica Sinica 23, 119.

BERLINER, L. M., WIKLE, C. K. & CRESSIE, N. (2000). Long-lead prediction of pacific ssts via bayesian dynamic

modeling. Journal of climate 13, 3953–3968.

BROWN, P. E., ROBERTS, G. O., KARESEN, K. F. & TONELLATO, S. (2000). Blur-generated non-separable space–

time models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62, 847–860.

CRESSIE, N. & WIKLE, C. K. (2011). Statistics for spatio-temporal data. New York: John Wiley & Sons.

DONOHO, D. L. & JOHNSTONE, I. M. (1994). Threshold selection for wavelet shrinkage of noisy data. In Engineer-

ing in Medicine and Biology Society, 1994. Engineering Advances: New Opportunities for Biomedical Engineers.

Proceedings of the 16th Annual International Conference of the IEEE. IEEE.

GNEITING, T., STANBERRY, L. I., GRIMIT, E. P., HELD, L. & JOHNSON, N. A. (2008). Assessing probabilistic

forecasts of multivariate quantities, with an application to ensemble predictions of surface winds. Test 17, 211–235.

HIGDON, D. (1998). A process-convolution approach to modelling temperatures in the north atlantic ocean. Environ-

mental and Ecological Statistics 5, 173–190.

JAN VAN OLDENBORGH, G., BALMASEDA, M. A., FERRANTI, L., STOCKDALE, T. N. & ANDERSON, D. L.

(2005). Did the ecmwf seasonal forecast model outperform statistical enso forecast models over the last 15 years?

Journal of climate 18, 3240–3249.

LARKIN, N. K. & HARRISON, D. (2005). On the definition of el nino and associated seasonal average us weather

anomalies. Geophysical Research Letters 32.

LEMOS, R. T. & SANSO, B. (2009). A spatio-temporal model for mean, anomaly, and trend fields of north atlantic

sea surface temperature. Journal of the American Statistical Association 104, 5–18.

17

MENA, R. H., RUGGIERO, M. & WALKER, S. G. (2011). Geometric stick-breaking processes for continuous-time

bayesian nonparametric modeling. Journal of Statistical Planning and Inference 141, 3217–3230.

NOLAN, J. P. (2013). Multivariate elliptically contoured stable distributions: theory and estimation. Computational

Statistics 28, 2067–2089.

NOLAN, J. P. (2014). Financial modeling with heavy-tailed stable distributions. Wiley Interdisciplinary Reviews:

Computational Statistics 6, 45–55.

PANORSKA, A. K. (1996). Generalized stable models for financial asset returns. Journal of computational and applied

mathematics 70, 111–114.

PENLAND, C. & MAGORIAN, T. (1993). Prediction of nino 3 sea surface temperatures using linear inverse modeling.

Journal of Climate 6, 1067–1076.

PETRONE, S. (1999). Bayesian density estimation using bernstein polynomials. Canadian Journal of Statistics 27,

105–126.

PHILANDER, S. (1985). El nino and la nina. Journal of the Atmospheric Sciences 42, 2652–2662.

RICHARDSON, R., KOTTAS, A. & SANSO, B. (2017). Flexible integro-difference equation modeling for spatio-

temporal data. Computational Statistics & Data Analysis 109, 182–198.

ROPELEWSKI, C. F. & HALPERT, M. S. (1987). Global and regional scale precipitation patterns associated with the

el nino/southern oscillation. Monthly weather review 115, 1606–1626.

SAMORODNITSKY, G. & TAQQU, M. S. (1997). Stable non-gaussian random processes. Econometric Theory 13,

133–142.

STORVIK, G., FRIGESSI, A. & HIRST, D. (2002). Stationary space-time gaussian fields and their time autoregressive

representation. Statistical Modelling 2, 139–161.

WIKLE, C. K. (2002). A kernel-based spectral model for non-gaussian spatio-temporal processes. Statistical Mod-

elling 2, 299–314.

WIKLE, C. K. & HOLAN, S. H. (2011). Polynomial nonlinear spatio-temporal integro-difference equation models.

Journal of Time Series Analysis 32, 339–350.

WIKLE, C. K. & HOOTEN, M. B. (2010). A general science-based framework for dynamical spatio-temporal models.

Test 19, 417–451.

18

XU, K., WIKLE, C. K. & FOX, N. I. (2005). A kernel-based spatio-temporal dynamical model for nowcasting

weather radar reflectivities. Journal of the American Statistical Association 100, 1133–1144.

19

(a) March 2015 Truth (b) March 2015 Stable Fitted (c) March 2015 Normal Fitted

(d) April 2015 Truth (e) April 2015 Stable Predictions (f) April 2015 Normal Predictions

(g) May 2015 Truth (h) May 2015 Stable Predictions (i) May 2015 Normal Predictions

(j) June 2015 Truth (k) June 2015 Stable Predictions (l) June 2015 Normal Predictions

(m) July 2015 Truth (n) July 2015 Stable Predictions (o) July 2015 Normal Predictions

Figure 7: The data and posterior K-step ahead predictions using the stable and normal kernels are shown given infor-mation through March 2015. The months March 2015 through July 2015 are shown.

20

1985 1990 1995 2000 2005 2010 2015

0.5

0.6

0.7

0.8

0.9

1.0

Stable IDE El Nino Prediction Probabilities

Year

Probability

Figure 8: Posterior probabilities that an El Nino will occur within the coming 9 months. The red X’s show El Ninooccurrences and the green bars extend from 6 months before the El Nino event to 3 months after. Additionally, pointswith probabilities above .8 are dotted blue.

21

Documents

Spatio-Temporal Modelling using Integro-Difference ... · Spatio-Temporal Modelling using Integro-Difference Equations with Bivariate Stable Kernels Robert Richardson Department of