Upload
vuongdung
View
231
Download
3
Embed Size (px)
Citation preview
Spatio-Temporal Modelling using Integro-Difference Equations
with Bivariate Stable Kernels
Robert Richardson
Department of Statistics, Brigham Young University, Provo, Utah 84602, U.S.A.
Athanasios Kottas and Bruno Sanso
Department of Applied Mathematics and Statistics, University of California at Santa Cruz, Santa Cruz, California 95064, U.S.A.
March 1, 2017
Abstract
Integro-difference equations can be represented as hierarchical spatio-temporal dynamic models using appropriate
parameterizations. The dynamics of the process defined by an integro-difference equation depends on the choice of a
bivariate kernel distribution, where more flexible shapes generally result in more flexible models. We consider the use
of the stable family of distributions for the kernel, as they are infinitely divisible and offer a variety of tail behaviours,
orientations and skewness. Many of the attributes of the bivariate stable distribution are controlled by a measure,
which we model using a flexible Bernstein polynomial basis prior. This nonparametric prior specification, along
with spatially dependent parameters for the stable kernel distribution, results in a model which is more general than
existing integro-difference equation models. A Fourier basis representation with thresholding of the relevant basis
functions and parallelization of certain parts of the posterior simulation algorithm allow the computational burden to
be reduced. The method is shown to improve prediction over the state-of-the-art Gaussian kernel model in an example
using Pacific sea surface temperature data.
Keywords: Bernstein polynomials; Fourier series; Geometric weights prior; Markov chain Monte Carlo; Semi-
parametric Bayesian modelling.
1 Introduction
Spatio-temporal data analysis requires the specification of spatial and temporal dynamics. In separable models, the
space and time components are modelled independently. However, spatio-temporal processes that occur naturally
in fields such as climate, ecology, and epidemiology often exhibit an interaction between the spatial and temporal
1
dynamics. Learning the nature of these space-time interactions is a key component of spatio-temporal modelling.
Hierarchical dynamic spatio-temporal models (Wikle & Hooten, 2010) provide a framework that can capture a wide
variety of non-separable model parameterizations. We explore a particular parameterization of a hierarchical dynamic
spatio-temporal model that offers a great deal of flexibility in capturing complicated physical characteristics, while
providing computational tractability.
Denoting by Xt(s) the process at time t and spatial location s, an integro-difference equation model is written as
Xt+1(s) =
∫k(u− s | θ)Xt(u) du + εt(s)
where εt(s) is a zero mean error process which may be spatially colored, and k is a bivariate redistribution kernel.
The kernel parameters θ may be spatially dependent, in which case they are denoted by θs. The concept of an
integro-difference equation is that the process at time t is propagated in time by re-weighting it using kernel k, acting in
space. This model provides a simple, yet effective and interpretable way to express the relationship between the space
and time components, using linear dynamics. As the kernel shifts or expands, the nature of the dependence between
the process at a specific location and the process at previous time points changes accordingly. In fact, it has been
shown that the first two moments of the kernel determine, respectively, the advection and the diffusion of the process
(Brown et al., 2000; Storvik et al., 2002). In Richardson et al. (2017) it is shown that integro-difference equations
with non-Gaussian kernels allow for the description of higher order properties in the spatio-temporal process. Thus,
although complex spatial fields will often exhibit non-linear dynamics, non-Gaussian kernels can help explain some of
the complexity. This feature can be enhanced by making the kernel parameters vary across the spatial region, as in Xu
et al. (2005). The introduction of spatial non-stationarity in the model is then used to capture the physical properties of
the field of interest, through the characteristics of the kernels (Lemos & Sanso, 2009). This is the approach we follow
here. In contrast, Wikle & Holan (2011) and Cressie & Wikle (2011) detail methods that account explicitly for non-
linearity. Such approaches offer powerful modelling tools, but suffer from inferential and computational difficulties.
In essentially all applications of integro-difference equation models, Gaussian kernels are used, owing to the
convenience in their specification and computation. To apply flexible integro-difference equation modelling to spatio-
temporal data, we propose a bivariate stable kernel due to its ability to capture varying tail behaviour and levels of
skewness. In addition, stable kernels are infinitely divisible. This property is important to guarantee that the evolution
equation of the process remains valid when the time scale is changed (Brown et al., 2000). An important feature of
the bivariate stable distribution is that it is defined through a measure Γ which dictates key distributional properties,
such as skewness, spread, and orientation. By flexibly modelling this measure we can achieve a wide variety of
kernel shapes in two dimensions. This treatment of the kernel provides a more general integro-difference equation
and improves prediction relative to Gaussian kernel models. To our knowledge, this is the first attempt at comparing
2
Gaussian versus non-Gaussian kernels in integro-difference equations for two-dimensional space. The proposed work
also offers a novel Bayesian semiparametric modelling approach for the bivariate stable distribution. Finally, while the
flexibility afforded by the proposed model is greater than that of Gaussian kernel models, the computational burden is
not significantly larger.
Section 2 provides relevant background and covers technical details useful in the fitting of integro-difference
equations with stable distributions. Methods of modelling for the measure Γ and inference for the model parameters
are given in Section 3. Section 4 details an analysis of sea surface temperature data comparing the bivariate stable
kernel with a Gaussian kernel. Using this data set, we show that the proposed integro-difference equation model
improves prediction over the Gaussian kernel model. Finally, Section 5 concludes with a summary.
2 Stable Distributions
2.1 Definitions
Here, we discuss some of the key properties of the class of stable distributions. A bivariate random vector X belongs
to the stable family of distributions if and only if for any positive constants a and b and independent copies X(1)
and X(2) of X , there exists a positive constant c and vector D ∈ R2 such that aX(1) + bX(2) and cX + D are
equal in distribution (Samorodnitsky & Taqqu, 1997). Many properties stem from this definition, including infinite
divisibility. Integro-difference equations with infinitely divisible kernels can be connected to well-studied partial
differential equation systems, and thus an infinitely divisible kernel is an attractive choice. As a family of distributions
without finite moments, the stable family has been used for applications involving heavy-tailed distributions (Panorska,
1996; Nolan, 2014).
In one dimension, the stable family is defined by four parameters: location parameter µ ∈ R, scale parameter
c > 0, parameter α ∈ (0, 2] which controls the thickness of the tails, and skewness parameter β ∈ [−1, 1]. The
extension to higher dimensions does not result in a comparably simple parametric form. Lacking a general closed-
form expression for its density, the stable distribution is typically defined through its characteristic function. For
α 6= 1, and for d ≥ 2, the d-dimensional stable characteristic function is given by
g(t | α, µ,Γ) = exp
{it′µ−
∫v∈Sd
|t′v|α [1− i sign(t′v) tan(πα/2)] dΓ(v)
}. (1)
Relating this to the one-dimensional stable distribution, the vector µ ∈ R2 is a location parameter vector, α ∈ (0, 2]
controls tail behavior, and Γ is a finite measure that controls characteristics of the distribution such as skewness,
orientation, and spread, effectively extending c and β to the multivariate setting. The integration space Sd is the unit
sphere in Rd, that is, the unit circle, S2, in two dimensions. In this case, a change of variables can be made to v =
3
dΓ = 5-4I(z>π)
-20 -10 0 10 20
-20
-10
010
20
dΓ = 3
-20 -10 0 10 20
-20
-10
010
20
dΓ = 1+4I(z>π)
-20 -10 0 10 20
-20
-10
010
20
dΓ = 1
-20 -10 0 10 20
-20
-10
010
20
dΓ = 2
-20 -10 0 10 20
-20
-10
010
20
dΓ = 3
-20 -10 0 10 20
-20
-10
010
20
dΓ = 2sin(2z)+2.5
-20 -10 0 10 20
-20
-10
010
20
dΓ = 2sin(2z+.5π)+2.5
-20 -10 0 10 20
-20
-10
010
20
dΓ = 2sin(2z+π)+2.5
-20 -10 0 10 20
-20
-10
010
20
Figure 1: Bivariate stable densities with µ = (0, 0)′, α = 1.5, and different choices for the Γ measure. In the toprow, γ is a step function resulting in a skewed distribution. In the middle row, γ is a constant function resulting indifferent spreads of the distribution. In the bottom row, γ is a sine function with a changing shift, resulting in differentorientations of the distribution.
(cos(z), sin(z))′ where the integral is now over z ∈ [0, 2π]. The unscaled density of the measure, γ = dΓ, must be
non-negative, but can take on a variety of shapes. Figure 1 shows how the shape of the distribution changes with γ.
Skewness changes when γ(z) and γ(z+π) are more disparate. The spread of the distribution changes with the scale of
the measure, that is, a measure with larger values for all z will have a larger spread. The orientation of the distribution
rotates with a shift in γ. A variety of other distributional shapes can be achieved by combining these properties.
2.2 Symmetry
In Richardson et al. (2017) it is shown that the degree of symmetry of an integro-difference equation kernel can
be connected to the physical characteristic of dispersion for the process. Thus, understanding the symmetry of a
bivariate stable kernel is key in the context of the proposed space-time modelling approach. Elliptically contoured
stable distributions are a simplification of stable laws (Nolan, 2013) and have characteristic functions of the form
exp{it′µ− (t′Σt)α/2}, where Σ is a positive definite matrix. These are also referred to as sub-Gaussian distributions
because they can be represented as scale mixtures of normals. A stable distribution is called symmetric α-stable if
4
there exists µ such that −X + µ and X + µ have the same distribution. This property is equivalent to that of elliptical
symmetry, which is often defined using the first and second moments. Equation 2.5.8 in Samorodnitsky & Taqqu
(1997) provides a specific form for Γ that yields elliptically contoured stable laws. The following lemma provides a
necessary and sufficient condition for elliptical symmetry.
Lemma 1. The condition γ(z) = γ(z+π), for z ∈ [0, π], is necessary and sufficient for a bivariate stable distribution
to be elliptically symmetric.
Proof. Without loss of generality, assume µ = (0, 0)′. Using a Fourier representation of the density, f(x) =
(2π)−2∑t∈Z2 g(t) exp(it′x), where g(t) is given by (1) with µ = (0, 0)′. Hence, f(x) − f(−x) can be written
as (2π)−2∑t∈Z2{g(t)− g(−t)} exp(it′x).
First, assume the distribution is elliptically symmetric, which implies f(x)−f(−x) = 0. Then, Fourier and inverse
Fourier transform properties yield g(t) − g(−t) = 0, for all t ∈ Z2. By collecting terms and dividing constants, we
obtain∫ 2π
z=0sign(t′v)|t′v|αdΓ(z) = 0. The integral can be split and written as
∫ πz=0|t′v|αdΓ(z)−
∫ πz=0|t′v|αdΓ(z +
π) = 0, for all t ∈ Z2, which holds true only when γ(z) = γ(z + π), for z ∈ [0, π].
Now, assume that γ(z) = γ(z + π), for z ∈ [0, π]. By working backwards in the above argument, this implies∫ 2π
z=0sign(t′v)|t′v|αdΓ(z) = 0. The characteristic function then becomes exp
{∫ 2π
z=0|t′v|α tan(πα/2)dΓ(z)
}. There-
fore, g(t)− g(−t) = 0, for t ∈ Z2, and thus f(x)− f(−x) = 0, resulting in the property of elliptical symmetry.
This result formalizes our previous statement, based on graphical exploration, that varying degrees of skewness in
the bivariate stable distribution can be achieved with more or less disparate values of γ(z) and γ(z + π).
2.3 Two-Dimensional Fourier Series Representation
The typical method for fitting an integro-difference equation model to spatio-temporal data involves decomposing the
process and the kernel into an orthonormal basis series expansion (Wikle, 2002). For a given set of basis functions,
ψi(s), the process is written asXt(s) =∑∞i=1 ai(t)ψi(s) and the kernel as k(u−s | θs) =
∑∞j=1 bj(s, θs)ψj(u). Any
orthonormal basis will then lead to the representation of the process at time t + 1 as Xt+1(s) =∑∞i=1 ai(t)bi(s, θs).
The process coefficients, ai(t), must be estimated, whereas, given parameters θs, the kernel coefficients, bj(s, θs), are
determined by the choice of kernel distribution.
Owing to the connection between the Fourier transform for densities and the characteristic function, a frequent
choice of basis is the Fourier series. If g(t), where t = (t1, t2)′, is the characteristic function of a bivariate density f ,
then, f(x) = (2π)−2∑t∈Z2 g(t) exp(it′x). When g(t) = exp{a(t)+ib(t)}, where a(t) and b(t) are real-valued scalar
functions, the real basis functions are cos(t′x) and sin(t′x) and the coefficients of the expansion are, respectively,
exp{a(t)} cos{b(t)} and exp{a(t)} sin{b(t)}. For identifiability, we can combine certain basis functions such that
5
Basis Function Coefficient
(2π)−1 cos (t′x) π−1 exp{∫ 2πz=0 |t
′v|α tan(πα/2)dΓ(z)} cos{t′µ−∫ 2πz=0 sign(t′v)|t′v|α tan(πα/2)dΓ(z)}
(2π)−1 sin (t′x) π−1 exp{∫ 2πz=0 |t
′v|α tan(πα/2)dΓ(z)} sin{t′µ−∫ 2πz=0 sign(t′v)|t′v|α tan(πα/2)dΓ(z)}
Table 1: Basis functions and coefficients for the real-valued Fourier basis expansion of the bivariate stable distributionwhere t ∈ Z2.
only the indices where t1 ≥ 1 for all t2, or for t2 ≥ 1 when t1 = 0, are included in the basis function set. The resulting
coefficients are exp{a(t)} cos{b(t)} + exp{a(−t)} cos{b(−t)} and exp{a(t)} sin{b(t)} − exp{a(−t)} sin{b(−t)}.
When t1 = t2 = 0, the basis coefficient is exp{a(0)} cos{b(0)}.
The characteristic function for the multivariate stable in equation (1) can be written for the bivariate case as
exp{a(t)+ib(t)}, where a(t) =∫ 2π
z=0|t′v|α tan(πα/2)dΓ(z), b(t) = t′µ−
∫ 2π
z=0sign(t′v)|t′v|α tan(πα/2)dΓ(z), and
v = (cos(z), sin(z))′. Table 1 includes the Fourier basis functions and coefficients for the bivariate stable distribution.
These basis coefficients are defined for t1 ≥ 1 for all t2, and for t2 ≥ 1 when t1 = 0. The basis coefficient when
t1 = t2 = 0 is (2π)−1, and only the cosine basis function should be included. In the IDE decomposition, the values in
Table 1 will be the coefficients bj(s, θs), where in the spatially varying case, the parameters µ and Γ will be replaced
with s + µs and Γs. The pairs of Fourier frequencies used in the decomposition will need to be ordered to match the
sum from 1 to ∞. The particular ordering doesn’t matter as long as the ordering for the coefficients, bj(s, θs) and
basis functions, ψj(s), match.
3 Methodology
3.1 Semiparametric Model for Γ
Here, we present our integro-difference equation model, starting with a semiparametric model for the bivariate stable
distribution kernel which is then extended to incorporate spatially dependent parameters.
As indicated in Section 2.3, and detailed later in Section 3.2, the stable distribution enters the hierarchical model
for the data through its characteristic function, thus overcoming the lack of a closed-form expression for the density.
Its definition in terms of the finite-dimensional parameter vector (α, µ) and measure Γ provides a natural setting for
a Bayesian semiparametric model specification. In particular, we develop a model for γ = dΓ building on ideas from
Bernstein polynomial priors for densities with compact support (Petrone, 1999). In light of the ensuing extension to
spatially dependent parameters for the kernel distribution, a key consideration for the model structure is to balance
model flexibility and computational feasibility for inference.
The fact that Γ is a finite measure on [0, 2π] implies that, up to a scaling parameter c > 0, γ = dΓ is a density
6
function on [0, 2π]. We model γ using a weighted combination of Beta densities with different shapes,
γ(z) = c (2π)−1M∑m=1
wm be(z/2π | m,M −m+ 1), z ∈ [0, 2π],
where be(· | a, b) denotes the density of the Beta distribution with mean a/(a + b). The mixture weights are defined
through an underlying distribution function F on [0, 2π], more specifically, wm = F (2πm/M)− F (2π(m− 1)/M),
for m = 1, ...,M .
Key to the flexibility of this mixture representation is the capacity of F to admit general shapes, including mul-
timodalities, which can lead to effective selection of the Beta mixture components that contribute most to the con-
struction of γ. This motivates using a nonparametric prior for F that supports general discrete distributions on [0, 2π].
In particular, we assign a geometric weights prior (Mena et al., 2011) to the distribution F associated with random
distribution function F , such that F =∑∞j=1 q(1 − q)j−1 δxj
. Here, the xj are independent from a base distribution
on [0, 2π], which is taken to be uniform, and q follows a distribution on the unit interval. Hence, F is a discrete distri-
bution with atoms xj and corresponding weights q(1− q)j−1 defined through a single random variable q. This model
structure is more economical in the number of parameters than, for instance, the stick-breaking weights of the Dirichlet
process, enabling the extension to a spatially varying kernel distribution. In practice, the countable representation for
F is truncated at finite level J , defining the last weight to be 1 minus the sum of the previous weights to force a proper
probability vector. The expected sum of the weights between J + 1 and∞ for a given q is equal to (1− q)J−1, so J
must be large enough to make this value close to 0, but for efficiency may be adjusted according to the expected value
of q.
Note that wm arises by summing up the geometric weights q(1 − q)j−1 for the xj that lie in interval (2π(m −
1)/M, 2πm/M). Since the xj are uniformly distributed on [0, 2π], we can associate a Multinomial(1, (1/M, ..., 1/M))
vector, (Zj1, ..., ZjM ), with each xj , and express the mixture weights as wm =∑Jj=1 q(1 − q)j−1Zjm, for m =
1, ...,M . This replaces a continuous latent variable with a discrete variable, which facilitates estimation. By construc-
tion, only one element of (Zj1, ..., ZjM ) is 1, the rest being 0, and thus the dimensionality of the new parameter set
remains effectively the same.
Therefore, measure Γ is defined in terms of parameters c, q, and {Zjm : j = 1, ..., J ;m = 1, ...,M}, and
the specification of the bivariate stable kernel distribution is completed with location parameter vector µ = (µ1, µ2)
and tail parameter α. To construct a non-stationary spatio-temporal process, we elaborate on the model such that
parameters µ, c, and q are spatially varying. To illustrate the flexibility of the proposed model and the effect of
spatially varying q, Figure 2 shows how the bivariate stable density changes with q with M = 40 and J = 40. Even
though the atoms are the same for each kernel, the shape can drastically change in both skewness and orientation
by only varying q. Also, Figure 2 shows that the kernel changes smoothly with q, implying that kernels at nearby
7
q = 0.1
-15 -10 -5 0 5 10 15
-15-10
-50
510
15
q = 0.2
-15 -10 -5 0 5 10 15
-15-10
-50
510
15
q = 0.3
-15 -10 -5 0 5 10 15
-15-10
-50
510
15
q = 0.4
-15 -10 -5 0 5 10 15
-15-10
-50
510
15
q = 0.5
-15 -10 -5 0 5 10 15
-15-10
-50
510
15
q = 0.6
-15 -10 -5 0 5 10 15
-15-10
-50
510
15
q = 0.7
-15 -10 -5 0 5 10 15
-15-10
-50
510
15
q = 0.8
-15 -10 -5 0 5 10 15
-15-10
-50
510
15
q = 0.9
-15 -10 -5 0 5 10 15
-15-10
-50
510
15
Figure 2: A bivariate stable density is shown with a scaled Bernstein polynomial measure and a geometric weightsbase distribution. µ = (0, 0)′, c = 2π, and α = 1.5 for each plot. Only the geometric weight, q, changes. The first 10atoms are (2, 5.14, 3.6, .4, 3.4, .6, 3.5, .5, 3.5, .5). The other atoms were randomly drawn from a uniform distributionon [0, 2π], but are the same for all 9 figures.
locations and having similar values for q, will be similar in shape. This smooth transition of kernel shape is desirable
for integro-difference equation spatio-temporal models. Further exploration has revealed that the added flexibility for
choosing greater than 40 Bernstein polynomials (M > 40) is not significant enough to justify the extra computational
cost. The support does not expand much past this point.
The spatially dependent parameters are modelled via kernel process convolutions (Higdon, 1998) for µ1(s), µ2(s),
log(c(s)), and Φ−1(q(s)), where Φ is the standard normal distribution function. For instance, for µ1(s), consider a
grid of knots u1, ..., uQ and latent variables (ζµ1(u1), ..., ζµ1
(uQ)) which are independent N(0, σµ1). Then, µ1(s) =
mµ1+∑Qi=1 kζ(ui, s)ζµ1
(ui), where kζ(ui, s) is come convolution kernel such as a Gaussian or a Matern function.
The process µ2(s) has a similar construction. Analogously, log(c(s)) = mc +∑Qi=1 kζ(ui, s)ζc(ui), where ζc(ui) ∼
N(0, σc), and Φ−1(q(s)) = mq +∑Qi=1 kζ(ui, s)ζq(ui), where ζq(ui) ∼ N(0, σq). Using process convolutions
reduces the parameter space as Q is typically set to be a number much smaller than the number of locations, but they
may potentially ignore some short range variability. The parameter processes, however, should vary smoothly in space,
so a process convolution approximation is acceptable.
In section 4, the flexibility of the model will be explored through the analysis of sea surface temperature data.
As a comparison, a Gaussian kernel IDE model will also be fit with all 5 parameters varying in space: two location
parameters, two variances, and the covariance. The proposed bivariate stable kernel model achieves a wider array
8
of kernel shapes with four spatially varying process parameters, thus offering a more general inference framework
without sacrificing computational feasibility relative to the state-of-the-art Gaussian kernel model.
3.2 Posterior Sampling
Define a spatial data vector Yt = (Yt(st,1), ..., Yt(st,nt))′ where the number of data points and the locations of
observations may change over time. Using the basis function decomposition of the process and the distribution, the
integro-difference equation model with a stable kernel can be represented as a dynamic linear model. The full model
for a given kernel parameter set θ is
Yt | at, σ2 ∼ N(Ψtat, σ2It), (2)
at | at−1, τ2, θ ∼ N(GtBθ,t at−1, τ2GtVtG
′t) t = 1, ..., T, (3)
where the i-th row of Ψt contains the Fourier basis functions as found in Table 1 evaluated at the i-th spatial
location at time t, at contains the stochastic basis function coefficients for the process which will be estimated,
Gt = (Ψ′tΨt)−1Ψ′t, and the i-th row of Bθ,t contains the fixed coefficients for the basis expansion of the kernel
given the parameters, also found in Table 1 with µ and Γ replaced with st,i +µst,i and Γst,i . More details on how this
formulation is derived for a general kernel is found in Wikle (2002) and Xu et al. (2005). The observational variance
σ2It is a scaled identity. The matrix Vt is a spatial covariance that is transformed spectrally by Gt and scaled by τ2.
The state vectors {a0, ..., aT } are sampled via dynamic linear model filtering. The kernel parameters will be up-
dated using Monte Carlo Markov chain sampling. Using Gibbs sampling within Monte Carlo Markov chain, the poste-
rior distribution for the latent variables is proportional to p(at | at−1, τ2, σ20 , {ζµj (ui) : j = 1, 2}, {ζc(ui)}, {ζq(ui)})p({ζµ1(ui)}).
The prior for the latent variables ζµ1(ui) are N(0, σµ1
). The parameter σµ1is given an inverse gamma prior. Con-
ditional on ζµ1(ui) for i = 1, ..., Q, the posterior distribution for σµ1
will also be inverse gamma. The conditional
posterior distributions for σµ2 , σc, and σq are also conjugate if they are paired with inverse gamma priors. With
normal priors on mµ1, mµ2
, mc, and mq , the posteriors are conjugate as well.
The parameter α for the stable distribution can be difficult to learn. To aid in estimation, we assume a discrete
prior for α, on a grid between 1.05 and 2 with step size .05. With a uniform discrete prior, the posterior probability that
α = λi is proportional to the likelihood evaluated at α = λi. Another advantage of discretizing α is that the integrals
in the basis coefficients from Table 1 can be calculated prior to the Monte Carlo Markov chain for each Bernstein
polynomial and for each possible value for α, resulting in a significant speed-up. The Zjm variables are also discrete.
For each set Zj1, ..., ZjM , the single variable which is equal to 1 can be sampled from a discrete posterior. Allowing lj
to be an indicator variable where lj = m when Zjm is 1, the probabilities are proportional to the likelihood evaluated
9
at lj = m, which means that it can be sampled discretely with conditional posterior probability
p(lj = m|{at : t = 1, ..., T}, ·) =
∏Tt=1 p(at|at−1, ·, lj = m)p(lj = m)∑M
i=1
∏Tt=1 p(at|at−1, ·, lj = i)p(lj = i)
.
This should be done for all J sets of latent variables. We note that, from a computational perspective, the discreteness
of these variables lends itself to parallelization, as does sampling the discretized variable α. This is done by sending
the calculations of the components of the discrete probabilities to different nodes and then collecting the proportional
posteriors to calculate the probabilities. For a large enough data set and using M nodes, this can significantly increase
the speed of the Monte Carlo Markov chain. For this work, the parallelization was done in C++ using openMP.
3.3 Thresholding
The method of decomposing the process and the kernel using a basis expansion requires specification of how to truncate
to a finite number of basis functions (Xu et al., 2005). In one dimension, the number of basis functions required for
accurately representing the kernel in integro-difference equation modelling is relatively small, ranging between 20 and
100. To achieve the same level of accuracy in 2 dimensions, hundreds of basis functions may be required. The main
determining factor of the optimal number of basis functions to use is the width of the kernel compared to the range of
the data. For example, when using a normal kernel, a higher variance requires fewer basis functions. This means that
the optimal number of basis functions to use is not constant but changes throughout the Monte Carlo Markov chain.
Recall that Bθ is a matrix where the (i, j)-th element is the coefficient of the j-th basis function at location si. Figure
3 shows how the width of the kernel affects the number of influential basis functions. This shows that the number of
significant basis functions for a particular basis index across all locations is generally smaller when the spread is larger.
Two other observations can be made from this. The first is that a small percentage of these are significantly larger than
0. The other observation is that the size of the coefficient is not fully determined by the order of the frequencies. As
basis function index increases, the frequency of the function increases. The general trend is that the coefficients get
smaller, but several coefficients of smaller frequency basis functions appear to be less significant. By exploiting these
facts the dimension of the state space can be intelligently decreased.
The Monte Carlo Markov chain can be adjusted at each iteration to include only the most important basis functions
and decrease computational time of the algorithm. After calculating Bθ, the column maximum of Bθ can be used to
decide the number of basis functions. There are several options for thresholding in the literature. Hard thresholding
and soft thresholding are two of the most common (Donoho & Johnstone, 1994). Hard thresholding involves setting all
coefficients less than a certain value to 0, b∗j = bj × I(|bj | > ε). Soft thresholding additionally subtracts ε from values
which are non-zero, b∗j = sign(bj)(|bj | − ε)I(|bj | > ε). There are several arguments for and against either of these
thresholding techniques. A more elegant option is the threshold based on the generalized double Pareto distribution
10
Kernel
-20 -10 0 10 20
-20
-10
010
20
0 100 200 300 400
0.0000.0100.0200.030
Largest Basis Coefficient
Kernel
-20 -10 0 10 20
-20
-10
010
20
0 100 200 300 4000.00
0.01
0.02
0.03
0.04 Largest Basis Coefficient
Figure 3: The left plots show the kernel and the right plots shows the maximum coefficient for every basis functionacross all spatial locations.
(Armagan et al., 2013). Using this more technical approach maintains the continuity of the target function without
over-shrinking. This method sets bj to 0 when bj < ε√
(η + 1). When bj > ε√
(η + 1) the values are
b∗j =
bj−ε√
(η+1)+[b2j+2bjε√
(η+1)−3ε2(η+1)]1/2
2 if bj > 0
bj+ε√
(η+1)−[b2j+2bjε√
(η+1)−3ε2(η+1)]1/2
2 if bj < 0
(4)
The value for ε defines the level of truncation and η controls the shrinkage. In order to be able to predict the length of
the learning algorithm, it may help to keep the number of basis functions constant. The only way to accomplish this
would be to change the thresholding level, which is ε in the equations above. This approach will also ensure that a
reasonable number of basis functions will always be present. For this work, the generalized double Pareto thresholding
was used with ε chosen to yield a fixed number of basis functions for each iteration of the Monte Carlo Markov chain.
If the column maximum of Bθ is less than ε√
(η + 1) then the column is not used and the dimensionality is reduced.
If the column maxmum is greater, then the values are adjusted as in equation (4). For this work, η was set to 1.
Sensitivity of the choice for η was not explored, but the resulting amount of dimension reduction compared reasonably
to what we expected.
11
4 Sea Surface Temperature Data Analysis
Sea surface temperature in the tropical Pacific Ocean has been useful in predicting a number of climate phenomena
(Philander, 1985). The most prominent of these phenomena is El Nino, which is a warming of ocean temperature
occurring between −5◦ and 5◦ Latitude and 180◦ and 240◦ E Longitude (see http://www.elnino.noaa.gov). This
warming results in a shift of nutrients in the water which can affect agriculture and fisheries in several countries. It
can also have profound climate effects that include increased precipitation in some areas of the Americas, as well
as very dry conditions in some others (Ropelewski & Halpert, 1987). There is a rich history of work dedicated to
predicting when El Nino will occur, as well as its counterpart La Nina, which is the cooling that usually follows. El
Nino follows a 2 to 7 year cycle and typically begins in Autumn, staying as long as a year. While many deterministic
physical models have arisen to explain and predict the occurrences (Jan van Oldenborgh et al., 2005), much success
has come from simply using sea surface temperature data over a large region in the Pacific Ocean in a stochastic model
(Berliner et al., 2000). These methods include linear systems (Penland & Magorian, 1993), but nonlinear methods
have proven more successful (Wikle & Hooten, 2010; Cressie & Wikle, 2011). Non-linear methods have been applied
to sea surface temperature data in the context of integro-difference equation models (Wikle & Holan, 2011), although
the specific nature of the kernel distribution was not the focus of that application.
We will illustrate the bivariate stable kernel for integro-difference equation modelling using the sea surface tem-
perature data. We do not claim to have a breakthrough method of analyzing sea surface temperature for El Nino
prediction. We simply use the data set to illustrate the proposed model and its predictive power. The data includes
9692 locations on a 1◦ by 1◦ resolution grid. It is collected monthly from December 1980 to July 2015. Seasonality is
accounted for by modelling the anomalies, which are differences between the data and the long term monthly average
for each location. The anomalies for July 2015 are shown in Figure 4, showing the large El Nino that begins in the
summer of 2015. It can be seen in the figure how the warm temperatures gather near the equator East of the Date Line,
indicating El Nino. We will compare prediction and model accuracy between the spatially varying stable kernel and
the spatially varying Gaussian kernel.
We apply the model of Section 3 to the sea surface temperature monthly anomalies. The parameter space is quite
large, including the observational variance σ2, the process variance τ2 and the kernel parameters, which include the
latent variables involved in the kernel convolution for the processes µ(s), q(s), and c(s), and the hyperparameters
for the latent ζ parameters for each process. The locations of the data are given in latitude and longitude, but for the
analysis they are scaled to between -10 and 10 in both directions and then scaled back to the original locations for
12
Figure 4: Data is shown for sea surface temperature anomalies for July 2015.
inference. The full model is
Yt|at, σ2 ∼ N(Ψat, σ2It), t = 1, ..., T, σ2 ∼ IG(ασ, βσ),
at|at−1, τ2, θ ∼ N(GBθat−1, τ2GV G′), τ2 ∼ IG(ατ , βτ ),
γ(z) = c(s)
M∑k=1
wM,mBeta(z/2π|m,M −m+ 1)/2π, α ∼ p(α),
wM,m =
J∑j=1
q(s)(1− q(s))j−1Zjm, (Zj1, ..., ZjM ) ∼ MN(1, (1/M, ..., 1/M)),
φ−1(q(s)) = µq1 +Kζζq, ζq(ui)|σq ∼ N(0, σq), µq, σq ∼ p(µq)p(σq),
log c(s) = µc1 +Kζζc, ζc(ui)|σc ∼ N(0, σc), µc, σc ∼ p(µc)p(σc),
µi(s) = µµi1 +Kζζµi
, ζµi(ui)|σµi
∼ N(0, σµi), µµi
, σµi∼ p(µµi
)p(σµi).
The matrix V is the unscaled spatial covariance matrix based on a Matern covariance function with κ = 1.5 and an
effective range of 2. The number of Bernstein polynomials used is M = 40 and the number of geometric weights is
equal to J = 40. The priors for σ2 and τ2 are IG(5, 3). The matrix Kζ maps the latent ζθ = (ζθ(u1), ..., ζθ(uQ))′
vectors for θ ∈ {q, σ, µ1, µ2} to the processes governing the integro-difference equation kernel parameters. The values
of Kζ correspond to convolution kernel where the (i, j)-th element is kζ(si−uj). The convolution kernel is a Matern
function with κ = 2.5 and an effective range of 4 for all the parameters, forcing a smooth evolution of the parameter
processes across the domain. The knots are chosen on a 30 by 30 grid from -11 to 11 in both directions, resulting in
a dimension reduction of the variables representing the spatially varying parameter processes from 9692 to 900. The
priors for the means of these processes are µq ∼ N(−1, .5), µc ∼ N(0, 4), and µµ1, µµ2
∼ N(0, 1). The scale terms
for the process covariances are given priors of IG(4, 3) for σc and σµiand IG(10, 6) for σq . These priors are based
on the scale of the data and reasonable shapes of kernels, but aren’t too restrictive. The process q(s) is very sensitive
to these priors, as the posterior is not necessarily identifiable, especially for very diffuse priors. There may be several
13
combinations of q(s) and the latent Zjk variables which results in the same values for wM,m, which are identifiable.
The other parameters are not overly sensitive to the prior.
75,000 samples are drawn from the posterior distributions using Metropolis-Hastings with the first 60,000 as burn-
in, leaving 15,000 samples. The convergence was checked using trace plots of the values of the kernel densities
at each location. That is to say that the trace plots suggested convergence by 60,000 iterations of the values for
k(s|µ(s), σ(s), q(s)) for various locations throughout the domain of the data.
The posterior kernel and associated measure may reveal information about the nature of dependence between
locations. In section 2.2, the property of elliptical symmetry is connected to the values of γ. The posterior kernel for
3 locations is shown in Figure 5 with the associated densities scaled by 2π. It is hard to compare the kernels at the
different locations by eye. The differences in the measure are much more easily detected for the three locations.
0 1 2 3 4 5 6
0.5
1.5
2.5
dΓ Location (171 E,23 N)
c(0, 2 * pi)
c(0.
3, 3
)
-5 0 5
-50
5
Posterior Kernel Mean
c(-7.5, 7.5)
c(-7
.5, 7
.5)
0 1 2 3 4 5 6
0.5
1.5
2.5
dΓ Location (166,-1)
c(0, 2 * pi)
c(0.
3, 3
)
-5 0 5
-50
5
Posterior Kernel Mean
c(-7.5, 7.5)
c(-7
.5, 7
.5)
0 1 2 3 4 5 6
0.5
1.5
2.5
dΓ Location (91 E,9 N)
c(0, 2 * pi)
c(0.
3, 3
)
-5 0 5
-50
5
Posterior Kernel Mean
c(-7.5, 7.5)
c(-7
.5, 7
.5)
Figure 5: The scaled density corresponding to the measure Γ and associated 95% credible bands are shown for threelocations on the left with the associated posterior mean kernel on the right.
To detect trends, we devise a method of measuring symmetry across the spatial field. Based on Lemma 1, we
use symell(s) =∫ π0{γs(z) − γs(z + π)}2dz as a metric of elliptical symmetry. Similarly, a metric of spherical
symmetry can be defined as symsph(s) =∫ 2π
0{γs(z)/c(s) − (2π)−1}2dz. These symmetry metrics are plotted as
a function of spatial location in Figure 6. The kernels at locations south of the equator are symmetric whereas the
kernels for locations north of the equator are non-symmetric. Additionally, the spherical symmetry map is very similar
to the elliptical symmetry map. This illustrates the importance of using a kernel that is spatially varying and can be
14
(a) Elliptical Symmetry (b) Spherical Symmetry
Figure 6: The symmetry metrics are shown across the spatial field for both elliptical and spherical symmetry. For bothmetrics, smaller values are associated with more symmetric kernels.
asymmetric. The splitting of the symmetry of the stable kernel suggests that the relationship between the spatial and
temporal components of the process behaves differently on the two sides of the equator. The physical interpretation of
this is not clear.
We will assess model performance through its predictive power. An additional model fit was conducted based on
a spatially varying Gaussian kernel, following Wikle (2002) and Xu et al. (2005). We score the predictions for each
time point based on energy scores of the form
es(F, y) =1
m
m∑i=1
||y(i) − y|| − 1
2m2
m∑i=1
m∑j=1
||y(i) − y(j)||, (5)
where y(1), ..., y(m) are samples from F , the posterior predictive distribution and y denotes the data vector (Gneiting
et al., 2008). We find that the stable kernel integro-difference equation model scores better in 241 time points, which
is 59.7% of all time points. Based on these alone, the advantage seems minimal. However slight the departures
from elliptical symmetry were, they do exist, but this does not seem to affect the scoring results. We also compare
the models using K-step ahead forecasts. This is done by propagating the state variables through the process level
of the model, a∗t ∼ N(GBθat−1, τ2GV G′). This can be propagated several steps past the final time, T , followed
by drawing the K-step ahead prediction Y ∗T+k ∼ N(Ψa∗T+k, σ2I). The scoring comparison for the K-step ahead
predictions, including in-sample predictions for K = 0, are shown in Table 2. As can be seen, while the in-sample
and 1-step ahead predictions are comparable, the stable distribution consistently outperforms the normal in each of
forecasts.
To perform true out of sample predictions, the last four months were left out of the analysis, resulting in 400 time
points ending in March 2015 being included in the model fit. The posterior distributions for Y ∗401, ..., Y∗404 are drawn as
part of the Monte Carlo Markov chain. The means of these posterior distributions are shown in Figure 7 for the stable
and normal kernel integro-difference equation models compared against the truth for March through July 2015. The
real data shows the El Nino which is known to be occurring in 2015. Both the predictions include the El Nino warming
15
K-Ahead 0 1 2 3 4 5
Percentage Stable Lower 48.2 59.7 69.7 62.0 57.5 56.1
Percentage Normal Lower 51.8 40.3 30.3 38.0 42.5 43.9
Table 2: Shown here are the percentage of time points where the energy score from the posterior predictive was lowerfor each model.
to some degree, but it is clear that the intensity of the predicted warming using the integro-difference equation model
with the stable kernel is much closer to the truth than with the model the normal kernel, especially closer to the coast
of South America.
There are several ways to numerically declare an El Nino event. Most of these involve high anomalies in certain
regions in the Pacific. For example, the official National Oceanic and Atmospheric Administration (NOAA) criterion
involves the block average anomalies in El Nino region 3.4 to be above .5◦ C for 3 consecutive months, although this
is modified to 5 consecutive months for the NOAA’s Climate Prediction Center (Larkin & Harrison, 2005). The El
Nino region 3.4 includes 120◦ to 170◦ W Longitude and -5◦ to 5◦ Latitude. Figure 8 shows the block averages for the
data and fitted models. These estimates are sampled from the posterior predictive distributions for K-steps ahead given
all information through March 2015. Because the fitting procedure is Bayesian, we have a full probability distribution
for the predictions, allowing us to calculate distributions on the probability of future El Nino events. Figure 8 shows
the posterior predictive probabilities that an El Nino will occur within the next 9 months for all months dating back
to the beginning of the data set. The red X’s are when El Nino did occur. The green bars extend from 6 months prior
to an El Nino occurring to 3 months after, so blue dots that occur in the regions with the green bars are indications
of correct predictions of an El Nino event. The model still has some false positives and false negatives, but overall, it
seems to be able to predict an El Nino months in advance in most cases.
5 Summary
We have proposed a model that explicitly considers space-time dynamics using a location-dependent linear evolution
equation that is specified by a wide and flexible family of infinitely-divisible kernels. The model can accommodate
both spatial and temporal non-stationarity. Furthermore, the proposed model incorporates observational error, and can
handle irregularly located observations and missing values.
In order to infer the characteristics of the kernel and the states of the spatio-temporal process of interest, we
use a Fourier basis representation for both the process and the kernel. A random measure is used as a prior for the
measure that controls the kernel shape. This, coupled with thresholding of the basis coefficients, produces a substantial
16
dimension reduction. As a result, the model, even being more flexible, is more parsimonious than a fully flexible
Gaussian kernel model. In spite of this, the model’s predictive skills are superior to those of the more standard model,
for the non-trivial example considered in the paper. Also, the use of a hierarchical model within a Bayesian approach
allows for a full probabilistic assessment of the prediction uncertainty. In addition, the structure of the proposed model
allows for efficient parallelization of the sampling algorithm. As a result, experiments with simulations as well as with
real data show that the model can effectively handle large numbers of observations.
References
ARMAGAN, A., DUNSON, D. B. & LEE, J. (2013). Generalized double pareto shrinkage. Statistica Sinica 23, 119.
BERLINER, L. M., WIKLE, C. K. & CRESSIE, N. (2000). Long-lead prediction of pacific ssts via bayesian dynamic
modeling. Journal of climate 13, 3953–3968.
BROWN, P. E., ROBERTS, G. O., KARESEN, K. F. & TONELLATO, S. (2000). Blur-generated non-separable space–
time models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62, 847–860.
CRESSIE, N. & WIKLE, C. K. (2011). Statistics for spatio-temporal data. New York: John Wiley & Sons.
DONOHO, D. L. & JOHNSTONE, I. M. (1994). Threshold selection for wavelet shrinkage of noisy data. In Engineer-
ing in Medicine and Biology Society, 1994. Engineering Advances: New Opportunities for Biomedical Engineers.
Proceedings of the 16th Annual International Conference of the IEEE. IEEE.
GNEITING, T., STANBERRY, L. I., GRIMIT, E. P., HELD, L. & JOHNSON, N. A. (2008). Assessing probabilistic
forecasts of multivariate quantities, with an application to ensemble predictions of surface winds. Test 17, 211–235.
HIGDON, D. (1998). A process-convolution approach to modelling temperatures in the north atlantic ocean. Environ-
mental and Ecological Statistics 5, 173–190.
JAN VAN OLDENBORGH, G., BALMASEDA, M. A., FERRANTI, L., STOCKDALE, T. N. & ANDERSON, D. L.
(2005). Did the ecmwf seasonal forecast model outperform statistical enso forecast models over the last 15 years?
Journal of climate 18, 3240–3249.
LARKIN, N. K. & HARRISON, D. (2005). On the definition of el nino and associated seasonal average us weather
anomalies. Geophysical Research Letters 32.
LEMOS, R. T. & SANSO, B. (2009). A spatio-temporal model for mean, anomaly, and trend fields of north atlantic
sea surface temperature. Journal of the American Statistical Association 104, 5–18.
17
MENA, R. H., RUGGIERO, M. & WALKER, S. G. (2011). Geometric stick-breaking processes for continuous-time
bayesian nonparametric modeling. Journal of Statistical Planning and Inference 141, 3217–3230.
NOLAN, J. P. (2013). Multivariate elliptically contoured stable distributions: theory and estimation. Computational
Statistics 28, 2067–2089.
NOLAN, J. P. (2014). Financial modeling with heavy-tailed stable distributions. Wiley Interdisciplinary Reviews:
Computational Statistics 6, 45–55.
PANORSKA, A. K. (1996). Generalized stable models for financial asset returns. Journal of computational and applied
mathematics 70, 111–114.
PENLAND, C. & MAGORIAN, T. (1993). Prediction of nino 3 sea surface temperatures using linear inverse modeling.
Journal of Climate 6, 1067–1076.
PETRONE, S. (1999). Bayesian density estimation using bernstein polynomials. Canadian Journal of Statistics 27,
105–126.
PHILANDER, S. (1985). El nino and la nina. Journal of the Atmospheric Sciences 42, 2652–2662.
RICHARDSON, R., KOTTAS, A. & SANSO, B. (2017). Flexible integro-difference equation modeling for spatio-
temporal data. Computational Statistics & Data Analysis 109, 182–198.
ROPELEWSKI, C. F. & HALPERT, M. S. (1987). Global and regional scale precipitation patterns associated with the
el nino/southern oscillation. Monthly weather review 115, 1606–1626.
SAMORODNITSKY, G. & TAQQU, M. S. (1997). Stable non-gaussian random processes. Econometric Theory 13,
133–142.
STORVIK, G., FRIGESSI, A. & HIRST, D. (2002). Stationary space-time gaussian fields and their time autoregressive
representation. Statistical Modelling 2, 139–161.
WIKLE, C. K. (2002). A kernel-based spectral model for non-gaussian spatio-temporal processes. Statistical Mod-
elling 2, 299–314.
WIKLE, C. K. & HOLAN, S. H. (2011). Polynomial nonlinear spatio-temporal integro-difference equation models.
Journal of Time Series Analysis 32, 339–350.
WIKLE, C. K. & HOOTEN, M. B. (2010). A general science-based framework for dynamical spatio-temporal models.
Test 19, 417–451.
18
XU, K., WIKLE, C. K. & FOX, N. I. (2005). A kernel-based spatio-temporal dynamical model for nowcasting
weather radar reflectivities. Journal of the American Statistical Association 100, 1133–1144.
19
(a) March 2015 Truth (b) March 2015 Stable Fitted (c) March 2015 Normal Fitted
(d) April 2015 Truth (e) April 2015 Stable Predictions (f) April 2015 Normal Predictions
(g) May 2015 Truth (h) May 2015 Stable Predictions (i) May 2015 Normal Predictions
(j) June 2015 Truth (k) June 2015 Stable Predictions (l) June 2015 Normal Predictions
(m) July 2015 Truth (n) July 2015 Stable Predictions (o) July 2015 Normal Predictions
Figure 7: The data and posterior K-step ahead predictions using the stable and normal kernels are shown given infor-mation through March 2015. The months March 2015 through July 2015 are shown.
20
1985 1990 1995 2000 2005 2010 2015
0.5
0.6
0.7
0.8
0.9
1.0
Stable IDE El Nino Prediction Probabilities
Year
Probability
Figure 8: Posterior probabilities that an El Nino will occur within the coming 9 months. The red X’s show El Ninooccurrences and the green bars extend from 6 months before the El Nino event to 3 months after. Additionally, pointswith probabilities above .8 are dotted blue.
21