140
Model-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health and Paulo Justiniano Ribeiro, Jr Department of Statistics, Universidade Federal do Paran´a

Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

  • Upload
    vuthuy

  • View
    222

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Model-based Geostatistics

Peter J Diggle

Lancaster University and Johns Hopkins

University School of Public Health

andPaulo Justiniano Ribeiro, Jr

Department of Statistics, Universidade Federal do

Parana

Page 2: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Outline

1. Introduction - motivating examples

2. Linear models

3. Bayesian inference

4. Generalised linear models

5. Geostatistical design

6. Geostatistics and marked point processes

Diggle and Ribeiro (2007). Model-based Geostatistics.New York : Springer.

Page 3: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Section 1

Introduction - motivating examples

Page 4: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Geostatistics

• traditionally, a self-contained methodology for spatialprediction, developed at Ecole des Mines, Fontainebleau,France

• nowadays, that part of spatial statistics which isconcerned with data obtained by spatially discretesampling of a spatially continuous process

Page 5: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Example 1.1: Measured surface elevations

0 1 2 3 4 5 6

01

23

45

6

X Coord

Y C

oord

No explanatory variables?

Page 6: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health
Page 7: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Example 1.2: Residual contamination fromnuclear weapons testing

−6000 −5000 −4000 −3000 −2000 −1000 0

−50

00−

4000

−30

00−

2000

−10

000

1000

X Coord

Y C

oord

western area

Page 8: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health
Page 9: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health
Page 10: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Example 1.3: Childhood malaria in Gambia

300 350 400 450 500 550 600

1350

1400

1450

1500

1550

1600

1650

W−E (kilometres)

N−

S (

kilo

met

res)

Western

Eastern

Central

Page 11: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Example 1.3: continued

30 35 40 45 50 55 60

0.0

0.2

0.4

0.6

0.8

greenness

prev

alen

ce

Correlation between prevalence and green-ness of vegetation

Page 12: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health
Page 13: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Example 1.4: Soil data

5000 5200 5400 5600 5800 6000

4800

5000

5200

5400

5600

5800

X Coord

Y C

oord

5000 5200 5400 5600 5800 6000

4800

5000

5200

5400

5600

5800

X Coord

Y C

oord

Ca (left-panel) and Mg (right-panel) concentrations

Page 14: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Example 1.4: Continued

20 30 40 50 60 70 80

1015

2025

3035

4045

ca020

mg0

20

Correlation between local Ca and Mg concentrations.

Page 15: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Example 1.4: Continued

5000 5200 5400 5600 5800 6000

2030

4050

6070

80

E−W

ca02

0

(a)

4800 5000 5200 5400 5600

2030

4050

6070

80

N−S

ca02

0

(b)

3.5 4.0 4.5 5.0 5.5 6.0 6.5

2030

4050

6070

80

elevation

ca02

0

(c)

1 2 3

2030

4050

6070

80

region

ca02

0

(d)

Covariate relationships for Ca concentrations.

Page 16: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Model-based Geostatistics

• the application of general principles of statisticalmodelling and inference to geostatistical problems

• Example: kriging as minimum mean square errorprediction under Gaussian modelling assumptions

Page 17: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Section 2

Linear models

Page 18: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Notation

• Y = {Yi : i = 1, ..., n} is the measurement data

• {xi : i = 1, ..., n} is the sampling design (note lower case)

• Y = {Y (x) : x ∈ A} is the measurement process

• S∗ = {S(x) : x ∈ A} is the signal process

• T = F(S) is the target for prediction

• [S∗, Y ] = [S∗][Y |S∗] is the geostatistical model

Page 19: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Gaussian model-based geostatistics

Model specification:

• Stationary Gaussian process S(x) : x ∈ IR2

· E[S(x)] = µ

· Cov{S(x), S(x′)} = σ2ρ(‖x − x′‖)

• Mutually independent Yi|S(·) ∼ N(S(x), τ2)

Page 20: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Minimum mean square error prediction

[S, Y ] = [S][Y |S]

• T = t(Y ) is a point predictor

• MSE(T ) = E[(T − T )2]

Theorem: MSE(T ) takes its minimum value when T = E(T |Y ).

Proof uses result that for any predictor T ,

E[(T − T )2] = EY [VarT (T |Y )] + EY {[ET (T |Y ) − T ]2}

Immediate corollary is that

E[(T − T )2] = EY [Var(T |Y )] ≈ Var(T |Y )

Page 21: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Simple and ordinary kriging

Recall Gaussian model:

• Stationary Gaussian process S(x) : x ∈ IR2

· E[S(x)] = µ

· Cov{S(x), S(x′)} = σ2ρ(‖x − x′‖)

• Mutually independent Yi|S(·) ∼ N(S(x), τ2)

Page 22: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Gaussian model implies

Y ∼ MVN(µ1, σ2V )

V = R + (τ2/σ2)I Rij = ρ(‖xi − xj‖)

Target for prediction is T = S(x), write r = (r1, ..., rn) where

ri = ρ(‖x − xi‖)

Standard results on multivariate Normal then give [T |Y ] asmultivariate Gaussian with mean and variance

T = µ + r′V −1(Y − µ1) (1)

Var(T |Y ) = σ2(1 − r′V −1r). (2)

Simple kriging: µ = Y Ordinary kriging: µ = (1′V −11)−11′V −1Y

Page 23: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

The Matern family of correlation functions

ρ(u) = 2κ−1(u/φ)κKκ(u/φ)

• parameters κ > 0 and φ > 0

• Kκ(·) : modified Bessel function of order κ

• κ = 0.5 gives ρ(u) = exp{−u/φ}

• κ → ∞ gives ρ(u) = exp{−(u/φ)2}

• κ and φ are not orthogonal:

– helpful re-parametrisation: α = 2φ√

κ

– but estimation of κ is difficult

Page 24: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

u

ρ(u)

κ = 0.5 , φ = 0.25κ = 1.5 , φ = 0.16κ = 2.5 , φ = 0.13

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

2

x

S(x

)

Page 25: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Simple kriging: three examples

1. Varying κ (smoothness of S(x))

0.2 0.4 0.6 0.8

−1.

5−

0.5

0.0

0.5

1.0

1.5

locations

Y(x

)

Page 26: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

2. Varying φ (range of spatial correlation

0.2 0.3 0.4 0.5 0.6 0.7 0.8

−2

−1

01

2

locations

Y(x

)

Page 27: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

3. Varying τ2/σ2 (noise-to-dignal ratio)

0.0 0.2 0.4 0.6 0.8 1.0

−1.

5−

0.5

0.5

1.5

locations

Y(x

)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

locations

stan

d. d

evia

tion

Page 28: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Predicting non-linear functionals

• minimum mean square error prediction is not invariantunder non-linear transformation

• the complete answer to a prediction problem is thepredictive distribution, [T |Y ]

• Recommended strategy:

– draw repeated samples from [S∗|Y ](conditional simulation)

– calculate required summaries(examples to follow)

Page 29: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Theoretical variograms

• the variogram of a process Y (x) is the function

V (x, x′) =1

2Var{Y (x) − Y (x′)}

• for the spatial Gaussian model, with u = ||x − x′||,

V (u) = τ2 + σ2{1 − ρ(u)}

• provides a summary of the basic structural parametersof the spatial Gaussian process

Page 30: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

u

V(u

)

sill

nugget

practical range

• the nugget variance: τ2

• the sill: σ2 = Var{S(x)}

• the practical range: φ, such ρ(u) = ρ(u/φ)

Page 31: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Empirical variograms

uij = ‖xi − x)j‖ vij = 0.5[y(xi) − y(xj)]2

• the variogram cloud is a scatterplot of the points (uij , vij)

• the empirical variogram smooths the variogram cloud byaveraging within bins: u − h/2 ≤ uij < u + h/2

• for a process with non-constant mean (covariates), useresiduals r(xi) = y(xi) − µ(xi) to compute vij

Page 32: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Limitations of V (u)

0 2 4 6 8

010

000

2000

030

000

u

V(u)

0 2 4 6 8

010

0020

0030

0040

0050

0060

00

u

V(u)

1. vij ∼ V (uij)χ21

2. the vij are correlated

Consequences:

• variogram cloud is unstable, pointwise and in overall shape

• binning addresses point 1, but not point 2

Page 33: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Parameter estimation using the variogram

• fitting a theoretical variogram function to the empiricalvariogram provides estimates of the model parameters.

• weighted least squares criterion:

W (θ) =∑

k

nk{[Vk − V (uk; θ)]}2

where θ denotes vector of covariance parameters and Vk

is average of nk variogram ordinates vij.

• need to choose upper limit for u (arbitrary?)

• variations include:– fitting models to the variogram cloud– other estimators for the empirical variogram– different proposals for weights

Page 34: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Comments on variogram fitting

1. Can give equally good fits for different extrapolationsat origin.

0 2 4 6 8 10 12 14

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

u

V(u)

Page 35: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

2. Correlation between variogram points inducessmoothness.

Empirical variograms for three simulationsfrom the same model.

0.0 0.2 0.4 0.6

0.0

0.5

1.0

1.5

u

V(u)

Page 36: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

3. Fit is highly sensitive to specification of the mean.

Illustration with linear trend surface:

• solid smooth line: theoretical variogram;

• dotted line: from data;

• solid line: from true residuals;

• dashed line : from estimated residuals.

0 2 4 6

0.0

0.5

1.0

1.5

2.0

2.5

u

V(u)

Page 37: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Parameter estimation: maximum likelihood

Y ∼ MVN(µ1, σ2R + τ2I)

R is the n × n matrix with (i, j)th element ρ(uij) whereuij = ||xi − xj ||, Euclidean distance between xi and xj .

Or more generally:

µ(xi) =k

j=1

fk(xi)βk

where dk(xi) is a vector of covariates at location xi, hence

Y ∼ MVN(Dβ, σ2R + τ2I)

Page 38: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Gaussian log-likelihood function:

L(β, τ, σ, φ, κ) ∝ −0.5{log |(σ2R + τ2I)| +

(y − Dβ)′(σ2R + τ2I)−1(y − Dβ)}.

• write ν2 = τ2/σ2, hence σ2V = σ2(R + ν2I)

• log-likelihood function is maximised for

β(V ) = (D′V −1D)−1D′V −1y

σ2 = n−1(y − Dβ)′V −1(y − Dβ)

• substitute (β, σ2) to give reduced maximisation problem

L∗(τr, φ, κ) ∝ −0.5{n log |σ2| + log |(R + ν2I)|}

• usually just consider κ in a discrete set {0.5, 1, 2, 3, ..., N}

Page 39: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Comments on maximum likelihood

• likelihood-based methods preferable to variogram-basedmethods

• restricted maximum likelihood is widely recommendedbut in our experience is sensitive to mis-specification ofthe mean model.

• in spatial models, distinction between µ(x) and S(x) isnot sharp.

• composite likelihood treats contributions from pairs (Yi, Yj)as if independent

• approximate likelihoods useful for handling largedata-sets

• examining profile likelihoods is advisable, to check forpoorly identified parameters

Page 40: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Swiss rainfall data

Page 41: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Swiss rainfall: trans-Gaussian model

Y ∗i = hλ(Y ) =

{

(yi)λ−1λ

if λ 6= 0log(yi) if λ = 0

ℓ(β, θ, λ) = −1

2{log |σ2V | + (hλ(y) − Dβ)′{σ2V }−1(hλ(y) − Dβ)}

+n

i=1

log(

(yi)λ−1

)

Page 42: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Swiss rainfall: profile log-likelihoods for λ

Left panel: κ = 0.5 Centre panel: κ = 1 Right panel: κ = 2

Page 43: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Swiss rainfall: MLE’s (λ = 0.5)

κ µ σ2 φ τ2 log L0.5 18.36 118.82 87.97 2.48 -2464.3151 20.13 105.06 35.79 6.92 -2462.4382 21.36 88.58 17.73 8.72 -2464.185

Likelihood criterion favours κ = 1

Page 44: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Swiss rainfall: profile log-likelihoods(λ = 0.5, κ = 1)

Left panel: σ2 Centre panel: φ Right panel: τ2

Page 45: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Swiss rainfall: plug-in predictions andprediction variances

0 50 100 150 200 250 300

−50

050

100

150

200

250

X Coord

Y C

oord

100 200 300 400

0 50 100 150 200 250 300

−50

050

100

150

200

250

X Coord

Y C

oord

20 40 60 80

Page 46: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Swiss rainfall: non-linear prediction

A200

Fre

quen

cy

0.36 0.38 0.40 0.42

050

100

150

0 50 100 150 200 250 300

−50

050

100

150

200

250

X Coord

Y C

oord

0.2 0.4 0.6 0.8

Left-panel: plug-in prediction for proportion of total

area with rainfall exceeding 200 (= 20mm)

Right-panel: plug-in predictive map of

P (rainfall > 250|Y )

Page 47: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Section 3

Bayesian inference

Page 48: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Basics

Model specification

[Y, S, θ] = [θ][S|θ][Y |S, θ]

Parameter estimation

• integration gives

[Y, θ] =

[Y, S, θ]dS

• Bayes’ Theorem gives posterior distribution

[θ|Y ] = [Y |θ][θ]/[Y ]

• where [Y ] =∫

[Y |θ][θ]dθ

Page 49: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Prediction: S → S∗

• expand model specification to

[Y, S∗, θ] = [θ][S|θ][Y |S, θ][S∗|S, θ]

• plug-in predictive distribution is

[S∗|Y, θ]

• Bayesian predictive distribution is

[S∗|Y ] =

[S∗|Y, θ][θ|Y ]dθ

• for any target T = t(S∗), required predictive distribution[T |Y ] follows

Page 50: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Notes

• likelihood function is central to both classical and Bayesianinference

• Bayesian prediction is a weighted average of plug-inpredictions, with different plug-in values of θweighted according to their conditional probabilitiesgiven the observed data.

• Bayesian prediction is usually more conservative thanplug-in prediction

Page 51: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Bayesian computation

1. Evaluating the integral which defines [S∗|Y ] is oftendifficult

2. Markov Chain Monte Carlo methods are widely used

3. but for geostatistical problems, reliable implementationof MCMC is not straightforward (no natural Markovianstructure)

4. for the Gaussian model, direct simulation is available

Page 52: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Gaussian models: known (σ2, φ)

Y ∼ N(Dβ, σ2R(φ))

• choose conjugate prior β ∼ N(

mβ ; σ2Vβ

)

• posterior for β is[

β|Y, σ2, φ]

∼ N(

β, σ2 Vβ

)

β = (V −1β + D′R−1D)−1(V −1

β mβ + D′R−1y)

Vβ = σ2 (V −1β + D′R−1D)−1)

• predictive distribution for S∗ is

p(S∗|Y, σ2, φ) =

p(S∗|Y, β, σ2, φ) p(β|Y, σ2, φ) dβ.

Page 53: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Notes

• mean and variance of predictive distribution can bewritten explicitly (but not given here)

• predictive mean compromises between prior and weightedaverage of Y

• predictive variance has three components:

– a priori variance,

– minus information in data

– plus uncertainty in β

• limiting case Vβ → ∞ corresponds to ordinary kriging.

Page 54: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Gaussian models: unknown (σ2, φ)

Convenient choice of prior is:

[β|σ2, φ] ∼ N(

mb, σ2Vb

)

[σ2|φ] ∼ χ2ScI

(

nσ, S2σ

)

[φ] ∼ arbitrary

• results in explicit expression for [β, σ2|Y, φ] andcomputable expression for [φ|Y ], depending on choice ofprior for φ

• in practice, use arbitrary discrete prior for φ and combineposteriors conditional on φ by weighted averaging

Page 55: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Algorithm 1:

1. choose lower and upper bounds for φ according to theparticular application, and assign a discrete uniform priorfor φ on a set of values spanning the chosen range

2. compute posterior [φ|Y ] on this discrete support set

3. sample φ from posterior, [φ|Y ]

4. attach sampled value of φ to conditional posterior,[β, σ2|y, φ], and sample (β, σ2) from this distribution

5. repeat steps (3) and (4) as many times as required;resulting sample of triplets (β, σ2, φ) is a sample fromjoint posterior distribution, [β, σ2, φ|Y ]

Page 56: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Predictive distribution for S∗ given φ is tractable, hence write

p(S∗|Y ) =

p(S∗|Y, φ) p(φ|y) dφ.

Algorithm 2:

1. discretise [φ|Y ], as in Algorithm 1.

2. compute posterior [φ|Y ]

3. sample φ from posterior [φ|Y ]

4. attach sampled value of φ to [S∗|y, φ] and sample fromthis to obtain realisations from [S∗|Y ]

5. repeat steps (3) and (4) as required

Note: Extends immediately to multivariate φ (but may becomputationally awkward)

Page 57: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Swiss rainfall

Priors/posteriors for φ (left) and ν2 (right)

0 15 37.5 60 82.5 105 135

priorposterior

0.00

0.05

0.10

0.15

0.20

0.25

0 0.1 0.2 0.3 0.4 0.5

priorposterior

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Page 58: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Swiss rainfall

Mean (left-panel) and variance (right-panel) of

predictive distribution

0 50 100 150 200 250 300

−50

050

100

150

200

250

X Coord

Y C

oord

100 200 300 400

0 50 100 150 200 250 300

−50

050

100

150

200

250

X Coord

Y C

oord

20 40 60 80

Page 59: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Swiss rainfall: posterior means and 95% cred-ible intervals

parameter estimate 95% intervalβ 144.35 [53.08, 224.28]σ2 13662.15 [8713.18, 27116.35]φ 49.97 [30, 82.5]ν2 0.03 [0, 0.05]

Page 60: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Swiss rainfall: non-linear prediction

0.36 0.38 0.40 0.42 0.44

010

2030

A200

f(A20

0)

0 50 100 150 200 250 300

−50

050

100

150

200

250

X Coord

Y C

oord

0.2 0.4 0.6 0.8

Left-panel: Bayesian (solid) and plug-in (dashed) prediction forproportion of total area with rainfall exceeding 200 (= 20mm)

Right-panel: Bayesian predictive map of P (rainfall > 250|Y )

Page 61: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Section 4

Generalized linear models

Page 62: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Generalized linear geostatistical model

• Latent spatial process

S(x) ∼ SGP{0, σ2, ρ(u))}ρ(u) = exp(−|u|/φ)

• Linear predictor

η(x) = d(x)′β + S(x)

• Link function

E[Yi] = µi = h{η(xi)}

• Conditional distribution for Yi : i = 1, ..., n

Yi|S(·) ∼ f(y; η) mutually independent

Page 63: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

GLGM

• usually just a single realisation is available, in contrastwith GLMM for longitudinal data analysis

• GLM approach is most appealing when there is a natu-ral sampling mechanism, for example Poisson model forcounts or logistic-linear models for proportions

• transformed Gaussian models may be more useful fornon-Gaussian continuous respones

• theoretical variograms can be derived but are less naturalas summary statistics than in Gaussian case

• but empirical variograms of GLM residuals can still beuseful for exploratory analysis

Page 64: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

A binomial logistic-linear model

• S(·) ∼ zero-mean Gaussian process

• [Y (xi) | S(xi)] ∼ Bin(ni; pi)

• h(pi) = log{pi/(1 − pi)} =∑k

j=1 dijβj + S(xi)

• model can be expanded by adding uncorrelated randomeffects Zi,

h(pi) =k

j=1

dijβj + S(xi) + Zi

to distinguish between two forms of the nugget effect:

– binomial variation is analogue of measurement error

– Zi is analogue of short-range spatial variation

Page 65: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Simulation of a binary logistic-linear model

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

locations

data

• data contain little information about S(x)

• leads to wide prediction intervals for p(x)

• more informative for binomial responses with large ni

Page 66: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Inference

• Likelihood function

L(θ) =

IRn

n∏

i

f(yi; h−1(si))f(s | θ)ds1, . . . , dsn

• involves high-dimensional integration

• MCMC algorithms exploit conditional independencestructure

Page 67: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Conditional independence graph

S∗ S Y

θ β

��

��

@@

@@

@

��

��

@@

@@

@

• only need vertex S∗ at prediction stage

• corresponding DAG would delete edge between S and β

Page 68: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Prediction with known parameters

• simulate s(1), . . . , s(m) from [S|y] (using MCMC).

• simulate s∗(j) from [S∗|s(j)], j = 1, . . . , m(multivariate Gaussian)

• approximate E[T (S∗)|y] by 1m

∑mj=1 T (s∗(j))

• if possible reduce Monte Carlo error by

– calculating E[T (S∗)|s(j)] directly

– estimating E[T (S∗)|y] by 1m

∑mj=1 E[T (S∗)|s(j)]

Page 69: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

MCMC for conditional simulation

• Let S = D′β + Σ1/2Γ, Γ ∼ Nn(0, I).

• Conditional density: f(γ|y) ∝ f(y|γ)f(γ)

Langevin-Hastings algorithm

• Proposal: γ′ from a Nn(ξ(γ), hI),

ξ(γ) = γ +h

2∇ log f(γ | y)

• Example: Poisson-log-linear spatial model:

∇ log f(γ|y) = −γ + (Σ1/2)′(y − exp(s)), s = Σ1/2γ.

• expression generalises to other generalised linear spatialmodels

• MCMC output γ(1), . . . , γ(m), hence sample s(m) = Σ1/2γ(m)from [S|y].

Page 70: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

MCMC for Bayesian inference

Posterior:

• update Γ from [Γ|y, β, log σ), log(φ)] (Langevin-Hastings))

• update β from [β|Γ, log(σ), log(φ)] (RW-Metropolis)

• update log(σ) from [log(σ)|Γ, β, log(φ)] (RW-Metropolis)

• update log(φ) from [log(φ)|Γ, β, log(σ)] (RW-Metropolis)

Predictive:

• simulate (s(j), β(j), σ2(j), φ(j)), j = 1, . . . , m (MCMC)

• simulate s∗(j) from [S∗|s(j), β(j), σ2(j), φ(j)],j = 1, . . . , m (multivariate Gaussian)

Page 71: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Comments

• above is not necessarily the most efficient algorithmavailable

• discrete prior for φ reduces computing time

• can thin MCMC output if storage is a limiting factor

• similar algorithms can be developed for MCMCmaximum likelihood estimation

Page 72: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Some computational resources

• geoR package:http://www.est.ufpr.br/geoR

• geoRglm package:http://www.est.ufpr.br/geoRglm

• R-project:http://www.R-project.org

• CRAN spatial task view:http://cran.r-project.org/src/contrib/Views/Spatial.html

• AI-Geostats web-site:http://www.ai-geostats.org

• and more ...

Page 73: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health
Page 74: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health
Page 75: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

AfricanProgramme forOnchocerciasisControl

• “river blindness” – an endemic disease in wet tropicalregions

• donation programme of mass treatment with ivermectin

• approximately 30 million treatments to date

• serious adverse reactions experienced by some patientshighly co-infected with Loa loa parasites

• precautionary measures put in place before masstreatment in areas of high Loa loa prevalence

http://www.who.int/pbd/blindness/onchocerciasis/en/

Page 76: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

The Loa loa prediction problem

Ground-truth survey data

• random sample of subjects in each of a number of villages

• blood-samples test positive/negative for Loa loa

Environmental data (satellite images)

• measured on regular grid to cover region of interest

• elevation, green-ness of vegetation

Objectives

• predict local prevalence throughout study-region (Cameroon)

• compute local exceedance probabilities,

P(prevalence > 0.2|data)

Page 77: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Loa loa: a generalised linear model

• Latent spatial process

S(x) ∼ SGP{0, σ2, ρ(u))}ρ(u) = exp(−|u|/φ)

• Linear predictor

d(x) = environmental variables at location x

η(x) = d(x)′β + S(x)

p(x) = log[η(x)/{1 − η(x)}]

• Error distribution

Yi|S(·) ∼ Bin{ni, p(xi)}

Page 78: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Schematic representation of Loa loa model

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

environmental gradient

local fluctuation

sampling errorpre

vale

nce

location

Page 79: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

The modelling strategy

• use relationship between environmental variables and ground-truth prevalence to construct preliminary predictions vialogistic regression

• use local deviations from regression model to estimatesmooth residual spatial variation

• Bayesian paradigm for quantification of uncertainty inresulting model-based predictions

Page 80: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

logit prevalence vs elevation

0 500 1000 1500

−5

−4

−3

−2

−1

0

elevation

logi

t pre

vale

nce

Page 81: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

logit prevalence vs MAX = max NDVI

0.65 0.70 0.75 0.80 0.85 0.90

−5

−4

−3

−2

−1

0

Max Greeness

logi

t pre

vale

nce

Page 82: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Comparing non-spatial and spatial predictionsin Cameroon

Non-spatial

3020100

60

50

40

30

20

10

0

Page 83: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Spatial

403020100

60

50

40

30

20

10

0

Page 84: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Probabilistic prediction in Cameroon

Page 85: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Next Steps

• analysis confirms value of local ground-truth prevalencedata

• in some areas, need more ground-truth data to reducepredictive uncertainty

• but parasitological surveys are expensive

Page 86: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Field-work is difficult!

Page 87: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

RAPLOA

• a cheaper alternative to parasitological sampling:

– have you ever experienced eye-worm?

– did it look like this photograph?

– did it go away within a week?

• RAPLOA data to be collected:

– in sample of villages previously surveyedparasitologically (to calibrate parasitology vs RAPLOAestimates)

– in villages not surveyed parasitologically (to reducelocal uncertainty)

• bivariate model needed for combined analysis ofparasitological and RAPLOA prevalence estimates

Page 88: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

UNDP/World Bank/WHO Special Programme for Research & Training in Tropical Disease (TDR)

R E P O RT O F A M U LT I - C E N T R E S T U DY

Rapid AssessmentProceduresfor Loiasis

TDR/IDE/RP/RAPL/01.1

Page 89: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

RAPLOA calibration

0 20 40 60 80 100

020

4060

8010

0

RAPLOA prevalence

para

sito

logy

pre

vale

nce

−6 −4 −2 0 2

−6

−4

−2

02

RAPLOA logitpa

rasi

tolo

gy lo

git

locatio

n

Empirical logit transformation linearises relationshipColour-coding corresponds to four surveys in different regions

Page 90: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

RAPLOA calibration (ctd)

0 20 40 60 80 100

020

4060

8010

0

RAPLOA prevalence

para

sito

logy

pre

vale

nce

locatio

n

Fit linear functional relationship on logit scale and back-transform

Page 91: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Parasitology/RAPLOA bivariate model

• treat prevalence estimates as conditionally independentbinomial responses

• with bivariate latent Gaussian process {S1(x), S2(x)} inlinear predictor

• to ease computation, write joint distribution as

[S1(x), S2(x)] = [S1(x)][S2(x)|S1(x)]

with low-rank spline representation of S1(x)

Page 92: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Lecture 3

Geostatistical design; geostatistics andmarked point processes

Page 93: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Section 5

Geostatistical design

Page 94: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Geostatistical design

• Retrospective

Add to, or delete from, an existing set of measurementlocations xi ∈ A : i = 1, ..., n.

• Prospective

Choose optimal positions for a new set of measurementlocations xi ∈ A : i = 1, ..., n.

Page 95: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Naive design folklore

• Spatial correlation decreases with increasing distance.

• Therefore, close pairs of points are wasteful.

• Therefore, spatially regular designs are a good thing.

Page 96: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Less naive design folklore

• Spatial correlation decreases with increasing distance.

• Therefore, close pairs of points are wasteful if you knowthe correct model.

• But in practice, at best, you need to estimate unknownmodel parameters.

• And to estimate model parameters, you need your designto include a wide range of inter-point distances.

• Therefore, spatially regular designs should be temperedby the inclusion of some close pairs of points.

Page 97: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Examples of compromise designs

0 1

0

1

A) Lattice plus close pairs design

0 1

0

1

B) Lattice plus in−fill design

Page 98: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

A Bayesian design criterion

Assume goal is prediction of S(x) for all x ∈ A.

[S|Y ] =

[S|Y, θ][θ|Y ]dθ

For retrospective design, minimise

v =

A

Var{S(x)|Y }dx

For prospective design, minimise

E(v) =

y

A

Var{S(x)|y}f(y)dy

where f(y) corresponds to

[Y ] =

[Y |θ][θ]dθ

Page 99: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Results

Retrospective: deletion of points from a monitoring network

0 1

0

1

Start design

Page 100: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Selected final designs

0 1

0

1

τ2/σ2=0

A

0 1

0

1

τ2/σ2=0

B

0 1

0

1

τ2/σ2=0

C

0 1

0

1

τ2/σ2=0.3

D

0 1

0

1

τ2/σ2=0.3

E

0 1

0

1

τ2/σ2=0.3

F

0 1

0

1

τ2/σ2=0.6

G

0 1

0

1

τ2/σ2=0.6

H

0 1

0

1

τ2/σ2=0.6

I

Page 101: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Prospective: regular lattice vs compromise designs

0.2 0.4 0.6 0.8 10.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Des

ign

crite

rion

τ2=0

RegularInfillingClose pairs

0.2 0.4 0.6 0.8 10.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5τ2=0.2

RegularInfillingClose pairs

0.2 0.4 0.6 0.8 10.2

0.3

0.4

0.5

0.6

0.7τ2=0.4

RegularInfillingClose pairs

0.2 0.4 0.6 0.8 10.2

0.3

0.4

0.5

0.6

0.7

Des

ign

crite

rion

Maximum φ

τ2=0.6

RegularInfillingClose pairs

0.2 0.4 0.6 0.8 10.3

0.4

0.5

0.6

0.7

0.8

Maximum φ

τ2=0.8

RegularInfillingClose pairs

Page 102: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health
Page 103: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Monitoring salinity in the Kattegat basin

Denmark(Jutland)

Kattegat

Nor

th S

ea

Sweden

Baltic Sea

Denmark(Zealand)

Germany

A B

Solid dots are locations deleted for reduced design.

Page 104: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Further remarks on geostatistical design

1. Conceptually more complex problems include:

(a) design when some sub-areas are more interestingthan others;

(b) design for best prediction of non-linear functionalsof S(·);

(c) multi-stage designs (see below).

2. Theoretically optimal designs may not be realistic(eg Loa loa mapping problem)

3. Goal here is not optimal design, but to suggestconstructions for good, general-purpose designs.

Page 105: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Section 6

Geostatistics and marked point processes

Page 106: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

The geostatistical model re-visited

locations X signal S measurements Y

• Conventional geostatistical model: [S, Y ] = [S][Y |S]

• What if X is stochastic?

Usual implicit assumption: [X, S, Y ] = [X][S][Y |S]

Hence, can ignore [X] for likelihood-based inferenceabout [S, Y ].

L(θ) =

[S][Y |S]dS

Page 107: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Marked point processes

locations X marks Y

• X is a point process

• Y need only be defined at points of X

• natural factorisation of [X, Y ] depends onscientific context

[X, Y ] = [X][Y |X] = [Y ][X|Y ]

Page 108: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Preferential sampling

locations X signal S measurements Y

• Conventional model:

[X, S, Y ] = [S][X][Y |S] (1)

• Preferential sampling model:

[X, S, Y ] = [S][X|S][Y |S, X] (2)

Under model (2), typically [Y |S, X] = [Y |S0] where S0 = S(X)denotes the values of S at the points of X

Page 109: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

An idealised model for preferential sampling

[X, S, Y ] = [S][X|S][Y |S, X]

• [S]= SGP(µ, σ2, ρ) (stationary Gaussian process)

• [X|S]= inhomogenous Poisson process with intensity

λ(x) = exp{α + βS(x)}

• [Y |S, X] =∏n

i=1[Yi|S(Xi)]

• [Yi|S(Xi)] = N(S(Xi), τ2)

Page 110: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Simulation of preferential sampling model

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

β = 0.0, 0.25, 0.5

Page 111: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Impact of preferential sampling on spatialprediction

• target for prediction is S(x), x = (0.5, 0.5)

• 100 data-locations on unit square

• three sampling designs

Sampling designuniform clustered preferential

bias (−0.081, 0.059) (−0.082, 0.186) (1.290, 1.578)MSE (0.268, 0.354) (0.948, 1.300) (2.967, 3.729)

Page 112: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Likelihood inference (crude Monte Carlo)

[X, S, Y ] = [S][X|S][Y |S, X]

• data are X and Y , likelihood is

L(θ) =

[X, S, Y ]dS = ES

[

[X|S][Y |S, X]]

• evaluate expectation by Monte Carlo,

LMC(θ) = m−1m∑

j=1

[X|Sj][Y |Sj, X],

using anti-thetic pairs, S2j = −S2j−1

Page 113: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

An importance sampler

Re-write likelihood as

L(θ) =

[X|S][Y |X, S][S|Y ]

[S|Y ][S]dS

• [S] = [S0][S1|S0]

• [S|Y ] = [S0|Y ][S1|S0, Y ] = [S0|Y ][S1|S0]

• [Y |X, S] = [Y |S0]

L(θ) =

[X|S][Y |S0]

[S0|Y ][S0][S|Y ]dS

= ES|Y

[

[X|S][Y |S0]

[S0|Y ][S0]

]

Page 114: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

An importance sampler (continued)

• simulate Sj ∼ [S|Y ] (anti-thetic pairs)

• if Y is measured without error, set [Y |S0j]/[S0j|Y ] = 1

Monte Carlo approximation is:

LMC(θ) = m−1m∑

j=1

[

[X|Sj ][Y |S0j]

[S0j|Y ][S0j]

]

Page 115: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Practical solutions to weak identifability

1. explanatory variables U to break dependence betweenS and X

2. strong Bayesian priors

3. two-stage sampling

Page 116: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health
Page 117: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health
Page 118: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Ozone monitoring in California

Data:

• yearly averages of O3 from 178 monitoring locationsthroughout California

• census information for each of 1709 zip-codes

Objective:

• estimate spatial average of O3 in designated sub-regions

Page 119: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

California ozone monitoring data

−124 −122 −120 −118 −116 −114

3234

3638

4042

longitude

latit

ude

NA 18 20 22 25 27 29 32 34 36 39 41 43 45 48 50 52 55 57 59 62

ale

nce

Page 120: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Ozone monitoring in California (continued)

Preferential sampling?

• highly non-uniform spatial distribution of monitors,negatively associated with levels of pollution

• may be able to allow for this if demographic and/orsocio-economic factors are associated both with levelsof pollution and with intensity of monitoring

Page 121: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Ozone monitoring in California (continued)

Modelling assumption

• dependence induced by latent variables U ,

[X, S, Y ] =

[X|U ][S|U ][Y |S, U ][U ]dU

• if U observed:

– use conditional likelihood,

[X, S, Y |U ] = [X|U ][S|U ][Y |S, U ]

– and ignore term [X|U ] for inference about S

Page 122: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

California ozone monitors: outlier?

−12400 −12200 −12000 −11800 −11600 −11400

3200

3400

3600

3800

4000

4200

Easting (km)

Nor

thin

g (k

m)

XX

−12400 −12200 −12000 −11800 −11600 −11400

3200

3400

3600

3800

4000

4200

ale

nce

Page 123: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Analysis of California ozone monitorlocations

• monitor intensity associated with:

– population density (positive)

– percentage College-educated (positive)

– median family income (negative)

• good fit to inhomogeneous Poisson process model(after removal of one outlier)

Page 124: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

California ozone monitors: fit to Poisson model

0 20 40 60

−10

000

010

000

2000

030

000

4000

0

Distance of interests (km)

Inho

mog

eneo

us K

ts

Page 125: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health
Page 126: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health
Page 127: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Heavy metal bio-monitoring in Galicia

500000 550000 600000 650000

4650

000

4700

000

4750

000

4800

000

4850

000

1997 sample

ale

nce

Page 128: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Heavy metal bio-monitoring in Galicia

500000 550000 600000 650000

4650

000

4700

000

4750

000

4800

000

4850

000

1997 sampleindustry

ale

nce

Page 129: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Heavy metal bio-monitoring in Galicia

500000 550000 600000 650000

4650

000

4700

000

4750

000

4800

000

4850

000

1997 sampleindustry2000 sample

ale

nce

Page 130: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Heavy metal bio-monitoring in Galicia

• 1997 sampling design is good for monitoring effects ofindustrial activity

• but would lead to potential biased estimates of residualspatial variation

• 2000 sampling design is good for fitting model of residualspatial variation

• assuming stability of pollution levels over time, possibleanalysis strategy is:

– use 2000 data, or sub-set thereof, to model spatialvariation

– holding spatial correlation parameters fixed,use 1997 data to model point-source effects ofindustrial locations.

Page 131: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Galicia: 2000 predictions (posterior mean)

500000 550000 600000 650000

4650

000

4700

000

4750

000

4800

000

4850

000

ale

nce

Page 132: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Galicia: excision of areas close to industry

500000 550000 600000 650000

4650

000

4700

000

4750

000

4800

000

4850

000

ale

nce

Page 133: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Galicia: posteriors from analysis of 2000 data

beta

Fre

quen

cy

−0.5 0.0 0.5 1.0 1.5

050

100

150

200

250

300

sigmasq

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5 0.6

010

020

030

040

0

phi

Fre

quen

cy

0 50000 100000 150000 200000 250000

020

4060

8010

012

0

tausq/sigmasq

Fre

quen

cy

0.0 0.5 1.0 1.5 2.0

050

100

150

repla

cem

en

E[S(x)] = µ0 V (u) = τ2 + σ2{1 − exp(−u/φ)}

Page 134: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Galicia: analysis of 1997 data

• introduce distance to nearest industry as explanatoryvariable,

µ(x) = µ0 + α + γd(x)

• for β and spatial covariance parameters, use posteriorsfrom 2000 analysis

• resulting posterior mean and SD for (α, γ)

α γmean 0.601 -0.000561

SD 0.179 0.000609

Page 135: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Galicia: posteriors for (α, γ)

alpha

Fre

quen

cy

0.5 1.0 1.5

050

100

150

200

250

gamma

Fre

quen

cy

−0.002 0.000 0.002 0.004 0.006

050

100

150

200

250

300

ale

nce

Page 136: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Galicia: a cautionary note

0 20 40 60 80 100 120

0.5

1.0

1.5

2.0

distance (km)

log(

lead

)

ale

nce

Suggests missing explanatory variable(s)?

Page 137: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Closing remarks on preferential sampling

• preferential sampling is widespread in practice, butalmost universally ignored

• its effects may or may not be innocuous

• model parameters may be poorly identifed, hence

• reliance on formal likelihood-based inference for a singledata-set may be unwise

• different pragmatic analysis strategies may be needed fordifferent applications

Page 138: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

Closing remarks on model-basedgeostatistics

• Parameter uncertainty can have a material impact onprediction.

• Bayesian paradigm deals naturally with parameteruncertainty.

• Implementation through MCMC is not whollysatisfactory:

– sensitivity to priors?

– convergence of algorithms?

– routine implementation on large data-sets?

Page 139: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

• Model-based approach clarifies distinctions between:

– the substantive problem;

– formulation of an appropriate model;

– inference within the chosen model;

– diagnostic checking and re-formulation.

• Areas of current research include:

– preferential sampling

– computational issues around large data-sets

– multivariate models

– spatio-temporal models

Page 140: Model-based Geostatistics Peter J Diggle Lancaster ... · PDF fileModel-based Geostatistics Peter J Diggle Lancaster University and Johns Hopkins University School of Public Health

• Analyse problems, not data:

– what is the scientific question?

– what data will best allow us to answer the question?

– what is a reasonable model to impose on the data?

– inference: avoid ad hoc methods if possible

– fit, reflect, re-formulate as necessary

– answer the question.