Simulation of spatially correlated discrete random variables Dan Dalthorp and Lisa Madsen Department of Statistics Oregon State University [email protected]

Simulation of spatially correlated discrete random variablesDan Dalthorp and Lisa Madsen

Department of StatisticsOregon State University

[email protected]@science.oregonstate.edu

Outline

I. Generating one pair of correlated discrete random variables. (a) Lognormal-Poisson hierarchy (b) Overlapping sums

II. Generating a vector of correlated discrete random variables by overlapping sums

III. Examples

mailto:[email protected]

mailto:[email protected]

Introduction

Generate Y1, Y2 where

• Y1, Y2 have specified means

variances

and correlation Y 0

• Y1, Y2 are count r.v.'s

i.e., y = 0, 1, 2, ...

• Distributions of Y1, Y2 are unimodal, Poisson-like

• If 2 < , then both 2 and are small

21, YY

22

21, YY

Lognormal-Poisson MethodFor Generating Y1 and Y2

• Generate correlated normal RVs Z1, Z2

• Transform to lognormals Xi = exp(Zi)

Y1 and Y2 resemble negative binomial RVs.

• Generate conditionally independent Yi ~ Poisson(Xi)

Obtaining the Right Moments

2,~ii YYiY

,22

iii YYX 2211

21

22YYYY

YYYX

To get with corr(Y1, Y2) = Y,

generate lognormals X1, X2 with

This requires normals Z1, Z2 with

and

, , ,X i Y i

22

2

logii

i

i

XX

XZ

1log

2

22

i

i

i

X

XZ

1log1log

1log

2211

21

21

22

XXXX

XX

XXX

Z

Constraints on Moments of Y1, Y2 with Lognormal-Poisson Method

,2

ii YY •

•

11log1logexp

2

2

2

2

2

22

1

11

21

21

Y

YY

Y

YY

YY

YYY

Upper Bound for Correlation–Lognormal Poisson

Upper Bound for Correlation–Lognormal Poisson

Overlapping Sums Method For Generating Y1 and Y2

• Generate independent, discrete RVs X1, X2, X

• Let Y1 = X + X1

Y2 = X + X2

Holgate (1964): Correlated Poissons

We are not concerned with the exact distribution of Y1 and Y2,but we require them to be ecologically plausible.

Obtaining the Right Moments

To get with corr(Y1, Y2) = Y,

Generate independent X1, X2, X with

1 2

1 2

2 2

2

i iX Y Y Y Y

X Y Y Y

and

),(cov 21 YY

22

11

YXX

YXX

2,~ii YYiY

Choose distributions for Xs based on relationship between variance and mean:

• If , use X ~ Negative binomial(X, X2)2

X X

• If , use X ~ Poisson(X)2X X

• If , use X ~ Bernoulli(X)2 (1 )X X X

• If and , use , where

2X X X P B

B~Bernoulli(p), and P~Poisson(),

with and2X Xp 2

X X X

then X cannot be simulated—by any method.• If ,12XXX

XXX 12

Constraints on Moments of Y1, Y2 with Overlapping Sums Method

• No constraints on means of Yi, but we require

1

2

2

1 ,minY

Y

Y

YY

0iY

•

▪ Relationship between and ecologically plausibleiY 2

iY

▪

Upper Bound for Correlation–Overlapping Sums

Comparing Methods

0.5 1 1.5 2 2.5 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

21

Max

imum

Pos

sibl

e C

orre

latio

n

1=0.8;

2=0.8

OS: 22=0.8

LP: 22=0.8

OS: 22=1.9

LP: 22=1.9

OS: 22=3

LP: 22=3

Comparing Methods

0.5 1 1.5 2 2.5 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

21

Max

imum

Pos

sibl

e C

orre

latio

n

1=0.8;

2=1.4

OS: 22=1.4

LP: 22=1.4

OS: 22=3.2

LP: 22=3.2

OS: 22=5

LP: 22=5

Step 1: Find variances and means of X's Y1 = X + X1

Y2 = X + X2 where X, X1, and X2 are independent count random variables with ...

1 1

2 2

,Y X X

Y X X

1 2

2 0.0836X Y Y

1 1

2 2 2 1.172X Y X

2 2

2 2 2 0.0554X Y X

Variances:

Means:

1

2

0.0921

0.928

0.0579

X

X

X

A quick example: Simulate Y1 and Y2 with and = 0.2

15.0,139.0

02.1,256.1

22

11

2

2

YY

YY

Two equations, three unknowns ...

Try so X would be Bernoulli.

0921.025.05.0 2 XX

Step 2: Define distributions for X's

X ~ Bernoulli(0.0921) since by design

X1 ~ Negative binomial with = 0.928 and 2 = 1.172

X2 = Bernoulli(p) + Poisson() with p = 0.05 and = 0.0079

2 (1 )X X X

Step 3: Simulate

XY T

X

X

X

Y

Y

2

12

1

101

011

Y1 = X + X1

Y2 = X + X2

Generalizing to n > 2:

1. Park & Shin (1998) algorithm gives variances for X's:

)( 22

1 mXXX Find n m matrix T consisting of 0’s and 1’s and m-vector

such that and

2. Linear programming gives reasonable means for X's:

Find m-vector that solves

subject to constraints: (i) i > 0 for all i; and

(ii) when i2 0.25

nY

Y

MT

1

X mXX

1MX

225.05.0 ii

3. Generate independent X's with the appropriate distributions and multiply by T:

binomial Negative~2 XXX

Poisson~2 XXX

Bernoulli~)1(2 XXXX

Poisson Bernoulli~25.0)1( 2 XXXX

11

llnn

T XY where X is a vector of independent r.v.’s, andT is a matrix of 0’s and 1’s

TXnYY ),,( 22

1 )cov()cov( XY T

Park & Shin (1998) algorithm gives variances of X's

36.0***

051.2**

03.041.021.2*

01.011.021.091.0

T

09.0

1

1

1

1

E.g., Suppose

45.009.012.01.0

09.06.25.02.0

12.05.03.23.0

1.02.03.01

)cov(Y

0.90 0.20 0.11 0

* 2.20 0.41 0.02

* * 2.51 0

* * * 0.35

T

01.0

09.0

11

01

11

11

2),cov( Xji YY

09.0

09.0

09.0

09.0

4

3

2

1

X

X

X

X

Y

Y

Y

Y

01.009.0

09.0

01.009.0

01.009.0

4

3

2

1

XX

X

XX

XX

Y

Y

Y

Y

20.09 X

for the common component of Y3 and Y4

0.90 0.20 0.11 0

* 2.18 0.39 0

* * 2.51 0

* * * 0.33

02.0

01.0

09.0

111

001

111

011

0.90 0.20 0.11 0

* 2.20 0.41 0.02

* * 2.51 0

* * * 0.35

0.90 0.20 0.11 0

* 2.18 0.38 0

* * 2.51 0

* * * 0.33

0.091 1 0 1

0.011 1 1 1

0.021 0 0 1

0.111 1 1 0

T

0.79 0.09 0 0

* 2.08 0.27 0

* * 2.40 0

* * * 0.33

0.79 0.09 0 0

* 2.08 0.27 0

* * 2.40 0

* * * 0.33

0.09

1 1 0 1 1 0.01

1 1 1 1 1 0.02

1 0 0 1 0 0.11

1 1 1 0 0 0.09

T

0.70 0 0 0

* 1.99 0.27 0

* * 2.41 0

* * * 0.33

0.70 0 0 0

* 1.98 0.27 0

* * 2.41 0

* * * 0.33

0.09

0.011 1 0 1 1 0

0.021 1 1 1 1 1

0.111 0 0 1 0 1

0.091 1 1 0 0 0

0.27

T

0.70 0 0 0

* 1.71 0 0

* * 2.14 0

* * * 0.33

0.70 0 0 0

* 1.71 0 0

* * 2.14 0

* * * 0.33

0.09

0.01

1 1 0 1 1 0 0 0.02

1 1 1 1 1 1 0 0.11

1 0 0 1 0 1 0 0.09

1 1 1 0 0 0 1 0.27

0.33

T

0.70 0 0 0

* 1.71 0 0

* * 2.14 0

* * * 0

0.09

0.01

0.021 1 0 1 1 0 0 1

0.111 1 1 1 1 1 0 0

0.091 0 0 1 0 1 0 0

0.271 1 1 0 0 0 1 0

0.33

0.70

T

0 0 0 0

* 1.71 0 0

* * 2.14 0

* * * 0

0.70 0 0 0

* 1.71 0 0

* * 2.14 0

* * * 0

0.09

0.01

0.02

1 1 0 1 1 0 0 1 0 0.11

1 1 1 1 1 1 0 0 1 0.09

1 0 0 1 0 1 0 0 0 0.27

1 1 1 0 0 0 1 0 0 0.33

0.70

1.71

T

0 0 0 0

* 0 0 0

* * 2.14 0

* * * 0

0 0 0 0

* 1.71 0 0

* * 2.14 0

* * * 0

0.09

0.01

0.02

1 1 0 1 1 0 0 1 0 0 0.11

1 1 1 1 1 1 0 0 0 1 0.09

1 0 0 1 0 1 0 0 1 0 0.27

1 1 1 0 0 0 1 0 0 0 0.33

0.70

1.71

2.14

T

0***

00**

000*

0000

0 0 0 0

* 0 0 0

* * 2.14 0

* * * 0

Grubs Adult Activity

Distance toNearest Tree

OrganicMatter

Grub population density as a function of several covariates

Name Description

clayuk clay content of soil

dml distance to nearest tree

dnx distance to nearest patch of soil

with high organic matter content

fair fairway/rough indicator

heat intensity of adult activity

om.e organic matter flexure

tap total adult population

tw45 number of trees within 45 meters

vix vegetation index

wbuk soil organic matter

Grub count

Fre

quen

cy

020

4060

80

0 2 4 6Fitted Values

(quartiles)

Var

ian

ce o

f R

esid

ual

s

0.5

1.0

1.5

2.0

1st 2nd 3rd 4th

0.0

0.1

0.2

0 60 120 180Co

rrel

atio

n o

f R

esid

ual

s

Lag distance(feet)

Are the conditions for multiple regression met?

1. Non-normal response variable

2. Variance not constant

3. Observations not independent

with quasi-likelihood estimation (Wedderburn, 1974)

Generalized linear model (Fisher 1935; Dempster 1971; Berk 1972; Nelder and Wedderburn 1972)

adapted for spatially dependent observations (Liang and Zeger 1986; McCullagh amd Nelder 1989; Albert and McShane 1995; Gotway and Stroup 1997; Dalthorp 2004)

A. Accommodates response variables with distribution in exponential family (including normal, binomial, Poisson, gamma, exponential, chi-squared, etc.)B. Allows for non-constant variance

A. Accommodates response variables that are not in an exponential family (including negative binomial, unspecified distributions)B. Requires only that the variance of the response variable be expressed as a function of the mean

A. Accounts for spatial autocorrelation in the residualsB. The statistical theory for the model is not well-developed

Example: Japanese beetle grub population density vs. soil organic matter

•

•

••

••

•

•

••

•• •••

•• • •• •

••

• • •••••

•• •

•••

•

•

•• ••

•

•

••

•

••

••

•

•

•• •

••

•

•••

••

••

•

••

•

•

•

•••••

•

• • ••• •••

••

••

•• •

•

•••••• • • •••••

•

• • •••••• • • ••••••• • •••• •••• ••••••

•••

Organic matter content (%)

Gru

bs p

er s

oil s

ampl

e

3 4 5 6 7 8 9

0

24

6

Means

0.5 1.0 1.5

0.5

1.0

23

xs2

0.0

0.1

0.2

0 60 120 180

Co

rrel

atio

n

Lag distance(feet)

Variances Correlations

6.762 3( 33.3 13.2 2.03 0.0965 )( 0.0148)i iOM OM OM

Means (via GLM):

Variances (via TPL): 2 1.1481.23

Correlations (via spherical model):

1 2

31 2 1 2 1 2 1 2

1 2

1 if 0

( ) 0.25(1 0.015 0.5( /100) ) if 0 100

0 if 100

Y Y

Y Y Y Y Y Y Y Y

Y Y

• X’s are independent, count-valued random variables-- variances from Park & Shin’s algorithm-- means from linear programming

### PROBLEM ### No solution found!

Choice between one of the following:

i. One Y mean off-target but no impossible X r.v.'s

Need: Y with = 0.141

Can only do: = 0.151

ii. One impossible X r.v. ( )We need: r.v. with = 0.0385, 2 = 0.0272Can do Bernoulli: = 0.0385, 2 = 0.0370

Consequences? Var(Y16) = 0.139 vs. target of 0.129

The simulation

1000 reps with n = 143: 143 1000 143 3720 3720 1000

Y T X

20.5 0.25

0.2 0.6 1.0 1.4

0.2

0.6

1.0

1.4

Results for 1000 simulation runs:

• 3720 X's consisting of:-- Negative binomial: 1580-- Bernoulli: 2099-- Bernoulli + Poisson: 40-- Impossible: 1 (simulated 2 slightly larger than target)

Target mean

Sim

ulat

ed m

ean

Means

0.5 1.0 1.5 2.0

0.5

1.0

1.5

2.0

Target variance

Sim

ulat

ed v

aria

nce Variances

0 50 100 150 200 250 300 350 400 450

-0.05

0

0.05

0.1

0.15

0.2

0.25

Lag distance

Co

rre

latio

n

Target correlation

Sim

ula

ted

co

rre

latio

n

0.0 0.05 0.10 0.15 0.20

-0.1

0.0

0.1

0.2

0.3

Correlations

••

•

••

••

•

•

•

•

•

•

•••

••

•

•

•

•

•••

••

••

••

••

•

••

••

•

•••

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

0.1 0.5 2.0

12

510

Example: Diamond back moth dispersal

Distance from release point

DB

M c

ount

5 10 15 20 25

02

46

8

Release point

Traps

Means Variances

Mean

Var

ianc

e

Lag Distance

Cor

rela

tion

00.

1

Correlation

The simulation

1000 reps with n = 114: 1000188018801141000114

XTY

• X’s are independent negative binomials-- variances from Park & Shin’s algorithm-- means from linear programming

• T is a matrix of zeros and ones that defines the common components of the Y’s

0.5 1.0 1.5

0.5

1.0

1.5

y

1 2 3 4 5

12

34

56

2

s2

Results

Means Variances

0 5 10 15 20 25 30 35 40

-0.05

0

0.05

0.1

0.15

0.2

0.25

Lag distance

Co

rre

latio

n

Correlation: Simulated vs. target

* Circles are averages for 1000 sims

Example: Weed counts (Chenopodium polyspermum) vs. soil magnesium

Weed counts and soil [Mg] inrandom quadrats in a field ...

Soil Magnesium

Wee

ds p

er s

ampl

e

220 260 300 340

05

1015

20

Means

0.5 1.0 2.0 4.0 8.0

15

1040

x

s2

Variances

Lag distance

Cor

rela

tion

5 10 15 20

0.0

0.2

0.4

0.6 Correlation ### Infeasible correlations ###

Highest possible correlation between Yi , Yj

is:2

, 2 2=Corr( , ) j j

ij i

Y Y

i j i jYY Y

Y Y

With 49 pairs of points in the weed data, target i,j is too high.

Summary

• Correlated count r.v.'s can be simulated by overlapping sums of independent negative binomials, Bernoullis, and Poissons

• The simulated r.v.'s are very close to negative binomial where < 2 and very close to Bernoulli + Poisson where > 2

• Negative correlations and strong positive correlations between r.v.’s with very different variances are not attainable, but ...

• The method can accommodate a wide variety of ecologically important scenarios that the hierarchical lognormal-Poisson model balks at, including:

-- underdispersed count r.v.'s

-- moderately strong correlations where 1 2 and 12 2

2

Documents

Simulation of spatially correlated discrete random variables Dan Dalthorp and Lisa Madsen Department of Statistics Oregon State University [email protected]