Upload
arthur-king
View
240
Download
2
Embed Size (px)
Citation preview
Simulation of spatially correlated discrete random variablesDan Dalthorp and Lisa Madsen
Department of StatisticsOregon State University
[email protected]@science.oregonstate.edu
Outline
I. Generating one pair of correlated discrete random variables. (a) Lognormal-Poisson hierarchy (b) Overlapping sums
II. Generating a vector of correlated discrete random variables by overlapping sums
III. Examples
Introduction
Generate Y1, Y2 where
• Y1, Y2 have specified means
variances
and correlation Y 0
• Y1, Y2 are count r.v.'s
i.e., y = 0, 1, 2, ...
• Distributions of Y1, Y2 are unimodal, Poisson-like
• If 2 < , then both 2 and are small
21, YY
22
21, YY
Lognormal-Poisson MethodFor Generating Y1 and Y2
• Generate correlated normal RVs Z1, Z2
• Transform to lognormals Xi = exp(Zi)
Y1 and Y2 resemble negative binomial RVs.
• Generate conditionally independent Yi ~ Poisson(Xi)
Obtaining the Right Moments
2,~ii YYiY
,22
iii YYX 2211
21
22YYYY
YYYX
To get with corr(Y1, Y2) = Y,
generate lognormals X1, X2 with
This requires normals Z1, Z2 with
and
, , ,X i Y i
22
2
logii
i
i
XX
XZ
1log
2
22
i
i
i
X
XZ
1log1log
1log
2211
21
21
22
XXXX
XX
XXX
Z
Constraints on Moments of Y1, Y2 with Lognormal-Poisson Method
,2
ii YY •
•
11log1logexp
2
2
2
2
2
22
1
11
21
21
Y
YY
Y
YY
YY
YYY
Upper Bound for Correlation–Lognormal Poisson
Upper Bound for Correlation–Lognormal Poisson
Overlapping Sums Method For Generating Y1 and Y2
• Generate independent, discrete RVs X1, X2, X
• Let Y1 = X + X1
Y2 = X + X2
Holgate (1964): Correlated Poissons
We are not concerned with the exact distribution of Y1 and Y2,but we require them to be ecologically plausible.
Obtaining the Right Moments
To get with corr(Y1, Y2) = Y,
Generate independent X1, X2, X with
1 2
1 2
2 2
2
i iX Y Y Y Y
X Y Y Y
and
),(cov 21 YY
22
11
YXX
YXX
2,~ii YYiY
Choose distributions for Xs based on relationship between variance and mean:
• If , use X ~ Negative binomial(X, X2)2
X X
• If , use X ~ Poisson(X)2X X
• If , use X ~ Bernoulli(X)2 (1 )X X X
• If and , use , where
2X X X P B
B~Bernoulli(p), and P~Poisson(),
with and2X Xp 2
X X X
then X cannot be simulated—by any method.• If ,12XXX
XXX 12
Constraints on Moments of Y1, Y2 with Overlapping Sums Method
• No constraints on means of Yi, but we require
1
2
2
1 ,minY
Y
Y
YY
0iY
•
▪ Relationship between and ecologically plausibleiY 2
iY
▪
Upper Bound for Correlation–Overlapping Sums
Comparing Methods
0.5 1 1.5 2 2.5 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
21
Max
imum
Pos
sibl
e C
orre
latio
n
1=0.8;
2=0.8
OS: 22=0.8
LP: 22=0.8
OS: 22=1.9
LP: 22=1.9
OS: 22=3
LP: 22=3
Comparing Methods
0.5 1 1.5 2 2.5 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
21
Max
imum
Pos
sibl
e C
orre
latio
n
1=0.8;
2=1.4
OS: 22=1.4
LP: 22=1.4
OS: 22=3.2
LP: 22=3.2
OS: 22=5
LP: 22=5
Step 1: Find variances and means of X's Y1 = X + X1
Y2 = X + X2 where X, X1, and X2 are independent count random variables with ...
1 1
2 2
,Y X X
Y X X
1 2
2 0.0836X Y Y
1 1
2 2 2 1.172X Y X
2 2
2 2 2 0.0554X Y X
Variances:
Means:
1
2
0.0921
0.928
0.0579
X
X
X
A quick example: Simulate Y1 and Y2 with and = 0.2
15.0,139.0
02.1,256.1
22
11
2
2
YY
YY
Two equations, three unknowns ...
Try so X would be Bernoulli.
0921.025.05.0 2 XX
Step 2: Define distributions for X's
X ~ Bernoulli(0.0921) since by design
X1 ~ Negative binomial with = 0.928 and 2 = 1.172
X2 = Bernoulli(p) + Poisson() with p = 0.05 and = 0.0079
2 (1 )X X X
Step 3: Simulate
XY T
X
X
X
Y
Y
2
12
1
101
011
Y1 = X + X1
Y2 = X + X2
Generalizing to n > 2:
1. Park & Shin (1998) algorithm gives variances for X's:
)( 22
1 mXXX Find n m matrix T consisting of 0’s and 1’s and m-vector
such that and
2. Linear programming gives reasonable means for X's:
Find m-vector that solves
subject to constraints: (i) i > 0 for all i; and
(ii) when i2 0.25
nY
Y
MT
1
X mXX
1MX
225.05.0 ii
3. Generate independent X's with the appropriate distributions and multiply by T:
binomial Negative~2 XXX
Poisson~2 XXX
Bernoulli~)1(2 XXXX
Poisson Bernoulli~25.0)1( 2 XXXX
11
llnn
T XY where X is a vector of independent r.v.’s, andT is a matrix of 0’s and 1’s
TXnYY ),,( 22
1 )cov()cov( XY T
Park & Shin (1998) algorithm gives variances of X's
36.0***
051.2**
03.041.021.2*
01.011.021.091.0
T
09.0
1
1
1
1
E.g., Suppose
45.009.012.01.0
09.06.25.02.0
12.05.03.23.0
1.02.03.01
)cov(Y
0.90 0.20 0.11 0
* 2.20 0.41 0.02
* * 2.51 0
* * * 0.35
T
01.0
09.0
11
01
11
11
2),cov( Xji YY
09.0
09.0
09.0
09.0
4
3
2
1
X
X
X
X
Y
Y
Y
Y
01.009.0
09.0
01.009.0
01.009.0
4
3
2
1
XX
X
XX
XX
Y
Y
Y
Y
20.09 X
for the common component of Y3 and Y4
0.90 0.20 0.11 0
* 2.18 0.39 0
* * 2.51 0
* * * 0.33
02.0
01.0
09.0
111
001
111
011
0.90 0.20 0.11 0
* 2.20 0.41 0.02
* * 2.51 0
* * * 0.35
0.90 0.20 0.11 0
* 2.18 0.38 0
* * 2.51 0
* * * 0.33
0.091 1 0 1
0.011 1 1 1
0.021 0 0 1
0.111 1 1 0
T
0.79 0.09 0 0
* 2.08 0.27 0
* * 2.40 0
* * * 0.33
0.79 0.09 0 0
* 2.08 0.27 0
* * 2.40 0
* * * 0.33
0.09
1 1 0 1 1 0.01
1 1 1 1 1 0.02
1 0 0 1 0 0.11
1 1 1 0 0 0.09
T
0.70 0 0 0
* 1.99 0.27 0
* * 2.41 0
* * * 0.33
0.70 0 0 0
* 1.98 0.27 0
* * 2.41 0
* * * 0.33
0.09
0.011 1 0 1 1 0
0.021 1 1 1 1 1
0.111 0 0 1 0 1
0.091 1 1 0 0 0
0.27
T
0.70 0 0 0
* 1.71 0 0
* * 2.14 0
* * * 0.33
0.70 0 0 0
* 1.71 0 0
* * 2.14 0
* * * 0.33
0.09
0.01
1 1 0 1 1 0 0 0.02
1 1 1 1 1 1 0 0.11
1 0 0 1 0 1 0 0.09
1 1 1 0 0 0 1 0.27
0.33
T
0.70 0 0 0
* 1.71 0 0
* * 2.14 0
* * * 0
0.09
0.01
0.021 1 0 1 1 0 0 1
0.111 1 1 1 1 1 0 0
0.091 0 0 1 0 1 0 0
0.271 1 1 0 0 0 1 0
0.33
0.70
T
0 0 0 0
* 1.71 0 0
* * 2.14 0
* * * 0
0.70 0 0 0
* 1.71 0 0
* * 2.14 0
* * * 0
0.09
0.01
0.02
1 1 0 1 1 0 0 1 0 0.11
1 1 1 1 1 1 0 0 1 0.09
1 0 0 1 0 1 0 0 0 0.27
1 1 1 0 0 0 1 0 0 0.33
0.70
1.71
T
0 0 0 0
* 0 0 0
* * 2.14 0
* * * 0
0 0 0 0
* 1.71 0 0
* * 2.14 0
* * * 0
0.09
0.01
0.02
1 1 0 1 1 0 0 1 0 0 0.11
1 1 1 1 1 1 0 0 0 1 0.09
1 0 0 1 0 1 0 0 1 0 0.27
1 1 1 0 0 0 1 0 0 0 0.33
0.70
1.71
2.14
T
0***
00**
000*
0000
0 0 0 0
* 0 0 0
* * 2.14 0
* * * 0
Grubs Adult Activity
Distance toNearest Tree
OrganicMatter
Grub population density as a function of several covariates
Name Description
clayuk clay content of soil
dml distance to nearest tree
dnx distance to nearest patch of soil
with high organic matter content
fair fairway/rough indicator
heat intensity of adult activity
om.e organic matter flexure
tap total adult population
tw45 number of trees within 45 meters
vix vegetation index
wbuk soil organic matter
Grub count
Fre
quen
cy
020
4060
80
0 2 4 6Fitted Values
(quartiles)
Var
ian
ce o
f R
esid
ual
s
0.5
1.0
1.5
2.0
1st 2nd 3rd 4th
0.0
0.1
0.2
0 60 120 180Co
rrel
atio
n o
f R
esid
ual
s
Lag distance(feet)
Are the conditions for multiple regression met?
1. Non-normal response variable
2. Variance not constant
3. Observations not independent
with quasi-likelihood estimation (Wedderburn, 1974)
Generalized linear model (Fisher 1935; Dempster 1971; Berk 1972; Nelder and Wedderburn 1972)
adapted for spatially dependent observations (Liang and Zeger 1986; McCullagh amd Nelder 1989; Albert and McShane 1995; Gotway and Stroup 1997; Dalthorp 2004)
A. Accommodates response variables with distribution in exponential family (including normal, binomial, Poisson, gamma, exponential, chi-squared, etc.)B. Allows for non-constant variance
A. Accommodates response variables that are not in an exponential family (including negative binomial, unspecified distributions)B. Requires only that the variance of the response variable be expressed as a function of the mean
A. Accounts for spatial autocorrelation in the residualsB. The statistical theory for the model is not well-developed
Example: Japanese beetle grub population density vs. soil organic matter
•
•
••
••
•
•
••
•• •••
•• • •• •
••
• • •••••
•• •
•••
•
•
•• ••
•
•
••
•
••
••
•
•
•• •
••
•
•••
••
••
•
••
•
•
•
•••••
•
• • ••• •••
••
••
•• •
•
•••••• • • •••••
•
• • •••••• • • ••••••• • •••• •••• ••••••
•••
Organic matter content (%)
Gru
bs p
er s
oil s
ampl
e
3 4 5 6 7 8 9
0
24
6
Means
0.5 1.0 1.5
0.5
1.0
23
xs2
0.0
0.1
0.2
0 60 120 180
Co
rrel
atio
n
Lag distance(feet)
Variances Correlations
6.762 3( 33.3 13.2 2.03 0.0965 )( 0.0148)i iOM OM OM
Means (via GLM):
Variances (via TPL): 2 1.1481.23
Correlations (via spherical model):
1 2
31 2 1 2 1 2 1 2
1 2
1 if 0
( ) 0.25(1 0.015 0.5( /100) ) if 0 100
0 if 100
Y Y
Y Y Y Y Y Y Y Y
Y Y
• X’s are independent, count-valued random variables-- variances from Park & Shin’s algorithm-- means from linear programming
### PROBLEM ### No solution found!
Choice between one of the following:
i. One Y mean off-target but no impossible X r.v.'s
Need: Y with = 0.141
Can only do: = 0.151
ii. One impossible X r.v. ( )We need: r.v. with = 0.0385, 2 = 0.0272Can do Bernoulli: = 0.0385, 2 = 0.0370
Consequences? Var(Y16) = 0.139 vs. target of 0.129
The simulation
1000 reps with n = 143: 143 1000 143 3720 3720 1000
Y T X
20.5 0.25
0.2 0.6 1.0 1.4
0.2
0.6
1.0
1.4
Results for 1000 simulation runs:
• 3720 X's consisting of:-- Negative binomial: 1580-- Bernoulli: 2099-- Bernoulli + Poisson: 40-- Impossible: 1 (simulated 2 slightly larger than target)
Target mean
Sim
ulat
ed m
ean
Means
0.5 1.0 1.5 2.0
0.5
1.0
1.5
2.0
Target variance
Sim
ulat
ed v
aria
nce Variances
0 50 100 150 200 250 300 350 400 450
-0.05
0
0.05
0.1
0.15
0.2
0.25
Lag distance
Co
rre
latio
n
Target correlation
Sim
ula
ted
co
rre
latio
n
0.0 0.05 0.10 0.15 0.20
-0.1
0.0
0.1
0.2
0.3
Correlations
••
•
••
••
•
•
•
•
•
•
•••
••
•
•
•
•
•••
••
••
••
••
•
••
••
•
•••
•
•
•
•
•
•
•
•••
•
•
•
•
•
•
0.1 0.5 2.0
12
510
Example: Diamond back moth dispersal
Distance from release point
DB
M c
ount
5 10 15 20 25
02
46
8
Release point
Traps
Means Variances
Mean
Var
ianc
e
Lag Distance
Cor
rela
tion
00.
1
Correlation
The simulation
1000 reps with n = 114: 1000188018801141000114
XTY
• X’s are independent negative binomials-- variances from Park & Shin’s algorithm-- means from linear programming
• T is a matrix of zeros and ones that defines the common components of the Y’s
0.5 1.0 1.5
0.5
1.0
1.5
y
1 2 3 4 5
12
34
56
2
s2
Results
Means Variances
0 5 10 15 20 25 30 35 40
-0.05
0
0.05
0.1
0.15
0.2
0.25
Lag distance
Co
rre
latio
n
Correlation: Simulated vs. target
* Circles are averages for 1000 sims
Example: Weed counts (Chenopodium polyspermum) vs. soil magnesium
Weed counts and soil [Mg] inrandom quadrats in a field ...
Soil Magnesium
Wee
ds p
er s
ampl
e
220 260 300 340
05
1015
20
Means
0.5 1.0 2.0 4.0 8.0
15
1040
x
s2
Variances
Lag distance
Cor
rela
tion
5 10 15 20
0.0
0.2
0.4
0.6 Correlation ### Infeasible correlations ###
Highest possible correlation between Yi , Yj
is:2
, 2 2=Corr( , ) j j
ij i
Y Y
i j i jYY Y
Y Y
With 49 pairs of points in the weed data, target i,j is too high.
Summary
• Correlated count r.v.'s can be simulated by overlapping sums of independent negative binomials, Bernoullis, and Poissons
• The simulated r.v.'s are very close to negative binomial where < 2 and very close to Bernoulli + Poisson where > 2
• Negative correlations and strong positive correlations between r.v.’s with very different variances are not attainable, but ...
• The method can accommodate a wide variety of ecologically important scenarios that the hierarchical lognormal-Poisson model balks at, including:
-- underdispersed count r.v.'s
-- moderately strong correlations where 1 2 and 12 2
2