Spatial Data Modeling with GIS and SAS - Long School of...

Preview:

Citation preview

Spatial Data Modeling with GIS and SAS

M. Norman Oliver, M.D., M.A.Associate Professor,

Departments of Family Medicine,Public Health Sciences, and Anthropology;

University of Virginia Health System

Acknowledgements

• This project was supported by grant K07 CA099983 from the National Cancer Institute, HRSA CFDA No. 93.984, Academic Units in Primary Care-Family Medicine, and a grant from the Paul Mellon Prostate Cancer Research Institute.

Kevin A. MatthewsUniversity of Virginia Department

of Family Medicine;

Mir Siadaty, M.S., M.D.,University of Virginia Department

of Health Evaluation Sciences

Fern R. Hauck, M.D.,University of Virginia Department

of Family Medicine

Linda W. Pickle, Ph.D.,Surveillance Research Program,

Division of Cancer Control and Population Sciences, NCI

Special thanks to Dr. Pickle: All the good ideas (and some of the slides) in this presentation come from courses she has taught and discussions we’ve had. All bad ideas and mistakes are my own invention.

Why map health data? Place matters

• Disease risk and exposure vary by location • Public wants to know whether their

community has a problem• Researchers want to know where to conduct

further studies (high or low rate areas)• Public health officials want to know where

to start intervention or prevention programs and how to allocate scarce resources

Goals of mapping analysis

• Identify geographical patterns• Cluster detection • Hypothesis generation (causes of patterns?)• Comparison of patterns (e.g., health disparities)• Changing patterns over time (surveillance, cancer

control)

Different goals will lead to different analytic methods and map designs

Oral Cancer in White Females

Source: Mason et al. Atlas of Cancer Mortality for U.S. Counties, 1950-1969. NIH 1975

Lung Cancer in White Males

Source: Mason et al. Atlas of Cancer Mortality for U.S. Counties, 1950-1969. NIH 1975

Cervical Cancer in White Females, 1950-941950-54 1955-59 1960-64

1965-69 1970-74 1975-79

1980-84 1985-89 1990-94

GIS & Disparities in Prostate Cancer•Second leading cause of cancer death among men

•Unequal burden: African-Americans have both prostate cancer incidence and mortality rates twice that of whites; however, geographic patterns for both

•Reasons for disparities unclear. Diet? Environmental exposures? Access to health care? Genetic susceptibility?

The Whirling Vortex of GIA

The question you want to

answer

The data youneed to answerthat question

The questionyou can

answer withthose data

The data youcan get

GIS

Source: Waller & Gotway, 2004Original source: John Richardson, Toxicologist EPA Region IV

What Statistic to Map?

• Mapping counts ignores population differences, so mapping rates preferred

• Proportions useful for showing case characteristics, e.g., % with late stage diagnosis

• Relative risks (or relative rates) are easily interpretable, put all maps on same scale

• Observed, estimated or smoothed statistic?• Map residuals as part of modeling process

Data description

• VCR: Incident prostate cancer cases 1990-9 (37,000 cases)• Geocoded 74% (26,000 cases)• Cases aggregated to census tract and county levels for

comparison to sociodemographic variables in the 1990 U.S. Census

• Census variables: low education (<12yrs), high education (≥16yrs), median household income, percent female head of household, %rural – all stratified by racial category

• MD ratio (ARF), %males screened with PSA (BRFSS)

Rate standardization• What rates to map?

– Crude rates ignore population difference effects on rates– Age-specific rates usually too variable due to small #s– Age-adjusted rates

• Only meaningful in comparison to similarly-adjusted rates• Direct method: Σ(age-specific rate)ij(std pop)j/(total std pop)

where sum is over all ages j, for place i• Indirect method:

SMRi = Σ(age-specific rate in std pop)j(pop)ij/(std rate)advantage: can be calculated when # cases smalldisadvantage: SMRs not comparable place to place except under certain conditions

(Pickle and White, Stat in Med, 1995)

Exploratory Spatial Data Analysis

• First look at the data• Quality control check -- unexpected

patterns?• Hypothesis generation – associations

between predictors and outcomes?• Quick & dirty method for answering spatial

questions• A way to check model

Smoothing Concepts

• Smoothing is a method of removing some variability in a quantitative map, highlighting the underlying (true?) pattern

• Maps of cancer rates for small areas can be difficult to interpret because of background “noise

• Different types of smoothers: linear or non-linear, based on distance or nearness of neighbors, but all are variations of 2D moving averages

• Comparison of various 2D smoothers: Kafadar1994 (Comp Stat & Data Analysis 18:419-439)

Smoothing Health Data

• The most important feature of a 2D smoother for health data is its inclusion of weights

• The variance of both rates & counts is proportional to 1/population, so rates are more or less reliable depending on population size

• Most smoothers in the past did not allow weights, several new ones do– very stable rates are smoothed less– more unstable rates (due to small populations) are

smoothed more

Headbanging

• Proposed by Tukey & Tukey, 1981, implemented by Hansen Simonson in 1991

• Weights added by Mungiole, Pickle, Simonson in 1999

• Algorithm is median-based moving average of a specified # of nearest neighbors for each point

• Available free on http://srab.cancer.gov/headbangas standalone or callable C++ function

HIV Mortality Rates, 1988-92Original data:

Rate/100,0007.824 - 172.2345.002 - 7.8243.57 - 5.0021.991 - 3.570 - 1.991

Rate/100,0006.4 - 48.94.6 - 6.43.7 - 4.62.9 - 3.70.8 - 2.9

Smoothed, unweighted

Rate/100,0008.2 - 97.15.7 - 8.24.7 - 5.73.4 - 4.70.6 - 3.4

Smoothed, weighted

Source: Pickle, et al. Atlas of United States Mortality. NCHS, 1996

Exploratory Data Analysis0

2040

6080

Spatial Data Modeling

• ESDA maps may show patterns; whether they are real or random noise is unknown

• Only inferential statistics can answer that question

Spatial Data Modeling StepsData collection& preparation

Model identification

Modelassessment

Parameter Estimation

& model fitting

OK?Interpretation

of results& fit

DoneYes

No

Unit of Analysis• What is your fundamental unit of analysis? Tumor or individual case

or aggregated case counts?• How to choose

– What is appropriate for hypothesis to be tested?– Availability of data – sometimes no choice, e.g., SEER records

have little information on each patient, so we need to use sociodemographic information for aggregated units

• Aggregated analysis = “ecologic”– Advantage: often the only available data for analysis, especially for

data considered confidential– Disadvantage: “ecologic fallacy” = bias of results because of

heterogeneity within the groups, i.e., associations at group level may not hold for individuals

Spatial Point Processes• Statistical models useful for spatial cancer data

– Poisson for case counts, e.g., incidence or mortality– Bernoulli or logistic binomial for individual cases,

proportions (e.g., % late stage), prevalence• Count data are typically overdispersed

“pure” Poisson:di ~ Pois(niλ) where di = # cases, ni = population

at risk in place, λ = intensityE(di) = Var(di) = niλ (i.e., mean = variance)

However, we usually observe Var(di) = φniλ, φ>1so we need to account for this in the model

Accounting for Overdispersion• Overdispersion can arise from clustering of the population, e.g., in

families, neighborhoods • Estimating overdispersion

– Directly by modeling variance as φ times usual variance– Add random effects to the model (extra variance components)

• Possible random effects for a simple model: yi = α + βXi + εi

• For each random effect, need to specify its distribution, e.g., βi ~ N(β*, γ)

VarianceComponents:

Randomcoefficients

Errors incovariates

Usual residualvariance

Random Effects, Tract Within County

X

YCounty mean*

* **

**

*

Fixed effects only

X

YCounty mean

Tract 1

Tract 2

Tract 3

Random intercepts only

X

YCounty mean

Tract 1Tract 2

Tract 3

Random slope onlyX

YCounty mean

Tract 1Tract 2

Tract 3

Random slope & intercept

Spatial Statistical Models Differ

• Spatial data are usually spatially correlated, not independent (as assumed for simple regression)

• Tobler’s First Law of Geography: “Everything is related to everything else, but near things are more related than distant things.” (Tobler 1969)

Modeling Spatial Correlation• Consider simple fixed effects regression model:

yi = α + βXi + εi , εi ~N(0,σ2)• This assumes that errors εi are independent, all with

common variance σ2

• If errors are spatial correlated, then εi ~N(0,Σ) where Σ is a variance-covariance matrix that describes their spatial dependencies

• Σ can be defined in terms of distances or neighbors• One common form of Σ = σ2 σ2[exp(-distance/ρ)]

same σ2

where ρ is measure of correlation based on distance.

More Assumptions . . .

• Statistical tests are based on assumptions.• Here, we assume a Poisson distribution of our

data. We also assume this spatial process shows• Stationarity: the spatial dependence structure

(covariance) is the same everywhere, i.e., the covariance depends only on distance or adjacency of any 2 places

• Isotropy: the spatial autocorrelation does not depend on direction

Variograms and Semivariograms

• The variogram is the variance of the difference of two data values separated by a distance h in a particular direction: var(Z(s+h)-Z(s)).

• The semivariogram is ½ the variogram.

• Both are functions of the separation distance, h, also called the spatial lag.

• The shape of these functions reflects the spatial similarity observed in the data.

Source: Waller and Gotway, 2004

How Do Semivariograms Help?

If A and B are positively correlated, Var(A-B) will be smaller than if they were independent.

If this variance (and correlation) changes with distance between the points, then there is spatial autocorrelation

Semiovariogram

distance

sillγ

nugget

rangeObs. within this distance of each other are spatially autocorrelated

(Variance/2 = γ)

Source: Waller & Gotway, 2004

Computation

• Software available for random & spatial errors– SAS PROC GLIMMIX– S+/R non-linear mixed effects (nlme) function– WinBUGS, MLwiN – full Bayesian MCMC

modeling.. • Iterative estimation can require many hours to run,

sometimes won’t converge at all • Complex models (with spatial covariance, random

effects, interactions,...) require larger sample sizes than fixed effects regression models

VCR Study Results 1

Table 1. Results of statistical modeling of V irginia prostate cancer incidence at tract level.

A frican A m ericans W hites

V ariables Effect p-value Effect p-value

Age (years)

<50 -6.7734 <0.0001 -6.5974 <0.0001

50-74 -0.5407 0.0086 -0.6459 <0.0001

>75a - - - -

aReference category. bBetween 100% and 200% poverty.cAt least four years of college. dLess than a high-school education.

VCR Study Results 2Table 1 contd. Results of statistical modeling of Virginia prostate cancer incidence at tract level.

African Americans Whites

Variables Effect p-value Effect p-value

aReference category. bBetween 100% and 200% poverty.cAt least four years of college. dLess than a high-school education.

Poverty

<10% -0.1495 0.6984 0.4414 0.0114

10-19% -0.3818 0.1837 0.3963 0.0122

>20%a - - - -

% Near poorb 0.0082 0.2583 -0.0083 0.0218

Median household income 0.4987 0.0187 0.1850 0.0027

Education ( persons aged >25 years)

% Highc -0.0398 0.0926 -0.0024 0.7360

% Lowd -0.0012 0.8283 -0.0077 0.0028

VCR Study Results 3Table 1 contd. Results of statistical modeling of Virginia prostate cancer incidence at tract level.

African Americans Whites

Variables Effect p-value Effect p-value

Rural

<10% 1.3537 0.0012 1.6311 <0.0001

10-19% 0.0430 0.9733 -0.6474 0.2379

>20%a - - - -

% Female head of household

0.0109 0.0653 0.0217 <0.0001

% PSA screened -0.0008 0.7935 0.0006 0.5600

MD ratio 0.00001 0.9518 0.0002 0.0010

aReference category. bBetween 100% and 200% poverty.cAt least four years of college. dLess than a high-school education.

VCR Study Results 4Table 2. Results of statistical modeling of Virginia prostate cancer incidence at county level.

African Americans

Whites

Variables Effect p-value

Effect p-value

Age (years)

<50 -6.6549 <0.0001 -6.6260 <0.0001

50-74 -0.4954 0.0017 -0.7335 <0.0001

>75a - - - -

VCR Study Results 5Table 2 contd. Results of statistical modeling of Virginia prostate cancer incidence at county level.

African Americans

Whites

Variables Effect p-value

Effect p-value

Poverty

<10% -0.1815 0.6058 -0.1267 0.527010-19% -0.2231 0.2026 -0.0066 0.9618>20%a - - - -% Near poorb -0.0145 0.2124 -0.0024 0.7769

Median household income -0.1775 0.4188 -0.1714 0.2268Education ( persons

aged >25 years)

% Highc 0.0062 0.7340 0.0159 0.1788% Lowd -0.0067 0.4122 -0.0025 0.6457

VCR Study Results 6Table 2 contd. Results of statistical modeling of Virginia prostate cancer incidence at county level.

African Americans

Whites

Variables Effect p-value

Effect p-value

Rural

<10% -0.8064 0.1184 -0.2178 0.4675

10-19% -0.3974 0.4125 -0.3433 0.1965

>20%a - - - -

% Female head of household

0.0029 0.8586 -0.0045 0.6396

% PSA screened 0.0023 0.1571 0.0011 0.2205

MD ratio 0.0000 0.9533 0.0000 0.0709

aReference category. bBetween 100% and 200% poverty.cAt least four years of college. dLess than a high-school education.

Conclusions• Socioeconomic position affects incidence of prostate

cancer• Differential effect among African Americans points to

possible racial discrimination – lack of access, biased delivery of clinical services

• Increased incidence among African Americans not due to socioeconomic measures evaluated in this study

• Oliver MN, Smith E, Siadaty M, Hauck FR, Pickle LW. A spatial analysis of prostate cancer incidence and race in Virginia, 1990-1999. Am J Prev Med. January 2006.

Limitations

• VCR not a gold-standard cancer registry; problems with data collection

• Only used 74% of VCR data, with large fall out of rural dataOliver MN, Matthew KA, Siadaty M, Hauck FR, Pickle LW. Geographic bias related to geocoding in epidemiologic studies. Intl J Health Geogr. 2005;4:29.

• Incidence rates calculated using average 1990 and 2000 populations

• Spatial and temporal lag: a general challenge for GIScience in public health

Spatial Data Modeling Steps -- LessonsData collection& preparation

Model identification

Modelassessment

Parameter Estimation

& model fitting

OK?Interpretation

of results& fit

DoneYes

No

Spatial Data Modeling with GIS and SAS

Thank you, and good luck!

Recommended