Www.spatialanalysisonline.com Chapter 5 Part A: Spatial data exploration

www.spatialanalysisonline.com

Chapter 5

Part A: Spatial data exploration

3rd edition www.spatialanalysisonline.com 2

Spatial data exploration

Spatial analysis and data models (Anselin, 2002)

Object Field

GIS vector raster

Spatial Data points, lines, polygons

surfaces

Location discrete continuous

Observations process realisation sample

Spatial Arrangement spatial weights distance function

Statistical Analysis lattice geostatistics

Prediction extrapolation interpolation

Models lag and error error

Asymptotics expanding domain infill



Sampling frameworks Pure random sampling Stratified random – by class/strata

(proportionate, disproportionate) Randomised within defined grids Uniform Uniform with randomised offsets Sampling and declustering



Sampling frameworks – point sampling



Sampling frameworks – within zonesSelection of 5 random points per zone

Grid generation - square grid within field boundaries

Grid generation (hexagonal) - selection of 1 point per cell, random

offset from centre


Spatial data explorationA. 10% random sample from existing point set B. Stratified random selection, 30% of each

stratum

800 radio-activity monitoring sites in Germany. Random sample of 80 (red/large dots)

200 radio-activity monitoring sites in Germany. Random sample of 30 (red/large dots)<100 units of radiation and 30 (crosses)>=100 units of radiation



Random points on a network



EDA, ESDA and ESTDA EDA – basic aims (after NIST)

maximize insight into a data set uncover underlying structure extract important variables detect outliers and anomalies test underlying assumptions develop parsimonious models determine optimal factor settings



ESDA (see GeoDa and STARS) Extending EDA ideas to the spatial domain

(lattice/zone models) Brushing Linking Mapped histograms Outlier mapping Box plots Conditional choropleth plots Rate mapping



ESDA: Brushing & linking



ESDA: Histogram linkage



ESDA: Parallel coordinate plot & star plot



ESDA: Mapped box plots



ESDA: Conditional choropleth mapping



ESDA: Mapped point dataA. Variable point size

B. Variable colourC. Semivariogram pairs

D. Voronoi analysis



ESDA: Trend analysis (continuous spatial data)



ESDA: Cluster hunting – GAM/K (steps)

1. Read data for the population at risk2. Identify the MBR containing the data, identify starting circle radius, and degree of

overlap3. Generate a grid covering the MBR4. For each grid-intersection generate a circle of radius r 5. Retrieve two counts for the population at risk and the variable of interest 6. Apply some “significance” test procedure 7. Keep the result if significant 8. Repeat Steps 5 to 7 until all circles have been processed 9. Increase circle radius by dr and return to Step 3 else go to Step 10 10. Create a smoothed density surface of excess incidence for the significant circles11. Map this surface and inspect the results


Spatial data explorationGrid-based statistics

Univariate analysis of attribute data (non-spatial metrics)

Cross-classification and cross-tab analyses Spatial pattern analysis for grid data

(including Landscape metrics)Patch metrics; Class-level metrics; Landscape-

level metrics Quadrat analysis Multi-grid regression analysis



Grid-based statistics Landscape metrics

Non-spatial• Proportional abundance; Richness; Evenness; Diversity

Spatial• Patch size distribution and density; Patch shape

complexity; Core Area; Isolation/Proximity; Contrast; Dispersion; Contagion and Interspersion; Subdivision; Connectivity



Point (event) based statistics Typically analysis of point-pair distances Points vs events Distance metrics: Euclidean, spherical, Lp or

network Weighted or unweighted events Events, NOT computed points (e.g. centroids) Classical statistical models vs Monte Carlo and

other computational methods



Point (event) based statistics Basic Nearest neighbour (NN) model

Input coordinates of all points Compute (symmetric) distances matrix D Sort the distances to identify the 1st, 2nd,...kth

nearest values Compute the mean of the observed 1st, 2nd, ...kth

nearest values Compare this mean with the expected mean under

Complete Spatial Randomness (CSR or Poisson) model



Point (event) based statistics – NN model

r+drr

Area = r2 Area = 2rdr

Width = dr



Point (event) based statistics – NN model Mean NN distance:

Variance:

NN Index (Ratio):

Z-transform:

m2

1

mn

rrz

e

eeo

/261358.0n/

where N(0,1), ~ /)(

2

m

4)4(

2

/o eR r r



Point (event) based statistics Issues

Are observations n discrete points? Sample size (esp. for kth order NN, k>1) Model requires density estimation, m Boundary definition problems (density and edge

effects) – affects all methods NN reflexivity of point sets Limited use of frequency distribution Validity of Poisson model vs alternative models



Frequency distribution of nearest neighbour distances, i.e. The frequency of NN distances in distance bands, say

0-1km, 1-2kms, etc The cumulative frequency distribution is usually denoted

G(d) = #(di < r)/n where di are the NN distances

and n is the number of

measurements, or F(d) = #(di < r)/m where m is the number of random

points used in sampling



Computing G(d) [computing F(d) is similar] Find all the NN distances Rank them and form the cumulative frequency

distribution Compare to expected cumulative frequency distribution:

Similar in concept to K-S test with quadrat model, but compute the critical values by simulation rather than table lookup

2

1)( rmerG



Point (event) based statistics – clustering (ESDA) Is the observed clustering due to natural background

variation in the population from which the events arise? Over what spatial scales does clustering occur? Are clusters a reflection of regional variations in

underlying variables? Are clusters associated with some feature of interest,

such as a refinery, waste disposal site or nuclear plant? Are clusters simply spatial or are they spatio-temporal?



Point (event) based statistics – clustering kth order NN analysis Cumulative distance frequency distribution, G(r) Ripley K (or L) function – single or dual pattern PCP Hot spot and cluster analysis methods



Point (event) based statistics – Ripley K or L

Construct a circle, radius d, around each point (event), i

Count the number of other events, labelled j, that fall inside this circle

Repeat these first two stages for all points i, and then sum the results

Increment d by a small fixed amount Repeat the computation, giving values of

K(d) for a set of distances, d Adjust to provide ‘normalised measure’ L: d

dKdL

)(

)(



Point (event) based statistics – Ripley KRipley K - Lung Cancer dataset

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

9.9 99.3 188.7 278.0 367.4 456.8 546.1 635.5 724.9 814.2 903.6 993.0

Distance

L(d) observed

L(d) min

L(d) max



Point (event) based statistics – comments CSR vs PCP vs other models Data: location, time, attributes, error, duplicates

Duplicates: deliberate rounding, data resolution, genuine duplicate locations, agreed surrogate locations, deliberate data modification

Multi-approach analysis is beneficial Methods: choice of methods and parameters Other factors: borders, areas, metrics, background variation,

temporal variation, non-spatial factors Rare events and small samples Process-pattern vs cause-effect ESDA in most instances



Hot spot and cluster analysis – questions where are the main (most intensive) clusters located? are clusters distinct or do they merge into one another? are clusters associated with some known background

variable? is there a common size to clusters or are they variable

in size? do clusters themselves cluster into higher order

groupings? if comparable data are mapped over time, do the

clusters remain stable or do they move and/or disappear?



Hot spot (and cool-spot) analysis Visual inspection of mapped patterns Scale issues Proximal and duplicate points Point representation (size) Background variation/controls (risk adjustment) Weighted or unweighted Hierarchical or non-hierarchical Kernel & K-means methods



Hot spot analysis – Hierarchical NN Cancer incidence data 1st and 2nd order clusters

Documents

Www.spatialanalysisonline.com Chapter 5 Part A: Spatial data exploration