Using Spatial Statistics In GIS - MSSANZ 2 Articles… · Using Spatial Statistics In GIS K. Krivoruchkoa and C.A. Gotwayb aEnvironmental Systems Research Institute, 380 New York

Using Spatial Statistics In GIS K. Krivoruchkoa and C.A. Gotwayb

aEnvironmental Systems Research Institute, 380 New York Street, Redlands, CA 92373-8100, USAbCenters for Disease Control and Prevention; 1600 Clifton Road NE Atlanta, GA 30333, USA

Abstract: A Geographical Information System (GIS) provides a powerful collection of tools for themanagement and visualization of spatial data. These tools can be even more powerful when they areintegrated with methods for spatial data analysis. In this context, we provide several examples that show thepower of exploratory spatial data analysis (ESDA) within a GIS and how this can provide the foundation for more sophisticated probabilistic modeling. While the ESRI’s ArcGIS software now facilitates the integrationof spatial data analysis and GIS functionality, more tools are needed for comprehensive spatial data analysis.We suggest how to implement additional spatial statistical methods within a GIS, including methods forusing non-Euclidean distances in the analysis of geostatistical, lattice, and point pattern data.

Keywords: Non-Euclidean distance, spatial correlation, GIS, geostatistics, lattice, marked point pattern,exploratory spatial data analysis

1. INTRODUCTION

Statistical analysis within a commercial GIS (Geographical Information System) is rapidlybecoming an impressive suite of tools. Untilrecently, statistical analysis was limited to visualization and exploratory data analysis, whilestatistical modeling was considered problematicfor implementation within a GIS. However,probabilistic reasoning and statistical modelingare now important components of GIS science,and users of commercial GIS software arebeginning to want more sophisticated statisticaltools for spatial analysis.

We discuss our ideas for data exploration andmodeling within the ESRI GIS, trying to balancea variety of user needs with a softwaredeveloper’s perspective. We believe that GISprovides a practical approach to data explorationand this helps to identify areas where statisticalmodeling could be most useful. In this context,we suggest new methods for future developmentand implementation in GIS.

2. SPATIAL ANALYSIS AND SPATIALDATA ANALYSIS IN GIS SOFTWARE

Bailey and Gatrell (1995) distinguish betweenspatial analysis, the study of spatial phenomena using the basic GIS operations such as spatial query, join, buffering, and layering, and spatialdata analysis, the application of statistical theoryand techniques to the modeling of spatially-referenced data, which is the discipline of spatialstatistics. ESRI’s GIS software includes modulesthat address both tasks, namely the Spatial

Analyst and the Geostatistical Analyst extensionsto ArcGIS. While there are some similaritiesbetween these two extensions, there are also somekey differences. Spatial Analyst functions allowthe user to construct maps of where things are andhow they change, find what is inside or nearby,and identify the largest and smallest values in thearea under investigation. Simple descriptivestatistics and statistical graphics such as means,standard deviations and pie charts are oftenenough to quantify the variability in the data andthe results. Zonal addition, proportionalallocation, and buffering are often sufficient forcombining data from different spatial units.Figure 1 shows the main menu of the SpatialAnalyst program that provides a comprehensivesuite of deterministic functions for spatialanalysis.

Figure 1. Main menu of the Spatial Analyst.

713

For many GIS users, these analyses provide morethan enough capability for most of theapplications of interest. However, the analyses ignore uncertainties in the data and the results andproduce new surfaces without taking into accountthe errors that propagate with each operation onthe data. Unfortunately for some researchers, this will not give an adequate analysis and many usersrequire much more sophisticated methods forspatial data analysis. These necessarily requireinferential spatial statistics: estimation, prediction,and hypothesis testing. ESRI’s GeostatisticalAnalyst extension to ArcGIS provides a greatersuite of both qualitative and quantitative statisticaltools for spatially continuous data. Figure 2 showsthe main menu of this extension with a list of theexploratory spatial data analysis (ESDA) toolsthat are included. When these tools areimplemented, graphical dialogs link the results tomaps and data tables, allowing the user to quicklyand effectively assess the variability, distribution,correlations and cross-correlations, and large-scale trends in the data.

Figure 2. Main menu of the GeostatisticalAnalyst.

For modeling the user can select from a variety of kriging models with output in the form ofpredictions, prediction standard errors, quantileand probability maps (Figure 3).

Figure 3. The geostatistical method selectiondialog.

Additional tools include several approaches fordata transformation and detrending. The user canrely on the default parameters or make morespecific choices using the graphical dialogs that

facilitate covariance modeling, selection of searchneighborhood, and provide validation and cross-validation diagnostics. Many options for mappingand post-processing are available.

Geostatistical Analyst has several thousands of users and they have very different backgroundsand interests. Unfortunately, many researchersuse the software only to make maps. Manystatisticians do not understand or appreciate thefull utility of GIS for spatial data analysis. Otherresearchers are not educated in spatial statistics,so they are unaware of techniques for modelinguncertainty, even though they realize thatmeasuring and modeling without errors isimpossible. At the same time, they readily use automatic “geoprocessing” tools thatarithmetically add or average raster data values in the cells without taking into account the impact oferror propagation on the results. After severalsuch geoprocessing steps, the resulting datastructure can be completely random and, consequently, decisions made from these resultsmay be wrong. Still others use the software for analyses for which it was not designed. Thisproblem often arises when users implementGeostatistical Analyst with aggregated data thatare associated with spatially discrete units. For example, we have watched users discuss whichinterpolator, inverse squared distance weightingor splines, is better for mapping of proportions offemales in burial populations. What is the bestprediction of the proportion of females outsidecemeteries? One would hope it is zero. There areother methods in spatial statistics that are moresuitable for this type of data. For example, a marked point pattern analysis that incorporates an attribute value recorded at each location could beused. The mark would be the occurrence (or not)of a female burial at each location, and a markedpoint pattern model could then be used to estimateand map the intensity of females in burialpopulations, assuming that data point locationsare given by nature and not selected by the user.

Hopefully, all of these problems can be solved byeducation. Case studies can help to show usershow GIS can and should be used for moresophisticated statistical analysis and modeling.

3. IMPROVED SPATIAL DATA ANALYSIS USING GIS

Good data visualization is important both for dataunderstanding and for representing the results of statistical analysis. Without a GIS, users maystruggle to create meaningful visualization tools.For instance, consider the graph in Figure 4 that displays land sales with different characteristics.

714

Consider instead the type of map that can beproduced using GIS. Figure 6 shows a map of ozone measurements in 1999 in SouthernCalifornia. The city of Los Angeles lies in a coastal plain, surrounded by mountains thatseparate a desert from the coastal climate. In thesummer, pollutants in the lower layer of air (smog) move from the city to the east, but are blocked by the mountains. So, to obtain a betterunderstanding of how pollution might move, thetopography is also displayed in the same map.One of the major sources of air pollution isexhaust from motor vehicles, so the major roadnetworks are also displayed on the map. Gooddata visualization can help us to construct a modelfor air quality. For example, we can now easily see a large-scale east-west trend in the measurements and the barrier to movement that the mountains provide. The larger values of ozonetend to be close to the mountains in the east, and the ozone concentration declines toward the coast.Thus, any model we select should account for thistrend. This can be easily investigated furtherusing Geostatistical Analyst’s ESDA tools and theresults can then be used to choose an appropriategeostatistical model.

Figure 4. Typical data representation without GIS usage.

Graphs like this are typical and plentiful in manyof today’s journals, but the utility of this graph is very low. In many papers on spatial statistics, theauthors do not even use graphics but only one-dimensional graphs and tables with estimatedparameters and diagnostics. Land cost depends onmany factors and many of them are readily available for GIS users (street networks, schooland shopping locations, etc). Visualizing these factors with a map could be a very valuable toolfor understanding land costs and the results onany statistical analysis.

Good visualization should be the prelude tosophisticated modeling. For example, consider thecontour map in Figure 5 that is typical of mapsdisplayed in statistics journals. Such maps are often the sole visualization tool used to supportthe development of sophisticated statisticalmodels. This illustration is not very helpful forunderstanding the spatial variation in the data, orfor investigating hypothesis about the reasons forsuch variations. Unfortunately, maps like this are plentiful in the scientific literature.

Figure 6. Ozone measurements, elevation, andmajor road network visualization using ArcGIS,Los Angeles area.

There are other useful GIS features which canalso improve statistical data analysis. Withadvances in GIS, the ability to find and usenecessary data has grown tremendously. Forexample, ArcGIS allows users to find and immediately use relevant data associated with a selected region on the map by displaying a list ofall data sources available for this area on theInternet servers. The user can also do a search for a specific type of data. Even if the user did notfind exactly what is necessary, similar data can beused for comparison with data underinvestigation.

Figure 5. The result of predictions is visualizedwithout GIS. Modern GIS software allows the management of

very large datasets. This is particularly important

715

Figure 7. 3D view of the area near Los Angeles.in the environmental sciences because large,remote-sensing images are easily obtained for minimal cost. As another example, Californiadaily measurements of air quality are available formany cities for the last 22 years and querieswithin a GIS can be very helpful for exploringsuch large data sets. With so much spatial dataavailable over the Internet, we can easily obtainelevation measurements, census and epidemiological data, land use classifications, andmeteorological information. Often these data are collected using different coordinate systems, and a GIS is very helpful in changing projection.

Intuitively, distances between objects on this mapshould not be “as the crow flies.” Figure 8 showsthe result of ozone concentration predictions overthe elevation map. In this analysis, the distancebetween locations was calculated using a non-Euclidean distance metric that incorporates the mountainous barrier to ozone movement. Thismetric constructs a cost surface based on information on altitude (Krivoruchko and Gribov,2002).

4. SPATIAL STATISTICS TOOLS ANDMODELS FOR IMPLEMENTATION IN GIS IN THE NEAR FUTURE

One interesting consequence of developingstatistical software for a large audience is the possibility of learning what “typical” users want.We found that the ideal world where data are accurately measured and normally or log-normally distributed is not common in most user applications and the spatial coordinates of thedata are often not known exactly. In this section,we will discuss two important options to beimplemented in the spatial data analysis software in the near future: the use of non-Euclideandistances and methods for adjusting formeasurement and locational errors.

Non-Euclidean distancesFigure 8. Visualization of the result of predictions using GIS. Many GIS users are analyzing data in the

environment with natural and artificial barriers. For many applications, the map in Figure 6 thatuses meteorological, elevation, and traffic data inaddition to the pollution measurements, togetherwith some basic results from typical GIS functions like buffering, will be enough tounderstand the spatial distribution of ozone.However, many applications require moresophisticated analytical methods. Continuing with the example above, Figure 7 shows a 3D view ofthe area near Los Angeles shown in Figure 6.

For some variables of interest, such as airpollution, we need to factor in the mountainousbarriers to movement. For other variables such as ocean temperature and fish abundance, we need to take into account the shape of the coastline.Another example is provided by halibutabundance near the western Canada coastline(Figure 9. Data provided by the InternationalPacific Halibut Commission). Here the smallislands are barriers to aquatic movement. Thus,we would want to use a distance metric thatreflects how the fish swim.

Traditional analysis of geostatistical and spatialpoint processes are based upon straight-linedistance. In practice, environmental and artificialbarriers due to rivers, roads, soil types variability,and other natural boundaries always exist and it isnecessary to account for them to create ameaningful analysis (e.g., as part of a processmodel approach to spatial analysis as described inLaffan, 2002). For example, weights to theneighbors in lattice data analysis can be naturallybased on travel or economic distance between

716

objects in addition to the length of the commonborder and distance between polygon centroids.

Figure 9. Halibut abundance near the westernCanada coastline. Circles are observationallocations.

Figure 10 displays crime events data over a streetnetwork over a six-month period in a mediumsized city. Crime tends to be close to the road and the places where people live or work. Analyzingthe spatial distribution of crime dictates the usageof a more sophisticated distance metric than asimple straight-line distance metric.

The next version of the Geostatistical Analystsoftware will use a general and flexible approachto the problem of using non-Euclidean distancemetrics in spatial data analysis. This approach is based on a cost weighted distance, a commonraster function in GIS that calculates the cost of travel from one cell of a grid to the next.Specifying high costs for travel between certaincells effectively prevents movement betweenthese cells. The determination of the cost value ateach location, calculation of distances betweensampled locations and unsampled ones, andchoice of covariance function in the case of geostatistics is discussed in Krivoruchko andGribov, 2002.

Measurement and locational errors

Two main sources of error are data collection anddata analysis. Errors propagate as a result of data manipulations: the errors in maps are modifiedand usually lead to increased output mapuncertainty and may lead to wrong conclusions.Even in a well-designed experiment, errors oftenarise from imperfection in the experimental setup

and the researcher’s inconsistencies. For mostexperiments with

Figure 10. Example of crime events displayed over street network.

measured outcomes, if an experiment is repeatedmany times, measurement will be slightlydifferent each time. Measurement errors can beattributed to error in the measurement device,human recording error, changes in themeasurement conditions, data integration, andfaulty sampling techniques. Therefore, theinevitable errors of measurement are somethingscience has to live with and the true signal we are interested in has to be extracted from the noisydata. Such errors also arise in determininglocations. Examples of locational errors includemeasurements collected for territories (polygons)for which the area of which is unknown; use ofcentroids to measure the location of a polygon;truncated coordinates; and coordinates distortedby map projection.

Measurement error in geostatistics theory was developed in the form of the filtered kriging fromthe very beginning (Gandin, 1963). Locationalerrors were discussed from time to time, butdetailed theory has appeared only recently(Cressie and Kornak, 2002). The GeostatisticalAnalyst now allows filtered kriging and futurereleases will allow for adjustment of locationalerrors.

Consider an example of lightning strikes inBoulder County, Colorado, on September 1, 2000(Figure 11). The red polygons reflect a radius ofuncertainty in the location of the strikes. Inaddition to the approximate data location, there is information on the polarity and strength of thelightning strikes. Positive polarity is more likelyto ignite wildfire. We could use the GeostatisticalAnalyst to create a probability map of positivepolarity lightning strikes and use this as anindication of the risk of a wildfire. However, thisdoes not account for the spatial distribution in thelightning strikes: there may be far more strikes inother areas, so even if they have negative polarity,

717

5. CONCLUSIONthere might be increased fire risk simply due tothe number of strikes. Although the requirements of GIS users are

varied, they all have two needs in common:interactive visualization and the ability to usestatistical methods and models for spatial data.Many users may not be knowledgeable aboutstatistical theory, so they will appreciate asoftware that helps them make good choices for their data and application. Others moreknowledegable in statistics will appreciate thevisualization tools in GIS that are lacking instatistical software packages. They will also appreciate the ability to choose to among methodsand models that comes with a comprehensive and flexibile set of tools for interactive spatial dataanalysis.

Figure 11. Lightning strikes in Boulder County,Colorado, on September 1, 2000

We think that it is much easier to incorporatemodern spatial statistics into GIS environmentthan it is to implement modern GIS functionalityin statistical software. At the same time, a userfriendly and understandable implementation ofstatistical models into the GIS core is the mostefficient way to involve more people in inferentialspatial data analysis.

From a spatial statistics viewpoint, these datawould be considered as a marked point patternprocess, a process that is doubly stochastic: one process locates the lightning strikes (a pointprocess) and another process controls the strengthand polarity associated with strikes at therecorded locations (random field process).

To account for errors in identifying the locationsof the lightening strikes, we could define muchlarger polygons that partition the data accordingto soil and forest types and then count the numberof lightning strikes and estimate polaritydistribution in the potentially overlappingpolygons. Any polygonal data analysis on suchaggregated data would require a sophisticateddistance metric.

6. REFERENCES

Bailey, T. C. and Gatrell, A. C. 1995. InteractiveSpatial Data Analysis. Essex: Addison WesleyLongman Limited.

Cressie, N. and Kornak, J. 2002. Spatial statisticsin the presence of location error with an application to remote sensing of the environment.Department of Statistics Preprint No. 701, TheOhio State University. Available by request at http://www.stat.ohio-state.edu/~sses/papers.html.

This data discussion raises several questions.First, we need a marked point pattern model withlocational error. Second, we need software forgeostatistical, point and marked point patternanalysis with an option to use non-Euclideandistances. Software that allows flexible definitionof weights for lattice data modeling, includingweights defined using cost surface, is required.After all, real datasets are often a combination ofcontinuous, polygonal, and point data, andpractitioners will appreciate tools for interactingbetween different types of data.

Gandin, L.S. 1963. Objective Analysis ofMeteorological Fields. GidrometeorologicheskoeIzdatel’stvo (GIMIZ), Leningrad (translated byIsrael Program for Scientific Translations,Jerusalem, 1965).

Krivoruchko K. and Gotway C.A. 2002.Expanding the “S” in GIS: Incorporating SpatialStatistics in GIS. Available from ESRI online athttp://www.esri.com/software/arcgis/arcgisxtensions/geostatistical/research_papers.html.To the best of our knowledge, commercial

marked point pattern analysis software simplydoes not exist. Although several implementationsof lattice data analysis are available on themarket, a flexible set of visualization tools for polygonal data analysis is also not available.

Krivoruchko K. and Gribov A. 2002.Geostatistical Interpolation in The Presence ofBarriers. Available from ESRI online athttp://www.esri.com/software/arcgis/arcgisxtensions/geostatistical/research_papers.html.

There are many other important additions to thespatial data analysis arsenal, includingaggregation, disaggregation, and integration ofspatial data obtained at different spatial scales (Krivoruchko and Gotway, 2002).

Laffan, S.W. 2002. Using process models toimprove spatial analysis. International Journal ofGeographic Information Science, 16(3): 245-257.

718

Documents

Using Spatial Statistics In GIS - MSSANZ 2 Articles… · Using Spatial Statistics In GIS K. Krivoruchkoa and C.A. Gotwayb aEnvironmental Systems Research Institute, 380 New York