Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California...

Preview:

Citation preview

Thinking Critically about Geospatial Data Quality

Michael F. Goodchild

University of California

Santa Barbara

Starting points• All geospatial data leave the user to some

extent uncertain about the state of the real world– missing data– positional and attribute errors– uncertain definitions of classes and terms– missing metadata

• projection unspecified• horizontal or vertical datum

– cartographic license

Uncertainty is endemic• All geographic information leaves some degree of

uncertainty about conditions in the real world• x

– the Greenwich Meridian– the Equator– standard time

• z– definitions of terms– errors of measurement

Starting points• Some applications will be impacted, some

will not– for any given data set at least one application can

be found where uncertainty matters– knowing whether data are fit for use

• Our perspective on these issues has changed in the past two decades– from error to uncertainty– from top-down to bottom-up data production

http://www.thesalmons.org/lynn/wh-greenwich.html

Definitions of terms• Precision:

– the number of significant digits used to report a measurement

– should never be more than is justified by the accuracy of the measuring device

– internal precision of the computer• single precision arithmetic 1 part in 107

• double precision 1 part in 1014

• relative to the Earth's dimensions, single precision is about a meter resolution, double is about the size of an atom

• no GIS should ever need more than single precision

• a GIS's internal precision is effectively infinite

• hard to persuade designers to drop those spurious digits

Scale• Relationship between measurement of distance on a

map and measurement on the ground– 1:24,000 is larger than 1:250,000

• A map used for digitizing or scanning always has a scale– a geospatial database never has scale– but we have complex conventions

• Scale as a useful surrogate for map contents, resolution, accuracy– scale of original map a useful item of geospatial metadata

Resolution• The minimum distance over which change isr

recorded– 0.5mm for a paper map

• Positional accuracy– set by national map accuracy standards to

roughly 0.5mm

Scale Positional accuracy and resolution

1:1000 0.5m

1:10,000 5m

1:24,000 12m

1:100,000 50m

1:250,000 125m

1:1,000,000 500m

1:2,000,000 1km

Accuracy• The difference between a measurement and the truth

– problems of defining truth for some geospatial variables, e.g. soil class

– if two people classified the same site would they agree?• Uncertainty of definition is a form of inaccuracy, along with:

– variation between observers or measuring instruments– temporal change– loss of information on e.g. projection, datum– transformation of datum– map registration– digitizing error– imperfect fit of the data model, e.g. heterogeneous polygons,

transition zones instead of boundaries– fuzziness of many geographic concepts– transformation of coordinate system, projection, data model, e.g.

raster/vector conversion

Truth• Often a source of higher accuracy

– circularity - accuracy is the difference between a measurement and a source of higher accuracy?

• What identifies a source as having higher accuracy?– larger scale– more recent– cost more, took longer to make, more careful– more accurate measuring instrument– certified by an expert– earlier in the chain reality-map-database (less processing)

The problem• Uncertainty is endemic in geospatial data

– even a 1:1 mapping would not create a perfect representation of reality

• All GIS products are therefore subject to uncertainty– what is the plus or minus on estimates of length, area,

counts of objects, positions, attributes, viewsheds, buffer zones, ...

• GIS products are often used in decision-making by people who do not have intimate knowledge of the methods used to collect, digitize or process the data– results are often presented and used visually rather than

numerically

• Computer (GIS) output carries a false sense of credibility

Topology vs geometry• Which property does this pole lie in?• Which side of the street is this house?• Do these two streets connect?

Acres 1 2 3 4 5 1+2+5 1+2+3+5 1+2+3+4+5

0-1 0 0 0 1 2 2640 27566 77346

1-5 0 165 182 131 31 2195 7521 7330

5-10 5 498 515 408 10 1421 2108 2201

10-25 1 784 775 688 38 1590 2106 2129

25-50 4 353 373 382 61 801 853 827

50-100 9 238 249 232 64 462 462 413

100-200 12 155 152 158 72 248 208 197

200-500 21 71 83 89 92 133 105 99

500-1000 9 32 31 33 56 39 34 34

1000-5000 19 25 27 21 50 27 24 22

>5000 8 6 7 6 11 2 1 1

Totals 88 2327 2394 2149 487 9558 39188 90599

Data types• The area-class map

– soil maps– vegetation cover type– land use

• Boundaries surrounding areas of uniform type– what’s the accuracy issue?

The area class map• Assigns every location x to a class

– Mark and Csillag term– c = f(x)– a nominal field (or perhaps ordinal)– classified scene– soil map, vegetation cover map, land use map

• Need to model uncertainty in this type of map

Uncertainty modeling• Area-class maps are made by a long and complex

process involving many stages, some partially subjective

• Maps of the same theme for the same area will not be the same– level of detail, generalization– vague definitions of classes– variation among observers– measuring instrument error– different classifiers, training sites– different sensors

Error and uncertainty• Error: true map plus distortion

– systematic measurements disturbed by stochastic effects– accuracy (deviation from true value)– precision (deviation from mean value)– variation ascribed to error

• Uncertainty: differences reflect uncertainty about the real world– no true map– possible consensus map– combining maps can improve estimates

Models of uncertainty• Determine effects of

uncertainty/variation/error on results of analysis– if there is known variation, the results of a single

analysis cannot be claimed to be correct– uncertainty analysis an essential part of GIS– error model the preferred term

Traditional error analysis• Measurements subject to distortion

– z' = z + z

• Propagate through transformations– r = f(z)– r + r = f(z + z)

• But f is rarely known– complex compilation and interpretation– complex spatial dependencies between elements of

resulting data set

Spatial dependence• In true values z• In errors e

• cov(ei,ej) a decreasing positive function of distance– geostatistical framework

• Scale effects, generalization as convolutions of z

If this were not true• If Tobler’s First Law of Geography did not apply to

errors in maps• If errors were statistically independent• If relative errors were as large as absolute errors• Errors in derived products would be impossibly large

– e.g. slope– e.g. length

• Shapes would be unrecognizable

Realization• A single instance from an error model

– an error model must be stochastic– Monte Carlo simulation

• The Gaussian distribution metaphor– scalar realizations– a Gaussian distribution for maps

• an entire map as a realization

Model

• {p1,p2,…,pn}

• correlation in neighboring cell outcomes• posterior probabilities equal to priors• 80% sand, 20% inclusions of clay• no knowledge of correlations

Topographic data• Definition problems

– sand dunes– trees– buildings

• Classic measurement error model– measured elevation = truth + error– error spatially autocorrelated

z'(x) = z(x) + δz(x)

Glyphs indicating wind direction, magnitude, and uncertainty (Pang, 2001)

Representation of estimated water balance surplus/deficit (using a mesh surface) and uncertainty in the estimates (using bars above and below the surface). The bars depict the range of a set of model predictions with those predictions above the mean in purple and those below the mean in orange.

Fauerbach, Edsall, Barnes, MacEachren animation

Options for uncertainty• Ignore• Present parameters• Present simulations

– confidence intervals

Point symbol sets depicting uncertainty with variation in (a) saturation (colors vary from saturated green, bottom, to unsaturated, top); (b) crispness of symbol edge – middle; and (c) transparency of symbol – right.

Alternative depictions of data (inorganic nitrogen in Chesapeake Bay) and uncertainty of data interpolated from sparse point samples. Left view shows bivariate depiction in which dark=more nitrogen and certainty is depicted with a diverging color scheme (blue = most certain and red = most uncertain). Right view depicts data in both panels (dark = more nitrogen), with the right panel showing the results of interactive focusing on the most certain data.

Communication of uncertainty• Producer to user

– abilities

• Metadata standards– parameters of complex models

• Assertion:– all knowledge of uncertainty can be expressed in

a suitable simulation model– equally likely realizations

The "five-fold way"• Positional accuracy• Attribute accuracy• Logical consistency• Completeness• Lineage• Federal Geographic Data Committee

– Spatial Data Transfer Standard– Content Standard for Digital Geospatial Metadata– www.fgdc.gov

Metadata• Data about data

– handling instructions– catalog entry– fitness for use

• What is known about data quality– a measure of the success of spatial data quality research– much progress has been made– FGDC CSDGM 1994– ISO 19115 2003

Web 2.0• Dominated by user-generated content• Bottom-up supply of geospatial data• What are the issues?

CSDGM, ISO 19115• Do they match the state of research?

– early 1990s– SDTS discussions of 1980s– the five-fold way

• positional accuracy• attribute accuracy• logical consistency• completeness• lineage

• Do they represent a user perspective?– committees staffed by data producers– production control mechanisms?

Producer or user?• Producer-centric

– details of the production process: the measurement and compilation systems used

– tests of data quality conducted under carefully controlled conditions

– formal specifications of data set contents• User-centric

– effects of uncertainties on specific uses of the data, from simple queries to complex analyses

– simple descriptions of quality that are readily understood by non-expert users

– tools to enable the user to determine the effects of quality on results

Increasing complexity• Self-documentation

– notes to oneself

• A colleague– brief description

• Another discipline, language, culture– ideal metadata/data ratio?

social distance

complexity of metadata

References• My CV and papers• Book list

Recommended