33
SPATIAL DATA MODELS AND SPATIAL DATA STRUCTURES TABLE OF CONTENTS 1 Spatial data models: an introduction................................................................................................. 2 2 Geometric entities............................................................................................................................. 2 2.1 Problems with the entity definition process...............................................................................5 3 Spatial data models and structures.................................................................................................... 7 4 The raster approach........................................................................................................................... 8 4.1 Raster data structures................................................................................................................. 9 4.1.1 The simple raster................................................................................................................ 9 4.1.2 The complex raster........................................................................................................... 11 4.2 Data compaction methods....................................................................................................... 12 4.2.1 Run length encoding........................................................................................................ 13 4.2.2 The quadtree.....................................................................................................................14 5 The vector approach........................................................................................................................ 19 5.1 Data structures without topology............................................................................................. 20 5.1.1 The neighbourhood problem............................................................................................ 22 5.1.2 The island or hole problem.............................................................................................. 23 5.2 Data structures with topology.................................................................................................. 23 5.2.1 Point................................................................................................................................. 23 5.2.2 Lines and networks.......................................................................................................... 23 5.2.3 Areas................................................................................................................................ 25 6 Vector and raster spatial data models: advantages and disadvantages............................................ 30 6.1 Data volume............................................................................................................................. 30 6.2 Topology queries..................................................................................................................... 31 6.3 Generality................................................................................................................................ 31 6.4 Analytical capabilities............................................................................................................. 32 6.5 Accuracy and precision............................................................................................................ 32 7 What have you learned?.................................................................................................................. 33

Spatial data models and structuresSPATIAL DATA MODELS AND SPATIAL DATA STRUCTURES This section focuses on the methods available for the actual implementation of geographic models within

  • Upload
    others

  • View
    35

  • Download
    0

Embed Size (px)

Citation preview

SPATIAL DATA MODELS AND SPATIAL DATASTRUCTURES

TABLE OF CONTENTS1 Spatial data models: an introduction................................................................................................. 22 Geometric entities............................................................................................................................. 2

2.1 Problems with the entity definition process...............................................................................53 Spatial data models and structures.................................................................................................... 74 The raster approach........................................................................................................................... 8

4.1 Raster data structures.................................................................................................................94.1.1 The simple raster................................................................................................................94.1.2 The complex raster...........................................................................................................11

4.2 Data compaction methods....................................................................................................... 124.2.1 Run length encoding........................................................................................................ 134.2.2 The quadtree.....................................................................................................................14

5 The vector approach........................................................................................................................ 195.1 Data structures without topology.............................................................................................20

5.1.1 The neighbourhood problem............................................................................................225.1.2 The island or hole problem.............................................................................................. 23

5.2 Data structures with topology..................................................................................................235.2.1 Point................................................................................................................................. 235.2.2 Lines and networks.......................................................................................................... 235.2.3 Areas................................................................................................................................ 25

6 Vector and raster spatial data models: advantages and disadvantages............................................306.1 Data volume.............................................................................................................................306.2 Topology queries..................................................................................................................... 316.3 Generality................................................................................................................................ 316.4 Analytical capabilities............................................................................................................. 326.5 Accuracy and precision............................................................................................................32

7 What have you learned?.................................................................................................................. 33

SPATIAL DATA MODELS AND SPATIAL DATASTRUCTURESThis section focuses on the methods available for the actual implementation of

geographic models within GIS. It will review the geometric primitives (spatial entities:

points, lines, areas, networks and surfaces) and the different approaches (data models

and data structures) used to implement these representations of geographic space in a

computerised GIS environment.

1 Spatial data models: an introduction“…Human perception of space is frequently not the most efficient way to structure a

computer database and does not account for the physical requirements of storing

and repeatedly using digital information. Computers for handling geographical data

therefore need to be programmed to represent the phenomenological structures in

an appropriate manner…”

(Burrough and McDonnell, 1998 p.36)

This pertinent statement provides a timely reminder that computers require unambiguous

instructions on how to perform specific tasks. In the same way, a computer needs to be

told how to build spatial models for implementation within a GIS. Using Peuquet’s (1984)

abstraction schema as a framework this section examines in detail the different levels of

abstraction (levels 1 – 3) and the associated processes used in the design and

implementation of a GIS data model.

The successful design of spatial data models within a GIS must consider the following:

1. Which landscape elements (phenomenological structures or spatial entities) are

necessary to appropriately represent the system under investigation (level 1 abstraction)?

2. What approach (spatial data model) should be used to handle and display these

spatial entities (level 2 abstraction)?

3. What particular set of instructions and information (data structure) will the computer

require to reconstruct the spatial data model in digital form (level 3 abstraction)?

2 Geometric entitiesIn performing the first stage of data abstraction we need to identify the features / objects

to be represented within our GIS. The design and construction of our GIS is dependent

upon the successful identification of a series of geometric primitives. These are the basic

units of our spatial data models and form the building blocks of GIS. All geographical

phenomena can, in two dimensions at least, be represented by one of three entity types

(Figure 1).

In completing the e-tutorial to construct your own map, you produced a series of spatial

entities of your own to represent the different landscape elements. These entities (or

primitives), were made up of a series of shapes (geometries) including points to represent

the location of trees, a series of lines to illustrate the road centre-lines and areas to

represent recreational space and residential and/or commercial buildings. Let's quickly

remind ourselves of some of the geometric properties of point, line and area entity types.

• What is a point? A point is a spatial entity that has no length or area.

• What is a line? A line is a feature that has length but no area.

• What is an area? An area is a spatial entity that has perimeter and area.

Figure 1. Point, line and area spatial entities.

The three entity types above, however, are not always the most appropriate

representational forms for landscape elements to be included in a GIS. The concepts of

area and line can be extended to produce two other spatial entities, namely surfaces and

networks (Figure 2).

Figure 2. Surface and network entities.

Surfaces can be used, for example, to represent population density, elevation values and

temperature. As well as its fundamental two-dimensional attributes the surface entity is

also capable of three-dimensional representation in GIS and this is revisited later in the

section on dimensionality. Examples of networks include traffic (road) and hydrological

systems.

In many cases, the range of surface and network entity types that you identified will have

been limited only by your knowledge of the area. Some suggestions include:

Surface NetworkRainfall ElectricityElevation RiversPollution Sewage pipes

Telephone cablesFootpaths

2.1 Problems with the entity definition process

As one might expect, there are a number of problems associated with simplifying the

complexities of the real world into five basic, two dimensional building blocks. These

include the dynamic nature of the real world, the scale at which a particular problem

needs addressing and the identification of discrete features.

Real world dynamics

The real world is not static: forests grow, rivers flood, cities expand. This poses two

particular problems for the entity definition phase of a GIS project. The first problem is

how to select the entity type that provides the most appropriate representation for the

feature being modelled. For example, is it best to represent a forest as a collection of

points (representing the location of individual trees), or as an area (the boundary of which

defines the territory covered by a forest)? The second problem is one of temporal

change. For example, a forest that was originally represented as an area may decline to

the extent that in reality it is only a dispersed group of trees that are, perhaps, better

represented using point features.

Scale

Scale is also an important concept in the entity definition process. For example, if a GIS

database is to be constructed at a scale of 1: 1M (for example, the Digital Chart of the

World) it may be appropriate to represent the city of Manchester as a point feature.

However, at larger scales you would need to employ different entity types to provide a

more appropriate representation of Manchester. For example, at 1:250 K, an area feature

would be most suitable; at any larger scales, it is likely that a compound or collection of

entities would be more practical. Ideally, a truly scale independent GIS would be most

effective, able to operate at any scale, modifying and selecting the appropriate entity

representation as the user zooms in and out of the database.

Definition

The selection of appropriate entity types is further plague by the fact that many real world

features simply do not fit comfortably into the character of the models available. Feature

boundaries, for instance, are a particular problem in the definition of spatial entity type. In

reality, the boundaries of natural phenomenon are rarely discrete, but instead are more

readily characterised by a continuum or transition zone. For example, where should you

place the boundary of an area feature used to represent a stand of forest? Do forests

have edges, or do transitional zones from full to zero forest cover better define their

boundary properties? Very often in GIS we make use of paper maps (secondary data

sources) for data input and these are readily defined by the clear and distinct marking of

any boundaries. While the discretisation of feature boundaries may be very useful and

allow us to more easily generate quantifiable measurements, we must recognise the

problems associated with the choice of entity type and its boundary characteristics,

particularly with regard to natural (and therefore potentially fuzzy) phenomena.

Level 1 of the abstraction process as defined by Peuquet (1984) is characterised by the

five geometric primitives (or spatial entities) used as the basic building blocks of GIS.

Unfortunately, as our discussion has shown, the process of defining what entity type

should reflect which real world feature is far from simple. The decision is vitally important

for the successful design of a GIS, as it controls GIS functionality and the potential for

further spatial operations (an issue addressed later in this unit).

The data layer concept

So far we have just considered the five primitive spatial entity types. However, the

complexity of the real world is such that for most GIS applications it is necessary to

construct more complex models of reality, typically consisting of compound features

(several entity types). The most common method used in GIS to handle this problem is to

adopt a layered approach. Individual data layers are constructed using the various entity

types to represent different spatial elements. Each data layer is stored independently in

the GIS using either raster or vector approaches as mentioned next. These data layers

can then be used either independently (single layer operations) or together (as multiple

layers) depending upon the application. The use of multiple layers, as discussed later,

can cause additional problems dependent upon the choice of spatial data model.

3 Spatial data models and structuresLevels 2 and 3, the next stages in the abstraction process concern the design,

representation and implementation of the defined spatial entities in the GIS. If you revisit

the quotation from Burrough and McDonnell (1998) you will recognise that as well as

defining our entities we must also instruct the computer on how to turn specific entity data

into (digital) graphical representations. In GIS we mostly employ (one of) two methods to

handle and display our chosen spatial entities. These are commonly referred to as the

raster and vector approaches. The history of GIS software is such that some software

was originally designed for a raster spatial data model e.g. Idrisi and other software is

based on the vector spatial data model e.g. GeoMedia Professional, ArcGIS. Raster data

sets are characterised by their grid cell structure, whereas the vector approach comprises

co-ordinate geometry in an attempt to represent the features or objects of interest as

exactly as possible. As well as employing the raster and vector spatial data models1 we

are required to provide the computer with further information to reconstruct these models

in digital format. There are a great number of spatial data structures, specific to either the

raster or vector spatial data models, used by commercial GIS. The great diversity of

spatial data structures is one of the reasons why exchanging spatial data between GIS is

problematic. Different GIS may contain information of value to the other, but will be

1The term data model is often used to describe these two terms. This can become confusing sincethe term data modelling is used to describe the entire process of representing reality in a computer.For our own purposes we will use the term spatial data model in association with the terms rasterand vector, and the terms data model and data modelling to refer to the overall modelling process.

unable to share that information if the data structures used to store the information are

incompatible. In the case of the UNIGIS supported software, each GIS has its own data

format and structure. This means that we cannot simply transfer a raster file straight into

a vector system or vice versa.

The following sections explore the raster and vector spatial data models and examine,

using a range of examples, the diversity of their associated spatial data structures.

4 The raster approachRaster systems are a result of developments in computer graphics technology over the

last forty years. They are widely used in computing and digital television graphics and

work by repeatedly sweeping an electron beam across the computer or television screen,

from side to side and top to bottom. The image attributes (brightness and colour) at each

point on the screen are determined by computer-generated data (Coll, 1991). In actual

fact, each point on the screen is actually composed of a small cell structure or pixel

(picture element).

In the raster world, individual (typically square) cells are used to represent the different

geometric entities (points, lines, areas, networks and surfaces) used to build the GIS

image. In a raster file geographic space is divided into regular sized grid tessellations.

Single cells represent point features, whereas lines and areas are identified by groups or

clumps of pixels (Figure 3).

Figure 3. Raster representations of points, lines and areas.

Importantly, raster space can be composed of different tessellation patterns, including the

triangle and hexagon (Figure 4). Peuquet (1990) notes that triangular tessellations are

useful for terrain representation – triangles do not all have the same orientation, making

them more effective for picking out bumps and undulations on the land surface.

Figure 4. Regular tessellations for a raster field-based representation.

Hexagonal tesselations Triangular tessellations Square tesselations

As well as selecting an appropriate tesselation you will also have to decide on the

resolution of the cells. Too small and the data volumes will be prohibitive, too large and

your data will look coarse and lack precision.

4.1 Raster data structures

There is a wide range of different data structures to represent an entity in a computer

using the raster data model. In the next part of this section we will examine some of these

data structures in detail, including a discussion of the data compaction methods used to

minimise data storage requirements for large or complex raster data sets.

4.1.1 The simple raster

At the most elementary level there is the basic or simple raster data structure where

information is stored for each cell in the image. This information informs the computer of

the presence (or absence) of a feature within a given cell. Figure 5 illustrates what a

raster representation of a simple map may look like at a range of different cell (pixel)

sizes. This usefully illustrates how data quality changes with cell size in a raster image.

Figure 5. Raster views of a simple map.

a: The simple map

b: Fine resolution grid cells

c: Intermediate size grid cells

d: Coarse resolution grid cells

Cell occupancy and mixed pixels

As well as the implications for quality of definition that results, the pixel size can also lead

to problems when dealing with phenomenon (entities) that only partially occupy a raster

grid cell. Typically, this is solved by the application of one of two cell occupancy rules: the

present or absent rule, and the 50% rule. The present or absent rule states that even if

an entity is only minimally occupying a raster cell then it is considered to be present – and

the cell will record an entity feature. As it suggests, the 50% rule states that if more than

50% of a pixel is occupied by an entity feature then the entity will be acknowledged and

recorded as present. Figure 6 shows how these two different procedures can affect the

shape and character of an entity in a raster spatial data model.

Figure 6. The present or absent rule and the 50% occupancy rule.

A circular phenomenon

Raster representation using

present or absent rule

Raster representation using

majority (50%) rule

Another problem with the simple raster model is its inability to distinguish between the

nature of an entity feature (i.e. point, line or area). This reflects the binary codes used in

raster technology to store image information. Using binary coding, entities present are

recorded with a value of 1, and unoccupied cells as 0. The computer therefore sees the

raster image presented in Figure 5 as a series of 0s and 1s, and not as a housing estate,

river or trees since it does not contain any information to distinguish between the three.

Using a simple raster approach, the computer requires a separate layer of information for

each class. Figure 5 would then require 3 separate raster files; one for the tree map, one

for the river map, and one for the housing estate map.

Figure 7. House figure for simple raster data structure exercise.

4.1.2 The complex raster

One of the more obvious problems with the simple raster data structure is the volume of

information that has to be recorded to represent even the simplest map. Complex raster

data structures reduce the volume of information by assigning coded labels to grid cells

that not only tell the computer that a feature is present but also identify its character.

Using our earlier example from Figure 5, the cells representing trees might be assigned a

value of 1. The table below indicates how other entity types could be represented. Note a

column indicating colour has been included to illustrate how a complex raster image may

appear on screen.

Phenomena Entity Type Code ColourTree Point 1 GreenRiver Line 2 BlueHousing estate Area 3 Red

Figure 8. House figure for complex raster data structure exercise

4.2 Data compaction methods

File size is one of the major problems with raster data sets – space occupancy raster

structures require a value to recorded and stored for each grid cell. This means a

complex soil map of, say 100 x 100 pixels, that may contain 20 or more distinct soil

classes requires the same storage space as a simple road map of the same area –

despite the fact that much of the raster road map contains many cells recording a value of

0.

Raster data storage requirements have received considerable attention, and a range of

data compression (compaction) methods have been developed. These compaction

methods can reduce the size of a raster data set quite considerably. In the following sub-

sections we will examine some of the more commonly used methods in detail.

4.2.1 Run length encoding

One of the most common and simplest techniques for reducing the data volume

associated with a raster image is a technique known as run length encoding. This

technique reduces the information stored for each line in a raster matrix by storing a single

value for the consecutive number of cells of a given type, rather than storing a value for

each cell. Consider the following simple raster in Figure 9 showing the presence or

absence of clay.

Figure 9. Simple raster file structure.

Row 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0Row 2 1 1 1 1 1 0 0 0 1 1 1 0 0 0Row 3 1 1 1 0 0 0 0 0 1 1 1 0 0 1Row 4 1 1 0 0 0 0 0 0 0 0 1 1 1 1Row 5 1 1 0 0 0 0 0 0 0 0 1 1 1 1

A run length encoded version of the file would be represented as follows:

Row 1 31, 40, 31, 40

Row 2 51, 30, 31, 30

Row 3 31, 50, 31, 20, 11

Row 4 21, 80, 41

Row 5 21, 80, 41

If we take a closer look at the first row of the run length encoded file we can see how the

method works:

31, 40, 31, 40

The first number (3) represents the number of consecutive cells with the same coding. In

this case the coding 1 = soil type A (clay soil). The third number (4) indicates the number

of unoccupied cells moving from left to right. The fourth number (0) represents the

absence of clay soil. The fifth number (3) denotes the next 3 consecutive cells are

occupied by clay soil (code = 1 again). Finally, the numbers 4 and 0 indicate the absence

of clay soil in the 4 grid cells that complete the row. Note that the commas have been

added to make the file easier to read – they would be absent in a real run length encoded

file e.g. 31403140

If we assume one numeric value uses one byte of storage (1 byte = 8 bits) on the

computer then row one of our run length encoded (RLE) file takes up 8 bytes compared

with the 14 bytes required to store the same information using the simple raster data

structure. The equivalent file sizes and savings are given below:

BYTE storage requirementsSimple raster Run length raster

Row 1 14 8Row 2 14 8Row 3 14 10Row 4 14 6Row 5 14 6Total 70 38Saving 38 bytes 46% saving in

storage space

The figure below shows how the data volume associated with the storage of a complex

raster using the RLE method could be reduced in a similar way. Note that the presence or

absence values of 0 and 1 have been extended to include the codes used to identify 3

different soil types present in the grid.

Figure 10 Complex raster file structure.

Row 1 1 1 1 0 0 0 0 2 2 2 0 0 0 0Row 2 1 1 1 1 1 0 0 0 2 2 2 0 0 0Row 3 1 1 1 0 0 0 0 0 2 2 2 0 0 2Row 4 1 1 0 0 0 0 0 0 0 0 2 2 2 2Row 5 1 1 0 0 0 0 0 0 0 0 2 2 2 2

An RLE version of the complex raster file would be represented as follows:

Row 1 31, 40, 32, 40

Row 2 51, 30, 32, 32

Row 3 31, 50, 32, 20, 12

Row 4 21, 80, 42

Row 5 21, 80, 42

4.2.2 The quadtree

As Peuquet (1990) points out, the advantages of the raster spatial data model that

employs a square tessellation is that each cell can be subdivided into smaller cells of the

same shape and orientation. This unique feature of the grid or raster data model has

produced a range of innovative data storage and data reduction methods that are based

on a regularly subdividing geographical space. The most widely implemented is the

quadtree (Samet 1989) based on the recursive decomposition of a grid. There is a range

of quadtree types (see Peuquet (1990) and Samet (1989) for further reading) and the

most common approach is the area quadtree.

The area or region quadtree

The area quadtree works on the principle of recursively subdividing the number of cells by

quads (or quarters) until there is a homogenous block and no more subdivision can take

place (Bonham-Carter 1994). At the end of the subdivision process each cell in the grid

matrix may be classed as having an entity present or absent. The number of subdivisions

in this process depends upon the complexity of the map layer and upon what is

acceptable as the finest division to represent an object. The smallest quad cell size is

determined by pixel resolution. Figure 11 illustrates the quadtree principle.

Figure 11. The quadtree process.

1st subdivision 2nd subdivision

Original feature

3rd subdivision 4th subdivision

Figure 11 shows the process of hierarchical subdivision. The image is first divided into

four quadrants – of which none can be wholly classified as not containing the entity.

Therefore a further stage of subdivision is required in each quadrant, where it is possible

to identify ten quadrants that do not contain the entity and six quadrants that do. Of these,

one quadrant wholly contains the entity and five quadrants only partially contain the entity

(3rd subdivision). Further subdivision is therefore only necessary in these five partially

occupied quadrants.

This process of subdivision continues until every cell either contains or does not contain

the entity (4th subdivision). Looking closely at Figure 11 we can identify four hierarchical

levels. The hierarchical nature of the quadtree becomes more apparent when we see it

represented diagrammatically as a binary image tree – also known as a bintree (Figure

12).

Figure 12. Binary image tree.

In the bintree we can clearly distinguish the four data layers and by binary coding of each

root in the tree with a 1 or 0 we can see whether a quadrant contains part of the entity or

not. To examine how a quadtree is coded and information retrieved for display by the

computer look at Figure 13.

Figure 13. Coding a simple quadtree.

Raster scan order

Another method of referencing a quadtree is to use the Morton Matrix indexing scheme.

This method, named after its developer (Morton 1966) is based on the Peano Scanmethod which generates a track through space that exploits the property that areas close

together in the real world will be close together in a sequential digital file. These scan

orders are best illustrated with reference to a diagram. Figure 14 shows how a Peano

scan would be generated for a raster matrix with 8 columns and 8 rows (dimensions of

raster in Figure 13). The coding system must be adapted to follow the Peano curve and

each cell has a unique identifier, known as a Morton number (Figure 15).

Figure 14. Peano curve.

Figure 15. Morton ordering scheme.

000 001 010 011 100 101 110 111

002 003 012 013 102 103 112 113

020 021 030 031 120 121 130 131

022 023 032 033 122 123 132 133

200 201 210 211 300 301 310 311

202 203 212 213 302 303 312 313

220 221 230 231 320 321 330 331

222 223 232 233 322 323 332 333

Figure 16. House figure for quadtree data structure exercise.

5 The vector approachThe five entity primitives identified at the start of this section can most easily be

represented using the vector spatial data model. The vector approach attempts to

represent the features of interest as exactly as possible employing Cartesian co-ordinate

geometry. Points are represented by coordinate pairs, lines by arcs linking a series or

string of points, areas by lines enclosing homogenous areas (a string of coordinate pairs

with the same origin and end point), networks by connected lines and surfaces by areas

linking points and lines.

Those of you who use and work with topographic paper maps on a regular basis will be

very familiar with the vector approach. Cartographic maps are composed entirely of a

series of representative points, lines, areas and surfaces. The topographic map or vector

model provides a scaled model of reality. For example, a river feature will be represented

using a line of appropriate thickness – rather than a series of contiguous inappropriately

shaped cells (as with the raster model). The high quality of geometric representation

makes interpretation of objects a relatively easy task (take another look at the map

provided in Section 1 to remind yourself).

This precision of representation is very useful. There is, however, further important

information that is needed to develop a vector data structure to store information about

the entities in a vector spatial data model. This information is about the geographical

relationships between points and lines that are used to represent an entity. These spatial

relationships are expressed as topology. As with the raster spatial data model there are

many potential data structures that can be used to represent an entity in the vector world.

However, they can be categorised into two groups:

1. Data structures without topology

2. Data structures with topology

5.1 Data structures without topology

The simplest form of vector data structure that can be used to reproduce a geographical

image in the computer is a set of x and y coordinates. Figure 17 shows how the simple

map (introduced earlier in the section on raster data), might be represented in a vector

view of the world.

Figure 17. A simple vector map.

A simple data structure without topology for this model could be constructed as follows:

Area (housing estate)

H1 (40, 50), H2 (40,50), H3 (50, 45), H4 (50, 35), H5 (70, 35), H6 (70, 30),

H7 (90, 30), H8 (90, 50), H1 (40, 50).

Note: the first and last co-ordinate pair are the same. This ensures that the area or

polygon is closed.

Line (river)

R1 (0, 25), R2 (10, 23), R3 (20, 20), R4 (40, 20), R5 (50, 20), R6 (70, 20),

R7 (80, 20), R8 (90, 15).

Points (trees)

T1 (10, 10), T2 (20, 5), T3 (25, 15), T4 (35, 10), T5 (55, 10), T6 (60, 15),

T7 (65, 5), T8 (75, 10), T9 (75, 5).

Where x, y represents the co-ordinates used to identify the location of the points, which

must be connected to make the entity. The descriptor in brackets, (tree) etc., is added to

the file so that the computer knows what the data represents. A new feature is recorded

by a code such as carriage return <cr>. The limitations of simple vector data structures

start to emerge when we look at more complex spatial entities. Consider for example the

group of areas and lines represented in Figure 18.

Figure 18.

Figure 18a, at its simplest level could be represented by the following data structure:

Area 1 xa,yb xb,yb xc,yc……xk,yk xl,yl xm,ym……

Area 2 xu,yu xv,yv xw,yw……xk,yk xl,yl xm,ym……

If you were to reconstruct the image in Figure 18a from the data structure given above,

you would find that the co-ordinates that define the boundary line, which is shared by the

two polygons, would be stored twice. While this may not appear too much of a problem

for our small example, consider the implications for a map of the soil series of the United

Kingdom or the state boundaries in the US. The amount of duplicated data stored would

be a large proportion of the total data. In Figure 18b we have a slightly different problem.

The network could quite easily be stored using the following simple file structure.

Line 1 x1,y1 x2,y2 x3,y3 etc

Line 2 x4,y4 x5,y5 x3,y3 etc

Line 3 x3,y3 x6,y6 x7,y7 etc

Line 4 x8,y8 x9,y9 x7,y7 etc

Line 5 x7,y7 x10,y10 x11,y11 etc

The computer would be able to reproduce the image but a problem would arise as soon

as we tried to use this information to ask questions about the network. This is because

the computer has not been provided with any information to tell it that line 1 is connected

to line 2 which is connected to line 3 and so on. These spatial linkages are only inferred

in the viewer’s mind when the lines are displayed on the screen and are not contained

explicitly within our data file. This situation has lead to simple vector data structures of

this type, without topology, being referred to as 'spaghetti', because what is actually on the

screen is merely a jumble of linear features as far as the computer is concerned.

This ‘spaghetti’ approach is used by many design and drawing packages. A true GIS

must use a topological data structure to represent spatial entities if it is to be of any

practical use other than for displaying features. Note that the absence of topology is one

factor which distinguishes a CAD/CAM system from one designed to store, manipulate

and analyse spatial data (GIS).

There are two specific problems of 'spaghetti' data structures that illustrate why topological

information is important. First, ‘spaghetti’ data contains no neighbourhood information,

and second the data structure is unable to cope with what are termed hole or island

polygons.

5.1.1 The neighbourhood problem

The neighbourhood problem has already been alluded to when we discussed the problem

of storing a simple network as a series of lines (Figure 18b). The problem is that while the

lines give the appearance of a network when displayed on the screen the actual file that is

used to create the image contains no information about which line is connected to the

next. In the same way, a series of polygons created using the simple data structure may

appear to be connected, but in fact they are discrete entities which are unaware of the

presence of neighbouring polygons. Even giving each polygon a label or unique identifier

would not solve the problem. What is required is a set of instructions which informs the

computer where one polygon is with respect to its neighbours. Polygon data structures

that contain such information would be termed topologically correct. How a data structure

can be designed to include full topology will be discussed later.

5.1.2 The island or hole problem

Figure 19 illustrates the island or hole problem. From the figure you can see that one

polygon classified as containing the soil type clay is wholly contained within a polygon

classified as soil type loam. The problem is frequently referred to as one of nested or

hierarchical polygons. While a simple file structure would be able to recreate the image in

Figure 19 by a series of x, y co-ordinates it would not be able to inform the computer that

the island polygon was in fact part of the larger ‘clay’ polygon. Dealing with islands or

holes also requires a fully topological data structure.

Figure 19. The island or hole problem.

5.2 Data structures with topology

5.2.1 Point

A point is the simplest spatial entity that can be represented in the vector world with full

topology, because all that is required for a point to be topologically correct is a pointer or

geographical reference which locates its position with respect to other spatial entities in

the real world. This is performed by tagging the point with a geographical reference.

5.2.2 Lines and networks

Simple lines carry no inherent spatial information about their connectivity. Lines only need

to have topological information attached to them when they become part of a network,

area or surface feature. Topological information is added to line features through the use

of 'pointers' which flag where links occur in the data structure. The most frequently used

pointer in the vector data model is the node. Figure 20 shows the type of information

required to identify connectivity in a line network.

The first stage in turning a series of lines into an intelligent network is to identify the start,

end and junction points. These pointers or nodes are then used to record information

about the connectivity of the network as well as hold information that regulates the nature

and direction of the information flow. Figure 20a illustrates six nodes, four of which

represent the start and end points of the network (B, D,E, F) and two (A, C) which

represent junctions.

The second stage is to identify the lines or arcs that connect the nodes. This information

is present in Figure 20b. In many cases, direction is also an important network feature

and Figure 20c shows how the direction at which an arc joins a node can be recorded.

Figure 20. Network connectivity.

5.2.3 Areas

Topology for a set of area entities is built in a series of stages. These stages have been

described by Burrough and McDonnell (1998) and are only summarised here. The order

in which the stages are carried out by a vector GIS will be GIS-product specific and may

not follow the order described here. However, the principles remain the same, with the

process consisting of four stages:

Stage 1 Generating a boundary network

Stage 2 Linking lines into polygons

Stage 3 Checking polygons for closure

Stage 4 Providing a unique identifier for each polygon

Stage 1: Generating a boundary network. Figure 21 shows a diagram of a set of simple

polygons. The first step in generating full topology for the entities in Figure 21 would be to

identify those arcs that intersect with one another. Those arcs that cross are

automatically intersected and built into two separate arcs and a node added at the

junction (Figure 22b).

Figure 21. A set of simple polygons.

Figure 22. Generating a boundary network.

The second step involves sorting arcs according to their x and y location so that arcs

topologically close to each other are also in close proximity in the data file. This process

helps speed up retrieval times when searching for adjacent chains.

Now it is possible to generate an outer envelope or boundary network that contains all

other polygons. The outer polygon is only used to build topology for the arc network. The

envelope polygon is built by identifying the arcs that make up the outer boundary of the

area (Figure 22c). A flag should be set to indicate that each of the arcs that make up the

outer envelope has been traversed once.

The following information should be stored for the envelope polygon:

A unique identifier or polygon ID

A code that identifies it as an envelope polygon

A direction pointer indicating the order and direction in which arcs should be linked

together to form the boundary

A list of arcs in the boundary

Its x, y extent

Stage 2: Linking arcs into polygons. Once the outer envelope has been created, topology

can be constructed for each of the other polygons in turn. The same starting point should

be used as employed in the construction of the outer envelope and if the outer envelope

was constructed in a clockwise direction, then the other polygons should be constructed in

a similar fashion. Once a pointer or node is reached the arcs which are to the right should

be followed. Arcs should be dropped from the search once they have been traversed

twice. This process should be repeated until all polygons have been constructed (Figure

22d).

Stage 3: Checking polygons for closure. In building topology for a set of areas it is

essential that the areas themselves can be identified as being closed or not. If a polygon

is left open it is not topologically correct. By consulting the arc table polygon closure can

be checked quite easily because all arcs must be linked to a node that points them to the

next arc. If any arcs in the table are found without the proper node pointers then either

they will be mistakes or unclosed polygons exist. If arcs of this type are found, then

depending upon the nature of the error, they may be flagged for correction or deletion.

Stage 4: Providing a unique identifier for each polygon. The final stage in building full

topology for a set of polygons is to ensure that a unique label or identifier is attached to

each polygon. This is important if nonspatial (attribute data) is to be linked with the

polygons that have been created. It is also important for locating (geographically) one

polygon in relation to another.

Figure 23 is a summary of the information necessary for computer storage in order to

reconstruct polygon topology.

Figure 23. Vector polygon with topological data structure.

Figure 24. Vector representation exercise.

6 Vector and raster spatial data models:advantages and disadvantagesMaffini (1987) states that the raster-vector approaches are two alternate methods for

storing and representing spatial phenomena. As models they have relative strengths and

weaknesses for describing conditions in the real world. In this section we will explore

some of merits and weaknesses of each model. Throughout the course there will be

many instances where you will need to determine whether one data model is more

appropriate than another is. IGISE (1991) groups the advantages and disadvantages

identified into five generic themes:

• Data volume

• Topology queries

• Generality

• Analytical capability

• Accuracy and precision

These themes are used here to start the debate about the choice of spatial data model as

part of the design process.

6.1 Data volume

One of the most frequently discussed areas in the raster-vector debate is data volume,

which was at its height during the 1970s and the first half of the 1980s when the

technological limitations on computer power and storage were most marked.

The problem is that the answer to this question is not simply that raster data sets are

larger than vector data sets. It depends upon the character and complexity of the spatial

entities you are trying to record. A simple or complex raster for example can require as

much data storage space to record a simple spatial entity with few polygon boundaries as

it would take to record a complex spatial entity with many polygon boundaries. In the

same way an unstructured vector file without topology for a series of 50 complex polygons

can be much smaller in size than a fully topological vector data structure for the same

area. The more complex a spatial entity becomes the closer the data volume

requirements of the different data storage techniques. As a general rule however, raster

spatial data models are generally more demanding of data storage than their vector

counterparts.

6.2 Topology queries

As IGISE (1991) stated, an important prerequisite of a GIS project is the ability to ask

questions such as:

• Where is something?

• What is next to something?

• What is contained within something?

The ability of the different data models to provide answers to these questions is of vital

importance to the designer of a GIS project. Both the raster and vector spatial data

models have strengths and weaknesses associated with answering different types of

spatial questions. These are explored in detail in the following sections of single and

multiple layer spatial operations. For the moment it is enough to note that traditionally,

vector data models have been considered more appropriate for answering topological

questions about containment, adjacency and connectivity. However, with the advent of

more intelligent raster data structures such as the quadtree which contain information

about the relationship between cells in the image, two particular spatial queries can now

be performed efficiently using a raster data model. These are identifying the area in which

a point is located. In general, however, where topological queries are likely to constitute

the major application of the GIS, a vector data model is required.

6.3 Generality

In a GIS project it is frequently necessary to be able to change the scale and thematic

resolution of operation. This often makes it essential to be able to generalise the

complexity of spatial features. For example, it might be necessary to have the ability to

dissolve 500 enumeration districts into 14 wards or 200 detailed soil polygons into 15

general soil units. Depending upon the type of generality the vector and raster data

models possess different relative advantages. The vector data model, for example,

handles changes in scale much more easily than its raster counterpart with regard to the

visual representation of entities. This is because of the precise way in which information

is recorded as a set of x, y co-ordinates. Changes of scale pose a problem in the grid

world if a resolution is requested below the cell specified at the project outset. Increases

in scale in the raster world are typified by the appearance of a blockier image.

On the other hand, in generalising the actual form of an area and of surface entities the

raster model comes into its own because to aggregate a complex soil class map into a

more general one needs only the value of each cell to be reclassified to reconstruct the

new image. While this is possible in the vector world it requires complex calculations of

the intersection and adjacency of polygons with similar attributes. Therefore, if many

calculations of this nature are required a raster data model may be more appropriate.

6.4 Analytical capabilities

There is a clear distinction between the analytical capabilities of raster and vector GIS.

This is a major component of the sections of this unit dealing with spatial operations.

6.5 Accuracy and precision

In days gone by, you may have often heard a vendor of a vector GIS product announcing

to his/her client that their product is more accurate at representing spatial features

because it the vector spatial model. This statement provides a useful starting point for

examining accuracy and precision because of highlights one of the most common

mistakes made when comparing raster and vector data; the confusion of the terms

accuracy and precision. Before we proceed it is necessary to define what we mean by the

terms.

Accuracy is the faithfulness with which our spatial entity is represented in our computer

view of the real world including its location (positional or spatial accuracy) and character

(attribute accuracy).

Precision is independent of accuracy and is the degree or exactness used to record the

location and character of our spatial entity. For example, a typical vector based GIS

allocates 8 decimal digits of precision to each of its coordinates and many allocate 16

(Goodchild and Gopal 1990). The level of this precision is much higher than the accuracy

of typical GIS data. Therefore, what the vendor really meant in the above statement is

that the vector data model is more precise at reproducing the shapes and lines that we

are used to seeing on our traditional paper map model of the world. This is because the

paper map uses a vector data model to represent spatial entities. The location of small

entities are shown as points, roads and rivers by lines and areas of forests by polygons

with a distinct boundary. Naturally, therefore, if we use a vector GIS to capture this data it

will be more precisely reproduced than in the raster world where the points would appear

as cells, the roads and rivers as jagged and stepped linear features and the areas as

blocky irregular entities rather than the smooth boundaries that are formed by the arcs of

the polygon.

At first appearance it might appear that our vendor is indeed right that the vector GIS is

much more accurate than an equivalent grid based system. To understand why this is not

the case we need to revisit our definition of accuracy and look closely at the concept of

faithfulness of representation. The first important point to recognise is that all spatial data

are of limited accuracy. IGISE (1991) and Goodchild and Gopal (1990) provide passages

to illustrate the point.

“Consider for example two air photo-interpreters who are evaluating the boundary

for a wooded area. They are likely to produce two different boundaries. Branches

of trees, for example, overlap one another. The overhang and overlap can easily

be several metres. Depending on the season in which the photograph was taken

(i.e. whether or not the trees are in leaf), the boundary line may be drawn

differently.”

IGISE (1991)

“The area labelled ‘soil type A’ on a map of soils is not in reality all type A, and its

boundaries are not sharp breaks but transition zones. Similarly, the area labelled

‘population density 1000-2000/sq.km’ does not in fact have between 1000 and

2000 in every square km, or between 10 and 20 in every hectare, since the spatial

distribution of population is punctiform and can only be approximated by a smooth

surface.”

Goodchild and Gopal (1990)

Now if we return to our vendor, what s/he should have said was that a vector GIS is more

precise in representing a spatial entity as it appears on a map. It is not necessarily more

accurate than a raster GIS at representing the location and character of the true real world

feature. Because an entity takes on a jagged, blocky or stepped appearance it is not

correct to assume that the database is inaccurate. In addition, many users of GIS

consider the blocky irregular boundary produced between areas when a raster spatial data

model is used to be more appropriate for representing the real world features where

distinct boundaries between spatial phenomena are not present.

7 What have you learned?This section has introduced the two field-based models for representing geographic space

in GIS – raster and vector. We have examined a range of storage methods employed by

the cell-based raster spatial data model from the simple raster to the area quadtree. In

considering the vector spatial data model with its precise geometric representation, we

have seen the significance of topology as well as the variety of structures in vector

models.