Upload
others
View
35
Download
0
Embed Size (px)
Citation preview
SPATIAL DATA MODELS AND SPATIAL DATASTRUCTURES
TABLE OF CONTENTS1 Spatial data models: an introduction................................................................................................. 22 Geometric entities............................................................................................................................. 2
2.1 Problems with the entity definition process...............................................................................53 Spatial data models and structures.................................................................................................... 74 The raster approach........................................................................................................................... 8
4.1 Raster data structures.................................................................................................................94.1.1 The simple raster................................................................................................................94.1.2 The complex raster...........................................................................................................11
4.2 Data compaction methods....................................................................................................... 124.2.1 Run length encoding........................................................................................................ 134.2.2 The quadtree.....................................................................................................................14
5 The vector approach........................................................................................................................ 195.1 Data structures without topology.............................................................................................20
5.1.1 The neighbourhood problem............................................................................................225.1.2 The island or hole problem.............................................................................................. 23
5.2 Data structures with topology..................................................................................................235.2.1 Point................................................................................................................................. 235.2.2 Lines and networks.......................................................................................................... 235.2.3 Areas................................................................................................................................ 25
6 Vector and raster spatial data models: advantages and disadvantages............................................306.1 Data volume.............................................................................................................................306.2 Topology queries..................................................................................................................... 316.3 Generality................................................................................................................................ 316.4 Analytical capabilities............................................................................................................. 326.5 Accuracy and precision............................................................................................................32
7 What have you learned?.................................................................................................................. 33
SPATIAL DATA MODELS AND SPATIAL DATASTRUCTURESThis section focuses on the methods available for the actual implementation of
geographic models within GIS. It will review the geometric primitives (spatial entities:
points, lines, areas, networks and surfaces) and the different approaches (data models
and data structures) used to implement these representations of geographic space in a
computerised GIS environment.
1 Spatial data models: an introduction“…Human perception of space is frequently not the most efficient way to structure a
computer database and does not account for the physical requirements of storing
and repeatedly using digital information. Computers for handling geographical data
therefore need to be programmed to represent the phenomenological structures in
an appropriate manner…”
(Burrough and McDonnell, 1998 p.36)
This pertinent statement provides a timely reminder that computers require unambiguous
instructions on how to perform specific tasks. In the same way, a computer needs to be
told how to build spatial models for implementation within a GIS. Using Peuquet’s (1984)
abstraction schema as a framework this section examines in detail the different levels of
abstraction (levels 1 – 3) and the associated processes used in the design and
implementation of a GIS data model.
The successful design of spatial data models within a GIS must consider the following:
1. Which landscape elements (phenomenological structures or spatial entities) are
necessary to appropriately represent the system under investigation (level 1 abstraction)?
2. What approach (spatial data model) should be used to handle and display these
spatial entities (level 2 abstraction)?
3. What particular set of instructions and information (data structure) will the computer
require to reconstruct the spatial data model in digital form (level 3 abstraction)?
2 Geometric entitiesIn performing the first stage of data abstraction we need to identify the features / objects
to be represented within our GIS. The design and construction of our GIS is dependent
upon the successful identification of a series of geometric primitives. These are the basic
units of our spatial data models and form the building blocks of GIS. All geographical
phenomena can, in two dimensions at least, be represented by one of three entity types
(Figure 1).
In completing the e-tutorial to construct your own map, you produced a series of spatial
entities of your own to represent the different landscape elements. These entities (or
primitives), were made up of a series of shapes (geometries) including points to represent
the location of trees, a series of lines to illustrate the road centre-lines and areas to
represent recreational space and residential and/or commercial buildings. Let's quickly
remind ourselves of some of the geometric properties of point, line and area entity types.
• What is a point? A point is a spatial entity that has no length or area.
• What is a line? A line is a feature that has length but no area.
• What is an area? An area is a spatial entity that has perimeter and area.
Figure 1. Point, line and area spatial entities.
The three entity types above, however, are not always the most appropriate
representational forms for landscape elements to be included in a GIS. The concepts of
area and line can be extended to produce two other spatial entities, namely surfaces and
networks (Figure 2).
Figure 2. Surface and network entities.
Surfaces can be used, for example, to represent population density, elevation values and
temperature. As well as its fundamental two-dimensional attributes the surface entity is
also capable of three-dimensional representation in GIS and this is revisited later in the
section on dimensionality. Examples of networks include traffic (road) and hydrological
systems.
In many cases, the range of surface and network entity types that you identified will have
been limited only by your knowledge of the area. Some suggestions include:
Surface NetworkRainfall ElectricityElevation RiversPollution Sewage pipes
Telephone cablesFootpaths
2.1 Problems with the entity definition process
As one might expect, there are a number of problems associated with simplifying the
complexities of the real world into five basic, two dimensional building blocks. These
include the dynamic nature of the real world, the scale at which a particular problem
needs addressing and the identification of discrete features.
Real world dynamics
The real world is not static: forests grow, rivers flood, cities expand. This poses two
particular problems for the entity definition phase of a GIS project. The first problem is
how to select the entity type that provides the most appropriate representation for the
feature being modelled. For example, is it best to represent a forest as a collection of
points (representing the location of individual trees), or as an area (the boundary of which
defines the territory covered by a forest)? The second problem is one of temporal
change. For example, a forest that was originally represented as an area may decline to
the extent that in reality it is only a dispersed group of trees that are, perhaps, better
represented using point features.
Scale
Scale is also an important concept in the entity definition process. For example, if a GIS
database is to be constructed at a scale of 1: 1M (for example, the Digital Chart of the
World) it may be appropriate to represent the city of Manchester as a point feature.
However, at larger scales you would need to employ different entity types to provide a
more appropriate representation of Manchester. For example, at 1:250 K, an area feature
would be most suitable; at any larger scales, it is likely that a compound or collection of
entities would be more practical. Ideally, a truly scale independent GIS would be most
effective, able to operate at any scale, modifying and selecting the appropriate entity
representation as the user zooms in and out of the database.
Definition
The selection of appropriate entity types is further plague by the fact that many real world
features simply do not fit comfortably into the character of the models available. Feature
boundaries, for instance, are a particular problem in the definition of spatial entity type. In
reality, the boundaries of natural phenomenon are rarely discrete, but instead are more
readily characterised by a continuum or transition zone. For example, where should you
place the boundary of an area feature used to represent a stand of forest? Do forests
have edges, or do transitional zones from full to zero forest cover better define their
boundary properties? Very often in GIS we make use of paper maps (secondary data
sources) for data input and these are readily defined by the clear and distinct marking of
any boundaries. While the discretisation of feature boundaries may be very useful and
allow us to more easily generate quantifiable measurements, we must recognise the
problems associated with the choice of entity type and its boundary characteristics,
particularly with regard to natural (and therefore potentially fuzzy) phenomena.
Level 1 of the abstraction process as defined by Peuquet (1984) is characterised by the
five geometric primitives (or spatial entities) used as the basic building blocks of GIS.
Unfortunately, as our discussion has shown, the process of defining what entity type
should reflect which real world feature is far from simple. The decision is vitally important
for the successful design of a GIS, as it controls GIS functionality and the potential for
further spatial operations (an issue addressed later in this unit).
The data layer concept
So far we have just considered the five primitive spatial entity types. However, the
complexity of the real world is such that for most GIS applications it is necessary to
construct more complex models of reality, typically consisting of compound features
(several entity types). The most common method used in GIS to handle this problem is to
adopt a layered approach. Individual data layers are constructed using the various entity
types to represent different spatial elements. Each data layer is stored independently in
the GIS using either raster or vector approaches as mentioned next. These data layers
can then be used either independently (single layer operations) or together (as multiple
layers) depending upon the application. The use of multiple layers, as discussed later,
can cause additional problems dependent upon the choice of spatial data model.
3 Spatial data models and structuresLevels 2 and 3, the next stages in the abstraction process concern the design,
representation and implementation of the defined spatial entities in the GIS. If you revisit
the quotation from Burrough and McDonnell (1998) you will recognise that as well as
defining our entities we must also instruct the computer on how to turn specific entity data
into (digital) graphical representations. In GIS we mostly employ (one of) two methods to
handle and display our chosen spatial entities. These are commonly referred to as the
raster and vector approaches. The history of GIS software is such that some software
was originally designed for a raster spatial data model e.g. Idrisi and other software is
based on the vector spatial data model e.g. GeoMedia Professional, ArcGIS. Raster data
sets are characterised by their grid cell structure, whereas the vector approach comprises
co-ordinate geometry in an attempt to represent the features or objects of interest as
exactly as possible. As well as employing the raster and vector spatial data models1 we
are required to provide the computer with further information to reconstruct these models
in digital format. There are a great number of spatial data structures, specific to either the
raster or vector spatial data models, used by commercial GIS. The great diversity of
spatial data structures is one of the reasons why exchanging spatial data between GIS is
problematic. Different GIS may contain information of value to the other, but will be
1The term data model is often used to describe these two terms. This can become confusing sincethe term data modelling is used to describe the entire process of representing reality in a computer.For our own purposes we will use the term spatial data model in association with the terms rasterand vector, and the terms data model and data modelling to refer to the overall modelling process.
unable to share that information if the data structures used to store the information are
incompatible. In the case of the UNIGIS supported software, each GIS has its own data
format and structure. This means that we cannot simply transfer a raster file straight into
a vector system or vice versa.
The following sections explore the raster and vector spatial data models and examine,
using a range of examples, the diversity of their associated spatial data structures.
4 The raster approachRaster systems are a result of developments in computer graphics technology over the
last forty years. They are widely used in computing and digital television graphics and
work by repeatedly sweeping an electron beam across the computer or television screen,
from side to side and top to bottom. The image attributes (brightness and colour) at each
point on the screen are determined by computer-generated data (Coll, 1991). In actual
fact, each point on the screen is actually composed of a small cell structure or pixel
(picture element).
In the raster world, individual (typically square) cells are used to represent the different
geometric entities (points, lines, areas, networks and surfaces) used to build the GIS
image. In a raster file geographic space is divided into regular sized grid tessellations.
Single cells represent point features, whereas lines and areas are identified by groups or
clumps of pixels (Figure 3).
Figure 3. Raster representations of points, lines and areas.
Importantly, raster space can be composed of different tessellation patterns, including the
triangle and hexagon (Figure 4). Peuquet (1990) notes that triangular tessellations are
useful for terrain representation – triangles do not all have the same orientation, making
them more effective for picking out bumps and undulations on the land surface.
Figure 4. Regular tessellations for a raster field-based representation.
Hexagonal tesselations Triangular tessellations Square tesselations
As well as selecting an appropriate tesselation you will also have to decide on the
resolution of the cells. Too small and the data volumes will be prohibitive, too large and
your data will look coarse and lack precision.
4.1 Raster data structures
There is a wide range of different data structures to represent an entity in a computer
using the raster data model. In the next part of this section we will examine some of these
data structures in detail, including a discussion of the data compaction methods used to
minimise data storage requirements for large or complex raster data sets.
4.1.1 The simple raster
At the most elementary level there is the basic or simple raster data structure where
information is stored for each cell in the image. This information informs the computer of
the presence (or absence) of a feature within a given cell. Figure 5 illustrates what a
raster representation of a simple map may look like at a range of different cell (pixel)
sizes. This usefully illustrates how data quality changes with cell size in a raster image.
Figure 5. Raster views of a simple map.
a: The simple map
b: Fine resolution grid cells
c: Intermediate size grid cells
d: Coarse resolution grid cells
Cell occupancy and mixed pixels
As well as the implications for quality of definition that results, the pixel size can also lead
to problems when dealing with phenomenon (entities) that only partially occupy a raster
grid cell. Typically, this is solved by the application of one of two cell occupancy rules: the
present or absent rule, and the 50% rule. The present or absent rule states that even if
an entity is only minimally occupying a raster cell then it is considered to be present – and
the cell will record an entity feature. As it suggests, the 50% rule states that if more than
50% of a pixel is occupied by an entity feature then the entity will be acknowledged and
recorded as present. Figure 6 shows how these two different procedures can affect the
shape and character of an entity in a raster spatial data model.
Figure 6. The present or absent rule and the 50% occupancy rule.
A circular phenomenon
Raster representation using
present or absent rule
Raster representation using
majority (50%) rule
Another problem with the simple raster model is its inability to distinguish between the
nature of an entity feature (i.e. point, line or area). This reflects the binary codes used in
raster technology to store image information. Using binary coding, entities present are
recorded with a value of 1, and unoccupied cells as 0. The computer therefore sees the
raster image presented in Figure 5 as a series of 0s and 1s, and not as a housing estate,
river or trees since it does not contain any information to distinguish between the three.
Using a simple raster approach, the computer requires a separate layer of information for
each class. Figure 5 would then require 3 separate raster files; one for the tree map, one
for the river map, and one for the housing estate map.
Figure 7. House figure for simple raster data structure exercise.
4.1.2 The complex raster
One of the more obvious problems with the simple raster data structure is the volume of
information that has to be recorded to represent even the simplest map. Complex raster
data structures reduce the volume of information by assigning coded labels to grid cells
that not only tell the computer that a feature is present but also identify its character.
Using our earlier example from Figure 5, the cells representing trees might be assigned a
value of 1. The table below indicates how other entity types could be represented. Note a
column indicating colour has been included to illustrate how a complex raster image may
appear on screen.
Phenomena Entity Type Code ColourTree Point 1 GreenRiver Line 2 BlueHousing estate Area 3 Red
Figure 8. House figure for complex raster data structure exercise
4.2 Data compaction methods
File size is one of the major problems with raster data sets – space occupancy raster
structures require a value to recorded and stored for each grid cell. This means a
complex soil map of, say 100 x 100 pixels, that may contain 20 or more distinct soil
classes requires the same storage space as a simple road map of the same area –
despite the fact that much of the raster road map contains many cells recording a value of
0.
Raster data storage requirements have received considerable attention, and a range of
data compression (compaction) methods have been developed. These compaction
methods can reduce the size of a raster data set quite considerably. In the following sub-
sections we will examine some of the more commonly used methods in detail.
4.2.1 Run length encoding
One of the most common and simplest techniques for reducing the data volume
associated with a raster image is a technique known as run length encoding. This
technique reduces the information stored for each line in a raster matrix by storing a single
value for the consecutive number of cells of a given type, rather than storing a value for
each cell. Consider the following simple raster in Figure 9 showing the presence or
absence of clay.
Figure 9. Simple raster file structure.
Row 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0Row 2 1 1 1 1 1 0 0 0 1 1 1 0 0 0Row 3 1 1 1 0 0 0 0 0 1 1 1 0 0 1Row 4 1 1 0 0 0 0 0 0 0 0 1 1 1 1Row 5 1 1 0 0 0 0 0 0 0 0 1 1 1 1
A run length encoded version of the file would be represented as follows:
Row 1 31, 40, 31, 40
Row 2 51, 30, 31, 30
Row 3 31, 50, 31, 20, 11
Row 4 21, 80, 41
Row 5 21, 80, 41
If we take a closer look at the first row of the run length encoded file we can see how the
method works:
31, 40, 31, 40
The first number (3) represents the number of consecutive cells with the same coding. In
this case the coding 1 = soil type A (clay soil). The third number (4) indicates the number
of unoccupied cells moving from left to right. The fourth number (0) represents the
absence of clay soil. The fifth number (3) denotes the next 3 consecutive cells are
occupied by clay soil (code = 1 again). Finally, the numbers 4 and 0 indicate the absence
of clay soil in the 4 grid cells that complete the row. Note that the commas have been
added to make the file easier to read – they would be absent in a real run length encoded
file e.g. 31403140
If we assume one numeric value uses one byte of storage (1 byte = 8 bits) on the
computer then row one of our run length encoded (RLE) file takes up 8 bytes compared
with the 14 bytes required to store the same information using the simple raster data
structure. The equivalent file sizes and savings are given below:
BYTE storage requirementsSimple raster Run length raster
Row 1 14 8Row 2 14 8Row 3 14 10Row 4 14 6Row 5 14 6Total 70 38Saving 38 bytes 46% saving in
storage space
The figure below shows how the data volume associated with the storage of a complex
raster using the RLE method could be reduced in a similar way. Note that the presence or
absence values of 0 and 1 have been extended to include the codes used to identify 3
different soil types present in the grid.
Figure 10 Complex raster file structure.
Row 1 1 1 1 0 0 0 0 2 2 2 0 0 0 0Row 2 1 1 1 1 1 0 0 0 2 2 2 0 0 0Row 3 1 1 1 0 0 0 0 0 2 2 2 0 0 2Row 4 1 1 0 0 0 0 0 0 0 0 2 2 2 2Row 5 1 1 0 0 0 0 0 0 0 0 2 2 2 2
An RLE version of the complex raster file would be represented as follows:
Row 1 31, 40, 32, 40
Row 2 51, 30, 32, 32
Row 3 31, 50, 32, 20, 12
Row 4 21, 80, 42
Row 5 21, 80, 42
4.2.2 The quadtree
As Peuquet (1990) points out, the advantages of the raster spatial data model that
employs a square tessellation is that each cell can be subdivided into smaller cells of the
same shape and orientation. This unique feature of the grid or raster data model has
produced a range of innovative data storage and data reduction methods that are based
on a regularly subdividing geographical space. The most widely implemented is the
quadtree (Samet 1989) based on the recursive decomposition of a grid. There is a range
of quadtree types (see Peuquet (1990) and Samet (1989) for further reading) and the
most common approach is the area quadtree.
The area or region quadtree
The area quadtree works on the principle of recursively subdividing the number of cells by
quads (or quarters) until there is a homogenous block and no more subdivision can take
place (Bonham-Carter 1994). At the end of the subdivision process each cell in the grid
matrix may be classed as having an entity present or absent. The number of subdivisions
in this process depends upon the complexity of the map layer and upon what is
acceptable as the finest division to represent an object. The smallest quad cell size is
determined by pixel resolution. Figure 11 illustrates the quadtree principle.
Figure 11. The quadtree process.
1st subdivision 2nd subdivision
Original feature
3rd subdivision 4th subdivision
Figure 11 shows the process of hierarchical subdivision. The image is first divided into
four quadrants – of which none can be wholly classified as not containing the entity.
Therefore a further stage of subdivision is required in each quadrant, where it is possible
to identify ten quadrants that do not contain the entity and six quadrants that do. Of these,
one quadrant wholly contains the entity and five quadrants only partially contain the entity
(3rd subdivision). Further subdivision is therefore only necessary in these five partially
occupied quadrants.
This process of subdivision continues until every cell either contains or does not contain
the entity (4th subdivision). Looking closely at Figure 11 we can identify four hierarchical
levels. The hierarchical nature of the quadtree becomes more apparent when we see it
represented diagrammatically as a binary image tree – also known as a bintree (Figure
12).
Figure 12. Binary image tree.
In the bintree we can clearly distinguish the four data layers and by binary coding of each
root in the tree with a 1 or 0 we can see whether a quadrant contains part of the entity or
not. To examine how a quadtree is coded and information retrieved for display by the
computer look at Figure 13.
Figure 13. Coding a simple quadtree.
Raster scan order
Another method of referencing a quadtree is to use the Morton Matrix indexing scheme.
This method, named after its developer (Morton 1966) is based on the Peano Scanmethod which generates a track through space that exploits the property that areas close
together in the real world will be close together in a sequential digital file. These scan
orders are best illustrated with reference to a diagram. Figure 14 shows how a Peano
scan would be generated for a raster matrix with 8 columns and 8 rows (dimensions of
raster in Figure 13). The coding system must be adapted to follow the Peano curve and
each cell has a unique identifier, known as a Morton number (Figure 15).
Figure 14. Peano curve.
Figure 15. Morton ordering scheme.
000 001 010 011 100 101 110 111
002 003 012 013 102 103 112 113
020 021 030 031 120 121 130 131
022 023 032 033 122 123 132 133
200 201 210 211 300 301 310 311
202 203 212 213 302 303 312 313
220 221 230 231 320 321 330 331
222 223 232 233 322 323 332 333
Figure 16. House figure for quadtree data structure exercise.
5 The vector approachThe five entity primitives identified at the start of this section can most easily be
represented using the vector spatial data model. The vector approach attempts to
represent the features of interest as exactly as possible employing Cartesian co-ordinate
geometry. Points are represented by coordinate pairs, lines by arcs linking a series or
string of points, areas by lines enclosing homogenous areas (a string of coordinate pairs
with the same origin and end point), networks by connected lines and surfaces by areas
linking points and lines.
Those of you who use and work with topographic paper maps on a regular basis will be
very familiar with the vector approach. Cartographic maps are composed entirely of a
series of representative points, lines, areas and surfaces. The topographic map or vector
model provides a scaled model of reality. For example, a river feature will be represented
using a line of appropriate thickness – rather than a series of contiguous inappropriately
shaped cells (as with the raster model). The high quality of geometric representation
makes interpretation of objects a relatively easy task (take another look at the map
provided in Section 1 to remind yourself).
This precision of representation is very useful. There is, however, further important
information that is needed to develop a vector data structure to store information about
the entities in a vector spatial data model. This information is about the geographical
relationships between points and lines that are used to represent an entity. These spatial
relationships are expressed as topology. As with the raster spatial data model there are
many potential data structures that can be used to represent an entity in the vector world.
However, they can be categorised into two groups:
1. Data structures without topology
2. Data structures with topology
5.1 Data structures without topology
The simplest form of vector data structure that can be used to reproduce a geographical
image in the computer is a set of x and y coordinates. Figure 17 shows how the simple
map (introduced earlier in the section on raster data), might be represented in a vector
view of the world.
Figure 17. A simple vector map.
A simple data structure without topology for this model could be constructed as follows:
Area (housing estate)
H1 (40, 50), H2 (40,50), H3 (50, 45), H4 (50, 35), H5 (70, 35), H6 (70, 30),
H7 (90, 30), H8 (90, 50), H1 (40, 50).
Note: the first and last co-ordinate pair are the same. This ensures that the area or
polygon is closed.
Line (river)
R1 (0, 25), R2 (10, 23), R3 (20, 20), R4 (40, 20), R5 (50, 20), R6 (70, 20),
R7 (80, 20), R8 (90, 15).
Points (trees)
T1 (10, 10), T2 (20, 5), T3 (25, 15), T4 (35, 10), T5 (55, 10), T6 (60, 15),
T7 (65, 5), T8 (75, 10), T9 (75, 5).
Where x, y represents the co-ordinates used to identify the location of the points, which
must be connected to make the entity. The descriptor in brackets, (tree) etc., is added to
the file so that the computer knows what the data represents. A new feature is recorded
by a code such as carriage return <cr>. The limitations of simple vector data structures
start to emerge when we look at more complex spatial entities. Consider for example the
group of areas and lines represented in Figure 18.
Figure 18.
Figure 18a, at its simplest level could be represented by the following data structure:
Area 1 xa,yb xb,yb xc,yc……xk,yk xl,yl xm,ym……
Area 2 xu,yu xv,yv xw,yw……xk,yk xl,yl xm,ym……
If you were to reconstruct the image in Figure 18a from the data structure given above,
you would find that the co-ordinates that define the boundary line, which is shared by the
two polygons, would be stored twice. While this may not appear too much of a problem
for our small example, consider the implications for a map of the soil series of the United
Kingdom or the state boundaries in the US. The amount of duplicated data stored would
be a large proportion of the total data. In Figure 18b we have a slightly different problem.
The network could quite easily be stored using the following simple file structure.
Line 1 x1,y1 x2,y2 x3,y3 etc
Line 2 x4,y4 x5,y5 x3,y3 etc
Line 3 x3,y3 x6,y6 x7,y7 etc
Line 4 x8,y8 x9,y9 x7,y7 etc
Line 5 x7,y7 x10,y10 x11,y11 etc
The computer would be able to reproduce the image but a problem would arise as soon
as we tried to use this information to ask questions about the network. This is because
the computer has not been provided with any information to tell it that line 1 is connected
to line 2 which is connected to line 3 and so on. These spatial linkages are only inferred
in the viewer’s mind when the lines are displayed on the screen and are not contained
explicitly within our data file. This situation has lead to simple vector data structures of
this type, without topology, being referred to as 'spaghetti', because what is actually on the
screen is merely a jumble of linear features as far as the computer is concerned.
This ‘spaghetti’ approach is used by many design and drawing packages. A true GIS
must use a topological data structure to represent spatial entities if it is to be of any
practical use other than for displaying features. Note that the absence of topology is one
factor which distinguishes a CAD/CAM system from one designed to store, manipulate
and analyse spatial data (GIS).
There are two specific problems of 'spaghetti' data structures that illustrate why topological
information is important. First, ‘spaghetti’ data contains no neighbourhood information,
and second the data structure is unable to cope with what are termed hole or island
polygons.
5.1.1 The neighbourhood problem
The neighbourhood problem has already been alluded to when we discussed the problem
of storing a simple network as a series of lines (Figure 18b). The problem is that while the
lines give the appearance of a network when displayed on the screen the actual file that is
used to create the image contains no information about which line is connected to the
next. In the same way, a series of polygons created using the simple data structure may
appear to be connected, but in fact they are discrete entities which are unaware of the
presence of neighbouring polygons. Even giving each polygon a label or unique identifier
would not solve the problem. What is required is a set of instructions which informs the
computer where one polygon is with respect to its neighbours. Polygon data structures
that contain such information would be termed topologically correct. How a data structure
can be designed to include full topology will be discussed later.
5.1.2 The island or hole problem
Figure 19 illustrates the island or hole problem. From the figure you can see that one
polygon classified as containing the soil type clay is wholly contained within a polygon
classified as soil type loam. The problem is frequently referred to as one of nested or
hierarchical polygons. While a simple file structure would be able to recreate the image in
Figure 19 by a series of x, y co-ordinates it would not be able to inform the computer that
the island polygon was in fact part of the larger ‘clay’ polygon. Dealing with islands or
holes also requires a fully topological data structure.
Figure 19. The island or hole problem.
5.2 Data structures with topology
5.2.1 Point
A point is the simplest spatial entity that can be represented in the vector world with full
topology, because all that is required for a point to be topologically correct is a pointer or
geographical reference which locates its position with respect to other spatial entities in
the real world. This is performed by tagging the point with a geographical reference.
5.2.2 Lines and networks
Simple lines carry no inherent spatial information about their connectivity. Lines only need
to have topological information attached to them when they become part of a network,
area or surface feature. Topological information is added to line features through the use
of 'pointers' which flag where links occur in the data structure. The most frequently used
pointer in the vector data model is the node. Figure 20 shows the type of information
required to identify connectivity in a line network.
The first stage in turning a series of lines into an intelligent network is to identify the start,
end and junction points. These pointers or nodes are then used to record information
about the connectivity of the network as well as hold information that regulates the nature
and direction of the information flow. Figure 20a illustrates six nodes, four of which
represent the start and end points of the network (B, D,E, F) and two (A, C) which
represent junctions.
The second stage is to identify the lines or arcs that connect the nodes. This information
is present in Figure 20b. In many cases, direction is also an important network feature
and Figure 20c shows how the direction at which an arc joins a node can be recorded.
Figure 20. Network connectivity.
5.2.3 Areas
Topology for a set of area entities is built in a series of stages. These stages have been
described by Burrough and McDonnell (1998) and are only summarised here. The order
in which the stages are carried out by a vector GIS will be GIS-product specific and may
not follow the order described here. However, the principles remain the same, with the
process consisting of four stages:
Stage 1 Generating a boundary network
Stage 2 Linking lines into polygons
Stage 3 Checking polygons for closure
Stage 4 Providing a unique identifier for each polygon
Stage 1: Generating a boundary network. Figure 21 shows a diagram of a set of simple
polygons. The first step in generating full topology for the entities in Figure 21 would be to
identify those arcs that intersect with one another. Those arcs that cross are
automatically intersected and built into two separate arcs and a node added at the
junction (Figure 22b).
Figure 21. A set of simple polygons.
Figure 22. Generating a boundary network.
The second step involves sorting arcs according to their x and y location so that arcs
topologically close to each other are also in close proximity in the data file. This process
helps speed up retrieval times when searching for adjacent chains.
Now it is possible to generate an outer envelope or boundary network that contains all
other polygons. The outer polygon is only used to build topology for the arc network. The
envelope polygon is built by identifying the arcs that make up the outer boundary of the
area (Figure 22c). A flag should be set to indicate that each of the arcs that make up the
outer envelope has been traversed once.
The following information should be stored for the envelope polygon:
A unique identifier or polygon ID
A code that identifies it as an envelope polygon
A direction pointer indicating the order and direction in which arcs should be linked
together to form the boundary
A list of arcs in the boundary
Its x, y extent
Stage 2: Linking arcs into polygons. Once the outer envelope has been created, topology
can be constructed for each of the other polygons in turn. The same starting point should
be used as employed in the construction of the outer envelope and if the outer envelope
was constructed in a clockwise direction, then the other polygons should be constructed in
a similar fashion. Once a pointer or node is reached the arcs which are to the right should
be followed. Arcs should be dropped from the search once they have been traversed
twice. This process should be repeated until all polygons have been constructed (Figure
22d).
Stage 3: Checking polygons for closure. In building topology for a set of areas it is
essential that the areas themselves can be identified as being closed or not. If a polygon
is left open it is not topologically correct. By consulting the arc table polygon closure can
be checked quite easily because all arcs must be linked to a node that points them to the
next arc. If any arcs in the table are found without the proper node pointers then either
they will be mistakes or unclosed polygons exist. If arcs of this type are found, then
depending upon the nature of the error, they may be flagged for correction or deletion.
Stage 4: Providing a unique identifier for each polygon. The final stage in building full
topology for a set of polygons is to ensure that a unique label or identifier is attached to
each polygon. This is important if nonspatial (attribute data) is to be linked with the
polygons that have been created. It is also important for locating (geographically) one
polygon in relation to another.
Figure 23 is a summary of the information necessary for computer storage in order to
reconstruct polygon topology.
Figure 23. Vector polygon with topological data structure.
6 Vector and raster spatial data models:advantages and disadvantagesMaffini (1987) states that the raster-vector approaches are two alternate methods for
storing and representing spatial phenomena. As models they have relative strengths and
weaknesses for describing conditions in the real world. In this section we will explore
some of merits and weaknesses of each model. Throughout the course there will be
many instances where you will need to determine whether one data model is more
appropriate than another is. IGISE (1991) groups the advantages and disadvantages
identified into five generic themes:
• Data volume
• Topology queries
• Generality
• Analytical capability
• Accuracy and precision
These themes are used here to start the debate about the choice of spatial data model as
part of the design process.
6.1 Data volume
One of the most frequently discussed areas in the raster-vector debate is data volume,
which was at its height during the 1970s and the first half of the 1980s when the
technological limitations on computer power and storage were most marked.
The problem is that the answer to this question is not simply that raster data sets are
larger than vector data sets. It depends upon the character and complexity of the spatial
entities you are trying to record. A simple or complex raster for example can require as
much data storage space to record a simple spatial entity with few polygon boundaries as
it would take to record a complex spatial entity with many polygon boundaries. In the
same way an unstructured vector file without topology for a series of 50 complex polygons
can be much smaller in size than a fully topological vector data structure for the same
area. The more complex a spatial entity becomes the closer the data volume
requirements of the different data storage techniques. As a general rule however, raster
spatial data models are generally more demanding of data storage than their vector
counterparts.
6.2 Topology queries
As IGISE (1991) stated, an important prerequisite of a GIS project is the ability to ask
questions such as:
• Where is something?
• What is next to something?
• What is contained within something?
The ability of the different data models to provide answers to these questions is of vital
importance to the designer of a GIS project. Both the raster and vector spatial data
models have strengths and weaknesses associated with answering different types of
spatial questions. These are explored in detail in the following sections of single and
multiple layer spatial operations. For the moment it is enough to note that traditionally,
vector data models have been considered more appropriate for answering topological
questions about containment, adjacency and connectivity. However, with the advent of
more intelligent raster data structures such as the quadtree which contain information
about the relationship between cells in the image, two particular spatial queries can now
be performed efficiently using a raster data model. These are identifying the area in which
a point is located. In general, however, where topological queries are likely to constitute
the major application of the GIS, a vector data model is required.
6.3 Generality
In a GIS project it is frequently necessary to be able to change the scale and thematic
resolution of operation. This often makes it essential to be able to generalise the
complexity of spatial features. For example, it might be necessary to have the ability to
dissolve 500 enumeration districts into 14 wards or 200 detailed soil polygons into 15
general soil units. Depending upon the type of generality the vector and raster data
models possess different relative advantages. The vector data model, for example,
handles changes in scale much more easily than its raster counterpart with regard to the
visual representation of entities. This is because of the precise way in which information
is recorded as a set of x, y co-ordinates. Changes of scale pose a problem in the grid
world if a resolution is requested below the cell specified at the project outset. Increases
in scale in the raster world are typified by the appearance of a blockier image.
On the other hand, in generalising the actual form of an area and of surface entities the
raster model comes into its own because to aggregate a complex soil class map into a
more general one needs only the value of each cell to be reclassified to reconstruct the
new image. While this is possible in the vector world it requires complex calculations of
the intersection and adjacency of polygons with similar attributes. Therefore, if many
calculations of this nature are required a raster data model may be more appropriate.
6.4 Analytical capabilities
There is a clear distinction between the analytical capabilities of raster and vector GIS.
This is a major component of the sections of this unit dealing with spatial operations.
6.5 Accuracy and precision
In days gone by, you may have often heard a vendor of a vector GIS product announcing
to his/her client that their product is more accurate at representing spatial features
because it the vector spatial model. This statement provides a useful starting point for
examining accuracy and precision because of highlights one of the most common
mistakes made when comparing raster and vector data; the confusion of the terms
accuracy and precision. Before we proceed it is necessary to define what we mean by the
terms.
Accuracy is the faithfulness with which our spatial entity is represented in our computer
view of the real world including its location (positional or spatial accuracy) and character
(attribute accuracy).
Precision is independent of accuracy and is the degree or exactness used to record the
location and character of our spatial entity. For example, a typical vector based GIS
allocates 8 decimal digits of precision to each of its coordinates and many allocate 16
(Goodchild and Gopal 1990). The level of this precision is much higher than the accuracy
of typical GIS data. Therefore, what the vendor really meant in the above statement is
that the vector data model is more precise at reproducing the shapes and lines that we
are used to seeing on our traditional paper map model of the world. This is because the
paper map uses a vector data model to represent spatial entities. The location of small
entities are shown as points, roads and rivers by lines and areas of forests by polygons
with a distinct boundary. Naturally, therefore, if we use a vector GIS to capture this data it
will be more precisely reproduced than in the raster world where the points would appear
as cells, the roads and rivers as jagged and stepped linear features and the areas as
blocky irregular entities rather than the smooth boundaries that are formed by the arcs of
the polygon.
At first appearance it might appear that our vendor is indeed right that the vector GIS is
much more accurate than an equivalent grid based system. To understand why this is not
the case we need to revisit our definition of accuracy and look closely at the concept of
faithfulness of representation. The first important point to recognise is that all spatial data
are of limited accuracy. IGISE (1991) and Goodchild and Gopal (1990) provide passages
to illustrate the point.
“Consider for example two air photo-interpreters who are evaluating the boundary
for a wooded area. They are likely to produce two different boundaries. Branches
of trees, for example, overlap one another. The overhang and overlap can easily
be several metres. Depending on the season in which the photograph was taken
(i.e. whether or not the trees are in leaf), the boundary line may be drawn
differently.”
IGISE (1991)
“The area labelled ‘soil type A’ on a map of soils is not in reality all type A, and its
boundaries are not sharp breaks but transition zones. Similarly, the area labelled
‘population density 1000-2000/sq.km’ does not in fact have between 1000 and
2000 in every square km, or between 10 and 20 in every hectare, since the spatial
distribution of population is punctiform and can only be approximated by a smooth
surface.”
Goodchild and Gopal (1990)
Now if we return to our vendor, what s/he should have said was that a vector GIS is more
precise in representing a spatial entity as it appears on a map. It is not necessarily more
accurate than a raster GIS at representing the location and character of the true real world
feature. Because an entity takes on a jagged, blocky or stepped appearance it is not
correct to assume that the database is inaccurate. In addition, many users of GIS
consider the blocky irregular boundary produced between areas when a raster spatial data
model is used to be more appropriate for representing the real world features where
distinct boundaries between spatial phenomena are not present.
7 What have you learned?This section has introduced the two field-based models for representing geographic space
in GIS – raster and vector. We have examined a range of storage methods employed by
the cell-based raster spatial data model from the simple raster to the area quadtree. In
considering the vector spatial data model with its precise geometric representation, we
have seen the significance of topology as well as the variety of structures in vector
models.