Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 1

Data Mining UMUC CSMN 667

Lecture #9


Lecture 9

“Spatial Data Mining”

WaterCoastLandDesertCloudSnow/IceGlint

Regional Prior

Probability


Outline

• Spatial Mining: – What is it?– How is it different?

• Spatial Queries• Indexing• Spatial Data Mining Primitives• Concept Hierarchies• Techniques for Spatial Mining


What is Spatial Mining?

• Spatial Mining = Mining Spatial Data Sets• Spatial data refer to any data about objects that

occupy real physical space.• Attributes for spatial data usually will include spatial

information.• Spatial information (metadata) is used to describe

objects in space.• Spatial information includes geometric metadata

(e.g., location, shape, size, distance, area, perimeter) and topological metadata (e.g., “neighbor of”, “adjacent to”, “included in”, “includes”).


Spatial Mining: how is it different?• Spatial data types naturally carry linkages to neighboring

data elements (e.g., contiguous geographic positions).

• Spatial information corresponds to unique attributes and unique relationships between attributes that are not normally found in other databases.

• In many cases, spatial data are stored as rasters: rows and columns, with (x,y) location information implicit in the placement within the raster (e.g., Remote Sensing images).– (x,y) values are not stored anywhere in the database.

– Special extraction tools are needed.

• These differences can pose challenges to standard data mining algorithms.


Spatial Mining: more challenges

• Additional challenges …

• How do you index these data collections?– Perhaps you can use a spatial index (e.g., Latitude, Longitude)

• Raster [x,y] value almost never equals [Latitude, Longitude] !

• How do you describe the spatial relationships among data items? What attributes should you use?– These attributes are chosen and controlled by each organization.

– Spatial data mining algorithms won’t know their meaning.

• Almost any data collected by, for, or about human society can be associated with a geo-location. Special GIS tools are (or can be) used = Geographic Information Systems @ http://www.gis.com/

• As a consequence of this fact, spatial data repositories are HUGE and growing.


Special Cases

• Image databases (Earth or the Sky)

• Thematic maps (values of attributes or “themes” are displayed in a spatial distribution = a map!)


Spatial Queries

• Queries must be able to handle both spatial and non-spatial attributes

• Queries may include range or region constraints indirectly (e.g., find all zip codes in the Rocky Mountains, or around Lake Michigan)

• Nearest Neighbor query might be exactly named in this instance = it really is the NEAREST neighbor that you are seeking!

• Distance metric probably uses a real distance (Euclidean spatial distance)


Region Queries

• Uses concept of MBR (Minimum Bounding Rectangle)…

• Can be used to index the spatial database: you need to index the database in order to find records.


Spatial Database Indexing

• Trees are frequently used to index spatial data.

• Quad Tree: based upon assigning data to spatial quadrants

• R-Tree: based on range of values (Lat,Long) assigned to the set of MBR’s.

• k-D Tree: a binary search tree in K dimensions

• … and many more

• Searching a tree-based index is fast: find all intersections at the high level; ignore the rest!


Quad Tree

• Quadrant Tree• A Quad Tree is a hierarchical decomposition

of the space into quadrants.• Each level in the tree represents the object as

being equivalent to the set of quadrants that contain any portion of the object.

• Each finer-grained level provides a more exact representation of the object.

• The number of levels used is determined by the degree of accuracy desired.


Quad Tree Example

(2,3)

(12,13,14)

(49,52,53,54,…)

“Indexing the Triangle”


R-Tree• Range-Tree uses ranges to build the index• As with Quad Tree, the region is divided

into successively smaller rectangles (MBRs), containing the data items.

• Rectangles need not be of the same size or quantity at each level of the tree.

• Rectangles may actually overlap.• Lowest level cell has only one object.• Uses tree maintenance algorithms similar to

those for B-trees (= traditional binary search trees used in non-spatial databases).


R-Tree Example


k-D Tree• k-Dimensional index (for multi-dimensional

data … dimensions can be 2, 3, or much more)• Designed for multi-attribute data, not

necessarily spatial• Uses the “Divide and Conquer” approach• This is a variation of the binary search tree• Each level is used to index one of the

dimensions of the spatial object• Lowest level cell has only one object• Divisions are not based on MBRs, but based on

successive divisions of the longest dimension.


k-D Tree Example


Example of Tree-based Indexing:Quad Tree Indexing

• A Quadrant Tree can be used to index data records in a spatial database based upon each record’s spatial location

• A business or government agency may wish to query the database using spatial constraints.

• The Quad Tree (QT) facilitates the storage and querying of spatial data.

• For example:– The complex geographical boundaries of different states,

counties, and cities may be difficult to specify in a database.– However, each address (home or business) in the database may

be indexed with a QT value derived from its spatial location.– Then, in order to search for all addresses within some particular

geographic boundary, one can simply query the database for records that have particular values of the quad tree index.


Quad Tree-based Spatial Mining• The Quadrant Tree (QT) index can be applied to data

records in multiple spatial databases.• One can look for associations, or patterns, or clusters, or

outliers, or nearest neighbors, etc. using the QT index values across the different databases.

• For example:– One can apply association mining to the spatial database(s) and

find associations among different attributes for database records that correspond to the same location or same type of location (e.g., urban, rural, suburban,…).

– One can apply a nearest neighbor search using the QT index as the query qualifier.

– One can use spatial attributes in some of the nodes (question points) in a Decision Tree.


Quad Tree Example: this grid can be overlaid onto any geographic map.

12

3 4

56

7 8

910

11 12

1314

15 16

1718

19 20

21

24

22

23

2526

27

3635

34 33

3231

30 29

2844

80

42 41

4039

38 37

54 53

5251

50 49

4847

46 45

43

66 65

6463

6162

6059

58 57

5655

8479

78 77

7675

74 73

7271

6970

6867 83

82 81

The pink circle represents the

sales area for our local retail

business. Our company’s sales

region may be described with

the following sets of Quad Tree

indices: {1,2} or {6,7,9,12} or

{26,27,30,37,39,40,49,50} or

with even higher-order indices

(corresponding to finer grids).

Therefore, if we convert the

local phonebook into a high-order

Quad-Tree (QT) indexed listing,

then all residents whose

addresses occur with any of these

QT index values {26,27,30,37,39,

40,49,50} are potential customers

-- thus, we will send them some

of our company’s advertising mail.

a

b

c

Address(a)={24} > No Mail

Address(b)={30} > Send Mail

Address(c)={75} > No Mail


Quad Tree example, continued• In the preceding slide, it was completely obvious in this simple

example to determine if an address (a), (b), or (c) was within the pink boundary (i.e., within the sales region of our company).

• But, in general, you do not have such a pretty map and picture to look at.

• In general, you only have the data in the database, and it might have thousands or millions of records.

• So, in general, the Quad Tree index provides a very rapid means to discover whether or not a database record is within a selected geographic region. Just check the numbers and you are done!

• The biggest decision is to decide what level to take the indices (i.e., how fine-grained do you want to make the hierarchical grid in the Quad Tree). It is possible to go to very high levels of spatial resolution, requiring very large index numbers. But, it is still easy to query even a huge database for the selected values.


Other Trees for Database Indexing• What we have said in the preceding slides about

Quad Trees is also applicable to R-Trees, k-D Trees, and other tree-based indices used in spatial databases.

• There are many other types of trees used to index more general databases, including the B-tree, hB-tree, R-tree, R+-tree, R*-tree, R**-tree, packed R-tree, M-tree, SR-tree, SS-tree, RD-tree, BANG file, BV-tree, Buddy tree, Cell tree, G-tree, GBD-tree, Gridfile, KDB-tree, LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid tree, Q-tree, SKD-tree, TV-tree, UB-tree, Z-order index, etc.


Spatial Data Mining Primitives

• Orientation relationships:– North, South, East, West

• Topological relationships:– see next slide

• Distance measures


Topological Relationships

• Disjoint

• Overlaps or Intersects

• Equals

• Covered by or inside or contained in

• Covers or contains


Distance Between Objects• A lot of this is same old stuff we talked about previously.

• Can use cluster distances (Single Link, etc. - see page 130)• Euclidean, Manhattan, etc.• Some special spatial extensions:


Aggregate Proximity

• Aggregate Proximity – a measure of how close a cluster is to a feature.

• Aggregate proximity relationship finds the k closest features to a cluster.

• CRH Algorithm – uses different shapes:– Encompassing Circle– Isothetic Rectangle = rectangle with edges

parallel to the principal axes– Convex Hull


CRH example: mathematical formulae exist to calculate easily the distance between any two

convex hulls, or circles, or rectangles


Concept Hierarchies

• Specialization (Progressive Refinement) = move down the hierarchy

• Generalization = move up the hierarchy

• Similar to “roll-up” and “drill-down”

• An implementation: STING


Progressive Refinement

• Make approximate answers prior to more accurate ones.

• Filter out data that are not part of answer.

• Hierarchical view of data based on spatial relationships

• Coarse predicate recursively refined


Example


STING

• STatistical INformation Grid-based

• Hierarchical technique to divide area into rectangular cells

• Grid data structure contains summary information about each cell

• Hierarchical clustering

• Similar to Quad Tree


Nodes in STING data structure:


Spatial Data Mining Algorithms

• Most traditional methods still apply, with special “features” to deal with spatial information (geographic / topological metadata):– Association Rules– Clustering– Classification– Decision Trees– Neural Nets– Bayes Networks

• Refer to text and … • If you are interested in this subject, try a GGooooggllee

search on “Spatial Data Mining”.


Spatial Rules

• Characteristic Rule :

The average family income in Dallas is $50,000.

• Discriminant Rule :

The average family income in Dallas is $50,000, while in Plano the average income is $75,000.

• Association Rule :

The average family income in Dallas for families living near White Rock Lake is $100,000.


Spatial Association Rules

• Either antecedent or consequent must contain spatial predicates.

• Views the underlying database as a set of spatial objects.

• May generate these rules using a type of progressive refinement


Spatial Classification

• Partitions the spatial objects into categories

• May use nonspatial attributes and/or spatial attributes

• Generalization and progressive refinement may be used.


Spatial Decision Tree

• Approach similar to that used for spatial association rules.

• Spatial objects can be described based on objects closest to them – called its Buffer.

• Description of class based upon aggregation of nearby objects.


Spatial Clustering• Detect clusters of

irregular shapes.

• Use of centroids and simple distance approaches may not work well.

• Clusters should be independent of order of input.


Summary


Summary of Topics Covered - Lecture 9

• Spatial Mining: – What is it?– How is it different?

• Spatial Queries• Indexing• Spatial Data Mining Primitives• Concept Hierarchies• Techniques for Spatial Mining

Documents

Lecture 9