39
By Dr. Borne 200 5 UMUC Data Mining Lecture 9 1 Data Mining UMUC CSMN 667 Lecture #9

Lecture 9

  • Upload
    tommy96

  • View
    482

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 1

Data Mining UMUC CSMN 667

Lecture #9

Page 2: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 2

Lecture 9

“Spatial Data Mining”

WaterCoastLandDesertCloudSnow/IceGlint

Regional Prior

Probability

Page 3: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 3

Outline

• Spatial Mining: – What is it?– How is it different?

• Spatial Queries• Indexing• Spatial Data Mining Primitives• Concept Hierarchies• Techniques for Spatial Mining

Page 4: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 4

What is Spatial Mining?

• Spatial Mining = Mining Spatial Data Sets• Spatial data refer to any data about objects that

occupy real physical space.• Attributes for spatial data usually will include spatial

information.• Spatial information (metadata) is used to describe

objects in space.• Spatial information includes geometric metadata

(e.g., location, shape, size, distance, area, perimeter) and topological metadata (e.g., “neighbor of”, “adjacent to”, “included in”, “includes”).

Page 5: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 5

Spatial Mining: how is it different?• Spatial data types naturally carry linkages to neighboring

data elements (e.g., contiguous geographic positions).

• Spatial information corresponds to unique attributes and unique relationships between attributes that are not normally found in other databases.

• In many cases, spatial data are stored as rasters: rows and columns, with (x,y) location information implicit in the placement within the raster (e.g., Remote Sensing images).– (x,y) values are not stored anywhere in the database.

– Special extraction tools are needed.

• These differences can pose challenges to standard data mining algorithms.

Page 6: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 6

Spatial Mining: more challenges

• Additional challenges …

• How do you index these data collections?– Perhaps you can use a spatial index (e.g., Latitude, Longitude)

• Raster [x,y] value almost never equals [Latitude, Longitude] !

• How do you describe the spatial relationships among data items? What attributes should you use?– These attributes are chosen and controlled by each organization.

– Spatial data mining algorithms won’t know their meaning.

• Almost any data collected by, for, or about human society can be associated with a geo-location. Special GIS tools are (or can be) used = Geographic Information Systems @ http://www.gis.com/

• As a consequence of this fact, spatial data repositories are HUGE and growing.

Page 7: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 7

Special Cases

• Image databases (Earth or the Sky)

• Thematic maps (values of attributes or “themes” are displayed in a spatial distribution = a map!)

Page 8: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 8

Spatial Queries

• Queries must be able to handle both spatial and non-spatial attributes

• Queries may include range or region constraints indirectly (e.g., find all zip codes in the Rocky Mountains, or around Lake Michigan)

• Nearest Neighbor query might be exactly named in this instance = it really is the NEAREST neighbor that you are seeking!

• Distance metric probably uses a real distance (Euclidean spatial distance)

Page 9: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 9

Region Queries

• Uses concept of MBR (Minimum Bounding Rectangle)…

• Can be used to index the spatial database: you need to index the database in order to find records.

Page 10: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 10

Spatial Database Indexing

• Trees are frequently used to index spatial data.

• Quad Tree: based upon assigning data to spatial quadrants

• R-Tree: based on range of values (Lat,Long) assigned to the set of MBR’s.

• k-D Tree: a binary search tree in K dimensions

• … and many more

• Searching a tree-based index is fast: find all intersections at the high level; ignore the rest!

Page 11: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 11

Quad Tree

• Quadrant Tree• A Quad Tree is a hierarchical decomposition

of the space into quadrants.• Each level in the tree represents the object as

being equivalent to the set of quadrants that contain any portion of the object.

• Each finer-grained level provides a more exact representation of the object.

• The number of levels used is determined by the degree of accuracy desired.

Page 12: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 12

Quad Tree Example

(2,3)

(12,13,14)

(49,52,53,54,…)

“Indexing the Triangle”

Page 13: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 13

R-Tree• Range-Tree uses ranges to build the index• As with Quad Tree, the region is divided

into successively smaller rectangles (MBRs), containing the data items.

• Rectangles need not be of the same size or quantity at each level of the tree.

• Rectangles may actually overlap.• Lowest level cell has only one object.• Uses tree maintenance algorithms similar to

those for B-trees (= traditional binary search trees used in non-spatial databases).

Page 14: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 14

R-Tree Example

Page 15: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 15

k-D Tree• k-Dimensional index (for multi-dimensional

data … dimensions can be 2, 3, or much more)• Designed for multi-attribute data, not

necessarily spatial• Uses the “Divide and Conquer” approach• This is a variation of the binary search tree• Each level is used to index one of the

dimensions of the spatial object• Lowest level cell has only one object• Divisions are not based on MBRs, but based on

successive divisions of the longest dimension.

Page 16: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 16

k-D Tree Example

Page 17: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 17

Example of Tree-based Indexing:Quad Tree Indexing

• A Quadrant Tree can be used to index data records in a spatial database based upon each record’s spatial location

• A business or government agency may wish to query the database using spatial constraints.

• The Quad Tree (QT) facilitates the storage and querying of spatial data.

• For example:– The complex geographical boundaries of different states,

counties, and cities may be difficult to specify in a database.– However, each address (home or business) in the database may

be indexed with a QT value derived from its spatial location.– Then, in order to search for all addresses within some particular

geographic boundary, one can simply query the database for records that have particular values of the quad tree index.

Page 18: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 18

Quad Tree-based Spatial Mining• The Quadrant Tree (QT) index can be applied to data

records in multiple spatial databases.• One can look for associations, or patterns, or clusters, or

outliers, or nearest neighbors, etc. using the QT index values across the different databases.

• For example:– One can apply association mining to the spatial database(s) and

find associations among different attributes for database records that correspond to the same location or same type of location (e.g., urban, rural, suburban,…).

– One can apply a nearest neighbor search using the QT index as the query qualifier.

– One can use spatial attributes in some of the nodes (question points) in a Decision Tree.

Page 19: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 19

Quad Tree Example: this grid can be overlaid onto any geographic map.

12

3 4

56

7 8

910

11 12

1314

15 16

1718

19 20

21

24

22

23

2526

27

3635

34 33

3231

30 29

2844

80

42 41

4039

38 37

54 53

5251

50 49

4847

46 45

43

66 65

6463

6162

6059

58 57

5655

8479

78 77

7675

74 73

7271

6970

6867 83

82 81

The pink circle represents the

sales area for our local retail

business. Our company’s sales

region may be described with

the following sets of Quad Tree

indices: {1,2} or {6,7,9,12} or

{26,27,30,37,39,40,49,50} or

with even higher-order indices

(corresponding to finer grids).

Therefore, if we convert the

local phonebook into a high-order

Quad-Tree (QT) indexed listing,

then all residents whose

addresses occur with any of these

QT index values {26,27,30,37,39,

40,49,50} are potential customers

-- thus, we will send them some

of our company’s advertising mail.

a

b

c

Address(a)={24} > No Mail

Address(b)={30} > Send Mail

Address(c)={75} > No Mail

Page 20: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 20

Quad Tree example, continued• In the preceding slide, it was completely obvious in this simple

example to determine if an address (a), (b), or (c) was within the pink boundary (i.e., within the sales region of our company).

• But, in general, you do not have such a pretty map and picture to look at.

• In general, you only have the data in the database, and it might have thousands or millions of records.

• So, in general, the Quad Tree index provides a very rapid means to discover whether or not a database record is within a selected geographic region. Just check the numbers and you are done!

• The biggest decision is to decide what level to take the indices (i.e., how fine-grained do you want to make the hierarchical grid in the Quad Tree). It is possible to go to very high levels of spatial resolution, requiring very large index numbers. But, it is still easy to query even a huge database for the selected values.

Page 21: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 21

Other Trees for Database Indexing• What we have said in the preceding slides about

Quad Trees is also applicable to R-Trees, k-D Trees, and other tree-based indices used in spatial databases.

• There are many other types of trees used to index more general databases, including the B-tree, hB-tree, R-tree, R+-tree, R*-tree, R**-tree, packed R-tree, M-tree, SR-tree, SS-tree, RD-tree, BANG file, BV-tree, Buddy tree, Cell tree, G-tree, GBD-tree, Gridfile, KDB-tree, LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid tree, Q-tree, SKD-tree, TV-tree, UB-tree, Z-order index, etc.

Page 22: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 22

Spatial Data Mining Primitives

• Orientation relationships:– North, South, East, West

• Topological relationships:– see next slide

• Distance measures

Page 23: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 23

Topological Relationships

• Disjoint

• Overlaps or Intersects

• Equals

• Covered by or inside or contained in

• Covers or contains

Page 24: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 24

Distance Between Objects• A lot of this is same old stuff we talked about previously.

• Can use cluster distances (Single Link, etc. - see page 130)• Euclidean, Manhattan, etc.• Some special spatial extensions:

Page 25: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 25

Aggregate Proximity

• Aggregate Proximity – a measure of how close a cluster is to a feature.

• Aggregate proximity relationship finds the k closest features to a cluster.

• CRH Algorithm – uses different shapes:– Encompassing Circle– Isothetic Rectangle = rectangle with edges

parallel to the principal axes– Convex Hull

Page 26: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 26

CRH example: mathematical formulae exist to calculate easily the distance between any two

convex hulls, or circles, or rectangles

Page 27: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 27

Concept Hierarchies

• Specialization (Progressive Refinement) = move down the hierarchy

• Generalization = move up the hierarchy

• Similar to “roll-up” and “drill-down”

• An implementation: STING

Page 28: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 28

Progressive Refinement

• Make approximate answers prior to more accurate ones.

• Filter out data that are not part of answer.

• Hierarchical view of data based on spatial relationships

• Coarse predicate recursively refined

Page 29: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 29

Example

Page 30: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 30

STING

• STatistical INformation Grid-based

• Hierarchical technique to divide area into rectangular cells

• Grid data structure contains summary information about each cell

• Hierarchical clustering

• Similar to Quad Tree

Page 31: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 31

Nodes in STING data structure:

Page 32: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 32

Spatial Data Mining Algorithms

• Most traditional methods still apply, with special “features” to deal with spatial information (geographic / topological metadata):– Association Rules– Clustering– Classification– Decision Trees– Neural Nets– Bayes Networks

• Refer to text and … • If you are interested in this subject, try a GGooooggllee

search on “Spatial Data Mining”.

Page 33: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 33

Spatial Rules

• Characteristic Rule :

The average family income in Dallas is $50,000.

• Discriminant Rule :

The average family income in Dallas is $50,000, while in Plano the average income is $75,000.

• Association Rule :

The average family income in Dallas for families living near White Rock Lake is $100,000.

Page 34: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 34

Spatial Association Rules

• Either antecedent or consequent must contain spatial predicates.

• Views the underlying database as a set of spatial objects.

• May generate these rules using a type of progressive refinement

Page 35: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 35

Spatial Classification

• Partitions the spatial objects into categories

• May use nonspatial attributes and/or spatial attributes

• Generalization and progressive refinement may be used.

Page 36: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 36

Spatial Decision Tree

• Approach similar to that used for spatial association rules.

• Spatial objects can be described based on objects closest to them – called its Buffer.

• Description of class based upon aggregation of nearby objects.

Page 37: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 37

Spatial Clustering• Detect clusters of

irregular shapes.

• Use of centroids and simple distance approaches may not work well.

• Clusters should be independent of order of input.

Page 38: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 38

Summary

Page 39: Lecture 9

By Dr. Borne 2005 UMUC Data Mining Lecture 9 39

Summary of Topics Covered - Lecture 9

• Spatial Mining: – What is it?– How is it different?

• Spatial Queries• Indexing• Spatial Data Mining Primitives• Concept Hierarchies• Techniques for Spatial Mining