Upload
tommy96
View
482
Download
1
Tags:
Embed Size (px)
Citation preview
By Dr. Borne 2005 UMUC Data Mining Lecture 9 1
Data Mining UMUC CSMN 667
Lecture #9
By Dr. Borne 2005 UMUC Data Mining Lecture 9 2
Lecture 9
“Spatial Data Mining”
WaterCoastLandDesertCloudSnow/IceGlint
Regional Prior
Probability
By Dr. Borne 2005 UMUC Data Mining Lecture 9 3
Outline
• Spatial Mining: – What is it?– How is it different?
• Spatial Queries• Indexing• Spatial Data Mining Primitives• Concept Hierarchies• Techniques for Spatial Mining
By Dr. Borne 2005 UMUC Data Mining Lecture 9 4
What is Spatial Mining?
• Spatial Mining = Mining Spatial Data Sets• Spatial data refer to any data about objects that
occupy real physical space.• Attributes for spatial data usually will include spatial
information.• Spatial information (metadata) is used to describe
objects in space.• Spatial information includes geometric metadata
(e.g., location, shape, size, distance, area, perimeter) and topological metadata (e.g., “neighbor of”, “adjacent to”, “included in”, “includes”).
By Dr. Borne 2005 UMUC Data Mining Lecture 9 5
Spatial Mining: how is it different?• Spatial data types naturally carry linkages to neighboring
data elements (e.g., contiguous geographic positions).
• Spatial information corresponds to unique attributes and unique relationships between attributes that are not normally found in other databases.
• In many cases, spatial data are stored as rasters: rows and columns, with (x,y) location information implicit in the placement within the raster (e.g., Remote Sensing images).– (x,y) values are not stored anywhere in the database.
– Special extraction tools are needed.
• These differences can pose challenges to standard data mining algorithms.
By Dr. Borne 2005 UMUC Data Mining Lecture 9 6
Spatial Mining: more challenges
• Additional challenges …
• How do you index these data collections?– Perhaps you can use a spatial index (e.g., Latitude, Longitude)
• Raster [x,y] value almost never equals [Latitude, Longitude] !
• How do you describe the spatial relationships among data items? What attributes should you use?– These attributes are chosen and controlled by each organization.
– Spatial data mining algorithms won’t know their meaning.
• Almost any data collected by, for, or about human society can be associated with a geo-location. Special GIS tools are (or can be) used = Geographic Information Systems @ http://www.gis.com/
• As a consequence of this fact, spatial data repositories are HUGE and growing.
By Dr. Borne 2005 UMUC Data Mining Lecture 9 7
Special Cases
• Image databases (Earth or the Sky)
• Thematic maps (values of attributes or “themes” are displayed in a spatial distribution = a map!)
By Dr. Borne 2005 UMUC Data Mining Lecture 9 8
Spatial Queries
• Queries must be able to handle both spatial and non-spatial attributes
• Queries may include range or region constraints indirectly (e.g., find all zip codes in the Rocky Mountains, or around Lake Michigan)
• Nearest Neighbor query might be exactly named in this instance = it really is the NEAREST neighbor that you are seeking!
• Distance metric probably uses a real distance (Euclidean spatial distance)
By Dr. Borne 2005 UMUC Data Mining Lecture 9 9
Region Queries
• Uses concept of MBR (Minimum Bounding Rectangle)…
• Can be used to index the spatial database: you need to index the database in order to find records.
By Dr. Borne 2005 UMUC Data Mining Lecture 9 10
Spatial Database Indexing
• Trees are frequently used to index spatial data.
• Quad Tree: based upon assigning data to spatial quadrants
• R-Tree: based on range of values (Lat,Long) assigned to the set of MBR’s.
• k-D Tree: a binary search tree in K dimensions
• … and many more
• Searching a tree-based index is fast: find all intersections at the high level; ignore the rest!
By Dr. Borne 2005 UMUC Data Mining Lecture 9 11
Quad Tree
• Quadrant Tree• A Quad Tree is a hierarchical decomposition
of the space into quadrants.• Each level in the tree represents the object as
being equivalent to the set of quadrants that contain any portion of the object.
• Each finer-grained level provides a more exact representation of the object.
• The number of levels used is determined by the degree of accuracy desired.
By Dr. Borne 2005 UMUC Data Mining Lecture 9 12
Quad Tree Example
(2,3)
(12,13,14)
(49,52,53,54,…)
“Indexing the Triangle”
By Dr. Borne 2005 UMUC Data Mining Lecture 9 13
R-Tree• Range-Tree uses ranges to build the index• As with Quad Tree, the region is divided
into successively smaller rectangles (MBRs), containing the data items.
• Rectangles need not be of the same size or quantity at each level of the tree.
• Rectangles may actually overlap.• Lowest level cell has only one object.• Uses tree maintenance algorithms similar to
those for B-trees (= traditional binary search trees used in non-spatial databases).
By Dr. Borne 2005 UMUC Data Mining Lecture 9 14
R-Tree Example
By Dr. Borne 2005 UMUC Data Mining Lecture 9 15
k-D Tree• k-Dimensional index (for multi-dimensional
data … dimensions can be 2, 3, or much more)• Designed for multi-attribute data, not
necessarily spatial• Uses the “Divide and Conquer” approach• This is a variation of the binary search tree• Each level is used to index one of the
dimensions of the spatial object• Lowest level cell has only one object• Divisions are not based on MBRs, but based on
successive divisions of the longest dimension.
By Dr. Borne 2005 UMUC Data Mining Lecture 9 16
k-D Tree Example
By Dr. Borne 2005 UMUC Data Mining Lecture 9 17
Example of Tree-based Indexing:Quad Tree Indexing
• A Quadrant Tree can be used to index data records in a spatial database based upon each record’s spatial location
• A business or government agency may wish to query the database using spatial constraints.
• The Quad Tree (QT) facilitates the storage and querying of spatial data.
• For example:– The complex geographical boundaries of different states,
counties, and cities may be difficult to specify in a database.– However, each address (home or business) in the database may
be indexed with a QT value derived from its spatial location.– Then, in order to search for all addresses within some particular
geographic boundary, one can simply query the database for records that have particular values of the quad tree index.
By Dr. Borne 2005 UMUC Data Mining Lecture 9 18
Quad Tree-based Spatial Mining• The Quadrant Tree (QT) index can be applied to data
records in multiple spatial databases.• One can look for associations, or patterns, or clusters, or
outliers, or nearest neighbors, etc. using the QT index values across the different databases.
• For example:– One can apply association mining to the spatial database(s) and
find associations among different attributes for database records that correspond to the same location or same type of location (e.g., urban, rural, suburban,…).
– One can apply a nearest neighbor search using the QT index as the query qualifier.
– One can use spatial attributes in some of the nodes (question points) in a Decision Tree.
By Dr. Borne 2005 UMUC Data Mining Lecture 9 19
Quad Tree Example: this grid can be overlaid onto any geographic map.
12
3 4
56
7 8
910
11 12
1314
15 16
1718
19 20
21
24
22
23
2526
27
3635
34 33
3231
30 29
2844
80
42 41
4039
38 37
54 53
5251
50 49
4847
46 45
43
66 65
6463
6162
6059
58 57
5655
8479
78 77
7675
74 73
7271
6970
6867 83
82 81
The pink circle represents the
sales area for our local retail
business. Our company’s sales
region may be described with
the following sets of Quad Tree
indices: {1,2} or {6,7,9,12} or
{26,27,30,37,39,40,49,50} or
with even higher-order indices
(corresponding to finer grids).
Therefore, if we convert the
local phonebook into a high-order
Quad-Tree (QT) indexed listing,
then all residents whose
addresses occur with any of these
QT index values {26,27,30,37,39,
40,49,50} are potential customers
-- thus, we will send them some
of our company’s advertising mail.
a
b
c
Address(a)={24} > No Mail
Address(b)={30} > Send Mail
Address(c)={75} > No Mail
By Dr. Borne 2005 UMUC Data Mining Lecture 9 20
Quad Tree example, continued• In the preceding slide, it was completely obvious in this simple
example to determine if an address (a), (b), or (c) was within the pink boundary (i.e., within the sales region of our company).
• But, in general, you do not have such a pretty map and picture to look at.
• In general, you only have the data in the database, and it might have thousands or millions of records.
• So, in general, the Quad Tree index provides a very rapid means to discover whether or not a database record is within a selected geographic region. Just check the numbers and you are done!
• The biggest decision is to decide what level to take the indices (i.e., how fine-grained do you want to make the hierarchical grid in the Quad Tree). It is possible to go to very high levels of spatial resolution, requiring very large index numbers. But, it is still easy to query even a huge database for the selected values.
By Dr. Borne 2005 UMUC Data Mining Lecture 9 21
Other Trees for Database Indexing• What we have said in the preceding slides about
Quad Trees is also applicable to R-Trees, k-D Trees, and other tree-based indices used in spatial databases.
• There are many other types of trees used to index more general databases, including the B-tree, hB-tree, R-tree, R+-tree, R*-tree, R**-tree, packed R-tree, M-tree, SR-tree, SS-tree, RD-tree, BANG file, BV-tree, Buddy tree, Cell tree, G-tree, GBD-tree, Gridfile, KDB-tree, LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid tree, Q-tree, SKD-tree, TV-tree, UB-tree, Z-order index, etc.
By Dr. Borne 2005 UMUC Data Mining Lecture 9 22
Spatial Data Mining Primitives
• Orientation relationships:– North, South, East, West
• Topological relationships:– see next slide
• Distance measures
By Dr. Borne 2005 UMUC Data Mining Lecture 9 23
Topological Relationships
• Disjoint
• Overlaps or Intersects
• Equals
• Covered by or inside or contained in
• Covers or contains
By Dr. Borne 2005 UMUC Data Mining Lecture 9 24
Distance Between Objects• A lot of this is same old stuff we talked about previously.
• Can use cluster distances (Single Link, etc. - see page 130)• Euclidean, Manhattan, etc.• Some special spatial extensions:
By Dr. Borne 2005 UMUC Data Mining Lecture 9 25
Aggregate Proximity
• Aggregate Proximity – a measure of how close a cluster is to a feature.
• Aggregate proximity relationship finds the k closest features to a cluster.
• CRH Algorithm – uses different shapes:– Encompassing Circle– Isothetic Rectangle = rectangle with edges
parallel to the principal axes– Convex Hull
By Dr. Borne 2005 UMUC Data Mining Lecture 9 26
CRH example: mathematical formulae exist to calculate easily the distance between any two
convex hulls, or circles, or rectangles
By Dr. Borne 2005 UMUC Data Mining Lecture 9 27
Concept Hierarchies
• Specialization (Progressive Refinement) = move down the hierarchy
• Generalization = move up the hierarchy
• Similar to “roll-up” and “drill-down”
• An implementation: STING
By Dr. Borne 2005 UMUC Data Mining Lecture 9 28
Progressive Refinement
• Make approximate answers prior to more accurate ones.
• Filter out data that are not part of answer.
• Hierarchical view of data based on spatial relationships
• Coarse predicate recursively refined
By Dr. Borne 2005 UMUC Data Mining Lecture 9 29
Example
By Dr. Borne 2005 UMUC Data Mining Lecture 9 30
STING
• STatistical INformation Grid-based
• Hierarchical technique to divide area into rectangular cells
• Grid data structure contains summary information about each cell
• Hierarchical clustering
• Similar to Quad Tree
By Dr. Borne 2005 UMUC Data Mining Lecture 9 31
Nodes in STING data structure:
By Dr. Borne 2005 UMUC Data Mining Lecture 9 32
Spatial Data Mining Algorithms
• Most traditional methods still apply, with special “features” to deal with spatial information (geographic / topological metadata):– Association Rules– Clustering– Classification– Decision Trees– Neural Nets– Bayes Networks
• Refer to text and … • If you are interested in this subject, try a GGooooggllee
search on “Spatial Data Mining”.
By Dr. Borne 2005 UMUC Data Mining Lecture 9 33
Spatial Rules
• Characteristic Rule :
The average family income in Dallas is $50,000.
• Discriminant Rule :
The average family income in Dallas is $50,000, while in Plano the average income is $75,000.
• Association Rule :
The average family income in Dallas for families living near White Rock Lake is $100,000.
By Dr. Borne 2005 UMUC Data Mining Lecture 9 34
Spatial Association Rules
• Either antecedent or consequent must contain spatial predicates.
• Views the underlying database as a set of spatial objects.
• May generate these rules using a type of progressive refinement
By Dr. Borne 2005 UMUC Data Mining Lecture 9 35
Spatial Classification
• Partitions the spatial objects into categories
• May use nonspatial attributes and/or spatial attributes
• Generalization and progressive refinement may be used.
By Dr. Borne 2005 UMUC Data Mining Lecture 9 36
Spatial Decision Tree
• Approach similar to that used for spatial association rules.
• Spatial objects can be described based on objects closest to them – called its Buffer.
• Description of class based upon aggregation of nearby objects.
By Dr. Borne 2005 UMUC Data Mining Lecture 9 37
Spatial Clustering• Detect clusters of
irregular shapes.
• Use of centroids and simple distance approaches may not work well.
• Clusters should be independent of order of input.
By Dr. Borne 2005 UMUC Data Mining Lecture 9 38
Summary
By Dr. Borne 2005 UMUC Data Mining Lecture 9 39
Summary of Topics Covered - Lecture 9
• Spatial Mining: – What is it?– How is it different?
• Spatial Queries• Indexing• Spatial Data Mining Primitives• Concept Hierarchies• Techniques for Spatial Mining