View
13
Download
0
Category
Preview:
Citation preview
1
Tutorial on Geographic and Spatial Data Mining
Michael May
15th Italian Symposium on Advanced Database Systems - SEBD’07
Torre Canne, Italy
June 17th
2Michael May
Tutorial Geographic and Spatial Data Mining
Fraunhofer Society
Joseph von Fraunhofer, German physicist and entrepreneur
Fraunhofer mission:
- do state-of-the-art research and use it in challenging customer projects
- Funding is 33% research grants, 33% customer projects, 33% institutional funding
57 institutes, 40 locations, 12.000 employees, 1 bill. € annual volume
Best-known invention: MP3
2
3Michael May
Tutorial Geographic and Spatial Data Mining
Fraunhofer IAIS: Intelligent Analysis- and Information Systems„From sensor data to business intelligence, from
media analysis to visual information systems: Ourresearch allows companies to do more with data“
New name, long-standing experience
- Founded in 2006 as a merger of the Fraunhofer institutes AIS and IMK
230 people: scientists, project engineers, technical and administrative staff
Located on Fraunhofer Campus SchlossBirlinghoven/Bonn
Joint research groups and cooperation with Univ. Bonn
4Michael May
Tutorial Geographic and Spatial Data Mining
Fraunhofer IAIS: research and projects
Core research areas:
Machine learning and adaptive systems
Data Mining and Business Intelligence
Automated media analysis
Interactive access and exploration
Autonomous systems
3
5Michael May
Tutorial Geographic and Spatial Data Mining
Objectives
Although it is about statistical concepts, algorithms and data structures, the tutorial has a practical, application oriented focus
Integration of various technologies and algorithms. How do they combine?
Covers a broad range
I do not assume familiarity with spatial concepts, but some basic familiarity with data mining approaches
Three Objectives:
- to stimulate research on spatial data mining related issues - to stimulate development of more efficient spatial databases tailored for data
mining applications- to stimulate real-world applications
6Michael May
Tutorial Geographic and Spatial Data Mining
A main message
Spatial Data Mining is not an esoteric research topic; it is practically and commercially very important and sometimes business critical field!
Later I give an example where the value of several dozens of companies directly depends on the predictions given by our spatial data mining algorithms.
4
7Michael May
Tutorial Geographic and Spatial Data Mining
Spatial vs. Geographic Data Mining
Geographic Data is data related to the earth
Spatial Data Mining deals with physical space in general, from molecular to astronomical level
Geographic Data Mining is a subset of Spatial Data Mining
Allmost all geographic data mining algorithms can work in a general spatial setting (with the same dimensionality)
This tutorial focuses on geographic data in 2D, but most algorithms work on spatial data in general
I do not talk about specificties of molecular data, face detection, etc.
8Michael May
Tutorial Geographic and Spatial Data Mining
Agenda
Introduction– Spatial and Geographic Data MiningPart I: Basic Concepts – Spatial Databases and GIS
•Spatial Data Types•Spatial Queries•Construction of Complex Features
Part II: Exploratory Analysis of Spatial DataPart III: Spatial and Geographic Data Mining Methods
•Autocorrelation•Mining Point Data – Clustering, Kriging•Mining Points, Lines Areas – Clustering, Subgroup Discovery, Association Rules •Mining Networks – A practical case study•Mining Tracks in Space and Time – Mining from GPS-DataChallengesSummary
5
9Michael May
Tutorial Geographic and Spatial Data Mining
Introduction – Spatial Data Mining
( )000 )1(
pppp
n−
−⋅
10Michael May
Tutorial Geographic and Spatial Data Mining
A classical example of spatial analysis
Dr. John SnowInvestigating causes of a cholera epidemiaLondon, September 1854
A good representation is often the key to solving a problem
Disease cluster
Infected water pump?
6
11Michael May
Tutorial Geographic and Spatial Data Mining
Good representation because...
Represents spatial relation of objects of the same type
Represents spatial relation of objects to other objects
It is not only important where a cluster is but also, what else is there (e.g. a water-pump)!
Shows only relevant aspects and hides irrelevant
12Michael May
Tutorial Geographic and Spatial Data Mining
Goals of Spatial Data Mining
Identifying spatial patterns
Identifying spatial objects that are potential generators of patterns
Identifying information relevant for explaining the spatial pattern (and hiding irrelevant information)
Presenting the information in a way that is intuitive to the analyst and supports further analysis
7
13Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Data Mining
Data Mining
+Geographic Information Systems
= Spatial Mining
( )000 )1(
pppp
n−
−⋅
14Michael May
Tutorial Geographic and Spatial Data Mining
Basic Concepts Spatial Databases and GIS
( )000 )1(
pppp
n−
−⋅
8
15Michael May
Tutorial Geographic and Spatial Data Mining
Commercial
Where to build a new supermarket?
Where are the customers that want to buy new product X?
How many cars pass the main road per hour?
Does it pay to install new antennas?
What percentage of young females sees a billboard located in Ripley avenue?
Public Sector
Are there clusters of a certain disease?
Is there a relationship between poverty and death rate?
Are there crime hot spots or patterns?
16Michael May
Tutorial Geographic and Spatial Data Mining
Buildings
Rivers
StreetsSchools
Hospitals
Factory
Attribute DataPerson p. HouseholdNo. of CarsLong-term illnessAgeProfessionEthnic groupUnemploymentEducationMigrantsMedical establishmentShopping areas...
9
17Michael May
Tutorial Geographic and Spatial Data Mining
Elements of a spatial database
Spatial Operators
Spatial Data Types
Spatial Indexes
Spatial Query Language
Metadata
SELECT c.holding_company, c.locationFROM competitor c,
bank bWHERE b.site_id = 1604AND SDO_WITHIN_DISTANCE(c.location,
b.location,'distance=2 unit=mile') = 'TRUE'
INSIDE
Examples from Oracle Spatial
18Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Datatypes
( )000 )1(
pppp
n−
−⋅
10
19Michael May
Tutorial Geographic and Spatial Data Mining
Two basic types of representation: Fields and Discrete Objects
Fields:
Raster Data
Discrete Objects: Vector Data Model
Area
Line
20Michael May
Tutorial Geographic and Spatial Data Mining
Vector Data: Data Structure
Ordered sets of xy-coordinates defining points, lines, or polygons
3D or 4D also possible
PointLine
(Polyline)Area (Polygon)
Easy to scale (linear transformation)
Storage efficient
Relationships between objects (e.g. overlap) are not explicitly represented
Aka „Spaghetti Model“
Straight lines between points
(5,10) ((5,10),(9,16),(12,17)) ((5,10),(9,16),(12,17), …)Data Structure
Draw line from last to first coordinate
11
21Michael May
Tutorial Geographic and Spatial Data Mining
Two Main Types of Vector Data
- non regular tesselationsclosed polylines that partition the space
- discrete isolated objects:
point, line, area
PointLine
Area (Polygon)
Tesselations very useful for aggregation of discrete objects and for feature extraction
22Michael May
Tutorial Geographic and Spatial Data Mining
UK, Greater Manchester, Stockport
BuildingsGeometry
Address
Type
…
HospitalsGeometry
Address
Phone
#Beds
…
Description of objects are organized in relations (database tables)
Each row in a table describes one object
Different categories of objects are organized in separate relations each having its own set of attributes.
1Ripley Avenue 23(5,5),(6,6),…3
2Islington Road 2(3,3),(4,4),…2
1Gladstone Street 5(1,1),(2,2),…1
TypeAddressGeometryID
567897Great Moore(3,3),(4,4),…2
234567Stepping Hill(1,1),(2,2),…1
PhoneAddressGeometryID
1Ripley Avenue 23(5,5),32Islington Road 2(3,3),…21Gladstone Street 5(1,1),1
Type
NameGeometryID
1Ripley Avenue 23(5,5),32Islington Road 2(3,3),…21Gladstone Street 5(1,1),1
Type
NameGeometryID
1Ripley Avenue 23(5,5),32Islington Road 2(3,3),…21Gladstone Street 5(1,1),1
Type
NameGeometryID
1Ripley Avenue 23(5,5),32Islington Road 2(3,3),…21Gladstone Street 5(1,1),1
Type
NameGeometryIDRivers
Streets
Schools
Factory
12
23Michael May
Tutorial Geographic and Spatial Data Mining
Hierarchy
Often data are organized in spatial hierarchies, e.g.
Country
State
Zip Area
Voting District
Parcel
Hierarchies may overlap
County
District2District1 Districtn
Ward1… Ward1Ward1
WardnWard1Ward2
UK census data
24Michael May
Tutorial Geographic and Spatial Data Mining
Representation of data in a spatial database
A set of relations R1,...,Rn such that each relation Ri has a geometry attribute Gior an identifier Ai such that Ri can be linked (joined) to a relation Rk having a geometry attribute Gk
- Geometry attributes Gi consist of ordered sets of x,y-pairs defining points, lines, or polygons
- Different types of spatial objects are organized in different relations Ri (geographic layers), e.g. streets,
rivers, enumeration districts, buildings, and
- each layer can have its own set of attributes A1,..., An and at mostone geometry attribute G
13
25Michael May
Tutorial Geographic and Spatial Data Mining
Representation of data in a spatial database
A set of relations R1,...,Rn such that each relation Ri has a geometry attribute Gior an identifier Ai such that Ri can be linked (joined) to a relation Rk having a geometry attribute Gk
- Geometry attributes Gi consist of ordered sets of x,y-pairs defining points, lines, or polygons
- Different types of spatial objects are organized in different relations Ri (geographic layers), e.g. streets,
rivers, enumeration districts, buildings, and
- each layer can have its own set of attributes A1,..., An and at mostone geometry attribute G
Does not fit well to standard data mining
approaches!
This is where the specific research challenge for
geographical data mining comes from!
26Michael May
Tutorial Geographic and Spatial Data Mining
Legend
Mixed conifer
Douglas fir
Oak savannah
Grassland
Raster representation. Each color represents a different value of a nominal-scale field
Longley et al (2001)
How to represent phenomena conceived as fields?
Divide the world into square cells
No variation within cells
Cell value may be average, max, min, sum,central point, …
Represent discrete objects as collections of one or more cells
Represent fields by assigning attribute values to cells
Raster Data
14
27Michael May
Tutorial Geographic and Spatial Data MiningRaster and Vector: Comparison
Raster ModellAdvantages:
• Simple data structure• Simple logical and algebraic structures
Disadvantages:• Large data volumes• imprecise geometry• expensive transformations of coordinates• implicit coordinates
Vector ModelAdvantages:
• Specify geometry by coordinates• Topological relationships• High geometric accuracy• Storage efficient
Disadvantages:
• Complex data structure• Compute intensive logical and algebraic operations
Remember: „Raster is vaster and vector is correcter“
Legend
Mixed conifer
Douglas fir
Oak savannah
Grassland
28Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Queries
( )000 )1(
pppp
n−
−⋅
15
29Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Queries
Problem: Vector data model does not explicitly capture relationships among objects.
They have to be inferred using spatial predicates
Spatial predicates evaluate to true or false for given objects
A query returns
the set of objects of which the statement is true; or
using aggregates the [minimum,maximum,sum,average,…], object(s) of which thestatement is true …
Queries are evaluated using a spatial join among different relations (layers)
Here‘s where database technology and spatial indexing comes in to do the job efficiently!
Still, they can be extremely time consuming!
30Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Predicates: Egenhofer‘s 9-intersection model
Each object has interior (i), exterior (e) and boundary (b)
This results in a 9-intersection matrix for the relation between two spatial objects A and B
A cell contains a 1 iff the intersection of point sets is non-empty
111e
100i
101b
eib
111e
111i
111b
eib
100e
111i
100b
eib
A
B
A B
A meets B A overlaps B A contains B
16
31Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Predicates
A inside B, B contains A
A contains B, B inside A
A covered-by B, B covers A
A covers B, B covered by A
A equals B, B equals A
A overlaps B, B overlaps A
A meets B, B meets A
A disjoint B, B disjoint A
9-intersection model for 2 regions (Egenhofer 1991)
INSIDE
32Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Queries: Distance
Metric spaces:→ Symmetry: d(i,j) = d(j,i) → triangle inequality: d(i,k) ≤ d(i,j)+ d(j,k)
- Euclidian Distance: de(i,j) =
i
j
k
Distance relation between polygons: Minimum distance between any 2 points of the polygons
22 )()( jiji yyxx ++−
17
33Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Queries: Distance and Proximity
Selects nearest neighbor in space
Select all object within a certain distance
Mai
n St
reet
X DistanceHospital #2
Hospital #1
SELECT c.holding_company, c.locationFROM competitor c,
bank bWHERE b.site_id = 1604AND SDO_WITHIN_DISTANCE(c.location,
b.location,'distance=2 unit=mile') = 'TRUE'
Select all competitors and locations within 2 miles distance from bank with id 1604
Example: Oracle Spatial
34Michael May
Tutorial Geographic and Spatial Data Mining
Distance – non-metric
non metric spaces → Asymmetry: d(i,j) ≠ d(j,i) → triangle inequality does not hold
drive time
driving distance
costs
18
35Michael May
Tutorial Geographic and Spatial Data MiningStockport Database Schema
ED
TAB01
TAB95
TAB61
...
Water
...
River
Building
Street
Shopping Region
Vegetation
=zone_id
=zone_id
=zone_id
spatially interact
inside
spatially interactsspatially
interacts
spatially interacts
Attribute data
95 tables with census data,
~8000 attributes
Geographical Layers
85 tables
Spatial Hierarchy
• County
• District
• Wards
• Enumeration district
spatially interact
Standard Join
Spatial Join
Relations between objects implicit; very flexible and storage efficient, but compute intensive
36Michael May
Tutorial Geographic and Spatial Data Mining
Implementation of Spatial Databases
Many popular databases have spatial extensions by now:
Oracle Spatial
PostgreSQL
MySQL (since 4.1)
19
37Michael May
Tutorial Geographic and Spatial Data Mining
Construction of Complex Features
( )000 )1(
pppp
n−
−⋅
38Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Functions
Example: Oracle Spatial 10g
Return a geometry- Union- Difference- Intersect- XOR- Buffer- CenterPoint- ConvexHull
Return a number- Length- Area- Distance
Union
XOR
Intersect
Original
Difference
http://colab.cim3.net/file/work/SICoP/2006-06-20/2006-06-21/xlopez06212006.ppt
Constructs new geometry objects from existing ones using point set theory
Efficient implementation using computational geometry
20
39Michael May
Tutorial Geographic and Spatial Data Mining
Constructing Cells: Buffer
How many competitors are in the catchment area of my shop?
= How many shops are within the buffer?
Simplistic approximation
Does not take account of barriers (rivers, highways)
Does not take into account road system
40Michael May
Tutorial Geographic and Spatial Data Mining
Voronoi diagramm
Which are my nearest competitors?
What is the cover of my radio antenna?
= Find voronoi neighbors
Approximation
Does not take account of barriers (rivers, highways)
Does not take into account road system
Decompose space into regions around each point in a set of points S such that all the points in the region around pi are closer to pi than to any other point in S
Complexity:
Related data structure: Delaunay triangulation (graph of Voronoi neighbors)
)lg( nnO
21
41Michael May
Tutorial Geographic and Spatial Data Mining
Drive-Time Zone (Dijkstra)
How many competitors are in the catchment area of my shop?
Realistic approximation
Take account of barriers (rivers, highways)
take into account road system, maximum speed on road
All streets segments within a drive time distance <= d from a given starting point
Use Dijkstra‘s algortihm
Complexity:
depending on data structures used for implementation
)lg()( 2 EVVOVO +−
42Michael May
Tutorial Geographic and Spatial Data Mining
Pre-procesing
Several of the feature extractions are computationally quite expensive (at least for large data sets) and there is often a combinatorial explosion of features that might be constructed.
Several strategies are used in Spatial Warehouse Design:
Selective Pre-processing: materializing important joins in advance (storage requirements!)
Approximate precomputing: e.g. using Minimum Bound Rectangle to approximate polygon
Schema Design (e.g. Star-Schema with selective materialization): Han J., Stefanovic N., Koperski K. Selective Materialization: An Efficient Method for Spatial Data Cube Construction. PAKDD, 1998.
22
43Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Database of Vector Objects: Discussion
Relations between objects implicit
Very flexible: depending on analysis task different relationsships can be constructed
storage efficient; no overhead for storing relationship information
compute intensive (thus spatial Indexing very important)
Consider what and when to materialize
Very rich possibilities to create new, non-trivial objects from existing ones
Makes feature extraction an important topic for Data Mining
Inherently multi-relational setting (but not first-order)
Could also be formulated in a deductive database setting
44Michael May
Tutorial Geographic and Spatial Data Mining
Interactive Visualization of Spatial Data –
Exploratory Data Analysis
( )000 )1(
pppp
n−
−⋅
23
45Michael May
Tutorial Geographic and Spatial Data Mining
Interactive Visualization of Spatial Data –Exploratory Data Analysis
(work by G. Andrienko & N. Andrienko, H. Voss and others at Fraunhofer IAIS)
For the theory behind CommonGIS, see the book
Andrienko, N. and Andrienko G.: Exploratory Analysis of Spatial and Temporal Data - A Systematic Approach, Springer, 2005
46Michael May
Tutorial Geographic and Spatial Data Mining
Geographic Information Systems and CommonGIS
Many commercial tools available
- ESRI ARC GIS- Mapinfo- Intergraph- Manifold
But CommonGIS is different and unique …
- Map-based exploratory data analysis- stresses interactive visualization manipulation of statistical data in space- elaborated facilities for time-series visualization
CommonGIS can be aquired for non-commercial use by educational instutions for no fee
See web page www.commongis.com
24
47Michael May
Tutorial Geographic and Spatial Data Mining
- Time-series visualization and analysis
- Combines Vector-Rastertransformation
- Weighted Sums
- Ideal Point Analysis
- Similarity analysis
- Dominant Attribut
- Integration with Weka (Clustering, Decision Trees)
MultivariateDecision supportMulti-dimensional
= Fraunhofer IAIS Tool for Map-based Exploratory Data Analysis - combines interactive cartography and statistics
CommonGIS
48Michael May
Tutorial Geographic and Spatial Data Mining
CommonGIS: Visual analysis of spatial data
Interactive spatial search for geographic objects and recognition of spatial patterns: dynamic choropleth maps, pie charts, bar charts, etc. with dynamic removal of outliers and dynamic queries Comparison of attribute values of geographic objects (relations and correlations) and comparison of spatial patterns (spatial correlations): (Linked) dynamic maps and interactive diagramsmultiple (linked) dynamic maps
25
49Michael May
Tutorial Geographic and Spatial Data Mining
CommonGIS: Visual analysis of spatio-temporal data
CommonGIS as an interactive browser to study how a spatial pattern evolves over time:
time aware maps (animations)
time series charts
CommonGIS as an interactive browser for temporal behaviours of objects:
set of controls for analysing time intervals (object animations)
CommonGIS as an interactive browser of discrete space-time events to find spatio-temporal clusters:
space-time cube
50Michael May
Tutorial Geographic and Spatial Data MiningTime Series – Sales per Shop and Product Category
26
51Michael May
Tutorial Geographic and Spatial Data Mining
Time-Series: Sales per Shop and Product Category
BäckereiStehcaféSitzcaféTerrasse
Different Time Hierarchies(Year, Quarter, Month, Day…)
52Michael May
Tutorial Geographic and Spatial Data Mining
CommonGIS: Data transformation
Transformation of data for further analysis: Attribute transformations: calculate statistical indices transform and combine attribute data arithmetically dynamic classifiers (linked with dynamic choropleth map) cross classifiers (linked with dynamic choropleth map) Geographic transformations:query, transform, combine, derive raster data illumination model raster -> vector transformations (i.e. raster -> area aggregation) point/line -> raster transformations
27
53Michael May
Tutorial Geographic and Spatial Data MiningCommonGIS: Combination of Vector and image data
1
54Michael May
Tutorial Geographic and Spatial Data Mining
Geographic and Spatial Data Mining Methods
( )000 )1(
pppp
n−
−⋅
55Michael May
Tutorial Geographic and Spatial Data Mining
Autocorrelation
( )000 )1(
pppp
n−
−⋅
2
56Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Variation
How are variables distributed in space?
Tobler‘s First Law of Geography:
„Everything is related to everything else, butnear things are more related than distantthings.“
distribution of variables depends on space
variables are autocorrelated
Field Soil Moisture
Franke, diploma thesis, Leipzig Univ., 2006
57Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Autocorrelation: Binary Example
binary attribute (blue, white)
autocorrelation to four immediate neighbors
Moran Index (here):
I = 0.86 I = 0.00 I = -1.00Goodchild, CATMOG, GeoBooks, Norwich, 1986
I = 0.39
changeequal
changeequal
nnnn
I+
−=
- change
- equal
3
58Michael May
Tutorial Geographic and Spatial Data Mining
Moran‘s I
Morans‘s I is a measure for spatial autocorrelation. It is a weighted correlationcoefficient used to detect departures from spatial randomness. Departures fromrandomness indicate spatial patterns such as clusters and geographic trend.
Values of I larger than 0 indicate positive spatial autocorrelation; values smaller than0 indicate negative spatial autocorrelation.
Moran's I is a weighted product-moment correlation coefficient, where the weightsreflect geographic proximity.
z – attribute of interest; w – weight; n – number of areal objects
∑∑∑
∑∑
== =≠
=
≠
=≠
−
−−= n
ii
n
i
n
jjiij
n
i
ijn
jjijiij
zzw
zzzzwnI
1
2
1 1,
1
,
1,
)(
))((
A B
CD 0110D
1011C1101B0110ADCBAwij
weight matrix
Example:n = 4
59Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Autocorrelation
similarity in location indicates similarity in attributevalue
differs from temporal autocorrelation
- 1 – dimensional autocorrelation in time series, spatial autocorrelation spreads in 2 or 3 dimensions
- only forward causality in time series, direction of causality not restricted in space
depends on scaleTemperature of
Sunspots
Sunspot Time Seriesyear
# su
nspo
ts
4
60Michael May
Tutorial Geographic and Spatial Data Mining
Effects of Autocorrelation
makes spatial abstraction possible
makes standard approaches of analysis impossible
- most statistics assume iid
makes local inference attractive
- Kriging, kNN, …
makes choice of sampling interval hard
- autocorrelation depends on scale
makes interpolation easier than extrapolation
zero autocorrelation = independence of location
distance
corr
elat
ion
0
-1
+1spatial autocorrelation
61Michael May
Tutorial Geographic and Spatial Data Mining
Problem types for Spatial Data Mining
Spatial Data Mining := partially automated search for patterns and models in large spatial databases
Classification of methods along the following hierarchy
Points
Points, Lines and Area
Networks
Tracks in space and time
5
62Michael May
Tutorial Geographic and Spatial Data Mining
Handling spatial data in Data Mining – Basic Options
Treat as ordinary variables
no special algorithms neededspatial properties ignored, e. g. discontiguous areas
Make spatial relationships explicit
e. g. infer topological relationshipexpensive, but allows normal algorithms to be usedCan by done as pre-processing or dynamically (latter requires specialized algortihms)
Specialized algorithms
- Neighborhood methods, kriging, Gaussian processes, density-based clustering …
Use proper combination of data, preprocessing, algorithms, and interaction software!
63Michael May
Tutorial Geographic and Spatial Data Mining
Mining Point Data
( )000 )1(
pppp
n−
−⋅
6
64Michael May
Tutorial Geographic and Spatial Data Mining
Mining Point Data
Points
Space Complexity
Time Complexity
65Michael May
Tutorial Geographic and Spatial Data Mining
Clustering spatial point data
Point data conceived as discrete objects
Many approaches exists for clustering spatial point data
In statistics, measures of spatial randomness or non-randomness have been developed (e.g. Ripley 1991, Cressie 1993)
- Ripley‘s K function as measuring deviation from complete spatial randomness (as exemplified by a Poisson process)
- Moran‘s I, which measures autocorrelation
Bayesian approaches often coming from image analysis (cf. Lawson et al 2002)
In Geography, spatial clustering algorithms have been developed (Openshaw, GAM, 1991)
7
66Michael May
Tutorial Geographic and Spatial Data Mining
Density Based Clustering – a KDD approach [Ester et al. 1996]
Suitable for large databases
Discovers areas of high density and turns them into clusters
Discovers clusters of arbitrary shape
Can handle noise
Algorithm DBSCAN
Note: Relatively straightforward extension to vector data possible (GDBSCAN); requires more complex definition of some key concepts (neighborhood and MinPts)
67Michael May
Tutorial Geographic and Spatial Data Mining
Clustering spatial data
distance-based clustering is inherently spatial
but assumption of convex clusters (e.g. k-means) inappropriate for many “geographical” tasks
X
X
X
X
X X
XX
XX
XX
source: Ester et al 1997
8
68Michael May
Tutorial Geographic and Spatial Data Mining
Definitions 1
Eps-neighborhood of a point pNε (p) := {q ∈ D | dist (p, q) ≤ ε }
A point p is directly density-reachable from q iff
1. p ∈ N ε(q)2. |N ε (q)|>MinPts (“q is core object”)
- Not necessarily symmetric
pp qqp: border object
q:core object
P directly density reachable from q
Q not directly density reachable from p
Definition of Eps is a crucial parameter!
69Michael May
Tutorial Geographic and Spatial Data Mining
Definitions 2density-reachable = p is density-reachable from point q wrt to Eps and MinPts iff there
is a chain of points p1,…,pn, p1=q,pn=p such that pi+1 is directly density-reachable from pi
Transitive, not symmetric
p is density-connected to q iff there is point o such that p and q are density-reachable from o wrt to Eps and MinPts.
p p
q p op and q density-
connected to each other by o
p density reachable from q
q not density reachable from p
Symmetric
9
70Michael May
Tutorial Geographic and Spatial Data Mining
Density-connected clustering
A cluster C wrt. To Eps and MinPts is a non-empty subset of database D, where
(1) ∀p,q: if p ∈ C and q is density-reachable from p wrt Eps and MinPts, then q ∈ C
(2) ∀p,q ∈ C: p is density connected to q wrt to Eps and MinPts.
Non-covered points are noise
Each cluster contains at least MinPts
Exactly one clustering
71Michael May
Tutorial Geographic and Spatial Data Mining
Algorithm DBScan – Basic Idea
Check Eps-Neigborhood of every unclassified point in database
If neighborhood of p contains more than MinPts, a new cluster with p as core object is build
Collect directly density reachable objects from this set, merging clusters as necessary
Terminate when no new point can be added to any cluster
Complexity: O(n log n) when spatial index is used, otherwise O(n2)
10
72Michael May
Tutorial Geographic and Spatial Data Mining
Kriging-Spatial Interpolation
( )000 )1(
pppp
n−
−⋅
73Michael May
Tutorial Geographic and Spatial Data Mining
Kriging
developed by G. Matheron in the 1960s based on work of D. Krige
geostatistical method of interpolation
Point data conceived as samples from a continuous surface
results are smoothly varying surfaces
provides optimality given assumptions (best linear unbiased estimate)
variety of methods, e.g. Ordinary Kriging, Universal Kriging, Co-Kriging, Block Kriging, Stratified Kriging, Indicator Kriging, …
??
??
??• – measurements
? – unknown values
Good introduction: Burrough, P., McDonnell, R 1998
11
74Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Variation
Problem:
spatial variation of a continuous attribute is often too irregular to be modelled by a simple, smooth mathematical function
Solution:
variation can be described by stochastic surface
x – location in n-dimensional space
Z(x) –random variable of interest, e.g. soilmoisture
A stochastic process is a family of random variables Z(x) over the index set D ⊂ ℜn:
{ }DxxZ ∈:)(
A Gaussian process is a stochastic process for which any finite set of Z-variables has a joint multivariate Gaussian distribution.
75Michael May
Tutorial Geographic and Spatial Data Mining
Components of Spatial Variation
structural component, having a constant mean or trend
random, but spatially correlated component (regionalized variable)
spatially uncorrelated random noise term
'')(')()( εε ++= xxmxZ
trend autocorrelation random noise
X
Z(x)
value at location x is random variable
12
76Michael May
Tutorial Geographic and Spatial Data Mining
Stationarity
Problem:
spatial data set is single realization of random process
inference is impossible without further restrictions on spatial variation
Intrinsic Stationarity (stationarity under translation):
constant mean (E[...] = 0) or trend (E[...] > 0):
variance of differences h is independent of location:
Isotropy (stationarity under rotation) :
spatial process evolves the same in all directions
[ ] .)()( consthxZxZE =+−
2E {Z(x) Z(x h)} 2 (h)⎡ ⎤− + = γ⎣ ⎦ x
x+hh
77Michael May
Tutorial Geographic and Spatial Data Mining
Ordinary Kriging
Assumptions:
intrinsic stationarity with a constant mean
- constant mean value in sampling area
- variance of differences depends only on the distance h between sites
Once structural effects have been accounted for, remaining variation ishomogeneous in variance so that difference at sites are merely a function of differences between them.
[ ]])}(')('[{
)(2])}()([{)()(2
2
hxxEhhxZxZEhxZxZVar
+−=
=+−=+−
εε
γ
x
x+hh
[ ] 0)()( =+− hxZxZE
semivariance
13
78Michael May
Tutorial Geographic and Spatial Data Mining
Ordinary Kriging
Proceedure:
1. Estimate semivariance γ(h) from data sample
2. Plot the experimental variogram
3. Fit a theoretical model to the experimental variogram
4. Estimate unknown values as weighted sum of neighboring measurements, determineoptimal weights from variogram
79Michael May
Tutorial Geographic and Spatial Data Mining
Semivariance and Experimental Variogram
semivariance depends only on distance (lag) h
estimate semivariance between all pairs of measurements with distance h (repeat forall possible h)
{ }∑=
+−=n
iii hxzxz
nh
1
2)()(21)(γ̂
lag h
γ(h)
Experimental Variogram
14
80Michael May
Tutorial Geographic and Spatial Data Mining
Variogram nugget:
- γ(h) = 0 (by definition)- nugget effect represents small scale
variation and measurement errors- estimate of ε‘‘
range:
- spatial dependency- here, variance of differences increases
with distance- two points are more similar the closer
they are
sill:
- semivariance levels off- variance of differences h is
independent of distancelag h
γ(h)
range
nugget
sill
{ }∑=
+−=n
iii hxzxz
nh
1
2)()(21)(γ̂
81Michael May
Tutorial Geographic and Spatial Data Mining
Variogram Models
experimental variogrammust be fitted to an appropriate variogrammodel
most commonly used arethe spherical, exponential, linear orGaussian model
lag h
γ(h)
Spherical Model
lag h
γ(h)
Exponential Model
lag h
γ(h)
Linear Model
lag h
γ(h)
Gaussian Model
15
82Michael May
Tutorial Geographic and Spatial Data Mining
Interpolation of unknown Values
unknown value at location x0 is estimated as weighted sum of neighboringmeasurements
weights wi are determined according to two restrictions
- Z*(x0) is an unbiased estimate of Z(x0)- Z*(x0) is an optimal estimate
Have to solve system of n+1 linear equations of semivariances and weights
∑=
=n
iii xZwxZ
10
* )()(
83Michael May
Tutorial Geographic and Spatial Data Mining
Equation System
restriction on weights introduces Lagrange parameter φ (Restriction 1)
system of (n+1) equations must be solved to obtain optimal weights for each x0
1 1 1 n 1 1 0
n 1 n n n n 0
(x x ) (x x ) 1 w (x x )
(x x ) (x x ) 1 w (x x )1 1 0 1
γ − γ − γ −⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟=⎜ ⎟ ⎜ ⎟ ⎜ ⎟γ − γ − γ −⎜ ⎟ ⎜ ⎟ ⎜ ⎟φ⎝ ⎠ ⎝ ⎠ ⎝ ⎠
K
M O M M M M
L
L
Ordinary Kriging is an exact interpolator, i.e. interpolated value of a sample locationwill be identical with the measurement taken
16
84Michael May
Tutorial Geographic and Spatial Data Mining
Variants of Kriging
Universal Kriging
structural component may contain a external trend
Co-Kriging
interpolation for one attribute incorporates information of another, correlated attribute
sparse measurements of an expensive variable are supported by plentymeasurements of a cheap variable
Stratified Kriging
interpolation within sub-areas
equations are adjusted to avoid discontinuities on boundaries
More Details: Burrough, P., McDonnell, R 1998
85Michael May
Tutorial Geographic and Spatial Data Mining
Mining Points, Lines, and Areas
( )000 )1(
pppp
n−
−⋅
17
86Michael May
Tutorial Geographic and Spatial Data Mining
Points, Lines and Areas
Points
Space Complexity
Time Complexity
Points, Lines, and Areas
87Michael May
Tutorial Geographic and Spatial Data Mining
Points, Lines and Areas
Requirements:• Point data• Polygons• aggregations
Applications• Customer Segmentation,• Catchment Areas,• Location Planning,• Radio Network Analysis
Examples:• GDBScan Clustering• Spatial Subgroup Minig• Spatial Association Rules• Spatial Model Trees
18
88Michael May
Tutorial Geographic and Spatial Data Mining
Clustering of Vector Data: GDBScan [Sander et al 1998]Extension of DBSCan - Sample Instantiations
dist < ε intersects/meets neighbor
| S | ≥ MinCard ∑ areas ≥ MinArea f (S) ≥ MinF
89Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Subgroup Mining
( )000 )1(
pppp
n−
−⋅
19
90Michael May
Tutorial Geographic and Spatial Data Mining
Typical Data Mining representation
Data Mining for spatial data: very different from this representation
‘spreadsheet data’exactly 1 table
atomic values
91Michael May
Tutorial Geographic and Spatial Data Mining
Subgroup Discovery Search (Klösgen 1996, Wrobel 1997)Subgroup discovery searches deviation patterns for subgroups
overproportionally high share of target value (or mean of target variable)
Top-down search from most general to most specific subgroups, exploiting partialordering of subgroups
S1 ≥ S2 S1 more general than S2
Beam search expands only the n best ones at each level
Evaluating hypothesis according to quality function:
N= Total populationn= subgroup size
p(T)= target share in total populationp(T|C)= target share in subgroup
Extension to multi-relational representation in Wrobel (1997)
nNNn
TpTpTpCTp
−−−
))(1)(()()|(
20
92Michael May
Tutorial Geographic and Spatial Data Mining
Translating Multirelational Subgroups to Object-relational SQL
Domain: relational database schema D = {R1, ..., Rn} having geometry attributes Gi
Hypothesis Language
Multirelational subgroups are represented by a concept set C = {Ci}, where each Ci consists of a set of attribute value-pairs {A1=v1,...,An=vn} from a relation in D,
a set of links L={Li} linking concepts Ci , Ck via their attributes Am, Ak of the form (Ci/Am {=|inside| overlaps|...|spatially_interact} Ck/An)
target attribute can be non-numeric (A1=v1) or numeric aggregate (avg(A)=n)
Example:C= {{district.long_term_illness=high, district.unemplyoment=high},{street.name=’Manchester
Road’}}
L= {{district.geometry spatially_interact street.geometry}}
“Enumeration districts with high rate of long term illness and unemplyoment crossed by Manchester Road”
Testing satisfaction of subgroup descriptions
The number of tuples in D that satisfies a subgroup description is evaluated using SQL select statements including joins over multiple relations.
93Michael May
Tutorial Geographic and Spatial Data Mining
Approach: Translation of Spatial Subgroup Mining to SQL (Klösgen, May 2002)
• Representing subgroups in object-relational SQL, i.e. multi-relational representation
• Using representation for spatial geometry based on Spatial Database
• Division of work between RDBMS and Search Manager
• Combining visualization in abstract and physical space
21
94Michael May
Tutorial Geographic and Spatial Data Mining
Division of labour between RDBMS and Search Manager (May, Savinov 2003)
Database Server Search Algorithm
Mining Serverstatistics
• search in hypothesis space
• generation and evaluation of hypotheses(subgroup patterns)
mining query
• Database integration: efficiently organize mining queries
• Mining query delivers statistics (aggregations)sufficient for evaluating many hypotheses
95Michael May
Tutorial Geographic and Spatial Data Mining
SPIN! – Spatial Data Mining System
Workspace
Property EditorSubgroup Viewer
Flowchart-Tool
SubgroupResult List
22
96Michael May
Tutorial Geographic and Spatial Data Mining
Interactive Exploratory Analysis
Combination of spatial and non-spatial visualization
User selects and manipulates variables
Powerful for analysis in low dimensions (3-4)
Scatter Plot
Parallel Coordinate Plot
Choropleth Maps
Display dynamically linked
97Michael May
Tutorial Geographic and Spatial Data Mining
Visualization of spatial sugroups
Linked Display
Spatial Venn DiagramSubgroup Overview
p(T|C) vs. p(C)
Subgroup
High long-term illness in districts crossed by M60
23
98Michael May
Tutorial Geographic and Spatial Data Mining
Radio Network Planning in Telecommunication
High cut of call ration in mountanous regions crossed by highways
having a certain technical configuration
Legende:
Blau: AutobahnBraun: große HöheSchwarz: Subgruppe
SPIN!
Mapviewer(Common GIS)
99Michael May
Tutorial Geographic and Spatial Data Mining
Other commercial applications of Subgroup Discovery
How are my customers characterized. Are there interesting profiles?
Where to open the next supermarket? Does it create competition for my other supermarkets?
Should I invest in UMTS in rural areas?
24
100Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Association Rules
work and slides by Donato Malerba et al., Univ. Bari
( )000 )1(
pppp
n−
−⋅
101Michael May
Tutorial Geographic and Spatial Data Mining
Spatial association rules
An association pattern PP (s%)(s%) is a spatial association pattern if it contains at least one spatialrelation
A large town intersects a road and is adjacent to water (62%)
An association rule QQ→→ RR (s%, c%)(s%, c%) is a spatial association rule if QQ∧∧RR is a spatialassociation pattern
IF a large town intersects a road
THEN it is also adjacent to water (62%, 89%)
Malerba et alSeminal work by Koperski & Han 1995
25
102Michael May
Tutorial Geographic and Spatial Data Mining
The problem
Givena spatial database (SDB) with a set of reference objects SS,some set RRkk, 1≤k≤m, of task-relevant objectssome spatial hierarchies HHkk involving objects in Rk
MM granularity levels in the descriptionsa set of granularity assignments ψψkk which associate each object in
Hk with a granularity levela couple of thresholds minsupminsup[l][l] and minconfminconf[l][l] for each
granularity levela domain knowledge
Find strong multiple-level spatial association rules.
Malerba et al
103Michael May
Tutorial Geographic and Spatial Data Mining
The solution
Solution (Appice et al., IDA Journal, 2003)
based on an Inductive Logic Programming (ILP) approach spatial relations easily handled
spatial pattern conjuction of first-order logic atoms
θ-subsumption orders the space of spatial patterns
monotonicity of support w.r.t. θ-subsumption pruning of patterns at the samegranularity level in the candidate generation phase
monotonicity of pattern frequency w.r.t. granularity level pruning of patternsat different granularity levels in the candidate generation phase
Implemented in SPADA (Spatial Pattern Discovery Algorithm)
European project SPIN (Spatial Mining for Data of Public Interest)
26
104Michael May
Tutorial Geographic and Spatial Data Mining
Extensions of initial solutions
Efficiency improvement of pattern evaluation by caching support objects for each stored pattern
Definition of a declarative bias to filter out rules on the basis of users’ preferences efficiency improvement is a byproduct
- In real-world applications a large number of spatial patterns can be generated even for a few hundred spatial objects.
- Most of discovered patterns are useless for the application at hand- Urban accessibility application: only spatial patterns involving some sociological factor
(household with no car) are interesting.
Integration of SPADA in the ARES system that interfaces a Spatial DB (Oracle Spatial)
105Michael May
Tutorial Geographic and Spatial Data Mining
Mining Network Data
( )000 )1(
pppp
n−
−⋅
27
106Michael May
Tutorial Geographic and Spatial Data Mining
Networks
Points
Space Complexity
Time complexity
Points, Lines, and Areas
Networks
107Michael May
Tutorial Geographic and Spatial Data Mining
Points and Networks
• Requirements:• Point Data • Polygons• Aggregations• Spatial dependencies and relations,
networks
• Examples: Traffic frequency prediction
• Method:• kNN
28
108Michael May
Tutorial Geographic and Spatial Data Mining
Case Study: Outdoor Advertising - Frequency Atlas
Customer:
Fachverband für Außenwerbung(FAW; German Outdoor Advertising Association)
Task:
Performance value assessment of advertisingmedia
Traffic volume forecast
separate for private cars, public transport, pedestrians
109Michael May
Tutorial Geographic and Spatial Data Mining
Frequency + Media factories = poster reach
Gesellschaft für Konsumforschung
Determining reach of a poster board
29
110Michael May
Tutorial Geographic and Spatial Data Mining
The project in numbers
Complete model for all German citieswith more than 50.000 inhabitants(192 cities) = ca 1.000.000 street segments!
Complete model includes, for each segment,item
- car frequency- pedestrian frequency- public transport frequency
The model is presently beeing extendedto to all cities with between 10.000 and 50.000 inhabitants
111Michael May
Tutorial Geographic and Spatial Data Mining
Basic Data: traffic measurements
Manual traffic measurement at selectedposter locations
- 4 times 6 minutes at four days of theweek at four times of day
Additional empirical model of day totals
Properties
- Well defined measurements- Extended measurement period, so
concept drift can not be excluded
Total of 96.000 manual measurements
30
112Michael May
Tutorial Geographic and Spatial Data Mining
Street networkSociodemographics + Socioeconomics
Public transportnetwork
Frequencymeasurements
0 200 400 600 800 1000 1250 1500 1750 2000 ...
DATA MINING
Points of Interest(POI)
Frequency classes
Secondary data
113Michael May
Tutorial Geographic and Spatial Data Mining
Local Measurements
Inhomogeneous measurements on the same street
How Spatial Autocorrelation helps
843820 1200
843
31
114Michael May
Tutorial Geographic and Spatial Data Mining
Attributes of street segments:
- Name, type, …. class- Points of Interest- Spatial coordinates
Locations with measurement values
Spatial kNN
Distance beetween two segments xa, xb
Selection of the k closest x1, …, xk
Prediction for new segment xq
(Project has actually used specially adapted distance measure)
( ) ∑=
−=M
mbmamba xxxxd
1
,
∑∑==
=k
iii
k
iiq wywy
11
ˆ),(
1
iqi xxd
w =with
Segment
115Michael May
Tutorial Geographic and Spatial Data Mining
Spatial KNN - Properties
kNN captures well autocorrelation inherent in the data
Allows to bring in background knowledge by fine-tuning distance function
Database Integrated (Oracle Spatial)
Performs dynamic spatial query (minimum distances among polygons)
Performance improvements
Spatial Queries use Index Structures (R-Tree), still relatively costly (i.e. dominates overall run-time)
Partial evaluation of distance function based on lower bounds for distance to minimize number of spatial queries
Can handle data sets that do not fit into main memory
32
116Michael May
Tutorial Geographic and Spatial Data Mining
Smoothing based on flow constraints
Measurement errors lead to inconsistencies
Need plausible assignment of frequencies
Solution:
Use Kirchhoff’s law as constraint
- Sum of inputs = sum of outputs
Smoothing algorithm finds locally optimal solution using constraint relaxation
117Michael May
Tutorial Geographic and Spatial Data Mining
Explaining frequencies
Problem: Customer wants transparent values, not a black box
=> Problem for Spatial kNN
Solution: Fit an explanatory model to the predicted values
Allows to understand why predictions are as they are
Allows to identify potential outliers and areas of high uncertainty
⇒ Use Model Trees
⇒ Geographic Space encoded in x-y coordinates
33
118Michael May
Tutorial Geographic and Spatial Data Mining
Numerical prediction with model trees
LM1FREQUENZ =
2277.3186 * X +75.4087 * ANZAHL_EINKAUF +
-142.4217 * MESSE +-21221.8497
Fussgängerzone:
Nein | Ja
Bahnhof
Nein | Ja
Distanz_zu_Bahnhof:
<= 150 | > 150
Anzahl_Restaurants :
<= 5 | > 5
ORTSTEIL =
INNENSTADT (LR) | ...
Straßenkategorie:
Nebenstr. | Hauptstr.
Y-Koordinate
<= 9.6 | > 9.6
X-Koordinate
<= 52.385 | > 52.385
Anzahl_Restaurants :
<= 15 | > 15
LM1 LM2 LM4 LM5
LM6
LM3
119Michael May
Tutorial Geographic and Spatial Data Mining
Improving model by spotting outliers based on model tree prediction
Points with great prediction error are checked
- Visual inspection- Getting additional empirical input by taking new measurements
Corrected values are basis for next round in model building, leading to improved results
34
120Michael May
Tutorial Geographic and Spatial Data MiningFinal Result: Frequency Map
Cars Public Transport
Pedestrians
PedestriansCarsPublic Transport
121Michael May
Tutorial Geographic and Spatial Data Mining
~1 Million street segments predicted based on 96.000 measurements
~1 Million street segments predicted based on 96.000 measurements
Final result: frequency atlas(cars, public transport, pedestrians)
Used for determining poster prices in Germany since 2006
Rare instance of a spatial data mining problem that has become business critical
35
122Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Model Trees [Malerba, Appice, Cecci 2005]
Standard Model Trees (e.g. M5‘) can do Spatial Mining by splitting along x and y coordinatesMrs-Smoti (Malerba et al. 2004) is a variant of Model Trees that
- Allows regression nodes as interior nodes- Handles directly autocorrelation:
Spatial regression model with dependencies in response variables:spatially lagged response
It inputs spatial objects eventually belonging to separate thematic layers stored in a spatial database S
- target objects (main subject of analysis)- non target objects (relevant for the task in hand)
and outputs a spatial model tree T by - partitioning training spatial data
according to intra-layer and inter-layer relationships
- associating different regression models to disjoint spatial areas
Integrates spatial database queries (see Subgroup Discovery)
65
43
2
Y’=c+dX’3
Y’=e+fX’2
X’4 ≤ γ
Y’=g+hX’3
0Y=a+bX1
1X’3 ≤ α
7
Y’=i+lX’4X’2 ≤ β
T
123Michael May
Tutorial Geographic and Spatial Data Mining
Mining Tracks in Space and Time
( )000 )1(
pppp
n−
−⋅
36
124Michael May
Tutorial Geographic and Spatial Data Mining
Tracks in Space and Time
Points
Space Complexity
Time complexity
Points, Lines, and Areas
Tracks in Space and Time
Networks
125Michael May
Tutorial Geographic and Spatial Data Mining
Tracks in space and time
• Requirements:• Point daa• Polygons• Aggregations• Networks• Tracks,
GPS/RFID/Sensor-Measurement
• Applications:Traffic prediction, Mobility analysis
• Examples• Sampling, Event analysis, non-linear
optimization
37
126Michael May
Tutorial Geographic and Spatial Data Mining
Mobility analysis based on GPS-tracks
introduction of new pricing model forposter sites based on GPS tracks
registration of contact frequencies withposter sites
contact extrapolation for target groups:
- socio-demographic characteristics- residential areas
Media Trend Journal, Nov, 2006
127Michael May
Tutorial Geographic and Spatial Data Mining
Time patterns
Patterns / Questions
- How long (days) does it take till x%of objects visit all locations?
- How long does it take till x% of objects visit at least one locationtwice?
Applications
- determine mobility of a group of people
- reach of poster networks- find popularity of locations (theatres,
supermarkets, hospitals)
38
128Michael May
Tutorial Geographic and Spatial Data Mining
Modelling tasks
Modelling mobility for cities with GPS-measurements for the overall population
Predicting mobility for cities without measurements (hard task!)
Extrapolating predictions in time
129Michael May
Tutorial Geographic and Spatial Data Mining
GeoPKDD - FET Project IST-014915
Geographic Privacy-aware Knowledge Discovery and Delivery
December 2005 – November 2008
Project Leader: Fosca Giannotti
http://www.geopkdd.eu
General Project Ideaextracting user-consumable forms of knowledge from large amounts of raw geographic data referenced in space and in time.
knowledge discovery and analysis methods for trajectories of moving objects, which change their position in time, and possibly also their shape or other significant features
devising privacy-preserving methods for data mining from sources that typically contain personal sensitive data
39
130Michael May
Tutorial Geographic and Spatial Data Mining
The Consortium
ID Acronym Partner Country
1 KDDLAB Knowledge Discovery and Delivery Laboratory, ISTI-CNR, Istituto di Scienza e Tecnologie dell’Informazione, Pisa. http://www.isti.cnr.it/ - jointly with Univ. Pisa, Dept. of Computer Science http://www.di.unipi.it
I
2 LUC Univ. Limburg, Theoretical Computer Science Group. http://www.luc.ac.be/theocomp B
3 EPFL EPFL, Lab. DB, Lausanne. http://lbdwww.epfl.ch/e/ CH
4 FAIS Fraunhofer Institute for Autonomous Intelligent Systems, Sankt Augustin. http://www.ais.fraunhofer.de/
D
5 WUR Wageningen UR, Centre for GeoInformation. http://cgi.girs.wageningen-ur.nl/ NL
6 CTI Research Academic Computer Technology Institute, Research and Development Division. http://www.cti.gr/ - jointly with Univ. Piraeus, Dept. of Informatics http://www.unipi.gr
GR
7 UNISAB Sabanci University, Faculty of Engineering and Natural Sciences. http://www.sabanciuniv.edu/ TK
8 WIND WIND Telecomunicazioni SpA, Direzione Reti Wind Progetti Finanziati & Technology Scouting. I
131Michael May
Tutorial Geographic and Spatial Data Mining
Geographic Privacy-aware Knowledge Discovery Process
Traffic Management
Accessibility of services
Mobility evolution
Urban planning
….
interpretation visualization
trajectory reconstruction
p(x)=0.02
warehouse
interpretation visualization
trajectory reconstruction
p(x)=0.02
ST patterns
Trajectories warehouse
Privacy-aware Data mining
Bandwidth/Power optimization
Mobile cells planning
…
Public administration or business companies
Telecommunication company (WIND)
GeoKnowledge
Aggregative Location-based services
Privacy enforcement
Traffic Management
Accessibility of services
Mobility evolution
Urban planning
….
interpretation visualization
trajectory reconstruction
p(x)=0.02
warehouse
interpretation visualization
trajectory reconstruction
p(x)=0.02
ST patterns
Trajectories warehouse
Privacy-aware Data mining
Bandwidth/Power optimization
Mobile cells planning
…
Public administration or business companies
Telecommunication company (WIND)
GeoKnowledge
Aggregative Location-based services
Privacy enforcement
40
132Michael May
Tutorial Geographic and Spatial Data Mining
GeoPKDD – Specific Goals
models for moving objects, and data warehouse methods to store their trajectories
knowledge discovery and analysis methods for moving objects and trajectories,
techniques to make such methods privacy-preserving
techniques for reasoning on spatio-temporal knowledge and on background knowledge
techniques for delivering the extracted knowledge within the geographic framework
133Michael May
Tutorial Geographic and Spatial Data Mining
From Traces to Trajectories: the Source Data
GSM network
Entering the cell
- e.g. (UserID, time, IDcell, in)
Exiting the cell
- e.g. (UserID, time, IDcell, out)
Movements inside the cell?
- Eg (UserID, time, X,Y, Idcell
streams of log data of mobile phones, e.g. cells in the GSM/UMTS network
Real trajectories are continuous functions
Logs are discrete sampling of real trajectories, dependent on the wireless network technology
- unregular granularity in time and space- possible imperfection/imprecision
An approximated reconstruction of the real trajectory from its log traces is needed
Source: Pedreschi & Giannotti, 2005
41
134Michael May
Tutorial Geographic and Spatial Data Mining
Movement patterns
ClusteringGroup together similar trajectories
For each group produce a summary
Frequent patternsDiscover frequently followed (sub)paths
ClassificationExtract behaviour rules from history
Use them to predict behaviour of future users 60
%
7%
8%
5%20%?
Source: Pedreschi & Giannotti, 2005
135Michael May
Tutorial Geographic and Spatial Data Mining
Why emphasis on privacy?
More, better data are gathered, more vulnerability from correlation
On the other hand, more and new data bring new opportunities
Need to maintain privacy without giving up opportunities
Need to obtain social acceptance through demonstrably trustworthy solutions
... is a technical issue, besides ethical, social and legal, in the specific context of ST data
How to formalize privacy constraints over ST data and ST patterns?
- E.g., anonymity threshold on clusters of individual trajectories
How to design DM algorithms that, by construction, only yield patterns that meet the privacy constraints?
Privacy in GeoPKDD
42
136Michael May
Tutorial Geographic and Spatial Data Mining
Challenges
( )000 )1(
pppp
n−
−⋅
137Michael May
Tutorial Geographic and Spatial Data Mining
Causal Inference from Statistical Spatio-Temporal Data
Current project at IAIS for newspaper publisher:
Sales prediction of individual shops.
What happens if a shop closes or is sold out? Predict to which alternative shop customers go.
Spatio-Temporal Clustering of shops
Time Series Prediction
Modeling customer behavior
⇒ Causal inference about customer behavior
„If shop A closes, n% of A‘s customers go to B, m% to C“
43
138Michael May
Tutorial Geographic and Spatial Data Mining
Sales data per day per shop for several years available
Use similarity of time series over some period for determining anomaly in behavior
139Michael May
Tutorial Geographic and Spatial Data Mining
Closed Shop
Other shops
Alternative shops
strong weakUse spatial structure to infer potential alternative shops.
People went from A to B when A is closed and B shows anomaly in behavior that cannot be explained otherwise
44
140Michael May
Tutorial Geographic and Spatial Data Mining
Closed Shop
Other shops
Alternative shops
strong weakDiagramms such as this one can be generated automatically for historic
cases
Challenge: based on historic examples come up with a predictive model
141Michael May
Tutorial Geographic and Spatial Data Mining
Ubiquitous Knowledge Discovery
Ubiquitous Knowledge Discovery (Embedded Data Mining and mobile and /or distributed mobile, micro processors)
Grid Mining (Distributed Architecture, GridComputing)
Knowledge Discovery in mobile Systems(Robots, RFID, GPS, mobile phones, Cars, ...)
Static and dynamic Sensor networks (RealityMining)
Privacy-Preserving Data Mining
KDUbiq Coordination Action (EU, 2005-2008) – www.kdubiq.org
45
142Michael May
Tutorial Geographic and Spatial Data Mining
Ubiquitous Knowledge Discovery
Characteristics of ubiquitous knowledge discovery systemsobjects are distributed in time and space
dynamic infrastructure (moving objects, appear and disappear)
analysis situation is in real-time, models evolve incrementallyobjects have access to local information only,
never see the global picture: only knowledge of local spatial environment
typically, objects exchange information with other objects
Spatial Data Mining is a key issue here!
KDUbiq reflects the future research challenges involved in this area
143Michael May
Tutorial Geographic and Spatial Data Mining
Summary
Spatial Data form a rich environment for analysis
Feature extraction and construction (Spatial Queries & Functions, Voronoi,…) play a very important role
Efficiency is often a big concern
A variety of approaches to Spatial Data Mining exist, coming from Statistics, Databases, Machine Learning
We have seen examples for density based clustering, kriging, subgroup discovery, association rules, model trees, kNN, Survival Analysis
Methods are different in the data types they can handle
Real-world applications are feasible today
Many more challenges in the future due to ubiquous environments!
46
144Michael May
Tutorial Geographic and Spatial Data Mining
Literature (1)
Andrienko, N. and Andrienko G.: Exploratory Analysis of Spatial and Temporal Data - A Systematic Approach, Springer, 2005Appice, A., M. Ceci, A. Lanza, F.A. Lisi, & D. Malerba (2003). Discovery of Spatial Association Rules in Georeferenced Census Data: A Relational Mining Approach, Intelligent Data Analysis, 7, 6.Burrough, P., McDonnell, R., Principles of Geographical Information Systems, OUP, 1998Cressie, N, 1993. Statistics for Spatial Data, WileyEgenhofer, M.. Reasoning about binary topological relations. In Gunther O. and Schek H.-J., editors, Second Symposium on Large Spatial Databases, volume 525 of LNCS, pages 143--160. Springer, 1991.Ester M., Kriegel H.-P., Sander J. and Xu X. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. Portland, OR, 226-231.Giannotti, F., Nanni, M., Pedreschi, P.: Efficient Mining of Temporally Annotated Sequences. SDM 2006Goodchild, M.F., Spatial Autocorrelation. CATMOG 47,Geobooks. 1986, Norwich UK.Han J., Stefanovic N., Koperski K. Selective Materialization: An Efficient Method for Spatial Data Cube Construction. PAKDD, 1998.Klösgen, W. (1996) Explora: A multipattern and multistrategy discovery assistant In Fayyad, Advances in Knowledge Discovery and Data Mining. MIT Press.Klösgen, W., May, M.: Spatial Subgroup Mining Integrated in an Object-Relational Spatial Database. PKDD 2002: 275-28Klösgen, W., May, M., Petch, J. 2003, Mining census data for spatial effects on mortality, Intelligent Data Analysis Issue: Volume 7, Number 6 / 2003 Pages: 521 - 540
145Michael May
Tutorial Geographic and Spatial Data Mining
Literature (2)
Koperski, K., Han, J, Discovery of Spatial Association Rules in Geographic Information Databases (1995), Proc. 4th Int. Symp. Advances in Spatial Databases, SSDKoperski, K. , J. Adhikary and J. Han, `` Spatial Data Mining: Progress and Challenges'', 1996 SIGMOD'96 Workshop. on Research Issues on Data Mining and Knowledge Discovery (DMKD'96), Montreal, Canada, June 1996Lawson, A. B. and Denison, D. (2002) (eds) Spatial Cluster Modelling Chapman & Hall CRC, London. Lisi, F.A, D. Malerba (2004).Inducing Multi-Level Association Rules from Multiple Relations.Machine Learning, 55:175-210.Longley, P., Goodchild, M, MacGuire, D., Rhind, D, 2001. Geographic Informations Systems and Science, WileyMalerba, D., Appice, A., Cecci, M. 2005, Mining Model Trees from Spatial Data, LNCS, PKDD2005May, M., Ragia, L. 2002, Spatial Subgroup Discovery Applied to the Analysis of Vegetation Data, PAKM 2002, LNCS 2569May, M., Savinov, A 2004 SPIN!-An Enterprise Architecture for Spatial Data Mining, Knowledge-Based Intelligent Information and Engineering Systems, LNCS 2773, 2003Openshaw, S., and Craft, A., (1991) 'Using geographical analysis machines to search for evidence of cluster and clustering in childhood leukaemia and non-Hodgkin Lymphomas in Britain. In G. Draper (ed) 'The Geographical Epidemiology of Childhood Leukaemia and non-Hodgkin Lymphomas in Great Britain 1966-83', Studies in Medical and Population Subjects No 53, OPCS, London, HMSOBurroughsRipley, B. 1988, Statistical Inference for Spatial Processes, CUPSander, J. , M. Ester, H.-P. Kriegel, and X. Xu. Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Mining and Knowledge Discovery, 2(2):169--194, 1998.Wrobel, S. : An Algorithm for Multi-relational Discovery of Subgroups. PKDD 1997: 78-87
47
146Michael May
Tutorial Geographic and Spatial Data Mining
Fraunhofer IAISFraunhofer IAIS –– Knowledge DiscoveryKnowledge DiscoveryDr. Michael May
Contact:Michael May
Schloss Birlinghoven53754 Sankt Augustin
Tel: 02241 / 14 2731 / 2039eMail: michael.may@iais.fraunhofer.de
Thanks!
Recommended