View
225
Download
1
Tags:
Embed Size (px)
Citation preview
Shashi Shekhar Mining For Spatial Patterns 1
Mining for Spatial Patterns
Shashi Shekhar
Department of Computer Science University of Minnesota
http://www.cs.umn.edu/~shekhar
Collaborators:
U. of Minnesota: V. Kumar, G. Karypis, C.T. Lu, W. Wu, Y. Huang, V. Raju, P. Zhang, P. Tan, M. Steinbach
NASA Ames Research Center: C. PotterCalifornia State University, Monterey Bay: S. Klooster
This work was partially funded by NASA and Army High Performance Computing Center
Shashi Shekhar Mining For Spatial Patterns 2
Background
NSF workshop on GIS and DM (3/99) Spatial data - traffic, bird habitats, global climate, logistics, ...For spatial patterns - outliers, location prediction, associations, sequential associations, clustering, trends, …
Shashi Shekhar Mining For Spatial Patterns 3
Framework
Problem statement: capture special needsData exploration: maps, new methodsTry reusing classical methods
from data mining, spatial statistics
If reuse is not possible, invent new methodsValidation, Performance tuning
Shashi Shekhar Mining For Spatial Patterns 4
Research Goals
SST
Precipitation
NPP
Pressure
SST
Precipitation
NPP
Pressure
Longitude
Latitude
Timegrid cell zone
...
Research Goals:
• modeling of ecological data
event modeling
zone modeling.
• finding spatio-temporal patterns
associations
predictive models.
A key interest is finding connections between the ocean and the land.
Shashi Shekhar Mining For Spatial Patterns 5
Sources of Earth Science Data
Before 1950, very sparse, unreliable data.Since 1950, reliable global data.
Ocean temperature and pressure are based on data from ships.Most land data, (solar, precipitation, temperature and pressure) comes from weather stations.
Since 1981, data has been available from Earth orbiting satellites.
FPAR, a measure related to plant
Since 1999 TERRA, the flagship of the NASA Earth Observing System, is providing much more detailed data.
Shashi Shekhar Mining For Spatial Patterns 6
Example Pattern: Teleconnections
Teleconnections are the simultaneous variation in climate and related processes over widely separated points on the Earth.
For example, El Nino is the anomalous warming of the eastern tropical region of the Pacific, and has been linked to various climate phenomena.
Droughts in Australia and Southern AfricaHeavy rainfall along the western coast of South America Milder winters in the Midwest
Shashi Shekhar Mining For Spatial Patterns 7
Net Primary Production (NPP)
Net Primary Production (NPP) is the net assimilation of atmospheric carbon dioxide (CO2) into organic matter by plants.
NPP is driven by solar radiation and can be constrained by precipitation and temperature.
NPP is a key variable for understanding the global carbon cycle and ecological dynamics of the Earth. Keeping track of NPP is important because it includes the food source of humans and all other organisms.
Sudden changes in the NPP of a region can have a direct impact on the regional ecology.
An ecosystem model for predicting NPP, CASA (the Carnegie Ames Stanford Approach) provides a detailed view of terrestrial productivity.
Shashi Shekhar Mining For Spatial Patterns 8
Benefits of Data Mining
Data mining provides earth scientist with tools that allow them to spend more time choosing and exploring interesting families of hypotheses.
However, statistics is needed to provide methods for
determining the “statistical” significance of results. By applying the proposed data mining techniques, some of the steps of hypothesis generation and evaluation will be automated, facilitated and improved.Association rules provide a “new” framework for detecting relationships between events.
Shashi Shekhar Mining For Spatial Patterns 10
ClusteringInterested in relationships between regions, not “points.”For land, clustering based on NPP or other variables, e.g., precipitation, temperature.For ocean, clustering based on SST (Sea Surface Temperature).When “raw” NPP and SST are used, clustering can find seasonal patterns.
Anomalous regions have plant growth patterns which reversed from those typically observed in the hemisphere in which they reside, and are easy to spot.
Shashi Shekhar Mining For Spatial Patterns 11
Clustering
SNN clusters of SST that are highly correlated with El Nino indices.
EL Nino Related SST Clusters
longitude
latitu
de
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
90
60
30
0
-30
-60
-90
El Nino Regions
Niño Region
Range Longitude
Range Latitude
1+2 90°W-80°W 10°S-0°
3 150°W-90°W 5°S-5°N
3.4 170°W-120°W 5°S-5°N
4 160°E-150°W 5°S-5°N
Shashi Shekhar Mining For Spatial Patterns 12
Spatial Association Rule
Citation: Symp. On Spatial Databases 2001Problem: Given a set of boolean spatial features
find subsets of co-located features, e.g. (fire, drought, vegetation)Data - continuous space, partition not natural, no reference feature
Classical data mining approach: association rulesBut, Look Ma! No Transactions!!! No support measure!
Approach: Work with continuous data without transactionizing it!
confidence = Pr.[fire at s | drought in N(s) and vegetation in N(s)] support: cardinality of spatial join of instances of fire, drought, dry veg.participation: min. fraction of instances of a features in join resultnew algorithm using spatial joins and apriori_gen filters
Shashi Shekhar Mining For Spatial Patterns 13
Event DefinitionConvert the time series into sequence of events at each spatial location.
Grid Cell (x,y) t1 t2 t3(1,1) Æ Æ Æ(1,2) {A, B, D} {D, L, J} Æ(1,3) Æ {A, B, E, G} {B, C, D}(1,4) {A, K, M} Æ Æ(2,1) {B, C, E} {E, G, M} {C, F, M}(2,2) Æ {C, E, F} {A, B, G, L}(2,3) Æ Æ Æ(2,4) {A, B} {D, F} {A, B, D}(3,1) Æ Æ Æ(3,2) {A, B, G} Æ {A, B, E}(3,3) {C, M} Æ Æ(3,4) Æ Æ Æ(4,1) Æ Æ Æ(4,2) Æ {D, K, L} Æ(4,3) Æ Æ {E, G, K}(4,4) Æ {A, B} {D, E, F}
DF A B
ABEG
DLJ
CEF
EGM
DKL
BCD
A BD
DEF
EGK
A BGL
ABE
CFM
t2 t3
time
A B
CM
A KM
A BD
A BG
BCE
t1
x
y
Shashi Shekhar Mining For Spatial Patterns 14
Interesting Association Patterns
Use domain knowledge to eliminate uninteresting patterns.A pattern is less interesting if it occurs at random locations.Approach:
Partition the land area into distinct groups (e.g., based on land-cover type).For each pattern, find the regions for which the pattern can be applied.If the pattern occurs mostly in a certain group of land areas, then it is potentially interesting.If the pattern occurs frequently in all groups of land areas, then it is less interesting.
Shashi Shekhar Mining For Spatial Patterns 15
Association Rules
Intra-zone non-sequential Patterns
Shrubland regionsFPAR-Hi NPP-Hi (support 10)
• Region corresponds to semi-arid grasslands, a type of vegetation, which is able to quickly take advantage of high precipitation than forests.
• Hypothesis: FPAR-Hi events could be related to unusual precipitation conditions.
Shashi Shekhar Mining For Spatial Patterns 16
Answers: and
Can you find co-location patterns from the following sample dataset?
Co-location
Shashi Shekhar Mining For Spatial Patterns 17
Co-locationCan you find co-location patterns from the following sample dataset?
Shashi Shekhar Mining For Spatial Patterns 18
Spatial Co-location A set of features frequently co-
located
Given A set T of K boolean spatial feature
types T={f1,f2, … , fk}
A set P of N locations P={p1, …, pN } in a spatial frame work S, pi P is of some spatial feature in T
A neighbor relation R over locations in S
Find Tc = subsets of T frequently co-
located
Objective Correctness Completeness Efficiency
Constraints R is symmetric and reflexive Monotonic prevalence measure
Reference Feature Centric
Window Centric Event Centric
Co-location
Shashi Shekhar Mining For Spatial Patterns 19
Participation indexParticipation ratio pr(fi, c) of feature fi in co-location c = {f1, f2, …, fk}: fraction of instances of fi
withfeature {f1, …, fi-1, fi+1, …, fk} nearby 2.Participation index = min{pr(fi, c)}
AlgorithmHybrid Co-location Miner
Association rules Co-location rules
underlying space discrete sets continuous space
item-types item-types events /Boolean spatial features
collections transactions neighborhoods
prevalence measure support participation index
conditional probability measure
Pr.[ A in T | B in T ]
Pr.[ A in N(L) | B at L ]
Comparison with association rules
Co-location
Shashi Shekhar Mining For Spatial Patterns 20
Spatial Co-location Patterns
• Spatial feature A,B,C and their instances• Possible associations are (A, B), (B, C), etc.• Neighbor relationship includes following pairs:
•A1, B1•A2, B1•A2, B2•B1, C1•B2, C2
Dataset
Shashi Shekhar Mining For Spatial Patterns 21
Spatial Co-location Patterns
Spatial feature A,B, C,and their instances
Support A,B =2 B,C=2 Support A,B=1 B,C=2
Partition approach[Yasuhiko, KDD 2001]
•Support not well defined,i.e. not independent of execution trace
•Has a fast heuristic which is hard to analyze for correctness/completeness
Dataset
Shashi Shekhar Mining For Spatial Patterns 22
Spatial Co-location Patterns
Spatial feature A,B, C,and their instances
Dataset Reference feature approach [Han SSD 95]
•C as reference feature to get transactions•Transactions: (B1) (B2)•Support (A,B) = Ǿ from Apriori algorithm
•Note: Neighbor relationship includes following pairs:•A1, B1•A2, B1•A2, B2•B1, C1•B2, C2
Shashi Shekhar Mining For Spatial Patterns 23
Spatial Co-location Patterns
Spatial feature A,B, C,and their instances
Our approach (Event Centric)• Neighborhood instead of transactions
• Spatial join on neighbor relationship
• Support Prevalence
•Participation index = min. p_ratio
•P_ratio(A, (A,B)) = fraction of instance of A participating in join(A,B, neighbor)
•ExamplesSupport(A,B)=min(2/2,3/3)=1
Support(B,C)=min(2/2,2/2)=1
Dataset
Shashi Shekhar Mining For Spatial Patterns 24
Spatial Co-location Patterns
Spatial feature A,B, C,and their instances
Support A,B =2 B,C=2
Support A,B=1 B,C=2
Support(A,B)=min(2/2,3/3)=1 Support(B,C)=min(2/2,2/2)=1
Partition approach
Our approachDataset
Reference feature approach
C as reference featureTransactions: (B1) (B2)Support (A,B) = Ǿ
Shashi Shekhar Mining For Spatial Patterns 25
Spatial OutliersSpatial Outlier: A data point that is extreme relative to it neighborsCase Study: traffic stations different from neighbors [SIGKDD 2001]Data - space-time plot, distr. Of f(x), S(x)Distribution of base attribute:
spatially smoothfrequency distribution over value domain: normal
Classical test - Pr.[item in population] is lowQ? distribution of diff.[f(x), neighborhood agg{f(x)}]Insight: this statistic is distributed normally!Test: (z-score on the statistics) > 2Performance - spatial join, clustering methods
Shashi Shekhar Mining For Spatial Patterns 26
Spatial Outlier DetectionGiven A spatial graph G={V,E} A neighbor relationship (K neighbors) An attribute function : V -> R An aggregation function : :R k -> R A comparison function Confidence level threshold Statistic test function ST: R ->{T, F}
Find O = {vi | vi V, vi is a spatial outlier}
Objective Correctness: The attribute values of vi
is extreme, compared with its neighbors
Computational efficiency
Constraints and ST are algebraic aggregate
functions of and Computation cost dominated by I/O
op.
f
aggrF
),( aggrdiff FfF
diffFf aggrF
Shashi Shekhar Mining For Spatial Patterns 27
Spatial Outlier Detection Test1. Choice of Spatial Statistic S(x) = [f(x)–E y N(x)(f(y))]
Theorem: S(x) is normally distributed
if f(x) is normally distributed
2. Test for Outlier Detection | (S(x) - s) / s | >
HypothesisI/O cost determined by clustering
efficiency
f(x) S(x)
Spatial Outlier Detection
Shashi Shekhar Mining For Spatial Patterns 28
Results 1. CCAM achieves higher
clustering efficiency (CE)
2. CCAM has lower I/O cost
3. High CE => low I/O cost
4. Big Page => high CE
Z-orderCCAM
I/O costCE value
Cell-Tree
Spatial Outlier Detection
Shashi Shekhar Mining For Spatial Patterns 29
A Unified Approach Spatial Outliers
Original Data
Our Approach
Scatter Plot
•Tests : quantitative, graphical •Results:
•Computation = spatial self-join•Tests: algebraic functions of join•Join predicate: neighbor relations•I/O-cost: f(clustering efficiency)•Our algorithm is I/O-efficient for
Algebric tests
Shashi Shekhar Mining For Spatial Patterns 30
Original Data
Variogram Cloud
Moran Scatter Plot
Graphical Spatial Tests
Shashi Shekhar Mining For Spatial Patterns 31
Location Prediction
Citations: IEEE Tran. on Multimedia 2002, SIAM DM Conf. 2001, SIGKDD DMKD 2000Problem: predict nesting site in marshes
given vegetation, water depth, distance to edge, etc.
Data - maps of nests and attributesspatially clustered nests, spatially smooth attributes
Classical method: logistic regression, decision trees, bayesian classifier
but, independence assumption is violated ! Misses auto-correlation !Spatial auto-regression (SAR), Markov random field bayesian classifierOpen issues: spatial accuracy vs. classification accuraryOpen issue: performance - SAR learning is slow!
Shashi Shekhar Mining For Spatial Patterns 32
Given:1. Spatial Framework
2. Explanatory functions:3. A dependent class:4. A family of function
mappings:
Find: Classification model:
Objective:maximizeclassification_accuracy
Constraints: Spatial Autocorrelation
exists
},...{ 1 nssS RSf
kX :
},...{: 1 MC ccCSf
CRR ...
cf̂
),ˆ( cc ff
Nest locations Distance to open water
Vegetation durability Water depth
Location Prediction
Shashi Shekhar Mining For Spatial Patterns 34
• Spatial Autoregression Model (SAR)• y = Wy + X +
• W models neighborhood relationships models strength of spatial dependencies error vector
• Solutions and - can be estimated using ML or Bayesian
stat.• e.g., spatial econometrics package uses
Bayesian approach using sampling-based Markov Chain Monte Carlo (MCMC) method.
• Likelihood-based estimation requires O(n3) ops.• Other alternatives – divide and conquer, sparse
matrix, LU decomposition, etc.
Solution Procedures
Shashi Shekhar Mining For Spatial Patterns 35
EvaluationLinear RegressionSpatial RegressionSpatial model is better
Xy
XWyy
Shashi Shekhar Mining For Spatial Patterns 36
• Markov Random Field based Bayesian Classifiers
• Pr(li | X, Li) = Pr(X|li, Li) Pr(li | Li) / Pr (X)
• Pr(li | Li) can be estimated from training data
• Li denotes set of labels in the neighborhood of si excluding labels at si
• Pr(X|li, Li) can be estimated using kernel functions
• Solutions• stochastic relaxation [Geman]• Iterated conditional modes [Besag]• Graph cut [Boykov]
Solution Procedures
Shashi Shekhar Mining For Spatial Patterns 37
• SAR can be rewritten as y = (QX) + Q• where Q = (I- W)-1 which can be viewed as a spatial
smoothing operation.• This transformation shows that SAR is similar to
linear logistic model, and thus suffers with same limitations – i.e., SAR model assumes linear separability of classes in transformed feature space
• SAR model also make more restrictive assumptions about the distribution of features and class shapes than MRF
• The relationship between SAR and MRF are analogous to the relationship between logistic regression and Bayesian classifiers.
• Our experimental results shows that MRF model yields better spatial and classification accuracies than SAR predictions.
Comparison
Shashi Shekhar Mining For Spatial Patterns 38
Confusion Matrix:
Spatial Confusion Matrix:
MRF vs. SAR
Shashi Shekhar Mining For Spatial Patterns 40
Conclusion and Future Directions
Spatial domains may not satisfy assumptions of classical methods
data: auto-correlation, continuous geographic spacepatterns: global vs. local, e.g. spatial outliers vs. outliersdata exploration: maps and albums
Open Issues patterns: hot-spots, blobology (shape), spatial trends, …metrics: spatial accuracy(predicted locations), spatial contiguity(clusters)spatio-temporal datasetscale and resolutions sentivity of patternsgeo-statistical confidence measure for mined patterns
Shashi Shekhar Mining For Spatial Patterns 41
Reference1. S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X. Liu and C.T. Liu, “Spatial Databases: Accomplishments and
Research Needs”, IEEE Transactions on Knowledge and Data Engineering, Jan.-Feb. 1999.
2. S. Shekhar and Y. Huang, “Discovering Spatial Co-location Patterns: a Summary of Results”, In Proc. of 7th International Symposium on Spatial and Temporal Databases (SSTD01), July 2001.
3. S. Shekhar, C.T. Lu, P. Zhang, "Detecting Graph-based Spatial Outliers: Algorithms and Applications“, the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001.
4. S. Shekhar, C.T. Lu, P. Zhang, “Detecting Graph-based Saptial Outlier”, Intelligent Data Analysis, To appear in Vol. 6(3), 2002
5. S. Shekhar, S. Chawla, the book “Spatial Database: Concepts, Implementation and Trends”, Prentice Hall, 2002
6. S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, “Extending Data Mining for Spatial Applications: A Case Study in Predicting Nest Locations”, Proc. Int. Confi. on 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2000), Dallas, TX, May 14, 2000.
7. S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, “Modeling Spatial Dependencies for Mining Geospatial Data”, First SIAM International Conference on Data Mining, 2001.
8. S. Shekhar, P.R. Schrater, R. R. Vatsavai, W. Wu, and S. Chawla, “Spatial Contextual Classification and Prediction Models for Mining Geospatial Data”,To Appear in IEEE Transactions on Multimedia, 2002.
9. S. Shekhar, V. Kumar, P. Tan. M. Steinbach, Y. Huang, P. Zhang, C. Potter, S. Klooster, “Mining Patterns in Earth Science Data”, IEEE Computing in Science and Engineering (Submitted)
Shashi Shekhar Mining For Spatial Patterns 42
Reference10. S. Shekhar, C.T. Lu, P. Zhang, “A Unified Approach to Spatial Outliers Detection”, IEEE Transactions on
Knowledge and Data Engineering (Submitted)
11. S. Shekhar, C.T. Lu, X. Tan, S. Chawla, Map Cube: A Visualization Tool for Spatial Data Warehouses, as Chapter of Geographic Data Mining and Knowledge Discovery. Harvey J. Miller and Jiawei Han (eds.), Taylor and Francis, 2001, ISBN 0-415-23369-0.
12. S. Shekhar, Y. Huang, W. Wu, C.T. Lu, What's Spatial about Spatial Data Mining: Three Case Studies , as Chapter of Book: Data Mining for Scientific and Engineering Applications. V. Kumar, R. Grossman, C. Kamath, R. Namburu (eds.), Kluwer Academic Pub., 2001, ISBN 1-4020-0033-2
13. Shashi Shekhar and Yan Huang , Multi-resolution Co-location Miner: a New Algorithm to Find Co-location Patterns in Spatial Datasets, Fifth Workshop on Mining Scientific Datasets (SIAM 2nd Data Mining Conference), April 2002