Upload
paul-tyler
View
238
Download
0
Embed Size (px)
Citation preview
Semantic Aspects in Spatial Data Mining
Vania Bogorny
2
Introduction
• Existing approaches for spatial data mining, in general, do not make use of prior knowledge
• Bogorny (2006) and Bogorny (2007) introduced the idea of using background knowledge – in data preprocessing, to reduce spatial joins – In spatial association rule mining, to eliminate well known patterns
3
Main Problems
• Unnecessary spatial relationship computation
• large amounts of association rules
• Many associations are well known natural geographic dependences
• Existing approaches for mining SAR are Apriori-like – Most approaches do not make use of background knowledge– Use syntactic constraints for frequent set and rule prunning– Only the data is considered, not the schema
• Result– Same associations explicitly represented in the schema (database
designer) are extracted by SAR mining algoritms
4
Spatial Relationships (Gutting, 1994)
B
C
A
B north A
C southeast A
B
C
A
B north A
C southeast A
OrderDistanceCB
dCB
d
BA
touches
A B
overlaps
B
inside
A
contains BA
crosses
A Bequals
A B
BA
touches
A B
overlaps
inside
A
contains
A B
disjoint
A B
disjoint
BA
crosses
A Bequals
A BTopological
5
• Order and Distance relationships may exist among any pair of spatial features
• Topological Relations depend on the geometry
TopologicalRelation
GeometricCombinations
Disjoint Overlaps Touches Contains Inside Crosses Equals
? ?
? ?
Topological Relationships – GEOMETRICALLY POSSIBLE
OGC standard
6
Topological Relation Semantic Combinations
Disjoint Overlaps Touches Contains Inside Crosses Equals
Factory () Hospital () Bridge () River ( ) Factory () Airport( □) River () Road () Beach () Sea (□) State (□) Country (□)
Topological Relationships – SEMANTICALLY CONSISTENT
7
Spatial Relationships
– Mandatory (Spatial constraints) Dependences: <island> <inside> <1><1> <Water Body>
– Prohibited: <River> <contains> <0><0> <Road>
– Possible: Normally undefined
Road crosses River
For Data mining and knowledge discovery,
POSSIBLE and PROHIBITED RELATIONSHIPS are interesting!!!!
All others are well known.
8
Well Known Geographic Dependences
Non-obvious spatial relationships Well known dependences
Is_a(gasStation) intersects(street) (100%)
Is_a(island) within (waterResource) (100%)
9
Well Known relationshiops X Association Rules
Bridge &Viaducts Roads Vegetation
contains(gasStation) contains(Street) (100%)
Bus Stop Street
intersects(busStop) intersects(Street) (100%)
Contains(viaduct) contains(road) (100%)
10
Country
namegeometry
Water Resource
nameextensiongeometry
State
namegeometry
1..n1 1..n1
Factory
namemainActivityimpactDegree
County
namepopulationgeometry1..n1 1..n1 1 0..n1 0..n
Island
geometry
1 0..n1 0..n
0..n
1
0..n
1
{State, Country}{Factory, County}{Island, WaterBody}….
Well Known Associations – Conceptual Schemas
11
Well Known Associations – Conceptual Schemas
Fonte: 1ª Divisão de Levantamento do Exército Brasileiro
12
Well Known Associations – Conceptual Schemas
13
Well Known Associations – Geo-Ontologies
<owl:Class rdf:ID=“Island"> <rdfs:SubClassOf rdf:resource="#SpatialFeatureType"/> <rdfs:subClassOf> <owl:Restriction> <owl:minCardinality
rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</owl:minCardinality> <owl:allValuesFrom rdf:resource="#WaterResource"/> <owl:OnProperty> <owl:ObjectProperty rdf:about="#Within"/> </owl:OnProperty> </owl:Restriction> </rdfs:subClassOf></owl:Class>
Geographic dependences are explicit in geo-ontologies (Bogorny, 2005b)
14
Well Known Associations – Geo-Ontologies
15
Well known dependences X Spatial Association Rules (SAR)
• Well knonw dependences affect the 3 main steps in the process of mining SAR:
– Spatial predicate computation: compute unnecessary relationships
– Frequent set generation: generate frequent itemsets with well known patterns
– Association rule extraction: produce a high number of rules with well known dependences
Well Known Dependences in SAR
17
Example of preprocessed spatial dataset
contains(Hospital), contains(TreatedWaterNet), contains(Factory)6
contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)5
contains(Port), contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)4
contains(Port), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)3
contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)2
contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)1
Spatial PredicatesTuple (city)
contains(Hospital), contains(TreatedWaterNet), contains(Factory)6
contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)5
contains(Port), contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)4
contains(Port), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)3
contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)2
contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)1
Spatial PredicatesTuple (city)
18
contains(Hospital), contains(TreatedWaterNet), contains(Factory)6
contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)5
contains(Port), contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)4
contains(Port), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)3
contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)2
contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)1
Spatial PredicatesTuple (city)
contains(Hospital), contains(TreatedWaterNet), contains(Factory)6
contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)5
contains(Port), contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)4
contains(Port), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)3
contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)2
contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)1
Spatial PredicatesTuple (city)
Problem 1 – Geographic Dependences between the Target Feature Type and Relevant Feature Types
Minconf=70%
Dependence = City and TreatedWaterNet
100% de support
contains(Hospital)contains(TreatedWaterNetTreatedWaterNet)
19
Problem 2 - Dependences among Relevant Feature Types
Minsup=50%
25 frequent sets(6 contain the dependence)
9 closed frequent sets (3 have the dependence)
Dependence = {Port, WaterBody}
contains(PortPort)crosses(WaterBodyWaterBody)
Pruning Methods using Background Knowledge
21
Frequent Set Pruning (Apriori-KC) (Bogorny, 2006ª)
Given: , // set of knowledge constraints , // dataset generated with spatial_predicate_extraction minsup, // minimum supportL1 = {large 1-predicate sets};
For ( k = 2; Lk-1 != ; k++ ) do begin
Ck = apriori_gen(Lk-1); // Generates new candidates
If (k=2) // remove pairs with dependences
(step 1) Delete from C2 all pairs with a dependence in ;
Forall rows w do begin Cw = subset(Ck, w); // Candidates contained in w
Forall candidates c Cw do
c.count++; End; Lk = {c Ck | c.count minsup};
End;Answer = kLk
22
Understanding the Pruning Methods
Dependences {D} e {A,W}
a) dataset b) frequent predicate sets with minsup 50%
Tid (city) Predicate Set 1 A, C, D,T, W 2 C, D, W 3 A, D, T, W 4 A, C, D, W 5 A, C, D, T, W 6 C, D, T
Set k Frequent sets k=1 {A}, {C}, {D}, {T}, {W}
k=2 {A,C}, {A,D}, {A,T}, {A,W}, {C,D}, {C,T}, {C,W}, {D,T}, {D,W}, {T,W}
k=3 {A,C,D}, {A,C,W}, {A,D,T}, {A,D,W}, {A,T,W}, {C,D,T}, {C,D,W}, {D,T,W}
k=4 {A,C,D,W}, {A,D,T,W}
c) predicates
A = contains(Port), C = contains(Hospital), W = crosses(WaterBody),
T = contains(Factory), D = contains(Street),
D = contains(TreatedWaterNet)
23
Understanding the Pruning Methods (Input Pruning)
{C,D,T}
{A,C,D,W}
{D}
{A,D} {C,D}
{A,C,D} {A,D,T} {A,D,W}
{D,T} {D,W}
{C,D,W} {D,T,W}
{A,D,T,W}
{}
{C}{A} {T}
{A,C} {A,T} {A,W} {C,T}
{A,C,W}
{W}
{C,W} {T,W}
{A,T,W}
{D} Input pruning
24
Understanding the Pruning Methods(Frequent set pruning)
25 frequent sets
{C,D,T}
{A,C,D,W}
{D}
{A,D} {C,D}
{A,C,D} {A,D,T} {A,D,W}
{D,T} {D,W}
{C,D,W} {D,T,W}
{A,D,T,W}
{}
{C}{A} {T}
{A,C} {A,T} {A,W} {C,T}
{A,C,W}
{W}
{C,W} {T,W}
{A,T,W}
{A,W} Frequent set pruning
Percentage reduction of association rules considering zero (reference), one, and two pairs of dependences with an increasing number of elements (predicates)
25
minconf=0
26
Problem 3 – Redundant Frequent Itemsets
9 closed frequent itemsets
- Considering the 25 frequent sets in the example dataset- 9 are closed frequent itemsets (3 contain the dependence) -16 are redundant (3 contain the dependence)
Dependence {A,W}
Tid (city) Predicate Set 1 A, C, D,T, W 2 C, D, W 3 A, D, T, W 4 A, C, D, W 5 A, C, D, T, W 6 C, D, T
Dataset
27
Problem 3 – Redundant Frequent Itemsets
{}
{C} {D}{A} {T}
{A,C} {A,D} {A,T} {C,D} {C,T}
{A,C,D} {A,D,T}
{W}
{C,W} {D,T} {D,W} {T,W}
{C,D,T} {C,D,W} {D,T,W}
(1345) (12456) (123456) (1356) (12345)
(145) (1345) (135) (12456) (156) (12456) (1356) (12345) (135)
(145) (135) (156) (1245) (135)
{}
{C} {D}{A} {T}
{A,C} {A,D} {A,T} {C,D} {C,T}
{A,C,D} {A,D,T}
{W}
{C,W} {D,T} {D,W} {T,W}
{C,D,T} {C,D,W} {D,T,W}
(1345) (12456) (123456) (1356) (12345)
(145) (1345) (135) (12456) (156) (12456) (1356) (12345) (135)
(145) (135) (156) (1245) (135)
Remove dependences and then generate closed frequent itemsetsProblem –> resultant frequent sets are not closed
Tid (city) Predicate Set 1 A, C, D,T, W 2 C, D, W 3 A, D, T, W 4 A, C, D, W 5 A, C, D, T, W 6 C, D, T
{A,D,T,W}
28
Problem 3 – Redundant Frequent Itemsets
9 closed frequent itemsets
- Generate closed frequent itemsets and then eliminate dependences Problem – loose information
{A,C,D,W}
{D}
{C,D}
{A,D,W}
{D,T} {D,W}
{C,D,T} {C,D,W}
{A,D,T,W}
(123456)
(12456) (1356) (12345)
(1345) (156) (1245)
(145) (135){A,C,D,W}
{D}
{C,D}
{A,D,W}
{D,T} {D,W}
{C,D,T} {C,D,W}
{A,D,T,W}
(123456)
(12456) (1356) (12345)
(1345) (156) (1245)
(145) (135)
Dependence {A,W}
29
Max-FGP (Bogorny 2006c)
{A,C,D,W}
{}
{C} {D}{A} {T}
{A,C} {A,D} {A,T} {A,W} {C,D} {C,T}
{A,C,D} {A,C,W} {A,D,T} {A,D,W}
{W}
{C,W} {D,T} {D,W} {T,W}
{A,T,W} {C,D,T} {C,D,W} {D,T,W}
{A,D,T,W}
(1345) (12456) (123456) (1356) (12345)
(145) (1345) (135) (1345) (12456) (156) (12456) (1356) (12345) (135)
(145) (145) (135) (1345) (135) (156) (1245) (135)
(145) (135){A,C,D,W}
{}
{C} {D}{A} {T}
{A,C} {A,D} {A,T} {A,W} {C,D} {C,T}
{A,C,D} {A,C,W} {A,D,T} {A,D,W}
{W}
{C,W} {D,T} {D,W} {T,W}
{A,T,W} {C,D,T} {C,D,W} {D,T,W}
{A,D,T,W}
(1345) (12456) (123456) (1356) (12345)
(145) (1345) (135) (1345) (12456) (156) (12456) (1356) (12345) (135)
(145) (145) (135) (1345) (135) (156) (1245) (135)
(145) (135)
- Remove dependences in a first step
30
Max-FGP
-Remove redundant frequent sets in a second step generating maximal frequent sets
(135)
{}
{C} {D}{A} {T}
{A,C} {A,D} {A,T} {C,D} {C,T}
{A,C,D} {A,D,T}
{W}
{C,W} {D,T} {D,W} {T,W}
{C,D,T} {C,D,W} {D,T,W}
(1345) (12456) (123456) (1356) (12345)
(145) (1345) (135) (12456) (156) (12456) (1356) (12345)
(145) (135) (156) (1245) (135)
(135)
{}
{C} {D}{A} {T}
{A,C} {A,D} {A,T} {C,D} {C,T}
{A,C,D} {A,D,T}
{W}
{C,W} {D,T} {D,W} {T,W}
{C,D,T} {C,D,W} {D,T,W}
(1345) (12456) (123456) (1356) (12345)
(145) (1345) (135) (12456) (156) (12456) (1356) (12345)
(145) (135) (156) (1245) (135)
31
Max-FGP (Bogorny, 2006c)
Given: L;L; // frequent sets without dependences (Apriori-KC) ; // dataset generated with spatial_predicate_extraction
Find: Maximal M
// find maximal generalized predicate setsM = L;M = L; For ( k = 2; Mk != ; k++ ) do begin For ( j = k+1; Mj!=0; j++ ) do begin If (tidSet (Mk) = tidSet (Mj))
If (Mk Mj) // Mj is more general than Mk
Delete Mk from M; End;End;Answer = M;
32
Some results on real databases
33
Input Space Pruning
1,731
865
432651
363
181331
16582
0200400600800
1,0001,2001,4001,6001,800
Fre
quen
t Set
s
10% 15% 20%
Minimum Support
Frequent Geographic Patterns Removing One and Two Dependences between the Target Feature and Two
Relevant Features (Input space Pruning)
Apriori
Apriori (Revoming 1 column)
Apriori (Removing 2 columns)
22,251
7,159
2,241
7,128
2,252689
2,268698 204
0
5,000
10,000
15,000
20,000
25,000
Ass
ocia
tion
Rul
es10% 15% 20%
Minimum Support
Association Rules Generated with Apriori Removing One and Two Dependences Between the Target Feature
and Two Relevant Features (Input space Pruning)
Apriori
Apriori (Removing 1 column)
Apriori (Removing 2 columns)
20 predicates
1 dependence 50%2 dependences 70%
1 dependence 70%2 dependences 90%
34
Frequent Set Pruning
117
85
3222
71
51
2016
47
33
14 12
0
20
40
60
80
100
120
Fre
qu
en
t Se
ts
5% 10% 15%
Minimum Support
Frequent Geographic Patterns Removing One Dependence among Relevant Features
(Frequent Set Pruning)
Apriori
Apriori-KC (1 pair)
Max-FGp(1 pair)
Closed Frequent sets
Computational Time
0
20
40
60
80
100
120
140
160
180
200
5% 10% 15%Minimum Support
Tim
e(s
)
Apriori
Apriori-KC (1 pair)
Max-FGp(1 pair)
Closed frequent sets17
711
77% 68% 58%
15 predicates
35
Summary
• Well known dependences exist in several non-spatial application domains– Biology/Bioinformatics– Pregnant Female (confidence=100%)– Breast_cancer Female (confidence 100%)
– ...
• Almost no data mining approaches consider background knowledge or domain knowledge
36
Future Tendences
• Data Mining methods will consider semantics
• 3 workshops (KDD and ICDM) for domain-driven data mining
• ICDM 2008,2009 Workshop - Semantic Aspects in Data Mining
• Book 2008: Domain-Driven Data Mining
37
Summary: Mining SAR using Background Knowledge
• Using background knowledge:– To prune the input space as much as possible
(applicable to any SDM method)
– Apriori-KC generate frequent itemsets without well known dependences
– Max-FGP (Maximal Frequent Geographic Patterns) generate closed frequent itemsets without well known dependences
38
References
Bogorny, V.; Valiati, J.; Camargo, S.; Engel, P.; Alvares, L. O.: Mining Maximal Generalized Frequent Geographic Patterns with Knowledge Constraints. In: IEEE International Conference on Data Mining, IEEE-ICDM, 6., 2006, Hong-Kong, 2006c
Bogorny, V.; Camargo, S.; Engel, P. M.; Alvares, L.O. Towards elimination of well known geographic domain patterns in spatial association rule mining. In: IEEE International Conference on Intelligent Systems, IEEE-IS, 3., 2006, London. IEEE Computer Society, 2006b. p. 532-537.
Bogorny, V.; Camargo, S.; Engel, P.; Alvares, L. O.: Mining Frequent Geographic Patterns with Knowledge Constraints. In: ACM International Symposium on Advances in Geographic Information Systems, ACM-GIS, 14., 2006, Arlington. p. 139-146a .
Bogorny, V.; Palma, A; Engel. P. ; Alvares, L.O. Weka-GDPM: Integrating Classical Data Mining Toolkit to Geographic Information Systems. In: SBBD Workshop on Data Mining Algorithms and Applications, WAAMD, Florianopolis, 2006 d.p. 9-16.
39
References
Bogorny, V.; Engel, P. M.; Alvares, L.O. Enhancing the Process of Knowledge Discovery in Geographic Databases using Geo-Ontologies. In: NIGRO, H. O.; CISARO, S.G.; XODO, D. (Ed.). Data Mining with Ontologies: Implementations, Findings, and Frameworks. Idea Group, 2007.
CLEMENTINI, E.; DI FELICE, P.; KOPERSKI, K. Mining multiple-level spatial association rules for objects with a broad boundary. Data & Knowledge Engineering, [S.l.], v.34, n.3, p.251-270, Sept. 2000.
GUTING, R. H. An Introduction to Spatial Database Systems. The International Journal on Very Large Data Bases, [S.l.], v.3, n.4, p. 357 – 399, Oct. 1994.
KOPERSKI, K.; HAN, J. Discovery of Spatial Association Rules In Geographic Information Databases. In: INTERNATIONAL SYMPOSIUM ON LARGE GEOGRAPHICAL DATABASES, SSD, 4., 1995, Portland. Proceedings… [S.l.]: Springer, 1995. p.47-66.
MENNIS, J.; LIU, J.W. Mining Association Rules in Spatio-Temporal Data: An Analysis of Urban Socioeconomic and Land Cover Change. Transactions in GIS,[S.l.], v.9, n.1, p. 5-17, Jan. 2005.
OPEN GIS CONSORTIUM. OpenGIS simple features specification for SQL. 1999. Available at <http://www.opengeospatial.org/docs/99-054.pdf>. Visited on Aug. 2005.