39
Semantic Aspects in Spatial Data Mining Vania Bogorny

Semantic Aspects in Spatial Data Mining Vania Bogorny

Embed Size (px)

Citation preview

Page 1: Semantic Aspects in Spatial Data Mining Vania Bogorny

Semantic Aspects in Spatial Data Mining

Vania Bogorny

Page 2: Semantic Aspects in Spatial Data Mining Vania Bogorny

2

Introduction

• Existing approaches for spatial data mining, in general, do not make use of prior knowledge

• Bogorny (2006) and Bogorny (2007) introduced the idea of using background knowledge – in data preprocessing, to reduce spatial joins – In spatial association rule mining, to eliminate well known patterns

Page 3: Semantic Aspects in Spatial Data Mining Vania Bogorny

3

Main Problems

• Unnecessary spatial relationship computation

• large amounts of association rules

• Many associations are well known natural geographic dependences

• Existing approaches for mining SAR are Apriori-like – Most approaches do not make use of background knowledge– Use syntactic constraints for frequent set and rule prunning– Only the data is considered, not the schema

• Result– Same associations explicitly represented in the schema (database

designer) are extracted by SAR mining algoritms

Page 4: Semantic Aspects in Spatial Data Mining Vania Bogorny

4

Spatial Relationships (Gutting, 1994)

B

C

A

B north A

C southeast A

B

C

A

B north A

C southeast A

OrderDistanceCB

dCB

d

BA

touches

A B

overlaps

B

inside

A

contains BA

crosses

A Bequals

A B

BA

touches

A B

overlaps

inside

A

contains

A B

disjoint

A B

disjoint

BA

crosses

A Bequals

A BTopological

Page 5: Semantic Aspects in Spatial Data Mining Vania Bogorny

5

• Order and Distance relationships may exist among any pair of spatial features

• Topological Relations depend on the geometry

TopologicalRelation

GeometricCombinations

Disjoint Overlaps Touches Contains Inside Crosses Equals

? ?

? ?

Topological Relationships – GEOMETRICALLY POSSIBLE

OGC standard

Page 6: Semantic Aspects in Spatial Data Mining Vania Bogorny

6

Topological Relation Semantic Combinations

Disjoint Overlaps Touches Contains Inside Crosses Equals

Factory () Hospital () Bridge () River ( ) Factory () Airport( □) River () Road () Beach () Sea (□) State (□) Country (□)

Topological Relationships – SEMANTICALLY CONSISTENT

Page 7: Semantic Aspects in Spatial Data Mining Vania Bogorny

7

Spatial Relationships

– Mandatory (Spatial constraints) Dependences: <island> <inside> <1><1> <Water Body>

– Prohibited: <River> <contains> <0><0> <Road>

– Possible: Normally undefined

Road crosses River

For Data mining and knowledge discovery,

POSSIBLE and PROHIBITED RELATIONSHIPS are interesting!!!!

All others are well known.

Page 8: Semantic Aspects in Spatial Data Mining Vania Bogorny

8

Well Known Geographic Dependences

Non-obvious spatial relationships Well known dependences

Is_a(gasStation) intersects(street) (100%)

Is_a(island) within (waterResource) (100%)

Page 9: Semantic Aspects in Spatial Data Mining Vania Bogorny

9

Well Known relationshiops X Association Rules

Bridge &Viaducts Roads Vegetation

contains(gasStation) contains(Street) (100%)

Bus Stop Street

intersects(busStop) intersects(Street) (100%)

Contains(viaduct) contains(road) (100%)

Page 10: Semantic Aspects in Spatial Data Mining Vania Bogorny

10

Country

namegeometry

Water Resource

nameextensiongeometry

State

namegeometry

1..n1 1..n1

Factory

namemainActivityimpactDegree

County

namepopulationgeometry1..n1 1..n1 1 0..n1 0..n

Island

geometry

1 0..n1 0..n

0..n

1

0..n

1

{State, Country}{Factory, County}{Island, WaterBody}….

Well Known Associations – Conceptual Schemas

Page 11: Semantic Aspects in Spatial Data Mining Vania Bogorny

11

Well Known Associations – Conceptual Schemas

Fonte: 1ª Divisão de Levantamento do Exército Brasileiro

Page 12: Semantic Aspects in Spatial Data Mining Vania Bogorny

12

Well Known Associations – Conceptual Schemas

Page 13: Semantic Aspects in Spatial Data Mining Vania Bogorny

13

Well Known Associations – Geo-Ontologies

<owl:Class rdf:ID=“Island"> <rdfs:SubClassOf rdf:resource="#SpatialFeatureType"/> <rdfs:subClassOf> <owl:Restriction> <owl:minCardinality

rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</owl:minCardinality> <owl:allValuesFrom rdf:resource="#WaterResource"/> <owl:OnProperty> <owl:ObjectProperty rdf:about="#Within"/> </owl:OnProperty> </owl:Restriction> </rdfs:subClassOf></owl:Class>

Geographic dependences are explicit in geo-ontologies (Bogorny, 2005b)

Page 14: Semantic Aspects in Spatial Data Mining Vania Bogorny

14

Well Known Associations – Geo-Ontologies

Page 15: Semantic Aspects in Spatial Data Mining Vania Bogorny

15

Well known dependences X Spatial Association Rules (SAR)

• Well knonw dependences affect the 3 main steps in the process of mining SAR:

– Spatial predicate computation: compute unnecessary relationships

– Frequent set generation: generate frequent itemsets with well known patterns

– Association rule extraction: produce a high number of rules with well known dependences

Page 16: Semantic Aspects in Spatial Data Mining Vania Bogorny

Well Known Dependences in SAR

Page 17: Semantic Aspects in Spatial Data Mining Vania Bogorny

17

Example of preprocessed spatial dataset

contains(Hospital), contains(TreatedWaterNet), contains(Factory)6

contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)5

contains(Port), contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)4

contains(Port), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)3

contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)2

contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)1

Spatial PredicatesTuple (city)

contains(Hospital), contains(TreatedWaterNet), contains(Factory)6

contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)5

contains(Port), contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)4

contains(Port), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)3

contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)2

contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)1

Spatial PredicatesTuple (city)

Page 18: Semantic Aspects in Spatial Data Mining Vania Bogorny

18

contains(Hospital), contains(TreatedWaterNet), contains(Factory)6

contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)5

contains(Port), contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)4

contains(Port), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)3

contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)2

contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)1

Spatial PredicatesTuple (city)

contains(Hospital), contains(TreatedWaterNet), contains(Factory)6

contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)5

contains(Port), contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)4

contains(Port), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)3

contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)2

contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)1

Spatial PredicatesTuple (city)

Problem 1 – Geographic Dependences between the Target Feature Type and Relevant Feature Types

Minconf=70%

Dependence = City and TreatedWaterNet

100% de support

contains(Hospital)contains(TreatedWaterNetTreatedWaterNet)

Page 19: Semantic Aspects in Spatial Data Mining Vania Bogorny

19

Problem 2 - Dependences among Relevant Feature Types

Minsup=50%

25 frequent sets(6 contain the dependence)

9 closed frequent sets (3 have the dependence)

Dependence = {Port, WaterBody}

contains(PortPort)crosses(WaterBodyWaterBody)

Page 20: Semantic Aspects in Spatial Data Mining Vania Bogorny

Pruning Methods using Background Knowledge

Page 21: Semantic Aspects in Spatial Data Mining Vania Bogorny

21

Frequent Set Pruning (Apriori-KC) (Bogorny, 2006ª)

Given: , // set of knowledge constraints , // dataset generated with spatial_predicate_extraction minsup, // minimum supportL1 = {large 1-predicate sets};

For ( k = 2; Lk-1 != ; k++ ) do begin

Ck = apriori_gen(Lk-1); // Generates new candidates

If (k=2) // remove pairs with dependences

(step 1) Delete from C2 all pairs with a dependence in ;

Forall rows w do begin Cw = subset(Ck, w); // Candidates contained in w

Forall candidates c Cw do

c.count++; End; Lk = {c Ck | c.count minsup};

End;Answer = kLk

Page 22: Semantic Aspects in Spatial Data Mining Vania Bogorny

22

Understanding the Pruning Methods

Dependences {D} e {A,W}

a) dataset b) frequent predicate sets with minsup 50%

Tid (city) Predicate Set 1 A, C, D,T, W 2 C, D, W 3 A, D, T, W 4 A, C, D, W 5 A, C, D, T, W 6 C, D, T

Set k Frequent sets k=1 {A}, {C}, {D}, {T}, {W}

k=2 {A,C}, {A,D}, {A,T}, {A,W}, {C,D}, {C,T}, {C,W}, {D,T}, {D,W}, {T,W}

k=3 {A,C,D}, {A,C,W}, {A,D,T}, {A,D,W}, {A,T,W}, {C,D,T}, {C,D,W}, {D,T,W}

k=4 {A,C,D,W}, {A,D,T,W}

c) predicates

A = contains(Port), C = contains(Hospital), W = crosses(WaterBody),

T = contains(Factory), D = contains(Street),

D = contains(TreatedWaterNet)

Page 23: Semantic Aspects in Spatial Data Mining Vania Bogorny

23

Understanding the Pruning Methods (Input Pruning)

{C,D,T}

{A,C,D,W}

{D}

{A,D} {C,D}

{A,C,D} {A,D,T} {A,D,W}

{D,T} {D,W}

{C,D,W} {D,T,W}

{A,D,T,W}

{}

{C}{A} {T}

{A,C} {A,T} {A,W} {C,T}

{A,C,W}

{W}

{C,W} {T,W}

{A,T,W}

{D} Input pruning

Page 24: Semantic Aspects in Spatial Data Mining Vania Bogorny

24

Understanding the Pruning Methods(Frequent set pruning)

25 frequent sets

{C,D,T}

{A,C,D,W}

{D}

{A,D} {C,D}

{A,C,D} {A,D,T} {A,D,W}

{D,T} {D,W}

{C,D,W} {D,T,W}

{A,D,T,W}

{}

{C}{A} {T}

{A,C} {A,T} {A,W} {C,T}

{A,C,W}

{W}

{C,W} {T,W}

{A,T,W}

{A,W} Frequent set pruning

Page 25: Semantic Aspects in Spatial Data Mining Vania Bogorny

Percentage reduction of association rules considering zero (reference), one, and two pairs of dependences with an increasing number of elements (predicates)

25

minconf=0

Page 26: Semantic Aspects in Spatial Data Mining Vania Bogorny

26

Problem 3 – Redundant Frequent Itemsets

9 closed frequent itemsets

- Considering the 25 frequent sets in the example dataset- 9 are closed frequent itemsets (3 contain the dependence) -16 are redundant (3 contain the dependence)

Dependence {A,W}

Tid (city) Predicate Set 1 A, C, D,T, W 2 C, D, W 3 A, D, T, W 4 A, C, D, W 5 A, C, D, T, W 6 C, D, T

Dataset

Page 27: Semantic Aspects in Spatial Data Mining Vania Bogorny

27

Problem 3 – Redundant Frequent Itemsets

{}

{C} {D}{A} {T}

{A,C} {A,D} {A,T} {C,D} {C,T}

{A,C,D} {A,D,T}

{W}

{C,W} {D,T} {D,W} {T,W}

{C,D,T} {C,D,W} {D,T,W}

(1345) (12456) (123456) (1356) (12345)

(145) (1345) (135) (12456) (156) (12456) (1356) (12345) (135)

(145) (135) (156) (1245) (135)

{}

{C} {D}{A} {T}

{A,C} {A,D} {A,T} {C,D} {C,T}

{A,C,D} {A,D,T}

{W}

{C,W} {D,T} {D,W} {T,W}

{C,D,T} {C,D,W} {D,T,W}

(1345) (12456) (123456) (1356) (12345)

(145) (1345) (135) (12456) (156) (12456) (1356) (12345) (135)

(145) (135) (156) (1245) (135)

Remove dependences and then generate closed frequent itemsetsProblem –> resultant frequent sets are not closed

Tid (city) Predicate Set 1 A, C, D,T, W 2 C, D, W 3 A, D, T, W 4 A, C, D, W 5 A, C, D, T, W 6 C, D, T

{A,D,T,W}

Page 28: Semantic Aspects in Spatial Data Mining Vania Bogorny

28

Problem 3 – Redundant Frequent Itemsets

9 closed frequent itemsets

- Generate closed frequent itemsets and then eliminate dependences Problem – loose information

{A,C,D,W}

{D}

{C,D}

{A,D,W}

{D,T} {D,W}

{C,D,T} {C,D,W}

{A,D,T,W}

(123456)

(12456) (1356) (12345)

(1345) (156) (1245)

(145) (135){A,C,D,W}

{D}

{C,D}

{A,D,W}

{D,T} {D,W}

{C,D,T} {C,D,W}

{A,D,T,W}

(123456)

(12456) (1356) (12345)

(1345) (156) (1245)

(145) (135)

Dependence {A,W}

Page 29: Semantic Aspects in Spatial Data Mining Vania Bogorny

29

Max-FGP (Bogorny 2006c)

{A,C,D,W}

{}

{C} {D}{A} {T}

{A,C} {A,D} {A,T} {A,W} {C,D} {C,T}

{A,C,D} {A,C,W} {A,D,T} {A,D,W}

{W}

{C,W} {D,T} {D,W} {T,W}

{A,T,W} {C,D,T} {C,D,W} {D,T,W}

{A,D,T,W}

(1345) (12456) (123456) (1356) (12345)

(145) (1345) (135) (1345) (12456) (156) (12456) (1356) (12345) (135)

(145) (145) (135) (1345) (135) (156) (1245) (135)

(145) (135){A,C,D,W}

{}

{C} {D}{A} {T}

{A,C} {A,D} {A,T} {A,W} {C,D} {C,T}

{A,C,D} {A,C,W} {A,D,T} {A,D,W}

{W}

{C,W} {D,T} {D,W} {T,W}

{A,T,W} {C,D,T} {C,D,W} {D,T,W}

{A,D,T,W}

(1345) (12456) (123456) (1356) (12345)

(145) (1345) (135) (1345) (12456) (156) (12456) (1356) (12345) (135)

(145) (145) (135) (1345) (135) (156) (1245) (135)

(145) (135)

- Remove dependences in a first step

Page 30: Semantic Aspects in Spatial Data Mining Vania Bogorny

30

Max-FGP

-Remove redundant frequent sets in a second step generating maximal frequent sets

(135)

{}

{C} {D}{A} {T}

{A,C} {A,D} {A,T} {C,D} {C,T}

{A,C,D} {A,D,T}

{W}

{C,W} {D,T} {D,W} {T,W}

{C,D,T} {C,D,W} {D,T,W}

(1345) (12456) (123456) (1356) (12345)

(145) (1345) (135) (12456) (156) (12456) (1356) (12345)

(145) (135) (156) (1245) (135)

(135)

{}

{C} {D}{A} {T}

{A,C} {A,D} {A,T} {C,D} {C,T}

{A,C,D} {A,D,T}

{W}

{C,W} {D,T} {D,W} {T,W}

{C,D,T} {C,D,W} {D,T,W}

(1345) (12456) (123456) (1356) (12345)

(145) (1345) (135) (12456) (156) (12456) (1356) (12345)

(145) (135) (156) (1245) (135)

Page 31: Semantic Aspects in Spatial Data Mining Vania Bogorny

31

Max-FGP (Bogorny, 2006c)

Given: L;L; // frequent sets without dependences (Apriori-KC) ; // dataset generated with spatial_predicate_extraction

Find: Maximal M

// find maximal generalized predicate setsM = L;M = L; For ( k = 2; Mk != ; k++ ) do begin For ( j = k+1; Mj!=0; j++ ) do begin If (tidSet (Mk) = tidSet (Mj))

If (Mk Mj) // Mj is more general than Mk

Delete Mk from M; End;End;Answer = M;

Page 32: Semantic Aspects in Spatial Data Mining Vania Bogorny

32

Some results on real databases

Page 33: Semantic Aspects in Spatial Data Mining Vania Bogorny

33

Input Space Pruning

1,731

865

432651

363

181331

16582

0200400600800

1,0001,2001,4001,6001,800

Fre

quen

t Set

s

10% 15% 20%

Minimum Support

Frequent Geographic Patterns Removing One and Two Dependences between the Target Feature and Two

Relevant Features (Input space Pruning)

Apriori

Apriori (Revoming 1 column)

Apriori (Removing 2 columns)

22,251

7,159

2,241

7,128

2,252689

2,268698 204

0

5,000

10,000

15,000

20,000

25,000

Ass

ocia

tion

Rul

es10% 15% 20%

Minimum Support

Association Rules Generated with Apriori Removing One and Two Dependences Between the Target Feature

and Two Relevant Features (Input space Pruning)

Apriori

Apriori (Removing 1 column)

Apriori (Removing 2 columns)

20 predicates

1 dependence 50%2 dependences 70%

1 dependence 70%2 dependences 90%

Page 34: Semantic Aspects in Spatial Data Mining Vania Bogorny

34

Frequent Set Pruning

117

85

3222

71

51

2016

47

33

14 12

0

20

40

60

80

100

120

Fre

qu

en

t Se

ts

5% 10% 15%

Minimum Support

Frequent Geographic Patterns Removing One Dependence among Relevant Features

(Frequent Set Pruning)

Apriori

Apriori-KC (1 pair)

Max-FGp(1 pair)

Closed Frequent sets

Computational Time

0

20

40

60

80

100

120

140

160

180

200

5% 10% 15%Minimum Support

Tim

e(s

)

Apriori

Apriori-KC (1 pair)

Max-FGp(1 pair)

Closed frequent sets17

711

77% 68% 58%

15 predicates

Page 35: Semantic Aspects in Spatial Data Mining Vania Bogorny

35

Summary

• Well known dependences exist in several non-spatial application domains– Biology/Bioinformatics– Pregnant Female (confidence=100%)– Breast_cancer Female (confidence 100%)

– ...

• Almost no data mining approaches consider background knowledge or domain knowledge

Page 36: Semantic Aspects in Spatial Data Mining Vania Bogorny

36

Future Tendences

• Data Mining methods will consider semantics

• 3 workshops (KDD and ICDM) for domain-driven data mining

• ICDM 2008,2009 Workshop - Semantic Aspects in Data Mining

• Book 2008: Domain-Driven Data Mining

Page 37: Semantic Aspects in Spatial Data Mining Vania Bogorny

37

Summary: Mining SAR using Background Knowledge

• Using background knowledge:– To prune the input space as much as possible

(applicable to any SDM method)

– Apriori-KC generate frequent itemsets without well known dependences

– Max-FGP (Maximal Frequent Geographic Patterns) generate closed frequent itemsets without well known dependences

Page 38: Semantic Aspects in Spatial Data Mining Vania Bogorny

38

References

Bogorny, V.; Valiati, J.; Camargo, S.; Engel, P.; Alvares, L. O.: Mining Maximal Generalized Frequent Geographic Patterns with Knowledge Constraints. In: IEEE International Conference on Data Mining, IEEE-ICDM, 6., 2006, Hong-Kong, 2006c

Bogorny, V.; Camargo, S.; Engel, P. M.; Alvares, L.O. Towards elimination of well known geographic domain patterns in spatial association rule mining. In: IEEE International Conference on Intelligent Systems, IEEE-IS, 3., 2006, London. IEEE Computer Society, 2006b. p. 532-537.

Bogorny, V.; Camargo, S.; Engel, P.; Alvares, L. O.: Mining Frequent Geographic Patterns with Knowledge Constraints. In: ACM International Symposium on Advances in Geographic Information Systems, ACM-GIS, 14., 2006, Arlington. p. 139-146a .

Bogorny, V.; Palma, A; Engel. P. ; Alvares, L.O. Weka-GDPM: Integrating Classical Data Mining Toolkit to Geographic Information Systems. In: SBBD Workshop on Data Mining Algorithms and Applications, WAAMD, Florianopolis, 2006 d.p. 9-16.

Page 39: Semantic Aspects in Spatial Data Mining Vania Bogorny

39

References

Bogorny, V.; Engel, P. M.; Alvares, L.O. Enhancing the Process of Knowledge Discovery in Geographic Databases using Geo-Ontologies. In: NIGRO, H. O.; CISARO, S.G.; XODO, D. (Ed.). Data Mining with Ontologies: Implementations, Findings, and Frameworks. Idea Group, 2007.

CLEMENTINI, E.; DI FELICE, P.; KOPERSKI, K. Mining multiple-level spatial association rules for objects with a broad boundary. Data & Knowledge Engineering, [S.l.], v.34, n.3, p.251-270, Sept. 2000.

GUTING, R. H. An Introduction to Spatial Database Systems. The International Journal on Very Large Data Bases, [S.l.], v.3, n.4, p. 357 – 399, Oct. 1994.

KOPERSKI, K.; HAN, J. Discovery of Spatial Association Rules In Geographic Information Databases. In: INTERNATIONAL SYMPOSIUM ON LARGE GEOGRAPHICAL DATABASES, SSD, 4., 1995, Portland. Proceedings… [S.l.]: Springer, 1995. p.47-66.

MENNIS, J.; LIU, J.W. Mining Association Rules in Spatio-Temporal Data: An Analysis of Urban Socioeconomic and Land Cover Change. Transactions in GIS,[S.l.], v.9, n.1, p. 5-17, Jan. 2005.

OPEN GIS CONSORTIUM. OpenGIS simple features specification for SQL. 1999. Available at <http://www.opengeospatial.org/docs/99-054.pdf>. Visited on Aug. 2005.