Semantic Aspects in Spatial Data Mining Vania Bogorny

Semantic Aspects in Spatial Data Mining

Vania Bogorny

2

Introduction

• Existing approaches for spatial data mining, in general, do not make use of prior knowledge

• Bogorny (2006) and Bogorny (2007) introduced the idea of using background knowledge – in data preprocessing, to reduce spatial joins – In spatial association rule mining, to eliminate well known patterns

3

Main Problems

• Unnecessary spatial relationship computation

• large amounts of association rules

• Many associations are well known natural geographic dependences

• Existing approaches for mining SAR are Apriori-like – Most approaches do not make use of background knowledge– Use syntactic constraints for frequent set and rule prunning– Only the data is considered, not the schema

• Result– Same associations explicitly represented in the schema (database

designer) are extracted by SAR mining algoritms

4

Spatial Relationships (Gutting, 1994)

B

C

A

B north A

C southeast A

B

C

A

B north A

C southeast A

OrderDistanceCB

dCB

d

BA

touches

A B

overlaps

B

inside

A

contains BA

crosses

A Bequals

A B

BA

touches

A B

overlaps

inside

A

contains

A B

disjoint

A B

disjoint

BA

crosses

A Bequals

A BTopological

5

• Order and Distance relationships may exist among any pair of spatial features

• Topological Relations depend on the geometry

TopologicalRelation

GeometricCombinations

Disjoint Overlaps Touches Contains Inside Crosses Equals

? ?

? ?

Topological Relationships – GEOMETRICALLY POSSIBLE

OGC standard

6

Topological Relation Semantic Combinations

Disjoint Overlaps Touches Contains Inside Crosses Equals

Factory () Hospital () Bridge () River ( ) Factory () Airport( □) River () Road () Beach () Sea (□) State (□) Country (□)

Topological Relationships – SEMANTICALLY CONSISTENT

7

Spatial Relationships

– Mandatory (Spatial constraints) Dependences: <island> <inside> <1><1> <Water Body>

– Prohibited: <River> <contains> <0><0> <Road>

– Possible: Normally undefined

Road crosses River

For Data mining and knowledge discovery,

POSSIBLE and PROHIBITED RELATIONSHIPS are interesting!!!!

All others are well known.

8

Well Known Geographic Dependences

Non-obvious spatial relationships Well known dependences

Is_a(gasStation) intersects(street) (100%)

Is_a(island) within (waterResource) (100%)

9

Well Known relationshiops X Association Rules

Bridge &Viaducts Roads Vegetation

contains(gasStation) contains(Street) (100%)

Bus Stop Street

intersects(busStop) intersects(Street) (100%)

Contains(viaduct) contains(road) (100%)

10

Country

namegeometry

Water Resource

nameextensiongeometry

State

namegeometry

1..n1 1..n1

Factory

namemainActivityimpactDegree

County

namepopulationgeometry1..n1 1..n1 1 0..n1 0..n

Island

geometry

1 0..n1 0..n

0..n

1

0..n

1

{State, Country}{Factory, County}{Island, WaterBody}….

Well Known Associations – Conceptual Schemas

11


Fonte: 1ª Divisão de Levantamento do Exército Brasileiro

12


13

Well Known Associations – Geo-Ontologies

<owl:Class rdf:ID=“Island"> <rdfs:SubClassOf rdf:resource="#SpatialFeatureType"/> <rdfs:subClassOf> <owl:Restriction> <owl:minCardinality

rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</owl:minCardinality> <owl:allValuesFrom rdf:resource="#WaterResource"/> <owl:OnProperty> <owl:ObjectProperty rdf:about="#Within"/> </owl:OnProperty> </owl:Restriction> </rdfs:subClassOf></owl:Class>

Geographic dependences are explicit in geo-ontologies (Bogorny, 2005b)

14

Well Known Associations – Geo-Ontologies

15

Well known dependences X Spatial Association Rules (SAR)

• Well knonw dependences affect the 3 main steps in the process of mining SAR:

– Spatial predicate computation: compute unnecessary relationships

– Frequent set generation: generate frequent itemsets with well known patterns

– Association rule extraction: produce a high number of rules with well known dependences

Well Known Dependences in SAR

17

Example of preprocessed spatial dataset

contains(Hospital), contains(TreatedWaterNet), contains(Factory)6

contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)5

contains(Port), contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)4

contains(Port), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)3

contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody)2


Spatial PredicatesTuple (city)








18















Problem 1 – Geographic Dependences between the Target Feature Type and Relevant Feature Types

Minconf=70%

Dependence = City and TreatedWaterNet

100% de support

contains(Hospital)contains(TreatedWaterNetTreatedWaterNet)

19

Problem 2 - Dependences among Relevant Feature Types

Minsup=50%

25 frequent sets(6 contain the dependence)

9 closed frequent sets (3 have the dependence)

Dependence = {Port, WaterBody}

contains(PortPort)crosses(WaterBodyWaterBody)

Pruning Methods using Background Knowledge

21

Frequent Set Pruning (Apriori-KC) (Bogorny, 2006ª)

Given: , // set of knowledge constraints , // dataset generated with spatial_predicate_extraction minsup, // minimum supportL1 = {large 1-predicate sets};

For ( k = 2; Lk-1 != ; k++ ) do begin

Ck = apriori_gen(Lk-1); // Generates new candidates

If (k=2) // remove pairs with dependences

(step 1) Delete from C2 all pairs with a dependence in ;

Forall rows w do begin Cw = subset(Ck, w); // Candidates contained in w

Forall candidates c Cw do

c.count++; End; Lk = {c Ck | c.count minsup};

End;Answer = kLk

22

Understanding the Pruning Methods

Dependences {D} e {A,W}

a) dataset b) frequent predicate sets with minsup 50%

Tid (city) Predicate Set 1 A, C, D,T, W 2 C, D, W 3 A, D, T, W 4 A, C, D, W 5 A, C, D, T, W 6 C, D, T

Set k Frequent sets k=1 {A}, {C}, {D}, {T}, {W}

k=2 {A,C}, {A,D}, {A,T}, {A,W}, {C,D}, {C,T}, {C,W}, {D,T}, {D,W}, {T,W}

k=3 {A,C,D}, {A,C,W}, {A,D,T}, {A,D,W}, {A,T,W}, {C,D,T}, {C,D,W}, {D,T,W}

k=4 {A,C,D,W}, {A,D,T,W}

c) predicates

A = contains(Port), C = contains(Hospital), W = crosses(WaterBody),

T = contains(Factory), D = contains(Street),

D = contains(TreatedWaterNet)

23

Understanding the Pruning Methods (Input Pruning)

{C,D,T}

{A,C,D,W}

{D}

{A,D} {C,D}

{A,C,D} {A,D,T} {A,D,W}

{D,T} {D,W}

{C,D,W} {D,T,W}

{A,D,T,W}

{}

{C}{A} {T}

{A,C} {A,T} {A,W} {C,T}

{A,C,W}

{W}

{C,W} {T,W}

{A,T,W}

{D} Input pruning

24

Understanding the Pruning Methods(Frequent set pruning)

25 frequent sets

{C,D,T}

{A,C,D,W}

{D}

{A,D} {C,D}

{A,C,D} {A,D,T} {A,D,W}

{D,T} {D,W}

{C,D,W} {D,T,W}

{A,D,T,W}

{}

{C}{A} {T}

{A,C} {A,T} {A,W} {C,T}

{A,C,W}

{W}

{C,W} {T,W}

{A,T,W}

{A,W} Frequent set pruning

Percentage reduction of association rules considering zero (reference), one, and two pairs of dependences with an increasing number of elements (predicates)

25

minconf=0

26

Problem 3 – Redundant Frequent Itemsets

9 closed frequent itemsets

- Considering the 25 frequent sets in the example dataset- 9 are closed frequent itemsets (3 contain the dependence) -16 are redundant (3 contain the dependence)

Dependence {A,W}


Dataset

27


{}

{C} {D}{A} {T}

{A,C} {A,D} {A,T} {C,D} {C,T}

{A,C,D} {A,D,T}

{W}

{C,W} {D,T} {D,W} {T,W}

{C,D,T} {C,D,W} {D,T,W}

(1345) (12456) (123456) (1356) (12345)

(145) (1345) (135) (12456) (156) (12456) (1356) (12345) (135)

(145) (135) (156) (1245) (135)

{}

{C} {D}{A} {T}

{A,C} {A,D} {A,T} {C,D} {C,T}

{A,C,D} {A,D,T}

{W}

{C,W} {D,T} {D,W} {T,W}

{C,D,T} {C,D,W} {D,T,W}

(1345) (12456) (123456) (1356) (12345)

(145) (1345) (135) (12456) (156) (12456) (1356) (12345) (135)

(145) (135) (156) (1245) (135)

Remove dependences and then generate closed frequent itemsetsProblem –> resultant frequent sets are not closed


{A,D,T,W}

28


9 closed frequent itemsets

- Generate closed frequent itemsets and then eliminate dependences Problem – loose information

{A,C,D,W}

{D}

{C,D}

{A,D,W}

{D,T} {D,W}

{C,D,T} {C,D,W}

{A,D,T,W}

(123456)

(12456) (1356) (12345)

(1345) (156) (1245)

(145) (135){A,C,D,W}

{D}

{C,D}

{A,D,W}

{D,T} {D,W}

{C,D,T} {C,D,W}

{A,D,T,W}

(123456)

(12456) (1356) (12345)

(1345) (156) (1245)

(145) (135)

Dependence {A,W}

29

Max-FGP (Bogorny 2006c)

{A,C,D,W}

{}

{C} {D}{A} {T}

{A,C} {A,D} {A,T} {A,W} {C,D} {C,T}

{A,C,D} {A,C,W} {A,D,T} {A,D,W}

{W}

{C,W} {D,T} {D,W} {T,W}

{A,T,W} {C,D,T} {C,D,W} {D,T,W}

{A,D,T,W}

(1345) (12456) (123456) (1356) (12345)

(145) (1345) (135) (1345) (12456) (156) (12456) (1356) (12345) (135)

(145) (145) (135) (1345) (135) (156) (1245) (135)

(145) (135){A,C,D,W}

{}

{C} {D}{A} {T}

{A,C} {A,D} {A,T} {A,W} {C,D} {C,T}

{A,C,D} {A,C,W} {A,D,T} {A,D,W}

{W}

{C,W} {D,T} {D,W} {T,W}

{A,T,W} {C,D,T} {C,D,W} {D,T,W}

{A,D,T,W}

(1345) (12456) (123456) (1356) (12345)

(145) (1345) (135) (1345) (12456) (156) (12456) (1356) (12345) (135)

(145) (145) (135) (1345) (135) (156) (1245) (135)

(145) (135)

- Remove dependences in a first step

30

Max-FGP

-Remove redundant frequent sets in a second step generating maximal frequent sets

(135)

{}

{C} {D}{A} {T}

{A,C} {A,D} {A,T} {C,D} {C,T}

{A,C,D} {A,D,T}

{W}

{C,W} {D,T} {D,W} {T,W}

{C,D,T} {C,D,W} {D,T,W}

(1345) (12456) (123456) (1356) (12345)

(145) (1345) (135) (12456) (156) (12456) (1356) (12345)

(145) (135) (156) (1245) (135)

(135)

{}

{C} {D}{A} {T}

{A,C} {A,D} {A,T} {C,D} {C,T}

{A,C,D} {A,D,T}

{W}

{C,W} {D,T} {D,W} {T,W}

{C,D,T} {C,D,W} {D,T,W}

(1345) (12456) (123456) (1356) (12345)

(145) (1345) (135) (12456) (156) (12456) (1356) (12345)

(145) (135) (156) (1245) (135)

31

Max-FGP (Bogorny, 2006c)

Given: L;L; // frequent sets without dependences (Apriori-KC) ; // dataset generated with spatial_predicate_extraction

Find: Maximal M

// find maximal generalized predicate setsM = L;M = L; For ( k = 2; Mk != ; k++ ) do begin For ( j = k+1; Mj!=0; j++ ) do begin If (tidSet (Mk) = tidSet (Mj))

If (Mk Mj) // Mj is more general than Mk

Delete Mk from M; End;End;Answer = M;

32

Some results on real databases

33

Input Space Pruning

1,731

865

432651

363

181331

16582

0200400600800

1,0001,2001,4001,6001,800

Fre

quen

t Set

s

10% 15% 20%

Minimum Support

Frequent Geographic Patterns Removing One and Two Dependences between the Target Feature and Two

Relevant Features (Input space Pruning)

Apriori

Apriori (Revoming 1 column)

Apriori (Removing 2 columns)

22,251

7,159

2,241

7,128

2,252689

2,268698 204

0

5,000

10,000

15,000

20,000

25,000

Ass

ocia

tion

Rul

es10% 15% 20%

Minimum Support

Association Rules Generated with Apriori Removing One and Two Dependences Between the Target Feature

and Two Relevant Features (Input space Pruning)

Apriori

Apriori (Removing 1 column)

Apriori (Removing 2 columns)

20 predicates

1 dependence 50%2 dependences 70%

1 dependence 70%2 dependences 90%

34

Frequent Set Pruning

117

85

3222

71

51

2016

47

33

14 12

0

20

40

60

80

100

120

Fre

qu

en

t Se

ts

5% 10% 15%

Minimum Support

Frequent Geographic Patterns Removing One Dependence among Relevant Features

(Frequent Set Pruning)

Apriori

Apriori-KC (1 pair)

Max-FGp(1 pair)

Closed Frequent sets

Computational Time

0

20

40

60

80

100

120

140

160

180

200

5% 10% 15%Minimum Support

Tim

e(s

)

Apriori

Apriori-KC (1 pair)

Max-FGp(1 pair)

Closed frequent sets17

711

77% 68% 58%

15 predicates

35

Summary

• Well known dependences exist in several non-spatial application domains– Biology/Bioinformatics– Pregnant Female (confidence=100%)– Breast_cancer Female (confidence 100%)

– ...

• Almost no data mining approaches consider background knowledge or domain knowledge

36

Future Tendences

• Data Mining methods will consider semantics

• 3 workshops (KDD and ICDM) for domain-driven data mining

• ICDM 2008,2009 Workshop - Semantic Aspects in Data Mining

• Book 2008: Domain-Driven Data Mining

37

Summary: Mining SAR using Background Knowledge

• Using background knowledge:– To prune the input space as much as possible

(applicable to any SDM method)

– Apriori-KC generate frequent itemsets without well known dependences

– Max-FGP (Maximal Frequent Geographic Patterns) generate closed frequent itemsets without well known dependences

38

References

Bogorny, V.; Valiati, J.; Camargo, S.; Engel, P.; Alvares, L. O.: Mining Maximal Generalized Frequent Geographic Patterns with Knowledge Constraints. In: IEEE International Conference on Data Mining, IEEE-ICDM, 6., 2006, Hong-Kong, 2006c

Bogorny, V.; Camargo, S.; Engel, P. M.; Alvares, L.O. Towards elimination of well known geographic domain patterns in spatial association rule mining. In: IEEE International Conference on Intelligent Systems, IEEE-IS, 3., 2006, London. IEEE Computer Society, 2006b. p. 532-537.

Bogorny, V.; Camargo, S.; Engel, P.; Alvares, L. O.: Mining Frequent Geographic Patterns with Knowledge Constraints. In: ACM International Symposium on Advances in Geographic Information Systems, ACM-GIS, 14., 2006, Arlington. p. 139-146a .

Bogorny, V.; Palma, A; Engel. P. ; Alvares, L.O. Weka-GDPM: Integrating Classical Data Mining Toolkit to Geographic Information Systems. In: SBBD Workshop on Data Mining Algorithms and Applications, WAAMD, Florianopolis, 2006 d.p. 9-16.

39

References

Bogorny, V.; Engel, P. M.; Alvares, L.O. Enhancing the Process of Knowledge Discovery in Geographic Databases using Geo-Ontologies. In: NIGRO, H. O.; CISARO, S.G.; XODO, D. (Ed.). Data Mining with Ontologies: Implementations, Findings, and Frameworks. Idea Group, 2007.

CLEMENTINI, E.; DI FELICE, P.; KOPERSKI, K. Mining multiple-level spatial association rules for objects with a broad boundary. Data & Knowledge Engineering, [S.l.], v.34, n.3, p.251-270, Sept. 2000.

GUTING, R. H. An Introduction to Spatial Database Systems. The International Journal on Very Large Data Bases, [S.l.], v.3, n.4, p. 357 – 399, Oct. 1994.

KOPERSKI, K.; HAN, J. Discovery of Spatial Association Rules In Geographic Information Databases. In: INTERNATIONAL SYMPOSIUM ON LARGE GEOGRAPHICAL DATABASES, SSD, 4., 1995, Portland. Proceedings… [S.l.]: Springer, 1995. p.47-66.

MENNIS, J.; LIU, J.W. Mining Association Rules in Spatio-Temporal Data: An Analysis of Urban Socioeconomic and Land Cover Change. Transactions in GIS,[S.l.], v.9, n.1, p. 5-17, Jan. 2005.

OPEN GIS CONSORTIUM. OpenGIS simple features specification for SQL. 1999. Available at <http://www.opengeospatial.org/docs/99-054.pdf>. Visited on Aug. 2005.

Documents

Semantic Aspects in Spatial Data Mining Vania Bogorny