20
Institute for Web Science and Technologies · University of Koblenz-Landau, Germany Focused Exploration of Geospatial Context on Linked Open Data Thomas Gottron , Johannes Schmitz, Stuart E. Middleton 20 October 2014 IESD workshop, Riva del Garda

Focused Exploration of Geospatial Context on Linked Open Data

Embed Size (px)

DESCRIPTION

Talk by Dr. Thomas Gottron at the IESD 2014 workshop in Riva del Garda (at ISWC). Abstract The Linked Open Data cloud provides a wide range of different types of information which are interlinked and connected. When a user or application is interested in specific types of information under time constraints it is best to ex- plore this vast knowledge network in a focused and directed way. In this paper we address the novel task of focused exploration of Linked Open Data for geospatial resources, helping journalists in real-time during breaking news stories to find contextual geospatial information related to geoparsed content. After formalising the task of focused exploration, we present and evaluate five approaches based on three different paradigms. Our results on a dataset with 425,338 entities show that focused exploration on the Linked Data cloud is feasible and can be implemented at very high levels of accuracy of more than 98%.

Citation preview

Page 1: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 1 Institute for Web Science and Technologies · University of Koblenz-Landau, Germany

Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron, Johannes Schmitz, Stuart E. Middleton

20 October 2014 IESD workshop, Riva del Garda

Page 2: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 2

Challenge: Focused Exploration of LOD

•  Linked Data entities

Page 3: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 3

Challenge: Focused Exploration of LOD

•  Linked Data entities •  (Semantic) link

structure

Page 4: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 4

Challenge: Focused Exploration of LOD

•  Linked Data entities •  (Semantic) link

structure •  „Relevant“ entities

Page 5: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 5

Challenge: Focused Exploration of LOD

•  Linked Data entities •  (Semantic) link

structure •  „Relevant“ entities •  Seed entity

Page 6: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 6

Challenge: Focused Exploration of LOD

•  Linked Data entities •  (Semantic) link

structure •  „Relevant“ entities •  Seed entity

? ?

? ?

? ?

Classification:

Which links lead to

relevant entities?

Ranking: How probable is a link leading to a relevant entity?

Use Cases: Guided exploration

Focused LOD crawler

Page 7: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 7

Focused exploration of Geospatial Context

Relevant entities: Locations semantically related to seed entities

Bensheim (Germany)

Rovereto

Page 8: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 8

•  E: set of entities (URIs) •  R: set of RDF triples (s,p,o)

– Restricted to s,o ∈ E •  L⊆E: relevant entities

– For us: Locations with coordinates •  Task: for given s‘ and all (s‘,p,o) ∈ R

– Classification: Predict which o are in L – Ranking: Sort object entities o starting from the

one presumed most probable to be relevant

Focused Exploration: Formalisation

s∈ L

-1.404

50.897

wgs84:long

wgs84:lat

Page 9: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 9

•  Based on 3 paradigms: – Schema semantics (1 approach) – Supervised machine learning (2 approaches) –  Information Retrieval inspired (2 approaches)

5 Approaches

Page 10: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 10

Exploration based on Schema Semantics •  Exploit rdfs:range definitions of link predicates

•  Follow links which lead to locations

dbponto:twinCity dbpedia:City rdfs:range

dbpedia:Place

rdfs:subClassOf

Page 11: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 11

Exploration based on Schema Semantics

Classification •  Range of any pi is a

location? àLabel = relevant

Ranking

•  Re-use classification: –  Relevant before

irrelevant

s

o

pm

p1

p2

...

Location?

Page 12: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 12

Supervised Machine Learning •  Use incoming link predicates as features

–  Learn predicates which typically leading to locations

•  Train a classifier (e.g. Naive Bayes)

o

xxx

yyy

wgs84:long

wgs84:lat

p2

p3 o‘

p4

p6

2 Variations:

Use all or only

observed predicates

Page 13: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 13

Supervised Machine Learning

Classification • 

àLabel = relevant

Ranking

•  Rank by odds:

s

o

pm

p1

p2

...

O o∈ L( ) =P o∈ L( )P o∉ L( )

P o∈ L( ) > P o∉ L( )?

Location?

Page 14: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 14

IR Inspired Approaches •  Discriminativeness of predicates (inspired by tf-idf)

•  Property relevance frequency:

•  Inverse property frequency

•  Combine into prf-ipf and prr-ipf •  Total score ρ: aggregate over all predicates

prf = c(p,L)

ipf = log c(∗,∗)c(p,∗)"

#$

%

&'

o p3

2nd Variation:

prr: normalised prf

Page 15: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 15

IR Inspired Approaches

Classification •  Determine threshold

–  Nearest centroid

Ranking

•  Rank by score

s

o

pm

p1

p2

...

ρprr-ipf o( )

Location?

Page 16: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 16

Evaluation

•  Metrics: –  Ranking:

•  ROC curves •  AUC

–  Classification: •  Precision •  Recall •  F1 •  Accuracy

•  Cross validation: –  10-times / 10-fold –  Averages

425,338 entities 128,171 relevant

Exp

lora

tion

99,951 entities

owl:sameAs See

d

1,728,633 links

Page 17: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 17

Performance (Ranking)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

ROC

randomSchema SemanticsNB (all predicates)

NB (present predicates)prf-ipfprr-ipf

0.95

0.975

1

0 0.025 0.05

Page 18: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 18

Performance (Classification & Ranking) 10

Table 2. Average performance of approaches († indicates significant improvements atconfidence level ⇢ = 0.01)

Method Recall Precision F1 Accuracy AUC

Schema Scemantics 0.1188 0.8119 0.2073 0.7262 0.5552NB (all predicates) 0.9906 0.9491 † 0.9694 † 0.9812 0.9970NB (observed predicates) 0.9943 0.9436 0.9683 0.9804 0.9968prf-ipf 0.8512 † 0.9754 0.9091 0.9487 0.9958prr-ipf † 0.9973 0.9240 0.9592 0.9745 0.9769

performance in bold. Furthermore, we marked the results where we had a significant im-provement over the second best method at confidence level of ⇢ = 0.01. The aggregatedvalues basically confirm the observations made above. In general, when considering themeasures F1, Accuracy and AUC, the Naive Bayes classifier making use of all predi-cates performs best. However, the advantage in comparison to the Naive Bayes classifierusing only observed terms is negligible. In application scenarios, where a high Recallis of importance, instead, the prr-ipf approach achieves the best results with more than99.7%. When focusing on Precision, prf-ipf performs best and demonstrated the high-est values. More than 97% of the objects predicted to have geo-coordinates actually didprovide such information. In a setting where we want to focus on promising items thismight be the kind of performance the end user is looking for.

One explanation for the very high accuracy in general might also be the dataset.Given that we started the exploration from location entities on DBPedia and Linked-GeoData, the overall dataset was biased towards entities from DBPedia. Hence, we in-tend to extend the evaluation to see if the quality of the supervised approaches remainsat a comparable level, when using larger and even more diverse datasets.

6 Related Work

Previous work related to this paper can be found in three areas, each of which will bedescribed below: (a) Extraction of geographic entities provides a starting point for ourapproach. The fields of (b) focused crawling on the WWW and (c) machine learningapplied to Linked Data in general each share some similarities with our classificationand ranking task, although differences do exist.

6.1 Extraction of Geographic Entities

Work done in the TRIDEC project [7] examined how geographic databases such asGeonames, OpenStreetMap and GooglePlaces could be used to avoid the need for errorprone named entity recognition and thus increase the overall precision when geoparsinglarge volumes of Twitter reports for crisis mapping. This work directly compared crisismaps from Twitter with official post-disaster environment agency impact assessments,highlighting just how accurate maps based on large-scale geospatial report crowd sourc-ing can be. We are building on this approach within the REVEAL project and extending

Page 19: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 19

•  Focused exploration feasible •  ML approach performing best

•  Future work: – Other data sets – Generalise scenario (more than locations) – Better approaches using more features

Summary

Page 20: Focused Exploration of Geospatial Context on Linked Open Data

Thomas Gottron Focused Exploration of LOD 20 Institute for Web Science and Technologies · University of Koblenz-Landau, Germany

Questions?

Thomas Gottron Institute for Web Science and Technologies Universität Koblenz-Landau [email protected]