Mining and Managing Large-scale Linked Open Data

Slide 1Prof. Ansgar Scherp – [email protected]

Ansgar Scherp

Mining and Managing Large-scale Linked Open Data

GVDB, Nörten-Hardenberg, May 25, 2016

Thanks to: Chifumi Nishioka, Renata Dividino, Thomas Gottron, and many more …


Team Knowledge Discovery @

Ansgar Scherp

Ahmed Saleh

ChifumiNishioka

FalkBöschen

Mohammad Abdel-Qader

Till Blume

AnkeKoslowski(Secretariat)

HenrikSchmidt(Engineer)

LukasGalke

FlorianMai

&


Linked Open Data (LOD) Cloud• Publishing and interlinking data on the web• Different quality, purpose, and sources• Using the Resource Description Framework (RDF)

World Wide Web LOD CloudDocuments DataHyperlinks via <a> Typed LinksHTML RDFAddresses (URIs) Addresses (URIs)


Relevance of Linked Data?

Slide 5Prof. Ansgar Scherp – [email protected] 1000+ Datasets, 50+ Billion Triples

Media

Geographic

Publications

Web 2.0

eGovernment

Cross-Domain

LifeSciences

Linked Data: May ‘07 August ‘14

Source: http://lod-cloud.net

Social Networking


LOD on One Slide: Example Graph

biglynx:matt-briggs

foaf:Person

rdf:type

Fully qualified URI using vocabulary prefixes:@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://w3.org/1999/02/22-rdf-syntax-ns#> .@prefix biglynx: <http://biglynx.co.uk/people/> .

Object

Predicate

Subject

RDF Triple



biglynx:matt-briggs

foaf:Person

rdf:type

Fully qualified URI using vocabulary prefixes:@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://w3.org/1999/02/22-rdf-syntax-ns#> .@prefix biglynx: <http://biglynx.co.uk/people/> .

biglynx:Director

rdf:type …

…



biglynx:matt-briggs

foaf:Person

biglynx:dave-smith

biglynx:Director

rdf:type

foaf:knows

rdf:type

_1:point

wgs84:lat

wgs84:long

dp:London

foaf:based_near

……

…

…

ex:loc

“-0.118”

“51.509”

TypesProperties

Entity


Motivation for the SchemEX Index• Single entry point to query the LOD cloud• Search for data sources containing entities like

– ‘Persons, who are Politicians and Actors’– ‘Research data sets’– ‘Scientific publications’

Query

SELECT ?xFROM …WHERE { ?x rdf:type ex:Actor . ?x rdf:type ex:Politician . }

Index1

2

2

2


Input Data for SchemEX• Quads: <subject> <predicate> <object> <context>

• Example: <http://biglynx.co.uk/people/matt-briggs> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://biglynx.co.uk/people/matt-briggs.rdf>

<http://biglynx.co.uk/peopl

e/matt-briggs.rdf> rdf:typebiglynx:

matt-briggsfoaf:

Person LOD Cloud

Dataset


SchemEX Idea• Schema-level index SchemEX

• Assign RDF entities to graph patterns• Map graph patterns to data sources (context)• Defined over entities, but store the context

• Construction of schema-level index• Stream-based for scalability• Stratified bi-simulation for detecting patterns• Little loss of accuracy

[KGS+12]


Building the Index from a Stream• Stream of quads coming from a LD crawler

… Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1

FiFo

4

3

2

1

1

6

23

4

5

C3

C2

C2

C1

+ Reasonable accuracy at cache size of 50k


Full BTC 2011 Data Set: 2.17 Bn Triples

Cache size: 50 k

WinnerBTC’11

+ Linear runtime with respect to number of triples

+ Memory consumption scales with window size


[GSK+13] Generalization

Specialization

Result list withexamples

Inspired byGoogle


LODatio Under the Hood

SPARQL

Snippets

Generalize

Retrieve Data Sources

Query translation Rank

Specialize

Count

Select

Select

• Hybrid database with off-the-shelf components


LOD on One Slide: Recap

biglynx:matt-briggs

foaf:Person

biglynx:dave-smith

biglynx:Director

rdf:type

foaf:knows

rdf:type

_1:point

wgs84:lat

wgs84:long

dp:London

foaf:based_near

……

…

…

ex:loc

“-0.118”

“51.509”

Type Set (TS)Property Set (PS)

Information theoretic analyses of LOD• How much information is encoded in TS and PS?• … information encoded, once TS or PS is known?• … to which degree are TS and PS redundant?• Example: 20% of PLDs do not need TS (6% for PS)

[GKS15]


• 29 weekly LOD snapshots of ~100 Mio triples • Still running since May 2012 (now 200+ weeks)

Käfer et al.’s Temporal Analysis of LOD• Data on the cloud changes a lot

[Käfer et al., 2013] T. Käfer, A. Abdelrahman, J. Umbrich, P. O'Byrne, A. Hogan: Observing Linked Data Dynamics. ESWC 2013: 213-227

Changes?

• But vocabularies defining RDF types and properties are highly static, e.g., RDF, FOAF

LOD cloud ~2012 LOD cloud ~2014


𝐻(𝑃𝑆

∨𝑇𝑆=

𝑡𝑠)

𝐻(𝑇𝑆∨

𝑃𝑆=𝑝

𝑠)

But: Do Changes Occur in PS and TS?• Analysis: expected conditional entropy over time• : entropy of given is known

• Observation: types become less important• Changes in the use of TS and PS ? !


Changes over Time• Extended characteristic sets: ECS = PS TS# of ECS

Avg.: 83.898 ECS per week

# of ECS

[DSG+13]

• Avg. 73% of ECS re-occur next week (orange)• Avg. 35% of ECS remain unchanged (blue)• Avg. 20% of entity sets of ECS change / week[Neumann and Moerkotte, 2011] Thomas Neumann, Guido Moerkotte: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. ICDE 2011: 984-994

[Neumann and Moerkotte, 2011]


Temporal Dynamics of the Entities?• Notion of entity motivated by ECS: entity is a

set of triples sharing the same subject URI • Example:

–1 entity–4 triples

w.l.o.g.

• Useful to keep LOD caches up-to-date?• Can we predict when LOD sources will

change?


Dynamics Function • Definition of over change rate function

Time

X

𝑡𝑖 𝑡 𝑗

Θ

[DGS+14]

𝑡 𝑗 ≈ ∑𝑘=𝑖+1

𝑗

𝛿(𝑋 𝑡𝑘− 1, 𝑋 𝑡𝑘)

• Approximation as step function over changes

Monotone,non-negative


Update Strategies for LOD Sources• Apply strategies from keeping caches of WWW

documents up-to-date to maintain LOD caches

• Assumptions–LOD is fetched from various sources–Sources are scored and prioritized based on

strategy–Data of a source is fetched only when the

operation can be entirely executed


Scheduling Update Strategies

a) HTTP Header [Dividino et al., 2014a]b) Age or Last Visited [Dasdan et al., 2009, Cho and

Garcia-Molina, 2000]c) PageRank [Page et al., 1999, Boldi et al., 2004,

Baeza-Yates et al., 2005]d) LOD Sources Sizee) Change Ratio [Douglis et al., 1997, Cho et al., 2002.

Tan et al., 2007]f) Change Rate [Olston et al., 2002, Ntoulas et al.,

2004, Dividino et al., 2013]g) History Information: Dynamics [Dividino et al., 2014b]

We borrow strategies developed for the WWW and metrics for data change analysis in the LOD cloud.


Ranking

Sources which changed (most)

Sources that not changed/less changesTime

e) Change Ratio• Captures the change

frequency of the data(freshness)

• Percentage of data items in the cache that are up-to-date


f) Change Rate• Data from sources which are less similar which their

previous update (snapshot) should be updated first

• Comparison of two RDF data sets– : Set of triple statements – : Numeric expression (distance)

Time𝑡𝑖 𝑡 𝑗

𝛿Example:


g) History Information: Dynamics• Data from sources which most evolve in a given

period of time should be updated first• Uses both history information and change rate

Time

X

𝑡𝑖 𝑡 𝑗

Θ

≈ ∑𝑘=𝑖+1

𝑗

𝛿(𝑋 𝑡𝑘− 1, 𝑋 𝑡𝑘)


Evaluation Idea: simulation of limitations of available

computational resources (network bandwidth, computation time)

Time

100%

Which sources to

prioritise in an update?

𝑡𝑖 𝑡𝑖+1


Evaluation: Single Step Update

Which strategy is the

most appropriated one

to keep the cache up-to-

date?

Time

100%15%

5%40%

75%95%60%

𝑡𝑖 𝑡𝑖+1


Evaluation: Iterative Updates

Time

. . .

15%5%40%

75%95%60%

15%5%40%

75%95%60%

100%

Simulates a LOD

search engine

continuously updating

its caches

𝑡𝑖 𝑡𝑖+1 𝑡𝑖+2


Dataset• Dynamic Linked Data Observatory• Weekly snapshots, 14 M triples 154 snapshots (approx. 3 years)

590 data sources (PLD)Top 10 largest data sources Average sizedbpedia.org 3,406,364.5edgarwrap.ontologycentral.com 982,631.0dbtune.org 864,107.6dbtropes.org 787,299.9data.linkedct.org 498,986.3aims.fao.org 416,708.9www.legislation.gov.uk 399,601.6kent.zpr.fer.hr 387,034.8identi.ca 278,316.2webenemasuno.linkeddata.es 250,557.9


Metrics:Precision & Recall• Precision: portion of cached data that are

actually up-to-date• Recall: portion of data in the LOD cloud that

is identical to the cached data

Cached dataActual data on the LOD cloud(w.r.t. to the 590 sources considered)


Results: Single Step Update

Time

100% 15%5%40%

75%95%60%

5%5%


Results: Iterative Updates

5% 5%

Time. . .

15%

5%40%

75%95%60%

15%

5%40%

75%95%60%

100%



Time. . .

15%

5%40%

75%95%60%

15%

5%40%

75%95%60%

100%

15%15%



Time. . .

15%

5%40%

75%95%60%

15%

5%40%

75%95%60%

100%

40%40%


Results: Summary Best strategies: ones which

capture the change behaviour over time

Specially for low relative bandwidth


Dynamics Function : Revisited

Time

X

𝑡𝑖 𝑡 𝑗

• Can we predict when LOD sources will change?

• Notion of dynamics to compute periodicities!• Dynamics as vector of changes:


Temporal Clustering of Entities• Dynamics as vector:

Time

C

hang

e (lo

g sc

ale)

[NS15]

• Clustering withk-means++ to find patterns

• 165 snapshots• 65,044 entities• 7 patterns (after

optimizing )


Periodicity of Entity Dynamics• Examples: ,

# of entities

Most likely periodicity

C1 12,982 66C2 168 23C3 35 1C4 12 1C5 1 1C6 1,541 56C7 30 37CS 50,725

[Elfeky et al., 2005] Mohamed G. Elfeky, Walid G. Aref, Ahmed K. Elmagarmid:Periodicity Detection in Time Series Databases. IEEE Trans. Knowl. Data Eng. 17(7): 875-887 (2005)

• Convolution-based algorithm [Elfeky et al. 2005]

• Entities of found in several clusters (C1,C3,C4,C5,C6)

• No changes (CS): 77.29%• CS: entities from and


Application Areas: More than One!• Searching for LOD sources

[GSK+13,KGS+12]

• Strategies for updating data caches [DGS15]• Programming queries against LOD [SSS12] • Recommending LOD vocabularies [SGS16]

Foundation for Future Data-driven Applications


Summary: KDD in Social Media & DLHow to deal with the vast amount of content related to research and innovation?

• H2020 INSO-4 project, duration: 04/2016-03/2019• Data mining & visualization tools enabling information

professionals to deal with large corpora • Website: http://www.moving-project.eu/

New

http://www.moving-project.eu/


Got Interested?Knowledge Discovery at ZBWContact me!Prof. Dr. Ansgar Scherp

• Email: [email protected]• Twitter: https://twitter.com/ansgarscherp• Slideshare: http://de.slideshare.net/ascherp• KD-Website:

http://www.zbw.eu/en/research/knowledge-discovery/http://www.kd.informatik.uni-kiel.de/en/

mailto:[email protected]

https://twitter.com/ansgarscherp

https://twitter.com/ansgarscherp

http://de.slideshare.net/ascherp

http://de.slideshare.net/ascherp

http://www.zbw.eu/en/research/knowledge-discovery/



http://www.kd.informatik.uni-kiel.de/en

http://www.kd.informatik.uni-kiel.de/en


References[DGS15] R. Dividino, T. Gottron, A. Scherp: Strategies for Efficiently Keeping Local

Linked Open Data Caches Up-To-Date. International Semantic Web Conference (2) 2015: 356-373

[DGS+14] R. Dividino, T. Gottron, A. Scherp, G. Gröner: From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources. PROFILES@ESWC 2014

[GKS15] T. Gottron, M. Knauf, A. Scherp: Analysis of schema structures in the Linked Open Data graph based on unique subject URIs, pay-level domains, and vocabulary usage. Distributed and Parallel Databases 33(4): 515-553 (2015)

[DSG+13] R. Dividino, A. Scherp, G. Gröner, T. Gottron: Change-a-LOD: Does the Schema on the Linked Data Cloud Change or Not? COLD 2013

[GSK+13] T. Gottron, A. Scherp, B. Krayer, A. Peters: LODatio: using a schema-level index to support users in finding relevant sources of linked data. K-CAP 2013: 105-108

[KGS+12] M. Konrath, T. Gottron, S. Staab, A. Scherp: SchemEX - Efficient construction of a data catalogue by stream-based indexing of linked data. J. Web Sem. 16: 52-58 (2012)

[NS15] C. Nishioka, A Scherp: Temporal Patterns and Periodicity of Entity Dynamics in the Linked Open Data Cloud. K-CAP 2015.

[SGS16] J. Schaible, T. Gottron, and A. Scherp: TermPicker Enabling the Reuse of Vocabulary Terms by Exploiting Data from the Linked Open Data Cloud, ESWC, Springer, 2016.

[SSS12] S. Scheglmann, A. Scherp, S. Staab: Declarative Representation of Programming Access to Ontologies. ESWC 2012: 659-673


a) HTTP Header• Data from sources which have been changed

since the last update should be updated first

HTTP Response

HEADER…

Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT

CONTENT


b) Age or Last Visited • Time elapsed from last

update (the difference between query time and last update time)

• It guarantees that every source is updated after a period

Ranking

Sources that have been at longer time updated

Sources that have been recently updated


c) PageRank and d) Source Size• PageRank captures popularity/

importance of the LOD source • Data from sources with highest

PageRank are updated first

• LOD source size: data from the biggest/smallest LOD sources should be updated first

Ranking

Sources with higher PR

Sources with lower PR


Results: Single Step Update

Time

100% 15%5%40%

75%95%60%

Technology

Mining and Managing Large-scale Linked Open Data