SchemEX -- Building an Index for Linked Open Data

Preview:

DESCRIPTION

General introduction to Linked Open Data and schema extraction using SchemEX. Download full slide set to enjoy all animations.

Citation preview

Slide 1 of 44SchemEX – Building an Index for LOD

SchemEX – Building an Index for Linked Open Data

Ansgar Scherp, Thomas Gottron, Mathias KonrathUniversity of Koblenz-Landau, Germany

Oslo, NorwayAugust 2012

Slide 2 of 44SchemEX – Building an Index for LOD

Learning Goals

• Understand the motivation and fundamentals of Linked Open Data (LOD).

• Qualify in why an index for LOD is needed and how to efficiently create such an index.

Slide 3 of 44SchemEX – Building an Index for LOD

Scenario

• Tim plans to travel– from London – to a customer in Cologne

Slide 4 of 44SchemEX – Building an Index for LOD

Website of the German Railway

It works, why bother…?

Eurostar

DB

KVB

Slide 5 of 44SchemEX – Building an Index for LOD

Let‘s Try Different Queries

Bottlenecks in public transportation? Compare the connections with flights? Visualize on a map? …

All these queries cannot be answered, because the data …

Slide 6 of 44SchemEX – Building an Index for LOD

… locked in Silos!

– High Integration Effort– Lack in Reuse of Data

B. Jagendorf, http://www.flickr.com/photos/bobjagendorf/, CC-BY

Slide 7 of 44SchemEX – Building an Index for LOD

Linked Data

• Publishing and interlinking of data• Different quality and purpose• From different sources in the Web

World Wide Web Linked DataDocuments DataHyperlinks Typed LinksHTML RDFAddresses (URIs) Addresses (URIs)

Example: http://www.uio.no/

Slide 8 of 44SchemEX – Building an Index for LOD

Relevance of Linked Data?

Slide 9 of 44SchemEX – Building an Index for LOD

Linked Data: May ‘07

< 31 Billion Triples

Media

Geographic

Publications

Web 2.0

eGovernment

Cross-Domain

LifeSciences

Sept. ‘11

Source: http://lod-cloud.net

Slide 10 of 44SchemEX – Building an Index for LOD

Linked Data Principles

1. Identification2. Interlinkage3. Dereferencing4. Description

Slide 11 of 44SchemEX – Building an Index for LOD

Example: Big Lynx

Big LynxCompany

Matt Briggs

Scott Miller

Source: http://lod-cloud.net< 31 Milliarde Triple

?

Slide 12 of 44SchemEX – Building an Index for LOD

1. Use URIs for Identification

B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY

http://biglynx.co.uk/people/matt-briggs

http://biglynx.co.uk/people/scott-miller

Matt Briggs

Scott Miller

Slide 13 of 44SchemEX – Building an Index for LOD

Example: Big Lynx

Big LynxCompany

Matt Briggs

Scott Miller

How to model relationships like knows?

Slide 14 of 44SchemEX – Building an Index for LOD

• Description of Ressources with RDF triple

Matt Briggs is a Person

Resource Description Framework (RDF)

Subject Predicate Object

@prefix rdf:<http://w3.org/1999/02/22-rdf- syntax-ns#> .

@prefix foaf:<http://xmlns.com/foaf/0.1/> .

<http://biglynx.co.uk/people/matt-briggs> rdf:type foaf:Person .

Slide 15 of 44SchemEX – Building an Index for LOD

1. Use URIs also for Relations

B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY

http://biglynx.co.uk/people/matt-briggs

http://biglynx.co.uk/people/scott-miller

foaf:knows

foaf:knows

Slide 16 of 44SchemEX – Building an Index for LOD

Example: Big Lynx

Big Lynx Company

DBpedia

Matts privateWebseite

„sameperson“

Matt Briggs

„lives here“

Dave SmithLondon

Matt Briggs

Scott Miller

Slide 17 of 44SchemEX – Building an Index for LOD

2. Establishing Interlinkage

• Relation links between ressources

<http://biglynx.co.uk/people/dave-smith> foaf:based_near <http://dbpedia.org/resource/London> .

Identity links between ressources

<http://biglynx.co.uk/people/matt-briggs> owl:sameAs <http://www.matt-briggs.eg.uk#me> .

Slide 18 of 44SchemEX – Building an Index for LOD

Example: Big Lynx

DBpedia

Big Lynx Company

Matts privateWebseite

foaf:based_near

owl:sameAs

Matt Briggs

Dave SmithLondon

Matt Briggs

„same Person“

„lives here“

Slide 19 of 44SchemEX – Building an Index for LOD

• Looking up of web documents

• How can we “look up” things of the real world?

3. Dereferencing of URIs

http://biglynx.co.uk/people/matt-briggs

Slide 20 of 44SchemEX – Building an Index for LOD

Two Approaches

1. Hash URIs–URI contains a part separated by #, e.g.,

http://biglynx.co.uk/vocab/sme#Team

2. Negotiation via „303 See Other“ request

http://biglynx.co.uk/people/matt-briggs

Response: „Look here:“http://biglynx.co.uk/people/matt-briggs.rdf

Slide 21 of 44SchemEX – Building an Index for LOD

Example: Big Lynx

DBpedia

Big Lynx Company

Matts privateWebseite

foaf:based_near

owl:sameAs

Matt Briggs

Dave SmithLondon

Matt BriggsDescription of Matt?

Slide 22 of 44SchemEX – Building an Index for LOD

4. Description of URIs

biglynx:matt-briggs

foaf:Person

biglynx:dave-smith

dp:Birmingham

rdf:type

foaf:knows

foaf:based_near

_:point

wgs84:lat

wgs84:long

dp:London

foaf:based_near

……

ex:loc

“-0.118”

“51.509”

Slide 23 of 44SchemEX – Building an Index for LOD

RDF / RDF Schema Vocabulary

• rdf:type • rdf:Property • rdf:XMLLiteral • rdf:List • rdf:first • rdf:rest • rdf:Seq • rdf:Bag • rdf:Alt • ... • rdf:value

• rdfs:domain • rdfs:range • rdfs:Resource • rdfs:Literal • rdfs:Datatype • rdfs:Class • rdfs:subClassOf • rdfs:subPropertyOf • rdfs:comment • …• rdfs:label

• Set of URIs defined in rdf:/rdfs: namespace

Slide 24 of 44SchemEX – Building an Index for LOD

Semantic Web Layer Cake (Simplified)

Slide 25 of 44SchemEX – Building an Index for LOD

Learning Goals

• Understand the motivation and fundamentals of Linked Open Data (LOD).

• Qualify in why an index for LOD is needed and how to efficiently create such an index.

Slide 26 of 44SchemEX – Building an Index for LOD

Scenario• People who are politicians and actors

• Who else? • Where do they live? • Whom do they know? …are they married with?

Slide 27 of 44SchemEX – Building an Index for LOD

Problem

• Execute those queries on the LOD cloud

Relevant sources?

• No single federated query interface provided

“politicians and actors”

SELECT ?xFROM …WHERE { ?x rdf:type ex:Actor . ?x rdf:type ex:Politician .}

Slide 28 of 44SchemEX – Building an Index for LOD

Principle Solution

• Suitable index structure for looking up sources

“politicians and actors”

Slide 29 of 44SchemEX – Building an Index for LOD

The Naive Approach

1. Download the entire LOD cloud2. Put it into a (really) large triple store3. Process the data and extract schema4. Provide lookup

- Big machinery- Late in processing the data- High effort to scale with LOD cloudCan we do smarter?

Can we do smarter?

Slide 30 of 44SchemEX – Building an Index for LOD

Idea Schema-level index

Define families of graph patternsAssign instances to graph patternsMap graph patterns to context (source URI)

ConstructionStream-based for scalabilityLittle loss of accuracy

Note Index defined over instancesBut stores the context

Slide 31 of 44SchemEX – Building an Index for LOD

Input Data n-Quads

<subject> <predicate> <object> <context>

Example: <http://www.w3.org/People/Connolly/#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf>

w3p:#me

foaf:Person

http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf

Slide 32 of 44SchemEX – Building an Index for LOD

DS1 DS2 DS3 DS4 DS5 DSxData

sources

consistsOf

hasDataSource

Building the Schema and Index

EQC1 EQC2 EQCn Equivalenceclasses

hasEQClass p1 p2

TC1 TC2 TCm

Type clusters…

C2C1 C3 CkRDF

classes…

Slide 33 of 44SchemEX – Building an Index for LOD

Layer 1: RDF Classes

timbl:card#i

foaf:Person

foaf:Person

http://www.w3.org/People/Berners-Lee/card

http://dig.csail.mit.edu/2008/...

C1

DS 3DS 2DS 1

SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person .}

All instances of a particular type

Slide 34 of 44SchemEX – Building an Index for LOD

Layer 2: Type Clusters

timbl:card#i

foaf:Person

foaf:Person

http://www.w3.org/People/Berners-Lee/card

DS 3DS 2DS 1SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male .}

C1

C2

pim:Male

tc4711

pim:Male

All instances belonging to exactly the same set of types

TC1

Slide 35 of 44SchemEX – Building an Index for LOD

Two instances are equivalent iff:They are in the same TCThey have the same

propertiesThe property targets are

in the same TC

Similar to 1-Bisimulation

Layer 3: Equivalence Classes

EQC1

DS 3DS 2DS 1

C1

C2

TC1

C3

TC2

p

Slide 36 of 44SchemEX – Building an Index for LOD

Layer 3: Equivalence Classes

timbl:card#i

foaf:Person

foaf:Person

http://www.w3.org/People/Berners-Lee/card

pim:Male

tc4711

pim:Male

foaf:PPD

timbl:card

eqc0815

foaf:PPD

tc1234

eqc0815-maker-tc1234

foaf:maker

SELECT ?x WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument .}

Slide 37 of 44SchemEX – Building an Index for LOD

The SchemEX Approach• Stream-based schema extraction• While crawling the data

Nquad-Stream

Schema-LevelIndex

Schema-Extractor

Parser

Instance-Cache

LOD-CrawlerRDF-DumpTriple Store

NxParser

RDFRDBMS

FIFO

Slide 38 of 44SchemEX – Building an Index for LOD

Building the Index from a Stream Stream of n-quads (coming from a LD crawler)

… Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1

FiFo

4

3

2

1

1

6

23

4

5

C3

C2

C2

C1

• Linear runtime complexity wrt # of input triples

Slide 39 of 44SchemEX – Building an Index for LOD

Computing SchemEX: TimBL Data Set

• Analysis of a smaller data set• 11 M triples, TimBL’s FOAF profile• LDspider with ~ 2k triples / sec

• Different cache sizes: 100, 1k, 10k, 50k, 100k• Compared SchemEX with reference schema• Index queries on all Types, TCs, EQCs• Good precision/recall ratio at 50k+

Slide 40 of 44SchemEX – Building an Index for LOD

Quality of Stream-based Index Construction

• Runtime increases hardly with window size• Memory consumption scales with window size

Slide 41 of 44SchemEX – Building an Index for LOD

Computing SchemEX: Full BTC 2011 Data

Cache size: 50 k

Slide 42 of 44SchemEX – Building an Index for LOD

Billion Triple Challenge 2011

Slide 43 of 44SchemEX – Building an Index for LOD

Conclusions: SchemEX

• Linked Open Data (LOD) approach• Publishing and interlinking data on the web

• SchemEX• Stream-based approach to LOD schema

extraction • Scalable to arbitrary amount of Linked Data • Applicable on commodity hardware

(4GB RAM, single CPU)

Slide 44 of 44SchemEX – Building an Index for LOD

Learning Goals

• Understand the motivation and fundamentals of Linked Open Data (LOD).

• Qualify in why an index for LOD is needed and how to efficiently create such an index.

Slide 45 of 44SchemEX – Building an Index for LOD

Recommended Readings

• Maciej Janik, Ansgar Scherp, Steffen Staab: The Semantic Web: Collective Intelligence on the Web. Informatik Spektrum 34(5): 469-483 (2011)URL: http://dx.doi.org/10.1007/s00287-011-0535-x

• Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp: SchemEX — Efficient construction of a data catalogue by stream-based indexing of linked data, J. of Web Semantics: Science, Services and Agents on the World Wide Web, Available online 23 June 2012URL: http://www.sciencedirect.com/science/article/pii/S1570826812000716

• Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space, Morgan & Claypool Publishers, 2011URL: http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001