The Digital Cavemen of Linked Lascaux

The Digital Cavemenof Linked LascauxRuben Verborgh

The Lascaux paintings are 17,300 years old.

How long will your records last?

by Banksy

by Moyan Brenn

SUSTAINABILITY

SUSTAINABILITYa threat to the Semantic Web

lack of a longterm plan for

=

SUSTAINABILITYmaking promises you can keep

=

SUSTAINABILITYa dialog becoming a contract

=

SUSTAINABILITYremaining constant under change

=

How can we promise to remain constant in a changing world?

Changes

Constants

Promises

The Digital Cavemenof Linked Lascaux

Changes

Constants

Promises


Changes

Data models

Technology

Interfaces

Changes

Data models

Technology

Interfaces

The oldest data model is a simple table.

1.1. INTRODUCTION 7

Tabular data Relational model

Meta-markup languages RDF

Each data item is structured asa line of field values. Fields arethe same for all items; a headerline can indicate their name.

Data are structured as tables, each ofwhich has its own set of attributes.Records in one table can relate to oth-ers by referencing their key column.

XML documents have a hierarchicalstructure, which gives them a tree-like appearance. Each element canhave one or more children; there isexactly one root element.

Each fact about a data item is expressedas a triple, which connects a subject toan object through a precise relationship.This leads to graph-structured data thatcan take any shape.

header

row

columnrelation

key column

attributes

table/entity

root

parent

child

siblings

propertysubject

object

Figure 1.1: Schematic comparison of the four major data models

van Hooland, S. and Verborgh, R. “Linked Data for Libraries, Archives and Museums” (Facet, 2014)

Tables do not cope well with changes in data or schema.

Title Artist Born Died

The Thrill is Gone B. B. King 1925 2015

Riding with the King John Hiatt 1952

Riding with the King B. B. King 1925

… … … …

Relational databases providea multi-dimensional table model.

1.1. INTRODUCTION 7







header

row

columnrelation

key column

attributes

table/entity

root

parent

child

siblings

propertysubject

object



Databases cope with data changesbut schema changes are harder.

Title ArtistThe Thrill is Gone 1

Riding with the King 2Riding with the King 1

… …

ID Name Born Died

1 B. B. King 1925 2015

2 John Hiatt 1952

… … … …

There is no interoperabilitywith other databases.

Title ArtistThe Thrill is Gone 1

Riding with the King 2Riding with the King 1

… …

Wikipedia?

XML allows reuse of schemasand identifiers.

1.1. INTRODUCTION 7







header

row

columnrelation

key column

attributes

table/entity

root

parent

child

siblings

propertysubject

object



XML schema evolution remains a tough nut to crack.

1.1. INTRODUCTION 7







header

row

columnrelation

key column

attributes

table/entity

root

parent

child

siblings

propertysubject

object


?

The RDF datamodel is flexiblefor changes in data and schema.

1.1. INTRODUCTION 7







header

row

columnrelation

key column

attributes

table/entity

root

parent

child

siblings

propertysubject

object



RDF involves a trade-offbetween flexibility and reuse.

customontology

reuse ontologies

perfect match

perfect interoperability

So far for change within models…what about change between them?

1.1. INTRODUCTION 7







header

row

columnrelation

key column

attributes

table/entity

root

parent

child

siblings

propertysubject

object


1.1. INTRODUCTION 7







header

row

columnrelation

key column

attributes

table/entity

root

parent

child

siblings

propertysubject

object


1.1. INTRODUCTION 7







header

row

columnrelation

key column

attributes

table/entity

root

parent

child

siblings

propertysubject

object


1.1. INTRODUCTION 7







header

row

columnrelation

key column

attributes

table/entity

root

parent

child

siblings

propertysubject

object


There’s no ultimate model.They co-exist. Change is inherent.

1.1. INTRODUCTION 7







header

row

columnrelation

key column

attributes

table/entity

root

parent

child

siblings

propertysubject

object


1.1. INTRODUCTION 7







header

row

columnrelation

key column

attributes

table/entity

root

parent

child

siblings

propertysubject

object


1.1. INTRODUCTION 7







header

row

columnrelation

key column

attributes

table/entity

root

parent

child

siblings

propertysubject

object


1.1. INTRODUCTION 7







header

row

columnrelation

key column

attributes

table/entity

root

parent

child

siblings

propertysubject

object


Changes

Data models

Technology

Interfaces

Even if your data doesn’t change, technology does.

What happens to your data?

new software versions

new software manufacturers

Is your softwareholding your data hostage?

Is your software the owner of your data?

Intentional or unintentional vendor lock-in?

Or are you?

Can you get your data out at any moment you want?

The Cooper-Hewitt Design Museum had trouble getting their own data.

Data in The Museum System

flexible, but complex relational design

no export button

Website had more flexible demands

complex manual queries to liberate data

parallel CMS to drive website

Changes

Data models

Technology

Interfaces

The Web has been designedwith change in mind.

Individual links are allowed to breakso the entire Web does not.

—Tim Berners-Lee

The Web is in rapid evolution but continues on working.

What year is it? Then your users need…

1995 – HTML 2.0

2000 – XML

2008 – JSON

2012 – HTML 5

2015 – RDF ?

2017 – … ?

At least HTML seems constant,so the human Web is safe.

http://bib.org/books/978-1-85604-964-1/

around 2005: made in HTML 4

around 2015: made in HTML 5

Markup changes, the identifier does not.

Tim Berners-Lee called these “Cool URIs”.

Web APIs for machines suffer from changes on many levels.

http://api.bib.org/v2/viewBookDetails.php?id=978-1-85604-964-1&format=json &apikey=WSDGU56VP

How does this identifier cope with change?

How long does this identifier work unchanged?

!


!

!

!

Web APIs for machines suffer from changes on many levels.

dependency on server technology

dependency on API version

dependency on representation

dependency on API key

Plenty of excuses exist to change machine interfaces.

But our new server does it faster!

But our new API has different features!

But XML is obsolete now so we need JSON!

Even funnier are the excuses for requiring API keys.

But we need to rate limit!

But we need to track automated access!

But we need to protect our data!

Once and for all: API keys do not help with these.

But we need to rate limit!

But we need to track automated access!

But we need to protect our data!

Once and for all: API keys do not help with these.

Your HTML interface is still open!

JSON is a convenience, not a necessity.

Anybody can still do whatever they wantby scraping HTML pages with the same data.

Protect your data, not just one interface.

Yet other possible changes still appear to be a concern.

Remain constant if your server changes?

Remain constant if your API changes?

Remain constant if data models change?

Changes

Constants

Promises


Constants

URIs

Ontologies

Resources

Constants

URIs

Ontologies

Resources

The RDF model is drivenby unique identifiers.

S

O

P

Constants allow clientsto establish a shared meaning.

S

O

P


http://bib.org/authors/7356/

http://purl.org/dc/terms/creator

Human semantics are in conceptsand their meaning to the world.

S

O

P

a book

a person

written by

Machine semantics are in symbolsand their structural interrelations.

S

O

P

http://digybe.wpq/dgjyj-dgu7945

http://aole.wqq/mobd1.tihz

http://yudgy.jdu/DHH8DHBtkixhj

We need to be very careful about our choice of symbols.

S

O

P




We need to be very careful about our choice of symbols.



Is this a bookor a description of a book?

:printDate "2014-06-11":lastModified "2015-11-25"

Is this a person or a document?

:birthDate "1987-02-28":size "17kB"

Although designed for machines,the example only works for humans.

S

O

P




Because, somehow, Web APIs make machine access different.

S

O

P


http://api.bib.org/v2/viewAuthorProfile.php?id=7356&format=json&apikey=WSDGU56VP


That’s why it’s a problem ifmachines need different identifiers.

S

O

P


http://api.bib.org/v2/viewAuthorProfile.php?id=7356&format=json&apikey=WSDGU56VP


Only this triple is a global constant.The other is volatile and local.

S

O

P




Constants

URIs

Ontologies

Resources

Fortunately, we don’t have to pick all the constants ourselves.

Ontologies provide identifiers of concepts that are designed to be reused.

They are necessary to make RDF work.

They are necessary to create queries,especially over multiple datasources.

Of course, we get the benefits only if we actually reuse.

Why have our own my:writtenBy property when dc:creator already exists?

Maybe we have a more specific meaning?

We can still relate both properties with RDF.

But if we all use derivatives of the constants,what is the value of these constants?

Authors are not always in control: external semantic drift happens.

foaf:knows was bidirectional…

spec: “some level of reciprocity”

An foaf:knows Pete ⇒ Peter foaf:knows An

…until somebody modeled Twitter followers

Pete follows Angela Merkel ⇒ Pete knows Angela

Yet Angela doesn’t know Pete…

Getting close to Derrida… but we’re not philosophers.

There are only two hard things in Computer Science:cache invalidation and naming things.

—Phil Karlton

Constants

URIs

Ontologies

Resources

The constants you can touch are the constants you can trust.

No matter how hard technology changes, the books we describe remain the same.

Any mechanism of identification should based on domain resources, not on inevitably changing technology.

The “success” storyof the Web API community.

3 FOSTERING REUSABILITY THROUGH A SELF-DESCRIPTIVE BOTTOM-UP APPROACH

3 Fostering reusability through a self-descriptive bottom-up approach

Lacking better measurements, the Web api community has been heading the same quantity-over-quality course that hascharacterized the first years of the Linked Data initiative. An often-quoted fact in Web api papers and articles is the everincreasing number of Web apis (Figure 1), which is supposed to be an indicator of the ecosystem’s excellent health. How-ever, as Linked Data researchers have become painfully aware, quantity only loosely correlates with quality or usefulness.Perhaps for Web apis, the correlation between quantity and utility could even be negative. Few other communities wouldpride themselves on the existence of more than 12.000 di↵erent micro-protocols to achieve essentially the same thing:communicating between clients and servers over http. Of course, each application has its own domain and domain-specific vocabulary, but does that also warrant an entirely di↵erent way of exposing this, especially when we have rdf asa uniform data model? Each di↵erent api currently requires a di↵erent client, given the lack of a uniform api descriptionformat to explain the api’s response structure and functionality. Clearly, this approach to Web apis is a dead end.

2005 2007 2009 2011 2013 2015

Special.

1861,263

2,418

5,018

7,182

10,302

12,559

number of indexed Web ��s

Figure 1: The increasing number of Web apis is often named an indicator of their success, while the overgrowth of such custommicro-protocols is unnecessary—and detrimental to the development of generic Web api clients. (data: programmableweb.com)

In order for machines to use information autonomously, it has to be composed out of pieces they can recognize andinterpret. The rdf model achieves this by identifying each of the triple components by reusable iris, which have a meaningbeyond the scope that mentions them. Furthermore, the Linked Data principles mandate the use of httpurls, which turnthese components into a↵ordances toward relevant information. For instance, given the following rdf triple:

<http://dbpedia.org/resource/Bill_Clinton> <http://xmlns.com/foaf/0.1/knows>

<http://dbpedia.org/resource/Al_Gore>.

the knowledge of the foaf:knows predicate is su�cient for a machine to determine that this relation is symmetric, and thatdbpedia:Bill_Clinton and dbpedia:Al_Gore are instances of foaf:Person—even though it might have never encoun-tered any of those iris before. Furthermore, should the foaf:knows property be unfamiliar, its iri can be dereferenced tofind this information expressed in ontological predicates. Knowledge of these predicates in turn allows an interpretationof foaf:knows and hence the aforementioned derivation. We herein recognize two characteristics in particular:

• The information is structured in a bottom-up way: machines interpret a larger unit of information through its piecesinstead of interpreting the pieces through the whole (while humans are capable of doing both simultaneously).

• Each piece in the unit is self-descriptive: anything needed to interpret a piece is contained within itself, with its iriacting as both an identifier and a direct handle towards additional interpretation mechanisms. No external resourceis required beforehand, given the knowledge of a limited set of basic concepts.

This sharply contrasts with current practice for Web apis. Machines are assumed to interpret each api operation in its en-tirety, as such smaller pieces do not exist, and api descriptions—if present—are external documents that must be collectedand interpreted before consumption is possible. While this does not imply the inviability of such an approach, it raisesserious doubt as to whether that is the most e↵ective strategy towards automated Web api consumption by generic clients.

number of indexed Web APIs in ProgrammableWeb

Just imagine we had15,000 different data models.

3 FOSTERING REUSABILITY THROUGH A SELF-DESCRIPTIVE BOTTOM-UP APPROACH

3 Fostering reusability through a self-descriptive bottom-up approach

Lacking better measurements, the Web api community has been heading the same quantity-over-quality course that hascharacterized the first years of the Linked Data initiative. An often-quoted fact in Web api papers and articles is the everincreasing number of Web apis (Figure 1), which is supposed to be an indicator of the ecosystem’s excellent health. How-ever, as Linked Data researchers have become painfully aware, quantity only loosely correlates with quality or usefulness.Perhaps for Web apis, the correlation between quantity and utility could even be negative. Few other communities wouldpride themselves on the existence of more than 12.000 di↵erent micro-protocols to achieve essentially the same thing:communicating between clients and servers over http. Of course, each application has its own domain and domain-specific vocabulary, but does that also warrant an entirely di↵erent way of exposing this, especially when we have rdf asa uniform data model? Each di↵erent api currently requires a di↵erent client, given the lack of a uniform api descriptionformat to explain the api’s response structure and functionality. Clearly, this approach to Web apis is a dead end.

2005 2007 2009 2011 2013 2015

Special.

1861,263

2,418

5,018

7,182

10,302

12,559

number of indexed Web ��s

Figure 1: The increasing number of Web apis is often named an indicator of their success, while the overgrowth of such custommicro-protocols is unnecessary—and detrimental to the development of generic Web api clients. (data: programmableweb.com)

In order for machines to use information autonomously, it has to be composed out of pieces they can recognize andinterpret. The rdf model achieves this by identifying each of the triple components by reusable iris, which have a meaningbeyond the scope that mentions them. Furthermore, the Linked Data principles mandate the use of httpurls, which turnthese components into a↵ordances toward relevant information. For instance, given the following rdf triple:

<http://dbpedia.org/resource/Bill_Clinton> <http://xmlns.com/foaf/0.1/knows>

<http://dbpedia.org/resource/Al_Gore>.

the knowledge of the foaf:knows predicate is su�cient for a machine to determine that this relation is symmetric, and thatdbpedia:Bill_Clinton and dbpedia:Al_Gore are instances of foaf:Person—even though it might have never encoun-tered any of those iris before. Furthermore, should the foaf:knows property be unfamiliar, its iri can be dereferenced tofind this information expressed in ontological predicates. Knowledge of these predicates in turn allows an interpretationof foaf:knows and hence the aforementioned derivation. We herein recognize two characteristics in particular:

• The information is structured in a bottom-up way: machines interpret a larger unit of information through its piecesinstead of interpreting the pieces through the whole (while humans are capable of doing both simultaneously).

• Each piece in the unit is self-descriptive: anything needed to interpret a piece is contained within itself, with its iriacting as both an identifier and a direct handle towards additional interpretation mechanisms. No external resourceis required beforehand, given the knowledge of a limited set of basic concepts.

This sharply contrasts with current practice for Web apis. Machines are assumed to interpret each api operation in its en-tirety, as such smaller pieces do not exist, and api descriptions—if present—are external documents that must be collectedand interpreted before consumption is possible. While this does not imply the inviability of such an approach, it raisesserious doubt as to whether that is the most e↵ective strategy towards automated Web api consumption by generic clients.

number of indexed Web APIs in ProgrammableWeb

Find resources in your domain and assign them an identifier.



It’s just like building a web site.When a user comes, serve HTML.


UGET

HTML

It’s just like building a web site.When a client comes, serve JSON.


CGET

JSON

It’s just like building a web site.When a client comes, serve RDF.


CGET

RDF

Content negotiation exists for a long time in HTTP.


CGET

RDF

Resource

Representation

This allows constant URIseven with future changes.


CGET

RDF 2.0

It enables different users andmachines to talk about things.


CU

C

The best API is no API. Your website is already an API.

Developers like to build complicated APIs.

API keys are especially cool to build.

Every feature and change comes with a high cost.

If you ask for an API, you’ll get one.

Ask for new representations of your resources instead.

Changes

Constants

Promises


Promises

Web Data

Integration

Scalability

Promises

Web Data

Integration

Scalability

The Semantic Web promiseddata on the Web.

85,567,007,302 triples from 3,426 datasets

LODStats

38,606,408,765 from 657,896 entries

LOD Laundromat

How much of this datacan we readily access?

data dumps

Linked Data documents

SPARQL endpoints

A data dump means downloading everything and querying locally.

A data dump means downloading everything and querying locally.

When was the last timeyou downloaded the full Wikipedia just because you had one question?

Dumps are not Web querying. It’s kind of like giving up.

Semantic Web ⇒ Semantic Basement?

What advantage do we havecompared to Big Data?

Still the RDF data model…

But the major difference is Web.

Linked Data documents allow you to traverse a dataset.

Linked Data documents allow you to traverse a dataset.

That’s similar to what we also do:consume information on Wikipedia by following links.

Much Linked Data is availableusing the well-known principles.

Servers publish a light-weight interface.

Clients follow their noseto retrieve information.

Linked Data documents allow query evaluation on the Web.

# Other books by the same author SELECT DISTINCT ?book WHERE { books:85604 dc:creator ?author. ?book dc:creator ?author. }

Some queries are hardor impossible to evaluate.

# Books about Hamburg SELECT DISTINCT ?book ?author WHERE { ?book dc:subject dbpedia:Hamburg. ?book dc:creator ?author.}

SPARQL endpoints allow you to ask any question you want.

SPARQL endpoints allow you to ask any question you want.

When was the last timeyou expected Wikipedia to answer specific questions automatically for you?

A public SPARQL endpoint happily answers this query.


A public SPARQL endpoint also happily answers this query.


A public SPARQL endpoint also happily answers this query…SELECT DISTINCT ?drug ?drug1 ?drug2 ?drug3 ?drug4 ?d1 WHERE { ?drug1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/drugCategory> <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugcategory/antibiotics> . ?drug2 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/drugCategory> <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugcategory/antiviralAgents> . ?drug3 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/drugCategory> <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugcategory/antihypertensiveAgents> . ?drug4 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/drugCategory> <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugcategory/anti-bacterialAgents> . ?drug1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/target> ?o1 . ?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/genbankIdGene> ?g1 . ?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/locus> ?l1 . ?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/molecularWeight> ?mw1 . ?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/hprdId> ?hp1 . ?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/swissprotName> ?sn1 . ?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/proteinSequence> ?ps1 . ?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/generalReference> ?gr1 . ?drug <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/target>?o1 . ?drug2 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/target> ?o2 . ?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/genbankIdGene> ?g2 . ?o2 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/locus> ?l2 . ?o2 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/molecularWeight> ?mw2 . ?o2 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/hprdId> ?hp2 .

There’s a price to pay for beingthe most expressive HTTP interface.

The majority of public SPARQL endpoints has less than 95% uptime.

This means we cannot query themfor more than 1.5 days each month.

This means we cannot rely on themto build Linked Data applications.Buil-Aranda – Hogan – Umbrich – Vandenbussche SPARQL Web-Querying Infrastructure: Ready for Action?

Promises

Web Data

Integration

Scalability

The main promise of Linked Datais integration, preserving semantics.

1.1. INTRODUCTION 7







header

row

columnrelation

key column

attributes

table/entity

root

parent

child

siblings

propertysubject

object


Integration is the promise. But does it work on the Web?

data dumps

Linked Data documents

SPARQL endpoints

With data dumps, we justbuild a bigger basement.

How far do we go?

How do we keep data up to date?

With Linked Data documents, we keep on following our nose.

There are no dataset boundaries.

Some queries will remain hard.

With public SPARQL endpoints, problems become worse.

1 endpoint has 95% availability.

1.5 days down each month

2 endpoints have 90% availability.

3 days down each month

3 endpoints have 85% availability.

4.5 days down each month

Promises

Web Data

Integration

Scalability

Can we think differentlyabout Linked Data on the Web?

high server costlow server cost

datadump

SPARQLendpoint

high availability low availabilityhigh bandwidth low bandwidthout-of-date data live data

low client costhigh client cost

Linked Datadocuments

Can we think differentlyabout Linked Data on the Web?

datadump

SPARQLendpoint

Linked Datadocuments

? ?

Let us combine the lessons onchanges, constants, and promises.

An interface that withstands change,

simple enough so it doesn’t break

complex enough to query.

Let us combine the lessons onchanges, constants, and promises.

Data dumps contain too much.

SPARQL endpoint results are too specific.

Linked Data documents are unidirectional.

Each interface divides a dataset into Linked Data Fragments.

Data dumps: 1 huge fragment

SPARQL endpoints: ∞ specific fragments

Linked Data: 1 fragment per subject

Can we find a new interfacewith a sustainable balance?

Triple Pattern Fragments: 1 fragment per subject / predicate / object

Browse a dataset by triple pattern—no less, no more.

Machines can accessthe exact same interface as RDF.

Triple Pattern Fragments extendLinked Data documents with forms.

That’s even more similar to what we do: consume information on the Wikipedia by following links and using forms.

Machines solve complex queries by breaking them down.


Machines solve complex queries by breaking them down.


Promises can be kept, becausethe interface is intelligently light.

Publishing Linked Data that can be queried on the Webis realistic because the workload is divided.

The server doesn’t even need a triplestore.

Since the client is in charge,querying multiple sources is easy.

Promises are negotiated contracts so they always involve trade-offs.

Querying will be slower.

clients send many requests to answer a query

Query times are more consistent.

0.3 secs with a SPARQL endpoint… 95% of time

3 secs with Triple Pattern Fragments… 99.9% of time

Experiment with more complex interfaces.

Make your Linked Data queryable on the Web.

Several open-source implementations: linkeddatafragments.org/software/

Query one or multiple sources online: client.linkeddatafragments.org

Example: bit.ly/harvard-hamburg

Changes

Constants

Promises


Identify the constants,separate them from changes.

Satisfy Linked Data needs with promises you can keep.

Simple enough to be usable,

complex enough to be useful.

Sustainability meanspromising the simplestuseful complexity.

@RubenVerborgh ruben.verborgh.org

Internet

The Digital Cavemen of Linked Lascaux