Building a knowledge graph of the Belgian War Press

Let’s talk Linked Data session Open Belgium 2017Brecht Van de Vyvere | @brechtvdv

Building a knowledge graph of the Belgium War Press

Can I easily link historic papers with other datasources?

Agenda

• hetarchief.be• Knowledge graph• 5-star Open Data plan• Adding context• Linked Data as a Service• Future Work

Dataset

hetarchief.be“News from the Great War”• Newspapers 1914 - 1918• 10+ Content Partners• Begin 2015: site launched• Functionality• Search by keyword• Map with place of publication• Collections

1k titles

55k newspapers

300k pages

Human-readable interface

Policy1.Metadata• No restrictions → CC0

2.OCR, documents• Pictures, short stories…• Uncertain copyright status• No license or “terms of use” that minimises restrictions

for re-use• Disclaimer

hetarchief.be• One of the biggest databases online• No raw data?• Title• Description → OCR from ALTO• Date created• Owner• IDs (carrier, Abraham, VIAA)• URL image

9

5-starsOpen Data Plan

First 3 Stars• Open License• Structured• Non-proprietary

VIAA DB VIAA API NodeJS

→ github.com/viaacode/hetarchief2lod

IDs Metadata

CSV

Transform

http://github.com/viaacode/hetarchief2lod

Step 4: URIs for everything• Map VIAAs internal ID to URI:• http://data.viaa.be/noid/{id}

• Use ontologies• BBC → Creative Work Ontology• schema.org• Hydra → collections

Knowledge graph• Semantic network• Concepts• Relations

• Linked Data• URIs• RDF

<http://dbpedia.org/page/Albert_I_of_Belgium>

rdfs:type

<http://xmlns.com/foaf/0.1/Person>

<http://data.viaa.be/noid/example>

<http://www.bbc.co.uk/ontologies/creativework#tag>

5-star: link to other sources• ABRAHAM: catalogue of newspapers in Belgium

<http://anet.be/record/abraham/opacbnc/c:bnc:26>

<http://data.viaa.be/noid/tm71v5c76q_191510XX>

owl:sameAs

L’illustration“1915-10-XX”

http://data.viaa.be/noid/tm71v5c76q_191510XX

cwork:titlecwork:dateCreated

On dit que c'est notre imagination

et….

cwork:content

cwork:CreativeWork

rdf:type

UGENT

schema:copyrightHolder

schema:inLanguage

en

Basic information triples

http://data.viaa.be/noid/tm71v5c76q

http://data.viaa.be/noid/tm71v5c76q_191804xx_0003



first last

previous/nextfirst

memberOf

totalItemsHydra

last

3

first/last

Problems• Node limited to 1.7 GB memory• OCR too big• Turtle file: 475 MB max (32k

newspapers)• Compressed to HDT: 388 MB• Basic triples with HDT:• 54k newspapers → 8.2 MB

Adding context

Connect with other datasources

• Cfr. Europeana, delpher.nl, lab.kbresearch.nl

Stanford NER• 4 types: Location, Organisation, Person and

Other• Train your model: golden corpus• Write code that fits your needs

• SPARQL query that matches strings• REPERTOIRE des COMMUNES et des PRINCIPAUX

HAMEAUX de la ci-devant Belgique

• Difficult to find cultural APIs (cfr. InFlandersField list of names, Abraham catalogue)

DBpedia Spotlight• Proof of concept• Models for all languages (nl, en, fr, de)

NL/FR/EN/DE trained model

DBpedia matcher

Stanford NER

Results?

• Filter on OCR quality; e.g. <90% assurance in ALTO

• Wrong time period, e.g. geonames• Standard models, should be trained• Use DBpedia knowledge later to filter

“impossible” tags

DBpedia Spotlight• Running your own endpoint is easy:• java -Xmx8G -jar dbpedia-

spotlight-0.7.1.jar nl http://localhost:2223/nl/rest

• Or with Docker:• docker build -f Dockerfile -t

dutch_spotlight .• docker run -i -p 2223:80 dutch_spotlight

spotlight.sh

Linked Data as a Service• Allow federated queries• Low server cost• Be reliable• Triple Pattern Fragments: a Low-cost

Knowledge Graph Interface for the Web

Linked Data Fragments querying• VIAA is part of the family!

http://data.viaa.be/ldfhttps://query.wikidata.org/

bigdata/ldf

http://data.linkeddatafragments.

org/linkedgeodata

http://data.linkeddatafragments.

org/dbpedia2014

Your browser

Client-side algorithm

GET fragments

Demo time!

Demo

• Retrieve all newspaper titles:

SELECT DISTINCT ?titleWHERE {?paper <http://www.bbc.co.uk/ontologies/creativework#title> ?title}

Demo• Retrieve more info from corresponding

DBpedia URI:

SELECT ?label ?commentWHERE {<http://data.viaa.be/noid/2z12n51476_19141120_0001> <http://www.bbc.co.uk/ontologies/creativework#tag> ?tag .?db owl:sameAs ?tag .?db rdfs:label ?label .?db rdfs:comment ?comment}

Battle of the Somme• Pages with military leaders from the Battle

of the Somme mentioned + thumbnail:

SELECT ?paper ?o ?thumbnailWHERE {<http://dbpedia.org/resource/Battle_of_the_Somme> <http://dbpedia.org/ontology/commander> ?o .?paper <http://www.bbc.co.uk/ontologies/creativework#tag> ?ctag .?o owl:sameAs ?ctag .?o <http://dbpedia.org/ontology/thumbnail> ?thumbnail .}

Frontpainters• Semi-automatic generation of collections,

e.g. about frontpaintersSELECT ?newspaper ?artist ?tag ?hetarchiefWHERE {?artist dc:subject <http://dbpedia.org/resource/Category:Belgian_war_artists> .?artist owl:sameAs ?tag .?newspaper <http://www.bbc.co.uk/ontologies/creativework#tag> ?tag .?newspaper <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?hetarchief}

Conclusion

• Extra search method for our researchers• NER versus OCR: enhanced findability• Adding extra information (cfr. Abraham)

requires effort, we need more TPFs interfaces

Future work• Dereferencable URIs• http://data.viaa.be/noid/{id}

• Content negotiation• HTML• JSON• RDF

• Save location with OLR• Suggestions are welcome!

http://data.viaa.be/noid/

Q&A

Brecht Van de Vyvere | @brechtvdv

Data & Analytics

Building a knowledge graph of the Belgian War Press