Upload
open-knowledge-belgium
View
42
Download
0
Embed Size (px)
Citation preview
Let’s talk Linked Data session Open Belgium 2017Brecht Van de Vyvere | @brechtvdv
Building a knowledge graph of the Belgium War Press
Agenda
• hetarchief.be• Knowledge graph• 5-star Open Data plan• Adding context• Linked Data as a Service• Future Work
hetarchief.be“News from the Great War”• Newspapers 1914 - 1918• 10+ Content Partners• Begin 2015: site launched• Functionality• Search by keyword• Map with place of publication• Collections
1k titles
55k newspapers
300k pages
Policy1.Metadata• No restrictions → CC0
2.OCR, documents• Pictures, short stories…• Uncertain copyright status• No license or “terms of use” that minimises restrictions
for re-use• Disclaimer
hetarchief.be• One of the biggest databases online• No raw data?• Title• Description → OCR from ALTO• Date created• Owner• IDs (carrier, Abraham, VIAA)• URL image
First 3 Stars• Open License• Structured• Non-proprietary
VIAA DB VIAA API NodeJS
→ github.com/viaacode/hetarchief2lod
IDs Metadata
CSV
Transform
Step 4: URIs for everything• Map VIAAs internal ID to URI:• http://data.viaa.be/noid/{id}
• Use ontologies• BBC → Creative Work Ontology• schema.org• Hydra → collections
<http://dbpedia.org/page/Albert_I_of_Belgium>
rdfs:type
<http://xmlns.com/foaf/0.1/Person>
<http://data.viaa.be/noid/example>
<http://www.bbc.co.uk/ontologies/creativework#tag>
5-star: link to other sources• ABRAHAM: catalogue of newspapers in Belgium
<http://anet.be/record/abraham/opacbnc/c:bnc:26>
<http://data.viaa.be/noid/tm71v5c76q_191510XX>
owl:sameAs
L’illustration“1915-10-XX”
http://data.viaa.be/noid/tm71v5c76q_191510XX
cwork:titlecwork:dateCreated
On dit que c'est notre imagination
et….
cwork:content
cwork:CreativeWork
rdf:type
UGENT
schema:copyrightHolder
schema:inLanguage
en
Basic information triples
http://data.viaa.be/noid/tm71v5c76q
http://data.viaa.be/noid/tm71v5c76q_191804xx_0003
http://data.viaa.be/noid/tm71v5c76q_191804xx_0002
http://data.viaa.be/noid/tm71v5c76q_191804xx_0001
first last
previous/nextfirst
memberOf
totalItemsHydra
last
3
first/last
Problems• Node limited to 1.7 GB memory• OCR too big• Turtle file: 475 MB max (32k
newspapers)• Compressed to HDT: 388 MB• Basic triples with HDT:• 54k newspapers → 8.2 MB
Stanford NER• 4 types: Location, Organisation, Person and
Other• Train your model: golden corpus• Write code that fits your needs
• SPARQL query that matches strings• REPERTOIRE des COMMUNES et des PRINCIPAUX
HAMEAUX de la ci-devant Belgique
• Difficult to find cultural APIs (cfr. InFlandersField list of names, Abraham catalogue)
DBpedia Spotlight• Proof of concept• Models for all languages (nl, en, fr, de)
NL/FR/EN/DE trained model
DBpedia matcher
Stanford NER
Results?
• Filter on OCR quality; e.g. <90% assurance in ALTO
• Wrong time period, e.g. geonames• Standard models, should be trained• Use DBpedia knowledge later to filter
“impossible” tags
DBpedia Spotlight• Running your own endpoint is easy:• java -Xmx8G -jar dbpedia-
spotlight-0.7.1.jar nl http://localhost:2223/nl/rest
• Or with Docker:• docker build -f Dockerfile -t
dutch_spotlight .• docker run -i -p 2223:80 dutch_spotlight
spotlight.sh
Linked Data as a Service• Allow federated queries• Low server cost• Be reliable• Triple Pattern Fragments: a Low-cost
Knowledge Graph Interface for the Web
Linked Data Fragments querying• VIAA is part of the family!
http://data.viaa.be/ldfhttps://query.wikidata.org/
bigdata/ldf
http://data.linkeddatafragments.
org/linkedgeodata
http://data.linkeddatafragments.
org/dbpedia2014
Your browser
Client-side algorithm
GET fragments
Demo
• Retrieve all newspaper titles:
SELECT DISTINCT ?titleWHERE {?paper <http://www.bbc.co.uk/ontologies/creativework#title> ?title}
Demo• Retrieve more info from corresponding
DBpedia URI:
SELECT ?label ?commentWHERE {<http://data.viaa.be/noid/2z12n51476_19141120_0001> <http://www.bbc.co.uk/ontologies/creativework#tag> ?tag .?db owl:sameAs ?tag .?db rdfs:label ?label .?db rdfs:comment ?comment}
Battle of the Somme• Pages with military leaders from the Battle
of the Somme mentioned + thumbnail:
SELECT ?paper ?o ?thumbnailWHERE {<http://dbpedia.org/resource/Battle_of_the_Somme> <http://dbpedia.org/ontology/commander> ?o .?paper <http://www.bbc.co.uk/ontologies/creativework#tag> ?ctag .?o owl:sameAs ?ctag .?o <http://dbpedia.org/ontology/thumbnail> ?thumbnail .}
Frontpainters• Semi-automatic generation of collections,
e.g. about frontpaintersSELECT ?newspaper ?artist ?tag ?hetarchiefWHERE {?artist dc:subject <http://dbpedia.org/resource/Category:Belgian_war_artists> .?artist owl:sameAs ?tag .?newspaper <http://www.bbc.co.uk/ontologies/creativework#tag> ?tag .?newspaper <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?hetarchief}
Conclusion
• Extra search method for our researchers• NER versus OCR: enhanced findability• Adding extra information (cfr. Abraham)
requires effort, we need more TPFs interfaces
Future work• Dereferencable URIs• http://data.viaa.be/noid/{id}
• Content negotiation• HTML• JSON• RDF
• Save location with OLR• Suggestions are welcome!