Aidan's PhD Viva

Digital Enterprise Research Institute www.deri.ie

Exploiting RDFS and OWL for Integrating Heterogeneous, Large-Scale, Linked Data

Corpora

Aidan HoganPhD Viva

Cold Open

Figure 1: Web of Data

explicit data

implicit data

Topic of thesis:

How can consumers tap into the implicit data

PRELUDEThe Area…

The Problem…The Hypothesis…

The Area…

…Linked Data / Linking Open Data

Bottom-up Approach to Semantic Web Individual Publishers should:

1. Use URIs to name things (not just documents)

2. Use HTTP URIs that can be looked up

3. Return information in a common structured data model (RDF)

4. Use external URIs in your data so as to link to related data

…the micro… Linked Data Principles

…the macro… A Web of Data

Images from: http://richard.cyganiak.de/2007/10/lod/; Cyganiak, JentzschSeptember 2010

August 2007

November 2007

February 2008

March 2008

September 2008

March 2009

July 2009

…so what’s The Problem?…

…heterogeneity

Take Query Answering…

SPARQL endpoints over Web data such as YARS2, Virtuoso, FactForge, etc.

Search engines such as SWSE, Sindice, Falcons, Swoogle, Watson, etc.

Take Query Answering…

Gimme webpages relating to

Tim Berners-Lee

foaf:page

timbl:i

timbl:i foaf:page ?pages .

Hetereogenity in terminology…

webpage: properties

foaf:page

foaf:homepage

foaf:isPrimaryTopicOf

foaf:weblog

doap:homepage

foaf:topic

foaf:primaryTopic

mo:musicBrainz

mo:myspace

= rdfs:subPropertyOf

= owl:inverseOf

Linked Data, RDFS and OWL: Linked Vocabularies

…Image from http://blog.dbtune.org/public/.081005_lod_constellation_m.jpg:; Giasson, Bergman

Hetereogenity in naming…

Tim Berners-Lee: URIs

timbl:i

dblp:100007

identica:45563

adv:timblfb:en.tim_berners-lee

db:Tim-Berners_Lee

= owl:sameAs

Returning to our Query…

Gimme webpages relating to

Tim Berners-Lee

foaf:page

timbl:i timbl:i foaf:page ?pages .

... 7 x 6 = 42 possible patterns

foaf:homepage

foaf:isPrimaryTopicOf

doap:homepage foaf:topic foaf:primaryTopic

mo:myspace

dblp:100007

identica:45563adv:timbl

fb:en.tim_berners-lee

db:Tim-Berners_Lee

…The Hypothesis?…

…we can use the OWL and RDFS inherent in Linked Data to attenuate the problem of heterogeneity for consumers

Scenario…

…take a static corpus crawled from Linked Data…

…about a billion triples or so…

…and tackle the problem(s) of heterogeneity

…(without domain-specific “cheats”).

Setup…

hardware …9 machines

…~6 years old… 4Gb RAM, 2.2GHz, Ethernet

Setup…

corpus …crawl (9 machines: 52.5 hr)

…took random seed URIs from Billion Triple Challenge 2009 dataset

…crawled ~4 million RDF/XML documents …from arbitrary domains (e.g., dbpedia.org)

– Only found 785 domains providing RDF/XML

…1.118 billion quadruples …947 million unique triples

Setup…

ranking (9 machines: 30.3 hr) …applied PageRank over interlinked source

docs.– …source A links to source B if A uses a URI which

“dereferences” (points) to B

Challenges…

…what (OWL) reasoning is feasible for Linked Data?

Linked Data Reasoning: Challenges

CORE1. Reasoning…

2. Annotated Reasoning…3. Consolidation…

1. Reasoning

High Level Approach…

…apply a subset of OWL 2 RL/RDF rules over the data

Forward Chaining materialisation:

Avoid runtime expense of backward-chaining– Users taught impatience by Google

Pre-compute answers for quick retrieval

Web-scale systems should be scalable!– More data = more disk-space/machines

Web Reasoning: Forward Chaining!

Scalable Authoritative OWL Reasoner

Our Approach

Our Approach…

INPUT:• Flat file of triples (quads)

OUTPUT:• Flat file of (partial) inferred triples (quads)

Scalable Reasoning: In-mem T-Box

Main optimisation: Store T-Box in memory T-Box: (loosely) data describing classes and properties.

Aka. schemata/vocabularies/ontologies/terminologies. E.g.,

– foaf:topic owl:inverseOf foaf:page .– sioc:UserAccount rdfs:subClassOf foaf:OnlineAccount .

Most commonly accessed data for reasoning Quite small (~0.1% for our Linked Data corpus)

High selectivity (if you prefer) A-Box: Lots ?s foaf:page ?o .

vs. T-Box: Few foaf:page ?p ?o . + ?s ?p foaf:page .

Scan 1: Scan input data separate T-Box statements, load T-Box statements into memory Do T-Box level reasoning if required (semi-naïve)

Scan 2: Scan all on-disk data, join with in-memory T-Box.

Scalable Reasoning: Two Scans

......

... ...

...ex:me foaf:homepage ex:hp ....

...ex:hp rdf:type foaf:Document .ex:me foaf:page ex:hp .ex:hp foaf:topic ex:me ....

IN-MEM T-BOX

ON-DISK A-BOX

ON-DISK OUTPUT

foaf:homepage

foaf:Document

rdfs:domainfoaf:page

rdfs:subPropertyOf

foaf:topic

owl:inverseOf

Execution of three rules:

OWL 2 RL rule prp-inv1?p1 owl:inverseOf ?p2 .

?x ?p1 ?y .

⇒ ?y ?p2 ?x .

OWL 2 RL rule prp-rng?p rdfs:range ?c .

?x ?p ?y .

⇒ ?y a ?c .

OWL 2 RL rule prp-spo1?p1 rdfs:subPropertyOf ?p2 .

?x ?p1 ?y.

⇒ ?x ?p2 ?y .

Scalable Reasoning: No A-Box Joins

However: some rules do require A-Box joins ?p a owl:TransitiveProperty . ?x ?p ?y . ?y ?p z .

⇒ ?x ?p ?z . Difficult to engineer a scalable solution (which reaches a

fixpoint) for Linked Data(?) Can lead to quadratic inferences

A lot of useful reasoning still possible without A-Box joins…

Scalable Reasoning: A-Box joins?

Consider source of T-Box (schemata) data

Class/property URIs dereference to their authoritative document

FOAF spec authoritative for foaf:Person ✓ MY spec not authoritative for foaf:Person ✘

Allow “extension” in authoritative documents my:Person rdfs:subClassOf foaf:Person . (MY spec) ✓

BUT: Reduce obscure memberships foaf:Person rdfs:subClassOf my:Person . (MY spec) ✘

ALSO: Protect specifications foaf:knows a owl:SymmetricProperty . (MY spec) ✘

Authoritative Reasoning

Survey of terminology: counts

Looked at use of RDFS and OWL in our corpus

1. rdfs:subClassOf ~307k axioms ~51k docs ✓

2. owl:equivalentClass ~23k axioms ~23k docs ✓3. rdfs:domain ~16k axioms 623 docs ✓4. rdfs:range ~14k axioms 717 docs ✓5. owl:unionOf ~13k axioms 109 docs ✓6. rdfs:subPropertyOf ~9k axioms 227 docs ✓7. owl:inverseOf ~1k axioms 98 docs ✓8. owl:disjointWith 917 axioms 60 docs ✘9. owl:someValuesFrom 465 axioms 48 docs ✓10. owl:intersectionOf 325 axioms 12 docs ✓/ ✘…

...summary please?

Our “cheap rules” cover 99% of RDFS/OWL axioms in our corpus

82.3% of such axioms have an authoritative version

- 78.3% of all non-authoritative axioms come from one doc

- (without which, ~96% of axioms have auth. version)

9.1% of documents have non-authoritative axioms

Authoritative reasoning for cheap rules fully support 90.6% of the “vocabulary documents”

Survey of terminology: counts

Survey of terminology: ranks

Looked at use of RDFS and OWL wrt. ranks of documents…1. rdfs:subClassOf 0.295 ✓ 2. rdfs:range 0.294 ✓3. rdfs:domain 0.292 ✓4. rdfs:subPropertyOf 0.090 ✓5. owl:FunctionalProperty 0.063 ✘6. owl:disjointWith 0.049 ✘7. owl:inverseOf 0.047 ✓8. owl:unionOf 0.035 ✓9. owl:SymmetricProperty 0.033 ✓10. owl:equivalentClass 0.021 ✓11. owl:InverseFunctionalProperty 0.030 ✘12. owl:equivalentProperty 0.030 ✓13. owl:someValuesFrom 0.030 ✓/ ✘

...summary please?

Adding up the ranks of all vocabularies our rules fully support gives 77% of the total rank of all vocabularies

Adding up the ranks of all vocabularies our authoritative rules fully support gives 70% of the total rank of all vocabularies

The highest ranked document our rules do not fully support was 5th overall: SKOS

The highest ranked document with non-authoritative axioms was 7th overall: FOAF

Survey of terminology: ranks

...let’s stick to the simple rules

Scalable Distributed Reasoning

...ex:me ex:presented ex:ThisTalk

SAME T-BOX

ex:presented

foaf:Person

rdfs:domain

ex:presented

foaf:Person

rdfs:domain

ex:Talk

rdfs:range

SAME T-BOX SAME T-BOX SAME T-BOX SAME T-BOX

DIFF. A-BOX DIFF. A-BOX DIFF. A-BOX DIFF. A-BOX DIFF. A-BOX

... LOCAL

OUTPUT......ex:me ex:presented ex:ThisTalk

LOCAL OUTPUT

...ex:me ex:presented ex:ThisTal

...ex:me rdf:type ex:Awesome .

ex:Talk

rdfs:range

ex:presented

foaf:Person

rdfs:domain

ex:Talk

rdfs:range

ex:presented

foaf:Person

rdfs:domain

ex:Talk

rdfs:range

ex:presented

foaf:Person

rdfs:domain

ex:Talk

rdfs:range

... ...

... EXTRACT T-BOX EXTRACT T-BOX EXTRACT T-BOX EXTRACT T-BOX EXTRACT T-BOX

COLLECT T-BOX COLLECT T-BOX COLLECT T-BOX COLLECT T-BOX COLLECT T-BOX

Reasoning Performance (1 machine)

Reasoning Performance: Distrib.

9 machines: Total 3.35 hours

Reasoning: Results

962 million unique/novel triples

947 millionunique triples

2. AnnotatedReasoning

Annotated Reasoning

Let’s try track some meta-information during the reasoning process

Annotate input triples with information

Use annotated reasoning framework for transforming annotations on input triples into annotations on output triples

Each input triple is assigned the sum of the ranks of the documents in which it appears…

foaf:Person rdfs:subClassOf foaf:Agent 0.3 .

timbl:i rdf:type foaf:Person 0.04 .

aidan:me rdf:type foaf:Person 0.0001 .

Annotated Reasoning: ranks

During reasoning, inferences are assigned the least-trustworthy triple involved in their “proof”

foaf:Person rdfs:subClassOf foaf:Agent 0.3 .

timbl:i rdf:type foaf:Person 0.04 .

⇒timbl:i rdf:type foaf:Agent 0.04 .

Annotated Reasoning

1. Can do top-k materialisation Only give me inferences above a certain rank threshold Only give me top-k inferences

2. Can fix inconsistencies in the data… …aka. logical contradictions …interpreting the rank values as denoting

“trustworthy” data

foaf:Person owl:disjointWith foaf:Document .

Inconsistencies: aka. Contradictions

?c1 owl:disjointWith ?c2 .

?x rdf:type ?c1 .

?x rdf:type ?c2 .

⇒ false

foaf:Person owl:disjointWith foaf:Document .

ex:sleepygirl rdf:type foaf:Person .

ex:sleepygirl rdf:type foaf:Document .

⇒ false

Cannot compute…

Considered two approaches:

1. Find the “consistency threshold” of the input + inferred data: The largest rank such that all data above that rank are

consistent Unfortunately, the 22nd ranked document had an ill-

typed literal, and so was inconsistent… So we would keep the data of ~22 documents And throw away the data of nearly four million

Fixing inconsistencies

Time for Plan B:

2. Perform a “granular” repair of the data Remove the weakest triple causing each contradiction

foaf:Person owl:disjointWith foaf:Document 0.3 .

ex:sleepygirl rdf:type foaf:Person 0.007 .

ex:sleepygirl rdf:type foaf:Document 0.002.

Fixing inconsistencies

~294k ill-typed datatypes ~7k members of disjoint classes

Inconsistencies found

Performance

9 machines

Annotated Reasoning: 14.6 hrs (vs. 3.35hrs w/o annotations: need to do a distributed sort to

remove non-optimal triples ) Detect/Extract Inconsistencies: 2.9 hrs Diagnosis/Repair 2.8 hrs

Total ~20.3 hours

3. Consolidation

Consolidation for Linked Data

Baseline Approach…

…use the explicit owl:sameAs relations given in the data…

Scan the data and extract all owl:sameAs triples

timbl:i owl:sameas identica:45563 .

dbpedia:Berners-Lee owl:sameas identica:45563 .

Load into memory Use a map to store equivalences:

timbl:i ->

identica:45563 ->

dbpedia:Berners-Lee ->

Consolidation: Baseline

timbl:i

identica:45563

dbpedia:Berners-Lee

For each set of equivalent identifiers, choose a canonical term

Consolidation: Baseline

timbl:i

identica:45563

dbpedia:Berners-Lee

Scan data a second time: Rewrite identifiers to their canonical version

Skip predicates and values of rdf:type

Canonicalisation

timbl:i rdf:type foaf:Person .

identica:48404 foaf:knows identica:45563 .

dbpedia:Berners-Lee dpo:birthDate “1955-06-08”^^xsd:date .

dbpedia:Berners-Lee rdf:type foaf:Person .

identica:48404 foaf:knows dbpedia:Berners-Lee .

dbpedia:Berners-Lee dpo:birthDate “1955-06-08”^^xsd:date .

timbl:i

identica:45563

dbpedia:Berners-Lee

Baseline Consolidation: Performance

9 machines

1. Extract owl:sameAs: 0.2 hr 2. Gather owl:sameAs: 0.1 hr3. Canonicalise data 0.7 hr

Total ~1.1 hours

Applied over raw input data

~12 million owl:sameAs triples ~2.2 million sets of equivalent identifiers ~5.8 million identifiers involved

~2.65 identifiers per set ~99.99% of terms were URIs ~6.25% of all URIs

Baseline Consolidation: Results

Extended Approach…

…use the owl:sameAs relations inferable through reasoning…

Infer owl:sameAs through reasoning (OWL 2 RL/RDF)1. explicit owl:sameAs (again)

2. owl:InverseFunctionalProperty

3. owl:FunctionalProperty

4. owl:cardinality 1 / owl:maxCardinality 1

foaf:homepage a owl:InverseFunctionalProperty .

timbl:i foaf:homepage w3c:timblhomepage .

adv:timbl foaf:homepage w3c:timblhomepage .

⇒timbl:i owl:sameas adv:timbl .

…then apply consolidation as before

Extended Consolidation

OWL 2 RL/RDF consolidation rules require A-Box joins!

Might not be able to fit owl:sameAs index in memory (4 Gb)!

⇒ Use on-disk batch-processing Distributed sorts, scans and merge-joins

Derive owl:sameAs on-disk

Extended Consolidation: Performance

9 machines

1. Inferring owl:sameAs ~7.4 hr2. Canonicalise data ~4.9 hr

Total ~12.3 hours(11X baseline)

~12 million explicit owl:sameAs triples (as before) ~8.7 million thru. owl:InverseFunctionalProperty ~106 thousand thru. owl:FunctionalProperty none thru. owl:cardinality/owl:maxCardinality

~2.8 million sets of equivalent identifiers (1.31x baseline)

~14.86 million identifiers involved (2.58x baseline)

~5.8 million URIs (1.014x baseline)

Extended Consolidation: Results

CONCLUSION

timbl:i foaf:page ?pages .

timbl:i

identica:45563

dbpedia:Berners-Lee

dbpedia:Berners-Lee foaf:page ?pages .

Heterogeneity poses a significant problem for consuming Linked Data

1. Lightweight reasoning can go a long way Simple/authoritative rules have reasonable coverage

2. Deceit/Noise ≠ End Of World3. Inconsistency ≠ End Of World

Useful for finding noise in fact!

4. Explicit owl:sameAs vs. extended consolidation: Extended consolidation mostly for consolidating

blank-nodes from older FOAF exporters

Conclusions

Aidan's PhD Viva

Technology

phd viva voice

Key Stage 4 Course Booklet - St Aidan's

Annakannan PhD Viva presentation

ST. AIDAN'S MISSION - gandhi Luthuli Documentation Centrescnc.ukzn.ac.za/doc/REL/Christ/St_Aidans_Mission.pdf · ST. AIDAN'S MISSION CENTENARY BROCHURE EDITOR : Dr. C.G. Henning,

How to prepare for Phd VIVA

A PhD and Beyond: Thesis, VIVA, and job market - Dr Tom Calvard

Cerutti--PhD viva voce defence

6 PhD Viva PreparatnDr Jamaludin

KOFI AGYEKUM- PHD VIVA

VIVA PhD _Emma Ziezie

St Aidan's Alumni Magazine, No. 3 (December 2011)

PhD Viva Guide

St Aidan's Alumni Newsletter 2013

Rajan Viva Voce PhD

St. Aidan's Audio Newsletter: Dec 2013

St. Aidan's Audio Newsletter: Sept 2012

St Aidan's College Brochure

St. Aidan's “By the Way” Newsletter: Dec 2015

GCSE French AQA Vocabulary - St Aidan's

PhD Mini Viva Talk