1
C Q Fariz Darari Supervisors: Werner Nutt, Sebastian Rudolph Managing and Consuming Completeness Information for RDF Data Sources Why completeness information? Though generally incomplete, parts of data on the Web are indeed complete! Completeness information lets us know exactly which parts are complete… Real-world RDF data sources need a large number of completeness statements, resulting in long reasoning time. Data-agnostic reasoning optimization CORNER COOL-WD Generic data source Support data-agnostic reasoning Highlights: RDFS extension, federated extension Wikidata-specific Support data-aware reasoning Highlights: Built-in on Wikidata, completeness analytics, query diagnostics Complete for all Apollo 11 crew: Compl(apollo11,crew,?crew) Give me people who are NOT Apollo 11 crew: SELECT * WHERE { ?person isA person . FILTER NOT EXISTS { apollo11 crew ?person } } Is this query Q neg sound? Give me the children of Apollo 11 crew: SELECT * WHERE { apollo11 crew ?crew . ?crew child ?child } Is this query Q pos complete?* *Suppose Wikidata is also complete for all children of Neil, Buzz, and Michael Let’s manage and consume completeness information! Data-aware completeness reasoning Darari et al. (ISWC’13) formalized data-agnostic completeness reasoning. The abstraction of the data graph results in weaker inferences: e.g., fails to guarantee the completeness of Q pos The incorporation of data graph increases the complexity from NP-complete (for data-agnostic) to П 2 -complete. Yet, optimization techniques exist for practical settings. But data-aware reasoning can guarantee it: Optimizing completeness reasoning Data-aware reasoning optimization Soundness reasoning Answer soundness reasoning Is my query answer sound? Input: P query with negation, C set of completeness statements, G graph, u answer mapping Output: true iff u is sound wrt. P, C, and G Characterization The answer u of P over G wrt. C is sound iff all P's NOT-EXISTS-BGPs (= negative parts), after applying u to them, are complete for G wrt. C Time-aware completeness reasoning Completeness statements can sometimes be out-of-date. Capturing this data-dynamicity over time increases flexibility in completeness reasoning! Completeness management tools To increase the potential uptake of our completeness reasoning framework, we have developed two completeness management tools: CORNER (for Completeness Reasoner) and COOL-WD (for Completeness Tool for Wikidata) Publications Radityo Eko Prasojo, Fariz Darari, Simon Razniewski, Werner Nutt: Managing and Consuming Completeness Information for Wikidata Using COOL-WD. COLD 2016. Fariz Darari, Simon Razniewski, Radityo Eko Prasojo, Werner Nutt: Enabling Fine-Grained RDF Data Completeness Assessment. ICWE 2016. Fariz Darari, Radityo Eko Prasojo, Werner Nutt: Expressing No-Value Information in RDF. ISWC (P&D) 2015. Fariz Darari, Simon Razniewski, Werner Nutt: Bridging the Semantic Gap between RDF and SPARQL using Completeness Statements. ISWC (P&D) 2014. Fariz Darari, Radityo Eko Prasojo, Werner Nutt: CORNER: A Completeness Reasoner for SPARQL Queries over RDF Data Sources. ESWC (P&D) 2014. Cardinality extraction from the Web: Auto-generating completeness information Cardinality information often expresses complete count information, when this matches the count of respective data in a KB, completeness statements can be generated automatically! Web documents (eg. Wikipedia) POS tags NER tags parsing Distant Supervision Learning Sentences containing a number matching with the values’ count of a relation Sentences containing a number NOT matching with the values’ count of a relation Learning classifier: Naïve Bayes, Logistic Regression, SVM, Conditional Random Fields Cardinalities KB with completeness statements Training data Pattern soundness reasoning Is my query pattern sound? Input: P minimal query with negation, C set of completeness statements Output: true iff P is sound wrt. C Characterization The query P is sound wrt. C iff each BGP of the NOT-EXISTS patterns (= negative parts) is complete wrt. C under the condition of the positive part of P It is the case that Q neg is pattern-sound since the statement Compl(apollo11,crew,?crew) guarantees the completeness of “apollo11 crew ?person” under any condition (hence also under the condition of “?person isA person”) apollo11 crew ?crew ?crew child ?child Q pos Compl(apollo11,crew,?crew) neil child ?child buzz child ?child michael child ?child Compl(neil,child,?child) Compl(buzz,child,?child) Compl(michael,child,?child) Compl(neil,spouse,?spouse) Compl(buzz,child,?child) Compl(michael,child,?child) Constants: Constant-relevance “A completeness statement C is relevant to the query Q iff all constants in C appear in Q” {neil, spouse} {buzz, child} {michael, child} michael child ?child Constants: {michael, child} X X Retrieval of constant-relevant statements can be reduced to subset-querying Completeness template “Generalize similar completeness statements for simultaneous matching process” Compl(neil,child,?child) Compl(buzz,child,?child) Compl(michael,child,?child) Compl[$p,child,?child] $p = {neil, buzz, michael} Partial matching “Filter irrelevant completeness templates by ruling out templates whose body is not overlapped with the query’s body” Experiments showed a 50000X speed-up! Experiments showed a 112X speed-up! Open-world style Closed-world style of negation Completeness statements: reducing soundness checking to completeness checking! 2012 Compl(?movie,director,tarantino) Compl(?movie,actor,tarantino) SELECT * WHERE { ?movie actor tarantino. ?movie director tarantino } “GCD := maximum date d s.t. all parts of the query Q can be guaranteed to be complete” Guaranteed Completeness Date (GCD) = 2012 Algorithm Incrementally compute the union of query parts that can be guaranteed to be complete from the latest date in C to the earliest date, while on the way checking if all the query parts are already included.

2017 UniBZ Winter Seminar Poster: Managing and Consuming Completeness Information for RDF Data Sources

Embed Size (px)

Citation preview

Page 1: 2017 UniBZ Winter Seminar Poster: Managing and Consuming Completeness Information for RDF Data Sources

C Q

Fariz Darari

Supervisors: Werner Nutt, Sebastian Rudolph

Managing and Consuming Completeness Information

for RDF Data Sources

Why completeness information?Though generally incomplete, parts of data on the Web are indeed complete!

Completeness information lets us know exactly which parts are complete…

Real-world RDF data sources need a large number of completeness

statements, resulting in long reasoning time.

Data-agnostic reasoning optimization

CORNER COOL-WD

• Generic data source• Support data-agnostic

reasoning• Highlights: RDFS

extension, federated extension

• Wikidata-specific• Support data-aware

reasoning• Highlights: Built-in on

Wikidata, completeness analytics, query diagnostics

Complete for all Apollo 11 crew:Compl(apollo11,crew,?crew)

Give me people who are NOT Apollo 11 crew:SELECT * WHERE { ?person isA person .

FILTER NOT EXISTS { apollo11 crew ?person } }

Is this query Qneg sound?Give me the children of Apollo 11 crew:SELECT * WHERE { apollo11 crew ?crew .

?crew child ?child }

Is this query Qpos complete?*

*Suppose Wikidata is also complete for all children of Neil, Buzz, and Michael

Let’s manage and consume completeness information!Data-aware completeness reasoning

Darari et al. (ISWC’13) formalized data-agnostic completeness reasoning.

The abstraction of the data graph results in weaker inferences:

e.g., fails to guarantee the completeness of Qpos

The incorporation of data graph increases the complexity from

NP-complete (for data-agnostic) to П2𝑃-complete.

Yet, optimization techniques exist for practical settings.

But data-aware reasoning can guarantee it:

Optimizing completeness reasoning

Data-aware reasoning optimization

Soundness reasoning

Answer soundness reasoning

Is my query answer sound?

Input: P query with negation,

C set of completeness statements,

G graph,

u answer mapping

Output: true iff u is sound wrt. P, C, and G

Characterization The answer u of P over G wrt. C is sound iff

all P's NOT-EXISTS-BGPs (= negative parts), after applying u

to them, are complete for G wrt. C

Time-aware completeness reasoning

Completeness statements can sometimes be out-of-date. Capturing this data-dynamicity over time

increases flexibility in completeness reasoning!

Completeness management tools

To increase the potential uptake of our completeness reasoning framework, we have developed two completeness

management tools: CORNER (for Completeness Reasoner) and COOL-WD (for Completeness Tool for Wikidata)

Publications• Radityo Eko Prasojo, Fariz Darari, Simon Razniewski, Werner Nutt: Managing and Consuming Completeness Information for Wikidata Using COOL-WD. COLD 2016.

• Fariz Darari, Simon Razniewski, Radityo Eko Prasojo, Werner Nutt: Enabling Fine-Grained RDF Data Completeness Assessment. ICWE 2016.

• Fariz Darari, Radityo Eko Prasojo, Werner Nutt: Expressing No-Value Information in RDF. ISWC (P&D) 2015.

• Fariz Darari, Simon Razniewski, Werner Nutt: Bridging the Semantic Gap between RDF and SPARQL using Completeness Statements. ISWC (P&D) 2014.

• Fariz Darari, Radityo Eko Prasojo, Werner Nutt: CORNER: A Completeness Reasoner for SPARQL Queries over RDF Data Sources. ESWC (P&D) 2014.

Cardinality extraction from the Web: Auto-generating completeness informationCardinality information often expresses complete count information, when this matches the count of respective data in a KB,

completeness statements can be generated automatically!

Web documents

(eg. Wikipedia)

POS tags

NER tags

parsing

Distant Supervision Learning

Sentences containing a number matching with

the values’ count of a relation

Sentences containing a number NOT matching

with the values’ count of a relation

Learning classifier: Naïve Bayes, Logistic

Regression, SVM, Conditional Random Fields

Cardinalities

KB with

completeness statements

Training data

Pattern soundness reasoning

Is my query pattern sound?

Input: P minimal query with negation,

C set of completeness statements

Output: true iff P is sound wrt. C

Characterization The query P is sound wrt. C iff

each BGP of the NOT-EXISTS patterns (= negative parts)

is complete wrt. C under the condition of

the positive part of P

It is the case that Qneg is pattern-sound since the statement

Compl(apollo11,crew,?crew) guarantees the completeness of

“apollo11 crew ?person” under any condition

(hence also under the condition of “?person isA person”)

apollo11 crew ?crew ?crew child ?childQpos

Compl(apollo11,crew,?crew)

neil child ?child buzz child ?child michael child ?child

Compl(neil,child,?child) Compl(buzz,child,?child) Compl(michael,child,?child)

Compl(neil,spouse,?spouse) Compl(buzz,child,?child) Compl(michael,child,?child)

Constants:

Constant-relevance“A completeness statement C is relevant to the query Q

iff all constants in C appear in Q”

{neil, spouse} {buzz, child} {michael, child}

michael child ?child

Constants: {michael, child}

X XRetrieval of constant-relevant

statements can be reduced to

subset-querying

Completeness template“Generalize similar completeness statements

for simultaneous matching process”

Compl(neil,child,?child) Compl(buzz,child,?child) Compl(michael,child,?child)

Compl[$p,child,?child]$p = {neil, buzz, michael}Partial matching

“Filter irrelevant completeness templates

by ruling out templates whose body is not overlapped

with the query’s body”

Experiments showed

a 50000X speed-up!

Experiments showed

a 112X speed-up!

Open-world style Closed-world style

of negationCompleteness statements:

reducing soundness checking to completeness checking!

2012

Compl(?movie,director,tarantino)

Compl(?movie,actor,tarantino)

SELECT * WHERE { ?movie actor tarantino. ?movie director tarantino }

“GCD := maximum date d s.t.

all parts of the query Q can be guaranteed to be complete”

Guaranteed Completeness Date (GCD) = 2012

Algorithm

Incrementally compute the union of query parts that can be guaranteed to be complete from the latest date in C to the earliest date, while on the way checking if all the query parts are already included.