pdfs.semanticscholar.orgpdfs.semanticscholar.org/77a9/54166ea4f3089b3edd21d37bacb1082878a1.pdfSemantic Web 0 (0) 1 1 IOS Press Completeness and Soundness Guarantees for Conjunctive

Semantic Web 0 (0) 1 1IOS Press

Completeness and Soundness Guarantees forConjunctive SPARQL Queries over RDF DataSources with Completeness Statements1

Fariz Darari a,*, Werner Nutt b, Simon Razniewski c, Sebastian Rudolph d

a Faculty of Computer Science, Universitas Indonesia, IndonesiaE-mail: [email protected] Faculty of Computer Science, Free University of Bozen-Bolzano, ItalyE-mail: [email protected] Max-Planck-Institute for Informatics, GermanyE-mail: [email protected] Faculty of Computer Science, TU Dresden, GermanyE-mail: [email protected]

Abstract. RDF generally follows the open-world assumption: information is incomplete by default. Consequently, SPARQLqueries cannot retrieve with certainty complete answers, and even worse, when they involve negation, it is unclear whether theyproduce sound answers. Nevertheless, there is hope to lift this limitation. On many specific topics (e.g., children of Trump,Apollo 11 crew, EU founders), RDF data sources contain complete information, a fact that can be made explicit through com-pleteness statements. In this work, we leverage completeness statements over RDF data sources to provide guarantees of com-pleteness and soundness for conjunctive SPARQL queries. We develop a technique to check whether query completeness canbe guaranteed by taking into account also the specifics of the queried graph, and analyze the complexity of such checking. Forqueries with negation, we approach the problem of query soundness checking, and distinguish between answer soundness (i.e.,is an answer of a query sound?) and pattern soundness (i.e., is a query as a whole sound?). We provide a formalization andcharacterize the soundness problem via a reduction to the completeness problem. We further develop heuristic techniques forcompleteness checking, and conduct experimental evaluations based on Wikidata, a prominent, real-world knowledge base, todemonstrate the feasibility of our approach.

Keywords: Data quality, data completeness, query completeness, query soundness, RDF, SPARQL

1. Introduction

Over the Web, we are witnessing a growing amountof data available in RDF. As of July 2018, the LODCloud2 has recorded over 1,200 RDF data sources,covering a wide range of application domains from5

government to life sciences. RDF follows the open-world assumption (OWA), assuming that data is inher-ently incomplete [2]. Yet, given such a large quantity

1This paper is an extended and revised version of Darari et al. [1].*Corresponding author. E-mail: [email protected]://lod-cloud.net/

of RDF data, one might wonder if it is complete forsome topics. As an illustration, consider Wikidata, a10

collaborative knowledge base (KB) whose content ismade available in RDF [3]. For data about the movieReservoir Dogs, Wikidata is incomplete,3 as it is miss-ing the fact that Michael Sottile was acting in thatmovie.4 On the other hand, for data about the Euro-15

pean Union (EU), Wikidata actually stores all of its

3https://www.wikidata.org/wiki/Q72962 (as of July 31, 2018)4See, e.g., http://www.imdb.com/title/tt0105236/fullcredits

1570-0844/0-1900/$35.00 c© 0 – IOS Press and the authors. All rights reserved

mailto:[email protected]





http://lod-cloud.net/

https://www.wikidata.org/wiki/Q72962

http://www.imdb.com/title/tt0105236/fullcredits

2 Darari et al. / Ensuring the Completeness and Soundness of Conjunctive SPARQL

founding members,5 as shown in Figure 1. Neverthe-less, the figure does not provide any indicator aboutcompleteness, leaving the user undecided whether thepresented facts about the EU founders are complete or20

not.

Fig. 1. Wikidata is complete for all EU founding members

The incorporation of completeness information canhelp users assess the quality of data. Over the Web,completeness information is in fact already availablein various forms. For instance, Wikipedia provides a25

template for adding completeness annotations for lists6

and contains about 15,000 pages with the keywords‘complete list of’ and ‘list is complete’; IMDb of-fers around 52,000 editor-verified (natural language)statements about the completeness of cast and crew;730

and OpenStreetMap has around 2,200 pages featur-ing completeness status information.8 For RDF data,such information about completeness is particularlycrucial due to RDF’s incomplete nature. In [4], Darariet al. proposed completeness statements, metadata to35

specify which parts of an RDF data source are com-plete: for a complete part all facts that hold in real-ity are captured by the data source. They also pro-vided an RDF representation of completeness state-ments, making them machine readable. The availabil-40

ity of explicit completeness information opens up thepossibility of specialized applications for data sourcecuration, discovery, analytics, and so forth. Moreover,it can also benefit data access over RDF data sources,mainly done via SPARQL queries [5]. More specifi-45

cally, the quality of query answers can also be mademore transparent given that now we know the qualityof data sources with regard to completeness.

5https://europa.eu/european-union/about-eu/history/6https://en.wikipedia.org/wiki/Template:Complete_list7E.g., see http://www.imdb.com/title/tt0105236/fullcredits8For instance, see http://wiki.openstreetmap.org/wiki/Abingdon

Query Completeness When data sources are en-riched with completeness information, the question50

naturally arises as to whether queries can be answeredalso completely. Intuitively, queries that are evaluatedonly over parts of data captured by completeness state-ments, are guaranteed to be complete. Consider thequery “Give the EU founders” over Wikidata:955

PREFIX wd: <http://www.wikidata.org/entity/>PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT * WHERE wd:Q458 wdt:P112 ?c # EU founder ?c

Without completeness information, evaluating thequery would give only query answers, for which we donot know the completeness. By having the statementthat Wikidata is complete for the EU founders, we canthen guarantee that the answers are complete. Darari et60

al. [4, 6] characterized such reasoning, that is, check-ing whether queries can be guaranteed to be com-plete by completeness statements. Nonetheless, thisapproach is limited in the sense that the specifics of thequeried graph are not taken into account in the com-65

pleteness reasoning.Let us give an example, illustrating this limitation.

Suppose that in addition to the statement about all EUfounders, we also have the statements that Wikidatais complete for the official languages of the follow-70

ing countries: Belgium, France, Germany, Italy, Lux-embourg, and the Netherlands. Let us now considerthe query “Give the EU founders and their official lan-guages.” We show that the query can be answered com-pletely by applying a data-aware approach: we enu-75

merate the complete EU founders stored in Wikidata,and for each of them, we are complete for its lan-guages. On the other hand, the data-agnostic approachfrom Darari et al. [4, 6] would fail to capture the querycompleteness: it can be that all the EU founders are80

completely different than those in Wikidata, and thus,having completeness statements about the six coun-tries above could not help in the reasoning. We ar-gue that data-aware reasoning can provide more fine-grained insights over the completeness of query an-85

swers, which otherwise cannot be captured by relyingonly on the data-agnostic approach.

9Wikidata has internal identifiers for resources, as shown in theSPARQL query example.

https://europa.eu/european-union/about-eu/history/

https://en.wikipedia.org/wiki/Template:Complete_list


http://wiki.openstreetmap.org/wiki/Abingdon

Darari et al. / Ensuring the Completeness and Soundness of Conjunctive SPARQL 3

Query Soundness While it is rather obvious to seethat completeness information may guarantee querycompleteness, one might wonder if such completeness90

information can also be leveraged to check the sound-ness of query answers. Indeed, for the positive frag-ment of SPARQL, the soundness of query answerstrivially holds, thanks to monotonicity. Now let us con-sider queries with negation. There are different ways95

to express negation in SPARQL, either by the explicitsyntactic constructs MINUS and FILTER NOT EX-ISTS, which are available in SPARQL 1.1, or by com-bining OPTIONAL patterns with a check for unboundvariables, as was the method of choice in SPARQL 1.0.100

The meaning of such queries on the Semantic Web hasalways been dubious (see, e.g., W3C mailing list dis-cussions in [7] and [8]). The evaluation of SPARQLqueries that include negation relies on the absence ofsome information. On the other hand, RDF, which fol-105

lows the OWA, regards missing information as unde-cided, that is, it is unknown whether the missing infor-mation is false. Given this situation, answers to querieswith negation can never be assured to be sound.

Completeness information can tackle the problem of110

query soundness. To illustrate this, consider asking for“countries that are not EU founders” over Wikidata:PREFIX wd: <http://www.wikidata.org/entity/>PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT * WHERE ?c wdt:P31 wd:Q6256 # ?c a countryFILTER NOT EXISTS

wd:Q458 wdt:P112 ?c # EU founder ?c

The answers include Spain (= wd:Q29). Withoutany completeness information about Wikidata, we can-not be sure about its soundness: assume Spain were an115

EU founder, but this information were missing fromthe data. In that case, Spain is not a correct answer. Inreality, the EU founders are exactly as shown beforein Figure 1. Knowing this guarantees that Spain is notan EU founding country. What we can observe here120

is that negation in SPARQL, due to its inherent non-monotonicity, may lead to the problem of judging an-swer soundness: adding new information may invali-date an answer. Soundness of answers is ensured, how-ever, if we know that the parts of the data, over which125

the negated parts of a query range, are complete (i.e.,not open-world).

Contributions An earlier version of our ideas onquery completeness checking was published in theProceedings of the International Conference on Web130

Engineering [1]. In that work, we provided a for-

malization of the data-aware completeness entailmentproblem and developed a sound and complete algo-rithm for the entailment problem. The SPARQL frag-ment considered in that work is the conjunctive frag-135

ment, which is the core fragment underlying all exten-sions [9, 10]. The present paper significantly extendsthe previous work in the following ways:

1. we formulate the soundness problem for con-junctive SPARQL queries with negation in the140

presence of completeness information, and distin-guish between answer soundness (i.e., is an an-swer of a query sound?) and pattern soundness(i.e., is a query as a whole sound?);

2. we provide a full characterization of both the an-145

swer and pattern soundness problem via a reduc-tion to the completeness problem;

3. we identify the bottlenecks of the complete-ness reasoning techniques from [1] and developheuristic techniques for completeness reasoning;150

4. we provide experimental evaluations based onWikidata, a prominent, real-world data source tostudy the effectiveness of the proposed heuristics(and their interplay) and to validate the feasibilityof our approach; and155

5. we provide a comprehensive complexity analy-sis of the completeness and soundness entailmentproblem, include the proofs of all theorems aswell as more recent related work, and improve thepresentation of the theoretical parts.160

A poster by Darari et al. [11] contained a result thatcan be interpreted as a sufficient condition for sound-ness in the data-agnostic setting. We now provide a dis-tinction between so-called pattern soundness and an-swer soundness, a full characterization by means of a165

reduction to completeness checking, as well as a com-plexity analysis and experimental evaluations for thesoundness problem.

Organization The rest of the article is organized asfollows. Section 2 provides some background about170

RDF and SPARQL, as well as completeness state-ments. Section 3 motivates and formalizes the prob-lem of query completeness and query soundness. InSection 4, we first introduce formal notions capturingaspects of completeness, then present an algorithm for175

completeness entailment checking based on those no-tions, and conclude with a complexity analysis of thecompleteness entailment problem. We give a charac-terization of the two problem variants of query sound-ness, that is, answer soundness and pattern soundness,180


in Section 5. We describe our heuristic techniques forcompleteness checking in Section 6, and report on ourexperimental evaluations for query completeness andquery soundness checking in Section 7. Related workis presented in Section 8. Section 9 provides a discus-185

sion of our framework, while Section 10 gives conclu-sions and future work. Proofs are provided in the ap-pendices.

2. Preliminaries

In this section, we introduce basic notions of RDF190

and SPARQL, and provide a formalization of com-pleteness statements. We adopt the formalization ofRDF and SPARQL as in [12] and base ourselves on theformalization of completeness as in [4, 6].

2.1. RDF and SPARQL195

We assume three pairwise disjoint infinite sets I(IRIs), L (literals), and V (variables). We collectivelyrefer to IRIs and literals as RDF terms or simply terms.An RDF graph (or simply, graph) G is a finite set oftriples (s, p, o) ∈ I × I × (I ∪ L). For simplicity, we200

omit namespaces in the abstract representation of RDFgraphs.

The standard RDF query language is SPARQL [5].At the core of SPARQL lie triple patterns, which re-semble triples, except that in each position also vari-205

ables are allowed. A basic graph pattern (BGP) is aset of triple patterns. A mapping µ is a partial functionµ : V → I ∪ L. The operator dom(µ) returns the set ofall variables that are mapped in µ. We define the map-ping with the empty domain as the empty mapping µ∅.210

The operator var(P) denotes the set of variables thatare in a BGP P. Given a BGP P, µP denotes the BGPobtained by replacing variables in P with terms ac-cording to µ. Evaluating P over a graph G gives theset of mappings JPKG = µ | µP ⊆ G and dom(µ) =215

var(P) .In the following, we define variable freezing, an im-

portant notion for the characterizations of complete-ness and soundness entailment in the subsequent sec-tions.220

Definition 1 (Variable Freezing). For a BGP P, thefreeze mapping id maps each variable ?v in P to a freshIRI v. With this mapping, we construct the prototypi-cal graph P := id P to represent any possible graphthat can satisfy the BGP P. The melt mapping id

−1un-225

does the freezing by substituting the fresh IRIs with theoriginal variables.10

Example 1. Consider the query “Give the foundingmembers of the EU” as introduced in Section 1. Thequery’s BGP can be written as: (EU, founder, ?c).230

The freeze mapping of the BGP is id = ?c 7→ c, andthe prototypical graph is (EU, founder, c).

The standard query type of SPARQL is the SELECTquery, which has the abstract form Q = (W, P), whereP is a BGP and W ⊆ var(P). The evaluation JQKG over235

a graph G is obtained by projecting the mappings inJPKG to W. We study only SELECT queries where W =var(P). In this regard, there is no distinction betweenbag and set semantics (that is, both semantics coin-cide since duplicates cannot occur). Beside SPARQL240

SELECT, in our formalization later on we will relyon CONSTRUCT queries. Given two BGPs P1 and P2

where var(P1) ⊆ var(P2), a CONSTRUCT query hasthe abstract form (CONSTRUCT P1 P2). Evaluating aCONSTRUCT query over G yields a graph where P1 is245

instantiated with all the mappings in JP2KG.

SPARQL with Negation SPARQL queries can alsoinclude negation. We introduce notation that is conciseand more convenient for our purposes than the originalSPARQL syntax [5]. A NOT-EXISTS pattern is con-250

structed by negating a BGP using ‘¬∃’. A graph pat-tern P, as used throughout this paper, is defined as a setof triple patterns and NOT-EXISTS patterns. The pos-itive part of P, denoted P+, consists of all triple pat-terns in P, and the negative part of P, denoted P−,255

consists of the BGPs of all NOT-EXISTS patterns in P.The evaluation JPKG of a graph pattern P over a graphG produces a set of mappings and is defined in [5]as: µ ∈ JP+KG | ∀Pi ∈ P− . JµPiKG = ∅ . As forthe fragment with negation, we define the correspond-260

ing SELECT query as Q = (var(P+), P), where P is agraph pattern.

Example 2. Consider the query with negation, “Givecountries that are not EU founders.” The graph patternof the query can be written as follows: (?c, a, country),265

10Technically, this definition is ambiguous, since there are in-finitely many ways to choose the fresh IRIs and therefore thereare infinitely many such mappings and corresponding prototypicalgraphs. In addition, the choice depends on the pattern P so that acorrect notation would include a subscript P. To keep our formalismslim, we do not specify how to choose the IRIs, nor do we introducethe subscript, which would be inessential details for the argumentsin the paper.


¬∃ (EU, founder, ?c) . The positive part of thequery is (?c, a, country), whereas the negative partis (EU, founder, ?c).

We refer to the SPARQL fragment where positiveand negative parts of queries are constructed from270

BGPs as the conjunctive SPARQL fragment with nega-tion. Note that BGPs are the basic building blocks un-derlying other SPARQL extensions [9, 10] as well.

2.2. Completeness Statements

We want to formalize a mechanism for specify-275

ing which parts of a data source are complete. Whentalking about the completeness of a data source, oneimplicitly compares the information available in thesource with a possible state of the source that con-tains all information that holds in the real world. We280

model this situation with a pair of graphs: one graphis the available, possibly incomplete state, while an-other stands for an ideal, conceptual complete refer-ence, which contains the available graph. In this work,we only consider data sources that may miss informa-285

tion, but do not contain wrong information.

Definition 2 (Extension Pair). An extension pair is apair (G,G′) of two graphs, where G ⊆ G′. We call Gthe available graph and G′ the ideal graph.

In an application, the state stored in an RDF data290

source is our actual, available graph, which consists ofa part of the facts that hold in reality. The full set offacts that constitute the ideal state are, however, un-known. Nevertheless, an RDF data source can be com-plete for some parts of the reality. In order to make295

assertions in this regard, we now introduce complete-ness statements, as meta-information about the extentto which the available state captures the ideal state. Weadopt the unconditional version of the completenessstatements defined in [4].300

Definition 3 (Completeness Statement). A complete-ness statement C has the form Compl(PC), where PC

is a non-empty BGP.

For example, we express that a data source iscomplete for all triples about the EU founders us-305

ing the statement Ceu = Compl((EU, founder, ?f )).11

To serialize completeness statements in RDF, we re-fer the reader to [4]. We now define when a com-

11For the sake of readability, we slightly abuse the notation byremoving the set brackets of the BGPs of completeness statements.

pleteness statement is satisfied by an extension pair.To a statement C = Compl(PC), we associate the310

CONSTRUCT query QC = (CONSTRUCT PC PC).Note that, given a graph G, the query QC returns thegraph consisting of those instantiations of the pat-tern PC present in G. For example, the query QCeu =(CONSTRUCT (EU, founder, ?f ) (EU, founder, ?f ))315

returns the founding members of the EU in G. Intu-itively, an extension pair (G,G′) satisfies a complete-ness statement C, if the subgraph of G′ identified by Cis also present in G.

Definition 4 (Satisfaction of Completeness State-320

ments). An extension pair (G,G′) satisfies a complete-ness statement C, written (G,G′) |= C, if JQCKG′ ⊆ G.

The above definition naturally extends to the sat-isfaction of a set C of completeness statements, thatis, (G,G′) |= C iff for all C ∈ C, it is the case that325

JQCKG′ ⊆ G.An important tool for characterizing completeness

entailment is the transfer operator, which evaluatesover G all CONSTRUCT queries associated to the state-ments in C and returns the union of the results.330

Definition 5 (Transfer Operator). Let C be a set ofcompleteness statements. The transfer operator TCmaps every graph G to the subgraph

TC(G) =⋃

C∈CJQCKG.

As an illustration, consider the statement Ceu asbefore and the graph Gorg = (EU, founder, ger),(ASEAN, founder, sgp). Then, it is the case thatTCeu(Gorg) = (EU, founder, ger). We have thefollowing immediate characterization of transfer op-335

erator: for all extension pairs (G,G′), it holds that(G,G′) |= C iff TC(G′) ⊆ G.

3. Motivation and Formal Framework

In this section, we motivate and formalize the prob-lem of query completeness and query soundness,340

adapting the definitions from [4] to the data-aware set-ting and to the problem of soundness.

3.1. Query Completeness

Given an RDF graph and a set of completeness state-ments, we want to check whether a query can be an-345

swered completely.


Fig. 2. RDF graph Gcou about countries

3.1.1. Motivating ScenarioConsider the RDF graph Gcou about members coun-

tries of the United Nations (UN) and official languagesof the members as in Figure 2. Next, consider the query350

Q0 asking for the UN members and their languages:

Q0 = (W0, P0)= ( ?m, ?l , (UN,member, ?m), (?m, lang, ?l))

Evaluating Q0 over the graph gives only one map-ping, where the member is mapped to Germany andthe language is mapped to de. Up until now, nothing355

can be said about the completeness of the query since(i) there can be another UN member with an officiallanguage; (ii) Germany may have another language; or(iii) the USA may have an official language.

Fig. 3. RDF graph Gcou with completeness statements

Let us consider the same graph as above, now en-360

riched with completeness information, as displayed inFigure 3. The figure illustrates the set Ccou of threecompleteness statements and to which parts of thegraph the statements apply:

– Cun = Compl((UN,member, ?m)), which states365

that the graph contains all members of the UN;12

– Cger = Compl((ger, lang, ?l)), which states thegraph contains all official languages of Germany;

– Cusa = Compl((usa, lang, ?l)), which states thegraph contains all official languages of the USA370

(i.e., the USA has no official languages).13

With the addition of completeness information, let ussee whether we can answer our query completely.

12For the sake of example, let us suppose that this is true.13See, e.g., https://www.cia.gov/library/publications/the-world-f

actbook/geos/us.html

First, from the statement Cun about UN members,we can infer that the part (UN,member, ?m) of Q0 is375

complete. By evaluating that part over Gcou, we knowthat all the UN members are Germany and the USA. Interms of extension pairs, that means that no extensionG′cou ⊇ Gcou satisfying Cun has UN members otherthan Germany and the USA. This allows us to instan-380

tiate the query Q0 to the following two queries that in-tuitively are together equivalent to Q0 itself:

– Q1 = (W1, P1)= ( ?l , (UN,member, ger), (ger, lang, ?l))

– Q2 = (W2, P2)= ( ?l , (UN,member, usa), (usa, lang, ?l)),

where we record that the variable ?m has been instan-385

tiated by Germany and the USA, respectively.Our task is now transformed to checking whether

Q1 and Q2 can be answered completely. As for Q2, weknow from the statement Cusa that our data graph iscomplete with regard to the triple pattern (usa, lang, ?l).390

This again allows us to instantiate the query Q2 wrt.the graph Gcou. However, now we come to the situationwhere there is no matching part in Gcou: instantiatingthe triple pattern (usa, lang, ?l) returns nothing (i.e.,the USA has no official languages). In other words,395

for any possible extension G′cou of Gcou, as guaranteedby Cusa, the extension G′cou is also empty for the part(usa, lang, ?l). Thus, there is no way that Q2 will re-turn an answer, so it can be safely removed. Here wecan also see that we are complete for Q2.400

Now, only the query Q1 is left. Again, from thestatement Cger, we know that we are complete for thepart (ger, lang, ?l) of Q1. This allows us to instantiatethe query Q1 to the query Q3, that is intuitively equiv-alent to Q1 itself:405

Q3 = (W3, P3)= ( , (UN,member, ger), (ger, lang, de)),

where we record that the variable ?m has been instanti-ated by Germany and ?l by de. However, our graph iscomplete for Q3 as it contains the whole ground bodyof Q3. Clearly, no extension G′cou of Gcou can contain410

more information about Q3. Now, tracing back our rea-soning steps, we know that our Q3 is in fact intuitivelyequivalent to our original query Q0. Since we are com-plete for Q3, we are also complete for Q0, wrt. ourgraph and completeness statements. In other words,415

our statements and graph can guarantee the complete-ness of the query Q0. Concretely, this means that deis the only official language of Germany, the only UNmember with an official language.

In summary, we have reasoned about the complete-420

ness of a query given a set of completeness statements

https://www.cia.gov/library/publications/the-world-factbook/geos/us.html

https://www.cia.gov/library/publications/the-world-factbook/geos/us.html


and a graph. The reasoning is basically done as fol-lows: (i) we find parts of the query that can be guar-anteed to be complete by the completeness statements;(ii) we produce equivalent query instantiations by eval-425

uating those complete query parts over the graph andapplying the obtained mappings to the query itself; (iii)for all the query instantiations, we repeat the abovesteps until no further complete parts can be found. Theoriginal query is complete iff all the BGPs of the gen-430

erated queries are contained in the data graph.Note that using the data-agnostic completeness rea-

soning approach of [4], it is not possible to derivethe same conclusion. Without looking at the availablegraph, we cannot conclude that Germany and the USA435

are all the UN members, since it could be the case thatthe members are completely different items (i.e., notGermany nor the USA). Consequently, just knowingthat the official languages of Germany and the USAare complete does not help in the reasoning.440

3.1.2. Formalization of Completeness ReasoningWhen querying a data source, we want to know

whether the data source provides sufficient informa-tion to retrieve all answers to the query, that is, whetherthe query is complete wrt. the real world. For instance,445

when querying for members of the UN, it would be in-teresting to know whether we really get all such coun-tries. Intuitively, a query is complete over an extensionpair whenever all answers we retrieve over the idealgraph are also retrieved over the available graph. We450

now define query completeness wrt. extension pairs.

Definition 6 (Query Completeness). To express that aquery Q is complete, we write Compl(Q). An exten-sion pair (G,G′) satisfies Compl(Q), if the result ofQ evaluated over G′ also appears in Q over G, that455

is, JQKG′ ⊆ JQKG.14 In this case we write (G,G′) |=Compl(Q).

The above definition can be naturally adapted to thecompleteness of a BGP P, written Compl(P), that isused in subsequent content: An extension pair (G,G′)460

satisfies Compl(P), written (G,G′) |= Compl(P), ifJPKG′ ⊆ JPKG.

Now, the question arises as to when some meta-information about data completeness can provide aguarantee for query completeness. In other words, the465

available state contains all data, as guaranteed by com-

14For monotonic queries, the other direction, that is, JQKG′ ⊇JQKG , comes for free. Hence, we sometimes use the ‘=’ conditionwhen queries are monotonic.

pleteness statements, that is required for computing thequery answers, so one can trust the result of the query.In the following, we define completeness entailment.

Definition 7 (Completeness Entailment). Let C be a470

set of completeness statements, G a graph, and Q aquery. Then C and G entail the completeness of Q,written C,G |= Compl(Q), if for all extension pairs(G,G′) |= C, it holds that (G,G′) |= Compl(Q).

In our motivating scenario, we have seen that the475

graph about the UN and the completeness statementsentail the completeness of the query Q0 asking formembers of the UN and their official languages.

Since for our SPARQL queries all variables are dis-tinguished, the set of query answers for such a query480

is the same as the set of mappings satisfying theBGP. We can therefore focus on the BGPs used in thebody of queries for completeness entailment. Note thathere (again) the distinction between bag and set se-mantics collapses. The following proposition provides485

an initial characterization of completeness entailment,which will serve as a starting point to develop formalnotions for completeness checking and an algorithm inSection 4. Basically, for a set of completeness state-ments, a graph, and a BGP, the completeness entail-490

ment holds, if and only if extending the graph witha possible BGP instantiation (by some mapping) suchthat the extension satisfies the statements, always re-sults in the inclusion of the BGP instantiation in thegraph itself.495

Proposition 1. Let C be a set of completeness state-ments, G a graph, and P a BGP. Then the following areequivalent:

1. C,G |= Compl(P);2. for every mapping µ such that dom(µ) = var(P)500

and (G,G ∪ µP) |= C, it is the case that µP ⊆ G.

In other words, the completeness entailment doesnot hold, if and only if we can find a possible BGP in-stantiation (by some mapping) such that the extensionsatisfies the statements, but the BGP instantiation is not505

contained in the graph. The idea here is that, as demon-strated in our motivating example, by using complete-ness statements we always try to find complete parts ofBGPs and instantiate them over the graph, until eitherall the instantiations are included in the graph (= the510

success case), or there is one instantiation that is notincluded there (= the failure case).


3.2. Query Soundness

Here, we motivate the second main problem of thiswork, query soundness. The problem comes in two515

variants: answer soundness and pattern soundness.

3.2.1. Answer SoundnessIn a nutshell, a mapping is a sound answer for a

query with negation over a given graph if it continuesto be an answer over all possible completions of thegraph. For an example, consider the following graphpattern, asking for countries where en is no officiallanguage and whose official languages (if any) do notinclude an official language of an EU founder:

Pl = (?c, a, country),¬∃ (?c, lang, en) ,¬∃(?c, lang, ?l), (?f , lang, ?l),

(EU, founder, ?f ).

For the sake of example, consider the followinggraph about countries:

Gl = (ger, a, country), (usa, a, country),(sgp, a, country), (spa, a, country),(ger, lang, de), (spa, lang, es),(EU, founder, ger).

For this graph, consider also the set Cl of the follow-520

ing four completeness statements: the first two are Cger

and Cusa as we have had before in the motivating sce-nario of query completeness. The other two are Cspa

for all official languages of Spain and Ceu for all EUfounders.15 Note that we do not claim anything about525

the completeness of the official languages of Singapore(= sgp).

Evaluating the graph pattern over the graph inthe standard way results in JPlKGl = ?c 7→ usa,?c 7→ sgp, ?c 7→ spa. We want to verify whether530

these answers are sound, that is, whether they cannothave been returned due to possibly incomplete infor-mation. This amounts to checking that there is no validextension of Gl wrt. Cl over which the answers are notreturned.535

Let us analyze ?c 7→ usa. First, we check if(usa, lang, en) is certainly not true. Indeed, since weknow by the graph and the statement Cusa that the USAhas no official languages, the triple (usa, lang, en)must not be true. Second, we check if (usa, lang, ?l),540

(?f , lang, ?l), (EU, founder, ?f ) surely fails. This is

15For the sake of example, suppose this is true.

clearly the case for the same reason as before, namelythat there is no official language of the USA. From thisreasoning, we conclude that the answer ?c 7→ usa issound.545

Next, let us analyze ?c 7→ sgp. We check if(sgp, lang, en) is indeed not true, that is, there is novalid extension where (sgp, lang, en) is true. Now wehave a problem: due to the lack of completeness in-formation, it might be that in reality, en is an offi-550

cial language of Singapore, but the fact is missing inour data. Thus, we cannot guarantee the soundness of?c 7→ sgp.

Last, let us analyze ?c 7→ spa. First, we checkif the triple (spa, lang, en) is not true. Since we know555

by Cspa and the graph that Spain’s official language isonly es, then (spa, lang, en) must not be true. Second,we check if the BGP (spa, lang, ?l), (?f , lang, ?l),(EU, founder, ?f ) evaluates to false. From the graphand the statements Cger and Ceu, we know that de is560

the only official language of Germany as the only EUfounder, which is different from es. Thus, the patternmust evaluate to false. We therefore conclude that theanswer ?c 7→ spa is sound.

In summary, our analysis established for which an-565

swers the NOT-EXISTS patterns of the query patternare surely false and thus whether the answer is sound.

3.2.2. Pattern SoundnessConsider now the graph pattern asking for countries

where en is no official language and that are not EUfounders:

P f = (?c, a, country),¬∃ (?c, lang, en) ,¬∃ (EU, founder, ?c) .

Consider also the set C f consisting of two complete-570

ness statements: Clang for all languages of countriesand Ceu for all EU founders. We will show that thestatements guarantee that all answers returned by P f

are sound, independently of the queried graph. In sucha case, we say that the pattern itself is sound.575

Let us see why C f guarantees for any possible graphthe soundness of all answers to P f . Consider a graph Gand suppose that pattern evaluation over G returns?c 7→ c for an IRI c. Consider also an arbitrary ex-tension G′ of G such that (G,G′) |= C f . To show that580

?c 7→ c is sound, we must make sure that neitherover G nor over G′ does c have en as an official lan-guage and is c an EU founder. By the statement Clang,it is the case that G is complete for all languages ofcountries. Therefore, G is also complete for all lan-585


guages of c. The fact that c is returned by P f over Gmeans that en is not among its official languages ac-cording to G, and due to completeness, also not ac-cording to G′. Moreover, the fact that c is returned overG means that c is not an EU founder according to G,590

and, since G is complete for all EU founders due toCeu, also not according to G′. Thus we can be sure thatthe answer ?c 7→ c is sound. Since the answer andthe graph were arbitrary, we conclude that the set C f ofcompleteness statements entails the soundness of P f .595

In this scenario, as opposed to answer soundness, wehave reasoned for a graph pattern whether the sound-ness of an arbitrary answer over an arbitrary graph canbe guaranteed by a set of completeness statements.

3.3. Formalization of Soundness Reasoning600

In the following, we provide definitions of answersoundness and pattern soundness.

Definition 8 (Answer Soundness). To express that amapping µ is sound for a graph pattern P, we writeSound(µ, P). We say that an extension pair (G,G′) sat-605

isfies Sound(µ, P) if, whenever µ ∈ JPKG, then also µ ∈JPKG′ . In this case we write (G,G′) |= Sound(µ, P).

From the above definition, for µ 6∈ JPKG it is trivialthat (G,G′) |= Sound(µ, P). Thus, we are only inter-ested in the soundness of answers occurring in JPKG.610

Definition 9 (Answer Soundness Entailment). Givena set C of completeness statements, a graph G, a graphpattern P, and a mapping µ ∈ JPKG, we say that C andG entail the soundness of µ for P, written as C,G |=Sound(µ, P), if for all extension pairs (G,G′) |= C, it615

holds that (G,G′) |= Sound(µ, P).

In our motivating scenario we saw that for all pos-sible completions of the graph, usa is a sound an-swer while sgp is not. Thus, Cl,Gl |= Sound(?c 7→usa, Pl), whereas Cl,Gl 6|= Sound(?c 7→ sgp, Pl).620

Pattern soundness, as opposed to answer soundness,is concerned with a graph pattern as a whole and ab-stracts from any specific answers of the pattern.

Definition 10 (Pattern Soundness). We express that agraph pattern P is sound by Sound(P). We say that625

P is sound over the extension pair (G,G′), written(G,G′) |= Sound(P), if JPKG ⊆ JPKG′ .

Thus, a pattern is sound whenever all answers in theevaluation over G are also present in that over G′. En-tailment of pattern soundness by completeness state-630

ments is defined in a natural manner.

Definition 11 (Pattern Soundness Entailment). A setof completeness statements C entails the soundness ofa graph pattern P, written C |= Sound(P), if for allextension pairs (G,G′) |= C, it holds that (G,G′) |=635

Sound(P).

In our motivating scenario, it is the case that Cf |=Sound(Pf). Combining the definitions above, we seethat a pattern is sound if and only if over all graphs allits answers are sound.640

Proposition 2. Let C be a set of completeness state-ments and P be a graph pattern. Then, C |= Sound(P)iff C,G |= Sound(µ, P) for every graph G and map-ping µ ∈ JPKG.

4. Checking Query Completeness645

In this section, we introduce formal notions andpresent an algorithm for checking the entailment ofquery completeness. We also analyze the complexityof the entailment problem.

4.1. Formal Notions650

First, we need a notion for a BGP with a stored map-ping from variable instantiations. This allows us to rep-resent BGP instantiations wrt. our completeness entail-ment procedure. We define a partially mapped BGP asa pair (P, µ) where P is a BGP and µ is a mapping with655

dom(µ) ∩ var(P) = ∅. Over a graph G, the evaluationof (P, µ) is defined as J(P, µ)KG = µ∪ ν | ν ∈ JPKG .It is easy to see that P ≡ (P, µ∅). Furthermore, we de-fine the evaluation of a set of partially mapped BGPsover a graph G as the union of the evaluations of each660

of them over G.

Example 3. Consider our motivating scenario. Overthe BGP P0 of the query Q0, instantiating the vari-able ?m to ger results in the BGP P1 of the query Q1.Pairing P1 with this instantiation gives the partially665

mapped BGP (P1, ?m 7→ ger ). Moreover, it isthe case that J(P1, ?m 7→ ger )KGcou = ?m 7→ger, ?l 7→ de .

Next, we formalize when two partially mappedBGPs are equivalent wrt. a set C of completeness state-670

ments and a graph G. Essentially, this is the case ifthey return the same answers over all possible exten-sions of G. We need this notion to capture the equiva-lence of the BGP instantiations that resulted from theevaluation of complete BGP parts.675


Definition 12 (Equivalence under C and G). Let (P, µ)and (P′, ν) be partially mapped BGPs, C be a setof completeness statements, and G be a graph. Then(P, µ) is equivalent to (P′, ν) wrt. C and G, written(P, µ) ≡C,G (P′, ν), if for all (G,G′) |= C, it holds that680

J(P, µ)KG′ = J(P′, ν)KG′ .

The above definition naturally extends to sets of par-tially mapped BGPs.

Example 4. Consider the queries in our motivat-ing scenario in Section 3.1.1. The equivalences dis-685

cussed there can now be stated as (P0, µ∅) ≡Ccou,Gcou

(P1, ?m 7→ ger ), (P2, ?m 7→ usa ) ≡Ccou,Gcou

(P3, ?m 7→ ger, ?l 7→ de ) .

Next, we would like to figure out which parts of aBGP contain variables that can be instantiated com-pletely. The idea is that, we ‘match’ completenessstatements to the BGP and the graph, and return thematched parts of the BGP. Note that in the matchingwe consider also the graph since it might be the casethat for a single completeness statement, some parts ofit have to be matched to the BGP, and the remainingparts are matched to the graph. For this reason, usingthe transfer operator from Definition 5, we define

crucC,G(P) = P ∩ id−1

(TC(P ∪G)) (1)

as the crucial part of P wrt. C and G. In the crucialpart computation, we evaluate the TC operator over the690

union of the prototypical graph P and the graph G inorder to obtain the complete parts wrt. the set C ofcompleteness statements. However, we are only inter-ested in the complete parts that are overlapping withour BGP P. Therefore, we undo the ‘freezing’ effect695

of the prototypical graph by the melt operator id−1

totransform it back into a BGP, and then compute theintersection with the BGP P to get the complete partsthat are relevant to P.16

By its construction, we are complete for the crucial700

part, that is, C,G |= Compl(crucC,G(P)). Later on, wewill see that the crucial part can be used to guide theinstantiation process during completeness entailmentchecking.

Example 5. Consider the query Q0 = (W0, P0) in705

our motivating scenario. Applying the freeze map-ping id = ?m 7→ m, ?l 7→ l to P0, we obtain

16Here we slightly abuse the notation by performing a mapping(which itself is actually a ‘reverse’ mapping) over a graph instead ofa BGP.

P0 = (UN,member, m), (m, lang, l). Among thethree completeness statements in Ccou, only Cun andCger can be applied to P0 ∪ Gcou and their applica-710

tion results in TCcou(P0∪Gcou) = (UN,member, usa),(UN,member, ger), (UN,member, m), (ger, lang, de).Undoing the freezing and intersecting with P0 resultsin crucCcou,Gcou(P0) = P0 ∩ id

−1(TCcou(P0 ∪ Gcou)) =

(UN,member, ?m) . Consequently, the graph Gcou715

provides us with a complete instantiation of the UNmembers.

Consider next the pattern P2 = (UN,member, usa),(usa, lang, ?l). Now, all completeness statements inCcou apply to the frozen version of P2 and the graph,720

which gives us TCcou(P2∪Gcou) = (UN,member, ger),(UN,member, usa), (ger, lang, de), (usa, lang, l). Melt-ing the frozen variable and intersecting with P2 thenresults in crucCcou,Gcou(P2) = P2.

Consider now the pattern P3 = (UN,member, ger),725

(ger, lang, de). Then P3 is ground and P3 = P3 ⊆Gcou. Only the statements Cun and Cger apply to Gcou,yielding TCcou(P3 ∪Gcou) = TCcou(Gcou) = Gcou. Thus,crucCcou,Gcou(P3) = P3 ∩Gcou = P3.

The operator below implements the instantiations of730

a partially mapped BGP wrt. its crucial part.

Definition 13 (Equivalent Partial Grounding). Let Cbe a set of completeness statements, G be a graph, and(P, ν) be a partially mapped BGP. We define the oper-ator equivalent partial grounding:

epg((P, ν), C,G) = (µP, ν ∪ µ) |

µ ∈ JcrucC,G(P)KG .

The following shows that such instantiations pro-duce a set of partially mapped BGPs equivalent to theoriginal partially mapped BGP, hence the name equiv-alent partial grounding. It holds basically since the in-735

stantiation is done over the crucial part, which is com-plete wrt. C and G.

Proposition 3 (Equivalent Partial Grounding). Let Cbe a set of completeness statements, G a graph, and(P, ν) a partially mapped BGP. Then740

(P, ν) ≡C,G epg((P, ν), C,G).

Example 6. Consider our motivating scenario. Recallfrom Example 5 the crucial parts of the BGPs P2, P3,and P0 from our motivating example:

– crucCcou,Gcou(P2) = P2;745


– crucCcou,Gcou(P3) = P3;– crucCcou,Gcou(P0) = Pun = (UN,member, ?m).

Evaluating the crucial parts over the graph Gcou yields

– JP2KGcou = ∅;– JP3KGcou = µ∅ ;750

– JPunKGcou = ?m 7→ ger , ?m 7→ usa .

The corresponding equivalent partial groundings are

– epg((P2, ?m 7→ usa ), Ccou,Gcou) = ∅;– epg((P3, ?m 7→ ger, ?l 7→ de ), Ccou,Gcou) =(P3, ?m 7→ ger, ?l 7→ de);755

– epg((P0, µ∅), Ccou,Gcou) = (P1, ?m 7→ ger ),(P2, ?m 7→ usa ) .

Generalizing from the example above, there arethree cases of the operator epg((P, ν), C,G):

– If JcrucC,G(P)KG = ∅, it returns an empty set.760

– If JcrucC,G(P)KG = µ∅ , it returns (P, ν).– Otherwise, it returns a non-empty set of partially

mapped BGPs, where some variables in P are in-stantiated.

Let us describe what these three cases mean to our765

completeness entailment procedure, and how their ap-plications lead to termination. The first case corre-sponds to the non-existence of the query answer in anypossible extension of the graph that satisfies the set ofcompleteness statements (e.g., the USA’s official lan-770

guages case). As demonstrated in Example 6, the firstcase removes partially mapped BGPs. This is due tothe emptiness of the crucial parts. Now, for the thirdcase, it corresponds to the instantiation of completeparts (that is, the crucial parts) of the BGP. The third775

case generates more specific partially mapped BGPsthat are equivalent to the original partially mappedBGP.

As for the second case, it generates exactly the samepartially mapped BGP as the original one. Here, we780

need a special treatment. We first define that a par-tially mapped BGP (P, ν) is saturated wrt. C and G, ifepg((P, ν), C,G) = (P, ν) , that is, if the second caseabove applies. Note that the notion of saturation is in-dependent of the mapping in a partially mapped BGP:785

given a mapping ν, a partially mapped BGP (P, ν) issaturated wrt. C and G iff (P, ν′) is saturated wrt. Cand G for any mapping ν′. Thus, wrt. C and G we saythat a BGP P is saturated if (P, µ∅) is saturated.

Saturated BGPs hold the key as to whether our com-790

pleteness entailment check succeeds or not: complete-ness of saturated BGPs is simply checked by test-

ing whether they are contained in the graph G. Fur-thermore, once the repeated epg applications with theabove three cases hit the saturated case (i.e., the sec-795

ond case), we can readily check completeness entail-ment.17 Note that this ensures also the termination ofthe epg applications.

Lemma 1 (Completeness Entailment of SaturatedBGPs). Let P be a BGP, C a set of completeness state-ments, and G a graph. Suppose P is saturated wrt. Cand G. Then:

C,G |= Compl(P) iff P ⊆ G.

By consolidating all the above notions, we are readyto provide an algorithm to check data-aware complete-800

ness entailment.

4.2. Algorithm

From the above notions, we have defined the crucoperator to find parts of a BGP that can be instantiatedcompletely. The instantiation process wrt. the crucialpart is facilitated by the epg operator. We have alsolearned that repeating the application of the epg oper-ator results in saturated BGPs for which we have tocheck whether they are contained in the graph or not, inorder to know whether our original BGP is complete.Algorithm 1 computes a function that, given a set ofcompleteness statements C, a graph G, and a BGP P,returns the set

sat(P, C,G)

of all mappings that have two properties: each BGPinstantiation of the mappings constitutes a saturatedBGP wrt. C and G; and the original BGP is equiva-805

lent wrt. C and G with the BGP instantiations producedfrom all the resulting mappings of the algorithm.

Let us now describe how Algorithm 1 works. Con-sider a BGP Porig, a set C of completeness state-ments, and a graph G. First, we transform our origi-810

nal BGP Porig into its equivalent partially mapped BGP(Porig, µ∅) and put it in Pworking. Then, in each iter-ation of the while loop, we take and remove a par-tially mapped BGP (P, ν) from Pworking via the methodtakeOne. Afterwards, we compute epg((P, ν), C,G).815

17Essentially, it might be that the epg applications hit the first caseonly, that is, there are no more partially mapped BGPs that have tobe checked for completeness. For this case, completeness entailmenttrivially holds.


ALGORITHM 1: sat(Porig, C,G)

Input: A BGP Porig, a set C of completenessstatements, a graph G

Output: A set Ω of mappings1 Pworking ← (Porig, µ∅) 2 Ω← ∅3 while Pworking 6= ∅ do4 (P, ν)← takeOne(Pworking)5 Pequiv ← epg((P, ν), C,G)6 if Pequiv = (P, ν) then7 Ω← Ω ∪ ν 8 else9 Pworking ← Pworking ∪Pequiv

10 end11 end12 return Ω

As discussed above there can be three result cases here:(i) If epg((P, ν), C,G) = ∅, then we simply remove(P, ν) and will not consider it anymore in the later iter-ation; (ii) If epg((P, ν), C,G) = (P, ν) , that is, (P, ν)is saturated, then we add the mapping ν to the set Ω;820

and (iii) otherwise, we add to Pworking a set of partiallymapped BGPs instantiated from (P, ν). We keep iterat-ing until Pworking = ∅, and finally return the set Ω.

The next proposition about the function computedby Algorithm 1 follows from the construction of the825

algorithm and Proposition 3.

Proposition 4. Let P be a BGP, C a set of complete-ness statements, and G a graph. Then the followingproperties hold:

– (P, µ∅) ≡C,G (µP, µ) | µ ∈ sat(P, C,G) ;830

– µP is saturated wrt. C and G, for all mappingsµ ∈ sat(P, C,G).

From the above proposition, we can derive the fol-lowing theorem, which shows the soundness and com-pleteness of the algorithm to check completeness en-835

tailment.

Theorem 1 (Completeness Entailment Check). Let Pbe a BGP, C a set of completeness statements, and G agraph. Then the following are equivalent:

1. C,G |= Compl(P);840

2. µP ⊆ G, for all µ ∈ sat(P, C,G).

Example 7. Consider our motivating scenario. Thensat(P0, Ccou,Gcou) = ?m 7→ ger, ?l 7→ de .For every mapping µ in sat(P0, Ccou,Gcou), it holdsthat µP0 ⊆ Gcou. Thus, by Theorem 1 the entailment845

Ccou,Gcou |= Compl(P0) holds.

From looking back at the initial characterization ofcompleteness entailment in Proposition 1, it actuallydoes not give us a concrete way to compute a set ofmappings to be used in checking completeness entail-850

ment. Now, by Theorem 1 it is sufficient for complete-ness entailment checking to consider only the map-pings in sat(P, C,G), which we know how to compute.

4.2.1. Simple Practical ConsiderationsFollowing from the aforementioned algorithm to855

compute saturated mappings, and Theorem 1, a corre-sponding algorithm for checking completeness entail-ment can be easily derived. In what follows we providetwo simple heuristic techniques of the completenesschecking algorithm: early failure detection and com-860

pleteness skip. More elaborate heuristics are given inSection 6.

Early Failure Detection. In our algorithm, the contain-ment checks for saturated BGPs are done at the end.Indeed, if there is a single saturated BGP not contained865

in the graph, we cannot guarantee query completeness(recall Theorem 1). Thus, instead of having to col-lect all saturated BGPs and then check the containmentlater on, we can improve the performance of the algo-rithm by performing the containment check right af-870

ter the saturation check (Line 6 of the algorithm). So,as soon as there is a failure in the containment check,we stop the loop and conclude that the completenessentailment does not hold.

Completeness Skip. Recall the definition of epg as875

epg((P, ν), C,G) = (µP, ν∪µ) | µ ∈ JcrucC,G(P)KG ,which relies on the cruc operator. Now, suppose thatcrucC,G(P) = P, implying that we are complete forthe whole part of the BGP P. Thus, we actually donot have to instantiate P in the epg operator, since we880

know that the instantiation results will be containedin G anyway due to P’s completeness wrt. C and G. Inconclusion, whenever crucC,G(P) = P, we just removethe corresponding (P, ν) from Pworking and thus skip itsinstantiations.885

4.3. Complexity

In this subsection, we analyze the complexity ofthe problem of data-aware completeness entailment.While the complexity of checking data-agnostic com-pleteness entailment is NP-complete [4], the addi-890

tion of the data graph to the entailment increases thecomplexity, which is now ΠP

2-complete. The hardnessis shown by a reduction of the validity problem for∀∃3SAT formulas.


Proposition 5. Deciding the entailment C,G |=895

Compl(P), given a set C of completeness statements, agraph G, and a BGP P, is ΠP

2-complete.

One might wonder, if some parts of the inputs werefixed, what would be the complexity of the entailmentproblem. We answer this question in the following se-900

ries of propositions.The following proposition states that the entailment

does not become easier if the graph is fixed. The rea-son is that the reduction from the validity problem of a∀∃3SAT formula uses one fixed graph.905

Proposition 6. Let G be a graph. Deciding the entail-ment C,G |= Compl(P), given a set C of completenessstatements and a BGP P, is in ΠP

2 . There is a graph forwhich the problem is ΠP

2-complete.

Now, we want to see the complexity when the910

BGP P is fixed. Recall that in the algorithm, P domi-nates the complexity of the instantiation process in theepg operator. When it is fixed, the size of the instantia-tions is bounded polynomially, reducing the complex-ity of the entailment problem to NP-complete. Note it915

is still NP-hard even when the input graph G is fixed.

Proposition 7. Let P be a BGP. Deciding the entail-ment C,G |= Compl(P), given a set C of completenessstatements and a graph G, is in NP. There is a BGPfor which the problem is NP-complete. This still holds920

when the graph is fixed.

Let us now see the complexity when the set of state-ments C is fixed. In the algorithm, C dominates thecomplexity of the TC operator used in computing thecrucial part. When it is fixed, the TC operator can925

be applied in PTIME, reducing the complexity of theentailment problem to CoNP-complete. Again, fixingalso the graph does not change the complexity.

Proposition 8. Let C be a set of completeness state-ments. Deciding the entailment C,G |= Compl(P),930

given a graph G and a BGP P, is in CoNP. There is aset C of completeness statements for which the prob-lem is CoNP-complete. This still holds when the graphis fixed.

Finally, the following proposition tells us that fixing935

both the set of statements C and the BGP P reduces thecomplexity to PTIME.

Proposition 9. Let C be a set of completeness state-ments and P be a BGP. Deciding the entailment C,G |=Compl(P), given a graph G, is in PTIME.940

This result corresponds to some practical cases whenqueries are assumed to be of limited length18 andhence, so are completeness statements (which are es-sentially also queries).

Table 1Complexity table for the data-aware completeness entailment prob-lem with various inputs fixed (‘×’ denotes ‘fixed’)

input complexity

C G P

X X X ΠP2-complete

X × X ΠP2-complete

X X × NP-complete

X × × NP-complete

× X X CoNP-complete

× × X CoNP-complete

× X × in PTIME

Our complexity results with various inputs fixed are945

summarized in Table 1. From this complexity study,it is therefore of our interest to investigate how wellthe problem of completeness entailment may be solvedin practice. In later sections, we will provide heuristictechniques, as well as experimental evaluations of the950

problem.

Completeness of Queries with Projections The aboveformalization deals with the completeness of querieswith no projections (that is, where all variables in theBGP are distinguished). One may wonder, what hap-955

pens when not all variables are distinguished? The an-swer depends on whether bag or set semantics is used.

With bag semantics, due to the monotonicity ofBGPs, and answer-multiplicity being preserved, ourresults immediately transfer to queries with projec-960

tions: a query with projections is complete if and onlyif the projection-free version of the query is complete.

We do not know a characterization of complete-ness for queries with projections under set semantics.Nevertheless, Theorem 1 derives a sufficient condition965

for completeness in this case: whenever the BGP of aquery with projections is complete, then the query isalso complete under set semantics.

5. Checking Query Soundness

In this section, we leverage completeness reasoning970

for checking answer and pattern soundness.

18as also customary in database theory when analyzing the datacomplexity of query evaluation [13]


5.1. Checking Answer Soundness

We will use data-aware completeness reasoning tojudge whether an answer obtained by evaluating agraph pattern over a graph is sound.975

Example 8. Remember the motivating scenario ofanswer soundness in Section 3.2.1. Consider themapping ?c 7→ usa ∈ JPlKGl . After instantiat-ing the variable ?c in the two negated subpatternswith usa, we obtain the patterns (usa, lang, en) and(usa, lang, ?l), (?f , lang, ?l), (EU, founder, ?f ). Usingthe techniques for completeness checking in Sec-tion 4, we find that both the entailment Cl,Gl |=Compl((usa, lang, en)) and the entailment

Cl,Gl |= Compl((usa, lang, ?l), (?f , lang, ?l),(EU, founder, ?f ))

hold. Therefore, these subpatterns will fail over everyextension of Gl compatible with Cl, and ?c 7→ usawill continue to be an answer, that is formally, Cl,Gl |=Sound(?c 7→ usa, Pl).

In contrast, consider the mapping ?c 7→ sgp ∈980

JPlKGl . For the extension G′l = Gl ∪ (sgp, lang, en),however, we have ?c 7→ sgp /∈ JPlKG′

l, since

instantiating the negated subpattern (?c, lang, en) to(sgp, lang, en) results in a triple satisfied by G′l .Clearly, (Gl,G′l) |= Cl, from which we conclude that985

Cl,Gl 6|= Compl((sgp, lang, en)). In short, ?c 7→ sgpis not a sound answer for Pl over Gl, because Cl and Gl

do not entail the completeness of the instantiated triplepattern ?c 7→ sgp(?c, lang, en).

The main theorem of this subsection generalizes the990

observations of the example. Intuitively, it states thatthe soundness of some answer-mapping of a graphpattern over a graph is guaranteed exactly if all thegraph pattern’s NOT-EXISTS-BGPs, after applying theanswer-mapping to them, are complete for the graph.995

Theorem 2. (ANSWER SOUNDNESS) Let G be agraph, C a set of completeness statements, P a graphpattern, and µ ∈ JPKG a mapping. Then the followingare equivalent:

1. C,G |= Sound(µ, P);1000

2. C,G |= Compl(µPi), for all Pi ∈ P−.

In fact, Theorem 2 holds for a wider class of graphpatterns than defined in this article. We only need thepositive part of the pattern to be monotonic, that is,a mapping remains a solution over all extensions of1005

the graph G. We do not make this formal to keep theexposition simple.

Complexity From Theorem 2, the check of whetheran answer is sound wrt. a set of completeness state-ments and a graph can be reduced to a linear number of1010

data-aware completeness checks (as discussed in Sec-tion 4). From this, it follows that the complexity of theanswer soundness entailment problem is in ΠP

2 . More-over, the answer soundness problem is also ΠP

2-hard asthe completeness problem can be reduced to it by us-1015

ing Theorem 2. Nevertheless, from a practical perspec-tive, one may expect graph patterns (including BGPsused to construct completeness statements) to be short,giving us a potentially manageable answer soundnesscheck. Section 7 reports an experimental study of an-1020

swer soundness checking in practical settings.

5.2. Checking Pattern Soundness

As shown in our motivating scenario, it might bethe case that completeness statements guarantee thesoundness of a graph pattern as such, that is, all an-1025

swers returned by the graph pattern are known to besound, no matter the specifics of the graph. To charac-terize pattern soundness, we follow the same strategyas before: we reduce the problem of soundness check-ing to completeness checking.1030

First, we generalize completeness statements to con-ditional completeness statements, which express thecompleteness of a BGP under the condition of an-other BGP. Given two BGPs P and P′, the complete-ness of P wrt. P′ is denoted as Compl(P | P′). Given1035

an extension pair (G,G′), we define that (G,G′) |=Compl(P | P′) if J(var(P), P∪ P′)KG′ ⊆ JPKG.19 Thismeans that the conditional completeness statement issatisfied by the extension pair, whenever the evaluationof the BGP P over the graph G contains the evalua-1040

tion of P under the condition of P′ over the graph G′.For example, the conditional completeness statementCompl((?c, lang, en) | (?c, a, country)) denotes thecompleteness of all things having English as their lan-guage, provided that those things are of type coun-1045

try. Note that conditional completeness statements aremore general than completeness statements as intro-duced in Section 2.2, since a completeness statementCompl(P) can be expressed as a conditional complete-ness statement with the empty condition Compl(P | ∅).1050

We define that the entailment C |= Compl(P | P′)holds if for all extension pairs (G,G′) satisfying C, it is

19Though redundancies may occur in the left-hand side due to theprojection, for our characterizations here the ⊆ operation is doneunder set semantics. Hence, the redundancies are ignored.


the case that (G,G′) |= Compl(P | P′). The followingproposition states that such an entailment holds iff theapplication of the transfer operator TC to the prototyp-1055

ical graph P∪ P′ contains P. Recall that the prototypi-cal graph represents any possible graph that satisfies aBGP.

Proposition 10. For a set C of completeness state-ments and BGPs P and P′, it is the case that

C |= Compl(P | P′) iff P ⊆ TC(P ∪ P′).

In the motivating scenario of pattern soundness, itholds that C f |= Compl((?c, lang, en) | (?c, a, country))due to the inclusion

(c, lang, en) ⊆ (c, lang, en)(c, a, country)

= TC f ((c, lang, en), (c, a, country)).

This means that the set C f of statements guarantees thecompleteness of all things whose official language is1060

English, under the condition that those things are oftype country.

The following lemma states that the soundness ofa graph pattern can be guaranteed if each BGP of theNOT-EXISTS patterns is complete under the condition1065

of the positive part of the graph pattern.

Lemma 2. Let C be a set of completeness statementsand P a graph pattern. Then C |= Sound(P) providedthat C |= Compl(Pi | P+) for all Pi ∈ P−.

One might wonder whether the converse of the1070

above lemma also holds. However, the following coun-terexample shows that it does not.

Example 9. Consider the following graph patterns:

P1 = (?c, a, country),¬∃(?c, lang, en),¬∃(?c, lang, en), (?c, lang, fr)

P2 = (?c, a, country),

¬∃(?c, lang, en), (?c, lang, ?l).

Suppose both return as answers mappings ?c 7→ c for an IRI c. It is the case that c is a country thatdoes not have the language en. The reason is that, af-1075

ter instantiating ?c with c, all three negated subpat-terns are empty over a graph G if and only if the triple(c, lang, en) is not present in G. Consider next also thecompleteness statement C = Compl((?c, lang, en)).If the triple (c, lang, en) is not present in G, and1080

(G,G′) |= C, then the triple is not present in G′ either.

Hence, all three negated subpatterns fail and ?c 7→c persists to be an answer to P1 and P2 over G′. Wethus have C |= Sound(P1) and C |= Sound(P2) despitethe violation of the right-hand side of Lemma 2.1085

Taking a closer look, one notices that both graphpatterns in fact contain redundancies, which can bechecked via query containment under set semantics(written vs). For P1, the second NOT-EXISTS patternis superfluous due to the first one being more gen-1090

eral; whereas for P2, the triple pattern (?c, lang, ?l)is superfluous since the emptiness of the BGP of theNOT-EXISTS pattern only depends on the triple pat-tern (?c, lang, en). Consequently, for both cases havingonly the completeness statement Compl((?c, lang, en))1095

is sufficient to guarantee their soundness.To identify such redundancies, we associate to each

pattern Pi ∈ P− the query (var(P+), P+ ∪ Pi).These are essentially conjunctive queries, which canbe checked for set containment and which can be min-1100

imized (see [14]). Specifically, we propose a normalform for graph patterns, called non-redundant form(NRF). A graph pattern P is in NRF if it satisfies twoconditions: there are no redundant negated patterns,and there are no non-minimal negated patterns. This1105

can be formalized as follows:

– The BGP Pi ∈ P− is redundant if there is a dis-tinct P j ∈ P− such that

(var(P+), P+ ∪ Pi) vs (var(P+), P+ ∪ P j).

– The BGP Pi ∈ P− is non-minimal if there is anon-empty strict subpattern P′i ⊂ Pi such that

(var(P+), P+ ∪ P′i) vs (var(P+), P+ ∪ Pi).

With this notion in place, we can obtain the maintheorem of this subsection. The theorem states thatgiven an NRF graph pattern, the check of whether it issound can be reduced to checking whether each BGP1110

among the NOT-EXISTS patterns is complete under thecondition of the positive part. Thus, the theorem en-sures that the converse of Lemma 2 holds for NRFgraph patterns.

Theorem 3. (PATTERN SOUNDNESS) Let C be a set1115

of completeness statements and P a graph pattern inNRF. Then the following are equivalent:

1. C |= Sound(P);2. C |= Compl(Pi | P+) for all Pi ∈ P−.


Example 10. In the motivating scenario of patternsoundness, it holds that

C f |= Compl((?c, lang, en) | (?c, a, country))

C f |= Compl((EU, founder, ?c) | (?c, a, country)).

By Theorem 3, it is the case C f |= Sound(P f ).1120

Complexity If P is a graph pattern (which may havenegated parts) and P is obtained from P by eliminat-ing redundant negated patterns and minimizing the re-maining ones, we say that P is an NRF of P. Clearly,P and every such P are equivalent. According to The-1125

orem 3, a straightforward attempt at implementing asoundness check for patterns P would be to computean NRF P of P and then perform the completenessentailment checks specified by the theorem. Such a Pcan be computed by a function that runs in polynomial1130

time and makes several calls to an NP-oracle, while theentailment checks can be performed in NP. This givesus PNP as an upper complexity bound for checking pat-tern soundness. We will show that the problem is evenin NP.1135

For that purpose we introduce a relationship be-tween patterns with negation of which the relationshipbetween a pattern and its NRF is a special case. We saythat two graph patterns P, P are negation-similar if thefollowing holds:1140

– P+ = P+;– for every Pi ∈ P−, there is a P j ∈ P− such that

(var(P+), P+ ∪ Pi) vs (var(P+), P+ ∪ P j);– for every P j ∈ P−, there is a Pi ∈ P− such that

(var(P+), P+ ∪ P j) vs (var(P+), P+ ∪ Pi).1145

Note that negation-similarity is in NP, since it can bechecked by guessing for every Pi a corresponding P j,for every P′j a corresponding P′i and then performingcontainment checks by guessing and verifying queryhomomorphisms (see [14]).1150

Negation-similarity is a key-property here becauseit preserves equivalence.

Proposition 11. Negation-similar graph patterns areequivalent.

With this background we can design an NP sound-1155

ness check for patterns P by, intuitively, first guess-ing P, verifying that it is negation-similar to P, andrunning the completeness entailment checks of the the-orem. We first make the step of guessing P more for-mal.1160

Proposition 12. Let P be a graph pattern. Then thereexists a graph pattern P such that

– P is in NRF;– P is obtained from P by dropping negated pat-

terns and by dropping atoms from negated pat-1165

terns;– P is similar to P.

We can obtain such a P as an NRF of P by first drop-ping redundant negated patterns until there are no re-dundancies any more and then dropping atoms from1170

the remaining Pi ∈ P− during the minimization of the(var(P+), P+ ∪ Pi). Clearly, the redundancy removaland minimization steps preserve negation-similarity.

Now we have all ingredients in place for an NP-algorithm that checks whether a set of completeness1175

statements C entails the soundness of a graph pat-tern P:

(i) nondeterministically drop negated patternsfrom P;

(ii) nondeterministically drop atoms from the remain-1180

ing negated patterns;(iii) verify that the resulting pattern P is similar to P;(iv) verify that C |= Compl(P j | P+) for every

P j ∈ P−.

Clearly, the first two steps can be performed in non-1185

deterministic polynomial time. As seen above, also,negation-similarity can be checked in nondeterministicpolynomial time. Finally, by Proposition 10, the com-pleteness checks in Step (iv) can be reduced to a linearnumber of TC applications, which are basically evalua-1190

tions of conjunctive CONSTRUCT queries and are there-fore in NP. Thus, all the checks together can be per-formed in NP.

If the resulting P passes the completeness checks,then we have found a query that is sound by Lemma 21195

and is equivalent to P by Proposition 11 so that P issound. If P is sound, then any NRF of P is a pattern thatcan be guessed and verified by the above algorithm.This shows membership in NP.

NP-hardness is clear, since Theorem 3 also shows1200

that completeness checking can be reduced to sound-ness checking.

Proposition 13. Soundness checking of graph patternsis NP-complete.

From a practical viewpoint, one may expect graph1205

patterns of queries and BGPs of completeness state-ments to be short, potentially allowing for a feasiblesoundness check. Section 7 reports an experimental in-vestigation of pattern soundness checking in practicalcases.1210


Soundness of Queries with Projections We haveprovided a full characterization of the soundness ofqueries with negation where no projections are in-volved (or where the projection is over all variables inthe query body). One may wonder whether our charac-1215

terization here can also be used for queries with nega-tion that involve generic projections (that is, wheresome variables can be non-distinguished). The next ex-ample shows that in general, the condition from The-orem 3 is not a necessary condition for pattern sound-1220

ness entailment of queries with projection.

Example 11. Consider the following boolean query,which asks whether there is some triple that cannot beright-shifted.

Q = (, (?x, ?y, ?z),¬∃ (?z, ?x, ?y) ).

Consider also the singleton set containing a complete-ness statement of three possible shifts of triples:

C=Compl((?x, ?y, ?z),(?z, ?x, ?y),(?y, ?z, ?x)).

If we assume bag semantics, Q returns a bag of emptymappings µ∅ when evaluated over a graph G, one foreach answer to the graph pattern of the query. Weshow that if G′ is such that (G,G′) |= C, then the num-1225

ber of empty mappings returned by Q over G′ is noless than the number retrieved over G (as otherwise, itwill be not sound). Now suppose that Q returns a copyof µ∅ over G. Then there is a triple (a, b, c) ∈ G suchthat (c, a, b) /∈ G. If also (c, a, b) /∈ G′, then Q over G′1230

continues to return a copy of µ∅ for the same reason asbefore. Consider therefore the case that (c, a, b) ∈ G′.If (b, c, a) /∈ G′, then Q returns again a copy of µ∅,this time because the triple (c, a, b) ∈ G′ does not haveits right shift in G′. If, however, also (b, c, a) ∈ G′,1235

then the completeness statement kicks in and enforcesthat (c, a, b) ∈ G, which contradicts our original as-sumption. Thus, C entails the pattern soundness of Q,even though C does not entail Compl((?z, ?x, ?y) |(?x, ?y, ?z)).1240

We do not know a characterization of pattern sound-ness for queries involving projections. Nevertheless,Theorem 3 still gives a sufficient condition for sound-ness in this case.

Combining Soundness and Completeness Reasoning1245

A graph pattern with negation can be both sound andcomplete. Theorem 3 characterizes when a graph pat-tern P in NRF is sound wrt. a set C of complete-

ness statements. One can show that P is complete ifand only if the positive part P+ is complete. For the1250

right-to-left direction, it can be seen that whenever thepositive part is complete, if there is an extension tothe graph G, the whole graph pattern P can only re-move mappings (that is, it cannot add new mappings).Hence, the evaluation result of P over G is guaranteed1255

to contain that over G′. For the left-to-right direction,whenever the positive part cannot be guaranteed to becomplete wrt. the set C of completeness statements,we can build a counterexample extension pair of thepattern soundness entailment by putting the freeze ver-1260

sion of the positive part as the graph G′, and the TCapplication over G′ as the graph G. Via both charac-terizations, we can then check whether a graph patternis sound and/or complete.

6. Heuristics for Completeness Checking1265

So far, we have formally characterized completenessentailment, and, for soundness entailment, provided areduction to completeness entailment. In this sectionwe investigate how completeness checks can actuallybe implemented in practice. A vanilla implementation1270

of completeness reasoning over a Wikidata-based ex-periment setting took about 15 s on average for a singlecompleteness check, whereas query evaluation tookjust about 1 ms. Such a considerable overhead overquery evaluation may hinder the adoption of our com-1275

pleteness framework in practical settings. This moti-vates the need for heuristic techniques that may pro-vide speed-ups. In this section we discuss several pos-sible heuristics for completeness checking, then evalu-ate their performance in the next section in a realistic1280

setting based on Wikidata.Before delving further into the discussion of heuris-

tic approaches for completeness checking, we providebasic practical assumptions underlying the develop-ment of the heuristics. Our first assumption concerns1285

the length of completeness statements. Similarly toqueries, which in many practical cases consist of a lim-ited number of triple patterns [15–17], we conjecturethat also completeness statements would be of lim-ited length. At the same time, it is conceivable that the1290

number of completeness statements is large in practice,in particular since big KBs such as Wikidata may havemany parts of data that are actually complete. Never-theless, for a given query, most statements are likely tobe irrelevant. For example, the statement “All players1295

of Arsenal” is irrelevant to the query “Give founders


of the EU.” Consequently, an efficient implementationshould filter out such irrelevant statements. Moreover,as users are likely to provide completeness statementsfor similar topics, we introduce generic completeness1300

templates. By providing a compact representation ofcompleteness statements, completeness templates en-able multiple statements to be processed simultane-ously. We also provide a heuristic called prioritizedevaluation, which tunes the way the crucial part is1305

computed in data-aware completeness reasoning (cf.,Eq. (1)).

In the next subsections, we first provide a heuristictechnique for a simpler problem, that is, data-agnosticcompleteness checking, followed by several heuristic1310

techniques for data-aware completeness checking.

6.1. Data-agnostic Completeness Checking

We now address a heuristic technique for data-agnostic completeness checking, originally proposedin [6], which can also be leveraged for pattern sound-1315

ness checking, thanks to the characterization in The-orem 3. The theorem states that pattern soundnesschecking can be reduced to the (data-agnostic) checksof whether a set C of completeness statements en-tails conditional completeness statements of the form1320

Compl(P | P′). By Proposition 10, such a check canbe done by evaluating the union of the CONSTRUCT

queries QC , for every completeness statement C ∈ C,over the prototypical graph P ∪ P′. A statement Ccontributes to this evaluation only if the result of QC1325

over this graph is non-empty. Clearly, a necessary con-dition for C to contribute is that all terms in C (i.e.,IRIs and literals) occur among the terms in P ∪ P′,written terms(C) ⊆ terms(P ∪ P′). We call such a Cterm-relevant for Compl(P | P′). We can retrieve all1330

such term-relevant C by evaluating the subset queryasking for the statement set C ∈ C | terms(C) ⊆terms(P ∪ P′) .

A simple way (yet efficient, as we will show in Sec-tion 7) to implement such subset queries is to build1335

a hashmap of all statements where the key of C isterms(C). With this hashmap, we can answer the sub-set query by retrieving for all non-empty subsets S ⊆terms(P ∪ P′) the statements C with terms(C) = S .The completeness check can then be done only with1340

these statements, which are potentially far fewer thanthe original statements.

Example 12. Consider the following set Corg of 10,000completeness statements:

– Clang = Compl((?c, a, country), (?c, lang, ?l))1345

– Ceu = Compl((EU, founder, ?f ))– Corg1 = Compl((org1, founder, ?f ))– . . .– Corg9998 = Compl((org9998, founder, ?f )).

The corresponding hashmap is as follows, where the1350

keys are sets of terms occurring in the completenessstatements, and the values are sets of completenessstatements with those terms:

– a, country, lang 7→ Clang,– EU, founder 7→ Ceu,1355

– org1, founder 7→ Corg1,– . . .

– org9998, founder 7→ Corg9998.

Consider the query P f from Section 3.2.2,

P f = (?c, a, country), ¬∃ (?c, lang, en) ,¬∃ (EU, founder, ?c) ,

1360

and let P1 = (?c, lang, en) (i.e., the BGP of the firstNOT-EXISTS pattern) and P2 = (EU, founder, ?c)(i.e., the BGP of the second NOT-EXISTS pattern). Toverify query soundness wrt. Corg, according to Theo-rem 3 we check whether both Corg |= Compl(P1 | P+

f )1365

and Corg |= Compl(P2 | P+f ). By applying TCorg ac-

cording to Proposition 10, we conclude that both en-tailments hold. However, rather than evaluating all10,000 statements in Corg over the prototypical graph,we can now use the term hashmap to rule out irrele-1370

vant statements, as follows. Consider the first check.The set of terms from the BGP is terms(P1 ∪ P+

f ) = a, country, lang, en . From these four terms, we gen-erate 15 non-empty subsets: a, country, . . . , andthe set terms(P1 ∪ P+

f ) itself. By looking up the state-1375

ments for these subsets using the hashmap, we endup with the singleton set Clang . Note that the other9,999 statements are irrelevant and thus are left out. Byperforming an analogous operation for the latter case,we retrieve the singleton Ceu . Thus, instead of con-1380

sidering 10,000 statements in the reasoning, we nowconsider only 2.

A potential, theoretical drawback of this heuristictechnique is that when queries are long, the number ofterm subsets generated for hashmap lookups would be-1385

come large. More precisely, the number of term sub-sets grows exponentially with the query length (i.e.,number of triple patterns). Nevertheless, as discussedabove, we expect that queries in the real world are oflimited length [15–17].1390


6.2. Data-aware Completeness Checking

For the data-aware setting, reasoning needs also ac-cess to the data graph. The previous heuristic of data-agnostic reasoning, which leaves out statements whoseterms are not among the terms of the query, is no more1395

applicable, since parts of the statements can now bemapped to the data graph. We present three heuristictechniques: completeness templates, partial matching,and prioritized evaluation. Their individual and com-bined impacts on reasoning time are reported in Sec-1400

tion 7.

6.2.1 Completeness Templates Templates supportusers in creating completeness statements about simi-lar topics, as they occur for instance in IMDb, whichreports completeness for movie cast and crew,20 or inOpenStreetMap, which uses a wiki to record the com-pleteness of objects in different geographical areas.21

A completeness template is a 3-tuple τ = (C,Vτ,Ω),where C is a completeness statement, Vτ ⊆ var(C)is a set of variables, called meta-variables, and Ω isa set of mappings from Vτ to terms (i.e., IRIs or lit-erals). We also refer to the BGP of the completenessstatement C of the template τ as Pτ. As an example ofa completeness template, we generalize the statementset

Compl((en, lang, ?l)),Compl((ger, lang, ?l)),

. . . ,Compl((spa, lang, ?l))

to the template (Compl((?c, lang, ?l)), ?c,Ω), whereΩ = ?c 7→ en , ?c 7→ ger , . . . , ?c 7→ spa .A template τ = (C,Vτ,Ω) represents the statement setCτ = Compl(µPC) | µ ∈ Ω , obtained by instanti-1405

ating C with the mappings in Ω. This definition natu-rally extends to sets of completeness templates, that is,the set CT is the union of all statements in Cτ for everyτ ∈ T . Note that a completeness statement C can beexpressed as the completeness template (C, ∅, µ∅),1410

where µ∅ is the empty mapping.A key part of the algorithm for data-aware com-

pleteness checking, given a statement set C and a datagraph G, is to identify the crucial part P0 of P, that is,the maximal subset P0 ⊆ P such that P0 ⊆ TC(P∪G),where TC is the transfer operator for C. Given a set T

20See e.g., http://www.imdb.com/title/tt0105236/fullcredits21See e.g., http://wiki.openstreetmap.org/wiki/Abingdon

of completeness templates, analogously to Eq. (1),such a part satisfies the equation

P0 = P ∩ id−1

(TCT (P ∪G)). (2)

A baseline approach to compute P0 in Eq. (2) is to in-stantiate templates to yield completeness statements,and then apply the TC-operator wrt. the statements.This may be costly if there are many instances of thosetemplates. Now, templates allow us to leverage queryevaluation for data-aware completeness reasoning byexploiting that a template represents many statements.Essentially, to check whether the TC-operator maps atriple in P by an instantiation of a template τ, we firstevaluate Pτ (by treating the meta-variables like vari-ables) over the union graph P ∪ G, and verify in asecond step which of the resulting mappings are com-patible22 with the instantiations of the template τ. Inthis way, all instances of τ can be processed simulta-neously. To formally represent the above idea, given aset T of completeness templates, a frozen BGP P, anda graph G, we define a template-based transfer opera-tor

TT (P ∪G) =⋃τ ∈ T

τ = (C, Vτ,Ω)

µPτ | µ ∈ JPτKP∪G on Ω.

The above operator computes for each template τ theevaluation of the BGP Pτ over P∪G, keeps only thosemappings compatible with Ω, and then takes the union.The crucial point here is that instead of evaluating the1415

whole set of (instantiated) completeness statements,we only have to evaluate the (potentially far fewer)completeness templates. By the definition of complete-ness templates and the construction of TT , the BGP P0

in Eq. (2) can alternatively be computed using TT .1420

Proposition 14. Given a BGP P, a graph G, and aset T of completeness templates, it is the case that

P0 = P ∩ id−1

(TCT (P ∪G))

= P ∩ id−1

(TT (P ∪G)).

The proposition holds basically because the map-pings that result from the evaluation of a template overP ∪G and are compatible with Ω are exactly the sameas those of the instantiated completeness statements ofthe template that can be applied over P ∪G.1425

22That is, when those mappings are projected to the meta-variables of the template, the projected mappings exist in Ω.


http://wiki.openstreetmap.org/wiki/Abingdon


6.2.2 Partial Matching While the above heuristictechnique concerns a data structure, called complete-ness templates, that bundles similar completenessstatements, here we develop a heuristic for filteringout irrelevant completeness statements. In the previ-1430

ous data-agnostic heuristic technique, we perform so-called subset matching: for each statement, collect theset of terms of the BGP and check if this is a subsetof the set of terms of the query BGP. This techniqueis no more applicable because a part of the BGP of a1435

statement can also be matched to the data graph.For data-aware completeness reasoning, it is suffi-

cient for a completeness statement to be relevant fora query if the statement partially captures the BGP ofthe query, which in the worst case means to capture1440

just a single triple pattern of the query. This is shownin the computation of the crucial part (Eq. (1)), wherethe transfer operator is evaluated over the union of theprototypical graph and the data graph, and then is inter-sected with the BGP of the query. Thus, the notion of1445

relevance needs to be adapted for the data-aware case:A completeness statement may potentially be relevantfor a query, if there is at least one triple pattern of thestatement which is more general than a triple patternof the query. The notion of generality is defined as fol-1450

lows: a triple pattern (s, p, o) is more general than an-other triple pattern (s′, p′, o′) if in the correspondingposition, say, the subject (and analogously for the pred-icate and object), either s is the same term as s′ or s isa variable. Note that as opposed to the data-agnostic1455

heuristic that supports the matching of the whole state-ment BGP, we now need to support matchings of sin-gle triple patterns. Such a matching of a single triplepattern of a BGP is called a partial matching of theBGP.1460

Let us first sketch a way how to retrieve such par-tially matched statements. Again, we rely on hashmaps.We use each triple pattern of a statement as a hashkey,by which the statement can be retrieved. Thus, a state-ment with three triple patterns, for example, can be re-1465

trieved in three different ways. To find statements thatare potentially applicable to a frozen BGP P, we per-form a hashmap lookup for each triple pattern of Pand for all possible generalizations of that triple pat-tern where non-predicate terms are replaced by a vari-1470

able.23

23Technically, the predicate can be a variable too. Yet, in relationto completeness statements, we believe that to be complete for allpossible relationships between two entities might not be reasonablein practice. Hence, only non-predicate terms are generalized.

Let us formalize the above sketch. Our main goalhere is partial matching: retrieving only completenessstatements having a triple pattern that can potentiallybe mapped to a triple in a frozen BGP P. To this end,1475

we first introduce the signature operator σ(·) that ab-stracts away concrete variables by replacing every oc-currence of a variable with the reserved IRI _var. Itcan be applied to any syntactic object. As an illustra-tion, the signature of the BGP Pusa = (usa, lang, ?l) 1480

is σ(Pusa) = (usa, lang,_var) .Next, we index completeness statements according

to (the signatures of) their triple patterns. For this pur-pose, we define a mapping M from signature triples tosets of completeness statements such that the signaturetriple is in the signature of the statement’s BGP:

M((s, p, o)) = C ∈ C | (s, p, o) ∈ σ(PC) .

In practice, such a mapping can be realized by standardhashmaps. Given a signature triple (s, p, o), the gen-eralization operator gen((s, p, o)) computes the setof all generalizations where non-predicate terms can1485

become variables. As an illustration, the generaliza-tion of the signature triple (usa, lang,_var) is the set (usa, lang,_var), (_var, lang,_var) .

Now, we are ready to define the operator pmatchthat, given a set of statements C, retrieves those el-ements of C that can potentially ‘transfer’ at leastone triple in the frozen BGP P. Technically, pmatchmaps P and C to the set of partially matched statementswrt. P and C, denoted pmatch(P, C), and defined as⋃(s,p,o)∈σ(P)

M((s′, p′, o′)) | (s′, p′, o′) ∈ gen((s, p, o)).

The operator computes, for each triple in the signa-ture of the BGP P, the generalizations of that triple1490

and then maps the generalizations to their correspond-ing statements in C. By the construction of the map-ping M and the generalization operator, it is the casethat pmatch(P, C) preserves the crucial-part operatorin Eq. (1), as stated in Proposition 15.1495

Proposition 15. Given a BGP P, a graph G, and aset C of completeness statements, it holds that


(TC(P ∪G))

= P ∩ id−1

(Tpmatch(P,C)(P ∪G)).

This means that instead of taking all the state-ments in C, it is enough to consider only the subset


pmatch(P, C), which is potentially smaller than C.The proposition holds since all applicable complete-ness statements over P ∪ G are actually partially1500

matched statements (and by construction, all partiallymatched statements appear in C).

Partial matching can also be leveraged immediatelyto support completeness templates: we simply take the(uninstantiated) statements of the completeness tem-1505

plates for building the hashmap in partial matching(that is, the C part in the template (C,Vτ,Ω)). The restof the matching procedure can then be adapted accord-ingly.

6.2.3 Prioritized Evaluation An important operatorin checking data-aware completeness is the crucial-part operator,


(TC(P ∪G)),

which computes the maximal part of a BGP containing1510

variables that can be instantiated completely, which isneeded for subsequent steps in the data-aware com-pleteness checking procedure. In the above operator,CONSTRUCT queries of completeness statements areevaluated over the union of the prototypical graph1515

and the data graph, and then (after the melting) areintersected with the query BGP. This means that tocheck whether the TC-operator maps a triple in P bya statement C, we first evaluate PC over the uniongraph P ∪G, with the condition that at least one triple1520

pattern in PC is mapped to a triple in P (since oth-erwise the mapping does not have any contribution).Here, instead of directly evaluating the CONSTRUCT

query QC over the union graph, which can be large insize (due to G), but may potentially waste computation1525

time, what if the evaluation is done by prioritizing thepart P and then, only if QC can be partially mapped tosome parts of P, the remaining parts of QC are eval-uated over the data graph. This way, not all QC’s areevaluated over the data graph, but only those that may1530

be useful for guaranteeing the completeness of parts ofthe BGP.

To formalize the above idea, we define prioritizedevaluation of a BGP over a pair of graphs (G1,G2).In such an evaluation, we consider the first graph G11535

as the mandatory graph and the second as the op-tional graph, which means that at least one triple pat-tern of the BGP is mapped to a triple in G1, whilethere is no need to map any triple pattern to G2.Formally, prioritized evaluation of a BGP P over1540

(G1,G2) is defined as JPK(G1,G2) = µ ∈ JPKG1∪G2 |

µP′ ⊆ G1 for some P′ ⊆ P, P′ 6= ∅ . In the case ofcompleteness checking, the mandatory graph will bethe frozen BGP P and the optional graph will be thedata graph G. In our transfer operator, prioritized eval-1545

uation is performed (instead of the standard evalua-tion) when the CONSTRUCT queries of the statementsare evaluated over the prototypical graph and the datagraph. Note that by the definition of prioritized evalu-ation and the crucial operator, it is the case that both1550

evaluation methods (i.e., the prioritized evaluation andthe prioritized evaluation) produce the same result.

Example 13. Consider the BGP Pusa =(usa, lang, ?l),asking for languages of the USA, the graph

Gorg= (org1, founder, ger), (ger, lang, de),(org2, founder, usa), (org2, founder, ger),

and the completeness statement

C = Compl((?c, lang, ?lang), (?org, founder, ?c)).

Then, with PC as the BGP of the statement C, we havethat JPCK(Pusa,Gorg)

= ?c 7→ usa, ?lang 7→ l, ?org 7→org2 . The reason is that in a prioritized evaluation,1555

at least one triple of PC has to be matched to the triplein Pusa. The only way to do that is to map ?c 7→ usaand ?lang 7→ l. This leaves only one possibility tomatch (?org, founder, ?c) to Gorg.

Next, in the prioritized evaluation of a BGP PC1560

over (P,G), we apply a pruning technique based onthe following observation. Each answer mapping µ ∈JPCK(P,G) determines a non-empty subset P′C ⊆ PC

such that µP′C ⊆ P and µP′′C ⊆ G for its complementP′′C := PC \ P′C . Since frozen variables only occur in1565

P and not in G, we conclude that for every variable ?vthat occurs both in P′C and P′′C it must be the case thatµ(?v) is not a frozen variable.

The algorithm with pruning proceeds as follows.For each non-empty subset P′C ⊆ PC , we first eval-uate P′C over P, which yields partial answers λ. Wetry to complete each such partial answer λ by evalu-ating the instantiated complement λ(P′′C) over G andjoining the answers resulting from this with λ itself.We prune the answers λ of the first evaluation step bykeeping only those mappings for which no term λ(?v),?v ∈ var(P′′C), is a frozen variable. We call such a λpure. Clearly, for non-pure mappings the subsequentevaluation of λ(P′′C) over G can only result in the empty


set. Formally, we compute the union⋃P′

C ⊆ PCP′

C 6= ∅

⋃λ ∈ JP′

CKPλ is pure

λ on Jλ(PC \ P′C)KG,

which equals JPCK(P,G). The equality holds for the rea-son discussed: the union above considers all possible1570

subsets of PC (except the empty set), and non-pure in-stantiations cannot be evaluated over the data graph G.

The above discussion concerns the prioritized eval-uation of the transfer operator that uses plain com-pleteness statements. For completeness templates, the1575

template-based transfer operator TT as in Section 6.2.1is slightly modified to account for a prioritized evalua-tion in the way the BGP of the template’s statement isevaluated, as follows, where T is a set of completenesstemplates, P is a frozen BGP and G is a graph:1580

TT (P,G) =⋃τ ∈ T

τ = (C, Vτ,Ω)

µPτ | µ ∈ JPτK(P,G) on Ω

As shown above, instead of evaluating the BGP Pτ ofa template τ ∈ T using the standard evaluation overthe union of P and G, the prioritized evaluation is usedinstead.

Trade-offs of the Heuristics A potential benefit of the1585

templates are that they can be used to group (possiblymany) similar completeness statements into one, andhence we can reduce the number of evaluations in thetransfer operator. A potential drawback of using tem-plates are that since their BGPs are by construction1590

more general than completeness statements, the eval-uation of the template’s BGP over the data graph mayreturn many results which have to be checked for com-patibility wrt. the set Ω of mappings of the template.

As for partial matching, in principle it may filter out1595

many completeness statements (and templates). Still,a potential drawback is that when the statements con-tain many overlapped parts to each other, then par-tial matching may consider many partially matchedcompleteness statements, too. Hence, the reduction of1600

the number of completeness statements that need tobe considered in completeness reasoning might not besubstantial.

As for prioritized evaluation, using it alone can stillbe expensive when there are a large number of com-1605

pleteness statements that need to be considered. Com-bining prioritized evaluation with completeness tem-

plates may provide a remedy to the generality of theBGPs of the completeness templates, since in this case,the BGPs of the templates have to be checked first if1610

they may capture parts of the query, and only if so, theycan be evaluated over the graph.

In Section 7, we will study how such trade-offs maybehave in a practical setting, for all possible combina-tions of heuristics (that is, they may be applied alone,1615

or together).

7. Experimental Evaluation

In this section, we report on our experimental evalu-ation of query completeness and query soundness rea-soning. We have proposed in Section 6 three heuristic1620

techniques for completeness reasoning, namely com-pleteness templates, partial matching, and prioritizedevaluation. Consequently, the practical impact of thesetechniques (and their interplay) on completeness rea-soning needs to be validated. Furthermore, in the light1625

of the query soundness problem, we want to study theperformance of answer and pattern soundness check-ing in a realistic setting. The goal of our experimentalevaluation is therefore two-fold:

1. To study how effective are completeness tem-1630

plates, partial matching, and prioritized evalua-tion, as well as their combinations, for complete-ness reasoning; and

2. to show the feasibility of soundness reasoning(which in turn reduces to completeness reason-1635

ing) in a realistic setting.

7.1. Query Completeness Evaluation

In this subsection, we first describe the experimentalsetup, followed by the discussion of the results of theexperiments.1640

7.1.1. Experimental SetupGiven the three heuristic techniques for complete-

ness reasoning, that is, completeness templates (temp),partial matching (match), and prioritized evaluation(peval), we want to analyze to which degree those1645

techniques may provide speed-up to completeness rea-soning. Specifically, not only do we want to study theeffects of every individual heuristic, but also thoseof the combinations between different heuristics. Wetherefore distinguish between eight different heuris-1650

tic settings: no, temp, match, peval, temp+match,temp+peval, match+peval, and all. The no is the


2 4 6 8

10−1

100

101

102

103

104

no temp match peval temp+match temp+peval match+peval all qeval

Fig. 4. Comparison of completeness checking using different heuristic settings. The x-axis is the query length, and the y-axis is the runtime inms.

baseline setting with no heuristic (that is, the vanillaimplementation of completeness reasoning), whereasthe all is the setting with all heuristics.1655

As for completeness reasoning, there are three in-gredients: graph, queries, and completeness state-ments. While machine-readable completeness state-ments in the wild are yet to appear, we aim to make oursetup as realistic as possible. We take Wikidata as the1660

graph for our evaluation. More specifically, we use themonolingual, direct-statement fragment (i.e., the frag-ment without qualifiers, references and non-Englishlabels) of the Wikidata dump released in April 2018.The graph consists of around 491 mio triples (= 57 GB1665

in the uncompressed NT format). We choose Wikidatamainly because of its popularity and good quality.

As for the experiment queries, they have to usethe same vocabulary as the Wikidata graph describedabove. We therefore generate the experiment queries1670

based on human-made, openly available queries on theWikidata query page.24 Since our experiment querieshave to be conjunctive in order to be able to be checkedfor completeness, we extract the BGPs of the queriesin the Wikidata query page and generate our experi-1675

ment queries based on these BGPs. We cannot directlyuse these BGPs as our experiment queries due to thelimited number of the extracted BGPs. Nevertheless,

24https://www.mediawiki.org/w/index.php?title=Wikibase/Indexing/SPARQL_Query_Examples&oldid=2099085

these BGPs may serve as a starting point to generateour experiment queries, in the following way: (i) for1680

each extracted BGP, we evaluate it over the Wikidatagraph; (ii) we take randomly 20 of the mappings, pro-jected on the first variable of the BGP;25 and (iii) wegenerate experiment queries by instantiating the BGPswith these projected mappings. We generate in total1685

1,211 queries with the average query length (i.e., thenumber of triple patterns) of 2.7. Note that this averagequery length reflects that of real-world queries [17].

Now, for the completeness statements, we generatethem in a similar way as we generate the experiment1690

queries. The only difference is that, in the second stepabove, instead of taking randomly 20 of the mappings,we obtain randomly 50% of the mappings to createcompleteness statements. This is to ensure the gener-ation of a large number of completeness statements.1695

In this setting, we also naturally represent complete-ness statements by completeness templates as follows:we take the extracted BGP as the template’s BGP, andthe projected mappings as the template’s mappings.We generate in total 693,928 completeness statements1700

with the average statement length (i.e., the number oftriple patterns in the BGP of completeness statements)of 2.4. Moreover, these statements are represented by65 completeness templates.

25If the number of mappings is below 20, then we just take all themappings.

https://www.mediawiki.org/w/index.php?title=Wikibase/Indexing/SPARQL_Query_Examples&oldid=2099085



Implementation The reasoning program and exper-1705

iment framework are implemented in Java using theApache Jena library.26 The graph of the experimentalevaluation is loaded into a Jena TDB triple store. Wemeasure the runtime of completeness reasoning withdifferent heuristic settings, as described above, and of1710

query evaluation. For each query length, we randomlysample 20 queries to be observed for the runtime, andtake the runtime median between these 20 queries. Theexperiments are done on a PC with an Intel Core i73.6 GHz-processor, a 16 GB memory, and an HDD of1715

1 TB.

7.1.2. Results and DiscussionFigure 4 summarizes the results of the experiments.

The figure shows the runtime comparison of complete-ness checking using different heuristic settings. The1720

x-axis is the query length, whereas the y-axis showsthe runtime in ms, and is in log-scale. We also addthe query evaluation time to provide an orientation asto how completeness checking may compare to queryevaluation.1725

In general, we observe that completeness checkingusing all heuristics (that is, the all setting) beats allthe other heuristic settings. The no, temp, and peval

settings are the worst-performing. Using the temp andpeval heuristics alone does not help in speeding up1730

completeness reasoning, as they only add a slight im-provement. In the temp setting, despite the lower num-ber of templates in comparison to the number of state-ments, completeness templates are by design less se-lective than completeness statements, and thus tend to1735

be longer to compute for the transfer operator. Thepeval heuristic does not perform well as anyway allcompleteness statements have to be evaluated. Inter-estingly, the match gives a mixed impression: for shortqueries, match seems to be a sufficient heuristic for1740

completeness checking; yet, when the queries becomelonger, starting from those of length 3, its runtime in-creases considerably. The only exception in the queriesof length 6 might be due to the following reason: inour experiment setup there are only 5 queries that are1745

of length 6, generated from one extracted BGP only.Hence, those queries might be queries that do not havemany overlaps with the other queries, which is an ad-vantage to the match technique (recall that our com-pleteness statements are generated in a similar way as1750

the queries). The queries of length 8 (which are notmany in our experiment query set) are in fact the op-

26http://jena.apache.org/

posite of the queries of length 6: that they have manyoverlaps with the other queries, hence the runtime in-crease for queries of length 8.1755

Let us now observe the performance of combin-ing the heuristics. Adding partial matching on top ofcompleteness templates provides a runtime improve-ment, as can be seen from the temp+match. Here,only partially matched templates are evaluated instead1760

of all templates. For the case of temp+peval, we ob-serve that completeness templates are reasonable forcompleteness checking when used together with pri-oritized evaluation. This is because prioritized evalu-ation tends to avoid the heavy computation of trans-1765

fer operator over the graph whenever: (i) no parts ofthe template can be applied to the prototypical graph;or (ii) parts of the template can be applied to the pro-totypical graph, but capturing frozen variables (whichdo not exist in the graph). Still, all templates have to1770

be considered, since there is no partial matching. Theaddition of prioritized evaluation to partial matchingonly provides a slight improvement, as observed inthe match+peval. Combining all the heuristic tech-niques, as in the all setting, gives the best perfor-1775

mance. The reason is that in addition to the benefitsof the temp+peval, as described above, we also ob-tain the effect of partial matching in filtering out irrel-evant completeness templates, hence only potentiallyrelevant templates are considered.1780

Overall, our experimental evaluation has answeredthe question as to how effective are heuristic tech-niques (and their combinations) for completenesschecking. Partial matching, though seeming promis-ing for short queries, tends to have a negative effect1785

for long queries. Templates and prioritized evaluationare best applied together. Using all the three heuristicsbrings the advantages of each heuristic into one.

7.2. Query Soundness Evaluation

Here, we describe the setup of the experiments, and1790

then discuss the results of the experiments.

7.2.1. Experimental SetupFrom the characterizations in Section 5, we are able

to check query soundness by reducing it to query com-pleteness checking. Pattern soundness checking can1795

be reduced to data-agnostic completeness checking,for which we proposed a heuristic technique in Sec-tion 6.1 that is based on the term-relevance princi-ple. As for answer soundness checking, we also reduceit to the problem of data-aware completeness check-1800

http://jena.apache.org/


ing. In Section 7.1, we have shown that not only us-ing all the heuristics of completeness templates, par-tial matching, and prioritized evaluation yields runtimeimprovements over a plain implementation of a com-pleteness reasoner, but also that the approach where all1805

these heuristics are applied in combination performsthe best. Given these improvement techniques for com-pleteness checking, we want to study the feasibilityof soundness checking in a realistic setting. Addition-ally, we want to know how pattern soundness check-1810

ing compares to answer soundness checking, and howsoundness checking compares to query evaluation.

For experimentally evaluating query soundness, we(again) require three ingredients: graph, queries, andcompleteness statements. We use the Wikidata graph1815

that is described as before in the experiments in Sec-tion 7.1. For the experiment queries, now we needqueries with negation. Similarly as before, we takeWikidata queries,27 and extract their BGPs to generateexperiment queries, this time with negation. We want1820

to have queries with negation of various shapes. Forthis reason, from the BGPs of the Wikidata queries wegenerate different sets of queries with negation, differ-ing in the triple patterns that are negated:

– QoneTP, the last triple pattern is negated;1825

– QoneTPoneTP, the last two triple patterns are in-dependently negated, forming two NOT-EXISTS

patterns;– QtwoTPs, the last two triple patterns are negated

together, forming one NOT-EXISTS pattern; and1830

– QthreeTPs, the last three triple patterns are negat-ed together, forming one NOT-EXISTS pattern.

The number of triple patterns negated is set to at mostthree, since most real-world queries are of length upto three [15]. We project out all variables in the posi-1835

tive part to correspond to graph pattern evaluation (re-call the definition of graph pattern in Section 2.1). Thestatistics of the generated queries are shown in Table 2.The number of queries across different cases rangesfrom 18 to 59. The median query length is 3 for all1840

the cases, except the QthreeTPs case with query lengthof 4.

Completeness Statements We used two differentmethods of generating completeness statements de-pending on whether we wanted to perform either an-1845

swer soundness or pattern soundness checking. As for

27https://www.mediawiki.org/w/index.php?title=Wikibase/Indexing/SPARQL_Query_Examples&oldid=2099085

the generation of statements for answer soundness, wewanted to perform it in such a way that there will be avariety of sound and possibly unsound answers. So, wegenerated the statements as follows: (i) given a query,1850

we evaluated the query and obtained all the answer-mappings; (ii) for 25% of these answer-mappings, weapplied them to the BGP of each NOT-EXISTS patternof the query and constructed completeness statementsout of these instantiated BGPs. This way, we can guar-1855

antee that these 25% answer-mappings are sound.In this setting, we can naturally represent complete-

ness statements by completeness templates (see Sec-tion 6.2). We took the BGP of the NOT-EXISTS pat-terns as the templates’ BGP and the sound answer1860

mappings as the templates’ mappings.In the particular case of QtwoTPs, however, we

also generated completeness statements in an addi-tional way, which differs from how we get BGPs forcompleteness statements: instead of taking the whole1865

instantiated BGP of the NOT-EXISTS pattern, wealso generated completeness statements separately pertriple pattern in the instantiated BGP. The first triplepattern28 in the instantiated BGP was taken as is, andthe second was (again) instantiated with the answer-1870

mappings from the evaluation of the first triple patternover the graph.

For the generation of statements for checking pat-tern soundness, we simply transformed the union ofthe positive part and each BGP of the NOT-EXISTS1875

patterns to a completeness statement. The statisticsof the generated completeness statements are shownin Figure 2. The number of statements ranges fromaround 61,000 to 454,000.

We had five different cases for our experimental1880

evaluation by combining different query sets and com-pleteness statements:

– oneTP is where the last triple pattern is negated;– oneTPoneTP is where the last two triple patterns

are independently negated;1885

– twoTPsTO (‘TO’ for together) is where the lasttwo triple patterns are negated together and thestatements are for the whole BGP;

– twoTPsSE (‘SE’ for separate) is where the lasttwo triple patterns are negated together, but the1890

statements are obtained separately per triple pat-tern; and

28We fixed an ordering.




– threeTPsTO (‘TO’ for together) is where the lastthree triple patterns are negated together and thestatements are for the whole BGP.1895

In each case, to perform answer soundness check-ing, we did not use the statements generated basedon pattern soundness since that would have made allthe answers sound. On the other hand, to perform pat-tern soundness checking, we also used all the state-1900

ments generated based on answer soundness, as oth-erwise there would have been too few statements (=the number of queries per case). We measured the run-time of soundness reasoning for both pattern and an-swer, and also that of query evaluation. For each case,1905

we removed the measurements where the query evalu-ation returned 0 answers, as answer soundness check-ing would have become trivial. Each measurement wasrepeated 15 times and we took the median. Moreover,to get the result summary of each experiment case, we1910

also took the median over the case’s results. We usedmedian to avoid the effect of extreme values (that is,some queries returned a large number of results, up toabout 490,000 results). The experiments were done ona PC with an Intel Core i7 3.6 GHz-processor, a 16 GB1915

memory, and an HDD of 1 TB.

7.2.2. Experimental Results and DiscussionTable 2 summarizes the results of the soundness rea-

soning experiments for five cases. In the table, queryevaluation time ranges from 20 ms to 89 ms. Pattern1920

soundness checking always takes less than 0.3 ms. Theterm-relevance principle is likely to help rule out ir-relevant completeness statements before performingthe actual pattern soundness checking. Note that thepattern soundness checking time for the oneTPoneTP1925

case takes the longest due to two data-agnostic com-pleteness checks that have to be performed for ev-ery pattern soundness checking (recall the reduction inTheorem 3). In comparison to query evaluation, pat-tern soundness checking is much faster, particularly1930

due to its data-agnostic behavior.As for answer soundness checking, we observe

that the runtime ranges from 2 ms to 592 ms. ThethreeTPsTO takes longer due to the large number ofquery answers that have to be checked for soundness.1935

When we break down the time per answer, the compu-tation takes less than 0.4 ms. Also, it is likely that themore triple patterns there are in the negation part, thelonger the soundness check per answer takes. In com-parison to query evaluation, answer soundness check-1940

ing is quite competitive. Yet, when compared to pat-

tern soundness checking, answer soundness checkingis slower due to its data-aware behavior.

Let us look more closely at answer soundnesschecking. Figure 5 shows the comparison between1945

the number of query answers, query evaluation time,and answer soundness checking time for cases oneTP,twoTPsTO, and threeTPsTO. We omitted the figureof oneTPoneTP since it is similar to that of oneTP,and of twoTPsSE as it is similar to that of twoTPsTO.1950

The x-axis is the query order based on the number ofquery answers in an ascending manner. The y-axis is inlog-scale and shows the respective unit (number for thequery answers, and ms for the runtime). There is strongevidence of a positive correlation between the number1955

of query answers and the answer soundness checkingtime. Moreover, we also see the following trend forall the cases: At first, when query answers are just afew, query evaluation tends to be slower than answersoundness checking. When the number of query an-1960

swers increases, the answer soundness checking timeoutgrows the query evaluation time. When queries be-come more complicated, the cross-over point happensearlier. This probably has to do with the increasingsoundness checking time per answer whenever the1965

number of negated triple patterns increases, as dis-cussed above.

To summarize, we have performed an experimentevaluation for soundness checking over a realistic set-ting based on Wikidata. Soundness checking using the1970

heuristic techniques as proposed in Section 6 has beenvalidated to be feasible for our experiment cases. Toget an idea of how long soundness checking may takewithout using any heuristics, we perform vanilla an-swer soundness checking for the oneTP case, and it1975

can take more than 300 s for just a single query. Incomparison to query evaluation and answer sound-ness checking, pattern soundness checking runs muchfaster. We would recommend that in practice, beforeapplying answer soundness checking, pattern sound-1980

ness checking should be done first since it takes lesstime, and by Proposition 2, if pattern soundness holds,then all answers are sound.

8. Related Work

In this section, we discuss related work for query1985

completeness and query soundness.


Table 2The number of statements |C|, the number of queries NQ, and the median of query length |Q|, of query answers |JQKG|, of query evaluation timetQ, of answer soundness checking time tAS, of answer soundness checking time per answer tAS/a, and of pattern soundness checking time tPS fordifferent cases. All times are in ms.

Case |C| NQ |Q| |JQKG| tQ tAS tAS/a tPS

oneTP 61,971 59 3 51 21 2.03 0.043 0.13

oneTPoneTP 454,131 30 3 374 89 17.4 0.046 0.21

twoTPsTO 395,399 41 3 198 20 26.5 0.15 0.13

twoTPsSE 414,321 41 3 198 20 12.7 0.06 0.13

threeTPsTO 335,852 18 4 2,671 89 592 0.39 0.15

0 10 20 30 40 50 60

10−1

100

101

102

103

104

105

0 10 20 30 40

100

101

102

103

104

105

106

0 5 10 15

10−1

100

101

102

103

104

105

106

|JQKG| tQ tAS

Fig. 5. Comparison between the number of query answers (|JQKG|), query evaluation time (tQ), and answer soundness checking time (tAS) forthe experiment cases: oneTP, twoTPsTO, and threeTPsTO. The x-axis is for query rank (from the lowest to the highest of number ofquery answers) and the y-axis is for both number of answers and runtime in ms.

Query Completeness Data completeness concernsthe breadth, depth, and scope of information [18] andis deemed to be one of the most significant data qualitydimensions [19]. In the field of relational databases,1990

Motro [20] and Levy [21] were among the first toinvestigate data completeness. Motro modeled com-pleteness of an available database instance in terms ofits relationship with a hypothetical, unknown instancethat represents the real world. This model has been1995

taken up in the present paper by the notion of extensionpairs. Motro also defined completeness constraints,which correspond to our completeness statements anddeveloped a sound technique to check query complete-ness of conjunctive queries in the data-agnostic set-2000

ting. Levy generalized completeness constraints to lo-cal completeness statements, which correspond to theconditional statements in Section 3.2.2 and are moreexpressive than the statements in the framework stud-ied here.2005

Razniewski and Nutt [22] further extended this workby reducing completeness reasoning to containmentchecking, for which many algorithms are known, and

by characterizing the complexity of reasoning for dif-ferent classes of conjunctive queries. While their in-2010

vestigations were largely at the schema level (that is,data-agnostic), they also showed that for conditionalstatements the combined complexity of completenesschecking in the data-aware setting is ΠP

2-complete andthe data complexity is polynomial. Since conjunctive2015

SPARQL queries can be seen as a special case ofconjunctive queries in the relational framework, andsince our completeness statements are unconditional,the ΠP

2-hardness result in the present paper strengthensthe one in [22], while the result on data complexity can2020

be deduced from theirs.This line of work on data-agnostic reasoning was

later continued by Darari et al. [4, 6], incorporatingorthogonal aspects such as time-awareness and feder-ation. In contrast, the present work focuses on more2025

restrictive, yet also more intuitive unconditional com-pleteness statements, for which it provides an in-depthcomplexity analysis and implementation techniquesfor data-aware reasoning. The problems of answer andpattern soundness are also completely new. In [23],2030


Razniewski et al. proposed completeness patterns anddefined a pattern algebra to check the completenessof queries. The work incorporated database instances,yet provided only a sound algorithm for completenesschecking. In our work, a sound and complete algorithm2035

for data-aware completeness checking and a compre-hensive complexity analysis of the checking are given.

On a more abstract level, Fürber and Hepp [24]distinguished three types of completeness: ontologycompleteness, concerning which ontology classes and2040

properties are represented; population completeness,referring to whether all objects of the real-world arerepresented; and property completeness, measuringthe missing values of a specific property. Those threetypes of completeness together with the interlink-2045

ing completeness, i.e., the degree to which instancesin the dataset are interlinked, are considered to bethe bases of the completeness dimension for RDFdata sources [25]. Our work considers completenessstatements which are built upon BGPs, and hence2050

have more flexibility in expressing completeness (e.g.,“complete for all children of the US presidents whowere born in Hawaii”). Mendes et al. [26] proposedSieve, a framework for expressing quality assessmentand fusion methods, where completeness is also con-2055

sidered. With Sieve, users can specify how to computequality scores and express a quality preference specify-ing which characteristics of data indicate higher qual-ity. Ermilov et al. [27] presented LODStats, a statis-tics aggregation of RDF datasets published over var-2060

ious data portals such as data.gov, publicdata.eu, anddatahub.io. They discussed several use cases that couldbe facilitated from such an aggregation, including cov-erage analysis (e.g., most frequent properties and mostfrequent namespaces of a dataset). As opposed to Sieve2065

and LODStats, our work puts more focus on describ-ing completeness of data sources, and leveraging suchcompleteness descriptions for checking query com-pleteness (and soundness). In the context of crowd-sourcing, Chu et al. [28] developed KATARA, a hy-2070

brid data cleaning system, which not only cleans data,but may also add new facts to increase the complete-ness of the KB; whereas Acosta et al. [29] developedHARE, a hybrid SPARQL engine to enhance answercompleteness. As opposed to our work, KATARA and2075

HARE cannot be used to check whether queries arecomplete in the sense that all answers are returned, asthey concentrate more on increasing the degree of KBand query completeness.

Galárraga et al. [30] proposed a rule mining sys-2080

tem that is able to operate under the Open-World As-

sumption (OWA) by simulating negative examples us-ing the Partial Completeness Assumption (PCA). ThePCA assumes that if the dataset knows some r-attributeof x, then it knows all r-attributes of x. This heuristic2085

was also employed by Dong et al. [31] (called LocalClosed-World Assumption in their paper) to developKnowledge Vault, a Web-scale system for probabilis-tic knowledge fusion. Our completeness statements arein fact a generalization of the assumption used in the2090

above work.

Query Soundness The use of negation in queryingcan be traced back to Codd’s relational calculus [32],where a tuple is included in the complement of a re-lation if it is not explicitly given in the relation. Re-2095

iter [33] and Clark [34] generalized this to rule-basedsystems. They assumed that the failure to find a proofof a fact implies that the negation is true, and called thisthe closed-world assumption (CWA). SPARQL, thestandard query language for RDF, supports negation by2100

such a non-existence check [5, 35]. However, since thesemantics of RDF imposes the open-world assump-tion (OWA) [2], there remains a conceptual mismatchwhen negation in SPARQL over RDF datasets is eval-uated in a closed-world style. In other words, there is a2105

missing gap between the normative semantics of nega-tion in SPARQL, which is based on the negation-as-failure (‘negation from the failure to find a proof ofthe fact’) [36], and the classical negation (‘the negatedfact truly holds’) [37] due to RDF’s openness. More-2110

over, the fact that RDF is a positive language, meansthat one viable way of having negated facts in RDF isby imposing some (partial) completeness assumptionover RDF data: whenever P is complete, then all factsnot in P are false.2115

In the Semantic Web, Polleres et al. [38] first ob-served this mismatch. They proposed to restrict thescope of negation to particular data sources, thus limit-ing the search for negative information. In their work,no assumption was made as to whether the knowl-2120

edge in these data sources is complete. In descriptionlogics (DLs), Lutz et al. [39] proposed closed predi-cates, that is, concepts and roles that are interpreted tobe complete, to enable a combination between open-and closed-world reasoning. A similar concept was2125

also employed by Analyti et al. [40]. They proposedERDF, an extended RDF that supports negation, aswell as derivation rules. ERDF allows one to have lo-cal closed-world information via default closure rulesfor properties and classes. As opposed to these two ap-2130

proaches, which considered only a simple partial CWA

data.gov

publicdata.eu

datahub.io


over atomic classes and properties (e.g., all cars, allchild relationships, . . . ), our work supports more ex-pressive completeness information in the sense that wecan use BGPs to capture completeness. From the prac-2135

tical side of the Semantic Web, negation is featuredin test queries of many popular SPARQL benchmarkssuch as SP2Bench [41], Berlin SPARQL Benchmark(BSBM) [42], and FedBench [43], in which the closed-world assumption (CWA) is employed. Our work does2140

not only provide formalizations, but also optimizationtechniques for checking the soundness of queries withnegation, for which we have experimentally shown toimprove the feasibility of the soundness checking inrealistic settings based on Wikidata.2145

More recently, Gutierrez et al. [44] proposed an al-ternative semantics for SPARQL based on certain an-swers. They argued that the proposed semantics ismore suitable to capture RDF peculiarities, such asOWA, unique name assumption (UNA), and blank2150

nodes. For queries with negation, they showed that thequeries do not have certain answers, since more factscan be arbitrarily added to falsify the query answers.In our work, we combine between open- and closed-information in RDF, enabling SPARQL queries with2155

negation to have answers that are guaranteed to re-main. That is, when queries are guaranteed to be soundby completeness statements, new data that might beadded to the graph is restricted by the statements,hence the answers will not be falsified.2160

Weak Monotonicity The open-world semantics ofRDF has led to the question which type of queries ap-propriately reflects this fact. Arenas and Pérez notedthat for positive graph patterns, that is, patterns com-posed by the operators AND, UNION, and FILTER,2165

the semantics of SPARQL reflects the open-world se-mantics in that the answers for such a pattern are ex-actly the certain answers [45].

As an additional feature that supports querying in-complete information, SPARQL offers the OPT con-2170

structor that binds variables if they can be matchedto the data, and leaves them unbound otherwise. Pat-terns with OPT are not necessarily monotone becauseadding triples to a data graph may lead to bindingsfor variables that were not bound previously. Such pat-2175

terns may however be weakly monotone, that is, for ev-ery answer mapping returned over the smaller graphthere is an answer over the larger graph that extendsit. Thus, weakly monotone patterns do not not lose in-formation when new information becomes available,2180

although they may not preserve the shape of answers.

While not all patterns with OPT are weakly mono-tone, this is the case for the class of well-designed pat-terns [45]. It has been shown, though, that there alsoweakly monotone patterns beyond the well-designed2185

patterns. Arenas and Ugarte have investigated pat-terns that are weakly monotone over potentially infi-nite graphs and characterized them by applying inter-polation techniques from first-order logic [46].

In the presence of incomplete data, OPT can be in-2190

terpreted in two different ways. For example, the pat-tern (?p, livesIn, bolzano) OPT (?p, hasEmail, ?e) can be interpreted as asking for all persons living inBolzano, together with the email addresses present inthe data, or for all such persons together with all their2195

email addresses, if they have any. Over complete data,where no information is added, only the second inter-pretation is meaningful.

Darari et al. have characterized completeness ofwell-designed patterns in the data-agnostic setting [4,2200

6]. We conjecture that techniques from that work andfrom the current paper can be combined to reasonabout the completeness of well-designed patterns inthe presence of data.

9. Discussion2205

Here we discuss issues related to our framework:creation of completeness information, no-value infor-mation, blank nodes, and RDFS extension.

Creation of Completeness Information Our frame-work relies on the availability of machine-readable2210

completeness information. We found a widespread in-terest in collecting completeness information in var-ious forms, for example on Wikipedia, IMDb, andOpenStreetMap. The techniques we develop may serveas an incentive to standardize such information and to2215

make it available in RDF, since then not only is suchinformation useful for managing data quality, but alsofor assessing query quality in terms of completenessand soundness.

Ideas for approaches to automating the generation2220

of completeness information were collected in [47].Galárraga et al. [48] investigated various signals, suchas popularity, update frequency, and cardinality, thatcan be used to identify complete parts of a KB viarule-mining techniques. Mirza et al. [49, 50] developed2225

techniques for relation cardinality extraction from text,which can be leveraged to generate completeness state-ments in the following way: when the extracted car-


dinality of a relation matches with the relation countin a KB, then a completeness statement can be gen-2230

erated. COOL-WD is a collaborative, web-based sys-tem for managing and consuming completeness infor-mation about Wikidata, which currently stores over10,000 real completeness statements [51], and is avail-able at http://cool-wd.inf.unibz.it. Additionally, COOL-2235

WD demonstrates how provenance information, suchas authorship, timestamp, and external reference, canbe added to completeness statements, which can thenserve as a basis for trust determination over query com-pleteness and soundness checking (e.g., “this query is2240

complete based on the completeness assertions X, Y ,and Z, given by A and B on date D, with references toR and S ”).

No-Value Information Completeness statements canalso be used to represent no-value information. Such2245

information is particularly useful to distinguish be-tween a value that does not exist due to data incom-pleteness or due to its inapplicability. Wikidata, for in-stance, contains about 19,000 pieces of no-value infor-mation over 269 properties.29 In our motivating sce-2250

nario, there is the completeness statement about theofficial languages of the USA with no correspondingdata in the graph. In this case, we basically say that theUSA has no official languages. As a consequence ofhaving no-value information, we can be complete for2255

queries despite having the empty answer. Such a fea-ture is similar to that proposed in [52]. The only differ-ence is that here we need to pair completeness state-ments with a graph that has no corresponding data cap-tured by the statements, while in that work, no-value2260

statements are used to directly say that some parts ofdata do not exist.

SPARQL Fragment In this paper we have focusedon conjunctive SPARQL, that is, SPARQL withoutUNION, OPTIONAL , or arithmetic comparisons. Con-2265

junctive SPARQL is the foundational fragment under-lying all extensions [9, 10], thus we believe that thestudy of completeness and soundness reasoning forthis fragment provides a solid theoretical basis for fur-ther analysis of richer fragments. Our main results can2270

be translated to richer fragments as sufficient condi-tions for completeness and soundness. We conjecturethat also necessary conditions can be established inanalogy to the data-agnostic case [6], for UNION interms of completeness and soundness of the disjuncts,2275

29as of Feb 18, 2017

or for well-designed patterns with OPTIONAL in termsof completeness and soundness of pattern trees [53],and for patterns with arithmetic comparisons by theuse of representative orderings for the containmentchecks [54].2280

Blank Nodes The use of blank nodes in RDF hasbeen a controversial topic in the Semantic Web com-munity [55, 56]. In Linked Data applications, blanknodes add complexity to data processing and data in-terlinking due to the local scope of their labels [57,2285

58]. With respect to SPARQL, there are semanticmismatches with the RDF semantics of blank nodes,e.g., when COUNT and NOT-EXISTS features are em-ployed [59]. Nevertheless, blank nodes are used inpractice to some degree: (i) for modeling unknown2290

nulls [52, 59], and (ii) for modeling n-ary relations asauxiliary instances in reification [60].

For the former usage, completeness of a topic thatcontains a blank node is contradictory, as we argue thatcompleteness statements should capture only “known2295

and complete” information. For instance, one maystate that a graph is complete for triples of the form(john, child, ?y), while the graph contains the triple(john, child,_:b), indicating that John is complete forhis unknown child, which does not really make sense.2300

Nevertheless, a graph with completeness statementsmay still have blank nodes as long as they are not cap-tured by the statements.

For the latter case, Skolemization as a way to sys-tematically replace blank nodes with fresh, Skolem2305

IRIs may be leveraged with almost interchangeablebehavior [2, 61, 62], except that Skolem IRIs havea global scope instead of a local scope. This way,completeness statements can capture n-ary relation in-formation encoded originally with blank nodes, and2310

completeness reasoning (which involves SPARQLqueries) behaves well (i.e., no semantic mismatchesas per [59]). Nevertheless, in practice Semantic Webdevelopers often tend to directly use IRIs instead ofblank nodes for representing auxiliary resources, for2315

instance by Wikidata [58].30

RDFS Extension RDFS [63] adds lightweight se-mantics to describe the structure and interlinkingof data, usually sufficient for Linked Data publish-ers [57]. Main RDFS inference capabilities consist of2320

class and property hierarchies, as well as property do-

30For instance, the resource IRI of Wikidata for the marriage be-tween Donald Trump and Ivana Trump is http://www.wikidata.org/entity/statement/q22686-f813c208-48b2-9a72-3c53-cdaed80518d2.

http://cool-wd.inf.unibz.it

http://www.wikidata.org/entity/statement/q22686-f813c208-48b2-9a72-3c53-cdaed80518d2

http://www.wikidata.org/entity/statement/q22686-f813c208-48b2-9a72-3c53-cdaed80518d2


mains and ranges [57, 64], which are widely used inpractice [65]. Darari et al. [4] formalized the incor-poration of RDFS in data-agnostic completeness rea-soning. Using a similar technique as in [4], it is also2325

relatively easy to extend our data-aware completenessreasoning framework with the RDFS semantics. Theidea is that we strengthen our syntactic characteriza-tion of computing the epg operator (see Section 4.1)via the closure operation wrt. RDFS ontologies [64].2330

More precisely, in the crucial part, the closure has to becomputed before and after the TC operation over P∪G.Also, the evaluation of the crucial part needs to be doneover the materialized graph G wrt. the RDFS ontology.As for query soundness checking, a similar procedure2335

based on RDFS closure needs to be employed as well.For pattern soundness reasoning, we include the clo-sure computation in the query set containment check-ing for Non-Redundant Form (NRF), and in the querycompleteness checking (as in Proposition 10). For an-2340

swer soundness checking, we can simply rely on thedata-aware completeness checking with RDFS incor-poration we just sketched. In summary, the addition ofthe closure computation ensures that the semantics ofRDFS is incorporated in the reasoning, while not in-2345

creasing the complexity as the RDFS closure compu-tation can be done in PTIME [64].

Extending the Completeness and Soundness Charac-terizations to Relational Databases Our complete-ness and soundness results are based on RDF and2350

SPARQL, which are the triple-based data model andtriple-pattern-based query language for the Seman-tic Web. However, analogous algorithms, characteri-zations and complexity results can also be formulatedand proved for relational databases if we generalize2355

triples to n-ary relations and conjunctive patterns toprojection-free conjunctive queries. The completenessstatements in this paper can generalized in two ways.One way is to view them as so-called query com-pleteness statements, which state that a given query is2360

complete (see [66]), where the query in this case isprojection-free. Another way is to see them as collec-tions of local completeness statements (see [21, 22]).The proofs for the analogous results would be simi-lar to the ones given here. This class of completeness2365

statements, however, has not yet considered in the rela-tional setting, while they arise naturally in the Seman-tic Web context.

10. Conclusions and Future Work

The open-world assumption of RDF and the closed-2370

world evaluation of SPARQL have created a gap onhow we should treat the completeness and soundnessof query answers. This paper bridges the gap betweenRDF and SPARQL via completeness statements, meta-data about (partial) completeness of RDF data sources.2375

In particular, we have introduced the problem of querycompleteness checking over RDF data annotated withcompleteness statements. We also developed an al-gorithm and performed a complexity analysis for thecompleteness problem. Then, we formulated the prob-2380

lem of soundness for SPARQL queries with negation,and characterized it via a reduction to completenesschecking. We proposed optimizations for complete-ness checking, and provided experimental evidencethat our techniques are feasible in realistic settings2385

over Wikidata for both query completeness and sound-ness problems.

Our current approach tackles the core fragment ofSPARQL, and can be easily adapted to provide suf-ficient characterizations of richer fragments (e.g., in-2390

volving union, arithmetic filter, or—for queries withnegation—the selection operator), setting a solid basisfor future investigations into the full characterizationof those fragments. Also, while the current complete-ness statements are constructed using BGPs, one might2395

wonder what happens if richer constructors are added,to enable statements like “Complete for all studentswho were born after 1991 and who do not speak Ger-man.” On the practical side, the availability of struc-tured completeness information remains a core issue.2400

We hope that our work provides a further incentivefor standardization and data publication efforts in thisarea. Another future direction is to study how (Seman-tic) Web data publishers and users perceive the prob-lem of completeness, and how they want to benefit2405

from data completeness. Extensive case studies may beconducted in various application domains like health-care, economics, or government. The purpose is to an-alyze whether our approach is sufficient or not for theirrequirements, and if not, on which side it can be im-2410

proved. Last but not least, we want to study how thesoundness of queries with the MINUS negation can alsobe characterized. We conjecture that our results applyalso to queries with the MINUS negation where there isa shared variable between the positive part and each of2415

the negative parts.


References

[1] F. Darari, S. Razniewski, R.E. Prasojo and W. Nutt, EnablingFine-Grained RDF Data Completeness Assessment, in: WebEngineering - 16th International Conference, ICWE 2016,2420

Lugano, Switzerland, June 6-9, 2016. Proceedings, 2016,pp. 170–187.

[2] P.J. Hayes and P.F. Patel-Schneider (eds), RDF 1.1 Semantics,W3C Recommendation, 25 February 2014.

[3] D. Vrandecic and M. Krötzsch, Wikidata: a free collaborative2425

knowledgebase, Commun. ACM 57(10) (2014), 78–85.[4] F. Darari, W. Nutt, G. Pirrò and S. Razniewski, Completeness

Statements about RDF Data Sources and Their Use for QueryAnswering, in: The Semantic Web - ISWC 2013 - 12th Interna-tional Semantic Web Conference, Sydney, NSW, Australia, Oc-2430

tober 21-25, 2013, Proceedings, Part I, 2013, pp. 66–83.[5] S. Harris and A. Seaborne (eds), SPARQL 1.1 Query Language,

W3C Recommendation, 21 March 2013.[6] F. Darari, W. Nutt, G. Pirrò and S. Razniewski, Completeness

Management for RDF Data Sources, ACM Trans. Web 12(3)2435

(2018), 1–53.[7] [email protected] Mail Archives: Open

world issue (opening vs closing days) andSPARQL CONSTRUCT, Accessed: 2017-01-15from https://lists.w3.org/Archives/Public/semantic-2440

web/2008May/0152.html.[8] [email protected] Mail Archives: The Open world as-

sumption shoe does not always fit - was: RE: [ontolog-forum]Fwd: Ontolog invited speaker session - Dr. Mark Greaveson the Halo Project - Thu 2008.06.19, Accessed: 2017-01-2445

15 from https://lists.w3.org/Archives/Public/public-semweb-lifesci/2008Jun/0084.html.

[9] E. Liarou, S. Idreos and M. Koubarakis, Evaluating Conjunc-tive Triple Pattern Queries over Large Structured Overlay Net-works, in: The Semantic Web - ISWC 2006, 5th International2450

Semantic Web Conference, ISWC 2006, Athens, GA, USA,November 5-9, 2006, Proceedings, 2006, pp. 399–413.

[10] G. Stefanoni, B. Motik and E.V. Kostylev, Estimating the Car-dinality of Conjunctive Queries over RDF Data Using GraphSummarisation, in: Proceedings of the 2018 World Wide Web2455

Conference on World Wide Web, WWW 2018, Lyon, France,April 23-27, 2018, 2018, pp. 1043–1052.

[11] F. Darari, S. Razniewski and W. Nutt, Bridging the SemanticGap between RDF and SPARQL using Completeness State-ments, in: Proceedings of the ISWC 2014 Posters & Demon-2460

strations Track a track within the 13th International SemanticWeb Conference, ISWC 2014, Riva del Garda, Italy, October21, 2014., 2014, pp. 269–272.

[12] J. Pérez, M. Arenas and C. Gutierrez, Semantics and complex-ity of SPARQL, ACM Trans. Database Syst. 34(3) (2009).2465

[13] M.Y. Vardi, The Complexity of Relational Query Languages(Extended Abstract), in: Proceedings of the 14th Annual ACMSymposium on Theory of Computing, May 5-7, 1982, San Fran-cisco, California, USA, 1982, pp. 137–146.

[14] A.K. Chandra and P.M. Merlin, Optimal Implementation of2470

Conjunctive Queries in Relational Data Bases, in: Proceed-ings of the 9th ACM Symposium on Theory of Computing(STOC’77), 1977.

[15] M. Arias, J.D. Fernández, M.A. Martínez-Prieto and P. de laFuente, An Empirical Study of Real-World SPARQL Queries,2475

in: Proceedings of the 1st International Workshop on UsageAnalysis and the Web of Data (USEWOD’11), 2011.

[16] L. Rietveld and R. Hoekstra, Man vs. machine: Differences inSPARQL queries, in: Proceedings of the 4th USEWOD Work-shop on Usage Analysis and the Web of Data, ESWC, Crete,2480

Greece, 2014.[17] M. Saleem, M.I. Ali, A. Hogan, Q. Mehmood and

A.N. Ngomo, LSQ: The Linked SPARQL Queries Dataset, in:The Semantic Web - ISWC 2015 - 14th International SemanticWeb Conference, Bethlehem, PA, USA, October 11-15, 2015,2485

Proceedings, Part II, 2015, pp. 261–269.[18] R.Y. Wang and D.M. Strong, Beyond Accuracy: What Data

Quality Means to Data Consumers, J. of Management Informa-tion Systems 12(4) (1996), 5–33.

[19] C. Batini and M. Scannapieco, Data and Information Quality -2490

Dimensions, Principles and Techniques, Data-Centric Systemsand Applications, Springer, 2016.

[20] A. Motro, Integrity = Validity + Completeness, ACM Trans.Database Syst. 14(4) (1989), 480–502.

[21] A.Y. Levy, Obtaining Complete Answers from Incomplete2495

Databases, in: VLDB, Morgan Kaufmann, 1996, pp. 402–412.[22] S. Razniewski and W. Nutt, Completeness of Queries over In-

complete Databases, PVLDB 4(11) (2011), 749–760.[23] S. Razniewski, F. Korn, W. Nutt and D. Srivastava, Identifying

the Extent of Completeness of Query Answers over Partially2500

Complete Databases, in: ACM SIGMOD 2015, 2015, pp. 561–576.

[24] C. Fürber and M. Hepp, SWIQA - a Semantic Web Informa-tion Quality Assessment Framework, in: Proceedings of the2011 European Conference on Information Systems (ECIS),2505

Helsinki, Finland, June 9-11, 2011, 2011.[25] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann and

S. Auer, Quality assessment for Linked Data: A Survey, Se-mantic Web 7(1) (2016), 63–93.

[26] P.N. Mendes, H. Mühleisen and C. Bizer, Sieve: linked data2510

quality assessment and fusion, in: Proceedings of the 2012Joint EDBT/ICDT Workshops, Berlin, Germany, March 30,2012, 2012, pp. 116–123.

[27] I. Ermilov, J. Lehmann, M. Martin and S. Auer, LODStats: TheData Web Census Dataset, in: The Semantic Web - ISWC 20162515

- 15th International Semantic Web Conference, Kobe, Japan,October 17-21, 2016, Proceedings, Part II, 2016, pp. 38–46.

[28] X. Chu, J. Morcos, I.F. Ilyas, M. Ouzzani, P. Papotti, N. Tangand Y. Ye, KATARA: A Data Cleaning System Powered byKnowledge Bases and Crowdsourcing, in: ACM SIGMOD2520

2015, 2015, pp. 1247–1261.[29] M. Acosta, E. Simperl, F. Flöck and M. Vidal, HARE: A Hy-

brid SPARQL Engine to Enhance Query Answers via Crowd-sourcing, in: K-CAP 2015, 2015, pp. 11–1118.

[30] L.A. Galárraga, C. Teflioudi, K. Hose and F.M. Suchanek,2525

AMIE: Association Rule Mining under Incomplete Evidencein Ontological Knowledge Bases, in: WWW 2013, 2013,pp. 413–422.

[31] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Mur-phy, T. Strohmann, S. Sun and W. Zhang, Knowledge vault:2530

a web-scale approach to probabilistic knowledge fusion, in:ACM SIGKDD 2014, 2014, pp. 601–610.

[32] E.F. Codd, Relational Completeness of Data Base Sublan-guages, In: R. Rustin (ed.): Database Systems (1972).


[33] R. Reiter, On Closed World Data Bases, in: Logic and Data2535

Bases, H. Gallaire and J. Minker, eds, Springer US, Boston,MA, 1978, pp. 55–76.

[34] K.L. Clark, Negation as Failure, in: Logic and Data Bases,1978, pp. 113–141.

[35] E. Prud’hommeaux and A. Seaborne (eds), SPARQL Query2540

Language for RDF, W3C Recommendation, 15 January 2008.[36] R. Reiter, Towards a Logical Reconstruction of Relational

Database Theory, in: On Conceptual Modelling (Intervale),1982, pp. 191–233.

[37] M. Gelfond and V. Lifschitz, Classical Negation in Logic Pro-2545

grams and Disjunctive Databases, New Generation Comput.9(3/4) (1991), 365–386.

[38] A. Polleres, C. Feier and A. Harth, Rules with ContextuallyScoped Negation, in: The Semantic Web: Research and Appli-cations, 3rd European Semantic Web Conference, ESWC 2006,2550

Budva, Montenegro, June 11-14, 2006, Proceedings, 2006,pp. 332–347.

[39] C. Lutz, I. Seylan and F. Wolter, Ontology-Based Data Accesswith Closed Predicates is Inherently Intractable(Sometimes),in: IJCAI 2013, Proceedings of the 23rd International Joint2555

Conference on Artificial Intelligence, Beijing, China, August3-9, 2013, 2013, pp. 1024–1030.

[40] A. Analyti, G. Antoniou, C.V. Damásio and G. Wagner, Ex-tended RDF as a Semantic Foundation of Rule Markup Lan-guages, J. Artif. Intell. Res. (JAIR) 32 (2008), 37–94.2560

[41] M. Schmidt, T. Hornung, M. Meier, C. Pinkel and G. Lausen,SP2Bench: A SPARQL Performance Benchmark, in: SemanticWeb Information Management - A Model-Based Perspective,2009.

[42] C. Bizer and A. Schultz, The Berlin SPARQL Benchmark, Int.2565

J. Semantic Web Inf. Syst. 5(2) (2009), 1–24.[43] M. Schmidt, O. Görlitz, P. Haase, G. Ladwig, A. Schwarte and

T. Tran, FedBench: A Benchmark Suite for Federated Seman-tic Data Query Processing, in: ISWC, 2011.

[44] C. Gutierrez, D. Hernández, A. Hogan and A. Polleres, Cer-2570

tain Answers for SPARQL?, in: Proceedings of the 10th Al-berto Mendelzon International Workshop on Foundations ofData Management, Panama City, Panama, May 8-10, 2016,2016. http://ceur-ws.org/Vol-1644/paper13.pdf.

[45] M. Arenas and J. Pérez, Querying semantic web data with2575

SPARQL, in: Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Sys-tems, PODS 2011, June 12-16, 2011, Athens, Greece, 2011,pp. 305–316.

[46] M. Arenas and M. Ugarte, Designing a Query Language2580

for RDF: Marrying Open and Closed Worlds, ACM Trans.Database Syst. 42(4) (2017), 21–12146.

[47] S. Razniewski, F.M. Suchanek and W. Nutt, But What Do WeActually Know?, in: Proceedings of the 5th Workshop on Au-tomated Knowledge Base Construction, AKBC@NAACL-HLT2585

2016, San Diego, CA, USA, June 17, 2016, 2016, pp. 40–44.[48] L. Galárraga, S. Razniewski, A. Amarilli and F.M. Suchanek,

Predicting Completeness in Knowledge Bases, in: Proceedingsof the Tenth ACM International Conference on Web Searchand Data Mining, WSDM 2017, Cambridge, United Kingdom,2590

February 6-10, 2017, 2017, pp. 375–383.[49] P. Mirza, S. Razniewski and W. Nutt, Expanding Wikidata’s

Parenthood Information by 178%, or How To Mine RelationCardinality Information, in: Proceedings of the ISWC 2016Posters & Demonstrations Track co-located with 15th Interna-2595

tional Semantic Web Conference (ISWC 2016), Kobe, Japan,October 19, 2016., 2016.

[50] P. Mirza, S. Razniewski, F. Darari and G. Weikum, Cardi-nal Virtues: Extracting Relation Cardinalities from Text, in:Proceedings of the 55th Annual Meeting of the Association2600

for Computational Linguistics, ACL 2017, Vancouver, Canada,July 30 - August 4, Volume 2: Short Papers, 2017, pp. 347–351.

[51] R.E. Prasojo, F. Darari, S. Razniewski and W. Nutt, Managingand Consuming Completeness Information for Wikidata UsingCOOL-WD, in: Proceedings of the 7th International Workshop2605

on Consuming Linked Data co-located with 15th InternationalSemantic Web Conference, COLD@ISWC 2015, Kobe, Japan,October 18, 2016., 2016.

[52] F. Darari, R.E. Prasojo and W. Nutt, Expressing No-Value In-formation in RDF, in: Proceedings of the ISWC 2015 Posters &2610

Demonstrations Track co-located with the 14th InternationalSemantic Web Conference (ISWC-2015), Bethlehem, PA, USA,October 11, 2015., 2015.

[53] A. Letelier, J. Pérez, R. Pichler and S. Skritek, Static analysisand optimization of semantic web queries, in: Proceedings of2615

the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Prin-ciples of Database Systems, PODS 2012, Scottsdale, AZ, USA,May 20-24, 2012, 2012, pp. 89–100.

[54] A. Klug, On conjunctive queries containing inequalities, Jour-nal of the ACM (JACM) 35(1) (1988), 146–160.2620

[55] [email protected] Mail Archives: a blanknode issue, Accessed: 2017-01-15 fromhttps://lists.w3.org/Archives/Public/semantic-web/2011Mar/0017.html.

[56] Richard Cyganiak: Blank nodes consid-2625

ered harmful, Accessed: 2017-01-15 fromhttp://richard.cyganiak.de/blog/2011/03/blank-nodes-considered-harmful/.

[57] T. Heath and C. Bizer, Linked Data: Evolving the Web into aGlobal Data Space, Synthesis Lectures on the Semantic Web:2630

Theory and Technology, Morgan & Claypool, 2011.[58] F. Erxleben, M. Günther, M. Krötzsch, J. Mendez and D. Vran-

decic, Introducing Wikidata to the Linked Data Web, in: TheSemantic Web - ISWC 2014 - 13th International Semantic WebConference, Riva del Garda, Italy, October 19-23, 2014. Pro-2635

ceedings, Part I, 2014, pp. 50–65.[59] A. Hogan, M. Arenas, A. Mallea and A. Polleres, Everything

you always wanted to know about blank nodes, J. Web Sem. 27(2014), 42–69.

[60] N. Noy and A. Rector (eds), Defining N-ary Rela-2640

tions on the Semantic Web, W3C Working GroupNote, 12 April 2006, Retrieved Jan 10, 2017 fromhttps://www.w3.org/TR/2006/NOTE-swbp-n-aryRelations-20060412/.

[61] R. Cyganiak, D. Wood and M. Lanthaler (eds), RDF2645

1.1 Concepts and Abstract Syntax, W3C Recommen-dation, 25 February 2014, Retrieved Jan 15, 2017from https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/.

[62] A. Hogan, Skolemising Blank Nodes while Preserving Isomor-2650

phism, in: Proceedings of the 24th International Conferenceon World Wide Web, WWW 2015, Florence, Italy, May 18-22,2015, 2015, pp. 430–440.

[63] D. Brickley and R.V. Guha, RDF Schema 1.1, W3C Recom-mendation, 25 February 2014, Retrieved Jan 10, 2017 from2655

http://www.w3.org/TR/2014/REC-rdf-schema-20140225/.

http://ceur-ws.org/Vol-1644/paper13.pdf


[64] S. Muñoz, J. Pérez and C. Gutierrez, Simple and Efficient Min-imal RDFS, J. Web Sem. 7(3) (2009), 220–234.

[65] A. Polleres, A. Hogan, R. Delbru and J. Umbrich, RDFS andOWL Reasoning for Linked Data, in: Reasoning Web. Seman-2660

tic Technologies for Intelligent Data Access - 9th InternationalSummer School 2013, Mannheim, Germany, July 30 - August2, 2013. Proceedings, 2013, pp. 91–149.

[66] A. Motro and I. Rakov, Flexible Query Answering Systems,T. Andreasen, H. Christiansen and H.L. Larsen, eds, Kluwer2665

Academic Publishers, Norwell, MA, USA, 1997, pp. 1–21, Chap. Not all answers are equally good: estimating thequality of database answers. http://dl.acm.org/citation.cfm?id=285506.285508.

Appendix A. Proofs of Section 32670

Proposition 1. Let C be a set of completeness state-ments, G a graph, and P a BGP. Then the following areequivalent:

1. C,G |= Compl(P);2. for every mapping µ such that dom(µ) = var(P)2675

and (G,G ∪ µP) |= C, it is the case that µP ⊆ G.

Proof. (⇒) We prove the contrapositive. Supposethere is a mapping µ where dom(µ) = var(P) and(G,G ∪ µP) |= C, but µP 6⊆ G. We want to show thatC,G 6|= Compl(P). For this, we need a counterexam-2680

ple extension pair (G,G′) such that (G,G′) |= C, but(G,G′) 6|= Compl(P).

Take the extension pair (G,G ∪ µP). By assump-tion, we have that (G,G ∪ µP) |= C. We will showthat (G,G ∪ µP) 6|= Compl(P). Again, by assumption2685

we have that µP 6⊆ G. This means that µ 6∈ JPKG, asopposed to the obvious fact that µ ∈ JPKG∪µP. Thisimplies that (G,G ∪ µP) 6|= Compl(P). Therefore,C,G 6|= Compl(P) as witnessed by the counterexampleextension pair (G,G ∪ µP).2690

(⇐) Assume that for all mappings µ such that dom(µ) =var(P) and (G,G ∪ µP) |= C, it is the case µP ⊆ G.We want to show that C,G |= Compl(P). Take an ex-tension pair (G,G′) such that (G,G′) |= C. We need toprove that (G,G′) |= Compl(P). In other words, it has2695

to be shown that JPKG′ ⊆ JPKG.Now take a mapping µ ∈ JPKG′ . By the semantics

of BGP evaluation, this implies µP ⊆ G′. We want toshow µ ∈ JPKG. Again, by the semantics of BGP eval-uation it is sufficient to show that µP ⊆ G. By the as-2700

sumption that (G,G′) |= C and the semantics of theTC operator, we have that TC(G′) ⊆ G. From this andµP ⊆ G′ (and also G ⊆ G′ by the definition of an ex-tension pair), it holds that TC(G∪µP) ⊆ TC(G′) ⊆ G.Therefore, it is the case that (G,G ∪ µP) |= C. By as-2705

sumption, we have that µP ⊆ G. Since µ was arbitrary,we can therefore conclude that JPKG′ ⊆ JPKG.

Appendix B. Proofs of Section 4

Proposition 3 (Equivalent Partial Grounding). Let Cbe a set of completeness statements, G a graph, and2710

(P, ν) a partially mapped BGP. Then

(P, ν) ≡C,G epg((P, ν), C,G).

http://dl.acm.org/citation.cfm?id=285506.285508




Proof. Take any G′ such that (G,G′) |= C. We want toshow that

J(P, ν)KG′ =⋃

(µP,ν∪µ)∈epg((P,ν),C,G)

J(µP, ν ∪ µ)KG′ .

Since dom(ν) ∩ var(P) = ∅ by the construction of apartially mapped BGP, it is sufficient to show that

J(P, µ∅)KG′ =⋃

(µP,µ)∈epg((P,µ∅),C,G)

J(µP, µ)KG′ .

By the construction of the epg operator, this corre-sponds to showing that

J(P, µ∅)KG′ =⋃

µ∈JcrucC,G(P)KG

J(µP, µ)KG′ .

Recall that the crucial part of P is complete wrt. Cand G, that is, C,G |= Compl(crucC,G(P)). This im-plies that JcrucC,G(P)KG′ = JcrucC,G(P)KG. Therefore,it is the case that⋃

µ∈JcrucC,G(P)KG′

J(µP, µ)KG′ =⋃

µ∈JcrucC,G(P)KG

J(µP, µ)KG′ .

By construction, we have that crucC,G(P) ⊆ P. There-fore, by the semantics of evaluating partially mappedBGPs, J(P, µ∅)KG′ =

⋃µ∈JcrucC,G(P)KG′ J(µP, µ)KG′ . Thus,2715

we conclude that

J(P, µ∅)KG′ =⋃

µ∈JcrucC,G(P)KG′

J(µP, µ)KG′

=⋃

µ∈JcrucC,G(P)KG

J(µP, µ)KG′ .

Lemma 1 (Completeness Entailment of SaturatedBGPs). Let P be a BGP, C a set of completeness state-ments, and G a graph. Suppose P is saturated wrt. Cand G. Then:

C,G |= Compl(P) iff P ⊆ G.

Proof. (⇒) By contrapositive. Suppose P 6⊆ G. Wewant to give a counterexample for C,G |= Compl(P).Let us take the extension pair (G,G ∪ P). Note that2720

P 6⊆ G implies P 6⊆ G. Consequently, it is the case thatJPKG∪P 6⊆ JPKG, implying (G,G ∪ P) 6|= Compl(P).

It is left to show (G,G ∪ P) |= C. We would liketo prove the following: If P is saturated wrt. C and G,then (G,G ∪ P) |= C. By definition, wrt. C and G a2725

BGP P is saturated iff (P, µ∅) is saturated. From ourassumption that P is saturated, we therefore know that(P, µ∅) is also saturated. By the definition of saturation,this means that epg((P, µ∅), C,G) = (P, µ∅) . Thisimplies that JcrucC,G(P)KG = µ∅ . Consequently,2730

µ∅(crucC,G(P)) = crucC,G(P) ⊆ G. Here we knowthat crucC,G(P) is ground.

Now we want show that TC(G ∪ P) ⊆ G for thefollowing reason: by the definition of TC and the sat-isfaction of an extension pair wrt. C, it is the case that2735

TC(G ∪ P) ⊆ G implies (G,G ∪ P) |= C.By construction, the TC operator always returns a

subset of the input. There are therefore two compo-nents of the results of TC(G ∪ P) for which we have tocheck if they are included in G. The first are the parts2740

of the output included in G, that is, G ∩ TC(G ∪ P).Clearly, G ∩ TC(G ∪ P) ⊆ G.

The second one are those included in P, that is,P ∩ TC(G ∪ P). We want to show that P ∩ TC(G ∪P) ⊆ G. Recall that crucC,G(P) ⊆ G. By definition,2745

crucC,G(P) = P∩ id−1

(TC(G ∪ P)). Since crucC,G(P)

is ground, we have that crucC,G(P) = P∩ id−1

(TC(G∪P)), so that the melting operator id

−1does not have

any effect, that is, P ∩ id−1

(TC(G ∪ P)) = P ∩TC(G ∪ P). Consequently, we have P ∩ TC(G ∪ P) =2750

crucC,G(P) ⊆ G.Since both components are in G, we have that

TC(G ∪ P) ⊆ G, and therefore (G,G ∪ P) |= C.(⇐) Assume P ⊆ G. This means that P is ground (i.e.,has no variables). Therefore, it is the case that for all2755

extension pairs (G,G′), the equation JPKG′ = JPKG =µ∅ holds, implying (G,G′) |= Compl(P). By defini-tion, C,G |= Compl(P) holds if for all (G,G′) |= C, wehave (G,G′) |= Compl(P). Hence, C,G |= Compl(P)holds since (G,G′) |= Compl(P) even for all possible2760

extension pairs (G,G′).

Theorem 1 (Completeness Entailment Check). Let Pbe a BGP, C a set of completeness statements, and G agraph. Then the following are equivalent:

1. C,G |= Compl(P);2765

2. µP ⊆ G, for all µ ∈ sat(P, C,G).


Proof. (⇒) We prove the contrapositive. Assume thereexists a mapping µ ∈ sat(P, C,G) such that µP 6⊆G. From Proposition 4, we have that µP is saturatedwrt. C and G. From Lemma 1, it is the case C,G 6|=2770

Compl(µP).From Proposition 4, we have that (P, µ∅) ≡C,G

(νP, ν) | ν ∈ sat(P, C,G) . Note that by construc-tion, each mapping in sat(P, C,G) is incomparable tothe others, in the sense that, every mapping is differ-2775

ent. Since C,G 6|= Compl(µP), we have the exten-sion pair (G,G ∪ µP) as a counterexample for C,G |=Compl(P).(⇐) By the second claim of Proposition 4, we have thatµP is saturated wrt. C and G for each µ ∈ sat(P, C,G).2780

Thus, from the right-hand side of Theorem 1 andLemma 1, we have that C,G |= Compl(µP) for eachµ ∈ sat(P, C,G). Therefore, by the first claim of Propo-sition 4, we have that C,G |= Compl(P).

Proposition 5. Deciding the entailment C,G |=2785

Compl(P), given a set C of completeness statements, agraph G, and a BGP P, is ΠP

2-complete.

Proof. We prove membership in ΠP2 by showing that

non-entailment is in ΣP2 . This follows from Propo-

sition 1, which can be read as saying that C,G 6|=Compl(P) iff there exists a mapping µ with dom(µ) =var(P) and µP 6⊆ G such that

(G,G ∪ µP) |= C.

We observe that (G,G ∪ µP) 6|= C iff there is a state-ment Compl(P′) ∈ C and a mapping ν such thatνP′ ⊆ G ∪ µP and νP′ 6⊆ G. Thus, non-entailment can2790

be checked by guessing µ in polynomial time and thenperforming a CoNP-check.

Next, we prove the hardness by reduction from thevalidity of a ∀∃3SAT formula. The general shape of aformula is as follows:

ψ = ∀x1, . . . , xm∃y1, . . . , yn γ1 ∧ . . . ∧ γk,

where each γi is a disjunction of three literals overpropositions from vars∀ ∪ vars∃ where vars∀ =x1, . . . , xm and vars∃ = y1, . . . , yn. We will con-2795

struct a set C of completeness statements, a graph G,and a BGP P such that the following claim holds:

C,G |= Compl(P) iff ψ is valid.

Our encoding is inspired by the following approachto check the validity of ψ: Unfold the universally quan-2800

tified variables x1, . . . , xm in ψ, and then check if forevery formula in the set Ψunfold of the unfolding results,there is an assignment from the existentially quantifiedvariables y1, . . . , yn to make all the clauses evaluate totrue.2805

(Encoding) First, we construct31

G = (0, varg, c), (1, varg, c)

and the completeness statement

C∀ = Compl( (?x, varg, ?y) ),

to denote all the assignment possibilities (i.e., 0 and 1)for the universally quantified variables.

Next, we define

Pground = (?xi, varg, ?cxi), (?xi, varc, cxi) |

xi ∈ vars∀ .

The idea is that Pground via C∀ and G will later be in-stantiated with all possible assignments for the univer-sally quantified variables in ψ.2810

Now, we define

Pneg = (0, neg, 1), (1, neg, 0) ,

which says that 0 is the negation of 1, and vice versa.This BGP is used later on to assign values for all thepropositional variables and their negations. Then, wedefine

Ptrue = (1, 1, 1), . . . , (0, 0, 1) ,

to denote the seven possible satisfying value combina-tions for a clause. Our BGP P that we want to checkfor completeness is therefore as follows:

P = Ptrue ∪ Pneg ∪ Pground.

Now, we want to encode the structure of the for-mula ψ. For each propositional variable pi, we encodethe positive literal pi as the variable υ(pi) = ?pi and

31Recall that we omit namespaces. With namespaces, for ex-ample, the ‘number’ 0 in the encoding can be written as the IRIhttp://example.org/0.


the negative literal ¬pi as the variable υ(¬pi) = ?pi.Given a clause γi = li1 ∨ li2 ∨ li3, the operator tp(γi)maps γi to (υ(li1), υ(li2), υ(li3)). We then define the fol-lowing BGP to encode the structure of ψ:

Pψ = tp(γi) | γi occurring in ψ .

To encode the inverse relationship between a posi-tive literal and a negative literal, we use the following:

Pposs = (?pi, neg, ?pi), (?pi, neg, ?pi) |

pi ∈ vars∀ ∪ vars∃.

This pattern will later be instantiated accordingly wrt.Pneg. Now, for capturing the assignments of the univer-sally quantified variables in P, we use

P∀ = (?xi, varc, cxi) | xi ∈ vars∀ .

We are now ready to construct the following com-pleteness statement:

Cψ = Compl(Ptrue ∪ Pposs ∪ P∀ ∪ Pψ).

In summary, our encoding consists of the followingingredients: the set C = C∀,Cψ of completenessstatements, the graph G, and the BGP P. Let us nowprove the claim mentioned above.

(Proof for Encoding) Recall the approach we men-2815

tioned above to check the validity of the formula ψ.To simulate the unfolding of the universally quantifiedvariables, we rely on the equivalent partial groundingoperator epg((P, µ∅), C,G) as in Algorithm 1 which in-volves the cruc operator. Accordingly, crucC,G(P) =2820

P∩id−1

(TC(P∪G)) by definition. By construction, thestatement C∀ captures the (?xi, varg, ?cxi) part of theBGP P where xi ∈ vars∀. Thus, by the construction ofG, it is the case that epg((P, µ∅), C,G) consists of 2m

partially mapped BGPs, where m is the number of the2825

universally quantified variables in ψ. Each of the par-tially mapped BGPs corresponds to an assignment forthe universally quantified variables in the set Ψunfold ofthe unfolding results of ψ.

Now, we prove the simulation of the next step, the2830

existential checking. For each partially mapped BGP(µP, µ) in the unfolding results epg((P, µ∅), C,G), it iseither epg((µP, µ), C,G) = ∅ or epg((µP, µ), C,G) = (µP, µ) . Let us see what this means.

By construction, the former case happens whenever2835

TC(µP∪G) = µP holds, from the fact that JµPKG = ∅.

Furthermore, it is the case that TC(µP ∪ G) = µPiff there is a mapping ν from the encoding ?yi ofthe existentially quantified variables in Pψ such thatν(µPψ) ⊆ Ptrue. Note that the mapping ν simulates2840

a satisfying assignment for the corresponding existen-tially quantified formula in the set Ψunfold. Wheneverthis holds for all (µP, µ) ∈ epg((P, µ∅), C,G), fromProposition 3 we can conclude that (P, µ∅) ≡C,G ∅,and therefore C,G |= Compl(P). Also, because we2845

have the satisfying assignments for all the correspond-ing existentially quantified formulas in the set Ψunfold,the formula ψ evaluates to true.

The latter case happens whenever TC(µP∪G) 6= µP,since there is no mapping ν from the encoding ?yi of2850

the existentially quantified variables in Pψ such thatν(µPψ) ⊆ Ptrue. This simulates the failure in find-ing a satisfying assignment for the corresponding ex-istentially quantified formula in the set Ψunfold. Thisimplies that ψ evaluates to false. However, whenever2855

the latter case happens, it means that (µP, µ) is satu-rated. By construction, it is the case µP 6⊆ G. FromLemma 1 and Proposition 3, we conclude that C,G 6|=Compl(P).

2860

Proposition 6. Let G be a graph. Deciding the entail-ment C,G |= Compl(P), given a set C of completenessstatements and a BGP P, is in ΠP

2 . There is a graph forwhich the problem is ΠP

2-complete.

Proof. The membership follows immediately from2865

Proposition 5, while the existence of a hard case fol-lows from the reduction proof of that proposition, inwhich the graph is fixed.

Proposition 7. Let P be a BGP. Deciding the entail-ment C,G |= Compl(P), given a set C of completeness2870

statements and a graph G, is in NP. There is a BGPfor which the problem is NP-complete. This still holdswhen the graph is fixed.

Proof. The membership relies on Algorithm 1 andTheorem 1. Recall that the algorithm contains the epg2875

operator, which performs grounding based on the cru-cial part over the graph G. However, now since theBGP is fixed, the size of the grounding results is there-fore bounded polynomially. Consequently, the onlysource of complexity is from the finding of the crucial2880

part of BGPs, which can be done in NP (note that thecompleteness statements are not fixed).


In the hardness proof we will see that the hard-ness follows even when the graph is fixed. The prooffor NP-hardness is by means of reduction from the2885

3-colorability problem of classical graphs, which isknown to be NP-hard. We encode the problem graphGp = (V, E), i.e., the classical graph we want to checkwhether it is 3-colorable, as a set triples(Gp) of triplepatterns. We associate to each vertex v ∈ V , a new2890

variable ?v. Then, we define triples(Gp) as the unionof all triple patterns (?s, edge, ?o) created from eachpair (s, o) ∈ E where ?s is the associated variable ofs, edge is an IRI and ?o is the associated variable of o.Let the BGP Pcol be:2895

(r, edge, g), (r, edge, b), (g, edge, r), (g, edge, b),(b, edge, r), (b, edge, g)

Next, we create the completeness statement

Cp = Compl(triples(Gp) ∪ Pcol).

Let G be the empty graph. Then, the following claim2900

holds:

The problem graph Gp is 3-colorable iff

Cp ,G |= Compl(Pcol).

Proof of the claim: (⇒) Assume Gp is 3-colorable.Thus, there must be a mapping µ from all the vertices2905

in Gp to an element from the set r, g, b such thatno adjacent nodes have the same color. From the map-ping µ, we can thus create a corresponding mapping νsuch that the associated variables of the vertices in Gp

are mapped to the same color as in the mapping µ. By2910

construction of the statement Cp, it is the case that themapping ν is a witness mapping for QCp to derive allthe triples in Pcol, hence ensuring the completeness ofPcol.

(⇐) We will prove the contrapositive. Assume that Gp2915

is not 3-colorable. Thus, there is no mapping from thevertices in Gp to an element from the set r, g, b suchthat any adjacent node has a different color. Considerthe an extension pair (G,G′), where G′ is the colorgraph (r, edge, g), . . . , (b, edge, g) . From the con-2920

struction of Cp, it is the case that (G,G′) |= Cp butJPcolKG 6= JPcolKG′ . Thus, Cp ,G 6|= Compl(Pcol).

Proposition 8. Let C be a set of completeness state-ments. Deciding the entailment C,G |= Compl(P),2925

given a graph G and a BGP P, is in CoNP. There isa set C for which the problem is CoNP-complete. Thisstill holds when the graph is fixed.

Proof. The membership proof is as follows. It is thecase that C,G 6|= Compl(P) iff there exists a graph G′2930

containing G where:

– (G,G′) |= C, and– (G,G′) 6|= Compl(P).

We guess a mapping µ over P such that µP 6⊆ G, whichimplies that (G,G∪µP) 6|= Compl(P). Then, we check2935

in PTIME (since C is now fixed) the entailment (G,G∪µP) |= C. If it holds, then C,G 6|= Compl(P) by thecounterexample G′ = G ∪ µP.

In the hardness proof we will see that the hardnessfollows even when the graph is fixed. The proof for2940

CoNP-hardness is by means of a reduction from the3-uncolorability problem of classical graphs. We en-code the problem graph Gp = (V, E), i.e., the clas-sical graph for which we want to check whether itis 3-uncolorable, as a set triples(Gp) of triple pat-2945

terns. We associate to each vertex v ∈ V , a new vari-able ?v. Then, we define triples(Gp) as the union ofall triple patterns (?s, edge, ?o) created from each pair(s, o) ∈ E where ?s is the associated variable of s,edge is an IRI and ?o is the associated variable of o.2950

Let the BGP P be:

triples(Gp) ∪ (c1, c2, c3)

Let the graph G be the color graph:

(r, edge, g), (r, edge, b), (g, edge, r), (g, edge, b),(b, edge, r), (b, edge, g) .2955

Next, we create the completeness statement

C = Compl((?x, edge, ?y)).

Then, the following claim holds:

The problem graph Gp is 3-uncolorable iff

C ,G |= Compl(P).2960

Proof of the claim:(⇒) The proof relies on Algorithm 1 and Theo-

rem 1. First, observe that by construction, the parttriples(Gp) of the BGP P can be grounded completelydue to the statement C, that is, the crucial part oper-2965

ator cruc returns exactly that part. Assume Gp is 3-uncolorable. As Gp is 3-uncolorable, there is no map-ping µ from all the vertices in Gp to an element fromthe set r, g, b such that no adjacent nodes have thesame color. Thus, the epg operator returns an empty2970

set as evaluating triples(Gp) over G yields the emptyanswer. This means that the grounding does not out-put any BGP that needs to be checked anymore for


its completeness. Hence, it is the case that C ,G |=Compl(P).2975

(⇐) We will prove the contrapositive. First, observethat by construction, the part triples(Gp) of the BGPP can be grounded completely due to the statementC. Assume that Gp is 3-colorable. Thus, there mustbe a mapping µ from all the vertices in Gp to an ele-2980

ment from the set r, g, b such that no adjacent nodeshave the same color. Take such a mapping µ arbitrarily.Since the graph Gp is 3-colorable, we can then reusethe mapping µ for mapping triples(Gp) to G. The epgoperator results therefore include that mapping, which2985

is then applied to the remaining part of P, that is, thetriple pattern (c1, c2, c3). Note that the triple patternconsists only of constants, so the mapping applicationhas no effect. Now we have to check the completenessof (c1, c2, c3). As no completeness statements can be2990

evaluated over that remaining part, it is then the casethat we are already saturated for (c1, c2, c3). By The-orem 1, the BGP P can be guaranteed to be completeiff all saturated instantiations wrt. C are in G. How-ever, clearly (c1, c2, c3) is not in G. Thus, we have that2995

C ,G 6|= Compl(P).

Proposition 9. Let C be a set of completeness state-ments and P be a BGP. Deciding the entailment C,G |=Compl(P), given a graph G, is in PTIME.

Proof. The proof relies on Algorithm 1 and Theo-3000

rem 1. Recall that the algorithm contains the epg op-erator, which performs grounding based on the crucialpart over the graph G. However, now since the BGPis fixed, the size of the grounding result is thereforebounded polynomially. Moreover, now that the com-3005

pleteness statements are fixed, the crucial part can thenbe found in PTIME. Hence, the overall procedure canbe executed in PTIME.

Appendix C. Proofs of Section 5

Theorem 2. (ANSWER SOUNDNESS) Let G be a3010

graph, C a set of completeness statements, P a graphpattern, and µ ∈ JPKG a mapping. Then the followingare equivalent:

1. C,G |= Sound(µ, P);2. C,G |= Compl(µPi), for all Pi ∈ P−.3015

Proof. (⇐) Let µ ∈ JPKG be a mapping. Suppose thatfor all Pi ∈ P−, we have C,G |= Compl(µPi). Takean extension pair (G,G′) satisfying C. We will showthat µ ∈ JPKG′ . Since µ ∈ JPKG and G ⊆ G′, it holdsthat µ ∈ JP+KG′ . It is left to show that for all Pi ∈3020

P−, we have JµPiKG′ = ∅. Take an arbitrary Pi ∈ P−.The inclusion JµPiKG′ ⊆ JµPiKG holds because C,G |=Compl(µPi). Moreover, the equality JµPiKG = ∅ holdsbecause µ ∈ JPKG. Hence, it is the case that JµPiKG′ =∅.3025

(⇒) We prove the contrapositive. Suppose there is aBGP Pw ∈ P− (‘w’ for witness) such that C,G 6|=Compl(µPw). We will show that C,G 6|= Sound(µ, P).Since it is the case that C,G 6|= Compl(µPw), theremust be a mapping ν such that: (i) dom(ν) = var(µPw);3030

(ii) (G,G ∪ νµPw) |= C; and (iii) νµPw 6⊆ G. This im-plies that ν 6∈ JµPwKG and ν ∈ JµPwKG∪νµPw . Now, wewill show that (G,G ∪ νµPw) 6|= Sound(µ, P). Sinceν ∈ JµPwKG∪νµPw , it holds that µ 6∈ JPKG∪νµPw . Onthe other hand, it is the case that µ ∈ JPKG from our3035

assumption. Thus, (G,G ∪ νµPw) 6|= Sound(µ, P).

Proposition 10. For a set C of completeness state-ments and BGPs P and P′, it is the case that

C |= Compl(P | P′) iff P ⊆ TC(P ∪ P′).

Proof. (⇒) Suppose that C |= Compl(P | P′). By def-inition of the entailment, for all (G,G′) |= C, the in-clusion J(var(P), P ∪ P′)KG′ ⊆s JPKG holds. Consider3040

the extension pair (G,G′) where G = TC(P ∪ P′) andG′ = P∪P′. By construction, (G,G′) |= C holds. Fromour assumption, it follows that J(var(P), P∪P′)KG′ ⊆s

JPKG. We define the operator πW(µ) by projecting themapping µ to the variables in W. By construction, we3045

have that πvar(P)(id) ∈ J(var(P), P ∪ P′)KG′ where idis the freeze mapping of the BGP P ∪ P′ (as definedin Section 2.1). From the set inclusion, it follows thatπvar(P)(id) ∈ JPKG. This implies that πvar(P)(id)P =

P ⊆ G = TC(P ∪ P′).3050

(⇐) Assume P ⊆ TC(P ∪ P′). By this assumptionand the prototypicality of P∪ P′, which represents anypossible graph satisfying P ∪ P′, it is the case thatC |= Compl(P | P′).

Lemma 2. Let C be a set of completeness statements3055

and P a graph pattern. Then C |= Sound(P) providedthat C |= Compl(Pi | P+) for all Pi ∈ P−.


Proof. Assume that for all Pi ∈ P−, it is the casethat C |= Compl(Pi | P+). Take any extension pair(G,G′) |= C and suppose there is a mapping µ ∈ JPKG.3060

We want to show that µ ∈ JPKG′ . By G ⊆ G′, it holdsthat µ ∈ JP+KG′ . Thus, it is left to show that for allPi ∈ P−, it is the case that JµPiKG′ = ∅.

Take any negation part Pi. By C |= Compl(Pi |P+) and (G,G′) |= C, it is the case that (G,G′) |=3065

Compl(Pi | P+). Consequently, by J(var(Pi), Pi ∪P+)KG′ ⊆s JPiKG and JµPiKG = ∅, it must be the casethat JµPiKG′ = ∅. As Pi was arbitrary, it is the case thatµ ∈ JPKG′ .

Theorem 3. (PATTERN SOUNDNESS) Let C be a set3070

of completeness statements and P a graph pattern inNRF. Then the following are equivalent:

1. C |= Sound(P);2. C |= Compl(Pi | P+) for all Pi ∈ P−.

Proof. (⇐) This is a direct consequence of Lemma 2.3075

(⇒) We give a proof by contrapositive. Suppose thereis a BGP Pw ∈ P− (‘w’ for witness) such that C 6|=Compl(Pw | P+). By Proposition 10, it is the case thatPw 6⊆ TC(Pw∪ P+). Let us prove that for the extensionpair (G,G′) = (P+ ∪TC(Pw ∪ P+), Pw ∪ P+), it is the3080

case that (G,G′) |= C, but (G,G′) 6|= Sound(P).By the definition of TC , it holds that (G,G′) |= C.

We now have to show that (G,G′) 6|= Sound(P). Byconstruction, id 6∈ JPKPw∪P+ = JPKG′ where id isthe freeze mapping wrt. P+. We will show that id ∈3085

JPKP+∪TC(Pw∪P+) = JPKG.By construction, id ∈ JP+KP+∪TC(Pw∪P+). Thus, it

is left to show that for every BGP Pi ∈ P−, it is thecase JidPiKP+∪TC(Pw∪P+) = ∅. Since, ue to the non-redundancy property of NRF graph patterns, differentnegated patterns do not contain one another, there is nonegated pattern P j 6= Pw such that:

JidP jKP+∪TC(Pw∪P+) 6= ∅.

Now, it is left to show that for the BGP Pw, it also holdsthat

JidPwKP+∪TC(Pw∪P+) = ∅.

However, this follows from the minimality of negatedpatterns in an NRF graph pattern. Thus, we haveshown that id 6∈ JPKG′ but id ∈ JPKG, serving as acounterexample for (G,G′) |= Sound(P).3090

Proposition 11. Negation-similar graph patterns areequivalent.

Proof. Suppose that P and P are negation-similar. Weshow that P v P. The converse statement follows by asymmetric argument.3095

Let G be a graph and let µ ∈ JPKG. We want toshow that µ ∈ JPKG. Clearly, µ ∈ JP+KG and there-fore also µ ∈ JP+KG. To show that µ ∈ JPKG, wemust ensure that JµP jKG = ∅ for all P j ∈ P−. By as-sumption, for P j ∈ P− there exists a Pi ∈ P− such3100

that (var(P+), P+∪ P j) vs (var(P+), P+∪Pi). Sinceµ ∈ JPKG, we have JµPiKG = ∅, which implies thatJµP jKG ⊆ JµPiKG = ∅. This completes the proof.

Documents

pdfs.semanticscholar.orgpdfs.semanticscholar.org/77a9/54166ea4f3089b3edd21d37bacb1082878a1.pdfSemantic Web 0 (0) 1 1 IOS Press Completeness and Soundness Guarantees for Conjunctive