Datalog and Recursion - Inriawebdam.inria.fr/Alice/pdfs/Chapter-12.pdf · Datalog and Recursion I ... Datalog owes a great debt to Prolog and the logic-programming area in general

P A R T

D Datalog and Recursion

In Part B, we considered query languages ranging from conjunctive queries to first-orderqueries in the three paradigms: algebraic, logic, and deductive. We did this by enriching

the conjunctive queries first with union (disjunction) and then with difference (negation).In this part, we further enrich these languages by adding recursion. First we add recursionto the conjunctive queries, which yields datalog. We study this language in Chapter 12.Although it is too limited for practical use, datalog illustrates some of the essential aspectsof recursion. Furthermore, most existing optimization techniques have been developed fordatalog.

Datalog owes a great debt to Prolog and the logic-programming area in general. Afundamental contribution of the logic-programming paradigm to relational query languagesis its elegant notation for expressing recursion. The perspective of databases, however, issignificantly different from that of logic programming. (For example, in databases datalogprograms define mappings from instances to instances, whereas logic programs generallycarry their data with them and are studied as stand-alone entities.) We adapt the logic-programming approach to the framework of databases.

We study evaluation techniques for datalog programs in Chapter 13, which coversthe main optimization techniques developed for recursion in query languages, includingseminaive evaluation and magic sets.

Although datalog is of great theoretical importance, it is not adequate as a practi-cal query language because of the lack of negation. In particular, it cannot express eventhe first-order queries. Chapters 14 and 15 deal with languages combining recursion andnegation, which are proper extensions of first-order queries. Chapter 14 considers the issueof combining negation and recursion. Languages are presented from all three paradigms,which support both negation and recursion. The semantics of each one is defined in fun-damentally operational terms, which include datalog with negation and a straightforward,fixpoint semantics. As will be seen, the elegant correspondence between languages in thethree paradigms is maintained in the presence of recursion.

Chapter 15 considers approaches to incorporating negation in datalog that are closerin spirit to logic programming. Several important semantics for negation are presented,including stratification and well-founded semantics.

271

12 Datalog

Alice: What do we see next?Riccardo: We introduce recursion.

Sergio: He means we ask queries about your ancestors.Alice: Are you leading me down a garden path?

Vittorio: Kind of—queries related to paths in a graph call for recursionand are crucial for many applications.

For a long time, relational calculus and algebra were considered the database languages.Codd even defined as “complete” a language that would yield precisely relational

calculus. Nonetheless, there are simple operations on data that cannot be realized in thecalculus. The most conspicuous example is graph transitive closure. In this chapter, westudy a language that captures such queries and is thus more “complete” than relationalcalculus.1 The language, called datalog, provides a feature not encountered in languagesstudied so far: recursion.

We start with an example that motivates the need for recursion. Consider a databasefor the Parisian Metro. Note that this database essentially describes a graph. (Databaseapplications in which part of the data is a graph are common.) To avoid making theMetro database too static, we assume that the database is describing the available metroconnections on a day of strike (not an unusual occurrence). So some connections maybe missing, and the graph may be partitioned. An instance of this database is shown inFig. 12.1.

Natural queries to ask are as follows:

(12.1) What are the stations reachable from Odeon?

(12.2) What lines can be reached from Odeon?

(12.3) Can we go from Odeon to Chatelet?

(12.4) Are all pairs of stations connected?

(12.5) Is there a cycle in the graph (i.e., a station reachable in one or more stops fromitself)?

Unfortunately, such queries cannot be answered in the calculus without using some a

1 We postpone a serious discussion of completeness until Part E, where we tackle fundamental issuessuch as “What is a formal definition of data manipulation (as opposed to arbitrary computation)?What is a reasonable definition of completeness for database languages?”

273

274 Datalog

Links Line Station Next Station

4 St.-Germain Odeon

4 Odeon St.-Michel

4 St.-Michel Chatelet

1 Chatelet Louvre

1 Louvre Palais-Royal

1 Palais-Royal Tuileries

1 Tuileries Concorde

9 Pont de Sevres Billancourt

9 Billancourt Michel-Ange

9 Michel-Ange Iena

9 Iena F. D. Roosevelt

9 F. D. Roosevelt Republique

9 Republique Voltaire

Figure 12.1: An instance I of the Metro database

priori knowledge on the Metro graph, such as the graph diameter. More generally, given agraph G, a particular vertex a, and an integer n, it is easy to write a calculus query findingthe vertexes at distance less than n from a; but it seems difficult to find a query for allvertexes reachable from a, regardless of the distance. We will prove formally in Chapter 17that such a query is not expressible in the calculus. Intuitively, the reason is the lack ofrecursion in the calculus.

The objective of this chapter is to extend some of the database languages consideredso far with recursion. Although there are many ways to do this (see also Chapter 14), wefocus in this chapter on an approach inspired by logic programming. This leads to a fieldcalled deductive databases, or database logic programming, which shares motivation andtechniques with the logic-programming area.

Most of the activity in deductive databases has focused on a toy language called dat-alog, which extends the conjunctive queries with recursion. The interaction between nega-tion and recursion is more tricky and is considered in Chapters 14 and 15. The importanceof datalog for deductive databases is analogous to that of the conjunctive queries for therelational model. Most optimization techniques for relational algebra were developed forconjunctive queries. Similarly, in this chapter most of the optimization techniques in de-ductive databases have been developed around datalog (see Chapter 13).

Before formally presenting the language datalog, we present informally the syntax andvarious semantics that are considered for that language. Following is a datalog programPTC that computes the transitive closure of a graph. The graph is represented in relation G

and its transitive closure in relation T :

T (x, y)←G(x, y)

T (x, y)←G(x, z), T (z, y).

Datalog 275

Observe that, except for the fact that relation T occurs both in the head and body of thesecond rule, these look like the nonrecursive datalog rules of Chapter 4.

A datalog program defines the relations that occur in heads of rules based on otherrelations. The definition is recursive, so defined relations can also occur in bodies of rules.Thus a datalog program is interpreted as a mapping from instances over the relationsoccurring in the bodies only, to instances over the relations occurring in the heads. Forinstance, the preceding program maps a relation over G (a graph) to a relation over T (itstransitive closure).

A surprising and elegant property of datalog, and of logic programming in general, isthat there are three very different but equivalent approaches to defining the semantics. Wepresent the three approaches informally now.

A first approach is model theoretic. We view the rules as logical sentences stating aproperty of the desired result. For instance, the preceding rules yield the logical formulas

∀x, y(T (x, y) ← G(x, y))(1)

∀x, y, z(T (x, y) ← (G(x, z) ∧ T (z, y))).(2)

The result T must satisfy the foregoing sentences. However, this is not sufficient to deter-mine the result uniquely because it is easy to see that there are many T s that satisfy thesentences. However, it turns out that the result becomes unique if one adds the followingnatural minimality requirement: T consists of the smallest set of facts that makes the sen-tences true. As it turns out, for each datalog program and input, there is a unique minimalmodel. This defines the semantics of a datalog program. For example, suppose that theinstance contains

G(a, b),G(b, c),G(c, d).

It turns out that T (a, d) holds in each instance that obeys (1) and (2) and where these threefacts hold. In particular, it belongs to the minimum model of (1) and (2).

The second proof-theoretic approach is based on obtaining proofs of facts. A proof ofthe fact T (a, d) is as follows:

(i) G(c, d) belongs to the instance;

(ii) T (c, d) using (i) and the first rule;

(iii) G(b, c) belongs to the instance;

(iv) T (b, d) using (iii), (ii), and the second rule;

(v) G(a, b) belongs to the instance;

(vi) T (a, d) using (v), (iv), and the second rule.

A fact is in the result if there exists a proof for it using the rules and the database facts.In the proof-theoretic perspective, there are two ways to derive facts. The first is to

view programs as “factories” producing all facts that can be proven from known facts.The rules are then used bottom up, starting from the known facts and deriving all possiblenew facts. An alternative top-down evaluation starts from a fact to be proven and attemptsto demonstrate it by deriving lemmas that are needed for the proof. This is the underlying

276 Datalog

intuition of a particular technique (called resolution) that originated in the theorem-provingfield and lies at the core of the logic-programming area.

As an example of the top-down approach, suppose that we wish to prove T (a, d). Thenby the second rule, this can be done by proving G(a, b) and T (b, d). We know G(a, b), adatabase fact. We are thus left with proving T (b, d). By the second rule again, it sufficesto prove G(b, c) (a database fact) and T (c, d). This last fact can be proven using the firstrule. Observe that this yields the foregoing proof (i) to (vi). Resolution is thus a particulartechnique for obtaining such proofs. As detailed later, resolution permits variables as wellas values in the goals to be proven and the steps used in the proof.

The last approach is the fixpoint approach. We will see that the semantics of theprogram can be defined as a particular solution of a fixpoint equation. This approach leadsto iterating a query until a fixpoint is reached and is thus procedural in nature. However,this computes again the facts that can be deduced by applications of the rules, and in thatrespect it is tightly connected to the (bottom-up) proof-theoretic approach. It correspondsto a natural strategy for generating proofs where shorter proofs are produced before longerproofs so facts are proven “as soon as possible.”

In the next sections we describe in more detail the syntax, model-theoretic, fixpoint,and proof-theoretic semantics of datalog. As a rule, we introduce only the minimumamount of terminology from logic programming needed in the special database case. How-ever, we make brief excursions into the wider framework in the text and exercises. Thelast section deals with static analysis of datalog programs. It provides decidability andundecidability results for several fundamental properties of programs. Techniques for theevaluation of datalog programs are discussed separately in Chapter 13.

12.1 Syntax of Datalog

As mentioned earlier, the syntax of datalog is similar to that of languages introduced inChapter 4. It is an extension of nonrecursive datalog, which was introduced in Chapter 4.We provide next a detailed definition of its syntax. We also briefly introduce some of thefundamental differences between datalog and logic programming.

Definition 12.1.1 A (datalog) rule is an expression of the form

R1(u1)← R2(u2), . . . , Rn(un),

where n ≥ 1, R1, . . . , Rn are relation names and u1, . . . , un are free tuples of appropriatearities. Each variable occurring in u1 must occur in at least one of u2, . . . , un. A datalogprogram is a finite set of datalog rules.

The head of the rule is the expression R1(u1); and R2(u2), . . . , Rn(un) forms the body.The set of constants occurring in a datalog program P is denoted adom(P ); and for an

instance I, we use adom(P, I) as an abbreviation for adom(P ) ∪ adom(I).

We next recall a definition from Chapter 4 that is central to this chapter.

12.1 Syntax of Datalog 277

Definition 12.1.2 Given a valuation ν, an instantiation

R1(ν(u1))← R2(ν(u2)), . . . , Rn(ν(un))

of a rule R1(u1)← R2(u2), . . . , Rn(un) with ν is obtained by replacing each variable x byν(x).

Let P be a datalog program. An extensional relation is a relation occurring onlyin the body of the rules. An intensional relation is a relation occurring in the head ofsome rule of P . The extensional (database) schema, denoted edb(P ), consists of theset of all extensional relation names; whereas the intensional schema idb(P ) consistsof all the intensional ones. The schema of P , denoted sch(P ), is the union of edb(P )and idb(P ). The semantics of a datalog program is a mapping from database instancesover edb(P ) to database instances over idb(P ). In some contexts, we call the input datathe extensional database and the program the intensional database. Note also that in thecontext of logic-based languages, the term predicate is often used in place of the termrelation name.

Let us consider an example.

Example 12.1.3 The following program Pmetro computes the answers to queries (12.1),(12.2), and (12.3):

St_Reachable(x, x) ←St_Reachable(x, y) ← St_Reachable(x, z),Links(u, z, y)

Li_Reachable(x, u) ← St_Reachable(x, z),Links(u, z, y)

Ans_1(y) ← St_Reachable(Odeon, y)

Ans_2(u) ← Li_Reachable(Odeon, u)

Ans_3() ← St_Reachable(Odeon,Chatelet)

Observe that St_Reachable is defined using recursion. Clearly,

edb(Pmetro)= {Links},idb(Pmetro)= {St_Reachable,Li_Reachable,Ans_1,Ans_2,Ans_3}

For example, an instantiation of the second rule of Pmetro is as follows:

St_Reachable(Odeon,Louvre)← St_Reachable(Odeon,Chatelet),

Links(1,Chatelet,Louvre)

278 Datalog

Datalog versus Logic Programming

Given the close correspondence between datalog and logic programming, we briefly high-light the central differences between these two fields. The major difference is that logicprogramming permits function symbols, but datalog does not.

Example 12.1.4 The simple logic program Pleq is given by

leq(0, x)←leq(s(x), s(y))← leq(x, y)

leq(x,+(x, y))←leq(x, z)← leq(x, y), leq(y, z)

Here 0 is a constant, s a unary function sysmbol, + a binary function sysmbol, and leq abinary predicate. Intuitively, s might be viewed as the successor function, + as addition,and leq as capturing the less-than-or-equal relation. However, in logic programming thefunction symbols are given the “free” interpretation—two terms are considered nonequalwhenever they are syntactically different. For example, the terms +(0, s(0)),+(s(0), 0),and s(0) are all nonequal. Importantly, functional terms can be used in logic programmingto represent intricate data structures, such as lists and trees.

Observe also that in the preceding program the variable x occurs in the head of thefirst rule and not in the body, and analogously for the third rule.

Another important difference between deductive databases and logic programs con-cerns perspectives on how they are typically used. In databases it is assumed that thedatabase is relatively large and the number of rules relatively small. Furthermore, a da-talog program P is typically viewed as defining a mapping from instances over the edbto instances over the idb. In logic programming the focus is different. It is generally as-sumed that the base data is incorporated directly into the program. For example, in logicprogramming the contents of instance Link in the Metro database would be representedusing rules such as Link(4, St.-Germain,Odeon)←. Thus if the base data changes, thelogic program itself is changed. Another distinction, mentioned in the preceding example,is that logic programs can construct and manipulate complex data structures encoded byterms involving function symbols.

Later in this chapter we present further comparisons of the two frameworks.

12.2 Model-Theoretic Semantics

The key idea of the model-theoretic approach is to view the program as a set of first-order sentences (also called a first-order theory) that describes the desired answer. Thusthe database instance constituting the result satisfies the sentences. Such an instance isalso called a model of the sentences. However, there can be many (indeed, infinitelymany) instances satisfying the sentences of a program. Thus the sentences themselvesdo not uniquely identify the answer; it is necessary to specify which of the models is

12.2 Model-Theoretic Semantics 279

the intended answer. This is usually done based on assumptions that are external to thesentences themselves. In this section we formalize (1) the relationship between rules andlogical sentences, (2) the notion of model, and (3) the concept of intended model.

We begin by associating logical sentences with rules, as we did in the beginning of thischapter. To a datalog rule

ρ : R1(u1)← R2(u2), . . . , Rn(un)

we associate the logical sentence

∀x1, . . . , xm(R1(u1)← R2(u2) ∧ · · · ∧ Rn(un)),

where x1, . . . , xm are the variables occurring in the rule and ← is the standard logicalimplication. Observe that an instance I satisfies ρ, denoted I |= ρ, if for each instantiation

R1(ν(u1))← R2(ν(u2)), . . . , Rn(ν(un))

such that R2(ν(u2)), . . . , Rn(ν(un)) belong to I, so does R1(ν(u1)). In the following, wedo not distinguish between a rule ρ and the associated sentence. For a program P , theconjunction of the sentences associated with the rules of P is denoted by �P .

It is useful to note that there are alternative ways to write the sentences associated withrules of programs. In particular, the formula

∀x1, . . . , xm(R1(u1)← R2(u2) ∧ · · · ∧ Rn(un))

is equivalent to

∀x1, . . . , xq(∃xq+1, . . . , xm(R2(u2) ∧ · · · ∧ Rn(un))→ R1(u1)),

where x1, . . . , xq are the variables occurring in the head. It is also logically equivalent to

∀x1, . . . , xm(R1(u1) ∨ ¬R2(u2) ∨ · · · ∨ ¬Rn(un)).

This last form is particularly interesting. Formulas consisting of a disjunction of liter-als of which at most one is positive are called in logic Horn clauses. A datalog programcan thus be viewed as a set of (particular) Horn clauses.

We next discuss the issue of choosing, among the models of �P , the particular modelthat is intended as the answer. This is not a hard problem for datalog, although (as we shallsee in Chapter 15) it becomes much more involved if datalog is extended with negation.For datalog, the idea for choosing the intended model is simply that the model should notcontain more facts than necessary for satisfying �P . So the intended model is minimal insome natural sense. This is formalized next.

Definition 12.2.1 Let P be a datalog program and I an instance over edb(P ). A modelof P is an instance over sch(P ) satisfying �P . The semantics of P on input I, denotedP(I), is the minimum model of P containing I, if it exists.

280 Datalog

Station

OdeonSt.-MichelChateletLouvresPalais-RoyalTuileriesConcorde

Ans_1 Line

41

Ans_2

⟨ ⟩

Ans_3

Figure 12.2: Relations of Pmetro(I)

For Pmetro as in Example 12.1.3, and I as in Fig. 12.1, the values of Ans_1, Ans_2, andAns_3 in P(I) are shown in Fig. 12.2.

We briefly discuss the choice of the minimal model at the end of this section.Although the previous definition is natural, we cannot be entirely satisfied with it at

this point:

• For given P and I, we do not know (yet) whether the semantics of P is defined (i.e.,whether there exists a minimum model of �P containing I).

• Even if such a model exists, the definition does not provide any algorithm forcomputing P(I). Indeed, it is not (yet) clear that such an algorithm exists.

We next provide simple answers to both of these problems.Observe that by definition, P(I) is an instance over sch(P ). A priori, we must consider

all instances over sch(P ), an infinite set. It turns out that it suffices to consider only thoseinstances with active domain in adom(P, I) (i.e., a finite set of instances). For given P andI, let B(P, I) be the instance over sch(P ) defined by

1. For each R in edb(P ), a fact R(u) is in B(P, I) iff it is in I; and

2. For each R in idb(P ), each fact R(u) with constants in adom(P, I) is in B(P, I).

We now verify that B(P, I) is a model of P containing I.

Lemma 12.2.2 Let P be a datalog program and I an instance over edb(P ). Then B(P, I)is a model of P containing I.

Proof Let A1 ← A2, . . . , An be an instantiation of some rule r in P such that A2, . . . ,

An hold in B(P, I). Then consider A1. Because each variable occurring in the head of ralso occurs in the body, each constant occurring in A1 belongs to adom(P, I). Thus bydefinition 2 just given, A1 is in B(P, I). Hence B(P, I) satisfies the sentence associatedwith that particular rule, so B(P, I) satisfies �P . Clearly, B(P, I) contains I by definition 1.

12.2 Model-Theoretic Semantics 281

Thus the semantics of P on input I, if defined, is a subset of B(P, I). This means thatthere is no need to consider instances with constants outside adom(P, I).

We next demonstrate that P(I) is always defined.

Theorem 12.2.3 Let P be a datalog program, I an instance over edb(P ), and X the setof models of P containing I. Then

1. ∩X is the minimal model of P containing I, so P(I) is defined.

2. adom(P (I))⊆ adom(P, I).

3. For each R in edb(P ), P(I)(R)= I(R).

Proof Note that X is nonempty, because B(P, I) is in X . Let r ≡ A1 ← A2, . . . , An bea rule in P and ν a valuation of the variables occurring in the rule. To prove (1), we showthat

(*) if ν(A2), . . . , ν(An) are in ∩X then ν(A1) is also in ∩X .

For suppose that (*) holds. Then ∩X |= r , so ∩X satisfies �P . Because each instance in Xcontains I, ∩X contains I. Hence ∩X is a model of P containing I. By construction, ∩Xis minimal, so (1) holds.

To show (*), suppose that ν(A2), . . . , ν(An) are in ∩X and let K be in X . Because∩X ⊆K, ν(A2), . . . , ν(An) are in K. Because K is in X , K is a model of P , so ν(A1) isin K. This is true for each K in X . Hence ν(A1) is in ∩X and (*) holds, which in turnproves (1).

By Lemma 12.2.2, B(P, I) is a model of P containing I. Therefore P(I)⊆ B(P, I).Hence

• adom(P (I))⊆ adom(B(P, I))= adom(P, I), so (2) holds.

• For each R in edb(P ), I(R) ⊆ P(I)(R) [because P(I) contains I] and P(I)(R) ⊆B(P, I)(R)= I(R); which shows (3).

The previous development also provides an algorithm for computing the semanticsof datalog programs. Given P and I, it suffices to consider all instances that are subsets ofB(P, I), find those that are models of P and contain I, and compute their intersection. How-ever, this is clearly an inefficient procedure. The next section provides a more reasonablealgorithm.

We conclude this section with two remarks on the definition of semantics of datalogprograms. The first explains the choice of a minimal model. The second rephrases ourdefinition in more standard logic-programming terminology.

Why Choose the Minimal Model?

This choice is the natural consequence of an implicit hypothesis of a philosophical nature:the closed world assumption (CWA) (see Chapter 2).

The CWA concerns the connection between the database and the world it models.

282 Datalog

Clearly, databases are often incomplete (i.e., facts that may be true in the world are notnecessarily recorded in the database). Thus, although we can reasonably assume that afact recorded in the database is true in the world, it is not clear what we can say aboutfacts not explicitly recorded. Should they be considered false, true, or unknown? The CWAprovides the simplest solution to this problem: Treat the database as if it records completeinformation about the world (i.e., assume that all facts not in the database are false). Thisis equivalent to taking as true only the facts that must be true in all worlds modeled bythe database. By extension, this justifies the choice of minimal model as the semantics ofa datalog program. Indeed, the minimal model consists of the facts we know must be truein all worlds satisfying the sentences (and including the input instance). As we shall see,this has an equivalent proof-theoretic counterpart, which will justify the proof-theoreticsemantics of datalog programs: Take as true precisely the facts that can be proven truefrom the input and the sentences corresponding to the datalog program. Facts that cannotbe proven are therefore considered false.

Importantly, the CWA is not so simple to use in the presence of negation or disjunction.For example, suppose that a database holds {p ∨ q}. Under the CWA, then both ¬p and¬q are inferred. But the union {p ∨ q,¬p,¬q} is inconsistent, which is certainly not theintended result.

Herbrand Interpretation

We relate briefly the semantics given to datalog programs to standard logic-programmingterminology.

In logic programming, the facts of an input instance I are not separated from thesentences of a datalog program P . Instead, sentences stating that all facts in I are trueare included in P . This gives rise to a logical theory �P,I consisting of the sentences in �P

and of one sentence P(u) [sometimes written P(u)←] for each fact P(u) in the instance.The semantics is defined as a particular model of this set of sentences. A problem is thatstandard interpretations in first-order logic permit interpretation of constants of the theorywith arbitrary elements of the domain. For instance, the constants Odeon and St.-Michelmay be interpreted by the same element (e.g., John). This is clearly not what we meanin the database context. We wish to interpret Odeon by Odeon and similarly for all otherconstants. Interpretations that use the identity function to interpret the constant symbolsare called Herbrand interpretations (see Chapter 2). (If function symbols are present,restrictions are also placed on how terms involving functions are interpreted.) Given a set� of formulas, a Herbrand model of � is a Herbrand interpretation satisfying �.

Thus in logic programming terms, the semantics of a program P given an instance Ican be viewed as the minimum Herbrand model of �P,I.

12.3 Fixpoint Semantics

In this section, we present an operational semantics for datalog programs stemming fromfixpoint theory. We use an operator called the immediate consequence operator. The oper-ator produces new facts starting from known facts. We show that the model-theoretic se-

12.3 Fixpoint Semantics 283

mantics, P(I), can also be defined as the smallest solution of a fixpoint equation involvingthat operator. It turns out that this solution can be obtained constructively. This approachtherefore provides an alternative constructive definition of the semantics of datalog pro-grams. It can be viewed as an implementation of the model-theoretic semantics.

Let P be a datalog program and K an instance over sch(P ). A fact A is an immediateconsequence for K and P if either A ∈K(R) for some edb relation R, or A← A1, . . . , An

is an instantiation of a rule in P and each Ai is in K. The immediate consequence operatorof P , denoted TP , is the mapping from inst(sch(P )) to inst(sch(P )) defined as follows.For each K, TP (K) consists of all facts A that are immediate consequences for K and P .

We next note some simple mathematical properties of the operator TP over sets ofinstances. We first define two useful properties. For an operator T ,

• T is monotone if for each I, J, I⊆ J implies T (I)⊆ T (J).

• K is a fixpoint of T if T (K)=K.

The proof of the next lemma is straightforward and is omitted (see Exercise 12.9).

Lemma 12.3.1 Let P be a datalog program.

(i) The operator TP is monotone.

(ii) An instance K over sch(P ) is a model of �P iff TP (K)⊆K.

(iii) Each fixpoint of TP is a model of �P ; the converse does not necessarily hold.

It turns out that P(I) (as defined by the model-theoretic semantics) is a fixpoint of TP .In particular, it is the minimum fixpoint containing I. This is shown next.

Theorem 12.3.2 For each P and I, TP has a minimum fixpoint containing I, whichequals P(I).

Proof Observe first that P(I) is a fixpoint of TP :

• TP (P (I))⊆ P(I) because P(I) is a model of P ; and

• P(I) ⊆ TP (P (I)). [Because TP (P (I)) ⊆ P(I) and TP is monotone, TP (TP (P (I)))⊆ TP (P (I)). Thus TP (P (I)) is a model of �P . Because TP preserves the contentsof the edb relations and I⊆ P(I), we have I⊆ TP (P (I)). Thus TP (P (I)) is a modelof �P containing I. Because P(I) is the minimum such model, P(I)⊆ TP (P (I)).]

In addition, each fixpoint of TP containing I is a model of P and thus contains P(I) (whichis the intersection of all models of P containing I). Thus P(I) is the minimum fixpoint ofP containing I.

The fixpoint definition of the semantics of P presents the advantage of leading to aconstructive definition of P(I). In logic programming, this is shown using fixpoint theory(i.e., using Knaster-Tarski’s and Kleene’s theorems). However, the database frameworkis much simpler than the general logic-programming one, primarily due to the lack offunction symbols. We therefore choose to show the construction directly, without theformidable machinery of the theory of fixpoints in complete lattices. In Remark 12.3.5

284 Datalog

we sketch the more standard proof that has the advantage of being applicable to the largercontext of logic programming.

Given an instance I over edb(P ), one can compute TP (I), T 2P (I), T

3P (I), etc. Clearly,

I⊆ TP (I)⊆ T 2P (I)⊆ T 3

P (I) . . .⊆ B(P, I).

This follows immediately from the fact that I ⊆ TP (I) and the monotonicity of TP . Let Nbe the number of facts in B(P, I). (Observe that N depends on I.) The sequence {T i

P (I)}ireaches a fixpoint after at most N steps. That is, for each i ≥ N , T i

P (I) = T NP (I). In

particular, TP (T NP (I))= T N

P (I), so T NP (I) is a fixpoint of TP . We denote this fixpoint by

T ωP (I).

Example 12.3.3 Recall the program PTC for computing the transitive closure of agraph G:

T (x, y)←G(x, y)

T (x, y)←G(x, z), T (z, y).

Consider the input instance

I= {G(1, 2),G(2, 3),G(3, 4),G(4, 5)}.

Then we have

TPTC(I )= I ∪ {T (1, 2), T (2, 3), T (3, 4), T (4, 5)}T 2PTC

(I )= TPTC(I ) ∪ {T (1, 3), T (2, 4), T (3, 5)}T 3PTC

(I )= T 2PTC

(I ) ∪ {T (1, 4), T (2, 5)}T 4PTC

(I )= T 3PTC

(I ) ∪ {T (1, 5)}T 5PTC

(I )= T 4PTC

(I ).

Thus T ωPTC

(I )= T 4PTC

(I ).

We next show that T ωP (I) is exactly P(I) for each datalog program P .

Theorem 12.3.4 Let P be a datalog program and I an instance over edb(P ). ThenT ωP (I)= P(I).

Proof By Theorem 12.3.2, it suffices to show that T ωP (I) is the minimum fixpoint of TP

containing I. As noted earlier,

TP (TωP (I))= TP (T

NP (I))= T N

P (I)= T ωP (I).

12.3 Fixpoint Semantics 285

where N is the number of facts in B(P, I). Therefore T ωP (I) is a fixpoint of TP that con-

tains I.To show that it is minimal, consider an arbitrary fixpoint J of TP containing I. Then

J ⊇ T 0P (I)= I. By induction on i, J ⊇ T i

P (I) for each i, so J ⊇ T ωP (I). Thus T ω

P (I) is theminimum fixpoint of TP containing I.

The smallest integer i such that T iP (I)= T ω

P (I) is called the stage for P and I and isdenoted stage(P, I). As already noted, stage(P, I)≤N = |B(P, I)|.

Evaluation

The fixpoint approach suggests a straightforward algorithm for the evaluation of datalog.We explain the algorithm in an example. We extend relational algebra with a while operatorthat allows us to iterate an algebraic expression while some condition holds. (The resultinglanguage is studied extensively in Chapter 17.)

Consider again the transitive closure query. We wish to compute the transitive closureof relation G in relation T . Both relations are over AB. This computation is performed bythe following program:

T :=G;while q(T ) �= T do T := q(T );

where

q(T )=G ∪ πAB(δB→C(G) �� δA→C(T )).

(Recall that δ is the renaming operation as introduced in Chapter 4.)Observe that q is an SPJRU expression. In fact, at each step, q computes the im-

mediate consequence operator TP , where P is the transitive closure datalog program inExample 12.3.3. One can show in general that the immediate consequence operator can becomputed using SPJRU expressions (i.e., relational algebra without the difference opera-tion). Furthermore, the SPJRU expressions extended carefully with a while construct yieldexactly the expressive power of datalog. The test of the while is used to detect when thefixpoint is reached.

The while construct is needed only for recursion. Let us consider again the nonrecur-sive datalog of Chapter 4. Let P be a datalog program. Consider the graph (sch(P ), EP ),where 〈S, S′〉 is an edge in EP if S′ occurs in the head of some rule r in P and S occurs inthe body of r . Then P is nonrecursive if the graph is acyclic. We mentioned already thatnr-datalog programs are equivalent to SPJRU queries (see Section 4.5). It is also easy tosee that, for each nr-datalog program P , there exists a constant d such that for each I overedb(P ), stage(P, I) ≤ d . In other words, the fixpoint is reached after a bounded numberof steps, dependent only on the program. (See Exercise 12.29.) Programs for which thishappens are called bounded. We examine this property in more detail in Section 12.5.

A lot of redundant computation is performed when running the preceding transitiveclosure program. We study optimization techniques for datalog evaluation in Chapter 13.

286 Datalog

Remark 12.3.5 In this remark, we make a brief excursion into standard fixpoint theoryto reprove Theorem 12.3.4. This machinery is needed when proving the analog of thattheorem in the more general context of logic programming. A partially ordered set (U,≤)is a complete lattice if each subset has a least upper bound and a greatest lower bound,denoted sup and inf , respectively. In particular, inf (U) is denoted⊥ and sup(U) is denoted,. An operator T on U is monotone iff for each x, y ∈ U , x ≤ y implies T (x)≤ T (y). Anoperator T on U is continuous if for each subset V , T (sup(V )) = sup(T (V )). Note thatcontinuity implies monotonicity.

To each datalog program P and instance I, we associate the program PI consistingof the rules of P and one rule R(u)← for each fact R(u) in I. We consider the completelattice formed with (inst(sch(P )),⊆) and the operator TPI defined by the following: Foreach K, a fact A is in TPI(K) if A is an immediate consequence for K and PI. The operatorTPI on (inst(sch(P )),⊆) is continuous (so also monotone).

The Knaster-Tarski theorem states that a monotone operator in a complete latticehas a least fixpoint that equals inf ({x | x ∈ U, T (x) ≤ x}). Thus the least fixpoint of TPI

exists. Fixpoint theory also provides the constructive definition of the least fixpoint forcontinuous operators. Indeed, Kleene’s theorem states that if T is a continuous operator ona complete lattice, then its least fixpoint is sup({Ki | i ≥ 0}) where K0 =⊥ and for eachi > 0, Ki = T (Ki−1). Now in our case, ⊥= ∅ and

∅ ∪ TPI(∅) ∪ · · · ∪ T iPI(∅) ∪ · · ·

coincides with P(I).In logic programming, function symbols are also considered (see Example 12.1.4). In

this context, the sequence of {T iPI(I)}i>0 does not generally converge in a finite number

of steps, so the fixpoint evaluation is no longer constructive. However, it does converge incountably many steps to the least fixpoint ∪{T i

PI(∅) | i ≥ 0}. Thus fixpoint theory is useful

primarily when dealing with logic programs with function symbols. It is an overkill in thesimpler context of datalog.

12.4 Proof-Theoretic Approach

Another way of defining the semantics of datalog is based on proofs. The basic idea is thatthe answer of a program P on I consists of the set of facts that can be proven using P andI. The result turns out to coincide, again, with P(I).

The first step is to define what is meant by proof . A proof tree of a fact A from I andP is a labeled tree where

1. each vertex of the tree is labeled by a fact;

2. each leaf is labeled by a fact in I;

3. the root is labeled by A; and

4. for each internal vertex, there exists an instantiation A1 ← A2, . . . , An of a rulein P such that the vertex is labeled A1 and its children are respectively labeledA2, . . . , An.

Such a tree provides a proof of the fact A.

12.4 Proof-Theoretic Approach 287

(a) Datalog proof (b) Context-free derivation

S(1,6)

R(5,a,6)T(1,5)

T(3,5)R(1,a,2) R(2,b,3)

R(3,a,4) R(4,a,5)

rule 2

rule 1

rule 3

S

aT

Ta b

a a

Figure 12.3: Proof tree

Example 12.4.1 Consider the following program:

S(x1, x3)← T (x1, x2), R(x2, a, x3)

T (x1, x4)← R(x1, a, x2), R(x2, b, x3), T (x3, x4)

T (x1, x3)← R(x1, a, x2), R(x2, a, x3)

and the instance

{R(1, a, 2), R(2, b, 3), R(3, a, 4), R(4, a, 5), R(5, a, 6)}.

A proof tree of S(1, 6) is shown in Fig. 12.3(a).

The reader familiar with context-free languages will notice the similarity betweenproof trees and derivation trees in context-free languages. This connection is especiallystrong in the case of datalog programs that have the form of the one in Example 12.4.1.This will be exploited in the last section of this chapter.

Proof trees provide proofs of facts. It is straightforward to show that a fact A is in P(I)iff there exists a proof tree for A from I and P . Now given a fact A to prove, one can lookfor a proof either bottom up or top down.

The bottom-up approach is an alternative way of looking at the constructive fixpointtechnique. One begins with the facts from I and then uses the rules to infer new facts, muchlike the immediate consequence operator. This is done repeatedly until no new facts can beinferred. The rules are used as “factories” producing new facts from already proven ones.This eventually yields all facts that can be proven and is essentially the same as the fixpointapproach.

In contrast to the bottom-up and fixpoint approaches, the top-down approach allowsone to direct the search for a proof when one is only interested in proving particular facts.

288 Datalog

For example, suppose the query Ans_1(Louvre) is posed against the program Pmetro ofExample 12.1.3, with the input instance of Fig. 12.1. Then the top-down approach willnever consider atoms involving stations on Line 9, intuitively because they are are notreachable from Odeon or Louvre. More generally, the top-down approach inhibits theindiscriminate inference of facts that are irrelevant to the facts of interest.

The top-down approach is described next. This takes us to the field of logic program-ming. But first we need some notation, which will remind us once again that “To bar aneasy access to newcomers every scientific domain has introduced its own terminology andnotation” [Apt91].

Notation

Although we already borrowed a lot of terminology and notation from the logic-program-ming field (e.g., term, fact, atom), we must briefly introduce some more.

A positive literal is an atom [i.e., P(u) for some free tuple u]; and a negative literal isthe negation of one [i.e., ¬P(u)]. A formula of the form

∀x1, . . . , xm(A1 ∨ · · · ∨ An ∨ ¬B1 ∨ · · · ∨ ¬Bp),

where the Ai, Bj are positive literals, is called a clause. Such a clause is written in clausalform as

A1, . . . , An← B1, . . . , Bp.

A clause with a single literal in the head (n= 1) is called a definite clause. A definite clausewith an empty body is called a unit clause. A clause with no literal in the head is called agoal clause. A clause with an empty body and head is called an empty clause and is denoted

. Examples of these and their logical counterparts are as follows:

definite T (x, y)← R(x, z), T (z, y) T (x, y) ∨ ¬R(x, z) ∨ ¬T (z, y)unit T (x, y)← T (x, y)

goal ← R(x, z), T (z, y) ¬R(x, z) ∨ ¬T (z, y)empty false

The empty clause is interpreted as a contradiction. Intuitively, this is because it correspondsto the disjunction of an empty set of formulas.

A ground clause is a clause with no occurrence of variables.The top-down proof technique introduced here is called SLD resolution. Goals serve

as the basic focus of activity in SLD resolution. As we shall see, the procedure beginswith a goal such as ← St_Reachable(x,Concorde), Li_Reachable(x, 9). A correct an-swer of this goal on input I is any value a such that St_Reachable(a,Concorde) andLi_Reachable(a, 9) are implied by �Pmetro,I. Furthermore, each intermediate step of thetop-down approach consists of obtaining a new goal from a previous goal. Finally, theprocedure is deemed successful if the final goal reached is empty.

The standard exposition of SLD resolution is based on definite clauses. There is a


subtle distinction between datalog rules and definite clauses: For datalog rules, we imposedthe restriction that each variable that occurs in the head also appears in the body. (Inparticular, a datalog unit clause must be ground.) We will briefly mention some minorconsequences of this distinction.

As already introduced in Remark 12.3.5, to each datalog program P and instance I,we associate the program PI consisting of the rules of P and one rule R(u)← for eachfact R(u) in I. Therefore in the following we ignore the instance I and focus on programsthat already integrate all the known facts in the set of rules. We denote such a program PIto emphasize its relationship to an instance I. Observe that from a semantic point of view

P(I)= PI(∅).

This ignores the distinction between edb and idb relations, which no longer exists for PI.

Example 12.4.2 Consider the program P and instance I of Example 12.4.1. The rulesof PI are

1. S(x1, x3)← T (x1, x2), R(x2, a, x3)

2. T (x1, x4)← R(x1, a, x2), R(x2, b, x3), T (x3, x4)

3. T (x1, x3)← R(x1, a, x2), R(x2, a, x3)

4. R(1, a, 2)←5. R(2, b, 3)←6. R(3, a, 4)←7. R(4, a, 5)←8. R(5, a, 6)←

Warm-Up

Before discussing SLD resolution, as a warm-up we look at a simplified version of thetechnique by considering only ground rules. To this end, consider a datalog program PI(integrating the facts) consisting only of fully instantiated rules (i.e., with no occurrencesof variables). Consider a ground goal g ≡

← A1, . . . , Ai, . . . , An

and some (ground) rule r ≡ Ai ← B1, . . . , Bm in PI. A resolvent of g with r is the groundgoal

← A1, . . . , Ai−1, B1, . . . , Bm,Ai+1, . . . , An.

Viewed as logical sentences, the resolvent of g with r is actually implied by g and r .This is best seen by writing these explicitly as clauses:

290 Datalog

R(5,a,6)

S(1,6) ← T(1,5), R(5,a,6)

R(2,b,3), T(3,5), R(5,a,6)

T(1,5) ← R(1,a,2), R(2,b,3), T(3,5)

T(3,5), R(5,a,6)

R(1,a,2) ←

R(5,a,6)

R(2,b,3) ←

R(4,a,5), R(5,a,6)

T(3,5) ← R(3,a,4), R(4,a,5)

R(5,a,6)

R(3,a,4) ←

R(4,a,5) ←

R(5,a,6) ←

← S(1,6)

← T(1,5),

← R(1,a,2),

← R(2,b,3),

← T(3,5),

← R(3,a,4),

← R(4,a,5),

← R(5,a,6)

←

Figure 12.4: SLD ground refutation

(¬A1 ∨ · · · ∨ ¬Ai ∨ · · · ∨ ¬An) ∧ (Ai ∨ ¬B1 ∨ · · · ∨ ¬Bm)

⇒ (¬A1 ∨ · · · ∨ ¬Ai−1 ∨ ¬B1 ∨ · · · ∨ ¬Bm ∨ ¬Ai+1 ∨ · · · ∨ ¬An).

In general, the converse does not hold.A derivation from g with PI is a sequence of goals g ≡ g0, g1, . . . such that for each

i > 0, gi is a resolvent of gi−1 with some rule in PI. We will see that to prove a fact A, itsuffices to exhibit a refutation of ← A—that is, a derivation

g0 ≡← A, g1, . . . , gi, . . . , gq ≡ .

Example 12.4.3 Consider Example 12.4.1 and the program obtained by all possibleinstantiations of the rules of PI in Example 12.4.2. An SLD ground refutation is shownin Fig. 12.4. It is a refutation of ← S(1, 6) [i.e. a proof of S(1, 6)].

Let us now explain why refutations provide proofs of facts. Suppose that we wish toprove A1 ∧ · · · ∧ An. To do this we may equivalently prove that its negation (i.e. ¬A1 ∨· · · ∨ ¬An) is false. In other words, we try to refute (or disprove) ← A1, . . . , An. Thefollowing rephrasing of the refutation in Fig. 12.4 should make this crystal clear.


Example 12.4.4 Continuing with the previous example, to prove S(1, 6), we try to refuteits negation [i.e.,¬S(1, 6) or← S(1, 6)]. This leads us to considering, in turn, the formulas

Goal Rule used

¬S(1, 6) (1)⇒ ¬T (1, 5) ∨ ¬R(5, a, 6) (2)⇒ ¬R(1, a, 2) ∨ ¬R(2, b, 3) ∨ ¬T (3, 5) ∨ ¬R(5, a, 6) (4)⇒ ¬R(2, b, 3) ∨ ¬T (3, 5) ∨ ¬R(5, a, 6) (5)⇒ ¬T (3, 5) ∨ ¬R(5, a, 6) (3)⇒ ¬R(3, a, 4) ∨ ¬R(4, a, 5) ∨ ¬R(5, a, 6) (6)⇒ ¬R(4, a, 5) ∨ ¬R(5, a, 6) (7)⇒ ¬R(5, a, 6) (8)⇒ false

At the end of the derivation, we have obtained a contradiction. Thus we have refuted¬S(1, 6) [i.e., proved S(1, 6)].

Thus refutations provide proofs. As a consequence, a goal can be thought of as a query.Indeed, the arrow is sometimes denoted with a question mark in goals. For instance, wesometimes write

?- S(1, 6) for ← S(1, 6).

Observe that the process of finding a proof is nondeterministic for two reasons: thechoice of the literal A to replace and the rule that is used to replace it.

We now have a technique for proving facts. The benefit of this technique is that it issound and complete, in the sense that the set of facts in P(I) coincides with the facts thatcan be proven from PI.

Theorem 12.4.5 Let PI be a datalog program and ground(PI) be the set of instantiationsof rules in PI with values in adom(P, I). Then for each ground goal g, PI(∅) |= ¬g iff thereexists a refutation of g with ground(PI).

Crux To show the “only if,” we prove by induction that

(**)for each ground goal g, if T i

PI(∅) |= ¬g,

there exists a refutation of g with ground(PI).

(The “if” part is proved similarly by induction on the length of the refutation. Its proof isleft for Exercise 12.18.)

The base case is obvious. Now suppose that (**) holds for some i ≥ 0, and letA1, . . . , Am be ground atoms such that T i+1

PI(∅) |= A1 ∧ · · · ∧ Am. Therefore each Aj is

in T i+1PI

(∅). Consider some j . If Aj is an edb fact, we are back to the base case. Otherwise

292 Datalog

R(x2,a,x)

S(x1,x3) ← T(x1,x2), R(x2,a,x3)

R(y2,b,x3), T(x3,x2), R(x2,a,x)

T(x1,x4) ← R(x1,a,x2), R(x2,b,x3), T(x3,x4)

T(x3,x2), R(x2,a,x)

R(1,a,2) ←

R(x2,a,x)

R(2,b,3) ←

R(z2,a,x2), R(x2,a,x)

T(x1,x3) ← R(x1,a,x2), R(x2,a,x3)

R(x2,a,x)

R(3,a,4) ←

R(4,a,5) ←

R(5,a,6) ←

← S(1,x)

← T(1,x2),

← R(1,a,y2),

← R(2,b,x3),

← T(3,x2),

← R(3,a,z2),

← R(4,a,x2),

← R(5,a,x)

←

Figure 12.5: SLD refutation

there exists an instantiation Aj ← B1, . . . , Bp of some rule in PI such that B1, . . . , Bp arein T i

PI(∅). The refutation of ← Aj with ground(PI) is as follows. It starts with

← Aj

← B1, B2 . . . , Bp.

Now by induction there exist refutations of ← Bn, 1 ≤ n ≤ p, with ground(PI). Usingthese refutations, one can extend the preceding derivation to a derivation leading to theempty clause. Furthermore, the refutations for each of the Aj ’s can be combined to obtaina refutation of← A1, . . . , Am as desired. Therefore (**) holds for i + 1. By induction, (**)holds.

SLD Resolution

The main difference between the general case and the warm-up is that we now handlegoals and tuples with variables rather than just ground ones. In addition to obtaining thegoal , the process determines an instantiation θ for the free variables of the goal g, suchthat PI(∅) |= ¬θg. We start with an example: An SLD refutation of← S(1, x) is shown inFig. 12.5.

In general, we start with a goal (which does not have to be ground):


← A1, . . . , Ai, . . . , An.

Suppose that we selected a literal to be replaced [e.g., Ai =Q(1, x2, x5)]. Any rule usedfor the replacement must have Q for predicate in the head, just as in the ground case. Forinstance, we might try some rule

Q(x1, x4, x3)← P(x1, x2), P (x2, x3),Q(x3, x4, x5).

We now have two difficulties:

(i) The same variable may occur in the selected literal and in the rule with twodifferent meanings. For instance, x2 in the selected literal is not to be confusedwith x2 in the rule.

(ii) The pattern of constants and of equalities between variables in the selected literaland in the head of the rule may be different. In our example, for the first attributewe have 1 in the selected literal and a variable in the rule head.

The first of these two difficulties is handled easily by renaming the variables of the rules.We shall use the following renaming discipline: Each time a rule is used, a new set ofdistinct variables is substituted for the ones in the rule. Thus we might use instead the rule

Q(x11, x14, x13)← P(x11, x12), P (x12, x13),Q(x13, x14, x15).

The second difficulty requires a more careful approach. It is tackled using unification,which matches the pattern of the selected literal to that of the head of the rule, if possible.In the example, unification consists of finding a substitution θ such that θ(Q(1, x2, x5))=θ(Q(x11, x14, x13)). Such a substitution is called a unifier. For example, the substitu-tion θ(x11) = 1, θ(x2) = θ(x14) = θ(x5) = θ(x13) = y is a unifier for Q(1, x2, x5) andQ(x11, x14, x13), because θ(Q(1, x2, x5)) = θ(Q(x11, x14, x13)) = Q(1, y, y). Note thatthis particular unifier is unnecessarily restrictive; there is no reason to identify all ofx2, x3, x4, x5.

A unifier that is no more restrictive than needed to unify the atoms is called a mostgeneral unifier (mgu). Applying the mgu to the rule to be used results in specializing therule just enough so that it applies to the selected literal. These terms are formalized next.

Definition 12.4.6 Let A,B be two atoms. A unifier for A and B is a substitution θ suchthat θA= θB. A substitution θ is more general than a substitution ν, denoted θ ↪→ ν, iffor some substitution ν′, ν = θ ◦ ν′. A most general unifier (mgu) for A and B is a unifierθ for A,B such that, for each unifier ν of A,B, we have θ ↪→ ν.

Clearly, the relation ↪→ between unifiers is reflexive and transitive but not antisym-metric. Let ≈ be the equivalence relation on substitutions defined by θ ≈ ν iff θ ↪→ ν andν ↪→ θ . If θ ≈ ν, then for each atom A, θ(A) and ν(A) are the same modulo renaming ofvariables.

294 Datalog

Computing the mgu

We now develop an algorithm for computing an mgu for two atoms. Let R be a relationof arity p and R(x1, . . . , xp), R(y1, . . . , yp) two literals with disjoint sets of variables.Compute ≡, the equivalence relation on var ∪ dom defined as the reflexive, transitiveclosure of: xi ≡ yi for each i in [1, p]. The mgu of R(x1, . . . , xp) and R(y1, . . . , yp) doesnot exist if two distinct constants are in the same equivalence class. Otherwise their mgu isthe substitution θ such that

1. If z≡ a for some constant a, θ(z)= a;

2. Otherwise θ(z)= z′, where z′ is the smallest (under a fixed ordering on var) suchthat z≡ z′.

We show that the foregoing computes an mgu.

Lemma 12.4.7 The substitution θ just computed is an mgu for R(x1, . . . , xp) andR(y1, . . . , yp).

Proof Clearly, θ is a unifier for R(x1, . . . , xp) and R(y1, . . . , yp). Suppose ν is anotherunifier for the same atoms. Let ≡ν be the equivalence relation on var ∪ dom defined byx ≡ν y iff ν(x) = ν(y). Because ν is a unifier, ν(xi) = ν(yi). It follows that xi ≡ν yi, so≡ refines≡ν. Then the substitution ν′ defined by ν′(θ(x))= ν(x), is well defined, becauseθ(x)= θ(x′) implies ν(x)= ν(x′). Thus ν = θ ◦ ν′ so θ ↪→ ν. Because this holds for everyunifier ν, it follows that θ is an mgu for the aforementioned atoms.

The following facts about mgu’s are important to note. Their proof is left to the reader(Exercise 12.19). In particular, part (ii) of the lemma says that the mgu of two atoms, if itexists, is essentially unique (modulo renaming of variables).

Lemma 12.4.8 Let A,B be atoms.

(i) If there exists a unifier for A,B, then A,B have an mgu.

(ii) If θ and θ ′ are mgu’s for A,B then θ ≈ θ ′.(iii) Let A,B be atoms with mgu θ . Then for each atom C, if C = θ1A = θ2B for

substitutions θ1, θ2, then C = θ3(θ(A))= θ3(θ(B)) for some substitution θ3.

We are now ready to rephrase the notion of resolvent to incorporate variables. Let

g ≡← A1, . . . , Ai, . . . , An, r ≡ B1 ← B2, . . . , Bm

be a goal and a rule such that

1. g and r have no variable in common (which can always be ensured by renamingthe variables of the rule).

2. Ai and B1 have an mgu θ .


Then the resolvent of g with r using θ is the goal

← θ(A1), . . . , θ(Ai−1), θ(B2), . . . , θ(Bm), θ(Ai+1), . . . , θ(An).

As before, it is easily verified that this resolvent is implied by g and r .An SLD derivation from a goal g with a program PI is a sequence g0 = g, g1, . . . of

goals and θ0, . . . of substitutions such that for each j , gj is the resolvent of gj−1 withsome rule in PI using θj1. An SLD refutation of a goal g with PI is an SLD derivationg0 = g, . . . , gq = with PI.

We now explain the meaning of such a refutation. As in the variable-free case, theexistence of a refutation of a goal ← A1, . . . , An with PI can be viewed as a proof of thenegation of the goal. The goal is

∀x1, . . . , xm(¬A1 ∨ · · · ∨ ¬An)

where x1, . . . , xm are the variables in the goal. Its negation is therefore equivalent to

∃x1, . . . , xm(A1 ∧ · · · ∧ An),

and the refutation can be seen as a proof of its validity. Note that, in the case of datalogprograms (where by definition all unit clauses are ground), the composition θ1 ◦ · · · ◦ θqof mgu’s used while refuting the goal yields a substitution by constants. This substitutionprovides “witnesses” for the existence of the variables x1, . . . , xm making true the conjunc-tion. In particular, by enumerating all refutations of the goal, one could obtain all valuesfor the variables satisfying the conjunction—that is, the answer to the query

{〈x1, . . . , xm〉 | A1 ∧ · · · ∧ An}.

This is not the case when one allows arbitrary definite clauses rather than datalog rules, asillustrated in the following example.

Example 12.4.9 Consider the program

S(x, z)←G(x, z)

S(x, z)←G(x, y), S(y, z)

S(x, x)←that computes in S the reflexive transitive closure of graphG. This is a set of definite clausesbut not a datalog program because of the last rule. However, resolution can be extended to(and is indeed in general presented for) definite clauses. Observe, for instance, that the goal← S(w,w) is refuted with a substitution that does not bind variable w to a constant.

SLD resolution is a technique that provides proofs of facts. One must be sure thatit produces only correct proofs (soundness) and that it is powerful enough to prove all

296 Datalog

true facts (completeness). To conclude this section, we demonstrate the soundness andcompleteness of SLD resolution for datalog programs.

We use the following lemma:

Lemma 12.4.10 Let g ≡← A1, . . . , Ai, . . . , An and r ≡ B1 ← B2, . . . , Bm be a goal anda rule with no variables in common, and let

g′ ≡← A1, . . . , Ai−1, B2, . . . , Bm,Ai+1, . . . , An.

If θg′ is a resolvent of g with r using θ , then the formula r implies:

r ′ ≡ ¬θg′ → ¬θg= θ(A1 ∧ · · · ∧ Ai−1 ∧ B2 ∧ · · · ∧ Bm ∧ Ai+1 ∧ · · · ∧ An)→ θ(A1 ∧ · · · ∧ An).

Proof Let J be an instance over sch(P ) satisfying r and let valuation ν be such that

J |= ν[θ(A1) ∧ · · · ∧ θ(Ai−1) ∧ θ(B2) ∧ · · · ∧ θ(Bm) ∧ θ(Ai+1) ∧ · · · ∧ θ(An)].

Because

J |= ν[θ(B2) ∧ · · · ∧ θ(Bm)]

and J |= B1 ← B2, . . . , Bm, J |= ν[θ(B1)]. That is, J |= ν[θ(Ai)]. Thus

J |= ν[θ(A1) ∧ · · · ∧ θ(An)].

Hence for each ν, J |= νr ′. Therefore J |= r ′. Thus each instance over sch(P ) satisfying r

also satisfies r ′, so r implies r ′.

Using this lemma, we have the following:

Theorem 12.4.11 (Soundness of SLD resolution) Let PI be a program and g ≡←A1, . . . , An a goal. If there exists an SLD-refutation of g with PI and mgu’s θ1, . . . , θq ,then PI implies

θ1 ◦ · · · ◦ θq(A1 ∧ · · · ∧ An).

Proof Let J be some instance over sch(P ) satisfying PI. Let g0 = g, . . . , gq = be anSLD refutation of g with PI and for each j , let gj be a resolvent of gj−1 with some rule inPI using some mgu θj . Then for each j , the rule that is used implies ¬gj → θj(¬gj−1) byLemma 12.4.10. Because J satisfies PI, for each j ,

J |= ¬gj → θj(¬gj−1).


Clearly, this implies that for each j ,

J |= θj+1 ◦ · · · ◦ θq(¬gj)→ θj ◦ · · · ◦ θq(¬gj−1).

By transitivity, this shows that

J |= ¬gq → θ1 ◦ · · · ◦ θq(¬g0),

and so

J |= true→ θ1 ◦ · · · ◦ θq(¬g).

Thus J |= θ1 ◦ · · · ◦ θq(A1 ∧ · · · ∧ An).

We next prove the converse of the previous result (namely, the completeness of SLDresolution).

Theorem 12.4.12 (Completeness of SLD resolution) Let PI be a program and g ≡←A1, . . . , An a goal. If PI implies ¬g, then there exists a refutation of g with PI.

Proof Suppose that PI implies ¬g. Consider the set ground(PI) of instantiations of rulesin PI with constants in adom(P, I). Clearly, ground(PI)(∅) is a model of PI, so it satisfies¬g. Thus there exists a valuation θ of the variables in g such that ground(PI)(∅) satisfies¬θg. By Theorem 12.4.5, there exists a refutation of θg using ground(PI).

Let g0 = θg, . . . , gp = be that refutation. We show by induction on k that for eachk in [0, p],

(†) there exists a derivation g′0 = g, . . . , g′k with PI such that gk = θkg′k for some θk.

For suppose that (†) holds for each k. Then for k = p, there exists a derivation g′1 =g, . . . , g′p with PI such that = gp = θpg

′p for some θp, so g′p = . Therefore there exists

a refutation of g with PI.The basis of the induction holds because g0 = θg = θg′0. Now suppose that (†) holds

for some k. The next step of the refutation consists of selecting some atom B of gk andapplying a rule r in ground(PI). In g′k select the atom B ′ with location in g′ correspondingto the location of B in gk. Note that B = θkB

′. In addition, we know that there is ruler ′′ = B ′′ ← A′′1 . . . A

′′n in PI that has r for instantiation via some substitution θ ′′ (such

a pair B ′, r ′′ exists although it may not be unique). As usual, we can assume that thevariables in g′k are disjoint from those in r ′′. Let θk ⊕ θ ′′ be the substitution defined byθk ⊕ θ ′′(x)= θk(x) if x is a variable in g′k, and θk ⊕ θ ′′(x)= θ ′′(x) if x is a variable in r ′′.Clearly, θk ⊕ θ ′′(B ′)= θk ⊕ θ ′′(B ′′)= B so, by Lemma 12.4.8 (i), B ′ and B ′′ have somemgu θ . Let g′k+1 be the resolvent of g′k with r ′′, B ′ using mgu θ . By the definition of mgu,there exists a substitution θk+1 such that θk ⊕ θ ′′ = θ ◦ θk+1. Clearly, θk+1(g

′k+1) = gk+1

and (†) holds for k + 1. By induction, (†) holds for each k.

298 Datalog

← S(1,x)

← T(1,x2), R(x2,a,x)

1:x1/1,x3/x

2:x1/1,x2/y2,x4/x2

← R(1,a,y2), R(y2,b,x3), T(x3,x2), R(x2,a,x)

3:x1/x3,x2/z2,x3/x2

← R(1,a,y2), R(y2,b,x3), R(x3,a,z2), T(z3,x2), R(x2,a,x)

← R(2,b,x3), R(x3,a,z2), R(z2,a,x2), R(x2,a,x)

5:x3/3

4:y2/2

← R(3,a,z2), R(z2,a,x2), R(x2,a,x)

6:z2/4

← R(4,a,x2), R(x2,a,x)

7:x2/5

← R(5,a,x)

8:x/6

Infinitesubtree

← R(1,a,y2), R(y2,a,x2), R(x2,a,x)

← R(1,a,1), R(1,a,x2), R(x2,a,x)

4:y2/1

no possible derivation

3:x1/1,x2/y2,x3/x2

2

Figure 12.6: SLD tree

SLD Trees

We have shown that SLD resolution is sound and complete. Thus it provides an adequatetop-down technique for obtaining the facts in the answer to a datalog program. To prove thata fact is in the answer, one must search for a refutation of the corresponding goal. Clearly,there are many refutations possible. There are two sources of nondeterminism in searchingfor a refutation: (1) the choice of the selected atom, and (2) the choice of the clause to unifywith the atom. Now let us assume that we have fixed some golden rule, called a selectionrule, for choosing which atom to select at each step in a refutation. A priori, such a rulemay be very simple (e.g., as in Prolog, always take the leftmost atom) or in contrast veryinvolved, taking into account the entire history of the refutation. Once an atom has beenselected, we can systematically search for all possible unifying rules. Such a search can berepresented in an SLD tree. For instance, consider the tree of Fig. 12.6 for the program inExample 12.4.2. The selected atoms are represented with boxes. Edges denote unificationsused. Given S(1, x), only one rule can be used. Given T (1, x2), two rules are applicablethat account for the two descendants of vertex T (1, x2). The first number in edge labelsdenotes the rule that is used and the remaining part denotes the substitution. An SLD treeis a representation of all the derivations obtained with a fixed selection rule for atoms.


There are several important observations to be made about this particular SLD tree:

(i) It is successful because one branch yields .

(ii) It has an infinite subtree that corresponds to an infinite sequence of applicationsof rule (2) of Example 12.4.2.

(iii) It has a blocking branch.

We can now explain (to a certain extent) the acronym SLD. SLD stands for selectionrule-driven linear resolution for definite clauses. Rule-driven refers to the rule used forselecting the atom. An important fact is that the success or failure of an SLD tree does notdepend on the rule for selecting atoms. This explains why the definition of an SLD treedoes not specify the selection rule.

Datalog versus Logic Programming, Revisited

Having established the three semantics for datalog, we summarize briefly the main differ-ences between datalog and the more general logic-programming (lp) framework.

Syntax: Datalog has only relation symbols, whereas lp uses also function symbols. Datalogrequires variables in rule heads to appear in bodies; in particular, all unit clauses areground.

Model-theoretic semantics: Due to the presence of function symbols in lp, models of lpprograms may be infinite. Datalog programs always have finite models. Apart fromthis distinction, lp and datalog are identical with respect to model-theoretic semantics.

Fixpoint semantics: Again, the minimum fixpoint of the immediate consequence operatormay be infinite in the lp case, whereas it is always finite for datalog. Thus the fixpointapproach does not necessarily provide a constructive semantics for lp.

Proof-theoretic semantics: The technique of SLD resolution is similar for datalog and lp,with the difference that the computation of mgu’s becomes slightly more complicatedwith function symbols (see Exercise 12.20). For datalog, the significance of SLDresolution concerns primarily optimization methods inspired by resolution (such as“magic sets”; see Chapter 13). In lp, SLD resolution is more important. Due to thepossibly infinite answers, the bottom-up approach of the fixpoint semantics may notbe feasible. On the other hand, every fact in the answer has a finite proof by SLDresolution. Thus SLD resolution emerges as the practical alternative.

Expressive power: A classical result is that lp can express all recursively enumerable (r.e.)predicates. However, as will be discussed in Part E, the expressive power of dataloglies within ptime. Why is there such a disparity? A fundamental reason is that functionsymbols are used in lp, and so an infinite domain of objects can be constructed from afinite set of symbols. Speaking technically, the result for lp states that if S is a (possiblyinfinite) r.e. predicate over terms constructed using a finite language, then there is anlp program that produces for some predicate symbol exactly the tuples in S. Speakingintuitively, this follows from the facts that viewed in a bottom-up sense, lp providescomposition and looping, and terms of arbitrary length can be used as scratch paper

300 Datalog

(e.g., to simulate a Turing tape). In contrast, the working space and output of range-restricted datalog programs are always contained within the active domain of the inputand the program and thus are bounded in size.

Another distinction between lp and datalog in this context concerns the nature ofexpressive power results for datalog and for query languages in general. Specifically,a datalog program P is generally viewed as a mapping from instances of edb(P )to instances of idb(P ). Thus expressive power of datalog is generally measured incomparison with mappings on families of database instances rather than in terms ofexpressing a single (possibly infinite) predicate.

12.5 Static Program Analysis

In this section, the static analysis of datalog programs is considered.2 As with relationalcalculus, even simple static properties are undecidable for datalog programs. In particular,although tableau homomorphism allowed us to test the equivalence of conjunctive queries,equivalence of datalog programs is undecidable in general. This complicates a systematicsearch for alternative execution plans for datalog queries and yields severe limitationsto query optimization. It also entails the undecidability of many other problems relatedto optimization, such as deciding when selection propagation (in the style of “pushing”selections in relational algebra) can be performed, or when parallel evaluation is possible.

We consider three fundamental static properties: satisfiability, containment, and a newone, boundedness. We exhibit a decision procedure for satisfiability. Recall that we showedin Chapter 5 that an analogous property is undecidable for CALC. The decidability ofsatisfiability for datalog may therefore be surprising. However, one must remember that,although datalog is more powerful than CALC in some respects (it has recursion), it is lesspowerful in others (there is no negation). It is the lack of negation that makes satisfiabilitydecidable for datalog.

We prove the undecidability of containment and boundedness for datalog programsand consider variations or restrictions that are decidable.

Satisfiability

Let P be a datalog program. An intensional relation T is satisfiable by P if there existsan instance I over edb(P ) such that P(I)(T ) is nonempty. We give a simple proof of thedecidability of satisfiability for datalog programs. We will soon see an alternative proofbased on context-free languages.

We first consider constant-free programs. We then describe how to reduce the generalcase to the constant-free one.

To prove the result, we use an auxiliary result about instance homomorphisms that is ofsome interest in its own right. Note that any mapping θ from dom to dom can be extendedto a homomorphism over the set of instances, which we also denote by θ .

2 Recall that static program analysis consists of trying to detect statically (i.e., at compile time)properties of programs.

12.5 Static Program Analysis 301

Lemma 12.5.1 Let P be a constant-free datalog program, I, J two instances over sch(P ),q a positive-existential query over sch(P ), and θ a mapping over dom. If θ(I)⊆ J, then(i) θ(q(I))⊆ q(J), and (ii) θ(P (I))⊆ P(J).

Proof For (i), observe that q is monotone and that q ◦ θ ⊆ θ ◦ q (which is not necessaryif q has constants). Because TP can be viewed as a positive-existential query, a straightfor-ward induction proves (ii).

This result does not hold for datalog programs with constants (see Exercise 12.21).

Theorem 12.5.2 The satisfiability of an idb relation T by a constant-free datalog pro-gram P is decidable.

Proof Suppose that T is satisfiable by a constant-free datalog program P . We prove thatP(Ia)(T ) is nonempty for some particular instance Ia. Let a be in dom. Let Ia be theinstance over edb(P ) such that for each R in edb(P ), Ia(R) contains a single tuple with ain each entry. Because T is satisfiable by P , there exists I such that P(I)(T ) �= ∅. Considerthe function θ that maps every constant in dom to a. Then θ(I) ⊆ Ia. By the previouslemma, θ(P (I)) ⊆ P(Ia). Therefore P(Ia)(T ) is nonempty. Hence T is satisfiable by Piff P(Ia)(T ) �= ∅.

Let us now consider the case of datalog programs with constants. Let P be a datalogprogram with constants. For example, suppose that b, c are the only two constants occur-ring in the program and thatR is a binary relation occurring inP . We transform the probleminto a problem without constants. Specifically, we replace R with nine new relations:

R��, Rb�, Rc�, R�b, R�c, Rbc, Rcb, Rbb, Rcc.

The first one is binary, the next four are unary, and the last four are 0-ary (i.e., are proposi-tions). Intuitively, a factR(x, y) is represented by the factR��(x, y) if x, y are not in {b, c};R(b, x) with x not in {b, c} is represented by Rb�(x), and similarly for Rc�, R�b, R�c. Thefact R(b, c) is represented by proposition Rbc(), etc. Using this kind of transformation foreach relation, one translates program P into a constant-free program P ′ such that T is sat-isfiable by P iff Tw is satisfiable by P ′ for some string w of � or constants occurring in P .(See Exercise 12.22a.)

Containment

Consider two datalog programs P,P ′ with the same extensional relations edb(P ) anda target relation T occurring in both programs. We say that P is included in P ′ withrespect to T , denoted P ⊆T P

′, if for each instance I over edb(P ), P(I)(T )⊆ P ′(I)(T ).The containment problem is undecidable. We prove this by reduction of the containmentproblem for context-free languages. The technique is interesting because it exhibits acorrespondence between proof trees of certain datalog programs and derivation trees ofcontext-free languages.

302 Datalog

We first illustrate the correspondence in an example.

Example 12.5.3 Consider the context-free grammar G = (V ,�,�, S), where V ={S, T }, S is the start symbol, � = {a, b}, and the set � of production rules is

S→ T a

T → abT | aa.The corresponding datalog program PG is the program of Example 12.4.1. A proof treeand its corresponding derivation tree are shown in Fig. 12.3.

We next formalize the correspondence between proof trees and derivation trees.A context-free grammar is a (�) grammar if the following hold:

(1) G is ε free (i.e., does not have any production of the form X→ ε, where εdenotes the empty string) and

(2) the start symbol does not occur in any right-hand side of a production.

We use the following:

Fact It is undecidable, given (�) grammars G1,G2, whether L(G1)⊆ L(G2).

For each (�) grammar G, let PG, the corresponding datalog program, be constructed(similar to Example 12.5.3) as follows: Let G = (V ,�,�, S). We may assume withoutloss of generality that V is a set of relation names of arity 2 and � a set of elements fromdom. Then idb(PG)= V and edb(PG)= {R}, where R is a ternary relation. Let x1, x2, . . .

be an infinite sequence of distinct variables. To each production in �,

T → C1 . . . Cn,

we associate a datalog rule

T (x1, xn+1)← A1, . . . , An,

where for each i

• if Ci is a nonterminal T ′, then Ai = T ′(xi, xi+1);

• if Ci is a terminal b, then Ai = R(xi, b, xi+1).

Note that, for any proof tree of a fact S(a1, an) using PG, the sequence of its leaves is(in this order)

R(a1, b1, a2), . . . , R(an−1, bn−1, an),

for some a2, . . . , an−1 and b1, . . . , bn−1. The connection between derivation trees ofG andproof trees of PG is shown in the following.

12.5 Static Program Analysis 303

Proposition 12.5.4 Let G be a (�) grammar and PG be the associated datalog pro-gram constructed as just shown. For each a1, . . . , an, b1, . . . , bn−1, there is a proof treeof S(a1, an) from PG with leaves R(a1, b1, a2), . . . , R(an−1, bn−1, an) (in this order) iffb1 . . . bn−1 is in L(G).

The proof of the proposition is left as Exercise 12.25. Now we can show the following:

Theorem 12.5.5 It is undecidable, given P,P ′ (with edb(P ) = edb(P ′)) and T ,whether P ⊆T P

′.

Proof It suffices to show that

(‡)for each pair G1,G2 of (�) grammars,

L(G1)⊆ L(G2)⇔ PG1 ⊆S PG2.

Suppose (‡) holds and T containment is decidable. Then we obtain an algorithm to decidecontainment of (�) grammars, which contradicts the aforementioned fact.

Let G2,G2 be two (�) grammars. We show here that

L(G1)⊆ L(G2)⇒ PG1 ⊆S PG2.

(The other direction is similar.) Suppose that L(G1)⊆ L(G2). Let I be over edb(PG1) andS(a1, an) be in PG1(I). Then there exists a proof tree of S(a1, an) from PG1 and I, withleaves labeled by facts

R(a1, b1, a2), . . . , R(an−1, bn−1, an),

in this order. By Proposition 12.5.4, b1 . . . bn−1 is in L(G1). Because L(G1) ⊆ L(G2),b1 . . . bn−1 is inL(G2). By the proposition again, there is a proof tree of S(a1, an) from PG2

with leaves R(a1, b1, a2), . . . , R(an−1, bn−1, an), all of which are facts in I. Thus S(a1, an)

is in PG2(I), so PG1 ⊆S PG2.

Note that the datalog programs used in the preceding construction are very particular:They are essentially chain programs. Intuitively, in a chain program the variables in a rulebody form a chain. More precisely, rules in chain programs are of the form

A0(x0, xn)← A1(x0, x1), A2(x1, x2), . . . , An(xn−1, xn).

The preceding proof can be tightened to show that containment is undecidable even forchain programs (see Exercise 12.26).

The connection with grammars can also be used to provide an alternate proof of thedecidability of satisfiability; satisfiability can be reduced to the emptiness problem forcontext-free languages (see Exercise 12.22c).

Although containment is undecidable, there is a closely related, stronger propertywhich is decidable—namely, uniform containment. For two programs P,P ′ over the same

304 Datalog

set of intensional and extensional relations, we say that P is uniformly contained in P ′,denoted P ⊆ P ′, iff for each I over sch(P ), P(I) ⊆ P ′(I). Uniform containment is asufficient condition for containment. Interestingly, one can decide uniform containment.The test for uniform containment uses dependencies studied in Part D and the fundamentalchase technique (see Exercises 12.27 and 12.28).

Boundedness

A key problem for datalog programs (and recursive programs in general) is to estimate thedepth of recursion of a given program. In particular, it is important to know whether for agiven program the depth is bounded by a constant independent of the input. Besides beingmeaningful for optimization, this turns out to be an elegant mathematical problem that hasreceived a lot of attention.

A datalog program P is bounded if there exists a constant d such that for each I overedb(P ), stage(P, I) ≤ d . Clearly, if a program is bounded it is essentially nonrecursive,although it may appear to be recursive syntactically. In some sense, it is falsely recursive.

Example 12.5.6 Consider the following two-rule program:

Buys(x, y)← Trendy(x),Buys(z, y) Buys(x, y)← Likes(x, y)

This program is bounded because Buys(z,y) can be replaced in the body by Likes(z,y),yielding an equivalent recursion-free program. On the other hand, the program

Buys(x, y)← Knows(x, z),Buys(z, y) Buys(x, y)← Likes(x, y)

is inherently recursive (i.e., is not equivalent to any recursion-free program).

It is important to distinguish truly recursive programs from falsely recursive (bounded)programs. Unfortunately, boundedness cannot be tested.

Theorem 12.5.7 Boundedness is undecidable for datalog programs.

The proof is by reduction of the PCP (see Chapter 2). One can even show that bound-edness remains undecidable under strong restrictions, such as that the programs that areconsidered (1) are constant-free, (2) contain a unique recursive rule, or (3) contain a uniqueintensional relation. Decidability results have been obtained for linear programs or chain-rule programs (see Exercise 12.31).

Bibliographic Notes

It is difficult to attribute datalog to particular researchers because it is a restriction orextension of many previously proposed languages; some of the early history is discussedin [MW88a]. The name datalog was coined (to our knowledge) by David Maier.

Bibliographic Notes 305

Many particular classes of datalog programs have been investigated. Examples are theclass of monadic programs (all intensional relations have arity one), the class of linearprograms (in the body of each rule of these programs, there can be at most one relation thatis mutually recursive with the head relation; see Chapter 13), the class of chain programs[UG88, AP87a] (their syntax resembles that of context-free grammars), and the class ofsingle rule programs or sirups [Kan88] (they consist of a single nontrivial rule and a trivialexit rule).

The fixpoint semantics that we considered in this chapter is due to [CH85]. However,it has been considered much earlier in the context of logic programming [vEK76, AvE82].For logic programming, the existence of a least fixpoint is proved using [Tar55].

The study of stage functions stage(d,H) is a major topic in [Mos74], where they aredefined for finite structures (i.e., instances) as well as for infinite structures.

Resolution was originally proposed in the context of automatic theorem proving. Itsfoundations are due to Robinson [Rob65]. SLD resolution was developed in [vEK76].These form the basis of logic programming introduced by Kowalski [Kow74] and[CKRP73] and led to the language Prolog. Nice presentations of the topic can be foundin [Apt91, Llo87]. Standard SLD resolution is more general than that presented in thischapter because of the presence of function symbols. The development is similar exceptfor the notion of unification, which is more involved. A survey of unification can be foundin [Sie88, Kni89].

The programming language Prolog proposed by Colmerauer [CKRP73] is based onSLD resolution. It uses a particular strategy for searching for SLD refutations. Vari-ous ways to couple Prolog with a relational database system have been considered (see[CGT90]).

The undecidability of containment is studied in [CGKV88, Shm87]. The decidabilityof uniform containment is shown in [CK86, Sag88]. The decidability of containment formonadic programs is studied in [CGKV88]. The equivalence of recursive and nonrecursivedatalog programs is shown to be decidable in [CV92]. The complexity of this problem isconsidered in [CV94].

Interestingly, bounded recursion is defined and used early in the context of universalrelations [MUV84]. Example 12.5.6 is from [Nau86]. Undecidability results for bound-edness of various datalog classes are shown in [GMSV87, GMSV93, Var88, Abi89]. De-cidability results for particular subclasses are demonstrated in [Ioa85, Nau86, CGKV88,NS87, Var88].

Boundedness implies that the query expressed by the program is a positive existentialquery and therefore is expressible in CALC (over finite inputs). What about the converse?If infinite inputs are allowed, then (by a compactness argument) unboundedness impliesnonexpressibility by CALC. But in the finite (database) case, compactness does not hold,and the question remained open for some time. Kolaitis observed that unboundedness doesnot imply nonexpressibility by CALC over finite structures for datalog with inequalities(x �= y). (We did not consider comparators �=, <,≤, etc. in this chapter.) The questionwas settled by Ajtai and Gurevich [AG89], who showed by an elegant argument that nounbounded datalog program is expressible in CALC, even on finite structures.

Another decision problem for datalog concerns arises from the interaction of datalogwith functional dependencies. In particular, it is undecidable, given a datalog program P ,

306 Datalog

set � of fd’s on edb(P ), and set of fd’s on idb(P ) whether P(I) |= whenever I |=�

[AH88].The expressive power of datalog has been investigated in [AC89, ACY91, CH85,

Shm87, LM89, KV90c]. Clearly, datalog expresses only monotonic queries, commuteswith homomorphisms of the database (if there are no constants in the program), and can beevaluated in polynomial time (see also Exercise 12.11). It is natural to wonder if datalogexpresses precisely those queries. The answer is negative. Indeed, [ACY91] shows that theexistence of a path whose length is a perfect square between two nodes is not expressiblein datalog�= (datalog augmented with inequalities x �= y), and so not in datalog. Thisis a monotonic, polynomial-time query commuting with homomorphisms. The parallelcomplexity of datalog is surveyed in [Kan88].

The function symbols used in logic programming are interpreted over a Herbrand do-main and are prohibited in datalog. However, it is interesting to incorporate arithmetic func-tions such as addition and multiplication into datalog. Such functions can also be viewedas infinite base relations. If these are present, it is possible that the bottom-up evaluationof a datalog program will not terminate. This issue was first studied in [RBS87], wherefiniteness dependencies were introduced. These dependencies can be used to describe howthe finiteness of the range of a set of variables can imply the finiteness of the range ofanother variable. [For example, the relation +(x, y, z) satisfies the finiteness dependen-cies {x, y}❀ {z}, {x, z}❀ {y}, and {y, z}❀ {x}.] Safety of datalog programs with infiniterelations constrained by finiteness dependencies is undecidable [SV89]. Various syntac-tic conditions on datalog programs that ensure safety are developed in [RBS87, KRS88a,KRS88b, SV89]. Finiteness dependencies were used to develop a safety condition for therelational calculus with infinite base relations in [EHJ93]. Safety was also considered inthe context of data functions (i.e., functions whose extent is predefined).

Exercises

Exercise 12.1 Refer to the Parisian Metro database. Give a datalog program that yields, foreach pair of stations (a, b), the stations c such that c is reachable (1) from both a and b; and (2)from a or b.

Exercise 12.2 Consider a database consisting of the Metro and Cinema databases, plus arelation Theater-Station giving for each theater the closest metro station. Suppose that you livenear the Odeon metro station. Write a program that answers the query “Near which metro stationcan I see a Bergman movie?” (Having spent many years in Los Angeles, you do not like walking,so your only option is to take the metro at Odeon and get off at the station closest to the theater.)

Exercise 12.3 (Same generation) Consider a binary relation Child_of , where the intendedmeaning of Child_of (a, b) is that a is the child of b. Write a datalog program computing theset of pairs (c, d), where c and d have a common ancestor and are of the same generation withrespect to this ancestor.

Exercise 12.4 We are given two directed graphs Gblack and Gwhite over the same set V ofvertexes, represented as binary relations. Write a datalog program P that computes the set ofpairs (a, b) of vertexes such that there exists a path from a to b where black and white edgesalternate, starting with a white edge.

Exercises 307

Exercise 12.5 Suppose we are given an undirected graph with colored vertexes representedby a binary relation Color giving the colors of vertexes and a binary relation Edge giving theconnection between them. (Although Edge provides directed edges, we ignore the direction, sowe treat the graph as undirected.) Say that a vertex is good if it is connected to a blue vertex(blue is a constant) or if it is connected to an excellent vertex. An excellent vertex is a vertexthat is connected to an outstanding vertex and to a red vertex. An outstanding vertex is a vertexthat is connected to a good vertex, an excellent one, and a yellow one. Write a datalog programthat computes the excellent vertexes.

Exercise 12.6 Consider a directed graph G represented as a binary relation. Show a datalogprogram that computes a binary relation T containing the pairs (a, b) for which there is a pathof odd length from a to b in G.

Exercise 12.7 Given a directed graph G represented as a binary relation, write a datalogprogram that computes the vertexes x such that (1) there exists a cycle of even length passingthrough x; (2) there is a cycle of odd length through x; (3) there are even- and odd-length cyclesthrough x.

Exercise 12.8 Consider the following program P :

R(x, y)←Q(y, x), S(x, y)

S(x, y)←Q(x, y), T (x, z)

T (x, y)←Q(x, z), S(z, y)

Let I be a relation over edb(P ). Describe the output of the program. Now suppose the first ruleis replaced by R(x, y)←Q(y, x). Describe the output of the new program.

Exercise 12.9 Prove Lemma 12.3.1.

Exercise 12.10 Prove that datalog queries are monotone.

Exercise 12.11 Suppose P is some property of graphs definable by a datalog program. Showthat P is preserved under extensions and homomorphisms. That is, if G is a graph satisfying P ,then (1) every supergraph of G satisfies P and (2) if h is a graph homomorphism, then h(G)satisfies P .

Exercise 12.12 Show that the following graph properties are not definable by datalogprograms:

(i) The number of nodes is even.

(ii) There is a nontrivial cycle (a trivial cycle is an edge 〈a, a〉 for some vertex a).

(iii) There is a simple path of even length between two specified nodes.

Show that nontrivial cycles can be detected if inequalities of the form x �= y are allowed in rulebodies.

♠Exercise 12.13 [ACY91] Consider the query perfect square on graphs: Is there a path (notnecessarily simple) between nodes a and b whose length is a perfect square?

(i) Prove that perfect square is preserved under extension and homomorphism.

(ii) Show that perfect square is not expressible in datalog.

Hint: For (ii), consider “words” consisting of simple paths from a to b, and prove a pumpinglemma for words “accepted” by datalog programs.

308 Datalog

Exercise 12.14 Present an algorithm that, given the set of proof trees of depth i with a programP and instance I, constructs all proof trees of depth i + 1. Make sure that your algorithmterminates.

Exercise 12.15 Let P be a datalog program, I an instance of edb(P ), and R in idb(P ). Let ube a vector of distinct variables of the arity of R. Demonstrate that

P(I)(R)= {θR(u) | there is a refutation of ← R(u) using PI and

substitutions θ1, . . . θn such that θ = θ1 ◦ · · · ◦ θn}.Exercise 12.16 (Substitution lemma) Let PI be a program, g a goal, and θ a substitution. Provethat if there exists an SLD refutation of θg with PI and ν, there also exists an SLD refutation ofg with PI and θ ◦ ν.

Exercise 12.17 Reprove Theorem 12.3.4 using Tarski’s and Kleene’s theorems stated in Re-mark 12.3.5.

Exercise 12.18 Prove the “if part” of Theorem 12.4.5.

Exercise 12.19 Prove Lemma 12.4.8.

�Exercise 12.20 (Unification with function symbols) In general logic programming, one canuse function symbols in addition to relations. A term is then either a constant in dom, a variablein var, or an expression f (t1, . . . , tn), where f is an n-ary function symbol and each ti is a term.For example, f (g(x, 5), y, f (y, x, x)) is a term. In this context, a substitution θ is a mappingfrom a subset of var into the set of terms. Given a substitution θ , it is extended in the naturalmanner to include all terms constructed over the domain of θ . Extend the definitions of unifierand mgu to terms and to atoms permitting terms. Give an algorithm to obtain the mgu of twoatoms.

Exercise 12.21 Prove that Lemma 12.5.1 does not generalize to datalog programs withconstants.

Exercise 12.22 This exercise develops three alternative proofs of the generalization of Theo-rem 12.5.2 to datalog programs with constants. Prove the generalization by

(a) using the technique outlined just after the statement of the theorem

(b) making a direct proof using as input an instance IC∪{a}, where C is the set of allconstants occurring in the program and a is new, and where each relation in I containsall tuples constructed using C ∪ {a}

(c) reducing to the emptiness problem for context-free languages.

♠Exercise 12.23 (datalog �=) The language datalog �= is obtained by extending datalog with anew predicate �= with the obvious meaning.

(a) Formally define the new language.

(b) Extend the least-fixpoint and minimal-model semantics to datalog �=.

� (c) Show that satisfiability remains decidable for datalog �= and that it can be tested inexponential time with respect to the size of the program.

�Exercise 12.24 Which of the properties in Exercise 12.12 are expressible in datalog �=?

Exercise 12.25 Prove Proposition 12.5.4.

Exercises 309

Exercise 12.26 Prove that containment of chain datalog programs is undecidable. Hint: Mod-ify the proof of Theorem 12.5.5 by using, for each b ∈ �, a relation Rb such that Rb(x, y) iffR(x, b, y).

Exercise 12.27 Prove that containment does not imply uniform containment by exhibiting twoprograms P,Q over the same edb’s and with S as common idb such that P ⊆S Q but P �⊆Q.

♠Exercise 12.28 (Uniform containment [CK86, Sag88]) Prove that uniform containment of twodatalog programs is decidable.

Exercise 12.29 Prove that each nr-datalog program is bounded.

♠Exercise 12.30 [GMSV87, Var88] Prove Theorem 12.5.7. Hint: Reduce the halting problemof Turing machines on an empty tape to boundedness of datalog programs. More precisely, havethe edb encode legal computations of a Turing machine on an empty tape, and have the programverify the correctness of the encoding. Then show that the program is unbounded iff there areunbounded computations of the machine on the empty tape.

Exercise 12.31 (Boundedness of chain programs) Prove decidability of boundedness for chainprograms. Hint: Reduce testing for boundedness to testing for finiteness of a context-freelanguage.

♠Exercise 12.32 This exercise demonstrates that datalog is likely to be stronger than positivefirst order extended by generalized transitive closure.

(a) [Coo74] Recall that a single rule program (sirup) is a datalog program with onenontrivial rule. Show that the sirup

R(x)← R(y), R(z), S(x, y, z)

is complete in ptime. (This has been called variously the graph accessibility problemand the blue-blooded water buffalo problem; a water buffalo is blue blooded only ifboth of its parents are.)

(b) [KP86] Show that the in some sense simpler sirup

R(x)← R(y), R(z), T (y, x), T (x, z)

is complete in ptime.

310 Datalog

(c) [Imm87b] The generalized transitive closure operator is defined on relations witharity 2n so that TC(R) is the output of the datalog program

ans(x1, . . . , x2n)← R(x1, . . . , x2n)

ans(x1, . . . , xn, z1, . . . , zn)← R(x1, . . . , xn, y1, . . . , yn),

ans(y1, . . . , yn, z1, . . . , zn)

Show that the positive first order extended with generalized transitive closure is inlogspace.

Documents

Datalog and Recursion - Inriawebdam.inria.fr/Alice/pdfs/Chapter-12.pdf · Datalog and Recursion I ... Datalog owes a great debt to Prolog and the logic-programming area in general