Graph Algorithms for Functional Dependency Manipulation

Graph Algorithms for Functional Dependency Manipulation

G I O R G I O AUSIELLO

University of Rome, Rome, Italy

ALESSANDRO D ' A T R I

University of L'Aqutla, L'Aquila, Italy

AND

DOMENICO SACCA

CRAI, Rende, Italy

Abstract. A graph-theoretic approach for the representation of functional dependenoes in relauonal databases is introduced and applied in the construction of algorithms for manipulating dependencies. This approach allows a homogeneous treatment of several problems (closure, minimization, etc.), which leads to simpler proofs and, m some cases, more efficient algorithms than in the current literature.

Categories and Subject Descriptors: F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumencal Algorithms and Problems--computations on d, screte structures; G.2.2 [Diserete Mathematics]: Graph Theory--graph algorithms; H.2.1 [Database Management]: Logical Design-normal forms; schema and subschema

General Terms: Algorithms, Design, Management, Theory

Additional Key Words and Phrases: Closure, computational complexity, functional dependency, FD- graph, minimal coverings, relational database

1. Introduction

T h e man ipu l a t i on o f da ta dependencies has a decisive impac t on the solut ion o f var ious p rob l ems in the logical design o f da tabases (synthesis o f relat ional schemes, v iew mode l ing and integrat ion, etc.). Algor i thms and da ta structures for the representa t ion and man ipu la t i on o f da ta dependencies , and in par t icular of functional dependencies (FDs) , were def ined dur ing the ear ly stages o f relat ional theory [8, 10, 121.

More recently, the computa t iona l aspects o f the man ipu la t i on o f funct ional and mul t iva lued dependencies were considered in [6, 14], and the min imiza t ion o f F D representa t ion was extensively discussed in [16].

This work supported in part by Consigho Nazionale delle Ricerche and by Cassa del Mezzogiorno under Grant PS 35/12 IND. Authors' addresses: G. Auslello, Istituto di Automatica, University of Rome, Via Eudossiana 18, 1-00184, Rome, Italy; A. D'Atri, Istituto di Elettrotecnica, University of L'Aqufla, Monteluco, Rojo, 1-67100, L'Aquila, Italy; D. Sacc/t, CRAI, Via Beznini 5, 1-87030, Rende, Italy. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1983 ACM 0004-5411/83/1000-0752 $00.75

Journal of the Association for Computing Machinery, Vol 30, No 4, October 1983, pp 752-766

Graph Algorithms for Functional Dependency Manipulation 753

This paper presents a new graph-theoretic approach which leads to simpler proofs and more efficient algorithms for the manipulation and representation of FDs than in the current literature.

This approach (introduced in Section 2) is based on the representation of the set of functional dependencies by FD-graphs (a generalization of graphs). Such a representation, which allows an explicit treatment of all sets of attributes needed by the users, provides a unified framework for the treatment of various properties and for the manipulation of FDs. In Section 3 the closure of an FD-graph is considered, and in Section 4 various notions of minimal coverings are defined and the corresponding algorithms are provided. Finally, in Section 5 it is shown that the synthesis problem is actually the same as the problem of fmding a minimal representation of a set of functional dependencies.

The major advantages of the approach presented in this paper are that (1) by embedding the problems of FD manipulation into a graph-theoretic setting, a unified treatment of the above mentioned problems is allowed and the algorithms are seen to coincide essentially with known graph algorithms, and (2) in the ease that the FD- graphs to which they are applied are indeed simply graphs, the algorithms perform more efficiently than the algorithms which can be found in the literature.

The reader is required to know the basic notation of the relational model and FDs [7, 17].

2. Graphical Representation of Functional Dependencies 2.1. GRAPH AND HYPERGRAPH REPRESENTATIONS. Several authors have used

graph (or hypergraph) formalisms to model the set of data dependencies contained in a relational database scheme.

In particular, hypergraphs have been used in [5] and [11] in the following way: attributes of the schema correspond to labeled nodes and functional dependencies among sets of attributes correspond to directed surfaces. Hypergraphs have also been used in [13] to model join dependencies among attributes under the universal relation assumption.

Multigraphs with labeled arcs (combined AZ-graphs) have been used in [18] to model functional and multivalued dependencies. In this approach the nodes correspond to the attributes, and three kinds of labeled arcs are used in order to identify the dependencies among attributes.

Labeled trees and directed acyclic graphs have also been used in [6] and [16], respectively, to model how a functional dependency can be derived from a set of dependencies by means of Armstrong's inference rules [3].

In this paper we propose a new graphical representation for FDs (derived from the one presented in [4]) which corresponds to an extension of the usual concept of graph, in such a way that the properties of functional dependencies can be formally characterized in terms of graph properties and the computational results obtained coincide, in the particular case of graphs, with the classical results of graph algorithms.

2.2. FD-GRAPHS. Let U = {.4, B, C, ...} be a finite set of attributes; we will denote subsets of U by . . . . X, Y, Z and will use concatenation for forming such subsets (thus AB stands for {A, B}). Concatenation will also stand for union (thus XY stands for X t.J Y).

Definition 1. A set 2~ of FDs on U is in reduced form if

(a) there exist no two FDs X---> Y and X' ---> Y' such that X = X', and (b) for all FDs X---~ Y, XN Y=O.

754

FIo. 1. Example of an FD-graph.

G. AUSIELLO, A. D'ATRI, AND D. SACC.~

/ ~ . ~ F . . . . . . . . . . . FBD --"------~ H oO..."'"y ....... ........... : B o

\-..o ....... From now on, we will only consider sets of FDs in reduced form, since this form

is the most reasonable and concise representation. Whenever we consider the size of the input for the algorithms in this paper, we will refer to the length of strings of symbols needed to represent such a set of FDs. We will denote this size by I Z~l = I~,1 + I~rl, where I~,1 (l~rl) is the sum of the length of the strings of attributes appearing on the left (right) side of the dependencies. In addition p = I1~11 will be used to denote the number of FDs in ~; U, (U~) is the subset of attributes appearing in the left (right) side of some FD of ~ (U = Ut U U~).

Definition 2. Given a set of FDs ~ on U, the FD-graph G~ = ( V, E) associated with ~ is the graph with node labeling function w: V---> ~ ( U ) and arc labeling function w': E---+ {0, 1} such that

(i) for every attribute A E U, there is a node in V labeled A (called a simple node); (ii) for every dependency X--> Y in ~ where II xII > 1, there is a node in V labeled

X (calle d a compound node); (iii) for every dependency X--> Yin ~ where Y = At . - - Ak, there are arcs labeled

0 (full arcs) from the node labeled X to the nodes labeled At, . . . , A~; (iv) for every compound node i in N labeled At -- . Ak, there are arcs labeled 1

(dotted arcs) from the node i to all simple nodes (component nodes of i) labeled At . . . . . Ak.

The set of full arcs (dotted arcs) is denoted Eo (El).

Note that the number of nodes is II vii - II Ull + p, the number of full arcs is IlEoll -- I~r l and the number of dotted arcs is IIEtll - I~, l- The total number of arcs is liEU = IIEoll + IIEtll-< I~l-

Example 1. Given the set o f F D s ~ = {A ---> FBC, C---> D, FBD ---> H, BD ---> E} with I~[ - 13, [~tl = 7, I~r[ = 6, p = 4, the corresponding FD-graph G,z -- (V, E) is given in Figure 1, where [I vii = 9, IIEoll -- 6, IIEtll -- 5 and IIEIt -- 11. D

Notice that when there are no compound nodes the FD-graph is simply a graph, and in this case IIEoll -- I~r l and IIEtll -- 0.

3. FD-Graph Closure and Covering

Given a set of FDs, we denote by ~+ the closure of ~ with respect to Armstrong's inference rules [3]:

Reflexivity: If Y _C X, then X---> Y. Transitivity: I f X---> Y and Y---> Z, then X---> Z. Union: I f X--* Yand X - , Z, then X---> YZ.

Notice that such rules have been proved to be a complete and independent set of inference rules for FDs and, hence, other inference rules (augmentation, decomposition, pseudotransitivity) are derivable from them. Given an FD-graph C,z = ( V, E) , our first concern is to derive all FDs which can be established among the set of attributes associated with the nodes of Gz using the inference rules.


B FBD

/ ' Bf"i :: A ~'-Z BD

755

FIo. 2.

(a) (b) (e)

Full and dotted FD-paths. (a) Full FD-path (A, E}. (b) Full FD-pa~ (A, D}. (¢) Dotted FD-path (FBD, E).

~ H

FIG. 3. Closure of the FD-graph of Figure 1.

Definition 3. Given an FD-graph C~ = ( V, E) and two n o d e s / , j ~ V, a (directed) FD-path (i, j ) from i to j is a minimal subgraph Gz = ( V, E) of G,z such that i, j E P and either (i, j ) E / ~ or one of the following possibilities holds:

(a) j is a simple node and there exists a node k such that (k, j ) ~ / ~ and there is an FD-path (i, k) included in C,~ (graph transitivity);

(b) j is a compound node with component nodes ml . . . . . mr and there are dotted arcs (j, ml) . . . . . (j, mr) in ~z and r FD-paths (i, m l ) , . . . , (i, mr) included in ~z (graph union).

Furthermore an FD-path (i, j ) is dotted if all its arcs leaving i are dotted; otherwise it is full.

Notice that since we require that an FD-path (i, j ) is minimal, no proper subgraph of it may also be an FD-path from i to j.

Notice that a path is also an FD-path, but the converse may not hold. For the FD- graph of the Example 1, FD-paths (A, E) , (A, D), and (FBD, E) are given in Figure 2.

Finally we may observe that by definition of FD-path, a compound node without outgoing full arcs can only be either a source or a target node of FD-paths to which it belongs.

Definition 4. The closure of an FD-graph G~ -- (V, E) is the graph G~ -- ( V, E+), labeled on the nodes and on the arcs, where the set Vis the same as in G~z, while the set E + = (E+)0 O (E+)I is defined in the following way:

(E+h = {(i, j)li, j E Vand there exists a dotted FD-path (i, j ) } ; (E+)o -- ((i, y)li, y ~ V, (i, j ) ~ (E+)I and there exists a full

FD-path (i, j )} .

Example 2. The closure of the FD-graph of Example 1 is given in Figure 3. []

THEOREM 1. Let Crz = ( V, E) be the FD-graph associated with the set 2~ of FDs, and let G~ = ( V, E + ) be its closure. An arc (i, j ) is in E + if and only if wQ) ---> w(j) is in Z +.

756 o. AUSIELLO, A. D'ATRI, AND D. SACCA

PROOF I f It is sufficient to show that if w(i) ---> w( j ) is in X +, then there exists an FD-

path ( i, j ) in G-x, since if this condition is satisfied, then by definition an edge (i, j ) in the closure will exist.

We carry on the proof by induction on the number of applications of Armstrong's inference rules by considering a minimal subset ~ of X such that w(i) ---> w( j ) is in ~+. Given two nodes i, j E V, if the dependency w(i) ---, w( j ) is in X +, then one of the following conditions must be true.

(i) If w(i) ---> Xw( j ) Y is in ~, then the edge (i, j ) is in E + by definition (basis of the induction).

(ii) I f j is a simple node, then there must be a node k such that w(i) ---> w(k) is in 2~+ and w(k) ---> Xw( j ) Y is in ~. We have that in C,x there must be an edge from k to j (by construction) and an FD-path from i to k (by induction). Hence an FD- path from i to j will also exist.

('tii) Otherwise j is a compound node with components ml . . . . . m,, and hence the dependencies w(i) ---> w(ml) . . . . . w(i) ---> w(mr) must hold in ~ +. By induction the FD-paths (i, ml) . . . . . (i, m,) must exist in C~ and, as a consequence, also the FD-path (/, j ) will be in Gx.

Only/f. Since the arcs of the closure are obtained by graph transitivity and graph union, and since these graph rules correspond to transitivity, union, and reflexivity inference rules for FDs, this part of the theorem is trivially proved. Q.E.D.

The relevance of Theorem 1 is that it provides a precise characterization of what we could call the "meaningful part" of the closure of a set of functional dependencies. In fact, according to the theorem we consider only the dependencies involving sets of attributes appearing in the original set of dependencies, and in this way we avoid the exponential growth of the closure which arises from the deduction of other less relevant dependencies obtained by the reflexivity rule.

In addition, having established the correspondence between a set of FDs and an FD-graph, we can provide a unified treatment of all properties of sets of functional dependencies and a unified data structure for their representation.

First of all, let us consider the problem of constructing the closure of an FD-graph. In order to obtain the closure G~ = ( IT, E +) of an FD-graph Gx = ( V, E) , we

may use the following algorithm.

ALGORITHM CLOSURE; {DATA STRUCTURES} V ° (//1): is the input set of simple (compound) nodes. D,, Vi ~ V: is the input set { j[ j E V 1 and (j, i) E El}. (D, is empty ifl E V~); L ° (L,~), Vi E V: is the input set { j l j E Vand (i, j ) ~ Eo (E~)}. L °+ (L,~+), Vi E V: is the output set { j l j ~ Vand (i, j ) E E~ (E~)}. Si, Vi E V: is a working subset of either L °+ or L~ +. qm, Vm E V~: is a counter of component nodes of the node m currently belonging to

L, °+ O LI +, where i is a given node. procedure NODECLOSURE (S,+: is either L °+ or L~+); {determine either dotted or full arcs leaving the node i in the closure} begin while S, # (0} do

begin select j from S,; i f j i s in V ° then

forall m in D 1 - ( i } do UNION: begin q= :ffi qm + 1;

ifqm = [[L~[] Chen S~ :=- S~ tJ {m); e n d UNION;


s• := s , + u {j}; forall k in L~ U L) do TRANSITIVITY: if NOT(k in S + O L, ~+ O {i}) then S, := S, 0 (k};

end; end NODECLOSURE; begin (CLOSURE}

forall i in Vwith outdegree > 0 (i.e., L ° O L~ # (~}) do begin forall m in V 1 do if m in D~ then qm := 1

else qm := 0; L~ + := {O}; L °+ := {9}; if i in V 1 then {determine dotted arcs of the closure}

begin S, := L,~; NODECLOSURE(L~+); end; S, := L ° - L~+; NODECLOSURE(L°+); {determine full arcs of the closure}

end end CLOSURE.

PROPOSITION 1. Algorithm CL OSURE determines G~ in time O(p . [X[).

PROOF. Since the algorithm uses the graph transitivity and union rules, for every node i E V, all nodes j of V such that there is an FD-path (i, j ) in Gx are determined. Furthermore, since dotted FD-paths are determined first, the labels of the arcs in the closure are correctly computed. The procedure NODECLOSURE is performed twice for every node i E ~', where ~" is the set of nodes with outdegree > 0. Since the UNION block runs in O(~9-1 IIL:ll) and the TRANSITIVITY statement in o ( X % l IlL ° u LlU), the procedure NODECLOSURE requires time O(llEll). Hence the algorithm runs in O(ll vii-IIEII), that is, O(p. IXl). Q.E.D.

Notice that in the case of an FD-graph without compound nodes, the previous algorithm becomes a classical transitive closure algorithm.

In the case of general directed graphs in [2] it has been shown that the cost of finding the transitive reduction is the same as the cost of finding the transitive closure; similarly, in the case of FD-graphs there is a connection between the problem of determining the closure and the problem of determining "minimal" coverings. Let us first define the concept of covering.

Given two sets of FDs ~ and X', we say that Y/is a covering of ]~ if ~ '+ -- 2~+.

Definition 5. Given two FD-graphs Cnz and Gx,, Gx, is a covering of C,x if X' is a covering of X.

The property of coverings in the case of graphs is that given two nodes i, j , the arc (i, j ) is in the closure of one if and only if it is in the closure of the other. In the case of FD-graphs this property holds, but it may be the case that given two nodes i, j that belong to both coverings, the arc (i, j ) in the closure of one covering may be a full arc while the arc (i, j ) in the closure of the other covering is dotted. The lemma below establishes under what conditions this situation may arise.

Definition 6. Two nodes i, j in an FD-graph C,x are said to be equivalent i f the arcs (i, j ) and (j, i) both belong to the closure of C~. Furthermore a node i of G-x is said to be equivalent to a node j of G~ where G,2 is a covering of G,z, if i, j are equivalent in some covering of Gx.

LEMMA 1. Let Gx = ( V, E) and G~ = ( V, E) be two coverings o f the same FD- graph, and let i, j be two nodes belonging to both V and ~r. Then

(i) (i, j ) E E + if and only if(i, j ) E E+. (ii) I f G~ is a subgraph o f Gx such that all arcs in E - E are dotted ff.e., Gx may

contain compound nodes not in G~ but no more full arcs) and (i, j ) is in (E+)0 [(E+)I], then (i, j ) is in (/~+)0 [(E+)I].

758 G. AUSIELLO, A. D'ATRI, AND D. SACC~.

(iii) I f (i, j ) E (E+)o and (i, j ) ~ (E+)x, then every dotted FD-path (i, j ) in C,~ contains a node k equivalent to £

PRoof

(i) By the correspondence between FD-graph closure and FD closure proved in Theorem 1.

(ii) By hypothesis, C-x differs from G,2 only for some compound nodes and their outgoing dotted arcs. Since such nodes have no outgoing full arcs, they can not be intermediate nodes in FD-paths of G~. Hence every FD-path (i, j ) in G~ such that i, j are in G-~ is also in C~. This implies that if the arc (i, j ) is in the closure of G~, it is also in the closure of G~ with the same label.

('di) The fact that (i, j ) E (/~+)~ implies that i is a compound node. Hence in E there must be both full and dotted arcs leaving i. Let us now consider any dotted FD-path (i, j ) in G~, and let kl . . . . . ks be the intermediate nodes on (i, j ) (if no intermediate node exists, the hypothesis of the lemma would not hold). We have to prove that at least one of these intermediate nodes is equivalent to i.

Without loss of generality we can assume that such nodes are also in G~. In fact, we may always refer to the FD-graph G~, = ( V', E ' ) obtained from G~ by adding the missing compound nodes and their outgoing dotted arcs. By part (ii) of this lemma the arc (i, j ) is in (E'+)o. By Theorem 1, if (ke, kr) is in J~, there exists an FD- path (ke, kr) in G~. Moreover, since no dotted FD-path (i, j ) is in G~, some FD- path (k~, k~) contains a full arc leaving i. By definition of FD-path, in G2 there exist FD-paths (i, k~) (because the FD-path (i, j ) contains k~) and (k,, i) (because the FD-path (k~, k~) in Gz contains the node i). Hence i and k, are equivalent. Q.E.D.

In the next section we will compute coverings of an FD-graph which have various minimality properties.

4. Minimal Coverings of FD-Graphs

4.1. NONREDLrNDANT COVERING. A set of FDs is nonredundant if no functional dependency in ~ can be derived from the remaining ones.

Definition 7. An FD-graph G-x is nonredundant if ~ is nonredundant.

In order to determine a nonredundant covering of an FD-graph it is sufficient to eliminate redundant nodes.

Definition 8. Given an FD-graph C,x = (V, E) , a compound node i E V is redundant if, for each full arc (i, j ) E E, there exists a dotted FD-path (i, j ) .

PROPOSITION 2. An FD-graph is nonredundant if and only if it has no redundant nodes.

PRooF. Given an FD-graph C-~ = ( V, E) , let (i, fi) . . . . . (i, fi) be the full arcs leaving the compound node i of II. Such arcs correspond to the FD: w(i) w(fi) . . . w(fi). This dependency will be redundant if and only if it can be derived from the other members of ~, that is, every full arc (i, j ) can be derived from a subgraph of G that does not contain it. Hence the FD w(i) ~ w(fi) . . . w(fi) is redundant if and only if there exist dotted FD-paths (/, f i) . . . . . (i, fi), that is, iffi is redundant. Q.E.D.

Given an FD-graph C,x = ( V, E) , a nonredundant covering of C,x is obtained by eliminating all redundant nodes and all arcs leaving them. Notice that redundant


J

ABC ;--.-B~. ~n k ' " '#BA~',. ~'

FIG. 4. Nonredundant covering of an FD-graph.

Co)

C~E~ D~.,,,. F - . . . o , :

8 ",%

759

nodes in an FD-graph can be determined by simply considering its closure; in fact, redundant nodes have no full arcs leaving them in the closure.

Example 3. A non_redundant covering of the FD-graph in Figure 4a is obtained by eliminating the redundant node ABC, as shown in Figure 4b. []

An interesting property of nonredundant coverings of an FD-graph is that all such coverings admit an equipotent partition into strong components.

Definition 9. Given an FD-graph Gz, a strong component is a maximal set of pairwise equivalent nodes. Furthermore, a strong component of G-,z is equivalent to a strong component of G,~, where G~ is a covering of G=, ff all nodes of such strong components are pairwise equivalent.

THEOREM 2. Let Gz, = (V', E') and G-z, = (V", E") be two nonredundant coverings of an FD-graph G~ = ( V, E ) . There exists a bijection ¢b between their strong components such that each strong component of Gz, is associated with an equivalent one of Gz..

PRoov. It is sufficient to prove that for each i' in V' there exists a node i" in V" equivalent to i'. If i' belongs to both V' and V", then i" -- i'; otherwise i' must be a compound node. Hence we can modify Gz- in ~,z- by adding i' to V" and adding all dotted arcs from i' to its component nodes in E". Since i" is not redundant in Cvz,, there exists at least one full arc (i, j~) in E', where j is a simple node in both IF' and V", such that (i, j ) is in (E'+)o. In C_~. an FD-path from i' to j must also exist, and by construction this FD-path must be dotted. By Lemma 1 (iii) there exists a node in G-,z- (and then in G,z-) equivalent to i', and this concludes our proof. Q.E.D.

Notice that the above theorem corresponds to [8, Lemma 3].

4.2. MINIMUM COVeRInG. In [16] various concepts of minimal coverings of a set of functional dependencies were given.

First of all, a minimum covering is defined as a nonredundant covering with the minimum number of FDs. In terms of FD-graphs we have

Definition 10. An FD-graph G,z is minimum i f ~ is minimum.

LEMMA 2. An FD-graph Cr~ = (V, E) is minimum if and only if there exists no covering of G-z with a smaller number of nodes.

760 G. AUSIELLO, A. D'ATRI, AND D. SACCA

PROOF. Given an FD-graph G~ = (V, E) , the number of nodes is equal to [[ VOA [[ + [IXl[, where V0A is the set o f simple nodes without outgoing arcs. Since all coverings of G~ have the same set VoA, the number of nodes depends only on HX{I; hence it is minimum if and only ifl[Xll is minimum. Q.E.D.

Definition 11. A node i o f an FD-graph ~ = ( V, E ) is superfluous if there exists a dotted FD-path (i, j ) where j is a node of Vequivalent to i.

THEOREM 3. A nonredundant FD-graph C,~ = ( V, E ) is minimum if and only i f it has no superfluous nodes.

PROOF

Only if. We want to show that if G~ is minimum, it has no superfluous nodes. Let us prove the result by contradiction. We suppose that G,z has a superfluous node i. This means that there exists a node j equivalent to i such that (i, j ) is a dotted FD- path. Then we can modify G,z by moving to the nodes j all full arcs leaving i and by eliminating the node i and all its outgoing arcs. Since this new FD-graph is a covering of G,z with fewer nodes, we get a contradiction with Lemma 2. Hence G~ has no superfluous nodes.

If. We have to prove that a nonredundant FD-graph G~ without superfluous nodes is indeed minimum. Let Gz, ffi ( V', E ' ) be a minimum covering of C,z. In order to prove the theorem, by Lemma 2 it is sufficient to fred a bijection ,k between f" = V - V' and ~" = V' - V. Given a node i ~ ~', we associate to it the node k of f" such that if we modify C-z, into ~z, ffi ( V',/~') by introducing the node i connected to its component nodes by dotted arcs, k is equivalent to i and there exists a dotted FD-path (i, k} in C,z, which does not contain other nodes of f" equivalent to i. Theorem 2 guarantees the existence of k in V'; furthermore, k is also in ~", because otherwise, by Lemma 1 (iii), there should exist a dotted FD-path (i, k) in Gz, and i would besuperf luous. We only must prove that ~k is injective; in fact, if so, since II f'll -> II V'll, then ~ is also bijective. Let us suppose that ~ is not injective, that is, there exists a node i in V equivalent to i such that k -- ~(i'). We show that in such a case Gz has superfluous nodes (contradiction). In fact, we modify G,z into Gz = ( V, E ) by introducing the node k connected to its component nodes by dotted arcs. Since k is equivalent to i and i, there exist dotted FD-paths (k, i} and (k, i ) in ~ . Furthermore, since dotted FD-paths (i, k) and (i, k} in C-z do not contain other nodesequivale_nt to i or/', by Lemma l(iii) there exist also dotted FD-paths (i, k} and (i, k} in Crz. Now, without loss of generality, we suppose that the dotted FD- path (/~, i} does not contain the node i (otherwise we could refer to (k, i}). Since there exist dotted FD-paths (i, k} and (k, i') and since (k, i') does not contain the node i, there exists also a dotted FD-path (i,/' ) in ~z. By Lemma 1 (ii) this FD- path is also in C,z and i is superfluous. Therefore ~k(/) is injective and Gz is minimum. Q.E.D.

Notice that all minimum coverings of an FD-graph have the same number of nodes for every equivalent strong component (as it is also pointed out in [16]); furthermore, if we consider the closure of two minimum coverings, Lemma 1 (iii) never holds.

Given a nonredundant FD-graph Gz = { V, E }, a minimum covering of G,z may be obtained in the following way:

1. Examine the closure to check if there exists a superfluous node i, that is, whether there exists a dotted arc (i, j ) from i to an equivalent node j.


AB , , .A%~ G AB.. ~ G t : " ~',..-C~,. .,,,¢ \ .. " , ~ . , /

' ' ' ' "" B"~ ~.#'CDL F ) ~_~'X~..~ / / FIG 5. Minimum covering of a nonre- E ~ u , ~ F I: ~.~_~. / dundant FD-graph.

(a) (b)

2. If such pair of nodes is found and if there exist nodes k such that (i, k) is full and (j, k) is dotted, then update the closure by transforming the arcs (j, k) to full arcs and by eliminating the node i. Store the pair of nodes (i, j ) in a hst L for subsequent execution of step 4.

3. Repeat steps l and 2 until no more superfluous nodes are found. 4. Update C-z, according to the content of list L, by eliminating superfluous nodes and by

moving their full outgoing arcs to their final destination.

Notice that steps 1-3 require time O([1 Veil. [1 VII), where Vc is the set of compound nodes and V is the set of all nodes in Cry. Besides, since every arc has to be moved at most once, step 4 requires time U EIlI, where E1 is the set of full arcs in C~. Since II Voll - p, II vii - p + II uII, and E1 - I~:1 (see Section 2.2), this means that the algorithm for determining the minimum covering of C~ requires time

O(max(p ~ + p" II uII, I~1)). This result is an improvement on the algorithm given in [16] for Finding a minimum covering of FDs starting from a nonredundant covering, where time O(p. I~:1) is required.

Example 4. A minimum covering of the nonredundant FD-graph of Figure 5a is obtained by eliminating the superfluous node CD and by replacing the arcs (CD, F) and (CD, G), respectively, with (AB, F) and (AB, G) (Figure 5b).

Notice that the node AB which might have been considered a superfluous node in the original FD-graph is no longer superfluous because its closure has been changed by the removal of CD. []

4.3. LR-MINIMUM COVERING. An LR-minimum covering of a set of functional dependencies ~ is defmed as a minimum covering of ~ in which, for each functional dependency X ~ Y, both X and Y are minimal (i.e., X and Y have no extraneous attributes); that is, the elimination of any attribute from X or Y would change the closure of I~.

Correspondingly, we may give the following.

Definition 12. Given a set of FDs Z, the FD-graph C~ is LR-minimum if Z is LR- minimum.

In order to determine an LR-minimum covering of an FD-graph C~, we have to eliminate all redundant arcs from a minimum covering of C~.

Definition 13

(a) A dotted arc (i, j ) is redundant if there exists a dotted FD-path (i, j ) which does not contain the arc (i, j ) .

(b) A full arc (i, j ) is redundant if there exists an FD-path (i, j ) which does not contain the arc (i, j) .

THEOREM 4. A minimum FD-graph G~ is LR-minimum if and only if it has neither

(a) redundant full arcs, nor (b) redundant dotted arcs.

762 G. AUSIELLO, A. D'ATRI, AND D. SACCA

PROOF

(a) The definition of FD-path establishes a correspondence between the concepts of redundant full arc and extraneous attribute on the fight side of a functional dependency.

(b) Let Gx = ( V, E} be a minimum FD-graph, and let A1, . . . An ~ Y be a dependency of X. Since such a dependency is nonredundant, A1 is an extraneous attribute i f and only if A2 . . . A,, ---> A1 is in X +, that is, in terms of the FD-graph formalism, i f and only if there is a dotted FD-path from i to j, where i, j are nodes of C,z with labels respectively AI . . . An and A1, which does not contain the arc (i, j ) (i.e., (i, j ) is redundant). Q.E.D.

In order to obtain an LR-minimum coveting of a minimum FD-graph, it is sufficient to eliminate all redundant arcs. The simplest approach would be to use the following algorithm: For every arc (i, j ) verify whether or not there is an FD-path (i, j} that does not contain j. In case (i, j ) is dotted, the FD-path must be dotted. This approach would require time O([ X [ 2) and corresponds to the algorithm used in [16].

The method we propose is indeed more efficient and is based on an extension of the following algorithm for the transitive reduction of a directed graph O = (V, E) [2].

1. Determine the transitive closure of G (time O(ll vii.HEll)). 2. Determine the strong components of G and choose a node as representative for every

component (time O(ll vii2)). 3. Modify the graph G by replacing every arc (i, j) between nodes of different components by

an arc (i, j ) between the representatives of such components (time O(ll vii2)) and by providing a Hamiltonian circuit among the nodes inside every component. Note that the set of arcs between representatives of different strong components form an acyclic graph.

4. Eliminate redundant arcs in the above mentioned acyclic graph (time O(ll vIl.llEII)).

In order to extend such an algorithm to FD-graphs we have to consider the following facts.

First of all, in the case of FD-graphs when we have to connect a node i to a

compound node j , instead of a full arc, we have a more complex structure made of full arcs from i to all component nodes of j and of dotted arcs from j to such components; we will refer to such a structure as u(nion)-path [i, j]. Step 3 of the preceding aglorithm has hence to be modified by considering that representative nodes of different strong components and nodes along a Hamiltonian (FD)-circuit may have to be connected by u-paths.

Second, the introduction of u-paths along a Hamiltonian circuit in a strong component may cause the presence of new redundant arcs (as can be seen in Example 5) and make the subsequent step 4 more expensive.

Example 5. Let us consider the strong component (enclosed in the dashed rectangle) of the minimum FD-graph given in Figure 6, where the node A has been chosen as representative node. The arc (BC, D) (belonging to the u-path [BC, DE] which has been introduced in the Hamiltonian circuit A, BC, DE, .4) is dearly redundant. []

Finally, step 4 has to be modified because, in the case of FD-graphs, it is not true that an arc (i, j ) is redundant if and only if there exists a node k such that the arcs (i, k) and (k, j ) are in the closure.

Graph A Igorithms for Functional Dependency Manipulation 763

G

I--~-- _ _ . I

H B C D E

P

FIG. 6. Hamfltonian (FD)-circuit containing a redundant arc.

FIG. 7 FD-graph without redundant arcs.

Example 6. In the FD-graph given in Figure 7, despite the fact that this FD- graph does not contain any cycle, the existence of the FD-path (A, BD) and of the arc (BD, C) is not sufficient to say that the arc (A, C) is redundant; in fact the are (A, C) belongs to said FD-path and this violates Definition 13. []

According to the preceding observations the only modification needed to extend the graph algorithm to FD-graphs concerns step 4. In order to introduce such modification and evaluate the overall cost of the algorithm, let us prove the following proposition.

PROPOSITION 3. Let Gx = ( V, E) be an FD-graph obtained from a minimum FD- graph after steps 1-3 of the preceding algorithm, and let (i, j) be a full (dotted) arc of Gx; (i, j) is redundant if there exists a representative node k in a different strong component such that (i, k) is in E (in El) and (k, j) is in E +.

PROOF. Since the arc (k, j ) is in E + (and k cannot be equivalent to i; otherwise it would belong to the same component as i), then there exists an FD-path (k, j ) that does not contain the arc (i, j) . Hence, since (i, k) is in E (El), there exists an FD-path (i, j ) (dotted if (i, j ) is dotted) which does not contain the arc (i, j ) . Definition 13 is hence satisfied. Q.E.D.

Step 4 is hence modified as follows:

(a) First test the sufficient condition given in Proposition 3 and eliminate the corresponding redundant arcs (time O([ I VI[.IIEII)).

(b) Test remaining full (dotted) arcs (i, j) for the existence of a representative node k such that (i, k) is in E + (E~-) and (k, j) is in E +. Such arcs are candidates for being redundant (time o(11VII.IIEII)).

(c) Test condition of Definition 13 for all can&date arcs of the preceding step and eliminate the corresponding redundant arcs (time O(t'.NEll), where t ' (0 ___ t ' -< lIED) is the number of such arcs.

In terms of the parameters of the set of functional dependencies ~, the overall cost of the algorithm is hence O(t. I~, D, where t _ max(p, t') and hence p ~ t -~ I~ [. Note that when the FD-graph is a graph, the conditions tested in steps 4a and 4b are necessary and sufficient, and hence the overall cost is O(p. I~ D, while the algorithm given in [161 in all cases requires time proportional to Ixl 2.

764

i , . ~ , F A~

G. AUSIELLO, A. D'ATRI, AND D. SACC.~

A B C ' ! ".'i:......~

G

(.)

E

/ H A B C

t G,,,-.-.-- BD "~ D

(b)

(c)

FIG. 8. (a) A minimum FD-graph. (b) The contiguraaon after step 3. (c) A final LR-minimum covering.

Example 7. Let us consider the minimum FD-graph given in Figure 8a. The result of steps 1-3 is given in Figure 8b (where the node E has been chosen as representative node of the strong component {ABC, E, F}; the arcs (ABC, E), (E, F) and the u-path IF, ABC] have been chosen in the Hamiltonian circuit). By means of step 4a the arc (BD, A) is eliminated; furthermore, the dotted arcs (ABC, A), (ABC, B) and the full arcs (F, A), (F, B) are candidates for being redundant (step 4b). Finally, in step 4c, (ABC, A) and (F, A) are eliminated, obtaining the FD-graph given in Figure 8c. []

5. Conclusions and Applications to Database Design In this paper a graph-theoretic formalism has been presented which allows us to

represent functional dependencies and to support algorithms for their manipulation (closure, minimization etc.). Functional dependencies were the first kind of data dependencies to be taken into consideration and have been thoroughly investigated by several authors. Their relevance to the problem of database design has been reconfirmed by the observation that in most cases a set of functional dependencies plus one join dependency are enough to express the dependency structure of a database scheme [13].

The proposed formalism is an extension of the concept of graph, and all our algorithms are extensions of algorithms operating on graphs. The advantage of our approach with respect to the approaches in the literature is that while in the literature we fred ad hoc representations for particular problems, our proposal allows a homogeneous and efficient treatment of various problems related to functional dependency manipulation and its use in various applications.

As an example we can show how our algorithms may be used to obtain a straightforward and efficient solution to the problem of synthesizing a relational scheme in 3NF.

Given a set of attributes U and a set of dependencies ~ on U, one of the problems that has to be solved in order to determine a "good" set of relation schemes (R,(U,), ~,) in the logical design of a database is the problem of finding a


decomposition of the (universal) relation scheme (R(U), X) that satisfies the following properties [6]:

(i) The dependencies X, are embodied I in the relation scheme R,, and (13/Xi) +

(ii) For every i, (R,(U,), ~,) is in third normal form. (iii) The number of relation schemes is minimum.

This problem can be immediately solved if we have previously determined the LR- minimum covering of Gx according to the algorithms that were described in the preceding paragraphs. In fact, once we have determined an LR-minimum FD-graph corresponding to the given set of functional dependencies, we may associate with every strong component a relation scheme. In such a scheme the keys correspond to the labels of the nodes in the component, and nonprime attributes correspond to the labels of the target nodes of the arcs leaving the strong component. These relation schemes are in third normal form because no transitive dependencies can exist.

For example, starting from the LR-minimum FD-graph of Example 7 we obtain the following relation schemes:

(R~(BCEFH), (BC---> E, E---> FH, F---) BC}), (R2(CD), {C---> D}), (R3(BDG), {BD--.-) G}), (R4(GA), (G---~ A}).

It can be seen immediately that starting from an LR-minimum FD-graph, the synthesis algorithm runs in time O(p2). With respect to the synthesis algorithms that were presented and discussed in the literature (in particular [6]), the method which is provided in this paper is both conceptually simpler (because practically the entire job is full'died in the process of determining a particular LR-minimum covering of an FD-graph) and, most of all, more efficient. In fact, as it was observed in the preceding paragraphs, our algorithm runs in time O(t. IXl), which may often be as low as ofp.l~.l) , while the algorithms in [61 run in time proportional to IXl 2 no matter what the structure of X is.

Finally, since in general our synthesis algorithm does not produce a lossless join decomposition of the universal relation scheme (see [1] for an introduction to the problem), we can use the FD-graph formalism in order to obtain this kind of decomposition.

To this goal we need to fred a kernel of the FO-graph closure. Notice that, in general, the problem of deciding whether a directed graph has a kernel is NP- complete [15]. However, since the FD-graph closure is a directed graph which is closed under a transitivity rule, it has at least one kernel. Moreover, a keruel may be found in time O(p 2) by taking a node for every strong component whose nodes do not have incident arcs from nodes of other strong components. Notice that all kernels have the same number of nodes, which is equal to the number of strong components which satisfy the former property.

If a kernel is a singleton, then the set of synthesized relations has the lossless join property; otherwise we obtain a lossless join decomposition by adding the relation- scheme (Rr(W), ( } ) , where Wis the union of the labels of the nodes in the kernel. This result derives from the fact, proved in [9], that a decomposition of a relation scheme (R(U), X) has the lossless join property if and only if there exists a relation scheme (Rj(Uj), Xj) in the decomposition such that Uj--~ U is in ~+.

x X---) A E X, is embo&edm R, iffX---) I3, E X+~ IX* andA E U, [XA ~ U,.

766 G. AUSIELLO, A. D'ATRI, AND D. SACC.~

D

FD-graph with kernel {AB, BC}. AB , J ~ BC : " ' . . . ." .

A j" " ' '" B~' . . . . . . . ~C

FIo. 9.

For example, the set of relations synthesized in the previous example has the lossless join property, since a kernel of the FD-graph is a singleton (it is one of the nodes of the strong component {BC, E, F}).

In contrast, the set of relations (RI(ABD), {AB ---> D} ), (R2(BCD), {BC ~ D} ) associated with the LR-minimum FD-graph of Figure 9 does not have the lossless join property. In fact, the kernel of the FD-graph closure is {AB, BC}; in order to achieve a lossless join decomposition we have to add the relation scheme (~(aBC), { }).

ACKNOWLEDGMENT. We are grateful to one of the referees for suggesting several formal and substantial improvements.

REFERENCES

1. AHO, A.V., BEam, C., ~ ULL~,N, J.D. The theory of joins in relational databases. ACM Trans. Database Syst. 4, 3 (Sept. 1979), 297-314.

2. AHO, A.V., G ~ Y , M.R., AND ULLMAN, J.D. The transitive reducUon of a directed graph. S I A M £ Comput. 1 (1972), 131-137.

3. ASMSTRONG, W. Dependency structure of database relationships. Proc. IFIP 74, North-Holland, Amsterdam, 1974, pp. 550-583.

4. AtTSmLLO, G., D'ATgl, A., AND S^cc,k, D. Graph algorithms for the synthesis and manipulation of data base schemes. In Proc. on Graph Theoretic Concepts in Computer Science, Lecture Notes in Computer Science 100, Springer-Verlag, New York, 1981, pp. 212-233.

5. BATINI, C., AND D'ATm, A. Rewriting systems as a tool for relational data base design. In Proc. on Graph-Grammars and Their Application to Computer Science, Lecture Notes in Computer Science 73, Springer-Verlag, New York, 1979, pp. 139-154.

6. B~m, C., AND BEatSSTEm, P.A. Computational problems related to the design of normal form relational schemas. Trans. Database Syst. 4, 1 (Mar. 1979), 30°59.

7. B~au, C., Bm~sTEn,~, P.A., AND GOODMAN, N. A sophisticate introduction to database normalization theory. In Proc. 4th Int. Conf. on Very Large Data Bases (Berlin, Oct. 1978), ACM, New York, pp. 113-124.

8. Bmu~sTEIN, P.A. Synthesizing third normal form relations from functional dependencies. ACM Trans. Database Syst. 1, 4 (Dec 1976), 277-298.

9. BISKUP, J., DAYAL, U., AND BI~gNSTEIN, P.A. Synthesizing independent database schemas. In ACM SIGMOD 1979 Int. Conf. on Management of Data (Boston, Mass., May 300June 1), ACM, New York, pp. 143-152.

10. CODD, E.F. A relational model of data for large shared data banks. Commun. ACM 13, 6 (June 1970), 377-387.

11. DEOANO, P., LOMAWrO, A., AND SmOVlCH, F. On finding the optimal access path to resolve a relational data base query. In Proc. on Mathematical Foundations of Computer Science, Lecture Notes in Computer Science 88, Springer-Verlag, New York, 1980, pp. 219-230.

12. DnLOenL, C., AND CASEY, R.G. Decomposition of data bases and the theory of boolean switching functions. IBM £ Res. Dev. 17 (1972), 374--386.

13. FAGXN, R., MEUDELZON, A.O., AND ULLMAN, J.D. A Simplified universal relaUon assumption and its propertles.ACM Trans. Database Syst. 7, 3 (Sept. 1982), 343-360.

14. GAUL, Z. An almost lmear-time algorithm for computing a dependency basis in a relational database. £ ACM29, 1 (Jan. 1982), 96-102.

15. GAREY, M.R., AND JOI-IUSON, D.S. Computer and Intractability. Freeman, San Francisco, 1979. 16. MAmR, D. Minimum covers in the relational database model. J. ACM 27, 4 (Oct. 1980), 664--674. 17. ULLMAN, J.D. Principles of Database Systems. Computer Science Press, Potomac, Md., 1980. 18. Z~OLO, C., AND MELKANOFP, M.A. A formal approach to the defmmon and the design of

conceptual schemata for database systems. ACM Trans. Database Syst. 7, 1 (Mar. 1982), 24--59.

RECEIVED NOV]EMBER 1981; REVISED SEPTEMBER 1982; ACCEPTED NOVEMBER 1982

Journal of the A.lsoctauon for Computing Machinery, Voi 30, No 4, October 1983.

Documents

Graph Algorithms for Functional Dependency Manipulation