1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

1

Database Theory and Methodology

2

The Good and the Bad• So far we have not developed any measure of “goodness” to

measure the quality of the design, other than our intuition.• Bad database design

– Redundant Information– Update anomalies– Wasting storage with null values– Generation of invalid data during joins

• Good database design– Easy to explain its semantics– No redundancy (or at least reduced redundancy)– No information loss during joins

• We need formal concepts to define “goodness” and “badness” of relational schemas.

3

The Evils of Redundancy

• Redundancy is at the root of several problems associated with relational schemas:– Redundant storage– Insertion anomalies – Deletion anomalies – Update anomalies

4

Example: Bad DesignFLIGHTS

flt# date airline plane#

DL242 10/23/00 Delta k-yo-33297

DL242 10/24/00 Delta t-up-73356

DL242 10/25/00 Delta o-ge-98722

AA121 10/24/00 American p-rw-84663

AA121 10/25/00 American q-yg-98237

AA411 10/22/00 American h-fe-65748

• redundancy: airline name repeated for the same flight

• inconsistency: when airline name for a flight changes, it must be changed in many places

5

Bad Database Design

• insertion anomalies: how do we represent that SK912 is flown by Scandinavian without there being a date and a plane assigned?

• deletion anomalies: cancelling AA411 on 10/22/00 makes us lose that it is flown by American.

• update anomalies: if DL242 is flown by Sabena, we must change it everywhere.

FLIGHTS


DL242 10/23/00 Delta k-yo-33297

DL242 10/24/00 Delta t-up-73356

DL242 10/25/00 Delta o-ge-98722

AA121 10/24/00 American p-rw-84663

AA121 10/25/00 American q-yg-98237

AA411 10/22/00 American h-fe-65748

6

Bad Database Design- decomposition

FLIGHTS


DL242 10/23/00 Delta k-yo-33297

DL242 10/24/00 Delta t-up-73356

DL242 10/25/00 Delta o-ge-98722

AA121 10/24/00 American p-rw-84663

AA121 10/25/00 American q-yg-98237

AA411 10/22/00 American h-fe-65748

FLIGHTS-AIRLINE

flt# airline

DL242 Delta

AA121 American

AA411 American

DATE-AIRLINE-PLANE

date airline plane#

10/23/00 Delta k-yo-33297

10/24/00 Delta t-up-73356

10/25/00 Delta o-ge-98722

10/24/00 American p-rw-84663

10/25/00 American q-yg-98237

10/22/00 American h-fe-65748

7

Bad Database Design- information loss

FLIGHTS


DL242 10/23/00 Delta k-yo-33297

DL242 10/24/00 Delta t-up-73356

DL242 10/25/00 Delta o-ge-98722

AA121 10/24/00 American p-rw-84663

AA121 10/25/00 American q-yg-98237

AA211 10/22/00 American h-fe-65748

AA411 10/24/00 American p-rw-84663

AA411 10/25/00 American q-yg-98237

AA411 10/22/00 American h-fe-65748

DATE-AIRLINE-PLANE

date airline plane#

10/23/00 Delta k-yo-33297

10/24/00 Delta t-up-73356

10/25/00 Delta o-ge-98722

10/24/00 American p-rw-84663

10/25/00 American q-yg-98237

10/22/00 American h-fe-65748

FLIGHTS-AIRLINE

flt# airline

DL242 Delta

AA121 American

AA411 American

• information loss: we polluted the database with false facts; we can’t find the true facts.

8

Good Database Design

• no redundancy of FACT (!)

• no inconsistency

• no insertion, deletion or update anomalies

• no information loss

FLIGHTS-DATE-PLANE

flt# date plane#

DL242 10/23/00 k-yo-33297

DL242 10/24/00 t-up-73356

DL242 10/25/00 o-ge-98722

AA121 10/24/00 p-rw-84663

AA121 10/25/00 q-yg-98237

AA411 10/22/00 h-fe-65748

FLIGHTS-AIRLINE

flt# airline

DL242 Delta

AA121 American

AA411 American

9

Goal — Devise a Theory for the Following

• Decide whether a particular relation R is in “good” form.

• In the case that a relation R is not in “good” form, decompose it into a set of relations {R1, R2, ..., Rn} such that

– each relation is in good form

– the decomposition is a lossless-join decomposition

• Our theory is based on:

– functional dependencies

– multivalued dependencies

10

Functional Dependencies (FDs)

• A functional dependency is a constraint between two sets of attributes from the database.

• Definition: A functional dependency, denoted by X Y, holds over relation R if, for every allowable instance r of R:– t1 r, t2 r, t1.X = t2.X implies t1.Y =t2.Y– i.e., given two tuples in r, if the X values agree, then the Y

values must also agree. (X and Y are sets of attributes.)• An FD is a statement about all allowable relations.

– Must be identified based on semantics of application.– Given some allowable instance r1 of R, we can check if it

violates some FD f, but we cannot tell if f holds over R!• If X Y in R, this does not say whether or not Y X.

11

Functional Dependencies and Keys

• A key constraint is a special case of an FD.

• Note the following:

– If X is a candidate key, X Y for any subset of attributes Y of R.

– X Y does not require that the set X be minimal; the additional minimality condition must be met for X to be a key.

– If X Y holds where Y is the set of all attributes in R, and there is some subset V of X s.t. V Y, then X is a superkey.

12

Example: Constraints on Entity Set

• Consider relation obtained from Hourly_Emps:– Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked)

• Notation: We will denote this relation schema by listing the attributes: SNLRWH– This is really the set of attributes {S,N,L,R,W,H}.– Sometimes, we will refer to all attributes of a relation by using the

relation name. (e.g., Hourly_Emps for SNLRWH)

• Some FDs on Hourly_Emps:– ssn is the key: S SNLRWH – rating determines hrly_wages: R W

13

Example (Contd.)

• Problems due to R W :

– Update anomaly: Can we change W in just the 1st tuple of SNLRWH?

– Insertion anomaly: What if we want to insert an employee and don’t know the hourly wage for his rating?

– Deletion anomaly: If we delete all employees with rating 5, we lose the information about the wage for rating 5!

S N L R W H

123-22-3666 Attishoo 48 8 10 40

231-31-5368 Smiley 22 8 10 30

131-24-3650 Smethurst 35 5 7 30

434-26-3751 Guldu 35 5 7 32

612-67-4134 Madayan 35 8 10 40

Will two smaller tables be better?

R W

8 10

5 7

Wages

S N L R H

123-22-3666 Attishoo 48 8 40

231-31-5368 Smiley 22 8 30

131-24-3650 Smethurst 35 5 30

434-26-3751 Guldu 35 5 32

612-67-4134 Madayan 35 8 40

Hourly_Emps2

14

Closure of a set of FDs

• Given some FDs, we can usually infer additional FDs:– ssn did, did lot implies ssn lot– mgr_ssn mgr_phone, dept_no mgr_ssn implies dept_no

mgr_phone

• An FD f is implied by a set of FDs F if f holds whenever all FDs in F hold.

• Formally, the set of all dependencies that include F as well as all dependencies that can be inferred from F is called the closure of F; it is denoted by F+.

15

Inference Rules for FDs

• Armstrong’s Axioms (X, Y, Z are sets of attributes):– Reflexivity: If Y X then X Y. – Augmentation: If X Y, then XZ YZ for any Z.– Transitivity: If X Y and Y Z, then X Z.

• These are sound and complete inference rules for FDs!– By sound we mean that they generate only FDs in F+ when applied to a

set F of FDs.– By complete we mean that repeated application of these rules will

generate all FDs in the closure of F+.

16

Derived inference rules

• Union: if XY and XZ, then XYZ.

• Decomposition: if XYZ, then XY and XZ.

• Pseudotransitivity: if XY and WYZ, then WXZ.

• These additional rules are not essential; their soundness can be proved using Armstrong’s Axioms.

• Exercise: Prove rules Union, Decomposition and Pseudotransitivity using A.A.

17

Example

• Consider the following schema:Contracts(cid,sid,jid,did,pid,qty,value)

– C is the key: C CSJDPQV– Project purchases each part using single contract: JP C– Dept purchases at most one part from a supplier: SD P

• JP C, C CSJDPQV imply JPCSJDPQV• SDP implies SDJ JP• SDJ JP, JP CSJDPQV imply SDJ CSJDPQV

– We can infer several additional FDs that are in the closure by using augmentation or decomposition. e.g. C CSJDPQV implies C C, C S,C J, C D, C P, C Q, C V.

– We have also a number of trivial FDs from reflexivity rule.

18

Reasoning About FDs

• Computing the closure of a set of FDs can be expensive. (Size of closure is exponential in # attrs!)

• Typically, we just want to check if a given FD X Y is in the closure of a set of FDs F. An efficient check:

1. Compute attribute closure of X (denoted X+) wrt F:• Set of all attributes A such that X A is in F+.

• There is a linear time algorithm to compute this.

2. Check if Y is in X+.

• Does F = {A B, B C, CD E } imply A E?– i.e, is A E in the closure F+ ? (Equivalently, is E in X+)?

19

Attribute Closure

X+ = X;

repeat {

oldX+ = X+;

for each functional dependency Y Z in F do

if Y X+ then X+ = X+ Z;

until X+ = oldX+

20

Example

• Given the following set of FDs:

F= { Ssn Ename,

Pnumber {Pname, Plocation},

{Ssn,Pnumber} Hours }

• The attribute closures:

{Ssn}+ = {Ssn, Ename}

{Pnumber}+ = {Pnumber, Pname,Plocation}

{Ssn,Pnumber}+ = {Ssn, Pnumber, Ename, Pname, Plocation, Hours}

21

Example R = (A, B, C, G, H, I)

F = {A BA C CG HCG IB H}

(AG)+

1. result = AG.

2. result = ABCG (A C and A B)

3. result = ABCGH(CG H and CG AGBC)

4. result = ABCGHI (CG I and CG AGBCH) Is AG a candidate key?

1. Is AG a super key?• Does AG R? == Is (AG)+ R

2. Is any subset of AG a superkey?• Does A R? == Is (A)+ R

• Does G R? == Is (G)+ R

22

Finding a Key for a Relation

Algorithm: Finding a Key K for R, given a set of FDs

1. Set K = R.

2. For each attribute A in K {

Compute (K-A)+ wrt F;

If (K-A)+ contains all the attributes in R, then

set K = K – {A}

};

23

Example

• Consider the following schema and the set of FDs:

R = (Ssn, Pnumber, Ename, Pname, Plocation, Hours)

F = {Ssn Ename, Pnumber {Pname, Plocation}, {Ssn,Pnumber} Hours}

• The Key is {Ssn,Pnumber}, since

{Ssn,Pnumber}+ = {Ssn, Pnumber, Ename, Pname, Plocation, Hours}

24

Equivalence of two sets of FDs

• Definition: A set of FDs F covers another set of FDs E if every FD in E is also in F+.

• Definition: F and E are equivalent if F+ = E+ (i.e. F covers E and E covers F).

• We can determine whether F covers E by calculating X+ with respect to F for each FD XY in E, and then checking whether this X+ includes the attributes in Y. If this is the case for every FD in E, then F covers E.

FE

F+

25

Minimal Cover

• Sets of functional dependencies may have redundant dependencies that can be inferred from the others

– Eg: A C is redundant in: {A B, B C, A C}

– Parts of a functional dependency may be redundant• E.g. on RHS: {A B, B C, A CD} can be simplified to

{A B, B C, A D}

• E.g. on LHS: {A B, B C, AC D} can be simplified to {A B, B C, A D}

• Intuitively, a canonical cover of F is a “minimal” set of functional dependencies equivalent to F, having no redundant dependencies or redundant parts of dependencies

26

Minimal Sets of FDs

A set of FDs F is minimal if:

• Every dependency in F has a single attribute for its right-hand side.

• We can’t replace any dependency X A in F with a dependency YA where Y X and still have a set of dependencies equivalent with F. – This ensures that there are no redundancies by having redundant

attributes on the left-hand side of a dependency.

• We can’t remove any dependency from F and still have a set of dependencies equivalent with F. – This ensures that there are no redundancies by having a dependency that

can be inferred from the remaining FDs in F.

27

Minimal Cover of a set of FDs

• A minimal cover of a set of FDs E is a minimal set of dependencies F that is equivalent to E.

• There can be several minimal covers for a set of FDs.

• Algorithm: Finding Minimal Cover F for a set of Fds E.

1. Set F = E.

2. Replace each Fd X {A1,..., An} in F by the n Fds XA1,..., XAn.

3. For each Fd XA in F

For each attribute BX

If {{F - {XA}} {(X - {B}) A}} F

Then replace XA with (X - {B}) A in F.

4. For each remaining Fd XA in F

If {F - {XA}}F, then remove XA from F.

28

Examples

• E = { A B, ABCD E, EF GH, ACDF EG } has the following minimal cover:

F = {A B, ACD E, EF G, EF H }

• E = {Ssn Ename, Pnumber {Pname, Plocation}, {Ssn,Pnumber} Hours}}

F= {Ssn Ename, Pnumber Pname, Pnumber Plocation, {Ssn,Pnumber} Hours}

29

Normal Forms

• Returning to the issue of schema refinement, the first question to ask is whether any refinement is needed!

• If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain kinds of problems are avoided/minimized. This can be used to help us decide whether decomposing the relation will help.

• Role of FDs in detecting redundancy:– Consider a relation R with 3 attributes, ABC.

• No FDs hold: There is no redundancy here.

• Given A B: Several tuples could have the same A value, and if so, they’ll all have the same B value!

30

Normalization

• Normalization of data is a process of analyzing the given relational schemas based on their FDs and primary keys to achieve the desirable properties of

– minimizing redundancy

– minimizing update, insert and delete anomalies.

• The process of normalization through decomposition must also confirm two additional properties:

1. Lossless join property (nonadditive join property)

2. Dependency preservation property

31

NF2

1NF

2NF

3NF

BCNF

Overview of NFs

32

Normal Forms

• NF2: non-first normal form.

• 1NF: R is in 1NF iff all domain values are atomic.

• 2NF: R is in 2. NF iff R is in 1NF and every non-key attribute is fully dependent on the key.

• 3NF: R is in 3NF iff R is 2NF and every nonkey attribute is non-transitively dependent on the key

• BCNF: R is in BCNF iff every determinant is a candidate key

– Determinant: an attribute on which some other attribute(s) is fully

functionally dependent.

33

1NF: Atomic ValuesFLT-SCHEDULE

flt# weekday airline dtime from atime to

DL242 MO WE FR DELTA 10:40 ATL 12:30 BOS

SK912 SA SU SAS 12:00 CPH 15:30 JFK

AA242 MO FR AA 08:00 CHI 10:10 ATL

Attributes must be defined over domains with atomic values

FLT-SCHEDULE

flt# weekday airline dtime from atime to

DL242 MO DELTA 10:40 ATL 12:30 BOS

SK912 SA SAS 12:00 CPH 15:30 JFK

AA242 MO AA 08:00 CHI 10:10 ATL

DL242 WE DELTA 10:40 ATL 12:30 BOSDL242 FR DELTA 10:40 ATL 12:30 BOS

SK912 SU SAS 12:00 CPH 15:30 JFK

AA242 FR AA 08:00 CHI 10:10 ATL

34

2NF: Full Functional Dependency

• An FD X Y is a full functional dependency if any attribute A X , X – {A} does not functionally determine Y.

• An FD X Y is a partial functional dependency if some attribute A X , (X – {A}) Y.

• E.g. {ssn, proj_num} hours is a full FD.

{ssn, proj_num} ename is a partial FD.

• Definition: A relation schema R is in 2NF if R is in 1NF and every nonprime attribute A is fully functionally dependent on the primary key.

35

Third Normal Form (3NF)

• Relation R with FDs F is in 3NF if, for all X A in F+

– A X (called a trivial FD), or– X contains a key for R, or– A is part of some key for R.

• Minimality of a key is crucial in third condition above!

• If R is in 3NF, obviously in 2NF.

• If R is in 3NF, some redundancy is still possible. It is a compromise, used when BCNF not achievable (e.g., no ``good’’ decomposition, or performance considerations).– Lossless-join, dependency-preserving decomposition of R

into a collection of 3NF relations always possible.

36

What Does 3NF Achieve?

• If 3NF is violated by X A, one of the following holds:– X is a subset of some key K

• We store (X, A) pairs redundantly.

– X is not a proper subset of any key.• There is a chain of FDs K X A, which means that we cannot

associate an X value with a K value unless we also associate an A value with an X value.

• But: even if relation is in 3NF, these problems could arise.– e.g., Reserves SBDC, S C, C S is in 3NF, but for

each reservation of sailor S, same (S, C) pair is stored.

37

Boyce-Codd Normal Form (BCNF)

• Reln R with FDs F is in BCNF if, for all X A in F+

– A X (called a trivial FD), or– X contains a key for R.

• In other words, R is in BCNF if the only non-trivial FDs that hold over R are key constraints.– No dependency in R that can be predicted using FDs alone.– If we are shown two tuples that agree upon

the X value, we cannot infer the A value in one tuple from the A value in the other.

– If example relation is in BCNF, then 2 tuples must be identical (since X is a key).

X Y A

x y1 a

x y2 ?

Documents

1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,