37
1 Database Theory and Methodology

1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

  • View
    218

  • Download
    3

Embed Size (px)

Citation preview

Page 1: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

1

Database Theory and Methodology

Page 2: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

2

The Good and the Bad• So far we have not developed any measure of “goodness” to

measure the quality of the design, other than our intuition.• Bad database design

– Redundant Information– Update anomalies– Wasting storage with null values– Generation of invalid data during joins

• Good database design– Easy to explain its semantics– No redundancy (or at least reduced redundancy)– No information loss during joins

• We need formal concepts to define “goodness” and “badness” of relational schemas.

Page 3: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

3

The Evils of Redundancy

• Redundancy is at the root of several problems associated with relational schemas:– Redundant storage– Insertion anomalies – Deletion anomalies – Update anomalies

Page 4: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

4

Example: Bad DesignFLIGHTS

flt# date airline plane#

DL242 10/23/00 Delta k-yo-33297

DL242 10/24/00 Delta t-up-73356

DL242 10/25/00 Delta o-ge-98722

AA121 10/24/00 American p-rw-84663

AA121 10/25/00 American q-yg-98237

AA411 10/22/00 American h-fe-65748

• redundancy: airline name repeated for the same flight

• inconsistency: when airline name for a flight changes, it must be changed in many places

Page 5: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

5

Bad Database Design

• insertion anomalies: how do we represent that SK912 is flown by Scandinavian without there being a date and a plane assigned?

• deletion anomalies: cancelling AA411 on 10/22/00 makes us lose that it is flown by American.

• update anomalies: if DL242 is flown by Sabena, we must change it everywhere.

FLIGHTS

flt# date airline plane#

DL242 10/23/00 Delta k-yo-33297

DL242 10/24/00 Delta t-up-73356

DL242 10/25/00 Delta o-ge-98722

AA121 10/24/00 American p-rw-84663

AA121 10/25/00 American q-yg-98237

AA411 10/22/00 American h-fe-65748

Page 6: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

6

Bad Database Design- decomposition

FLIGHTS

flt# date airline plane#

DL242 10/23/00 Delta k-yo-33297

DL242 10/24/00 Delta t-up-73356

DL242 10/25/00 Delta o-ge-98722

AA121 10/24/00 American p-rw-84663

AA121 10/25/00 American q-yg-98237

AA411 10/22/00 American h-fe-65748

FLIGHTS-AIRLINE

flt# airline

DL242 Delta

AA121 American

AA411 American

DATE-AIRLINE-PLANE

date airline plane#

10/23/00 Delta k-yo-33297

10/24/00 Delta t-up-73356

10/25/00 Delta o-ge-98722

10/24/00 American p-rw-84663

10/25/00 American q-yg-98237

10/22/00 American h-fe-65748

Page 7: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

7

Bad Database Design- information loss

FLIGHTS

flt# date airline plane#

DL242 10/23/00 Delta k-yo-33297

DL242 10/24/00 Delta t-up-73356

DL242 10/25/00 Delta o-ge-98722

AA121 10/24/00 American p-rw-84663

AA121 10/25/00 American q-yg-98237

AA211 10/22/00 American h-fe-65748

AA411 10/24/00 American p-rw-84663

AA411 10/25/00 American q-yg-98237

AA411 10/22/00 American h-fe-65748

DATE-AIRLINE-PLANE

date airline plane#

10/23/00 Delta k-yo-33297

10/24/00 Delta t-up-73356

10/25/00 Delta o-ge-98722

10/24/00 American p-rw-84663

10/25/00 American q-yg-98237

10/22/00 American h-fe-65748

FLIGHTS-AIRLINE

flt# airline

DL242 Delta

AA121 American

AA411 American

• information loss: we polluted the database with false facts; we can’t find the true facts.

Page 8: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

8

Good Database Design

• no redundancy of FACT (!)

• no inconsistency

• no insertion, deletion or update anomalies

• no information loss

FLIGHTS-DATE-PLANE

flt# date plane#

DL242 10/23/00 k-yo-33297

DL242 10/24/00 t-up-73356

DL242 10/25/00 o-ge-98722

AA121 10/24/00 p-rw-84663

AA121 10/25/00 q-yg-98237

AA411 10/22/00 h-fe-65748

FLIGHTS-AIRLINE

flt# airline

DL242 Delta

AA121 American

AA411 American

Page 9: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

9

Goal — Devise a Theory for the Following

• Decide whether a particular relation R is in “good” form.

• In the case that a relation R is not in “good” form, decompose it into a set of relations {R1, R2, ..., Rn} such that

– each relation is in good form

– the decomposition is a lossless-join decomposition

• Our theory is based on:

– functional dependencies

– multivalued dependencies

Page 10: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

10

Functional Dependencies (FDs)

• A functional dependency is a constraint between two sets of attributes from the database.

• Definition: A functional dependency, denoted by X Y, holds over relation R if, for every allowable instance r of R:– t1 r, t2 r, t1.X = t2.X implies t1.Y =t2.Y– i.e., given two tuples in r, if the X values agree, then the Y

values must also agree. (X and Y are sets of attributes.)• An FD is a statement about all allowable relations.

– Must be identified based on semantics of application.– Given some allowable instance r1 of R, we can check if it

violates some FD f, but we cannot tell if f holds over R!• If X Y in R, this does not say whether or not Y X.

Page 11: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

11

Functional Dependencies and Keys

• A key constraint is a special case of an FD.

• Note the following:

– If X is a candidate key, X Y for any subset of attributes Y of R.

– X Y does not require that the set X be minimal; the additional minimality condition must be met for X to be a key.

– If X Y holds where Y is the set of all attributes in R, and there is some subset V of X s.t. V Y, then X is a superkey.

Page 12: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

12

Example: Constraints on Entity Set

• Consider relation obtained from Hourly_Emps:– Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked)

• Notation: We will denote this relation schema by listing the attributes: SNLRWH– This is really the set of attributes {S,N,L,R,W,H}.– Sometimes, we will refer to all attributes of a relation by using the

relation name. (e.g., Hourly_Emps for SNLRWH)

• Some FDs on Hourly_Emps:– ssn is the key: S SNLRWH – rating determines hrly_wages: R W

Page 13: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

13

Example (Contd.)

• Problems due to R W :

– Update anomaly: Can we change W in just the 1st tuple of SNLRWH?

– Insertion anomaly: What if we want to insert an employee and don’t know the hourly wage for his rating?

– Deletion anomaly: If we delete all employees with rating 5, we lose the information about the wage for rating 5!

S N L R W H

123-22-3666 Attishoo 48 8 10 40

231-31-5368 Smiley 22 8 10 30

131-24-3650 Smethurst 35 5 7 30

434-26-3751 Guldu 35 5 7 32

612-67-4134 Madayan 35 8 10 40

Will two smaller tables be better?

R W

8 10

5 7

Wages

S N L R H

123-22-3666 Attishoo 48 8 40

231-31-5368 Smiley 22 8 30

131-24-3650 Smethurst 35 5 30

434-26-3751 Guldu 35 5 32

612-67-4134 Madayan 35 8 40

Hourly_Emps2

Page 14: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

14

Closure of a set of FDs

• Given some FDs, we can usually infer additional FDs:– ssn did, did lot implies ssn lot– mgr_ssn mgr_phone, dept_no mgr_ssn implies dept_no

mgr_phone

• An FD f is implied by a set of FDs F if f holds whenever all FDs in F hold.

• Formally, the set of all dependencies that include F as well as all dependencies that can be inferred from F is called the closure of F; it is denoted by F+.

Page 15: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

15

Inference Rules for FDs

• Armstrong’s Axioms (X, Y, Z are sets of attributes):– Reflexivity: If Y X then X Y. – Augmentation: If X Y, then XZ YZ for any Z.– Transitivity: If X Y and Y Z, then X Z.

• These are sound and complete inference rules for FDs!– By sound we mean that they generate only FDs in F+ when applied to a

set F of FDs.– By complete we mean that repeated application of these rules will

generate all FDs in the closure of F+.

Page 16: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

16

Derived inference rules

• Union: if XY and XZ, then XYZ.

• Decomposition: if XYZ, then XY and XZ.

• Pseudotransitivity: if XY and WYZ, then WXZ.

• These additional rules are not essential; their soundness can be proved using Armstrong’s Axioms.

• Exercise: Prove rules Union, Decomposition and Pseudotransitivity using A.A.

Page 17: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

17

Example

• Consider the following schema:Contracts(cid,sid,jid,did,pid,qty,value)

– C is the key: C CSJDPQV– Project purchases each part using single contract: JP C– Dept purchases at most one part from a supplier: SD P

• JP C, C CSJDPQV imply JPCSJDPQV• SDP implies SDJ JP• SDJ JP, JP CSJDPQV imply SDJ CSJDPQV

– We can infer several additional FDs that are in the closure by using augmentation or decomposition. e.g. C CSJDPQV implies C C, C S,C J, C D, C P, C Q, C V.

– We have also a number of trivial FDs from reflexivity rule.

Page 18: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

18

Reasoning About FDs

• Computing the closure of a set of FDs can be expensive. (Size of closure is exponential in # attrs!)

• Typically, we just want to check if a given FD X Y is in the closure of a set of FDs F. An efficient check:

1. Compute attribute closure of X (denoted X+) wrt F:• Set of all attributes A such that X A is in F+.

• There is a linear time algorithm to compute this.

2. Check if Y is in X+.

• Does F = {A B, B C, CD E } imply A E?– i.e, is A E in the closure F+ ? (Equivalently, is E in X+)?

Page 19: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

19

Attribute Closure

X+ = X;

repeat {

oldX+ = X+;

for each functional dependency Y Z in F do

if Y X+ then X+ = X+ Z;

until X+ = oldX+

Page 20: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

20

Example

• Given the following set of FDs:

F= { Ssn Ename,

Pnumber {Pname, Plocation},

{Ssn,Pnumber} Hours }

• The attribute closures:

{Ssn}+ = {Ssn, Ename}

{Pnumber}+ = {Pnumber, Pname,Plocation}

{Ssn,Pnumber}+ = {Ssn, Pnumber, Ename, Pname, Plocation, Hours}

Page 21: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

21

Example R = (A, B, C, G, H, I)

F = {A BA C CG HCG IB H}

(AG)+

1. result = AG.

2. result = ABCG (A C and A B)

3. result = ABCGH(CG H and CG AGBC)

4. result = ABCGHI (CG I and CG AGBCH) Is AG a candidate key?

1. Is AG a super key?• Does AG R? == Is (AG)+ R

2. Is any subset of AG a superkey?• Does A R? == Is (A)+ R

• Does G R? == Is (G)+ R

Page 22: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

22

Finding a Key for a Relation

Algorithm: Finding a Key K for R, given a set of FDs

1. Set K = R.

2. For each attribute A in K {

Compute (K-A)+ wrt F;

If (K-A)+ contains all the attributes in R, then

set K = K – {A}

};

Page 23: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

23

Example

• Consider the following schema and the set of FDs:

R = (Ssn, Pnumber, Ename, Pname, Plocation, Hours)

F = {Ssn Ename, Pnumber {Pname, Plocation}, {Ssn,Pnumber} Hours}

• The Key is {Ssn,Pnumber}, since

{Ssn,Pnumber}+ = {Ssn, Pnumber, Ename, Pname, Plocation, Hours}

Page 24: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

24

Equivalence of two sets of FDs

• Definition: A set of FDs F covers another set of FDs E if every FD in E is also in F+.

• Definition: F and E are equivalent if F+ = E+ (i.e. F covers E and E covers F).

• We can determine whether F covers E by calculating X+ with respect to F for each FD XY in E, and then checking whether this X+ includes the attributes in Y. If this is the case for every FD in E, then F covers E.

FE

F+

Page 25: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

25

Minimal Cover

• Sets of functional dependencies may have redundant dependencies that can be inferred from the others

– Eg: A C is redundant in: {A B, B C, A C}

– Parts of a functional dependency may be redundant• E.g. on RHS: {A B, B C, A CD} can be simplified to

{A B, B C, A D}

• E.g. on LHS: {A B, B C, AC D} can be simplified to {A B, B C, A D}

• Intuitively, a canonical cover of F is a “minimal” set of functional dependencies equivalent to F, having no redundant dependencies or redundant parts of dependencies

Page 26: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

26

Minimal Sets of FDs

A set of FDs F is minimal if:

• Every dependency in F has a single attribute for its right-hand side.

• We can’t replace any dependency X A in F with a dependency YA where Y X and still have a set of dependencies equivalent with F. – This ensures that there are no redundancies by having redundant

attributes on the left-hand side of a dependency.

• We can’t remove any dependency from F and still have a set of dependencies equivalent with F. – This ensures that there are no redundancies by having a dependency that

can be inferred from the remaining FDs in F.

Page 27: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

27

Minimal Cover of a set of FDs

• A minimal cover of a set of FDs E is a minimal set of dependencies F that is equivalent to E.

• There can be several minimal covers for a set of FDs.

• Algorithm: Finding Minimal Cover F for a set of Fds E.

1. Set F = E.

2. Replace each Fd X {A1,..., An} in F by the n Fds XA1,..., XAn.

3. For each Fd XA in F

For each attribute BX

If {{F - {XA}} {(X - {B}) A}} F

Then replace XA with (X - {B}) A in F.

4. For each remaining Fd XA in F

If {F - {XA}}F, then remove XA from F.

Page 28: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

28

Examples

• E = { A B, ABCD E, EF GH, ACDF EG } has the following minimal cover:

F = {A B, ACD E, EF G, EF H }

• E = {Ssn Ename, Pnumber {Pname, Plocation}, {Ssn,Pnumber} Hours}}

F= {Ssn Ename, Pnumber Pname, Pnumber Plocation, {Ssn,Pnumber} Hours}

Page 29: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

29

Normal Forms

• Returning to the issue of schema refinement, the first question to ask is whether any refinement is needed!

• If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain kinds of problems are avoided/minimized. This can be used to help us decide whether decomposing the relation will help.

• Role of FDs in detecting redundancy:– Consider a relation R with 3 attributes, ABC.

• No FDs hold: There is no redundancy here.

• Given A B: Several tuples could have the same A value, and if so, they’ll all have the same B value!

Page 30: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

30

Normalization

• Normalization of data is a process of analyzing the given relational schemas based on their FDs and primary keys to achieve the desirable properties of

– minimizing redundancy

– minimizing update, insert and delete anomalies.

• The process of normalization through decomposition must also confirm two additional properties:

1. Lossless join property (nonadditive join property)

2. Dependency preservation property

Page 31: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

31

NF2

1NF

2NF

3NF

BCNF

Overview of NFs

Page 32: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

32

Normal Forms

• NF2: non-first normal form.

• 1NF: R is in 1NF iff all domain values are atomic.

• 2NF: R is in 2. NF iff R is in 1NF and every non-key attribute is fully dependent on the key.

• 3NF: R is in 3NF iff R is 2NF and every nonkey attribute is non-transitively dependent on the key

• BCNF: R is in BCNF iff every determinant is a candidate key

– Determinant: an attribute on which some other attribute(s) is fully

functionally dependent.

Page 33: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

33

1NF: Atomic ValuesFLT-SCHEDULE

flt# weekday airline dtime from atime to

DL242 MO WE FR DELTA 10:40 ATL 12:30 BOS

SK912 SA SU SAS 12:00 CPH 15:30 JFK

AA242 MO FR AA 08:00 CHI 10:10 ATL

Attributes must be defined over domains with atomic values

FLT-SCHEDULE

flt# weekday airline dtime from atime to

DL242 MO DELTA 10:40 ATL 12:30 BOS

SK912 SA SAS 12:00 CPH 15:30 JFK

AA242 MO AA 08:00 CHI 10:10 ATL

DL242 WE DELTA 10:40 ATL 12:30 BOSDL242 FR DELTA 10:40 ATL 12:30 BOS

SK912 SU SAS 12:00 CPH 15:30 JFK

AA242 FR AA 08:00 CHI 10:10 ATL

Page 34: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

34

2NF: Full Functional Dependency

• An FD X Y is a full functional dependency if any attribute A X , X – {A} does not functionally determine Y.

• An FD X Y is a partial functional dependency if some attribute A X , (X – {A}) Y.

• E.g. {ssn, proj_num} hours is a full FD.

{ssn, proj_num} ename is a partial FD.

• Definition: A relation schema R is in 2NF if R is in 1NF and every nonprime attribute A is fully functionally dependent on the primary key.

Page 35: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

35

Third Normal Form (3NF)

• Relation R with FDs F is in 3NF if, for all X A in F+

– A X (called a trivial FD), or– X contains a key for R, or– A is part of some key for R.

• Minimality of a key is crucial in third condition above!

• If R is in 3NF, obviously in 2NF.

• If R is in 3NF, some redundancy is still possible. It is a compromise, used when BCNF not achievable (e.g., no ``good’’ decomposition, or performance considerations).– Lossless-join, dependency-preserving decomposition of R

into a collection of 3NF relations always possible.

Page 36: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

36

What Does 3NF Achieve?

• If 3NF is violated by X A, one of the following holds:– X is a subset of some key K

• We store (X, A) pairs redundantly.

– X is not a proper subset of any key.• There is a chain of FDs K X A, which means that we cannot

associate an X value with a K value unless we also associate an A value with an X value.

• But: even if relation is in 3NF, these problems could arise.– e.g., Reserves SBDC, S C, C S is in 3NF, but for

each reservation of sailor S, same (S, C) pair is stored.

Page 37: 1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

37

Boyce-Codd Normal Form (BCNF)

• Reln R with FDs F is in BCNF if, for all X A in F+

– A X (called a trivial FD), or– X contains a key for R.

• In other words, R is in BCNF if the only non-trivial FDs that hold over R are key constraints.– No dependency in R that can be predicted using FDs alone.– If we are shown two tuples that agree upon

the X value, we cannot infer the A value in one tuple from the A value in the other.

– If example relation is in BCNF, then 2 tuples must be identical (since X is a key).

X Y A

x y1 a

x y2 ?