Download pdf - Week 7-8-normalization

Transcript
Page 1: Week 7-8-normalization

1

1

Normalization

Normalization

• Normalization is the process of efficiently

organizing data in a database with two

goals in mind

• First goal: eliminate redundant data

– for example, storing the same data in more

than one table

• Second Goal: ensure data dependencies

make sense

– for example, only storing related data in a

table

Benefits of Normalization

• Less storage space

• Quicker updates

• Less data

inconsistency

• Clearer data

relationships

• Easier to add data

• Flexible Structure

Bad database designs

results in: redundancy:

inefficient storage.

anomalies: data

inconsistency, difficulties in

maintenance

4

Example

Name Price Category Manufacturer

gizmo $19.99 gadgets GizmoWorks

Power gizmo $29.99 gadgets GizmoWorks

SingleTouch $149.99 photography Canon

MultiTouch $203.99 household Hitachi

Relational schema:Product(Name, Price, Category, Manufacturer)

Instance:

5

First Normal Form (1NF)• A database schema is in First Normal

Form if all tables are flat

Name GPA Courses

Alice 3.8

Bob 3.7

Carol 3.9

Math

DB

OS

DB

OS

Math

OS

Student Name GPA

Alice 3.8

Bob 3.7

Carol 3.9

Student

Course

Math

DB

OS

Student Course

Alice Math

Carol Math

Alice DB

Bob DB

Alice OS

Carol OS

Takes Course

May need

to add keys

6

Functional Dependencies

• A form of constraint

– hence, part of the schema

• Finding them is part of the database

design

• Also used in normalizing the relations

• Warning: this is the most abstract, and

“hardest” part of the database design.

Page 2: Week 7-8-normalization

2

7

Functional Dependencies

Definition:

If two tuples agree on the attributes

then they must also agree on the attributes

Formally:

A1, A2, …, An B1, B2, …, Bm

A1, A2, …, An

B1, B2, …, Bm

Functional dependency between A and B

8

Examples

• EmpID Name, Phone, Position

• Position Phone

• but Phone Position

EmpID Name Phone PositionE0045 Smith 1234 ClerkE1847 John 9876 SalesrepE1111 Smith 9876 SalesrepE9999 Mary 1234 Lawyer

9

In General

• To check A B, erase all other columns

• check if the remaining relation is many-one

(called functional in mathematics)

… A … B

X1 Y1

X2 Y2

… …

10

Example

EmpID Name Phone Position

E0045 Smith 1234 Clerk

E1847 John 9876 Salesrep

E1111 Smith 9876 Salesrep

E9999 Mary 1234 Lawyer

Position Phone

11

Typical Examples of FDs

Product: name price, manufacturer

Person: ssn name, age

Company: name stockprice, president

12

Example

Product(name, category, color, department, price)

name color

category department

color, category price

Consider these FDs:

What do they say ?

Page 3: Week 7-8-normalization

3

13

Example

FD’s are constraints on relations:

• On some instances they hold

• On others they don’t

name category color department price

Gizmo Gadget Green Toys 49

Tweaker Gadget Green Toys 99

Does this instance satisfy all the FDs ?

name color

category department

color, category price

14

Example

name category color department price

Gizmo Gadget Green Toys 49

Tweaker Gadget Black Toys 99

Gizmo Stationary Green Office-supp. 59

What about this one ?

name color

category department

color, category price

15

Example

If some FDs are satisfied, then

others are satisfied too

If all these FDs are true:name color

category department

color, category price

Then this FD also holds: name, category price

Why ??16

Inference Rules for FD’s

Is equivalent to

Splitting rule

and

Combining rule

A1 ... Am B1 ... Bm

A1, A2, …, An B1, B2, …, Bm

A1, A2, …, An B1

A1, A2, …, An B2

. . . . .

A1, A2, …, An Bm

17

Inference Rules for FD’s

(continued)

Trivial Rule

Why ?

A1 … Am

where i = 1, 2, ..., n

A1, A2, …, An Ai

18

Inference Rules for FD’s

(continued)

Transitive Closure Rule

If

and

then

Why ?

A1, A2, …, An B1, B2, …, Bm

B1, B2, …, Bm C1, C2, …, Cp

A1, A2, …, An C1, C2, …, Cp

Page 4: Week 7-8-normalization

4

19

A1 … Am B1 … Bm C1 ... Cp

Functional Dependencies

We use functional dependencies to:

test relations to see if they are legal under a

given set of functional dependencies.

If a relation r is legal under a set F of functional

dependencies, we say that r satisfies F.

specify constraints on the set of legal relations

We say that F holds on R if all legal relations on R

satisfy the set of functional dependencies F.

20

21

• K is a superkey for relation schema R if and only if K R

• K is a candidate key for R if and only if

– K R, and

– for no K, R

• Functional dependencies allow us to express constraints that

cannot be expressed using superkeys. Consider the schema:

bor_loan = (customer_id, loan_number, amount )

We expect this functional dependency to hold:

loan_number amount

but would not expect the following to hold:

amount customer_name

Functional Dependencies

22

• A functional dependency is trivial if

– Example:

• customer_name, loan_number customer_name

• customer_name customer_name

Functional Dependencies

23

• Consider the relation:

PLOTS (prop#, state, plot#, area, price, Tax_rate)

Information about plots available in India. The constraints on

the relation are:

– Prop# is unique throughout India

– Plot# are unique within a given state

– For a given_state, tax_rate is fixed

– Plots having the same area have the same price,

irrespective of the state in which they are located

• Write all the FDs on the relation PLOTS

Functional Dependencies

24

Functional Dependencies

PLOTS

Prop# State Plot# Area Price Tax_rate

FD1 PK

FD2 CK

FD3

FD4

Identify redundancy in PLOTS

Identify update anomalies in PLOTS

Page 5: Week 7-8-normalization

5

25

Functional Dependencies

PLOTS

FD1 PK

FD2 CK

Plot#StateProp# Area

PriceArea

FD4

Tax_rate

FD3

State

26

Dependency Diagram (1NF)

Figure 4.4

27

Conversion to 1NF• A relational schema R is in first normal form if the

domains of all attributes of R are atomic

• Repeating groups must be eliminated

– Proper primary key developed

• Uniquely identifies attribute values (rows)

• Combination of PROJ_NUM and EMP_NUM

– Dependencies can be identified

• Desirable dependencies based on primary key

• Less desirable dependencies

– Partial

» based on part of composite primary key

– Transitive

» one nonprime attribute depends on another nonprime

attribute28

1NF Summarized

• Each attribute must be atomic (single value)

• No repeating columns within a row (composite attributes)

• No multi-valued columns.

• All key attributes defined

• All attributes dependent on primary key

• 1NF simplifies attributes

• Queries become easier.

29

Conversion to 2NF

• Start with 1NF format:

• Write each key component on separate line

• Write original key on last line

• Each component is new table

• Write dependent attributes after each key

PROJECT (PROJ_NUM, PROJ_NAME)

EMPLOYEE (EMP_NUM, EMP_NAME, JOB_CLASS, CHG_HOUR)

ASSIGN (PROJ_NUM, EMP_NUM, HOURS)

30

Second Normal Form (2NF)Each attribute must be functionally dependent on the primary key.

• If the primary key is a single attribute, then the relation is in 2NF

• The test for 2NF involves testing for FDs whose left-hand-side attribute are part of the primary key

• Disallow partial dependency, where non-keys attributes depend on part of a composite primary key

• In short, remove partial dependencies

2NF improves data integrity.

• Prevents update, insert, and delete anomalies.

Page 6: Week 7-8-normalization

6

31

2NF Conversion ResultsFigure 4.5

32

• Based on the concept of Full FDs (FFD)

• If A & B are sets of attributes of R, B is said to be FFD on A if AB, but no proper subset of A determines B

• No partial dependencies on the PK

• Is PLOTS in 2NF?

• YES

• Single attribute PK

• All relations with single attribute PK are in 2 NF!!

• 2 NF applies to relations with composite keys

2 NF

33

• A relation that is in 1NF & every non-PK

attribute is fully functionally dependent on

the PK, is said to be in 2 NF

1 NF

2 NF

2 NF

Remove all

Partial Dependencies

34

2NF Summarized

• In 1NF

• Includes no partial dependencies

– No attribute dependent on a portion of primary

key

• Still possible to exhibit transitive dependency

– Attributes may be functionally dependent on

nonkey attributes

35

Conversion to 3NF

• Create separate table(s) to eliminate

transitive functional dependencies

PROJECT (PROJ_NUM, PROJ_NAME)

ASSIGN (PROJ_NUM, EMP_NUM, HOURS)

EMPLOYEE (EMP_NUM, EMP_NAME, JOB_CLASS)

JOB (JOB_CLASS, CHG_HOUR)

36

• Based on the concept of transitive dependency

• No non-PK attribute should be transitively dependent on the PK

• Transitive Dependency

If AB & BC, then A transitively determines C through B, provided B & C do not determine A

• Is PLOTS in 3NF?

• NO

3 NF

Page 7: Week 7-8-normalization

7

37

3 NF

PLOTS

Prop# State Plot# Area Price Tax_rate

FD1 PK

FD2 CK

FD3

FD4

Prop# transitively determines tax_rate through state

Prop# transitively determines price through area38

• A relation that is in 1NF & 2 NF & no non-PK

attribute is transitively dependent on the PK,

is said to be in 3 NF

2 NF

3 NF

3 NF

Remove all

Transitive Dependencies

39

2NF – Example - 1

• Inventory (Item, Supplier, Cost, Supplier Address)

• We first check if Cost is fully functionally dependent upon

the ENTIRE Primary-Key

• If I know just Item, can I find out Cost?

– No. We can have > 1 supplier for the same product.

• If I know just Supplier, and I find out Cost?

– No. We need to know what the Item is as well.

• So, Cost is fully functionally dependent upon the

ENTIRE Primary-Key

40

2NF – Example - 2

• Inventory (Item, Supplier, Cost, Supplier Address)

• We then check if Supplier Address is fully functionally

dependent upon the ENTIRE Primary-Key

• If I know just Item, can I find out Supplier Address?

– No. We can have > 1 supplier for the same product.

• If I know just Supplier, and I find out Supplier Address?

– Yes. The supplier’s address does not depend on the

Item.

• So, Supplier Address is NOT fully functionally

dependent upon the ENTIRE Primary-Key NOT 2NF

So putting things together

Inventory

Description Supplier Cost Supplier Address

Inventory

Description Supplier Cost

Supplier

Name Supplier Address

The above relation is now in 2NF since the relation has no non-key

attributes.

Transitive Dependence

Give a relation R,

Assume the following FD hold:

Note : Both Ename and Address attributes are non-key attributes in R, and

since

Address depends on a non-Prime attribute Name, which depends on the

primary

key(EmpNo), a transitive dependency exists

EmpNo EName Salary Address

AddressEmpNoAddresstEnameEnameEmpNo ,,

AddressEname

EmpNo EName Salary Ename Address

R1 R2

Page 8: Week 7-8-normalization

8

43

• Boyce-Codd Normal Form (BCNF)

– A relation is in Boyce-Codd normal form

(BCNF) if every determinant in the table is a

candidate key.

(A determinant is any attribute whose value

determines other values with a row.)

– If a table contains only one candidate key, the

3NF and the BCNF are equivalent.

– BCNF is a special case of 3NF.

Database Normalization

A Table That Is In 3NF But Not In BCNF

Figure 5.7

The Decomposition of a Table Structure to Meet

BCNF Requirements

Figure 5.8 46

Sample Data for a BCNF Conversion

47

Decomposition into BCNF

48

• Based on FDs that take into account all candidate

keys of a relation

• For a relation with only 1 CK, 3NF & BCNF are

equivalent

• A relation is said to be in BCNF if every

determinant is a CK

• Is PLOTS in BCNF?

• NO

BCNF

Page 9: Week 7-8-normalization

9

49

• Consider the relation R(A,B,C) with functional dependencies AB C and

C B.

• Is R in 2NF?

• Is R in 3NF?

• Is R in BCNF?

Problem 1

50

Closure of a set of FDs

• Given a set of FDs F on a relation R, it may be possible that several other FDs must also hold for R

• For Example, R=(A,B,C) & FDs, AB & BC hold in R, then FD AC also holds on R

• For a given value of A, there can be only one corresponding value of B, & for that value of B, there can be only one corresponding value for C

• The closure of F is the set of all FDs that can be inferred from F, & is denoted by F+

51

Closure of a set of FDs

• It is not suff. to consider just the given set of FDs

• We need to consider all FDs that hold

• Given F, more FDs can be inferred

• Such FDs are said to be logically implied by F

• F+ is the set of all FDs logically implied by F

• We can compute F+using formal defn. of FD

• If F were large, this process would be lengthy & cumbersome

• Axioms or Rules of Inference provide simpler technique

• Armstrong;s Axioms

52

Inference Rules for FDs

Armstrong's inference rules:IR1. (Reflexive) If Y X, then X Y

IR2. (Augmentation) If X Y, then XZ YZ

(Notation: XZ stands for X U Z)

IR3. (Transitive) If X Y and Y Z, then X Z

IR1, IR2, IR3 form a sound & complete set of

inference rules

Never generates any wrong FD

Generate all FDs that hold

53

Some additional inference rules that are

useful:

Decomposition: If XYZ, then XY & XZ

Union: If XY & XZ, then XYZ

Psuedotransitivity: If XY & WYZ,then WXZ

• The last three inference rules, as well as any other

inference rules, can be deduced from IR1, IR2, and IR3

(completeness property)

Inference Rules for FDs

54

Example

• R = (A, B, C, G, H, I)

F = { A B

A C

CG H

CG I

B H}

• some members of F+

– A H

• by transitivity from A B and B H

– AG I

• by augmenting A C with G, to get AG CG

and then transitivity with CG I

– CG HI

• By union rule

Page 10: Week 7-8-normalization

10

55

Closure of Attribute Sets

• Given a set of attributes define the closure of under F

(denoted by +) as the set of attributes that are functionally

determined by under F

• Algorithm to compute +, the closure of under F

result := ;

while (changes to result) dofor each in F do

beginif result then result := result

end

56

Example of Attribute Set Closure

• R = (A, B, C, G, H, I)

• F = {A B, A C, CG H, CG I, B H}

• (AG)+

1. result = AG

2. result = ABCG (A C and A B)

3. result = ABCGH (CG H and CG AGBC)

4. result = ABCGHI (CG I and CG AGBCH)

• Is AG a candidate key? 1. Is AG a super key?

1. Does AG R? == Is (AG)+ R

2. Is any subset of AG a superkey?

1. Does A R? == Is (A)+ R

2. Does G R? == Is (G)+ R

57

Uses of Attribute ClosureThere are several uses of the attribute closure algorithm:

• Testing for superkey:

– To test if is a superkey, we compute +, and check if +

contains all attributes of R.

• Testing functional dependencies

– To check if a functional dependency holds (or, in other

words, is in F+), just check if +.

– That is, we compute + by using attribute closure, and then

check if it contains .

– Is a simple and cheap test, and very useful

• Computing closure of F

– For each R, we find the closure +, and for each S +,

we output a functional dependency S.

58

Canonical Cover• Sets of functional dependencies may have redundant

dependencies that can be inferred from the others– For example: A C is redundant in: {A B, B

C}

– Parts of a functional dependency may be redundant• E.g.: on RHS: {A B, B C, A CD} can be

simplified to {A B, B C, A D}

• E.g.: on LHS: {A B, B C, AC D} can be simplified to

{A B, B C, A D}

• Intuitively, a canonical cover of F is a “minimal” set of functional dependencies equivalent to F, having no redundant dependencies or redundant parts of dependencies

59

Equivalence of Sets of FDs

• Two sets of FDs F and G are equivalent if:

- every FD in F can be inferred from G, &

- every FD in G can be inferred from F

• Hence, F and G are equivalent if F+=G+

Definition: F covers G if every FD in G can be inferred from F (i.e., if G+ F+)

• F and G are equivalent if F covers G and G covers F

• There is an algorithm for checking equivalence of sets of FDs

60

Extraneous Attributes

• Consider a set F of functional dependencies and the functional dependency in F.

– Attribute A is extraneous in if A and F logically implies (F – { }) {( – A) }.

– Attribute A is extraneous in if Aand the set of functional dependencies (F – { }) { ( – A)} logically implies F.

• Note: implication in the opposite direction is trivial in each of the cases above, since a “stronger” functional dependency always implies a weaker one

• Example: Given F = {A C, AB C }

– B is extraneous in AB C because {A C, AB C} logically implies A C (I.e. the result of dropping B from AB C).

• Example: Given F = {A C, AB CD}

– C is extraneous in AB CD since AB C can be inferred even after deleting C

Page 11: Week 7-8-normalization

11

61

Testing if an Attribute is Extraneous

• Consider a set F of functional dependencies and the functional dependency in F.

• To test if attribute A is extraneous in

1. compute ({ } – A)+ using the dependencies in F

2. check that ({ } – A)+ contains ; if it does, A is

extraneous

• To test if attribute A is extraneous in

1. compute + using only the dependencies in

F’ = (F – { }) { ( – A)},

2. check that + contains A; if it does, A is extraneous

62

Canonical Cover

• A canonical cover for F is a set of dependencies Fc such that

– F logically implies all dependencies in Fc, and

– Fc logically implies all dependencies in F, and

– No functional dependency in Fc contains an extraneous attribute, and

– Each left side of functional dependency in Fc is unique.

• To compute a canonical cover for F:repeat

Use the union rule to replace any dependencies in F1 1 and 1 2 with 1 1 2

Find a functional dependency with an extraneous attribute either in or in

If an extraneous attribute is found, delete it from until F does not change

• Note: Union rule may become applicable after some extraneous attributes have been deleted, so it has to be re-applied

63

Computing Canonical Cover

• R = (A, B, C)F = {A BC, B C, A B, AB C}

• Combine A BC and A B into A BC

– Set is now {A BC, B C, AB C}

• A is extraneous in AB C

– Check if the result of deleting A from AB C is implied by the other dependencies• Yes: in fact, B C is already present!

– Set is now {A BC, B C}

• C is extraneous in A BC

– Check if A C is logically implied by A B and the other dependencies• Yes: using transitivity on A B and B C.

– Can use attribute closure of A in more complex cases

• The canonical cover is: A B, B C 64

Decomposition1. Decomposing the schema

R = ( bname, bcity, assets, cname, lno, amt)

R1 = (bname, bcity, assets, cname) R1 = (cname, lno, amt)

2. Decomposing the instance

R = R1 U R2

bname bcity assets cname lno amt

Downtown Bkln 9M Jones L-17 1000

Downtown Bkln 9M Johnson L-23 2000

Mianus Horse 1.7M Jones L-93 500

Downtown Bkln 9M Hayes L-17 1000

bname bcity assets cname

Downtown Bkln 9M Jones

Downtown Bkln 9M Johnson

Mianus Horse 1.7M Jones

Downtown Bkln 9M Hayes

cname lno amt

Jones L-17 1000

Johnson L-23 2000

Jones L-93 500

Hayes L-17 1000

65

Goals of Decomposition1. Lossless Joins

Want to be able to reconstruct big (e.g. universal) relation by

joining smaller ones (using natural joins)

(i.e. R1 R2 = R)

2. Dependency preservation

Want to minimize the cost of global integrity constraints based on FD’s

( i.e. avoid big joins in assertions)

3. Redundancy Avoidance

Avoid unnecessary data duplication (the motivation for decomposition)

Why important?

LJ : information loss

DP: efficiency (time)

RA: efficiency (space), update anomalies

Lossy Decomposition

A B C

1 2 3

4 5 6

7 2 8

1 2 8

7 2 3

A B C

1 2 3

4 5 6

7 2 8

A B

1 2

4 5

7 2

B C

2 3

5 6

2 8

JOINSpurious Tuples

Page 12: Week 7-8-normalization

12

67

Dependency Goal #1: lossless joinsA bad decomposition:

bname bcity assets cname

Downtown Bkln 9M Jones

Downtown Bkln 9M Johnson

Mianus Horse 1.7M Jones

Downtown Bkln 9M Hayes

cname lno amt

Jones L-17 1000

Johnson L-23 2000

Jones L-93 500

Hayes L-17 1000

=

bname bcity assets cname lno amt

Downtown Bkln 9M Jones L-17 1000

Downtown Bkln 9M Jones L-93 500

Downtown Bkln 9M Johnson L-23 2000

Mianus Horse 1.7M Jones L-17 1000

Mianus Horse 1.7M Jones L-93 500

Downtown Bkln 9M Hayes L-17 1000

Problem: join adds meaningless tuples

“lossy join”: by adding noise, have lost meaningful information as a

result of the decomposition

68

Dependency Goal #1: lossless joinsIs the following decomposition lossless or lossy?

bname assets cname lno

Downtown 9M Jones L-17

Downtown 9M Johnson L-23

Mianus 1.7M Jones L-93

Downtown 9M Hayes L-17

lno bcity amt

L-17 Bkln 1000

L-23 Bkln 2000

L-93 Horse 500

Ans: Lossless: R = R1 R2, it has 4 tuples

69

Ensuring Lossless Joins

A decomposition of R : R = R1 U R2

Is lossless iff

R1 R2 R1, or

R1 R2 R2

(i.e., intersecting attributes must for a superkey for

one of the resulting smaller relations)

Lossless Decomposition

Theorem

A decomposition of R into R1 and R2 is lossless join wrt FDs F, if and only if at least one of the following dependencies is in F+:

• R1 R2 R1• R1 R2 R2

In other words, R1 R2 forms a superkey of

either R1 or R2

Lossy Decomposition

S# Status

S3 30

S5 30

S# City

S3 Paris

S5 Athens

S# Status

S3 30

S5 30

Status City

30 Paris

30 Athens

S# Status City

S3 30 Paris

S5 30 Athens

Lossless Decomposition

• Observe that S satisfies the FDs:

– S# Status & S# City

• It can not be a coincidence that S is equal to the

join of its projections on {S#, Status} & {S#, City}

• Heaths’ Theorem:

Let R{A,B,C} be a relation, where A, B, & C are

sets of attributes. If R satisfies AB & AC,

then R is equal to the join of its projections on

{A,B} & {A,C}

• Observe that in the second decomposition of S

the FD, S# City is lost

Page 13: Week 7-8-normalization

13

Lossless Decomposition

• The decomposition of R into R1, R2, …Rn is lossless if for

any instance r of R

r = R1 (r ) R2 (r ) …… Rn (r )

• We can replace R by R1 & R2, knowing that the instance of

R can be recovered from the instances of R1 & R2

• We can use FDs to show that decompositions are lossless

74

Decomposition Goal #2: Dependency

preservationGoal: efficient integrity checks of FD’s

An example w/ no DP:

R = ( bname, bcity, assets, cname, lno, amt)

bname bcity assets

lno amt bname

Decomposition: R = R1 U R2

R1 = (bname, assets, cname, lno)

R2 = (lno, bcity, amt)

Lossless but not DP. Why?

Ans: bname bcity assets crosses 2 tables

75

Decomposition Goal #2: Dependency

preservationTo ensure best possible efficiency of FD checks

ensure that only a SINGLE table is needed in order to check each FD

i.e. ensure that: A1 A2 ... An B1 B2 ... Bm

Can be checked by examining Ri = ( ..., A1, A2, ..., An, ..., B1, ..., Bm, ...)

To test if the decomposition R = R1 U R2 U ... U Rn is DP

(1) see which FD’s of R are covered by R1, R2, ..., Rn

(2) compare the closure of (1) with the closure of FD’s of R

76

Decomposition Goal #2: Dependency

preservation

Example: Given F = { AB, AB D, C D}

consider R = R1 U R2 s.t.

R1 = (A, B, D) , R2 = (C, D)

(1) F+ = { ABD, CD}+

(2) G = {ABD, CD, ...} +

(3) F+ = G+

note: G+ cannot introduce new FDs not in F+

Decomposition is DP

77

Dependency Preservation

• Let Fi be the set of dependencies F + that include only attributes in Ri.

• A decomposition is dependency preserving, if

(F1 F2 … Fn )+ = F +

• If it is not, then checking updates for violation of functional dependencies may require computing joins, which is expensive.

78

Testing for Dependency Preservation

• To check if a dependency is preserved in a

decomposition of R into R1, R2, …, Rn we apply the following

test (with attribute closure done with respect to F)

– result =

while (changes to result) do

for each Ri in the decompositiont = (result Ri)

+ Ri

result = result t

– If result contains all attributes in , then the functional dependency

is preserved.

• We apply the test on all dependencies in F to check if a

decomposition is dependency preserving

• This procedure takes polynomial time, instead of the

exponential time required to compute F+ and (F1 F2 …

Fn)+

Page 14: Week 7-8-normalization

14

Example• R = (A, B, C)

F = {A B, B C)

– Can be decomposed in two different ways

• R1 = (A, B), R2 = (B, C)

– Lossless-join decomposition:

R1 R2 = {B} and B BC

– Dependency preserving

• R1 = (A, B), R2 = (A, C)

– Lossless-join decomposition:

R1 R2 = {A} and A AB

– Not dependency preserving (cannot check B C without computing R1 R2)

80

Decomposition Goal #3: Redudancy

Avoidance

Redundancy

for B=x , y and z

Example: A B C

a x 1

e x 1

g y 2

h y 2

m y 2

n z 1

p z 1

(1) An FD that exists in the above relation is: B C

(2) A superkey in the above relation is A, (or any set containing A)

When do you have redundancy?

Ans: when there is some FD, XY covered by a relation

and X is not a superkey

Problems with Decompositions

There are three potential problems to consider:

– Some queries become more expensive

• e.g., What is the price of prop# 1?

– Given instances of the decomposed relations, we

may not be able to reconstruct the corresponding

instance of the original relation!

• Fortunately, not in the PLOTS example

– Checking some dependencies may require joining the

instances of the decomposed relations.

• Fortunately, not in the PLOTS example

Tradeoff: Must consider these issues vs. redundancy

Example• R = (A, B, C )

F = {A B

B C}

Key = {A}

• R is not in BCNF (B C but B is not

superkey)

• Decomposition R1 = (A, B), R2 = (B, C)

– R1 and R2 in BCNF

– Lossless-join decomposition

– Dependency preserving

Testing for BCNF• To check if a non-trivial dependency causes a violation of BCNF

1. compute + (the attribute closure of ), and

2. verify that it includes all attributes of R, that is, it is a superkey of R.

• Simplified test: To check if a relation schema R is in BCNF, it suffices to check only the dependencies in the given set F for violation of BCNF, rather than checking all dependencies in F+.

– If none of the dependencies in F causes a violation of BCNF, then none of the dependencies in F+ will cause a violation of BCNF either.

• However, simplified test using only F is incorrect when testing a relation in a decomposition of R

– Consider R = (A, B, C, D, E), with F = { A B, BC D}

• Decompose R into R1 = (A,B) and R2 = (A,C,D, E)

• Neither of the dependencies in F contain only attributes from(A,C,D,E) so we might be mislead into thinking R2 satisfies BCNF.

• In fact, dependency AC D in F+ shows R2 is not in BCNF.

BCNF and Dependency Preservation

• R = (J, K, L )F = {JK L

L K }

Two candidate keys = JK and JL

• R is not in BCNF

• Any decomposition of R will fail to preserve

JK L

This implies that testing for JK L requires a

join

It is not always possible to get a BCNF decomposition that is

dependency preserving

Page 15: Week 7-8-normalization

15

Third Normal Form: Motivation

• There are some situations where

– BCNF is not dependency preserving, and

– efficient checking for FD violation on updates is

important

• Solution: define a weaker normal form, called Third

Normal Form (3NF)

– Allows some redundancy (with resultant problems; we

will see examples later)

– But functional dependencies can be checked on

individual relations without computing a join.

– There is always a lossless-join, dependency-

preserving decomposition into 3NF.

Redundancy in 3NF

J

j1

j2

j3

null

L

l1

l1

l1

l2

K

k1

k1

k1

k2

repetition of information (e.g., the relationship l1, k1)

(i_ID, dept_name)

need to use null values (e.g., to represent the relationship

l2, k2 where there is no corresponding value for J).

(i_ID, dept_nameI) if there is no separate relation mapping

instructors to departments

• There is some redundancy in this schema

• Example of problems due to redundancy in 3NF

– R = (J, K, L)F = {JK L, L K }

Testing for 3NF

• Optimization: Need to check only FDs in F, need not check all FDs

in F+.

• Use attribute closure to check for each dependency , if is a

superkey.

• If is not a superkey, we have to verify if each attribute in is

contained in a candidate key of R

– this test is rather more expensive, since it involve finding

candidate keys

– testing for 3NF has been shown to be NP-hard

– Interestingly, decomposition into third normal form (described

shortly) can be done in polynomial time

3NF Decomposition AlgorithmLet Fc be a canonical cover for F;i := 0;for each functional dependency in Fc doif none of the schemas Rj, 1 j i contains

then begini := i + 1;Ri :=

endif none of the schemas Rj, 1 j i contains a candidate key for Rthen begin

i := i + 1;Ri := any candidate key for R;

end /* Optionally, remove redundant relations */

repeatif any schema Rj is contained in another schema Rk

then /* delete Rj */Rj = R;;i=i-1;

return (R1, R2, ..., Ri)

Testing Decomposition for BCNF

• To check if a relation Ri in a decomposition of R is in BCNF,

– Either test Ri for BCNF with respect to the restriction of F

to Ri (that is, all FDs in F+ that contain only attributes from

Ri)

– or use the original set of dependencies F that hold on R, but

with the following test:

– for every set of attributes Ri, check that + (the

attribute closure of ) either includes no attribute of

Ri- , or includes all attributes of Ri.

• If the condition is violated by some in F, the

dependency( + - ) Ri

can be shown to hold on Ri, and Ri violates BCNF.

• We use above dependency to decompose Ri

BCNF Decomposition Algorithmresult := {R };

done := false;

compute F +;

while (not done) do

if (there is a schema Ri in result that is not in BCNF)

then begin

let be a nontrivial functional dependency that holds on Ri such that Ri is not in F +,

and = ;

result := (result – Ri ) (Ri – ) ( , );

end

else done := true;

Note: each Ri is in BCNF, and decomposition is lossless-join.

Page 16: Week 7-8-normalization

16

Example of BCNF Decomposition

• class (course_id, title, dept_name, credits, sec_id, semester, year, building, room_number, capacity, time_slot_id)

• Functional dependencies:

– course_id→ title, dept_name, credits

– building, room_number→capacity

– course_id, sec_id, semester, year→building, room_number, time_slot_id

• A candidate key {course_id, sec_id, semester, year}.

• BCNF Decomposition:

– course_id→ title, dept_name, credits holds

• but course_id is not a superkey.

– We replace class by:

• course(course_id, title, dept_name, credits)

• class-1 (course_id, sec_id, semester, year, building, room_number, capacity, time_slot_id)

BCNF Decomposition (Cont.)

• course is in BCNF

– How do we know this?

• building, room_number→capacity holds on class-1

– but {building, room_number} is not a superkey for class-1.

– We replace class-1 by:

• classroom (building, room_number, capacity)

• section (course_id, sec_id, semester, year, building,

room_number, time_slot_id)

• classroom and section are in BCNF.

93

4 NF

• BCNF removes any anomalies due to FDs

• Further research has led to the identification of

another type of dependency called Multi-valued

Dependency (MVD)

• Proposed by R Fagin* in 1977

• MVDs can also cause data redundancy

• MVDs are a generalization of FDs

* R Fagin: “Multi-valued Dependencies & a new normal form for

relational databases,” ACM TODS2, No. 3 (Sept. 1977) 94

4 NF

• Consider the following relation:

• In relational databases, repeating groups are not

allowed

Course Teacher Texts

DBS N Goyal

J P Misra

Yash

Garcia

Korth

Elmasiri

Raghu

Networks S Mohan

Rahul

J P Misra

Tannenbaum

Keshav

Petterson

95

4 NF• 1 NF Version

COURSE TEACHER TEXTS

DBS N GOYAL GARCIA

DBS N GOYAL KORTH

DBS N GOYAL ELMASIRI

DBS N GOYAL RAGHU R

DBS J P MISRA GARCIA

DBS J P MISRA KORTH

DBS J P MISRA ELMASIRI

DBS J P MISRA RAGHU R

NETWORKS S MOHAN TANNENBAUM

NETWORKS S MOHAN KESHAV

NETWORKS S MOHAN KUROSE

NETWORKS RAHUL TANNENBAUM

NETWORKS RAHUL KESHAV

NETWORKS RAHUL KUROSE

CTX

96

4 NF• ANY REDUNDANCY? ANY ANOMALIES?

COURSE TEACHER TEXTS

DBS N GOYAL GARCIA

DBS N GOYAL KORTH

DBS N GOYAL ELMASIRI

DBS N GOYAL RAGHU R

DBS J P MISRA GARCIA

DBS J P MISRA KORTH

DBS J P MISRA ELMASIRI

DBS J P MISRA RAGHU R

NETWORKS S MOHAN TANNENBAUM

NETWORKS S MOHAN KESHAV

NETWORKS S MOHAN PETTERSON

NETWORKS RAHUL TANNENBAUM

NETWORKS RAHUL KESHAV

NETWORKS RAHUL PETTERSON

CTX

Page 17: Week 7-8-normalization

17

97

4 NF• Redundancy is due to the constraint that the texts

for a course are independent of the instructors

• This constraint cannot be expressed in terms of

FDs

• Example of MVD

• Is CTX in BCNF?

• New Teacher for DBS

• New Text for Networks

• Teacher teaching DBS leaves

98

4 NF

• Decompose CTX into CT & TX

COURSE TEACHER

DBS N GOYAL

DBS J P MISRA

DBS S JAGADISH

NETWORKS S MOHAN

NETWORKS RAHUL

NETWORKS J P MISRA

COURSE TEXT

DBS GARCIA

DBS KORTH

DBS ELMASIRI

DBS RAGHU R

NETWORKS TANNENBAUM

NETWORKS KESHAV

NETWORKS PETTERSON

CTTX

99

4 NF

• Decompose CTX into CT & TX is not done on the

basis of FDs

• Decompose CTX into CT & TX is done on the basis

of MVDs

• MVDs

Represents a dependency between attributes of a relation,

such that for every value of A, there is a set of values of B &

a set of values of C, The set of values for B & C are

independent of each other

course teacher

course text

100

Multi-Valued Dependencies

• A multi-valued dependency occurs when a

determinant determines more than one

dependent, and the dependents are

independent of each other

• Example course implies teacher; course implies

text, where teacher and text are independent

• A relation with course, instructor and text is all

key, and exhibits redundancy, but is in 3NF

• Updates can exhibit anomalies

101

4 NF

• An MVDs A B is trivial if

(a) B A or

(b) A U B = R

• A relation that is in BCNF & contains no non-trivial

MVDs is said to be in 4NF

• CTX is not in 4NF because course teacher is a

non trivial MVD

102

Fourth Normal Form

• Relation R is in 4 NF if and only if, whenever there exist subsets A and B of the attributes of R such that the nontrivial multi-valued dependency A multi-determines B is satisfied, then all attributes of R are also functionally dependent on A

• In the previous example, decompose course,instructor, text into two relation: course, instructor, and course text

Page 18: Week 7-8-normalization

18

103

Multi-Valued Dependencies

• An MVD is an assertion that 2 attributes or sets of attributes are independent of each other

• Generalization of the concept of FD in the sense that every FD implies a corresponding MVD

• Independence of attribute sets cannot be explained using FDs

• SO what causes MVDs?

• Role of MVDs in database schema design

104

Multi-Valued Dependencies

• Most common source of redundancy in BCNF schemas is to put 2 or more M:M relationships in a single relation

• Note that in CTX, there are no non-trivial FDs

• If you fix the values for one set of attributes, then the values in certain other attributes are independent of all the other attributes in the relation

Multivalued Dependencies (MVDs)

• Let R be a relation schema and let R and R.

The multivalued dependency

holds on R if in any legal relation r(R), for all pairs for tuples t1 and t2 in r such that t1[ ] = t2 [ ], there exist

tuples t3 and t4 in r such that:

t1[ ] = t2 [ ] = t3 [ ] = t4 [ ]

t3[ ] = t1 [ ]

t3[R – ] = t2[R – ]

t4 [ ] = t2[ ]

t4[R – ] = t1[R – ]

MVD (Cont.)• Tabular representation of

107

Formal Definition of MVD

• The MVD

A1A2….An B1B2…Bm

holds for a relation R if

for each pair of tuples t & u that agree on As, we can find a tuple v that agrees

1. With t & u on As

2. With t on Bs

3. With u on all attributes of R that are not among As & Bs

108

MVD

t

v

A’s B’sA B

Others

u

Page 19: Week 7-8-normalization

19

109

• 4NF

• 5NF

• 6NF

• DKNF

110

• Fourth Normal Form(4NF)– Eliminates data redundancy caused by Multi-valued

dependencies. (MVD)

– A given relations in 4NF may not contain more than one

multi-valued dependency.

111

• MVD?

Multi-value Dependencies (XY) hold

in a relation R if when ever we have two

tuples of R that agree on all the attributes

of X, then we can swap their Y

components and get two tuples that are

also in R.

112

• Example

• In Relation R(A,B,C) how can we find if

AB

• If the relation has two tuples

A

1

1

B

7

3

C

4

2

Then that table should also contain

two other tuples where B’s are

swapped.

Do this for all tuples that have the

same A values.1

1

3

7

4

2

113

• What is so bad about having a table with

multiple multi-valued dependencies?

• Example: Consider R(Departments, Jobs , Resources Used)

The table has the following MVDs department Parts

department Jobs

114

• Department d1 works on jobs j1, and j2 with parts p1 and p2• Department d2 works on jobs j3, j4, and j5 with parts p2 and p4• Department d3 works on job j2 only with parts p5 and p6.

Department Job Part#-------------------------------------------------

d1 j1 p1 d1 j1 p2 Department Job

d1 j2 p1 d1 j2 p2 d2 j3 p2 Department Part

d2 j3 p4 d2 j4 p2 d2 j4 p4 d2 j5 p2d2 j5 p4d3 j2 p5d3 j2 p6

Page 20: Week 7-8-normalization

20

115

• If you want to add a part to a department, you must create more than one new row.

• Likewise, to remove a part or a job from a row can destroy information.

• Updating a part or job name will also require multiple rows to be

changed.

• The solution is to split this table into two tables, one with

(department, projects) in it and one with (department, parts) in it.

**Only desirable MVD is the ones whose determinant is a super key of R.

Special Case: Assume R has the following two-multi value dependencies:

A B and B C

In this case R will be in the fourth normal form iff B and C are dependent on each other. 116

A relation R is in 5NF if for all join dependencies at least

one of the following holds.

(a) (R1, R2, ..., Rn) is a trivial join-dependency.

(b) Every Ri is a candidate key for R.

117

• A table is said to be in the 5NF iff it is in

4NF and every join dependency in it is

implied by the candidate keys.• Sometimes its impossible to break the table into 2

tables, that is when you can use the rules of 5NF

to normalize.

• Generally a table in 4th NF is always in 5th NF, but

sometimes real world constraint will cause the

Relation to be not comply with 5th NF.

118

• Join Dependencies: They are basically

generalization of MVD.

• A condition where the natural join of all its

projections results in the reconstruction of

R.

• If such a condition is present then that

relation should be replaced with the

tables that consist of its projections.

119

The psychiatrist is able

to offer reimbursable

treatment to patients who

suffer from the given

condition and who are

insured by the given

insurer. Psychiatrist-to-

Insurer-to-Condition is

necessary in order to

model the situation

correctly.

120

• Suppose, however, that the following rule

applies: When a psychiatrist is authorized

to offer reimbursable treatment to

patients insured by Insurer P, and the

psychiatrist is able to treat condition C,

then – in the event that the Insurer P

covers condition C – it must be true that

the psychiatrist is able to provide

treatment to patients who suffer from

condition C and are insured by Insurer P.

Page 21: Week 7-8-normalization

21

121

These are all the possible projections of the Previous table. And

if (R1 |X| R2) or (R2 |X| R3) or (R1 |X| R3) result in R then

there are MVD (4th NF), and if NJ of {R1, R2, R3} results in R

then JD exist and the original table is not in 5th NF 122

• Only in rare situations does a 4NF table

not conform to 5NF. These are situations

in which a complex real-world constraint

governing the valid combinations of

attribute values in the 4NF table is not

implicit in the structure of that table.

123

Fifth Normal Form

• A relation R is in 5NF – also called

projection-join normal form, if and only if

every nontrivial join dependency that is

satisfied by R is implied by the candidate

key(s) of R

• It is the most general form possible for

projection-based normalization

124

• DKNF offers a complete solution to the problem of avoiding modification abnormalities

• Domain/key normal form (DKNF). A key uniquely identifies each row in a table.

• By enforcing key and domain restrictions, the database is assured of being freed from any modification inconsistency.

125

• Ronald Fagin (1981) proved that if a Relation is in DKNF then it is free from any anomalies(redundancies). Including the ones caused by FDs, MVDs, JDs.

• DKNF seems simple enough then why all the hoopla about 1NF, 2NF, 3NF, BCNF, 4NF, 5NF

126

DKNF not always achievable, and there is no formal definition to verify if a relation schema is in DKNF

In short, sets of single-theme tables will most likely be in DKNF.

Page 22: Week 7-8-normalization

22

127

Denormalization

• Denormalization is said to be necessary to

improve performance

• Technically normalization is a model

concept, not related to stored files

• In practice, denormalization will speed up

some queries, and drag down others


Recommended