33
CM20145 CM20145 Further DB Design – Further DB Design – Normalization Normalization Dr Alwyn Barry Dr Joanna Bryson

CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Embed Size (px)

Citation preview

Page 1: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

CM20145CM20145Further DB Design –Further DB Design –NormalizationNormalization

Dr Alwyn BarryDr Joanna Bryson

Page 2: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Last TimeLast Time

Database design is an ongoing, iterative process. Requirements come from data, user

demands, design issues. Change occurs:

Corporations & technologies grow. Programmers & users learn.

Views / security. Lossless-join decomposition

Now: Science for improving design.

Page 3: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Design Process & NormalizationDesign Process & Normalization

We assume a schema R is given. R could have been generated when

converting E-R diagram to a set of tables. R could have been a single relation

containing all attributes that are of interest (called universal relation).

Normalization breaks R into smaller relations.

R could be the result of any ad hoc design of relations, which we then test & convert to normal form.

Page 4: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

OverviewOverview

First Normal Form. Functional Dependencies. Second Normal Form. Third Normal Form. Boyce-Codd Normal Form. Fourth Normal Form. Fifth Normal Form. Domain Key / Normal Form. Design Process & Problems.

Page 5: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

First Normal Form – 1NFFirst Normal Form – 1NF

You aren’t supposed to have more than one value per attribute of a tuple.

All tuples have the same number of attributes.

Necessary for a relational database.

Name Office Office Hours

Barry 2.23 1pm, 4pm

Bryson L2.27 11am, 5pmBAD

Page 6: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Getting Caught Out With 1NFGetting Caught Out With 1NF A domain is atomic if its elements are

considered to be indivisible units. Examples of non-atomic domains:

Set-valued attributes, composite attributes. Identifiers like CS101 that can be broken up into

parts.

A relational schema R is in first normal form if the domains of all attributes of R are atomic.

Non-atomic values: complicate storage, encourage redundancy, Depend on interpretation built into

application programs.

Page 7: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Are You Atomic?Are You Atomic? Atomicity is not an intrinsic property of the

elements of the domain. Atomicity is a property of how the elements

of the domain are used. E.g. strings containing a possible delimiter (here:

a space) cities = “Melbourne Sydney” (non-

atomic: space separated list) surname = “Fortescue Smythe” (atomic:

compound surname) E.g. strings encoding two separate fields

bucs_login = cssjjb If the first two characters are extracted to find the

department, the domain bucs_login is not atomic. Leads to encoding of information in application program

rather than in the database.

Page 8: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Second Normal Form (2NF)Second Normal Form (2NF)

Violated when a nonkey column is a fact about part of the primary key.

A column is not fully functionally dependent on the full primary key. CUSTOMER-CREDIT in this case:

ORDER

ITEMID CUSTOMERID

QUANTITY

CUSTOMER-CREDIT

Desk JJB 25 OK

Chair AMB 3 POOR

ITEM

*itemid…

ORDER

quantity…

CUSTOMER

*customeridcustomer-credit

From Watson

BAD

FIX

Page 9: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Def: Def: Functional DependencyFunctional Dependency Let R be a relation schema

R and R The functional dependency (FD) holds on R (“ is

FD on ”) iff for any legal relations r(R): whenever any two tuples t1 and t2 of r agree on the

attributes they also agree on the attributes . i.e. (t1) = (t2) (t1) = (t2)

Example: Consider r(A,B) with the following instance of r:

A B does NOT hold, but B A does hold

A: Initials B: Chore

JJB Grading

AMB Setting Tutorials

JJB Writing Unit Reviews

Page 10: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Functional Dependencies: UsesFunctional Dependencies: Uses

Way to encode “business rules”. Specify constraints on the set of

legal relations. We say that F holds on R if all legal

relations on R satisfy the set of FDs F.

Test relations to see if they are legal under a given set of FDs. If a relation r is legal under a set F

of FDs, we say that r satisfies F.

Page 11: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Functional Dependencies Functional Dependencies

An FD is an assertion about a schema, not an instance.

If we only consider an instance or a few instances, we can’t tell if an FD holds. Inspecting only a few bird species (e.g.

crows, cardinals and canaries) we might conclude colour species.

However, this would be a bad FD as we would find out if we found some ravens.

Thus, identifying FDs is part of the data modelling process.

Page 12: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Trivial Functional DependenciesTrivial Functional Dependencies

An FD is trivial if it is satisfied by all instances of a relation E.g.

customer-name, loan-number customer-name

customer-name customer-name

In general, is trivial if

Permitting such FDs makes certain definitions and algorithms easier to state.

Page 13: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Functional Dependency Functional Dependency vsvs Key Key

FDs can express the same constraints we could express using keys:

Superkeys: K is a superkey for relation schema R if

and only if K R

Candidate keys: K is a candidate key for R if and only if

K R, and there is no K’ K such that K’ R

Of course, which candidate key becomes the primary key is arbitrary.

Page 14: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

FDs FDs <><> Keys Keys

FDs can represent more information than keys can on their own.

Consider the following Loan-info-schema:Loan-info-schema = (customer-name, loan-number,

branch-name, amount).

We expect these FDs to hold:loan-number amountloan-number branch-name

We could try to express this by making loan-number the key, however the following FD does not hold:

loan-number customer-name Incidentally, this isn’t a very good table!

(¬2NF)

Page 15: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

FD ClosureFD Closure Given a set F of FDs, other FDs are logically implied.

E.g. If A B and B C, we can infer that A C The set of all FDs implied by F is the closure of F, written F+ . Find F+ by applying Armstrong’s Axioms:

if , then (reflexivity) if , then (augmentation) if , and , then (transitivity)

Additional rules (derivable from Armstrong’s Axioms): If and holds, then holds (union) If holds, then holds and holds (decomposition) If holds and holds, then holds (pseudotransitivity)

Page 16: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Bad Decomposition ExampleBad Decomposition Example(From Last Time)(From Last Time)

A Non Lossless-Join Decomposition R = (A, B) R1 = (A), R2 = (B)

A B

121

A

B

12

rA(r)

B(r)

A (r) ⋈ B (r)

A B

1212

Thus, r is different to A (r) ⋈ B (r)

So A,B is not a lossless-join decomposition of R.

Page 17: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

FDs & Lossless DecompositionFDs & Lossless Decomposition

All attributes of an original schema (R) must appear in the decomposition (R1, R2):

R = R1 R2

Lossless-join decomposition.For all possible relations r on schema R

r = R1 (r) ⋈ R2 (r) A decomposition of R into R1 and R2 is

lossless-join if and only if at least one of the following dependencies is in F+: R1 R2 R1

R1 R2 R2

Page 18: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Second Normal Form (2NF)Second Normal Form (2NF)

Violated when a nonkey column is a fact about part of the primary key.

A column is not fully functionally dependent on the full primary key. CUSTOMER-CREDIT in this case:

ORDER

ITEMID CUSTOMERID

QUANTITY

CUSTOMER-CREDIT

Desk JJB 25 OK

Chair AMB 3 POOR

ITEM

*itemid…

ORDER

quantity…

CUSTOMER

*customeridcustomer-credit

From Watson

BAD

FIX

Page 19: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Third Normal Form (3NF)Third Normal Form (3NF)

Violated when a nonkey column is a fact about another nonkey column.

A column is not fully functionally dependent on the primary key.

R is 3NF iff R is 2NF and has no transitive dependencies. EXCHANGE RATE violates this.

STOCK

*stock codefirm namestock price

stock quantitystock dividend

stock PE

NATION

*nation codenation name

exchange rate

STOCK

STOCK CODE

NATION EXCHANGE RATE

GOOG USA 0.67

NOK FIN 0.46BAD

FIX

Page 20: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Boyce-Codd (BCNF)Boyce-Codd (BCNF) Arises when a table:

has multiple candidate keys, the candidate keys are composite, the candidate keys overlap.

R is BCNF iff every determinant is a cand. key. E.g. Assume one consultant per problem per client, and

one problem per consultant. If client-problem is the primary key, how do you add a new

consultant? Like 3NF but now worry about all fields.

ADVISOR

CLIENT PROBLEM CONSULTANT

Alpha Marketing Gomez

Alpha Production

Raginiski

CLIENT

*clientno…

CLIENT-PROBLEM

*cltprobdate…

PROBLEM

*problemcode…

CONSULTANT

*consultid…

BAD

FIX

Page 21: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Design Goals & their discontentsDesign Goals & their discontents

Goals for a relational database design: eliminate redundancies by decomposing

relations, must be able to recover original data using

lossless joins, prefer not to loose dependencies.

BCNF: no redundancies, no guarantee of dependency preservation.

3NF: dependency preservation, but possible redundancies.

Page 22: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Fourth normal form (4NF)Fourth normal form (4NF)

A row should not contain two or more independent multivalued facts.

4NF iff BCNF & no non-trivial multi-valued dependencies.

Multivalued dependency means the value of one attributed determines a set of values for another.

STUDENT

STUDENTID

SPORT SUBJECT

50 Football

English …

50 Football

Music …

50 Tennis Botany …

50 Karate Botany …

BAD

FIX

Page 23: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Fifth normal form (5NF)Fifth normal form (5NF)

5NF iff a relation has no join dependency.

The schemas R1, R2,.., Rn have a join dependency over R if they define a lossless-join decomposition over R.

This is way too complicated, don’t worry about it.

Page 24: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Domain Key Normal FormDomain Key Normal Form

Every constraint on the relation must be a logical consequence of the domain constraints and the key constraints that apply to the relation. Key: unique identifier. Constraint: rule governing attribute

values. Domain: set of values of the same

data type. No known algorithm gives DK/NF.

Page 25: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

E-R Model and NormalizationE-R Model and Normalization When an E-R diagram is carefully

designed, identifying all entities correctly, the tables generated should not need further normalization.

However, in a real (imperfect) design there can be FDs from non-key attributes of an entity to other attributes of the entity.

The keys identified in E-R diagrams might not be minimal - FDs can help us to identify minimal keys.

FDs from non-key attributes of a relationship set are possible, but rare.

Page 26: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Denormalization & PerformanceDenormalization & Performance May want to use non-normalized schema

for performance. E.g. displaying customer-name along with

account-number and balance requires join of account with depositor.

Alternative 1: Use denormalized relation containing attributes of account as well as depositor. faster lookup. extra space and extra execution time for updates. extra coding work for programmer and possibility of

error in extra code. Alternative 2: use a materialized view defined as

account ⋈ depositor as above, except less extra coding, errors.

Page 27: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Limits of NormalizationLimits of Normalization

Examples of bad database design, not caught by normalization.

Good: earnings(company-id, year, amount)

Bad: earnings-2000, earnings-2001, earnings-

2002, etc., on (company-id, earnings) all are BCNF, but querying across years difficult needs a new table each year

company-year(company-id, earnings-2000,earnings-2001, earnings-2002) in BCNF, but querying across years difficult requires new attribute each year

Page 28: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Summary 1 – Rules to WatchSummary 1 – Rules to Watch 1NF: attributes not atomic. 2NF: non-key attribute FD on part of

key. 3NF: one non-key attribute FD on

another. Boyce-Codd NF: overlapping but

otherwise independent candidate keys. 4NF: multiple, independent multi-valued

attributes. 5NF: join dependency. Domain Key / NF: all constraints either

domain or key

Page 29: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Summary 2 – ConceptsSummary 2 – Concepts

Functional Dependencies: Axioms & Closure.

Lossless-join decomposition. Design Process. Normalization Problems.

Next: Interfaces and Architectures

Page 30: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Reading & ExercisesReading & Exercises

Reading Connolly & Begg Chapter (13, 14) Silberschatz Chapters 7. Any other book, the

design/normalization chapter. Exercises:

Silberschatz7.1, 7.2, 7.16, 7.23, 7.24, 7.27-29

Page 31: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Next WeekNext Week

• Architectures and Implementations

• Integrity and Security

Page 32: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Slides after and including Slides after and including this one you are not this one you are not responsible for, but I am responsible for, but I am saving in case I decide to saving in case I decide to use them in the future.use them in the future.

Page 33: CM20145 Further DB Design – Normalization Dr Alwyn Barry Dr Joanna Bryson

Goal: Formalize “Good Design”Goal: Formalize “Good Design” Process:

Decide whether a particular relation R is in “good” form.

In the case that a relation R is not in “good” form, decompose it into a set of relations {R1, R2, ..., Rn} such that: each relation is in good form, the decomposition is a lossless-join

decomposition.

Theory: Constraints on the set of legal relations. Require that the value for a certain set of

attributes determines uniquely the value for another set of attributes – functional dependencies.