32
Schema Normalization, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 11, 2005 me slide content courtesy of Susan Davidson & Raghu Ramakrishnan

Schema Normalization, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 11, 2005 Some slide content

  • View
    219

  • Download
    3

Embed Size (px)

Citation preview

Schema Normalization, Concluded

Zachary G. IvesUniversity of Pennsylvania

CIS 550 – Database & Information Systems

October 11, 2005

Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan

2

Announcements

Decide on 3-person project groups by 1 week from Thursday (10/20)

Homework 2 answers posted on Web Homework 3 due Thursday No class next Tuesday (Fall Break) Midterm: Thursday 10/20

3

Not All Designs are Equally Good

Why is this a poor schema design?

And why is this one better?

Stuff(sid, name, serno, subj, cid, exp-grade)

Student(sid, name)Course(serno, cid)Subject(cid, subj)Takes(sid, serno, exp-grade)

4

Functional DependenciesDescribe “Key-Like” Relationships

A key is a set of attributes where:If keys match, then the tuples match

A functional dependency (FD) is a generalization:If an attribute set determines another, written X ! Y

then if two tuples agree on attribute set X, they must agree on X:

sid ! name

What other FDs are there in this data? FDs are independent of our schema design

choice

5

Formal Definition of FD’s

Def. Given a relation schema R and subsets X, Y of R:An instance r of R satisfies FD X Y if,

for any two tuples t1, t2 2 r, t1[X ] = t2[X] implies t1[Y] = t2[Y]

For an FD to hold for schema R, it must hold for every possible instance of r

(Can a DBMS verify this? Can we determine this by looking at an instance?)

6

General Thoughts on Good Schemas

We want all attributes in every tuple to be determined by the tuple’s key attributes, i.e. part of a superkey (for key X Y, a superkey is a “non-minimal” X)What does this say about redundancy?

But: What about tuples that don’t have keys (other

than the entire value)? What about the fact that every attribute

determines itself?

7

Armstrong’s Axioms: Inferring FDs

Some FDs exist due to others; can compute using Armstrong’s axioms:

Reflexivity: If Y X then X Y (trivial dependencies)

name, sid name

Augmentation: If X Y then XW YWserno subj so serno, exp-grade subj, exp-grade

Transitivity: If X Y and Y Z then X Zserno cid and cid subj

so serno subj

8

Armstrong’s Axioms Lead to…

Union: If X Y and X Z then X YZ

Pseudotransitivity: If X Y and WY Z then XW Z

Decomposition: If X Y and Z Y then X Z

Let’s prove a few of these from Armstrong’s Axioms

9

Closure of a Set of FD’s

Defn. Let F be a set of FD’s. Its closure, F+, is the set of all FD’s:

{X Y | X Y is derivable from F by Armstrong’s Axioms}

Which of the following are in the closure of our Student-Course FD’s?name name

cid subj

serno subj

cid, sid subj

cid sid

10

Attribute Closures: Is SomethingDependent on X?

Defn. The closure of an attribute set X, X+, is:

X+ = {Y | X Y F +} This answers the question “is Y determined

(transitively) by X?”; compute X+ by:

Does sid, serno subj, exp-grade?

closure := X;repeat until no change {

if there is an FD U V in F such that U is in closure then add V to closure}

11

Equivalence of FD sets

Defn. Two sets of FD’s, F and G, are equivalent if

their closures are equivalent, F + = G +

e.g., these two sets are equivalent: {XY Z, X Y} and {X Z, X Y}

F + contains a huge number of FD’s (exponential in the size of the schema)

Would like to have smallest “representative” FD set

12

Minimal Cover

Defn. A FD set F is minimal if:1. Every FD in F is of the form X A,

where A is a single attribute2. For no X A in F is:

F – {X A } equivalent to F3. For no X A in F and Z X is: F – {X A } {Z A } equivalent to FDefn. F is a minimum cover for G if F is minimal

and is equivalent to G.e.g.,

{X Z, X Y} is a minimal cover for{XY Z, X Z, X Y}

in a sense,each FD is“essential”to the cover

we expresseach FD insimplest form

13

More on Closures

If F is a set of FD’s and X Y F + then for some attribute A Y, X A F +

Proof by counterexample. Assume otherwise and let Y = {A1,..., An} Since we assume X A1, ..., X An are in F +

then X A1 ... An is in F + by union rule,

hence, X Y is in F + which is a contradiction

14

Why Armstrong’s Axioms?Why are Armstrong’s axioms (or an

equivalent rule set) appropriate for FD’s? They are: Consistent: any relation satisfying FD’s in F will

satisfy those in F +

Complete: if an FD X Y cannot be derived by Armstrong’s axioms from F, then there exists some relational instance satisfying F but not X Y

In other words, Armstrong’s axioms derive all the FD’s that should hold

What is the goal of using these axioms?

15

Decomposition

Consider our original “bad” attribute set

We could decompose it into:

But this decomposition loses information about the relationship between students and courses. Why?

Stuff(sid, name, serno, subj, cid, exp-grade)

Student(sid, name)Course(serno, cid)Subject(cid, subj)

16

Lossless Join Decomposition

R1, … Rk is a lossless join decomposition of R w.r.t. an FD set F if for every instance r of R that satisfies F,

R1(r) ⋈ ... ⋈ Rk(r) = r

Consider:

What if we decompose on (sid, name) and (serno, subj, cid, exp-grade)?

sid

name serno subj

cid exp-grade

1 Sam 570103

AI 570 B

23 Nitin 550103

DB 550 A

17

Testing for Lossless Join

R1, R2 is a lossless join decomposition of R with respect to F iff at least one of the following dependencies is in F+

(R1 R2) R1 – R2

(R1 R2) R2 – R1

So for the FD set:sid nameserno cid, exp-gradecid subj

Is (sid, name) and (serno, subj, cid, exp-grade) a lossless decomposition?

18

Dependency Preservation

Ensures we can check whether a FD X Y is violated during DB updates, without using a join:

FZ, the projection of FD set F onto attribute set Z, is:

{X Y | X Y F +, X Y Z}i.e., it is those FDs only applicable to Z’s attributes

A decomposition R1, …, Rk is dependency preserving if F + = (FR1 ... FRk)+ (note we need an extra closure!)

We don’t lose the ability to test the “cover” of our FDs in a single table, just because we decompose

19

Example 1

For Schema R(sid, name, serno, cid, subj, exp-grade) and FD set:sid name serno cidcid subj sid, serno exp-grade

Is R1(sid, name) and R2(serno, subj, cid, exp-grade): A lossless decomposition? Is it dependency-preserving?

How about R1(sid, name) and R2(sid, serno, subj, cid, exp-grade)?

20

Example 2

Given schema R(name, street, city, st, zip, item, price),

FD set name street, city street, city ststreet, city zip name, item

priceand decomposition

R1(name, street, city, st, zip) and R2(name, item, price)

Is it lossless? Is it dependency preserving?

What if we replaced the first FD with name, street city?

21

A More Disturbing Example…

Given schema R(sid, fid, subj)and FD set: fid subj sid, subj fid

Consider the decomposition R1(sid, fid) and R2(fid, subj)

Is it lossless? Is it dependency preserving?

If it isn’t, can you think of a decomposition that is? Can you do this non-redundantly?

22

Redundancy vs. FDs

Ideally, we want a design s.t. for each nontrivial dependency X Y, X is a superkey for some relation schema in R

We just saw that this isn’t always possible in a non-redundant way…

Thus we have two kinds of normal forms, Boyce-Codd and Third Normal Form

23

Two Important Normal Forms

Boyce-Codd Normal Form (BCNF). For every relation scheme R and for every X A that holds over R,

either A X (it is trivial) ,oror X is a superkey for R

Third Normal Form (3NF). For every relation scheme R and for every X A that holds over R,

either A X (it is trivial), or X is a superkey for R, or A is a member of some key for R

24

Normal Forms Compared

BCNF is preferable, but sometimes in conflict with the goal of dependency preservation

It’s strictly stronger than 3NF

Let’s see algorithms to obtain: A BCNF lossless join decomposition

(nondeterministic) A 3NF lossless join, dependency preserving

decomposition

25

BCNF Decomposition Algorithm(from Korth et al.; our book gives a recursive version)

result := {R}compute F+while there is a relation schema Ri in result that isn’t in BCNF{

let A B be a nontrivial FD on Ri

s.t. A Ri is not in F+ and A and B are disjoint

result:= (result – Ri) {(Ri - B), (A,B)}}

26

3NF Decomposition Algorithm

Let F be a minimal coveri:=0for each FD A B in F { if none of the schemas Rj, 1 j i, contains AB { increment i Ri := (A, B) }}if no schema Rj, 1 j i contains a candidate key for R { increment i Ri := any candidate key for R}return (R1, …, Ri)

Build dep.-preservingdecomp.

Ensurelosslessdecomp.

27

Summary of Normalization

We can always decompose into 3NF and get: Lossless join Dependency preservation

But with BCNF we are only guaranteed lossless joins

BCNF is stronger than 3NF: every BCNF schema is also in 3NF

The BCNF algorithm is nondeterministic, so there is not a unique decomposition for a given schema R

28

XML: A Semi-Structured Data Model

29

Why XML?

XML is the confluence of several factors: The Web needed a more declarative format for data Documents needed a mechanism for extended tags Database people needed a more flexible interchange

format “Lingua franca” of data It’s parsable even if we don’t know what it means!

Original expectation: The whole web would go to XML instead of HTML

Today’s reality: Not so… But XML is used all over “under the covers”

30

Why DB People Like XML

Can get data from all sorts of sources Allows us to touch data we don’t own! This was actually a huge change in the DB community

Interesting relationships with DB techniques Useful to do relational-style operations Leverages ideas from object-oriented, semistructured

data

Blends schema and data into one format Unlike relational model, where we need schema first … But too little schema can be a drawback, too!

31

XML Anatomy<?xml version="1.0" encoding="ISO-8859-1" ?> <dblp> <mastersthesis mdate="2002-01-03" key="ms/Brown92">  <author>Kurt P. Brown</author>   <title>PRPL: A Database Workload Specification Language</title>   <year>1992</year>   <school>Univ. of Wisconsin-Madison</school>   </mastersthesis> <article mdate="2002-01-03" key="tr/dec/SRC1997-018">  <editor>Paul R. McJones</editor>   <title>The 1995 SQL Reunion</title>   <journal>Digital System Research Center Report</journal>   <volume>SRC1997-018</volume>   <year>1997</year>   <ee>db/labs/dec/SRC1997-018.html</ee>   <ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>   </article>

Processing Instr.

Element

Attribute

Close-tag

Open-tag

32

Well-Formed XML

A legal XML document – fully parsable by an XML parser All open-tags have matching close-tags (unlike

so many HTML documents!), or a special:<tag/> shortcut for empty tags (equivalent to

<tag></tag>

Attributes (which are unordered, in contrast to elements) only appear once in an element

There’s a single root element XML is case-sensitive