View
218
Download
3
Tags:
Embed Size (px)
Citation preview
1
Database Theory and Methodology
2
The Good and the Bad• So far we have not developed any measure of “goodness” to
measure the quality of the design, other than our intuition.• Bad database design
– Redundant Information– Update anomalies– Wasting storage with null values– Generation of invalid data during joins
• Good database design– Easy to explain its semantics– No redundancy (or at least reduced redundancy)– No information loss during joins
• We need formal concepts to define “goodness” and “badness” of relational schemas.
3
The Evils of Redundancy
• Redundancy is at the root of several problems associated with relational schemas:– Redundant storage– Insertion anomalies – Deletion anomalies – Update anomalies
4
Example: Bad DesignFLIGHTS
flt# date airline plane#
DL242 10/23/00 Delta k-yo-33297
DL242 10/24/00 Delta t-up-73356
DL242 10/25/00 Delta o-ge-98722
AA121 10/24/00 American p-rw-84663
AA121 10/25/00 American q-yg-98237
AA411 10/22/00 American h-fe-65748
• redundancy: airline name repeated for the same flight
• inconsistency: when airline name for a flight changes, it must be changed in many places
5
Bad Database Design
• insertion anomalies: how do we represent that SK912 is flown by Scandinavian without there being a date and a plane assigned?
• deletion anomalies: cancelling AA411 on 10/22/00 makes us lose that it is flown by American.
• update anomalies: if DL242 is flown by Sabena, we must change it everywhere.
FLIGHTS
flt# date airline plane#
DL242 10/23/00 Delta k-yo-33297
DL242 10/24/00 Delta t-up-73356
DL242 10/25/00 Delta o-ge-98722
AA121 10/24/00 American p-rw-84663
AA121 10/25/00 American q-yg-98237
AA411 10/22/00 American h-fe-65748
6
Bad Database Design- decomposition
FLIGHTS
flt# date airline plane#
DL242 10/23/00 Delta k-yo-33297
DL242 10/24/00 Delta t-up-73356
DL242 10/25/00 Delta o-ge-98722
AA121 10/24/00 American p-rw-84663
AA121 10/25/00 American q-yg-98237
AA411 10/22/00 American h-fe-65748
FLIGHTS-AIRLINE
flt# airline
DL242 Delta
AA121 American
AA411 American
DATE-AIRLINE-PLANE
date airline plane#
10/23/00 Delta k-yo-33297
10/24/00 Delta t-up-73356
10/25/00 Delta o-ge-98722
10/24/00 American p-rw-84663
10/25/00 American q-yg-98237
10/22/00 American h-fe-65748
7
Bad Database Design- information loss
FLIGHTS
flt# date airline plane#
DL242 10/23/00 Delta k-yo-33297
DL242 10/24/00 Delta t-up-73356
DL242 10/25/00 Delta o-ge-98722
AA121 10/24/00 American p-rw-84663
AA121 10/25/00 American q-yg-98237
AA211 10/22/00 American h-fe-65748
AA411 10/24/00 American p-rw-84663
AA411 10/25/00 American q-yg-98237
AA411 10/22/00 American h-fe-65748
DATE-AIRLINE-PLANE
date airline plane#
10/23/00 Delta k-yo-33297
10/24/00 Delta t-up-73356
10/25/00 Delta o-ge-98722
10/24/00 American p-rw-84663
10/25/00 American q-yg-98237
10/22/00 American h-fe-65748
FLIGHTS-AIRLINE
flt# airline
DL242 Delta
AA121 American
AA411 American
• information loss: we polluted the database with false facts; we can’t find the true facts.
8
Good Database Design
• no redundancy of FACT (!)
• no inconsistency
• no insertion, deletion or update anomalies
• no information loss
FLIGHTS-DATE-PLANE
flt# date plane#
DL242 10/23/00 k-yo-33297
DL242 10/24/00 t-up-73356
DL242 10/25/00 o-ge-98722
AA121 10/24/00 p-rw-84663
AA121 10/25/00 q-yg-98237
AA411 10/22/00 h-fe-65748
FLIGHTS-AIRLINE
flt# airline
DL242 Delta
AA121 American
AA411 American
9
Goal — Devise a Theory for the Following
• Decide whether a particular relation R is in “good” form.
• In the case that a relation R is not in “good” form, decompose it into a set of relations {R1, R2, ..., Rn} such that
– each relation is in good form
– the decomposition is a lossless-join decomposition
• Our theory is based on:
– functional dependencies
– multivalued dependencies
10
Functional Dependencies (FDs)
• A functional dependency is a constraint between two sets of attributes from the database.
• Definition: A functional dependency, denoted by X Y, holds over relation R if, for every allowable instance r of R:– t1 r, t2 r, t1.X = t2.X implies t1.Y =t2.Y– i.e., given two tuples in r, if the X values agree, then the Y
values must also agree. (X and Y are sets of attributes.)• An FD is a statement about all allowable relations.
– Must be identified based on semantics of application.– Given some allowable instance r1 of R, we can check if it
violates some FD f, but we cannot tell if f holds over R!• If X Y in R, this does not say whether or not Y X.
11
Functional Dependencies and Keys
• A key constraint is a special case of an FD.
• Note the following:
– If X is a candidate key, X Y for any subset of attributes Y of R.
– X Y does not require that the set X be minimal; the additional minimality condition must be met for X to be a key.
– If X Y holds where Y is the set of all attributes in R, and there is some subset V of X s.t. V Y, then X is a superkey.
12
Example: Constraints on Entity Set
• Consider relation obtained from Hourly_Emps:– Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked)
• Notation: We will denote this relation schema by listing the attributes: SNLRWH– This is really the set of attributes {S,N,L,R,W,H}.– Sometimes, we will refer to all attributes of a relation by using the
relation name. (e.g., Hourly_Emps for SNLRWH)
• Some FDs on Hourly_Emps:– ssn is the key: S SNLRWH – rating determines hrly_wages: R W
13
Example (Contd.)
• Problems due to R W :
– Update anomaly: Can we change W in just the 1st tuple of SNLRWH?
– Insertion anomaly: What if we want to insert an employee and don’t know the hourly wage for his rating?
– Deletion anomaly: If we delete all employees with rating 5, we lose the information about the wage for rating 5!
S N L R W H
123-22-3666 Attishoo 48 8 10 40
231-31-5368 Smiley 22 8 10 30
131-24-3650 Smethurst 35 5 7 30
434-26-3751 Guldu 35 5 7 32
612-67-4134 Madayan 35 8 10 40
Will two smaller tables be better?
R W
8 10
5 7
Wages
S N L R H
123-22-3666 Attishoo 48 8 40
231-31-5368 Smiley 22 8 30
131-24-3650 Smethurst 35 5 30
434-26-3751 Guldu 35 5 32
612-67-4134 Madayan 35 8 40
Hourly_Emps2
14
Closure of a set of FDs
• Given some FDs, we can usually infer additional FDs:– ssn did, did lot implies ssn lot– mgr_ssn mgr_phone, dept_no mgr_ssn implies dept_no
mgr_phone
• An FD f is implied by a set of FDs F if f holds whenever all FDs in F hold.
• Formally, the set of all dependencies that include F as well as all dependencies that can be inferred from F is called the closure of F; it is denoted by F+.
15
Inference Rules for FDs
• Armstrong’s Axioms (X, Y, Z are sets of attributes):– Reflexivity: If Y X then X Y. – Augmentation: If X Y, then XZ YZ for any Z.– Transitivity: If X Y and Y Z, then X Z.
• These are sound and complete inference rules for FDs!– By sound we mean that they generate only FDs in F+ when applied to a
set F of FDs.– By complete we mean that repeated application of these rules will
generate all FDs in the closure of F+.
16
Derived inference rules
• Union: if XY and XZ, then XYZ.
• Decomposition: if XYZ, then XY and XZ.
• Pseudotransitivity: if XY and WYZ, then WXZ.
• These additional rules are not essential; their soundness can be proved using Armstrong’s Axioms.
• Exercise: Prove rules Union, Decomposition and Pseudotransitivity using A.A.
17
Example
• Consider the following schema:Contracts(cid,sid,jid,did,pid,qty,value)
– C is the key: C CSJDPQV– Project purchases each part using single contract: JP C– Dept purchases at most one part from a supplier: SD P
• JP C, C CSJDPQV imply JPCSJDPQV• SDP implies SDJ JP• SDJ JP, JP CSJDPQV imply SDJ CSJDPQV
– We can infer several additional FDs that are in the closure by using augmentation or decomposition. e.g. C CSJDPQV implies C C, C S,C J, C D, C P, C Q, C V.
– We have also a number of trivial FDs from reflexivity rule.
18
Reasoning About FDs
• Computing the closure of a set of FDs can be expensive. (Size of closure is exponential in # attrs!)
• Typically, we just want to check if a given FD X Y is in the closure of a set of FDs F. An efficient check:
1. Compute attribute closure of X (denoted X+) wrt F:• Set of all attributes A such that X A is in F+.
• There is a linear time algorithm to compute this.
2. Check if Y is in X+.
• Does F = {A B, B C, CD E } imply A E?– i.e, is A E in the closure F+ ? (Equivalently, is E in X+)?
19
Attribute Closure
X+ = X;
repeat {
oldX+ = X+;
for each functional dependency Y Z in F do
if Y X+ then X+ = X+ Z;
until X+ = oldX+
20
Example
• Given the following set of FDs:
F= { Ssn Ename,
Pnumber {Pname, Plocation},
{Ssn,Pnumber} Hours }
• The attribute closures:
{Ssn}+ = {Ssn, Ename}
{Pnumber}+ = {Pnumber, Pname,Plocation}
{Ssn,Pnumber}+ = {Ssn, Pnumber, Ename, Pname, Plocation, Hours}
21
Example R = (A, B, C, G, H, I)
F = {A BA C CG HCG IB H}
(AG)+
1. result = AG.
2. result = ABCG (A C and A B)
3. result = ABCGH(CG H and CG AGBC)
4. result = ABCGHI (CG I and CG AGBCH) Is AG a candidate key?
1. Is AG a super key?• Does AG R? == Is (AG)+ R
2. Is any subset of AG a superkey?• Does A R? == Is (A)+ R
• Does G R? == Is (G)+ R
22
Finding a Key for a Relation
Algorithm: Finding a Key K for R, given a set of FDs
1. Set K = R.
2. For each attribute A in K {
Compute (K-A)+ wrt F;
If (K-A)+ contains all the attributes in R, then
set K = K – {A}
};
23
Example
• Consider the following schema and the set of FDs:
R = (Ssn, Pnumber, Ename, Pname, Plocation, Hours)
F = {Ssn Ename, Pnumber {Pname, Plocation}, {Ssn,Pnumber} Hours}
• The Key is {Ssn,Pnumber}, since
{Ssn,Pnumber}+ = {Ssn, Pnumber, Ename, Pname, Plocation, Hours}
24
Equivalence of two sets of FDs
• Definition: A set of FDs F covers another set of FDs E if every FD in E is also in F+.
• Definition: F and E are equivalent if F+ = E+ (i.e. F covers E and E covers F).
• We can determine whether F covers E by calculating X+ with respect to F for each FD XY in E, and then checking whether this X+ includes the attributes in Y. If this is the case for every FD in E, then F covers E.
FE
F+
25
Minimal Cover
• Sets of functional dependencies may have redundant dependencies that can be inferred from the others
– Eg: A C is redundant in: {A B, B C, A C}
– Parts of a functional dependency may be redundant• E.g. on RHS: {A B, B C, A CD} can be simplified to
{A B, B C, A D}
• E.g. on LHS: {A B, B C, AC D} can be simplified to {A B, B C, A D}
• Intuitively, a canonical cover of F is a “minimal” set of functional dependencies equivalent to F, having no redundant dependencies or redundant parts of dependencies
26
Minimal Sets of FDs
A set of FDs F is minimal if:
• Every dependency in F has a single attribute for its right-hand side.
• We can’t replace any dependency X A in F with a dependency YA where Y X and still have a set of dependencies equivalent with F. – This ensures that there are no redundancies by having redundant
attributes on the left-hand side of a dependency.
• We can’t remove any dependency from F and still have a set of dependencies equivalent with F. – This ensures that there are no redundancies by having a dependency that
can be inferred from the remaining FDs in F.
27
Minimal Cover of a set of FDs
• A minimal cover of a set of FDs E is a minimal set of dependencies F that is equivalent to E.
• There can be several minimal covers for a set of FDs.
• Algorithm: Finding Minimal Cover F for a set of Fds E.
1. Set F = E.
2. Replace each Fd X {A1,..., An} in F by the n Fds XA1,..., XAn.
3. For each Fd XA in F
For each attribute BX
If {{F - {XA}} {(X - {B}) A}} F
Then replace XA with (X - {B}) A in F.
4. For each remaining Fd XA in F
If {F - {XA}}F, then remove XA from F.
28
Examples
• E = { A B, ABCD E, EF GH, ACDF EG } has the following minimal cover:
F = {A B, ACD E, EF G, EF H }
• E = {Ssn Ename, Pnumber {Pname, Plocation}, {Ssn,Pnumber} Hours}}
F= {Ssn Ename, Pnumber Pname, Pnumber Plocation, {Ssn,Pnumber} Hours}
29
Normal Forms
• Returning to the issue of schema refinement, the first question to ask is whether any refinement is needed!
• If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain kinds of problems are avoided/minimized. This can be used to help us decide whether decomposing the relation will help.
• Role of FDs in detecting redundancy:– Consider a relation R with 3 attributes, ABC.
• No FDs hold: There is no redundancy here.
• Given A B: Several tuples could have the same A value, and if so, they’ll all have the same B value!
30
Normalization
• Normalization of data is a process of analyzing the given relational schemas based on their FDs and primary keys to achieve the desirable properties of
– minimizing redundancy
– minimizing update, insert and delete anomalies.
• The process of normalization through decomposition must also confirm two additional properties:
1. Lossless join property (nonadditive join property)
2. Dependency preservation property
31
NF2
1NF
2NF
3NF
BCNF
Overview of NFs
32
Normal Forms
• NF2: non-first normal form.
• 1NF: R is in 1NF iff all domain values are atomic.
• 2NF: R is in 2. NF iff R is in 1NF and every non-key attribute is fully dependent on the key.
• 3NF: R is in 3NF iff R is 2NF and every nonkey attribute is non-transitively dependent on the key
• BCNF: R is in BCNF iff every determinant is a candidate key
– Determinant: an attribute on which some other attribute(s) is fully
functionally dependent.
33
1NF: Atomic ValuesFLT-SCHEDULE
flt# weekday airline dtime from atime to
DL242 MO WE FR DELTA 10:40 ATL 12:30 BOS
SK912 SA SU SAS 12:00 CPH 15:30 JFK
AA242 MO FR AA 08:00 CHI 10:10 ATL
Attributes must be defined over domains with atomic values
FLT-SCHEDULE
flt# weekday airline dtime from atime to
DL242 MO DELTA 10:40 ATL 12:30 BOS
SK912 SA SAS 12:00 CPH 15:30 JFK
AA242 MO AA 08:00 CHI 10:10 ATL
DL242 WE DELTA 10:40 ATL 12:30 BOSDL242 FR DELTA 10:40 ATL 12:30 BOS
SK912 SU SAS 12:00 CPH 15:30 JFK
AA242 FR AA 08:00 CHI 10:10 ATL
34
2NF: Full Functional Dependency
• An FD X Y is a full functional dependency if any attribute A X , X – {A} does not functionally determine Y.
• An FD X Y is a partial functional dependency if some attribute A X , (X – {A}) Y.
• E.g. {ssn, proj_num} hours is a full FD.
{ssn, proj_num} ename is a partial FD.
• Definition: A relation schema R is in 2NF if R is in 1NF and every nonprime attribute A is fully functionally dependent on the primary key.
35
Third Normal Form (3NF)
• Relation R with FDs F is in 3NF if, for all X A in F+
– A X (called a trivial FD), or– X contains a key for R, or– A is part of some key for R.
• Minimality of a key is crucial in third condition above!
• If R is in 3NF, obviously in 2NF.
• If R is in 3NF, some redundancy is still possible. It is a compromise, used when BCNF not achievable (e.g., no ``good’’ decomposition, or performance considerations).– Lossless-join, dependency-preserving decomposition of R
into a collection of 3NF relations always possible.
36
What Does 3NF Achieve?
• If 3NF is violated by X A, one of the following holds:– X is a subset of some key K
• We store (X, A) pairs redundantly.
– X is not a proper subset of any key.• There is a chain of FDs K X A, which means that we cannot
associate an X value with a K value unless we also associate an A value with an X value.
• But: even if relation is in 3NF, these problems could arise.– e.g., Reserves SBDC, S C, C S is in 3NF, but for
each reservation of sailor S, same (S, C) pair is stored.
37
Boyce-Codd Normal Form (BCNF)
• Reln R with FDs F is in BCNF if, for all X A in F+
– A X (called a trivial FD), or– X contains a key for R.
• In other words, R is in BCNF if the only non-trivial FDs that hold over R are key constraints.– No dependency in R that can be predicted using FDs alone.– If we are shown two tuples that agree upon
the X value, we cannot infer the A value in one tuple from the A value in the other.
– If example relation is in BCNF, then 2 tuples must be identical (since X is a key).
X Y A
x y1 a
x y2 ?