11. NormalizationDatabase Design is the process of finding a "good" logical structure for the database. So far, design has been maintaining well-defined relations and choosing good primary keys. Here is a review of some of the definitions: A CANDIDATE KEY is a set of attributes, K, from the schema which always satisfies UNIQUENESS: no 2 distinct tuples have same K-value, MINIMALITY: None of Ai..Ak can be discarded and still maintain uniqueness.A PRIMARY KEY is one candidate key that is designated as the primary key (by a database administrator?).An ALTERNATE KEY is non-primary candidate key.A SUPERKEY is an attribute superset of a candidate key.NORMALIZATION is the part of Data Base design activities that precludes operational anomalies, and provide good key properties for projections.Normalization classes are (more exist): 1st Normal Form (1NF) 2nd Normal Form (2NF) 3rd Normal Form (3NF) BCNF Boyce Codd Normal Form (BCNF) 4th Normal Form (4NF) ...
I ask again: Who decides primary key? (and other design choices?)
Ph D Part Number will be the primary key for, PARTS(P#, COLOR, WT, TIME-OF-ARRIVAL).MG You're the expert.Ph D Well, P# should be the primary key, because IT IS the lookup attribute!. . . laterPointy-headed Dbexpert = Ph DThe Database design expert?NO! Not in isolation, anyway.Someone from the enterprise who understands the data and the procedures should be consulted.The following story illustrates this point. CAST: Mr. Goodwrench = MG (parts manager);MG Why is lookup so slow?Ph D You do store parts in the stock room ordered by P#, right?MG No. We store by weight! When a shipment comes in, I take each part into the back room and throw it. The lighter ones go further than the heavy ones so they get ordered by weight!
Ph D But, but weight doesn't have Uniqueness property! Parts with the same weight end up together in a pile!MG No they don't. I tire quickly, so the first one I throw goes furthest.Ph D Then well use a composite primary key, (weight, time-of-arrival).MG We get our keys primarily from Murts Lock and Key.Point: This conversation should have taken place during the 1st meeting.
Fucntional DependenciesThe defining conditions get stronger going down this list and therefore the sets of qualifying relations gets smaller going down this list.
By defining conditions 4NF BCNF 3NF 2NF 1NF
By classes of relations 4NF BCNF 3NF 2NF 1NF
Given a relation R with attributes X and Y (possibly composite)
R.Y is FUNCTIONALLY DEPENDENT (FD) on R.Xor equivalently, R.X FUNCTIONALLY DETERMINES R.Y written R.X R.Y iff (x,y1) and (x,y2) R[X,Y] implies y1=y2.
NOTES: All attributes are FD on any attribute with the uniqueness property. Functional Dependencies are like integrity constraints. They are stipulated to hold (all tuples for all times) by designers.They can't be determined simply by observing the data state at a particular time. They are quite different from Association Rules. (ARs are approximate dependencies that hold at a given confidence level over a given subset at a particular time).
Full Functional DependenciesFor composite attribute, X (composed of more than 1 attribute)
R.Y is FULLY FUNCTIONALLY DEPENDENT (FFD) on R.X or R.X FULLY FUNCTIONALLY DETERMINES R.Y written R.XR.Y
iff R.XR.Y but R.ZR.Y for any proper subset ZX.
Said another way; R.X is a candidate key (has uniqueness and minimality) for R[X,Y].
These are SEMANTIC notions. One needs to know how data is used or what is intended (ask people who created/use data!)
First Normal Form (1NF)NORMAL FORMS (initially assuming only one candidate key)
A file or relation is in First Normal Form (1NF) if there are no repeating groups (all attribute values are atomic). Allowing repeating groups in an attribute creates a situation in which the key does not determine a value in that attribute (but possibly many values).We can use a memory technique, that 1NF means: (note: THIS IS A MEMORY TECHNIQUE, NOT A DEFINITION - i.e., don't use this as a definition on an exam) every nonkey attribute is functionally dependent on key.
This EDUCATION1 file is not in First Normal Form (1NF) S#|SNAME |CHILDREN|LCODE |LSTATUS|C#|CNAME|SITE |GR 32|THAISZ|Ed,Jo,Ty|NJ5102| 1 | 8| DSDE| ND |89 25|CLAY |Ann |NJ5101| 1 | 7| CUS | ND |68 32|THAISZ|Ed,Jo,Ty|NJ5102| 1 | 7| CUS | ND |91 25|CLAY |Ann |NJ5101| 1 | 6| 3UA | NJ |76 32|THAISZ|Ed,Jo,Ty|NJ5102| 1 | 6| 3UA | NJ |62
This EDUCATION2 file is in First Normal Form (1NF) S# |SNAME |LCODE |LSTATUS|C#|CNAME|SITE|GR 32|THAISZ|NJ5102| 1 | 8| DSDE| ND |89 25|CLAY |NJ5101| 1 | 7| CUS | ND |68 32|THAISZ|NJ5102| 1 | 7| CUS | ND |91 25|CLAY |NJ5101| 1 | 6| 3UA | NJ |76 32|THAISZ|NJ5102| 1 | 6| 3UA | NJ |62
How do you get EDUCATION1 into 1NF? 1. Create a separate CHILDREN file: CHILD|S# |Ed |32| |Jo |32| |Ty |32| |Ann |25|
Second Normal Form (2NF)A file or relation is in Second Normal Form (2NF) if it is in 1NF and every nonkey attribute is fully functionally dependent on the primary key.( memory scheme: every nonkey attribute is functionally dependent on whole key).
Why do we need (want) relations to be in 2NF?
1NF relations which are not 2NF present PROBLEMS (anomalies), e.g.,
S# |SNAME |LCODE |LSTATUS|C#|CNAME|SITE|GR 32|THAISZ|NJ5102| 1 | 8| DSDE| ND |89 25|CLAY |NJ5101| 1 | 7| CUS | ND |68 32|THAISZ|NJ5102| 1 | 7| CUS | ND |91 25|CLAY |NJ5101| 1 | 6| 3UA | NJ |76 32|THAISZ|NJ5102| 1 | 6| 3UA | NJ |62
INSERT ANOMALY: Can't record JONES' LCODE until he takes a course.
DELETE ANOMALY: Delete 1st record (e.g., THAISZ drops DSDE), loose C#=8 is DSDE in ND
UPDATE ANOMALY: Change SITE of C#=6 from NJ to ND search sequentially for all C#=6.
Second Normal Form (cont.)E.g., In the EDUCATION1 file, with key (S#,C#), FD's which are not FFD are: (S#,C#) SNAME (S#,C#) LCODE (S#,C#) LSTATUS (S#,C#) CNAME (S#,C#) SITE We make these FD's into FFD's by breaking (projecting) the file into 3 files. (puts relations in 2NF) STUDENTS = EDUCATION1[S#,SNAME,LCODE,LSTATUS] ENROLL = EDUCATION1[S#,C#,GRADE] COURSE = EDUCATION1[S#,CNAME,SITE]
STUDENTSENROLLCOURSE|S#|SNAME |LCODE |LSTATUS| |S#|C#|GRADE| |C#|CNAME|SITE||25|CLAY |NJ5101| 1 ||32|8 | 89 ||8 |DSDE |ND ||23|THAISZ|NJ5102| 1 ||32|7 | 91 ||7 |CUS |ND ||38|GOOD |FL6321| 4 ||25|7 | 68 ||6 |3UA |NJ ||17|BAID |NY2091| 3 ||25|6 | 76 ||5 |3UA |ND ||57|BROWN |NY2092| 3 ||32|6 | 62 |
STUDENTS is in 2NF since it has a single attribute as key (S#) COURSE is 2NF since it has a single attribute as key (C#) ENROLL is 2NF since GRADE is FFD on (S#,C#)
Still, we have problems: INSERT ANOMALY: Can't record LSTATUS of LOC=ND2987 until a student from ND2987 registers. DELETE ANOMALY: Deleting THAISZ, loose LCODE=NJ5102 has LSTATUS=1 UPDATE ANOMOLY: Change LSTATUS=3 to 2 and 4 to 3 must search STUDENTS sequentially. (so LSTATUS=2 isn't skipped!)
The problem is: we have a transitive dependency: S#LCODE LSTATUS, i.e., an FD which doesn't involve the key (or even part of the key): LCODELSTATUS A more common transitive dependency is "CityState":
COURSE C# |CNAME|CTY|ST|8 |DSDE |Mot|ND||7 |CUS |Mot|ND||6 |3UA |Bay|NJ||5 |3UA |Mot|ND|
Delete 3rd record (cancel course C#=6) loose fact that Bay is in NJ, etc.
Third Normal Form (3NF)A file or relation is in Third Normal Form (3NF) if it is in 2NF and every nonkey attr is non-transitively dependent on primary key. (there are no transitive dependencies). (memory scheme: every nonkey attribute is functionally dependent on nothing but key)We need to project STUDENT STUDENTS[S#, SNAME, LCODE] and LOCATION STUDENTS[ LCODE, LSTATUS ]
STUDENT LOCATION ENROLL COURSE S# |SNAME |LCODE LCODE |LSTATUS |S#|C#|GRADE |C#|CNAME|SITE |25|CLAY |NJ5101 |NJ5101| 1 |32|8 | 89 |8 |DSDE |ND |23|THAISZ|NJ5102 |NJ5102| 1 |32|7 | 91 |7 |CUS |ND |38|GOOD |FL6321 |FL6321| 4 |25|7 | 68 |6 |3UA |NJ |17|BAID |NY2091 |NY2091| 3 |25|6 | 76 |5 |3UA |ND |57|BROWN |NY2092 |NY2092| 3 |32|6 | 62
In summary, the memory scheme to remember 3NF: (not a definition! Just a memory scheme) every nonkey attribute is functionally dependent upon the key 1NF the whole key and 2NF nothing but the key 3NF so help me Codd" (E.F. Codd invented relational model and normal forms)
This is an analogy based on the way in which witnesses are sworn into legal proceedings in the US: Do you swear to tell the truth, the whole truth and nothing but the truth, so help you God?"
Boyce/Codd Normal Form (BCNF)A file or relation, with possibly multiple candidate keys, is in Boyce/Codd Normal Form (BCNF) if the only determinants are the superkeys (of candidate keys).
Example of a 3NF relation which is not BCNF. ENROLL |S#|C#|GRADE|TUTOR| |32|8 | 89 |Xin | |32|7 | 91 |Sally||25|7 | 68 |Ahmed| |25|6 | 76 |Ben | |32|6 | 62 |Amit |
Primary key = (S#,C#) and each tutor is assigned to only 1 course. (Course to which a tutor is assigned is determined, so TUTOR C# ) FDs: (S#,C#)GRADE (S#,C#)TUTOR (S#,C#)(GRADE,TUTOR) TUTORC#
This isn't BCNF (since Tutor is not superkey), but it is in 3NF (Strictly speaking, since C# is not a non-key attribute).
Fourth Normal Form (4NF)A file or relation, with possibly multiple candidate keys, is in Fourth Normal Form (4NF) if it is in BCNF and all Multivalue Dependencies (MVDs) are just FDs.
What are Multivalue Dependencies? For R(A,X,Y), where A,X,Y are distinct attributes (possibly composite) R[A,X], R[A,Y] is a LOSSLESS decomposition of R iff R[A,X] JOINA R[A,Y] = R(A,X,Y)
A set of projections of a relation with at a one common attribut