32
THE UNIVERSITY OF TEXAS AT AUSTIN SCHOOL OF INFORMATION LIS 384K.11 (known as INF 385M, beginning with the Fall Semester 2003) DATABASE-MANAGEMENT PRINCIPLES AND APPLICATIONS R. E. Wyllys Steps in Normalization Contents: Section 1. Introduction Section 2. Summary of Definitions of the Normal Forms Section 3. Functional Dependency and Determinants Section 4. The 1st Normal Form (1NF) Section 5. The 2nd Normal Form (2NF) Section 6. Anomalies and Normalization Section 7. Turning a Table with Anomalies into Single-Theme Tables Section 8. The 3rd Normal Form (3NF) Section 9. The Boyce-Codd Normal Form (BCNF) Section 10. The 4th Normal Form (4NF) Section 11. The 5th Normal Form (5NF) and the Domain-Key Normal Form (DKNF) Section 11.1. Converting a Table with Partial Dependencies into DKNF Tables Section 11.2. Converting a Table with Transitive Dependencies into DKNF Tables Section 11.3. Converting into DKNF a Table in Which Not Every Determinant Is a Candidate Key Section 11.4. Converting a Table with Multivalued Dependencies into DKNF Section 11.5. Single-Theme Tables and the DKNF Section 1. Introduction This handout discusses the normalization of databases. Our goal here is to explain, and to illustrate the need for, the various normal forms through examples of sets of relations. The relations in the examples present various difficulties, which are removed by procedures stemming from the relevant definitions of normal forms.

Steps in Normalization - RDBMS

Embed Size (px)

DESCRIPTION

This material is not copy righted.

Citation preview

Page 1: Steps in Normalization - RDBMS

THE UNIVERSITY OF TEXAS AT AUSTIN

SCHOOL OF INFORMATION

LIS 384K.11 (known as INF 385M, beginning with the Fall

Semester 2003)

DATABASE-MANAGEMENT PRINCIPLES AND

APPLICATIONS

R. E. Wyllys

Steps in Normalization

Contents: Section 1. Introduction

Section 2. Summary of Definitions of the Normal Forms

Section 3. Functional Dependency and Determinants

Section 4. The 1st Normal Form (1NF)

Section 5. The 2nd Normal Form (2NF)

Section 6. Anomalies and Normalization

Section 7. Turning a Table with Anomalies into Single-Theme Tables

Section 8. The 3rd Normal Form (3NF)

Section 9. The Boyce-Codd Normal Form (BCNF)

Section 10. The 4th Normal Form (4NF)

Section 11. The 5th Normal Form (5NF) and the Domain-Key Normal Form (DKNF)

Section 11.1. Converting a Table with Partial Dependencies into DKNF Tables

Section 11.2. Converting a Table with Transitive Dependencies into DKNF Tables

Section 11.3. Converting into DKNF a Table in Which Not Every Determinant Is a Candidate

Key

Section 11.4. Converting a Table with Multivalued Dependencies into DKNF

Section 11.5. Single-Theme Tables and the DKNF

Section 1. Introduction

This handout discusses the normalization of databases. Our goal here is to explain, and to

illustrate the need for, the various normal forms through examples of sets of relations. The

relations in the examples present various difficulties, which are removed by procedures

stemming from the relevant definitions of normal forms.

Page 2: Steps in Normalization - RDBMS

Note: This lesson presents a detailed discussion of normalization. For a simple introduction to

the ideas of normalization, one source is my lesson entitled Overview of Normalization.

Section 2. Summary of Definitions of the Normal Forms

1st Normal Form (1NF)

Definition: A table (relation) is in 1NF if

1. There are no duplicated rows in the table.

2. Each cell is single-valued (i.e., there are no repeating groups or arrays).

3. Entries in a column (attribute, field) are of the same kind.

Note: The order of the rows is immaterial; the order of the columns is immaterial.

Note: The requirement that there be no duplicated rows in the table means that the table has a

key (although the key might be made up of more than one column--even, possibly, of all the

columns).

2nd Normal Form (2NF)

Definition: A table is in 2NF if it is in 1NF and if all non-key attributes are dependent on all

of the key.

Note: Since a partial dependency occurs when a non-key attribute is dependent on only a part of

the (composite) key, the definition of 2NF is sometimes phrased as, "A table is in 2NF if it is in

1NF and if it has no partial dependencies."

3rd Normal Form (3NF)

Definition: A table is in 3NF if it is in 2NF and if it has no transitive dependencies.

Boyce-Codd Normal Form (BCNF)

Definition: A table is in BCNF if it is in 3NF and if every determinant is a candidate key.

4th Normal Form (4NF)

Definition: A table is in 4NF if it is in BCNF and if it has no multi-valued dependencies.

5th Normal Form (5NF)

Page 3: Steps in Normalization - RDBMS

Definition: A table is in 5NF, also called "Projection-Join Normal Form" (PJNF), if it is in

4NF and if every join dependency in the table is a consequence of the candidate keys of the

table.

Domain-Key Normal Form (DKNF)

Definition: A table is in DKNF if every constraint on the table is a logical consequence of

the definition of keys and domains.

Section 3. Functional Dependency and Determinants

Before we develop the ideas of normalization further, it is important for you to have an

understanding of "functional dependency." The essence of this idea is that if the existence of

something, call it A, implies that B must exist and have a certain value, then we say that "B is

functionally dependent on A." We also often express this idea by saying that "A determines B,"

or that "B is a function of A," or that "A functionally governs B." Often, the notions of

functionality and functional dependency are expressed briefly by the statement, "If A, then B." It

is important to note that the value B must be unique for a given value of A, i.e., any given value

of A must imply just one and only one value of B, in order for the relationship to qualify for the

name "function." (However, this does not necessarily prevent different values of A from

implying the same value of B.)

For the terminology of relational databases, the word "function" was borrowed from

mathematics, where it is common to say things like "y is a function of x" or "y = f(x)". (The

latter expression is read "y equals f of x".) The determining value, x, is called the argument; the

determined value, y or f(x), is called the result.

The expression "y = f(x)" is a very general, and abstract, way of talking about functionality.

Outside of mathematics--and, in particular, ordinarily in relational database management--we

talk not abstractly but in terms of particular examples. (Indeed, the general idea of a "function" is

best understood when one has seen enough examples of specific functions to be able to start

generalizing about the abstract, or general, properties that the specific functions share.)

Here are some examples of functions. An easy one is y = x2. This particular function says that if

we are given a particular value for x, say 3, then we must say that y has the value 9. (We could

also write y = f(x) = x2 or just f(x) = x

2.) Another easy one is: y = x

3. This particular function

says that if we are given a particular value for x, say -2, then we must say that y has the value -8.

A common way of indicating functions is to place the determining and determined values side by

side in a table. Thus we can place sample values of the function, y = x2, in a table like the one

shown here.

Value of x

("argument," or

Value of y = x2

("the

function," or

Page 4: Steps in Normalization - RDBMS

This table shows just three of the infinity of possible pairs of

values, x and y, for the function y = x2. It also shows that for

some functions, different values of x (here, 3 and -3) imply the

same value (here, 9) of the function.

The functions we have given as examples so far have been

functions that are specified by an algebraic function. But the

idea of function is more general; i.e., functions need not be

algebraically defined. The essence of the idea of function is

that to a specified determining value corresponds a unique

determined value. This essence can be defined, among other

ways, by placing the determining and determined values in a table that displays and/or defines

the relationship between the argument and the result.

Note that the table above displays, but does not fully define, the relationship, y = x2. This

function, since it has an infinite number of pairs of values, cannot be fully defined in a table. For

functions that involve only a finite number of pairs of values of argument and result, a table is

often a convenient way--and may in fact be the only way--of displaying and, at the same time,

defining the function.

Here is a simple example of a finite function that is both displayed and defined in a table. Most

of you will be familiar with the conventional (though often delightfully breakable) rules for

serving different types of wines with different courses in a dinner. Let us assume for the purpose

of this example that these rules can be summarized as follows: with meat, serve red wine; with

fish, white wine; and with cheese, ros� wine. Then the following table defines the course-wine

function:

But note that this table looks just like a database table. In fact,

there is no reason not to consider it a database table. Indeed,

this table defines a relation in the database sense: it has

columns, each of which contains entries of the same kind, and

it has no duplicate rows. In other words, not only does the

course-wine table display the data about the conventional

rules for which wine to serve with which course, but also the

table can be viewed as defining a function for which the

determining value is the dinner course and the determined

value is the type of wine. Thus we can say that type of wine is

functionally dependent on the dinner course, or equally well,

that the course determines the wine.

In relational database terminology, we often call the argument of the function (the dinner course

in this example) the "determinant", and we often use an arrow notation to exhibit the functional

dependency. Thus, we can say that the dinner course is the determinant of the type of wine, and

we can write: dinner course wine. And we can say that the attribute, type of wine, is

functionally dependent on the attribute, dinner course.

"A") "the result",

or "B")

3 9

4 16

-3 9

Dinner Course Type of Wine

meat red

fish white

cheese ros�

Page 5: Steps in Normalization - RDBMS

In general, a functional dependency is a relationship among attributes. In relational databases, we

can have a determinant that governs one other attribute or several other attributes. To go back to

our mathematical examples for a moment, we could view the situation of functional dependency

of several attributes on one determinant as being like having several linked functions that share

an argument and can be displayed economically in just one table. For example, consider the

following table that displays sample values of the algebraic functions y = x2, y = x

3, and y = x

4.

Looking at this table from the relational-

database point of view, we can say that

the attributes x2, x

3, and x

4 are all

functionally dependent on the attribute

x.

Similarly, we could expand the dinner-course and wine table to exhibit also the type of cutlery

that would be appropriate in the case of a formal dinner.

From this table we see that the attributes,

type of wine and type of cutlery, are

functionally dependent on the attribute,

dinner course.

Using the arrow notation, we have:

dinner course wine

and

dinner course cutlery.

Section 4. The 1st Normal Form (1NF)

Now we ready to come to grips with the ideas of normalization. The following table, containing

information about some students at Enormous State University, is a table that is in 1st Normal

Form, 1NF. (Here as elsewhere in the rest of this discussion, you may want to refer back to

Section 2. Summary of Definitions of the Normal Forms, where the various normal forms are

defined.)

Table 4.1

Value of x Value of x2 Value of x3 Value of x4

3 9 27 81

4 16 64 256

-3 9 -27 81

Dinner Course Type of Wine Type of Cutlery

meat red meat fork

fish white fish fork

cheese ros� cheese fork

Social Security FirstName LastName Major

Page 6: Steps in Normalization - RDBMS

You can easily verify

for yourself that this

table satisfies the

definition of 1NF:

viz., it has no

duplicated rows; each

cell is single-valued

(i.e., there are no

repeating groups or

arrays); and all the

entries in a given

column are of the

same kind.

In Table 4.1 we can

see that the key, SSN,

functionally

determines the other attributes; i.e., a given Social Security Number implies (determines) a

particular value for each of the attributes FirstName, LastName, and Major (assuming, at least

for the moment, that a student is allowed to have only one major). In the arrow notation: SSN

FirstName, SSN LastName, and SSN Major.

A key attribute will, by the definition of key, uniquely determine the values of the other

attributes in a table; i.e., all non-key attributes in a table will be functionally dependent on the

key. But there may be non-key attributes in a table that determine other attributes in that table.

Consider the following table:

Table 4.2

In Table 4.2 the Level attribute can

be said to be functionally

dependent on the Major attribute.

Thus we have an example of an

attribute that is functionally

dependent on a non-key attribute.

This statement is true in the table

per se, and that is all that the

definition of functional dependence

requires; but the statement also

reflects the real-world fact that

Library and Information Science is

a major that is open only to

graduate students and that Pre-

Medicine and Pre-Law are majors

that are open only to undergraduate

Number

123-45-6789 Jack Jones Library and Information Science

222-33-4444 Lynn Lee Library and Information Science

987-65-4321 Mary Ruiz Pre-Medicine

123-54-3210 Lynn Smith Pre-Law

111-33-5555 Jane Jones Library and Information Science

FirstName LastName Major Level

Jack Jones LIS Graduate

Lynn Lee LIS Graduate

Mary Ruiz Pre-Medicine Undergraduate

Lynn Smith Pre-Law Undergraduate

Jane Jones LIS Graduate

Page 7: Steps in Normalization - RDBMS

students.

Section 5. The 2nd Normal Form (2NF)

Table 4.2 has another interesting aspect. Its key is a composite key, consisting of the paired

attributes, FirstName and LastName. The Level attribute is functionally dependent on this

composite key, of course; but, in addition, Level can be seen to be dependent on only the

attribute LastName. (This is true because each value of Level is paired with a distinct value of

LastName. In contrast, there are two occurrences of the value Lynn for the attribute FirstName,

and the two Lynns are paired with different values of Level, so Level is not functionally

dependent on FirstName.) Thus this table fails to qualify as a 2nd Normal Form table, since the

definition of 2NF requires that all non-key attributes be dependent on all of the key. (Admittedly,

this example of a partial dependency is artificially contrived, but nevertheless it illustrates the

problem of partial dependency.)

We can turn Table 4.2 into a table in 2NF in an easy way, by adding a column for the Social

Security Number, which will then be the natural thing to use as the key.

Table 5.1

SSN FirstName LastName Major Level

123-45-6789 Jack Jones LIS Graduate

222-33-4444 Lynn Lee LIS Graduate

987-65-4321 Mary Ruiz Pre-Medicine Undergraduate

123-54-3210 Lynn Smith Pre-Law Undergraduate

111-33-5555 Jane Jones LIS Graduate

With the SSN defined as the key, Table 5.1 is in 2NF, as you can easily verify. This illustrates

the fact that any table that is in 1NF and has a single-attribute (i.e., a non-composite) key is

automatically also in 2NF.

Table 5.1 still exhibits some problems, however. For example, it contains some repeated

information about the LIS-Graduate pairing.

Page 8: Steps in Normalization - RDBMS

Section 6. Anomalies and Normalization

At this point it is appropriate to note that the main thrust behind the idea of normalizing

databases is the avoidance of insertion and deletion anomalies in databases.

To illustrate the idea of anomalies, consider what would happen to our knowledge (at least, as

explicitly contained in a table) of the level of the major, Pre-Medicine, if Mary Ruiz left

Enormous State University. With the deletion of the row for Ms. Ruiz, we would lose the

information that Pre-Medicine is an Undergraduate major. This is an example of a deletion

anomaly. We may possess the real-world information that Pre-Medicine is an Undergraduate

major, but no such information is explicitly contained in a table in our database.

As an example of an insertion anomaly, we can suppose that a new student wants to enroll in

ESU: e.g., suppose Jane Doe wants to major in Public Affairs. From the information in Table 5.1

we cannot tell whether Public Affairs is an Undergraduate or a Graduate major; in fact, we do

not even know whether Public Affairs is an established major at ESU. We do not know whether

it is permissible to insert the value, Public Affairs, as a value of the attribute, Major, or what to

insert for the attribute, Level, if we were to assume that Public Affairs is a valid value for Major.

The point is that while we may possess real-world information about whether Public Affairs is a

major at ESU and what its level is, this information is not explicitly contained in any table that

we have thus far mentioned as part of our database.

A database-management system, a DBMS, can work only with the information that we put

explicitly into its tables for a given database and into its rules for working with those tables,

where such rules are appropriate and possible.

How do anomalies relate to normalization? The simple answer is that by arranging that the tables

in a database are sufficiently normalized (in practice, this typically means to at least the 4th level

of normalization), we can ensure that anomalies will not arise in our database. Anomalies are

difficult to avoid directly, because with databases of typical complexity (i.e., several tables) the

database designer can easily overlook possible problems. Normalization offers a rigorous way of

avoiding unrecognized anomalies.

Normalization may look like a difficult process when one views it from the standpoint of the

formal definitions of the various normal forms, as presented in Section 2 of this handout. But in

practice, you can easily attain sufficient normalization in your database by simply ensuring that

the tables in your database are what we can call "single-theme" tables. This idea will be

illustrated as we proceed through the rest of the discussion in this handout.

Section 7. Turning a Table with Anomalies (Table 5.1) into Single-Theme Tables

Although Table 5.1 is in 2NF, it is still open to the problems of insertion and deletion anomalies,

as the discussion in the preceding section shows. The reason is that Table 5.1 deals with more

than a single theme. What can we do to turn it into a set of tables that are, or at least come closer

to being, single-theme tables?

Page 9: Steps in Normalization - RDBMS

A reasonable way to proceed is to note that Table 5.1 deals with both information about students

(their names and SSNs) and information about majors and levels. This should strike you as two

different themes. Presented below is one possible set of single-theme tables dealing with the

information in Table 5.1. (To save space, the following tables also contain some information that

is not in Table 5.1, and the discussion will deal with this added information.)

Table 7.1

SSN FirstName LastName

123-45-6789 Jack Jones

222-33-4444 Lynn Lee

987-65-4321 Mary Ruiz

123-45-4321 Lynn Smith

111-33-5555 Jane Jones

999-88-7777 Newton Gingpoor

Table 7.2

Major Level

LIS Graduate

Pre-Medicine Undergraduate

Pre-Law Undergraduate

Public Affairs Graduate

Table 7.3

Page 10: Steps in Normalization - RDBMS

SSN Major

123-45-6789 LIS

222-33-4444 LIS

987-65-4321 Pre-Medicine

123-54-3210 Pre-Law

111-33-5555 LIS

The three preceding tables should strike you as providing a better arrangement of the information

in Table 5.1. For one thing, this arrangement puts the information about the students into a

smaller table, Table 7.1, which happily fails to contain redundant information about the LIS-

Graduate pairing. For another thing, this arrangement permits us to enter information about

students (e.g., Newton Gingpoor) who have not yet identified themselves as pursuing a particular

major. For still another thing, it puts the information about the Major-Level pairings into a

separate table, Table 7.2, which can easily be expanded to include information (e.g., that the

Public Affairs major is at the Graduate level) about majors for which, at the moment, there may

be no students registered. Finally, Table 7.3 provides the needed link between individual students

and their majors (note that Newton Gingpoor's SSN is not in this Table 7.3, which tells us that he

has not yet selected a major).

Tables 7.1 - 7.3 are single-theme tables and are in 2NF, as you can easily verify. (In fact, they are

in DKNF, but we are not yet ready to discuss the latter level in detail.)

Section 8. The 3rd Normal Form (3NF)

In order to discuss the 3rd Normal Form, we need to begin by discussing the idea of transitive

dependencies.

In mathematics and logic, a transitive relationship is a relationship of the following form: "If A

implies B, and if also B implies C, then A implies C." An example is: "If John Doe is a human,

and if every human is a primate, then John Doe must be a primate." Another way of putting it is

this: "If A functionally governs B, and if B functionally governs C, then A functionally governs

C." In the arrow notation, we have:

[(A B) and (B C)] (A C)

Page 11: Steps in Normalization - RDBMS

The following table, Table 8.1, provides an example of how transitive dependencies can occur in

a table in a relational database.

Table 8.1

Author

Last

Name

Author

First

Name

Book Title Subject Collection or Library Building

Berdahl Robert The Politics of the Prussian

Nobility

History PCL General Stacks Perry-Casta�eda

Library

Yudof Mark Child Abuse and Neglect Legal

Procedures

Law Library Townes Hall

Harmon Glynn Human Memory and

Knowledge

Cognitive

Psychology

PCL General Stacks Perry-Casta�eda

Library

Graves Robert The Golden Fleece Greek

Literature

Classics Library Waggener Hall

Miksa Francis Charles Ammi Cutter Library

Biography

Library and

Information Science

Collection

Perry-Casta�eda

Library

Hunter David Music Publishing and

Collecting

Music

Literature

Fine Arts Library Fine Arts Building

Graves Robert English and Scottish Ballads Folksong PCL General Stacks Perry-Casta�eda

Library

By examining Table 8.1 we can infer that books dealing with history, cognitive psychology, and

folksong are assigned to the PCL General Stacks collection; that books dealing with legal

procedures are assigned to the Law Library; that books dealing with Greek literature are assigned

to the Classics Library; that books dealing with library biography are assigned to the Library and

Information Science Collection (LISC);and that books dealing with music literature are assigned

to the Fine Arts Library.

Page 12: Steps in Normalization - RDBMS

Further, we can infer that the PCL General Stacks collection and the LISC are both housed in the

Perry-Casta�eda Library (PCL) building; that the Classics Library is housed in Waggener Hall;

and that the Law Library and Fine Arts Library are housed, respectively, in Townes Hall and the

Fine Arts Building.

Thus we see that there is a transitive dependency in Table 8.1: any book that deals with history,

cognitive psychology, or library biography will be physically housed in the PCL building (unless

it is temporarily checked out to a borrower); any book dealing with legal procedures will be

housed in Townes Hall; and so on. In short, if we know what subject a book deals with, we also

know not only what library or collection it will be assigned to but also what building it is

physically housed in.

What is wrong with having a transitive dependency or dependencies in a table? For one thing,

there is duplicated information: from three different rows we can see that the PCL General

Stacks are in the PCL building. For another thing, we have possible deletion anomalies: if the

Yudof book were lost and its row removed from Table 8.1, we would lose the information that

books on legal procedures are assigned to the Law Library and also the information the Law

Library is in Townes Hall. As a third problem, we have possible insertion anomalies: if we

wanted to add a chemistry book to the table, we would find that Table 8.1 nowhere contains the

fact that the Chemistry Library is in Robert A.Welch Hall. As a fourth problem, we have the

chance of making errors in updating: a careless data-entry clerk might add a book to the LISC

but mistakenly enter Townes Hall in the building column.

The solution to the problem is, once again, to place the information in Table 8.1 into appropriate

single-theme tables. Here is one such possible arrangement:

Table 8.2

Author

Last

Name

Author

First

Name

Book Title

Berdahl Robert The Politics of the Prussian Nobility

Yudof Mark Child Abuse and Neglect

Harmon Glynn Human Memory and Knowledge

Graves Robert The Golden Fleece

Miksa Francis Charles Ammi Cutter

Page 13: Steps in Normalization - RDBMS

Hunter David Music Publishing and Collecting

Graves Robert English and Scottish Ballads

Table 8.3

Book Title Subject

The Politics of the Prussian Nobility History

Child Abuse and Neglect Legal Procedures

Human Memory and Knowledge Cognitive Psychology

The Golden Fleece Greek Literature

Charles Ammi Cutter Library Biography

Music Publishing and Collecting Music Literature

English and Scottish Ballads Folksong

Table 8.4

Subject Collection or Library

History PCL General Stacks

Legal Procedures Law Library

Cognitive Psychology PCL General Stacks

Greek Literature Classics Library

Page 14: Steps in Normalization - RDBMS

Library Biography Library and Information Science Collection

Music Literature Fine Arts Library

Folksong PCL General Stacks

Table 8.5

Collection or Library Building

PCL General Stacks Perry-Casta�eda Library

Law Library Townes Hall

Classics Library Waggener Hall

Library and Information Science Collection Perry-Casta�eda Library

Fine Arts Library Fine Arts Building

You can verify for yourself that none of these tables contains a transitive dependency; hence, all

of them are in 3NF (and, in fact, in DKNF).

We can note in passing that the fact that Table 8.2 contains the first and last names of Robert

Graves in two different rows suggests that it might be worthwhile to replace it with two further

tables, along the lines of:

Table 8.6

Author Last

Name

Author

First

Name

Author

Identification

Number

Berdahl Robert 001

Page 15: Steps in Normalization - RDBMS

Yudof Mark 002

Harmon Glynn 003

Graves Robert 004

Miksa Francis 005

Hunter David 006

Table 8.7

Author

Identification

Number

Book Title

001 The Politics of the Prussian Nobility

002 Child Abuse and Neglect

003 Human Memory and Knowledge

004 The Golden Fleece

005 Charles Ammi Cutter

006 Music Publishing and Collecting

004 English and Scottish Ballads

Though Tables 8.6 and 8.7 together take a little more space than Table 8.2, it is easy to see that

given a much larger collection, in which there would be many more authors with multiple works

to their credit, Tables 8.6 and 8.7 would be more economical of storage space than Table 8.2.

Furthermore, the structure of Tables 8.6 and 8.7 lessens the chance of making updating errors

(e.g., typing Grave instead of Graves, or Miska instead of Miksa).

Page 16: Steps in Normalization - RDBMS

Section 9. The Boyce-Codd Normal Form (BCNF)

The Boyce-Codd Normal Form (BCNF) deals with the anomalies that can occur when a table

fails to have the property that every determinant is a candidate key. Here is an example, Table

9.1, that fails to have this property. (In Table 9.1 the SSNs are to be interpreted as those of

students with the stated majors and advisers. Note that each of students 123-45-6789 and 987-

65-4321 has two majors, with a different adviser for each major.)

Table 9.1

We begin by showing that

Table 9.1 lacks the required

property, viz., that every

determinant be a candidate

key.

What are the determinants in

Table 9.1? One determinant is

the pair of attributes, SSN

and Major. Each distinct pair

of values of SSN and Major

determines a unique value for

the attribute, Adviser.

Another determinant is the

pair, SSN and Adviser, which

determines unique values of

the attribute, Major. Still

another determinant is the

attribute, Adviser, for each

different value of Adviser

determines a unique value of

the attribute, Major. (These

observations about Table 9.1 correspond to the real-world facts that each student has a single

adviser for each of his or her majors, and each adviser advises in just one major.)

Now we need to examine these three determinants with respect to the question of whether they

are candidate keys. The answer is that the pair, SSN and Major, is a candidate key, for each such

pair uniquely identifies a row in Table 9.1. In similar fashion, the pair, SSN and Adviser, is a

candidate key. But the determinant, Adviser, is not a candidate key, because the value Dewey

occurs in two rows of the Adviser column. So Table 9.1 fails to meet the condition that every

determinant in it be a candidate key.

It is easy to check on the anomalies in Table 9.1. For example, if student 987-65-4321 were to

leave Enormous State University, the table would lose the information that Semmelweis is an

adviser for the Pre-Medicine major. As another example, Table 9.1 has no information about

advisers for students majoring in history.

SSN Major Adviser

123-45-6789 Library and Information Science Dewey

123-45-6789 Public Affairs Roosevelt

222-33-4444 Library and Information Science Putnam

555-12-1212 Library and Information Science Dewey

987-65-4321 Pre-Medicine Semmelweis

987-65-4321 Biochemistry Pasteur

123-54-3210 Pre-Law Hammurabi

Page 17: Steps in Normalization - RDBMS

As usual, the solution lies in constructing single-theme tables containing the information in Table

9.1. Here are two tables that will do the job.

Table 9.2

SSN Adviser

123-45-

6789

Dewey

123-45-

6789

Roosevelt

222-33-

4444

Putnam

555-12-

1212

Dewey

987-65-

4321

Semmelweis

987-65-

4321

Pasteur

123-54-

3210

Hammurabi

Table 9.3

Major Adviser

Library and Information Science Dewey

Public Affairs Roosevelt

Page 18: Steps in Normalization - RDBMS

Library and Information Science Putnam

Pre-Medicine Semmelweis

Biochemistry Pasteur

Pre-Law Hammurabi

History Herodotus

By way of an example of the value of separating Table 9.1 into single-theme tables, Table 9.3

includes information about at least one faculty member at ESU who could be the adviser of a

student who wanted to major in history.

Tables 9.2 and 9.3 are in BCNF (in fact, they are in DKNF), since every determinant in them is

also a candidate key. You can easily verify this statement if you note that the key in Table 9.2 is

a composite key, SSN and Adviser.

Section 10. The 4th Normal Form (4NF)

The 4th Normal Form is concerned with the anomalies that can occur when a table fails to have

the property of containing no multivalued dependencies (i.e., the anomalies that can occur when

a table does have such dependencies). We develop below a table that has these undesirable

multivalued dependencies.

Suppose we have some information about the hobbies of some students at Enormous State

University and want to put this information into a database. Suppose, in particular, that Jack

Jones's hobbies are surfing the Internet and playing chess; Lynn Lee's, photography and stamp

collecting; Mary Ruiz's, surfing the Internet and photography; and Lynn Smith's, playing poker.

If we (foolishly) try to put all this information into just one table, here is what we get.

Table 10.1

LastName Major Hobby

Jones Library and Information Science Surfing the Internet

Jones Library and Information Science Chess

Page 19: Steps in Normalization - RDBMS

The problem is that Jack

Jones, for example, has two

majors and two hobbies. If

we coupled each of his

majors with just one of his

hobbies (e.g., LIS with

chess, or Public Affairs with

surfing the Internet), we

would imply that Jack plays

chess only as an LIS major

and surfs the Internet only as

a Public Affairs major. This

would not make sense. (Note

that in this relatively small

and simple example, it is

obvious that such restrictive

pairing does not make sense.

In practice, however, the

problems arise in connection

with much larger tables,

where it may be very

difficult to detect that

restrictive pairing has

occurred.) To avoid such

false implications, we enter

all pairings of majors and hobbies for all the students. Obviously, however, this approach has the

problem of redundant information. Equally obviously, updating this table presents anomalies; for

example, you can work out for yourself what would have to be added to Table 10.1 if Jones took

up tennis as a third hobby.

This situation is an example of the effects of multivalued dependencies. A multivalued

dependency occurs when (a) a table has at least three attributes, (b) two of the attributes are

multivalued, and (c) the values of the multivalued attributes depend only one of the remaining

attributes. Table 10.1 fits these specifications for the following reasons: The LastName attribute

determines multiple values of the attributes Major and Hobby, but neither of these latter

attributes depends on the other; they are independent.

The notation for multivalued dependency is a double arrow. In this example, we can write:

LastName Major, and LastName Hobby. We read these expressions as, "LastName

multidetermines Major" and "LastName multidetermines Hobby."

Once again, single-theme tables provide the solution. We break Table 10.1 down into the

following tables.

Table 10.2

Jones Public Affairs Surfing the Internet

Jones Public Affairs Chess

Lee Library and Information Science Photography

Lee Library and Information Science Stamp collecting

Ruiz Pre-Medicine Surfing the Internet

Ruiz Pre-Medicine Photography

Ruiz Biochemistry Surfing the Internet

Ruiz Biochemistry Photography

Smith Pre-Law Playing poker

Page 20: Steps in Normalization - RDBMS

LastName Major

Jones Library and Information Science

Jones Public Affairs

Lee Library and Information Science

Ruiz Pre-Medicine

Ruiz Biochemistry

Smith Pre-Law

Table 10.3

LastName Hobby

Jones Surfing the Internet

Jones Chess

Lee Photography

Lee Stamp collecting

Ruiz Surfing the Internet

Ruiz Photography

Smith Playing poker

Page 21: Steps in Normalization - RDBMS

Tables 10.2 and 10.3 display, separately, the various students' majors and hobbies; and while

doing so, these tables correctly avoid suggesting any connections between particular majors and

particular hobbies.

Section 11. The 5th Normal Form (5NF) and the Domain-Key Normal Form

(DKNF)

The 5th Normal Form is difficult to illustrate in terms of relatively simple examples. Hence, we

will not attempt to illustrate the 5NF property of having every join dependency in the table be a

consequence of the candidate keys of the table. This omission is a minor one, for at least two

reasons: First, in practice the 4NF is often regarded as sufficient; and second, the Domain-Key

Normal Form (DKNF) subsumes the 5NF.

The DKNF is important because it offers a complete solution to the problem of avoiding

anomalies: A set of tables (relations) that is in DKNF is known, as a consequence of a theorem

proved by Ronald Fagin in 1981, to be free of anomalies. We do not attempt here to reproduce

the proof of Fagin's theorem but merely to illustrate how the theorem can be applied in practice.

The DKNF definition is this: A relation is in DKNF if every constraint on the relation is a logical

consequence of the definitions of keys and domains. To understand what this definition means,

we begin by noting that the central ideas are embodied in the words "constraint," "key," and

"domain." By "key" Fagin means both primary keys and candidate keys. By "domain" Fagin

means the set of definitions of the contents of attributes (columns) and any limitations on the

kind of data to be stored in the columns, such as a limitation to only numeric data or only logical

data; in addition, domain limitations may include such matters as the format (e.g., a limitation on

numeric data to being expressed to exactly two decimal digits). By "constraint" Fagin means any

rule dealing with attributes that is clear enough so that one can decide whether the rule is upheld

or broken by any set of the data with which one is dealing.

There is an important qualification to be attached to the DKNF definition as presented in the

preceding paragraph. Fagin excludes constraints that are time-dependent or relate to changes

made in data values. That means that a time-dependent constraint (or other constraint on changes

in value) may exist in a table and may fail to be a logical consequence of the definitions of keys

and domains, yet the table may nevertheless be in DKNF.

As an illustration, some states have a property-tax rule specifying that the assessed value of the

primary-residence property owned by a citizen over 65 cannot be increased above the value that

was assessed in the year in which the property owner turned 65. The existence of such a rule

would not, in itself, prevent a table of properties and their assessed values from being in DKNF.

Achieving DKNF amounts to establishing a set of tables in each of which the constraints follow

logically from (i.e., are logical consequences of) the keys and the domain definitions. Although

there is no direct procedure for converting an arbitrary table into one or more tables each of

which is in DKNF, in practice the effort to replace an arbitrary table by a set of single-theme

tables achieves the goal. To show this, we consider some of the previous examples from the

DKNF point of view.

Page 22: Steps in Normalization - RDBMS

Section 11.1. Converting a Table with Partial Dependencies into DKNF Tables

Here once again is the table, Table 4.2, that we used in our discussion of the problem of partial

dependencies. Since we going to use it here, we name this copy of it Table 11.1.1.

Table 11.1.1

Let us consider Table 11.1.1 from the

DKNF point of view. First, we see that

the key is composite, consisting of the

LastName-FirstName pair of attributes.

We see also that all other attributes in

the table are dependent on this key. But

there is another significant aspect to

this table: the Level attribute is

dependent on the LastName attribute,

i.e., Level is dependent on just part of

the key. (As noted earlier, this partial

dependency is contrived, but

nevertheless it illustrates the problem

of partial dependency.) Because Level

is dependent on just LastName, the

table fails to be one in which all

constraints are logical consequences of

the key; hence, Table 11.1 is not in DKNF.

From the DKNF point of view, therefore, we see that we should take the Level attribute out of

Table 11.1.1 and put it in some other table, or tables, where it will be a logical consequence of

the keys and domains. Clearly, a table that associates just the attributes Major and Level will

achieve this.

We will also need a table that provides the necessary link between the paired attributes,

FirstName and LastName, and the attribute Major. In such a table, the attribute Major will be a

logical consequence of the keys and domains.

Thus it appears that we need two tables, one containing just Major and Level, and the other

containing FirstName, LastName, and Major. We can indicate this more briefly as Table A:

(Major, Level) and Table B: (FirstName, LastName, Major).

Here are the tables.

Table 11.1.2 (Table A as described above)

Major Level

FirstName LastName Major Level

Jack Jones LIS Graduate

Lynn Lee LIS Graduate

Mary Ruiz Pre-Medicine Undergraduate

Lynn Smith Pre-Law Undergraduate

Jane Jones LIS Graduate

Page 23: Steps in Normalization - RDBMS

LIS Graduate

Pre-Medicine Undergraduate

Pre-Law Undergraduate

Table 11.1.3 (Table B as described above)

FirstName LastName Major

Jack Jones LIS

Lynn Lee LIS

Mary Ruiz Pre-Medicine

Lynn Smith Pre-Law

Jane Jones LIS

These are single-theme tables, and we arrived at them by steps aimed at achieving DKNF.

Section 11.2. Converting a Table with Transitive Dependencies into DKNF Tables

Here once again is the table, Table 8.1, that we used in our discussion of transitive dependencies.

Since we going to use it here, we name this copy of it Table 11.2.1.

Table 11.2.1

Author

Last

Name

Author

First

Name

Book Title Subject Collection or Library Building

Berdahl Robert The Politics of the

Prussian Nobility

History PCL General Stacks Perry-Casta�eda

Library

Page 24: Steps in Normalization - RDBMS

Yudof Mark Child Abuse and Neglect Legal

Procedures

Law Library Townes Hall

Harmon Glynn Human Memory and

Knowledge

Cognitive

Psychology

PCL General Stacks Perry-Casta�eda

Library

Graves Robert The Golden Fleece Greek

Literature

Classics Library Waggener Hall

Miksa Francis Charles Ammi Cutter Library

Biography

Library and

Information Science

Collection

Perry-Casta�eda

Library

Hunter David Music Publishing and

Collecting

Music

Literature

Fine Arts Library Fine Arts Building

Graves Robert English and Scottish

Ballads

Folksong PCL General Stacks Perry-Casta�eda

Library

You will recall from the discussion of this table as Table 8.1 that it exhibits the following

transitive dependencies: Book Title Subject, Subject Collection-Library, and Collection-

Library Building. From the DKNF point of view, this means that the primary key, Book Title,

is not the only thing that determines the Collection-Library attribute and the Building attribute.

In turn, this means that there are constraints that are not logical consequences of the key and,

hence, that the table is not in DKNF.

Reasoning from the DKNF point of view, we would like to have a table in which the Building

attribute is a logical consequence of the key; constructing a table containing the Collection-

Library and Building attributes, with Collection-Library as key, will accomplish that. Again from

the DKNF point of view, we would like to have a table in which the Collection-Library attribute

is a logical consequence of the key; clearly, a table containing Subject (as key) and Collection-

Library suffices. The same point of view leads us to desire a table in which the Author First

Name and Author Last Name attributes will be a logical consequence of the key; such a table is

one that contains Book Title (as key), Author First Name, and Author Last Name. Finally, a table

that contains Book Title (as key) and Subject will be (1) a table in which the attribute Subject

will be a logical consequence of the key and (2) a table that provides the necessary connection

between Title and Subject.

Thus from the DKNF point of view, we are led to the same tables as previously:

Page 25: Steps in Normalization - RDBMS

Table 11.2.2

Author

Last Name

Author

First

Name

Book Title

Berdahl Robert The Politics of the Prussian Nobility

Yudof Mark Child Abuse and Neglect

Harmon Glynn Human Memory and Knowledge

Graves Robert The Golden Fleece

Miksa Francis Charles Ammi Cutter

Hunter David Music Publishing and Collecting

Graves Robert English and Scottish Ballads

Table 11.2.3

Book Title Subject

The Politics of the Prussian Nobility History

Child Abuse and Neglect Legal Procedures

Human Memory and Knowledge Cognitive Psychology

The Golden Fleece Greek Literature

Charles Ammi Cutter Library Biography

Page 26: Steps in Normalization - RDBMS

Music Publishing and Collecting Music Literature

English and Scottish Ballads Folksong

Table 11.2.4

Subject Collection or Library

History PCL General Stacks

Legal Procedures Law Library

Cognitive Psychology PCL General Stacks

Greek Literature Classics Library

Library Biography Library and Information Science Collection

Music Literature Fine Arts Library

Folksong PCL General Stacks

Table 11.2.5

Collection or Library Building

PCL General Stacks Perry-Casta�eda Library

Law Library Townes Hall

Classics Library Waggener Hall

Library and Information Science Collection Perry-Casta�eda Library

Page 27: Steps in Normalization - RDBMS

Fine Arts Library Fine Arts Building

These are the tables presented in Section 8 as single-theme tables that solved the transitive-

dependency problem of Table 8.1. Here we have arrived at these same tables by considering how

the information in Table 11.2.1 (the same information as in Table 8.1) should be re-arranged

from the DKNF point of view.

Section 11.3. Converting into DKNF a Table in Which Not Every Determinant Is a

Candidate Key

Here is the table, Table 9.1, that we used earlier to illustrate the problem of a table in which not

every determinant is a candidate key. Since we going to use it here, we name this copy of it

Table 11.3.1.

Table 11.3.1

You will recall from the

discussion of this table as Table

9.1 that one determinant is the

pair of attributes, SSN and

Major, which determines

Adviser; another determinant is

the pair, SSN and Adviser,

which determines Major; and

still another is Adviser alone,

which also determines Major.

And you will recall that the

candidate keys are the pairs,

SSN-Major and SSN-Adviser.

The third determinant, Adviser,

is not a candidate key.

From the DKNF point of view,

we reason as follows: If we

choose SSN-Adviser as the key,

then Major is determined by, and

hence is a logical consequence

of, this key, If, instead, we

choose SSN-Major as the key, then Adviser is determined by, and hence is a logical consequence

of, this alternative key. But in either case, the third constraint, viz., that Adviser determines

Major, is not a logical consequence of the key. Hence, the table is not in DKNF.

In order to move from this table to a set of tables in DKNF, we can argue. from the DKNF point

of view, that we need to move Major into a table in which it will be a logical consequence of the

key. Such a table would obviously need to have Adviser as the key. If we put Adviser and Major

SSN Major Adviser

123-45-6789 Library and Information Science Dewey

123-45-6789 Public Affairs Roosevelt

222-33-4444 Library and Information Science Putnam

555-12-1212 Library and Information Science Dewey

987-65-4321 Pre-Medicine Semmelweis

987-65-4321 Biochemistry Pasteur

123-54-3210 Pre-Law Hammurabi

Page 28: Steps in Normalization - RDBMS

into such a table, then we will need at least one other table, viz., a table that provides the

necessary link between SSN and Adviser, so that we will know who each student's adviser is.

Once we have put SSN and Adviser into such a table, there is nothing further that needs to be

done.

Here are the tables.

Table 11.3.2

Major Adviser

Library and Information Science Dewey

Public Affairs Roosevelt

Library and Information Science Putnam

Pre-Medicine Semmelweis

Biochemistry Pasteur

Pre-Law Hammurabi

History Herodotus

Table 11.3.3

SSN Adviser

123-45-

6789

Dewey

123-45-

6789

Roosevelt

222-33- Putnam

Page 29: Steps in Normalization - RDBMS

4444

555-12-

1212

Dewey

987-65-

4321

Semmelweis

987-65-

4321

Pasteur

123-54-

3210

Hammurabi

These are the tables presented in Section 9 as single-theme tables that solved the failure of Table

9.1 to be in Boyce-Codd Normal Form. Here we have arrived at these same tables by considering

how the information in Table 11.3.1 (the same information as in Table 9.1) should be re-arranged

from the DKNF point of view.

Section 11.4. Converting a Table with Multivalued Dependencies into DKNF

Here is the table, Table 10.1, that we used previously to illustrate the problem of multivalued

dependencies. Since we going to use it here, we name this copy of it Table 11.4.1.

Table 11.4.1

LastName Major Hobby

Jones Library and Information Science Surfing the Internet

Jones Library and Information Science Chess

Jones Public Affairs Surfing the Internet

Jones Public Affairs Chess

Page 30: Steps in Normalization - RDBMS

If we analyze Table 11.4.1

from the DKNF point of

view, the first thing we see

is that the key in the table

is composite. It is the triple,

LastName-Major-Hobby.

But in an intuitive sense,

the natural key would be

just LastName, since we

know that there are just

four students involved and

that we are trying to

present data about their

majors and their hobbies.

The complications arise

because some of the

students have more than

one major and/or more than

one hobby. Another way of putting it is that the complications of the table arise from the fact that

we are trying to display, in just one table, more information than it is practicable to display in a

single table.

From the DKNF point of view, we have two constraints. One constraint concerns the natural key,

LastName, and the attribute, Major. If we set up one table that houses these attributes, then the

constraint on Major will be a logical consequence of the key, LastName. The other constraint

concerns the natural key, LastName, and the attribute, Hobby. If we set up a second table that

houses these attributes, then the constraint on Hobby will be a logical consequence of the key,

LastName. Having set up these two tables, we will find that there is nothing further to be done.

Here are the tables.

Table 11.4.2

LastName Major

Jones Library and Information Science

Jones Public Affairs

Lee Library and Information Science

Lee Library and Information Science Photography

Lee Library and Information Science Stamp collecting

Ruiz Pre-Medicine Surfing the Internet

Ruiz Pre-Medicine Photography

Ruiz Biochemistry Surfing the Internet

Ruiz Biochemistry Photography

Smith Pre-Law Playing poker

Page 31: Steps in Normalization - RDBMS

Ruiz Pre-Medicine

Ruiz Biochemistry

Smith Pre-Law

Table 11.4.3

LastName Hobby

Jones Surfing the Internet

Jones Chess

Lee Photography

Lee Stamp collecting

Ruiz Surfing the Internet

Ruiz Photography

Smith Playing poker

These are the tables presented in Section 10 as single-theme tables that solved the failure of

Table 10.1 to be in 4NF. Here we have arrived at these same tables by considering how the

information in Table 11.4.1 (the same information as in Table 10.1) should be re-arranged from

the DKNF point of view.

Section 11.5. Single-Theme Tables and the DKNF

What has the preceding discussion shown us?

We have seen that when we analyze, from the DKNF point of view, tables with various kinds of

problems, we find--again and again--that the solutions to the problems consist in turning a

complicated, multi-theme table into sets of single-theme tables, tables which satisfy the

requirements of the DKNF. If on the other hand, we analyze a complicated, problem-laden table

Page 32: Steps in Normalization - RDBMS

from the point of view of turning it into a set of single-theme tables, we thereby achieve--again

and again--a set of tables that satisfy the requirements of the DKNF.

In short, sets of single-theme tables will almost always be sets of tables in DKNF and, as such,

will be sets of tables that avoid the various kinds of anomalies that we want to avoid.