126
SQL Unit 15 Normalization prepared by Kirk Scott 1

SQL Unit 15 Normalization prepared by Kirk Scott 1

Embed Size (px)

Citation preview

Page 1: SQL Unit 15 Normalization prepared by Kirk Scott 1

1

SQL Unit 15Normalization

prepared by Kirk Scott

Page 2: SQL Unit 15 Normalization prepared by Kirk Scott 1

2

• 1. Normal Forms• 2. First Normal Form• 3. Second Normal Form• 4. Third Normal Form• 5. Boyce-Codd Normal Form• 6. Higher Normal Forms• 7. Domains• 8. Nulls and Integrity

Page 3: SQL Unit 15 Normalization prepared by Kirk Scott 1

3

1. Normal Forms

• The benefits of relational database theory can be summarized as follows:

• There is a step-by-step way of arriving at a correct design

• There is a way of detecting flaws in a design• The design process has to do with the problem

domain, not with computer-related questions

Page 4: SQL Unit 15 Normalization prepared by Kirk Scott 1

4

• The database designer and user are protected from questions related to the implementation of the dbms and the hardware it’s running on

• Finally, if the design is correct, it will be possible to:

• Store all desired information in it;• Update the information on an ongoing basis;• Retrieve any/all of the information as needed.

Page 5: SQL Unit 15 Normalization prepared by Kirk Scott 1

5

• Correct designs are based on what are called normal forms.

• This section presents the background information to the design process.

• It also discusses and illustrates the use of normal forms.

Page 6: SQL Unit 15 Normalization prepared by Kirk Scott 1

6

Identifying Entities

• At its most basic level, design of a database depends on determining what you want to store information about.

• When deciding what the base tables will be, you are trying to identify entities.

• From a language point of view, this involves identifying nouns which do not modify other things.

Page 7: SQL Unit 15 Normalization prepared by Kirk Scott 1

7

Identifying Attributes

• Identifying the entities leads to identifying their attributes.

• Attribute names usually end up being nouns too, but you figure out what they are when you try to describe entities, and the descriptions usually involve adjectives.

• One of the key points of database design is that you only store information about the entities and attributes you need to.

Page 8: SQL Unit 15 Normalization prepared by Kirk Scott 1

8

• There may be many possible entities • All entities may have a long list of potential

attributes• But you limit yourself to only those things you

will need to retrieve information about in the future.

Page 9: SQL Unit 15 Normalization prepared by Kirk Scott 1

9

Identifying Keys

• You are familiar with primary keys and foreign keys. • When trying to organize the attributes around

entities in the design, the idea is to equate an entity with a primary key field

• Then group the attributes with the entities that they describe.

• Relationships between tables are captured by embedding the primary keys of one or more tables as foreign keys in other tables.

Page 10: SQL Unit 15 Normalization prepared by Kirk Scott 1

10

Functions, Determination, and Dependency

• When described in general, the foregoing sounds sensible enough.

• That’s why the book claims that if you can model successfully, the result will be a correct design

• In practice it can be difficult to do without formal guidelines.

• This is what the normal forms provide.

Page 11: SQL Unit 15 Normalization prepared by Kirk Scott 1

11

• The normal forms are based on and described in terms of an idea taken from math.

• One field in a table may functionally determine another.

• Stated in reverse order: The other field depends functionally on the one.

Page 12: SQL Unit 15 Normalization prepared by Kirk Scott 1

12

• This is an example of a mathematical function:• y = f(x), for example, y = x2

• y is a function of x. • x is in the domain and y is in the range. • x functionally determines y• Or, y functionally depends on x.

Page 13: SQL Unit 15 Normalization prepared by Kirk Scott 1

13

• For a mathematical function, you find the dependent value by doing some sort of computation on the determining value.

• The key point underlying a function is the following:

• For each value of x, there can only be one corresponding value of y.

• x uniquely determines y.

Page 14: SQL Unit 15 Normalization prepared by Kirk Scott 1

14

• The analogy in database design is the following: • The primary key of a table should functionally

determine the values of the other fields in the table.

• In other words, the non-key fields should functionally depend on the primary key field.

• Just as the primary key uniquely identifies a record, it uniquely determines the values of the fields in the record

Page 15: SQL Unit 15 Normalization prepared by Kirk Scott 1

15

• Take this small table for example:• This is its schema:• Person(SSN, name, dob)

SSN Name dob123-45-6789 Bob 1/1/01…

Page 16: SQL Unit 15 Normalization prepared by Kirk Scott 1

16

• You don’t find a person’s name or birthdate by doing a computation on their social security number.

• However, given any one social security number, there is exactly one corresponding name and exactly one corresponding date of birth.

• It is true that different people with different social security numbers may have the same name and the same date of birth, but this is not a problem.

Page 17: SQL Unit 15 Normalization prepared by Kirk Scott 1

17

• The point of the primary key field is that it is the unique identifier that makes it possible to distinguish between these two people.

• This idea came up at the beginning of the course

• The point now is that the name and date of birth fields functionally depend on the social security number field.

Page 18: SQL Unit 15 Normalization prepared by Kirk Scott 1

18

• A new notation can be used to indicate this. • In this notation, the arrows go from the field

that functionally determines another field, to the field that is dependent.

• This is illustrated on the next overhead.

Page 19: SQL Unit 15 Normalization prepared by Kirk Scott 1

19

Page 20: SQL Unit 15 Normalization prepared by Kirk Scott 1

20

Normal Forms

• Some of the normal forms are identified by number, for example 1st, 2nd, and 3rd normal forms.

• Others are identified by name, for example Boyce-Codd normal form, named after the people who discovered it.

• These four normal forms are abbreviated 1NF, 2NF, 3NF, and BCNF, respectively.

• There are also higher normal forms, 4th, 5th, and domain key normal forms (4NF, 5NF, DKNF).

Page 21: SQL Unit 15 Normalization prepared by Kirk Scott 1

21

• The normal forms have to do with finding dependencies in tables which spring from fields other than the primary key.

• These dependencies are undesirable and may be referred to as stray dependencies.

• The normal forms make increasingly strict statements about the kinds of stray dependencies that have to be eliminated from correctly designed tables.

• Designs containing stray dependencies are said to violate the normal forms.

Page 22: SQL Unit 15 Normalization prepared by Kirk Scott 1

22

Eliminating Dependencies

• The design process using normal forms consists of repetitive steps:

• Make a design• Identify stray dependencies (normal form violations)• Redesign to eliminate the dependencies• Once you’ve eliminated all occurrences of one type

of violation, you will have promoted the design into the next higher normal form

• Repeat until you’ve reached the highest normal form

Page 23: SQL Unit 15 Normalization prepared by Kirk Scott 1

23

• The rule of thumb at every stage is to remove stray dependencies in the following way:

• Make any field which determines other fields the primary key of a new table, and move the fields that depend on that field to the new table.

• Make sure that the new table is connected to the old table by a primary key, foreign key pair.

Page 24: SQL Unit 15 Normalization prepared by Kirk Scott 1

24

Anomalies

• Design problems that are based on violations of normal forms lead to what are called anomalies.

• The hallmark of a problematic design is that the same information is stored multiple times.

• In other words, there is redundancy in the database.

• Depending on the nature of the redundancy, this can lead to problems when inserting data, when updating data, and when deleting data.

Page 25: SQL Unit 15 Normalization prepared by Kirk Scott 1

25

Justifying Normal Forms

• The use of normal forms may seem unnecessarily theoretical at first.

• However, they provide a convenient way of identifying problems in designs and then eliminating them.

• Normal forms are what justify these claims about relation databases:– There is a step-by-step way of arriving at a correct

design– There is a way of detecting flaws in a design

Page 26: SQL Unit 15 Normalization prepared by Kirk Scott 1

26

The Plan of Action for the Following Sections

• Each of the following sections will present a normal form in this way:

• A definition of the normal form will be given. • A scenario for information to be held in a

database will be given, with the underlying assumptions given.

• An example database design which violates the normal form will be given

Page 27: SQL Unit 15 Normalization prepared by Kirk Scott 1

27

• The violation will be shown using a diagram with the notation indicating functional dependencies.

• The desired functional dependencies from the primary key will be shown using arrows below the field names.

• Undesired, stray dependencies, which need to be eliminated in order to correct the design, will be shown using arrows above the field names.

Page 28: SQL Unit 15 Normalization prepared by Kirk Scott 1

28

• Anomalies resulting from the incorrect design will be discussed.

• In general, there will be insert, update, and delete anomalies

• Finally, a corrected design will be given.

Page 29: SQL Unit 15 Normalization prepared by Kirk Scott 1

29

Basis for Examples

• All of the examples will be based on the general topic of cars, salespeople, customers, and car sales.

• Some of the field names are abbreviated, and some of the fields clearly belong together in some way.

• Here is a little preliminary explanation regarding the fields that will be in the examples.

• Not all of the fields will appear in all of the examples.

Page 30: SQL Unit 15 Normalization prepared by Kirk Scott 1

30

• vin: vehicle identification number. Vehicles have makes, models, and years.

• spno, spname: Salesperson number and name.

• custno, custname: Customer number and name.

• A car sale has a salesprice and a date.

Page 31: SQL Unit 15 Normalization prepared by Kirk Scott 1

31

2. First Normal Form

• 1NF Definition:• Formally (Watson):• A relation is in first normal form if and only if

all columns are single-valued.• Informally: • Data is stored in flat files; there can be no

repeating groups in a record. • (This was mentioned in the very first unit.)

Page 32: SQL Unit 15 Normalization prepared by Kirk Scott 1

32

• The assumptions underlying the design are that a salesperson can sell many cars, but each car can only be sold by one salesperson.

• In this design, each car is only sold once, so the design captures information about the sales of new cars.

• These assumptions don’t cause the problem.• It is the implementation of them that causes

the problem.

Page 33: SQL Unit 15 Normalization prepared by Kirk Scott 1

33

• Here is the design that violates 1NF: • Carsale(spno, spname, {vin, salesprice})

• The example design uses {} notation to indicate repeating groups of fields.

• A diagram with arrows illustrating this design is given on the next overhead

Page 34: SQL Unit 15 Normalization prepared by Kirk Scott 1

34

Page 35: SQL Unit 15 Normalization prepared by Kirk Scott 1

35

• The repeating group alone is a sufficient problem to make this kind of design incorrect.

• However, the design does have anomalies• These foreshadow the kinds of anomalies that

all violations entail.

Page 36: SQL Unit 15 Normalization prepared by Kirk Scott 1

36

• Insert: You can’t store information about a car which hasn’t been sold.

• Update: To have an update anomaly, the assumptions would have to be changed.

• If a car could be sold more than once, information about it would appear in more than one row of the table

• An update to the car information (its vin) would require updating multiple rows

Page 37: SQL Unit 15 Normalization prepared by Kirk Scott 1

37

• Delete: The deletion anomaly is the mirror image of the insertion anomaly.

• If the record of a car sale is deleted, then information about the car is lost, as well as information about the sale.

Page 38: SQL Unit 15 Normalization prepared by Kirk Scott 1

38

• The solution to all basic normal form violations is the same:

• Break out the stray dependency out into a separate table

• In this case, break out the information contained in the repeating group

Page 39: SQL Unit 15 Normalization prepared by Kirk Scott 1

39

• As stated in the assumptions, one salesperson can sell many cars, but each car is sold only once, so there is a 1-m relationship between the two tables in the resulting design.

• The primary key of the table containing salesperson information will have to be embedded as a foreign key in the table containing car information.

Page 40: SQL Unit 15 Normalization prepared by Kirk Scott 1

40

• Here is the corrected design:• Salesperson(spno, spname)• Carsale(vin, salesprice, spno f.k.)

Page 41: SQL Unit 15 Normalization prepared by Kirk Scott 1

41

3. Second Normal Form

• 2NF Definition:• Formally (Watson):• A relation is in second normal form if an only if

it is in first normal form, and all nonkey columns are dependent on the key

• Informally: • In a table with a concatenated primary key

field, there can be no stray dependencies that originate in just part of the primary key field.

Page 42: SQL Unit 15 Normalization prepared by Kirk Scott 1

42

• The basic idea is that all nonkey fields have to depend on the whole key.

• Stating it in this way will lead to a useful mnemonic device which will be given later.

• When you lay it out in this way, you begin to realize that 2NF deals with tables that have concatenated key fields, where a dependency from only one field of the key might be possible.

Page 43: SQL Unit 15 Normalization prepared by Kirk Scott 1

43

• In this example the underlying assumptions are that the same car can come back to the lot and be sold more than once.

• It can be sold by the same salesperson more than once, but not on the same day.

• It can also be sold by different salespeople at different times.

• Although unlikely, the design is made so that two different salespeople could sell the same car on the same date.

Page 44: SQL Unit 15 Normalization prepared by Kirk Scott 1

44

• The design doesn’t contain any information about customers, but the scenario would be that one customer brought the car back, and a different salesperson sold it again.

• It seems unlikely that the same customer would buy the same car twice, whether on the same date or different dates.

Page 45: SQL Unit 15 Normalization prepared by Kirk Scott 1

45

• In summary, this design works for used car sales and both the date and the salesperson information are needed, along with the car information, to distinguish between different sales.

Page 46: SQL Unit 15 Normalization prepared by Kirk Scott 1

46

• Here is the design that violates 2NF:• • Carsale(vin, spno, date, spname)

• A diagram with arrows illustrating this is given on the next overhead

Page 47: SQL Unit 15 Normalization prepared by Kirk Scott 1

47

Page 48: SQL Unit 15 Normalization prepared by Kirk Scott 1

48

• This faulty design has insert, update, and delete anomalies.

• Suppose a salesperson has not yet sold a car. • In this case, it is not possible to insert information about

that salesperson. • On the other hand, a salesperson may make many sales. • This means that the same information about that

salesperson would be stored in more than one record in the table.

• This is redundancy.

Page 49: SQL Unit 15 Normalization prepared by Kirk Scott 1

49

• Not only is the redundancy itself wasteful, it leads to the update anomaly.

• Suppose the salesperson’s name changes. • Then it’s necessary to update multiple records

to reflect this fact, not just one.

Page 50: SQL Unit 15 Normalization prepared by Kirk Scott 1

50

• The delete anomaly is related to the insert anomaly.

• Suppose that as part of the maintenance of the database, on a yearly basis the sales table is cleared.

• When you delete the last record containing a sale by a particular salesperson, you not only get rid of the sales record, you also lose the salesperson’s name.

Page 51: SQL Unit 15 Normalization prepared by Kirk Scott 1

51

• As usual, the solution to the problem is to break the stray dependency out into a table of its own.

• Each car sale has only one salesperson, but each salesperson can be involved in many sales, so this is a 1-m many relationship.

• The salesperson information is stored in a table by itself, and the primary key of the salesperson table is embedded as a foreign key in the car sale table.

Page 52: SQL Unit 15 Normalization prepared by Kirk Scott 1

52

• Here is the corrected design:

• Salesperson(spno, spname)• Carsale(vin, date, spno f.k.)

Page 53: SQL Unit 15 Normalization prepared by Kirk Scott 1

53

4. Third Normal Form

• 3NF can be defined as follows: • There can be no stray dependencies from one

non-key field to another.

Page 54: SQL Unit 15 Normalization prepared by Kirk Scott 1

54

• In this example, for the sake of simplicity, it is assumed that new cars are being sold and they can only be sold once.

• Information about the customer is also recorded with the sale.

• Each car can only be bought by one customer. • It would be possible for a customer to buy

more than one car.

Page 55: SQL Unit 15 Normalization prepared by Kirk Scott 1

55

• Here is the design that violates 3NF:

• Carsale(vin, custno, custname, salesprice, date)

• A diagram with arrows illustrating this is given on the next overhead

Page 56: SQL Unit 15 Normalization prepared by Kirk Scott 1

56

Page 57: SQL Unit 15 Normalization prepared by Kirk Scott 1

57

• This design also has insert, update, and delete anomalies and the pattern of the anomalies is the same as in the previous example.

• They all stem from the presence of the stray dependency in the design.

• If you have a potential customer who has not yet bought a car, it is impossible to insert information about that person.

Page 58: SQL Unit 15 Normalization prepared by Kirk Scott 1

58

• If a customer has bought more than one car, the customer information is stored redundantly.

• In that case, if the customer’s name changes, it’s necessary to change multiple records.

• Finally, if the sales table is cleared on a regular basis, when you delete the last sales record for a given customer, you not only get rid of the sales record, you also lose the customer’s name.

Page 59: SQL Unit 15 Normalization prepared by Kirk Scott 1

59

• There is a situation that can arise in database designs that appears to be a violation of 3NF, but isn’t.

• The most common example of this situation is a table which includes a city, state, and zip code as part of an address.

• The postal service has divided up the country into zones which are identified by zip codes.

Page 60: SQL Unit 15 Normalization prepared by Kirk Scott 1

60

• None of these zones cross city or state boundaries.

• That means that a zip code determines the city and state.

• It is not necessary for someone to break this dependency out of their database design.

• The rule of thumb is that if you are not responsible for maintaining the dependency, then you can ignore it.

Page 61: SQL Unit 15 Normalization prepared by Kirk Scott 1

61

• The postal service has a table somewhere with zip code as the primary key and all of the descriptive fields about zip code that exist.

• The post office maintains this. • A table not maintained by the postal service

can contain addresses with zip codes and completely ignore the fact that there may in reality be a dependency.

Page 62: SQL Unit 15 Normalization prepared by Kirk Scott 1

62

5. Boyce-Codd Normal Form

• A formal statement of BCNF would be somewhat theoretical.

• Once understood, such a definition would make it clear that BCNF is a summation of 1NF through 3NF which covers one other case which is not covered by the previous normal forms.

• It is easier to explain BCNF by just presenting this special case and explaining it.

Page 63: SQL Unit 15 Normalization prepared by Kirk Scott 1

63

• BCNF says that there can be no stray dependencies from a non-key field to a field in the key.

• For the purposes of this example suppose that the same car can be sold by the same salesperson more than once, but only one sale of that car is possible per date.

• Suppose also that this dealership has a system for assigning prospective customers to specific salespeople, so that each salesperson is associated with an exclusive list of clients.

Page 64: SQL Unit 15 Normalization prepared by Kirk Scott 1

64

• It would be normal to assume that this system is implemented in some sort of table.

• Such a table is not shown here—it will become part of the solution to the problem.

• The point now is to show the problem this assumption leads to in the table of interest.

Page 65: SQL Unit 15 Normalization prepared by Kirk Scott 1

65

• Here is the design which violates BCNF:

• Carsale(vin, spno, date, custno)

• A diagram with arrows illustrating this is given on the next overhead

Page 66: SQL Unit 15 Normalization prepared by Kirk Scott 1

66

Page 67: SQL Unit 15 Normalization prepared by Kirk Scott 1

67

• The anomalies in this design are analogous to the anomalies in the previous designs.

• It is not possible to insert information about the relationship between a given customer and salesperson without a sales record which matches them.

• If the given customer has bought from the same salesperson many times, their relationship is in multiple records.

Page 68: SQL Unit 15 Normalization prepared by Kirk Scott 1

68

• An update would require changes in multiple records.

• Finally, if you’re down to the last record containing information about a particular pair, deleting the record would cause the information to be lost.

• As usual, the solution to the problem is to break out the stray dependency in a separate table.

Page 69: SQL Unit 15 Normalization prepared by Kirk Scott 1

69

• Here is the corrected design:

• Carsale(vin, date, custno f.k.)• Customer-Salesperson(custno, spno)

Page 70: SQL Unit 15 Normalization prepared by Kirk Scott 1

70

• The point is that if customers are uniquely associated with a single salesperson,

• if the car sale record tells you who bought the car,

• you can then look up the salesperson in the Customer-Salesperson table.

Page 71: SQL Unit 15 Normalization prepared by Kirk Scott 1

71

• There is another aspect of BCNF that needs to be explained.

• Consider a design which includes both a university-generated student id number and a social security number.

• It would seem to violate BCNF as explained above:

Page 72: SQL Unit 15 Normalization prepared by Kirk Scott 1

72

• Here is the design which seems to violate BCNF:

• Student(studentIDno, SSN, name)

• A diagram with arrows illustrating this is given on the next overhead

Page 73: SQL Unit 15 Normalization prepared by Kirk Scott 1

73

Page 74: SQL Unit 15 Normalization prepared by Kirk Scott 1

74

• The additional part of BCNF is that if the stray dependency results from another field which also could have been chosen as a primary key for the table, then it is not a normal form violation.

• In other words, both studentIDno and SSN are valid, unique identifiers of students.

• You might want to record both. • It is simply necessary to choose one of them as the

primary key of the field. • The presence of the other one in the table does no harm.

Page 75: SQL Unit 15 Normalization prepared by Kirk Scott 1

75

• Up through BCNF the normal forms can be explained in terms of stray dependencies.

• An easy way to remember the requirements for these normal forms is the following statement:

• Every field in a table has to depend on the key, the whole key, and nothing but the key.

Page 76: SQL Unit 15 Normalization prepared by Kirk Scott 1

76

• Because they are increasingly strict, the normal forms can be thought of as nested.

• When checking a design, you begin with the lowest normal form, make sure there are no violations, and move on to the following ones.

• This is what makes the design process step-by-step.

Page 77: SQL Unit 15 Normalization prepared by Kirk Scott 1

77

• This idea can be represented using a Venn diagram.

• The idea is that the set of designs which is in some normal form is always a subset of those designs which meet the conditions for a lower normal form.

• A diagram of this is shown on the following overhead

Page 78: SQL Unit 15 Normalization prepared by Kirk Scott 1

78

1NF

2NF

3NF

BCNF…

Page 79: SQL Unit 15 Normalization prepared by Kirk Scott 1

79

6. Higher Normal Forms

• It was claimed earlier that there are only three kinds of relationships: 1-1, 1-m, and m-n.

• This is not entirely true. • There may be many-to-many-to-many relationships

(relationships between 3 different types of entities at the same time, m-m-m),

• and in theory there is no reason why there can’t be relationships among 4 or more different types of entities at the same time.

• Fourth and fifth normal form, 4NF and 5NF, have to do with cases like these.

Page 80: SQL Unit 15 Normalization prepared by Kirk Scott 1

80

• The presentation of 4NF will be done in the opposite order to the presentation of the earlier normal forms.

• First an example of a valid design will be given, • and then a statement will be made about the

nature of a design that violates 4NF.

Page 81: SQL Unit 15 Normalization prepared by Kirk Scott 1

81

• Suppose that a given car can be sold more than one time.

• In other words, you’re dealing in used cars. • Suppose also that salespeople can sell more

than one different car, and customers can buy more than one different car.

• This means that there are three 1-m relationships.

Page 82: SQL Unit 15 Normalization prepared by Kirk Scott 1

82

• For the three base tables, Car, Salesperson, and Customer, there could be one table in the middle, Carsale, which brought all three together.

• The idea can be represented using ER modeling.

• This results in the star shaped design shown on the next overhead:

Page 83: SQL Unit 15 Normalization prepared by Kirk Scott 1

83

Car

CarsaleSales-person

Customer

Page 84: SQL Unit 15 Normalization prepared by Kirk Scott 1

84

• The relationships are captured by embedding primary keys as foreign keys, and a valid design can be given as follows:

• Car(vin, make, model, year)• Salesperson(spno, spname)• Customer(custno, custname)• Carsale(vin, spno, custno, date)

Page 85: SQL Unit 15 Normalization prepared by Kirk Scott 1

85

• If someone tried to create a design which had information on all three types of entities, cars, salespeople, and customers, in the same table, this would be a 4NF violation.

• No example of this is given. • After working up through BCNF it should be clear

that when analyzing such a table you would find more than one stray dependency.

• By removing each of the stray dependencies in succession, you would solve the problem.

Page 86: SQL Unit 15 Normalization prepared by Kirk Scott 1

86

• 4NF violations like this are not common. • Anyone familiar with database design principles

would not try to put three types of entities together in a single table in the first place.

• On the other hand, people who are unfamiliar with the rules sometimes think that they should try and cram as much information into a single table as possible.

• If that happens, then a violation such as this is possible.

Page 87: SQL Unit 15 Normalization prepared by Kirk Scott 1

87

• You may have realized that there is another way to relate all three of the base tables together.

• What if each pair were related in an m-m relationship?

• The idea can be represented using ER modeling.

• This results in the design with a cycle in it shown on the next overhead:

Page 88: SQL Unit 15 Normalization prepared by Kirk Scott 1

88

Car

Customer-Salesperson

Sales-person

Customer

Salesperson-Car

Car-Customer

Page 89: SQL Unit 15 Normalization prepared by Kirk Scott 1

89

• The design could also be represented in this way:

• Car(vin, make, model, year)• Car-Customer(vin f.k., custno f.k.)• Customer(custno, custname)• Customer-Salesperson(custno f.k., spno f.k.)• Salesperson(spno, spname)• Salesperson-Car(spno f.k., vin f.k., date)

Page 90: SQL Unit 15 Normalization prepared by Kirk Scott 1

90

• This design does not violate 4NF like the previous scenario of cramming all of the information into a single table.

• The question is, does this design correctly capture all of the assumptions stated above?

• In general, designs with cycles in them are difficult to understand, and in the context of 4NF, the design with the cycle is not desirable, while the design with the star is desirable.

Page 91: SQL Unit 15 Normalization prepared by Kirk Scott 1

91

• If you traced all of the links in the design with the cycle, you would find that every car is connected to every salesperson is connected to every customer.

• Put in business terms, if this design is supposed to represent car sales, at one time or another every salesperson has sold every car and every customer has bought every car.

• This does not agree with the assumptions underlying the star design, where the one table in the middle captures information for that subset of possible sales that actually occurred.

Page 92: SQL Unit 15 Normalization prepared by Kirk Scott 1

92

• The cyclical design leads to a brief consideration of 5NF.

• The question now becomes, if every possible pair of relationships actually does exist, which design is better, the one with the star or the one with the cycle?

• In this case, the design with the cycle is better.

Page 93: SQL Unit 15 Normalization prepared by Kirk Scott 1

93

• At this point the car sale example breaks down.

• It is unrealistic to think that every car would be sold by every salesperson and bought by every customer.

• However, there are occasionally situations where every entity in every base table is related to every other entity in every other base table.

Page 94: SQL Unit 15 Normalization prepared by Kirk Scott 1

94

• A general description of what 5NF says is the following:

• A design is correct if two conditions are met: • All real relationships between entities are

captured by the design; • no false relationships between entities are

captured by the design.

Page 95: SQL Unit 15 Normalization prepared by Kirk Scott 1

95

7. Domains

• The highest normal form, domain-key normal form (DKNF), is a theoretical statement of how relationships are formed between tables.

• This form is not numbered, because the theoretical statement encompasses all of the other normal forms.

• This theoretical statement does not give you a step-by-step procedure for determining whether or not a design has violations and fixing them.

Page 96: SQL Unit 15 Normalization prepared by Kirk Scott 1

96

• DKNF is based on the idea of domains, which have not been explained yet.

• Although this normal form is of little practical use, the idea of domains is important for correctly capturing the relationships between tables, and will be explained.

Page 97: SQL Unit 15 Normalization prepared by Kirk Scott 1

97

• There is a preliminary point to be made before talking about domains.

• When a table is created, a complete definition has to tell the data type of each field.

• Some fields may hold numeric values, some may hold strings of characters, some may hold dates, etc.

• If a field holds strings of characters, its length, or maximum length also has to be stated.

• So, for example, a person’s last name may be defined as containing a maximum of 24 characters.

Page 98: SQL Unit 15 Normalization prepared by Kirk Scott 1

98

• A social security number has 9 digits. • Although the social security number is called a

number, it is never used numerically. • There is no need to add, subtract, multiply, or

divide it, and a good design will prevent that. • The simple way to do so is to define this field as

a character field containing 9 characters where valid characters in this field are limited to digits.

Page 99: SQL Unit 15 Normalization prepared by Kirk Scott 1

99

• A domain is a semantic concept. • Most of the time, the name of a field is

descriptive of the kind of information it can hold. • So for a “last name” field in a table containing

information about people, it is informally clear what this means.

• Formally, the term domain refers to the whole set of values that could appear as valid data in that field.

Page 100: SQL Unit 15 Normalization prepared by Kirk Scott 1

100

• In general, a name would consist of a sequence of letters of the alphabet.

• Names could come from any language or culture, translated into the English alphabet.

• Some names do contain numeric information, usually indicated with Roman numerals, for example, John Smith I, John Smith II, etc.

• It would not be possible to come up with a formula that mathematically defined all possible values.

• Still, the general idea is clear.

Page 101: SQL Unit 15 Normalization prepared by Kirk Scott 1

101

• The idea of domains can be further clarified by giving examples of cases where fields are not on the same domain.

• A person’s last name field may be defined as 24 characters.

• A city field could also be defined the same way. • There may be cases where a person’s name is the

same as the name of a city. • There is a city of Lincoln in England.

Page 102: SQL Unit 15 Normalization prepared by Kirk Scott 1

102

• Abraham Lincoln’s ancestors probably came from that area.

• There is also a city of Lincoln in Nebraska, which was named after Abraham Lincoln.

• Even though there may be an intersection of the values in the city and last name fields, conceptually, city name and person last name are two distinct domains.

Page 103: SQL Unit 15 Normalization prepared by Kirk Scott 1

103

• Another example of two fields that are not on the same domain would be social security number and zip code.

• A full zip code consists of 5 plus 4, or 9 digits, like a social security number.

• Both might be defined as character fields containing 9 characters.

• However, social security numbers and zip codes have nothing in common.

• There are doubtless cases where someone’s social security number matches some zip code somewhere in the country, but this is purely coincidental.

Page 104: SQL Unit 15 Normalization prepared by Kirk Scott 1

104

• Up to this point, the relationship between tables has been explained by the process of embedding the primary key of one table as a foreign key in another.

• This would mean that when defining the second table, it would have a field with a suitable name, on the same domain, that is defined to hold the same type of data as the first field.

• This is good as far as it goes, but there can be other relationships between tables which are the result of domains, but not the direct result of embedding keys.

Page 105: SQL Unit 15 Normalization prepared by Kirk Scott 1

105

• Going back to one of the earlier examples, a database may distinguish between mothers and children as different kinds of entities, and store them in different tables.

• Each of these tables may have social security number fields and last name fields.

• You would not expect a mother and child to have the same social security number.

• This would be a mistake.

Page 106: SQL Unit 15 Normalization prepared by Kirk Scott 1

106

• However, in most cases you would expect mothers and children to have the same last names.

• The idea is that a social security number is a social security number, regardless of what table it appears in.

• The idea of a social security number defines a domain.

Page 107: SQL Unit 15 Normalization prepared by Kirk Scott 1

107

• Similarly, if the last name fields in both the mother and child tables were defined as containing 24 characters, a last name is a last name, regardless of what table it appears in.

• The idea of a last name defines a domain. • The last name field in both tables has the

same meaning even though they are not a primary key, foreign key pair.

Page 108: SQL Unit 15 Normalization prepared by Kirk Scott 1

108

• As you can see, a domain is a cross-table concept.

• Any given database may contain many different fields in its tables, but the database will contain fewer domains because various fields are on the same domain.

• The idea of a domain is more fundamental than the idea of a field.

Page 109: SQL Unit 15 Normalization prepared by Kirk Scott 1

109

• A field is just a manifestation of a domain. • In very general terms, DKNF says that a

database is correctly designed if the dependencies among the tables are the result of correct choices of domains for all fields, in particular the domains of the primary and foreign keys of the tables.

Page 110: SQL Unit 15 Normalization prepared by Kirk Scott 1

110

8. Nulls and Integrity

• This section is a review of material that was explained earlier.

• It will not be gone over in class.• However, it is provided below in its entirety in

case you want to read the overheads yourself.

Page 111: SQL Unit 15 Normalization prepared by Kirk Scott 1

111

8. Nulls and Integrity

• The term “null” refers to the idea that a particular field in a particular record may not have data in it.

• In general, this is permissible. • Cases often arise in practice where the

information doesn’t exist or isn’t known.

Page 112: SQL Unit 15 Normalization prepared by Kirk Scott 1

112

• It would be impractical to insist that all fields always contain data.

• If that restriction were imposed, people would get around it by putting in bogus values for information that didn’t exist or wasn’t known.

• However, filling a database with bogus values is not a very good idea.

Page 113: SQL Unit 15 Normalization prepared by Kirk Scott 1

113

• When a database management system supports null values in fields, it’s important to understand what this does not mean.

• It does not mean that the fields contain the sequence of characters “null”.

• It also does not mean that the field contains invisible blanks.

• Blank spaces themselves are a form of character. • What it means is that there is absolutely nothing in the

field, and the database management system is able to recognize fields that are in that state.

Page 114: SQL Unit 15 Normalization prepared by Kirk Scott 1

114

• The term integrity in database management systems refers to the validity and consistency of data entered into a database.

• The phrase “entity integrity” is the formal expression of a requirement that was stated informally earlier.

• Entity integrity puts the following requirement on a correctly implemented database:

• Every table has a primary key field, and no part of the primary key field can be null for any record in the table.

• Clearly, if all or part of a key were allowed to be null, that would defeat the purpose that the primary key field be the unique identifier for every record in the table.

Page 115: SQL Unit 15 Normalization prepared by Kirk Scott 1

115

• As seen in the long discussion of normal forms, it is the primary key to foreign key relationships that support the interconnection between related entities that have been separated into different tables by the design process.

• Once this has been done, it is critically important that the data maintaining the relationships be valid and consistent.

• The phrase “referential integrity” has the following meaning: • Every value that appears in a foreign key field also has to appear

as a value in the corresponding primary key field. • This can also be stated negatively: • There can be no foreign key value that does not have a

corresponding primary key value.

Page 116: SQL Unit 15 Normalization prepared by Kirk Scott 1

116

• The meaning and importance of referential integrity can be most easily explained with a small example showing a violation of it.

• Consider the tables shown on the following overhead:

Page 117: SQL Unit 15 Normalization prepared by Kirk Scott 1

117

Mothermid Name1 Lily2 Matilda

Childkid name mida Ned 3b Ann 2c June

Page 118: SQL Unit 15 Normalization prepared by Kirk Scott 1

118

• Child b, Ann, is shown as having a mother with mid equal to 3.

• There is no such mother in the Mother table. This is literally nonsense.

• There is no sense in which this can be correct and this is what referential integrity forbids.

Page 119: SQL Unit 15 Normalization prepared by Kirk Scott 1

119

• This example also illustrates two other things, which are related to “non-existent” values.

• Observe that mother 1, Lily does not have any matching records in the Child table.

• This does not violate referential integrity. • It suggests that the Mother table is misnamed, and

should be named the Woman table, but it is reasonable to think that you might be recording information about women and children and some women will not have children.

Page 120: SQL Unit 15 Normalization prepared by Kirk Scott 1

120

• The other thing visible in the table is that child c, June, does not have a mother listed.

• In other words, the foreign key field is null. • This also does not violate referential integrity. • As with null in any situation, it may mean that

the mother is not known.

Page 121: SQL Unit 15 Normalization prepared by Kirk Scott 1

121

• Nobody literally doesn’t have a mother, but if the woman table only records information on living women, for example, then for an orphan, the mother “wouldn’t exist”.

• It is unlikely that you would rename the table “Children and Orphans”—but the idea is that the null value is allowed and this in some sense affects the meaning of what kinds of entities are entered into the table.

Page 122: SQL Unit 15 Normalization prepared by Kirk Scott 1

122

• Referential integrity leads to one last consideration.

• The idea behind normalization was to get the stray dependency out of one table and break it into two.

• The problem with stray dependencies was redundancy and anomalies.

• By breaking a design into two tables with a primary to foreign key pair, you introduce interrelationship constraints.

Page 123: SQL Unit 15 Normalization prepared by Kirk Scott 1

123

• Put simply, the question is this: What do you do with foreign key values if the corresponding primary key values in another table are deleted or updated?

• A fully-featured database management system will enforce referential integrity constraints.

• The default settings for these constraints are summarized in these two phrases:

• On delete, restrict; on update, cascade.

Page 124: SQL Unit 15 Normalization prepared by Kirk Scott 1

124

• If these defaults are implemented, this is a fuller explanation of what they mean in terms of the concrete mother and child example:

• On delete, restrict: • No mother record can be deleted if she has

corresponding child records in the other table.

• To allow the deletion would lead to a referential integrity violation.

Page 125: SQL Unit 15 Normalization prepared by Kirk Scott 1

125

• On update, cascade: • If the primary key value of a mother record is

updated, if she has corresponding child records in the other table, the foreign key values in those records is automatically updated to reflect the change.

• This problem arises less frequently because once a primary key value is assigned to an entity, it is rarely changed.

Page 126: SQL Unit 15 Normalization prepared by Kirk Scott 1

126

The End