Download doc - Database Normalization

Transcript
Page 1: Database Normalization

NormalizationConstraints: There are 2 types of constraints:

1) Defines the permitted values, the attributes can have.

2) Defines the relationship between the attribute.

Functional Dependency (FD): It is denoted by. FD X Y means that X uniquely determines Y, where X and Y are simple or composite attributes. The dependency from X to Y is said to be there if application has the following:

If T1 and T2 are 2 tuples with some values X then value for Y must also be same in T1 and T2 i.e., relationship between X and Y is independent of other attributes which must be present in the table. In simple words, for a given X there is always a single value of Y.

For example,

A Street of a City Pin Code: A Street of a city has a unique pin code however the reverse need not be true.

(ISBN, TITLE) AUTHOR: Given an ISBN number, one can find the Title and name of the Author of the book.

Implications and Covers: The application can call for some functional dependencies which may imply additional functional dependencies.

If F is the set of FDs then we define closure of F denoted as F+ to be set of all possible FDs, which is implied by F.

1 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 2: Database Normalization

To find F+, given F, we have to find out the interface rules for the FD’s which are implied by F. The inference rules are very important for good database design for the following reasons:

Given F, one may like to determine whether XY is implied or not.

For computing the closure of F+ of F.

Given F we may want to remove those FDs, which are redundant in F. A FD is redundant if it is implied by another FD in F.

While designing database schema:

Find minimal cover of “G” of F.

By finding minimal cover, G does not contain any redundant FDs (that is, G+ will be same as F+). By computing minimal cover G of F, we can ensure that DBMS will enforce the constraints, which automatically enforces the constraints implied by G.

Inference rules for FDs:

Inference rules are known as Armstrong’s Axioms, are published by Armstrong. These properties are as given below:

1. Reflexive property: X Y is TRUE, if Y is a SUBSET of X.

2. Augmentation property: If X Y is TRUE, then XZ YZ is also TRUE.

3. Transitivity property: If X Y and Y Z are TRUE, then X Z is implied.

4. Union property: If X Y and X Z are TRUE, then X YZ is also TRUE. This property indicates that if Right Hand Side of FD contains many attributes then, FD exists for each of them.

2 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 3: Database Normalization

5. Decomposition property: If X Y is implied and Z is a SUBSET of Y, the X Z is implied. This property is the Reverse of Union property.

6. Pseudo transitivity property: If X Y and WY Z are given, then XW Z is TRUE.

Example: Consider a college having a table STUDY with COURSE, TEACHER, ROOM NO and DEPARTMENT as attributes.

STUDY (COURSE, TEACHER, ROOMNO, DEPT) here identify few FDs namely:

Course Teacher

Teacher Department

Course Room Number

Additional FDs can be derived from above by using Inference properties:

By reflexivity: (Course, Teacher) Teacher

By Augmentation: (Course, Room Number) (Teacher, Room Number)

By Transitivity: Course Department

By Union: Course (Teacher, Room Number)

The main axioms of Armstrong are sound and complete, and are defined as:

1. Soundness property: If X Y can be inferred from F using the above axioms, the X Y will be TRUE in any relation in which F holds.

2. Completeness property: If X Y cannot be inferred from F and F holds in relation R, then X Y will not be TRUE in relation R.

3 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 4: Database Normalization

Normalization: Consider the table shown below:

Order No.

Order Date

Item Lines

Item Code Quantity Price / Unit

1456 260289 3687 52 50.40

4627 38 60.20

3214 20 17.50

1886 040389 4629 45 20.25

4627 30 60.20

1788 040489 4627 40 60.20

Table 1: An UnNormalized Relation

In this relation, an order no. includes many items. The attribute “order lines” is not single attribute but is composed of many attributes.

Besides this, the number of ITEM LINES is variable. This form is not suitable for storage as a file in a computer. Further, retrieval of data based on a component of a composite attribute is difficult. For example, to find out how many items with a specified item code are ordered, one must break up composite attribute first before attempting a search. Thus a relation with a format such as in above table is not allowed. It is said to be unnormalized.

To normalize this relation, a composite attribute is converted to individual attributes. The normalization step consists of first identifying fields within a composite attribute as individual attributes. After doing this common attributes for a composite attribute are duplicated as many times as there are lines in the composite attribute. The normalized relation corresponding to the relation given in the table 1 above is show in the table 2 below:

4 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 5: Database Normalization

Table 2: Normalized form of the Relation given in Table 1

Order No. Order Date Item Code Quantity Price / Unit

1456 260289 3687 52 50.40

1456 260289 4627 38 60.20

1456 260289 3214 20 17.50

1886 040389 4629 45 20.25

1886 040389 4627 30 60.20

1788 040489 4627 40 60.20

The relation shown in the Table 2 is said to be in First Normal Form, abbreviated as 1NF. This form is also called a flat file. There are no composite attribute and every single attribute is single and describes only one property.

Converting a relation to 1NF form is the first essential step in normalization. There are successive higher normalization forms known as 2NF, 3NF, BCNF, 4NF and 5NF. Each form is an improvement over the earlier form. In other words, 2NF is an improvement over 1NF; 3NF is an improvement over 2NF, and so on. A higher normal form relation is a SUBSET of lower normal form, as shown in Figure 1.

5 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 6: Database Normalization

Figure 1: Illustration of successive normal form of a relation.

The higher order normalization steps are based on 3 important concepts:

1) Dependence among attributes in a relation.

2) Identification of an attribute or a set of attributes as the key of a relation.

3) Multivalued dependency between attributes.

Functional Dependency:

Remember that, there is no fool-proof algorithmic method to identify functional dependency.

Let X and Y be 2 attributes of a relation. Given a relation X, if there exists one value of Y corresponding to it, then Y is said

6 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

1NF1NF

2NF2NF

3NF3NF

BCNFBCNF

4NF4NF

5NF

Page 7: Database Normalization

to be functionally dependent on X. This is indicated by the notation:

X Y

For example, given the value of item code, there is only one value of item name for it. Thus item name is functionality dependent on item code. This is as shown as:

Item Code Item Name

Similarly in Table 2, given an Order Number, the date of the order is known. Thus:

Order No. Order Date

Functional dependency may be based on a composite attribute. For example, if we write:

X, Z Y

It means that there is only one value of Y corresponding to given values of X, Z. In other words, Y is functionality dependent on the composite X, Z. In Table 2, for example, Order No. and Item Code together with Quantity and Price. Thus,

Order No., Item Code, Quantity Price

Example: Consider the relation:

Student (Roll No., Name, Address, Dept., Year of Study)

In this relation, Name is functionally dependent on Roll No. In fact, given the value of Roll No., the values of the other entire attribute can be uniquely determined. Name and Department are not functionally dependent because given the name of a student; one cannot find his department uniquely. This is due to the fact that there may be more than one student with the same name. Name in this case is not a key. Department and Year of Study are not functionally dependent as Year of Study pertains to a student whereas Department is an independent attribute. The functional

7 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 8: Database Normalization

dependency in this relation is shown in Figure 2, as a dependency diagram. Such dependency diagrams are very useful in normalization.

Relation Key: Given a relation, if the value of an attribute X uniquely determines the values of all other attributes in a row, then X is said to be the key of the relation. Sometimes more than one attribute is used to uniquely determine other attributes in a relation row. In that case, such a set of attributes is the key. In Table 2, Order No., and Item Code together determines Order Date, Quantity and Price. Thus the key is formed by the combination set of (Order No., Item Code). In this relation, Supplies (Vendor Code, Item Code, Quantity supplied, Date of Supply, Price/Unit), Vendor Code and Item Code together form the key. This dependency is shown in the dependency diagram of Figure 3:

8 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

NameName

AddressAddress

DepartmentDepartment

Roll No.Roll No.

Year of StudyYear of Study

Figure 2: Dependency diagram for the relation “Student”

Page 9: Database Normalization

Observe that in the figure, the fact that Vendor Code and Item Code together form the composite key is clearly shown by enclosing them together in a rectangle.

9 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Figure 3: Dependency diagram for the relation “Supplies”.

Vendor CodeVendor Code

Item CodeItem Code

Quantity SuppliedQuantity Supplied

Date of SupplyDate of Supply

Price/UnitPrice/Unit

Page 10: Database Normalization

Why do we Normalize a Relation?

Relations are normalized when relations in a database are to be altered during the lifetime of a database, we do not loose information or introduce inconsistencies. The type of alterations normally needed for relations are:

1) Insertions of new data values to a relation. This should be possible without being forced to leave blank fields for some attributes.

2) Deletions of a tuple, namely, a row of a relation. This should be possible without losing vital information unknowingly.

3) Updating or changing a value of an attribute in a tuple. This should be possible without exhaustively searching all the tuples in the relation.

Ideal relations after normalization should have the following properties so that the problems mentioned above do not occur for relations in the (ideal) normalized form:

1) No data value should be duplicated in different rows unnecessarily.

2) A value must be specified (and required) for every attributes in a row.

3) Each relation should be self-contained. In other words, if a row from a relation is deleted, important information should not be accidentally lost.

4) When a row is added to a relation, other relations in the database should not be affected.

5) A value of an attribute in a tuple may be changed independent of other tuples in the relation and other relations.

The idea of normalizing relations to higher and higher normal forms is to attain the goals of having a set of ideal relations meeting the above criteria.

10 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 11: Database Normalization

Second Normal Form (2NF) Relation

A relation is said to be in 2NF if:

1) It is already in 1NF form.

2) Non-key attributes are functionally dependent on Key attributes.

3) If the key has more than 1 attributes then no non-key attributes should be functionally dependent upon part of key attributes.

Example: Consider Table 2, as it is in 1NF form.

Key = (Order No., Item Code)

The dependency diagram for attributes of this relation is shown in Figure 3.

The non-key attribute Price/Unit is functionally dependent on Item Code, which is a part of relation key.

Also, non-key attribute Order Data is functionally dependent on Order No., which is a part of relation key.

As the non-key attributes depends only upon part of the key attributes, the relation in Table 2, is not in 2NF form.

To obtain the 2NF form of Table 2, we proceed as follows:

The relation “Orders” has Order No. as key. The relation “Order Details” has the composite key Order No. and Item Code. In both relations, the non-key attributes are functionally dependent on the whole key.

Observe that by transforming to 2NF relations the repetition of Order Date in Table 2 has been removed.

Further, if an order for an item is cancelled, the price of an item is not lost.

11 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 12: Database Normalization

12 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Table 3: Splitting of Relation given in Table 2 into 2NF Relations.

OrdersOrder No. Order

Date145626028918860403891788040489

Order DetailsOrder No.Item CodeQuantity1456368752145646273814563214201886462945188646273017884

62740

PricesItem CodePrice / Unit368750.40462760.2032141

7.50462920.25

Figure 4: Dependency diagram for the relation given in Table 2.

Order No.Order No.

Item CodeItem Code

Order DateOrder Date

QuantityQuantity

Price/UnitPrice/Unit

Page 13: Database Normalization

If Order No. 1886 for Item Code 4629 is cancelled in Table 2, then the fourth row will be removed and the price of the item will be lost.

In Table 3, only the fourth row of the centre table will be removed. The Item price is present in the right table in Table 3, which will not be touched and hence the price of Item 4629 is not lost. The Date of Order is also not lost as it is in the left table of the Table 3.

These relations in 2NF form meet all the “ideal” conditions specified. Observe that the 3 relations obtained are self-contained. There is no duplication of data within a relation.

13 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 14: Database Normalization

Third Normal Form (3NF)

3NF form normalization will be needed where all attributes in a relation tuple are not functionally dependent on the key attribute. If 2 non-key attributes are functionally dependent, then there will be unnecessary duplication of data.

Consider the relation given in Table 4.

Table 4: A 2NF Form Relation

Roll No. Name Department Year Hostel Name

1784 Raman Physics 1 Ganga

1648 Krishnan Chemistry 1 Ganga

1768 Gopalan Mathematics 2 Kaveri

1848 Raja Botany 2 Kaveri

1682 Maya Geology 3 Krishna

1485 Singh Zoology 4 Godavari

Here, Roll No. is the Key and all other attributes are functionally dependent on it. Thus is in 2NF.

If it is known that in the college all 1st Year students are accommodated in Ganga Hostel, all 2nd Year students in Krishna, and all 4th Year students in Godavari, then the non-key attribute Hostel Name is dependent on the non-key attribute Year. This dependency is shown in the Figure 5.

Observe that given the Year of student, his Hostel is known and vice versa.

14 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 15: Database Normalization

The dependency of Hostel on Year leads to duplication of data as is evident from Table 4. If it is decided to ask all 1st Year students to move to Kaveri Hostel, and all 2nd Year students to Ganga Hostel, this change should be made in many places in Table 4. Also, when a student’s Year of study changes, his Hostel changes, should also be noted in Table 4. This is undesirable.

A table is said to be in 3NF if it is in 2NF, and no non-key attribute is functionally dependent on any other non-key attribute.

Therefore, Table 4 is not in 3NF.

To transform it into 3NF, we should introduce another relation, which includes the functionality related non-key attributes. This is shown in the Table 5.

15 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Roll No.Roll No.

NameName

DepartmentDepartment

YearYear

Hostel NameHostel Name

Figure 5: Dependency diagram for the relation given in Table 4.

Page 16: Database Normalization

It should be stressed again that dependency between attributes is a semantic property and has to be stated in the problem specification. In this example, the dependency between Year and Hostel is clearly stated.

In case, Hostel allocated to student is independent of their year in college, then Table 4 is already in 3NF.

Example: The relation Employee is given below and its dependency diagram is shown in the Figure 6:

Employee (Employee Code, Employee Name, Dept., Salary, Project No., Termination Date of project)

16 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Roll No.NameDepartmentYear1784RamanPhysics11648KrishnanChemistry11768GopalanMathematics21848RajaBotany21682May

aGeology31485SinghZoology4

YearHostel Name1Ganga1Ganga2Kaveri2Kaveri3

Krishna4Godavari

Table 5: Conversion of Table 4 into 3NF Relations.

Page 17: Database Normalization

As it can be seen from the figure, the Termination Date of the project is dependent on the Project No. Thus the relation is not in 3NF. The 3NF relations are:

Employee (Employee Code, Employee Name, Salary, Project No.)

Project (Project No., Termination Date)

17 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Employee CodeEmployee Code

Employee NameEmployee Name

DepartmentDepartment

SalarySalary

Project No.Project No.

Termination DateTermination Date

Figure 6: Dependency diagram of Employee Relation.

Page 18: Database Normalization

Boyce-Codd Normal Form (BCNF)

Assume that a relation has more than 1 possible key. Assume further that the composite key have a common attribute. IF an attribute of a composite key is dependent on an attribute of the other composite key, a normalization called BCNF is needed.

Consider an example, the relation Professor:

Professor (Professor Code, Department, Head of Department, Percent Time)

It is assumed that:

1) A Professor can work in more than 1 department.

2) The Percentage of the time he spends in each department is given.

3) Each department has only 1 Head of the Department.

The relationship diagram is given in Figure 7. Table 6 gives the relation attributes. The 2 possible composite keys are Professor Code and Department, or Professor Code and Head of Department. Observe that the Department as well as Head of Department is not non-key attributes. They are a part of the composite key.

18 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 19: Database Normalization

Table 6: Normalization of Relation – “Professor”

Professor Code Department Head of Department

Percent Time

P1 Physics Ghosh 50

P1 Mathematics Krishnan 50

P2 Chemistry Rao 25

P2 Physics Ghosh 75

P3 Mathematics Krishnan 100

19 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Department

Professor Code

Head of Department

Percent Time

Department Head of Department

Head of Department

Professor Code

Department

Percent Time

Figure 7: Dependency of diagram of Professor Relation.

Page 20: Database Normalization

The relation given in Table 6 is in 3NF. Observe, however, that the names of Department and Head of Department are duplicated. Further, if Professor P2 resigns, rows 3 and 4 are deleted. We then lose the information that Rao is the Head of Department of Chemistry.

The normalization of the relation is done by creating a new relation for Department and Head of Department, and deleting Head of Department from Professor Relation. The normalized relations are shown in Table 7:

Table 7: Normalized Professor Relation in BCNF

(a)

Professor Code

Department Percent Time

P1 Physics 50

P1 Mathematics 50

P2 Chemistry 25

P2 Physics 75

P3 Mathematics 100

(b)

Department Head of Department

Physics Ghosh

Mathematics Krishnan

Chemistry Rao

The dependency diagram is shown in Figure 8.

20 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Department

Professor Code

Percent Time

Department Head of Department

Figure 8: Dependency diagram of Professor Relation.

Page 21: Database Normalization

The dependency diagram gives the important clue to this normalization step, as is clear from Figure 7 and 8.

21 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 22: Database Normalization

Fourth and Fifth Normal Form (4NF and 5NF)

When attributes in a relation above have Multivalued dependency, further normalization to 4NF and 5NF are required.

Example: Consider a vendor supplying many items to many projects in an organization. The following are the assumptions:

1) A Vendor is capable of supplying many Items.

2) A Project uses many Items.

3) A Vendor supplies for many Projects.

4) An Item may be supplied by many Vendors.

Table 8 gives the relation for this problem and Figure 9, the dependency diagram(s).

Vendor Code Item Code Project No.

V1 I1 P1

V1 I2 P1

V1 I1 P3

V1 I2 P3

V2 I2 P3

V2 I3 P1

V3 I1 P1

V3 I1 P2

22 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 23: Database Normalization

The relation in Table 8 has a number of problems. For example:

If vendor Y1 has supply to project P2, but the item is not yet decided, then a row with a blank for Item Code has to be introduced.

The information about Item 1 is stored twice for Vendor V3.

Observe that the relation given in Table 8 is in 3NF and also in, BCNF. It still has the problems mentioned above. The problem is reduced by expressing this relation as 2 relations in 4NF.

A relation is in 4NF, if it has no more than one independent multivalued dependency with a functional dependency.

Table 8 can be expressed as the 2 4NF relations given in Table 9. The fact that vendors are capable of supplying certain items and that they are assigned to supply for some projects is independently specified in the 4NF relation.

23 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

V

I

R

V

P

I

Figure 9: Dependency Diagram of Vendor-Supply-Project relation

Indicates Multivalued Dependency

Page 24: Database Normalization

Table 9: Vendor-Supply-Project Relations in 4NF.

(a) Vendor Supply

Vendor Code Item Code

V1 I1

V1 I2

V2 I2

V2 I3

V3 I1

Table 10: 5NF Additional Relation

Project No. Item Code

P1 I1

P1 I2

P2 I1

P3 I1

P3 I3

24 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

(a) Vendor Project

Vendor Code Project No.

V1 P1

V1 P3

V2 P1

V3 P1

V3 P2

Page 25: Database Normalization

These relations still have a problem. Even though Vendor V1’s capability to supply items and his allotment to supply for specified Projects are known, he may not be actually supplying them to a project as the project may not need it. We thus need another relation, which specifies this. This is called 5NF form. The 5NF relations are the relations in Table 9(a) and (b) together with the relation given in Table 10.

25 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 26: Database Normalization

Table 11: Summary of Normalization Steps.

Input Relation Transformation Output Relation

All relations Eliminate Variable Length Records.

Remove multi-attribute lines in the Table.

1NF

1NF Remove dependency of non-key attribute on part of a multiattribute key.

2NF

2NF Remove dependency of non-key attributes on other non-key attributes.

3NF

3NF Remove dependency of an attribute of a multiattribute key on an attribute of another (overlapping) multiattribute key.

BCNF

BCNF Remove more than one independent Multivalued dependency from relation by splitting relation.

4NF

4NF Add one relation relating attributes with Multivalued dependency to the

5NF

26 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]

Page 27: Database Normalization

two relations with Multivalued dependency.

27 | P a g e

Written By Arnav Mukhopadhyay EMAIL: [email protected]