INTRODUCTION TO RELATIONAL DATABASE SYSTEMS

INTRODUCTION TO

RELATIONAL DATABASE SYSTEMS

DATENBANKSYSTEME 1 (INF 3131)

Torsten GrustUniversität Tübingen

Winter 2017/18

1

LEGO BUILDING INSTRUCTIONS

Each LEGO set comes with building instructions, an illustrated booklet that details theindividual steps of model construction.

One page in the booklet holds one ore more instruction steps (steps are numbered 1, 2, …).

Each step lists the pieces (with their color and quantity) required to complete the step.

Each step comes with an illustration of where the listed pieces find their place in themodel.

What would be a reasonable design for a building instructions table? Clearly:

Do not include LEGO set details in instructions: instead, use a foreign key to refer totable sets.

Do not include LEGO piece details in instructions: instead, use a foreign key to refer totable bricks.

Represent page numbers, step numbers, image sizes as integers but formulateconstraints that avoid data entry errors (e.g. negative page/step numbers).

‐

‐‐‐

‐1.

2.

3.

2

LEGO BUILDING INSTRUCTIONS

Page 25 in Building Instruction for LEGO Set 9495 (Y-Wing)3

LEGO BUILDING INSTRUCTIONS (TABLEDESIGN)

instructionsset step piece color quantity page img width height

9495–1 7 3010 2 2 24 ‹image07› 639 5339495–1 7 3023 2 2 24 ‹image07› 639 5339495–1 7 2877 86 1 24 ‹image07› 639 5339495–1 8 3002 7 2 24 ‹image08› 650 5229495–1 8 30414 1 2 24 ‹image08› 650 5229495–1 9 30414 85 1 25 ‹image09› 541 6389495–1 9 3062b 85 2 25 ‹image09› 541 6389495–1 10 30033 11 1 25 ‹image10› 540 6629495–1 10 2412b 86 1 25 ‹image10› 540 6629495–1 10 4589b 86 2 25 ‹image10› 540 6629495–1 10 87580 85 1 25 ‹image10› 540 6629495–1 11 3039 2 1 25 ‹image11› 1042 5589495–1 11 4073 85 4 25 ‹image11› 1042 5589495–1 11 44728 3 1 25 ‹image11› 1042 558

4

REDUNDANCY

The design of table instructions appears reasonable. We immediately spot a fair amount ofredundancy, though. For example:

Step 10 of Set 9495 is printed on page 25. [represented 4 ×]

Step 7 of Set 9495 is illustrated by ‹image07›. [3 ×]

‹image09› has dimensions 541 ⨉ 638 pixels. [2 ×]

Redundancy comes with a number of serious problems, most importantly:

Storage space is wasted.Tables occupy more disk space than needed. Query processor has to touch/move morebytes. Archival storage (backup) requires more resources.

Redundant copies will go out of sync.Eventually, an update operation will miss a copy. The database instance now contains“multiple truths.” Typically, this goes unnoticed by DBMS and user.

‐

1.

2.

3.

‐‐

‐

5

EMBEDDED FUNCTIONS AND REDUNDANCY

In table instructions, the source of redundancy is the presence of functions that areembedded in the table.

Leibniz PrincipleIf is a function defined on , then

Table instructions embeds the materialized functions

printed_on(): maps set, step to the page it is printed on

illustrated_by(): maps set, step to the illustration stored in image img

image_size(): maps an image img to its width and height

‐

f x, y

x = y ∧ f(x) = z ⇒ f(y) = z

‐1.

2.

3.

6

FUNCTIONAL DEPENDENCIES

Functional Dependency (FD)

Let denote a relational schema. Given and , the functionaldependency holds in if

Read: “If two rows agree on the columns in , they also agree on column .” ( :function arguments, : function result).

Notation: the FD abbreviates the set of FDs , …, .

Note: If , then is called a trivial FD that obviously holds for any instance of .No interesting insight into here.

(R, α) β ⊆ α c ∈ αβ → c R

∀t, u ∈ inst(R) : t. β = u. β ⇒ t. c = u. c

β c βc

β → { , … , }c1 cn β → c1β → cn

‐ c ∈ β β → c RR

7


FDs are constraints that document universally valid mini-world facts (e.g., “a step isassociated with one illustration”). FDs thus need to hold in all database instances.


9495–1 7 3010 2 2 24 ‹image07› 639 5339495–1 7 3023 2 2 24 ‹image07› 639 5339495–1 7 2877 86 1 24 ‹image07› 639 5339495–1 8 3002 7 2 24 ‹image08› 650 5229495–1 8 30414 1 2 24 ‹image08› 650 5229495–1 9 30414 85 1 25 ‹image09› 541 6389495–1 9 3062b 85 2 25 ‹image09› 541 638

Which functional dependencies hold in table instructions?

‐

‐

8


Given table R, check whether the FD { b1, …, bn } c holds in the current table instance:

SELECT DISTINCT 'The FD { b1, …, bn } → c does not hold'FROM RGROUP BY b1, …, bnHAVING COUNT(DISTINCT c) > 1

Aggregate Functions

Optional modifier DISTINCT affects the computation of aggregate functions:

‹aggregate›([ ALL ] ‹expression›) -- aggregate all non-NULL values‹aggregate›(DISTINCT ‹expression›) -- aggregate all distinct non-NULL values‹aggregate›(*) -- aggregate all rows (count(*))

‐ →

9

KEY → FD

Note that a key implicitly defines a particularly strong FD: the key columns functionallydetermine all columns of the table.

Keys vs FDs (1)Assume table .

is a key of holds.

So, keys are special FDs.

Turning this around: FDs are a generalization of keys.

‐

(R, { , … , , , … , })a1 ak ak+1 an

{ , … , }a1 ak R ⇔{ , … , } → { , … , }a1 ak ak+1 an

‐‐

10

FD → (LOCAL, PARTIAL) KEY

Keys vs FDs (2)Assume table and FD . Then is key in the sub-table of defined by

SELECT DISTINCT , FROM

Example: for table instructions and FD { set, step } → page the sub-table isset step page

9495–1 7 249495–1 8 249495–1 9 259495–1 10 259495–1 11 25

(i.e., exactly the table materializing the function printed_on(), see above ).

R β → c β R

β cR

‐

11


Example: recall table stores of the LEGO Data Warehouse scenario:

store city state country

7 HAMBURG Hamburg Germany8 LEIPZIG Sachsen Germany9 MÜNCHEN Bayern Germany10 MÜNCHEN PASING Bayern Germany11 NÜRNBERG Bayern Germany

16 ARDEN FAIR MALL CA USA17 DISNEYLAND RESORT CA USA18 FASHION VALLEY CA USA

List the FDs that hold in table stores.

Does the mini-world suggest FDs not implied by the rows shown above?

‐

‐‐

12


An FD indicates the presence of a materialized function. Consider the following variant of theusers and ratings table:

usersuser rating starsAlex 3 ***Bert 1 *Cora 4 ****Drew 5 *****Erik 1 *Fred 3 ***

FD { rating } → stars materializes the computable function stars = rating = repeat('*', rating) [see PostgreSQL’s string function library].

In such cases, good database design should consider to trade materialization forcomputation. Removes redundancy.

‐

‐ f( )

‐

13

✄ ✄ ✄ ✄ ✄ ✄ ✄ ✄

SQL: VIEWS

CREATE VIEW

Binds ‹query› to ‹name› which is globally visible. Whenever table ‹name› isreferenced in subsequent queries, ‹query› is re-evaluated and its result returned(no materialization of the result of ‹query› is performed):

-- TEMPORARY: automatically drop view after current sessionCREATE [ OR REPLACE ] [ TEMPORARY ] VIEW ‹name› AS ‹query›

Compare with CTEs: local visibility in surrounding WITH statement only.

A temporary view named ‹name› shadows a (regular, persistent) table of the same name.

‐‐

14

SQL: VIEWS

Views provide data independence: users and applications continue to refer to ‹name›, whilethe database designer may decide to replace a persistent table with a computed query or viceversa.

Example: turn the materialized function stars = rating into a computed function:

-- drop the materialized function from the tableALTER TABLE users DROP COLUMN stars;

-- provide the three-column table that users/applications expectCREATE TEMPORARY VIEW users(user, rating, stars) AS SELECT u.user, u.rating, repeat('*', u.rating) stars FROM users u;

Since PostgreSQL’s repeat() is a pure function, the FD rating → stars trivally holds inthe view.

✄ ✄ ✄ ✄ ✄ ✄ ✄ ✄

‐

‐ f( )

‐

15

DERIVING FUNCTIONAL DEPENDENCIES

Given a set of FDs over table , simple inference rules—the Armstrong Axioms—suffice togenerate all FDs following from those in .

Armstrong AxiomsApply exhaustively to generate all FDs implied by FD set .

Reflexivity:If , then .

Augmentation (with ):If , then

Transitivity:If and , then .

Note: transitivity closely relates to function composition: if are functions, so is .

‐ F RF

F

γ ⊆ β β → γ

c ∈ sch(R)β → γ β ∪ {c} → γ ∪ {c}

α → β β → γ α → γ

‐ f, g g ∘ f16

DERIVING FDS (COVER)

Problem: Given a set of columns and a set of FDs over , compute thecover , i.e. the set of all columns functionally determined by .

CoverThe cover of a set of columns is the set of all columns that are functionallydetermined by the columns in (with respect to a given FD set ):

⚠ Should we find that , then is a candidate key for .

‐ α ⊆ sch(R) F Rα+ α

α+ α cα F

:= {c | F implies α → c}α+

‐ = sch(R)α+ α R

17


Compute the cover for a given set of FDs:

(Input: column set , FD set , Output: )

Repeat

For each FD in do

If then

Until did not change

Return

‐ α+

cover(α, F) α F α+

1. X := α

2.

‐ β → c F

‐ β ⊆ X

‐ X := X ∪ {c}X

3. X

18


Example: In table instructions, compute {set,step}+ with = { {set,step} → page,{set,step} → img, {img} → width, {img} → height }.


Tracing column set :

:= {set,step}

FD {set,step} → page, {set,step} : " := {page}FD {set,step} → img, {set,step} : " := {img}FD {img} → width, {img} : " := {width}FD {img} → height, {img} :" := {height}"All FDs considered. = {set, step, page, img, width, height}Repetition of 2. does not add new columns to .

Return {set, step, page, img, width, height}.

‐ F

‐ X

1. X

2. ⊆ X X X ∪⊆ X X X ∪

⊆ X X X ∪⊆ X X X ∪

XX

3.19

DERIVING CANDIDATE KEYS

(Input: FD set , Output: set of all candidate keys for )

If then Return [ Invariant: ]

else

For each do

If then [ Is essential for the key? ]

⚑

else

Return

Invoke via .

Can optimize at ⚑ : invoke instead.

key(K, U , F) F R

‐ U = ∅ {K} cover(K ∪ U , F) = sch(R)

‐ X := ∅‐ c ∈ U

‐ c ∉ cover(K ∪ (U ∖ {c}), F) c

‐ X := X ∪ key(K ∪ {c}, U ∖ {c}, F)

‐ X := X ∪ key(K, U ∖ {c}, F)

‐ X

‐ key(∅, sch(R), F)‐ key(K ∪ {c}, U ∖ cover(K ∪ {c}, F), F)

20

DATABASE DESIGN WITH FDS

Typically it is a severe sign of poor database design if tables embed functions, i.e. if a tablecontains

FDs that are not implied by the primary key. ⚠

Consequences of table designs with non-key FDs / embedded functions:

Redundancy (see above ✔)

Update/Insertion/Deletion Anomalies

RDBMS cannot protect the integrity of non-key FDs, thus risk of inconsistency over time:

SQL DDL does not implement an ALTER TABLE … ADD FUNCTIONAL DEPENDENCY …statement.

Although FDs embody important mini-world facts they are easily violated withoutprotection. (Can simulate this protection using SQL triggers or rewrite rules.Cumbersome. Inefficient.)

‐

‐1.

2.

3.

‐

‐

21

UPDATE/INSERTION/DELETION ANOMALIES

Recall table instructions and embedded FD { img } → { width, height}:


Update anomaly:Changing a single mini-world fact requires the modification of multiple rows.[ Modifying image size requires to search/update entire instruction table. ]

Insertion anomaly:A new mini-world fact cannot be stored unless it is put in larger context.[ No place to record width/height dimension of a new image yet unused in an instructionmanual. ]

Deletion anomaly:A formerly stored mini-world fact vanishes once its (last) context is deleted.[ Information about image width/height is lost once last instruction manual including thatimage is deleted from instructions. ]

‐

‐

‐

‐

22

BOYCE-CODD NORMAL FORM

Boyce-Codd Normal Form (BCNF)Table is in Boyce-Codd Normal Form (BCNF) if and only if all its FDs are alreadyimplied by its key constraints.

For table in BCNF and any FD of one of the following holds:

The FD is trivial, i.e., .

The FD follows from a key because (or a subset of it) already is a key of .

A table in BCNF does not exhibit the three anomalies (no embedded functions).

All FDs in table in BCNF are protected by the RDBMS through PRIMARY KEY (or UNIQUE)constraints.

R

R β → c R

1. c ∈ β

2. β R

‐‐

23

BOYCE-CODD NORMAL FORM

Examples:

Table instructions is not in BCNF: key FD { set, step, piece, color } → { quantity, page, img, width, height} does not imply { set, step } → { page, img } or { img } → { width, height }:


Table users not in BCNF: { rating } → { stars } not implied by key FD:usersname rating stars

Table stores not in BCNF: { state } → { country } not implied by key FD:storesstore city state country

‐‐

‐

‐

24

BCNF SCHEMA DECOMPOSITION

(Input: table with FD set , Output: spli%ed relation schemata)

If with and does not contain a key of then

Split and replace by

Notes:

denotes FD set restricted to those for which .

For each split: and .

split(R, F) R F

‐ β → c ∈ F c ∉ β β R

1. R

‐ ((sch(R) ∖ cover(β, F)) ∪ β)R1

‐ (cover(β, F))R2

2. split( , )R1 Fsch( )R1

3. split( , )R2 Fsch( )R2

‐‐ FC F β → c β ∪ {c} ⊆ C

‐ sch( ) ∪ sch( ) = sch(R)R1 R2 sch( ) ∩ sch( ) = βR1 R225

BCNF: AFTER DECOMPOSITION (1)

Resultant BCNF tables after has been completed:

parts (1/3)set step piece color quantity

9495–1 7 3010 2 29495–1 7 3023 2 29495–1 7 2877 86 19495–1 8 3002 7 29495–1 8 30414 1 29495–1 9 30414 85 19495–1 9 3062b 85 2

Note: It is rather straightforward to name the newly generated tables: these tables represent asingle real-world concept.

‐ split(instructions, F)

‐

26

BCNF: AFTER DECOMPOSITION (2)

Resultant BCNF tables after has been completed:

layouts (2/3)set step page img

9495–1 7 24 ‹image07›9495–1 8 24 ‹image08›9495–1 9 24 ‹image09›

illustrations (3/3)img width height

‹image07› 639 533‹image08› 650 522‹image09› 541 638

To tie the BNCF tables together, establish foreign keys pointing from parts to layouts andfrom layouts to illustrations.

‐ split(instructions, F)

‐27

BCNF: RECONSTRUCTION

Use an equi-join to reconstruct the original wide table instructions from its constituenttables:

Reconstruction after BCNF decomposition

Perform an equi-join over the (non-empty) schema intersections of the BCNF tables:

SELECT p.set, p.step, p.piece, p.color, p.quantity, l.page, l.img, i.width, i.heightFROM parts p, layouts l, illustrations iWHERE p.set = l.set AND p.step = l.stepAND l.img = i.img

It may make sense to use CREATE VIEW to reestablish the wide table for users andapplications.

‐

‐

28

BCNF: AFTER DECOMPOSITION

Decomposition for table users:

users ( ) user ratingAlex 3Bert 1Cora 4Drew 5Erik 1Fred 3

render ( )rating stars1 *3 ***4 ****5 ******

The RDBMS protects the FDs (keys): translation from rating to stars in table render is alwaysconsistent. No redundancy in table render.

‐R1

R2

‐29

BCNF DECOMPOSITION: LOSSLESS SPLITS

BNCF decomposition builds on the assumption that no information is lost during the splits:original table can be reconstructed by an equi-join of and .

Not all decompositions are lossless, however. Consider:RA B Ca₁ b₁ c₁a₁ b₁ c₂a₁ b₂ c₁

and its decomposition into R1(A, B), R2(A, C). The equi-join of R1 and R2 (on A) is:A B Ca₁ b₁ c₁a₁ b₁ c₂a₁ b₂ c₁a₁ b₂ c₂ ⚠

⇒ An extra (bogus!) row has been reconstructed by the join. Information has been lost.

‐R R1 R2

‐

30

BCNF: LOSSLESS SPLITS

Decomposition Theorem

Consider the decomposition of table into and . The reconstruction of from , via an equi-join on is lossless if

and

is a key of or (or both).

The splits “along the FD ” performed by will always be lossless:

and ✔

Since , is a key for ✔

We will never lose information through BCNF decomposition.

R R1 R2 RR1 R2 sch( ) ∩ sch( )R1 R2

1. sch( ) ∪ sch( ) = sch(R)R1 R2

2. sch( ) ∩ sch( )R1 R2 R1 R2

‐ β → c split(R, F)1. sch( ) ∪ sch( ) = sch(R)R1 R2 sch( ) ∩ sch( ) = βR1 R2

2. sch( ) = cover(β, F)R2 β R2

‐

31

BCNF: NON-DETERMINISM, LOSS OF FDS

⚠ BCNF is not deterministic: arbitrary choice of the “split FD” in algorithm leads to different decompositions, in general:

For table instructions: spli%ing along FD { set, step } → { page, img } or { img } → { width, height } first makes no difference. (Try it.)

But consider R(A,B,C,D,E) with FDs { C, D } → E and { B } → E.

⚠ BCNF decomposition may fail to preserve dependencies: given FD , the column set may be distributed across multiple tables. The FD is “lost” (cannot be enforced by

the system).

Consider FDs { zip } → { city, state } and { street, city, state } → zip in table zipcodes.

What are the candidate keys for zipcodes?

What is a BCNF decomposition for zipcodes?

zipcodeszip street city state

‐ split(R, F)

1.

2.

‐ β → cβ ∪ {c}

‐

1.

2.

32

DENORMALIZATION VS. DECOMPOSITION

BCNF and decomposition come with significant benefits but are no panacea. There are validreasons to leave database tables in denormalized form:

Performance:Decomposition requires table reconstruction via equi-joins which incur query evaluationcosts. Denormalized table save this effort at the cost of storing information redundantly.

Preservation of FDs:In specific applications, preservation of mission-critical FDs may be a higher priority thanthe removal of redundancy.

Columnar database systems perform full decomposition (beyond the splits required byBCNF normalization): R(id, A, B, C, …) decomposed into R1(id, A), R2(id, B), R3(id, C),… (binary tables).

Queries other than SELECT r.* FROM R r can selectively access the Rᵢ, reading less bytesfrom persistent storage.

DBMS internals simplified: every row is guaranteed to have exactly two fields.

‐

1.

2.

‐

‐

‐

33

Documents

INTRODUCTION TO RELATIONAL DATABASE SYSTEMS