Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
INTRODUCTION TO
RELATIONAL DATABASE SYSTEMS
DATENBANKSYSTEME 1 (INF 3131)
Torsten GrustUniversität Tübingen
Winter 2017/18
1
LEGO BUILDING INSTRUCTIONS
Each LEGO set comes with building instructions, an illustrated booklet that details theindividual steps of model construction.
One page in the booklet holds one ore more instruction steps (steps are numbered 1, 2, …).
Each step lists the pieces (with their color and quantity) required to complete the step.
Each step comes with an illustration of where the listed pieces find their place in themodel.
What would be a reasonable design for a building instructions table? Clearly:
Do not include LEGO set details in instructions: instead, use a foreign key to refer totable sets.
Do not include LEGO piece details in instructions: instead, use a foreign key to refer totable bricks.
Represent page numbers, step numbers, image sizes as integers but formulateconstraints that avoid data entry errors (e.g. negative page/step numbers).
‐
‐‐‐
‐1.
2.
3.
2
LEGO BUILDING INSTRUCTIONS (TABLEDESIGN)
instructionsset step piece color quantity page img width height
9495–1 7 3010 2 2 24 ‹image07› 639 5339495–1 7 3023 2 2 24 ‹image07› 639 5339495–1 7 2877 86 1 24 ‹image07› 639 5339495–1 8 3002 7 2 24 ‹image08› 650 5229495–1 8 30414 1 2 24 ‹image08› 650 5229495–1 9 30414 85 1 25 ‹image09› 541 6389495–1 9 3062b 85 2 25 ‹image09› 541 6389495–1 10 30033 11 1 25 ‹image10› 540 6629495–1 10 2412b 86 1 25 ‹image10› 540 6629495–1 10 4589b 86 2 25 ‹image10› 540 6629495–1 10 87580 85 1 25 ‹image10› 540 6629495–1 11 3039 2 1 25 ‹image11› 1042 5589495–1 11 4073 85 4 25 ‹image11› 1042 5589495–1 11 44728 3 1 25 ‹image11› 1042 558
4
REDUNDANCY
The design of table instructions appears reasonable. We immediately spot a fair amount ofredundancy, though. For example:
Step 10 of Set 9495 is printed on page 25. [represented 4 ×]
Step 7 of Set 9495 is illustrated by ‹image07›. [3 ×]
‹image09› has dimensions 541 ⨉ 638 pixels. [2 ×]
Redundancy comes with a number of serious problems, most importantly:
Storage space is wasted.Tables occupy more disk space than needed. Query processor has to touch/move morebytes. Archival storage (backup) requires more resources.
Redundant copies will go out of sync.Eventually, an update operation will miss a copy. The database instance now contains“multiple truths.” Typically, this goes unnoticed by DBMS and user.
‐
1.
2.
3.
‐‐
‐
5
EMBEDDED FUNCTIONS AND REDUNDANCY
In table instructions, the source of redundancy is the presence of functions that areembedded in the table.
Leibniz PrincipleIf is a function defined on , then
Table instructions embeds the materialized functions
printed_on(): maps set, step to the page it is printed on
illustrated_by(): maps set, step to the illustration stored in image img
image_size(): maps an image img to its width and height
‐
f x, y
x = y ∧ f(x) = z ⇒ f(y) = z
‐1.
2.
3.
6
FUNCTIONAL DEPENDENCIES
Functional Dependency (FD)
Let denote a relational schema. Given and , the functionaldependency holds in if
Read: “If two rows agree on the columns in , they also agree on column .” ( :function arguments, : function result).
Notation: the FD abbreviates the set of FDs , …, .
Note: If , then is called a trivial FD that obviously holds for any instance of .No interesting insight into here.
(R, α) β ⊆ α c ∈ αβ → c R
∀t, u ∈ inst(R) : t. β = u. β ⇒ t. c = u. c
β c βc
β → { , … , }c1 cn β → c1β → cn
‐ c ∈ β β → c RR
7
FUNCTIONAL DEPENDENCIES
FDs are constraints that document universally valid mini-world facts (e.g., “a step isassociated with one illustration”). FDs thus need to hold in all database instances.
instructionsset step piece color quantity page img width height
9495–1 7 3010 2 2 24 ‹image07› 639 5339495–1 7 3023 2 2 24 ‹image07› 639 5339495–1 7 2877 86 1 24 ‹image07› 639 5339495–1 8 3002 7 2 24 ‹image08› 650 5229495–1 8 30414 1 2 24 ‹image08› 650 5229495–1 9 30414 85 1 25 ‹image09› 541 6389495–1 9 3062b 85 2 25 ‹image09› 541 638
Which functional dependencies hold in table instructions?
‐
‐
8
FUNCTIONAL DEPENDENCIES
Given table R, check whether the FD { b1, …, bn } c holds in the current table instance:
SELECT DISTINCT 'The FD { b1, …, bn } → c does not hold'FROM RGROUP BY b1, …, bnHAVING COUNT(DISTINCT c) > 1
Aggregate Functions
Optional modifier DISTINCT affects the computation of aggregate functions:
‹aggregate›([ ALL ] ‹expression›) -- aggregate all non-NULL values‹aggregate›(DISTINCT ‹expression›) -- aggregate all distinct non-NULL values‹aggregate›(*) -- aggregate all rows (count(*))
‐ →
9
KEY → FD
Note that a key implicitly defines a particularly strong FD: the key columns functionallydetermine all columns of the table.
Keys vs FDs (1)Assume table .
is a key of holds.
So, keys are special FDs.
Turning this around: FDs are a generalization of keys.
‐
(R, { , … , , , … , })a1 ak ak+1 an
{ , … , }a1 ak R ⇔{ , … , } → { , … , }a1 ak ak+1 an
‐‐
10
FD → (LOCAL, PARTIAL) KEY
Keys vs FDs (2)Assume table and FD . Then is key in the sub-table of defined by
SELECT DISTINCT , FROM
Example: for table instructions and FD { set, step } → page the sub-table isset step page
9495–1 7 249495–1 8 249495–1 9 259495–1 10 259495–1 11 25
(i.e., exactly the table materializing the function printed_on(), see above ).
R β → c β R
β cR
‐
11
FUNCTIONAL DEPENDENCIES
Example: recall table stores of the LEGO Data Warehouse scenario:
store city state country
7 HAMBURG Hamburg Germany8 LEIPZIG Sachsen Germany9 MÜNCHEN Bayern Germany10 MÜNCHEN PASING Bayern Germany11 NÜRNBERG Bayern Germany
16 ARDEN FAIR MALL CA USA17 DISNEYLAND RESORT CA USA18 FASHION VALLEY CA USA
List the FDs that hold in table stores.
Does the mini-world suggest FDs not implied by the rows shown above?
‐
‐‐
12
FUNCTIONAL DEPENDENCIES
An FD indicates the presence of a materialized function. Consider the following variant of theusers and ratings table:
usersuser rating starsAlex 3 ***Bert 1 *Cora 4 ****Drew 5 *****Erik 1 *Fred 3 ***
FD { rating } → stars materializes the computable function stars = rating = repeat('*', rating) [see PostgreSQL’s string function library].
In such cases, good database design should consider to trade materialization forcomputation. Removes redundancy.
‐
‐ f( )
‐
13
✄ ✄ ✄ ✄ ✄ ✄ ✄ ✄
SQL: VIEWS
CREATE VIEW
Binds ‹query› to ‹name› which is globally visible. Whenever table ‹name› isreferenced in subsequent queries, ‹query› is re-evaluated and its result returned(no materialization of the result of ‹query› is performed):
-- TEMPORARY: automatically drop view after current sessionCREATE [ OR REPLACE ] [ TEMPORARY ] VIEW ‹name› AS ‹query›
Compare with CTEs: local visibility in surrounding WITH statement only.
A temporary view named ‹name› shadows a (regular, persistent) table of the same name.
‐‐
14
SQL: VIEWS
Views provide data independence: users and applications continue to refer to ‹name›, whilethe database designer may decide to replace a persistent table with a computed query or viceversa.
Example: turn the materialized function stars = rating into a computed function:
-- drop the materialized function from the tableALTER TABLE users DROP COLUMN stars;
-- provide the three-column table that users/applications expectCREATE TEMPORARY VIEW users(user, rating, stars) AS SELECT u.user, u.rating, repeat('*', u.rating) stars FROM users u;
Since PostgreSQL’s repeat() is a pure function, the FD rating → stars trivally holds inthe view.
✄ ✄ ✄ ✄ ✄ ✄ ✄ ✄
‐
‐ f( )
‐
15
DERIVING FUNCTIONAL DEPENDENCIES
Given a set of FDs over table , simple inference rules—the Armstrong Axioms—suffice togenerate all FDs following from those in .
Armstrong AxiomsApply exhaustively to generate all FDs implied by FD set .
Reflexivity:If , then .
Augmentation (with ):If , then
Transitivity:If and , then .
Note: transitivity closely relates to function composition: if are functions, so is .
‐ F RF
F
γ ⊆ β β → γ
c ∈ sch(R)β → γ β ∪ {c} → γ ∪ {c}
α → β β → γ α → γ
‐ f, g g ∘ f16
DERIVING FDS (COVER)
Problem: Given a set of columns and a set of FDs over , compute thecover , i.e. the set of all columns functionally determined by .
CoverThe cover of a set of columns is the set of all columns that are functionallydetermined by the columns in (with respect to a given FD set ):
⚠ Should we find that , then is a candidate key for .
‐ α ⊆ sch(R) F Rα+ α
α+ α cα F
:= {c | F implies α → c}α+
‐ = sch(R)α+ α R
17
DERIVING FDS (COVER)
Compute the cover for a given set of FDs:
(Input: column set , FD set , Output: )
Repeat
For each FD in do
If then
Until did not change
Return
‐ α+
cover(α, F) α F α+
1. X := α
2.
‐ β → c F
‐ β ⊆ X
‐ X := X ∪ {c}X
3. X
18
DERIVING FDS (COVER)
Example: In table instructions, compute {set,step}+ with = { {set,step} → page,{set,step} → img, {img} → width, {img} → height }.
instructionsset step piece color quantity page img width height
Tracing column set :
:= {set,step}
FD {set,step} → page, {set,step} : " := {page}FD {set,step} → img, {set,step} : " := {img}FD {img} → width, {img} : " := {width}FD {img} → height, {img} :" := {height}"All FDs considered. = {set, step, page, img, width, height}Repetition of 2. does not add new columns to .
Return {set, step, page, img, width, height}.
‐ F
‐ X
1. X
2. ⊆ X X X ∪⊆ X X X ∪
⊆ X X X ∪⊆ X X X ∪
XX
3.19
DERIVING CANDIDATE KEYS
(Input: FD set , Output: set of all candidate keys for )
If then Return [ Invariant: ]
else
For each do
If then [ Is essential for the key? ]
⚑
else
Return
Invoke via .
Can optimize at ⚑ : invoke instead.
key(K, U , F) F R
‐ U = ∅ {K} cover(K ∪ U , F) = sch(R)
‐ X := ∅‐ c ∈ U
‐ c ∉ cover(K ∪ (U ∖ {c}), F) c
‐ X := X ∪ key(K ∪ {c}, U ∖ {c}, F)
‐ X := X ∪ key(K, U ∖ {c}, F)
‐ X
‐ key(∅, sch(R), F)‐ key(K ∪ {c}, U ∖ cover(K ∪ {c}, F), F)
20
DATABASE DESIGN WITH FDS
Typically it is a severe sign of poor database design if tables embed functions, i.e. if a tablecontains
FDs that are not implied by the primary key. ⚠
Consequences of table designs with non-key FDs / embedded functions:
Redundancy (see above ✔)
Update/Insertion/Deletion Anomalies
RDBMS cannot protect the integrity of non-key FDs, thus risk of inconsistency over time:
SQL DDL does not implement an ALTER TABLE … ADD FUNCTIONAL DEPENDENCY …statement.
Although FDs embody important mini-world facts they are easily violated withoutprotection. (Can simulate this protection using SQL triggers or rewrite rules.Cumbersome. Inefficient.)
‐
‐1.
2.
3.
‐
‐
21
UPDATE/INSERTION/DELETION ANOMALIES
Recall table instructions and embedded FD { img } → { width, height}:
instructionsset step piece color quantity page img width height
Update anomaly:Changing a single mini-world fact requires the modification of multiple rows.[ Modifying image size requires to search/update entire instruction table. ]
Insertion anomaly:A new mini-world fact cannot be stored unless it is put in larger context.[ No place to record width/height dimension of a new image yet unused in an instructionmanual. ]
Deletion anomaly:A formerly stored mini-world fact vanishes once its (last) context is deleted.[ Information about image width/height is lost once last instruction manual including thatimage is deleted from instructions. ]
‐
‐
‐
‐
22
BOYCE-CODD NORMAL FORM
Boyce-Codd Normal Form (BCNF)Table is in Boyce-Codd Normal Form (BCNF) if and only if all its FDs are alreadyimplied by its key constraints.
For table in BCNF and any FD of one of the following holds:
The FD is trivial, i.e., .
The FD follows from a key because (or a subset of it) already is a key of .
A table in BCNF does not exhibit the three anomalies (no embedded functions).
All FDs in table in BCNF are protected by the RDBMS through PRIMARY KEY (or UNIQUE)constraints.
R
R β → c R
1. c ∈ β
2. β R
‐‐
23
BOYCE-CODD NORMAL FORM
Examples:
Table instructions is not in BCNF: key FD { set, step, piece, color } → { quantity, page, img, width, height} does not imply { set, step } → { page, img } or { img } → { width, height }:
instructionsset step piece color quantity page img width height
Table users not in BCNF: { rating } → { stars } not implied by key FD:usersname rating stars
Table stores not in BCNF: { state } → { country } not implied by key FD:storesstore city state country
‐‐
‐
‐
24
BCNF SCHEMA DECOMPOSITION
(Input: table with FD set , Output: spli%ed relation schemata)
If with and does not contain a key of then
Split and replace by
Notes:
denotes FD set restricted to those for which .
For each split: and .
split(R, F) R F
‐ β → c ∈ F c ∉ β β R
1. R
‐ ((sch(R) ∖ cover(β, F)) ∪ β)R1
‐ (cover(β, F))R2
2. split( , )R1 Fsch( )R1
3. split( , )R2 Fsch( )R2
‐‐ FC F β → c β ∪ {c} ⊆ C
‐ sch( ) ∪ sch( ) = sch(R)R1 R2 sch( ) ∩ sch( ) = βR1 R225
BCNF: AFTER DECOMPOSITION (1)
Resultant BCNF tables after has been completed:
parts (1/3)set step piece color quantity
9495–1 7 3010 2 29495–1 7 3023 2 29495–1 7 2877 86 19495–1 8 3002 7 29495–1 8 30414 1 29495–1 9 30414 85 19495–1 9 3062b 85 2
Note: It is rather straightforward to name the newly generated tables: these tables represent asingle real-world concept.
‐ split(instructions, F)
‐
26
BCNF: AFTER DECOMPOSITION (2)
Resultant BCNF tables after has been completed:
layouts (2/3)set step page img
9495–1 7 24 ‹image07›9495–1 8 24 ‹image08›9495–1 9 24 ‹image09›
illustrations (3/3)img width height
‹image07› 639 533‹image08› 650 522‹image09› 541 638
To tie the BNCF tables together, establish foreign keys pointing from parts to layouts andfrom layouts to illustrations.
‐ split(instructions, F)
‐27
BCNF: RECONSTRUCTION
Use an equi-join to reconstruct the original wide table instructions from its constituenttables:
Reconstruction after BCNF decomposition
Perform an equi-join over the (non-empty) schema intersections of the BCNF tables:
SELECT p.set, p.step, p.piece, p.color, p.quantity, l.page, l.img, i.width, i.heightFROM parts p, layouts l, illustrations iWHERE p.set = l.set AND p.step = l.stepAND l.img = i.img
It may make sense to use CREATE VIEW to reestablish the wide table for users andapplications.
‐
‐
28
BCNF: AFTER DECOMPOSITION
Decomposition for table users:
users ( ) user ratingAlex 3Bert 1Cora 4Drew 5Erik 1Fred 3
render ( )rating stars1 *3 ***4 ****5 ******
The RDBMS protects the FDs (keys): translation from rating to stars in table render is alwaysconsistent. No redundancy in table render.
‐R1
R2
‐29
BCNF DECOMPOSITION: LOSSLESS SPLITS
BNCF decomposition builds on the assumption that no information is lost during the splits:original table can be reconstructed by an equi-join of and .
Not all decompositions are lossless, however. Consider:RA B Ca₁ b₁ c₁a₁ b₁ c₂a₁ b₂ c₁
and its decomposition into R1(A, B), R2(A, C). The equi-join of R1 and R2 (on A) is:A B Ca₁ b₁ c₁a₁ b₁ c₂a₁ b₂ c₁a₁ b₂ c₂ ⚠
⇒ An extra (bogus!) row has been reconstructed by the join. Information has been lost.
‐R R1 R2
‐
30
BCNF: LOSSLESS SPLITS
Decomposition Theorem
Consider the decomposition of table into and . The reconstruction of from , via an equi-join on is lossless if
and
is a key of or (or both).
The splits “along the FD ” performed by will always be lossless:
and ✔
Since , is a key for ✔
We will never lose information through BCNF decomposition.
R R1 R2 RR1 R2 sch( ) ∩ sch( )R1 R2
1. sch( ) ∪ sch( ) = sch(R)R1 R2
2. sch( ) ∩ sch( )R1 R2 R1 R2
‐ β → c split(R, F)1. sch( ) ∪ sch( ) = sch(R)R1 R2 sch( ) ∩ sch( ) = βR1 R2
2. sch( ) = cover(β, F)R2 β R2
‐
31
BCNF: NON-DETERMINISM, LOSS OF FDS
⚠ BCNF is not deterministic: arbitrary choice of the “split FD” in algorithm leads to different decompositions, in general:
For table instructions: spli%ing along FD { set, step } → { page, img } or { img } → { width, height } first makes no difference. (Try it.)
But consider R(A,B,C,D,E) with FDs { C, D } → E and { B } → E.
⚠ BCNF decomposition may fail to preserve dependencies: given FD , the column set may be distributed across multiple tables. The FD is “lost” (cannot be enforced by
the system).
Consider FDs { zip } → { city, state } and { street, city, state } → zip in table zipcodes.
What are the candidate keys for zipcodes?
What is a BCNF decomposition for zipcodes?
zipcodeszip street city state
‐ split(R, F)
1.
2.
‐ β → cβ ∪ {c}
‐
1.
2.
32
DENORMALIZATION VS. DECOMPOSITION
BCNF and decomposition come with significant benefits but are no panacea. There are validreasons to leave database tables in denormalized form:
Performance:Decomposition requires table reconstruction via equi-joins which incur query evaluationcosts. Denormalized table save this effort at the cost of storing information redundantly.
Preservation of FDs:In specific applications, preservation of mission-critical FDs may be a higher priority thanthe removal of redundancy.
Columnar database systems perform full decomposition (beyond the splits required byBCNF normalization): R(id, A, B, C, …) decomposed into R1(id, A), R2(id, B), R3(id, C),… (binary tables).
Queries other than SELECT r.* FROM R r can selectively access the Rᵢ, reading less bytesfrom persistent storage.
DBMS internals simplified: every row is guaranteed to have exactly two fields.
‐
1.
2.
‐
‐
‐
33