Upload
dangdat
View
236
Download
0
Embed Size (px)
Citation preview
Dependency-Preserving Normalization ofRelational and XML Data
Solmaz Kolahi
Department of Computer Science
University of Toronto
DBPL 2005 – p.1/28
Motivation
Schema design: coming up with a “good” way of grouping theattributes of interest to avoid insertion, update, and deletionanomalies.
Goals:Eliminating redundancies.Preserving data.Preserving data dependencies and constraints.
Well-known approaches for relational database design: BCNF, 4NF,and 3NF.
BCNF: eliminates all redundancies, may lose dependencies.
3NF: does not eliminate all redundancies, preserves alldependencies.
DBPL 2005 – p.2/28
Motivation (cont’d)
How much redundancy 3NF tolerates to preserve dependencies?Applying an information-theoretic measure to 3NF.
Is it possible to achieve redundancy elimination and dependencypreservation by representing relational data in XML documents?
Characterizing cases when an XML normal form, called XNF,guarantees both.Providing a PTIME algorithm.
How do we achieve dependency preservation in XML normalizationtechniques?
Defining equivalent of 3NF for XML.
DBPL 2005 – p.3/28
Outline
Characterizing 3NF using an information-theoretic measure.
Converting relational data into redundancy-free XML documents.
XML dependency preservation and XML Third Normal Form.
Final Remarks.
DBPL 2005 – p.4/28
Outline
Characterizing 3NF using an information-theoretic measure.
Converting relational data into redundancy-free XML documents.
XML dependency preservation and XML Third Normal Form.
Final Remarks.
DBPL 2005 – p.4/28
Measure of Information Content
Proposed by Arenas & Libkin in PODS’03.
Used to measure the redundancy of a data value in a databaseinstance with respect to a set of constraints.
Defined using information theory.
Intuitively,
� �� � ��� ��
measures the information content of position �
in instance
with respect to constraints�
.
1 2 31 2 4
1 2 31 2 41 2 5
DBPL 2005 – p.5/28
Measure of Information Content
Proposed by Arenas & Libkin in PODS’03.
Used to measure the redundancy of a data value in a databaseinstance with respect to a set of constraints.
Defined using information theory.
Intuitively,
� �� � ��� ��
measures the information content of position �
in instance
with respect to constraints�
.
� � ���� �� � � � � � � � � �
� � �
1 2 31 2 4� �� � ��� �� � ��� � ��
� � �
1 2 31 2 41 2 5� �� � ��� �� � ��� � ��
DBPL 2005 – p.5/28
Measure of Information Content (cont’d)
A database specification
� �� �
is defined as well-designed if forevery instance
of
� �� �
and every position � in
,� �� � ��� �� � �
.That is, every position in every instance carries the maximumamount of information.
It is known that:If
�
contains only FDs,
� �� �
is well-designed iff it is in BCNF.If
�
contains FDs and MVDs,
� �� � is well-designed iff it is in
4NF.
We would like to apply this measure to 3NF.
DBPL 2005 – p.6/28
Characterizing 3NF
Theorem The specification
� �� �
is in 3NF iff if for every instance
of� �� �
and every position � � ��� �
in
,
� �� � ��� �� �� �
implies�
is aprime attribute.
Question Can this number be arbitrarily small when we have 3NF?In other words, how much redundancy is allowed by 3NF?
DBPL 2005 – p.7/28
Characterizing 3NF (cont’d)
Theorem For every ! � �� � "
, there exists a relation schema�
, a set ofFDs
�
over
�
, an instance
of
� �� �
, and position � in
such that
� �� �
is in 3NF, and
� �� � ��� �� � .
�� ���� �� �$#� � � �� �&% � � � � � � � �$# � � � �&%� �$# � ��� � � �� �&% � � �
' ( (*) + + + (-,
1 1 1 + + + 11 2 1 + + + 11 3 1 + + + 1...
......
...
1
.
1 + + + 1
/ 021 354 687 � � 0 354 689 :; < � 1 � � �� � ��� �� �� for any ! � �� � "
.
DBPL 2005 – p.8/28
Outline
Characterizing 3NF using an information-theoretic measure.
Converting relational data into redundancy-free XML documents.
XML dependency preservation and XML Third Normal Form.
Final Remarks.
DBPL 2005 – p.9/28
Converting into XML: Example
� � ���� �� � � = � �� � � �� � � � Hierarchical translation into XML
(C)
(A)
(B) 1 2
1
1 2
3
2
12
1
A B C
1 1 11 2 22 1 13 2 2 A A @b A A @b
B B
r
1 2
@a @a @a @aC C C C1 2 1 3
@c @c @c @c1 1 2 2
DBPL 2005 – p.10/28
Converting into XML: Example
� � ���� �� � � = � �� � � �� � � � Hierarchical translation into XML
(C)
(A)
(B) 1 2
1
1 2
3
2
12
1
A B C
1 1 11 2 22 1 13 2 2 A A @b A A @b
B B
r
1 2
@a @a @a @aC C C C1 2 1 3
@c @c @c @c1 1 2 2
DTD:
> � �? > � @
� � � ? � � AB
� � �? � � ADC
� � � � ADEDBPL 2005 – p.10/28
Converting into XML: Example
� � ���� �� � � = � �� � � �� � � � Hierarchical translation into XML
(C)
(A)
(B) 1 2
1
1 2
3
2
12
1
A B C
1 1 11 2 22 1 13 2 2 A A @b A A @b
B B
r
1 2
@a @a @a @aC C C C1 2 1 3
@c @c @c @c1 1 2 2
DTD: Functional Dependencies:
> � �? > � @ > � � � AB � > � �
� � � ? � � AB � > � �� > � � � � � ADC � � > � � � �� � �? � � ADC � > � � � ��� > � � � � � �� ADE � � > � � � � � �
� � � � ADE � > � � � � � ADC� > � � � AB � � > � � � � � �� ADE
> � � � � � �� ADE � > � � � AB
DBPL 2005 – p.10/28
Converting into XML: Problem Statement
Given: relation specification
� �� F
, where:� � �� #� � � �� � %
is a relation.F
is a set of FDs over
�
.
Question: is there an XML representation
�G � � , where:G
is a DTD.�
is a set of XML FDs over
G
.such that:�G � �
is a dependency-preserving hierarchical translation of� �� F
.�G � �
does not allow redundancy.
By dependency preservation we mean:
A relational instance is valid w.r.t.
� �� F
iff its hierarchical XMLrepresentation is valid w.r.t.
�G � �
.
DBPL 2005 – p.11/28
XML Functional Dependencies
Proposed by Arenas & Libkin in PODS’02.
Based on a relational representation of XML trees: tree tuples.
r
1 2A A@b @bAA
BB
C
@c
@a1
1
C
@c1
@a2
C
@c2
@a1
C
@c2
@a3
DBPL 2005 – p.12/28
XML Functional Dependencies
Proposed by Arenas & Libkin in PODS’02.
Based on a relational representation of XML trees: tree tuples.
r
1 2A A@b @bAA
BB
C
@c
@a1
1
C
@c1
@a2
C
@c2
@a1
C
@c2
@a3
DBPL 2005 – p.12/28
XML Functional Dependencies
Proposed by Arenas & Libkin in PODS’02.
Based on a relational representation of XML trees: tree tuples.
r
1 2A A@b @bAA
BB
C
@c
@a1
1
C
@c1
@a2
C
@c2
@a1
C
@c2
@a3
DBPL 2005 – p.12/28
XML Functional Dependencies
Proposed by Arenas & Libkin in PODS’02.
Based on a relational representation of XML trees: tree tuples.
r
1 2A A@b @bAA
BB
C
@c
@a1
1
C
@c1
@a2
C
@c2
@a1
C
@c2
@a3
DBPL 2005 – p.12/28
XML Functional Dependencies
Proposed by Arenas & Libkin in PODS’02.
Based on a relational representation of XML trees: tree tuples.
r
1 2A A@b @bAA
BB
C
@c
@a1
1
C
@c1
@a2
C
@c2
@a1
C
@c2
@a3
A tree tuple in a DTD
G
is a mapping:
�IH � C � JLK �G � nodes M Strings M �N �
in a consistent way.
DBPL 2005 – p.12/28
XML Functional Dependencies (cont’d)
An XML FD is an expression of the form:
��O #� � � �� OQP � � O O #� � � �� OQP � O ! � C � JLK �G
XML tree
R
satisfies
��O #� � � �� OQP � � O iff for every two tree tuples
� #� � 7
in
R
: �ST ! U�� V " � # �OW � � 7 �OW X � N 1 � � # �O � � 7 �O
r
1 2A A@b @bAA
BB
C
@c
@a1
1
C
@c1
@a2
C
@c2
@a1
C
@c2
@a3
> � � � �� ADE � > � � � AB
DBPL 2005 – p.13/28
XNF: a Normal Form for XML
�G � �
is in XNF if for each non-trivial FD
Y � O � AZ ! �G � � [,Y � O is also in
�G � � [
.
XNF generalizes BCNF for the case of XML.
XNF guarantees zero redundancy.
Question Given
� � �� #� � � �� � %
and a set of FDs
F
, can we translate� �� F
into
�G � �
in XNF and preserve FDs?
Answer Yes, in some cases, for which we have a precise characterization.
DBPL 2005 – p.14/28
Converting into XML: Conditions
Given
� � �� #� � � �� � %
and FDs
F
over it:
Condition We can put the attributes of
�
in order
� \]� � � �� � \_^ s.t. forevery non-trivial FD
Y � � \` ! F [
and every
a� T
, the FD
Y � � \b isalso in
F [
.
Theorem
� �� F
has an FD-preserving XNF representation iff the abovecondition holds.
We provide a PTIME algorithm that checks the condition and producesthe XML representation.
DBPL 2005 – p.15/28
Outline
Characterizing 3NF using an information-theoretic measure.
Converting relational data into redundancy-free XML documents.
XML dependency preservation and XML Third Normal Form.
Final Remarks.
DBPL 2005 – p.16/28
XNF Normalization
" mr201" @type @bid " marketing"
branch
client
@name @postal_code @city @name @postal_code @city @name @postal_code @city @name @postal_code @city"cl1" "cl2" "Toronto" "M4Y 2R5" "M4Y 2R5" "Toronto" "K1A 0H9" "K2B 1S5" "Ottawa" "Ottawa"
client client client
clients clients
company
@type @bid
branch
"ad005" "admin"
"cl3" "cl4"
Ec /� C Vd � B > C VE J� E ZT e V� K � E ZT e V� � A � c K � C Z
_Ec f e �
Ec /� C Vd � B > C VE J� E ZT e V� K � E ZT e V� � ADET � d
DBPL 2005 – p.17/28
XNF Normalization
company
city_info branch branch city_info
"Ottawa" @city code code @city
"Toronto" code clients
" mr201" @type @bid " marketing"
clients @type @bid " ad005" "admin"
"M4Y 2R5" @name @postal_code @name @postal_code
"cl1" "M4Y 2R5" @name @postal_code
"cl2" "K1A 0H9" @name @postal_code
"K2B 1S5" "cl3" "cl4"
"M4Y 2R5" @val
"K1A 0H9" "K2B 1S5" @val @val client client client client
DBPL 2005 – p.18/28
XNF Normalization
client
@name @postal_code @city @name @postal_code @city @name @postal_code @city @name @postal_code @city"cl1" "cl2" "Toronto" "M4Y 2R5" "M4Y 2R5" "Toronto" "K1A 0H9" "K2B 1S5" "Ottawa" "Ottawa"
client client client
clients clients
company
@type @bid
branch
"ad005" "admin"
"cl3" "cl4"
" mr201" @type @bid " marketing"
branch
Ec /� C Vd � B > C VE J� E ZT e V� K � E ZT e V� � A � c K � C Z
_Ec f e �
Ec /� C Vd � B > C VE J� E ZT e V� K � E ZT e V� � ADET � d
� Ec /� C Vd � B > C VE J� E ZT e V� K � E ZT e V� � ADET � d� Ec /� C Vd � B > C VE J� A � d � e � �
Ec /� C Vd � B > C VE J
DBPL 2005 – p.19/28
XNF Normalization
company
city_info branch branch city_info
"Ottawa" @city code code @city
"Toronto" code clients
" mr201" @type @bid " marketing"
clients @type @bid " ad005" "admin"
"M4Y 2R5" @name @postal_code @name @postal_code
"cl1" "M4Y 2R5" @name @postal_code
"cl2" "K1A 0H9" @name @postal_code
"K2B 1S5" "cl3" "cl4"
"M4Y 2R5" @val
"K1A 0H9" "K2B 1S5" @val @val client client client client
The FD: � Ec /� C Vd � ET � d _
T V gc � ADET � d� Ec /� C Vd � B > C VE J� A � d � e � �
Ec /� C Vd � B > C VE J
does not hold for the new document.
DBPL 2005 – p.20/28
XML Dependency Preservation
The concept of dependency preservation is more involved for XML.Implication of FDs in presence of DTD.
There are XML specifications
�G � �
, for which there is nodependency-preserving XML representation in XNF.
Complete redundancy elimination cannot be achieved for some XMLdocuments without losing constraints.
We need an equivalent of 3NF for XML.
DBPL 2005 – p.21/28
Prime Attribute Path
Given a DTD
G
, and set of XML FDs
�
over
G
:
Definition Attribute path � � AZ ! � C � JLK �G
is called prime if there is anontrivial FD
h i � � i ! �G � � [
such that:
� � AZ ! h i
;
� i
is an element path;
h i
is minimal
� h i1 � � � AZ � � � iX ! �G � � [ .
Ec /� C Vd � B > C VE J� E ZT e V� K � E ZT e V� � A � c K � C Z
_Ec f e �
Ec /� C Vd � B > C VE J� E ZT e V� K � E ZT e V� � ADET � d
� Ec /� C Vd � B > C VE J� E ZT e V� K � E ZT e V� � ADET � d� Ec /� C Vd � B > C VE J� A � d � e � �
Ec /� C Vd � B > C VE J
DBPL 2005 – p.22/28
Third Normal Form for XML
Given a DTD
G
, and set of XML FDs
�
over
G
:
Definition
�G � �
is in XML third normal form (X3NF) iff for every nontrivialFD
h � � � AZ ! �G � � [
:
1. the FD
h � � is also in
�G � � [
; or
2. attribute path � � AZ
is prime.
Ec /� C Vd is in X3NF.
X3NF generalizes 3NF for XML.
DBPL 2005 – p.23/28
Outline
Characterizing 3NF using an information-theoretic measure.
Converting relational data into redundancy-free XML documents.
XML dependency preservation and XML Third Normal Form.
Final Remarks.
DBPL 2005 – p.24/28
Final Remarks
3NF is good, because it preserves dependencies and eliminatesredundancies for non-prime attributes.
But it admits arbitrary redundancy on prime attributes.
We can sometimes achieve redundancy elimination and dependencypreservation by converting into XML.
Normalizing XML to achieve redundancy elimination can result inlosing FDs.
Future WorkFormally defining FD-preservation for XML.Verifying that decomposition based on X3NF definition willpreserve FDs.
DBPL 2005 – p.25/28
Backup Slides
DBPL 2005 – p.26/28
Converting into XML: Algorithm
� � ���� �� �� G � =� j
F� � � � �G � = j� G = � ��� j � �� G j � � �
Question Can we convert
� �� F
into hierarchical XML form
�G � �
inXNF?
DBPL 2005 – p.27/28
Converting into XML: Algorithm
� � ���� �� �� G � =� j
F� � � � �G � = j� G = � ��� j � �� G j � � �
Question Can we convert
� �� F
into hierarchical XML form
�G � �
inXNF?
...�
... �
...
j � �� jX � �
... �
...�
...
G = � ��� G =X � �
DBPL 2005 – p.27/28
Converting into XML: Algorithm
� � ���� �� �� G � =� j
F� � � � �G � = j� G = � ��� j � �� G j � � �
� # � ���� �� �� G � =
F# � � � � �G � =� G = � � �
�� � �G [lk �G = [ � � � G = �
ordering:
��� G � =� �� �
�7 � ���� �� �� G � j
F7 � � � � �G � j� j � ��G j � � �
�� � �G [ k j [ k �G j [ �
� �� j �
�G j [ � � � �
ordering:
�� j� ��� �� G
DBPL 2005 – p.27/28
Converting into XML: Algorithm
� � ���� �� �� G � =� j
F� � � � �G � = j� G = � ��� j � �� G j � � �
� # � ���� �� �� G � =
F# � � � � �G � =� G = � � �
�� � �G [lk �G = [ � � � G = �
ordering:
��� G � =� �� �
�7 � ���� �� �� G � j
F7 � � � � �G � j� j � ��G j � � �
�� � �G [ k j [ k �G j [ �
� �� j �
�G j [ � � � �
ordering:
�� j� ��� �� G
DBPL 2005 – p.27/28
Converting into XML: Algorithm
� � ���� �� �� G � =� j
F� � � � �G � = j� G = � ��� j � �� G j � � �
� # � ���� �� �� G � =
F# � � � � �G � =� G = � � �
�� � �G [lk �G = [ � � � G = �
ordering:
��� G � =� �� �
�7 � ���� �� �� G � j
F7 � � � � �G � j� j � ��G j � � �
�� � �G [ k j [ k �G j [ �
� �� j �
�G j [ � � � �
ordering:
�� j� ��� �� G
DBPL 2005 – p.27/28
Converting into XML: Algorithm
� � ���� �� �� G � =� j
F� � � � �G � = j� G = � ��� j � �� G j � � �
� # � ���� �� �� G � =
F# � � � � �G � =� G = � � �
�� � �G [lk �G = [ � � � G = �
ordering:
��� G � =� �� �
�7 � ���� �� �� G � j
F7 � � � � �G � j� j � ��G j � � �
�� � �G [ k j [ k �G j [ �
� �� j �
�G j [ � � � �
ordering:
�� j� ��� �� G
DBPL 2005 – p.27/28
Converting into XML: Algorithm
� � ���� �� �� G � =� j
F� � � � �G � = j� G = � ��� j � �� G j � � �
� # � ���� �� �� G � =
F# � � � � �G � =� G = � � �
�� � �G [lk �G = [ � � � G = �
ordering:
��� G � =� �� �
�7 � ���� �� �� G � j
F7 � � � � �G � j� j � ��G j � � �
�� � �G [ k j [ k �G j [ �
� �� j �
�G j [ � � � �
ordering:
�� j� � � �� G
DBPL 2005 – p.27/28
Converting into XML: Algorithm
� � ���� �� �� G � =� j
F� � � � �G � = j� G = � ��� j � �� G j � � �
� # � ���� �� �� G � =
F# � � � � �G � =� G = � � �
�� � �G [lk �G = [ � � � G = �
ordering:
��� G � =� �� �
�7 � ���� �� �� G � j
F7 � � � � �G � j� j � ��G j � � �
�� � �G [ k j [ k �G j [ �
� �� j �
�G j [ � � � �
ordering:
�� j� ��� �� G
DBPL 2005 – p.27/28
Converting into XML: Algorithm
� � ���� �� �� G � =� j
F� � � � �G � = j� G = � ��� j � �� G j � � �
� # � ���� �� �� G � =
F# � � � � �G � =� G = � � �
�� � �G [lk �G = [ � � � G = �
ordering:
��� G � =� �� �
�7 � ���� �� �� G � j
F7 � � � � �G � j� j � ��G j � � �
�� � �G [ k j [ k �G j [ �
� �� j �
�G j [ � � � �
ordering:
�� j� ��� �� Gmn o p n
q n r p n
s n m p nt p n
t n q p n
o n
uDBPL 2005 – p.27/28
Converting into XML (cont’d)
So far we talked about hierarchical translation of relations into XML.
What if we allow XML elements to represent more than one relationalattribute ( a semi-hierarchical translation)?
r
AB
C
@c
AB
C
@c
AB
C
@c
AB
C
@c
@a @b @a @b @a @b @a @b1 1
1
1 1
12
2 2
2
23
Given a relation and FDs over it:
Theorem has a redundancy-free semi-hierarchical translation iff ithas a redundancy-free hierarchical translation.
DBPL 2005 – p.28/28
Converting into XML (cont’d)
So far we talked about hierarchical translation of relations into XML.
What if we allow XML elements to represent more than one relationalattribute ( a semi-hierarchical translation)?
r
AB
C
@c
AB
C
@c
AB
C
@c
AB
C
@c
@a @b @a @b @a @b @a @b1 1
1
1 1
12
2 2
2
23
Given a relation
�
and FDsF
over it:
Theorem
� �� F
has a redundancy-free semi-hierarchical translation iff ithas a redundancy-free hierarchical translation.
DBPL 2005 – p.28/28