37
1 Administrivia HW3 deadline extended: Tuesday 10/7 Revised upcoming schedule: Thurs. 10/2: Querying XML Tues. 10/7: More on XML; beginning of views; HW4 out Thurs. 10/9: Views + datalog, ctd.; review for midterm Tues. 10/14: Fall Break Thurs. 10/16: HW4 due; Midterm

1 Administrivia HW3 deadline extended: Tuesday 10/7 Revised upcoming schedule: Thurs. 10/2: Querying XML Tues. 10/7: More on XML; beginning of

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

1

Administrivia

HW3 deadline extended: Tuesday 10/7

Revised upcoming schedule: Thurs. 10/2: Querying XML Tues. 10/7: More on XML; beginning of views;

HW4 out Thurs. 10/9: Views + datalog, ctd.; review for

midterm Tues. 10/14: Fall Break Thurs. 10/16: HW4 due; Midterm

Page 2: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

2

Lossless Join Decomposition

R1, … Rk is a lossless join decomposition of R w.r.t. an FD set F if for every instance r of R that satisfies F,

R1(r) ⋈ ... ⋈ Rk(r) = r

R1, R2 is a lossless join decomposition of R with respect to F iff at least one of the following dependencies is in F+:

(R1 R2) R1 – R2

or (R1 R2) R2 – R1

Page 3: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

3

Dependency Preservation

Ensures we can check whether a FD X Y is violated during DB updates, without using a join:

FZ, the projection of FD set F onto attribute set Z, is:

{X Y | X Y F +, X Y Z}i.e., it is those FDs only applicable to Z’s attributes

A decomposition R1, …, Rk is dependency preserving if F + = (FR1 ... FRk)+ (note we need an extra closure!)

We don’t lose the ability to test the “cover” of our FDs in a single table, just because we decompose

Page 4: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

4

Example 1

For Schema R(sid, name, serno, cid, subj, exp-grade) and FD set:sid name serno cidcid subj sid, serno exp-grade

Is R1(sid, name) and R2(serno, subj, cid, exp-grade): A lossless decomposition? Is it dependency-preserving?

How about R1(sid, name) and R2(sid, serno, subj, cid, exp-grade)?

Page 5: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

5

Example 2

Given schema R(name, street, city, st, zip, item, price),

FD set name street, city street, city ststreet, city zip name, item

priceand decomposition

R1(name, street, city, st, zip) and R2(name, item, price)

Is it lossless? Is it dependency preserving?

What if we replaced the first FD with name, street city?

Page 6: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

6

A More Disturbing Example…

Given schema R(sid, fid, subj)and FD set: fid subj sid, subj fid

Consider the decomposition R1(sid, fid) and R2(fid, subj)

Is it lossless? Is it dependency preserving?

If it isn’t, can you think of a decomposition that is? Can you do this non-redundantly?

Page 7: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

7

Redundancy vs. FDs

Ideally, we want a design s.t. for each nontrivial dependency X Y, X is a superkey for some relation schema in R

We just saw that this isn’t always possible in a non-redundant way…

Thus we have two kinds of normal forms

Page 8: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

8

Two Important Normal Forms

Boyce-Codd Normal Form (BCNF). For every relation scheme R and for every X A that holds over R,

either A X (it is trivial) ,oror X is a superkey for R

Third Normal Form (3NF). For every relation scheme R and for every X A that holds over R,

either A X (it is trivial), or X is a superkey for R, or A is a member of some key for R

Page 9: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

9

Normal Forms Compared

BCNF is preferable, but sometimes in conflict with the goal of dependency preservation It’s strictly stronger than 3NF

Let’s see algorithms to obtain: A BCNF lossless join decomposition

(nondeterministic) A 3NF lossless join, dependency preserving

decomposition

Page 10: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

10

BCNF Decomposition Algorithm(from Korth et al.; our book gives recursive version)

result := {R}compute F+while there is a schema Ri in result that is not in BCNF{

let A B be a nontrivial FD on Ri

s.t. A Ri is not in F+ and A and B are disjoint

result:= (result – Ri) {(Ri - B), (A,B)}}

Page 11: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

11

3NF Decomposition Algorithm

Let F be a minimal coveri:=0for each FD A B in F { if none of the schemas Rj, 1 j i, contains AB { increment i Ri := (A, B) }}if no schema Rj, 1 j i contains a candidate key for R { increment i Ri := any candidate key for R}return (R1, …, Ri)

Build dep.-preservingdecomp.

Ensurelosslessdecomp.

Page 12: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

12

Summary of Normalization

We can always decompose into 3NF and get: Lossless join Dependency preservation

But with BCNF we are only guaranteed lossless joins

BCNF is stronger than 3NF: every BCNF schema is also in 3NF

The BCNF algorithm is nondeterministic, so there is not a unique decomposition for a given schema R

Page 13: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

XML: Semistructured Data

Zachary G. IvesUniversity of Pennsylvania

CIS 550 – Database & Information Systems

October 1, 2003

Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan

Page 14: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

14

Why XML?

XML is the confluence of several factors: The Web needed a more declarative format for data Documents needed a mechanism for extended tags Database people needed a more flexible interchange

format “Lingua franca” of data It’s parsable even if we don’t know what it means!

Original expectation: The whole web would go to XML instead of HTML

Today’s reality: Not so… But XML is used all over “under the covers”

Page 15: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

15

Why DB People Like XML

Can get data from all sorts of things Allows us to touch data we don’t own

Interesting relationships with DB techniques Can do relational-style operations Leverages ideas from object-oriented and

semistructured data

Blends schema and data into one format Unlike relational model, where we need schema

first … But too little schema can be a drawback, too!

Page 16: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

16

XML Anatomy<?xml version="1.0" encoding="ISO-8859-1" ?> <dblp> <mastersthesis mdate="2002-01-03" key="ms/Brown92">  <author>Kurt P. Brown</author>   <title>PRPL: A Database Workload Specification Language</title>   <year>1992</year>   <school>Univ. of Wisconsin-Madison</school>   </mastersthesis> <article mdate="2002-01-03" key="tr/dec/SRC1997-018">  <editor>Paul R. McJones</editor>   <title>The 1995 SQL Reunion</title>   <journal>Digital System Research Center Report</journal>   <volume>SRC1997-018</volume>   <year>1997</year>   <ee>db/labs/dec/SRC1997-018.html</ee>   <ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>   </article>

Processing Instr.

Element

Attribute

Close-tag

Open-tag

Page 17: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

17

Well-Formed XML

A legal XML document – fully parsable by an XML parser

All open-tags have matching close-tags (unlike so many HTML documents!), or a special: <tag/> shortcut

Attributes (which are unordered, in contrast to elements) only appear once in an element

There’s a single root element XML is case-sensitive

Page 18: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

18

XML as a Data Model

XML “information set” includes 7 types of nodes: Document (root) Element Attribute Processing instruction Text (content) Namespace: Comment

XML data model includes this, plus typing info, plus order info and a few other things

Page 19: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

19

XML Data Model Visualized(and simplified!)

Root

?xml dblp

mastersthesis article

mdate key

author title year school editor title yearjournal volume eeee

mdatekey

2002…

ms/Brown92

Kurt P….

PRPL…

1992

Univ….

2002…

tr/dec/…

Paul R.

The…

Digital…

SRC…

1997

db/labs/dec

http://www.

attributeroot

p-i element

text

Page 20: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

20

What Does XML Do?

Serves as a document format that’s better than HTML Allows custom tags (e.g., used by MS Word,

openoffice) Supplement it with stylesheets (XSL) to define

formatting Provides an exchange format for data (still

need to agree on terminology) Basis for RDF format for describing libraries

and the Semantic Web (more on this later) Format for marshalling and unmarshalling

data in SOAP and Web Services

Page 21: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

21

XML as a Super-HTML(MS Word)

<h1 class="Section1"><a name="_top“ />CIS 550: Database and Information Systems</h1><h2 class="Section1">Fall 2003</h2><p class="MsoNormal">

<place>311 Towne</place>, Tuesday/Thursday<time Hour="13" Minute="30">1:30PM –

3:00PM</time></p>

Page 22: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

22

XML Encodes Relations

sid

serno

exp-grade

1 570103

B

23 550103

A<student-course-grade><tuple><sid>1</sid><serno>570103</serno><exp-grade>B</exp-grade></tuple><tuple><sid>23</sid><serno>550103</serno><exp-grade>A</exp-grade></tuple>

</student-course-grade>

Student-course-grade

Page 23: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

23

But XML is More Flexible…“Non-First-Normal-Form” (NF2)

<parents> <parent name=“Jean” >

<son>John</son><daughter>Joan</daughter><daughter>Jill</daughter>

</parent> <parent name=“Feng”>

<daughter>Felicity</daughter> </parent>…

Page 24: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

24

XML and Code

Web Services (.NET, recent Java web service toolkits) are using XML to pass parameters and make function calls Why?

Easy to be forwards-compatible Easy to read over and validate (?) Generally firewall-compatible

Drawbacks? XML is a verbose and inefficient encoding!

XML is used to represent: SOAP: the “envelope” that data is marshalled into XML Schema: gives some typing info about structures being

passed WSDL: the IDL (interface def language) UDDI: provides an interface for querying about web services

Page 25: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

25

What If We Have Multiple Sources with the Same Tags?

Namespaces allow us to specify a context for different tags

Two parts: Binding of namespace to URI Qualified names

<tag xmlns:myns=“http://www.fictitious.com/mypath”><thistag>is in namespace myns</thistag><myns:thistag>is the same</myns:thistag><otherns:thistag>is a different tag</otherns:thistag>

</tag>

Page 26: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

26

XML Isn’t Normally Enough

It’s too unconstrained for many cases! How will we know when we’re getting garbage? How will we query? How will we understand what we got?

We also need: Some idea of the structure

Our focus next Presentation, in some cases – XSL(T)

We’ll talk about this soon Some way of interpreting the tags…?

We’ll talk about this later in the semester

Page 27: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

27

Document Type Definitions (DTDs)

The DTD is an EBNF grammar defining XML structure XML document specifies an associated DTD, plus

the root element DTD specifies children of the root (and so on)

DTD defines special significance for attributes: IDs – special attributes that are analogous to keys

for elements IDREFs – references to IDs IDREFS – a nasty hack that represents a list of

IDREFs

Page 28: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

28

An Example DTD

Example DTD:<!ELEMENT dblp((mastersthesis | article)*)><!ELEMENT mastersthesis(author,title,year,school,committeemember*)><!ATTLIST mastersthesis(mdate CDATA #REQUIRED

key ID #REQUIREDadvisor CDATA #IMPLIED>

<!ELEMENT author(#PCDATA)>

…Example use of DTD in XML file:

<?xml version="1.0" encoding="ISO-8859-1" ?> <!DOCTYPE dblp SYSTEM “my.dtd"> <dblp>…

Page 29: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

29

Representing Graphs in XML

<?xml version="1.0" encoding="ISO-8859-1" ?> <!DOCTYPE graph SYSTEM “special.dtd"> <graph>

<author id=“author1”><name>John Smith</name>

</author><article>

<author ref=“author1” /> <title>Paper1</title></article><article>

<author ref=“author1” /> <title>Paper2</title></article>

Page 30: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

30

Graph Data ModelRoot

!DOCTYPE

graph

authorarticle

nametitle

refref

John Smith

author1author1

Paper2

?xml

article

id

author1

author authortitle

Paper1

Page 31: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

31

Graph Data ModelRoot

!DOCTYPE

graph

authorarticle

nametitle

refref

John Smith

Paper2

?xml

article

id

author1

author authortitle

Paper1

Page 32: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

32

DTDs Aren’t Enough for Some People

DTDs capture grammatical structure, but have some drawbacks: Not themselves in XML – inconvenient to build

tools for them Don’t capture database datatypes’ domains IDs aren’t a good implementation of keys

Why not?

No way of defining OO-like inheritance No way of expressing FDs

Page 33: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

33

XML Schema

Aims to address the shortcomings of DTDs … But an “everything including the kitchen

sink” spec

Features: XML syntax Better way of defining keys using XPaths (which

we’ll see next time) Type subclassing that’s more complex than in a

programming language … And, of course, domains and built-in

datatypes

Page 34: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

34

Schema Example

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

<xsd:element name=“mastersthesis" type=“ThesisType"/> <xsd:complexType name=“ThesisType">

<xsd:attribute name=“mdate" type="xsd:date"/><xsd:attribute name=“key" type="xsd:string"/><xsd:attribute name=“advisor" type="xsd:string"/><xsd:sequence>

<xsd:element name=“author" type=“xsd:string"/> <xsd:element name=“title" type=“xsd:string"/> <xsd:element name=“year" type=“xsd:integer"/> <xsd:element name=“school" type=“xsd:string”/> <xsd:element name=“committeemember"

type=“CommitteeType” minOccurs=“0"/> </xsd:sequence>

</xsd:complexType>

Page 35: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

35

Whither XML Schema?

Adopted by a few tools, but… The jury is still out on this spec

Considered too big, bulky, inconsistent Still no way of expressing FDs, and the key

mechanism isn’t that elegant Other people have defined more elegant schema

languages, FD languages, and key languages (much of the work was here @ Penn)

But it’s also at the core of the data model of the XQuery language and the other W3C specs…

Page 36: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

36

Designing an XML Schema/DTD

XML data design is generally not as formalized as relational data design We can still use ER diagrams to break into entity,

relationship sets ER diagrams have extensions for “aggregation” – treating

smaller diagrams as entities – and for composite attributes Note that often we already have our data in relations and

need to design the XML schema to export them! We generally orient the XML tree around the

“central” objects in a particular application A big decision: element vs. attribute

Element if it has its own properties, or if you *might* have more than one of them

Attribute if it is a single property – or perhaps not!

Page 37: 1 Administrivia  HW3 deadline extended: Tuesday 10/7  Revised upcoming schedule:  Thurs. 10/2: Querying XML  Tues. 10/7: More on XML; beginning of

37

Wrap-up

XML is a non-first-normal-form representation Can represent documents, data Standard data exchange format Several competing schema formats – esp.,

DTD and XML Schema – provide typing information

Next time: 3 languages for XML querying