27
XML and Web Data

XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

XML and Web Data

Page 2: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 2

Data in HTML

• HyperText Markup Language– Different data elements

are set out using tags• No schema?

– Based on the data itself, we can make a reasonable guess about the structure

– “Self-describing”

Page 3: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 3

Object and Schema

Page 4: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 4

Semi-structured Data

• Object-like: it can be represented as a collection of objects

• Schemaless: it is not guaranteed to conform to any type structure

• Self-describing– Often carries only the names of the attributes

and has a lower degree of organization than the data in the database

• Semi-structured data: data with the above characteristics

Page 5: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 5

Schemaless But Self-Describing

(#12345,[ListName:“Students”,Contents:{ [Name:“John Doe”,

ID:“111111111”,Address:[Number:123, Street:“Main St”] ],[Name:“Joe Public”,Id:“666666666”,Address:[Number:666, Street:“Hollow Rd”] ]}

] )

Page 6: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 6

XML• Extensible Markup Language

– A standard adopted in 1998 by the W3C (World Wide Web Consortium)

• Optional mechanisms for specifying document structure– DTD: the Document Type Definition Language, part of

the XML standard– XML Schema: a more recent specification built on top of

XML• Query languages for XML

– XPath: lightweight– XSLIT: document transformation language– XQuery: a full-blown language

Page 7: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 7

From HTML to XML

Page 8: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 8

HTML and XML

• HTML– A fixed number of tags– Each tag has its own well-defined meaning

• E.g., <table> … </table>

• XML: HTML-like language– An arbitrary number of user-defined tags– No a priori semantics– Mainly for data exchange– Display using stylesheet

Page 9: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 9

Important Differences

• XML contains a large assortment of tags chosen by the document author– The only valid tags in HTML are those sanctioned by

the official specification of the language; other tags are ignored

• Every opening tag must have a matching closing tag, and the tags must be properly nested – E.g., <a><b></a></b> is not allowed– Some HTML tags are not required to be closed, e.g.,

<p>• The document has a root element – the element

that contains all other elements

Page 10: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 10

Example

Root element

Mandatory statement

XML elements

Element names

Element contents

Page 11: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 11

Hierarchical StructurePersonList Student

Title Contents

Person Person

Name: John Doe

Id: 111111111

Address

Number: 123

Street: Main St

Name: Joe Public

Id: 666666666

Address

Number: 666

Street: Hollow Rd

Page 12: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 12

Attributes

• <PersonList Type=“Student”>– Type is the name of an attribute that belongs to the

element PersonList– Student is the attribute value– All attribute values must be quoted– Text strings between tags do not need to be quoted

• Empty element– <Title Value=“Student List”/>– The element has one attribute and no content– A shorthand for <Title Value=“Student List”></Title>

Page 13: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 13

Processing Instructions & Comments

• Processing instructions– <?xml version=“1.0” ?>– Contain anything the author might want to communicate

to the XML processor, e.g., <?my-command go bring coffee?>

– Rarely used• Comment

– <!-- A comment -->– Can occur everywhere except inside the markups, i.e.,

between symbols < and >– An integral part of the document– May be used by a receiver (e.g., a browser)

Page 14: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 14

CDATA Construct

• Include strings of characters which contain markup elements that might make the document ill formed

• <![CDATA[ This is an example of markup in HTML: <b><i> Example <\b><\i>]]>

Page 15: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 15

XML Elements and Data Objects

• XML allows mixed data/text structure• XML elements are ordered• XML has only one primitive type, string, and

very weak facilities for specifying constraints<Address>

<Number> 123 </Number><Street> Main St </Street>

</Address>is different from<Address>

<Street> Main St </Street><Number> 123 </Number>

</Address>

A legal XML document<Address>

Sally lives on<Street> Main St </Street>house number<Number> 123 </Number>in the beautiful Anytown, Canada.

</Address>

Page 16: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 16

Use of Attributes• An element can have any number of user-defined

attributes• What attributes can do can also be achieved with elements

– An attribute may occur only once within a tag, while subelementswith the same tag may be repeated

• Attributes introduce ambiguity as to whether to represent information as attributes or elements– Sometimes convenient for representing data, can also be done with

elements– The use of attributes is expected to decline

<Address><Number> 123 </Number><Street> Main St </Street>

</Address>

<Address Number=“123” Street=“Main St/>

Page 17: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 17

Attributes in Markup<Act Number=“5”>

<Scene Number=“1” Place=“Mantua. A street”>…<Apothecary Voice=“scared”>

Such mortal drugs I have; but Mantua’s lawIs death to any he that utters them.

</Apothecary><Romeo Voice=“persistent”>

Art thou so bare and full of wretchedness,And fear’st to die?…

</Romeo>…

</Scene></Act>

Page 18: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 18

Advantages of Attributes

• Attributes in an element are not ordered– <Address Number=“123” Street=“Main St”/>– <Address Street=“Main St” Number=“123”/>

• Attributes are more succinct• Attributes can be declared to have unique value

and can be used to enforce limited kind of referential integrity

<Address><Number> 123 </Number><Street> Main St </Street>

</Address>

Page 19: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 19

ID and IDREF – Cross-References

Page 20: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 20

Well Formed XML Document

• It has a root element• Every opening tag is followed by a matching

closing tag, and the elements are properly nested inside each other

• Any attribute can occur at most once in a given opening tag, its value must be provided, and this value must be quoted

Page 21: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 21

Namespaces

• A term (tag) might have different meanings in different contexts– <name><First>John</First> <Last>Doe</Last></Name>– <Name>Simon Fraser University</Name>

• Every XML tag must have two parts: namespace and local name– General structure: namespace:local-name– Namespace represented by URI (uniform resource

identifier)• An abstract identifier (a general unique string)• URL (uniform resource locator)

Page 22: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 22

Example – Namespace• Namespaces are defined using the attribute xmlns

– All names xml* should be considered reserved• Default namespace xmlns=“…”

– Only one default namespace• Other namespace xmlns:toy=“…”

– Prefixes (e.g., toy) must be distinct<item xmlns=“http://www.acmeinc.com/jp#supplies”

xmlns:toy=“http://www.acmeinc.com/jp#toys”><name>backpack</name><feature>

<toy:item><toy:name>cyberpet</toy:name>

</toy:item></feature>

</item>

Page 23: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 23

Namespace Declarations

• Namespace as prefix– E.g., toy:item, toy:name– Tags without prefix belong to the default

namespace• Namespace declarations have scope

– Can be nested like a program block

Page 24: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 24

Example – Scopes of Namespaces

<item xmlns=“http://www.acmeinc.com/jp#supplies”xmlns:toy=“http://www.acmeinc.com/jp#toys”>

<name>backpack</name><feature>

<toy:item><toy:name>cyberpet</toy:name>

</toy:item></feature><item xmlns=“http://www.acmeinc.com/jp#supplies2”

xmlns:toy=“http://www.acmeinc.com/jp#toys2”><name>notebook</name><toy:name>sticker</toy:name>

</item></item>

Page 25: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 25

More About Namespace

• The name of a namespace is just a string that happens to be a URL

• Not necessarily it is a real address that contains some kind of schema describing the corresponding set of names

• Don’t be misled by the URL!

Page 26: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 26

Summary

• HTML and XML: differences and applications

• Structure of XML– Elements– Attributes– Well formed XML documents

• Namespace

Page 27: XML and Web Data - cs.sfu.ca · XML and Web Data. CMPT 354: Database I -- XML 2 Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No

CMPT 354: Database I -- XML 27

To-Do-List

• Can every relational table be represented in XML? Can every XML document be represented in a relational table?

• RSS is an application of XML. Try to understand the two RSS segments at http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html