Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
XML and Web Data
CMPT 354: Database I -- XML 2
Data in HTML
• HyperText Markup Language– Different data elements
are set out using tags• No schema?
– Based on the data itself, we can make a reasonable guess about the structure
– “Self-describing”
CMPT 354: Database I -- XML 3
Object and Schema
CMPT 354: Database I -- XML 4
Semi-structured Data
• Object-like: it can be represented as a collection of objects
• Schemaless: it is not guaranteed to conform to any type structure
• Self-describing– Often carries only the names of the attributes
and has a lower degree of organization than the data in the database
• Semi-structured data: data with the above characteristics
CMPT 354: Database I -- XML 5
Schemaless But Self-Describing
(#12345,[ListName:“Students”,Contents:{ [Name:“John Doe”,
ID:“111111111”,Address:[Number:123, Street:“Main St”] ],[Name:“Joe Public”,Id:“666666666”,Address:[Number:666, Street:“Hollow Rd”] ]}
] )
CMPT 354: Database I -- XML 6
XML• Extensible Markup Language
– A standard adopted in 1998 by the W3C (World Wide Web Consortium)
• Optional mechanisms for specifying document structure– DTD: the Document Type Definition Language, part of
the XML standard– XML Schema: a more recent specification built on top of
XML• Query languages for XML
– XPath: lightweight– XSLIT: document transformation language– XQuery: a full-blown language
CMPT 354: Database I -- XML 7
From HTML to XML
CMPT 354: Database I -- XML 8
HTML and XML
• HTML– A fixed number of tags– Each tag has its own well-defined meaning
• E.g., <table> … </table>
• XML: HTML-like language– An arbitrary number of user-defined tags– No a priori semantics– Mainly for data exchange– Display using stylesheet
CMPT 354: Database I -- XML 9
Important Differences
• XML contains a large assortment of tags chosen by the document author– The only valid tags in HTML are those sanctioned by
the official specification of the language; other tags are ignored
• Every opening tag must have a matching closing tag, and the tags must be properly nested – E.g., <a><b></a></b> is not allowed– Some HTML tags are not required to be closed, e.g.,
<p>• The document has a root element – the element
that contains all other elements
CMPT 354: Database I -- XML 10
Example
Root element
Mandatory statement
XML elements
Element names
Element contents
CMPT 354: Database I -- XML 11
Hierarchical StructurePersonList Student
Title Contents
Person Person
Name: John Doe
Id: 111111111
Address
Number: 123
Street: Main St
Name: Joe Public
Id: 666666666
Address
Number: 666
Street: Hollow Rd
CMPT 354: Database I -- XML 12
Attributes
• <PersonList Type=“Student”>– Type is the name of an attribute that belongs to the
element PersonList– Student is the attribute value– All attribute values must be quoted– Text strings between tags do not need to be quoted
• Empty element– <Title Value=“Student List”/>– The element has one attribute and no content– A shorthand for <Title Value=“Student List”></Title>
CMPT 354: Database I -- XML 13
Processing Instructions & Comments
• Processing instructions– <?xml version=“1.0” ?>– Contain anything the author might want to communicate
to the XML processor, e.g., <?my-command go bring coffee?>
– Rarely used• Comment
– <!-- A comment -->– Can occur everywhere except inside the markups, i.e.,
between symbols < and >– An integral part of the document– May be used by a receiver (e.g., a browser)
CMPT 354: Database I -- XML 14
CDATA Construct
• Include strings of characters which contain markup elements that might make the document ill formed
• <![CDATA[ This is an example of markup in HTML: <b><i> Example <\b><\i>]]>
CMPT 354: Database I -- XML 15
XML Elements and Data Objects
• XML allows mixed data/text structure• XML elements are ordered• XML has only one primitive type, string, and
very weak facilities for specifying constraints<Address>
<Number> 123 </Number><Street> Main St </Street>
</Address>is different from<Address>
<Street> Main St </Street><Number> 123 </Number>
</Address>
A legal XML document<Address>
Sally lives on<Street> Main St </Street>house number<Number> 123 </Number>in the beautiful Anytown, Canada.
</Address>
CMPT 354: Database I -- XML 16
Use of Attributes• An element can have any number of user-defined
attributes• What attributes can do can also be achieved with elements
– An attribute may occur only once within a tag, while subelementswith the same tag may be repeated
• Attributes introduce ambiguity as to whether to represent information as attributes or elements– Sometimes convenient for representing data, can also be done with
elements– The use of attributes is expected to decline
<Address><Number> 123 </Number><Street> Main St </Street>
</Address>
<Address Number=“123” Street=“Main St/>
CMPT 354: Database I -- XML 17
Attributes in Markup<Act Number=“5”>
<Scene Number=“1” Place=“Mantua. A street”>…<Apothecary Voice=“scared”>
Such mortal drugs I have; but Mantua’s lawIs death to any he that utters them.
</Apothecary><Romeo Voice=“persistent”>
Art thou so bare and full of wretchedness,And fear’st to die?…
</Romeo>…
</Scene></Act>
CMPT 354: Database I -- XML 18
Advantages of Attributes
• Attributes in an element are not ordered– <Address Number=“123” Street=“Main St”/>– <Address Street=“Main St” Number=“123”/>
• Attributes are more succinct• Attributes can be declared to have unique value
and can be used to enforce limited kind of referential integrity
<Address><Number> 123 </Number><Street> Main St </Street>
</Address>
CMPT 354: Database I -- XML 19
ID and IDREF – Cross-References
CMPT 354: Database I -- XML 20
Well Formed XML Document
• It has a root element• Every opening tag is followed by a matching
closing tag, and the elements are properly nested inside each other
• Any attribute can occur at most once in a given opening tag, its value must be provided, and this value must be quoted
CMPT 354: Database I -- XML 21
Namespaces
• A term (tag) might have different meanings in different contexts– <name><First>John</First> <Last>Doe</Last></Name>– <Name>Simon Fraser University</Name>
• Every XML tag must have two parts: namespace and local name– General structure: namespace:local-name– Namespace represented by URI (uniform resource
identifier)• An abstract identifier (a general unique string)• URL (uniform resource locator)
CMPT 354: Database I -- XML 22
Example – Namespace• Namespaces are defined using the attribute xmlns
– All names xml* should be considered reserved• Default namespace xmlns=“…”
– Only one default namespace• Other namespace xmlns:toy=“…”
– Prefixes (e.g., toy) must be distinct<item xmlns=“http://www.acmeinc.com/jp#supplies”
xmlns:toy=“http://www.acmeinc.com/jp#toys”><name>backpack</name><feature>
<toy:item><toy:name>cyberpet</toy:name>
</toy:item></feature>
</item>
CMPT 354: Database I -- XML 23
Namespace Declarations
• Namespace as prefix– E.g., toy:item, toy:name– Tags without prefix belong to the default
namespace• Namespace declarations have scope
– Can be nested like a program block
CMPT 354: Database I -- XML 24
Example – Scopes of Namespaces
<item xmlns=“http://www.acmeinc.com/jp#supplies”xmlns:toy=“http://www.acmeinc.com/jp#toys”>
<name>backpack</name><feature>
<toy:item><toy:name>cyberpet</toy:name>
</toy:item></feature><item xmlns=“http://www.acmeinc.com/jp#supplies2”
xmlns:toy=“http://www.acmeinc.com/jp#toys2”><name>notebook</name><toy:name>sticker</toy:name>
</item></item>
CMPT 354: Database I -- XML 25
More About Namespace
• The name of a namespace is just a string that happens to be a URL
• Not necessarily it is a real address that contains some kind of schema describing the corresponding set of names
• Don’t be misled by the URL!
CMPT 354: Database I -- XML 26
Summary
• HTML and XML: differences and applications
• Structure of XML– Elements– Attributes– Well formed XML documents
• Namespace
CMPT 354: Database I -- XML 27
To-Do-List
• Can every relational table be represented in XML? Can every XML document be represented in a relational table?
• RSS is an application of XML. Try to understand the two RSS segments at http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html