1 XML Basics –Semi-structured data –DTD –XML Schema XML transforming and querying –XPath –XSLT –XQuery Semantic Web –RDF –OWL An introduction to XML and

1

• XML Basics– Semi-structured data– DTD– XML Schema

• XML transforming and querying– XPath– XSLT– XQuery

• Semantic Web– RDF– OWL

An introduction to XML and Related Standards

2

Background: Markup and Markup Language

Markup– Annotations (tags) for carrying information about

a document’s content• a writer’s handwritten notes for typesetting

• an editor’s corrections in a manuscript

Makeup Language– A language defines a syntax and grammar for tags

3

Background: SGML

SGML– Standard Generalized Markup Language– Standardized in 1986 (ISO)– A language for defining markup languages– And for marking-up content– Syntax + Document Type Definition (DTD)– Tools aimed at document management

4

Background: HTML

HTML– A markup language– A particular SGML Document Type (called an

“application”)– Tools for browsing and authoring

5

Background: Limitations of SGML and HTML

SGML– Complex, many options and shortcuts

– Must know the DTD to parse correctly

– Cost of SGML technology is high

HTML– Not extensible—can’t define new tags

– Tags for presenting data not describing it

– Doesn’t capture much document structure or content meaning

6

Enter XML

XML (Extensible Markup Language)– Standardized by W3C in 1998– For data interchange over the Web– A Simpler SGML:

• Actually, a subset of SGML

• DTDs are optional

• Less features and options

– Widely available tools for parsing, authoring, browsing, etc.

7

Uses for XMLWhy XML?

– Capture logical structure of documentsPresentation Independent

– Data InterchangeXML is implementation independent

– Storage FormatAny successful interchange format becomes a storage format

– MetadataSearching, filtering, organizing

– Data Packaging, Movement, and ProcessingClient-Side processing, Server-to-Server communication, Non-

browser based clients, Simplified Server Processing, etc.

8

The Many Standards of XML

XML Document

XML DTD

QueryXQuery, XQL, XML-QL

ProgrammingDocument Object Model

(DOM)

TransformationXSLT for rearrangingand restructuring XML

documents

TransportXML-RPC, SOAP,

XML-Protocol for message and object serialization and

remote procedure calls

MetadataRDF, OWL - using XML

to define resource metadata

Schema and TypesXML Schema

LinkingXLink for simple and complex

hyperlinks between XML Documents

AddressingXPath and XPointer for

addressing XML subdocuments

9

The Running Example

Lego Product Catalogs– catalogs have:

a publishing date, an identifier, a title, etc.

– catalogs are made up of products• either a kit or accessory

each has an item #, price, name, picture, etc.

• kits can havean age level, # of pieces, set type (duplo, basic), a theme (star

wars), a system (space)

10

An Example XML Catalog Document

<?xml version=“1.0”?><LegoCatalog> <pubDate>2000</pubDate> <products> <kit> <name>X-Wing Fighter</name> <ages> <minAge>7</minAge> <maxAge>12</maxAge> </ages> <pieces>263</pieces> <theme>Star Wars</theme> <desc>Take to the skies with Luke as he battles the forces of evil!</desc> </kit> </products></LegoCatalog>

11

An Example XML Documentprolog

body

elements havestart and end-tags

elements can also contain content

elements are nested“boxes within boxes”

<?xml version=“1.0”?>

<LegoCatalog> <pubDate>2000</pubDate> <products> <kit> <name>X-Wing Fighter</name> <ages> <minAge>7</minAge> <maxAge>12</maxAge> </ages> <pieces>263</pieces> <theme>Star Wars</theme> <desc>Take to the skies with Luke as he battles the forces of evil! … </desc> </kit> </products></LegoCatalog>

12

Well Formed Documents

Well-formed XML documents:– A single root element – Start and end tags required (unlike HTML)

• <name>X-Wing Fighter</name>

• empty-element tags: <theme/>

– Elements must be properly nested

• <kit> <pieces>263</kit></pieces>

– More rules:

• naming elements, document has at least one element, etc.

This is NOT properly nested!!!

13

XML Attributes

• Elements can contain attributes<kit unitId=“7140” price=“$29.99” shipWeight=“1lb” >

element name

attribute name

attribute value

attribute name

attribute value

attribute name

attribute value

Attributes are always assigned in element start tags, are always surrounded by double quotes, and must be unique in the element

14

Attributes vs. Content

In general, it is up to the document designer

In SGML, content usually was for data you see and attributes for metadata

15

DTD and XML Schema

16

Document Type Definition

• Why DTDs?– To standardize tags and structure for interchange and

creation

– To make the documents machine processable

• What is a DTD?– A grammar for describing XML documents (tags,

attributes, nesting, etc.)

– An XML document that is well-formed and conforms to a DTD is said to be valid

17

An Example DTD: Elements

<!ELEMENT LegoCatalog (pubDate, products)>

<!ELEMENT pubDate (#PCDATA)>

<!ELEMENT products (kit | accessory)*>

<!ELEMENT kit (name, ages, pieces, theme?, series?, desc)>

<!ELEMENT ages (minAge, maxAge)>

<!ELEMENT minAge (#PCDATA)><!ELEMENT maxAge (#PCDATA)><!ELEMENT pieces (#PCDATA)><!ELEMENT series (#PCDATA)><!ELEMENT desc (#PCDATA)>

An element content model for LegoCatalog

A character data content model for pubDate

* zero or more+ one or more? optional

| Choice, Strict Sequence() Grouping

Empty, Any, and Mixedcontent models

18

An Example DTD: Attributes<!ATTLIST kit price CDATA #REQUIRED

shipWeight CDATA #REQUIRED

avail (yes | no) #IMPLIED

image CDATA “na.jpg”

unitId ID #IMPLIED >

<!ATTLIST accessory forKits IDREFS #IMPLIED

orderStatus CDATA #FIXED “special”>

each attribute has the form: attr-name type default-decl

CDATA = character dataID = unique identifierIDREF = reference to an IDIDREFS = list of referencesenumeration = list of possible values

#REQUIRED = must appear#IMPLIED = optionally appear#FIXED + default = if attribute is missing, parser assumes valueDefault only = if attribute is missing, default is assumed, otherwise any value

19

Limitations of DTDs

DTDs are not optimal– Not well-formed XML

• can’t parse them with an XML parser

• need different tools to create them

+ but at least you can sort-of read/understand them

– Limited support for defining data types– Limited modeling capabilities

• hard to express some structures

• no support for reusing structure

20

XML Schema

• W3C proposed recommendation (2001)• Divided into 2 parts: structures, datatypes• Main features

– Well-formed XML documents– A schema can span multiple documents– Can define new data types and constraints– Inheritance among content model types– Improves data interchange

• Offers more precision for computer-computer transfer

21

The .xsd file

<xs:schema xmlns:xs=“http://www.w3.org/1999/XMLSchema” targetNamespace=“http://www.lego.com/products” version=“1.1”> ….</xs:schema>

xmlns:xs - use the ‘xs’ prefix to reference elements defined in a schema from another namespace

targetNamespace - all the elements and types defined in this schema come from this namespace. Use this URI to import or include these definitions in other schemas

22

Example XML Schema<xs:schema> <xs:element xs:name=“products”> <xs:complexType> <xs:sequence> <xs:element xs:name=“kit” type=“Product” xs:minOccurs=“1” xs:maxOccurs=“unbounded”/> <xs:element xs:name=“accessory” xs:type=“Product” xs:minOccurs=“0” xs:maxOccurs=“unbounded”/> ... </xs:element> <xs:complexType xs:name=“Product”> <xs:attribute xs:name=“price” xs:type=“DollarType”/> … </xs:complexType> <xs:simpleType xs:name=“DollarType”> <xs:pattern xs:value=“reg-exp”/> <xs:simpleType> ...

Many ways to describe new data types (not just regular expressions)

ComplexType = Content Model

23

Main Schema Components

Definitions of:– Complex types = sub-elements + attributes

– Simple types = no sub-elements, constraints on strings(datatypes)

Declarations of:– elements (of simple and complex types)

– attributes (simple types), attribute groups

24

Simple Type Definitions Can have: built-in, pre-declared or anonymous simple type definitions.

<attribute name=“State” type=“string”/>

<simpleType name=“US-State” base=“string”> <enumeration value=“AK”/> <enumeration value=“AL”/> <enumeration value=“AR”/> …… </simpleType> <attribute name=“State” type=“US-State”/>

<address State=“California” />

25

Example of Complex Type Definition

<complexType name=“personName”>

<element name=“title” type=“string”/>

<element name=“firstname” type=“string”/>

<element name=“lastname” type=“string”/>

<attribute name=“age” type=“integer”/>

</complexType>

<element name=“producer” type=“personName”/>

<producer> <name>…</name> <firstname>…</firstname> <lastname>…</lastname> <age>…</age></producer>

26

Constraints on Element Content

content =– textOnly : only character data– mixed : character data appears alongside subelements– elementOnly : only subelements– empty : no content (only attributes) – any

<element name=“price”> <complexType content=“empty”> <attribute name=“currency” type=“string”/> <attribute name=“value” type=“decimal”/> </complexType></element>

<price currency=“AUD” value=“256.76”/>

27

Datatype Example

<simpleType name="TelephoneNumber" base="string"> <length value="8"/> <pattern value="\d{3}-\d{4}"/>

</simpleType>

This creates a new datatype called 'TelephoneNumber'.Elements of this type can hold string values, but thestring length must be exactly 8 characters long and thestring must follow the pattern: ddd-dddd, where ‘\d' represents a 'digit'.

28

XPath

29

What is XPath?

• XPath is a syntax used for selecting parts of an XML document

• The way XPath describes paths to elements is similar to the way an operating system describes paths to files

• XPath is almost a small simple programming language; it has functions, tests, and expressions

• XPath is a W3C standard

• XPath is not itself written as XML, but is used heavily in XSLT, XML Schema and XQuery

30

Terminology

<library> <book>

<chapter> </chapter>

<chapter> <section> <paragraph/> <paragraph/> </section> </chapter>

</book></library>

• library is the parent of book; book is

the parent of the two chapters

• The two chapters are the children of

book, and the section is the child of

the second chapter

• The two chapters of the book are

siblings (they have the same parent)

• library, book, and the second chapter

are the ancestors of the section

• The two chapters, the section, and the

two paragraphs are the descendents of

the book

31

Slashes

• A path that begins with a / represents an absolute path, starting from the top of the document– Example: /email/message/header/from– Note that even an absolute path can select more than one element– A slash by itself means “the whole document”

• A path that does not begin with a / represents a path starting from the current element– Example: header/from

• A path that begins with // can start from anywhere in the document– Example: //header/from selects every element from that is a child

of an element header– This can be expensive, since it involves searching the entire document

32

Brackets and last()

• A number in brackets selects a particular matching child (counting starts from 1, except in Internet Explorer)– Example: /library/book[1] selects the first book of the library– Example: //chapter/section[2] selects the second section of

every chapter in the XML document

– Example: //book/chapter[1]/section[2]– Only matching elements are counted; for example, if a book has both

sections and exercises, the latter are ignored when counting sections

• The function last() in brackets selects the last matching child– Example: /library/book/chapter[last()]

• You can even do simple arithmetic– Example: /library/book/chapter[last()-1]

33

Stars

• A star, or asterisk, is a “wildcard” -- it means “all the elements at this level”– Example: /library/book/chapter/* selects every

child of every chapter of every book in the library– Example: //book/* selects every child of every book – Example: /*/*/*/paragraph selects every

paragraph that has exactly three ancestors

– Example: //* selects every element in the entire document

34

Attributes I

• You can select attributes by themselves, or elements that have certain attributes– Remember: an attribute consists of a name-value pair, for example in

<chapter num="5">, the attribute is named num– To choose the attribute itself, prefix the name with @– Example: @num will choose every attribute named num– Example: //@* will choose every attribute, everywhere in the

document

• To choose elements that have a given attribute, put the attribute name in square brackets– Example: //chapter[@num] will select every chapter element

(anywhere in the document) that has an attribute named num

35

Attributes II

• //chapter[@num] selects every chapter element with an attribute num

• //chapter[not(@num)] selects every chapter element that does not have a num attribute

• //chapter[@*] selects every chapter element that has any attribute

• //chapter[not(@*)] selects every chapter element with no attributes

36

Values of attributes

• //chapter[@num='3'] selects every chapter element with an attribute num with value 3

• The normalize-space() function can be used to remove leading and trailing spaces from a value before comparison

– Example: //chapter[normalize-space(@num)="3"]

37

Location Path

The central construct is the location path:

location path = location step / …/ location step

child::section [ position()<6 ] / descendant::cite / attribute::href

selects all href attributes in cite elements in the first 5 sections of a document

• A location step is evaluated wrt. some context• A location path is evaluated left-to-right, starting with some initial context, each node resulting from evaluation of one step is used as context for evaluation of the next, and the results are unioned together

38

Location Steplocation step = axis :: node-test [ predicate ]

axis • a rough set of candidate nodes

– e.g. the child nodes of the context node node-test

• performs an initial filtration based on– types: chardata node, processing instruction, etc. – names: element name

predicates • a further, more complex, filtration.• only candidates for which the predicates evaluate to true are kept


39

Axes :: Node-test [ Predicate ]

child descendantparent ancestor following-sibling preceding-sibling followingprecedingattribute namespaceself descendant-or-self ancestor-or-self


Axes Node Test

name *text() comment()processing-instruction() node()

[attribute::name="flour"] [attribute::name!="flour"][attribute::amount=“0.5” and attribute::unit=“cup”][position()=2]

Predicate

40

Abbreviations

child:: nothing (so child is the default axis)

attribute:: @

/descendant-or self::node()/ //

self::node() .

parent::node () ..

.//@href

selects all href attributes in descendants of the context node.

section [ position()<6 ] // cite [ @href = “there”]

selects all cite elements with href="there" attributes in the first 5 sections

41

XSL

42

XSL (eXtensible Stylesheet Language)

Why do we need it?– Store in one format, display in another.

e.g. transforming XML to XHTML and displaying in browser

– Convert to a more useful format

– Make the document more compact

Extracting from XML documents only the data we need

We are interested to get another document that looks like we specify

43

XSL (eXtensible Stylesheet Language)

consists of two parts: – XSL Transformations (XSLT)

XSLT stylesheet is an XML document defining transformation

from one class of XML documents into another – XSL Formatting Objects (XSL-FO)

Specifying formatting in a more low-level and detailed way

44

A Simple Example• File data.xml:

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="render.xsl"?> <message>Howdy!</message>

• File render.xsl:

<?xml version="1.0"?><xsl:stylesheet version="1.0” xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

 <xsl:template match="/">

<html><body> <h1><xsl:value-of select="message"/></h1>

</body></html> </xsl:template>

</xsl:stylesheet>

45

The .xsl File

• An XSLT document has the .xsl extension • The XSLT document begins with:

– <?xml version="1.0"?>– <xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

• Contains one or more templates, such as:– <xsl:template match="/"> ... </xsl:template>

• And ends with:– </xsl:stylesheet>

46

Explanation of render.xslThe XSL was:

<xsl:template match="/"> <html><body>

<h1><xsl:value-of select="message"/></h1> </body></html>

</xsl:template>

• The <xsl:template match="/"> chooses the root• The <html><body> <h1> is written to the output file• The contents of message is written to the output file• The </h1> </body></html> is written to the output file• The resultant file looks like:

<html><body> <h1>Howdy!</h1> </body></html>

47

How XSLT Works

• The XML text document is read in and stored as a tree of nodes

• The <xsl:template match="/"> template is used to select the entire tree

• The rules within the template are applied to the matching nodes, thus changing the structure of the XML tree

– If there are other templates, they must be called explicitly from the main template

• Unmatched parts of the XML tree are not changed

• After the template is applied, the tree is written out again as a text document

48

xsl:value-of• <xsl:value-of select="XPath expression"/> selects the contents of an element and adds it to the

output stream– The select attribute is required– Notice that xsl:value-of is not a container, hence it needs

to end with a slash

• Example (from an earlier slide):

<h1> <xsl:value-of select="message"/> </h1>

49

xsl:for-each• xsl:for-each is a kind of loop statement• The syntax is

<xsl:for-each select="XPath expression"> Text to insert and rules to apply </xsl:for-each>

• Example: to select every book (//book) and make an unordered list (<ul>) of their titles (title), use: <ul> <xsl:for-each select="//book"> <li> <xsl:value-of select="title"/> </li> </xsl:for-each> </ul>

50

Filtering output

• You can filter (restrict) output by adding a criterion to the select attribute’s value: <ul> <xsl:for-each select="//book"> <li> <xsl:value-of select="title[../author=‘Terry Smith']"/> </li> </xsl:for-each> </ul>

• This will select book titles by Terry Smith

51

Filter details

• Here is the filter we just used: <xsl:value-of select="title[../author='Terry Smith'"]/>

• author is a sibling of title, so from title we have to go up to its parent, book, then back down to author

• This filter requires a quote within a quote, so we need both single quotes and double quotes

• Legal filter operators are: = != < >

52

But it doesn’t work right!• Here’s what we did:

<xsl:for-each select="//book"> <li> <xsl:value-of select="title[../author='Terry Smith']"/> </li> </xsl:for-each>

• This will output <li> and </li> for every book, so we will get empty bullets for authors other than Terry Smith

• There is no obvious way to solve this with just xsl:value-of

53

xsl:if

• xsl:if allows us to include content if a given condition (in the test attribute) is true

• Example: <xsl:for-each select="//book"> <xsl:if test="author='Terry Smith'"> <li> <xsl:value-of select="title"/> </li> </xsl:if> </xsl:for-each>

• This does work correctly!

54

xsl:choose

• The xsl:choose ... xsl:when ... xsl:otherwise construct is XML’s equivalent of switch ... case ... default statement

• The syntax is: <xsl:choose>

<xsl:when test="some condition"> ... some code ... </xsl:when> <xsl:otherwise> ... some code ... </xsl:otherwise></xsl:choose>• xsl:choose is often used within an xsl:for-each loop

55

xsl:sort• You can place an xsl:sort inside an xsl:for-each• The attribute of the sort tells what field to sort on• Example:

<ul> <xsl:for-each select="//book"> <xsl:sort select="author"/> <li> <xsl:value-of select="title"/> by <xsl:value-of select="author"> </li> </xsl:for-each> </ul>

– This example creates a list of titles and authors, sorted by author

56

xsl:apply-templates• If you apply a template to an element that has child elements,

templates are not automatically applied to those child elements

• The <xsl:apply-templates> element applies a template rule to the current element or to the current element’s child nodes

• If we add a select attribute, it applies the template rule only to the child that matches

• If we have multiple <xsl:apply-templates> elements with select attributes, the child nodes are processed in the same order as the <xsl:apply-templates> elements

57

Applying templates to children• <book>

<title>XML</title> <author>Terry Smith</author> </book>

• <xsl:template match="/"> <html> <head></head> <body> <b><xsl:value-of select="/book/title"/></b> <xsl:apply-templates select="/book/author"/> </body> </html></xsl:template>

<xsl:template match="/book/author"> by <i><xsl:value-of select="."/></i></xsl:template>

With this line:XML by Gregory Brill

Without this line:XML

58

Calling named templates

• You can name a template, then call it, similar to the way you would call a method in Java

• The named template: <xsl:template name="myTemplateName"> ...body of template... </xsl:template>

• A call to the template: <xsl:call-template name="myTemplateName"/>

• Or: <xsl:call-template name="myTemplateName"> ...parameters... </xsl:call-template>

59

Processing model • A list of source nodes is processed to create a result tree fragment. • The result tree is constructed by processing a list containing just the root

node.• A list of source nodes is processed by appending the result tree structure

created by processing each of the members of the list in order. • A node is processed by finding all the template rules with patterns that match

the node, and choosing the best amongst them; the chosen rule's template is then instantiated with the node as the current node and with the list of source nodes as the current node list.

• A template typically contains instructions that select an additional list of source nodes for processing.

• The process of matching, instantiation and selection is continued recursively until no new source nodes are selected for processing.

60

XQuery

61

Enter XQueryXML documents generalize relational data

c2b2a2

c3b3a3

c1b1a1

CBA

Rtuple

A a1 /A

B b1 /B

C c1 /C/tupletuple

A a2 /A

B b2 /B

C c2 /C/tuple

…

/R

How should query languages like SQL be similarly generalized?

62

FLWOR Expressions The main engine of XQuery is the FLWOR expression:

– For-Let-Where-Order-Return

– pronounced "flower"

– generalizes SELECT-FROM-WHERE from SQL

for $d in document("depts.xml")//deptno let $e := document("emps.xml")//employee[deptno = $d] where count($e) >= 10 order by avg($e/salary) descending return <big-dept> { $d, <headcount>{count($e)}</headcount>, <avgsal>{avg($e/salary)}</avgsal> } </big-dept>

generates an ordered list of bindings of deptno

values $d

for each $d, $e = the list of emp elements with

that department number

filters that list to retain

only the desired tuples

sorts that list by the

given criteria constructs for each tuple

a resulting value

have an ordered list of

tuples ($d,$e)

The result is a list of departments with at least 10 employees, sorted by average salaries.

63

List Expressions

XQuery expressions often manipulate lists of values

for $p in distinct-values(document("bib.xml")//publisher) let $a := avg(document("bib.xml")//book[publisher = $p]/price)return <publisher> <name>{ $p/text() }</name> <avgprice>{ $a }</avgprice> </publisher>

List functions: distinct-values, avg, …

64

Conditional expressions

XQuery supports a general if-then-else construction.

extracts from the holdings of a library the titles and either editors or authors.

for $h in document("library.xml")//holding return <holding> { $h/title, if ($h/@type = "Journal") then $h/editor else $h/author } </holding>

65

Quantified Expressions

for $b in document("bib.xml")//book where some $p in $b//paragraph satisfies ( contains($p,"sailing") AND contains($p,“fishing") ) return $b/title

for $b in document("bib.xml")//book where every $p in $b//paragraph satisfies contains($p,"sailing") return $b/title

finds the titles of all books which mention both sailing and fishing in the same paragraph

finds the titles of all books which mention sailing in every paragraph