Upload
benedict-francis
View
256
Download
0
Tags:
Embed Size (px)
Citation preview
1
• XML Basics– Semi-structured data– DTD– XML Schema
• XML transforming and querying– XPath– XSLT– XQuery
• Semantic Web– RDF– OWL
An introduction to XML and Related Standards
2
Background: Markup and Markup Language
Markup– Annotations (tags) for carrying information about
a document’s content• a writer’s handwritten notes for typesetting
• an editor’s corrections in a manuscript
Makeup Language– A language defines a syntax and grammar for tags
3
Background: SGML
SGML– Standard Generalized Markup Language– Standardized in 1986 (ISO)– A language for defining markup languages– And for marking-up content– Syntax + Document Type Definition (DTD)– Tools aimed at document management
4
Background: HTML
HTML– A markup language– A particular SGML Document Type (called an
“application”)– Tools for browsing and authoring
5
Background: Limitations of SGML and HTML
SGML– Complex, many options and shortcuts
– Must know the DTD to parse correctly
– Cost of SGML technology is high
HTML– Not extensible—can’t define new tags
– Tags for presenting data not describing it
– Doesn’t capture much document structure or content meaning
6
Enter XML
XML (Extensible Markup Language)– Standardized by W3C in 1998– For data interchange over the Web– A Simpler SGML:
• Actually, a subset of SGML
• DTDs are optional
• Less features and options
– Widely available tools for parsing, authoring, browsing, etc.
7
Uses for XMLWhy XML?
– Capture logical structure of documentsPresentation Independent
– Data InterchangeXML is implementation independent
– Storage FormatAny successful interchange format becomes a storage format
– MetadataSearching, filtering, organizing
– Data Packaging, Movement, and ProcessingClient-Side processing, Server-to-Server communication, Non-
browser based clients, Simplified Server Processing, etc.
8
The Many Standards of XML
XML Document
XML DTD
QueryXQuery, XQL, XML-QL
ProgrammingDocument Object Model
(DOM)
TransformationXSLT for rearrangingand restructuring XML
documents
TransportXML-RPC, SOAP,
XML-Protocol for message and object serialization and
remote procedure calls
MetadataRDF, OWL - using XML
to define resource metadata
Schema and TypesXML Schema
LinkingXLink for simple and complex
hyperlinks between XML Documents
AddressingXPath and XPointer for
addressing XML subdocuments
9
The Running Example
Lego Product Catalogs– catalogs have:
a publishing date, an identifier, a title, etc.
– catalogs are made up of products• either a kit or accessory
each has an item #, price, name, picture, etc.
• kits can havean age level, # of pieces, set type (duplo, basic), a theme (star
wars), a system (space)
10
An Example XML Catalog Document
<?xml version=“1.0”?><LegoCatalog> <pubDate>2000</pubDate> <products> <kit> <name>X-Wing Fighter</name> <ages> <minAge>7</minAge> <maxAge>12</maxAge> </ages> <pieces>263</pieces> <theme>Star Wars</theme> <desc>Take to the skies with Luke as he battles the forces of evil!</desc> </kit> </products></LegoCatalog>
11
An Example XML Documentprolog
body
elements havestart and end-tags
elements can also contain content
elements are nested“boxes within boxes”
<?xml version=“1.0”?>
<LegoCatalog> <pubDate>2000</pubDate> <products> <kit> <name>X-Wing Fighter</name> <ages> <minAge>7</minAge> <maxAge>12</maxAge> </ages> <pieces>263</pieces> <theme>Star Wars</theme> <desc>Take to the skies with Luke as he battles the forces of evil! … </desc> </kit> </products></LegoCatalog>
12
Well Formed Documents
Well-formed XML documents:– A single root element – Start and end tags required (unlike HTML)
• <name>X-Wing Fighter</name>
• empty-element tags: <theme/>
– Elements must be properly nested
• <kit> <pieces>263</kit></pieces>
– More rules:
• naming elements, document has at least one element, etc.
This is NOT properly nested!!!
13
XML Attributes
• Elements can contain attributes<kit unitId=“7140” price=“$29.99” shipWeight=“1lb” >
element name
attribute name
attribute value
attribute name
attribute value
attribute name
attribute value
Attributes are always assigned in element start tags, are always surrounded by double quotes, and must be unique in the element
14
Attributes vs. Content
In general, it is up to the document designer
In SGML, content usually was for data you see and attributes for metadata
15
DTD and XML Schema
16
Document Type Definition
• Why DTDs?– To standardize tags and structure for interchange and
creation
– To make the documents machine processable
• What is a DTD?– A grammar for describing XML documents (tags,
attributes, nesting, etc.)
– An XML document that is well-formed and conforms to a DTD is said to be valid
17
An Example DTD: Elements
<!ELEMENT LegoCatalog (pubDate, products)>
<!ELEMENT pubDate (#PCDATA)>
<!ELEMENT products (kit | accessory)*>
<!ELEMENT kit (name, ages, pieces, theme?, series?, desc)>
<!ELEMENT ages (minAge, maxAge)>
<!ELEMENT minAge (#PCDATA)><!ELEMENT maxAge (#PCDATA)><!ELEMENT pieces (#PCDATA)><!ELEMENT series (#PCDATA)><!ELEMENT desc (#PCDATA)>
An element content model for LegoCatalog
A character data content model for pubDate
* zero or more+ one or more? optional
| Choice, Strict Sequence() Grouping
Empty, Any, and Mixedcontent models
18
An Example DTD: Attributes<!ATTLIST kit price CDATA #REQUIRED
shipWeight CDATA #REQUIRED
avail (yes | no) #IMPLIED
image CDATA “na.jpg”
unitId ID #IMPLIED >
<!ATTLIST accessory forKits IDREFS #IMPLIED
orderStatus CDATA #FIXED “special”>
each attribute has the form: attr-name type default-decl
CDATA = character dataID = unique identifierIDREF = reference to an IDIDREFS = list of referencesenumeration = list of possible values
#REQUIRED = must appear#IMPLIED = optionally appear#FIXED + default = if attribute is missing, parser assumes valueDefault only = if attribute is missing, default is assumed, otherwise any value
19
Limitations of DTDs
DTDs are not optimal– Not well-formed XML
• can’t parse them with an XML parser
• need different tools to create them
+ but at least you can sort-of read/understand them
– Limited support for defining data types– Limited modeling capabilities
• hard to express some structures
• no support for reusing structure
20
XML Schema
• W3C proposed recommendation (2001)• Divided into 2 parts: structures, datatypes• Main features
– Well-formed XML documents– A schema can span multiple documents– Can define new data types and constraints– Inheritance among content model types– Improves data interchange
• Offers more precision for computer-computer transfer
21
The .xsd file
<xs:schema xmlns:xs=“http://www.w3.org/1999/XMLSchema” targetNamespace=“http://www.lego.com/products” version=“1.1”> ….</xs:schema>
xmlns:xs - use the ‘xs’ prefix to reference elements defined in a schema from another namespace
targetNamespace - all the elements and types defined in this schema come from this namespace. Use this URI to import or include these definitions in other schemas
22
Example XML Schema<xs:schema> <xs:element xs:name=“products”> <xs:complexType> <xs:sequence> <xs:element xs:name=“kit” type=“Product” xs:minOccurs=“1” xs:maxOccurs=“unbounded”/> <xs:element xs:name=“accessory” xs:type=“Product” xs:minOccurs=“0” xs:maxOccurs=“unbounded”/> ... </xs:element> <xs:complexType xs:name=“Product”> <xs:attribute xs:name=“price” xs:type=“DollarType”/> … </xs:complexType> <xs:simpleType xs:name=“DollarType”> <xs:pattern xs:value=“reg-exp”/> <xs:simpleType> ...
Many ways to describe new data types (not just regular expressions)
ComplexType = Content Model
23
Main Schema Components
Definitions of:– Complex types = sub-elements + attributes
– Simple types = no sub-elements, constraints on strings(datatypes)
Declarations of:– elements (of simple and complex types)
– attributes (simple types), attribute groups
24
Simple Type Definitions Can have: built-in, pre-declared or anonymous simple type definitions.
<attribute name=“State” type=“string”/>
<simpleType name=“US-State” base=“string”> <enumeration value=“AK”/> <enumeration value=“AL”/> <enumeration value=“AR”/> …… </simpleType> <attribute name=“State” type=“US-State”/>
<address State=“California” />
25
Example of Complex Type Definition
<complexType name=“personName”>
<element name=“title” type=“string”/>
<element name=“firstname” type=“string”/>
<element name=“lastname” type=“string”/>
<attribute name=“age” type=“integer”/>
</complexType>
<element name=“producer” type=“personName”/>
<producer> <name>…</name> <firstname>…</firstname> <lastname>…</lastname> <age>…</age></producer>
26
Constraints on Element Content
content =– textOnly : only character data– mixed : character data appears alongside subelements– elementOnly : only subelements– empty : no content (only attributes) – any
<element name=“price”> <complexType content=“empty”> <attribute name=“currency” type=“string”/> <attribute name=“value” type=“decimal”/> </complexType></element>
<price currency=“AUD” value=“256.76”/>
27
Datatype Example
<simpleType name="TelephoneNumber" base="string"> <length value="8"/> <pattern value="\d{3}-\d{4}"/>
</simpleType>
This creates a new datatype called 'TelephoneNumber'.Elements of this type can hold string values, but thestring length must be exactly 8 characters long and thestring must follow the pattern: ddd-dddd, where ‘\d' represents a 'digit'.
28
XPath
29
What is XPath?
• XPath is a syntax used for selecting parts of an XML document
• The way XPath describes paths to elements is similar to the way an operating system describes paths to files
• XPath is almost a small simple programming language; it has functions, tests, and expressions
• XPath is a W3C standard
• XPath is not itself written as XML, but is used heavily in XSLT, XML Schema and XQuery
30
Terminology
<library> <book>
<chapter> </chapter>
<chapter> <section> <paragraph/> <paragraph/> </section> </chapter>
</book></library>
• library is the parent of book; book is
the parent of the two chapters
• The two chapters are the children of
book, and the section is the child of
the second chapter
• The two chapters of the book are
siblings (they have the same parent)
• library, book, and the second chapter
are the ancestors of the section
• The two chapters, the section, and the
two paragraphs are the descendents of
the book
31
Slashes
• A path that begins with a / represents an absolute path, starting from the top of the document– Example: /email/message/header/from– Note that even an absolute path can select more than one element– A slash by itself means “the whole document”
• A path that does not begin with a / represents a path starting from the current element– Example: header/from
• A path that begins with // can start from anywhere in the document– Example: //header/from selects every element from that is a child
of an element header– This can be expensive, since it involves searching the entire document
32
Brackets and last()
• A number in brackets selects a particular matching child (counting starts from 1, except in Internet Explorer)– Example: /library/book[1] selects the first book of the library– Example: //chapter/section[2] selects the second section of
every chapter in the XML document
– Example: //book/chapter[1]/section[2]– Only matching elements are counted; for example, if a book has both
sections and exercises, the latter are ignored when counting sections
• The function last() in brackets selects the last matching child– Example: /library/book/chapter[last()]
• You can even do simple arithmetic– Example: /library/book/chapter[last()-1]
33
Stars
• A star, or asterisk, is a “wildcard” -- it means “all the elements at this level”– Example: /library/book/chapter/* selects every
child of every chapter of every book in the library– Example: //book/* selects every child of every book – Example: /*/*/*/paragraph selects every
paragraph that has exactly three ancestors
– Example: //* selects every element in the entire document
34
Attributes I
• You can select attributes by themselves, or elements that have certain attributes– Remember: an attribute consists of a name-value pair, for example in
<chapter num="5">, the attribute is named num– To choose the attribute itself, prefix the name with @– Example: @num will choose every attribute named num– Example: //@* will choose every attribute, everywhere in the
document
• To choose elements that have a given attribute, put the attribute name in square brackets– Example: //chapter[@num] will select every chapter element
(anywhere in the document) that has an attribute named num
35
Attributes II
• //chapter[@num] selects every chapter element with an attribute num
• //chapter[not(@num)] selects every chapter element that does not have a num attribute
• //chapter[@*] selects every chapter element that has any attribute
• //chapter[not(@*)] selects every chapter element with no attributes
36
Values of attributes
• //chapter[@num='3'] selects every chapter element with an attribute num with value 3
• The normalize-space() function can be used to remove leading and trailing spaces from a value before comparison
– Example: //chapter[normalize-space(@num)="3"]
37
Location Path
The central construct is the location path:
location path = location step / …/ location step
child::section [ position()<6 ] / descendant::cite / attribute::href
selects all href attributes in cite elements in the first 5 sections of a document
• A location step is evaluated wrt. some context• A location path is evaluated left-to-right, starting with some initial context, each node resulting from evaluation of one step is used as context for evaluation of the next, and the results are unioned together
38
Location Steplocation step = axis :: node-test [ predicate ]
axis • a rough set of candidate nodes
– e.g. the child nodes of the context node node-test
• performs an initial filtration based on– types: chardata node, processing instruction, etc. – names: element name
predicates • a further, more complex, filtration.• only candidates for which the predicates evaluate to true are kept
child::section [ position()<6 ] / descendant::cite / attribute::href
39
Axes :: Node-test [ Predicate ]
child descendantparent ancestor following-sibling preceding-sibling followingprecedingattribute namespaceself descendant-or-self ancestor-or-self
child::section [ position()<6 ] / descendant::cite / attribute::href
Axes Node Test
name *text() comment()processing-instruction() node()
[attribute::name="flour"] [attribute::name!="flour"][attribute::amount=“0.5” and attribute::unit=“cup”][position()=2]
Predicate
40
Abbreviations
child:: nothing (so child is the default axis)
attribute:: @
/descendant-or self::node()/ //
self::node() .
parent::node () ..
.//@href
selects all href attributes in descendants of the context node.
section [ position()<6 ] // cite [ @href = “there”]
selects all cite elements with href="there" attributes in the first 5 sections
41
XSL
42
XSL (eXtensible Stylesheet Language)
Why do we need it?– Store in one format, display in another.
e.g. transforming XML to XHTML and displaying in browser
– Convert to a more useful format
– Make the document more compact
Extracting from XML documents only the data we need
We are interested to get another document that looks like we specify
43
XSL (eXtensible Stylesheet Language)
consists of two parts: – XSL Transformations (XSLT)
XSLT stylesheet is an XML document defining transformation
from one class of XML documents into another – XSL Formatting Objects (XSL-FO)
Specifying formatting in a more low-level and detailed way
44
A Simple Example• File data.xml:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="render.xsl"?> <message>Howdy!</message>
• File render.xsl:
<?xml version="1.0"?><xsl:stylesheet version="1.0” xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- one rule, to transform the input root (/) --> <xsl:template match="/">
<html><body> <h1><xsl:value-of select="message"/></h1>
</body></html> </xsl:template>
</xsl:stylesheet>
45
The .xsl File
• An XSLT document has the .xsl extension • The XSLT document begins with:
– <?xml version="1.0"?>– <xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
• Contains one or more templates, such as:– <xsl:template match="/"> ... </xsl:template>
• And ends with:– </xsl:stylesheet>
46
Explanation of render.xslThe XSL was:
<xsl:template match="/"> <html><body>
<h1><xsl:value-of select="message"/></h1> </body></html>
</xsl:template>
• The <xsl:template match="/"> chooses the root• The <html><body> <h1> is written to the output file• The contents of message is written to the output file• The </h1> </body></html> is written to the output file• The resultant file looks like:
<html><body> <h1>Howdy!</h1> </body></html>
47
How XSLT Works
• The XML text document is read in and stored as a tree of nodes
• The <xsl:template match="/"> template is used to select the entire tree
• The rules within the template are applied to the matching nodes, thus changing the structure of the XML tree
– If there are other templates, they must be called explicitly from the main template
• Unmatched parts of the XML tree are not changed
• After the template is applied, the tree is written out again as a text document
48
xsl:value-of• <xsl:value-of select="XPath expression"/> selects the contents of an element and adds it to the
output stream– The select attribute is required– Notice that xsl:value-of is not a container, hence it needs
to end with a slash
• Example (from an earlier slide):
<h1> <xsl:value-of select="message"/> </h1>
49
xsl:for-each• xsl:for-each is a kind of loop statement• The syntax is
<xsl:for-each select="XPath expression"> Text to insert and rules to apply </xsl:for-each>
• Example: to select every book (//book) and make an unordered list (<ul>) of their titles (title), use: <ul> <xsl:for-each select="//book"> <li> <xsl:value-of select="title"/> </li> </xsl:for-each> </ul>
50
Filtering output
• You can filter (restrict) output by adding a criterion to the select attribute’s value: <ul> <xsl:for-each select="//book"> <li> <xsl:value-of select="title[../author=‘Terry Smith']"/> </li> </xsl:for-each> </ul>
• This will select book titles by Terry Smith
51
Filter details
• Here is the filter we just used: <xsl:value-of select="title[../author='Terry Smith'"]/>
• author is a sibling of title, so from title we have to go up to its parent, book, then back down to author
• This filter requires a quote within a quote, so we need both single quotes and double quotes
• Legal filter operators are: = != < >
52
But it doesn’t work right!• Here’s what we did:
<xsl:for-each select="//book"> <li> <xsl:value-of select="title[../author='Terry Smith']"/> </li> </xsl:for-each>
• This will output <li> and </li> for every book, so we will get empty bullets for authors other than Terry Smith
• There is no obvious way to solve this with just xsl:value-of
53
xsl:if
• xsl:if allows us to include content if a given condition (in the test attribute) is true
• Example: <xsl:for-each select="//book"> <xsl:if test="author='Terry Smith'"> <li> <xsl:value-of select="title"/> </li> </xsl:if> </xsl:for-each>
• This does work correctly!
54
xsl:choose
• The xsl:choose ... xsl:when ... xsl:otherwise construct is XML’s equivalent of switch ... case ... default statement
• The syntax is: <xsl:choose>
<xsl:when test="some condition"> ... some code ... </xsl:when> <xsl:otherwise> ... some code ... </xsl:otherwise></xsl:choose>• xsl:choose is often used within an xsl:for-each loop
55
xsl:sort• You can place an xsl:sort inside an xsl:for-each• The attribute of the sort tells what field to sort on• Example:
<ul> <xsl:for-each select="//book"> <xsl:sort select="author"/> <li> <xsl:value-of select="title"/> by <xsl:value-of select="author"> </li> </xsl:for-each> </ul>
– This example creates a list of titles and authors, sorted by author
56
xsl:apply-templates• If you apply a template to an element that has child elements,
templates are not automatically applied to those child elements
• The <xsl:apply-templates> element applies a template rule to the current element or to the current element’s child nodes
• If we add a select attribute, it applies the template rule only to the child that matches
• If we have multiple <xsl:apply-templates> elements with select attributes, the child nodes are processed in the same order as the <xsl:apply-templates> elements
57
Applying templates to children• <book>
<title>XML</title> <author>Terry Smith</author> </book>
• <xsl:template match="/"> <html> <head></head> <body> <b><xsl:value-of select="/book/title"/></b> <xsl:apply-templates select="/book/author"/> </body> </html></xsl:template>
<xsl:template match="/book/author"> by <i><xsl:value-of select="."/></i></xsl:template>
With this line:XML by Gregory Brill
Without this line:XML
58
Calling named templates
• You can name a template, then call it, similar to the way you would call a method in Java
• The named template: <xsl:template name="myTemplateName"> ...body of template... </xsl:template>
• A call to the template: <xsl:call-template name="myTemplateName"/>
• Or: <xsl:call-template name="myTemplateName"> ...parameters... </xsl:call-template>
59
Processing model • A list of source nodes is processed to create a result tree fragment. • The result tree is constructed by processing a list containing just the root
node.• A list of source nodes is processed by appending the result tree structure
created by processing each of the members of the list in order. • A node is processed by finding all the template rules with patterns that match
the node, and choosing the best amongst them; the chosen rule's template is then instantiated with the node as the current node and with the list of source nodes as the current node list.
• A template typically contains instructions that select an additional list of source nodes for processing.
• The process of matching, instantiation and selection is continued recursively until no new source nodes are selected for processing.
60
XQuery
61
Enter XQueryXML documents generalize relational data
c2b2a2
c3b3a3
c1b1a1
CBA
Rtuple
A a1 /A
B b1 /B
C c1 /C/tupletuple
A a2 /A
B b2 /B
C c2 /C/tuple
…
/R
How should query languages like SQL be similarly generalized?
62
FLWOR Expressions The main engine of XQuery is the FLWOR expression:
– For-Let-Where-Order-Return
– pronounced "flower"
– generalizes SELECT-FROM-WHERE from SQL
for $d in document("depts.xml")//deptno let $e := document("emps.xml")//employee[deptno = $d] where count($e) >= 10 order by avg($e/salary) descending return <big-dept> { $d, <headcount>{count($e)}</headcount>, <avgsal>{avg($e/salary)}</avgsal> } </big-dept>
generates an ordered list of bindings of deptno
values $d
for each $d, $e = the list of emp elements with
that department number
filters that list to retain
only the desired tuples
sorts that list by the
given criteria constructs for each tuple
a resulting value
have an ordered list of
tuples ($d,$e)
The result is a list of departments with at least 10 employees, sorted by average salaries.
63
List Expressions
XQuery expressions often manipulate lists of values
for $p in distinct-values(document("bib.xml")//publisher) let $a := avg(document("bib.xml")//book[publisher = $p]/price)return <publisher> <name>{ $p/text() }</name> <avgprice>{ $a }</avgprice> </publisher>
List functions: distinct-values, avg, …
64
Conditional expressions
XQuery supports a general if-then-else construction.
extracts from the holdings of a library the titles and either editors or authors.
for $h in document("library.xml")//holding return <holding> { $h/title, if ($h/@type = "Journal") then $h/editor else $h/author } </holding>
65
Quantified Expressions
for $b in document("bib.xml")//book where some $p in $b//paragraph satisfies ( contains($p,"sailing") AND contains($p,“fishing") ) return $b/title
for $b in document("bib.xml")//book where every $p in $b//paragraph satisfies contains($p,"sailing") return $b/title
finds the titles of all books which mention both sailing and fishing in the same paragraph
finds the titles of all books which mention sailing in every paragraph