60
Querying XML Querying XML Documents and Documents and Data Data CBU Summer School CBU Summer School 13.8. - 20.8.2007 (2 ECTS) 13.8. - 20.8.2007 (2 ECTS) Prof. Pekka Kilpeläinen Prof. Pekka Kilpeläinen Univ of Kuopio, Dept of Computer Univ of Kuopio, Dept of Computer Science Science [email protected] [email protected]

Querying XML Documents and Data

  • Upload
    santa

  • View
    60

  • Download
    0

Embed Size (px)

DESCRIPTION

Querying XML Documents and Data. CBU Summer School 13.8. - 20.8.2007 (2 ECTS) Prof. Pekka Kilpeläinen Univ of Kuopio, Dept of Computer Science [email protected]. order. XML. invoice. Internet. Introduction & Motivation. XML appears everywhere How to query it?. - PowerPoint PPT Presentation

Citation preview

Page 1: Querying XML  Documents and Data

Querying XML Querying XML Documents and DataDocuments and Data

CBU Summer School CBU Summer School 13.8. - 20.8.2007 (2 ECTS)13.8. - 20.8.2007 (2 ECTS)Prof. Pekka KilpeläinenProf. Pekka Kilpeläinen

Univ of Kuopio, Dept of Computer ScienceUniv of Kuopio, Dept of Computer [email protected]@cs.uku.fi

Page 2: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 2

Introduction & MotivationIntroduction & Motivation

XML appears everywhereXML appears everywhere How to query it?How to query it?

XMLXML

InternetInternet

orderorder

invoiceinvoice

Page 3: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 3

Main Topic: Two XML Query ModelsMain Topic: Two XML Query Models

Region algebraRegion algebra– for retrieval of structuded textfor retrieval of structuded text– "lightweight""lightweight"

» reduced language; for ad-hoc files; efficient reduced language; for ad-hoc files; efficient free implementationfree implementation

XQueryXQuery– for general querying/manipulation of XMLfor general querying/manipulation of XML– "heavy""heavy"

» comprehensive and complex language; for (data comprehensive and complex language; for (data viewed as) XML only; production-use viewed as) XML only; production-use implementations?implementations?

Page 4: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 4

Course OutlineCourse Outline

Intro and Arrangements; Intro and Arrangements;

Structured documentsStructured documents

1 Review of XML Basics 1 Review of XML Basics 1.1 XML and XML docs; 1.2 Document grammars 1.1 XML and XML docs; 1.2 Document grammars

1.3. XML DTDs; 1.4 XML Namespaces 1.3. XML DTDs; 1.4 XML Namespaces

1.5 XML Schema1.5 XML Schema

2 Region Algebra and sgrep2 Region Algebra and sgrep

3 W3C XQuery, and XPath 2.03 W3C XQuery, and XPath 2.0

(Apologies for potential dis-organization!)(Apologies for potential dis-organization!)

Page 5: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 5

ArrangementsArrangements

Background: Background: W3C Recommendations (XML, XQuery)W3C Recommendations (XML, XQuery) Reports (Region Algebra, sgrep)Reports (Region Algebra, sgrep) Earlier courses; own research and experimentsEarlier courses; own research and experiments

Some material (to be posted) at Some material (to be posted) at http://www.cs.uku.fi/~kilpelai/CBU07/http://www.cs.uku.fi/~kilpelai/CBU07/

Plan: Plan: Lectures 12 h; Lectures 12 h; hands-on exercises 8 h hands-on exercises 8 h

Page 6: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 6

Structured DocumentsStructured Documents

DocumentDocument: : – a structured representation of information on some a structured representation of information on some

medium (medium ( message) message)

– normally for a human readernormally for a human reader» memos, manuals, articles, books, …memos, manuals, articles, books, …

– also application-to-application messagesalso application-to-application messages» e.g., btw client and server in e.g., btw client and server in Web ServicesWeb Services

– "prose-oriented XML" vs "data-oriented XML""prose-oriented XML" vs "data-oriented XML"– can be treated as a single unit can be treated as a single unit

» (a web page vs a web site)(a web page vs a web site)

Page 7: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 7

Presentation vs StructurePresentation vs Structure

Presentation informs the Presentation informs the human readerhuman reader about the about the meaning of text and the role of its partsmeaning of text and the role of its parts

Markup Markup indicates the presentation or the indicates the presentation or the meaning of different parts of text meaning of different parts of text

» originally hand-written annotations for the typesetter originally hand-written annotations for the typesetter

– nowadays primarily codes embedded in digital nowadays primarily codes embedded in digital documents; documents; <Tags><Tags>

Page 8: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 8

Markup and Markup LanguageMarkup and Markup Language

Procedural markup Procedural markup – commands (start boldface, produce empty line, indent commands (start boldface, produce empty line, indent

5 mm, ...)5 mm, ...)– proprietary word processor formats, nroff, TeX, ...proprietary word processor formats, nroff, TeX, ...

Descriptive Descriptive oror generic markup generic markup– indicates conceptual structures using chosen namesindicates conceptual structures using chosen names– LaTeX: LaTeX: \begin{abstract}\begin{abstract} ... ... \\end{abstract}end{abstract}

– HTML: HTML: <TITLE> <TITLE> ...... </TITLE> </TITLE> Markup languageMarkup language

– a fixed set of markup notations (e.g. nroff, TeX, HTML, a fixed set of markup notations (e.g. nroff, TeX, HTML, SVG, …) SVG, …)

Page 9: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 9

Structure in DocumentsStructure in Documents

HierarchyHierarchy or or nestingnesting is ubiquitous is ubiquitous– Sections w. subsections etcSections w. subsections etc– (Also overlapping hierarchies!)(Also overlapping hierarchies!)

Linear orderLinear order essential in prose documents essential in prose documents– less important in documents representing data objectsless important in documents representing data objects

HypertextHypertext and and cross-referencescross-references

XML: proper hierarchies, tree-like structures, XML: proper hierarchies, tree-like structures, with cross-references via attribute valueswith cross-references via attribute values

Page 10: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 10

1 Document Instances and Grammars1 Document Instances and Grammars

Overview of fundamentals, and some Overview of fundamentals, and some details, of XMLdetails, of XML

1.1 XML and XML documents1.1 XML and XML documents1.2 Basics of document grammars 1.2 Basics of document grammars 1.3 Basics of XML DTDs1.3 Basics of XML DTDs1.4 XML Namespaces1.4 XML Namespaces

1.5 XML Schema1.5 XML Schema

Page 11: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 11

2.1 XML and XML documents2.1 XML and XML documents

XML - Extensible Markup Language,XML - Extensible Markup Language,W3C Recommendation, February 1998W3C Recommendation, February 1998– not an official standard, but a stable industry standardnot an official standard, but a stable industry standard– 22ndnd Ed 2000, 3 Ed 2000, 3rdrd Ed 2004, 4 Ed 2004, 4thth Ed 2006 Ed 2006

» editorial revisions, editorial revisions, notnot new versions of XML 1.0 new versions of XML 1.0

a simplified subset of SGML, Standard a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987Generalized Markup Language, ISO 8879:1987– validvalid XML documents are also SGML documents XML documents are also SGML documents

Page 12: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 12

What is XML?What is XML?

ExtensibleExtensible Markup Language Markup Language is is notnot a markup a markup language! language! – does not fix a tag set nor its semantics does not fix a tag set nor its semantics

(like markup languages like HTML do)(like markup languages like HTML do)

XML documents have XML documents have no inherentno inherent (processing or (processing or presentation) presentation) semanticssemantics– even though many think that XML is semantic or self-even though many think that XML is semantic or self-

describing; See nextdescribing; See next

Page 13: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 13

Semantics of XML MarkupSemantics of XML Markup

Meaning of this XML fragment?Meaning of this XML fragment?

– The application has to “understand” the tagsThe application has to “understand” the tags– But better off with the tags, though!But better off with the tags, though!

Page 14: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 14

What is XML (2)?What is XML (2)?

XML XML isis– a way to use markup to represent informationa way to use markup to represent information– a a metalanguagemetalanguage

» supports definition of specific markup languages through XML supports definition of specific markup languages through XML DTDs (Document Type Definitions) or SchemasDTDs (Document Type Definitions) or Schemas

» E.g. XHTML a reformulation of HTML using XMLE.g. XHTML a reformulation of HTML using XML

Often “XML” Often “XML” XML + XML technology XML + XML technology

Page 15: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 15

How does it look?How does it look?

<?xml version=’1.0’ encoding=”iso-8859-1” ?><?xml version=’1.0’ encoding=”iso-8859-1” ?>

<invoice num=”1234”><invoice num=”1234”>

<client clNum=”00-01”><client clNum=”00-01”> <name>Pekka Kilpeläinen</name> <name>Pekka Kilpeläinen</name>

<email>[email protected]</email><email>[email protected]</email>

</client></client>

<item price=”60” unit=”EUR”><item price=”60” unit=”EUR”>XML Handbook</item> XML Handbook</item>

<item price=”350” unit=”FIM”><item price=”350” unit=”FIM”>XSLT Programmer’s Ref</item>XSLT Programmer’s Ref</item>

</invoice></invoice>

Page 16: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 16

Essential Features of XMLEssential Features of XML

Overview of XML essentialsOverview of XML essentials– many details skippedmany details skipped– Learn to consult original sources Learn to consult original sources

(specifications, documentation etc) for details!(specifications, documentation etc) for details!» The XML specification is easy to browseThe XML specification is easy to browse

First of all, XML is a textual or character-based First of all, XML is a textual or character-based way to represent dataway to represent data

Page 17: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 17

XML Document CharactersXML Document Characters

XML documents are made of ISO-10646 (32-bit) XML documents are made of ISO-10646 (32-bit) characterscharacters; in practice of their 16-bit Unicode ; in practice of their 16-bit Unicode subset (used, e.g., in Java)subset (used, e.g., in Java)– Unicode 2.0 defines almost 39,000 distinct charactersUnicode 2.0 defines almost 39,000 distinct characters

Characters have three different aspectsCharacters have three different aspects::– their identification as numeric code pointstheir identification as numeric code points– their their representationrepresentation by bytes by bytes– theirtheir visual presentation visual presentation

Page 18: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 18

External Aspects of CharactersExternal Aspects of Characters

Documents are stored/transmitted as a sequence Documents are stored/transmitted as a sequence of bytes (of 8 bits). An of bytes (of 8 bits). An encodingencoding determines how determines how characters are characters are representedrepresented by bytes. by bytes.– UTF-8 (UTF-8 (7-bit ASCII) is the XML default encoding7-bit ASCII) is the XML default encoding– encoding="KOI8R"encoding="KOI8R" should be OK for Cyrillic textsshould be OK for Cyrillic texts

» (I cannot comment on parser support)(I cannot comment on parser support)

A A fontfont determines the determines the visual presentationvisual presentation of of characterscharacters

Page 19: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 19

XML Encoding of Structure 1XML Encoding of Structure 1

XML is, essentially, a textual encoding scheme of XML is, essentially, a textual encoding scheme of labelledlabelled, , orderedordered and and attributedattributed treestrees::– internal nodes are internal nodes are elementselements labelled by type names labelled by type names– leaves are leaves are text nodestext nodes labelled by string values, or labelled by string values, or

empty element nodesempty element nodes– the left-to-right order of children of a node mattersthe left-to-right order of children of a node matters– element nodes may carry element nodes may carry attributesattributes

= (name, string-value) pairs= (name, string-value) pairs

This view is shared by many XML techniques This view is shared by many XML techniques (DOM, (DOM, XPathXPath, XSLT, , XSLT, XQueryXQuery, ...), ...)

Page 20: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 20

XML Encoding of Structure 2XML Encoding of Structure 2

XML encoding of a treeXML encoding of a tree– corresponds to a pre-order walkcorresponds to a pre-order walk– start of an element node with type name A start of an element node with type name A

denoted by a denoted by a start tagstart tag <A>, and its end <A>, and its end denoted by denoted by end tagend tag </A> </A>

– possible attributes written within the start tag:possible attributes written within the start tag:<A attr<A attr11=“value=“value11” … attr” … attrnn=“value=“valuenn”>”>

» Names attrNames attr11,…,attr,…,attrn n must be distinctmust be distinct

– text nodes written as their string valuetext nodes written as their string value

Page 21: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 21

XML Encoding of Structure: XML Encoding of Structure: ExampleExample

<S><S>

SS

EE

<W><W> <W><W></W></W> <E A=‘1’/><E A=‘1’/>HelloHello world!world!

WW

HelloHello

WW

world!world!

</W></W> </S></S>

A=1A=1

Page 22: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 22

XML: Logical Document StructureXML: Logical Document Structure

ElementsElements – indicated by matching (case-sensitive!) tagsindicated by matching (case-sensitive!) tags<ElementTypeName> <ElementTypeName> …… </ElementTypeName></ElementTypeName>

– can contain text and/or subelementscan contain text and/or subelements– can be can be emptyempty::

<elem-type></elem-type><elem-type></elem-type> or or <elem-type/><elem-type/> (e.g. (e.g. <br/><br/> in in

XHTML)XHTML)– unique root element unique root element document a single tree document a single tree

Page 23: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 23

Logical document structure (2)Logical document structure (2)

AttributesAttributes– name-value pairs attached to elementsname-value pairs attached to elements– in start-tag after the element type namein start-tag after the element type name

<div class="preface" date='990126'><div class="preface" date='990126'> … …

– forms forms ""......"" and and ''......'' are interchangeable are interchangeable Also:Also:

– <!--<!-- commentscomments outside other markup outside other markup -->-->– <?note <?note this would be passed to the application as a this would be passed to the application as a

processing instruction named ‘note’processing instruction named ‘note’?>?>

Page 24: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 24

CDATA SectionsCDATA Sections

““CDATA Sections” to include XML markup CDATA Sections” to include XML markup characters as textual contentcharacters as textual content

<![CDATA[<![CDATA[ Here we can easily include markup Here we can easily include markup characters and, for example, code characters and, for example, code fragments:fragments:

<example>if (Count < 5 && Count > 0) <example>if (Count < 5 && Count > 0) </example></example>

]]>]]>

Page 25: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 25

Two levels of correctness (1)Two levels of correctness (1)

Well-formedWell-formed documents documents – roughly: follows the syntax of XML,roughly: follows the syntax of XML,

markup correct (elements properly nested, tag markup correct (elements properly nested, tag names match, attributes of an element have names match, attributes of an element have unique names, ...)unique names, ...)

– violation is a fatal errorviolation is a fatal error ValidValid documentsdocuments

– (in addition to being well-formed) (in addition to being well-formed) obey an associated grammar (DTD/Schema)obey an associated grammar (DTD/Schema)

Page 26: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 26

XML docs and valid XML docsXML docs and valid XML docs

XML documents = well-formed XML documentsXML documents = well-formed XML documents

DTD-valid documentsDTD-valid documents Schema-valid documentsSchema-valid documents

Page 27: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 27

An XML Processor (Parser)An XML Processor (Parser)

Reads XML documents and reports their contents Reads XML documents and reports their contents to an application to an application – relieves the application from details of markup relieves the application from details of markup – XML Recommendation specifies: XML Recommendation specifies: – recognition of characters as markup or data; what recognition of characters as markup or data; what

information to pass to applications; information to pass to applications; how to check the correctness of documents; how to check the correctness of documents;

– validation based on comparing document against its validation based on comparing document against its grammar grammar

Next: Basics of document grammarsNext: Basics of document grammars

Page 28: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 28

1.2 Basics of document grammars1.2 Basics of document grammars

DTDs are variations of DTDs are variations of context-free grammarscontext-free grammars (CFGs), which are widely used to syntax (CFGs), which are widely used to syntax specification (programming languages, XML, …) specification (programming languages, XML, …) and to parser/compiler generation (e.g. and to parser/compiler generation (e.g. YACC/GNU Bison)YACC/GNU Bison)– No knowledge of them is necessary, but connections No knowledge of them is necessary, but connections

with CFGs may be informative for those that know about with CFGs may be informative for those that know about themthem

Page 29: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 29

DTD/CFG CorrespondenceDTD/CFG Correspondence

DTDDTD

----------------------------------------------------------------

XML documentXML document

element typeelement type

element type declarationelement type declaration

#PCDATA#PCDATA

CFGCFG

------------------------------------------------

parse/syntax treeparse/syntax tree

nonterminalnonterminal

productionproduction

terminalterminal

Page 30: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 30

Example: Three Authors of a RefExample: Three Authors of a Ref

RefRef

Author Author Author Author Author Author TitleTitle

. . .. . .

PublDataPublData

Aho Aho Hopcroft Hopcroft Ullman Ullman The Design and Analysis ...The Design and Analysis ...

Ref Ref Author* Title PublData Author* Title PublData P, P,Author Author Author Title PublData Author Author Author Title PublData L( L(Author* Title PublDataAuthor* Title PublData))

Page 31: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 31

Extended ProductionsExtended Productions

Notice the Notice the regular expressionsregular expressions in in productionsproductions– to describe (potentially infinite) sequencesto describe (potentially infinite) sequences

That is, we are using That is, we are using extendedextended CFGs CFGs– content models (of a DTD) correspond to content models (of a DTD) correspond to

regular expressions (in an ECFG production)regular expressions (in an ECFG production)– > number of element’s children generally > number of element’s children generally

unlimited unlimited

Page 32: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 32

1.3 Basics of XML DTDs1.3 Basics of XML DTDs

A A Document Type DeclarationDocument Type Declaration provides a provides a grammar (grammar (document type definitiondocument type definition,, DTD DTD) for a ) for a class of documents [Defined in XML Rec]class of documents [Defined in XML Rec]

Syntax (in the prolog of a document instance):Syntax (in the prolog of a document instance):<!DOCTYPE<!DOCTYPE rootElemType rootElemType SYSTEMSYSTEM "ex.dtd" "ex.dtd"<!-- <!-- "external subset" in file ex.dtd"external subset" in file ex.dtd --> --> [[ <!–- <!–- an optional "internal subset" an optional "internal subset" --> --> ]]>>

DTD = union of the external and internal subsetDTD = union of the external and internal subset– internal has preference for attribute and entity declsinternal has preference for attribute and entity decls

Page 33: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 33

Markup DeclarationsMarkup Declarations

DTD consists of DTD consists of markup declarationsmarkup declarations – element type declarationselement type declarations

» ≈≈ productions of ECFGsproductions of ECFGs

– attribute-list declarationsattribute-list declarations » for declared element typesfor declared element types

– entity declarationsentity declarations» for physical structuresfor physical structures

– notation declarationsnotation declarations

logical structureslogical structures

Page 34: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 34

How do Declarations Look Like?How do Declarations Look Like?

<!ELEMENT invoice (client, item+)><!ELEMENT invoice (client, item+)>

<!ATTLIST invoice num NMTOKEN #REQUIRED><!ATTLIST invoice num NMTOKEN #REQUIRED>

<!ELEMENT client (name, email?)> <!ELEMENT client (name, email?)>

<!ATTLIST client num NMTOKEN #REQUIRED><!ATTLIST client num NMTOKEN #REQUIRED>

<!ELEMENT name (#PCDATA)> <!ELEMENT name (#PCDATA)>

<!ELEMENT email (#PCDATA)> <!ELEMENT email (#PCDATA)>

<!ELEMENT item (#PCDATA)><!ELEMENT item (#PCDATA)>

<!ATTLIST item <!ATTLIST item

priceprice NMTOKEN #REQUIREDNMTOKEN #REQUIRED

unit (FIM | EUR) ”EUR” >unit (FIM | EUR) ”EUR” >

Page 35: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 35

Element Type DeclarationsElement Type Declarations

General form:General form:<!ELEMENT<!ELEMENT elementTypeName elementTypeName ((EE)>)>

where where EE is a is a content modelcontent model regular expression of element namesregular expression of element names Content model operators:Content model operators:

E | F : choiceE | F : choice EE,, F: concatenation F: concatenationE? : optionalE? : optional E* : zero or moreE* : zero or moreE+ : one or moreE+ : one or more (E) : grouping(E) : grouping

Must groupMust group: : (A,B)|C or A,(B|C), but A,B|C forbidden(A,B)|C or A,(B|C), but A,B|C forbidden

Page 36: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 36

Attribute-List DeclarationsAttribute-List Declarations

Can declare attributes for elements:Can declare attributes for elements:– Name, data type and possible default value Name, data type and possible default value

Example:Example:<!ATTLIST FIG<!ATTLIST FIG

idid IDID #IMPLIED#IMPLIEDdescr CDATA #REQUIREDdescr CDATA #REQUIREDclass (a | b | c) class (a | b | c) "a">"a">

Semantics mainly up to the applicationSemantics mainly up to the application– processor checks that processor checks that IDID attributes are unique and that attributes are unique and that

targets of targets of IDREFIDREF attributes exist attributes exist

Page 37: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 37

Mixed, Empty and Arbitrary ContentMixed, Empty and Arbitrary Content

Mixed contentMixed content::<!ELEMENT P<!ELEMENT P (#PCDATA | I | IMG)*>(#PCDATA | I | IMG)*>

– may contain text and elementsmay contain text and elements Empty contentEmpty content::

<!ELEMENT IMG <!ELEMENT IMG EMPTYEMPTY>>

Unrestricted content: Unrestricted content: ANYANY

(= (= (#PCDATA |(#PCDATA | choice-of-all-declared-element-typeschoice-of-all-declared-element-types)* )* ))

Page 38: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 38

Entities (1)Entities (1)

Named storage units of XML documentsNamed storage units of XML documents Multiple uses:Multiple uses:

– character entitiescharacter entities: : » &lt;&lt; &#60;&#60; and and &#x3C;&#x3C; all expand to ‘ all expand to ‘<<‘‘

(treated as data, not as start-of-markup)(treated as data, not as start-of-markup)

» other other predefined entitiespredefined entities: : &amp; &gt; &apos; &quote;&amp; &gt; &apos; &quote;expand toexpand to &&,, > >,, ' ' andand ""

– general entitiesgeneral entities are shorthand notations: are shorthand notations:<!ENTITY UKU "University of Kuopio"><!ENTITY UKU "University of Kuopio">

Page 39: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 39

Entities (2)Entities (2)

physical storage units comprising a documentphysical storage units comprising a document– parsed entitiesparsed entities

<!ENTITY chap1 SYSTEM <!ENTITY chap1 SYSTEM "http://myweb/ch1">"http://myweb/ch1">

– document entity document entity is the starting point of processingis the starting point of processing– entities and elements must nest properly:entities and elements must nest properly:

<!DOCTYPE doc [<!DOCTYPE doc [<!ENTITY chap1 <!ENTITY chap1

((… as above …)… as above …) > ]>> ]><doc><doc>

&chap1;&chap1;</doc></doc>

<sec num="1"><sec num="1">… …

</sec></sec><sec num="2"><sec num="2">

… … </sec></sec>

Page 40: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 40

Unparsed Entities and Parameter EntitiesUnparsed Entities and Parameter Entities

Unparsed entitiesUnparsed entities allow XML documents refer to allow XML documents refer to external binary objects like graphics files external binary objects like graphics files – XML processor handles only textXML processor handles only text– I've rarely used theseI've rarely used these

Parameter entitiesParameter entities are used in DTDs are used in DTDs– useful for modularizing declarationsuseful for modularizing declarations

We skip theseWe skip these

Page 41: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 41

1.4 XML Namespaces1.4 XML Namespaces

Documents often comprise parts processed by different Documents often comprise parts processed by different applications (and/or defined by different grammars) applications (and/or defined by different grammars)

– for example, in XSLT scripts:for example, in XSLT scripts:

<xsl:template match="doc/title"> <xsl:template match="doc/title"> <H1><H1>

<xsl:apply-templates /><xsl:apply-templates /> </H1> </H1>

</xsl:template></xsl:template>

– How to manage multiple sets of names?How to manage multiple sets of names?

HTML HTML elementselements

XSLT XSLT elements/elements/

instructionsinstructions

Page 42: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 42

XML Namespaces (2/5) XML Namespaces (2/5)

Solution: Solution: – By introducing (arbitrary) local name By introducing (arbitrary) local name prefixesprefixes, ,

and binding them to (fixed) globally unique URIsand binding them to (fixed) globally unique URIs– For example, the local prefix “For example, the local prefix “xsl:xsl:” ”

conventionally used in XSLT scriptsconventionally used in XSLT scripts

Page 43: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 43

XML Namespaces briefly (3/5)XML Namespaces briefly (3/5)

Namespace identified by a URI (through Namespace identified by a URI (through the associated local prexif) the associated local prexif) e.g.e.g. http://www.w3.org/http://www.w3.org/1999/XSL/Transform1999/XSL/Transform for XSLTfor XSLT

– conventional but not required to use URLsconventional but not required to use URLs– the identifier has to be unique, but no need to be an the identifier has to be unique, but no need to be an

addressaddress

Association inherited to sub-elementsAssociation inherited to sub-elements– see the next example (of an XSLT script)see the next example (of an XSLT script)

Page 44: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 44

XML Namespaces (4/5) XML Namespaces (4/5)

<xsl:stylesheet version=<xsl:stylesheet version="1.0""1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/TR/xhtml1/strict">xmlns="http://www.w3.org/TR/xhtml1/strict">

<!-- XHTML is the ’default namespace’ --><!-- XHTML is the ’default namespace’ --><xsl:template match="doc/title"> <xsl:template match="doc/title"> <H1><H1>

<xsl:apply-templates /><xsl:apply-templates /> </H1> </H1> </xsl:template> </xsl:template>

</xsl:stylesheet></xsl:stylesheet>

Page 45: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 45

XML Namespaces briefly (5/5)XML Namespaces briefly (5/5)

Mechanism built on top of basic XMLMechanism built on top of basic XML– overloads attribute syntax (overloads attribute syntax (xmlns:xmlns:) to introduce ) to introduce

namespacesnamespaces– does not affect validation does not affect validation

» namespace attributes have to be declared for DTD-namespace attributes have to be declared for DTD-validityvalidity

» all element type names have to be declared (with their all element type names have to be declared (with their initial prefixes!)initial prefixes!)

– > Other schema languages (XML Schema, Relax NG) > Other schema languages (XML Schema, Relax NG) better for validating documents with Namespacesbetter for validating documents with Namespaces

Page 46: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 46

1.5 XML Schemas1.5 XML Schemas

A quick look at XML SchemaA quick look at XML Schema– W3C Recommendation,W3C Recommendation,

11stst Ed. May, 2001; 2 Ed. May, 2001; 2ndnd Ed. Oct, 2004: Ed. Oct, 2004:» XML Schema Part 0: Primer (readable non-XML Schema Part 0: Primer (readable non-

normative introduction; Recommended)normative introduction; Recommended)

» XML Schema Part 1: StructuresXML Schema Part 1: Structures

» XML Schema Part 2: DatatypesXML Schema Part 2: Datatypes

– W3C Draft (didn't lead anywhere?):W3C Draft (didn't lead anywhere?):» Formal Description, 9/2001 Formal Description, 9/2001

Page 47: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 47

Advantages of XML Schema Advantages of XML Schema (1)(1)

XML syntaxXML syntax– easier to manipulate by programs (than DTDs)easier to manipulate by programs (than DTDs)

Compatibility with namespacesCompatibility with namespaces– can validate against declarations from multiple can validate against declarations from multiple

sourcessources Content datatypesContent datatypes

– 44 built-in datatypes (including primitive Java 44 built-in datatypes (including primitive Java datatypes, datatypes of SQL, and XML attribute datatypes, datatypes of SQL, and XML attribute types)types)

– mechanisms to derive user-defined datatypesmechanisms to derive user-defined datatypes– used as types of XQueryused as types of XQuery

Page 48: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 48

XSDL built-in types XSDL built-in types

(Part 2, Chap. 3)(Part 2, Chap. 3)

NB: all simple values in NB: all simple values in documents documents stringsstrings

**CDATACDATA

**

**

**

**

**

**

**

**

*: XML attribute *: XML attribute typestypes

Page 49: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 49

Advantages of XML Schema Advantages of XML Schema (2)(2)

Element names and Element names and content typescontent types independent; Compare with independent; Compare with – For example, could define For example, could define titlestitles

» of people as “Mr.”/”Mrs.”/”Ms.”, andof people as “Mr.”/”Mrs.”/”Ms.”, and» of chapters as stringsof chapters as strings

– > extend the power of CFGs/DTDs > extend the power of CFGs/DTDs » where non-terminal / tag-name alone determines where non-terminal / tag-name alone determines

its allowed content its allowed content

– (Is this relevant in practice?) (Is this relevant in practice?)

Page 50: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 50

Advantages of XML Schema Advantages of XML Schema (3)(3)

Ability to specify uniqueness and keys within Ability to specify uniqueness and keys within selected parts of the documentselected parts of the document– for example, that for example, that titletitless of chapters should be unique; or of chapters should be unique; or

key attributes of relationskey attributes of relations– uses XPathuses XPath

Support for schema documentation Support for schema documentation – element element annotationannotation with sub-elements with sub-elements

documentationdocumentation (for human readers) and(for human readers) andappInfoappInfo (for applications)(for applications)

– Only these contain text (#PCDATA)Only these contain text (#PCDATA)

Page 51: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 51

Disadvantages of XML Disadvantages of XML SchemaSchema

Complexity (esp. Rec Part 1!) vs. added power Complexity (esp. Rec Part 1!) vs. added power – > a long learning curve> a long learning curve– > slow adoption by users> slow adoption by users

Immaturity of implementations (?)Immaturity of implementations (?)– W3C web site mentions ~ 60 tools/processorsW3C web site mentions ~ 60 tools/processors– Apache Xerces claims full XSDL supportApache Xerces claims full XSDL support– Some features difficult to implement efficientlySome features difficult to implement efficiently

Alternative schema languages have been suggested, Alternative schema languages have been suggested, tootoo– Relax NGRelax NG– SchematronSchematron– ... ...

Page 52: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 52

XSDL through ExampleXSDL through Example

– Next: walk-through of an XML schema exampleNext: walk-through of an XML schema example– from Chapter 2 of the XML Schema Primerfrom Chapter 2 of the XML Schema Primer

– Consider modelling purchase orders like below:Consider modelling purchase orders like below:

<purchaseOrder orderDate="1999-10-20"><purchaseOrder orderDate="1999-10-20"> <shipTo country="US"> <shipTo country="US"> <name>Alice Smith</name> <name>Alice Smith</name> <street>123 Maple Street</street> <street>123 Maple Street</street> <city>Mill Valley</city> <city>Mill Valley</city> <state>CA</state> <state>CA</state> <zip>90952</zip> <zip>90952</zip> </shipTo> </shipTo>

Page 53: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 53

purchaseOrderpurchaseOrder instance instance continuescontinues

<billTo country="US"><billTo country="US"> <name>Robert Smith</name> <name>Robert Smith</name> <street>8 Oak Avenue</street> <street>8 Oak Avenue</street> <city>Old Town</city> <city>Old Town</city> <state>PA</state> <state>PA</state> <zip>95819</zip> </billTo> <zip>95819</zip> </billTo> <comment>Hurry, my lawn is wild!</comment> <comment>Hurry, my lawn is wild!</comment>

<items><item partNum="872-AA"><items><item partNum="872-AA"> <productName>Lawnmower</productName><productName>Lawnmower</productName> <quantity>1</quantity><quantity>1</quantity> <USPrice>148.95</USPrice><USPrice>148.95</USPrice> <comment>Only if electric</comment><comment>Only if electric</comment> </item></item>

Page 54: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 54

End of the example instanceEnd of the example instance

<item partNum="926-AA"><item partNum="926-AA"> <productName>Baby Phone</productName><productName>Baby Phone</productName> <quantity>1</quantity><quantity>1</quantity> <USPrice>39.98</USPrice><USPrice>39.98</USPrice> <shipDate>1999-05-21</shipDate><shipDate>1999-05-21</shipDate> </item></item> </items></items></purchaseOrder></purchaseOrder>

Next: A schema for such purchase ordersNext: A schema for such purchase orders

Page 55: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 55

The Purchase Order Schema (1/5)The Purchase Order Schema (1/5)

<xs:schema <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="purchaseOrder" type="POrdType"/><xs:element name="purchaseOrder" type="POrdType"/>

<xs:element name="comment" type="xs:string"/><xs:element name="comment" type="xs:string"/>

<xs:complexType name="POrdType"> <xs:complexType name="POrdType"> <xs:sequence> <xs:sequence>

<xs:element name="shipTo" type="USAddr"/> <xs:element name="shipTo" type="USAddr"/> <xs:element name="billTo" type="USAddr"/> <xs:element name="billTo" type="USAddr"/> <xs:element ref="comment" minOccurs="0"/> <xs:element ref="comment" minOccurs="0"/> <xs:element name="items" type="Items"/> <xs:element name="items" type="Items"/>

</xs:sequence> </xs:sequence> <xs:attribute name="ordDate" type="xs:date"/> <xs:attribute name="ordDate" type="xs:date"/> </xs:complexType></xs:complexType>

Page 56: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 56

The Purchase Order Schema (2/5)The Purchase Order Schema (2/5)

<xs:complexType name="USAddr"> <xs:complexType name="USAddr"> <xs:sequence> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="name" type="xs:string"/> <xs:element name="street" <xs:element name="street"

type="xs:string"/> type="xs:string"/> <xs:element name="city" <xs:element name="city" type="xs:string"/> type="xs:string"/> <xs:element name="state" <xs:element name="state" type="xs:string"/> type="xs:string"/> <xs:element name="zip" <xs:element name="zip" type="xs:decimal"/> type="xs:decimal"/> </xs:sequence> </xs:sequence>

<xs:attribute name="country" <xs:attribute name="country" type="xs:NMTOKEN" fixed="US"/> type="xs:NMTOKEN" fixed="US"/>

</xs:complexType></xs:complexType>

Page 57: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 57

The Purchase Order Schema (3/5)The Purchase Order Schema (3/5)

<xs:complexType name="Items"> <xs:complexType name="Items"> <xs:sequence> <xs:sequence> <xs:element name="item" <xs:element name="item"

minOccurs="0" maxOccurs="unbounded"> minOccurs="0" maxOccurs="unbounded"> <xs:complexType> <xs:complexType> <xs:sequence> <xs:sequence>

<xs:element name="productName" <xs:element name="productName" type="xs:string"/> type="xs:string"/>

<xs:element name="quantity"> <xs:element name="quantity"> <xs:simpleType> <xs:simpleType>

<xs:restriction <xs:restriction base="xs:positiveInteger"> base="xs:positiveInteger">

<xs:maxExclusive value="100"/> <xs:maxExclusive value="100"/> </xs:restriction> </xs:restriction>

</xs:simpleType> </xs:simpleType> </xs:element> </xs:element>

anonymous type for anonymous type for itemitem

anon. type anon. type for for

quantityquantity

Page 58: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 58

The Purchase Order Schema (4/5)The Purchase Order Schema (4/5)

<xs:element name="USPrice" <xs:element name="USPrice" type="xs:decimal"/> type="xs:decimal"/>

<xs:element ref="comment" <xs:element ref="comment" minOccurs="0"/> minOccurs="0"/>

<xs:element name="shipDate" <xs:element name="shipDate" type="xs:date" type="xs:date"

minOccurs="0"/> minOccurs="0"/> </xs:sequence> </xs:sequence> <xs:attribute name="partNum" type="SKU" <xs:attribute name="partNum" type="SKU"

use="required"/> use="required"/> </xs:complexType> </xs:complexType> </xs:element> <!-- item --> </xs:element> <!-- item --> </xs:sequence> </xs:sequence>

</xs:complexType> <!-- Items --></xs:complexType> <!-- Items -->

Page 59: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 59

The Purchase Order Schema (5/5)The Purchase Order Schema (5/5)

<!-- Type for Stock Keeping Units, <!-- Type for Stock Keeping Units, (codes for identifying products): --> (codes for identifying products): -->

<xs:simpleType name="SKU"> <xs:simpleType name="SKU"> <xs:restriction base="xs:string"><xs:restriction base="xs:string"><!-- defined by a regular expr: --> <!-- defined by a regular expr: --> <xs:pattern value="\d{3}-[A-Z]{2}" /> <xs:pattern value="\d{3}-[A-Z]{2}" />

<!-- 3 digits, hyphen, 2 letters --> <!-- 3 digits, hyphen, 2 letters --> </xs:restriction> </xs:restriction>

</xs:simpleType> </xs:simpleType>

</xs:schema> </xs:schema>

Page 60: Querying XML  Documents and Data

CBU Summerschool '07

Querying XML: Introduction 60

XML Schema: SummaryXML Schema: Summary

XSDL: an XML-based grammar XSDL: an XML-based grammar formalismformalism– W3C Recommendation; Alternative to DTDsW3C Recommendation; Alternative to DTDs

» support for namespacessupport for namespaces» richer content and attribute datatypesricher content and attribute datatypes

Well accepted(?) in XML industryWell accepted(?) in XML industry– e.g., to describe messages btw clients and servers e.g., to describe messages btw clients and servers

in in Web servicesWeb services; (See, e.g., Web Services ; (See, e.g., Web Services Description Language, Vers. 2.0, W3C Draft 3/07)Description Language, Vers. 2.0, W3C Draft 3/07)

– for typing of for typing of XQueryXQuery