Upload
delta
View
26
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Processing of structured documents. Spring 2002, Part 1 Helena Ahonen-Myka. Course organization. 581290-5 laudatur course, 3 cu lectures (in Finnish) 22.1.-21.2. Tue 12-14, Thu 10-12 not obligatory exercise sessions 29.1.-27.2. - PowerPoint PPT Presentation
Citation preview
Processing of structured documents
Spring 2002, Part 1Helena Ahonen-Myka
2
Course organization
581290-5 laudatur course, 3 culectures (in Finnish)
22.1.-21.2. Tue 12-14, Thu 10-12 not obligatory
exercise sessions 29.1.-27.2. course assistants: Olli Lahti and Miro
Lehtonen (new group Wed 12-14 A318) not obligatory
3
Requirements
Exam (Wed 6.3. at 16-20): 45 pointsProject: 15 pointsExercises: 5 extra pointsMaximum of points: 60
4
Outline (preliminary)
1. Descriptions of structure context-free grammars namespaces, information sets (XML DTD,) XML Schema
2. Programming interfaces SAX, DOM SOAP
3. Traversing documents XPath
5
Outline...
4. Querying structured documents XML Query
5. XML Linking6. XML databases7. Metadata: RDF8. Compressing XML data9. ...
6
Prerequisites
You should know the basics of XML DTD, elements, attributes, syntax XSLT (basics), formatting
some programming experience is needed
7
Group project
Group of 4-5 students groups are formed in the exercise sessions
in the 2nd weekTask: construct a toy B2B e-commerce
application a travel agency which sells packages
containing hotel nights and concerts a hotel (or several) a concert ticket office
8
Group project
Task continues a customer can reserve packages using
a web page a reservation causes a query to the
hotels and the ticket offices for the availability of rooms and tickets
for all the communication and for the storage of all the documents you should use XML
9
Group project
Try to get some simple implementation work may depend on the support we can offer
you don´t have to consider all the real life problems, like consistency of reservations
concentrate on playing with XMLstate of the work is presented in the last
exercise sessions (also students who don’t normally attend exercises)
10
Requirements for project
More instructions follow later...return a report by 22.3. (as an URL)The report should include
(short) requirements analysis descriptions of the structure (DTD, Schema) other designs, architecture, ...
Some kind of a working prototype not necessarily the whole system
11
1. Structure descriptions
Regular expressions, context-free grammars -> What is XML?
(XML Document type definitions)namespaces, information setsXML Schema
12
Regular expressions
A way to describe set of strings over an alphabet (of chars, events, elements…)
many uses: text searching (e.g. emacs, grep, perl) in grammatical formalisms (e.g. XML DTDs)
relevant for document structures: what kind of structural content is allowed for different document components
13
Regular expressions
A regular expression over alphabet is either (an empty set) (epsilon; sometimes lambda ) a, where a R | S (choice; sometimes R S) R S (catenation) or R* (Kleene closure)
where R and S are regular expressions
14
Regular expressions
Regular expression E denotes a language (a set of strings) L(E): L() = (empty set) L() = {} (singleton set of empty string) L(a) = {a} (singleton set of a ) L(R|S) = L(R) L(S) = {w | w L(R) or w L(S)}
L(RS) = L(R)L(S) = {xy | x L(R) and y L(S)}
L(R*) = L(R)* = {x1…xn| xk L(R), k=1,…,n; n 0}
15
Example
top-level structure of a document: = {title, author, date, sect} title followed by an optional list of authors,
followed by an optional date, followed by one or more sections:
title auth* (date | ) sect sect*common abbreviations:
E? = (E | ); E+ = E E* -> title auth* date? sect+
16
Context-free grammars
Used widely for syntax specification (programming languages)
G = (V, , P, S) V: the alphabet of the grammar G; V =
N : the set of terminal symbols;
N = V- : the set of nonterminal symbols P: set of productions S N: the start symbol
17
Productions and derivations
Productions: A -> , where A N, V* e.g. A -> aBa (1)
Let , V*. String derives directly, => , if = A, = for some , V*, and
A -> is a production of the grammar e.g. AA => AaBa (assuming prod. 1
above)
18
Language generated by a context-free grammar
derives , =>* , if there is a sequence of 0 or more direct derivations that transforms to
The language generated by a CFG G: L(G) = {w * | S =>* w}
L(G) is a set of strings: to model structural elements, we consider parse trees
19
Parse trees of a CFG
Aka syntax trees or derivation treesnodes labelled by symbols of V (or by ):
internal nodes by nonterminals, root by start symbol
leaves using terminal symbols (or )parent with label A can have children
labeled by X1,…,Xk only if A -> X1…Xk is a production
20
CFGs for document structures
Nonterminals represent document structures e.g. Ref -> AuthorList Title PublData
AuthorList -> Author AuthorList AuthorList ->
problem: obscures the relation of elements (the last
Author several hierarchical levels away from Ref) -> solution: extended CFGs
21
Extended CFGs (ECFGs)
Like CFGs, but right-hand-sides of productions are regular expressions over V, e.g. Ref -> Author* Title PublData
Let , V*. String derives directly, => , if = A, = for some , V*, and
A -> E is a production such that L(E) e.g. Ref => Author Author Author Title
PublData
22
Language generated by an ECFG
Defined similarly to CFGsTheorem: Languages generated by
extended and ordinary CGFs are the same
23
Parse trees of an ECFG
Similar to parse trees of an ordinary CFG, except that…
parent with label A can have children labeled by X1,…,Xk when A -> E is a production such that X1…Xk L(E)
-> an internal node may have arbitrarily many children (e.g. Authors below a Ref node)
24
What is XML?
metalanguage that can be used to define markup languages gives syntax for defining extended
context free grammars XML documents that adhere to an ECFG
are strings in that language document types (grammars)- document
instances (strings in the language)
25
XML encoding of structure
XML document essentially a parenthesized linear encoding of a parse tree corresponds to a preorder walk start of inner node (element) A denoted by a
start tag <A>, end denoted by end tag </A> leaves are strings (or empty elements)
+ certain extensions (especially attributes)
26
Terminal symbols in practice
Leaves of parse trees are labeled by single characters (symbols of )
too granular in practice: instead terminal symbols which stand for all values of a type e.g. #PCDATA in XML for variable length
content of data characters richer data types in XML schema
formalisms
27
An example DTD
<!DOCTYPE invoice [<!ELEMENT invoice (orderDate, shipDate, billingAddress voice*, fax?)><!ELEMENT orderDate (#PCDATA)><!ELEMENT shipDate (#PCDATA)><!ELEMENT billingAddress (name, street, city, state, zip)><!ELEMENT voice (#PCDATA)><!ELEMENT fax (#PCDATA)><!ELEMENT name (#PCDATA)><!ELEMENT street (#PCDATA)><!ELEMENT city (#PCDATA)><!ELEMENT state (#PCDATA)><!ELEMENT zip (#PCDATA)>]>
28
<invoice> <orderDate>19990121</orderDate> <shipDate>19990125</shipDate> <billingAddress> <name>Ashok Malhotra</name> <street>123 IBM Ave.</street> <city>Hawthorne</city> <state>NY</state> <zip>10532-0000</zip> </billingAddress> <voice>555-1234</voice> <fax>555-4321</fax></invoice>
And a document:
29
XML processing model
A processor (parser) reads XML documents passes data to an application
XML Specification tells how to read, what to pass
30
Well-formed XML documents
documents that adhere to the formal requirements (syntax) of the XML specification
if a document is not well-formed, it is not an XML document (and the XML tools do not have to process it)
31
Valid documents
a document is a valid XML-document, if it is well-formed and adheres to the structure defined in the DTD given
XML-processor can be validating or non-validating
sometimes validity is important, sometimes not