51
1 SAX and more… CS 236607, Spring 2008/9

1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

  • View
    231

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

1

SAX and more…

CS 236607, Spring 2008/9

Page 2: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

2

SAX Parser

SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser

invokes the corresponding method of the corresponding handler

The handlers are programmer’s implementation of standard Java API (i.e., interfaces and classes)

Page 3: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

<?xml version="1.0"?>

<!DOCTYPE countries SYSTEM "world.dtd">

<countries>

<country continent="&as;"> <!--israel-->

<name>Israel</name>

<population year="2001">6,199,008</population>

<city capital="yes"><name>Jerusalem</name></city>

<city><name>Ashdod</name></city>

</country>

<country continent="&eu;">

<name>France</name>

<population year="2004">60,424,213</population>

</country>

</countries>

Page 4: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

<?xml version="1.0"?>

<!DOCTYPE countries SYSTEM "world.dtd">

<countries>

<country continent="&as;"> <!--israel-->

<name>Israel</name>

<population year="2001">6,199,008</population>

<city capital="yes"><name>Jerusalem</name></city>

<city><name>Ashdod</name></city>

</country>

<country continent="&eu;">

<name>France</name>

<population year="2004">60,424,213</population>

</country>

</countries>

Start Document

Page 5: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

<?xml version="1.0"?>

<!DOCTYPE countries SYSTEM "world.dtd">

<countries>

<country continent="&as;"> <!--israel-->

<name>Israel</name>

<population year="2001">6,199,008</population>

<city capital="yes"><name>Jerusalem</name></city>

<city><name>Ashdod</name></city>

</country>

<country continent="&eu;">

<name>France</name>

<population year="2004">60,424,213</population>

</country>

</countries>

Start Element

Page 6: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

<?xml version="1.0"?>

<!DOCTYPE countries SYSTEM "world.dtd">

<countries>

<country continent="&as;"> <!--israel-->

<name>Israel</name>

<population year="2001">6,199,008</population>

<city capital="yes"><name>Jerusalem</name></city>

<city><name>Ashdod</name></city>

</country>

<country continent="&eu;">

<name>France</name>

<population year="2004">60,424,213</population>

</country>

</countries>

Start Element

Page 7: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

<?xml version="1.0"?>

<!DOCTYPE countries SYSTEM "world.dtd">

<countries>

<country continent="&as;"> <!--israel-->

<name>Israel</name>

<population year="2001">6,199,008</population>

<city capital="yes"><name>Jerusalem</name></city>

<city><name>Ashdod</name></city>

</country>

<country continent="&eu;">

<name>France</name>

<population year="2004">60,424,213</population>

</country>

</countries>

Comment

Page 8: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

<?xml version="1.0"?>

<!DOCTYPE countries SYSTEM "world.dtd">

<countries>

<country continent="&as;"> <!--israel-->

<name>Israel</name>

<population year="2001">6,199,008</population>

<city capital="yes"><name>Jerusalem</name></city>

<city><name>Ashdod</name></city>

</country>

<country continent="&eu;">

<name>France</name>

<population year="2004">60,424,213</population>

</country>

</countries>

Start Element

Page 9: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

<?xml version="1.0"?>

<!DOCTYPE countries SYSTEM "world.dtd">

<countries>

<country continent="&as;"> <!--israel-->

<name>Israel</name>

<population year="2001">6,199,008</population>

<city capital="yes"><name>Jerusalem</name></city>

<city><name>Ashdod</name></city>

</country>

<country continent="&eu;">

<name>France</name>

<population year="2004">60,424,213</population>

</country>

</countries>

Characters

Page 10: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

<?xml version="1.0"?>

<!DOCTYPE countries SYSTEM "world.dtd">

<countries>

<country continent="&as;"> <!--israel-->

<name>Israel</name>

<population year="2001">6,199,008</population>

<city capital="yes"><name>Jerusalem</name></city>

<city><name>Ashdod</name></city>

</country>

<country continent="&eu;">

<name>France</name>

<population year="2004">60,424,213</population>

</country>

</countries>

End Element

Page 11: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

<?xml version="1.0"?>

<!DOCTYPE countries SYSTEM "world.dtd">

<countries>

<country continent="&as;"> <!--israel-->

<name>Israel</name>

<population year="2001">6,199,008</population>

<city capital="yes"><name>Jerusalem</name></city>

<city><name>Ashdod</name></city>

</country>

<country continent="&eu;">

<name>France</name>

<population year="2004">60,424,213</population>

</country>

</countries>

End Element

Page 12: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

<?xml version="1.0"?>

<!DOCTYPE countries SYSTEM "world.dtd">

<countries>

<country continent="&as;"> <!--israel-->

<name>Israel</name>

<population year="2001">6,199,008</population>

<city capital="yes"><name>Jerusalem</name></city>

<city><name>Ashdod</name></city>

</country>

<country continent="&eu;">

<name>France</name>

<population year="2004">60,424,213</population>

</country>

</countries>

End Document

Page 13: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

13

SAX Parsers

SAX Parser

When you see the start of the document do …

When you see the start of an element do … When you see

the end of an element do …

<?xml version="1.0"?>...

Page 14: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

Used to create a SAX Parser

Handles document events: start tag, end tag,

etc.

Handles Parser Errors

Handles DTD

Handles Entities

XMLReaderXML

XML-ReaderFactory

ContentHandler

EntityResolver

ErrorHandler

DTDHandler

Page 15: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

15

Creating a Parser The SAX interface is an accepted standard There are many implementations of many vendors

Standard API does not include an actual implementation, but Sun provides one with JDK

We would like to be able to change the implementation used, without changing any code in the program How this is done?

Page 16: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

16

System Properties

Properties are configuration values managed as key/value pairs. In each pair, the key and value are both String values.

The System class maintains a Properties object that describes the configuration of the current working environment. System properties include information about the current user, the

current version of the Java runtime, and the character used to separate components of a file path name.

Java standard System Properties.

Page 17: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

17

Factory Design Pattern Have a “factory” class that creates the actual

parsers org.xml.sax.helpers.XMLReaderFactory

The factory checks configurations, mainly the value of a system property, that specify the implementation

In order to change the implementation, simply change the system property

Read more about XMLReaderFactory Class

Page 18: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

18

Creating a SAX Parser

import org.xml.sax.*;

import org.xml.sax.helpers.*;

public class EchoWithSax {

public static void main(String[] args) throws Exception {

System.setProperty("org.xml.sax.driver",

"org.apache.xerces.parsers.SAXParser");

XMLReader reader =

XMLReaderFactory.createXMLReader();

reader.parse("world.xml");

}

}Read more about XMLReader Interface, Xerces SAXParser class

Implements XMLReader

(not a must)

The factory will look for

this…

Page 19: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

19

Implementing the Content Handler A SAX parser invokes methods such as startDocument, startElement and endElement of its content handler as it runs

In order to react to parsing events we must: implement the ContentHandler interface set the parser’s content handler with an instance of our ContentHandler implementation

Page 20: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

20

ContentHandler Methods

startDocument - parsing begins endDocument - parsing ends startElement - an opening tag is encountered endElement - a closing tag is encountered characters - text (CDATA) is encountered ignorableWhitespace - white spaces that should be

ignored (according to the DTD) and more ...

Read more about ContentHandler Interface

Page 21: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

21

The Default Handler

The class DefaultHandler implements all handler interfaces (usually, in an empty manner) i.e., ContentHandler, EntityResolver, DTDHandler, ErrorHandler

An easy way to implement the ContentHandler

interface is to extend DefaultHandler

Read more about DefaultHandler Class

Page 22: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

A Content Handler Example

import org.xml.sax.helpers.DefaultHandler;

import org.xml.sax.*;

public class EchoHandler extends DefaultHandler {

int depth = 0;

public void print(String line) {

for(int i=0; i<depth; ++i) System.out.print(" ");

System.out.println(line);

}

Page 23: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

23

A Content Handler Example

public void startDocument() throws SAXException {

print("BEGIN"); }

public void endDocument() throws SAXException {

print("END"); }

public void startElement(String ns, String localName,

String qName, Attributes attrs) throws SAXException {

print("Element " + qName + "{");

++depth;

for (int i = 0; i < attrs.getLength(); ++i)

print(attrs.getLocalName(i) + "=" + attrs.getValue(i)); }

We will discuss this interface later…

Page 24: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

24

A Content Handler Example

public void endElement(String ns, String lName,

String qName) throws SAXException {

--depth;

print("}"); }

public void characters(char buf[], int offset, int len)

throws SAXException {

String s = new String(buf, offset, len).trim();

++depth; print(s); --depth; } }

Page 25: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

25

Fixing The Parser

public class EchoWithSax {

public static void main(String[] args) throws Exception {

XMLReader reader =

XMLReaderFactory.createXMLReader();

reader.setContentHandler(new EchoHandler());

reader.parse("world.xml");

}

}

What would happen without this line?

Page 26: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

26

Empty Elements

What do you think happens when the parser parses an empty element?

<rating stars="five" />

Page 27: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

27

Attributes Interface The Attributes interface provides an access to all

attributes of an element getLength(), getQName(i), getValue(i), getType(i), getValue(qname), etc.

The following are possible types for attributes: CDATA, ID, IDREF, IDREFS, NMTOKEN, NMTOKENS, ENTITY, ENTITIES, NOTATION

Read more about Attributes Interface

#attributes

Page 28: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

28

ErrorHandler Interface We implement ErrorHandler to receive error events

(similar to implementing ContentHandler)

DefaultHandler implements ErrorHandler in an

empty fashion, so we can extend it (as before)

An ErrorHandler is registered with reader.setErrorHandler(handler);

Three methods: void error(SAXParseException ex);

void fatalError(SAXParserExcpetion ex);

void warning(SAXParserException ex);

Page 29: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

29

Parsing Errors

Fatal errors disable the parser from continuing parsing For example, the document is not well formed, an unknown XML

version is declared, etc.

Errors (that is recoverable ones) occur for example when the parser is validating and validity constrains are violated

Warnings occur when abnormal (yet legal) conditions are encountered For example, an entity is declared twice in the DTD

Page 30: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

30

Usually appear with external non-xml resources and describe their

type

EntityResolver and DTDHandler The interface EntityResolver enables the

programmer to specify a new source for translation of external entities

The interface DTDHandler enables the programmer to react to notations (unparsed entities) declarations inside the DTD ( Notation Example )

e.g. external DTD

Read more about EntityResolver Interface, DTDHandler Interface

Page 31: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

31

Features and Properties

SAX parsers can be configured by setting their features

and properties

Syntax: reader.setFeature("feature-url", boolean)

reader.setProperty("property-url", Object)

Standard feature URLs have the form:

http://xml.org/sax/features/feature-name

Standard property URLs have the form

http://xml.org/sax/properties/prop-name

Page 32: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

32

Feature/Property Examples Features:

namespaces - are namespaces supported? validation - does the parser validate (against the

declared DTD) ? http://apache.org/xml/features/nonvalidating/load-external-dtd

Ignore the DTD? (spec. to Xerces implementation)

Properties: lexical-handler - see the next slide...

Read more about Features

Read more about Properties

Page 33: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

33

Lexical Events

Lexical events have to do with the way that a document

was written and not with its content

Examples: A comment is a lexical event (<!-- comment -->)

The use of an entity is a lexical event (&gt;)

These can be dealt with by implementing the

LexicalHandler interface, and setting the lexical-

handler property to an instance of the handler

Page 34: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

34

LexicalHandler Methods

comment(char[] ch, int start, int length) startCDATA() endCDATA() startEntity(java.lang.String name) endEntity(java.lang.String name) and more...

Read more about LexicalHandler Interface

Page 35: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

35

Different Approaches

Page 36: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

36

The DOM Approach

Tree-based API : map an XML document into an internal tree structure, then allow an application to navigate that tree.

The application is active. Provides random-access. Remember that it is possible to construct a parse tree

using an event-based API, and it is possible to use an event-based API to traverse an in-memory tree. (Actually DOM is a level above SAX)

Page 37: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

<?xml version="1.0"?>

<!DOCTYPE countries SYSTEM "world.dtd">

<countries>

<country continent="&as;">

<name>Israel</name>

<population year="2001">6,199,008</population>

<city capital="yes"><name>Jerusalem</name></city>

<city><name>Ashdod</name></city>

</country>

<country continent="&eu;">

<name>France</name>

<population year="2004">60,424,213</population>

</country>

</countries>

Page 38: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

The DOM TreeDocument

countries

country

Asia

continent

Israel

name

year

2001

6,199,008

population

city

capital

yes

name

Jerusalem

country

Europe

continent

France

nameyear

2004

60,424,213

population

city

capital

no

name

Ashdod

Page 39: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

39

The SAX Approach Event based API. The Application is passive. Provides Serial access parser.Given the following XML document:

This XML document, when passed through a SAX parser, will generate the following sequence of events (Pushing)…

Page 40: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

XML Processing Instruction, named xml, with attributes version equal to "1.0" and encoding equal to "UTF-8"

XML Element start, named RootElement, with an attribute param equal to "value" XML Element start, named FirstElement XML Text node, with data equal to "Some Text" (note: text processing, with regard

to spaces, can be changed) XML Element end, named FirstElement XML Element start, named SecondElement, with an attribute param2 equal to

"something" XML Text node, with data equal to "Pre-Text" XML Element start, named Inline XML Text node, with data equal to "Inlined text" XML Element end, named Inline XML Text node, with data equal to "Post-text." XML Element end, named SecondElement XML Element end, named RootElement

Page 41: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

41

Pull vs. Push

SAX is known as a push framework the parser has the initivative the programmer must react to events

An alternative is a pull framework the programmer has the initiative the parser must react to requests

Page 42: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

42

StAX - Streaming API for XML An API to read and write XML documents in the Java

programming language (not like DOM & SAX that were written for Parsing).

The StAX approach: the programmatic entry point is a cursor that represents a point within the document. The application moves the cursor forward - 'pulling' the information from the parser as it needs.

Thus, the application is active. Not part of the standard Java API.

Page 43: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

43

For example, here's a simple bit of code that iterates through an XML document and prints out the names of the different elements it encounters:

Here's the beginning of the output when I ran this across a simple well-formed HTML file: htmlheadtitlemetametalinkmeta...

Page 44: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

44

JDOM An implementation of generic XML trees in Java. JDOM provides a way to represent XML document for easy

and efficient reading, manipulation, and writing. one can construct a tree of elements, then generate a XML

file from it, like:

Page 45: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

45

SAX vs. DOM

A more detailed comparison

Page 46: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

46

Parser Efficiency The DOM object built by DOM parsers is usually

complicated and requires more memory storage than the XML file itself A lot of time is spent on construction before use For some very large documents, this may be impractical

SAX parsers store only local information that is encountered during the serial traversal

Hence, programming with SAX parsers is, in general, more efficient

Page 47: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

47

Programming using SAX is Difficult In some cases, programming with SAX is

difficult: How can we find, using a SAX parser, elements e1

with ancestor e2? How can we find, using a SAX parser, elements e1

that have a descendant element e2? How can we find the element e1 referenced by the

IDREF attribute of e2?

Page 48: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

48

Node Navigation SAX parsers do not provide access to elements other than

the one currently visited in the serial (DFS) traversal of the document

In particular, They do not read backwards They do not enable access to elements by ID or name

DOM parsers enable any traversal method Hence, using DOM parsers is usually more comfortable

Page 49: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

49

More DOM Advantages

You can save time and effort if you send and receive DOM objects instead of XML files But, DOM object are generally larger than the source

DOM parsers provide a natural integration of XML reading and manipulating e.g., “cut and paste” of XML fragments

Page 50: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

50

Which should we use?DOM vs. SAX If your document is very large and you only need

a few elements – use SAX If you need to manipulate (i.e., change) the XML

– use DOM If you need to access the XML many times – use

DOM (assuming the file is not too large)

Page 51: 1 SAX and more… CS 236607, Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes

51

Resources used for this presentation The Hebrew University of Jerusalem – CS

Faculty. Wikipedia An Introduction to XML and Web Technologies –

Course’s Literature. http://www.saxproject.org/event.html http://www.xml.com/