Java and XML. What is XML XML stands for eXtensible Markup Language. A markup language is used to provide information about a document. Tags are added

Java and XML

What is XML

• XML stands for eXtensible Markup Language.• A markup language is used to provide information

about a document.• Tags are added to the document to provide the extra

information.• HTML tags tell a browser how to display the

document.• XML tags give a reader some idea what some of the

data means.

Advantages of XML

• XML is text (Unicode) based.• One XML document can be displayed differently in

different media.– Html, video, CD, DVD,– You only have to change the XML document in order to

change all the rest.

• XML documents can be modularized. Parts can be reused.

Example of an HTML Document

<html><head><title>Example</title></head.

<body><h1>This is an example of a page.</h1><h2>Some information goes here.</h2>

</body></html>

Example of an XML Document

<?xml version=“1.0”/><address>

<name>Alice Lee</name><email>[email protected]</email><phone>212-346-1234</phone><birthday>1985-03-22</birthday>

</address>

Difference Between HTML and XML

• HTML tags have a fixed meaning and browsers know what it is.

• XML tags are different for different applications, and users know what they mean.

• HTML tags are used for display.• XML tags are used to describe documents and

data.

XML Rules

• Tags are enclosed in angle brackets.• Tags come in pairs with start-tags and end-

tags.• Tags must be properly nested.

– <name><email>…</name></email> is not allowed.– <name><email>…</email><name> is.

• Tags that do not have end-tags must be terminated by a ‘/’.– <br /> is an html example.

More XML Rules

• Tags are case sensitive.– <address> is not the same as <Address>

• XML in any combination of cases is not allowed as part of a tag.

• Tags may not contain ‘<‘ or ‘&’.• Tags follow Java naming conventions, except that a

single colon and other characters are allowed. They must begin with a letter and may not contain white space.

• Documents must have a single root tag that begins the document.

Well-Formed Documents

• An XML document is said to be well-formed if it follows all the rules.

• An XML parser is used to check that all the rules have been obeyed.

• Recent browsers such as Internet Explorer 5 and Netscape 7 come with XML parsers.

• Parsers are also available for free download over the Internet. One is Xerces, from the Apache open-source project.

• Java 1.4 also supports an open-source parser.

Expanded Example

<?xml version = “1.0” ?><address>

<name> <first>Alice</first> <last>Lee</last>

</name> <email>[email protected]</email> <phone>123-45-6789</phone> <birthday>

<year>1983</year> <month>07</month> <day>15</day></birthday>

</address>

XML Files are Trees

address

name email phone birthday

first last year month day

Validity

• A well-formed document has a tree structure and obeys all the XML rules.

• A particular application may add more rules in either a DTD (document type definition) or in a schema.

• Many specialized DTDs and schemas have been created to describe particular areas.

• These range from disseminating news bulletins (RSS) to chemical formulas.

• DTDs were developed first, so they are not as comprehensive as schema.

Document Type Definitions

• A DTD describes the tree structure of a document and something about its data.

• There are two data types, PCDATA and CDATA.– PCDATA is parsed character data.– CDATA is character data, not usually parsed.

• A DTD determines how many times a node may appear, and how child nodes are ordered.

DTD for address Example

<!ELEMENT address (name, email, phone, birthday)><!ELEMENT name (first, last)><!ELEMENT first (#PCDATA)><!ELEMENT last (#PCDATA)><!ELEMENT email (#PCDATA)><!ELEMENT phone (#PCDATA)><!ELEMENT birthday (year, month, day)><!ELEMENT year (#PCDATA)><!ELEMENT month (#PCDATA)><!ELEMENT day (#PCDATA)>

Schemas

• Schemas are themselves XML documents.• They were standardized after DTDs and provide

more information about the document.• They have a number of data types including string,

decimal, integer, boolean, date, and time.• They divide elements into simple and complex types.• They also determine the tree structure and how

many children a node may have.

Schema for First address Example

<?xml version="1.0" encoding="ISO-8859-1" ?><xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"><xs:element name="address">

<xs:complexType><xs:sequence>

<xs:element name="name" type="xs:string"/><xs:element name="email" type="xs:string"/><xs:element name="phone" type="xs:string"/><xs:element name="birthday" type="xs:date"/>

</xs:sequence></xs:complexType>

</xs:element></xs:schema>

Parsers

• There are two principal models for parsers.• SAX – Simple API for XML

– Uses a call-back method– Similar to javax listeners

• DOM – Document Object Model– Creates a parse tree– Requires a tree traversal

DOM Parser

About DOM

• Stands for Document Object Model

• A World Wide Web Consortium (w3c) standard

• Standard constantly adding new features – Level 3 Core just released this month

• Well cover most of the basics. There’s always more, and it’s always changing.

DOM abstraction layer in Java -- architecture

Returns specific parserimplementation

org.w3d.dom.Document

Emphasis is on allowing vendors to supply their own DOM Implementation without requiring change to source code

Sample Code

DocumentBuilderFactor factory = DocumentBuilderFactory.newInstance();

/* set some factory options here */

DocumentBuilder builder = factory.newDocumentBuilder();

Document doc = builder.parse(xmlFile);

A factory instanceis the parser implementation.Can be changed with runtime System property. Jdk has default.Xerces much better.

From the factory one obtainsan instance of the parser

xmlFile can be an java.io.File,an inputstream, etc.

javax.xml.parsers.DocumentBuilderFactoryjavax.xml.parsers.DocumentBuilderorg.w3c.dom.Document

For reference. Notice that theDocument class comes from thew3c-specified bindings.

Validation

• Note that by default the parser will not validate against a schema or DTD

• As of JAXP1.2, java provides a default parse than can handle most schema features

• See next slide for details on how to setup

Document object• Once a Document object is obtained, rich API to

manipulate.

• First call is usually Element root = doc.getDocumentElement();

This gets the root element of the Document as an instance of the Element class

• Note that Element subclasses Node and has methods getType(), getName(), and getValue(), and getChildNodes()

Types of Nodes

• Note that there are many types of Nodes (ie subclasses of Node:Attr, CDATASection, Comment, Document, DocumentFragment, DocumentType, Element, Entity, EntityReference, Notation, ProcessingInstruction, Text

Each of these has a special and non-obvious associated type, value, and name.

Standards are language-neutral and are specified on chart on following slide

Important: keep this chart nearby when using DOM

Node nodeName() nodeValue() Attributes nodeType()

Attr Attr name Value of attribute null 2

CDATASection #cdata-section CDATA cotnent null 4

Comment #comment Comment content null 8

Document #document Null null 9

DocumentFragment #document-fragment

null null 11

DocumentType Doc type name null null 10

Element Tag name null NamedNodeMap 1

Entity Entity name null null 6

EntityReference Name entitry referenced

null null 5

Notation Notation name null null 1

ProcessingInstruction target Entire string null 7

Text #text Actual text null 3

Transformer Architecture

Writing DOM to XML

public class WriteDOM{ public static void main(String[] argv) throws Exception{ File f = new File(argv[0]); DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document document = builder.parse(f);

TransformerFactory tFactory = TransformerFactory.newInstance(); Transformer transformer = tFactory.newTransformer(); DOMSource source = new DOMSource(document); StreamResult result = new StreamResult(System.out); transformer.transform(source, result); }}

Creating a DOM from scratch

• Sometimes you may want to create a DOM tree directly in memory. This is done with:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

DocumentBuilder builder = factory.newDocumentBuilder();

document = builder.newDocument();

Manipulating Nodes

• Once the root node is obtained, typical tree methods exist to manipulate other elements:

boolean node.hasChildNodes()NodeList node.getChildNodes()Node node.getNextSibling()Node node.getParentNode()String node.getValue();String node.getName();String node.getText();void setNodeValue(String nodeValue);Node insertBefore(Node new, Node ref);

SAX

Simple API for XML Processing

About SAX

• SAX in Java is hosted on source forge

• SAX is not a w3c standard

• Originated purely in Java

• Other languages have chosen to implement in their own ways based on this prototype

SAX vs DOM

• Please don’t compared unrelated things:– SAX is an alternative to DOM, but realize that

DOM is often built on top of SAX

– SAX and DOM do not compete with JAXP

– They do both compete with JAXB implementations

How a SAX parser works• SAX parser scans an xml stream on the fly and responds to

certain parsing events as it encounters them.

• This is very different than digesting an entire XML document into memory.

• Much faster, requires less memory.

• However, need to reparse if you need to revisit data.

Obtaining a SAX parser

• Important classes javax.xml.parsers.SAXParserFactory; javax.xml.parsers.SAXParser; javax.xml.parsers.ParserConfigurationException;

//get the parser SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser saxParser = factory.newSAXParser();

//parse the document saxParser.parse( new File(argv[0]), handler);

DefaultHandler

• Note that an event handler has to be passed to the SAX parser.

• This must implement the interface org.xml.sax.ContentHandler;

• Easier to extend the adapter org.xml.sax.helpers.DefaultHandler

Overriding Handler methods• Most important methods to override

– void startDocument()• Called once when document parsing begins

– void endDocument()• Called once when parsing ends

– void startElement(...)• Called each time an element begin tag is encountered

– void endElement(...)• Called each time an element end tag is encountered

– void characters(...)• Called randomly between startElement and endElement calls to

accumulated character data

startElement • public void startElement( String namespaceURI, //if namespace assoc String sName, //nonqualified name String qName, //qualified name Attributes attrs) //list of attributes

• Attribute info is obtained by querying Attributes objects.

Characters

• public void characters( char buf[], //buffer of chars accumulated int offset, //begin element of chars int len) //number of chars

• Note, characters may be called more than once between begin tag / end tag

• Also, mixed-content elements require careful handling

Entity references

• Recall that entity references are special character sequences for referring to characters that have special meaning in XML syntax– ‘<‘ is &lt– ‘>’ is &gt

• In SAX these are automatically converted and passed to the characters stream unless they are part of a CDATA section

Choosing a Parser• Choosing your Parser Implementation

– If no other factory class is specified, the default SAXParserFactory class is used. To use a different manufacturer's parser, you can change the value of the environment variable that points to it. You can do that from the command line, like this:

• java -Djavax.xml.parsers.SAXParserFactory=yourFactoryHere ...

• The factory name you specify must be a fully qualified class name (all package prefixes included). For more information, see the documentation in the newInstance() method of the SAXParserFactory class.