3
XML namespaces and XPath with Python March 2, 2016 1 XML namespaces and XPath with Python Thomas Aglassinger http://www.roskakori.at 2 XML eXtensible Markup Language a blueprint for other file formats can represent sequences and hierarchies text based (binary somewhat possible using e.g. UUEncode) human readable somewhat verbose supports a Document Object Model (DOM) 2.1 XML with Python xml: part if the standard library xml.etree.ElementTree - XML as pythonic Trees xml.dom.mindom - DOM, warts and all xml.sax - sequential parsing of large documents works, but has limited support for namespaces, XPath etc. lxml: available from http://lxml.de/ Python wrapper to C based XML libraries full support for namespaces, XPath, schemas etc universally used for “serious” XML processing 2.2 Example XML file <?xml version="1.0" encoding="utf-8"?> <people:list xmlns:people="https://www.example.org/xml/people"> <people:updated date="2016-02-16" /> <people:person name="Alice" phone="0650/12345678" size="172" /> <people:person name="Bob" phone="0654/23456789" size="167" /> <people:person name="B¨ arbel" phone="0699/34567890" size="182" /> <people:person name="G¨ unther" size="172"> <people:note>Ask for phone number.</people:note> </people:person> </people:list> 1

XML namespaces and XPath with Python

Embed Size (px)

Citation preview

Page 1: XML namespaces and XPath with Python

XML namespaces and XPath with Python

March 2, 2016

1 XML namespaces and XPath with Python

Thomas Aglassinger http://www.roskakori.at

2 XML

• eXtensible Markup Language

• a blueprint for other file formats

• can represent sequences and hierarchies

• text based (binary somewhat possible using e.g. UUEncode)

• human readable

• somewhat verbose

• supports a Document Object Model (DOM)

2.1 XML with Python

• xml: part if the standard library

• xml.etree.ElementTree - XML as pythonic Trees

• xml.dom.mindom - DOM, warts and all

• xml.sax - sequential parsing of large documents

• works, but has limited support for namespaces, XPath etc.

• lxml: available from http://lxml.de/

• Python wrapper to C based XML libraries

• full support for namespaces, XPath, schemas etc

• universally used for “serious” XML processing

2.2 Example XML file

<?xml version="1.0" encoding="utf-8"?>

<people:list xmlns:people="https://www.example.org/xml/people">

<people:updated date="2016-02-16" />

<people:person name="Alice" phone="0650/12345678" size="172" />

<people:person name="Bob" phone="0654/23456789" size="167" />

<people:person name="Barbel" phone="0699/34567890" size="182" />

<people:person name="Gunther" size="172">

<people:note>Ask for phone number.</people:note>

</people:person>

</people:list>

1

Page 2: XML namespaces and XPath with Python

2.3 XML namespaces

In our example

xmlns:people="https://www.example.org/xml/people"

assigns the shortcut people to the namespace identified by https://www.example.org/xml/people.

2.4 XPath

XPath is a query language to find nodes in XML documents. Examples:

• /people:list/people:person - all person elements in the document

• /people:list/people:person[@phone] - all person elements in the document with a phone attribute

Tutorial: http://www.w3schools.com/xsl/xpath intro.asp

3 Extract information from XML

3.1 Read the document root

Compute the path to our example XML file:

In [1]: import os.path

people_xml_path = os.path.join(’examples’, ’people.xml’)

Build the document root from the file:

In [2]: from lxml import etree

people_root = etree.parse(people_xml_path)

3.2 Setup the namespace

In [3]: NAMESPACES = {

’people’: ’https://www.example.org/xml/people’,

}

3.3 Find persons and print details

In [4]: # Find persons matching XPath.

person_elements = people_root.xpath(

’/people:list/people:person[@phone]’,

namespaces=NAMESPACES)

# Print name and phone of persons found.

for person_element in person_elements:

print(

person_element.attrib[’name’] + ’: ’ +

person_element.attrib[’phone’])

Alice: 0650/12345678

Bob: 0654/23456789

Barbel: 0699/34567890

2

Page 3: XML namespaces and XPath with Python

3.4 Examining XML elements

Elements have a tag, where namespaces are represente using the Clark notation {namespace}tag:

In [5]: person_element.tag

Out[5]: ’{https://www.example.org/xml/people}person’

XML attributes are a simlpe dictionary:

In [6]: person_element.attrib

Out[6]: {’phone’: ’0699/34567890’, ’name’: ’Barbel’, ’size’: ’182’}

3.5 Text nodes

Print notes about persons without a phone:

In [7]: note_elements_for_persons_without_phone = \

people_root.xpath(

’/people:list/people:person[not(@phone)]/people:note’,

namespaces=NAMESPACES)

for note_element in note_elements_for_persons_without_phone:

person_element = note_element.getparent()

person_name = person_element.attrib[’name’]

note_text = note_element.text

print(person_name + ’: ’ + note_text)

Gunther: Ask for phone number.

Use getparent() to access the enclosing XML element (as seen above).

4 Summary

• XML will be around for the foreseeable future so learn to deal with it.

• Use lxmlfor any serious XML work in Python.

• Namespaces and XPath can be taimed.

3