Upload
roskakori
View
314
Download
5
Embed Size (px)
Citation preview
XML namespaces and XPath with Python
March 2, 2016
1 XML namespaces and XPath with Python
Thomas Aglassinger http://www.roskakori.at
2 XML
• eXtensible Markup Language
• a blueprint for other file formats
• can represent sequences and hierarchies
• text based (binary somewhat possible using e.g. UUEncode)
• human readable
• somewhat verbose
• supports a Document Object Model (DOM)
2.1 XML with Python
• xml: part if the standard library
• xml.etree.ElementTree - XML as pythonic Trees
• xml.dom.mindom - DOM, warts and all
• xml.sax - sequential parsing of large documents
• works, but has limited support for namespaces, XPath etc.
• lxml: available from http://lxml.de/
• Python wrapper to C based XML libraries
• full support for namespaces, XPath, schemas etc
• universally used for “serious” XML processing
2.2 Example XML file
<?xml version="1.0" encoding="utf-8"?>
<people:list xmlns:people="https://www.example.org/xml/people">
<people:updated date="2016-02-16" />
<people:person name="Alice" phone="0650/12345678" size="172" />
<people:person name="Bob" phone="0654/23456789" size="167" />
<people:person name="Barbel" phone="0699/34567890" size="182" />
<people:person name="Gunther" size="172">
<people:note>Ask for phone number.</people:note>
</people:person>
</people:list>
1
2.3 XML namespaces
In our example
xmlns:people="https://www.example.org/xml/people"
assigns the shortcut people to the namespace identified by https://www.example.org/xml/people.
2.4 XPath
XPath is a query language to find nodes in XML documents. Examples:
• /people:list/people:person - all person elements in the document
• /people:list/people:person[@phone] - all person elements in the document with a phone attribute
Tutorial: http://www.w3schools.com/xsl/xpath intro.asp
3 Extract information from XML
3.1 Read the document root
Compute the path to our example XML file:
In [1]: import os.path
people_xml_path = os.path.join(’examples’, ’people.xml’)
Build the document root from the file:
In [2]: from lxml import etree
people_root = etree.parse(people_xml_path)
3.2 Setup the namespace
In [3]: NAMESPACES = {
’people’: ’https://www.example.org/xml/people’,
}
3.3 Find persons and print details
In [4]: # Find persons matching XPath.
person_elements = people_root.xpath(
’/people:list/people:person[@phone]’,
namespaces=NAMESPACES)
# Print name and phone of persons found.
for person_element in person_elements:
print(
person_element.attrib[’name’] + ’: ’ +
person_element.attrib[’phone’])
Alice: 0650/12345678
Bob: 0654/23456789
Barbel: 0699/34567890
2
3.4 Examining XML elements
Elements have a tag, where namespaces are represente using the Clark notation {namespace}tag:
In [5]: person_element.tag
Out[5]: ’{https://www.example.org/xml/people}person’
XML attributes are a simlpe dictionary:
In [6]: person_element.attrib
Out[6]: {’phone’: ’0699/34567890’, ’name’: ’Barbel’, ’size’: ’182’}
3.5 Text nodes
Print notes about persons without a phone:
In [7]: note_elements_for_persons_without_phone = \
people_root.xpath(
’/people:list/people:person[not(@phone)]/people:note’,
namespaces=NAMESPACES)
for note_element in note_elements_for_persons_without_phone:
person_element = note_element.getparent()
person_name = person_element.attrib[’name’]
note_text = note_element.text
print(person_name + ’: ’ + note_text)
Gunther: Ask for phone number.
Use getparent() to access the enclosing XML element (as seen above).
4 Summary
• XML will be around for the foreseeable future so learn to deal with it.
• Use lxmlfor any serious XML work in Python.
• Namespaces and XPath can be taimed.
3