View
236
Download
7
Category
Preview:
Citation preview
LT PyXML: A fast validating XML parser embedded in Python
Henry S. ThompsonHCRC Language Technology
GroupUniversity of Edinburgh
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
2
Acknowledgements This work was carried out in the Language
Technology Group of the Human Communication Research Centre, whose baseline funding comes from the UK Economic and Social Research Council
The UK Engineering and Physical Sciences Research Council funded project NSCOPE, which stimulated some of the work discussed here today
This work was also helped by grants to our group from Sun Microsystems and Microsoft
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
3
How we use SGML/XML We use SGML and XML in the context of
collecting, standardising, distributing, annotating and using large text collections (corpora) for computational linguistics research and development
These corpora are: Large: 10-100 million words Densely annotated: often every word has associated
markup DTDs and validation are very important to us
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
4
An aside about validation A DTD or schema is a contract between
producers and consumers It provides a guaranteed interface Producers validate to ensure they are providing
what they promised Consumers validate to check up on producers
and to protect their applications Application authors validate to simplify their task
Leave error detection and analysis to the validating parser
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
5
How we use XML (2) Like any other SME, we produce documents Being a university-embedded SME, we produce
lots of documents Lots of those documents are trivial variations
on one-another, based on target medium and/or audience Overhead slides for teaching Web pages for publicity/teaching backup Presentation slides for conferences Research papers for monographs and journals
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
6
Our application needs Batch applications to automatically add
linguistic annotation Modular, pipelined programs
supporting data parallelism Specialised interactive editors to hand-
correct markup Authoring tools and publication tools
which make content-sharing easy
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
7We built software: RXP & LT XML because of the following issues:
Price Efficiency C-language interface Documentation
Contrast with EXPAT 50 to 100% slower
– but still 90% faster than Java implementations Thoroughly documented Validates Coverage nine nines identical
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
8
LT XML: Basic Architecture Pipelines of ‘fat’ streams
c.f. Unix ‘thin’ streams API provides primitives for XML-
appropriate input and output Two alternative views:
micro-sequence: start-tag, comment, char-data, end-tag, proc. inst
tree-structure: sequence of sub-trees, level ad lib.
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
9
Flat view provides GetNextBit which reads the
next bit of XML: Start/empty tags (including attributes and all
values) Text==PCDATA End tags Processing instructions
PrintBit will write one of these to an output stream
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
10
Tree-structured view Items are subtrees of the SGML structure Reading
GetNextItem GetNextQueryItem
Writing PrintItem
The two views (flat or tree-structured) can be mixed to suit the needs of the application
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
11
Query language LT XML defines a query language which
allows the specification of elements from an XML document
Queries are tree based, using element names, attribute values and textual data
Similar path-style syntax to XPath Regular expressions are allowed for
attribute values.
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
12
Query language, continued The LT XML query language is not a
complete relational query language, although that can be built on top
For efficiency reasons, LT XML doesn't allow queries which require back-tracking or an unbounded amount of left context
The query language allows programmers to quickly find the sub-structure they are interested in, while ignoring the rest
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
13
Query example
.*/TEXT/./P[TYPE=STD]/S[1]
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
14Simple Tools are Simple to Build Less than one page of C code to
produce simple application Pipelines mean you can compose
simple tools for complex applications
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
15
Pre-constructed Tools Extract text content: textonly Select fragments based on tags, attributes
and text content: sggrep Count tags: sgcount Production-system style transformation: sgmltrans
Simple pattern-based information extraction: sgrpg
Indexing for fast access: mkindex
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
16
Availability Free to all for research use Executables and libraries for Unix
(Solaris, SunOs, Linux, FreeBSD) and Win32
Sources for Unix Packaged executable for Mac
http://www.ltg.ed.ac.uk/software/xml/
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
17
What about user interaction? C is not the world's easiest or most portable GUI-
building environment We have inhouse clients who are happy with
scripting languages So we've embedded LT XML inside a number of
other contexts Common Lisp Perl Python
It's the Python embedding that's the main topic for today
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
18
LT PyXML Basics A C-implemented Python module Integrates the LT XML API into Python
Architecture– Both views (bits and tree fragments)
Objects– including garbage collection
Functions– A modest subset
We've used the Tkinter module for all our GUI work, put Python has other GUI options
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
19
LT PyXML functions Files
Open, OpenString, Fopen, Close Bits
GetNextBit, ItemParse Attributes
GetAttrVal, ItemActualAttributes, PutAttrVal Queries
ParseQuery, GetNextQueryItem Printing
Print, PrintEndTag, PrintStartTag, PrintTextLiteral
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
20
LT PyXML Objects Use native Python lists and dictionaries where we
can New primitive Objects, often lazy wrt pullthrough
Files– NSL_File
Doctypes– NSL_Doctype, NSL_ElementType, NSL_AttrDefn,
NSL_ContentParticle Instances
– NSL_Bit, NSL_Item, NSL_ERef , NSL_OOB Queries
– NSL_Query
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
21
LT PyXML limitations 8-bit character inventory (Python/Tk
limitation) I haven't delivered on the promise in
the abstract, but The binary is in the XED distributions A proper release will appear shortly
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
22
Three applications XED
instance access minimal doctype access minimal
Schema workbench instance access paradigmatic depends heavily on validation
XML DTD Normaliser instance access non-existent doctype access paradigmatic
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
23
XED A text editor for XML document instances Implemented in Python using LT PyXML
and Tkinter Optimised for hand-authoring small- to
medium-sized documents Cross-platform Free of charge Sources not yet available
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
24
XED features Single-window WYSIWYG presentation Add, remove and rename balanced
start/end tag pairs and empty elements Add, remove and rename attribute
name/value pairs Add or remove comments, CDATA
sections and processing instructions Context-sensitive tag and attribute menus
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
25
XED features, cont'd Filling of text content, indenting of
element-only content Structure-sensitive point-and-sweep
selection paradigm Structure-preserving cut and paste Multiple undo Key bindings based on xxxPad under
WIN32; based on Emacs under Unix
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
26
XED demo See http://www.ltg.ed.ac.uk/ht/xed.html The vast bulk of XED is Python/Tk, but it's
made possible by LT PyXML Control of text segments Control of OOB processing
Context-sensitive menus are initialised from the DTD
Really helps newcomers to XML get started Cannot produce ill-formed XML
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
27
Schema Workbench demo Not publically available yet Built to facilitate development of the XML
Schema spec When I started writing large schemata
which exploited the refinement aspects of the public WD I needed to see the type hierarchy I needed to produce a normalised DTD to
compare with the originals
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
28
Schema Workbench features The schema document to schema structures
part of this took less than a day to write Two main reasons
Validation on the way in meant – I could depend on the presence of required components– I didn't need to check for misplaced bits
Python's object-creation and evaluation facilities– Turned most NSL_Items directly into Python objects with
object type == GI
Once I had the structures, implementing refinement was easy
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
29
DTD normaliser This was a two hour, 1.5 page job:
Find the DTD Construct a string file which uses it Open that string Sort the doctype Print the declarations, sorting disjunctions
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
30
I can't resist :-) Once I got the tools built, I could diff
the normalised XHTML draft DTD and the DTD produced from my XHTML schema
I found one error in the DTD!
HCRC Language Technology Group
Henry S. ThompsonXML DevCon, Montréal, 1999-08-19
31When it's time to railroad,everybody railroads The next big challenge for XML,
Schemas particularly is Managing the mapping between
document infoset and application infoset
LT PyXML has proved to be a useful laboratory for exploring this issue
Recommended