XQuery as an Annotation Query Language: a Use … · XQuery as an Annotation Query Language: a Use Case Analysis ... Recent work has shown that single data model can represent many

XQuery as an Annotation Query Language: a Use Case Analysis

Steve Cassidy

Department of Computing,Macquarie University,

Sydney, [email protected]

http://www.ics.mq.edu.au/˜cassidy

AbstractRecent work has shown that single data model can represent many different kinds of Linguistic annotation. This data model can beexpressed equivalently as a directed graph of temporal nodes (Bird and Liberman, Speech Communication, 2000) as a set of intersectinghierarchies (Cassidy and Harrington, Speech Communication, 2000). While some tools are being built to support this data model, thereis as yet no query language that can be used to search annotations stored in this way. Since the hierarchical view of annotations has muchin common with the XML data model, this paper examines a recent proposal for an XML query language as a candidate annotation querylanguage. The methodology used is a use case analysis. The result of the analysis shows that XQuery provides many useful featuresparticularly when queries include hierarchical constraints but that it is weak in expressing sequential constraints.

1. IntroductionRecent work has shown that a single data model can

represent many different kinds of Linguistic annotation.This data model can be expressed equivalently as a directedgraph of temporal nodes (Bird and Liberman, 2000) or as agroup of intersecting hierarchies (Cassidy and Harrington,2000).

While the data model is now well defined and progressis being made on implementing annotation tools based onthe model there is as yet no query language that can be usedto search corpora stored in this format. Bird et al. (2000)propose an annotation graph query language which closelyfollows the underlying graph structure in the data model.While this language appears to have adequate expressivepower, some quite simple queries can become difficult toexpress (Cassidy et al., 2000). Other query languages forannotations exist in systems such as Emu (Cassidy and Har-rington, 2000) and MATE (McKelvie et al., 2001); whilethese may prove useful starting points for an eventual querylanguage design, they are either known to be incomplete orhave not yet been shown to provide adequate coverage ofall Linguistic annotation requirements.

In a bid to better understand the requirements for anannotation query language, we are collecting a set of repre-sentative use cases that might be used in an objective anal-ysis of a query language proposal. This paper illustratesthis approach by presenting an evaluation of the XQuerylanguage (Boag et al., 1999) which has been developed tosearch XML documents. As we will show, XML sharesmany properties with the annotation data model and so ourwork can be usefully informed by the efforts of the XMLquery community.

1.1. Data Models

A data model is a way of structuring data for access bya computer program. In this case it is a way of storing theannotations on a piece of Linguistic data. A data modelshould capture the important features of the real data andmake them available to programs in a natural way.

The annotation graph data model (Bird and Liberman,

2000) has been proposed as a general purpose data structurefor representing Linguistic annotations in computer sys-tems. An annotation graph (AG) is a directed graph whosenodes represent temporal locations (which may or may notbe associated with time values) and whose arcs representannotated regions with associated features. The AG modelforegrounds the sequential nature of Linguistic data and an-notations and represents hierarchical structures implicitlyby having parent arcs span their children (Figure 1).

Another view of this data model is given by the Emusystem (Cassidy and Harrington, 2000) which foregroundsthe hierarchical structure of an annotation. In the Emumodel, an annotation consists a directed graph with anno-tations as nodes and arcs representing three kinds of rela-tionship: sequence, domination and association (Figure 2).While this appears very different to the AG model we be-lieve them to be isomorphic (Cassidy and Harrington, 2000)and of equal expressive power. The main advantage of theEmu view is that it makes the relationships in the annotationdata explicit and corresponds more closely to the concep-tual model held by corpus constructors. For this reason, thispaper will consider annotations according to the Emu datamodel. However it is important to emphasise the equiva-lence with the AG model and so we will refer to Emu/AGannotations throughout the paper.

1.2. Relationship to XML

XML (Bray et al., 2000) is a standard markup languagedeveloped by the World Wide Web Consortium which isnow used in many application areas that require a standard-ised data annotation format. The XML standard definesa serialisation syntax (the well known <tag>. . .</tag>syntax) and a data model, known as the XML InformationSet (Cowan and Tobin, 2001). In essence, the XML datamodel consists of a single hierarchy of nodes which canbe either the root (document) node, elements or text nodes.Element nodes can have associated attributes with arbitraryvalues and these can be used to link disparate parts of thehierarchy using a pointer mechanism based on unique at-tribute names (IDREF).

1012257

1114120

P/aa

1913650

2013650

T/H*/1

1215240

P/r18

5704000

12360

P/h#

T/0

23270

P/sh

35200S/NP

W/she

1722179

Imt/L-

S/S

Itl/L%

P/iy

46160

P/hv6

9680

S/V

W/had

S/VP

58720

P/ae P/dcl7

10173

P/y8

11077

W/your

S/NP

P/axr9

12019

P/dcl14

16626

W/dark/1

P/d13

16200

P/kcl P/k15

18480

P/sW/suit

1620685

P/uw P/q

Figure 1: An example annotation graph showing an utterance from the TIMIT database which has been augmented withboth a syntactic annotation and a ToBI style intonational annotation. Each anchor (small rectangles) represents a time pointand each arc represents a span of time. The labels on each arc consist of a type and a label, so P/sh denotes an /sh/ phoneme.Hierarchical relations are represented here implicitly by parent arcs spanning their children.

she

L-

L%

had your dark suit

h# sh iy hv ae dcl yaxr dcld aa r kcl k s uw q

H*

NP

S

Phoneme

Word

Intermediate

Intonational

Syntax

she had your dark suit

Tone

VP

V

NP

Figure 2: An example Emu annotation showing the same example as Fig 1. The names of the levels are shown on the left,the Word level has been duplicated to show the links to both the syntactic and intonational hierarchies. The single Toneevent H* is associated with the word ‘dark’. Time information at the phoneme level is used to derive times for all higherlevels.

There is a close correspondence between the XML andEmu/AG data models which is summarised in Table 1.In both cases, the data model consists of hierarchies ofnodes which can have associated attributes (called featuresin AG/Emu). Emu/AG ensures that each node has a definedstart and end anchor (which may be associated with a time);these can be modeled in XML with required attributes foreach element node. The main difference between the datamodels is the requirement for a unique root node and hencea single hierarchy in XML documents. A consequence ofthis is that multi-hierarchy annotations can’t be expresseddirectly in a single XML document.

This restriction has been overcome by using techniquessuch as standoff markup (Thompson and McKelvie, 1997)where multiple hierarchies are stored in different XML doc-uments and linked to a source document using externalreferences. Using this technique allows storage of a datamodel that is largely equivalent to that of Emu/AG.

The obvious motivation for using XML as an annota-tion format is that there is now a massive effort to develop

tools for manipulation of XML in the software industry.These inclde authoring environments, transformation andpresentation tools, query languages and efficient storageand transport mechanisms.

1.3. XML Query Languages

One of the stated advantages of XML is that it allowsa document designer to make the semantic structure of adocument explicit (Berners-Lee et al., 2001). Followingthis, large collections of XML marked up data have be-come available and the requirement for a query languageto extract information from these documents is clear.

The problem of searching XML documents can be seenas a special case of querying semi-structured data: datawhich has some structure but which is not stored in a fixedformat like a relational database (Buneman, 1997). Thework to define an XML query language has been drivenlargely from work in this area. Recently a proposal hasbeen made under the banner of the World Wide Web Con-sortium for an XML query language called XQuery (Boag

XML Data Model Emu Data ModelNodes: documents, elements with attributes Nodes: segments with start/end time and attributesSingle hierarchy of elements with unique root node Multiple hierarchies, many root nodesNode must have a single parent Node can have many parentsNodes are fully ordered (each node has a Nodes are partially ordered (they may overlap and a nodeunique successor) may have more than one successor)Elements have attributes Nodes have one or more featuresArbitrary elements can be related using IDREFS Nodes can be related by association relations

Table 1: The relationship between the XML and Emu data models

et al., 1999) which draws on earlier proposals (Robie et al.,2000; Deutsch et al., 1998) and existing W3C standards.

Queries in XQuery work on XML documents and re-turn a set of document fragments. This provides the im-portant property of compositionality which allows queriesto work on the results of earlier queries. A central part ofthe XQuery standard is the XPath path language which is aW3C standard in it’s own right (Clark and DeRose, 1999).XPath expressions select nodes in the XML hierarchy basedon paths from the root node or some other known point inthe document. XQuery expressions can provide additionalconstraints on the nodes selected by XPath expressions andcan then describe the structure of the returned XML frag-ment which embeds the nodes satisfying all constraints.

Since XQuery is only a proposed standard the languageis something of a moving target. The discussion here con-centrates on the major features of the language which arelikely to be preserved in the final standard. In some casesinformed guesses have been made about how some featuresof the language might work.

1.4. Use Case AnalysisIn order to evaluate XQuery as an annotation query lan-

guage we attempt to express some real queries, or use cases,in the language. Note that we don’t want to have to firsttransform the annotation to multiple XML documents inorder to do this, rather we’ll suggest how XQuery might beextended to be able to cope with intersecting hierarchies inthe Emu/AG model.

Use cases are selected to illustrate real Linguisticqueries but also to exercise different requirements of aquery language. In the larger use case gathering exercisewe seek to collect a representative set of queries. The casesdescribed here have been selected to show the strengths andweaknesses of XQuery in particular. All of the queries citedhere refer to the style of annotation illustrated in Fig 2.

2. Use Cases2.1. Simple XPath Query

The first example query illustrates the basic structure ofXQuery and the use of path expressions to locate elementsin an annotation.

Find all vowel phonemes labels A, E, I, O or U spo-ken by male talkers.

This is a common kind of query in phonetics researchand is well supported in the current Emu query language(Cassidy and Harrington, 2000). The result of the querywould be a list of vowel tokens which could then be used toaccess associated speech data (via their start and end times).

One way of expressing this query in XQuery is shownhere:

//talker[@sex="male"]//word/phoneme[@label = (A, E, I, O, U)]

This query1 uses an XPath (Clark and DeRose, 1999) pathexpression to restrict the result to phonemes which are di-rect children of words (/ denotes a direct link) which are de-scendants (//) of talkers. The talkers are restricted to thosewhich have a sex attribute of male and the phonemes arerestricted based on their label attribute. These restrictionsare made in Xpath by following the element name by addi-tional conditions in square brackets. The result of the queryis a collection of phoneme elements which, in XML, wouldbe returned as XML document fragments.

If this were an XML document, the returned valueswould be the serialisation of the matching nodes and alltheir attributes and child nodes. Importantly, each matchwill be a valid XML document in itself which could formthe target for a further query. This is an important prop-erty to retain in the annotation domain; hence the result ofa query should be a valid annotation. In this case, the re-sult would be a collection of annotations (and associatedfeatures and anchors) corresponding to the nodes matchingthe path expression and this would be a valid annotation.

An alternative return value from a query would be a listof references into the original annotation. This might beuseful if later queries or reports wanted to refer to otherassociations with the matched elements. For example, ifafter finding the vowels matched by this query we wantedto locate the words in which they appear. This could besupported by retaining unique identifiers for each matchingresult such that the resulting annotation is a proper subsetof the original annotaiton.

2.2. Embedded Query

In some cases it is not possible to express a query ina single clause. XQuery allows embedded queries wherethe result of one query is passed to another (recall that theresult of a query is a well formed XML document). Thisquery illustrates the use of this facility.

Find all unique words and return their phonemictranscriptions.

This query can be expressed in XQuery as follows:

1XPath expressions would normally be written on one line like//talker//word/phoneme, they have been split onto multi-ple lines here to fit into the two-column paper format

FOR $wl INdistinct-values(//word/@label)

RETURN<word label={$wl}>

$wl/parent::word/phoneme</word>

This query is structured in two parts: the FOR clausedefines a variable $wlwhich will take on successive valuesas returned by the path expression. The RETURN clausespecifies the form of the XML data to be returned for eachmatch.

The outer query here finds distinct word labels whichare attribute nodes in the XML data model but would benode feature values in the Emu/AG model. The RETURNclause constructs a word element with the same label asthe one matched in the FOR clause. The children of thisword will be the result of another XPath expression whichfinds the phoneme children of the word node matched inthe outer query. Note that we take care to return words withdistinct labels (word types) rather than distinct word ele-ments (word tokens) which would be returned by the pathexpression distinct(//word).

To locate this word, the path expression$wl/parent::word is used. This is a path ex-pression which selects word along the parent axis (ratherthan the default child axis). In this case the word elementis the parent of the attribute node matched in the outerquery. XPath includes a number of axes that can be used inpath expressions, for example following-sibling orancestor. The notation /word/phoneme is shorthandfor /child::word/child::phoneme.

The result of the inner path expression is a set ofphoneme nodes. Hence the result of the the whole queryis a set of word nodes dominating phoneme children.

2.3. Multiple Hierarchies

The primary difference between the XML and Emu/AGdata models is the presence of multiple intersecting hierar-chies in annotations. This query illustrates the need to beable to traverse these intersecting hierarchies.

Find L- Intermediate phrases containing wordswhich form an NP Syntax phrase

There are two approaches to answering this query. Thefirst method looks for NP phrases phrases which dominatewords which all have the same L- Intermediate phrase par-ent. The second considers these two kinds of phrases aspotentially overlapping temporal regions and finds NP syn-tax regions that are wholly inside L- Intermediate phrases.

The first approach can be stated as:

FOR $np in //syntax[@label="NP"]LET $iparent ::=

$np//word[1]/parent::interWHERE

EVERY $int IN$np//word/parent::inter

SATISFIES ($int == $iparent) AND($int/@label = "L-")

RETURN$iparent

This query works by finding NP syntax elements andconstraining every intermediate (note: shortened in the ex-ample to inter) parent of the word children of that NPto be the same as that of the first word – ie. that all thewords are siblings relative to the intermediate node. TheLET clause is used to establish a variable relative to a nodematched in the FOR clause. In this example the variable$iparent is bound to the intermediate phrase parent ofthe first word in $np. The WHERE clause provides addi-tional constraints on the matched nodes, in this case that ev-ery intermediate parent of a word child of $np is the sameas $iparent, the parent of the first word, and has a label“L-”.Note that it is not sufficient to constrain the label onthe intermediate node to have an L- label since this wouldallow NPs spanning two adjacent L- phrases to match.

This query could be made more compact with the defi-nition of a new function sibling which was true if a setof nodes were siblings relative to another node:

FOR $np in //syntax[@label="NP"]WHERE

siblings( $np//word,$np//word[1]/parent::inter)

RETURN$iparent

It is interesting to note that the only generalisation ofthe XQuery model required here is to allow expressions in-volving the parent axis to return node sets rather than singlenodes. Other than this these queries are valid examples ofXQuery.

The alternate approach looks at the boundary times ofthe two kinds of phrase segments. It could be stated as:

FOR $np in //syntax[@label="NP"]$int in //inter[@label="L-"]

WHERE$np/@start >= $int/@start AND$np/@end <= $int/@end

RETURN$int

This solution is similar to that used for the examplequeries in (Teich et al., 2001) to query annotations stored inXML documents. While this seems to provide a simpler so-lution, note that depending on the corpus structure, this ap-proach could yield false positives since it does not constrainthe two phrases to share any words. If the corpus containsannotations of overlapping speech, this query could returnan intermediate phrase by one speaker that overlaps an NPby another.

2.4. Sequential Constraints

The previous cases illustrate the suitability of the XPathcomponent of XQuery to specify hierarchical constraintswithin an annotation, even when multiple hierarchies areinvolved. A common requirement in Linguistic corpora isto specify sequential patterns which can often involve com-plex constraints. A good example is the use of a query tofind examples of groupings defined by a particular syntacticor phonological construct. This construct might be definedin terms of a regular or context free grammar. Hence the

query language needs to be able to express regular or con-text free constraints. The example cited here is taken froma phonetic study which was interested in syllabic structuresdefined as sequences of consonants and vowels following asimple regular grammar:

Find sequences of any number of consonantphonemes followed by a single vowel followed by anynumber of consonants: C+VC+.

Such patterns are easily specified in regular expressionson strings but in XQuery we have to work to satisfy thisquery. While XPath can be used to specify sequences ofnodes (using the following-sibling axis:) it is notable to state the ‘one or more’ constraint in the regular ex-pression ‘C+’. (This limitation is not isolated to sequentialpaths, XPath can’t apply a constraint like this along anyaxis.)

The solution in this case is to write functions specific tothis query which check for ‘one or more consonants’ pre-ceding or following a vowel. User defined functions are anintegral part of XQuery and this recursive function returnssequences of elements along the following-siblingaxis which have the label “C”:

define function fsequence( element $s )returns element * {

let $c := $s/following-sibling::phoneme[@label="C"]

if (empty($c)) then return ()else return ($c, fsequence( $c ) )

}

A similar function (psequence) can be defined towork on the preceding-sibling axis and these twofunctions can then be used to state the query:

FOR $v in //phoneme[@label="V"]LET $pc ::= psequence( $v )

$fc ::= fsequence( $v )WHEN

$pc != () AND $fc != ()RETURN

<result>($pc,$v,$fc)

</result>

This query first finds vowel phonemes and then sets uptwo variables in the LET clause corresponding to C+ se-quences before and after the vowel. The WHEN clause testswhether the sequences of consonants are empty and if theyare not the RETURN clause constructs a <result> nodecontaining the matched sequence.

While the functions psequence and fsequencemight be generalised a little it seems clear that quite a bitof work needs to be done in order to specify patterns ofthis nature. More complicated conditions would be harderto implement and the fit between the statement of the queryand it’s realisation is very poor for this class of query. In ad-dition, the use of user defined functions in this way preventsany optimisation of these kinds of queries in a database sys-tem.

2.5. Embedded Clauses

While the previous case tried to express a regular ex-pression like query, this case is concerned with finding in-stances of a context-free grammar rule. It illustrates a sim-ilar class of weakness in XPath to the previous case butalong a different axis.

The context free grammar rule ’NP � Adj NP’ canproduce embedded NP clauses to arbitrary depth. Findexamples of NP clauses where all embedded NPs are ofthis type except for the final NP which should be a sin-gleton noun.

This query might match clauses like ‘big angry hairygoat’ but should reject ‘the goat’ (which follows the struc-ture ‘NP � Det NP’). To answer this query we much checkthat each NP node in the parent/child sequence has Adj andNP children except for the final one. An XPath expressionfor a one-level deep clause would be:

//syntax[@label=’NP’ andchild:syntax[label=’Adj’]]

/syntax[@label=’NP’]/syntax[@label=’N’]

(In fact, this expression would give false positivematches since we don’t say that the top NP has only twochildren). This path expression uses a conjunction insidethe qualifier for the first path step stating that the syntaxnode must have an NP label and have a syntax child withan Adj label. The second step states that the NP must have asyntax child labelled NP which has a syntax child labelledN. To extend this to the arbitrary level of embedding re-quired by the use case requires a function similar to that inthe previous case to enforce a constraint on a sequence ofsteps along the child axis.

3. DiscussionThe examples above illustrate the main features of

XQuery and show that it can be applied to the Emu/AGdata model without the need for syntactic changes. Inparticular, the XPath component of XQuery is very wellsuited to specifying hierarchical constraints in annotationqueries and extends easily to intersecting hierarchies if themultiple-parent assumption is made.

The provision of different axes (parent, child, sibling,etc) in XPath is a powerful mechanism which could be use-fully extended for annotation queries: one could imaginedefining, for example, an ‘overlap’ axis to allow selectionof temporally overlapping nodes.

An important weakness of XQuery is illustrated by thefinal two use cases which require constraints on a sequenceof nodes along an axis. The problem is that an XPath ismade up of a sequence of steps along an axis but that herewe need to specify conditions on multiple steps. This ismost apparent for the sibling axes but might also occur forexample along an associative axis or on the parent/childaxis. The AGQL query language proposed by Bird et al.(2000) includes this facility for sequential path expressions.The C+VC+ query can be expressed in that language as:

selectF(W).[label: CVC].F(Z)

whereW.[cons]+.X.[vow].Y.[cons]+.Z

(where the details of the conditions in square bracketshave been omitted for brevity). This query asks for oneor more cons arcs between anchors W and X, a vow arcbetween X and Y and one or more cons arcs between Yand Z. The combination of this kind of facility with theflexibility to use different axes in XPath would provide apowerful component of an annotation query language.

It is clear that the most natural queries to express inXQuery are those dealing with purely hierarchical con-straints. An earlier review of the AG query language pro-posed by Bird et al. (2000) found that it could expresssequential constraints well but was awkward hierarchicalqueries (Cassidy et al., 2000). These strengths come di-rectly from the forms of the underlying data models: XMLis primarily hierarchical, AGs are primarily sequential. Ourobservations are that the requirements of Linguistic anno-tation are for a free mixture of sequential and hierarchicalstructures. A query language combining the strengths ofXQuery and the AGQL would go a long way towards satis-fying these requirements.

4. ReferencesTim Berners-Lee, James Hendler, and Ora Lassila.

2001. The Semantic Web. Scientific American, May.http://www.scientificamerican.com/2001/0501issue/0501berners-lee.html.

S. Bird and M. Liberman. 2000. A Formal Framework forLinguistics Annotation. Speech Communication.

S. Bird, P. Buneman, and W. C. Tan. 2000. Towards aQuery Language for Annotation Graphs. In Proceedingsof LREC 2000, Athens, Greece.

Scott Boag, Don Chamberlin, Mary F. Fernandez, DanielaFlorescu, Jonathan Robie, Jerome Simeon, and MugurStefanescu. 1999. XQuery 1.0: An XML Query Lan-guage. W3C Working Draft, November. http://www.w3.org/TR/xpath.

Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, andEve Maler. 2000. Extensible Markup Language (XML)1.0 (Second Edition). W3C Recommendation, October.http://www.w3.org/TR/REC-xml.

Peter Buneman. 1997. Semistructured data. In PODS’97.Invited Tutorial.

S. Cassidy and J. Harrington. 2000. Multi-level Annota-tion in the Emu Speech Database Management System.Speech Communication, 33:61–77.

S. Cassidy, P. Welby, J. McGory, and M. Beckman. 2000.Testing the adequacy of query languages against anno-tated spoken dialog. In Proceedings of the Speech Sci-ence and Technology Conference, pages 428–433, Can-berra, Australia. Australian Speech Science and Technol-ogy Association.

James Clark and Steve DeRose. 1999. XML Path Lan-guage (XPath) Version 1.0. W3C Recommendation,November. http://www.w3.org/TR/xpath.

John Cowan and Richard Tobin. 2001. XML InformationSet. W3C Recommendation, October. http://www.w3.org/TR/xml-infoset/.

Alin Deutsch, Mary Fernandez, Daniela Florescu, AlonLevy, and Dan Suciu. 1998. XML-QL: A Query Lan-guage for XML. available as http://www.w3.org/TR/NOTE-xml-ql, August. W3C Note.

D. McKelvie, A. Isard, A. Mengel, M. Grosse, andM. Klien. 2001. The MATE Workbench - an annotationtool for XML coded speech corpora. Speech Communi-cation.

Jonathan Robie, Don Chamberlin, and Daniela Flo-rescu. 2000. Quilt: An XML Query Language.Presented at XMLEurope 2000, July. http://www.gca.org/papers/xmleurope2000/papers/s08-01.html.

E. Teich, S. Hansen, and P Fankhauser. 2001. Repre-senting and querying multi-layer annotated corpora. InS. Bird, P. Buneman, and M. Liberman, editors, Pro-ceedings of the IRCS Workshop on Linguistic Databases,pages 228–237. http://www.ldc.upenn.edu/annotation/databases.

Henry S. Thompson and David McKelvie. 1997. Hyper-link semantics for standoff markup of read-only doc-uments. In SGML Europe ’97. http://www.ltg.ed.ac.uk/˜ht/sgmleu97.html.

Documents

XQuery as an Annotation Query Language: a Use … · XQuery as an Annotation Query Language: a Use Case Analysis ... Recent work has shown that single data model can represent many