59
"Recent Advances in Corpus Annotation" Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and Computation

Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

"Recent Advances in Corpus Annotation"

Brigitte Krenn, Thorsten Brants,Wojciech Skut, Hans Uszkoreit

Workshop

Language and Computation

Page 2: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Construction and Annotation of Test-Items in DiETJudith Klein^, Sabine Lehmann*, Klaus Netter^, and Tillmann Wegst^

DFKI^ GmbHStuhlsatzenhausweg 3D-66123 Saarbrücken(Germany)[email protected]

ISSCO*, University of Geneva54, route des AcaciasCH-1227 Geneva(Switzerland)[email protected]

1 MotivationAs industrial use of language technology is flourish-ing and the market for competing software systems isexpanding, there is an increasing need for the assess-ment of NLP components on the level of adequacyevaluation and quality assurance. However, effec-tive and efficient assessment is often restricted bythe lack of suitable test material and tools. One ap-proach traditionally used for testing and evaluatingNLP systems comprises test-suites, i.e. systematiccollections of constructed language data associatedwith annotations describing their properties.

2 Project AimsThe objective of the DiET1 - Diagnostic and Eval-uation Tools for NLP Applications - project is todevelop a tool-box which will allow the user to con-struct test-suites, and to adapt and employ theseresources for the evaluation of NLP components.The project builds on the results of the tsnlp (TestSuites for Natural Language Processing) project(Lehmann et al. 1996) where a test-suite databasetsdb2 was built which contains substantial English,French and German test-suites covering a wide rangeof syntactic phenomena.DiET aims at extending and developing the tsnlptest-suites with annotated test-items for syntax,morphology and discours phenomena, but DiET isnot a mere continuation of tsnlp . It goes beyondthe tsnlp approach and tries to overcome the data-related and technological shortcomings of tsdb (assummarized e.g. in (Klein 1996)). The main focusof DiET is no longer the building of a large amountof test material but the development of more sophis-ticated diagnostic and evaluation tools - efficientlyworking under/behind* a user-friendly graphical in-terface - for the creation, retrieval and employment

1The project started in April 1997 and lasts until April1999. More information on DiET can be obtained throughthe world-wide web under http://dylan.ucd.ie/DiET/.The DiET project is supported by the European Commissionand the Swiss government under the Telematics ApplicationProgramme (Language Engineering LE 4204).

2The test-suite database tsdb can be directly accessed viahttp://tsnlp.dfki.uni-sb.de/tsnlp/tsdb/tsdb.cgi

of such data. The DiET system provides a flexibleannotation schema which comprises a number of al-ready existing annotation-types and allows the userto easily modify these types or define a completelydifferent set of annotation-types according to his/herspecific needs. In contrast to tsnlp where the adapta-tion of the (conceputally) flexible annotation schemais possible - as for example, (Flickinger and Oepen1997) show - but only for people familiar with rela-tional database systems3 in DiET these extensionsand modifications can easily be one by means ofgraphical interface tools.While the first set of tools is hence devoted to theconstruction of annotated test-items, the second setconcerns the customisation of test and reference datato specific applications and domains, i.e. the useris given the means to construct, modify and extenddata as well as the list of annotations for his/her spe-cific requirements. Customisation efforts focus onthree different types of applications, namely Gram-mar and Controlled Language Checkers, Parsers ascomponents of Machine Translation Systems andTranslation Memories. Under the heading of cus-tomisation a whole range of different processes andoperations are subsumed: (i) functions to performadaptation to new domains on the basis of lexicalreplacement, (ii) the development of document pro-files, establishing a relation between test-suites anddomain specific corpora and (iii) the support for theprocess of adapting and customising the test mate-rial for concrete evaluation procedures.This paper focuses on the annotations, i.e. customi-sation processes are only addressed inasmuch as theyare reflected in the annotation schema. Also, onlythe tools immediately relevant for the annotationtask are discussed.

3 Annotation SchemaAs already mentioned, within the scope of DiET aset of annotation-types will be specified. For this

3The modification of the annotation schema, realised asconceptual database schema, requires basic knowledge on re-lational database design in order to re-arrange the internaldatabase relations.

Page 3: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

purpose the annotation-types defined in tsnlp wererevised. Most of the attributes relevant for syntac-tic analyses (i.e. structural and functional infor-mation, wellformedness), phenomenon descriptions(e.g. phenomenon supertypes) and formal proper-ties (e.g. origin) will be kept for DiET, changingonly the representation format. Some of the salientcharacteristics of the syntactic phenomena (the so-called parameters) remain unchanged (e.g. agree-ment and verbal subcategorisation), others need tobe consistently defined and labelled for the threelanguages (e.g. coordination attributes). A fewannotation-types (which were already problematicin tsnlp , e.g. phenomenon interaction) remain tobe defined. In general, some new annotation-typesdescribing formal and linguistic properties need tobe specified. Furthermore, the whole complex ofapplication-specific annotations will be worked outfor the three applications chosen for DiET since tsnlpmade only a general proposal on user & applicationprofile annotations. Finally, DiET will provide twocompletely new sets of annotations: one comprisesannotation-types describing corpus-related informa-tion, the other one consists of annotation-types de-scribing evaluation-specific aspects.4

Since it is impossible to foresee all informationneeded by a user, the annotation schema is open tocustomisation and extension. It is therefore designedto be flexible and can be tailored to the user require-ments: the system will not only provide the means toannotate data on the basis of a given set of differentannotation-types, but will allow the user to definehis/her own specific types. Many of the annotationsconsist of simple features attributing certain valuesto the test-items, but there are also structural anno-tations, for example to describe the internal phrasaland relational structure of the items.The objects that an annotation may be attached to,are (i) test-items (strings), (ii) (ordered) groups oftest-items, and (iii) segments of test-items. Thetest data are assigned a broad range of differ-ent annotations describing (i) linguistic proper-ties, (ii) application-specific features, (iii) corpus-related information based on text profiling, and (iv)evaluation-specific attributes.

3.1 Formal and Linguistic AnnotationsIn principle the annotations interpret and classifya test-item as a whole, including language, formatvariations, information about well-formedness andthe test-item level (i.e. phrasal, sentence or textlevel).The linguistic annotations of the test-items com-prise morphological, syntactic and discourse anal-

4It is important to note, that within the DiET project,these annotation-types will just be specified. The DiET groupdoes not aim at annotating all test-items according to theseannotation-types.

ysis. The morphological annotations provide infor-mation on lexical category and attaches the lexicalitems to the respective ambiguity class. At the syn-tax level - following the example of the NEGRA an-notation tool (Skut et al. 1997) - graphical tree anddependency representations will provide informationon the structural analysis of the test-items where thenon-terminal nodes are assigned phrasal categoriesand the arcs some grammatical functions (subject,object, modifier, etc.). It is possible to assign a syn-tactic well-formedness to the overall structure. Thediscourse analysis provides information on direction(e.g. antecedent) and type (e.g. co-reference) of se-mantic relations between test-item-segments.All test-items can be classified according to the lin-guistic phenomenon illustrated. In order to charac-terise test-items for the salient properties of a specificphenomenon - which are language-specific - addi-tional annotation-types will be defined. In German,for example, the annotation-types case, number andgender need to be specified for the syntactic phe-nomenon NP agreement.

3.2 Application-specific AnnotationsThe annotation schema also provides informationabout application-specific properties of the test-items such that the users can formulate specificsearch statements retrieving suitable test material.But these annotations also serve as reference mate-rial which can be used within the comparison phaseof the evaluation procedure. For grammar and con-trolled language checker, for example, annotation-types can be defined, which specify the error type ofill-formed test-items and connect the ungrammticaltest-items to their grammatical pendants. The testmaterial for parsers can be annotated with the num-ber of alternative readings for ambiguous test-items.For translation memories the test-items of the sourcelanguage can be connected to adequate translationunits from the translation system.In addition to these pre-defined application-specificannotation-types, the users can easily extend the an-notation schema for their specific needs.

3.3 Corpus-related AnnotationsSystematically constructed data are useful for diag-nosis but should not be deemed sufficient for ade-quacy evaluation. What is needed are test-suites re-lated to test corpora to provide weighted test data,since for evaluation purposes it is very importanthow representative a test-item is of a certain (textor application) domain, whether it occurs very fre-quently, whether it is of crucial relevance, etc. Es-tablishing a relation between the isolated, artificiallyconstructed test-items and real-life corpora presup-poses the identification of the typical and salientcharacteristics of the examined text domain. Thisprocess is called text profiling.

Page 4: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

DiET will integrate the NEGRA tagging and parsingtools to semi-automatically annotate corpus textsfor part-of-speech and possibly also structural anddependency information. Analysis tools will be em-ployed to establish a text profile containing infor-mation on frequency and distribution of text ele-ments. The information in the profile will includenon-linguistic information such as use of punctua-tion marks as well as a description of linguistic char-acteristics such as sequences of lexical categories,patterns of specific lexical items and lexical cate-gories or even structural descriptions. Manual in-spection and interpretation of the resulting profilewill reveal which linguistic constructions are salientin the examined text and hence relevant for the en-visaged application domain.Once the corpus-relevant linguistic characteristicsare identified they will be used to (i) elaborate arelevance measure, i.e. decide to what extent cer-tain criteria are significant to establish the rele-vance values and to (ii) select test-items and clas-sify them for relevance (extending the annotationschema for this new annotation-type). If, for ex-ample, the text profile records a high percentageof co-ordinating conjunctions between two nominalphrases, the user will select test-items exemplifyingsimple NP co-ordination and give them a high rele-vance value. This is a simplified example since veryoften, the relevance value will be some kind of a com-position of the values given to several annotation-types, e.g. sentence-length, number-of-conjunctions,number-of-coordinated-elements, NP-type, number-of-NP-elements, etc.Currently, the responsible DiET group works on theconceptual design to realise text profiling and allowfor the linkage between test-items and text corpora,i.e. the concrete mechanisms for this customisationprocess can not be provided yet.

3.4 Evaluation-specific AnnotationsThe annotation schema will comprise annotation-types describing the evaluation scenario, includingthe user-type (developer, user, customer), type andname of the system under evaluation, goal of theevaluation (diagnosis, progress, adequacy), condi-tions (black box, glass box), evaluated objects (e.g.parse tree, error correction, etc.), evaluation criteria(e.g. correct syntactic analysis, correct error flag,etc.), evaluation measure (the analysis of the systemresults, for example in terms of recall and precision),and the quality predicates (e.g. the interpretation ofthe results as excellent, good, bad, etc.).Since the DiET database is also meant to accom-modate evaluation results, the annotation schemawill comprise annotation-types to annotate the ac-tual output of the examined NLP system to indi-vidual items. This information may allow for the(at least manual) comparison of the actual to the

expected output (the reference annotation specifiedwithin the application-specific annotation-types).Clearly, not all annotations can be gathered au-tomatically but the DiET system will support theusers to organise evaluation runs and keep track ofthe results.

3.5 Meta-information on the Annotations

The database also provides information about thetest-suite as a whole, such as listings of the vocab-ulary occurring in the test-items, part-of-speech tagsets or the terminology used for the specification ofannotation-types. Additionally, (declarative) infor-mation will be assigned to the linguistic phenomena,comprising the (traditional) phenomenon name, itsrelation to other phenomena displayed in a hierarchi-cal phenomenon classification and a (informal) de-scription of the linguistic constructions belonging tothe phenomenon in question.

4 Annotation ToolThe objective of DiET is to offer a flexible tool pack-age which is open enough to cover the requirementsof different types of users who wish to employ thesystem for a range of applications. A tool packageaccessible from a graphical user interface portablyimplemented in Java is being built. DiET has aclient/server architecture, with a central databasesystem, lexical replacement, tagging and syntacticanalysis working as servers and a client integratingconstruction, display and editing facilities. The cen-tral construction and annotation tool serves to en-ter new data and to annotate these items on differ-ent levels of linguistic abstraction, such as phrasalconstituency or anaphoric relations and for variousapplication- and corpus-specific information. Theprocess of annotation is supported among others bypart-of-speech tagging and parsers with bootstrap-ping facilities learning from existing annotations.

Figure 1 gives an impression on how the main win-dow of the tool will look like. In principle, the in-terface aims at simplicity in design and offers clearlyarranged operation fields to provide an easy to useannotation tool. The left window contains the test-items. The right one is split up into two parts:the upper window shows the hierarchically arrangedannotation-types together with the values attributedto the selected test-item, the lower part presentsmore information on the value (s) of the annotation-type marked in the window above.How to annotate a test-item? First, the user selectsa test-item. From the pool of annotation-types (inthe upper right window), s/he choses an annotation-type, e.g. syntactic analysis, NP.coordination, etc.In the lower right window, fields, appropriate forthe given annotation-type, will appear allowing the

Page 5: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Figure 1: DIET GUI

value(s) to be enters. In the case of syntactic anal-ysis for example, this will be a tree window.It is often the case that the construction of test-items can be based on already existing test-itemssince only minor changes in the test-items or in theannotations need to be done. DiET gives thereforethe possibility to duplicate and adapt a similar entryrather than producing the test-item string and itsannotations from scratch.The user can also easily specify new annotation-types. For the definition of annotation-types adialog-window will open (see figure 2). To define anew annotation-type, the user choses a name for thenew type, attributes it to the respective data type,(if necessary) defines the range of allowed values,and positions it within the hierarchically orderedlist of annotation-types. Annotation types can beapplied to different types of linguistic objects (i.e.,test-items, groups of test-items, etc.) and their val-ues (if any) can be configured to be entered manuallyor are alternatively provided through some server,i.e. the user selects a service (e.g. a tagger) whichwill provide the values. Instances of such annota-tion types could be, for example, phrasal or rela-tional structures over strings building on the datatype tree, anaphoric relations making use of a sim-ple data type arc, well-formedness judgements witha boolean value, etc.While most of the described functions of declaration,selection, and data entry will be carried out in thecentral client module, there will also be a number

of specialised and potentially decentralised serverssupporting the tasks of data construction and an-notation. (Semi-)automatic annotation of data byservers is forseen for standard annotation types suchas part of speech tagging. These will be available forthe three languages, as will be a morphology compo-nent to assign standardised morpho-syntactic classi-fications.

5 Concluding remarksDiET supports NLP evaluation by the developmentof tools which allow to construct and customise an-notated test material which can consist of a system-atic collection of constructed language data or bereal life corpora. In contrast to the Penn Treebank(Marcus et al. 1994) and the NEGRA annotation tool(Skut et al. 1997) DiET provides the means to at-tach richer annotations to the data which go be-yond part-of-speech tagging and syntactic analysis,namely the possibility to assign salient phenomenon-specific characteristics, the so-called parameters tothe test-items. Furthermore, semantic informationcan be associated with test-item-segments abovesentence level.The project focuses on the flexibility of the an-notation schema which - with the given tool-box- can easily be tailored according to the require-ments of the user: the test-items can be annotatedwith respect to formal and linguistic properties,application-specific features, corpus-related informa-tion based on document profiling and evaluation-

Page 6: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Figure 2: Configuration of annotation-types

specific attributes.The project also develops a set of tools which sup-port the customisation of test-suites by adaptingthem to new domains on the basis of lexical replace-ment, the development of text profiles and the sup-port for evaluation purposes.

Acknowledgement

We thank all our DIET colleges, Susan Armstrong(ISSCO), Jochen Bedersdorfer (DFKI), Tibor Kiss(IBM Germany), Bernice McDonagh (LRC), DavidMilward (SRI Cambridge), Dominique Petitpierre(ISSCO), Steve Pulman (SRI Cambridge), SylvieRegnier-Prost (Aerospatiale), Reinhard Schaler(LRC), and Hans Uszkoreit (DFKI) for their con-tributions.

ReferencesDan Flickinger and Stephan Oepen: Towards Sys-tematic HPSG Grammar Profiling. Test Suite Tech-nology Ten Years After In: Proceedings of DGfS-CL, Heidelberg 1997Judith Klein: TSDB - Em Informationssystemzur Unterstützung der syntaktischen Evaluierungnatürlichsprachhcher Systeme. Master's Thesis,Saarbrücken 1996

Sabine Lehmann, Stephan Oepen, Sylvie Regnier-Prost, Klaus Netter, Veronika Lux, Judith Klein,Kirsten Falkedal, Frederik Fouvry, Dominique Esti-va!, Eva Dauphin, Herve Compagnion, Judith Baur,Lorna Balkan, and Doug Arnold: tsnlp — TestSuites for Natural Language Processing In: Proceed-ings of COLING, Kopenhagen 1996 pp.711-716Marcus, M. P. et al.: The Penn Treebank:Annotating Predicate Argument Structure. ARPAHuman Language Technology Workshop 1994(http://www.ldc.upenn.edu/doc/treebank2/arpa94.html)Klaus Netter, Susan Armstrong, Tibor Kiss, Ju-dith Klein, Sabine Lehmann, David Milward, SylvieRegnier-Prost, Reinhard Schaler, Tillmann Wegst:DiET - Diagnostic and Evaluation Tools for NaturalLanguage Processing Applications In: Proceedingsof the first international Conference on LanguageResources and Evaluation, Granada 1998 pp.573-579Stephan Oepen, Klaus Netter and Judith Klein:tsnlp — Test Suites for Natural Language Process-ing. In: Nerbonne, J. (Ed.): Linguistic Databases.CSLI Lecture Notes. Standford 1997 pp.13-37Wojciech Skut, Brigitte Krenn, Thorsten Brants,and Hans Uszkoreit An Annotation Scheme for FreeWord Order Languages In: Proceedings of ANLP1997, Washington 1997 pp.88-96

Page 7: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

GLOSS: A Visual Interactive Tool forDiscourse Annotation

Dan Cristea Ovidiu Craciun Cristian Ursu

University "AI.Cuza" IasiFaculty of Computer Science

16, Berthelot St.6600 - Iasi, Romania

{dcristea, noel, cursu}@thor.infoiasi.ro

Abstract

We present an annotation tool, called GLOSS, that manifests the following features: accepts as inputs SGML source documentsand/or their database images and produces as output source SGML documents as well as the associated database images; allowsfor simultaneous opening of more documents; can collapse independent annotation views of the same original document, whichalso allows for a layer-by-layer annotation process in different annotation sessions and by different annotators, includingautomatic; offers an attractive interface to the user; permits discourse structure annotation by offering a pair of building operations(adjoining and substitution) and remaking operations (undo, delete parent-child link and tree dismember). Finally we display anexample that shows how GLOSS is employed to validate, using a corpus, a theory on global discourse.

1. Introduction

It is generally accepted that annotation increasesthe value of a corpus. The added value resides inthe human expertise that is so contributed. Afully automatic annotation, although perhaps notforeseeable, would make the very idea ofannotating a corpus completely useless, as, inthat hypothetical moment of the future, all humanexpertise would be totally reproducible. If such apicture is not realistic, still there will always existthe need for low level automatic annotation tasks,that could speed up tedious phases of theannotation work done by humans. There is also atremendous need for advanced tools able to helpthis process, acting in an interactive manner.

On the other hand, recent progress on corpusdriven methods for studying natural languagephenomena puts more and more accent on thereusability of linguistic resources. It becomesnatural to think that annotated corpora created fora certain goal could further be used on a differentpurpose by another research team that finds inthe existing annotation a partial fulfilment of itsneeds and would like to add the missing parts byits own efforts. In order to speed up anannotation task it is also plausible that theworkload be distributed to independentannotators, each responsible for just one set ofmarkups. In such a case it is desirable that theannotators work in parallel, on different copies of

the same base document, and an integratingdevice exists able to combine the contributingmarkups in a single document.

Such a need is described in Cristea, Ide andRomary (1998a), where a marking procedure isproposed in order to deal with the goal ofannotating corpora from two differentperspectives: discourse structure and reference.This scheme yielded by the necessity to validatea theory on discourse structure in correlationwith anaphora (Cristea, Ide and Romary, 1998b).

This paper presents an annotation tool, calledGLOSS, that implements the ideas presented in(Cristea, Ide and Romary, 1998a) regardingmultiple annotation views on the same originaldocument, while also providing a powerful visualinteractive interface for annotation of discoursestructure. The result of the annotation process isa pair of files: an ASCII SGML document, andits database mirror, which can be SQL-accessed.

2. The annotator's features

SGML compatibility

The annotator accepts in input original un-annotated texts as well as SGML documentswhich must be paired with their DTD files1. Atany moment of the annotation process the

1 The current implementation allows for a simplifiedDTD syntax.

Page 8: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

document under annotation can be saved in theSGML format.

Database image copyDuring the annotation process, an internalrepresentation of the markups is kept in anassociated database. The database recordsmarkers to text as pairs of beginning and endindices in the bare text document. When anannotation session is finished, the associateddatabase can be saved for interrogation purposes.

GLOSS supports simple queries to be addressedto the database, in the SQL language,interactively during an annotation session.

Once a database image of a document exists, itcan act as input for a subsequent annotationsession with GLOSS. This allows for enrichingcertain types of tags by an automatic procedure,as sketched in figure 1. In this way manualannotation can be combined with automaticannotation in an easy way.

Multiple parentage /multiple views feature

The annotator allows for simultaneous openingof more than one document, the same way aneditor behaves. Within the same session the usercan commute from one document to another,each having an associated workspace window (asshown below)

The multi-document feature is justified by themultiple-views philosophy that it implements(Cristea, Ide and Romary, 1998a). According tothis philosophy, a document can inherit frommultiple documents, considered as its parents in ahierarchical structure. The overall architecture isthat of a directed acyclic graph (DAG) with abase document as root. All documents in thehierarchy represent annotations made fromdifferent perspectives of the same original hubdocument. If a document is defined to inheritfrom one or more parents, then any or all of thecontributing markups can be displayedsimultaneously on the screen.

The multiple views behaviour is obtained by aprocess of intelligent unifying of the databaserepresentations of the parent documents. So,when a document is declared to inherit from twoor more parent documents, another database isgenerated that copies once from the parents thecommon parts and than adds the markups that arespecific to each of them. Ulterior to the moment

the declaration of parentage is made, which mustoccur when a new view is created, the currentdocument looses any connection with the parentdocuments, such that any modification can bemade to this version without affecting in any waythe original markups.

Dynamic versus static inheritance in theviews architecture

Following current approaches on inheritancesystems (Daelemans, De Smedt & Gazdar, 1992)for instance), two styles of recording data can beused in a hierarchy of views. In a dynamicmanner only new information is recorded, whileold information is referred to by pointers in oneof the hierarchically upper views (in fact - theirassociated database images). This mimics a kindof monotonic inheritance. This decision is onethat yields maximum of parsimony in recordingdata. On the other hand still, it obliges that allhierarchically upper views be simultaneouslyopen and active during an annotation orinterrogation session. Also it blocks performingany modification in the original text in any of thedocuments in the hierarchy. This is becausealignments would be lost down the hierarchy, astext indexes could be changed.

On the contrary, in the static manner a new copyof the database is created for each new view.Although recording anew the inherited data, this

Page 9: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

style permits independence from the originalparents during annotation and interrogation. Assuch, different views on the hierarchy areisolated and can diverge significantly. The onlyrestriction is that all parents of a child view bebased on the same text document.

GLOSS implements the static inheritancephilosophy.

The main interactive workspace

Each opened document under annotation has amain window where the original text is displayed(see figure 2). The SGML tags do not appear onthis screen. Instead, the tagged segments of textare highlighted in colours. The user can assign a

different colour to each type of tag. Highlightsare sensible to the mouse click and can open forvisualisation and/or modification thecorresponding attribute-value structures.

Text-empty tags, as for instance co-referencelinks (Brunesseaux and Romary, 1997) or, in acertain representation, discourse relations, canalso be displayed. Since there is no text that theycould be anchored to by highlighting, only theirIDs are displayed. This happens in a childwindow where the text-empty tags are groupedby their types. As in the case of tags thatsurround text, the text-empty tags can be openedfor inspection and/or modification with themouse click.

Figure 2: A GLOSS session with two documents opened

Figure 3: Floating window with adjoining and substitution structures

Page 10: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

The discourse structure annotationworkspace

Annotating the discourse structure in GLOSS isan interactive visual process that aims at creatinga binary tree (Marcu, 1996) (Cristea & Webber,1997), where intermediate nodes are relationsand terminal nodes are units. The discoursestructure window appears near the maindocument window, as shown in figure 2.

An incremental, unit by unit elaboration, thatwould precisely mimic an automatic expectation-driven incremental tree development, asdescribed in Cristea & Webber (1997), is notcompulsory during a manual process. Manualannotation resembles more a trial and error,islands driven process. To facilitate the treestructure building, GLOSS allows developmentof partial trees that could subsequently beintegrated into existing structures. Each unit orpartial structure is incorporated within one of thealready existing trees either by an adjoiningoperation (adding to a developing structure anauxiliary tree - one that has a foot node, markedby a *, on the left extreme of the terminalfrontier) or a substitution operation (replacementof a substitution node, marked by a , belongingto the developing structure, with a whole tree)(Cristea & Webber, 1997). Adjoining of a partialtree (minimum a unit) into an already existing,developing tree is allowed only on the nodes ofthe outer or inner right frontier2, as defined inCristea & Webber (1997). The approach inCristea & Webber (1997) that restrictssubstitution only into the most inner substitutionnode is also preserved.

Both build operations (adjoining andsubstitution) require two arguments: the insertedstructure and the node of the final structurewhere the insertion takes place. The adjoiningoperation is a drag-and-drop one from root of theinserted partial structure to the node of anexisting structure where the insertion must takeplace. In the same manner, a substitutionoperation is performed between an elementarytree structure and a destination substitution nodebelonging to the partial developed structure. Ifinstead of dragging, the source node is right

2 An outer right frontier is the right frontier (the nodeson the path from the root to the most right terminalnode) - in a tree without substitution sides. An innerright frontier is the right frontier of the tree rooted atthe left sister of the most inner substitution site - in atree with substitution sites.

clicked, a float appears displaying a family ofstructures (auxiliary or elementary trees) thatcould be created using as material the selectednode. These structures are shown in figure 3.

After completion of any adjoining or substitutionoperation the obtained tree obeys the Principle ofSequentiality3 (Cristea & Webber, 1997).

Remaking the discourse tree

As discourse annotation is an ever-refiningprocess, we have implemented a number ofremaking operations that would allow forundoing the work already done when buildingtrees: cuts of father-child links and completedismembering of parts of the developing tree.

These operations are:• deleteUpper: deletes the upper link of a

node. As a result the tree under the nodebecomes unbound and the corresponding linkfrom the parent remains pending and ismarked by a <i because its further behaviourwill be that of a substitution node;

• dismemberTree: forgets all the relations andthe parent-child links of the structure underthe selected node.

3. Applying GLOSS for validating theVeins Theory

In (Cristea, Ide, Romary, 1998b), a theory ofglobal discourse called Veins Theory (VT) wasproposed. VT identifies hidden structures in thediscourse tree, called veins, that enable, on onehand, to define referential accessibility domainsfor each discourse unit and, on the other, toapply Centering Theory (CT) of Grosz, Joshi &Weinstein (1995), globally, to the wholediscourse. Veins are defined as sub-sequences ofthe sequence of units making up the discourse.From the point of view of discourse cohesion,VT claims that referential strings belonging to aunit are allowed to refer only discourseantecedents placed on the same vein as the unititself, or if this is not the case, then the referentscan be inferred from the general or contextualknowledge. The point on coherence defended byVT is that if a different metric than the surfaceorder of the units is used, than the inference loadthe discourse obliges to is less. The new metric

3 A left-to-right reading of the terminal frontier of thetree corresponds to the span of text scanned in thesame left-to-right order.

Page 11: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

being along veins, the comparison criteria is a"smoothness" index computed on the base of CTtransitions (continuation is smoother thanretaining, which is smoother than smooth shift,which is smoother than abrupt shift, which itselfis smoother than no transition at all).Accordingly, values from 4 - continuation, downto 0 - no transition, are summed up for alldiscourse units and the result is divided by thenumber of transitions. The research aims atproving that in all cases the VT smoothness scoreis at least as good as the one computed followingCT, if not better.

The specific scheme proposed is a multi-level(hierarchical) parallel annotation architecturedescribed in detail in Cristea, Ide & Romary(1998a). The overall hierarchical architecture ofthis schema is a DAG of views as in figure 4.

Figure 4: The hierarchy of views for thevalidation of VT

Following is a short description of each of theseviews:

• BD: the base document, contains the raw textor, possibly, markup for basic documentstructure down to the level of paragraph;

• RS-VIEW: includes markup for isolatedreference strings. The basic elements are<RS>'S (reference strings);

• RL-VIEW: the reference links view,imposed over the RS-VIEW, includesreference links between an anaphor, orsource, and a referee, or target. Linksconfigure co-reference chains, but can alsoindicate bridge references (Strube & Hahn,1996; Passoneau, 1994, 1996). They arecoded as <LINK> elements with TYPE=CO-REFERENCE or TYPE=BRIDGE;

U-VIEW: marks discourse units (sentences,and possibly clauses). Units are marked as<SEG> elements with TYPE=UNTT;REL-V1EW: reflects the discourse structurein terms of a tree-like representation. Tobuild the REL-VIEW, full usage of theinteractive tree building facilities is made.The parent-child relations are marked as<LINK TYPE=RELATION> with the attributesTARGET 1 (left child) and TARGET2 (rightchild);VEINS-VIEW: includes markup for veinexpressions. VEIN attributes (with valuescomprising lists) are added to all <SEGTYPE=UNIT> elements; ;„RS-IN-U-VIEW: inherits <RS> and <SEGTYPE=UNIT> elements from U-VIEW andRS-VIEW. It also includes markup thatidentifies the discourse unit to which areferring string belongs;CF-VIEW: inherits all markup from RS-IN-U-VIEW, and adds a list of forward lookingcenters (the CF attribute) to each unit in thediscourse;CT-VIEW (Centering Theory view):inherits the CF attribute from the CF-VIEWand backward references from the RL-VIEW. Using the markup in this view, firstbackward-looking centers4 (the CB-C5

attribute of the <SEG TYPE=UNIT> elements)and then transitions can be computedfollowing classical CT, therefore betweensequential units. A global smoothness scorefollowing CT is finally added;VT-VIEW (Veins Theory view): inheritsforward-looking lists from the CF-VIEW,back-references from the RL-VIEW, andvein expressions from the VEINS-VIEW.The VT-VIEW also includes markup forbackward-looking centers computed alongthe veins of the discourse structure. As, foreach unit these centers depend on the veinthe current unit is actually placed, they arerecorded as CB-H6 attribute of the <LINK>elements. Transitions are computedfollowing VT and then, a global VTsmoothness score, thus providing support forthe validation of the claim on coherenceuttered by VT. Also the simultaneous

4 CT defines a backward-looking center as the first elementof the forward-looking list of the previous unit that isrealised also in the current unit.5 From "classical".6 From "hierarchical".

Page 12: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

existence of reference links, inherited fromRL-VTEW and of vein expressions, inheritedfrom VEINS-VIEW allows for the validationof the claim on cohesion of VT (referencesare possible only in their domains ofaccessibility).

ConclusionsWe present an annotation tool, called GLOSS,that manifests the following features: accepts asinput SGML source documents and/or theirdatabase images and produces as output sourceSGML documents as well as their associateddatabase images; allows for simultaneousopening of more documents; can mixindependent annotation views of the sameoriginal document, which also allows for a layer-by-layer annotation process in differentannotation sessions and by different annotators,including automatic; offers an attractive interfaceto the user; permits discourse structureannotation by offering a pair of buildingoperations (adjoining and substitution) andremaking operations (delete parent-child link andtree dismember). Finally we display an examplethat shows how GLOSS is used to validate, bypartial manual, partial automatic corpusannotation, a theory of global discourse. Thesystem currently runs on a PC underWindows'95orNT.

Although designed in the idea of assistingannotation tasks in discourse, GLOSS could alsobe used for other types of annotation. The use ofSGML with an associated DTD file allows easydefinition of any annotation standard withspecific markings and their families of attributes.To assist discourse structure coding, the currentversion implements binary trees with a range ofoperations that permit interactive building oftrees in an add/modify/delete manner. Thismodule could be extended to allow forinteractive annotation of any kind of trees. Assuch, syntactic structures become possible to bemarked in a GLOSS session.

As further extensions to GLOSS, we plan to addan interface that will allow easy definition andincorporation of other structures, different fromtrees.

ReferencesL. Bruneseaux & L. Romary (1997). Codage des

references et conferences dans les dialogueshomme-machine. Proceedings of ACH/ALLC,Kingston (Ontario).

D. Cristea, N. Ide & L. Romary. (1998a).Marking-up multiple views of a Text:Discourse and Reference. Proceedings of theFirst International Conference on LanguageResources Evaluation, Granada, 28-30 May.

D. Cristea, N. Ide & L. Romary. (1998b). Veinstheory: A Model of Global Discourse Cohesionand Coherence. Proceedings ofACL/COLING'98.

D. Cristea & B. L. Webber. (1997). Expectationsin Incremental Discourse Processing.Proceedings of ACL-EACL'97.

W. Daelemans; K. De Smedt & G. Gazdar(1992). Inheritence in Natural LanguageProcessing. Computational Linguistics, vol. 18,no. 2.

N. Ide & G. Priest-Dorman.Encodinghttp://www.cs.vassar.edu/CES/.

(1996). CorpusSpecification.

B. J. Grosz, A. K. Joshi & S. Weinstein. (1995).Centering: A framework for modelling the localcoherence of discourse. ComputationalLinguistics, 12(2), June.

D. Marcu (1996). Bulling up rhetorical structuretrees, Proceedings of the 13th NationalConference on Al (AAAI-96), vol.2, Portland,Ore., 1069-1074.

C. M. Sperberg-McQueen & L. Burnard. (1994).Guidelines For Electronic Text Encoding andInterchange. ACH-ACL-ALLC Text EncodingInitiative, Chicago and Oxford.

Page 13: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Linguistic Annotation of Two Prosodic Databases

Maria WoltersInstitut fur Kommunikationsforschung und Phonetik

Poppelsdorfer Allee 47, D-53115 [email protected]

AbstractTwo prosodic databases were annotated withlinguistic information using SGML (StandardGeneral Markup Language), one database ofAmerican English and one of Modern Stan-dard German. Only information that mighthave prosodic correlates was annotated. Pho-netic and morphological information was sup-plied by automatic tools and then hand cor-rected. Semantic and pragmatic informationwas inserted by hand. The SGML tagset isessentially the same for both languages. Tagsdelimit structural units, all other informationis supplied by attributes. The databases them-selves, which combine linguistic and phoneticinformation, are stored as SPSS files.

1 IntroductionThe prosody of an utterance is closely relatedto its linguistic form. Therefore, a prosodicdatabase for a language should contain bothphonetic/prosodic and linguistic information.This paper describes how linguistic informationis encoded in the Bonn prosodic databases ofGerman (Heuft et al., 1995), henceforth G, andAmerican English (Eisner et al., 1998), hence-forth AE. The databases were read by 2-3 na-tive speakers under laboratory conditions. Eachspeaker read the complete text. AE consistsof 443 question/answer dialogues, while G con-tains 110 question/answer dialogues, 3 shorttexts and 116 single sentences. Most of the ma-terial was written specifically for these corpora.The phonetic information in these corpora israther detailed for a speech corpus of that size,which permits very fine-grained analyses.

The databases are used to study how infor-mation such as dialogue act, sentence type orfocus can be signalled prosodically in content-to-speech synthesis (CTS). In CTS, the input is

enriched with information about semantic unitsand pragmatic functions. For our analyses, wemainly use the statistics package SPSS. Sinceeach segment and each word is stored as a sep-arate case in the SPSS file, word-level linguis-tic annotations are difficult to maintain in thisformat. Therefore, they are stored in a SGML-based markup of the text. SGML (StandardGeneral Markup Language, (Goldfarb, 1990))was chosen because it is a standard languagefor corpus annotation. In an SGML annotation,the text is partitioned into elements E, whichare organized hierarchically. The boundaries ofthese elements are marked by start (<E>) andend (</E>) tags. Each tag can have an arbi-trary number of attributes (<E attr=value>).The main disadvantage of SGML is the stricthierarchical structure it imposes - annotatingunits that partly overlap in a single documentis impossible.

Combining the two annotations is straightfor-ward. The parsed SGML file with full attributespecifications for each tag is converted into atab-delimited database of words where each tagand each attribute has its own column. Tagcolumns have two values, 0 for closed tag and 1for open tag. Attribute columns have the value-1 if the tag for which the attribute is defined isclosed, and the correct value if that tag is open.In a last step, the information for each word isadded to its entry in the SPSS file.

In the next sections, we present the hierar-chical structure of the SGML tagset and givean outline of its four layers. It is beyond thescope of this paper to list all tags and theirattributes; this information can be found in(Wolters, 1998). To conclude, we will discussextension and evaluation of the coding scheme.

Page 14: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

2 Overview of the TagsetBoth corpora share a basic tagset. The at-tributes for identifying sound files, part-of-speech labels, and phrase-level tags differ. Allphonemic, morphological and syntactic infor-mation which was inserted automatically wassubsequently corrected by hand.

The tags specify units on four different lay-ers, structural layer, word layer, phrase layerand semantic/pragmatic layer. The propertiesof these units are encoded in attributes. Thefirst layer specifices the overall structure of thedatabase. Information annotated in the secondand third layer such as part of speech (POS)and syntactic phrase boundaries is relevant fortext-to-speech (TTS) as well as for CTS sys-tems, because it can be provided automatically.In contrast, the annotations on the fourth layercontain the information that is only exploitedin CTS systems.

The third and the fourth layer are largelyorthogonal: semantic and pragmatic units areboth parts of larger syntactic phrases and con-tain smaller syntactic phrases. If a seman-tic/pragmatic unit is coextensive with a syntac-tic phrase, the phrase boundaries are containedwithin the unit boundaries.

3 The Structural LayerThere are four main elements on this layer. Thetag <dialbase> indicates that the text is aprosodic database. Its units are dialogues (tag:<dialog>), stories (tag: <story>), and sen-tences (tag: <sent>).1 Each unit has a uniqueidentifier.

Stories and dialogues consist in turn of sen-tences. Sentences are sequences of words endedby a dot, a question mark or an exclamationmark. Sentences are marked for sentence type(attribute typ) and, in dialogues, for the cur-rent speaker (attribute sp). Table 1 lists the sixtypes covered.

4 The Word LayerThe main unit on this layer is the word(<w>). The attribute pos supplies part-of-speech (POS) information, while the tags<phon> and <orth> delimit the canonical

C command STDEQ decision ques- SEQ

tionEQ echo question WQ

statementselection ques-tionWH-question

Table 1: Sentence types. The answers to questionsare labelled with the label of the question + a foranswer. Echo questions are defined as declarativesentences with a question intonation, the definitionsof the other types are straightforward.

phonemic and orthographic transcription of aword. Example:2

(1) <w pos=VB> <phon> g II v <orth>give

Since POS tags are comparatively easy tosupply in a text-to-speech synthesis system,they provide valuable information for a first setof prosodic rules (Widera et al., 1997). Includ-ing canonical transcriptions allows to check forreductions, elisions and insertions which cooc-cur with prosodic parameters such as stress andspeaking rate.

The canonical transcriptions for AE wereretrieved from cmudict3 and converted toSAMPA for American English (Wolters, 1997);G was transcribed by a rule set implemented inthe language P-TRA (Stock, 1992). Each vowelsymbol is followed by an indication of stresstype, primary (1), secondary (2) or none (0).The AE POS tags were provided by the Brilltagger (Brill, 1993), which uses the Penn Tree-bank tagset (Santorini, 1990). The POS tags forG were derived from the Bonner Wortdatenbank(Brustkern, 1992).

5 The Phrase LayerThere are two types of units, syntactic phrasesand lists (tag: <li>). Lists are frequentlyused in applications of text-to-speech synthesis.They consist of items (tag: <1>) and coordi-nators (tag: <cc>). Example:4

(2) <sent typ=deq>Sind <li> <1> Peter </!> <1> Tina</!> <1> Petra </!> <cc> und</cc> <1> Klaus </!> </li> wirklich

'Turns are not indicated by separate tags, as theywould have to be for spontaneous speech, because turntaking is almost impossible to study using read speech.

2 If not stated otherwise, all examples come from oneof the databases.

3compiled by Bob Weide, available at ftp://svr-ftp.eng.cam.ac.uk/pub/comp-speech/dictionaries

4 All examples only show the relevant tags.

Page 15: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Figure 1- Hierarchy and layering of SGML elements CAPS: name of layer. Only the most important tagsare shown.

miteinander verwandt?trans Are Peter, Tina, Petra and Klausreally relatives?

Syntactic phrase boundaries were providedby the Apple Pie Parser (Sekine, 1995) forAE and the SVOX (Traber, 1995) parser forG. These tags are used for examining if andhow syntactic boundaries are marked prosodi-cally. There are four phrase types, noun phrases(tag: <np>), verb phrases (tag: <vp>), adverbphrases (tag: <advp>), and adjective phrases(tag: <adjp>). An example from the Germandatabase:

(3) <sent typ=c><advp st=k f=u><w pos=QPN> <phon> 1 aUl t 60<orth> lauter<w pos=INJ> <phon> b II t ©0<orth> bitte</advp>trans.. "Speak up, please".Attributes of the adverbial phrase: st=k -comparative, f=u - not inflected

When a parser provides more detailed infor-mation about a phrase than just the type ofits head, this is stored in attributes. These at-tributes mostly refer to features of the phrasehead. Their default value is "unspecified", incase the parser does not yield this information.

6 Semantic/Pragmatic Level

This level differs from the others in that most ofthe concepts to be annotated are difficult bothto define clearly and and to detect automati-cally. Therefore, all annotations have to be in-serted by hand.Date, time, numbers. These units areused for current research in the VERBMOBILproject. Dates (<d>) must consist at least of

day and month, and times (<t>) at least ofhours. Numbers (<nr>) are sequences of dig-its which refer to e.g. a telephone number. Allthree units consist of noun phrases at the syn-tactic level. Some examples:

(4) <sent typ=deq> War seine Telefon-nummer nicht <nr> eins vier drei dreizwo </nr>trans.: Wasn't his phone number one fourthree three two?

(5) <sent typ=deq> <d> Am siebe-nundzwanzigsten Mai </d> <t> umsiebzehn Uhr fünfunddreißig </t> ,kann ich das eintragen?trans.: The 27th of May at 17.35, can I notethat down?

Coreference. It is well known that old infor-mation is less likely to be accented than newinformation. But how can this dimension of fa-miliarity be quantized for annotation? To avoidsemantic complications, we only code the famil-iarity of discourse referents. First, all referringexpressions are marked with the tag <coref >following (MUC, 1996). Referring expressionscan consist of syntactic noun phrases and ofother referring expressions. Each expression isassigned an integer number id unique withina dialogue or story. The attribute ref speci-fies the entity an expression refers to. A refer-ent is identified by the cardinal number of thereferring expression which introduces it. Theattribute ground specifies how the referent islinked to the preceding discourse. The links,summarized in Tab. 2, are a subset of thoseproposed in (Passonneau, 1997). A constructedexample:

(6) Does <coref id=l ref=l ground=no>Mimi </coref> like <coref id=2 ref=2ground=no> detective stories </coref>

Page 16: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

no no link, first extensional mention in thediscourse

id referential identityisa member of a previously mentioned setsubs subset of a previously mentioned setpart part of a previously mentioned whole

Table 2: Types of links between referring expres-sions.

Yes, <coref id=3 ref=l ground=yes>she </coref> especially likes <corefid=4 Tef—2 ground=subs> old-fashioned whodunits </coref> .

Focus. Five types of foci are annotated, ad-verbial, answer, contrast, correction, and infor-mation foci. The first four types are common,rather descriptive categories, while the last typerelies on the focus theory of (Vallduvi, 1992). Itwas included in order to account for foci whichare not covered by the other categories. Al-though they could have been annotated in termsof Vallduvi's theory, more theory-independentdefinitions were used in order to facilitate a re-analysis in an arbitrary framework.

The definitions of all five types is purely se-mantic. What our tags identify is potential fo-cus positions, which allows us to examine in asecond step which of these foci were realizedprosodically. Thereby, we avoid circularity. Thefirst four focus types are tagged using f, withthe attribute typ for focus type. Focus "an-chors" are specified by the tag (<fa>). An an-chor is a constituent in the preceding discoursewhich the focus refers to. For example, the po-sition of the answer focus in wh-questions de-pends on the information which is requested,and the position of the adverbial focus dependson the focus adverb, more precisely on the scopeof the focus adverb. Although anchors are notlikely to be accented, it might be interesting toexamine in which conditions they do receive anaccent. Furthermore, specifying the anchor alsofacilitates constructing a semantic representa-tion for the focussed consituent. The attributeref links foci and their anchors as well as theelements of contrast and correction foci. Tab. 3gives short definitions of these focus types andtheir anchors. Some constructed examples:

(7) Does Anne prefer <fa typ=a ref=l>

tea </fa> or <fa typ=a ref=l> coffee</fa> ? She prefers <f typ=a ref=l>tea. </f> (answer focus)

(8) <f typ=cn ref=2> Anne </f> likes <ftyp=cn ref=3> tea </f>, <f typ=cnref=2> Ben </f> likes <f typ=cnref=3> coffee </f> . (contrast focus)

(9) Ben doesn't like <fa typ=cr ref=4> tea</fa>, he prefers <f typ=cr ref=4>coffee </f>. (correction focus)

(10) But sometimes, <fa typ=v ref=5> even</fa> <f typ=v ref=5> Ben </f>drinks tea. (adverb focus)

The fifth type of focus is intended to coverfoci which do not belong to one of the first fourtypes. We assume with (Heim, 1983) that thereis a file card for each discourse referent whichserves as a repository of information about thatreferent. The constituents of an utterance whichare in focus specify information which is to beadded to a referent's file card. This concept offocus is based on (Vallduvi, 1992). To distin-guish this general type of focus from the other,more specific ones, which it can be extendedto include, it is labelled information focus, tag:<if>. There are no anchors for informationfoci, since they are defined in relation to ab-stract file cards. The attribute update specifieshow the file card to be updated is retrieved. Toavoid a proliferation of tags, information focusis only annotated when none of the other focustypes is present.

(11) Ben doesn't like chocolate.He <if update=p> doesn't drink milk,</if> either, (information focus, up-date via pronominal reference).

Dialogue Act. Dialogue acts are difficult toelicit in read speech, because they depend onthe aim a speaker has in mind when producing acertain utterance. When labelling dialogue, thisaim has to be reconstructed from the context.For example, the utterance "that won't work"can be assigned three different dialogue acts:

(12) A: Turn the screw now. B: That won'twork, negative feedback

(13) A: Can we meet on Saturday. B: Thatwon't work, rejection of suggestion

L

Page 17: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Type Focus Anchoradverb scope of adverb adverb

correction correction consitutent to be correctedcontrast constituents to be contrasted not applicable

answer wh-question:constituent corresponding to wh-phrase wh-constituentselection question:member of alternative set alternative set from question

Table 3. Focus types and their anchors.

Page 18: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Table 4: Types of dialogue acts, bold type: main type, underlined: subtype, italics: subsubtype. CAPS:value of attribute typ

editors, Meaning, use, and interpretation oflanguage, pages 164-189. de Gruyter, Berlin.

B. Heuft, T. Portele, J. Kramer, H. Meyer,M. Rauth, and G. Sonntag. 1995. Paramet-ric description of FO contours in a prosodicdatabase. In Proc. Int. Congress of Phon.Sci. Stockholm, pages 378-381.

MUC. 1996. Conference task definition V. 3.0.Message Understanding Conference.

R. Passonneau. 1997. Summary of the coref-erence group. In J. Carletta, N. Dahlback,N. Reithinger, and M.A. Walter, edi-tors, Standards for Dialoge Coding in Nat-ural Language Processing, pages 12-21.http://www.dag.uni-sb.de/ENG.

B. Santorini. 1990. Part-of-speech taggingguidelines for the Penn Treebank Project.Technical report, University of Pennsylvania.

S. Sekine. 1995. Apple Pie Parser. Tech-nical report, Department of Com-puter Science, New York University,http://cs.nyu.edu/cs/projects/proteus/app/. .

D. Stock. 1992. P-TRA - eine Program- ''miersprache zur phonetischen Transkription.In W. Hess and W.F. Sendlmeier, edi-tors, Beiträge zur angewandten und experi-mentellen Phonetik, pages 222-231. Steiner,Stuttgart.

C. Traber. 1995. SVOX: The Implementationof a Text-to-Speech Sytem for German. Ph.D.thesis, Institut für Technische Informatik undKommunikationsnetze, ETH Zurich.

E. Vallduvi'. 1992. The Informational Compo-nent. Garland, New York.

C. Widera, T. Portele, and M. Wolters. 1997.Prediction of word prominence. In Proc. Eu-

rospeech, pages 999-1002. Rhodes.M. Wolters. 1997. A multiphone unit inventory

for the concatenative synthesis of AmericanEnglish. VERBMOBIL Memo 120.

M. Wolters. 1998. Konventionen für linguis-tisches Tagging. Technical report, Institutfür Kommunikationsforschung und Phonetik,Universitat Bonn, ftp://asll.ikp.uni-bonn.de/pub/mwo/dbaseanno.ps.gz.

feedback positive positive answer to question; previous utterance notaccepted/understood, FP

accept accept suggestion, ACCnegative negative answer to question, previous utterance not

accepted/understood, FNreject reject suggestion, REJ

inform give information (default), INFgive reason give reason for sth. recently mentioned, GIR

request information ask for information, REQIaction request action, REQAsuggest request suggestion, REQS

suggest suggest

Page 19: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

A Linguistically Interpreted Corpus of German Newspaper Text

Wojciech Skut, Thorsten Brants, Brigitte Krenn, Hans UszkoreitUniversität des Saarlandes, Computational Linguistics

D-66041 Saarbrücken, Germany{skut,brants,krenn,Uszkoreit}@coli.uni-sb.de

AbstractIn this paper, we report on the development of an anno-tation scheme an annotation tools for unrestricted Ger-man text. Our representation format is based on argu-ment structure, but also permits the extraction of otherkinds of representations. We discuss several methodolog-ical issues and the analysis of some phenomena. Addi-tional focus is on the tools developed in our project andtheir applications.

1 IntroductionParts of a German newspaper corpus, the Frank-furter Rundschau, have been annotated with syn-tactic structure. The raw text has been taken fromthe multilingual CD-ROM which has been producedby the European Coding Initiative ECI, and is dis-tributed by the Linguistic Data Consortium LDC.

The aim is to create a linguistically interpretedtext corpus, thus setting up a basis for corpus lin-guistic research and statistics-based approaches forGerman. We developed tools to facilitate annota-tions. These tools are easily adaptable to other an-notation schemes.

2 Corpora for Data-Driven NLPAn important pardigm shift is currently taking placein linguistics and language technology. Purely in-trospective research focussing on a limited numberof isolated phenomena is being replaced by a moredata-driven view of language. The growing impor-tance of stochastic methods opens new avenues fordealing with the wealth of phenomena found in realtexts, especially phenomena requiring a model ofpreferences or degrees of grammaticality.

This new research paradigm requires very largecorpora annotated with different kinds of linguisticinformation. Since the main objective here is rich,transparent and consistent annotation rather thanputting forward hypotheses or explanatory claims,the following requirements are often stressed:

'This is a revised version of the paper (Skut et al., 1997a).tThe work has been carried out in the project NEGRA of

the Sonderforschungsbereich 378 'Kognitive ressourcenadap-tive Prozesse' (resource adaptive cognitive processes) fundedby the Deutsche Forschungsgemeinschaft.

descriptivity: phenomena should be describedrather than explained as explanatory mecha-nisms can be derived (induced) from the data.

data-drivenuess: the formalism should providemeans for representing all types of grammati-cal constructions occurring in the corpus1.

theory-neutrality: the annotation format shouldnot be influenced by theory-internal consider-ations. However, annotations should containenough information to permit the extraction oftheory-specific representations.

In addition, the architecture of the annotationscheme should make it easy to refine the informa-tion encoded, both in width (adding new descriptionlevels) and depth (refining existing representations).Thus a structured, multi-stratal organisation of therepresentation formalism is desirable.

The representations themselves have to be easyto determine on the basis of simple empirical tests,which is crucial for the consistency and a reasonablespeed of annotation.

3 Why Tectogrammatical Structure?In the data-driven approach, the choice of a par-ticular representation formalism is an engineeringproblem rather than a matter of 'adequacy'. Moreimportant is the theory-independence and reusabil-ity of linguistic knowledge, i.e., the recoverability oftheory/application specific representations, which inthe area of NL syntax fall into two classes:

Phenogrammatical structure: the structure re-flecting surface order, e.g. constituent struc-ture or topological models of surface syntax, cf.(Ahrenberg, 1990), (Reape, 1994).

Tectogrammatical representations: predicate-argument structures reflecting lexical argumentstructure and providing a guide for assembling

1This is what distinguishes corpora used for grammar in-duction from other collections of language data. For instance,so-called test suites (cf. (Lehmann et al., 1996)) consist oftypical instances of selected phenomena and thus focus on asubset of real-world language.

Page 20: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

meanings. This level is present in almost everytheory: D-structure (GB), f-structure (LFG) orargument structure (HPSG). A theory basedmainly on tectogrammatical notions is depen-dency grammar, cf. (Tesniere, 1959).

As annotating both structures separately presentssubstantial effort, it is better to recover constituentstructure automatically from an argument structuretreebank, or vice versa. Both alternatives are dis-cussed in the following sections.

3.1 Annotating Constituent StructurePhenogrammatical annotations require an addi-tional mechanism encoding tectogrammatical struc-ture, e.g., trace-filler dependencies representingdiscontinuous constituents in a context-free con-stituent structure (cf. (Marcus, Santorini, andMarcinkiewicz, 1994), (Sampson, 1995)). A majordrawback for annotation is that such a hybrid for-malism renders the structure less transparent, as isthe phrase-structure representation of sentence (1):

(1) daran wird ihn Anna erkennen, dass er weintat-it will him Anna recognise that he cries'Anna will recognise him at his cry'

Furthermore, the descriptivity requirement couldbe difficult to meet since constituency has beenused as an explanatory device for several phenomena(binding, quantifier scope, focus projection).

The above remarks carry over to other models ofphenogrammatical structure, e.g. topological fields,cf. (Bech, 1955). A sample structure is given below2

Here, as well, topological information is insuffi-cient to express the underlying tectogrammaticalstructure (e.g., the attachment of the extraposedthat-clause)3. Thus the field model can be viewed

2LSB, RSB stand for left and right sentence bracket.3Even annotating grammatical functions is not enough as

long as we do not explicitly encode their tectogrammaticalattachment of such functions.

as a non-standard phrase-structure grammar whichneeds additional tectogrammatical annotations.

3.2 Argument Structure Annotations

An alternative to annotating surface structure is todirectly specify the tectogrammatical structure, asshown in the following figure:

This encoding has several advantages. Local andnon-local dependencies are represented in a uniformway. Discontinuity does not influence the hierarchi-cal structure, so the latter can be determined onthe basis of lexical subcategorisation requirements,agreement and some semantic information.

An important advantage of tectogrammaticalstructure is its proximity to semantics. This kindof representations is also more theory-neutral sincemost differences between syntactic theories occur atthe phenogrammatical level, the tectogrammaticalstructures being fairly similar.

Furthermore, a constituent tree can be recov-ered from a tectogrammatical structure. Thus tec-togrammatical representations provide a uniform en-coding of information for which otherwise both con-stituent trees and trace-filler annotations are needed.

Apart from the work reported in this paper, tec-togrammatical annotations have been successfullyused in the TSNLP project to construct a languagecompetence database, cf. (Lehmann et al., 1996).

3.3 Suitability for German

Further advantages of tectogrammatical annotationshave to do with the fairly weak constraints on Ger-man word order, resulting in a good deal of discon-tinuous constituency. This feature makes it diffi-cult to come up with a precise notion of constituentstructure. In the effect, different kinds of structuresare proposed for German, the criteria being oftentheory-internal4.

In addition, phrase-structure annotations aug-mented with the many trace-filler co-referenceswould lack the transparency desirable for ensuringthe consistency of annotation.

4Flat or binary right-recursive structures, not to mentionthe status of the head in verb-initial, verb-second and verb-final clauses, cf. (Netter, 1992), (Kasper, 1994), (Nerbonne,1994), (Pollard, 1996).

Page 21: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

4 MethodologyThe standard methodology of determining con-stituent structure (e.g., the Vorfeld test) does notcarry over to tectogrammatical representations, atleast not in all its aspects. The following sectionsare thus concerned with methodological issues.

4.1 Structures vs. LabelsThe first question to be answered here is how muchinformation has to be encoded structurally. Richstructures usually introduce high spurious ambigu-ity potential, while flat representations (e.g., cate-gory or function labels) are significantly easier tomanipulate (alteration, refinement, etc.).

Thus it is a good strategy to use rather simplestructures and express more information by labels.

4.2 Structural RepresentationsAs already mentioned, tectogrammatical structuresare often thought of in terms of dependency grammar(DG, cf. (Hudson, 1984), (Hellwig, 1988)), whichmight suggest using conventional dependency trees(stemmas) as our representation format. However,this would impose a number of restrictions that fol-low from the theoretical assumptions of DG. It ismainly the DG notion of heads that creates prob-lems for a flexible and maximally theory-neutral ap-proach. In a conventional dependency tree, headshave to be unique, present and of lexical status, re-quirements other theories might not agree with.

That is why we prefer a representation format inwhich heads are distinguished outside the structuralcomponent, as shown in the figure below, sentence(2)5:

(2) Bäcker wollte er nie werdenbaker wanted he never become'he never wanted to become a baker'

The tree encodes three kinds of information:

tectogrammatical structure: trees with possiblycrossing branches (no non-tangling condition);

syntactic category: node labels and part-of-speech tags (Stuttgart-Tubingen Tagset, cf.(Thielen and Schiller, 1995)).

functional annotations: edge labels.

4.3 Classification of Labels

Compared to the fairly simple structures employedby our annotation scheme, the functional annota-tions encode a great deal of linguistic information.We have already stressed that the notion head is dis-tinguished at this level. Accordingly, it seems to bethe appropriate stratum to encode the differencesbetween different classes of dependencies.

For instance, most linguistic theories distinguishbetween complements and adjuncts. Unfortunately,the theories do not agree on the criteria for drawingthe line between the two classes of dependents. Tothis date there is no single combination of criteriasuch as category, morphological marking, optional-ity, uniqueness of role filling, thematic role or seman-tic properties that can be turned into a transparentoperational distinction linguists of different schoolswould subscribe to.

In our scheme, we try to stay away from a theo-retical commitment concerning borderline decisions.The distinction between functional labels such as SBand DA - standing for traditional grammatical func-tions - on the one hand and phrases labelled MO onthe other should not be interpreted as a classificationinto complements and adjuncts. For the time being,functional labels different from MO are assigned onlyif the grammatical function of the phrase can easilybe detected on the basis of the linguistic data. MOis used, e.g., to label adjuncts as well as preposi-tional objects. Likewise the label OC is used foreasily recognisable clausal complements. Other em-bedded sentences depending on the verb are labelledas MO6. This is consistent with our philosophy ofstepwise refinement. We are in the process of design-ing a more fine-grained classification of functionallabels together with testable criteria for assigningthem. This classification will not contain a distinc-tion between complements and adjuncts. Thus thelocative phrase m Berlin in the sentence Peter wohntin Berlin (Peter lives in Berlin) will just be markedas a locative MO with the category PP. As linguis-tic theories disagree on the question, we will not askthe annotators to decide whether this phrase is acomplement of the verb.

This strategy differs from the one pursued by thecreators of the Penn Treebank. There the differencebetween complements and adjuncts is encoded in the

*•"!

5Edge labels- HD head, SB subject, OC clausal comple-ment, PD predicative, MO modifier. Note that crossing edgesindicate discontinuous constituency.

6MO is inspired by the usage of the term 'modifier' in tra-ditional structuralist linguistics where some authors (Bloom-field, 1933) use it for adjuncts and others also for complements(Trubetzkoy, 1939).

Page 22: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

hierarchical structure. Verbal complements are en-coded as siblings of the verb whereas adjuncts areadjoined at a higher level. In a case of doubt, theannotators are asked to select adjunction. We con-sider this structural encoding less suitable for refine-ment than a hierarchy of functional labels in whichMO can be further specified by sublabels.

5 Annotation ToolsThe development of linguistically interpreted cor-pora presents a laborious and time-consuming task.In order to make the annotation process more effi-cient, extra effort has been put into the developmentof the annotation software.

5.1 Structural Annotation

The annotation tools are an integrated softwarepackage that communicates with the user via a com-fortable graphical interface (Plaehn, 1998). Bothkeyboard and mouse input are supported, the struc-ture being annotated is shown on the screen as atree. The tools can be employed for the annotationof different kinds of structures, ranging from ourrudimentary predicate-argument trees to standardphrase structure annotations with trace-filler depen-decies, cf. (Marcus, Santorini, and Marcinkiewicz,1994). A screen dump of the annotation tool isshown in figure 1.

The kernel part of the annotation tool supportspurely manual annotation. Further modules permitinteraction with an external stochastic or symbolicparser. Thus, the tools are not dependent on a par-ticular automation method. Also the degree of au-tomation can vary from part-of-speech tagging andrecognition of grammatical functions to full parsing.

In our project, we rely on an interactive anno-tation mode in which the annotator specifies rathersmall annotation increments that are then processedby a stochastic parser. The output of the parser isimmediately displayed and the annotator edits it ifnecessary. Currently, the annotator's task is to spec-ify substructures containing up to 20 — 30 words;their internal structure as well as the labels for gram-matical functions and categories are assigned by theparser. The precision of the parser is about 96%for the assignment of labels and 90% for partialstructures (Brants and Skut, 1998; Skut and Brants,1998a; Skut and Brants, 1998b).

Another part of our software package is the corpussearch tool. It is very helpful for both linguistic in-vestigations and detecting annotation errors. As forthis latter application, we have also developed pro-grams that compare annotations. Each sentence isannotated independently by two annotators. Duringthe comparison, inconsistencies are highlighted, andthe annotators have to correct errors and/or agreeon one reading.

In addition to the treebank project, the tools arecurrently used in the Verbmobil project to annotatetransliterated spoken dialogues in English and Ger-man (Stegmann and Hinrichs, 1998), in the FLAGproject to annotate spelling errors in German news-group texts, and it is planned to employ them inthe DIET project to build a linguistic competencedatabase (Netter et al., 1998).

5.2 AutomationThe graphical surface communicates with severalseparate programs to perform the task of semi-automatic annotation. Currently, these separateprograms are a part-of-speech tagger, a tagger forgrammatical functions and phrasal categories andan NP/PP chunker.

The part-of-speech tagger is a trigram part-of-speech tagger that is trainable for a wide variety oflanguages and tagsets (Brants, 1996). We trained iton all previously annotated material in our corpus,using the Stuttgart-Tbingen tagset, and it currentlyachieves an accuracy of 96% on new, unseen text.

In our project, annotation is an interactive task.After the annotator has specified a partial structure,the tool automatically inserts all the labels into thestructure, i.e. the grammatical functions (edge la-bels) and phrasal categories (node labels). This taskis performed by a tagger for grammatical functionsand phrasal categories (Brants, Skut, and Krenn,1997). The underlying mechanism is very similar topart of speech tagging. There, states of a Markovmodel represent tags, and outputs represent words.For tagging grammatical functions, states representgrammatical functions, and outputs represent termi-nal and non-terminal tags. Thus, tagging is appliedto the next higher level.

Grammatical functions have a different distribu-tion within each type of phrase, so each type ofphrase is modeled by a different Markov model.If the type of phrase is known, the correspondingmodel is used to assign grammatical functions. Ifthe type is not known, all models run in paralleland the model assigning the highest probability isused. This determines at the same time the phrasalcategory. The tagger is also trained on all previousmaterial of the corpus and achieves 97% accuracyfor assigning phrasal categories, and 96% accuracyfor assigning grammatical functions.

When tagging for part-of-speech, grammaticalfunctions, and phrasal categories, we additionallycalculate the second best assignment and its proba-bility. This is used to estimate the reliability of thefirst assignment. If the probability of the alterna-tive is close to that of the best assignment, the firstchoice is regarded as unreliable, wheres it is reliableif the alternative has a much lower probability. Reli-able and unreliable are distinguished by a thresholdon the distance of the best and second best assing-

Page 23: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Figure 1: Screen dump of the annotation tool

ment. The annotation tool simply inserts all reliablelabels and asks the human annotator for confirma-tion in the unreliable cases.

The next level of automation is concerned withthe structure of NPs and PPs which can be fairlycomplex in German (see figure 2). As shown in(Brants and Skut, 1998), recognition of completeNP/PP structures can also efficiently performed withMarkov models, encoding relative structures, i.e.stating that a word is attached lower, higher or atthe same level as its predecessor. The annotator nolonger has to build the structure level by level, butmarks the boundaries of NPs and PPs, and the in-ternal structures is generated automatically. Thisapproach has an accuracy of 85 - 90%, dependingon the exact task.

6 Applications of the CorpusThe corpus provides training and test material forstochastic approaches to natural language process-ing. It is also a valuable source of data for theoreticallinguistic investigations, especially into the relation

of competence grammar and language usage.

6.1 Statistical NLP

As described in section 5, statistical annotationmethods have been developed and implemented. Inour bootstrapping approach, the accuracy of themodels is improved and functionality increases as theannotated corpus grows, thus leading to completelyautomatic NLP methods. For instance, the chunktagger initially designed to support the annotatoris used for the recognition of major phrases in un-restricted text pre-tagged with part-of-speech infor-mation (Skut and Brants, 1998a; Skut and Brants,1998b).

Apart from these applications, the corpus is al-ready used in other projects to train rule-based andstatistical taggers and parsers.

6.2 Corpus Linguistic Investigations

The treebank has been successfully used for corpus-linguistic investigations. In this regard, two majorclasses of applications have arisen so far. Firstly,

Page 24: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Figure 2: Example of a complex N P.

a search program enables the user to find examplesof interesting linguistic constructions, which is espe-cially useful for testing predictions made by linguis-tic theories. It has also proved to be a great help inteaching linguistics.

The second, more ambitious class of applicationsconsists in statistical evaluation of the corpus data.In a study on relative clause extrapostion in Ger-man (Uszkoreit et al., 98), we were able to verifythe predictions made by the performance theory oflanguage formulated by Hawkins (1994). The cor-pus data made it possible to measure the influenceof the factors heaviness and distance on the extra-position of relative clauses. The results of these in-vestigations are also supported by psycholinguisticexperiments.

For investigations on statistics-based collocationextraction, various portions of the FrankfurterRundschau Corpus have been automatically anno-tated with parts-of-speech and phrase chunks likeNP, PP, AP. The part-of-speech tagger (Brants,1996) and the chunker (Skut and Brants, 1998b)have been trained on the annotated and hand-corrected corpus. Although error rates of 10 to15 % occur at the stage of chunking, collocationextraction benefits from structurally annotated cor-pora because of the accessibility of syntactic infor-mation (1) accuracy of frequency counts increases,i.e. more syntactically plausible collocation candi-dates are found, and (2) grammatical restrictionson collocations can mostly be automatically derivedfrom the corpus, cf. (Krenn, 1998b).

Syntactically preprocessed corpora are also a valu-able source for insights into actual realisations of col-locations. This is particularly important in the caseof partially flexible collocations. In order to pro-vide material for investigations into collocations as

on the one hand grammatically flexible and on theother hand lexically fixed constructions, collocationexamples found in syntactically annotated corporaare stored in a database together with competence-based analyses, cf. (Krenn, 1998a).

7 ConclusionsThe increasing importance of data-oriented NLP re-quires the development of a specific methodology,partly different from the generative paradigm whichhas dominated linguistics for nearly 40 years. Theimportance of consistent and efficient encoding oflinguistic knowledge has absolute priority in this newapproach, and thus we have argued for easing theburden of explanatory claims, which has proved tobe a severe constraint on linguistic formalism.

We have presented a number of linguistic anal-yses used in our treebank and examples of the in-teraction of different syntactic phenomena. We alsohave shown how the particular representation cho-sen enables the derivation of other, theory specificrepresentations. Finally we have given examples forapplications of the corpus in statistics-based NLPand corpus linguistics. Our claims are backed byan annotated corpus of currently about 12,000 sen-tences, all of which have been annotated twice inorder to ensure consistency.

ReferencesAhrenberg, L. 1990. A grammar combining phrase

structure and field structure. In Proceedings of COL-ING '90.

Bech, G. 1955. Studien über das deutsche Verbum in-finitum. Max Niemeyer Verlag, Tubingen.

Bloomfield, L. 1933. Language. New York.Brants, Thorsten. 1996. Tnt - a statistical part-of-

speech tagger. Technical report, Universität des Saar-landes, Computational Linguistics.

Page 25: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Brants, Thorsten and Wojciech Skut. 1998. Automationof treebank annotation. In Proceedings of NeMLaP-3,Sydney, Australia.

Brants, Thorsten, Wojciech Skut, and Brigitte Krenn.1997. Tagging grammatical functions. In Proceedingsof EMNLP-97, Providence, HI, USA.

Hawkins, John A. 1994. A performance theory of orderand constituency. Cambridge Univ. Press, CambridgeStudies in Linguistics 73.

Hellwig, P. 1988. Chart parsing according to the slotand filler principle. In COLING 88, pages 242-244.

Hudson, Richard. 1984. Word Grammar. Basil Black-well Ltd.

Kasper, R. 1994. Adjuncts in the mittelfeld. In J. Ner-bonne, K. Netter, and C. Pollard, editors, German inHPSG. CSLI, Stanford.

Krenn, Brigitte. 1998a. A Representation Scheme andDatabase for German Support-Verb Constructions. InProceedings of KONVENS '98, Bonn, Germany.

Krenn, Brigitte. 1998b. Acquisition of PhraseologicalUnits from Linguistically Interpreted Corpora. A CaseStudy on German PP-Verb Collocations. In Proceed-ings of the 3rd International Symposium on Phraseol-ogy, Stuttgart, Germany.

Lehmann, Sabine, Stephan Oepen, Sylvie Regnier-Prost,Klaus Netter, Veronika Lux, Judith Klein, KirstenFalkedal, Frederik Fouvry, Dominique Estival, EvaDauphin, Herve Compagnion, Judith Baur, LornaBalkan, and Doug Arnold. 1996. TSNLP — Test Suitesfor Natural Language Processing. In Proceedings ofCOLING 1996, Kopenhagen.

Marcus, Mitchell, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1994. Building a large annotated cor-pus of English: the Penn Treebank. In Susan Arm-strong, editor, Using Large Corpora. MIT Press.

Nerbonne, John. 1994. Partial verb phrases and spu-rious ambiguities. In John Nerbonne, Klaus Netter,and Carl Pollard, editors, German in Head-DrivenPhrase Structure Grammar, number 46 in LectureNotes. CSLI Publications, Stanford University, pages109-150.

Netter, Klaus. 1992. On Non-Head Non-Movement. InGiinter Gorz, editor, KONVENS '92, Reihe Infor-matik aktuell. Springer-Verlag, Berlin, pages 218-227.

Netter et al., Klaus. 1998. DiET - diagnostic and eval-uation tools for natural language processi ng applica-tions. In Proceedings of LERC-98, Granada, Spain.

Plaehn, Oliver. 1998. Annotate — benutzerhandbuch.Technical report, Universitat des Saarlandes, Com-puterlinguistik, Saarbriicken.

Pollard, Carl. 1996. On head nonmovement. In ArthurHorck and Wietske Sijtsma, editors, DiscontinuousConstituency. Mouton de Gruyter.

Reape, Mike. 1994. Domain union and word order vari-ation in German. In John Nerbonne, Klaus Netter,and Carl Pollard, editors, German in Head-DrivenPhrase Structure Grammar, number 46 in LectureNotes. CSLI Publications, Stanford University, pages151-197.

Sampson, Geoffrey. 1995. English for the Computer.Oxford University Press, Oxford.

Skut, Wojciech and Thorsten Brants. 1998a. A Max-imum Entropy Partial Parser for Unrestricted Text.In 6th Workshop on Very Large Corpora, Montreal,Canada, August.

Skut, Wojciech and Thorsten Brants. 1998b. Chunk tag-ger, stochastic recognition of noun phrases. In ESSLIWorkshop on Automated Acquisition of Syntax andParsing, Saarbriicken, Germany, August.

Skut, Wojciech, Thorsten Brants, Brigitte Krenn, andHans Uszkoreit. 1997a. Annotating unrestricted ger-man text. In DGFS-97.

Skut, Wojciech, Brigitte Krenn, Thorsten Brants, andHans Uszkoreit. 1997b. An annotation scheme forfree word order languages. In Proceedings of ANLP-97, Washington, DC.

Stegmann, Rosymary and Erhard Hinrichs. 1998. Style-book for the german treebank in verbmobil. Verbmo-bil report, Seminar für Sprachwissenschaft, Univer-sität Tubingen, Germany.

Tesniere, L. 1959. Elements de Syntaxe Structurale.Klincksieck, Paris.

Thielen, Christine and Anne Schiller. 1995. Ein kleinesund erweitertes Tagset furs Deutsche. In Tagungs-berichte des Arbeitstreffens Lexikon + Text 17./18.Februar 1994, Schlofl Hohentubingen. LexicographicaSeries Maior, Tubingen. Niemeyer.

Trubetzkoy, N. 1939. Le rapport entre le determine, ledeterminant et ledefmi. In Melanges de linguistique,offers a Charles Bally. Geneva.

Uszkoreit, Hans, Thorsten Brants, Denys Duchier,Brigitte Krenn, Lars Konieczny, Stephan Oepen, andWojciech Skut. 98. Studien zur performanzorien-tierten Linguisitk. Aspekte der Relativsa tzextrapo-sition im Deutschen. Kognitionswissenschaft.

Page 26: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Large Scale Dialogue Annotation in VERBMOBIL

Norbert Reithinger and Michael Kipp (or vice versa)DFKI GmbH, Stuhlsatzenhausweg 3, D-66123 Saarbrücken

e-mail: {reithinger, kipp}@dfki.de

AbstractIn this paper we present an overview of theapproach to dialogue related annotations inVERBMOBIL. We introduce the informationannotated in dialogues of the VERBMOBILcorpus and present the rationale behind theuse of the open partitur format for annota-tions. A tool to facilitate the task of theannotators was developed that supports twodialogue related annotation levels. Finally,we show our approach to measure the inter-coder reliability.

1 Introduction to theEnvironment and Tasks

Large scale language processing systems likeVERBMOBIL, a speech-to-speech translationsystem in the domain of time-scheduling(Bub et al., 1997), heavily rely on cor-pus data that can be used for the train-ing and test of knowledge sources and algo-rithms. For VERBMOBIL, currently about20 CDROMs with German, English, andJapanese dialogues are available1, contain-ing both speech signals and translitera-tions. The transliterations of the signal aremore than mere orthographic transcriptions.They also cover phenomena like dialectal pe-culiarities, a symbolic transliteration of allsorts of noises contained in the audio signal,and word breaks.

One of the most important units of pro-cessing for the dialogue module are so-calleddialogue acts like ACCEPT, REJECT, orSUGGEST that characterise the core inten-

tion of an utterance.2 Currently, we use 30acts that are structured hierarchically. Earlyduring the development of the dialogue mod-ule of VERBMOBIL (Alexandersson et al.,1997b), we noticed that the determinationand processing of acts has to be based onreal data. Therefore, we started to tag thetransliterated dialogues of the corpus withdialogue acts.

In this paper, we first describe the formatof the dialogues and the annotation level.We then present the tool we developed forannotation and finally show how we takecare of the coder's reliability. We close witha look into the future.

2 Annotations and thePartitur Format

When we started annotating the dialogues,we edited the original transliteration files.We added dialogue act tags directly af-ter the utterance using an ad-hoc annota-tion markup, supported by some EMACSmacros. The transliterations looked like this:

ja , guten <!1 gu'n> Tag , HerrMetze . <Ger"ausch> <#Klopfen>®(GREET AB) <"ahm> <#> <Ger"ausch>wir sollen hier- einen <!1 ein'>Terrain vereinbaren . <Ger"ausch><A> 8 UNIT AB)

where annotation consisted of a special char-acter, the dialogue act and the speaker direc-tion in brackets. This format was OK as longas there was just one level of markup (i.e.

1http://www.phonetik.uni-muenchen.de/Bas/BasKorporadeu.html

2 We currently use the third version of the di-alogue acts as presented in (Alexandersson et al.,1997a).

Page 27: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

dialogue acts) and one annotator involved.However, as soon as we added another levelof markup3, a pandora's box of problems wasopened. E.g. version problems appeared be-tween different files annotated by differentpeople. Version control tools alone cannotcope with this problems easily, because boththe transliterations and the annotations canbe changed by different persons and at dif-ferent institutions and the tools available tellyou where there are changes, but not how tointegrate these changes.

As a remedy, we first considered theuse of available markup-tools like ALEM-BIC (Day, 1996) which provides for a struc-tured markup and supports annotation. Butit couldn't solve one of the major prob-lems: the data collection agency, BavarianArchive for Speech Signals (BAS), updatesthe transliterations in the master files reg-ularly to correct errors, and these changescouldn't be merged easily with our anno-tated files since we changed the original textwith the markup text.

The solution was to move to an extensibleformat BAS provides, the so-called Partiturformat. In this format all levels of descrip-tion are independent but time aligned likethe single parts of a score.

It is an open format that contains inde-pendent descriptions of as many differentlevels of the speech signal as necessary, forinstance orthography, canonical transcript,phonology, phonetics, prosody, dialog acts,or POS-tags. Symbolic links between theindependent levels allow logical assignmentsrelated to the linear flow of language. Theselinks are based on the word units of the ut-terance and are realized as numbers.4

A part of the partitur for the above men-tioned sentence is show in figure 1. At thebeginning it contains some bookkeeping in-formation, followed by the transliteration,the orthographic transcription, the canonicphoneme representation, and finally by the

dialogue act.

LHD:REP:SNB:SAM:SBF:SSB:NCH:SPN:LED:TR2:TR2:TR2:TR2:TR2:TR2:TR2:TR2:TR2:TR2:TR2:TR2:ORT:ORT:ORT:ORT:ORT:ORT:ORT:ORT:ORT:ORT:ORT:ORT:KAN:KAN:KAN:KAN:KAN:KAN:KAN:KAN:KAN:KAN:KAN:KAN:DAS:DAS:

Partitur 1.2.3Muenchen21600001161AAJ

0 ja ,1 guten <! 1 gu'n>2 Tag ,3 Herr4 'Metze . <Ger"ausch> <#Klopfen>5 <"ahm> <#> <Ger"ausch>6 wir7 sollen8 hier9 einen <!1 ein'>10 Terrain11 vereinbaren . <Ger"ausch> <A>0 ja1 guten2 Tag3 Herr4 Metze5 <"ahm>6 wir7 sollen8 hier9 einen10 Termin11 vereinbaren0 j'a:1 g'u:tSn2 t'a:k3 h'E64 m"EtsQ5 QE:m6 vi:6+7 z01Qn+8 h'i:69 QaInQn+10 tE6m'i:n11 f6Q'aInba:r@n0,1,2,3,4 @(GREET AB)5,6,7,8,9,10,11 @(INIT AB)

3 We needed to mark whole turns with a specialturn-related set of tags

4See http://www.phonetik.uni-muenchen.de/Bas/BasFormatseng.html#Partitur for a detaileddescription.

Figure 1: The partitur-file for the example

As can be seen, each level of descriptionis marked by a key, e.g. TR2 for transliter-ation, ORT for orthographic transcription orDAS for dialogue acts, followed by the infor-mation for this level. The DAS level shows

Page 28: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

the link between the dialogue act with thebasic unit indices, that are the canonic KANunits.

Since the KAN track is guaranteed to al-ways stay the same and our own annotationsdo refer to the KAN layer, changes in the TR2or ORT layer (by BAS) do not affect our an-notation. On the other hand, our annotationnever in any way modifies other layers (likeTR2 or ORT) so that partitur files annotatedat different institutes with other layers of in-formation can be compared and merged withother files easily and consistently. Further-more, new levels needed, e.g. for turn infor-mation, could be added swiftly due to theopen specification.

In contrast to the human readable textualtransliterations, the partitur is not intendedto be worked on with text editors. Since theannotators are only moderately computerliterate, we developed a tool to ease the painof these poor guys.

3 Colourful Tools forAnnotation

Human annotation of the training data is atedious task. Since human annotators haveto read and tag thousands of utterances, mis-takes of syntactical and semantic nature arecommonplace. Syntactic errors occur evenwhen using tools, since annotators tend tosidestep the tools sometimes, e.g. they useother means to just quickly correct or add atag with their favourite editor, even if theyare told not to do so. Semantic mistakesstem from documentation of the current dia-logue act definitions that cover most, but notall phenomena, due to the fact that languageis - as we all know - very flexible. Also,over the time of the annotation the guide-lines dynamically change through the feed-back of annotators. This source of problemscan only be reduced by reliability checks on aregular basis where disagreement in annota-tion is analysed in detail (see next section).The other important source of mistakes isthe lack of technical support which shouldprovide the annotator with sufficient meansto concentrate on the semantic aspect of the

annotation task.What we learnt from these errors is that

an annotation tool should draw a clear linebetween annotator and data. In one direc-tion data should visualised in a way thatsuits the annotator. In the other directionmanipulation of data must only be allowedwithin very limited bounds. The originalbase document should not be accessible tothe annotators directly.

The obvious way of doing this is by visual-ising only task-relevant parts of the tool's in-ternal dialogue representation and open onelevel to manipulation. Writing back to theoriginal partitur file then changes not thetransliteration but only the level under con-sideration, e.g. the DAS level when annotat-ing dialogue acts.

Our tool, named ANNOTAG, reads par-titur files of one dialogue, visualises thetransliteration of the turns and presents theannotator buttons for the various dialogueacts (see fig. 2). Human annotators are nowto partition these turns into utterances bylabelling a part of the text with one or moredialogue act(s). There maybe more than oneillocutionary act performed in one utterancewhich forces us to either define a new di-alogue act whose definition covers the phe-nomenon or to label two or more of the oldones (in the latter case we speak of a multipledialogue act). The definition of our domain-specific dialogue acts should keep the num-ber of multiple dialogue acts low.

Segmentation and annotation are done ina single step. We are convinced that a sep-aration of these two steps is both unnec-essary and impractical. The decision thatthere is a dialogue act boundary and the de-cision which dialogue act to annotate can bemade in parallel (and should be made in par-allel since our manual actually makes use ofthe dialogue act definition for segmentation(Alexandersson et al., 1997a)). Splitting itsimply requires the annotator to look at thesame data twice and artificially split the rea-soning. Also, currently some people (mostof them do not annotate large corpora) ar-gue that you must listen to the speech datain order to be able to segment properly. We

Page 29: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Figure 2: ANNOTAG annotation tool in dialogue act mode

also think this might help in some rare cases,but in general it is not necessary. For exam-ple, pauses are transliterated, so the anno-tators have access to this information, too.But pauses are no criterion for segmentation,since in spoken language pauses do not tellyou that much about proper segmentation.Personal communication with other annota-tion project like Switchboard-DAMSL andMAPTASK shows that all other large scaleannotation projects also do not separate thetwo steps of segmentation and annotation.

Annotation with ANNOTAG works likethis: Mark the text by clicking on a wordof a turn. ANNOTAG then highlights all thetext beginning at the preceding segment bor-der and ending in this word. A click on oneof the dialogue act buttons inserts the cor-responding tag into the text. Furthermore,we can remove tags (the segments left andright of the tag are merged to one segment),undo the last action and add more tags to anexisting tag (making it a multiple dialogueact). To improve readability, the text canbe filtered before being displayed. Further-

more, dialogue act buttons are colour-codedaccording to their position in the hierarchyand unknown dialogue acts in the text aremarked red.

So far we have made true our twofoldwishes: first, to provide the annotator witha comfortable, easy to use surface that fil-ters out all technical details and discomforts;second, to keep the human user in safe dis-tance from the valuable original data. Whatremains are questions of extendibility. Thetool is wholly written in Tcl/Tk (plus theTix5 extension) and can easily be manipu-lated. E.g. to change the set of dialogue actswe only need to change a list of strings. Re-cently, we added a feature enabling the userto annotate turns with turn classes which arewritten to a particular level in the partiturfile format.

Additionally, we developed tools that wereintegrated to serve our special needs. Forexample, when we revised the set of dia-logue acts we used a facility which maps alldialogue acts from one one version to an-

5http://www.xpi.com/tix/

Page 30: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

other. There is also a conversion tool thatallows to use annotated files in old plaintext transliteration format and maps themto the partitur format. The fact that bothformats do not match perfectly (there arewords omitted, different turn segmentationetc.) called for a semi-automatic mecha-nism, showing conversion suggestions for ap-proval/dismissal. We successfully convertedmore than 400 dialogues this way. Furthertools allow inspection of original files, extrac-tion of special information, or conversion ofdialogues to format.

4 or not: ComparingAnnotators

To be useful for training and test purposesin a speech processing system our hand-annotated dialogues have to fulfil very highquality standards.

In order to assess this we carried out anumber of reliability studies for the segmen-tation and annotation of the data.

To measure the agreement betweenfeature-attributed data sets the coefficientis of outstanding importance (for details see(Carletta, 1996)). In the field of contentanalysis a K value > 0.8 is considered goodreliability for the correlation between twovariables, while a K of 0.67< K <0.8 stillallows tentative conclusions to be drawn.

For measuring inter-coder replicability ofdialogue act coding we used 10 dialogueswhich altogether consist of 170 utterancesand had two human coders annotate thisdata. The utterance labels for the two coderscoincide in as many as 85.30 % of the cases.The K value of 0.8261 shows that dialogueacts can be coded quite reliably.

For testing the stability of dialogue actswe asked one coder to relabel five dia-logues which altogether contain 191 utter-ances. The K coefficient for this study is0.8430 with an overall agreement of 85.86%.

Currently, we annotate German,Japanese, and English dialogues. Theannotators are mostly students of thesenationalities, with or without linguisticbackground. They are trained on a set

of training dialogues, where the trainingconsist, amongst others, in annotating a setof test dialogues that are compared to theannotation of the other annotators who alsoannotated this set. Proceeding this way, wecan identify the overall agreement of theannotators and can identify classes wherean annotator seems to misunderstand thedefinitions of the manual.

One major use of the dialogue act anno-tation is to train a statistical dialogue actrecogniser (Reithinger and Klesen, 1997). Totest the quality of the system, we also mea-sure the K coefficient between the annota-tions from humans and the program. For atest set of 51 German dialogues, we achievedagreement in 63.48% of the cases, and a Kvalue of 0.58, using 215 dialogues for train-ing, which of course did not contain thetested dialogues. For 18 Japanese dialoguesthe agreement was 71.65% with a K value of0.68, using 81 dialogues for training.

5 Future Work

At the time we write these lines, and usingthe approach and the tools presented above,we have about 830 German, English, andJapanese dialogues annotated with dialogueacts, and some 100 with turn classes. Thepartitur format demonstrates its power indaily use, since it can easily be processed fordifferent purposes with common UNIX toolslike sed or perl.

In the future, besides annotating moredialogues with dialogue act and turn classinformation, we will extend our reliabilitystudies to detect and repair possible weak-nesses in the definition and description ofthe tags. Also, we will annotate the preposi-tional content of the utterances using a do-main description language.

As we now have a pool of annotated di-alogues we will use them to bootstrap anautomatic annotator which will propose thehuman annotator the most likely tag for aunit to be annotated. We hope that we fi-nally will get a (semi-)automatic annotator.

Page 31: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

ReferencesJan Alexandersson, Bianka Buschbeck-Wolf,

Tsutomu Fujinami, Elisabeth Maier, Nor-bert Reithinger, Birte Schmitz, andMelanie Siegel. 1997a. Dialogue Acts inVERBMOBIL-2. Technical Report 204,DFKI SaarbrückenUniversitat Stuttgart,Technische Universität Berlin, Universitätdes Saarlandes.

Jan Alexandersson, Norbert Reithinger, andElisabeth Maier. 1997b. Insights intothe Dialogue Processing of VERBMOBIL.In Proceedings of the Fifth Conferenceon Applied Natural Language Processing,ANLP '97, pages 33-40, Washington, DC.

Thomas Bub, Wolfgang Wahlster, and AlexWaibel. 1997. Verbmobil: The combina-tion of deep and shallow processing forspontaneous speech translation. In Pro-ceedings of ICASSP-97, pages 71-74, Mu-nich.

Jean Carletta. 1996. Assessing Agree-ment on Classification Tasks: TheKappa Statistics. Computational Linguis-tics, 22(2):249-254, June.

David S. Day, 1996. Alembic WorkbenchUser's Guide. The MITRE Corporation,Bedford, MA, December.

Norbert Reithinger and Martin Klesen.1997. Dialogue act classification us-ing language models. In Proceedings ofEuroSpeech-97, pages 2235-2238, Rhodes.

Page 32: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Annotating German Language Datafor Shallow Processing

Judith Klein and Thierry DeclerckGerman Research Center for Artificial Intelligence (DFKI GmbH)

Stuhlsatzenhausweg 3, 66123 Saarbrücken (Germany)<Firstname. Lastname>@dfki.de

1 MotivationEvaluation of natural language processing (NLP)systems is necessary, both to extend the linguisticcoverage of the NLP components independently oftheir integration in a broader application systembut also to assess and improve the quality of theNLP functionalities with respect to an application-specific task. For NLP-based information extraction(IE) applications, for example, the reliable analysisof natural language input is a necessary prerequi-site to ensure high quality results of the IE task.Profound evaluation cycles during system develop-ment and adequacy evaluations on prototypes mustbe carried out on suitable reference data in orderto provide information on the performance of theNLP functionalities and direct application-orientedimprovements.Even though the importance of annotated text cor-pora as a means for quality assessment is widely rec-ognized within the NLP community, the realizationof evaluation studies is often hampered due to thelack of suitable language material - especially forlanguages other than English.

2 Evaluation Scenario for smesThis paper reports on the building of reference datafor the evaluation of the smes system (SaarbrückerMessage Extraction System), developed at DFKI(Neumann et al. 1997). An annotated German ref-erence corpus is being constructed for the diagnosticand adequacy evaluation of the NLP functionalitiesof smes carried out within the PARADIME (Parame-terizable Domain-adaptive Information and MessageExtraction1) project.For shallow text analysis smes employs a tokenizer, amorphological analyzer, a part of speech (PoS) tag-ger and a shallow parsing component. Each compo-nent needs a reference basis of different size and for-mat according to its task within the system. Whilemorpho-syntactic processing as basic processing steprequires reference data which aims at a very gen-eral linguistic coverage, the parsing component is

designed mainly for fragment analysis and hence itsreference material will consist of selected text units,especially various types of noun phrases and verbgroups.

Evaluation experiments on the morphology and thePoS component have already been carried out withthe help of manually validated reference corpora.For the parsing module, a part of the PARADIME cor-pus has been annotated for noun phrases and firstevaluation cycles have started on the NP grammarof smes.

Since the building of annotated language resources isa costly and time-consuming task, automatic anno-tation and re-usability are key issues in the contextof NLP evaluation. In order to speed-up the an-notation task a bootstrapping approach is followedwhere the processing modules themselves will be em-ployed to support future annotation tasks. This ap-proach is also followed by many other groups, c.f.(Skut et al. 1998), (Oesterle and Maier-Meyer 1998),(Kuroashi and Nagao 1998). But, before such semi-automatic annotation is possible the results of theevaluation studies on the smes components must beintegrated in order to improve their performance andguarantee that the analysis results are reliable andneed only minor manual post-editing.

Several aspects will ensure re-usability: Firstly, forthe used tag set a mapping has been defined to theStuttgart-Tubingen tag set (stts)2, which is widelyused in the German NLP community. Secondly,the annotation work at PoS and phrasal level willbe documented providing guidelines for general andspecific linguistic cases. Thirdly, even though theannotation format corresponds to the output of therespective components it will be as transparent aspossible. Additionally, it is envisaged to develop anxml (extensible markup language) interface to easilyexchange the annotated reference data.3

1 Information on PARADIME is available through the wwwunder http://www.dfki.de/lt/projects/paradime-e.html.

Information on the stts tag set is available under:http://www.sfs.nphil.uni-tuebingen.de/Elwis/stts/stts.html.

3Information on xml can be found under Henry Thomp-son's homepage: http-//www.ltg.ed.ac uk/ ht/.

Page 33: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

3 Short Sketch of SMESInformation extraction systems aim at identifyingand classifying parts of texts which carry informa-tion units relevant for an application-specific task.Since a deep and complete understanding of the textmay not be necessary for IE, partial representationsas provided by shallow processing methods provedvery useful for this task.The core engine of the text processing part of smesconsists of a tokenizer, an efficient and robust Ger-man morphology, a PoS tagger, a shallow parsingmodule, a linguistic knowledge base and an outputconstruction component (c.f. Neumann 1997).

Tokenizer The tokenizer, implemented in LEX, isbased on regular expressions, against which the tex-tual input is matched for identifying some text struc-tures (e.g. words, date expressions, etc.). Number,date and time expressions are normalized and rep-resented as attribute value structures.

Morphix++ The morphological analysis is per-formed by MORPHIX++, an efficient and robust Ger-man morphological analyzer which performs mor-phological inflection and compound processing4.With its core lexicon containing more than 120,000stem entries MORPHIX++ achieves a very broadcoverage. The output of MORPHIX++ consists ofa list of all possible readings, where each readingis uniformly represented as a triple of the form<STEM,INFLECTION,POS>.

Brill Tagger Ambiguous readings delivered byMORPHIX+-|- are disambiguated wrt. PoS usingan adapted version of Brill's unsupervised tagger(Brill 1995) for German, which has been inte-grated into the smes architecture for experimentalpurposes5. The integrated BRILL tagger consistsof a learning component and an application com-ponent. Within smes the learning material for thetagger consists of the lexical mark-up of texts pro-vided by MORPHIX++. In the learning phase thetagger induces disambiguation rules on the basis ofthe morpho-syntactic annotated texts. In a secondphase those rules are applied to the morphologicallyanalysed input texts. In PARADIME the integrationof Brill's unsupervised tagging strategy is also in-vestigated in order to see to what extent it can beused for the semi-automatic annotation of corporafor new application domains6.

4MORPHix++ is based on MORPHIX (Finkler and Neu-mann 1988).

5First experiments done with the Brill's unsupervisedtagger within smes were very promising and (Neumann etal. 1997) reported already a tagging accuracy as high as91.4%.

6In the future we might investigate the combination of thesupervised and the unsupervised strategy. The supervisedBrill tagger is currently being trained for German at the Uni-versity of Zurich, c.f. http://www.ifi.unizh.ch/CL/tagger.

Shallow Parsing The shallow parsing componentconsists of a declarative specification tool for ex-pressing finite state grammars. Shallow parsingis performed in two steps: Firstly, specified finitestate transducers (FST) perform fragment recogni-tion and extraction on the basis of the output ofthe scanner and of MORPHIX++ - i.e. the Brill out-put is not yet taken into account. Fragments to beidentified are user-defined and typically consist ofphrasal entities (e.g. NPs, PPs) and application-specific units (e.g. complex time and date expres-sions). In a second phase, user-defined automatarepresenting sentence patterns operate on extractedfragments to combine them into predicate-argumentstructures.

The linguistic knowledge base The knowledgebase includes a lexicon of about 120,000 root en-tries. The design of the lexicon allows for struc-tured extensions. So for example, informationabout sub-categorization frames - extracted fromthe Sardic lexical data base (see Buchholz 1996)developed at the DFKI - has been recently inte-grated into the verb lexicon. The smes system hasnow about 12,000 verbs with a total of 30,042 sub-categorization frames at its disposal.

The output construction component An out-put construction component is using fragment com-bination patterns to define linguistically head-modifier constructions. The distinction between thiscomponent and the automata allows a modular I/Oof the grammars. The fragment combiner is alsoused for the instantiation of templates, providing oneof the possible visualization of the results of the IEtask.

4 Reference Corpus for smesWithin the PARADIME project smes is being devel-oped as a parameterizable core machinery for a va-riety of commercially viable applications in the areaof information extraction (IE). The planned employ-ment of smes in other projects (e.g. in the domain oftourist information) stimulates both diagnostic eval-uation to improve the NLP functionalities of smesand adequacy evaluation to show what the system'sperformance currently is.

Corpus Selection An important aspect of theimprovement of smes is the evaluation of the per-formance of its components applied to large texts.For this purpose a corpus from the German businessmagazine "Wirtschaftswoche" from 1992 with about180,000 tokens was selected. It consists of 366 texts,on different topics written in distinct styles.

Annotation Schema From an earlier evaluationexperiment on smes (Klein et al. 1997) we draw thefollowing conclusions on the design of the annotationschema: (i) for easy result comparison the reference

Page 34: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

annotations and the system output must cover thesame information and (ii) for automatic comparisonthe format of the annotations and the system outputmust be identical or at least allow an easy mapping.In order to provide suitable reference material forthe different shallow processing modules the anno-tation schema must comprise information on variouslinguistic levels. The BRILL tagger needs morpho-syntactically annotated lexical items as referencebasis while the grammars require annotations atphrasal level, possibly including syntactic categoryand grammatical function. Three main aspects char-acterize the new annotation approach for smes:

• the annotations are directly attached to the lan-guage data and not stored separately from thecorpus texts;

• the format of the annotations and the systemoutput will allow automatic comparisons;

• the improved NLP modules will be employedfor semi-automatic annotation.

4.1 Morphological AnnotationSince MORPHIX+-1- should be employed to providethe morphological annotation for the reference cor-pus, several cycles of diagnostic evaluation were car-ried out in order to extend the linguistic coverageand improve the morphological analysis. No refer-ence material was available for this evaluation pro-cess. All corpus texts were given as input, themorphological output was inspected manually andthe necessary modifications and extensions - includ-ing the integration of the German SARDIC (Buch-holz 1996, Neis 1997) lexicon comprising 600,000word forms plus morphosyntactic information -were done accordingly. The current version of MOR-PHIX+-I- has reached about 94% lexical coverage forthe applied text corpus, where most of the wordsnot analyzed are in fact proper nouns or misspelledwords.In order to obtain a morphologically analysed corpusall 366 texts were run through the improved morpho-logical module which provides as output the wordform (full form plus stem) together with all possiblemorphological readings. Using the recently imple-mented parameterizable interface of MORPHIX++,the cardinality of possible tag sets ranges from 23,considering only lexical category (PoS), to 780, con-sidering all possible morpho-syntactic information.For the first annotation step, the morphologicalanalysis was restricted to PoS information only,where ambiguous words are associated with the al-ternative lexical categories (see figure 1). Itemsmarked as unknown words are tagged as being possi-bly a noun, an adjective or a verb. In a second round,the tag set was extended, adding a further specifica-tion to the lexical category, e.g. V-psp (past partici-

ple), A-pred-used (adjective in predicative construc-tion, i.e. without inflection).

(" Fuer" ("filer" :S "VPREF") ("fuer" :S "PREP"))("die" ("d-det" :S "DBF"))("Angaben" ("angeb" :S "V") ("angabe" :S "N"))("in" ("in" :S "PREP"))("unseren" ("unser" :S "POSSPRON"))("Listen" ("list" :S "N") ("list" :S "V") ("liste" :S "N"))("wurde" ("werd" :S "V"))("grundsaetzlich" ("grundsaetzlich" :S "A") ("grundsaet-zlich" :S "ADV" ))("die" ("d-det" :S "DEF"))("weitestgehende" ("weitestgehend" :S "A"))("Bilanz" ("bilanz" :S "N"))("zugrunde" ("zugrunde" :S "VPREF"))("gelegt" ("leg" :S "V"))("." ("." :S "INTP"))

Figure 1: Morphix++ output for PoS only

The entire smes reference corpus is now morpholog-ically annotated with both tag set versions.

4.2 Part-of-Speech AnnotationIn order to obtain reliable PoS tagged reference ma-terial about 22,000 tokens of the first version ofthe morphologically annotated corpus were indepen-dently validated and disambiguated by two annota-tors. The manual annotation proved once more to betoo time-consuming. Looking for a more economicalsolution, it was decided to support the annotationtask by the un-supervised BRILL tagger, which hasrecently been integrated in the smes system.The tagger consists of a learning component, whichinduces disambiguation rules on the basis of thePoS information resulting from MORPHIX++, andan application component, which applies the rulesto the morpho-syntactically ambiguous input texts.Instead of the manual disambiguation of the MOR-PHIX++ output BRILL should provide automaticallydisambiguated annotated texts. But before the tag-ger can be employed for automatic annotation thor-ough evaluation must prove that BRILL'S perfor-mance is sufficient for this task.First evaluation cycles have already been carriedout using the validated PoS-tagged reference cor-pus. BRILL was run over the morpho-syntacticallyanalyzed test corpus using a first version of a sim-ple rule application strategy. A set of tools wereimplemented to support the automatic comparisonand calculation of the results. The evaluation of theBRILL tagger is based on quantitative and qualita-tive measures. The quantitative result is calculatedas the ratio between the number of disambiguationsperformed and the number of all word ambiguitiesprovided by MORPHIX++. The accuracy of the dis-ambiguation step is measured as the ratio betweenthe number of correct disambiguations and all dis-ambiguations performed by the BRILL tagger. Theresults were very promising: 62% of the ambigui-

Page 35: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

((:HEAD "fuer")(•COMP (:QUANTIFIER "d-det") (:HEAD "angabe")

(:END . 3) (:START . 1) (:TYPE . :NP))(:END . 3) (:START 0) (:TYPE . :PP))

((:HEAD "in")(:COMP (QUANTIFIER "unser") (:HEAD "list")

(:END . 6) (:START . 4) (:TYPE . :NP))(:END . 6) (:START . 3) (:TYPE . :PP))

((:HEAD "bilanz") (:MODS "weitestgehend")(:QUANTIFIER "d-det") (:END . 11) (:START . 8)(:TYPE . :NP))

Figure 2: Simplified output of the NP grammar (ex-cluding AGR information) for the sentence "Für dieAngaben in unseren Listen wurde grundsätzlich dieweitestgehende Bilanz zugrunde gelegt".

ties in the input text were disambiguated showingan accuracy of 95%. During further developmentand test phases the tagger will be improved until aperformance is reached which proves suitable for au-tomatic annotation. We will take advantage of therule-based approach implemented in the BRILL tag-ger, which allows the editing of the set of the learneddisambiguation rules, and thus to add some linguis-tically motivated rules. Nevertheless, the BRILL out-put must be manually checked but we expect thatwith the help of the tagger the annotation of lan-guage data will be sped-up considerably.

4.3 Phrasal Annotation

For shallow parsing smes makes use of various fi-nite state automata defining sub-grammars, like NPgrammar, verb group grammar, etc. To improve thecoverage of the parsing module the grammars willin turn be evaluated. We have started with the NPautomata since the recognition, i.e the identificationand correct analysis of the NPs, including also PPsas post-modifiers and APs as pre-modifiers, is veryimportant for information extraction.The task of the NP automata is to identify thenoun phrases and provide their internal structuresin terms of head-modifier dependencies. The outputof the grammar is a bracketed feature value struc-ture, including the syntactic category and the cor-responding start and end positions of the spannedinput expressions (see figure 2).The first part of the smes reference corpus, about12,500 tokens, has been manually annotated andvalidated for noun phrase information7, resulting in4,050 NPs annotated with phrasal boundaries andagreement information attached at the mother node,as can be seen in figure 3. The used annotationformat is easily convertible into the output formatof the NP automata. The annotated NPs comprise

7 One annotator has done the manual bracketing and theassignment of agreement information and another one haschecked the annotation.

about 3,100 noun phrases without post modification- but showing sometimes very complex prenominalpart - and about 950 NPs with different kinds ofpost modifiers.

("Fuer" ("fuer" :S "PREP"))<NP<NP ("die" ("d-det" :S "DBF"))("Angaben" ("angabe" :S"N"))[AGR a,p,f] NP>("in" ("in" :S "PREP"))< NP ("unseren" ("unser" :S "POSSPRON"))("Listen" ("liste" :S "N"))[AGR d,p,f] NP> [AGR a,p,f] NP>("wurde" ("werd" :S "AUX"))("grundsaetzlich" (" grundsaetzlich" :S"A"))<NP ("die" ("d-det" :S"DEF"))("weitestgehende" ("weitestgehend" :S "ATTR-A"))("Bilanz" ("bilanz" :S "N"))[AGR n,s,f] NP>("zugrunde" ("zugrunde" :S "VPREF"))("gelegt" ("leg" :S "MODV"))("." ("." :S"INTP"))

Figure 3: NP annotation including agreement infor-mation (case, number, gender)

First evaluation experiments on the performance ofthe NP grammar are being carried out with the helpof the NP reference corpus. In order to provide afine-grained performance profile it is envisaged toexamine the output of the NP automata at two lev-els:

• phrasal level: it is checked, if the identificationof the NP, i.e. the external bracketing, is cor-rect.

• internal structure: it is checked if the head-modifier dependencies are assigned correctly.

The first measurement provides quantitative infor-mation on the performance of the identification partof the NP grammar while the second one measuresthe accuracy of the attributed dependency struc-tures.In order to direct the modification and extension ofthe existing NP automata it is very important howrepresentative a specific NP phenomena is of a cer-tain application domain. This information will beprovided by text profiling where the typical charac-teristics of the corpus are identified. On the basisof one portion of the NP reference corpus (about700 tokens) a text profile was established. The re-sulting NP classification contains prototypical struc-tures ranging from bare plurals over simply modi-fied NPs including adjectives and genitive NPs upto complex NPs containing various pre- and post-modifications. This first classification of the distinctNP types indicates the internal structure of the nom-inal constructions occuring in the corpus texts.8 A

8The extraction and grouping of a second part of the cor-

-^

4

Page 36: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

more detailed subdivision will be worked out in orderto provide the necessary representation format forthe annotation of the NP intern dependency struc-tures. The next step in text profiling will be to as-sign relevance values to the NP structures accordingto their frequency and importance in the validatedpart of the NP reference corpus.In the next annotation step - which will be donewith the help of the NP grammar - the NPs willadditionally be annotated with their head-modifierdependencies. The grammar output will be manu-ally checked and corrected to establish a referencecorpus annotated with phrasal NP boundaries andthe internal NP structure. This reference corpus willbe used for further evaluations in order to check theaccuracy of the NP sub-grammar.

5 Conclusion and Future WorkAnnotated reference corpora are a necessary pre-requisite to carry out profound evaluation on NLPsystems. The time-consuming manual annotationforced us to look for a more economical solutionnamely the employment of the NLP modules forsemi-automatic annotation. Even though the useof validated NLP modules for the building of anno-tated reference material can not dispense with man-ual work first results indicate that this approach pro-vides substantial support for the annotation wherethe manual part might be reduced to the control ofthe automatically attached annotations. Ideally, thetools will provide highly reliable annotations - withonly minimal manual checking necessary - thus al-lowing efficient annotation of large amounts of textmaterial. Especially when smes will be employed forfurther application domains it will be a great advan-tage when application-specific reference data can beeasily provided by the shallow text processing part.We will also investigate if the (syntactic) annota-tion tools, as for example those currently developedwithin DiET9 and Negra10, could smoothly be in-tegrated into the smes environment. The graphicalinterfaces of those tools could facilitate the manualinspection of the automatically built reference data.In order to make the annotation schema re-usablefor other systems and applications it is planned toprovide an interface to the xml representation for-mat.

AcknowledgmentThe research underlying this paper was supportedby grants from the German Bundesministerium für

pus NPs made only slight changes in the classification schemanecessary.

9 More information on DiET is available underhttp://dylan.ucd.ie/DiET/.

10For information on the Negra project seehttp://www.coll.uni-sb.de/cl/projects/negra.html.

Bildung, Wissenschaft, Forschung und Technolo-gic (BMB+F) for the DFKI project PARADIME(FKZ ITW 9704). We are grateful to Günter Neu-mann for his fruitful comments on the annotationwork within PARADIME. We thank Milena Valkovaand Birgit Will for their careful annotation work.

ReferencesBrill, E. (1995). Unsupervised learning of dis-

mabiguation rules for part of speech tagging. Pro-ceedings of the Second Workshop on Very LargeCorpora, WVLC-95 Boston, 1995

Buchholz, S. (1996). Entwicklung einer lexiko-graphischen Datenbank für die Verben desDeutschen. Master Thesis, Universitat des Saar-landes, 1996

Finkler, W.; Neumann, G. (1988). Morphix: A fastRealization of a Classification-based Approach toMorphology. Proceedings der 4- ÖsterreichischenArtificial Intelligence Tagung, Wiener Work-shop Wissensbasierte Sprachverarbeitung, pp. 11-19, Berlin, 1988

Klein, J.; Busemann, S.; Declerck, T. (1997). Diag-nostic Evaluation of Shallow Parsing Through anAnnotated Reference Corpus. Proceedings of theSpeech and Language Technology Club Workshop,SALT-97, pp.121-128, Sheffield, 1997

Kuroashi, S.; Nagao, M. (1998) Building a JapaneseParsed Corpus while Improving the Parsing Sys-tem. Proceedings of the first international Con-ference on Language Resources and Evaluation,LREC-98, Granada, 1998, pp.719-724

Neis, I. (1997) SarDic - das Saarbrücker Dictionary:eine morphologische Workbench und em morphol-ogischer Server. Master's Thesis, Saarbriicken,1997

Neumann, G.; Backofen, R.; Baur, J.; Becker, M.;Braun, C. (1997). An Information ExtractionCore System for Real World German Text Pro-cessing. Proceedings of the 5th Conference onApplied Natural Language Processing, ANLP-97,pp.209-216, Washington DC, 1997

Oesterle, J.; Maier-Meyer, P. (1998) The GNoP(German Noun Phrase) Treebank. Proceedingsof the first international Conference on LanguageResources and Evaluation, LREC-98, Granada,1998, pp.699-703

Skut, W.; Brants, T.; Krenn, B.; Uszkoreit,H. (1998) A Linguistically Interpreted Corpusof German Newspaper Text. Proceedings ofthe first international Conference on LanguageResources and Evaluation, LREC-98, Granada,1998, pp.705-711

Page 37: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Adding Manual Constraints and Lexical Look-up to a Brill-Taggerfor German

Gerold Schneider and Martin VolkUniversity of Zurich

Department of Computer ScienceComputational Linguistics Group

Winterthurerstr. 190, CH-8057 Zurichgschneid [email protected]

AbstractWe have trained the rule-based Brill-Tagger forGerman. In this paper we show how the tag-ging performance improves with increasing cor-pus size. Training over a corpus of only 28'500words results in an error rate of around 5% forunseen text. In addition we demonstrate thatthe error rate can be reduced by looking upunknown words in an external lexicon, and bymanually adding rules to the rule set that hasbeen learned by the tagger. We thus obtain anerror rate of 2.79% for the reference corpus towhich the manual rules were tuned. For a sec-ond general reference corpus lexical-lookup andmanual rules lead to an error rate of 4.13%.

1 IntroductionThere already exist a number of taggers forGerman (Lezius et al., 1996). We have no-ticed, however, that none of them is rule-based.But as Samuelsson and Voutilainen (1997) havedemonstrated rule-based taggers can be supe-rior to statistical taggers. We have thereforeadapted and trained the supervised version ofthe rule-based Brill-Tagger to German. To thisend we have been building up a German trainingcorpus, which currently consists of about 38'000tagged words, where all tags have been manu-ally checked.1 In this paper we show how thetagging performance improves with increasingcorpus size. In addition we demonstrate thatthe error rate can be further reduced by look-ing up unknown words in an external lexicon,and by manually adding rules to the rule set

1 We would like to acknowledge the help of Gero Bas-senge, Alexander Glintschert, Sven Hartrumpf, Sebas-tian Hubner, Sandra Kubler, Andreas Mertens and ElkeTeich in checking part of our training corpus.

that has been learned by the tagger.2

1.1 Rule-Based TaggingWe have chosen the Brill-Tagger for the follow-ing reasons:

Practical Performance The rule-based Brill-Tagger (Brill, 1992, Brill, 1994) has showngood results for English. Samuelsson andVoutilainen (1997) show that a rule-basedtagger for English can achieve better re-sults than a stochastic one. Chanod andTapanainen (1995) prove the same forFrench.

Theoretical Advantages While the con-straints for French by Chanod andTapanainen (1995) and for Englishby Samuelsson and Voutilainen (1997)are hand-written, the Brill-Tagger isself-learning. It employs a transformation-based error-driven learning method.Ramshaw and Marcus (1996) describethis as a compromise method, whichmeans that it involves both a statisticaland a symbolic component. Instead ofpure n-grams the Brill-Tagger uses ruletemplates to restrict the search space.

Linguistic Accessibility and ExtensibilityAnother advantage of rule-based taggingover statistical approaches is the linguisticcontrol, as Ramshaw and Marcus (1996)point out. Linguistic knowledge firstdefines the linguistic principles to bestatistically investigated, i.e. the Brill-Tagger set of rule templates. Second, theyallow to tune an automatically abstracted

2 The web version of our tagger and infor-mation about its availability can be found athttp://www.ifi.unizh.ch/CL/tagger.

Page 38: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

description of the training corpus, i.e. therule files. Third, they help in analysing theresults and in pin-pointing the remainingerrors.

1.2 The Brill-Tagger for GermanThe Brill-Tagger was originally developped forEnglish.3 For German, we had to start fromscratch, first finding a suitable tag-set, adapt-ing the tagger code slightly, and then manuallytagging a German corpus. The changes in thetagger code needed for German are well docu-mented in the tagger manuals. It is necessaryto adapt the initial guess for capitalized words,which for English is a tag for proper noun in theoriginal code. This had to be changed to thetag for common noun in our tag-set, because inGerman all nouns are capitalized.

1.2.1 Training PhaseThe Brill-Tagger is trained in two steps. In thefirst step, each word is assigned its most likelytag, based on the training corpus. In the secondstep, the errors made in step one are recorded,and the tagger finds rules for the biggest pos-sible error elimination based on the context orinternal build-up of words. The tagger formu-lates these rules on the basis of the rule tem-plates. Every rule is then tested against thetraining corpus, the number of corrected er-rors are weighed against the number of errorsnewly introduced by this rule. The rule withthe greatest net improvement is included in therule set. This learning procedure continues iter-atively, until a certain threshold is reached. Dueto its iterative character, newly acquired rulesare already respected for the elimination of thenext error.

Using this procedure, the Brill-Tagger gener-ates a lexicon and two rule files, one for contextrules and one for lexical rules.

1.2.2 Application PhaseFor the application of the tagger to a text, thetagger uses the files generated during the train-ing. The lexicon contains each word with allits tags as they occurred in the training corpus.The tag at the first position is the most likelytag, which will be assigned to a word as a first

3 The Brill-Tagger is available from its author athttp://WWB.cs.jhu.edu/~brill.

guess. These guesses are then corrected accord-ing to the learned context rules.

Context rules take the context of a word intoconsideration. The Brill-Tagger has an observa-tion window of size 4: the furthest reaching ruletemplate allows for the consideration of threewords to the left or to the right. This is biggerthan in most statistical taggers. We will giveexamples of context rules in section 4.1.

The other rule file contains lexical rules. Lex-ical rules are solely used for tagging unknownwords. The following is an example of a simpli-fied lexical rule

lich hassuf 4 ADV

This lexical rule will have the effect that un-known words with a four-letter suffix -lich aretransformed into adverbs from whatever theirfirst guessed tag was.

Brill rules are transformation rules, whichmeans that a tag is transformed into anothertag if a rule applies. But at any stage a wordwill have exactly one tag. In this sense, trans-formation rules (Ramshaw and Marcus, 1996)are different from constraint rules (Samuelssonand Voutilainen, 1997).

1.3 Tag-set and CorporaWe use a tag-set widely acknowledged for Ger-man, the so-called Stuttgart-Tubingen Tag-Set(STTS) (Schiller et al., 1995), which contains 51part-of-speech tags plus some punctuation tags.Our corpus consists of texts from the Universityof Zurich annual report and currently containsabout 38'000 words.

2 Performance of the Brill-Taggerfor German

In this chapter we show how the tagging accu-racy increases with increasing corpus size, un-til the progress flattens out, partly due to thetagging difficulties for German, which we willdescribe below. In separate experiments, whichare documented in (Volk and Schneider, 1998),we show that the Brill-Tagger and the statisticaltagger by Schmid (1995) achieve similar resultsfor German.

2.1 Training the Brill-Tagger with ourCorpus

We used a utility provided by Brill to split the38'000 word corpus into two halves, say A and

Page 39: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

B. This utility repeatedly takes the next twosentences from the corpus and randomly putsone sentence to file A and the other one to fileB. We divide these halves again by the samemethod to get four parts. We use the first threeparts as the training corpus, which we call TC.The remaining quarter is reference material. Wedivide this reference material again in the sameway. We call the reference corpora we thus getRC1 and RC2.

2.1.1 Training ProgressIn order to illustrate the training progress, wefirst train the tagger with only 12.5% of the cor-pus and tag RC1 with the data obtained fromthe training. Then, we move on to 25%, 50%and 75% of the corpus and tag RC1 again. Ta-ble 1 illustrates the training progress. The 75%corpus is TC, i.e. the training corpus we willuse for the rest of this paper. After training theBrill-Tagger with the training corpus TC, thetraining module has learnt 186 lexical rules and176 context rules. When applying these rulesto RC1 the error rate is at 5.04%. This meansthat 94.96% of all the tokens in RC1 receive thesame tag as manually prespecified.

For illustration purposes, we also add RC2to TC and tag RC1 again, reducing the errorrate to 4.81%. As expected, the error rate intable 1 flattens out, suggesting that the increasein tagging accuracy from using bigger trainingcorpora will become increasingly smaller.

Table 1: Error Rates on RC 1

Of course, we may also tag RC2 with TC.Coincidentally, the error rate is a little higherthan with reference corpus RC1, at 5.59%.

When we add the reference material to thetraining corpus, the error rate drops signifi-cantly, as table 2 illustrates. Of course this errorrate of 1.49% has no "real world" significance,because no matter how big a training corpusthere will always be new words and new syntac-

Table 2: Drastic Error Rate Reduction on In-cluding Reference Material into Training Mate-rial

tic constructions in a new text to be tagged.

2.1.2 Unknown WordsBut we found this increase so striking thatwe wanted to know if it is rather due to theknown vocabulary or due to the larger num-ber of transformation rules. When taggingRCl with the rule files learned from TC, butthe lexicon learned from the entire corpus (i.e.TC+RC1+RC2) the error rate was only 1.86%,almost as good as when tagging with the rulefiles learned from the entire corpus. On the con-trary, when tagging with the rule files learnedfrom the entire corpus but using only the lexi-con learned from TC, the error rate was 4.86%(even a little worse than when using the TCrules). This indicates that the Brill part-of-speech guesser for unknown words is still un-satisfactory (Brill, 1994). Section 3 describesone way to increase the tagging accuracy forunknown words.

2.1.3 Average Ambiguity per TokenSuccess and error rates alone are not enough asa measure for the efficiency of a tagger. If fewwords in the text to be tagged are ambiguousin the tagger lexicon, it is easy for the taggerto achieve good results. We therefore providea figure on every training step for the averageambiguity per token. This figure is calculatedfor all tokens in a text (RCl in our case) thatare contained in the tagger lexicon. Unknownwords are not used in this calculation. Enlarg-ing the training corpus has two effects on thetagger lexicon. First, there will be more tokensin the lexicon and second, many tokens are as-signed multiple tags. This accounts for the in-crease in the average ambiguity of tokens listedin the third column of table 1.

2.2 Tagging Difficulties for German

Adjective vs. Past Participle: In Germanthe distinction between predicative adjec-

Page 40: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

tives and past particles is difficult for lin-guistic experts and therefore also for thetagger. E.g. in the following corpus sen-tence

(1) ... während die technikbezogenenDisziplinen an der ETH Zurichvertreten sind.

it is difficult to judge whether vertreten isa past participle or an independent adjec-tive. The sentence can be transformed intoa similar active sentence, but it is debatablewhether this involves a semantic change.In quantitative terms, however, only 3% ofthe errors from tagging RC1 are mistakesof this type.

Verb-Forms: The STTS tag set calls for a dis-tinction of finite verb form, infinitive formand past participle form. But in Germanthe finite verb form for the first and thirdperson plural, present tense, is identicalwith the infinitive form. In addition thereare many verbs where the past participleis identical with the infinitive or with a fi-nite verb form. That means that one candecide on the verb form only by looking atthe complete verb group in a clause.But in German matrix clauses the verbgroup is a discontinuous constituent withthe finite verb in second and the rest of theverb group in clause final position. Thismeans that the distance between a finiteauxiliary verb and the rest of the verbgroup can easily become too big for thewindow of a tri-gram tagger, as (Schmid,1995) notes. Unfortunately, the Brill-Tagger window in many cases is not bigenough either. In the following examples,our tagger mis-tagged beantragt as finiteverb, while verlangt is mis-tagged as pastparticiple.

(2) Hier hat der Ausschuss ... fur dieersten beiden Punkte derErziehungsdirektion beantragt, ...

(3) Die Theologische Fakultät verlangtKenntnisse in Latein, Griechisch undHebräisch.

When analysing the remaining 5.04% er-rors from tagging RC1 with our TC we

find that indeed 25% of the errors involvea wrong verb form.

Capitalisation: Unlike in English, all nounsare capitalised in German. This meansthat the tagger mis-tags many unknownproper names as common names, and thatsentence-initial unknown words are also of-ten mis-tagged as common nouns. Whenanalysing the errors in RCl we find that17% of the errors involve capitalisation.

3 The Impact of Lexical LookupAs shown in 2.1.2, unknown words account for alarge portion of the errors. We therefore exper-imented with sending the words not present inthe tagger lexicon to the wide-coverage morpho-logical analyser Gertwol (Oy, 1994). An auto-mated mapping procedure over the Gertwol out-put extracts all possible STTS tags for a givenword and temporarily appends these new wordswith their tags to the tagger lexicon. There isan obvious increase in the tagging accuracy, astable 3 shows.

Table 3: The impact of lexical lookup

But the impact of this lexicon-lookup issmaller than expected. The problem is thatGertwol delivers an unordered list of tags for agiven word-form, which includes even rare read-ings. The Brill-Tagger, on the other hand, needsthe most likely part-of-speech at the first posi-tion in the lexicon. We have not yet found outa method to weigh the Gertwol output appro-priately.

4 The Impact of Manual ConstraintsAs stated in 1.1 the Brill-Tagger has the advan-tage that it finds linguistic rules from the train-ing corpus which can be inspected, assessed andextended.

The net of automatically learned and partlyinterdependent rules might be a fragile systemhowever, lenient to decrease in efficiency aftermanual editing. Since the rules depend on eachother, their position within the rule file is rele-vant. How can a linguist know at which place

Page 41: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

he or she should insert rules? And to what anextent are the rules interdependent?

Ramshaw and Marcus (1996) have investi-gated this question. They state [p. 151-2]:

The trees for a run on 50K words ofthe Brown Corpus bear out that ruledependencies, at least in the part-of-speech tagging application, are lim-ited. ... [T]he great majority of thelearning in this case came from tem-plates that applied in one step di-rectly to the baseline tags, with lever-aging being involved in only about12% of the changes. The relativelysmall amount of interaction found be-tween the rules also suggests that theorder in which the rules are appliedmay not be a major factor in the suc-cess of the method for this particu-lar application, and initial experimentstend to bear this out.

Given this reassurance, we added rules man-ually to the end of the contextual rule file.

4.1 Examples of Context RulesAs the Brill-Tagger has only little built-in lin-guistic knowledge it is on the one hand almostlanguage-independent, on the other hand it hasto rely on statistical data for learning linguisticrules.

It is striking to see that the learning algo-rithm automatically learns many well-knownand linguistically sound context rules like:

APPR PTKVZ NEXTTAG $.

This rule means that what is initially taggedas a preposition (APPR) should be transformedinto a separated verb prefix tag (PTKVZ) if foundat the end of a sentence (NEXTTAG $.) - if theword in question can be found as PTKVZ in thetagger lexicon. Prepositions never occur at theend of a sentence indeed.4 In the sentence

(4) Ich gebe nie auf.

auf/APPR is thus correctly transformed intoauf/PTKVZ. More surprisingly, the learning al-gorithm even detects rules one is hardly awareof:

VAINF VAFIN NEXTTAG ADV

This rule transforms the infinite auxiliaryverb tag into a finite auxiliary verb if followed byan adverb. Indeed, German word order seemsto forbid sentences in which infinite auxiliariesare post-modified by an adverb:

(5) * Wir werden haben sehr schemes Wetter.

(6) * ... um zu sein ganz sicher.

The learning algorithm also detects a numberof linguistically more questionable context ruleswhich, however, correctly work in the majorityof language uses, but which may also lead tomistakes:

VVFIN WPP NEXT10R2TAG $.

This rule means that sentence-final finite verbtags should be transformed into verb participles.While this rule is correct for e.g.

(7) Der Arzt hat seine Patienten behandelt.

it will produce wrong results for - in our cor-pus apparently less frequent - sentences like

(8) Der Arzt ist aufmerksam, wenn er seinePatienten behandelt.

Moreover, the learning algorithm misses somebasic linguistic facts. In a reference corpus, wefind e.g.

(9) unseres/PPOSS Reformvorhabens/NN

PPOSS stands for substituting possessive pro-noun. The tag-set distinguishes between substi-tuting and attributive pronouns. Without anylinguistic knowledge, the tagger cannot knowthat a substituting pronoun will hardly be fol-lowed by a common noun (NN). A transforma-tion to attributive personal pronoun seems plau-sible. This is where our manual rules come in.We add the following rule the contextual rulefile:

PPOSS PPOSAT NEXTTAG NN

4 Our tagsetpostpositions.

distinguishes prepositions and

Page 42: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

After training our tagger with TC we tag RC1with lexical look-up (cf. section 3). The errorrate is 4.33%. We manually checked the 189errors and wrote and individually tested man-ual rules where we believe rules are linguisti-cally plausible and expressable in the formalism- like the one for attributive personal pronounsabove. We added the manual rules to the auto-matically learned rules. With 97 manual rulesthe error rate drops to 2.79%. Coming up withand testing new rules is a time-consuming pro-cess. Determining these 97 manual rules tookus around 4 hours.

4.2 Results with Manual Constraints

The fact that rule interdependence is low(cf.section 4) suggests that we can safely addrules. On the other hand this may also indicatethat rules are so independent because each ofthem only has a limited effect. In order to an-swer this question, we tag RC2, first only withlexical look-up (as described in 3) and with theautomatically learned context rules from TC -we get an error rate of 4.74% -, then togetherwith the above 97 manual context rules, forwhich we get an error rate of 4.13%.

Because of the small interaction, we mayfreely add manual rules written at other occa-sions. At an earlier stage of our research, whenour training corpus comprised of 28'000 words,we wrote 141 manual rules for test purposes (cf.4.3). We now add these manual rules to the filecontaining the automatically learned rules andthe manual rules tuned for RCl. If we use thisrule file for RC2 (again with lexical look-up), weget another increase in accuracy to 4.09%.

One may fear that manual rules are corpus-specific and bear no general linguistic siginifi-cance, which would entail that they lead to anerror increase in other corpora. In order to testthis, we used the above rule file (i.e. automati-cally learned rules plus manual rules tuned forRCl plus manual rules from earlier stage) totag RCl (with lexical look-up). We get only aslight error increase from 2.79% to 2.86%. Thisindicates that only a small fraction of the tunedrules are indeed corpus-specific, while the ma-jority are linguistically accurate, at least in thesense of linguistic performance.

We conclude that because the context rulesof the Brill-Tagger are independent and eachhas only a limited effect, the knowledge can be

freely accumulated and will lead to better re-sults in most cases. Adding manual rules is thusa feasible and useful practice for Brill tagging.

4.3 The Limits of Tagging Performance

As mentioned above, at an earlier stage of ourresearch, when the entire corpus comprised ofabout 28'000 words, we wrote context rulesbased on the results of tagging the training cor-pus itself.

When a tagger tags its training corpus the er-ror rate is naturally much lower than in tagginga new text. In the case of our 28'000 word cor-pus the error rate was at 1.81%. Based on theseerrors we wrote 141 manual context rules andadded them to the 121 automatically learnedcontext rules. The resulting error rate was justbelow 1%, at 0.95%. For the remaining 266 er-rors, no context rule could be found that re-sulted in any improvements. We therefore sug-gest that, given the window of the context rulesin the Brill formalism and the restricted expres-sivity of the rules, and given the relatively freeword order of the German language and theSTTS tag set, an error rate of just slightly be-low 1% is about the best possible rate that canbe achieved by a Brill-Tagger for German.

5 ConclusionsWe have shown that the rule-based Brill-Taggercan be trained successfully over a relativelysmall annotated corpus. Tagging performancethen suffers from unknown words but this canbe alleviated by looking-up these words in anexternal lexicon. This lexicon should not onlyprovide all possible tags but also identify themost likely tag. Current wide-coverage lexicalresources like Gertwol do not contain this infor-mation. Perhaps a statistical analysis of onlinedictionaries, as proposed by Coughlin (1996),could help to compute this missing information.

Tagging performance can also be improvedby adding manual rules to the automaticallylearned rule set. In our experiments a set ofabout 100 manual rules sufficed to increase thetagging accuracy from 95% to around 96%. Wealso demonstrated that the Brill-Tagger is rela-tively robust as to the order in which the man-ual rules are added. Unfortunately many ofthe remaining errors (e.g. verb form problems)lie outside the scope of the tagger's observation

Page 43: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

window. Therefore we need to add a more pow-erful component to the tagger or build a shallowparsing post-processor for error correction.

ReferencesEric Brill. 1992. A simple rule-based part-of-

speech tagger. In Proceedings of A NLP, pages152-155, Trento/Italy. ACL.

Eric Brill. 1994. A report of recent progressin transformation-based error-driven learn-ing. In Proceedings of A A Al.

Jean-Pierre Chanod and Pasi Tapanainen.1995. Tagging French - comparing a statis-tical and a constraint-based method. In Pro-ceedings of EACL-95, Dublin.

Deborah A. Coughlin. 1996. Deriving part ofspeech probabilities from a machine-readabledictionary. In Proceedings of the Second In-ternational Conference on New Methods inNatural Language Processing, pages 37-44,Ankara, Turkey.

W. Lezius, R. Rapp, and M. Wettler. 1996.A morphology-system and part-of-speech tag-ger for German. In D. Gibbon, editor, Natu-ral Language Processing and Speech Technol-ogy. Results of the 3rd KONVENS Confer-ence (Bielefeld), pages 369-378, Berlin. Mou-ton de Gruyter.

Lingsoft Oy. 1994. Gertwol. Questionnaire forMorpholympics 1994. LDV-Forum, 11(1):17-29.

L.A. Ramshaw and M.P. Marcus. 1996. Explor-ing the nature of transformation-based learn-ing. In J. Klavans and P. Resnik, editors, Thebalancing act. Combining symbolic and sta-tistical approaches to language. MIT Press,Cambridge,MA.

C. Samuelsson and A. Voutilainen. 1997. Com-paring a linguistic and a stochastic tagger. InProc. of ACL/EACL Joint Conference, pages246-253, Madrid.

A. Schiller, S. Teufel, and C. Thielen.1995. Guidelines fur das Tagging deutscherTextcorpora mit STTS (Draft). Technicalreport, Universitat Stuttgart. Institut fiirmaschinelle Sprachverarbeitung.

Helmut Schmid. 1995. Improvements in part-of-speech tagging with an application to Ger-man. Technical report, Universitat Stuttgart.Institut fiir maschinelle Sprachverarbeitung.

(Revised version of a paper presented atEACL SIGDAT, Dublin 1995).

Martin Volk and Gerold Schneider. 1998. Com-paring a statistical and a rule-based tagger forGerman. Manuscript.

Page 44: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Annotation Management for Large-Scale NLP

Remi ZajacComputing Research Laboratory, New Mexico State University

[email protected]

We describe a new flexible annotation scheme used in CRL's Document Manager which extends theTipster Document Architecture in several innovative directions. Annotation types are defined usingtyped feature structure definitions; the annotations themselves are instances of these types.Annotations are stored in an object-oriented database system; the corpora files can be stored on a filesystem (local or remote) or in the database itself; the Document Manager maintains the relationsbetween annotations and the documents

1 Introduction

The development of modern natural languageprocessing system relies on the exploitation ofcorpora for either extracting linguistic informationor testing the NLP systems. Although the extractionof information is certainly of primary importance,the use of large corpora for testing and evaluatingNLP systems is an important feature in thedevelopment of large scale NLP tools. Part of theUS Tipster program (Grishman 95, Caid et al. 96) isdevoted in the development of a so-calledDocument Architecture aimed at facilitating theintegration, the reuse and the evaluation of NLPcomponents on very large collections of documents,as the one used in the Tipster and MUC programs.Although the Tipster architecture is primarilyaimed at information retrieval and extraction, it hasalso been used at CRL for machine translation andmachine-aided human-translation. Our previousextensive experience with the Tipster-conformantimplementation of a Tipster Document Managerdeveloped at CRL (Sharpies & Bernick 96)' lead usto a new generic implementation which can be usedfor NLP at large (see e.g., Zajac et al. 97). This newDocument Manager, developed within the Corelliproject at CRL, can be viewed as a specializedobject-oriented database system for managing largecollections of documents annotated with complexdata structures. This implementation extends thebasic concepts of the Tipster DocumentArchitecture in the following directions:

• The structure of the set of annotations isdesigned to facilitate the implementation of

1. Initially prototyped by Ted Dunning.

an NLP system by reusing existing NLPcomponents and minimizing the interactionbetween these components.

• The annotations objects themselves aretyped feature structures. They extend theTipster annotations, which are simple sets ofattribute-value pairs; they are also used asthe representation device for many modemNLP systems.

• The types of the annotations used in theDocument Manager must be declared. TheDocument Manager uses the type definitions(see e.g., Zajac 92) to check that documentannotations created by a componentconform to the declared types.

The remainder of the paper gives an overview of theCorelli Document Management Architecture, andof the extensions which are currently beingimplemented; presents the new annotation schemebased on feature structures and how the DocumentManager makes use of this scheme to provide betterintegration and testing facilities; gives an overviewof the implementation.

2 Document Management

In the Corelli Document Architecture, componentsdo not talk directly to each other but communicatethrough annotations attached to the document beingprocessed. Each component of an NLP systemreads and writes annotations using the DocumentManager interface. This model reduces inter-dependencies between components, promoting thedesign of modular applications (Figure) and

Page 45: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

enabling the development of blackboard-typeapplications such as the one described in (Boitet &Seligman 94). The Corelli Document Architectureprovides solutions for

• Representing information about a document,

• Storing and retrieving this information in anefficient way,

• Exchanging this information among allcomponents of an application.

If a system reuses components that were notdesigned to work with each other, there will still bean impedance mismatch to be resolved: thearchitecture does not provide ready-made solutionsfor translating linguistic structures (e.g., mappingtwo different tagsets or mapping a dependency treeto a constituent structure), since these problems areapplication-dependent and need to be resolved on acase-by-case basis; such integration is howeverfeasible, as demonstrated by the various Tipsterdemonstration systems, and use of the architecturereduces significantly the load of integrating acomponent into an application.

Figure 1: Centralizing document annotations enablesa modular architecture and reduces the number of

interfaces from the order of n2 to the order of n.

2.1 Document Architecture

The basic data object of the architecture is thedocument: documents can have properties (a set ofattribute-value pairs) and annotations, and can begrouped into collections. Annotations are used tostore information about a particular segment of thedocument (identified by a span, i.e., start-end byteoffsets in the document content) while thedocument itself remains unchanged. This contrastswith the SGML solution used in the Multext projectwhere information about a piece of text is stored asadditional SGML mark-up in the document itself(Ballim95, Thompson95). The Corelli architecturesupports writable data as well as read-only data

(e.g., data stored in a CD-ROM or on a remote filesystem); no copy or modification of the originaldocuments is needed. This solution enables theprocessing of very large corpora such as the onesused in TREC with reasonable performances.Documents are accessible via a Document Managerwhich maintains persistent collections, documentsand their attributes and annotations, using acommercial database management system tosupport persistency. The implementation of thearchitecture takes advantage of the client-serverarchitecture of the database (which uses TCP/IP)and allows local as well as remote clients to connectto a Document Manager.

The Document Manager Graphical User Interfaceprovides a set of tools for manipulating andbrowsing the documents and collections ofdocuments; specialized viewers allows the displayof document annotations.

2.1.1 Document Annotations

The original Tipster Document Architectureprovides a relatively low-level structure forannotations: annotations are sets of attribute-valuepairs; values can be either strings or numbers orrecursively annotations; all annotations are storedas a set (actually, a bag). This imposes complextransformations between the data structureproduced by a component and the Tipsterannotation structures; if annotations are related toeach other (as for example as edges of a chart parseror as nodes of a parse tree), an additional datastructure is required to represent relations betweenannotations.

The Corelli Document Architecture partitions theset of annotations into labeled sub-sets, where eachsub-set is the input and/or the output of acomponent. Each annotation sub-set relevant to agiven component is structured as a lattice, whichenables a direct representation of for example aword-lattice as produced by a speech-recognitionsystem or as an ambiguous output of amorphological analyzer (Boitet & Seligman 94,Amtrup et al. 97). Each annotation is a (typed)feature structure. The annotation structurefacilitates the integration of NLP components suchas unification-based parsers or morphologicalanalyzers since the input of such component is aword or a sentence and the data structures read and/or produced are feature structures or graphs of

Page 46: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

feature structures. F3 shows that annotations can bestructured in graphs where each graph represent thelinguistic structure of a sentence as computed by agiven component. An annotation is an edge in agraph of annotations and also contains a pair ofpointers (byte offsets) to the span of text covered bythe annotation.

The specification of a component includes thespecification of the pre- and post-conditions of thecomponent. A pre- or post-conditions specifieswhich annotation sub-sets are accessed by thecomponents (using the sub-sets' labels) and foreach sub-set, the required features and values(specified as a typed feature structure).

2.1.2 Annotation Management

The Corelli Document Architecture provides a wayto declare annotation types and to check that anyannotation added to a document at runtimeconforms to the declared types. Annotations aretyped feature structures, and allowed featurestructures are defined using type definitions (Zajac92). The Tango language developed at CRLprovides the facilities for defining the types offeature structures. This language supports thenotion of module (package) and includes a set ofpre-defined types (integers, strings, lists and regularexpressions). The runtime system provides a set ofmethods to type-check feature structures as well asa set of unification methods. This runtime engine isused in the implementation of several unification-based formalisms at CRL.

Tango modules are stored in a database, and theTango development environment supportsfunctionalities to define modules and types, andcompile modules. The runtime modules can then beused in a variety of applications including thedocument manager: instances of these types aredocument annotations. The GraphicalProgramming Environment of the Document

Manager gives access to the Tango toolset and anapplication programmer can import Tango modulesand use the type definitions, but also modify andrecompile Tango modules.

The Tango runtime system supports several levelsof type checking and this set of functionalities isused by the Document Manager to check at runtimeannotations which are created by a componentbefore actually storing the annotations persistentlyin the Document Manager persistent store (thistype-checking can be turned-off by theprogrammer). The Document Manager uses thedeclaration of pre- and post-conditions declared fora given component to perform runtime typechecking (see below).

Figure 3: Integration of NLP components in theCorelli Document Architecture.

2.2 Application Framework

The Corelli Document Architecture includes anApplication Framework which supports theconstruction of NLP applications by integration ofNLP components through a high-level GraphicalProgramming Environment. The Application Editorsupports a drag-and-drop graphical interface forintegrating components in a single application in away similar to the GATE GDE (Cunningham et al.94, 96). The components themselves may bedistributed and communicate with the applicationusing a commercial agent-based architecture (fromObjectSpace). The Application Frameworkinterpreter allows a step-wise execution of theapplication and stores all intermediary results

Page 47: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

(output of each component) in the DocumentManager where they can be displayed using theDocument Manager viewers.

2.2.1 Component Architecture

The data layer of the Corelli DocumentArchitecture, as described above, provides a staticmodel for component integration through acommon data framework. This data model does notprovide any support for communication betweencomponents, i.e., for executing and controlling theinteraction of a set of components, nor for rapidtool integration. The Corelli ComponentArchitecture fills this gap by providing a dynamicmodel for component integration: this frameworkprovides a high-level of plug-and-play, allowing forcomponent interchangeability without modificationof the application code, thus facilitating theevolution and upgrade of individual components.

An NLP component is integrated in the architectureby implementing the Corelli Component interfacewhich defines a standardized set of methods toexecute a component's functionalities and provideshigh-level communications capabilities allowingdistribution of components. This interface acts as awrapper for the component's code and severalintegration solutions are possible:

• If the component has a Java API,1 it can beencapsulated directly in the wrapper's code.

• If the component has an API written in oneof the languages supported by the JavaNative Interface (currently C and C++), itcan be dynamically loaded into the wrapperat runtime and accessed via the Java front

. end.

• If the component is an executable, thewrapper must issue a system call for runningthe program and data communicationusually occurs through files.

2.2.2 Component Management

In a way which is similar to the GATE componentarchitecture (GATE), a Corelli Component has pre-and post-conditions. These conditions are definedas (typed) feature structures and the DocumentManager can dynamically check the validity of

1. The Document Manager is implemented in Java.

annotations created by a component by checkingthat each input annotation is subsumed by the pre-condition and that each annotation produced by thecomponent is subsumed by the post-condition.

The programmer building an application using theApplication Framework defines for each componentpre- and post-conditions. A component imports oneor more Tango modules, and pre- and post-conditions are defined as expressions of typedfeature structures which are instances of typesdeclared in the imported modules. The GraphicalProgramming Environment type-checks thedeclarations or pre- and post-conditions using theTango runtime facilities.

When the programmer defines an application as agraph of components (where the graph is usedexpress the control flow between components), theGraphical Programming Environment also type-check the entire application by checking thecompatibility of pre- and post-conditions for allpossible execution paths in the application.

3 Conclusions

We described a new annotation scheme that:

• allows to annotate read-only documents;

• supports efficient annotation of very largedocument collections;

• allows to control the validity of annotationsas they are added to a document;

• interfaces readily with modern unification-based NLP components.

An alpha version of the Corelli Document Managerhas been released and is being tested at severalresearch institutes. The Document Manager isimplemented in Java and uses a Java OODBMSback-end from ObjectDesign. The ComponentArchitecture has been prototyped using RMI andwe are still exploring other options to implementthe distributed Application Framework, includingObjectSpace's Voyager and CORBA. TheGraphical Programming Environment and theApplication structure will be derived from theGATE model (REF). The Tango package is alsoimplemented in Java and supports a few unificationand type-checking methods, but new optimizedunification algorithms will be developed in thefuture.

Page 48: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Acknowledgments Research reported in this paperis supported by the DoD, contract MDA904-96-C-1040.

4 References

Jan Amtrup. Henrik Heine, Uwe lost. 1997."What's in a Word Graph - Evaluation andEnhancement of Word Lattices". Eurospeech'97 —Proceedings of the 5th European Conference onSpeech Communication and Technology. Rhodes,Greece.

A. Ballim. 1995. "Abstract Data Types for MultextTool I/O". LRE 62-05 Deliverable 1.2.1.

Christian Boitet and Mark Seligman. 1994. "TheWhiteboard Architecture: a Way to IntegrateHeterogeneous Components of NLP Systems".Proceedings of the 15th International Conferenceon Computational Linguistics - COLING'94,August 5-9 1994, Kyoto, Japan. pp426-430.

Bill Caid, Jamie Callan, Jim Conley, Harold Robin,Jim Cowie, Kathy DiBella, Ted Dunning, JoeDzikiewicz, Louise Guthrie, Jerry Hobbs, ClintHyde, Mark Ilgen, Paul Jacobs, Matt Mettler, BillOgden, Peggy Otsubo, Bev Schwartz, Ira Sider,Ralph Weischedel and Remi Zajac. "Tipster TextPhase II Architecture Design and Requirements,Version 2.1". Proceedings of the Tipster-H 24-month Workshop, Tysons Corner, VA, 7-10 May,1996. pp249-305.

H. Cunningham, M. Freeman, W.J. Black. 1994."Software Reuse, Object-Oriented Frameworks andNatural Language Processing". Proceedings of the1st Conference on New Methods in NaturalLanguage Processing - NEMLAP-1, Manchester.

H. Cunningham, Y. Wilks, R. Gaizauskas. 1996."New Methods, Current Trends and SoftwareInfrastructure for NLP". Proceedings of the 2ndConference on New Methods in Natural LanguageProcessing - NEMLAP-2, Ankara, Turkey.

Ralph Grishman, editor. 1995. "Tipster Phase nArchitecture Design Document". New-YorkUniversity, NY, July 1995.

Nigel Sharpies and Philip Bernick. 1996. "A User'sGuide to TDM: The CRL TIPSTER DocumentManager", CRL Technical Report MCCS-96-298.

Henry Thompson. 1995. "Multext Workpackage 2,Milestone B, Deliverable Overview". LRE 62-050Deliverable 2.

Remi Zajac, Mark Casper and Nigel Sharpies.1997. "An Open Distributed Architecture for Reuseand Integration of Heterogeneous NLPComponents". Proceedings of the 5th Conferenceon Applied Natural Language Processing -ANLP'97, 31 March-3 April, Washington DC.pp245-252.

Remi Zajac. 1992. "Inheritance and Constraint-based Grammar Formalisms". ComputationalLinguistics 18/2, June 1992, pp!59-182.

Page 49: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Extending Grammar Annotation Standards to Spontaneous Speech

Anna RAHMANSchool of Cognitive and Computing Sciences

University of SussexPalmer, Brighton, UK. BN1 9QH

[email protected]

Geoffrey SAMPSONSchool of Cognitive and Computing Sciences

University of SussexPalmer, Brighton, UK. BN1 9QH

[email protected]

Abstract

We examine the problems that arise in extendingan explicit, rigorous scheme of grammaticalannotation standards for written English into thedomain of spontaneous speech. Problems ofprinciple occur in connexion with part-of-speechtagging; the annotation of speech repairs andstructurally incoherent speech; logicaldistinctions dependent on the orthography ofwritten language (the direct/indirect speechdistinction); differentiating between nonstandardusage and performance errors; and integratinginaudible wording into analyses of otherwise-clear passages. Perhaps because speech hascontributed little in the past to the tradition ofphilological analysis, it proves difficult in thisdomain to devise annotation guidelines whichpermit the analyst to express what is truewithout forcing him to go beyond the evidence.

Background

To quote Jane Edwards (1992: 139), "The singlemost important property of any data base forpurposes of computer-assisted research is thatsimilar instances be encoded in predictablysimilar ways". This principle has not often beenobserved in the domain of grammaticalannotation. Although many alternative lists ofgrammatical categories have been proposed forEnglish and for other languages, in most casesthese are not backed up by detailed, rigorousspecifications of boundaries between thecategories. A scheme may define how to draw aparse tree for a clear, "textbook" examplesentence, with node labels drawn from a large,informative label-alphabet, but may leave itentirely to analysts' discretion how to apply the

annotation to the messy constructions that aretypical of real-life data.

The SUSANNE scheme, developed over theperiod 1983-93 Sampson (1995;ftp://ota.ox.ac.uk/pub/ota/susanne/) is a firstpublished attempt to fill this gap for English; the500 pages of the scheme aim to define anexplicit analysis for everything that occurs in thelanguage in practice. No claim is made that thenumerous annotation rules comprised in theSUSANNE scheme are "correct" with respect tosome psychological or other reality; undoubtedlythere are cases where the opposite choice of rulecould have yielded an equally well-defined andinternally consistent annotation scheme. But,without some explicit choice of rules on a longlist of issues , one has only a list of category-names and symbols, not a well-defined schemefor applying them.

The SUSANNE scheme has been achieving adegree of international recognition: "the detail... is unrivalled" Langendoen (1997: 600);"impressive ... very detailed and thorough"Mason (1997: 169, 170); "meticulous treatmentof detail" Leech & Eyes (1997: 38). We are notaware of any alternative annotation scheme (forEnglish, or for another language) which coversthe ground at a comparable level of detail. (Theother schemes that we know about seem to havebeen initiated substantially more recently thanSUSANNE, as well as being less detailed; we donot survey them here.) Various research groupsmay prefer to use different lists of grammaticalsymbols, but it is not clear what value will attachto statistics derived from annotated corporaunless the boundaries between their categoriesare defined with respect to the same issues thatthe SUSANNE scheme treats explicitly.

Page 50: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

Currently, the CHRISTINE project(http://www.cogs.susx.ac.uk/users/geoffs/RChristine.html) is extending theSUSANNE scheme, which was based mainly onedited written English, to the domain ofspontaneous spoken English. CHRISTINE isdeveloping the outline extensions of theSUSANNE scheme for speech which werecontained in Sampson (1995: ch. 6) into a set ofannotation guidelines comparable in degree ofdetail to the rest of the scheme, "debugging"them by applying them manually to samples ofBritish English representing a wide variety ofregional, social class, age, and social settingvariables. Figure 1 displays an extract from thecorpus of annotated speech currently beingproduced through this process. The sources oflanguage samples used by the CHRISTINEproject are the speech section of the BritishNational Corpus (http://info.ox.ac.uk/bnc/), theReading Emotional Speech Corpus(http://midwich.reading.ac.uk/research/speechlab/emotion/), and the London-LundCorpus (Svartvik 1990). Figure 1 is extractedfrom file KSS of the British National Corpus.(Except where otherwise stated, examplesquoted in later sections of this paper will alsocome from the BNC, with the location specifiedas three-character filename followed after fullstop by five-digit "s-unit number". BNCtranscriptions include punctuation andcapitalization, which are of questionable statusin representations of spoken wording; in Figure1 these matters are normalized away, but theyhave been allowed to stand in examples quotedin the text below.)

Figure 1 - *. . -

date April 1992, location South Shields,locale home, activity conversationPS6RC: f, 72, dial Lancashire,Salvation Army, educ X, class UUPS6R9: m, 45, dial Lancashire,unemployed, educ X, class DE(sonofPS6RC)

VDO doXX +n'tVVOv knowRRQq wherePPHSlfVHZ +'sWGK gonTO +naVVOv getNNln cakeVDN doneYG plOlRRs yet

[Ve.

.Ve][Fn?:o[Rq:G101.Rq:G101]she [Nas:s.Nas:s][Vzut..Vzut][Ti:z[Vi.-Vi][Ns:o.Ns:o][Tn:j[Vn.Vn]Tn:j].Ti:z]Fn?:o][Rs:t.Rs:t]Fn:o]S]

? <unclear> [Y.Y]PPY youVMo caXX +n'tVVOv iceATI aYR #WOv/iceiYR #WOv iceATI aNNln cakeCSi ifPPY youVHO haveXX +n'tVVNv gotMCI one

[S[Ny:s.Ny:s][Vce..

.Vce][Ns:o.Ns:o].

[V[VVOv#..

.VVOv#]V][Ns:o..Ns:o][Fa:c.[Ny:s.Ny:s][Vef.

.Vef][Ms:o.Ms:o]Fa:c]S]

PS6RCPPIS1 IVVOv sayPPIS1 I

[S[Nea:s.Nea:s][V.V][Fn:o[Nea:s.Nea:s]

In Figure 1, the words uttered by the speakersare in the next-to-rightmost field. The field totheir left classifies the words, using theSUSANNE tagset supplemented by someadditional refinements to handle specialproblems of speech: the part-word i at byte0692161 is tagged "VVOv/ice" to show that it isa broken-off attempt to utter the verb ice. Therightmost field gives the grammatical analysis ofthe constructions, in the form of a labelled treestructure which again uses the SUSANNEconventions. All tagmas are classified formally,with a capital letter followed in some cases bylower-case subcategory letters: S stands for"main clause", Nea represents "noun phrasemarked as first-person-singular and subject", Velabels don't know as a verb group marked asnegative. Additionally, immediate constituentsof clauses are classified functionally, by a letterafter a colon: Nea:s in the first line shows that 7

Page 51: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

is the subject of say, Fn:o in the third line showsthat the nominal clause (Fn) / don't know where... is direct object of say. Three-digit indexnumbers relate surface to logical structures.Thus where in the seventh line is marked as aninterrogative adverb phrase (Rq) having nological role (:G) in its own clause (that is, theclause headed by + 's gon, i.e. is going), butcorresponding logically to an unspoken Placeadjunct ("plOl") within the infinitival clause+na (= to) get cake done. (The character "y" incolumn 3 identifies a line which contains anelement of the structural analysis rather than aspoken word.)

Legal constraints permitting, the CHRISTINECorpus will be made freely availableelectronically, after completion in December1999, in the same way as the SUSANNE Corpusalready is.

Defining a rigorous, predictable structuralannotation scheme for spontaneous speechinvolves a number of difficulties which are notonly additional to, but often different in kindfrom, those involved in defining such a schemefor written language. This paper examinesvarious of these difficulties. In some cases, ourproject has already identified tentativeannotation rules for addressing these difficulties,and in these cases we shall mention the decisionadopted; but in other cases we have not yet beenable to formulate any satisfactory solution.Even in cases where our project has chosen aprovisional solution, discussing this is notcentral to our aims in the present paper. Ourgoal, rather, is to identify the types of issueneeding to be resolved, and to show howdevising an annotation scheme for speechinvolves problems of principle, of a kind thatwould have been difficult to anticipate beforeundertaking the task.

1 The Software Engineering Precedent

The following pages will examine a number ofconceptual problems that arise in definingrigorous annotation standards for spontaneousspeech. Nothing will be said aboutcomputational technicalities, for instance the

possibilities of designing an automatic parserthat could apply such annotation, or the natureof the software tools used in our project tosupport manual annotation. (The project hasdeveloped a range of such tools, but we regardthem as being of interest only to ourselves.)

In our experience, some computational linguistssee a paper of this type as insubstantial and oflimited value in advancing the discipline. Whileit is not for us to decide the value of ourparticular contribution, as a judgement on agenre we see this attitude as profoundly wrong-headed. To explain why, let us draw an analogywith developments in industrial and commercialcomputing.

Writing programs and watching them running isfun. Coding and typing at keyboards are theprogrammer activities which are most easy forIT managers to perceive as productive. For boththese reasons, in the early decades of computingit was common for software developers to movefairly quickly from taking on a new assignmentto drafting code — though, unless theassignment was trivially simple, the firstsoftware drafts did not work. Sometimes theycould be rescued through debugging — usually agreat deal of debugging. Sometimes they couldnot: the history of IT is full of cases of many-man-year industrial projects which eventuallyhad to be abandoned as irredeemably flawedwithout ever delivering useful results.

There is nowadays a computer sciencesubdiscipline, software engineering e.g.Sommerville (1992), which has as one of itsmain aims the training of computing personnelto resist their instincts and to treat coding as alow priority. Case studies have shown Boehm(1981: 39-41) that the cost of curingprogramming mistakes rises massively,depending how late they are caught in theprocess that begins with analysing a newprogramming task and ends with maintenance ofcompleted software. In a well-run modernsoftware house, tasks and their componentsubtasks are rigorously documented atprogressively more refined levels of detail, sothat unanticipated problems can be detected andresolved before a line of code is written;

Page 52: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

programming can almost be described as theeasy bit at the end of a project.

The subject-matter of computational linguistics,namely human language, is one of the mostcomplex phenomena dealt with by any branch ofIT. To someone versed in modern industrialsoftware engineering, which mainly deals withstructures and processes much simpler than anynatural language, it would seem very strange thatour area of academic computing research coulddevote substantially more effort to developinglanguage-processing software than to analysingin detail the precise specifications whichsoftware such as natural-language parsers shouldbe asked to deliver, and to uncovering hiddenindeterminacies in those specifications.Accordingly, we make no bones about the data-oriented rather than technique-oriented nature ofthe present paper. At the current juncture incomputational linguistics, consciousness-raisingabout problematic aspects of the subject-matteris a high priority.

2 Wordtagging

One fundamental aspect of grammaticalannotation is classifying the grammatical rolesof words in context — wordtagging. TheSUSANNE scheme defined an alphabet of over350 distinct wordtags for written English, mostof which are equally applicable to the spokenlanguage though a few have no relevance tospeech (for instance, tags for roman numerals, ormathematical operators). Spoken language also,however, makes heavy use of "discourse items"Stenstrom (1990) having pragmatic functionswith little real parallel in writing: e.g. well as anutterance initiator. Discourse items fall intoclasses which in most cases are about as clearlydistinct as the classifications applicable towritten words, and the CHRISTINE schemeprovides a set of discourse-item wordtagsdeveloped from Stenstrom's classification.However, where words are ambiguous asbetween alternative discourse-item classes, thefact that discourse items are not normallysyntactically integrated into wider structuresmeans that there is little possibility of findingevidence to resolve the tagging ambiguity.

Thus, three discourse-item classes are Expletive(e.g. gosh), Response (e.g. ah), and ImitatedNoise (e.g. glug glug). Consider the followingextracts from a sample in which children are"playing horses", one riding on the other's back:

KPC.00999-1002 speaker PS 1DV: ... all youcan do is <pause> put your belly up andI'll go flying! ... Go on then, put yourbelly up! speaker PS 1DR: Gung!

KPC. 10977 Chuck a chuck a chuck chuck! Eeeel Go on then.

In the former case, gung is neither a standardEnglish expletive, nor an obviously appropriatevocal imitation of anything happening in thehorse game. Conversely, in the latter case eecould equally well be the standard Northernregional expletive expressing mildly shockedsurprise, or a vocal imitation of a "riding" noise.In many such cases, the analyst is forced by thecurrent scheme to make arbitrary guesses, yetclear cases of the discourse-item classes are toodistinct from one another to justify eliminatingguesswork by collapsing the classes into one.

Not all spoken words posing tagging problemsare discourse items. In:

KSU.00396-8 Ah ah! Diddums! Yeah. , -

any English speaker will recognize the worddiddums as implying that the speaker regards thehearer as childish, but intuition does not settlehow the word should be tagged (noun? if so,proper or common?); and published dictionariesdo not help. To date we have formulated noprincipled rule for choosing an analysis in caseslike these.

3 Speech Repairs

Probably the most crucial single area wheregrammatical standards developed for writtenlanguage need to be extended to represent thestructure of spontaneous spoken utterances isthat of speech repairs. The CHRISTINE repairannotation system draws on Levelt (1983) andHowell & Young (1990, 1991), to ourknowledge the most fully-worked-out andempirically-based previously existing approach.

Page 53: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

This approach identified a set of up to ninerepair milestones within a repaired utterance, forinstance the point at which the speaker's firstgrammatical plan is abandoned (the "moment ofinterruption"), and the earlier point marking thebeginning of the stretch of wording which willbe replaced by new wording after the moment ofinterruption. However, this approach is not fullyworkable for many real-life speech repairs. Inone respect it is insufficiently informative: theLevelt/Howell & Young notation provides nomeans of showing how a local sequencecontaining a repair fits into the largergrammatical architecture of the utterancecontaining it. In other respects, the notationproves to be excessively rich: it requires speechrepairs to conform to a canonical pattern fromwhich, in practice, many repairs deviate.

Accordingly, CHRISTINE embodies asimplified version of this notation, in which the"moment of interruption" in a speech repair ismarked (by a "#" sign within the stream ofwords), but no attempt is made to identify othermilestones, and the role of the repaired sequenceis identified by making the "#" node a daughterof the lowest labelled node in a parse tree suchthat both the material preceding and the materialfollowing the # are (partial) attempts to realizethat category, and the mother node fits normallyinto the surrounding structure. This approachworks well for the majority of speech repairs,e.g.:

KBJ.00943 That's why I said [Ti:o to get ma ba#, get you back then] ...KCA.02828 I'll have to [VVOv# cha # change ]it

"A:'

In the KBJ case, to get ma ba (in which ma andba are truncated words, the former identified bythe wordtagging as too distorted to reconstructand the latter as an attempt at back as anadverb), and get you back then, are successiveattempts to produce an infinitival clause (Ti)functioning as object (:o) of said. In the KCAcase, cha and change are successive attempts toproduce a single word whose wordtag is WOv(base form of verb having transitive andintransitive uses). In Figure 1, the "#" symbol isused at two levels in the same speaker turn:speaker PS6RC makes two attempts to realize a

main clause (S), and the second attempt beginswith two attempts to pronounce the verb ice.

However, although the CHRISTINE speech-repair notation is less informative than the fullLevelt/Howell & Young scheme, and seems assimple as is consistent with offering an adequatedescription of repair structure, applying itconsistently is not always straightforward. Inthe first place, as soon as the annotation schemeincludes any system for marking speech repairs,analysts are obliged to decide whether particularstretches of wording are in fact repairs or well-formed constructions, and this is often unclear.Sampson (1998) examined a number ofindeterminacies that arise in this area; one ofthese is between repairs and appositionalstructures, as in:

KSS.05002 she can't be much cop if she 'd openher legs to a first date to a Dutch s-sailor

— where to a Dutch s- sailor might be intendedto replace to a first date as the true reason forobjecting to the girl, but alternatively to a Dutchs- sailor could be an appositional phrase givingfuller and better particulars of the nature of heroffence. Annotation ought not systematically torequire guesswork, but it is hard to see how aneutral notation could be devised that wouldallow the analyst to suspend judgment on such afundamental issue as whether a stretch ofwording is a repair or a well-formedconstruction.

Even greater problems are posed by a notuncommon type of ill-formed utterance thatmight be called "syntactically Markovian", inwhich each element coheres logically with whatimmediately precedes but the utterance as awhole is not coherent. The following examplescome from the London-Lund Corpus, with textnumbers followed by first and last tone-unitnumbers for the respective extracts:

S.I.3 0901-3 ... of course I would be willing tourn <pause => come into the common-room <pause => and uh <pause >in fact I would like nothing I would likebetter [speaker is undergraduate, age ca

Page 54: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

36, describing interview for Oxbridgefellowship]

S.5.5 0539-45 and what is happening<pause=> in Britain today <pause ->is ay- demand for an entirely newforeign policy quite different from thecold war policy <pause => is emergingfrom the Left [speaker is AnthonyWedgwood Benn MP on radiodiscussion programme]

In the former example, nothing functionssimultaneously as the last uttered word of anintended sequence / would like nothing betterand the first uttered word of an implied sequencesomething like there is nothing I would likebetter. In the latter, the long NP an entirely newforeign policy quite different from the cold warpolicy appears to function both as thecomplement of the preposition for, and assubject of is emerging. In such cases one cannotmeaningfully identify a single point where onegrammatical plan is abandoned in favour ofanother. Because these structures involvephrases which simultaneously play onegrammatical role in the preceding constructionand a different role in the followingconstruction, they resist analysis in terms oftree-shaped constituency diagrams (or,equivalently, labelled bracketing of the word-string). Yet constituency analysis is so solidlyestablished as the appropriate formalism forrepresenting natural-language structure ingeneral that it seems unthinkable to abandon itmerely in order to deal with one special type ofspeech repair.

4 Logical Distinctions Dependent on theWritten MediumThere are cases where grammatical categorydistinctions that are highly salient in writtenEnglish seem much less significant in the spokenlanguage, so that maintaining them in theannotation scheme arguably misrepresents thestructure of speech. Probably the mostimportant of these is the direct/indirect speechdistinction. Written English takes great pains todistinguish clearly between direct speech,involving a commitment to transmit accurately

the quoted speaker's exact wording, and indirectspeech which preserves only the general sense ofthe quotation. The SUSANNE annotationscheme uses categories which reflect thisdistinction (Q v. Fn). However, the most crucialcues to the distinction are orthographic matterssuch as inverted commas, which lack spokencounterparts. Sometimes the distinction can bedrawn in spoken English by reference topronouns, verb forms, vocatives, etc.:

KD6.03060 ... he says he hates drama becausethe teacher takes no notice, he said oneweek Stuart was hitting me with a stickand the teacher just said calm down youboys ...

— the underlined he (rather than /) implies thatthe complement of says is indirect speech; meimplies that the passage beginning one week is adirect quotation, and the imperative form calmand vocative you boys imply that the teacher isquoted directly. But in practice these cuesfrequently conflict rather than reinforcing oneanother:

KCT. 10673 [reporting speaker's own responseto a directly-quoted objection]: / saidwell that^s his hard luck!

KCJ .01053-5 well Billy, Billy says well takethat and then he 'II come back and thenhe er gone and pay that

In the KCT example, the discourse item well andthe present tense of fijs after past-tense saidsuggest direct speech, but his (which from thecontext denotes the objector) suggests indirectspeech. Likewise in the KCJ example, well andthe imperative take imply direct speech, he'llrather than I'LL implies indirect speech.Arguably, imposing a sharp two-way direct v.indirect distinction on speech is a distortion; onemight instead feel that speech uses a singleconstruction for reporting others' utterances,though different instances may contain more orfewer indicators of the relative directness of thereport. On the other hand, logically speaking thedirect v. indirect speech distinction is sofundamental that an annotation scheme whichfailed to recognize it could seem unacceptable.(To date, CHRISTINE analyses retain thedistinction.)

Page 55: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

5 Nonstandard Usage

Real-life British speech contains manydifferences from standard usage with respect toboth individual words and syntactic patterns.

In the case of wordtagging, the SUSANNE rule(Sampson 1995: §3.67) is that words used inways characteristic of nonstandard dialects aretagged in the same way as the words that wouldreplace them in standard English. This ruletends to be unproblematic for pronouns anddeterminers, thus in:

KP4.03497 it's a bit of fun, it livens up me dayKCT. 10705 she told me to have them plums

the underlined words are given the tags forstandard my, those respectively. It is moredifficult to specify a predictable way to applysuch a rule in the case of nonstandard uses ofstrong verb forms. Standard base forms can beused in past contexts, e.g.:

KCJ.01096-8 a man bought a horse and give itto her, now it's won the race

and the solution of tagging such an instance as apast tense is put into doubt because frequentlynonstandard English omits the auxiliary of thestandard perfective construction, suggesting thatgive might be tagged as given rather than gave;cf.:

KCA.02536 What I done. I taped it back likethat.

KCA.02572 What it is, when you sot snookeron and just snooker you're quite<pause> content to watch it...

(Note that done might be seen as equivalent tostandard did, rather than to standard have done;but got meaning "have" is not plausibly analysedas other than a perfective construction.) It isquite impractical for annotation to be based onfully adequate grammatical analyses of eachnonstandard dialect in its own terms; but it is noteasy to specify consistent rules for annotatingsuch uses as deviations from the known,standard dialect. The CHRISTINE project hasattempted to introduce predictability into the

analysis of these cases, by recognizing anonstandard-English "tense" realized as pastparticiple not preceded by auxiliary, and byruling that any verb form used in a nonstandardstructure with past reference will be classified asa past participle (thus give in the KCJ exampleabove is classified as a nonstandard equivalentof given). This approach does work well formany cases, but it remains to be seen whether itdeals satisfactorily with all the usages that arise.

At the syntactic level, an example of anonstandard construction requiring adaptation ofthe written-English annotation scheme would berelative clauses containing both relative pronounand undeleted relativized NP, unknown instandard English but usual in variousnonstandard dialects, e.g.:

KD6.03075 ... bloody Colin who, he borrowedhis computer that time, remember?

Here the CHRISTINE decision is to treat therelativized NP (he} as appositional to the relativepronoun. For the case quoted, this works; but itwill not work if a case is ever encountered wherethe relativized element is not the subject of therelative clause. Examples like this raise thequestion what it means to specify consistentgrammatical annotation standards applicable to aspectrum of different dialects, rather than asingle dialect. Written English usually conformsmore or less closely to the norms of the nationalstandard language, so that grammatical dialectvariation is marginal and annotation standardscan afford to ignore it. In the context of speech,it cannot be ignored, but the exercise ofspecifying annotation standards forunpredictably varying structures seemsconceptually confused.

6 Dialect Difference v. PerformanceError

Special problems arise in deciding whether aturn of phrase should be annotated as well-formed with respect to the speaker'snonstandard dialect, or as representing standardusage but with words elided as a performanceerror. Speakers often do omit necessary words,e.g.:

Page 56: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

KD2.03102-3 There's one thing I don't like<pause> and that's having my phototaken. And it will be hard when we haveto photos.

— it seems safe to assume that the speakerintended something like have to show photos.One might take it that a similar process explainsthe underlined words in:

KD6.03154 oh she was shouting at him atdinner time <shift shouting> Steven<shift> oh god dinner time she wasshouting him.

where at is missing; but this is cast in doubtwhen other speakers, in separate samples, arefound to have produced:

KPC.00332 go in the sitting room until I shoutyou for tea

KD2.02798 The spelling mistakes onlyoccurred when <pause> I was shouted.

— this may add up to sufficient evidence fortaking shout to have a regular transitive use innonstandard English.

This problem is particularly common at the endsof utterances, where the utterance might beinterpreted as broken off before it wasgrammatically complete (indicated in theSUSANNE scheme by a "#" terminal node aslast daughter of the root node), but mightalternatively be an intentional nonstandardelision. In:

KE2.08744 That's right, she said Margaretnever goes, I said well we never go forlunch out, we hardly ever really

the words we hardly ever really would not occurin standard English without some verb (if only aplaceholding do), so the sequence would mostplausibly be taken as a broken-off utterance ofsome clause such as we hardly ever really go outto eat at all; but it is not difficult to imagine thatthe speaker's dialect might allow we hardly everreally for standard we hardly ever do really, inwhich case it would be misleading to include the"#" sign.

It seems inconceivable that a detailed annotationscheme could fail to distinguish difference ofdialect from performance error; indeed, ascheme which ignored this distinction mightseem offensive. But analysts will often inpractice have no basis for applying thedistinction to particular examples.

7 Transcription Inadequacies

One cannot expect every word of a sample ofspontaneous speech recorded in field conditionsto be accurately transcribable from therecordings. Our project relies on transcriptionsproduced by other researchers, which containmany passages marked as "unclear"; the samewould undoubtedly be true if we had chosen togather our own material. A structural annotationsystem needs to be capable of assigning ananalysis to a passage containing unclearsegments; to discard any utterance or sentencecontaining a single unclear word would requirethrowing away too many data, and wouldundesirably bias the retained collection ofsamples towards utterances that were spokencarefully and may therefore share some specialstructural properties.

The SUSANNE scheme uses the symbol Y tolabel nodes dominating stretches of whollyunclear speech, or tagmas which cannot beassigned a grammatical category because theycontain unclear subsegments that make thecategorization doubtful. This system isunproblematic, so long as the unclear material infact consists of one or more completegrammatical constituents. Often, however, thisis not so; e.g.:

KCT. 10833 Oh we didn't <unclear> to drinkyourselves.

Here it seems sure that the unclear stretchcontained multiple words, beginning with one ormore words that complete the verb group (V)initiated by didn't; and the relationship of thewords to drink yourselves to the main clausecould be quite different, depending what theunclear words were. For instance, if the unclearwords were give you anything, then to drinkwould be a modifying tagma within an NP

Page 57: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

headed by anything; on the other hand, if theunclear stretch were expect you, then to drinkwould be the head of an object complementclause. Ideally, a grammatical annotationscheme would permit all the clear grammar to beindicated, but allow the analyst to avoidimplying any decision about unresolvable issuessuch as these. Given that clear grammar isrepresented in terms of labelled bracketing,however, it is very difficult to find usablenotational conventions that avoid commitmentabout the structures to which unclear wordingcontributes.

Conclusion

In annotating written English, where one isdrawing on an analytic tradition evolved overcenturies, it seems on the whole to be true thatmost annotation decisions have definite answers;where some particular example is vague betweentwo categories, these tend to be subcategories ofa single higher-level category, so a neutralfallback annotation is available. (Most Englishnoun phrases are either marked as singular ormarked as plural, and the odd exceptional casesuch as the fish can at least be classified as anoun phrase, unmarked for number.) One wayof summarizing many of the problems outlinedin the preceding sections is to say that, inannotating speech, whose special structuralfeatures have had little influence on the analytictradition, ambiguities of classification constantlyarise that cut across traditional categoryschemes. In consequence, not only is it oftendifficult to choose a notation which attributesspecfic properties to an example; unlike withwritten language, it is also often very difficult todefine fallback notations which enable theannotator to avoid attributing properties forwhich there is no evidence, while allowing whatcan safely be said to be expressed.

Some members of the research community maybe tempted to feel that a paper focusing on theseproblems ranks as self-indulgent hand-wringingin place of serious effort to move the disciplineforward. We hope that our earlier discussion ofsoftware engineering will have shown why thatfeeling would be misguided. Nothing is easierand more appealing than to plunge into the workof getting computers to deliver some desired

behaviour, leaving conceptual unclarities to besorted out as and when they arise. Hugequantities of industrial resources have beenwasted over the decades through allowing ITworkers to adopt that approach. Naturallanguage processing was one of the firstapplication areas ever proposed for computers(by Alan Turing in 1948 — Hodges 1983: 382);fifty years later, the level of success of NLPsoftware (while not insignificant) does notsuggest that computational linguistics can affordto go on ignoring lessons that have already beenpainfully learned by more central sectors of theIT industry.

Effort put into automatic analysis of naturallanguage implies a prior requirement for seriouseffort devoted to defining and debuggingdetailed standard schemes of linguistic analysis.Our SUSANNE and CHRISTINE projects havebeen and are contributing to this goal, but theyare no more than a beginning. We urge othercomputational linguists to recognize this area asa priority.

Acknowledgment

The research reported here was supported bygrant ROOO 23 6443, "Analytic Standards forSpoken Grammatical Performance", awarded bythe Economic and Social Research Council(UK).

References

Boehm B.W. (1981) Software EngineeringEconomics. Prentice-Hall, EngelwoodCliffs, N.J.

Edwards Jane A. (1992) Design principles inthe transcription of spoken discourse.In , "Directions in Corpus Linguistics",J. Svartvik, ed., Mouton de Gruyter,Berlin.

Hodges A. (1983) Alan Turing: The Enigma ofIntelligence. Burnett Books, London.

Howell P. & K. Young (1990) Speech repairs:report of work conducted October 1st1989-March 31st 1990. Department ofPsychology, University College London.

Howell P. & Young K. (1991) The use ofprosody in highlighting alterations in

Page 58: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and

repairs from unrestricted speech.Quarterly Journal of ExperimentalPsychology 43A,pp. 733-758.

Langendoen D.T. (1997) Review of Sampson(1995). Language 73. pp.600-603.

Leech G.N. & Eyes Elizabeth (1997) Syntacticannotation: treebanks In Ch. 3 "CorpusAnnotation", R.G. Garside et al., eds.,Longman, Harlow, Essex.

Levelt W.J.M. (1983) Monitoring and self-repair in speech. Cognition , 14, pp.41-104.

Mason O. (1997) Review of Sampson (1995).International Journal of CorpusLinguistics, 2.1, pp.69-72.

Sampson G.R. (1995) English for theComputer. Clarendon Press, Oxford.

Sampson G.R. (1998) Consistent annotation ofspeech-repair structures. In"Proceedings of the First InternationalConference on Language Resources andEvaluation, Granada, May 1998"

Sommerville I. (1992) Software Engineering(4th ed.). Addison-Wesley,Wokingham, Berks.

Stenstrom Anna-Brita (1990) Lexical itemspeculiar to spoken discourse. InSvartvik (1990).

Svartvik J., ed. (1990) The London-LundCorpus of Spoken English. LundUniversity Press.

Page 59: Recent Advances in Corpus Annotation Brigitte Krenn, Thorsten Brants…... · 2002. 10. 30. · Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit Workshop Language and