30
M Syntax + Prosody: A syntactic–prosodic labelling scheme for large spontaneous speech databases A. Batliner a, * , R. Kompe a,b , A. Kießling a,c , M. Mast a,d , H. Niemann a , E. N oth a a Universit at Erlangen-N urnberg, Lehrstuhl f ur Mustererkennung, Informatik 5, Martensstr. 3, 91058 Erlangen, Germany b Sony Stuttgart Technology Center, Stuttgarterstr. 106, 70736 Fellbach, Germany c Ericsson Eurolab Deutschland GmbH, Nordostpark 12, 90411 N urnberg, Germany d IBM Heidelberg Scientific Center, Vangerowstr. 18, 69115 Heidelberg, Germany Received 17 April 1997; received in revised form 25 May 1998 Abstract In automatic speech understanding, division of continuous running speech into syntactic chunks is a great problem. Syntactic boundaries are often marked by prosodic means. For the training of statistical models for prosodic boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation), we developed a syntactic–prosodic labelling scheme where dierent types of syntactic boundaries are la- belled for a large spontaneous speech corpus. This labelling scheme is presented and compared with other labelling schemes for perceptual–prosodic, syntactic, and dialogue act boundaries. Interlabeller consistencies and estimation of eort needed are discussed. We compare the results of classifiers (multi-layer perceptrons (MLPs) and n-gram language models) trained on these syntactic–prosodic boundary labels with classifiers trained on perceptual–prosodic and pure syntactic labels. The main advantage of the rough syntactic–prosodic labels presented in this paper is that large amounts of data can be labelled with relatively little eort. The classifiers trained with these labels turned out to be superior with respect to purely prosodic or syntactic labelling schemes, yielding recognition rates of up to 96% for the two-class-problem ‘boundary versus no boundary’. The use of boundary information leads to a marked improvement in the syntactic processing of the VM system. Ó 1998 Elsevier Science B.V. All rights reserved. Zusammenfassung Die Segmentierung von kontinuierlich gesprochener Sprache in syntaktisch sinnvolle Einheiten ist f ur die automa- tische Sprachverarbeitung ein großes Problem. Syntaktische Grenzen sind oft prosodisch markiert. Um prosodische Grenzen mit statistischen Modellen bestimmen zu k onnen, ben otigt man allerdings große Trainingskorpora. F ur das Forschungsprojekt Verbmobil zur automatischen Ubersetzung spontaner Sprache wurde daher ein syntaktisch–pro- sodisches Annotationsschema entwickelt und auf ein großes Korpus angewendet. Dieses Schema wird mit anderen Annotationsschemata verglichen, mit denen prosodisch–perzeptive, rein syntaktische bzw. Dialogakt-Grenzen et- ikettiert wurden; Konsistenz der Annotation und ben otigter Aufwand werden diskutiert. Das Ergebnis einer automatischen Klassifikation (multi-layer perceptrons bzw. Sprachmodelle) f ur diese neuen Grenzen wird mit den Erkennungsraten verglichen, die f ur die anderen Grenzen erzielt wurden. Der Hauptvorteil der groben syntaktisch– prosodischen Grenzen, die in diesem Aufsatz eingef uhrt werden, besteht darin, daß ein großes Trainingskorpus in Speech Communication 25 (1998) 193–222 * Corresponding author. E-mail: [email protected]. 0167-6393/98/$ – see front matter Ó 1998 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 9 8 ) 0 0 0 3 7 - 5

M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

M � Syntax + Prosody: A syntactic±prosodic labelling schemefor large spontaneous speech databases

A. Batliner a,*, R. Kompe a,b, A. Kieûling a,c, M. Mast a,d, H. Niemann a,E. N�oth a

a Universit�at Erlangen-N�urnberg, Lehrstuhl f�ur Mustererkennung, Informatik 5, Martensstr. 3, 91058 Erlangen, Germanyb Sony Stuttgart Technology Center, Stuttgarterstr. 106, 70736 Fellbach, Germany

c Ericsson Eurolab Deutschland GmbH, Nordostpark 12, 90411 N�urnberg, Germanyd IBM Heidelberg Scienti®c Center, Vangerowstr. 18, 69115 Heidelberg, Germany

Received 17 April 1997; received in revised form 25 May 1998

Abstract

In automatic speech understanding, division of continuous running speech into syntactic chunks is a great problem.

Syntactic boundaries are often marked by prosodic means. For the training of statistical models for prosodic

boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech

translation), we developed a syntactic±prosodic labelling scheme where di�erent types of syntactic boundaries are la-

belled for a large spontaneous speech corpus. This labelling scheme is presented and compared with other labelling

schemes for perceptual±prosodic, syntactic, and dialogue act boundaries. Interlabeller consistencies and estimation of

e�ort needed are discussed. We compare the results of classi®ers (multi-layer perceptrons (MLPs) and n-gram language

models) trained on these syntactic±prosodic boundary labels with classi®ers trained on perceptual±prosodic and pure

syntactic labels. The main advantage of the rough syntactic±prosodic labels presented in this paper is that large

amounts of data can be labelled with relatively little e�ort. The classi®ers trained with these labels turned out to be

superior with respect to purely prosodic or syntactic labelling schemes, yielding recognition rates of up to 96% for the

two-class-problem `boundary versus no boundary'. The use of boundary information leads to a marked improvement in

the syntactic processing of the VM system. Ó 1998 Elsevier Science B.V. All rights reserved.

Zusammenfassung

Die Segmentierung von kontinuierlich gesprochener Sprache in syntaktisch sinnvolle Einheiten ist f�ur die automa-

tische Sprachverarbeitung ein groûes Problem. Syntaktische Grenzen sind oft prosodisch markiert. Um prosodische

Grenzen mit statistischen Modellen bestimmen zu k�onnen, ben�otigt man allerdings groûe Trainingskorpora. F�ur das

Forschungsprojekt Verbmobil zur automatischen �Ubersetzung spontaner Sprache wurde daher ein syntaktisch±pro-

sodisches Annotationsschema entwickelt und auf ein groûes Korpus angewendet. Dieses Schema wird mit anderen

Annotationsschemata verglichen, mit denen prosodisch±perzeptive, rein syntaktische bzw. Dialogakt-Grenzen et-

ikettiert wurden; Konsistenz der Annotation und ben�otigter Aufwand werden diskutiert. Das Ergebnis einer

automatischen Klassi®kation (multi-layer perceptrons bzw. Sprachmodelle) f�ur diese neuen Grenzen wird mit den

Erkennungsraten verglichen, die f�ur die anderen Grenzen erzielt wurden. Der Hauptvorteil der groben syntaktisch±

prosodischen Grenzen, die in diesem Aufsatz eingef�uhrt werden, besteht darin, daû ein groûes Trainingskorpus in

Speech Communication 25 (1998) 193±222

* Corresponding author. E-mail: [email protected].

0167-6393/98/$ ± see front matter Ó 1998 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 6 3 9 3 ( 9 8 ) 0 0 0 3 7 - 5

Page 2: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

relativ kurzer Zeit erstellt werden kann. Die Klassi®katoren, die mit diesem Korpus trainiert wurden, erzielten bessere

Ergebnisse als alle fr�uher verwendeten; die beste Erkennungsrate lag bei 96% f�ur das Zwei-Klassen-Problem `Grenze vs.

Nicht-Grenze'. Die Ber�ucksichtigung der Grenzinformation in der syntaktischen Verarbeitung f�uhrt zu einer wesent-

lichen Verbesserung. Ó 1998 Elsevier Science B.V. All rights reserved.

ReÂsumeÂ

En compr�ehension automatique de la parole, la segmentation de parole continue en composants syntaxiques pose un

grand probl�eme. Ces composants sont souvent d�elimit�ees par des indices prosodiques. Cependant l'entraõnement de

mod�eles d'�etiquetage de fronti�eres prosodiques statistiques n�ecessite de tr�es grandes bases de donn�ees. Dans le cadre du

projet allemand Verbmobil (traduction automatique de parole �a parole), nous avons donc d�evelopp�e une m�ethode

d'�etiquetage prosodique±syntaxique de larges corpus de parole spontan�ee o�u deux principaux types de fronti�eres

(fronti�eres syntaxiques majeures et fronti�eres syntaxiques ambig�ues) et certaines autres fronti�eres sp�eciales sont �eti-

quet�es. Cette m�ethode d'�etiquetage est pr�esent�ee et compar�ee �a d'autres m�ethodes d'�etiquetage de fronti�eres bas�ees sur

des crit�eres prosodiques perceptifs, syntaxiques, et de dynamique du dialogue. Plus pr�ecisement, nous comparons les

r�esultats obtenus �a partir de classi®cateurs (perceptrons multi-couches et mod�eles du language) entraõn�es sur les

fronti�eres �etablies par notre �etiqueteur prosodique±syntaxique �a ceux obtenus �a partir d'�etiqueteurs strictement syn-

taxiques ou prosodiques perceptifs. L'un des avantages principaux de la m�ethode d'�etiquetage prosodique±syntaxique

pr�esent�ee dans cet article est la rapidit�e avec laquelle elle permet d'�etiqueter de grandes bases de donn�ees. De plus, les

classi®cateurs entraõn�es avec les �etiquettes de fronti�ere produites se r�ev�elent etre tr�es performants, les taux de recon-

naissance atteignant 96%. Ó 1998 Elsevier Science B.V. All rights reserved.

Keywords: Syntax; Dialogue; Prosody; Phrase boundaries; Prosodic labelling; Automatic boundary classi®cation; Spontaneous speech;

Large databases; Neural networks; Stochastic language models

1. Introduction

The research presented in this paper has beenconducted under the Verbmobil (VM) project, cf.(Wahlster, 1993; Bub and Schwinn, 1996; Wahlsteret al., 1997), which aims at automatic speech-to-speech translation of appointment schedulingdialogues; details are given in Section 2. In VM,prosodic boundaries are used for disambiguationduring syntactic parsing. In spontaneous speech,many elliptic sentences or nonsentential free ele-ments occur. Without knowledge of the prosodicphrasing and/or the dialogue history, a correctsyntactic phrasing that mirrors the intention of thespeaker is often not possible for a parser in suchcases. Consider the following turn ± a typical ex-ample taken from the VM corpora:

ja j zur Not j geht0s j auch j am SamstagjThe vertical bars indicate possible positions forclause boundaries. In written language, most ofthese bars can be replaced by either comma, periodor question mark. In total, there exist at least 36di�erent syntactically correct alternatives for put-

ting the punctuation marks. Examples 1 and 2show two of these alternatives together with atranslation into English.

Example. 1. Ja? Zur Not geht's? Auch am Samstag?(Really? It's possible if necessary? Even on Satur-day?)

Example. 2. Ja. Zur Not. Geht's auch am Samstag?(Yes. If necessary. Would Saturday be possible aswell?)

Without knowledge of context and/or dialoguehistory, for such syntactically ambiguous turns,the use of prosodic information might be the onlyway to ®nd the correct interpretation, e.g., whetherja is a question or a con®rmation. But even ifknowledge of the context is available, we believethat it is much cheaper and easier to infer the sameinformation from prosody than from the context.Furthermore, for syntactically nonambiguousword sequences, the search space during parsingcan be enormous because locally, it is often notpossible to decide for some word boundaries if

194 A. Batliner et al. / Speech Communication 25 (1998) 193±222

Page 3: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

there is a clause boundary or not. Therefore, thesearch e�ort can be reduced considerably duringparsing if prosodic information about clauseboundaries is available, cf. (Batliner et al., 1996a;Kompe et al., 1997).

Researchers have long noted that there is astrong (albeit not perfect) correspondence betweenprosodic phrasing and syntactic phrasing, cf. (Lea,1980; VaissieÁre, 1988; Price et al., 1991; Cutleret al., 1997). Most of the pertinent research on thistopic that has been conducted in the last two de-cades is basic research with controlled, elicitedspeech, a small database, and only a few speakers.This restriction holds even for most of the studiesthat de®nitely aim at the use of prosody in auto-matic speech processing, cf., e.g. (Bear and Price,1990; Ostendorf and Veilleux, 1994; Hunt, 1994),who used pairs of ambiguous sentences read byprofessional radio news speakers, and (Wang andHirschberg, 1992), which is one of the very fewstudies based on spontaneous speech (a subset ofthe ATIS database (Lamel, 1992)). For a review ofthe di�erent statistical methods used in thesestudies, see (Kompe, 1997). In these studies with(American) English, the labelling is generallybased on the ToBI approach (Beckman and Ayers,1994), in which tone sequences are annotated onthe tonal tier and a hierarchy of break indices onthe break (phrasing) tier. There is no generalagreement on the number of levels needed. Thelowest possible number is of course two: `boun-dary versus no boundary'. ToBI has four levels,and seven levels can be found in (Ostendorf et al.,1990; Price et al., 1990; Bear and Price, 1990).Usually, these detailed schemes are, however, inpractice reduced to some two or three levels.

In this paper, we describe scenario and archi-tecture of the VM project in Section 2. In Sec-tion 3 we discuss brie¯y the boundary annotationsused up to now in VM: The prosodic±perceptual Bboundaries, the purely syntactic S boundaries, andthe dialogue act boundaries D. We then introducein Section 4 a new labelling scheme, the syntactic±prosodic M boundaries, that constitutes a ®rst steptowards an integrated labelling system especiallysuited for the processing of large spontaneousspeech databases used for automatic speech pro-cessing. With these labels, we were able to anno-

tate large amounts of data with relatively littlee�ort without much loss of information which isessential for the exploitation of prosody in theinterpretation of utterances. The correspondencebetween these M labels and the other three label-ling schemes (Section 5) corroborates this as-sumption; its discussion leads to a more detailedrelabelling of the M boundaries described in Sec-tion 6. Correspondences of the new M boundarieswith B and D are dealt with in Section 7, interla-beller consistency and an estimation of overall ef-fort for the annotation of the M labels incomparison with other annotation schemes isgiven in Section 8. Finally, we discuss recognitionrates obtained with a combination of acousticclassi®ers and language models (Section 9) as wellas the overall usefulness of the M labels for syn-tactic processing in VM (Section 10).

2. The Verbmobil project

VM is a speech-to-speech translation project(Wahlster, 1993; Wahlster et al., 1997; Block,1997) in the domain of appointment schedulingdialogues, i.e., two persons try to ®x a meetingdate, time, and place. Currently, the emphasis lieson the translation of German utterances into En-glish. VM deals thus with a special variety ofspontaneous speech found in such human±ma-chine±human communications which is, however,fairly representative for spontaneous speech ingeneral, as far as typical spontaneous speechphenomena (word fragments, false starts, syntacticconstructions not found in written language, etc.)are concerned. After the recording of the sponta-neous utterance, a word hypotheses graph (WHG)is computed by a standard Hidden Markov Model(HMM) word recognizer and enriched with pro-sodic information. This information currentlyconsists of probabilities for a major syntacticboundary being after the word hypotheses, aprobability for accentuation and for three classesof sentence mood (Kieûling, 1997; Kompe, 1997).The WHG is parsed by one of two alternativesyntactic modules, and the best scored word chaintogether with its di�erent possible parse trees(readings), cf. Section 10, is passed onto the

A. Batliner et al. / Speech Communication 25 (1998) 193±222 195

Page 4: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

semantic analysis. Governed by the syntax moduleand the dialogue module, the utterance is trans-lated onto the semantic level (transfer module),and an English utterance is generated and syn-thesized. Parallel to the deep analysis performed bysyntax and semantics, the dialogue module con-ducts a shallow processing, i.e., the important di-alogue acts are detected in the utterance andtranslated roughly. A more detailed account of thearchitecture can be found in (Bub and Schwinn,1996). For the time being, the following modulesuse prosodic information: syntactic analysis, se-mantics, transfer, dialogue processing, and speechsynthesis (cf. Niemann et al., 1997); in this paper,we will only deal with the use of prosody in syn-tactic analysis and dialogue processing.

The architecture of VM is a so-called multi-agent architecture which means that several au-tonomous modules (developed by 29 di�erentpartners at di�erent sites) process speech andlanguage data and interact with each other. Notethat VM uses a sequential, bottom±up approach;syntax and dialogue do not interact with eachother, there is no dialogue history available for thesyntax module and no deep syntactic analysis forthe dialogue module. A system like VM can thusnot mirror perception and understanding of acompetent native speaker that is certainly a com-bination of bottom±up and top±down processesand much more parallel than the architecture ofVM allows, or, for that matter, the architecture ofany other successful system that can be imaginedin the near future. Prosodic analysis has to beadapted to the demands of those modules (syntaxand dialogue) that use its information. This meansthat we use prosodic information to predict eventsthat can be useful for processing in these highermodules. We conceive prosody as a means toconvey di�erent functions, e.g., segmentation ofspeech into syntactic boundaries or dialogue actboundaries. Other functions of prosody (denotingrhythm, speaker idiosyncrasies, emotions, etc.) aretreated in this approach either as interveningfactors that have to be controlled, if their in¯u-ence is strong and systematic, or as random noisethat can be handled quite well with statisticalclassi®ers, if their in¯uence is weak and/or un-systematic.

For all VM dialogues, a so-called `basis trans-literation' is provided, i.e., the words in orthog-raphy together with non-words, as, e.g., pauses,breathing, unde®ned noise, etc., and with speci®ccomments, if, e.g., the pronunciation of a worddeviates considerably from the canonical pronun-ciation (slurring, dialectal variants). A phonetictranscription exists only for a small number ofdialogues and is used for special tasks. Irregularphenomena, e.g., boundaries at speech repairs, areannotated as well.

In a second step, the basis transliteration issupplemented by other partners with annotationsneeded for special tasks, e.g., perceptual±prosodiclabels for accentuation and phrasing, and dialogueact annotations including dialogue act boundaries,cf. Section 3. The resources for these time±con-suming annotations are limited, and often, thedatabases available are too small for a robusttraining of statistical classi®ers. This in turn pre-vents an improvement of automatic classi®cation.This point will be taken up again in the discussionof the di�erent labelling schemes in Section 3.

3. Prosodic, syntactic, and dialogue act boundaries:

Bs, Ss and Ds

In written forms of languages such as Germanand English, syntactic phrasing is ± on the surface± at least partly indicated by word order; for in-stance, a wh-word after an in®nite verb formnormally indicates a syntactic boundary before thewh-word: Wir k�onnen gehen wer kommt mit (Wecan go. Who will join us?). Above that, syntacticphrasing in written language can be disambiguatedwith the help of punctuation marks. In spontane-ous speech, prosodic marking of boundaries cantake over the role of punctuation. In order to useprosodic boundaries during syntactic analysis,automatic classi®ers have to be trained; for this,prosodic reference labels are needed.

Reyelt and Batliner (1994) describe an inven-tory of prosodic labels for the speech data in VMalong the lines of ToBI, an inventory that containsa boundary tier amongst other tiers. The followingdi�erent types of perceptual boundaries were

196 A. Batliner et al. / Speech Communication 25 (1998) 193±222

Page 5: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

labelled by the VM partner University of Bra-unschweig, cf. (Reyelt, 1998).B3: full intonational boundary with strong intona-

tional marking with/without lengthening orchange in speech tempo.

B2: minor (intermediate) phrase boundary withrather weak intonational marking.

B0: normal word boundary (default, not labelledexplicitly).

B9: `agrammatical' boundaries indicating hesita-tions or repairs.

There are, however, some drawbacks in this ap-proach if one wants to use this information inparsing: First, prosodic labelling by hand is verytime consuming; the labelled database up to now istherefore rather small. Second, perceptual labellingof prosodic boundaries is not an easy task andpossibly not very robust. E�ort needed and con-sistency of annotation will be discussed in Sec-tion 8. Finally and most important, prosodicboundaries do not only mirror syntactic bound-aries but are in¯uenced by other factors such asrhythmic constraints and speaker speci®c style. Inthe worst case, discrepancies between prosodic andsyntactic phrasing might be lethal for a syntacticanalysis if the parser goes on the wrong track andnever returns.

Feldhaus and Kiss (1995) therefore argued for alabelling without listening which is based solely onlinguistic de®nitions of syntactic phrase bound-aries in German (cf. as well (Batliner et al.,1996a)). They proposed that only syntacticboundaries should be labelled, and they should belabelled whether they are marked prosodically ornot. (The assumption behind is, of course, thatmost of the time, a correspondence between syn-tactic and prosodic boundaries really exists. Oth-erwise, prosodic classi®cation of such boundarieswould be rather useless for parsing.) 21 dialoguesof the VM corpus that are labelled with Bs werelabelled by the VM partner IBM Heidelberg usinga very detailed and precise scheme with 59 di�erenttypes of syntactic boundaries that are described in(Feldhaus and Kiss, 1995). The labels code the leftcontext and the right context of every wordboundary. The distribution of the 59 di�erenttypes is of course very unequal and often, only afew tokens per type exist in the database. In this

paper, we therefore use only three main labels, S3+(syntactic clause boundary obligatory), S3) (syn-tactic clause boundary impossible), and S3? (syn-tactic clause boundary ambiguous). Note that S3?is sometimes used for constellations that could notbe analyzed syntactically at all or where the la-beller could not decide between two competinganalyses. The correspondence between S and B la-bels showed a high agreement of 95%, cf. (Batlineret al., 1996a).

In VM, the dialogue as a whole is seen as asequence of basic units, the dialogue acts. Onetask of the dialogue module is to follow the dia-logue in order to provide predictions for the on-going dialogue. Therefore each turn has to besegmented into units which correspond to onedialogue act, i.e. the boundaries ± in the followingcalled D3 in analogy to B3 and S3 ± have to bedetected and for each unit, the corresponding di-alogue act(s) has (have) to be found. For thetraining of classi®ers, a large subsample of the VMdatabase has been labelled with D3 at dialogue actboundaries; every other word boundary is auto-matically labelled with D0. Dialogue acts are de-®ned according to their illocutionary force, e.g.,ACCEPT, SUGGEST, REQUEST, and can besubcategorised for their functional role in the dia-logue or for their propositional content, e.g.,DATE or LOCATION depending on the appli-cation; in our domain, 18 dialogue acts on theillocutionary level and 42 subcategories are de-®ned at present (Jekat et al., 1995). Dialogue actson the illocutionary (functional) level are e.g.REJECT and CLARIFICATION. REJECT issubcategorised w.r.t. the propositional context inthe appointment scheduling domain in RE-JECT_LOCATION and REJECT_TIME. Clari-®cation is subcategorised w.r.t. the function thedialogue act has in the dialogue in CLARIFICA-TION_ANSWER (when it follows a question) andCLARIFICATION_STATE (when there precedesno question, but the speaker wants to clarifysomething with a statement). With this multi-levelde®nition of dialogue acts, the ones on the ill-ocutionary level can be applied to other domains.In VM, dialogue acts are used in di�erent modulesfor quite di�erent purposes (e.g. tracking the dia-logue history, robust translation with keywords)

A. Batliner et al. / Speech Communication 25 (1998) 193±222 197

Page 6: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

for which the more domain speci®c dialogue actson the lower levels are needed.

Turns can be subdivided into smaller unitswhere each unit corresponds to one or more dia-logue acts. It is not easy to give a quantitative andqualitative de®nition of the term dialogue act inspontaneous speech. We de®ned criteria for thesegmentation of turns based on the textual repre-sentation of the dialogue (Mast et al., 1995). Suchcriteria are mostly syntactic, e.g.: All `material'that belongs to the verb frame of a ®nite verbbelongs to the same dialogue act. That way it isguaranteed that both the obligatory and the op-tional elements of a verb are included in the samedialogue act, cf. Example 3. For dependent clausesthe preceding rule is also applicable: Each depen-dent clause which contains a ®nite verb is seen as aunit of its own, cf. Example 4. Conventionalisedexpressions are seen as one unit even if they do notcontain a verb. Typical examples are: hello, goodmorning, thanks. Prosody is not taken into accountin order to be able to label dialogues withouthaving to listen to them and thus to reduce thelabelling e�ort (cf. Section 8): In (Carletta et al.,1997, p. 35) it is reported that the segmentation ofdialogues changes only slightly when the annota-tors can listen to speech data.

Example 3.

Example 4.

Large-scaled parallel annotations of B, S and Dboundaries might be desirable; it is, however, morerealistic to aim at smaller reference databases withannotations of di�erent boundaries in combina-tion with a large database annotated with an in-tegrated labelling system. `Integrated' in thiscontext means that we want to favour a labellingsystem that basically is syntactic but takes expec-tations about prosodic and dialogue structure into

consideration by subclassifying boundaries; bythis, the subclasses can be clustered di�erentlyaccording either to the needs of syntax or to theneeds of dialogue analysis. If such labels can beannotated in a relatively short amount of time,they can be annotated for a very large corpus forthe training of automatic classi®ers without toomuch e�ort. Other annotations, e.g., prosodic ordialogue act boundaries, can be used to evaluatesuch a labelling system. For that, however, smallerreference corpora can be used.

4. A rough syntactic±prosodic labelling scheme: the

Ms

Prosodic boundaries do not mirror syntacticboundaries exactly, but prosodic marking can, onthe other hand, be the only way to disambiguatethe syntactic structure of speech. In the past, wesuccessfully incorporated syntactic±prosodicboundary labels in a context-free grammar whichwas used to generate a large database with readspeech; in this grammar, 36 sentence templateswere used to generate automatically 10.000 uniquesentences out of the time-table-inquiry-domain.We added boundary labels to the grammar andthus to the generation process. The sentences thespeakers had to read contained punctuation marksbut no prosodic labels. In listening experiments,boundaries were de®ned perceptually and usedlater as reference in automatic classi®cation ex-periments. For these perceptually de®ned bound-aries, recognition rates of above 90% could beobtained with acoustic±prosodic classi®ers trainedon automatically generated boundary labels; fordetails, cf. (Kompe et al., 1995; Batliner et al.,1995; Kompe, 1997).

For read, constructed speech, it is thus possibleto label syntactic±prosodic boundaries automati-cally; for spontaneous speech, however, it is ± atleast for the time being ± necessary to label suchboundaries manually. These labels shall be usedboth for acoustic±prosodic classi®ers and for sto-chastic language models (prosodic±syntactic clas-si®ers). The requirements for such a labelling aredescribed in the following.

and on the fourteenth I amleaving for my bobsleddingvacation until the nineteenth

SUGGEST_EX-CLUDE_DATE.

no Friday is not any good REJECT_DATE

because I have got a seminarall day

GIVE_REASON

198 A. Batliner et al. / Speech Communication 25 (1998) 193±222

Page 7: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

First, it should allow for fast labelling. The la-belling scheme should be rather rough and notbased on a deep syntactic analysis because themore precise it is the more complicated and themore time consuming the labelling will be; roughin this context means considerably rougher thanthe annotation with S labels. A `small' amount oflabelling errors can be tolerated, since it shall beused to train statistical models, which should berobust to cope for these errors.

Second, prosodic tendencies and regularitiesshould be taken into account. For our purposes, itis suboptimal to label a syntactic boundary that ismost of the time not marked prosodically with thesame label as an often prosodically markedboundary for the following reasons: Syntactic la-bels that ± implicitly ± model the words and/orparts-of-speech before and after the boundary canbe used for language models. Prosodic±perceptuallabels can be used for acoustic±prosodic classi®ersirrespective of the syntactic context. Labels thatare used both for language models and foracoustic±prosodic classi®ers have to be subclassi-fed accordingly; examples for such subclassi®ca-tions are given below. Since large quantities ofdata should be labelled within a short time, onlyexpectations about prosodic regularities based onthe textual representation of a turn can be con-sidered. These expectations are either based on theexperience of the labeller or rely on the basistransliteration of the VM dialogues where, e.g.,pauses that are longer than half a second are an-notated; examples will be given below. For thesame reason, we will not use the whole dialoguehistory for interpretation and disambiguation, butonly the immediate context, i.e., the whole turn orat the most the turns before and after. Pauses and/or breathing that are labelled in the transliterationwill be taken as an indication of a prosodicboundary and used for a subclassi®cation of oursyntactic boundaries. Note that pauses ± andhesitations, for that matter ± are often an indica-tion of agrammatical boundaries which, however,are already labelled in the basis transliteration.

Third, the speci®c characteristics of spontane-ous speech (elliptic sentences, frequent use of dis-course particles, etc.) have to be taken intoaccount.

The labels will be used for statistical models(hence M for the ®rst character), corresponding toB, D and S. The strength of the boundary is indi-cated by the second character: 3 at sentences/clauses/phrases, 2 at prosodically heavy constitu-ents, 1 at prosodically weak constituents, 0 at anyother word boundary. The third character tries tocode the type of the adjacent clauses or phrases, asdescribed below. In Table 1, the context of theboundaries is described shortly, and the label andthe main class it is attached to is given, as well asone example for each boundary type; in addition,the frequency of occurrence in the whole databaseis given as well, not counting the end of turns.These, by default, are labelled implicitly with M3.So far a reliable detection of M3 had priority;therefore, for the time being, M2I is only labelledin three dialogues and mapped onto M0 for ourclassi®cation experiments; M1I is currently not la-belled at all.

Agrammatical phenomena such as hesitations,repairs and restarts, are labelled in the basistransliteration, cf. (Kohler et al., 1994), and werealso used for disambiguating between alternativeM labels. However, in very agrammatical passages,a reasonable labelling with M labels is almost im-possible. In general, we follow the strategy thatafter short agrammatical passages, no label isgiven but after rather long passages, especially ifthe syntactic construction starts anew, either M3Sor M3P is labelled; a more detailed discussion canbe found in (Batliner et al., 1996b).

Syntactic main boundaries M3S (`S'entence) areannotated between main clause and main clause,between main clause and subordinate clause, andbefore coordinating particles between clauses.Boundaries at `nonsentential free elements' func-tioning as elliptic sentences are labelled with M3P(`P'hrase), as well as left dislocations (cf. Sec-tion 6). Normally, these phrases do not contain averb. They are idiomatic performative phraseswith a sort of lexicalized meaning such as GutenTag (hello), Wiedersehen (good bye) and vocatives,or they are `normal, productive' elliptic sentencessuch as um vierzehn Uhr (at two p.m.). Boundariesbetween a sentence and a phrase to its right, whichin written language normally would be inside theverbal brace, are labelled with M3E (`E' for ex-

A. Batliner et al. / Speech Communication 25 (1998) 193±222 199

Page 8: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

traposition, or right dislocation with or without apro element): In Dann k�onnen wir es machen M3Edas Tre�en (Then we can do that, the meeting),there is a pro element ( es� it), whereas in Wurdees Ihnen passen M3E am Dienstag (Will it suit youon Tuesday), no pro element exists. In writtenlanguage, the dislocated element would be insidethe verbal brace: Dann k�onnen wir das Tre�enmachen and W�urde es Ihnen am Dienstag passen.Note that the verbal brace (`Verbklammer') is asyntactic phenomenon that does not exist in En-glish. M3E is also labelled at boundaries wherethere is no verbal brace (so-called `open verbalbrace') and thus no de®ning criterion, but where apause etc. in the transliteration denotes a strongerseparation from the clause to its left, e.g. in Tre�en

wir uns M3E <pause> am Freitag (Let's meet M3E<pause> on Friday). This di�erence can in¯uencepresuppositions and thus semantic interpretationbecause a clear pause that can be accompanied bya pronounced accent on the extrapolated elementindicates an ± implicit ± contrast, if, e.g., the dia-logue partner has proposed Saturday, and thespeaker wants to reject this day not explicitely(Nein, nicht am Samstag, sondern am Freitag (No,not on Saturday, but on Friday)) but with the helpof syntactic (and prosodic) means (extrapositionwith or without special accentuation).

Sentences or nonsentential free elements thatare embedded in a sentence are labelled with M3I(`I'nternal). Typically, these are parentheticalasides or embedded relative clauses.

Table 1

Description and examples for boundary labels and their main classes in parentheses, with frequency of occurence of the annotated

labels in the whole database (326 dialogues, 7075 turns, 147 110 words)

Label Main class # Description with example

M3S (M3) 11 473 main/subordinate clause:

vielleicht stelle ich mich kurz vorher noch vor M3S mein Name ist Lerch

perhaps I should ®rst introduce myself M3S my name is LerchM3P (M3) 4535 nonsentential free element/phrase, elliptic sentence, left dislocation:

guten Tag M3P Herr Meier

hello M3P Mr. MeierM3E (M3) 1398 extraposition:

wie w�urde es Ihnen denn am Dienstag passen M3E den achten Juni

will Tuesday suit you M3E the eighth of JuneM3I (M3) 369 embedded sentence/phrase:

eventuell M3I wenn Sie noch mehr Zeit haben M3I hAtmungi 'n bi�chen l�anger

possibly M3I if you've got even more time ábreathingñ M3I a bit longerM3T (M3) 325 pre-/ postsentential particle with ápauseñ/ábreathingñ:

gut M3T hPausei okay

®ne ápauseñ M3T okay

M3D (MU) 5052 pre-/ postsentential particle without ápauseñ/ábreathingñ:also M3D dienstags pa�t es Ihnen M3D ja M3S hAtmungithen M3D Tuesday will suit you M3D won't it / after all M3S ábreathingñ

M3A (MU) 707 syntactically ambiguous:

w�urde ich vorschlagen M3A vielleicht M3A im Dezember M3A noch mal M3A dann

I'd propose M3A possibly M3A in December M3A again M3A then

M2I (M0) ) constituent, marked prosodically:

wie s�ahe es denn M2I bei Ihnen M2I Anfang November aus

will it be possible M2I for you M2I early in November

M1I (M0) ) constituent, not marked prosodically:

M3S h�atten Sie da M1I 'ne Idee M3S

M3s have you've got M1I any idea M3SM0I (M0) every other word (default)

200 A. Batliner et al. / Speech Communication 25 (1998) 193±222

Page 9: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

Very often in spontaneous speech, a turn or asentence inside a turn begins with presententialparticles (`Satzauftaktpartikeln') such as ja (well),also (well), gut (well), okay (okay), etc. The term`presentential particle' is used here purely syntac-tically for a particle that is the ®rst word in a turnor in a sentence. Such particles can have di�erentfunctions. Often, it cannot be decided whetherthey are just discourse particles or whether theyhave a certain meaning: Their functions are neu-tralized. Often, but not always, prosody can helpto disambiguate these di�erent functions. Notethat normally, only a�rmative but not negativeparticles can be neutralized in presentential posi-tion, cf. the following four answers to the questionKommst du morgen (Will you come tomorrow):(1) con®rmation, semantics of particle neutralized:Ja das geht (Yes/Well, that's possible), (2) rejec-tion, presentential discourse particle without se-mantic function: Ja das geht �uberhaupt nicht (Well,that's not possible at all), (3) rejection: Nein dasgeht nicht (No, that's not possible), (4) ungram-matical and contradictory combination of negativeparticle and a�rmation: Nein das geht (No, that'spossible). The speci®c function can, however, bemarked by prosodic means; presentential particlesthat are followed by a pause or by breathing de-noted as such in the transliteration are thereforelabelled with M3T and all other with M3D (`D'is-course particle). In postsentential position, we la-bel these particles analogously. Here, theynormally function as tags: Geht gut ja (That's okisn't it). Note that inside a clause, they are modalparticles that normally cannot be translated intoEnglish: Das ist ja gut (That's great). (The mne-monic reason for M3T versus M3D is that M3Trepresents a stronger boundary than M3D becauseit is marked by a pause, and the phoneme /t/ isphonologically/phonetically stronger than /d/ aswell, cf. (Grammont, 1923).) Note that for a cor-rect interpretation and translation, a much ®nerclassi®cation than the one between M3D and M3Tshould distinguish between lexical categories andtheir possible syntax/dialogue speci®c roles.

Syntactically ambiguous boundaries M3A(`A'mbiguous) cannot be determined only basedon syntactic criteria. Often there are two or morealternative word boundaries where the syntactic

boundary could be placed. It is thus the job ofprosody to disambiguate between two alternativereadings. M3A and M3D labels are mapped ontothe main class MU (`unde®ned'), all other labelsmentioned so far are mapped onto the main classM3 (`strong boundary').

The labels M2I and M1I denote (`I'nternal)syntactic constituent boundaries within a sentence(typically NPs or PPs) and are mapped onto themain class M0, together with the default class M0I,which implicitly is labelled at each word boundary(except for turn±®nal ones) where none of theabove labels is placed. An M1I constituent boun-dary is in the vicinity of the beginning or the end ofa clause; it is normally not marked prosodicallybecause of rhythmic constraints. An M2I constit-uent boundary is inside a clause or phrase, not inthe vicinity of beginning or end of the turn; it israther often marked prosodically, again because ofrhythmic constraints. In the experiments con-ducted so far, we distinguish only between thethree main classes M3, MU and M0 given in Ta-ble 1; these are for the time being most relevant forthe linguistic analysis in VM. Besides, M3 and M0are `robust' in the sense that their assignment isless equivocal and thus less prone to di�erentsyntactic interpretations or misconceptions bydi�erent labellers than it might be the case for theother, more detailed labels, cf. Section 8. We be-lieve, however, that these detailed labels might berelevant in the future. A more speci®c account ofall labels can be found in (Batliner et al., 1996b).

With the labelling schemes described so far,di�erent sub-corpora were labelled whose size isgiven in Table 2. The following corpora are usedin this paper.· TEST is usually used for the test of classi®ers, cf.

below. It was spoken by six di�erent speakers

Table 2

The di�erent subsets used

# Dialogues # Turns Minutes # Word

tokens

TEST 3 64 11 1513

B-TRAIN 30 797 96 13 145

M-TRAIN 293 6214 869 132 452

S-TRAIN 21 583 66 8222

A. Batliner et al. / Speech Communication 25 (1998) 193±222 201

Page 10: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

(three male, three female). In all other corporaabout one third of the speakers are female.

· B-TRAIN contains all turns annotated by theVM partner University of Braunschweig withprosodic labels except the ones contained inTEST.

· M-TRAIN are all turns labelled with the M la-bels except the ones in TEST and B-TRAINfor which M labels are available as well.

· S-TRAIN is the subset of B-TRAIN for whichsyntactic labels were created by our colleaguesfrom IBM.

· D-TRAIN is the sub-corpus for which D labelsare available; this corpus is constantly growing,cf. Section 8.

The M labels of TEST were checked several timesby the labeller; the labels of all other corpora werenot checked thoroughly. To give an impression ofthe e�ort needed: The S labelling was done by onelinguist at IBM in about two months, the M la-belling by the ®rst author in about four months,i.e., for the M labels, the e�ort is reduced almost bythe factor 10 (twice the amount of time for a ma-terial that is 17 times larger). Note, however, thatthese ®gures are only rough estimates. We willcome back to this topic in Section 8 below.

5. Correspondence between Ms and Ss, Ds and Bs

We will generally refer to S-TRAIN if we dis-cuss correspondences between labels because forthis sub-corpus, all four types of boundaries areavailable: B labels that denote prosodic bound-aries as perceived by human listeners, S labels thatdenote (detailed) syntactic boundaries, D labelsthat denote dialogue act boundaries, and M labelsthat denote rough syntactic±prosodic boundaries.

Tables 3±11 give the ®gures of correspondence.In these tables, the second column shows the fre-quencies of the labels given in the ®rst column. Allother numbers show the percentage of the labels inthe ®rst column coinciding with the labels in the®rst row. For example, the number 84.3 in thethird column, second row of Table 3 means that84.3% of the word boundaries labelled with M3 arealso labelled with S3+. Those numbers, wherefrom the de®nition of the labels a high corre-

spondence could have been expected a priori, aregiven in bold face: We expect high correspon-dences between the boundaries M3, B3, S3+ andD3, and also high correspondences between thenonboundaries M0, B0, S3) and D0. Note thatturn ®nal word boundaries are not considered inthe tables, because these are in all cases labelledwith S3, M3 and D3, and in most cases with B3. Asan almost total correspondence between the labelsat these turn ®nal positions is obvious, their in-clusion would yield very high but not realisticcorrespondences.

There are at least three di�erent factors that canbe responsible for missing correspondences be-tween the di�erent label types.· Labelling errors: Simple labelling errors occur

once in a while, in particular if the labellinghas to be done rather fast. Such errors are typi-cally omitting a label or shifting its position oneword to the left or to the right of the correct po-sition. To give an exact ®gure for these errors isnot possible because this would imply a very ex-act and time consuming check of the labels ±which is exactly what we want to avoid. Such er-rors should be randomly distributed and thus noproblem for statistical classi®ers if they do notoccur very often.

· Systematic factors: Either the clustering is toorough and we had to use a more detailedscheme, or there is no systematic relationshipbetween two types of labels.

· Interlabeller di�erences: These exist of coursenot only for labellers using the same schemebut for labellers who use di�erent schemes; theyare either randomly distributed or caused bysystematic di�erences in analyzing the structureof the turn. While within the same scheme, inter-labeller consistency can be investigated, cf. Sec-tion 8, across schemes, this is normally notpossible.

5.1. A comparison of M and S labels

Of primary interest is the degree of correspon-dence between the M and the S labels, because thelatter were based on a thorough syntactic analysis(deep linguistic analysis with syntactic categoriza-tion of left and right context for each label) while

202 A. Batliner et al. / Speech Communication 25 (1998) 193±222

Page 11: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

for the M labels, we used a rough scheme andmainly took into account the left context.

Tables 3±5 show that there is a very high cor-respondence between M0 and S3): 96.6% of theM0 correspond to S3) and 98.2% of the S3) toM0. 84.4% of the M3 are also labelled as S3+. Only67.9% of the S3+ labels correspond to M3; most ofthe remaining S3+ (26.2%) are labelled as MU. Acloser look at the subclasses of the M labels showsthat this is not due to labelling errors which can befound in both annotations, but that it has sys-tematic reasons resulting from di�erent syntactic±theoretic assumptions. Mainly responsible for thismismatch is that the majority of the M3D labels, asubclass of MU, is labelled with S3+. Examples 5and 6 show further typical parts of turns, whereS3+ systematically correspond to M0. Neither theone or the other labelling system is wrong, but they

use di�erent syntactic analyses resulting in di�er-ent labels: In Example 5, the time of day expres-sion can be considered to be just an object to theverb (no boundary), or a sort of elliptic subordi-nate clause (boundary); the conjunction in Exam-ple 6 can either be attributed to the second clause(no boundary) or neither to the ®rst or to thesecond clause (boundary).

Example 5. sagen wir lieber M0/S3+ vierzehn Uhrf�unfundzwanzig (let's rather say M0/S3+ two twenty®ve p.m.).

Example 6. aber M0/S3+ Donnerstag vor-mittag. . .w�ar' mir recht (but M0/S3+ Thursday inthe morning. . .would be ®ne)

Only 34.5% of the M3E labels, a subclass of M3,correspond to S3+. This might partly be due to thefact that we took into account pauses denoted inthe transliterations while labelling this class: forthese positions, a pause after the word boundarytriggered the assignment of an M3E label. (Notethat without listening to the turns, we cannot de-cide whether a pause is a `regular' boundarymarker or caused by planning processes; pausesdue to deliberation are, however, very often adja-cent to hesitations which are denoted in thetransliterations. In such cases, the di�erence is sortof neutralized: Did the speaker pause only becauseof the interfering planning process or did boun-dary marking and indication of planning processsimply coincide?) Alternatively, it might be due tothe fact that extraposition is not a phenomenoneverybody agrees on, cf. (Haftka, 1993). In anycase, the M3E labels are surely candidates for apossible rearrangement; this was done in the nextlabelling phase, cf. Section 6. The subclasses M3Sand M3P correspond to S3+ in over 90% of thecases. This meets our expectations, because thesecases should be quite independent from the speci®csyntactic analysis. S3? is de®ned as `syntacticallyambiguous boundary' but at the same time, it isused for boundaries between elements that cannotyet be analyzed syntactically with certainty. M3A isonly used for `contextually' ambiguous bound-aries; it is not used for all kinds of possible syn-tactically ambiguous boundaries but only for those

Table 3

Percentage of M labels corresponding to S labels

Label # S3+ S3? S3)

M3 951 84.3 8.4 7.2

MU 391 79.3 9.2 11.5

M0 6297 1.1 2.3 96.6

Table 4

Percentage of S labels corresponding to M labels

Label # M3 MU M0

S3+ 1181 67.9 26.2 5.8

S3? 259 30.9 13.9 55.2

S3) 6199 1.1 0.7 98.2

Table 5

Percentage of detailed M labels corresponding to S labels

Label # S3+ S3? S3)

M3S 502 94.2 0.6 5.2

M3P 288 93.4 3.8 2.8

M3E 148 34.5 43.2 22.3

M3I 6 50.0 33.3 16.7

M3T 7 85.7 0.0 14.3

M3D 301 90.0 3.6 6.3

M3A 90 43.3 27.8 28.9

M0 6297 1.1 2.3 96.6

A. Batliner et al. / Speech Communication 25 (1998) 193±222 203

Page 12: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

that are not fully impossible in this context; thiscriterion is admittedly vague. Together with thefact that our rather fast labelling procedure didcertainly not reveal all ambiguities these factorsmight explain the rather low correspondence be-tween S3? and MU; cf. as well Section 8.

5.2. The prosodic marking of the M labels

The correspondence between M and B as well asbetween the di�erent M subclasses and the B labelsis given in Tables 6 and 7. The sentence or clauseboundaries M3S are mostly (87.8%) marked with aB3 boundary. This corroborates the conventionalwisdom that there is a high correspondence be-tween syntactic and prosodic boundaries. How-ever, to our knowledge this is the ®rst investigationof a very large spontaneous speech corpus con-cerning this hypothesis. It is thus not the very factbut the amount of correlation that is interesting.The 8% of the M3S which correspond to B2 areoften boundaries between main clause and subor-dinate clause, where the speaker has marked theboundary prosodically only by a slight continua-tion rise. Especially for subordinations, it might beat the discretion of the speaker to what extentprosodic marking is used. The overall speakingrate might play a role as well.

In Example 7, a clause boundary has not beenmarked at all prosodically despite the fact thatthere is no subordinating particle on the syntacticsurface. Nevertheless, from the syntax it is clear (ina left to right analysis already at the word wir) thatthere is such a boundary. The ®rst sentence israther short so that there is no need to separate itprosodically for the purpose of making the listen-ers understanding easier. Many of these B0/M3Scorrespondences occur after short main clausessuch as ich denke (I think) or meinen Sie (do youthink). These constellations will be taken into ac-count in the relabelling, cf. Section 6.

Example 7. <Atmung> ich denke B0/M3S wirsollten das Ganze dann doch auf die n�achste Wocheverschieben (<breathing> I think B0/M3S we shouldmove the whole thing to next week).

Also a high but lower number (75.7%) of theM3P boundaries are marked as B3. This is stillwithin the range of agreements between di�erentpersons labelling the B boundaries (Reyelt, 1995).The lower correspondence of the M3P with respectto the M3S can be explained with the fact that M3Plabels separate elliptic phrases or left dislocations.These are often quite short so that the same ar-gumentation as above for the short main clausesholds here as well.

Some 35.1% of the M3E are not marked at allprosodically. This might on the one hand indicatethat the de®nition and the labelling of M3E shouldbe revised. On the other hand, we assume that forM3E positions, as well as for other M3 subclasses,it is left at the discretion of the speaker whetherthese positions are marked prosodically or not, cf.(de Pijper and Sanderman, 1994).

Two thirds of the M3I boundaries are markedprosodically. The M3D labels coincide with B3, B2or B0, without any clear preference. This could beexpected, because the M3D mark ambiguousboundaries at rather short phrases. On the otherhand, the de®ning criterion of M3T, the presenceof a pause, is responsible for the very high corre-spondence (85.7%) with B3.

At positions marked with M3A, the really am-biguous boundary positions between clauses, ei-ther a strong boundary marking (B3 in 35.5% of

Table 6

Percentage of M labels corresponding to B labels

Label # B3 B2 B9 B0

M3 951 78.7 9.1 0.1 12.1

MU 391 27.1 29.1 0.5 43.2

M0 6297 2.8 4.6 3.7 88.9

Table 7

Percentage of detailed M labels corresponding to B labels

Label # B3 B2 B9 B0

M3S 502 87.8 8.0 0.0 4.2

M3P 288 75.7 10.4 0.3 13.5

M3E 148 53.4 11.5 0.0 35.1

M3I 6 66.7 0.0 0.0 33.3

M3T 7 85.7 0.0 0.0 14.3

M3D 301 24.6 32.9 0.3 42.2

M3A 90 35.5 16.7 1.1 46.7

M0 6297 2.8 4.6 3.7 88.9

204 A. Batliner et al. / Speech Communication 25 (1998) 193±222

Page 13: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

the cases) or no marking at all (B0, 46.7% of thecases) can be observed, which also meets our ex-pectations.

In accordance with their de®nition, almost allB9 boundaries do not coincide with major syn-tactic boundaries (M3).

5.3. The di�erence between D and M labels

The creation of both M and D labels was ratherrough and fast. Despite this, the numbers in Ta-bles 8±10 are consistent with our expectations:Most of the D3 correspond to M3, and almost all ofthe M0 correspond to D0. Only about half of the M3correspond to D3, that is, a turn segment corre-sponding to a dialogue act often consists of morethan one clause or phrase ± e.g., Example 8 can besegmented into four clauses but only into two dia-logue acts. As for the MU labels, not surprisingly,only 3.3% of the M3D (no syntactic boundary andthus normally no D3 boundary) and 20% of the

M3A (sometimes a syntactic boundary and as such,sometimes a D3 boundary) coincide with a D3boundary. An M3P that coincides with a D3boundary (as in Example 10) will usually be markedprosodically, whereas an M3P that does not coin-cide with D3 (as in Example 9) will be usually notmarked prosodically at all or at least to a lesserextent than M3P in Example 10. M3P in Guten TagM3P Herr/Frau..., e.g., can be assigned to a newsubclass of M3P that is assumed not to be markedprosodically and thus to the main class M0 thatcorresponds to D0. Obviously, M3E does not marka D3 boundary, cf. the low correspondence of 1.4%.Table 11 shows that 91.4% of the D3 boundariesare strongly marked prosodically, that is, they co-incide with a B3 boundary. This number is evenhigher than that for the M3S boundaries. Thiscon®rms the results of other studies which showedthat boundaries at discourse units are very stronglymarked by prosodic means, cf. (Cahn, 1992; Swertset al., 1992; Hirschberg and Grosz, 1994).

Example 8. ich mu� sagen M3S mir w�ar's dannlieber M3S wenn wir die ganze Sache auf Maiverschieben D3/M3S <Pause> geht es da bei Ihnenauch (I would say M3S I then would prefer M3S if wemoved the whole thing onto May D3/M3S <pause>does this suit you as well).

Example 9. Guten Tag M3P Herr Meier . . . (HelloM3P Mr. Meier . . .).

Example 10. Guten Tag D3/M3P ich h�att 'ne Frage. . . (Hello D3/M3P I`ve got a question . . .).

6. Relabelling of the M labels

The ®rst version of the M labelling scheme wasintended for use in the VM prototype which wasdue in October 1996. The labels were used for the

Table 8

Percentage of M labels corresponding to D labels

Label # D3 D0

M3 951 51.5 48.5

MU 391 7.2 92.8

M0 6297 0.2 99.8

Table 9

Percentage of D labels corresponding to M labels

Label # M3 MU M0

D3 533 91.9 5.2 2.8

D0 7106 6.5 5.1 88.4

Table 10

Percentage of detailed M labels corresponding to D labels

Label # D3 D0

M3S 502 75.5 24.5

M3P 288 37.1 62.8

M3E 148 1.4 98.6

M3I 6 0.0 100.0

M3T 7 28.6 71.4

M3D 301 3.3 96.7

M3A 90 20.0 80.0

M0 6297 0.2 99.8

Table 11

Percentage of D labels corresponding to B labels

Label # B3 B2 B9 B0

D3 533 91.4 6.2 1.5 0.9

D0 7106 7.7 6.4 3.2 82.7

A. Batliner et al. / Speech Communication 25 (1998) 193±222 205

Page 14: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

training of automatic classi®ers of boundaries asdescribed in Section 9. The time schedule wasrather tight, and elaboration and evaluation of thescheme were therefore not conducted for the ®rstversion. Even though the M3 boundaries turnedout to be very successful, cf. Section 9, we revisedand extended the labelling scheme for the follow-ing reasons.· Correspondences to the other boundary types

(prosodic±perceptual boundaries and dialogueact boundaries) was good, but in some cases,suboptimal, cf. the discussion in Section 5.

· It turned out that for the higher linguistic mod-ules in VM, a subclassi®cation into more speci®cclasses is desirable.

· To reduce e�ort, the labelling of prosodic phras-es (constituents) inside a sentence was not con-ducted for the ®rst version. These boundariesare, however, very important for the modellingof accent positions, cf. Section 11.

· Di�erent M3 boundaries were not labelled at theend of turn and adjacent to agrammaticalboundaries. Although this is not necessary forour present purposes, such a labelling makes ad-ditional information available that might be use-ful in the future.

The new labels are listed in Tables 12 and 13,where the mapping onto the old labels, the contextwith one example for each label, the label itself,and the main class it is attached to are given. Thenames of the new labels consist of three characterseach with the encoding given in Table 14. Typeand hierarchy describe syntactic phenomena; withstrength, we so to speak code our working hy-pothesis that prosodic (and thereby, to some ex-tent, syntactic) marking of boundaries is scaledalong these lines. Most of the revisions concern asubspeci®cation of the former M labels that mostof the time could not take into account hierarchi-cal dependencies and left/right relationship. Ofcourse, it will not be possible to model and train allnew labels, especially if their frequency is low, butwe will have ample possibilities to try and clusterthese labels in di�erent ways in order to get theoptimal main classes for di�erent demands: Thedialogue module will most probably need a clus-tering that di�ers from that most useful for thesyntax module, cf. Section 7.

The extensional de®nition of most of the labelsdid not change, but they will be subspeci®ed. In afew cases, we decided in favour of a more plausibledenotation of type with the ®rst character, cf.Tables 12 and 13. The labelling is again introducedin the word chain immediately after the last wordof the respective unit at the word boundary andbefore any `nonverbal' such as <�ah>, <pause>,<laughter>, etc. Turn-initially, no label is given.In contrast to the former strategy, turn-®nally, theleft context (last sentence/phrase etc.) is labelledwith the appropriate label as well. By that, we willbe able to model turn-®nal syntactic units, if thiswill be of any use for, e.g., dialogue act classi®-cation or dialogue act boundary classi®cation. Itwould, however, be no problem to map these la-bels onto M3S or `end of turn', if necessary. Up tonow, we followed the strategy only to label with Madjacent to irregular boundaries if a sentence is notcompleted and another syntactic constructionstarts anew. At irregular boundaries, in contrast tothis former strategy, a label is always given, ifpossible. If this information is not of any use, wecan map these M labels that are adjacent to ir-regular boundaries onto `null'; we can, however,have a closer look at these combinations as well.

Generally, we cannot subspecify beyond thelevels encoded by our labels, i.e., we cannot specifytwo levels of subordination; other possible sub-speci®cations are merged. For sentences (up tonow M3S), we denote subordination, coordina-tion, left/right relationship and prosodic marking.With these distinctions, we cannot denote allconstellations. We only have one level for subor-dination, i.e., with SC2, we cannot denote whichone of these clauses is subordinated w.r.t. the otherone. After free phrases (elliptic sentences) followedby a subordinate clause, SM2 or SM1 is labelled aswell: Wunderbar SM1 da� Sie da Zeit haben (GreatSM1 that you'll have time by then); this constella-tion is very rare and because of that, it makes notmuch sense to model it in a special way. In anal-ogy, phrasal coordination at subordinate clauses islabelled with SC3. For free phrases (up to nowM3P), besides the `main' label PM3, we annotatewith PM1 free phrases that are prosodically inte-grated with the following adjacent sequence. Se-quences inside free phrases are analogous to the

206 A. Batliner et al. / Speech Communication 25 (1998) 193±222

Page 15: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

Table 12

Examples with context for new boundary labels and their main classes, part I (with reference to the old boundary labels)

Main class Label Context (between/at) with example

sentences, up to now: M3SM3 SM3 Main clause and main clause:

vielleicht stelle ich mich kurz vorher noch vor SM3 mein Name ist Lerch

perhaps I should ®rst introduce myself SM3 my name is LerchM3 SM2 Main clause and subordinate clause:

ich wei� nicht SM2 ob es auch bei Ihnen dann pa�t

I don't know SM2 whether it will suit you or notM3 SS2 Subordinate clause and main clause:

da ich aus Kiel komme SS2 wird hier ja relativ wenig gefeiert

because I am from Kiel SS2 we don't celebrate that oftenM3 SM1 Main clause and subordinate clause, prosodically integrated:

ich denke SM1 das k�onnen wir so machen

I think SM1 we can do it that wayM3 SS1 Subordinate clause and main clause, prosodically integrated:

das sieht sowieso ziemlich schlecht aus SS1 w�urd' ich sagen

anyway, that looks rather bad SS1 I'd sayM3 SC3 Coordination of main clauses and of subordinate clauses:

dann nehmen wir den Montag SC3 und tre�en uns dann morgens

then we'll take Monday SC3 and meet in the morningM3 SC2 Subordinate clause and subordinate clause:

da ich froh w�are SC2 diese Sache m�oglichst schnell hinter mich zu bringen

because I would be glad SC2 to get it over as soon as possible

free Phrases, up to now: M3PM3 PM3 free Phrase, stand alone:

sehr gerne PM3 ich liebe Ihre Stadt

with pleasure PM3 I love your townM2 PC2 sequence in free Phrases:

um neun Uhr PC2 in 'nem Hotel PC2 in Stockholm

at nine o'clock PC2 in a hotel PC2 in StockholmM3 PM1 free Phrase, prosodically integrated, no dialogue act boundary:

guten Tag PM1 Herr Meier

hello PM1 Mr. Meier

Left dislocations, up to now: M3PM3 LS2 Left dislocation:

am f�unften LS2 da hab' ich etwas

on the ®fth LS2 I am busyM2 LC2 sequence of Left dislocations:

aber zum Mittagessen LC2 am neunzehnten LS2 wenn Sie vielleicht da Zeit h�atten

but for lunch LC2 on the 19th LS2 if you've got time then

Right dislocations, up to now: M3EM3 RS2 Right dislocation:

wie w�urde es Ihnen denn am Dienstag passen RS2 den achten Juni

will Tuesday suit you RS2 the eighth of JuneM2 RC2 sequence of Right dislocations:

es w�are bei mir dann m�oglich RS2 ab Freitag RC2 dem f�unfundzwanzigsten

it would be possible for me RS2 from Friday onwards RC2 the 25thM2 RC1 Right `dislocation' at open verbal brace:

tre�en wir uns RC1 um eins

let's meet RC1 at one o'clock

A. Batliner et al. / Speech Communication 25 (1998) 193±222 207

Page 16: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

constituent boundaries IC2 and labelled with PC2.Left dislocations (up to now M3P) are constituentsto the left of the matrix sentence, typically but notnecessarily with some sort of anaphoric referencein the matrix sentence. Sequences inside left dis-locations are also analogous to the constituentboundaries IC2 and labelled with LC2. Right dis-locations (up to now M3E) are subspeci®ed furtheras well: Any constituent boundary appearing after

RS2 has to be labelled with RC2 instead of IC2because once a right dislocation is opened, allfollowing constituents become additions to thedislocation. For right dislocations at open verbalbrace, a new label RC1 is introduced. Embeddedsentences (up to now M3I) are all sentences em-bedded in a matrix sentence that continues afterthe embedded sentence. In contrast to the formerstrategy, even very short parentheses (glaub ich)

Table 13

Examples with context for new boundary labels and their main classes, part II (with reference to the old boundary labels)

Main class Label Context (between/at) with example

Embedded strings, up to now: M31

M3 EM3 embedded sentence/phrase:

eventuell EM3 wenn Sie noch mehr Zeit haben EM3 hAtmungi 'n bi�chen l�anger

possibly EM3 if you've got even more time EM3 ábreathingñ a bit longer

Free particles, up to now: M3TM3 FM3 pre-/postsentential particle, with ápauseñ etc.:

gut FM3 hPausei okay

®ne FM3 ápauseñ okay

Discourse particles, up to now: M3DMU DS3 pre-/postsentential particle, ambisentential:

dritter Februar DS3 ja DS3 ab vierzehn Uhr h�att' ich da Zeit

third February DS3 isn't it/well DS3 I have time then after two p.m.MU DS1 pre-/postsentential particle, no ápauseñ etc.:

also DS1 dienstags pa�t es Ihnen DS1 ja M3S hAtmungithen DS1 Tuesday will suit you DS1 won't it / after all ábreathingñ

Ambiguous boundaries, up to now: M3AMU AM3 between sentences, Ambiguous:

w�urde ich vorschlagen AM3 vielleicht AM3 im Dezember AM3 noch mal AM3 dann

I'd propose AM3 possibly AM3 in December AM3 again AM3 thenMU AM2 between free phrases, Ambiguous:

sicherlich AM2 sehr gerne

sure/-ely AM2 with pleasureMU AC1 between constituents, Ambiguous:

wollen wir dann AC1 noch AC1 'n Tre�en machen

should we then (still) have a meeting / should we then have another meeting

Constituents, up to now: M2IM2 IC2 between Constituents:

ich wollte gerne mit Ihnen IC2 ein Fr�uhst�uck vereinbaren

I'd like to arrange IC2 a breakfast with youM2 IC1 asyndetic listing of Constituents (not labelled up to now):

wir haben bis jetzt eins IC1 zwei IC1 drei IC1 vier IC1 f�unf IC1 sechs Termine

until now, we've got one IC1 two IC1 three IC1 four IC1 ®ve IC1 six appointments

Default, no boundary, up to now: M0

M0 IC0 every other word boundary:

da bin ich ganz Ihrer Meinung

I fully agree with you

208 A. Batliner et al. / Speech Communication 25 (1998) 193±222

Page 17: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

are annotated with E3; if necessary, these shortparentheses (less or equal two words) can be re-labelled automatically. Free/discourse particles (upto now M3T/M3D) are de®ned a bit di�erentlythan before: In contrast to the former strategy, weuse PM3, if such a particle unequivocally can beclassi®ed as a con®rmation, as in A: Pa�t Ihnendrei Uhr SM3 ± B: Ja PM3 Dann zum zweitenTermin . . . (A: Is three o'clock ok with you? SM3 B:Yes. PM3 And now the second date . . .). Muchmore common is, however, that the particle isfollowed by a sort of equivalent con®rmation, e.g.:B: Ja DS1/FM3 pa�t ausgezeichnet SM3 Dann zumzweiten Termin . . . (B: Well/Yes DS1/FM3 that'sfully ok with me. SM3 And now the second date . . .).Here, we simply cannot tell apart the two func-tions `con®rmation' or `discourse particle'. This is,however, not necessary because in these cases, thefunctional load on this particle is rather low. Itmight thus be the most appropriate solution not todecide in favour of the one or the other readingbut to treat this distinction as underspeci®ed orneutralized. This means for the higher linguisticmodules that, in constellations like this, theseparticles might simply be treated as discourseparticles without any pronounced semantic func-tion; i.e., in the short run, they can be neglected.

There are three levels for Ambiguous boundaries(up to now M3A): AM3 and AM2 are ambiguousboundaries between clauses and phrases, respec-tively, and are discussed in more detail in (Batlineret al., 1996b); DS3 labels denote ambiguous sen-

tence/phrase boundaries at pre-/postsententialparticles. Particles that are very often surroundedby the new AC1 label are, e.g., auch (also/as well),doch (but/however/yet), noch (still). There are al-ways at least two syntactic and semantic readingsfor clauses containing such particles; these read-ings can be described with di�erent syntacticbracketing, but prosodically, accent structure ismore important, i.e., whether the particle is ac-cented or not. Consider the sentence: Dann bra-uchen wir noch einen Termin. If noch is accented, ithas to be translated with Then we need anotherdate, if Termin is accented and noch not, thetranslation is: Then we still need a date.

There are two types of constituent boundaries(up to now not labelled) between constituents (IC2)and between words/noun phrases in the case ofasyndetic listing (IC1), i.e., a listing without anyconjunction (und, etc.). The decision whether toput in an IC2 label or not is very often di�cult tomake. The criteria are basically prosodic: First, theboundary should be really inside the clause, i.e. farfrom left and right edges, and second, the con-stituent that precedes the boundary is `prosodic-ally heavy', i.e. normally a noun phrase that can bethe carrier of a primary accent. Primarily, theseboundaries will be used to trigger accent assign-ment, cf. Section 11 below and (Kompe, 1997).The criteria are discussed in more detail in (Batli-ner, 1997).

Basically, the old main classes are still valid; inaddition, we introduce a fourth main class, M2, forthose labels which denote boundaries at constitu-ents (typically noun phrases) within larger syn-tactic units: PC2, LC2, RC2, RC1, IC2 and IC1.

The relabelling was conducted in the followingsteps: First, two linguists, C and N, discussedscheme and necessary editions with the ®rst authorwho has annotated with the ®rst version of thelabelling scheme, cf. Table 1. Then C relabelledand thereby corrected or rede®ned the old labels,and afterwards, N checked and, if necessary, cor-rected C's labels. By that, we had two independentruns where errors could be detected and the la-belling could be systematizised. A new subset, VM-CD 7 (1740 turns), was labelled independently byC and N and serves as database for the computa-tion of the interlabeller consistency, cf. Section 8.

Table 14

Encoding of syntactic type, syntactic hierarchy, and prosodic±

syntactic strength of the new M labels

Label Description

Type Sentence

free Phrase

Left dislocation

Right dislocation

Embedded sentence/phrase

Free particle

Discourse particle

Ambiguous boundary

Internal constituent boundary

Hierarchy Main, Subordinate, Coordinate

Strength Prosodic±syntactic strength: strong (3), intermedi-

ate (2), weak (1), very weak (0)

A. Batliner et al. / Speech Communication 25 (1998) 193±222 209

Page 18: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

7. Correspondences of the new M labels with B and

D labels

As far as the main classes are concerned, we didnot expect that the new M labels (M_new) wouldshow more correspondence with B and D than theold ones (M_old) but that our subcategorizationswould lead to a better separation of subclasses.This is shown in Figs. 1±6 where for B-TRAIN,correspondences of M_new with B and with D areshown; note that due to technical reasons, only 737out of the 797 turns of B-TRAIN could be pro-cessed for these correspondences. The mapping ofM_old onto M_new is given in Tables 12 and 13. Ifinside M_old, the distribution of the new subclas-ses was equal, there should not be much of a dif-ference between the new subclasses. For instance,the old label M3S for `sentences' is now subspeci-

®ed into seven new labels (SM3 to SC2), which areexplained in Table 12. Their correlation with the Blabels is shown in Fig. 3: There are considerabledi�erences, ranging from more than 90% correla-tion with B3/B2 for SM3 to 0% correlation for SS1.The di�erences shown in Figs. 3 and 4 are almostalways congruent with our working hypothesisthat the coding of strength (third character of thelabels) and the coding of hierarchy (second char-acter of the labels) covaries with the percentage ofitems marked by prosodic±perceptual boundaries:Globally, from left to right, and within subgroupsfrom left to right as well, correspondences with B3/B2 decrease and vice versa, those with B0 increase.This overall pattern shows up more clearly in ®g-ures than in tables; we decided therefore in favorof this presentation. This result is consistent with(de Pijper and Sanderman, 1994) who found thatmajor syntactic boundaries tend to be markedprosodically to a greater extent than minor syn-tactic boundaries. A similar behaviour of thesublabels of M_new can be seen for the corre-spondence with D in Figs. 5 and 6. We can thusconclude that M_new meets prosodic regularitiesbetter than M_old.

If we compare the correspondences of M_newwith B0 and D0, it can be seen that for some fewlabels, there is a strong positive correlation be-tween B0 and D0, in particular for SM3 and forIC0, but most of the other labels are somewhere `inbetween'. That means that for a classi®cation ofD3 based on M_new, it would be suboptimal to usethese labels only within a Multi-layer Perceptron(MLP) or only within a language model (LM). AsD3 boundaries tend to be marked prosodically to agreat extent, it might be better to use a two stageprocedure instead: First, to classify these Mboundaries with an MLP into prosodically markedor not, and then to use only the marked bound-aries in a classi®cation of D3. Our expectation isthat D3 is always marked prosodically whereas D0might be marked prosodically, because not everyphrase boundary is a dialog act boundary. This issomehow con®rmed with the ®gures, because whenan M_new label highly correlates with B0 it alsostrongly correlates with D0; when a M_new labeldoes not correlate with B0 it might still correlatewith D0.

Fig. 1. Correspondences: M labels with B labels, main classes.

Fig. 2. Correspondences: M labels with D labels, main classes.

210 A. Batliner et al. / Speech Communication 25 (1998) 193±222

Page 19: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

Fig. 3. Correspondences: M labels with B labels, sentences, free phrases, left dislocations, right dislocations.

Fig. 4. Correspondences: M labels with B labels, embedding, free/discourse particles, ambiguous boundaries, constituents, no

boundaries.

Fig. 5. Correspondences: M labels with D labels, sentences, free phrases, left dislocations, right dislocations.

A. Batliner et al. / Speech Communication 25 (1998) 193±222 211

Page 20: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

8. Further evaluation of the labelling scheme:

reliability, e�ort

There exist di�erent evaluation measures forlabelling schemes. Common practice in basic re-search is to check the inter- or intralabeller con-sistency, cf. (Pitrelli et al., 1994; Grice et al., 1996;Reyelt, 1998), common practice in applied re-search is the quality of automatic classi®cation, cf.Section 9: If an automatic classi®er yields highrecognition rates, this is great evidence for a con-sistent labelling. A common problem for all thesemeasures is the fact that there is no `objective'reference everybody agrees on as is the case forphoneme or word recognition: All these annota-tions are based on partly di�erent and competingtheoretical assumptions. The ultimate proof forsuch a scheme is, in our opinion, therefore only asort of external validity, namely its usefulness forthe intended purpose, in our case, the usefulness ofthe automatically predicted M labels for the higherlinguistic modules in the VM system, cf. Sec-tion 10. In the present section, we will show howthe M labels meet the criteria of di�erent measuresof reliability. We will distinguish internal and ex-ternal reliability ± internal or external to the spe-ci®c annotation scheme, in our case, the M labels.Internal reliability can be measured intra labeller(same labelling scheme, same data, same labeller,di�erent time) and inter labellers (same labellingscheme, same data, di�erent labellers). Externalreliability can be measured intra labelling scheme

(manual annotation versus automatic classi®ca-tion, same labelling scheme, same sort of data) orinter labelling schemes (correspondence of di�er-ent labelling schemes, automatic classi®cation ofone scheme based on a sample trained with an-other scheme). External reliability intra will bedealt with in Section 9, external reliability inter hasbeen dealt with above in Sections 5 and 7, as far asthe correlation between the di�erent schemes isconcerned, and will be dealt with in Section 9, asfar as the classi®cation of the D labels based on theM labels is concerned.

As a metric for the internal reliability of ourlabelling scheme, we compute either correspon-dences (for single classes) or, as overall metric, thevalue j because it has been computed for otherGerman data and for prosodic±perceptual labelsas well and because it is the most meaningfulmetric for such comparisons, cf. (Reyelt, 1998;Maier, 1997). j cannot be used for the externalreliability or for the correspondence between oldand all four new M labels because there is no one-to-one relationship between the labels. For thecomparison of old versus new labelling scheme, thenew main labels can serve as reference, and the oldones, so to speak, as more or less correct `ap-proximation'. Table 15 thus displays frequency foreach class and percent `correct correspondence' forM_new. M2 has not been labelled in the ®rst stage,so 82% correspondence with M0_old meets ourexpectation. The correspondence of M0_new withM0_old is almost perfect, the one of M3_new with

Fig. 6. Correspondences: M labels with D labels, embedding, free/discourse particles, ambiguous boundaries, constituents, no

boundaries.

212 A. Batliner et al. / Speech Communication 25 (1998) 193±222

Page 21: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

M3_old is 85%; most of the 14% correspondencewith M0_old are certainly due to the modi®ed la-belling principles because in the ®rst stage, M3 wasnot labelled at the end of turns or at irregularboundaries. For the three classes M3, MU and M0,j is 0.88. Only the correlation of MU_new withMU_old (53%) is rather low; we will come back tothis point in the discussion of Table 16, where wedisplay interlabeller correspondence of the label-lers C and N for the new labels. In this table, we donot display percent values but the frequencies ofcorrespondences between the two labellers.Meaningful percent values are mean percent cor-respondence for the hits: M3: 90%, M2: 67%, MU:65%, M0: 95%. Confusions of M3 with M0 occurvery seldom, i.e., `false alarms' of M3 as M0: 2%,`false alarms' of M0 as M3: 0%. We see that there isa very good agreement for the most importantclasses M3 and M0, while it is not that good for M2and MU. This is due to the fact that these labelscannot be de®ned as strictly as the two other ones:The labelling of M2 is triggered by the imaginationof the labellers w.r.t. the possibility of a prosodicmarking of the respective boundaries. As for theMU labels, they can be labelled rather easily at pre/postsentential particles with DS1 (83%) but ratherless easily for the `real' ambiguous boundaries

AM3 and AM2 (27% and 20%, respectively). Thiscan be explained by the fact that for these labels,the labellers have to decide quickly whether thereis a syntactically ambiguous boundary and, at thesame time, whether it is `semantically/pragmati-cally reasonably ambiguous' as well ± a task thatobviously does not lead to a good interlabelleragreement. Each labeller might be biased towardsMU labels for special constructions, but often, thedecision whether to use MU or not might be ran-dom as well. In such a case, a random assignmentof ambiguous boundaries to one of the four mainclasses should not disturb the behaviour of a sta-tistical classi®er trained on these labels too much.

j for the main classes is 0.79, for all 25 classes, itis 0.74. (Reyelt, 1998) obtained j values between0.5 and 0.8 for prosodic±perceptual boundary la-bels, depending on the experience of the labellersand on the speci®c tasks.

To check intralabeller consistency, labeller Crelabelled eight dialogues from VM-CD 7 afterseveral months. For the main classes, mean per-cent correspondence for the hits are M3: 96%, M2:78%, MU: 78%, M0: 96%. j is 0.86 for the mainclasses and 0.84 for all 25 classes. Maier (1997)reports a j value of 0.93 for the segmentation ofdialogue acts inter labellers and of 0.94 intra la-beller. As is the case for regression coe�cients,there is no threshold value that distinguishes`good' from `bad' values but j values around 80%and above can be considered to represent a verygood agreement.

The ToBI correspondence for boundaries (13labellers, 4 categories) in (Grice et al., 1996) is87%; this value could be compared with our meanpercent values given above; all these coe�cientscannot be compared in a strict sense, however,because number of labellers and/or categories dif-fer. Moreover, the importance of the categories isnot equal: as for the M labels, not overall corre-spondence might be the appropriate measure forthe quality of the labelling scheme but only `falsealarms' for M3 if, e.g., only these are problematicfor the processing in the higher linguistic modules.

It might thus be safe to conclude that interla-beller correspondence for the M labels is goodenough, and that means, that ± with a reasonableamount of training ± the labelling can be con-

Table 15

Correspondence of new (ordinate) with old (abscissa) M labels

in percent, four main classes, with number of occurrence

Label M3 MU M0

# 19135 4706 99719

M3 20214 85 1 14

M2 6415 17 0 83

MU 8421 9 53 38

M0 88510 0 0 100

Table 16

Interlabeller correspondence for new M labels, VM±CD 7, four

main classes, with number of occurrence. j: 0.79

Label M3 M2 MU M0

# 6913 2607 3404 29865

M3 6749 6146 299 223 81

M2 2273 84 1635 56 498

MU 4369 532 259 2500 1078

M0 29398 151 414 625 28208

A. Batliner et al. / Speech Communication 25 (1998) 193±222 213

Page 22: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

ducted by di�erent persons with a fairly goodknowledge of German syntax. For many applica-tions, only some main M boundaries might berelevant which presumably can be labelled by lessexperienced labellers as well.

E�ort needed is, in practice, at least as impor-tant as reliability, and an explicit criterion in ap-plied research. Here, procedures that are coarsebut fast can and must be preferred because a cer-tain time-out is always a delimiter: the end of aproject, the ®nancial means available, the maxi-mum processing time allowed for an automaticspeech processing system. In basic research, e�ortis not that often mentioned but is, in fact, implic-itly equally important. Note, e.g., that practicallyall perception experiments violate the fundamentalclaim of inferential statistics that subjects have tobe chosen random out of the population (Gutt-man, 1977). This is done simply because otherwise,e�ort needed for the selection of subjects wouldparalyse the experiments themselves.

We claim that the annotation of the M labelsneeds considerably less e�ort than any annotationwhere labellers have to listen to the speech mate-rial and particularly, less e�ort than the labellingof prosodic±perceptual boundaries. In order to getan impression of the time needed, labeller C an-notated three di�erent subsets of VM-CD 12 with4 dialogues each, keeping track of the time needed,with the 8 old M labels (ca. 5 times real time), withthe 25 new M labels (ca. 8 times real time), andwith only the 3 old main classes M3, MU and M0(ca. 5 times real time), respectively. Number oflabels and time needed are obviously not corre-lated linearly with each other. Even if the anno-tation of only 3 main classes is intended, it seemsthat the labeller conducts a more thorough syn-tactic analysis in which he uses more than these 3classes. Such a lower limit might correspond to thee�ort needed for the annotation of the 8 old Mclasses. Therefore, roughly the same time is neededfor 8 and for 3 classes; only little more time isneeded for a further splitting up into 25 detailedclasses. Note that while discussing the new label-ling scheme, these 25 classes seemed to be somesort of upper limit for a `rough' labelling on whichwe could agree. It might be that a more detailedlabelling scheme would mean to `cross the Rubi-

con' towards a deep syntactic analysis for whichmuch more time is needed.

The time needed for the annotation of a corre-sponding subsample, two dialogues of VM-CD 1,with ToBI (prosodic±perceptual annotation) is 23times real time, and only with boundaries andaccents, it is 13 times real time; the annotation ofonly the boundary labels takes not much less time.Note that these were `easy' dialogues, i.e., on theaverage, it takes more time because the labeller hasto listen to di�cult turns much more often.Moreover, the only tools for the labelling of Mneeded are either paper and pencil and/or anyASCII-editor on any platform. For the labelling ofprosodic±perceptual boundaries in ToBI, severalpreprocessing stages are necessary: word segmen-tation (by hand or automatically computed withcorrection by hand taking more than 10 times realtime) and computation of F0 curve. First, anno-tation tools have to be developed (or purchased),and there is always a workstation needed for eachlabeller. Besides, the training for the labellers ofToBI takes much more time than that needed forthe labellers of the M labels. All these ®gures anddetails are provided by Reyelt (1997).

Up to the end of 1997, 700 German, 160 En-glish and 100 Japanese dialogues were annotatedat the DFKI, Saarbr�ucken, with dialogue act in-formation including D3 boundaries without lis-tening to the speech data. This took approximately10 hours a week for two years. These ®gures areprovided by Reithinger (1997). Note that for thistask, the annotation of the dialogue acts takesmuch more time than that of the dialogue actboundaries which are a sort of `by-product'.

9. Automatic classi®cation of boundaries

First, we want to stress again that our ultimategoal is to prove the usefulness of our labellingscheme for syntactic and other higher processingas dialogue analysis in the higher modules of theVM project. A necessary but not necessarily suf-®cient precondition for such a successful incorpo-ration in a whole system is a good automaticclassi®cation of boundaries with the help of the Mlabels: If classi®cations were good but if the classes

214 A. Batliner et al. / Speech Communication 25 (1998) 193±222

Page 23: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

were irrelevant for syntactic processing the M la-bels were of no use. On the other hand, if auto-matic classi®cation was bad it would be veryunlikely that the labels can be of any use. Wetherefore repeated on Ms and Ds some of theclassi®cation experiments we had done before onBs and Ss, cf. (Kieûling et al., 1994b; Kompe et al.,1994, 1995; Batliner et al., 1996a; Kieûling, 1997;Kompe, 1997). According to the comparison of Msand Ds with Bs, cf. above, we can expect theseboundary types to be often marked prosodically.Therefore, MLPs were used for the acoustic±pro-sodic classi®cation of M0 versus M3 and of D0versus D3 boundaries, respectively. All experi-ments described in this section are based on theunrevised old main classes M3, MU and M0(M_old). The revised classes will be used in thesecond stage of the VM project in 1998±2000.Note that the mapping of old onto new labels isvery good, cf. Section 8, and, if changes weremade, transparent. We therefore do not expectthat overall classi®cation results with the new la-bels will be very di�erent but that the subspeci®-cation of the new labels makes even betterclassi®cation results likely for some speci®c tasks.

We distinguish di�erent categories of prosodicfeature levels. The acoustic prosodic features aresignal-based features that usually span over speechunits that are larger than phonemes (syllables,words, turns, etc.). Normally, they are extractedfrom the speci®c speech signal interval that be-longs to the prosodic unit, describing its speci®cprosodic properties, and can be fed directly into aclassi®er. Within this group we can further dis-tinguish as follows.· Basic prosodic features are extracted from the

pure speech signal without any explicit segmen-tation into prosodic units. Examples are theframe-based extraction of fundamental frequen-cy (F0) and energy. Usually the basic prosodicfeatures cannot be directly used for a prosodicclassi®cation.

· Structured prosodic features are computed over alarger speech unit (syllable, syllable nucleus,word, turn, etc.) partly from the prosodic basicfeatures, e.g., features describing the shape ofF0 or energy contour, partly based on segmentalinformation that can be taken from the output

of a word recognizer, e.g., features describingdurational properties of phonemes, syllable nu-clei, syllables, pauses.

On the other hand, prosodic information is highlyinterrelated with `higher' linguistic information,i.e., the underlying linguistic information stronglyin¯uences the actual realization and relevance ofthe measured acoustic prosodic features. In thissense, we speak of linguistic prosodic features thatcan be introduced from other knowledge sources,as lexicon, syntax or semantics; usually they haveeither an intensifying or an inhibitory e�ect on theacoustic prosodic features. The linguistic prosodicfeatures can be further divided into two categories.· Lexical prosodic features are categorical features

that can be extracted from a lexicon that con-tains syllable boundaries in the phonetic tran-scription of the words. Examples for thesefeatures are ¯ags marking if a syllable is word-®-nal or not or denoting which syllable carries thelexical word accent. Other possibilities not con-sidered here might be special ¯ags marking, e.g.,content and function words.

· Syntactic/semantic prosodic features encode thesyntactic and/or semantic structure of an utter-ance. They can be obtained from syntax, e.g.,from the syntactic tree, or they can be basedon predictions of possibly important ± and thusaccented ± words from the semantic or the dia-logue module.

All these categories are dealt with in more detail in(Kieûling, 1997). Here, we do not consider syn-tactic/semantic prosodic features; in the following,the cover term prosodic features means mostlystructured prosodic features and some few lexicalprosodic features. We only use the aligned spokenwords thus simulating 100% word recognition ±and by that, simulating the capability of a humanlistener. The time alignment is done by a standardHMM word recognizer. It is still an open question,which prosodic features are the most relevant forthe di�erent classi®cation problems and how thedi�erent features are interrelated; cf. below and(Batliner et al., 1997). MLPs are generally good athandling features that are even highly correlatedwith each other; we therefore try to be as ex-haustive as possible, and leave it to the statisticalclassi®er to ®nd out the relevant features and the

A. Batliner et al. / Speech Communication 25 (1998) 193±222 215

Page 24: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

optimal weighting of them. As many relevantprosodic features as possible are therefore ex-tracted over a prosodic unit (here: the word-®nalsyllable) and composed into a huge feature vectorwhich represents the prosodic properties of thisand of several surrounding units in a speci®ccontext.

In more detail the features used here are asfollows:· For each syllable and word in the speci®c con-

text minimum and maximum of fundamentalfrequency (F0) and their positions on the timeaxis relative to the position of the actual syllableas well as the F0-mean.

· F0-o�set + position for actual and precedingword.

· F0-onset + position for actual and succeedingword.

· Linear regression coe�cients of F0-contour andenergy contour over di�erent windows to the leftand to the right of the actual syllable character-izing the slope ± falling versus rising ± of thesecontours.

· For each syllable and word in this context max-imum energy (normalized as in Wightman,1992) + positions and mean energy (also nor-malized).

· Duration (absolute and normalized) for eachsyllable/syllable nucleus/word.

· Length of the pause preceding/succeeding actualword.

· For an implicit normalization of the other fea-tures, measures for the speaking rate are com-puted over the whole utterance based on theabsolute and the normalized syllable durations(as in (Wightman, 1992)).

· For each syllable: ¯ags indicating whether thesyllable carries the lexical word accent or wheth-er it is in a word ®nal position.

MLPs of varying topologies were investigated,using always M-TRAIN for the training of M3versus M0 and TEST for testing M3, M0 and MU.In the following, all results reported are for thespoken word chain. The best recognition rate ob-tained so far is 86.0% for the classi®cation of M3versus M0; the confusion matrix is shown in Ta-ble 17. MU is of course not a class which can beidenti®ed by a speci®c prosodic marking but MU

boundaries are either marked prosodically (de-noting a syntactic boundary) or not. Therefore, inthe table we give the percentage of MU mapped toM3 or M0. The fact that the decision is not clear infavour of one or the other proves that MU marksambiguous boundaries.

Tables 17 and 18 display frequency and percentrecognized for each class; all information neces-sary to compute di�erent metrics as, e.g., recall,precision, false alarms, errors, and chance level foreach class are given in these ®gures. Chance levelfor M3, e.g., is 12% (177/(177 + 103 + 1169); forM0, it is 81%. We see, that both classes are rec-ognized well above chance level, whereas an arbi-trary mapping of all boundaries onto M0 wouldresult in a class-wise computed recognition rate of50%. Note that the class-wise results obtained areactually the most relevant ®gures for measuringthe performance of the MLP, because the MLPgot the same number of training patterns for M3and M0. Therefore it does not ``know'' about apriori probabilities of the classes. We chose thisapproach because the MLP was combined with astochastic language model (cf. below) which en-codes the a priori probabilities of M3 or M0 giventhe words in the context of a boundary. The highrecognition rates for M0 as well as for M3 and thefact that only 61% of the MU were recognized as

Table 17

Best recognition results in percent so far for three M main

classes using acoustic±prosodic features and MLP; 121 prosodic

features computed in a context of � 2 syllables were used

Reference % Recognized

Label # M3 M0

M3 177 87.6 12.4

MU 103 61.2 38.8

M0 1169 14.2 85.8

Table 18

Best recognition results in percent so far for three M main

classes using n-grams

Reference % Recognized

Label # M3 MU M0

M3 177 77 0 23

MU 103 8 52 40

M0 1169 2 0 98

216 A. Batliner et al. / Speech Communication 25 (1998) 193±222

Page 25: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

M3 con®rms the appropriateness of our labellingscheme: M3 and M0 can be told apart in most ofthe cases (the average of the class-wise computedrecognition rates is 86.7%) which is in the samerange as the result obtained with an MLP for B3versus :B3 which was 86.8%, cf. (Kieûling, 1997,p. 191). Remember that the MLP uses mostlyacoustic features and could thus be imagined towork better for the perceptually evaluated B labelsthan for the M labels; of course, the larger trainingdatabase for the M labels contributes as well. Ac-cording to our expectations, the MU labels areneither assigned fully to M3 nor to M0 but dis-tributed roughly equally.

Similar experiments with di�erent feature sub-sets and MLP topologies were conducted for theD0 versus D3 classi®cation as well. Trained withD0 versus D3, they yielded a recognition rate of85% (class-wise recognition: 83%). We achieved arespectable recognition rate of 82% (class-wiserecognition: 82%) with classi®ers trained with M3and M0. Here as well, the larger training databasefor the M labels might contribute, but this resultproves at the same time that a mapping of M ontoD is really possible; more details can be found in(Kompe, 1997, p. 202). The lower recognition ratefor the Ds compared to the Ms is due to the factthat a lot of D0s correspond to M3 and thus can beexpected to be marked prosodically.

In (Batliner et al., 1997), we investigated thepredictive power of the di�erent feature classesincluded in our feature vector. It turned out thatpractically all feature classes alone yield resultsabove chance level, but that the best result can beachieved by simply using all features together(31.6% reduction of error rate w.r.t. all featurestaken together vs. best feature class alone). Thisresult con®rms our general approach not to lookfor important (`distinctive') features but to takeinto account as many features as possible.

Since the Ms and Ds were labelled by linguisticcriteria, one should be able to reliably predict themby a stochastic language model provided they arelabelled consistently. In (Kompe et al., 1995;Kompe, 1997), we introduced a classi®er based onan n-gram language model (LM) and reportedresults for the prediction of boundaries. Thisclassi®er is based on estimates of probabilities for

boundary labels given the words in the context.The same approach was used for the prediction ofMs and Ds, that is, the n-grams were trained on alarge text corpus and then for the test corpus, eachword boundary was labelled automatically withthe most likely label. The results for the Ms givenin Table 18 meet our expectations as well. Noteespecially, that the MUs really cannot be predictedfrom the text alone. In similar classi®cation ex-periments, D0 versus D3 could be recognized cor-rectly in 93% of the cases.

MLP and n-gram each model di�erent proper-ties; it thus makes sense to combine them. In Ta-ble 19, we compare the results for di�erentcombinations of classi®ers (MLP for B versus : Band LMs for S Labels: LMS, and for M Labels:LMM) for the two main classes boundary versusnonboundary for three di�erent types of bound-aries: B, S and M. Here, the unde®ned boundariesMU and S3? are not taken into account. The ®rstnumber shows the overall recognition rate, thesecond is the average of the class-wise recognitionrates. All recognition results were again measuredon TEST. For the training of the MLP and theLMS, all the available labelled data was used ex-cept for the test set (797 and 584 turns, respec-tively); for LMM , 6297 turns were used. It can benoticed that roughly, the results get better fromtop left to bottom right. Best results can beachieved with a combination of the MLP with theLMM no matter whether the perceptual B or thesyntactic±prosodic M labels serve as reference.LMM is better than the LMS even for S3 versusS:3 because of the greater amount of trainingdata. The LMs alone are already very good; wehave, however, to consider that they cannot be

Table 19

Recognition rates (total/class±wise average) in percent for dif-

ferent combinations of classi®ers (®rst column) distinguishing

between di�erent types of boundaries (B, S and M)

B3 vs. :B3 S3+ vs. S3- M3 vs. M0

Cases 165 vs. 1284 210 vs. 1179 190 vs. 1259

MLP 87/86 85/76 86/81

LMS 86/80 92/86 92/83

MLP+LMS 90/87 92/86 93/87

LMM 92/85 95/87 95/86

MLP+LMM 94/91 94/86 96/90

A. Batliner et al. / Speech Communication 25 (1998) 193±222 217

Page 26: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

applied to the unde®ned classes MU and S3? whichare of course very important for a correct syntac-tic/semantic processing. Particularly for thesecases, we need a classi®er trained with perceptual±prosodic labels. Due to the di�erent a prioriprobabilities, the boundaries are recognized lessaccurately than the nonboundaries with the LMs;this causes the lower class-wise recognition rates(e.g., 80.8% for M3 versus 97.7% for M0 forMLP + LMM). It is of course possible to adapt theclassi®cation to various demands, e.g., in order toget better recognition rates for the boundaries ifmore false alarms can be tolerated. More detailsare given in (Kompe, 1997).

Similar classi®cation experiments with syntac-tic±prosodic boundaries are reported in (Wangand Hirschberg, 1992; Ostendorf et al., 1993),where HMMs or classi®cation trees were used. Theauthors rely on perceptual±prosodic labels createdon the basis of the ToBI system (Beckman andAyers, 1994); for such labels, however, a muchsmaller amount of data can be obtained than inour case, cf. Section 8. Our recognition rates arehigher maybe because of the larger amount oftraining data; note, however, that the studiescannot be compared in a strict sense because theydi�er considerably w.r.t. several factors: the la-belling systems are di�erent, the numbers of classesdi�er, Wang and Hirschberg (1992) included theend of turns as label, Ostendorf et al. (1993) usedelicited, systematically ambiguous material thatalready because of that should be marked pro-sodically to a greater extent, and the languagesdi�er.

10. The use of the M labels in the VM system

In this section, we will describe the usefulness ofprosodic information based on an automaticclassi®cation of the M labels for syntactic pro-cessing in the VM system. More details can befound in (Niemann et al., 1997; Kompe et al.,1997; Kompe, 1997, p. 248�). Input to the prosodymodule is the WHG and the speech signal. Outputis a prosodically scored WHG (Kompe et al.,1995), i.e., to each of the word hypotheses, prob-abilities for prosodic accent, for prosodic bound-

aries, and for sentence mood are attached. Here,we will only deal with the boundary classi®cation.

There are two reasons why syntactic processingheavily depends on prosody: First, to ensure thatmost of the spoken words are recognized, forspontaneous speech a large WHG has to be gen-erated. Currently, WHGs of about 10 hypothesesper spoken word are generated. Finding the cor-rect (or approximately correct) path through aWHG is thus an enormous search problem. Acorpus analysis of VM data showed that about70% of the utterances contain more than a singlesentence (Tropf, 1994). About 25% of the utter-ances are longer than 10 s. This is worsened by thefact that spontaneous speech may contain ellipticalconstructions. Second, even if the spoken wordsequence has been recovered by word recognitioncorrectly without alternative word hypotheses,there still might be many di�erent parses possible,due to the high number of ambiguities containedin spontaneous speech and due to the relativelylong utterances occurring in the VM domain.

Both VM syntax modules use the boundaryscores of the prosody module along with theacoustic score of the word hypotheses and n-gramstochastic LMs. Siemens uses a Trace Uni®cationGrammar (TUG) (Block and Schachtl, 1992;Block, 1997), IBM a Head-driven Phrase StructureGrammar (HPSG) (Pollard and Sag, 1987; Kiss,1995). These syntax modules are used alternative-ly. Both grammars contain boundary symbols.The basic di�erence is that the Siemens system hasa word graph parser which searches in the wordgraph for the optimal word chain. It simulta-neously explores di�erent promising (i.e., wellscored) paths. The search is guided by all scoresincluding the prosodic score. In the IBM system,the word graph is ®rst expanded to the n-best wordchains, which include prosodic boundary labels.Here, `best' refers to the scores including the pro-sodic score. Two alternative word chains mightjust di�er in the boundary positions. These wordchains are parsed one after the other until a parsecould be completed successfully.

The following evaluations were conducted on594 word graphs consisting of about 10 hypothesesper spoken word. The usefulness of prosodic in-formation in the syntax module from Siemens can

218 A. Batliner et al. / Speech Communication 25 (1998) 193±222

Page 27: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

be seen in Table 20 that shows the improvement ofthe Siemens WHG parser by using the prosodicboundary probabilities: The number of readings aswell as the parse time are reduced drastically. Thefact that 9 word graphs (i.e. 2%) could not beanalyzed with the use of prosody is due to the factthat the search space is explored di�erently andthat the ®xed time limit has been reached beforethe analysis succeeded. However, this small num-ber of nonanalyzable word graphs is neglectableconsidering the fact that without prosody, theaverage real-time factor is 6.1 for the parsing. Withprosodic information the real-time factor drops to0.5; the real-time factor for the computation ofprosodic information is 1.0. Note that furthermorea high number of readings results in a largercomputational load in the higher linguistic mod-ules. For the IBM parser results are only availablefor speech recorded during tests with the VMsystem by non-naive users. With this material aspeed-up of 46% was achieved by using the pro-sodic clause boundary information, cf. (Batliner etal., 1996a).

It is therefore safe to conclude that the use ofsyntactic±prosodic information, i.e., informationcoded in the M boundaries, is, at least for the timebeing, vital for the VM system because withoutthis information, the overall processing time wouldbe too long and thus acceptability too low. Notethat we do not want to claim that it is only pro-sodic information that could do this job. This canbe demonstrated easily because everybody whoreads the transliterations without punctuation andkeeps track of the dialogue history can understandpractically all turns. The information needed forthe M labels can, however, for the time being and

for the near future be computed at much lowercosts than the information w.r.t. dialogue history.This is partly due to the relatively advanced stateof prosodic and syntactic analysis compared todialogue analysis (Carletta et al., 1997) where therestill is no agreement on even basic categories. Mostimportant is, however, that the above mentionedhuman capacity cannot be transfered onto end-to-end-systems like VM which have to deal withspoken language, and that means, with WHGs. Toextract the spoken word chain is normally noproblem for human beings but a great problem forspeech understanding systems. We do not want toclaim either that we could not have come quite farwith only a prosodic classi®cation of the Bboundaries. A look at Table 19 shows, however,that the reduction of error rate from top left tobottom right (B to M) is 69% for overall classi®-cation and 29% for class-wise computed classi®-cation results. (The overall error rate for B with anMLP is, e.g., 100 ) 87� 13%; for M with thecombination of MLP with LM, it is100 ) 96� 4%. The di�erence between these twoerror rates is 9%, which means 69% reduction oferror rate: 9/13� 0.69.) Moreover, the M labelsmodel more exactly than the B labels the entitieswe are interested in.

In Section 1, we claimed that prosodic infor-mation might be the only means to ®nd the correctreading/parse for one and the same ambiguoussequence of words, cf. Examples 1 and 2. Severalexperiments on VM sub-corpora proved thatprosody not only improves the parse time or thenumber of parse trees but also the average qualityof the parse trees. This holds especially for thesearch through the WHG. The experiments aredescribed in detail in (Kompe, 1997, pp. 263±269).

11. Concluding remarks and future work

For a robust training of stochastic languagemodels like n-grams, huge amounts of appropri-ately labelled training data are necessary. Thus,the main advantage of the syntactic±prosodic Mlabels is the comparatively low e�ort of the label-ling process and the comparatively large amountof data obtained. Above that, these labels are

Table 20

With the Siemens parser, 594 word graphs were parsed

Without

prosody

With

prosody

Improvement

(%)

# Successful

analyses

368 359 )2

# Readings 137.7 5.6 96

Parse time (s) 38.6 3.1 92

Results are given for parse experiments using prosodic infor-

mation and compared with results where no prosodic infor-

mation was used. The last column shows the relative

improvement achieved by using prosody.

A. Batliner et al. / Speech Communication 25 (1998) 193±222 219

Page 28: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

directly designed for use in the syntax module ofthe VM project but detailed enough so they can beused for automatic classi®cation of dialogue actboundaries as well. This fact will become evenmore important when switching to di�erent ap-plications or languages. For the time being, weadapt the M system onto American English; thisseems to be possible with only some slight modi-®cations; e.g., the verbal brace does not exist inEnglish, and thus the labels for right dislocationshave to be rede®ned. As is the case with ToBI, theM labelling framework can thus be used at least forthese two related languages; that means that la-bellers can use basically the same guidelines andprinciples for the annotation of German and En-glish. For other applications in the same language,the training material can always be used for abootstrap, i.e., the classi®ers can be trained withthe large M database and then later calibrated witha smaller database taken from the new application.

The large training database might be the reasonwhy we could predict the S and the B boundarieswith the M labels to such an extent. This does,however, not mean that we can do without the Blabels: without these, we would be at a loss todisambiguate the semantically crucial MU bound-aries. (Besides, there is 29% reduction of error ratefor the class-wise computed recognition rate forthe combination of LM and MLP w.r.t. the use ofthe LM alone, cf. Table 19.) For this task, how-ever, we can do with a rather small database la-belled with B.

We want to stress again that this sort of label-ling is no substitute for a thorough syntacticanalysis. We can de®ne and label prototypicalboundaries with a great reliability but for quite afew possible boundary locations, di�erent inter-pretations are possible. A small percentage of thelabels in the training database is certainly incorrectbecause the labelling was done rather fast. It isalways possible that the labelling strategy is notfully consistent across the whole labelling proce-dure. On the other hand, the good classi®cationresults show that across this great database, suchinconsistencies do not matter much. The philoso-phy behind this labelling scheme is a `knowledgebased' clustering of syntactic boundaries intosubclasses that might be marked distinctively by

prosodic means or that are prone not to be markedat all prosodically. The purpose of this labellingscheme is not to optimize a stand alone prosodicclassi®cation but to optimize its usefulness forsyntactic analysis in particular and for the lin-guistic modules in VM in general. The wholeconcept is thus a compromise between a ± man-ageable ± generalization and a ± conceivable ±speci®cation of the labels.

In the future, we want to use the new M labelsfor the automatic labelling of `phrase' accent po-sitions along the lines of (Kieûling et al., 1994a).There, we designed rules for the assignment ofphrase accent positions based on syntactic±pro-sodic boundary labels for a large read databaseachieving recognition rates of up to 88.3% for thetwo class problem `accent' versus `no accent'. Themost important factors were part-of-speech andposition in the accent-phrase, i.e., in the syntacticunit that is delimited by M boundaries; details canbe found in (Kieûling, 1997).

For the time being, the VM project is in atransition stage, from VM I to VM II, the latterbeing planned for 1997±2000. Database, task andall modules are redesigned. It is planned to use theM labels for similar purposes as in the past; abovethat, they will be annotated for English and usedin the processing of some additional higher lin-guistic modules, as statistical grammar and sta-tistical translation.

Acknowledgements

We discussed the overall usefulness of our la-belling scheme as well as the labelling strategy andsome problematic cases with S. Schachtl (Siemens,M�unchen), with A. Feldhaus, S. Geiûler, T. Kiss(all of them from IBM, Heidelberg) and with M.Reyelt (TU, Braunschweig). N. Schumann and C.Sch�aferle (LMU, M�unchen) annotated the new Mlabels, V. Warnke and R. Huber (FAU, Erlangen)conducted some additional analyses. We want tothank all of them. Needless to say that none ofthem should be blamed for any inconsistencies orerrors. We would like to thank two anonymousreviewers for helpful discussions of this work. Thiswork was funded by the German Federal Ministry

220 A. Batliner et al. / Speech Communication 25 (1998) 193±222

Page 29: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

of Education, Science, Research and Technology(BMBF) in the framework of the Verbmobil Pro-ject under Grants 01 IV 102 F/4, 01 IV 102 H/0and 01 IV 701 K/5. The responsibility for thecontents lies with the authors.

References

Batliner, A., 1997. M speci®ed: A revision of the syntactic±

prosodic labelling system for large spontaneous speech

databases. Verbmobil Memo 124.

Batliner, A., Kompe, R., Kieûling, A., N�oth, E., Niemann, H.,

Kilian, U., 1995. The prosodic marking of phrase bound-

aries: Expectations and results. In: Rubio Ayuso, A., L�opez

Soler, J. (Eds.), Speech Recognition and Coding. New

Advances and Trends, NATO ASI Series F, Vol. 147.

Springer, Berlin, pp. 325±328.

Batliner, A., Kompe, R., Kieûling, A., Mast, M., N�oth, E.,

1996b. All about Ms and Is, not to forget As, and a

comparison with Bs and Ss and Ds. Towards a syntactic±

prosodic labeling system for large spontaneous speech

databases. Verbmobil Memo 102.

Batliner, A., Feldhaus, A., Geissler, S., Kieûling, A., Kiss, T.,

Kompe, R., N�oth, E., 1996a. Integrating syntactic and

prosodic information for the e�cient detection of empty

categories. In: Proc. Internat. Conf. on Computational

Linguistics, Copenhagen, Vol. 1, pp. 71±76.

Batliner, A., Kieûling, A., Kompe, R., Niemann, H., N�oth, E.,

1997. Can we tell apart intonation from prosody (if we look

at accents and boundaries)?. In: Kouroupetroglou, G. (Ed.),

Proc. ESCA Workshop on Intonation, Department of

Informatics, University of Athens, Athens, pp. 39±42.

Bear, J., Price, P., 1990. Prosody, syntax, and parsing. In: Proc.

28th Conference of the Association for Computational

Lingustics, Ban�, pp. 17±22.

Beckman, M., Ayers, G., 1994. Guidelines for ToBI transcrip-

tion, Version 2. Department of Linguistics, Ohio State

University.

Block, H., 1997. The language components in Verbmobil. In:

Proc. Internat. Conf. on Acoustics, Speech and Signal

Processing, M�unchen, Vol. 79±82, p. 1.

Block, H., Schachtl, S., 1992. Trace & uni®cation grammar. In:

Proc. Internat. Conf. on Computational Linguistics, Nan-

tes, Vol. 1, pp. 87±93.

Bub, T., Schwinn, J., 1996. Verbmobil: The evolution of a

complex large speech-to-speech translation system. In:

Internat. Conf. on Spoken Language Processing, Philadel-

phia, Vol. 4, pp. 1026±1029.

Cahn, J., 1992. An investigation into the correlation of cue

phrases, un®lled pauses and the structuring of spoken

discourse. In: Proc. IRCS Workshop on Prosody in Natural

Speech, University of Pennsylvania, Pennsylvania, pp. 19±30.

Carletta, J., Dahlb�ack, N., Reithinger, N., Walker, M., 1997.

Standards for dialogue coding in natural language process-

ing. Dagstuhl-Seminar-Report 167.

Cutler, A., Dahan, D., van Donselaar, W., 1997. Prosody in the

comprehension of spoken language: A literature review.

Language and Speech 40, 141±201.

de Pijper, J., Sanderman, A., 1994. On the perceptual strength

of prosodic boundaries and its relation to suprasegmental

cues. Journal of the Acoustic Society of America 96, 2037±

2047.

Feldhaus, A., Kiss, T., 1995. Kategoriale Etikettierung der

Karlsruher Dialoge. Verbmobil Memo 94.

Grammont, M., 1923. Notes de phon�etique g�en�erale. VIII.

L'assimilation. In: Bulletin de la Soci�et�e de Linguistique,

Vol. 24, pp. 1±109.

Grice, M., Reyelt, M., Benzm�uller, R., Mayer, J., Batliner, A.,

1996. Consistency in transcription and labelling of German

intonation with GToBI. In: Internat. Conf. on Spoken

Language Processing, Philadelphia, Vol. 3, pp. 1716±1719.

Guttman, L., 1977. What is not what in statistics. The

Statistician 26, 81±107.

Haftka, B., 1993. Topologische Felder und Verset-

zungsph�anomene. In: Jacobs, J., van Stechow, A., Ster-

nefeld, W., Vennemann, T. (Eds.), Syntax ± Ein

Internationales Handbuch zeitgen�ossischer Forschung ±

An International Handbook of Contemporary Research,

Vol. 1. De Gruyter, Berlin, pp. 846±867.

Hirschberg, J., Grosz, B., 1994. Intonation and discourse

structure in spontaneous and read direction-giving. In: Proc.

Internat. Symp. on Prosody, Yokohama, Japan, pp. 103±

109.

Hunt, A., 1994. A prosodic recognition module based on linear

discriminant analysis. In: Internat. Conf. on Spoken Lan-

guage Processing, Yokohama, Japan, Vol. 3, pp. 1119±1122.

Jekat, S., Klein, A., Maier, E., Maleck, I., Mast, M., Quantz, J.,

1995. Dialogue acts in Verbmobil. Verbmobil Report 65.

Kieûling, A., 1997. Extraktion und Klassi®kation prosodischer

Merkmale in der automatischen Sprachverarbeitung. Be-

richte aus der Informatik. Shaker Verlag, Aachen.

Kieûling, A., Kompe, R., Batliner, A., Niemann, H., N�oth, E.,

1994a. Automatic labeling of phrase accents in German. In:

Internat. Conf. on Spoken Language Processing, Yokoha-

ma, Japan, Vol. 1, pp. 115±118.

Kieûling, A., Kompe, R., Niemann, H., N�oth, E., Batliner, A.,

1994b. Detection of phrase boundaries and accents. In:

Niemann, H., De Mori, R., Hanrieder, G. (Eds.), Progress

and Prospects of Speech Research and Technology: Proc.

CRIM / FORWISS Workshop. In®x, Sankt Augustin, PAI

1, pp. 266±269.

Kiss, T., 1995. Merkmale und Repr�asentationen. Westdeut-

scher Verlag, Opladen.

Kohler, K., Lex, G., P�atzold, M., Sche�ers, M., Simpson, A.,

Thon, W., 1994. Handbuch zur Datenaufnahme und

Transliteration in TP14 von Verbmobil, V3.0. Verbmobil

Technisches-Dokument 11, Institut f�ur Phonetik und digi-

tale Sprachverarbeitung, Universit�at Kiel, Kiel.

Kompe, R., 1997. Prosody in Speech Understanding Systems,

Lecture Notes in Arti®cial Intelligence. Springer, Berlin.

Kompe, R., Batliner, A., Kieûling, A., Kilian, U., Niemann, H.,

N�oth, E., Regel-Brietzmann, P., 1994. Automatic classi®-

A. Batliner et al. / Speech Communication 25 (1998) 193±222 221

Page 30: M Syntax + Prosody: A syntactic–prosodic labelling scheme ... · boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation),

cation of prosodically marked phrase boundaries in Ger-

man. In: Proc. Internat. Conf. on Acoustics, Speech and

Signal Processing, Adelaide, Vol. 2, pp. 173±176.

Kompe, R., Kieûling, A., Niemann, H., N�oth, E., Schukat-

Talamazzini, E., Zottmann, A., Batliner, A., 1995. Prosodic

scoring of word hypotheses graphs. In: Proc. European

Conf. on Speech Communication and Technology, Madrid,

Vol. 2, pp. 1333±1336.

Kompe, R., Kieûling, A., Niemann, H., N�oth, E., Batliner, A.,

Schachtl, S., Ruland, T., Block, H., 1997. Improving

parsing of spontaneous speech with the help of prosodic

boundaries. In: Proc. Internat. Conf. on Acoustics, Speech

and Signal Processing, M�unchen, Vol. 2, pp. 811±814.

Lamel, L., 1992. Report on speech corpora development in the

US. ESCA Newsletter 8, 7±10.

Lea, W., 1980. Prosodic aids to speech recognition. In: Lea, W.

(Ed.), Trends in Speech Recognition. Prentice-Hall, Engle-

wood Cli�s, NJ, pp. 166±205.

Maier, E., 1997. Evaluating a scheme for dialogue annotation.

Verbmobil Report 193.

Mast, M., Maier, E., Schmitz, B., 1995. Criteria for the

segmentation of spoken input into individual utterances.

Verbmobil Report 97.

Niemann, H., N�oth, E., Kieûling, A., Kompe, R., Batliner, A.,

1997. Prosodic processing and its use in Verbmobil. In:

Proc. Internat. Conf. on Acoustics, Speech and Signal

Processing, M�unchen, Vol. 1, pp. 75±78.

Ostendorf, M., Veilleux, N., 1994. A hierarchical stochastic

model for automatic prediction of prosodic boundary

location. Computational Linguistics 20 (1), 27±53.

Ostendorf, M., Price, P., Bear, J., Wightman, C., 1990. The use

of relative duration in syntactic disambiguation. In: Speech

and Natural Language Workshop. Morgan Kaufmann,

Hidden Valley, PA, pp. 26±31.

Ostendorf, M., Wightman, C., Veilleux, N., 1993. Parse scoring

with prosodic information: An analysis/synthesis approach.

Computer Speech & Language 7 (3), 193±210.

Pitrelli, J.F., Beckman, M.E., Hirschberg, J., 1994. Evaluation

of prosodic transcription labeling reliability in the ToBI

framework. In: Internat. Conf. on Spoken Language

Processing, Yokohama, Japan, Vol. 1, pp. 123±126.

Pollard, C., Sag, I., 1987. Information-based Syntax and

Semantics, Vol. 1, CSLI Lecture Notes, Vol. 13. CSLI,

Stanford, CA.

Price, P., Wightman, C., Ostendorf, M., Bear, J., 1990. The use

of relative duration in syntactic disambiguation. In: Inter-

nat. Conf. on Spoken Language Processing, Kobe, Japan,

Vol. 1, pp. 13±18.

Price, P., Ostendorf, M., Shattuck-Hufnagel, S., Fong, C.,

1991. The use of prosody in syntactic disambiguation.

Journal of the Acoustic Society of America 90, 2956±2970.

Reithinger, N., 1997. Personal communication.

Reyelt, M., 1995. Consistency of prosodic transcriptions.

Labelling experiments with trained and untrained transcrib-

ers. In: Proc. 13th Internat. Congress of Phonetic Sciences,

Stockholm, Vol. 4, pp. 212±215.

Reyelt, M., 1997. Personal communication.

Reyelt, M., 1998. Experimentelle Untersuchungen zur Festle-

gung und Konsistenz suprasegmentaler Einheiten f�ur die

automatische Sprachverarbeitung. Berichte aus der Inform-

atik. Shaker Verlag, Aachen.

Reyelt, M., Batliner, A., 1994. Ein Inventar prosodischer

Etiketten f�ur Verbmobil. Verbmobil Memo 33.

Swerts, M., Geluykens, R., Terken, J., 1992. Prosodic correlates

of discourse units in spontaneous speech. In: Internat. Conf.

on Spoken Language Processing, Ban�, Vol. 1, pp. 421±424.

Tropf, H., 1994. Spontansprachliche syntaktische Ph�anomene:

Analyse eines Korpus aus der Dom�ane ``Terminabsprache''.

Technical report, Siemens AG, ZFE ST SN 54, M�unchen.

Vaissi�ere, J., 1988. The use of prosodic parameters in automatic

speech recognition. In: Niemann, H., Lang, M., Sagerer, G.

(Eds.), Recent Advances in Speech Understanding and

Dialog Systems, NATO ASI Series F, Vol. 46. Springer,

Berlin, pp. 71±99.

Wahlster, W., 1993. Verbmobil ± Translation of face±to±face

dialogs. In: Proc. European Conf. on Speech Communica-

tion and Technology, Berlin, Opening and Plenary Sessions,

pp. 29±38.

Wahlster, W., Bub, T., Waibel, A., 1997. Verbmobil: The

combination of deep and shallow processing for spontane-

ous speech translation. In: Proc. Internat. Conf. on Acous-

tics, Speech and Signal Processing, M�unchen, Vol. 1, pp.

71±74.

Wang, M., Hirschberg, J., 1992. Automatic classi®cation of

intonational phrase boundaries. Computer Speech & Lan-

guage 6 (2), 175±196.

Wightman, C., 1992. Automatic detection of prosodic constit-

uents. Ph.D. Thesis, Boston University Graduate School.

222 A. Batliner et al. / Speech Communication 25 (1998) 193±222