34
1 Rules for the Generation of ToBI-based American English Intonation Matthias Jilka, Gregor Möhler, Grzegorz Dogil University of Stuttgart, Institute of Natural Language Processing, Chair of Experimental Phonetics {jilka, moehler, dogil}@ims.uni-stuttgart.de Universität Stuttgart, Institut für Maschinelle Sprachverarbeitung, Experimentelle Phonetik; Azenbergstraße 12, 70174 Stuttgart, Germany Abstract This study presents an approach to the generation of American English intonation based on prescriptive rules that define the respective features of certain tone labels that in turn represent linguistically relevant F 0 configurations. In accordance with the principles of the Tone Sequence Model the F 0 contour is analyzed as a series of discrete target values that are connected by means of transitional functions. The target values are associated either with stressed syllables (pitch accents) or the margins of the phrase (phrasal tones). The targets’ exact position is represented relative to pitch range and time. All tone labels are examined according to these parameters and the results are then converted into a set of rules that allows the generation of an F 0 contour. ToBI (Tones and Break Indices), a system for transcribing the intonation patterns of American English, provides an inventory of tone labels and a set of example utterances available for analysis. Utterances from ToBI and the Boston Radio News Corpus were used for the evaluation of the generation rules: root mean squared error (RMSE) and correlation between generated and original contour were determined, and in a perception test native speakers assessed the quality of the resynthesized contours which, in general, were judged to sound natural and show few differences to the corresponding originals. Zusammenfassung Die vorliegende Studie stellt einen regelbasierten Ansatz zur Generierung der Intonation von Amerikanischem Englisch vor. Die Regeln beschreiben die Eigenschaften bestimmter Tonlabels und deren Umsetzung in eine Grundfrequenzkontur. Gemäß den Prinzipien des Ton-Sequenz Modells wird die F 0 -Kontur als eine Abfolge von diskreten Zeit-Frequenz-Werten analysiert, die durch Interpolation miteinander verbunden sind. Die Zeit-Frequenz-Werte sind entweder mit betonten Silben (Pitchakzente) oder den Phrasenrändern (Phrasentöne) assoziiert und ihre Position wird relativ zu Pitch Range und Silbe dargestellt. Alle Tonlabels werden im Hinblick auf diese Parameter untersucht. Die Ergebnisse werden in einen Regelsatz umgewandelt, der die Generierung einer F 0 -Kontur ermöglicht. ToBI (Tones and Break Indices), ein System zur Transkription der Intonationsmuster des Amerikanischen Englisch, stellt ein Inventar von Tonlabels und einen Satz von Beispielsäußerungen zum Zweck der Analyse zur Verfügung. Äußerungen aus ToBI und dem Boston Radio News Corpus wurden zur Bewertung der Generierungsregeln verwendet: Mittleres Fehlerquadrat und Korrelation zwischen generierten und originalen Konturen wurden berechnet, und in einem Perzeptionstest bewerteten Muttersprachler die Qualität der resynthetisierten Konturen insgesamt als natürlich. Ebenso wurden nur geringe Unterschiede zu den entsprechenden Originalen festgestellt.

Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

  • Upload
    lamhanh

  • View
    224

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

1

Rules for the Generation of ToBI-based American English IntonationMatthias Jilka, Gregor Möhler, Grzegorz DogilUniversity of Stuttgart, Institute of Natural Language Processing, Chair of Experimental Phonetics{jilka, moehler, dogil}@ims.uni-stuttgart.deUniversität Stuttgart, Institut für Maschinelle Sprachverarbeitung, Experimentelle Phonetik; Azenbergstraße 12, 70174 Stuttgart, Germany

Abstract

This study presents an approach to the generation of American English intonation based onprescriptive rules that define the respective features of certain tone labels that in turn representlinguistically relevant F0 configurations. In accordance with the principles of the Tone SequenceModel the F0 contour is analyzed as a series of discrete target values that are connected by meansof transitional functions. The target values are associated either with stressed syllables (pitchaccents) or the margins of the phrase (phrasal tones). The targets’ exact position is representedrelative to pitch range and time. All tone labels are examined according to these parameters andthe results are then converted into a set of rules that allows the generation of an F0 contour. ToBI(Tones and Break Indices), a system for transcribing the intonation patterns of American English,provides an inventory of tone labels and a set of example utterances available for analysis.Utterances from ToBI and the Boston Radio News Corpus were used for the evaluation of thegeneration rules: root mean squared error (RMSE) and correlation between generated andoriginal contour were determined, and in a perception test native speakers assessed the quality ofthe resynthesized contours which, in general, were judged to sound natural and show fewdifferences to the corresponding originals.

Zusammenfassung

Die vorliegende Studie stellt einen regelbasierten Ansatz zur Generierung der Intonation vonAmerikanischem Englisch vor. Die Regeln beschreiben die Eigenschaften bestimmter Tonlabelsund deren Umsetzung in eine Grundfrequenzkontur. Gemäß den Prinzipien des Ton-SequenzModells wird die F0-Kontur als eine Abfolge von diskreten Zeit-Frequenz-Werten analysiert, diedurch Interpolation miteinander verbunden sind. Die Zeit-Frequenz-Werte sind entweder mitbetonten Silben (Pitchakzente) oder den Phrasenrändern (Phrasentöne) assoziiert und ihrePosition wird relativ zu Pitch Range und Silbe dargestellt. Alle Tonlabels werden im Hinblick aufdiese Parameter untersucht. Die Ergebnisse werden in einen Regelsatz umgewandelt, der dieGenerierung einer F0-Kontur ermöglicht. ToBI (Tones and Break Indices), ein System zurTranskription der Intonationsmuster des Amerikanischen Englisch, stellt ein Inventar vonTonlabels und einen Satz von Beispielsäußerungen zum Zweck der Analyse zur Verfügung.Äußerungen aus ToBI und dem Boston Radio News Corpus wurden zur Bewertung derGenerierungsregeln verwendet: Mittleres Fehlerquadrat und Korrelation zwischen generiertenund originalen Konturen wurden berechnet, und in einem Perzeptionstest bewertetenMuttersprachler die Qualität der resynthetisierten Konturen insgesamt als natürlich. Ebensowurden nur geringe Unterschiede zu den entsprechenden Originalen festgestellt.

Page 2: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

2

Résumé

Dans cet article nous présentons une approche à la génération de l’intonation de l’anglaisaméricain basée sur des règles. Les règles décrivent les caractéristiques de certaines étiquettestonales qui représentent des configurations spécifiques de la fréquence fondamentale.Conformément aux principes du modèle de la séquence des tons le contour F0 est analysé commeune suite de valeurs de temps et fréquence discrètes qui sont liées par des fonctions de transition.Les valeurs de temps et fréquence sont associées soit à des syllabes accentuées (accents toniques)ou aux marges de la phrase (accents phrasaux). Leur position est représentée par rapport à lasyllabe et l’étendue de la fréquence fondamentale. Toutes les étiquettes tonales sont examinéessur la base de ces paramètres. Les résultats sont transformés en une série de règles qui rendpossible la génération d’un contour F0. ToBI (Tones and Break Indices), un système servant à latranscription des structures tonales de l’anglais américain, fournit un inventaire d’ étiquettestonales et une collection d’énoncés d’exemples pour l’analyse. Des énoncés de ToBI et de laBoston Radio News Corpus ont été utilisés pour l’évaluation des règles de génération: on adéterminé l’erreur quadratique moyenne et la corrélation entre les contours générés et lescontours originaux et, dans une expérience de perception des locuteurs natifs ont estimé que laqualité des contours resynthétisés s’avérait naturel dans l’ensemble. De même ils n’ont constatéque de petites différences par rapport aux originaux correspondants.

Keywords: American English intonation - ToBI - Rule-based F0 generation - Synthesis - Evaluation

Page 3: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

3

1. Introduction

In order to successfully generate F0 contours that reflect the shape of meaningful intonationpatterns, it is necessary to find an approach that both describes such patterns and allowstheir reproduction. The Tone Sequence Model is such an approach. It was first introducedby Bruce (1977) for Swedish and Pierrehumbert (1980) for American English. The ToneSequence Model has strongly influenced over a decade and a half of intonation synthesisduring which "enormous progress in modeling fundamental frequency contours" (Beckman1997, p.187) has been made while itself building on ...decades of research by scores of linguists working on the phonology and pragmatics ofEnglish intonation, from O’Connor and Ward (1959), Bolinger (1958), and Halliday (1967) toVanderslice and Ladefoged (1972) (as interpreted by Bruce (1977), Sag and Liberman (1975),and many others.

(Beckman 1997, p. 189).Besides the Tone Sequence Model other approaches to the representation and generation ofintonation have been developed in recent years. The main difference between the ToneSequence Model and other competing intonation models (e.g. Kohler 1991) is that in thesemodels F0 movements, not target values are seen as the basic elements of intonation, thusmaking different assumptions as to whether listeners perceive intonation (and the meaningsassociated with it) in terms of F0 movements or focus exclusively on certain main targetpoints in the contour.Correspondingly, these models view the intonation contour of an utterance globally, i.e. asone unit, whereas in the Tone Sequence Model it is understood as a sequence of discretelocal tonal events that constitute the overall contour.The most influential models of F0 generation that emphasize the global view are the so-called Superpositional Models. In a superpositional model the complex pattern of theoverall contour is constructed by successively adding hierarchically ordered components ofthe contour. The most prominent superpositional model, the Fujisaki Model (Fujisaki1988), works with a phrase and an accent component. Other superpositional models mayuse more and/or different components. Thorsen (1988) for example integrates text,sentence, phrase and stress group components into her model. The Lund Intonation Model(Gårding 1983) is also predominantly superpositional. Local F0 movements as well as Hand L tones (as features of syllables or words) are placed within the framework of a tonalgrid which controls the global F0 contour in dependence of sentence mood and the majorsyntactic boundaries.The existence of a global trend in the F0 contour leads to the assumption that speakers areaware of the overall shape of the contour they are producing and can therefore look aheadto what is still to come. This is not possible for the Tone Sequence Model which claimsthat the overall contour of an utterance simply consists of a sequence of local accents, thusexpressing a completely different view of the relation between word and sentence accent.Since for this reason there are actual phenomena in intonation that the Tone SequenceModel cannot account for in a satisfactory way and that indeed fit better into the globalview of intonation, it is criticized by supporters of superpositional approaches.Psycholinguistic research on slips of the tongue (Zimmer 1988) for example shows that theglobal F0 contour remains unchanged when words are inadvertently exchanged, while theexchanged words retain their proper accents. This demonstrates a difference in behavior ofword and sentence accent, showing that they are distinct phenomena..Another argument for the global view of F0 movement within an utterance is given byKutik et al. (1983) who show that an F0 contour can be interrupted by a parenthesis andafterwards continue on basically unaffected, as if there had not been an interruption.The fact that the Tone Sequence Model interprets global declination as phonologicallyconditioned repeated downstepping is also not considered to be an optimal explanation.Finally, it will be shown in section 3.4. that some special intonation rules described in thisarticle also require prior knowledge of the shape of the following F0 movements.

Page 4: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

4

The problems just listed certainly do not affect superpositional models, which are on thewhole just as effective in generating natural-sounding intonation as are approaches basedon the Tone Sequence Model. They are, however, more abstract in that is more difficult tocorrelate the shape of the contour with a corresponding linguistic description. Therepresentation of local F0 movements as tonal categories in tone sequence-style approachfacilitates linguistic labelling and also allows the assignment of intonational meaningswithin a framework of compositional discourse semantics (Pierrehumbert and Hirschberg1990).The theoretical basis of the Tone Sequence Model and its application to the process of F0

generation will be described in more detail in the rest of this introductory section.According to the principles underlying the model an F0 contour is analyzed as a series ofdiscrete target values that are connected by means of transitional functions. With the helpof these target values or combinations of target values it is possible to define and duplicateparticular F0 configurations that decisively influence both the overall shape of the F0

contour and the interpretation of an utterance. All these particular events on the F0 contourcan then be identified and marked with specific tone labels. The tone labels used for theanalysis described in this paper are taken from the ToBI (for Tones and Break Indices,Beckman and Ayers, 1994) tone inventory. The tone inventory is made up of pitch accentsand boundary tones, because according to the Tone Sequence Model it is assumed thattarget values are associated with either stressed syllables or the margins of the phrase.Apart from this transcription system for intonation patterns and other aspects of theprosody of English, the ToBI guidelines also contain the set of example utterances whichserved as the basis for developing the generation rules. ToBI and especially the toneinventory will be introduced in more detail in section 2 of this paper. The exact position ofthe target values, also called target points or simply targets, is represented relative to thedimensions of pitch range and syllable structure. In the Tone Sequence Model the pitchrange is seen as the distance between the independently assigned baseline and topline whichencompass the F0 contour. A target point’s position is thus described in terms of fractionsof the distance between the two extremes. Similarly, its temporal position is given inpercentages in relation to a certain period of time within a labelled syllable. In section 3 itwill be argued that for American English the voiced part of the labelled syllable is anappropriate criterion. In the same section we will also examine every ToBI tone label withrespect to the target points that correspond to it. The analysis will show that the targetpoints associated with a specific tone label may differ in position depending on theenvironment in which the latter occurs. All the obtained results are then converted into a setof rules that constitutes the core of the F0 generation since the rules are responsible for theplacement of the target points in the dimensions of pitch range and time..

Page 5: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

5

A short representation of the generation process is depicted in the block diagram in Fig. 1:

Input:read phrase

syllable by syllable (phonemes, syllable boundaries, tones)

↓ Tone Sequence Model grammar check

↓ placement of target points according to rules

↓Embedding target points in concrete pitch range (extra parameter)

↓ linear interpolation

↓ output of F0 contour

Fig. 1: Block diagram of the F0 generation tool

The F0 generation has been implemented and embedded in the xwaves-environment (formore details see Möhler 1998 and Möhler and Dogil 1995). It is obvious that the inputmust provide the syllable and phoneme structure of an utterance as well as its tonalstructure so the program will be able to determine the concrete position of the targetswithin the labelled syllables. In the following step the program checks whether the toneand phoneme labels used really correspond to the respective inventories. If everything is inorder, the generation program internally creates the structure of an intonation phrase takinginto account the labelled configurations just mentioned. After that, the actual generationprocess begins by scanning the newly created intonation phrase structure. For everylabelled or otherwise marked syllable (see 3.3.1.) the appropriate rule is applied which isresponsible for the placement of one or several target points, depending on the environmentand/or the tone label itself. Then, the target points’ relative pitch is embedded in a specificspeaker’s pitch range; the targets are thus assigned concrete frequency values. In a finalstep, the individual target points are connected by means of linear interpolation.Visually, the resulting F0 contour is a good approximation of a natural contour, eventhough it is piecewise linear.The quality of the generated contours is evaluated acoustically. This is achieved byresynthesizing the original utterance with the newly generated F0 contour using the PSOLA

Page 6: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

6

(Pitch Synchronous Overlap and Add) resynthesis method (Moulines and Charpentier1990). Thus, in a small-scale perceptual experiment the resynthesized utterances takenfrom ToBI and the Boston Radio News Corpus (the only generally available prosodicallylabelled corpus of American English) are assessed as to their naturalness as well as theirsimilarity to / differences from the respective originals. This is described in detail in section4 of this paper. Aside from the strictly perceptual quality evaluation a plain numericalevaluation is also provided in terms of measures of correlation and the root mean squarederror (RMSE).

2. The ToBI example utterances and tone inventory

As already mentioned, the tone labels that designate specific F0 configurations on stressedsyllables or syllables at the margins of the intonation phrase are taken from the toneinventory of the Tones and Break Indices system (ToBI), which is intended as a standardfor the prosodic transcription of American English (Silverman et al. 1992, p. 867). Thus,its development was motivated mainly by the desire of speech scientists from variousdisciplines to have a common and generally accepted method for representing prosody.Apart from being easy to understand and use, as well as covering the most importantprosodic phenomena in prosodic speech, ToBI is also supposed to be more compatible withcurrent work in language processing (speech synthesis and recognition, syntactic parsing)and with formal representations in semantics and pragmatics, as the ToBI representation ofprosody in terms of categories is closely connected with both the acoustic speech signaland text and discourse structure. ToBI features a set of 161 example utterances includingspeech waveform, F0 contour and spectrogram for each utterance. The utterances aredescribed by labels structured in tiers: the orthographic tier, the miscellaneous tier (forcomments of all kinds), the break tier (which describes the utterance’s phrasing) and, ofcourse most importantly, the tone tier.This section consists of three subsections, the first subsection arguing that ToBI providesall the prerequisites necessary for the purposes of this study. The other two subsectionscontain a short description of the individual elements of the ToBI tone inventory, whichclosely follows the example of the ToBI Annotation Conventions by Beckman andHirschberg (1994). The first of these two subsections describes phrasal tones, the other onedeals with pitch accents.

2.1. ToBI as the basis for developing the rule system

The main advantage of using ToBI (its tone inventory and set of example utterances) isthat it can function as a prescriptive intonational grammar which covers all possible typesof accents, i.e., forms of contours. The ToBI example utterances do not exclusively consistof samples of so-called laboratory speech, but also include spontaneous speech and otherspeaking styles. All these speaking styles are treated equally in the generation process anddo not cause any deviations or other problems. Up until now the effects of differences inspeaking style have not yet been examined thoroughly. Preliminary work (Bruce 1995,Hirschberg 1995) suggests that such differences may particularly concern speaking rate(e.g. read speech is faster than spontaneous speech) and pitch range, e.g., Bruce shows thatin Swedish spontaneous speech a major topic shift is signalled by a clearly widened F0

range). Both these potential possibilities for the distinction of different speaking styles canbe easily accounted for by the generation model presented in this paper, since pitch range isgiven separately for every utterance anyway and a difference in speaking rate wouldmanifest itself above all in the length of segments and syllables which is indicated by theindividual phoneme and syllable labels that are in turn providing the frame of reference inrelation to which the generation rules are applied.

Page 7: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

7

The selection of example utterances in ToBI and the rules derived from this study can beconsidered as both sufficient and appropriate in providing the variety necessary for thedescription of all possible F0 configurations of American English. We will show that theprescriptive system derived from ToBI can be successfully applied for the generation andsynthesis of the speech samples of the Boston Radio News Corpus.

2.2. Phrasal tones

Phrasal tones describe the particular F0 configurations on the margins of an intonationphrase. According to ToBI, American English utterances are structured in certainintonational units and the specific pitch events at the phrase margins designated by thephrasal tones mark the boundaries between two such units. There are two intonationalcategories that characterize the phrasing of American English utterances: intermediaryphrases (ip’s) and intonation phrases (IP’s). Consequently, there are also two types ofphrasal tones to mark the end of these phrases.In the case of intermediary phrases these tones are called phrase accents. A phrase accentends an ip either at a point high (H-) or low (L-) in the speaker’s pitch range. Similarly,high (H%) or low (L%) boundary tones mark intonation phrase boundaries. However,boundary tones never occur on their own. The reason is to be found in the hierarchicalnature of the phrasing of English utterances. An utterance is made up of one or moreintonational phrases which in turn consist of one or more intermediary phrases. Thereforethe end of an IP is by definition also the end of an ip, and thus an IP boundary has twofinal tones.There are four possibilities in combining phrase accents and boundary tones. L-L% isassociated with a point very low in the speaker’s pitch range. In combination with apreceding H* pitch accent it is considered typical for both standard declarative utterancesand Wh-questions (for more detailed information on the latter see especially Daly and Zue(1990) ). In the case of L-H%, a low phrase accent closes the last ip and is followed by ahigh boundary tone. Syntactic yes-no questions usually end in a H-H% boundary tonewhich is associated with a point extremely high in the speaker’s range. This is modelled by"upstepping" the boundary tone that follows the high phrase accent.This is also the case for H-L%, where the low boundary tone is raised to the level of H-,creating a plateau ("plateau contour").The beginning of both ip’s and IP’s is usually not labelled with any phrasal tones. Thelabel for the high initial boundary tone, %H, is used only if an intonation phrase beginshigh in the speaker’s range and if this fact cannot be attributed to a high pitch accent on thefirst syllable of the phrase’s first word because this syllable is never stressed (e.g., aschwa). In all other cases, a speaker begins an IP in the middle of the pitch range, and thisdefault initial boundary is left unmarked by the ToBI transcription. Should the beginningof an ip not follow an IP-boundary but only another ip-boundary no label is necessaryeither, since in such a case the preceding ip’s phrase accent is simply taken over as thefirst tone of the following ip.

2.3. Pitch accents

Pitch accents describe the particular F0 configurations on accented syllables. For thispurpose the ToBI tone inventory contains five pitch accents. An accented syllable isphonologically linked to the starred tone of the pitch accent.H* (peak accent) is associated with a tone target on the accented syllable in the upper partof a speaker’s pitch range, whereas L* (low accent) is the equivalent in the lowest part ofthe pitch range. For the two bitonal pitch accents L*+H and L+H*, however, thedistinction is essential. L*+H (scooped accent) consists of a low tone target on theaccented syllable immediately followed by a rise to upper part of the speaker’s pitch range.

Page 8: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

8

L+H* (rising peak accent), on the other hand, can be described as a high peak target on theaccented syllable immediately preceded by a rise from a valley in the lower part of thespeaker’s pitch range. The difference between these two pitch accents thus lies in theirrelation to the accented syllable as indicated by the starred tone. The fifth pitch accent isH+!H*. It is associated with a clear step down onto the accented syllable from a high pitchwhich cannot be accounted for by a preceding phrase or pitch accent. The “!’’ diacriticsignals that the following high tone is downstepped.2

According to the ToBI Guidelines, downstep is defined as a phonologically triggered compression of the pitch range that lowers F0 targets for any H tones subsequent to a downstep trigger.

(Beckman and Ayers 1994, p. 24)In other words, after a downstepped tone the topline is lowered so that every following hightone occurs on the same level as the preceding downstepped tone. This is demonstrated inthe example utterance <<weight>> (Fig. 2), where the topline is first lowered by downstepon "twelve" and a second time on "thousand". The following regular high tone on "pounds"is as high as the second downstepped tone.

Fig. 2: Downstep as a compression of pitch range, demonstrated in the example utterance<<weight>>

Nearly every high pitch or phrase accent can be downstepped, i.e. has a downsteppedcounterpart : !H*, L+!H*, L*+!H, !H-, !H-L%. The only exceptions are H-H% and thefirst H in H+!H*.The discussion of downstep has shown that topline and baseline, which define the pitchrange, are assigned independently, since only the topline is affected. The pitch range modelfor American English shown in Fig.3 thus assumes a horizontal topline determined at everylocal F0-maximum and lowered only if phonologically triggered by downstep and acontinuously falling baseline.

Fig. 3: Illustration of the used pitch range model in American English (example contour: L+H* !H L-L%)

L+H*topline

baseline

pitch range

!H* (downstep, compression of pitch range)

L-L%

Page 9: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

9

The section concludes with a short summary of the ToBI tone inventory.

pitchaccents

H* (!H*) ; L* (monotonal) L+H* (L+!H*) ; L*+H (L*+!H) ; H+!H* (bitonal)

phrasaltones

final phrase accentsboundary tones

H - (!H-) ; L- L-L% ; L-H% ; H-H% ; H-L% (!H-L%)

initial beginning of an IP

beginning of an ip(after an ip-boundary)

%H ; default beginning in the middle ofpitch rangetaking over the preceding ip’s phraseaccent

Tab.1 Summary of the ToBI tone inventory

3. Analysis of the ToBI labels

3.1. Basics

In this section we will associate the individual ToBI labels with specific target values inorder to convert them into particular F0 configurations. It must be mentioned, though, thatnot all important F0 configurations are explicitly marked with ToBI labels, as some occurobligatorily in certain environments. The default beginning of the F0 contour of an IP inthe middle of the pitch range is just one example of this; there are also otherconfigurations, like valleys between high tones or special contours between tones opposedin pitch (high vs. low) that are more than three syllables apart.It is obvious that many occurrences of one individual tone label would have to be examinedin order to arrive at a number big enough to allow the calculation of reliable average valuesfor the corresponding target points. Unfortunately, for many particular configurations ofthe tone labels only a few examples could be found in the ToBI utterances.In this analysis all target positions, regardless of whether they represent pitch accents orphrasal tones, are defined relative to the pitch range and the voiced part of the labelledsyllable. Theoretically, many segments or periods of time within the syllable, even thewhole syllable itself, could have been chosen to indicate the point in time at which a targetis placed. The voiced part of the syllable was preferred to other possible criteria like thenucleus, the rhyme, or the voiced rhyme for several reasons. First, the individual valuesmeasured for every target exhibit a smaller variance than other criteria. Second, there areseveral instances where a peak or a local F0 minimum occur in the onset or the coda of asyllable. In the ToBI utterance <<word1>> , for example, the peak of a H* pitch accent inthe last syllable of the IP occurs on the /w/, which corresponds to 22% of the syllable’svoiced part, but -22% of the nucleus (i.e. before the nucleus).For the calculation of these averages the correct representation of an utterance’s syllableand phoneme structure is essential. Phonemes and syllable boundaries were labelled byhand. Syllable boundaries were placed according to the principle of maximizing onsets,also taking into account assimilations like /wO tSU/ instead of /wOt jU/ (for "what you")where the syllable boundaries do not agree with the word boundaries given on the ToBIorthographic tier. Thus, a phoneme tier and a syllable tier are added to the description ofan example utterance. The transcription system used on the phoneme tier is SAM-PA(Speech Assessment Methods - Phonetic Alphabet) (Wells et al. 1992). The parametersjust defined create the prerequisites for the determination of the target points thatcorrespond to specific tone labels.

3.2. Pitch accents

The target points representing the starred tones of pitch accents vary depending on theaccents’ position within the ip or IP. Four different cases can be distinguished.

Page 10: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

10

In the very first syllable of an ip/IP, the pitch accent seems to be pushed back by the initialboundary and so occurs relatively late in the voiced part of the accented syllable. In thelast syllable of an ip/IP, the exact opposite can be observed. The final boundary pushesthe pitch accent forward to a position early in the voiced part of the syllable. In the normal,unmarked position, away from the extreme ends of the ip/IP, the target points of thestarred tones of any pitch accent occur shortly after the middle of the accented syllable’svoiced part. The fourth case is very rare. An accented syllable can actually be the onlysyllable of a complete intermediary phrase (example utterance <<insert>> : [["I"]ip

[means insert]ip]IP), and a pitch accent on it consequently occurs in the first and lastsyllable of the ip at the same time.The corresponding target point would then be positionedat about the middle of the voiced part of the syllable. In the ToBI utterances only a fewexamples were found, all of them concerning exclusively H*. However, this fact cannotnecessarily be taken as proof for the assumption that this special case does not occur withother pitch accents. For this reason, the positions of the target points of the other pitchaccents in the case of a one-syllable ip were estimated on the basis of their behavior inother positions of the phrase. In the representation of the concrete values for all pitchaccents but also phrasal tones, all estimates made necessary by a lack of examples will bemarked as such. Most of these estimates were tested by generating and resynthesizingutterances with the affected tone labels in the corresponding environment but the success ofsuch tests can of course not confirm these tones’ actual occurrence in these positions, andtherefore the targets’ positions, either.The most frequent pitch accent in American English is H*. The target points belonging toits four positional variants are shown in Tab. 2 along with some example utterances.

H*

position in pitchrange

topline /after a downstepped tone, at the level of that tone (lowered topline)

position in voicedpart of syllable

first syllable ofip

last syllable of ip one-syllable-ip normal case

85%

<<thought>><<elephant3>>

25%

<<loan1>><<voice>>

50%

<<insert>><<atlanta>>

60%

<<jam1>><<voiced-h>>

Tab. 2 Position of target points for all possible variants of H*

The downstepped counterpart !H* is realized in the same position of the voiced part of thesyllable, but on a lowered topline. Our analysis has revealed that each time a downsteppedhigh tone occurs, the pitch range is compressed by a factor of 0.75 on average. Thedownstepped high tone is thus to be assigned to this new topline. For the description ofchains of downstepped tones this means that the topline is always lowered by a factor of0.75 in relation to to the preceding topline even if the latter has undergone downsteppingalready before.The other monotonal pitch accent L* shows no differences from H* with regard to thepositional variants.

Page 11: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

11

L*

position in pitchrange

baseline (0%)

position in voicedpart of syllable

first syllable ofip

last syllable of ip one-syllable-ip normal case

85%

<<anna1>><<names>>

25%

<<jam1>><<good2>>

50%

estimated

60%

<<elephant3>><<made4>>

Tab. 3 Position of target points for all possible variants of L*

The starred tones of bitonal accents can be described in quite the same way as monotonalones. For example H* (or !H*) in L+H* simply occurs a little later in the voiced part ofthe syllable, probably because of the influence of the preceding, dependent tone L.

H* in L+H*

position in pitchrange

topline (when first H of the ip), else as high as preceding high tone

position in voicedpart of syllable

first syllable ofip

last syllable of ip one-syllable-ip normal case

90%

<<democrat>><<blond-baby1>>

25%

<<heavy-rain>><<legumes>>

70%

estimated

75%

<<noone>><<mother4>>

Tab. 4 Position of target points for all possible variants of H* in L+H*

The position of the target points representing L can only be determined in relation to thestarred tone. L precedes H* by the fixed period of time of 0.2 s (reference point).

Fig. 4 Normal distance of 0.2 s between L and H*

However, this is only the case if there is in fact a voiced region occurring 0.2 s before H*.If not, the distance must be extended to 90% of the voiced region of the syllable precedingthe reference point. As shown in Fig. 5, such an extension can more than double thedistance. If there is also no voiced region in this more distant area because the phraseboundary is close, the distance must be shortened to 20% of the next voiced region of asyllable to the right of the reference point (Fig. 6).

Page 12: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

12

L ← 0.45 s → H* L ← 0.1 s → H*

Fig. 5 Distance of 0.45 s between L and H* Fig. 6 Distance of 0.1 s between L and(example utterance <<spoon2>> ) H* (example utterance <<pigs>> )

In addition to this, the target points for L of course must not intersect with those belongingto a preceding tone label. Below is a short summary of the features the targets representingL can have.

L in L+H*

position in pitch range 20% (not as low as a starred low tone)

position in voiced part ofsyllable

0.2 s before H* (reference point) ;

90% of voiced region to left of reference point if voiceless at thatpoint;

20% of voiced region to right of reference point if no voicing beforethat point not farther to the left than target belonging to precedingtone label

Tab. 5 Position of target points for all possible variants of L in L+H*

L*+H (L+!H*) is the exact opposite of L+H* as L* is the starred tone and the trail tone Hfollows after 0.2 s.

L* in L*+H

position in pitchrange

baseline (0%)

position in voicedpart of syllable

first syllable ofip

last syllable ofip

one-syllable-ip normal case

55%

<<stein>>(2 examples)

15%

estimated

35%

estimated

40%

<<eileen1>><<noodle1>>

Page 13: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

13

H in L*+H

position in pitch range topline (when first H of ip) else as high as precedinghigh tone

position in voiced part ofsyllable

0.2 s after L* (reference point)

20% of voiced region to right of reference point ifvoiceless at that point

90% of voiced region to left of reference point if novoicing after that point not farther to the right thantarget belonging to following tone label

Tab. 6 Position of target points for all possible variants of L*+H

The targets representing L* occur relatively early in the syllable, possibly to make roomfor the trail tone.With H+!H* it is important to note that the two elements of the pitch accent areinterdependent. The starred tone !H* is downstepped in relation to the preceding H,whereas H depends on the position of !H* in the labelled syllable since it precedes by0.15s.

!H* in H+!H*position in pitchrange

downstepped

position in voicedpart of syllable

first syllable last syllable one-syllable-ip normal case

90%

estimated

20%

<<mile>><<onions>>

60%

<<nose>>

60%

<<romanelli>><<sublime2>>

H in H+!H*

position in pitch range 90% (when first high tone in ip),else 90% of preceding high tone (different downstep factor)

position in voiced part 0.15 s before !H* (reference point)

90% of voiced region to left of reference point if voiceless at thatpoint

20% of voiced region to right of reference point if no voicingbefore that point or if reference point would be more than onesyllable away from !H*; not farther to the left than targetbelonging to preceding tone label

Tab. 7 Position of target points for all possible variants of H+!H*

3.3. Phrasal tones

3.3.1. Initial

There are three main configurations of target points at the beginning of an ip/IP, but onlyone of them (%H) is labelled in the ToBI system. The other two are sufficiently defined bythe environment in which they occur. The beginning of an IP is marked by a default initial

Page 14: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

14

boundary starting in the middle of the pitch range. At the beginning of an ip which is notalso the beginning of an IP the phrase accent ending the preceding ip is simply taken overas the first target point of the ip. All three possibilities have in common that the first targetpoint is placed at the very beginning of the voiced part of the syllable.The description of the F0 configurations associated with the default beginning may,however, require additional target points depending on the distance of the targetsbelonging to the first pitch accent of the IP (see Tab. 8).

3.3.2. Final

%H takeover

position in pitch range topline (100%) taking over position of phraseaccent ending preceding phrase(H-, !H-, L-)

position in voiced part ofsyllable

0% 0%(of first syllable of ip)

<<bananas>> ; <<voice>> <<insert>> (L-); <<lazy>> (H-)

Default beginning of IP

distance to next target ≤ 2 syllables distance to next target > 2 syllables

next target high next target low next target high next target low

position in pitchrange

70% 30% P1 : 50%P2 :75%

P1 : 50%P2 : 25%

position in voicedpart of syllable

0% (of first syllable)

<<flap>> ; <<made3>>

0% (of first syllable)

<<stalin>> ; <<made4>>

P1 : 0% P2 :100%

(of first syllable)

<<name1>> ;<<voiced-h>>

P1 : 0%P2 : 100%

(of first syllable)

<<jam1>>

Tab. 8 Target points for initial phrasal tones

All characteristic F0 configurations at the end of an ip or IP are marked with labels. In allcases the last target is placed at the very end of the voiced part of the last syllable in thephrase. For some boundary tones, though, it is necessary to add a preceding target in orderto reproduce the typical shape of F0 associated with that particular tone label. Thedescription of phrase accents and the boundary tones L-L% and H-H%, however, can beachieved very easily.

Phrase accents

H- (!H-) L-position in pitch range topline (when first H of ip),

else as high as preceding hightone-!H- → downstepped

baseline

position in voiced part ofsyllable

100%

<<lazy>> ;<<heavy-rain>> (!H-)

100%

<<insert>>

Tab. 9 Target points for phrase accents H-, !H- and L-

For H-H% and L-L% it must be noted that for both boundary tones the target points’position in the pitch range is outside the normal range. In the case of H-H%, this is due to

Page 15: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

15

upstep (see 2.2.); in the case of L-L%, it is caused by a discourse phenomenon called "finallowering" which signals that a topic is completed (Hirschberg and Pierrehumbert 1986, p.138).

H-H% L-L%position in pitch range 120% -20%position in voiced part ofsyllable

100%

<<names>> ;<<jam1>>

100%

<<thought>> ;<<word>>

Tab. 10 Target points for boundary tones H-H% and L-L%

The F0 configurations associated with the other boundary tones are a little more complex.For this reason, two targets are needed to reproduce these configurations. While the secondtarget never changes, the position of the first target depends on the distance and pitch rangeposition of the last target belonging to the closest preceding pitch accent.

L-H%

last target point before L-H% is high last target point before L-H% is low

distance > 2syllables

distance < 2syllables

distance > 2syllables

distance < 2syllables

position in pitchrange

P1 : baseline P2 : 80%

P1 : baselineP2 : 80%

P1 : 25%P2 : 80%

-80%

position in voicedpart of syllable

P1 : 0% P2 : 100%

<<good-aft>>

P1 : 50%P2 : 100%

<<cream>>

P1 : 0% P2 : 100%

<<drive>>

-100%

<<tags>>

Tab. 11 Target points for L-H%

H-L% and !H-L% complete the list of ToBI tone labels. !H-L% has to be treatedseparately, since in this case the downstepped counterpart cannot simply be described bydownstepping H-.

H-L% !H-L%

last targetbefore H-L% ishigh

last target before H-L% is low

last target inlast syllable ofIP

distance > 1syllable

last target inlast syllable ofIP

distance > 1syllable

Position inpitch range

-topline

P1: toplineP2: topline

P1: toplineP2: topline

P1: downstep P2:as high as P1

P1: downstep P2:as high as P1

Position invoiced part ofsyllable

-100%

<<mile>>

P1: 50%P2: 100%

estimated

P1: 0%P2:100

<<cheapest>>

P1: 50%P2: 100%

<<spoon2>>

P1: 20%P2: 100%

<<calling>>

Tab. 12 Target points for H-L% and !H-L%

Page 16: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

16

3.4. Distance rules

When target points that are opposed in pitch are more than three syllables apart, the F0

contour takes on a particular shape that cannot be reproduced by simple interpolation. Thegeneration process takes this into consideration by adding target points between theopposing target points. It is possible to distinguish two cases depending on whether thestarting point of the distance rule is high or low.

LH_change (low target followedby high target after more than 3syllables)

HL_change (high targetfollowed by low target aftermore than 3 syllables)

Position in pitch range P1 : 75%P2 : downstep (factor 0.9) in relation tothe last target with a high tone label

25 %

Position in voiced part ofsyllable

P1 : 100% of syllable followingsyllable with low targetP2 : 0% of syllable preceding syllablewith high target

<<manitowoc>>

100% of syllable following syllablewith high target

<<flap>>

Tab. 13 Distance rules

The diagrams in Fig. 7 demonstrate the difference between direct linear interpolation andinterpolation using additional target points.

Fig. 7 Demonstration of the necessity of distance rules (LH_change and HL_change): ( = direct linear interpolation; = interpolation with additional target points)

The example utterance <<manitowoc>> (Fig. 8) shows the natural transition from a L*pitch accent to a H-H% boundary tone. It becomes clear that normal linear interpolationwould not be sufficient in order to reproduce the contour.

Fig. 8 LH_change in <<manitowoc>>

L

HH

L

P1

P2P1

Page 17: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

17

3.5. Valleys

Between two high target points the F0 contour typically falls and rises, taking on the shapeof a valley. Valleys occur only between targets that belong to two different high tonelabels, but not between two targets that belong to the same label (like H-H%) or when anadditional target that is not associated with a label is involved (as in the distance rules orthe default beginning). The exact shape of a valley is very difficult to describe.Pierrehumbert 1981 (p. 990) develops a complex function mainly motivated by linguisticprinciples that combines the frequency values of the peaks, the lower baseline value, andthe distance of the two peaks. However, this function creates very deep valleys that mayeven hit the baseline provided that the two target points are far enough apart from eachother (0.8 s). Our observations in the ToBI utterances, on the other hand, have shown thatthe valleys are never particularly deep regardless of the frequency values or the distance ofthe two peaks. For this reason our generation program generates much flatter valleys. Itagrees with observations of Pierrehumbert in that it distinguishes three main ranges of thedistance between the peaks, although they differ slightly from hers. If the distance betweenthe two target points is smaller than 0.25 s, no valley is generated at all. If it is between0.25 s and 0.5 s, one additional target point is placed exactly in the middle and if thedistance is greater than 0.5 s, two intermediate targets are generated each 0.25 s from oneof the peaks.

4. Evaluation of the generated contours

The quality of the rules in section 3 and consequently also of the ToBI labels can beexamined in two distinct ways.First, the shape of the generated contours can be evaluated strictly on the basis of themeasured and calculated difference between generated and original contour. Thedetermination of the root mean squared error (RMSE) and the correlation coefficient canbe considered appropriate methods for fulfilling this purpose (cf. Dusterhoff and Black1997, Portele and Heuft 1997 and Ross 1995). The resulting numerical values representan objective evaluation of the generated contours and also allow, to a limited extent, acomparison with other methods of F0 generation.This method of evaluating the generated contours will be described in more detail in section4.3. It does have the disadvantage, however, that it indicates differences between thegenerated contour and the original whether perceptually relevant or not. Listeners may noteven be aware of big deviations in certain positions, while in other positions smalldifferences may change the interpretation of an utterance or make it sound unnatural.(cf.House 1990)Since perception has to be the crucial factor in assessing the quality of the generation rules,it makes more sense to evaluate the generated contours auditorily, that is, to resynthesizethem and have the resynthesized utterances judged by listeners.For this reason a perception experiment was devised in order to assess the overall qualityof the intonation rules (see section 4.2). The rules were applied to utterances from theBoston Radio News Corpus and to a small extent to ToBI utterances containing types ofintonation patterns that do not normally occur in a news corpus (see also section 4.1).3

To our knowledge there are presently no alternative prosodically labelled corpora availablethat could be used instead.

4.1 The Boston Radio News Corpus

The Boston Radio News Corpus (BRNC for short) offers the possibility of applying thegeneration rules developed from ToBI data to an independent set of utterances.According to its creators the BRNC is

Page 18: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

18

...a corpus of professionally read radio news data, including speech and accompanying annotations, suitable for speech and language research.

(Ostendorf et al. 1995, p.1)

It contains over seven hours of read speech recorded from seven radio announcers atWBUR, a public radio station in Boston. The main purpose of the corpus is to supportresearch in text-to-speech synthesis, especially the generation of intonation patterns. It isclaimed that the radio news speaking style is more natural than "normal" read speech, thuscontaining some of the advantages of nonread speech.Especially as far as prosody is concerned,

there is evidence that [...] newscasters use more clear and consistent indications of prosodic structure than non-professional read speech ...

(Ostendorf et al. 1995, p.3)

For our purposes the BRNC has the big advantage that its F0 patterns are alreadydescribed by ToBI labels.As speaker f2b’s recordings are most completely annotated (tone labels, phoneme label,and full orthography), utterances from this speaker were used in the perception test.Unfortunately, the individual utterances are much too long to be properly judged bylisteners in a perception test. For this reason they were cut into smaller sections (namedafter especially characteristic words, e.g. << intact>> or <<JNC>>). The sections werechosen in such a way as to include the highest possible variety of F0 configurations, i.e.,tunes. However, as indicated before, it is in the nature of news reports that they do notcontain certain types of intonation patterns, particularly those typical of yes-no questions(e.g. L* H-H%) or exclamations (e.g. L+H* H-L%). On the other hand, there are a greatnumber of instances of declarative sentences or phrases which indicate that moreinformation on the same subject is to follow ("continuation rise", e.g. H* L-H%).On the whole it can be said that the BRNC represents a very good instrument for testingthe majority of intonation patterns that occur in American English intonation.

4.2. The perception test

The perception test described in this section was merely devised to assess the generalquality of the generation rules. It cannot completely fulfil the task of evaluating thegenerated intonation patterns only, while excluding all other possible influences on thejudgment of the listeners. Evaluating the perception of synthesized speech is a complexcognitive process, as it involves the contribution of semantic, syntactic, morphological,segmental as well as prosodic aspects and is affected by external influences such as thelength of the stimuli. Strictly speaking, in an optimal experimental setting for theevaluation of prosody alone intonational meaning, metrical structure, segmental form(coarticulation, length) and morphosyntactic correlates would have to be dealt withseparately.Unfortunately, to our knowledge no such experimental designs have been established as ofyet. For this reason we relied on general listening impressions in our prosody evaluation.The readers may decide on the quality of the generated stimuli themselves(http://www.ims.uni-stuttgart.de/phonetik/matthias/ToBI). In the test, stimuli are presentedvia a WWW-browser, allowing native speakers of American English from anywhere in theworld to access the internet (provided they have the necessary audio-hardware), take thetest, and look at the results.Seven native speakers of American English have participated in the experiment, which tests24 different utterances with regard to their naturalness and difference from thecorresponding original version. The raters were all between 20 and 30 years of age and hadno experience at all in intonation research. They received no instructions in addition to

Page 19: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

19

those on the webpage, which tell them to click on the play button for listening and on oneof the rating buttons for evaluation. The instructions also emphasize that listeners shouldconcentrate on intonation and ignore other aspects such sound quality.The test consists of two parts. In the first part, 60 stimuli (utterances) are presented inrandom order and rated by the listeners with respect to the naturalness of the perceivedintonation on a scale from 0 to 5. In order to provide some sort of anchor for consistentcriteria, the numerical values are associated with explanatory comments such as 0 (not atall natural) or 5 (completeley natural).4 The second part contains 35 pairs of stimuli,usually the original and the resynthesized version of an utterance. In similar fashion to thefirst part, listeners assess the difference between both versions on a scale from 0 (nodifference) to 3 (very big difference).5

The raters control the pace of the test themselves. By clicking on the corresponding button,they can listen to an utterance as often as they want so that it takes a reasonably thoroughrater about 45 minutes to complete the test. This freedom in taking the test is meant to helpthe listener evaluate stimuli of differing length or differing degree of dependence on the(removed) context in a consistent way.It certainly is easier to recognize differences between two versions of a short utterance thanit is for a pair of longer utterances. The possibility of listening to a stimulus repeatedlyshould however help the rater concentrate on the different individual sections of a largerutterance, leading to a reduction of this effect and thus an appropriate rating.Very similarly, utterances that are presented without context may appear strange andconsequently not receive a high naturalness rating. Again, the possiblity of listening to thestimulus repeatedly should help the listener become more familiar with it and weaken theless natural impression.In order to get an impression of the judges’ consistency, all stimuli in the naturalness testwere presented twice in the course of the experiment, while in the comparison part only aselected number of stimulus pairs were repeated due to constraints on the test duration.The listeners also unknowingly rated the naturalness of original utterances and comparedcompletely identical stimuli (either original or resynthesized), which indicated both theirgeneral attitude towards some particular intonation patterns (some judges found the newsintonation characteristic of the BRNC and/or the occasionally exaggerated intonation inToBI examples to be rather unnatural) and their degree of attention during the test.6

The examination of naturalness and difference from the original provides quite basicinformation about the overall quality of the generated contours. It also allows us to drawsome (limited) conclusions as to whether a generated contour in fact retains that part of theutterance meaning which is associated with a certain tonal configuration, i.e., a tune.The tunes that an intonation phrase is made up of express discourse meanings (relevant toinformation structure). H* L-L% for example is typical for declaratives and Wh-questions(see also 2.2.), L* H-H% is usually used in syntactic yes-no questions, H* H-H% inconfirmation questions. With the plateau contour H* H-L% the speaker adds informationto a preceding statement, by using L* L-H% he or she can express the conviction that thehearer already knows the content of the statement, etc. (See Pierrehumbert and Hirschberg1990 and Mayer 1997 for a detailed description of these tune meanings).While one cannot automatically assume that a high rating for the degree of naturalness isequivalent to preserving the same interpretation, it is almost always the case (an apparentexception is described in the example <<dream>> later in this section). Little or no audibledegree of difference between original and resynthesized version does however allow theconclusion that the original tune meaning has been preserved.In the reverse case it is also possible to draw some inferences about the preservation oftune meaning from a very low rating of the degree of difference. Provided that the originalutterance itself is judged by listeners to be quite natural it is possible to conclude that a"relatively unnatural" resynthesized version is very likely to be uninterpretable. If, on theother hand, the resynthesized contour sounds natural, but also very different, it mostprobably expresses a different meaning.

Page 20: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

20

It would be much more difficult to infer anything conclusive from a comparison based onan original utterances that is not accepted as natural by the judges in the first place. In sucha case the listeners will probably make up various interpretations that fit the context/thelexical environment and transfer their interpretation to the resynthesized equivalents.On the whole, however, (apart from the conceptual considerations mentioned above) testingnaturalness and difference from the respective original does indeed to a certain extentprovide information as to whether the generation rules are adequate to the task ofexpressing an intended meaning. For this reason it is important to cover as many tunemeanings as possible in the perception test. Since the Boston Radio News Corpus does notcontain the complete variety of different tunes (see 4.1), a small number of ToBI exampleutterances were added to the list of stimuli in order to fill these gaps.Tab. 14 shows a selection of stimuli. Name of the tested utterance, predominant tonalevents (tunes) that occur in the intonation phrases that make up the utterance, as well as asaverage ratings of naturalness and difference from the corresponding original are indicated.

name of utterance most importanttune(s) of utterance

average rating ofnaturalness0 (not at all natural) -5 (completely natural)

average ratingof differencefrom original0 (no difference) -3 (very big difference)

<<JNC>> H* L-H% / H* L-L% 4.25 0.79 <<abilities>> H* H* H- H* !H* L-L% 4.38 1.07 <<agenda>> H* L+ !H* L-H% /

H* L-H% / H* L-L%4.00 0.43

<<challenge>> L + H* L* !H* L-L% 4.19 0.21 <<gains>> L+H* H* L- H* L-H% 4.13 0.86 <<hardact>> H* L- H* L-L% 4.19 0.57 <<hennessy>> H+ !H* !H* !H* !H* L-

L%3.88 1.00

<<intact>> L* L-H% /L* H* !H* L-L%

4.75 0.21

<<ninety>> H* H-L% /H* H* L-L%

4.06 0.57

<<ralph>> L+ H* !H- L+H* L-H% /H* H* !H-L%

3.63 0.43

<<road>> L+H* L-H% / L* L-L% 3.75 1.50 <<season>> H* H* !H- H* !H* L-L% 3.88 0.57 <<mile>> (ToBI) H* H-L% / H+!H* L-L% 3.69 0.00 <<deream>> (ToBI) H* L-L% 3.31 1.43

Tab. 14 Average ratings of a selection of the resynthesized utterances and tunes(Both original and resynthesized versions are available at http://www.ims.uni stuttgart.de/phonetik/matthias/ToBI)

The naturalness rating for all tested utterances combined amounts to 3.81. For the 17tested Boston utterances alone it is 4.05, distinctly better than the average of 3.24 for the 7ToBI utterances.Some of the worse ratings for ToBI utterances can be explained by the unusual, oftenexaggerated intonation patterns in some utterances (<<manitowoc>> (3.56), <<bananas>>(3.56) or <<deream>> (3.31)) that were regarded by many judges as not quite natural inthe first place. In the case of <<jam1>> a combination of this fact and very poor soundquality leads to the extremely bad rating of 1.0.In the interest of obtaining a more meaningful frame of reference, 5 original Bostonutterances (<<agenda>> , <<hardact>> , <<intact>> , <<jay>> , <<season>>) were mixed inwith the resynthesized ones. The unsuspecting judges assigned an average rating of 4.39 tothese natural utterances (in the sense of a maximum reference). In comparison, thecorresponding resynthesized versions scored 4.19 on average. Evidently, this small

Page 21: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

21

difference must lead to a more favorable interpretation of the naturalness ratings of theother resynthesized utterances.The optimal rating 5 (completely natural) was awarded in 39.8% (Boston: 45.2%; ToBI:26.8), rating 4 in 31.0% of all cases (Boston: 31.6%; ToBI: 29.5%), thus covering themajority of stimuli. The rating 0 (not at all natural) occurred only in 3.4% of all cases(Boston: 0.4%; ToBI: 10.7%).The difference between the rule-generated version and the original was rated 0.80 onaverage. Again, the Boston utterances (14 in number) were judged to be of a higher qualitythan the 10 tested ToBI utterances (0.76 vs.0.86). The test also included comparisons ofcompletely identical utterances (<<april>> , <<deborah>> , <<hardact>> , <<ralph>>),which were correctly recognized as such by the listeners (average rating: 0.04).The rating 0 (no difference) was awarded in 43.4% of all cases (Boston: 41.8%; ToBI:45.7%), while a small difference (rating 1) was recognized in 35.4% of all pairs of stimuli(Boston: 41.3%; ToBI: 27.2%)Thus, the general quality of the resynthesized utterances was quite high and ratedaccordingly.The resynthesized version of <<challenge>>, shown in Fig. 9 is a good example. Of thejudges, 78.6% did not detect any difference at all, while the other 21.4% heard only a smalldifference (average rating: 0.21).

Fig. 9 <<challenge>> top: original contour; middle: generated contour; bottom: label tiers

Some problems can arise in complex utterances that include several IP’s with differentpitch ranges. Unfortunately, the pitch range can currently be adjusted only for the wholeutterance. It is possible, however, to assign certain registers to the individual IP’s within anutterance. These registers can expand, reduce, lower, or otherwise shift the chosen pitchrange in the respective IP’s. This is demonstrated in the example utterance <<season>>(ratings: 3.88 (naturalness) and 0.57 (difference)) which has to use expanded, shifted, andlowered registers (see extra register tier in Fig. 10) in order to approximate the originalcontour.

Page 22: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

22

Fig. 10 <<season>> top : original version; middle : generated version; bottom : label tiers(%xr expanded register; %r = default register; %sr = shift register; %lr = lower register;)

Some examples diverge a little farther from the determined averages and lead to smalldifferences between the originals and the resynthesized versions, but they are few innumber and do not result in different interpretations. In the case of a H* pitch accent in thelast syllable of an IP, however, it can happen that the corresponding target point occursmuch later than predicted by the average of 25%, resulting in a different interpretation. Thedifference is depicted in Fig. 11.

Fig. 11 <<dream>> top : original version ; middle : generated version ; bottom : label tiers

In the generated version the peak occurs relatively early in the voiced part of the syllable.The phrase <<dream>> is then interpreted as a normal, not particularly marked question.The original version was interpreted differently by the native speakers. They felt that theword “dream“ was focused and that the utterance “So what did you dream?’’ had to be

Page 23: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

23

interpreted in a context where a person has announced that (s)he is going to talk aboutwhat (s)he has dreamed but then digresses and instead talks about what (s)he has seen,heard, done, etc. A second person now finally wants to know what the first person hasdreamed and asks the question with the H* peak relatively late in the one-syllable word"dream." It seems that the later peak emphasizes that dreaming has been chosen from anumber of possible alternatives. Although it is certainly very tempting to look for aconnection between focus and the late occurrence of the peak, this hypothesis must remainpurely speculative since <<dream>> is the only such example in the ToBI utterances. Inany case, the observed difference in interpretation is not due to imprecise intonation rules.A very plausible explanation was proposed by Mark Steedman (p.c.). He suggests that thespeaker in this utterance uses a dialect with the tendency to split clusters of plosives andliquids into two syllables, e.g., /p∂ - liz/ for please (IPA transcription). It is indeedpossible to analyze the speaker’s pronunciation of the word dream the same way (/d∂ -rim/). Only /rim/ would then be the last syllable of the utterance and the generationprogram would place a target point at 25% of the voiced part of this shorter syllable.Consequently, this target would occur later in the word and thus be closer to the actuallocation of the original peak.A much more common problem is connected with the fact that the generation processdetermines the voiced part of a syllable based on the phoneme labels and not by directlylooking at the F0 contour. Many fricatives and plosives that should be voiced byconventional transcription standards are not represented on the F0 contour. The s of theword means in the utterance <<insert>> (“I’’ means insert), for example, should be avoiced /z/, since it is following a nasal and preceding a vowel. Nevertheless, it does notshow up on the F0 contour. Also, since the generation program is label-based, the voicelessclosure phase of a voiced plosive is included in the voiced part of the syllable. Both theseproblems cause additional small inaccuracies that are, fortunately, not significant. In thefuture such inaccuracies could be avoided by abandoning the voiced part of the syllable asthe criterion for a target’s position within a labelled syllable and replacing it with only thesonorants of the syllable. A last small problem concerns the so-called calling contour (e.g.(L+)H* !H-L% in <<calling>> and <<calling2>>), where the pitch accent’s peak stays onthe topline for a longer period of time (approximately 0.5 s) and is thus insufficientlyrepresented by just one target point This problem should be solved relatively easily bytreating such utterances, which usually consist of one proper name, as a single unit.

4.3. Numerical evaluation

A numerical evaluation of the generated intonation contours was carried out in order toprovide some concrete, objective numbers to assess their quality and to allow a comparisonwith other methods of intonation generation.This is commonly achieved by determining (two statistical measures:) the root meansquared error (RMSE) and the correlation coefficient r.While the RMSE is a simple distance measure between two F0 contours, the correlationcoefficient indicates whether the two contours exhibit the same tendencies by measuring thedeviation from the mean F0 at regularly set points in time. If the two contours rise or fall inthe same position, then the correlation coefficient is close to 1. If the the F0 movements areuncorrelated, it equals 0, and if they are contrary (rise vs. fall), it is negative. Thus, thecorrelation coefficient is an appropriate method for determining the degree of agreement inthe pitch movements of two contours.Both measures are crucially dependent on a coherent corpus, as they are based on therespective pitch range and mean F0 of the individual utterances. For this reasoncomparisons to other methods of generation only make sense if drawn for the same corpus.This implies that while a comparison of the calculated measures can be regarded asmeaningful, the interpretation of the measures themselves may be quite difficult. The ToBI

Page 24: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

24

example utterances, for instance, consist not only of a mix of several male and femalevoices, but also include utterances with exaggerated intonation of a very high pitch.Consequently, the variations in pitch range across the individual utterances add up to ahigh average mean F0, which in turns results in a misleadingly good correlation coefficientof 0.860. It may be added though that the main function of the ToBI utterances was toprovide the basic data for the development of the rules (training corpus).The results for the utterances taken from the Boston Radio News Corpus, on the otherhand, can be considered to be more meaningful. For the generated contours included in theperception test a root mean square error of 32.4 Hz and a correlation coefficient of r =0.605 were determined. For the interpretation of these values in comparison with othermethods of generation it has to be taken into account, however, that they were calculatedonly for the 19 test stimuli (approximately 90s of speech) taken from larger sections in theBRNC (speaker f2b, see 4.1). Also, we adjusted pitch range individually for each utterance(see 4.2).On the whole it can nevertheless be stated that our rule-based generation of intonationcontours is on a level with other methods. Results are relatively similar to those determinedalso for the BRNC using the rule-based method first described by Anderson et al. (1984)(RMSE: 44.7 Hz; correlation coefficient r = 0.40)7, the method of linear regression inBlack and Hunt (1996) (RMSE: 34.8 Hz; correlation coefficient r = 0.62) or the generationbased on tilt events as described in Dusterhoff and Black (1997) (RMSE:32.5 Hz;correlation coefficient r = 0.60).

5. Conclusion

In summary, it can be stated that the results of the rule-based generation of intonationpatterns of American English are on the whole very satisfactory. The description of the F0

contours by means of the ToBI labels and the rules based on the Tone Sequence Modeldepending on them is quite appropriate. With some of the further future improvementsmentioned in section 4 the generation program will become even more accurate.Utterances from the Boston Radio News Corpus scored better than ToBI utterances in theperception test. This may be due to their superior sound quality and the fact that several ofthe ToBI utterances contain somewhat exaggerated intonation patterns that were regardedas unnatural by some of the judges (see also 4.2). On the whole, the method of generatingAmerican English intonation patterns presented in this study compares very well with othermethods, as far as statistical measures are concerned. Perceptual impressions were nottested in the other studies of intonation generation mentioned above. Our PSOLA-basedapproach certainly yields results that sound more natural than, e.g., those of the previouslyused LPC analysis-resynthesis method.The main difference from other methods of resynthesizing intonation from ToBI labelsconsists particularly in the more direct modeling of alignment points. Anderson et al.(1984) and Silverman (1987), for example, used a method of alignment in which a stylizedF0 contour is smoothed by a gliding window filter, whereas the approaches presented byRoss (1995) or Black and Hunt (1996) are data-based.Another extension pertains to the explicit modeling of register values introduced in caseswhere numerous IP’s enter into complex pragmatic relationships (see Fig. 9). Aphonological description of the registers occurring in discourse prosody is provided inMayer (1997).The generation rules (in their current state) have been incorporated into the synthesissystem FESTIVAL which is being developed at the Centre of Speech Technology Researchat the University of Edinburgh (see Black and Taylor 1997). The quality of the rules canthus be tested using text-to-speech conversion with FESTIVAL.Our approach also has the advantage of using a linguistic model for the description of F0

contours, which may offer valuable orientation in the analysis (or even teaching and

Page 25: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

25

learning) of specific tonal events. The rules are derived from material based on theprescriptive ToBI labelling tutorial. Hence, they display a high degree of prescriptiveprecision and realizational invariance. Additionally, extra care has been taken to accountfor the interdependence between the F0 contour and the specific parts of segmentalstructure in relation to which the respective contour is generated.It is our view that as more and more insights into the nature of the interdependence betweenF0 and syllabic/segmental structure are gained (see for example House 1990 and 1996,Beckman 1997 and van Santen and Möbius 1997), an even more exact modeling of it willbe made possible contributing to further progress in improving the naturalness ofsynthesized intonation.

Notes

1. The set of example utterances can be obtained at the address ftp kiwi.nmt.edu (or internet address ftp 129.138.1.82). More detailed instructions can be found in the ToBI-Guidelines or in the README file.

It is also possible to get an audio tape of the utterances and a printed paper copy of the labelling guide and F0 tracks at this address :

ToBI Labelling Guide, c/o Mary BeckmanOhio State University, Linguistics Dept.222 Oxley Hall, 1712 Neil Ave.Columbus, OH 43210-1298USA

2. H+!H*, incidentally, replaces H+L* in Pierrehumbert’s original tone inventory (Pierrehumbert 1980). This seems much more appropriate, since the starred tone is still in the upper half of the pitch range and not in the lower half as L* would suggest. H*+L, the sixth pitch accent in Pierrehumbert’s original inventory, is replaced in a similar way by a regular monotonal high pitch accent followed by a downstepped monotonal pitch accent (H*. . . !H*).

3. Any example utterances with such F0 patterns can be tested within the speech synthesis sytem FESTIVAL into which our generation rules have been incorporated.

4. 0 (not at all natural), 1 (not very natural), 2 (somewhat natural), 3 (reasonably natural), 4 (quite natural), 5 (completely natural)

5. 0 (no difference), 1 (small difference), 2 (big difference), 3 (very big difference)

6. One particular listener for example rated the same stimulus once with 5 and once with 1 and also heard a "very big difference" (3) between two identical stimuli. He was excluded from the analysis of the results. All other listeners took the the test seriously (despite not being paid).

7. Values for RMSE and correlation coefficient taken from Black and Hunt (1996).

Page 26: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

26

6. Appendix

Detailed summary of rules for the placement of target points

Pitch Accents

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

H* first syllable of ip 85% topline (when first H of the ip), else as high as preceding high tone

last syllable of ip 25% topline (when first H of the ip), else as high as preceding high tone

one-syllable-ip 50% topline (when first H of the ip), else as high as preceding high tone

else (normal case) 60% topline (when first H of the ip), else as high as preceding high tone

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

L* first syllable of ip 85% baseline

last syllable of ip 25% baseline

one-syllable-ip 50% (est.) baseline

else (normal case) 60% baseline

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

L+H*

H* (target tone) first syllable of ip 90% topline (when first H

of the ip), else as high as preceding high tone

last syllable of ip 25% topline (when first H of the ip), else as high as preceding high tone

one-syllable-ip 70% (est.) topline (when first H of the ip), else as high as preceding high tone

else (normal case) 75% topline (when first H of the ip), else as high as preceding high tone

(continued next page)

Page 27: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

27

( L+H* continued)

L (preceding tone) normal case 0.2 s before H* 20%

( = reference point)

if voiceless region 0.2 s before H* 90% of voiced region 20% ( = reference point) to left of reference point

if no voicing at and before reference 20% of voiced region 20% point that does not belong to a to right of reference point preceding tone label (if a preceding point is set at a distance > 0.2 s)

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

L*+H

L* (target tone) first syllable of ip 55% baseline

last syllable of ip 15% (est.) baseline

one-syllable-ip 35% (est.) baseline

else (normal case) 40% baseline

H (trail tone) normal case 0.2 s after L* topline (when first H

(reference point) of the ip), else as high as preceding high tone

if voiceless region 0.2 s after L* 20% of voiced region topline (when first H (reference point) to right of reference point of the ip), else as high

as preceding high tone

if no voicing at and after reference 90% of of voiced region topline (when first H point that does not belong to a to left of reference point of the ip), else as high following tone label as preceding high tone

The high tones of the pitch accents H*, L+H* and L*+H can also occur downstepped(!H*,L+!H*, L*+!H; the target tone in the bitonal pitch accent H+!H* is alwaysdownstepped). In that case their position in the pitch range is still on the topline. However, thetopline is lowered since the pitch range has been compressed by a factor of 0.75. Thus thefrequency value of a downstepped high tone is 0.75 times lower than that of the preceding hightone which itself could also be downstepped in relation to another preceding high tone (seealso 2.3. and 3.2.).

Page 28: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

28

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

H+!H*

!H* (target tone) first syllable of ip 90% (est.) downstepped (in

relation to own preceding H)

last syllable of ip 20% downstepped (in relation to own preceding H)

one-syllable-ip 60% downstepped (in relation to own preceding H)

else (normal case) 60% downstepped (in relation to own preceding H)

H (preceding tone) normal case 0.15 s before !H* 90% (when first

(reference point) high tone in ip), else downstepped (factor : 0.9)

if voiceless region 0.15 s before !H* 90% of voiced region 90% (when first (reference point) to left of reference point high tone in ip),

else downstepped (factor : 0.9)

if no voicing at and before reference 20% of voiced region 90% (when first point that does not belong to a to right of reference point high tone in ip), preceding tone label else downstepped

(factor : 0.9)

if reference point is more than one 20% of voiced region 90% (when first syllable away from target tone !H* to right of reference point high tone in ip),

else downstepped (factor : 0.9)

Phrasal Tones

Initial

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

%H - 0% topline

Page 29: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

29

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

Takeover (no label) at the beginning of an ip following 0% (of first syllable taking over position

another ip ending with a phrase accent of ip) of phrase accent ending preceding ip (H-, !H- or L-)

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

Default distance to next target ≤ 2 syllables 0% (of first syllable 70%

beginning and next target high of IP)

of IP (no label) distance to next target ≤ 2 syllables 0% (of first syllable 30%

and next target low of IP)

distance to next target > 2 syllables P1 : 0% P1 : 50% and next target high P2 : 100% P2 : 75%

(of first syllable of IP)

distance to next target > 2 syllables P1 : 0% P1 : 50% and next target low P2 : 100% P2 . 25%

(of first syllable of IP)

Final

Phrase Accents

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

H- - 100% (of last syllable topline (when first H of ip) of the ip), else as high

as preceding high tone

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

!H- - 100% (of last syllable downstepped of ip)

Page 30: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

30

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

L- - 100% (of last syllable baseline of ip)

Boundary Tones

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

H-H% - 100% (of last syllable 120% (upstep) of IP)

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

L-L% - 100% (of last syllable -20% (final lowering) of IP)

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

L-H% last target before L-H% is high P1 : 0% P1 : baseline and distance is > 2 syllables P2 : 100% P2 : 80%

(of last syllable of IP)

last target before L-H% is high P1 : 50% P1 : baseline and distance is ≤ 2 syllables P2 : 100% P2 : 80%

(of last syllable of IP)

last target before L-H% is low P1 : 0% P1 : 25% and distance is > 2 syllables P2 : 100% P2 : 80%

(of last syllable of IP)

last target before L-H% is low 100% 80% and distance is ≤ 2 syllables (of last syllable of IP)

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

H-L% last target before H-L% is high 100% topline (when first H (of last syllable of IP) of the ip), else as high

as preceding high tone

last target before H-L% is low P1 : 50% (est.) P1 : topline (see above) and in last syllable of the IP P2 : 100% P2 : topline (see above)

(of last syllable of IP)

last target before H-L% is low P1 : 0% P1 : topline (see above) and distance is ≥ 1 syllable P2 : 100% P2 : topline (see above)

(of last syllable of IP)

Page 31: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

31

ToBI tone label Variants depending on environment Position in voiced Position in part of syllable pitch range

!H-L% last target before !H-L% is in last P1 : 50% P1 : downstepped syllable of the IP P2 : 100% P2 : as high as P1

distance of last target before !H-L% P1 : 20% P1 : downstepped is ≥ 1 syllable P2 . 100% P2 : as high as P1

Distance Rules

Description of the phenomenon Position in voiced part of the syllable Position in pitch range

LH_Change : a low target is followed additional target points :by a high target with the distance > 3 syllables P1 : 100% of syllable following syllable P1 : 75%

with low targetP2 : 0% of syllable preceding syllable P2 : downstepped (factor with high target 0.9) in relation to the

last target with a high tone label

HL_Change : a high target is followed additional target point :by a low target with the distance > 3 syllables 100% of syllable following syllable 25%

with high target

Page 32: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

32

References

M.D. Anderson, J.B. Pierrehumbert and M.Y. Liberman (1984), "Synthesis by rule of English inonationpatterns", Proceedings of Int. Conf. Acoust.,Speech, Signal Proc., New York, pp. 2.8.1.-2.8.4.

M. E. Beckman (1997), "Speech Models and Speech Synthesis", in : J. van Santen, R. W. Sproat,J. P. Olive and J. Hirschberg, eds., Progress in Speech Synthesis (Springer, New York,),

M. E. Beckman and G. M. Ayers (1994), Guidelines to ToBI Labelling. Version 2.0

M. E. Beckman and J. Hirschberg (1994), The ToBI Annotation Conventions.

A. W. Black and A. Hunt (1996), "Generating F0 contours from ToBI labels using linear regression", Proceedings of the International Conference on spoken Language Processing, Philadelphia, pp. 1385 - 1388

A.W. Black and W. Taylor (1997), Festival Speech Synthesis System: System documentation (1.1.1), Human Communication Research Centre, Technical Report HCRC/TR-83

D. Bolinger (1958), "A Theory of Pitch Accent in English", Word 7, pp. 199 - 210

G. Bruce (1977), Swedish Word Accents in Sentence Perspective (CWK Gleerup / Liber Laromedel, Lund)

G. Bruce (1995), "Modeling Swedish Intonation for Read and Spontaneous Speech", Proceedings ofthe XIIIth International Congress of Phonetic Sciences vol. 2, Stockholm, pp. 28 - 35

N. Daly and V. Zue (1990), "Acoustic, Perceptual, and Linguistic Analyses of Intonation Contours in Human/Machine Dialogues", Proceedings of the International Conference on SpokenLanguage Processing, Kobe, pp. 12.4.1. - 14.4.4.

K. Dusterhoff and A.W. Black (1997), "Generating F0 contours for speech synthesis using the Tilt Intonation Model", Proceedings of the ESCA Workshop on Intonation: Theory, Models and Applications, Athens, pp. 107 - 110

H. Fujisaki (1988), "A note on the physiological and physical basis for the phrase and accentcomponents inthe voice fundamental frequency contour" in: O. Fujimura, ed., Vocal physiology: voiceproduction, mechanisms and functions (Raven, New York), pp. 347 - 355

E. Gårding (1983), "A generative model of intonation", in: A. Cutler and D.R. Ladd, eds., Prosody:models and measurements (Springer, Berlin), pp. 11 - 25

M.A.K. Halliday (1967), Intonation and Grammar in British English. (Mouton, The Hague)

J. Hirschberg (1995), "Prosodic and other Acoustic Cues to Speaking Style in Spontaneous and ReadSpeech", Proceedings of the XIIIth International Congress of Phonetic Sciences vol. 2,Stockholm, pp. 36 - 43

J. Hirschberg and J. B. Pierrehumbert (1986), "The Intonational Structuring of Discourse", Proceedings of the 24th Annual Meeting of the Association for Computational Linguistics, NewYork, pp. 136 - 144

D.House (1990), Tonal Perception in Speech. (Lund University press, Lund)

Page 33: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

33

D. House (1996), "Perception of Prepausal Tonal Contours : Implications for Automatic Stylization ofIntonation", Proc. of ICSLP, Philadelphia, pp. 949 - 952

K.J. Kohler (1991), "Prosody in speech synthesis: the interplay between basic research and TTS application", Journal of Phonetics, Vol. 19, pp. 121 - 138

E.J. Kutik, W.E. Cooper and S. Boyce (1983), "Declination of fundamental frequency in speaker’s production of parenthetical and main clauses", J. Acoust. Soc. Amer., Vol. 73, pp. 1731 - 1738

J. Mayer (1997), Intonation und Bedeutung: Aspekte der Prosodie-Semantik-Schnittstelle im Deutschen (Doctoral Dissertation, University of Stuttgart)

G. Möhler (1998), Theoriebasierte Modellierung der Deutschen Intonation für die Sprachsynthese(Doctoral Dissertation, University of Stuttgart)

G. Möhler and G. Dogil (1995), "Test Environment for the Two-Level Model of GermanProminence", Proceedings of Eurospeech, Madrid, pp. 1019 - 1022

E. Moulines and F. Charpentier (1990), "Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones" , Speech Communication, Vol.9, pp. 453 - 467

J.D. O’Connor and G.F. Ward (1959), Intonation of Colloquial English (Longman, London)

M. Ostendorf, P.J. Price and S. Shattuck-Hufnagel (1995), The Boston University Radio News Corpus.Technical Report ECS-95-001, Electrical, Compuer and Systems Engineering Department, Boston University, Boston, MA

J. B. Pierrehumbert (1979), "The Perception of Fundamental Frequency Delineation", J. Acoust. Soc. Amer.,Vol. 66, pp. 363 - 369

J. B. Pierrehumbert (1980), The Phonology and Phonetics of English Intonation (PhD Dissertation, MIT)

J. B. Pierrehumbert (1981), "Synthesizing Intonation", J. Acoust. Soc. Amer., Vol. 70, pp. 985 - 995

J. B.Pierrehumbert and J. Hirschberg (1990), "The Meaning of Intonational Contours in theInterpretation of Discourse" in: P. Cohen, J. Morgan and M.Pollock , eds., Intentions inCommunications (MIT Press, Cambridge, Mass.), pp. 271 - 311

T. Portele and B. Heuft (1997), "Towards a prominence-based speech synthesis system", Speech Communication, Vol. 21, pp. 61 - 72

K. Ross (1995), Modeling of Intonation for Speech Synthesis (PhD Dissertation, Boston University)

I. Sag and M. Liberman (1975), "The Intonational Desambiguation of Indirect Speech Acts", Papers from the Eleventh Regional Meeting, Chicago Linguistics Society, Chicago, pp. 487 - 497

K. Silverman (1987), The Structure and Processing of Fundamental Frequency Contours (PhD Dissertation, University of Cambridge)

K. Silverman et al (1992), "ToBI : A Standard for Labelling English Prosody", Proceedings of the 1992 International Conference on Spoken Language Processing, pp. 867 - 870

Page 34: Rules for the Generation of ToBI-based American English Intonation Matthias Jilka ...moehler/papers/gm_speechc… ·  · 2012-11-29Rules for the Generation of ToBI-based American

34

N.G. Thorsen (1988), "Standard Danish intonation", Annual Report of the Institute of Phonetics (Univ.of Copenhagen) ARIPUC, Vol. 22, pp. 1 - 23

R. Vanderslice and P. Ladefoged (1972), "Binary Suprasegmental Features and Transformational Word-accentuation Rules", Language , Vol. 48, pp. 819 - 838

J. van Santen and B. Möbius (1997), "Modeling pitch accent curves", Proceedings of the ESCA Workshop on Intonation: theory, Models and Applications, Athens, pp. 321 - 324

J. Wells, W. Barry, M. Grice, A. Fourcin and D. Gibbon (1992), Standard Computer CompatibleTranscription. Esprit Project 2589 (SAM). Doc. no. SAM-UCL-037. London; Phonetics andLinguistics Dept. UCL

D.E. Zimmer (1988), So kommt der Mensch zur Sprache (Haffmans, Zürich)