24
Original Paper Ruth E. Cumming Centre for Neuroscience in Education, Department of Experimental Psychology, University of Cambridge Downing Street Cambridge, CB2 3EB (UK) Tel. +44 122 376 7509, E-Mail [email protected] Fax 41 61 306 12 34 E-Mail [email protected] www.karger.com © 2011 S. Karger AG, Basel 0031–8388/10/0674–0219 $26.00/0 Accessible online at: www.karger.com/pho Phonetica 2010;67:219–242 Received: October 18, 2010 DOI: 10.1159/000324132 Accepted: January 10, 2011 The Interdependence of Tonal and Durational Cues in the Perception of Rhythmic Groups Ruth E. Cumming Centre for Neuroscience in Education, Department of Experimental Psychology, University of Cambridge, Cambridge, UK Abstract This experiment investigated whether rising fundamental frequency (f0) and increased duration are interdependent perceptual cues to rhythmic groups in speech, and whether this depends on listeners’ native language. In a two- alternative forced-choice decision task, listeners had to indicate whether they perceived a 3 + 2 or 2 + 3 grouping in 5-syllable-long series of various digits and letters, in which f0 and duration were manipulated in the second and/or third sylla- ble. The listeners were native speakers of Swiss German, Swiss French, or French. These languages differ from each other in terms of prosodic properties involving f0 and duration. The results demonstrate that rising f0 and increased duration are interdependent cues to rhythmic groups, since the two cues were significantly more effective than one when heard simultaneously, but significantly less effec- tive than one when heard in conflicting positions around the rhythmic-group boundary location. Moreover, listeners’ native language influenced whether f0 or duration was the more effective cue. These findings are important not only for their insight into the perception of prosodic groups in speech, but also because they have implications for current research on speech rhythm. Copyright © 2011 S. Karger AG, Basel 1. Introduction 1.1. Prosodic Groups and Rhythm in Speech The idea that continuous speech comprises definable groups has existed for a long time; Halliday [1960] was the first to explicitly discuss that these groups form a hierarchical structure [Ladd, 1996, p. 237]. This concept has since been gener- ally accepted amongst phonologists under the term ‘prosodic hierarchy’ [e.g. Nespor and Vogel, 1986; Pierrehumbert, 1980]. Different-sized prosodic groups within the hierarchy have been given various labels (e.g. intonation(al) phrase/group/unit, tone group/unit, tune, rhythmic group/unit, phonological phrase, accentual phrase, breath group etc.).

The Interdependence of Tonal and Durational Cues in the Perception of Rhythmic Groups

  • Upload
    ruth-e

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Original Paper

Ruth E. CummingCentre for Neuroscience in Education, Department of Experimental Psychology, University of CambridgeDowning StreetCambridge, CB2 3EB (UK)Tel. +44 122 376 7509, E- Mail [email protected]

Fax �41 61 306 12 34E- Mail [email protected]

© 2011 S. Karger AG, Basel0031–8388/10/0674–0219$26.00/0Accessible online at:www.karger.com/pho

Phonetica 2010;67:219–242 Received: October 18, 2010DOI: 10.1159/000324132 Accepted: January 10, 2011

The Interdependence of Tonal and Durational Cues in the Perception of Rhythmic Groups

Ruth E. Cumming

Centre for Neuroscience in Education, Department of Experimental Psychology, University of Cambridge, Cambridge, UK

AbstractThis experiment investigated whether rising fundamental frequency (f0)

and increased duration are interdependent perceptual cues to rhythmic groups in speech, and whether this depends on listeners’ native language. In a two- alternative forced- choice decision task, listeners had to indicate whether they perceived a 3 + 2 or 2 + 3 grouping in 5- syllable- long series of various digits and letters, in which f0 and duration were manipulated in the second and/or third sylla-ble. The listeners were native speakers of Swiss German, Swiss French, or French. These languages differ from each other in terms of prosodic properties involving f0 and duration. The results demonstrate that rising f0 and increased duration are interdependent cues to rhythmic groups, since the two cues were significantly more effective than one when heard simultaneously, but significantly less effec-tive than one when heard in conflicting positions around the rhythmic- group boundary location. Moreover, listeners’ native language influenced whether f0 or duration was the more effective cue. These findings are important not only for their insight into the perception of prosodic groups in speech, but also because they have implications for current research on speech rhythm.

Copyright © 2011 S. Karger AG, Basel

1. Introduction

1.1. Prosodic Groups and Rhythm in Speech

The idea that continuous speech comprises definable groups has existed for a long time; Halliday [1960] was the first to explicitly discuss that these groups form a hierarchical structure [Ladd, 1996, p. 237]. This concept has since been gener-ally accepted amongst phonologists under the term ‘prosodic hierarchy’ [e.g. Nespor and Vogel, 1986; Pierrehumbert, 1980]. Different- sized prosodic groups within the hierarchy have been given various labels (e.g. intonation(al) phrase/group/unit, tone group/unit, tune, rhythmic group/unit, phonological phrase, accentual phrase, breath group etc.).

220 Phonetica 2010;67:219–242 Cumming

The term used in this paper is ‘rhythmic group’, since this was considered most suitable for the experiment reported, which was conducted as part of a series of experi-ments concerning speech rhythm. However, it could be argued that this experiment should refer to ‘phrasal grouping’ rather than ‘rhythmical grouping’, to distinguish it from other studies that have investigated grouping at a different level in the pro-sodic hierarchy. This experiment concerns grouping at a phrase level, whereas other experiments on rhythmical grouping [e.g. Hay and Diehl, 2007; Wagner, 2010] have concerned the metrical foot level. It is possible that groups at these different levels are produced and perceived using different acoustic cues. Nevertheless, the term ‘rhythmic group’ was retained here, because groups at various different levels could all contribute to the perception of rhythm in different languages, though probably some levels more than others, and because this research has a rhythm- based perspective. This important point about different levels of grouping is discussed further in section 4.

Although prosodic groups are widely accepted phenomena, it is often unclear how to identify them in spontaneous speech data [Crystal, 1969; Ladd, 1996]. Carlson et al. [2002] suggested that groups are difficult to identify phonetically because it is not fully understood how groups are perceptually cued by fundamental frequency (f0) move-ment, pre- boundary lengthening, voice- quality changes etc. Similarly, Guaïtella [1999] argued that groups are perceptual rather than production phenomena. It could also be argued that prosodic- group perception might not concern phonetic cues to boundaries alone, but groups’ internal structural coherence as well.

Nevertheless, there is some perceptual evidence that phonetic properties at bound-aries cue prosodic groups. De Rooij [1976] and Bouwhuis et al. [1978] constructed series of seven /da/ monosyllables with different f0 contours and certain lengthened vowels. For each series, listeners had to choose the best- fitting sentence from a list of syntactically varied Dutch sentences. Results showed that group- final lengthening, and lengthening plus dynamic pitch, but not dynamic pitch alone, were effective cues to within- sentence prosodic groups. De Pijper and Sanderman [1994] and Swerts [1997] found that untrained listeners perceived boundaries of different prosodic strength (i.e. different groups in the hierarchy) in sentences; boundary- strength ratings were higher the more phonetic cues were present (e.g. f0 movement, pre- boundary lengthening, pauses). Carlson and Swerts [2003] demonstrated that when these phonetic cues were present, listeners could predict upcoming boundaries. Cutler et al. [1997, p. 162] sum-marised experiments that showed which cues listeners used to locate prosodic- group boundaries in speech, including durational, tonal or amplitude- related cues. Several experiments have shown that listeners recalled a word/digit series better if it comprised groups with boundaries marked by phonetic cues like pauses and f0 movements [e.g. Crowder, 1982; Frankish, 1985, 1989; Reeves et al., 2000]. In such an experiment, the listeners of Reeves et al. [2000] were not instructed to use grouping strategies, but spontaneously used prosodic cues to group stimuli and recall information.

The question of which cues are significant in the perception of groups in continu-ous speech is important, not only for our understanding of the phenomenon of percep-tual groups, but also because it is relevant to research on speech rhythm. As far back as some early psychology experiments investigating rhythm, it was recognised that the perception of groups in a sound sequence was important in the perceived rhythm of that sound sequence [Fraisse, 1982, pp. 155– 164]. These experiments on rhythm found that different acoustic cues affected grouping perception in different ways, such as in a series of clicks, those with stronger intensity were perceived as group- initial, and those

221Phonetica 2010;67:219–242Tonal and Durational Cues in Rhythmic- Group Perception

with longer duration as group- final [e.g. Bolton, 1894; Woodrow, 1909]. However, research on speech rhythm has generally not focussed on the importance of perceptual grouping, particularly not in terms of cues other than duration. A recent exception is Wagner [forthcoming], who suggested how the fundamental grouping principles that had been shown in experiments using non- linguistic stimuli (e.g. clicks and tones) may also apply to rhythm in language. Hay and Diehl [2007], in an experiment similar to the early rhythm experiments with click series, found that listeners perceived rhyth-mic groups within series of /ɡɑ/ syllables; again, increased duration had a group- final effect and stronger intensity a group- initial effect. This suggests that speech rhythm may follow the grouping principles of non- speech rhythm, but to fully assess this, fur-ther experiments would be needed with stimuli that are more representative of natural language than /ɡɑ/ series.

Furthermore, it has been suggested that listeners’ perception of groups may depend on their native language. For example, Jakobson et al. [1952] claimed that a series of knocks with every third being louder is perceived in different groupings by Czech, French and Polish speakers, influenced by their native language’s word- /phrasal- prominence location. Iversen et al. [2009] provided some experimental evi-dence for this effect of native language; in recurring sequences of alternating short- long tones, English and Japanese listeners perceived (respectively) short- long and long- short groups, perhaps because short function words precede longer content words in English, but follow in Japanese. However, Iversen et al. [2009] found no cross- linguistic difference in perceived grouping for amplitude- alternating tones, and Hay and Diehl’s [2007] similar experiment with /ɡɑ/ syllables found no difference in perceived grouping between English and French listeners. The question of whether listeners’ native language influences their perception of rhythmic groups is generally under- researched [Cutler et al., 1997], and in particular with speech stimuli that reflect natural language.

1.2. Research Questions

The experiment reported in this paper joins the relatively few perceptual studies concerning which acoustic properties cue prosodic groups in continuous speech. More specifically, it investigates whether f0 and duration are interdependent cues to rhythmic groups in longer utterances, and if so, whether listeners’ native language influences the way in which f0 and duration are interdependent cues.

In order to answer these questions, an experiment inspired by House’s [1990] research was designed. He recorded a Swedish speaker repeating the word [fe͡ɛm] (‘five’); one token was selected, multiplicated, and the f0 contour was variously manip-ulated on different replicates, which were then spliced together to form stimuli of five fives (‘55555’). These stimuli tested how dynamic f0 and mean f0 level contributed to the perception of group boundaries, as listeners had to indicate whether they heard 55- 555 or 555- 55. Duration could conceivably contribute as a group- boundary cue in this context. In the present experiment therefore, f0 and duration were manipulated in similar stimuli, to investigate whether these are interdependent cues to rhythmic groups within five- syllable utterances.

In order to answer the other question investigated in this experiment (whether listeners’ native language influences their use of prosodic- group cues), the languages

222 Phonetica 2010;67:219–242 Cumming

tested need to have prosodic structures which are realised with f0 and duration in diver-gent ways, so that the influence of listeners’ native- language prosody would be likely to manifest itself in different responses for each language group of listeners. For this reason, Swiss German (henceforth SG), Swiss French (henceforth SFr) and French (i.e. from France, henceforth Fr) were chosen as the test languages.

1.2.1. Brief Description of the Languages TestedSG is the collective name for the linguistically heterogeneous Alemannic dia-

lects spoken within Switzerland, which differ so much from (standard) German, that many are not mutually intelligible with it [Haas, 1982]. This experiment concerned SG speakers living in the Zürich canton. SG is not a standard language (according to e.g. Haugen’s [1966] model); in contrast, Fr is a standard language with an offi-cial written form, and it could be said that SFr and Fr are two ‘varieties’ of this lan-guage. SFr is generally similar to Fr, but differs subtly in phonetics/phonology and some lexical items. Unlike Fr, SFr has word- final vowel- length contrasts [Andreassen, 2006; Grosjean et al., 2007]; according to stereotype, SFr is slower than Fr and has unusual intonation and accentuation [see e.g. Grosjean et al., 2007, pp. 2– 3; Knecht and Rubattel, 1984, p. 142]. Yet little empirical evidence exists to support these impres-sionistic claims [Miller, 2007]. It has also been said that SFr is becoming increasingly homogenised with Fr [Singy, 1996; Voillat, 1971].

According to analyses of SG intonation [Fitzpatrick- Cole, 1999; Nolan and Hausmann, 2005], accented syllables have an f0 rise, which may begin late in the syllable and continue into the following syllable; accented syllables are also lengthened [Häsler et al., 2005; Leeman and Siebenhaar, 2007]. The position of accented syllables depends on lexical stress, which mostly occurs on the word- initial syllable [Fleischer and Schmid, 2006]. SG also has a late, sharp, intonation- phrase- final f0 fall, which native speakers do not perceive as prominent [Fitzpatrick- Cole, 1999; Nolan and Hausmann, 2005]. SG has relatively high durational variability, of both vowels and consonants, when measured with rhythm metrics (the pair-wise variability index, %V/ΔV/ΔC) [Cumming, 2010; Schmid, 2001]. This may reflect the fact that SG has phonemic consonant- and vowel- length contrasts, com-plex syllables, and vowel reduction in unstressed syllables [Fleischer and Schmid, 2006].

According to recent analyses of Fr intonation and accentuation [e.g. Di Cristo, 1999, 2000; Jun and Fougeron, 2000; Post, 2000], each accentual phrase (AP) has one obligatory accented syllable in final position (i.e. accentuation is phrasal rather than dependent on lexical stress), and this AP- final syllable may have falling or rising f0; APs can also have an optional initial rising accent. The AP- final accented syllable, but not the optional AP- initial accented syllable, shows extensive lengthening [e.g. Vaissière, 1991]. It has been suggested that the different types of Fr prominence could have dif-ferent primary cues: duration for AP- final prominence and f0 for AP- initial prominence [Benguerel, 1971, 1973; Di Cristo, 1999, 2000; Vaissière, 1991]. For SFr, Miller [2007] gave a similar overall account, though her data showed some subtle between- variety differences: AP- initial and AP- final rises may start later and earlier, respectively, in SFr than in Fr. (S)Fr has relatively low durational variability, of both vowels and con-sonants, when measured with rhythm metrics (pairwise variability index, %V/ΔV/ΔC) [e.g. Cumming, 2010; Grabe and Low, 2002; Ramus et al., 1999]. This may reflect the fact that (S)Fr has no phonemic consonant- or vowel- length contrasts (though vowel-

223Phonetica 2010;67:219–242Tonal and Durational Cues in Rhythmic- Group Perception

length contrasts exist in SFr), simple (often CV) syllables, and little vowel reduction [Vaissière, 1991].

As well as monolingual speakers of these three languages, SG- SFr bilinguals were included in this experiment. If listeners’ native language influences the way in which f0 and duration are interdependent cues, subjects who acquired both SG and SFr in infancy (i.e. bilinguals) might behave differently when completing the task in each language, since prosody perception develops early in infancy [e.g. Nazzi and Ramus, 2003; Nazzi et al., 2006]. If native language has no effect, all monolinguals and bilin-guals should respond similarly. With either outcome, SG- SFr bilinguals provide more insight into how language- universal/- specific the perceptual interdependence of dura-tion and f0 is. Details of how it was ensured that these bilinguals were exposed to both SFr and SG from early infancy onwards are found in section 2.1.

1.3. Hypotheses

It was predicted that listeners would perceive rhythmic- group boundaries using length and pitch cues as follows: (i) (a) when neither cue was available, the number of responses indicating that one or

other of the possible groupings was perceived would be around chance; (b) when one cue was available, the number of responses indicating that increased

duration or rising f0 was the cue that listeners used would be high;(ii) when both cues were available and (a) accorded (i.e. the same syllable had increased duration and rising f0), the num-

ber of responses indicating that these cues were used would be even higher; (b) conflicted (i.e. one syllable had increased duration and a neighbouring syllable

had rising f0 around the possible boundary location), the number of responses indicating that these cues were used would be considerably lower.

The rationale for hypothesis (i) comes from previous experiments on perceived grouping (section 1.1), which provided evidence for the psychological reality of within- utterance groups, when durational and/or tonal cues to these groups were present in the speech signal. Hypothesis (ii) is more specific to this experiment, since it predicts how durational and tonal cues could be interdependent in the perception of rhythmic groups. No hypothesis was proposed for the potential effect of listeners’ native lan-guage, because there has been little research that has directly addressed this question of cross- linguistic variation in perceptual grouping. The ‘Discussion’ returns to this question in light of this experiment’s results.

2. Method

2.1. Subjects

All subjects reported normal hearing, and were offered a small payment. There were 36 SG, 38 SFr, 36 Fr (all monolingual) and 20 bilingual subjects; the mean age of each group was 25.7, 24.1, 24.4 and 23.8 years, respectively.

A simple definition of ‘monolingual’ and ‘bilingual’ does not exist [for discussion, see Galloway, 2007]. In this experiment, distinct monolingual and bilingual groups were defined by three factors: ambient language in childhood, and age and method of second- language acquisition. Relatively few

224 Phonetica 2010;67:219–242 Cumming

Swiss people acquire two languages in infancy; most acquire one native language depending on which linguistic community of Switzerland they grew up in, and they learn second languages at school. For the monolinguals, each subject grew up in one country with one ambient language in childhood; some had temporarily lived abroad in adulthood, e.g. a gap year after schooling. None had started to learn a second language before compulsory second- language classes in school (starting age 9– 11). For the bilinguals, all had begun to acquire both languages aged 4 or below in a naturalistic environment (i.e. not second- language classes). They had one of two types of language acquisition background: 9 subjects had parents who each spoke a different language to them, so they had two ambient languages from birth; 11 subjects were born and brought up in an area where the ambient language was different from that spoken at home, so they were immersed in the other language very early through nursery school and friendships with local children. The bilinguals had to rate how native- like their spoken language and comprehension currently are in both languages (5 = native, 0 = no ability); all scored themselves as 5 for one language, and 4 or 5 for the other language.

The SG subjects were recruited in Zürich, and tested in Zürich University Phonetics Laboratory’s sound- attenuated booth. Many were born and brought up in Zürich, though some had moved there later in life. All were students at Zürich University or Zürich Federal Technical College. The SFr sub-jects were mostly recruited in Neuchâtel, and tested in a quiet room in Neuchâtel University (a few were recruited and tested in Zürich). They came from various SFr- speaking cantons, though many from Neuchâtel. Most were university students. The Fr subjects were recruited in Cambridge, and tested in Cambridge University Phonetics Laboratory’s sound- attenuated booth. They came from various regions in France. Many were resident in Cambridge, most of which were university students; others were briefly visiting Cambridge, e.g. summer- school students. None had lived in the UK before age 18.

2.2. Stimulus Preparation

Two sets of stimuli, one SG, one Fr, were created in Praat [Boersma and Weenik, 2008] and were designed to be cross- linguistically comparable and equally appropriate for each language group [this is a challenge of cross- linguistic research; see Beddor and Gottfried, 1995]. The stimuli were sequences of five digits and/or letters (as occur in everyday speech, e.g. telephone numbers, postal codes and serial numbers). Each sequence comprised syllables that were taken individually from natu-rally produced sequences, then resynthesised with specific durations and f0 contours, and then spliced together.

2.2.1. RecordingsOne native SG speaker (Zürich dialect) and 1 native Fr speaker (from Rennes, France) were

recorded; neither was phonetically trained. They were each asked to read, at a comfortable rate, a list of digit and letter sequences in the format (X X X) (X X) or (X X) (X X X), where X was a monosyl-labic digit/letter. All Xs in each sequence were identical, and each grouping (3 + 2, 2 + 3) occurred once for all single digits (1– 9) and all monosyllabic letters of the alphabet (A– Z), in a randomised order down the reading sheets. The main reason for recording sequences in this format, rather than isolated syllables or ungrouped sequences of five, was to ascertain that speakers produced longer syllables with an f0 rise or fall before each group boundary. The manipulations later made on the stimuli were based on these duration and f0 values. One speaker at a time was recorded in a sound- attenuated studio, using a Marantz PMD670 solid- state recorder and a low- noise condenser Sennheiser MKH40P48 microphone with a cardioid frequency response. The recording mode was set to 16- bit linear PCM, with a 44.1- kHz sample rate.

In Praat, the recordings were visually and auditorily inspected to choose the most appropriate individual syllables to make up the stimuli (table 1). These stimuli mostly comprised syllables with various segmental structures (henceforth ‘varied stimuli’), so that listeners did not become bored by hearing a constantly repeated sound, thus ensuring that they heard the stimuli as meaningful digit/let-ter sequences. Moreover, the varied stimuli tested whether listeners tended to group certain syllable patterns (see the rightmost column of table 1) using criteria other than prosodic cues. For instance, a listener might always hear 3PTP3 as 3P- TP3 if that listener thinks letters sound better together group- initially. Some stimuli, also created by splicing together individual syllables, had segmentally identical

225Phonetica 2010;67:219–242Tonal and Durational Cues in Rhythmic- Group Perception

syllables (henceforth ‘identical stimuli’), which were included to compare responses with those to the varied stimuli.

There were various reasons for choosing these digits/letters. First, all had a sonorant nucleus, and a voiceless onset and/or coda, which ensured a between- syllable break of periodicity, so f0 manipula-tions would have clear start and end points. The plosives /b d/ are unvoiced lenis in SG, and usually voiced in Fr, but still have a burst transient disrupting periodicity. Second, the SG and Fr sequences’ segmental structure had to be cross- linguistically as similar as possible. The main issue was choosing syllables in which the voiceless and sonorant parts were durationally similar in each language, so that these parts could be manipulated to the same duration for equivalent stimuli in each language. For example, the SG and Fr plosives /p t b d/ had similar durations (essentially, any differences between them were below the just noticeable difference threshold for speech sounds reported in e.g. Lehiste [1970]), likewise the SG and Fr fricatives /h f s/ had similar durations, as did the SG and Fr vowels /e ɛ ɑ/, and the SG diphthong /ai/ and Fr approximant + vowel /wa/. Third, the ‘5Z4J6’/‘75Z46’ stimuli were designed to have five segmentally different syllables.

2.2.2. Token SelectionOne token of each of the digits/letters required was selected (i.e. B, P, Z, 2, 3, 5 etc. – 14 indi-

vidual syllables per language). For each syllable required, there were ten possible tokens: five per recorded sequence, one grouped 3 + 2 and one 2 + 3. The most appropriate token was selected using various criteria, through visual inspection of the spectrogram and waveform, and simultaneous audi-tory observation. First, any syllables with non- modal voicing (e.g. creak) were eliminated, since these could be problematic during resynthesis. Then only non- group- final syllables (i.e. 1, 2, 4 in 3 + 2 sequences and 1, 3, 4 in 2 + 3 sequences) were considered, because the ‘base’ syllable to be made from each selected token would sit in these positions in stimuli and have a level f0 and normal (i.e. not increased) duration. Ultimately, syllable 2 in 3 + 2 sequences or syllable 4 in 2 + 3 sequences was the preferred choice for all tokens, because several syllables began with plosives, for which only group- medial tokens were suitable, since the closure duration of post- silence plosives was unknown. An inevitable consequence of taking syllables from continuous speech was the presence of form-ant transitions at syllable boundaries, which could have decreased naturalness when syllables were removed from one sequence and concatenated with others. However, the task involved attention to prosody rather than segments, and listeners had a simultaneous visual presentation of the digits/letters, so would not be distracted from making decisions based on prosodic cues.

2.2.3. ‘Base’ SyllablesOnce each selected digit/letter token had been extracted from the recording, duration, intensity

and f0 were altered to make all tokens equivalent. The result was a ‘base’ syllable for each digit/letter, which could then be subject to further manipulation. A Praat script was written and run which changed the consonant and vowel duration in each syllable during resynthesis, to make segment durations iden-tical between languages (table 2). Note that the resulting syllable durations varied from 320 to 370 ms

Table 1. Syllable sequences chosen for stimuli

SG FrPattern

orthography transcription orthography transcription

Identical BBBBB /be/ (×5) PPPPP /pe/ (×5) �����

22222 /͡tsvai/ (×5) 33333 /tʁwa/ (×5) �����

Varied 3PTP3 /dry.pe.te.pe.dry/ 2BDB2 /dø.be.de.be.dø/ �����

5Z4J6 /foif.͡tsɛt.fier.jɔt.sɑχs/ 75Z46 /sɛt.sɛk̃.zɛd.katʁ.sis/ �����

H2H2H /hɑ.͡tsvai.hɑ.͡tsvai.hɑ/ C3C3C /se.tʁwa.se.tʁwa.se/ �����

S888F /ɛs.ɑɣt.ɑɣt.ɑɣt.ɛf/ S888F /ɛs.ɥit.ɥit.ɥit.ɛf/ �����

226 Phonetica 2010;67:219–242 Cumming

depending on syllable structure. These durations were chosen based on a previous analysis of segment and syllable durations in all the relevant sequences produced by the speaker of each language in the recordings described above. The chosen durations were a compromise between the durations observed in each language, though (as already noted) these were not noticeably different.

In each syllable, peak intensity was manipulated to 83 dB, using Praat’s IntensityTier func-tion, which allows a relative decibel value to be specified and multiplied with a sound file’s existing intensity during resynthesis. In each syllable, f0 was manipulated to a level of 120 Hz, using Praat’s PitchTier function, which allows an f0 contour to be specified and replace a sound file’s existing f0 during resynthesis. 120 Hz was a compromise between each speaker’s mean pitch.

2.2.4. Duration/f0 ManipulationsAt this point, the syllables were in their base form (duration = 320– 370 ms intensity = 83 dB,

f0 = 120 Hz; table 2). Each base syllable was replicated six times; each replicate then underwent a durational and/or tonal manipulation. The six manipulations are represented schematically in the top box of table 4; 120 Hz at d1 is the base form. Duration was manipulated first (where needed), by running a Praat script written for this purpose; f0 was then manipulated using the PitchTier function. Vowel durations at d2 (i.e. after lengthening, table 3) were a 50% increase on d1, whereas consonant durations were kept equal (since vowels lengthened more than consonants group- finally in the record-ings), except in the syllables without an onset consonant. In these, if the coda consonant was kept at 120 ms, a 375- ms vowel sounded odd (compared to the recordings), so the coda was increased until it sounded sufficiently acceptable. Thus durations at d2 were based on the values measured from the production of group- final syllables by the speaker of each language, as well as some auditory observation.

2.2.5. Concatenation of SyllablesThe sequences in table 4 were created for each digit/letter pattern (table 1) by splicing together,

without pauses, the appropriate syllables created in the previous stage. Various aspects of the

Table 2. Base duration (d1) of each syllable (V included the approximants in Fr /ɥit/ ‘8’ and /tʁwa/ ‘3’, and the short voiced trill in SG /dry/ ‘3’)

Digit/letter Duration of segments, ms Total duration of syllable msSG Fr C(C) V C(C)

B, 3, P, T P, 2, B, D 70 250 3202, H 3, C 100 250 3504, 5, 6, J, Z 4, 5, 6, 7, Z 95 175 95 3658, F, S 8, F, S 250 120 370

Table 3. Duration of each syllable after lengthening (d2)

Digit/letter Duration of segments, ms Total duration of syllablemsSG Fr C(C) V C(C)

B, 3, P, T P, 2, B, D 70 375 4452, H 3, C 100 375 4754, 5, 6, J, Z 4, 5, 6, 7, Z 125 263 125 5138, F, S 8, F, S 375 160 535

227Phonetica 2010;67:219–242Tonal and Durational Cues in Rhythmic- Group Perception

Table 4. Stimuli: sequences of durationally/tonally manipulated syllables spliced together

150130120100

Hz

d1150130120100

Hz

d2

Control

aLength

b

aPitch 1

b

aPitch 2

b

aPitch 1 andlength

accordant b

aPitch 2 andlength

accordant b

aPitch 1 andlength

conflicting b

aPitch 2 andlength

conflicting b

The diagrams below correspond tothe f0 values shown to the right. Thefall only occurred on sequence-finalsyllables. Tables 2 and 3 give detailsof d1 and d2 durations. There wereno pauses between syllables; the gapshere simply allow a clear visualrepresentation of different lengthsyllables.

Labels for stimuli

228 Phonetica 2010;67:219–242 Cumming

sequences in table 4 require further comment. First, the stimulus labels included ‘length’ and ‘pitch’, which refer to perceptual cues and distinguish these from the acoustic manipulations of ‘duration’ and ‘f0’.

Second, ‘pitch1’ refers to a lower f0 excursion (10 Hz), and ‘pitch2’ to a higher f0 excursion (30 Hz). Rises with two different f0 excursions were included to test whether listeners responded differently when the magnitude of f0 excursion was like a pitch- accent or a microfluctuation. The 30- Hz excursion was based on the mean f0 excursion of group- final rises produced by the speakers in the recordings described above; the 10- Hz excursion was chosen because previous experiments have shown that it would unlikely be perceived as a rise [Rossi, 1971]. Since these previous experiments reported the threshold for perception of a pitch rise on a Hertz scale, and since the higher f0 excur-sion used in the present experiment was based on an absolute value from speech production data (as opposed to a perceptually relative value), a Hertz scale was chosen here rather than a semitone scale. It should also be noted that each sequence ended in a long fall, in order to increase naturalness, since both speakers produced long sequence- final falls in the recordings.

Third, note that listeners had to say whether they heard a 3 + 2 or 2 + 3 grouping; the intended group boundaries are marked in table 4 with bold vertical lines within the sequences, except for the ‘control’ stimuli and the ‘pitch and length conflicting’ stimuli, in which the boundary could be per-ceived at either of the points marked with a dashed vertical line. In the control stimuli, neither length nor pitch was available as a cue. These stimuli tested whether a bias for 3 + 2 grouping occurred [House, 1990]: if there was no bias, we could expect responses to be 50% 3 + 2 grouping and 50% 2 + 3 grouping.

Fourth, the remaining stimuli were counterbalanced to cancel out potential bias, by including stimulus ‘a’ and stimulus ‘b’ in which the second and third syllables were sequentially opposite. For example, in the ‘length’ condition, stimulus ‘a’ had an increased duration on the third syllable, and stimulus ‘b’ had an increased duration on the second syllable; we would expect listeners to respond 3 + 2 for stimulus ‘a’ and 2 + 3 for stimulus ‘b’, and not simply 3 + 2 (or 2 + 3) every time, which would be the case if they were biased towards perceiving one particular grouping regardless of the cues present. The difference in responses between each pair of ‘a’ and ‘b’ stimuli could be calculated to take account of any bias, and thus reveal the actual effect of each cue (section 3.1).

Fifth, in the ‘pitch and length accordant’ stimuli, the same syllable had increased duration and rising f0, whereas in the ‘pitch and length conflicting’ stimuli, one syllable had increased duration and a neighbouring syllable had rising f0 around the possible boundary location. Conflicting cues could have been placed on the same syllable (e.g. rising f0 and shorter duration). This was rejected during stimulus preparation since listeners might have perceived a group boundary after such an anomalous ‘conflicting’ syllable because it was an unexpected break from normality in the signal rather than because its duration/f0 were cues. Conflicting cues on successive syllables could also sound unusual, but listeners still had to decide between two anomalous syllables of which one was group- final. This eliminated the potential confound of responses based on non- prosodic factors, and more clearly con-cerned the interaction of duration and f0.

2.3. Procedure

Subjects sat at a MacBook laptop (Mac OS X.4) and listened through binaural Sennheiser HD555 headphones. The experiment was scripted and run in Praat. There were two identical versions, one with SG stimuli and German on- screen text, one with Fr stimuli and on- screen text. The procedure for monolinguals was as follows. They heard stimuli in their native language (Fr for SFr subjects, since the words used do not show between- variety variation). Before testing began, subjects read instruc-tions in their native language (standard German for SGs), and were given a chance to ask for clarifica-tion. For each trial, the following statement appeared on screen: ‘These digits/letters could be grouped as. . .’. The task was to click one of the two on- screen buttons labelled with a 3 + 2 or 2 + 3 grouping, e.g. ‘75Z- 46’ or ‘75- Z46’ (forced choice). Half the subjects saw 3 + 2 on the left and 2 + 3 on the right; the other half saw the reverse. Each trial began after subjects had clicked to begin the experiment, or to register their response to the previous trial. After this click came 1.5 s of silence before the stimulus. Only one listening per trial was allowed, and response time was unlimited. There were 102 trials [6

229Phonetica 2010;67:219–242Tonal and Durational Cues in Rhythmic- Group Perception

digit/letter patterns (table 1) × 8 cue conditions (table 4) × 2 orders (table 4)], including 6 practice trials, with a break after every 16. Subjects could also ask for clarification after the practice session, though none did. The whole experiment lasted approximately 15 min.

For bilinguals, the procedure was very similar, except that they completed the task in each language with a break in between. They had a practice session for each language, to allow them to adjust to the new speaker. Half listened to SG first, and half to Fr first; within these two groups, half saw 3 + 2 on the left of the screen, and half saw 3 + 2 on the right. Subjects had been assigned to one of these four groups before they arrived. Since the test procedure was identical for the stimuli in both languages, the bilinguals only read the instructions once, in the language that they listened to first.

2.4. Analysis

Responses were recorded in Praat, and sorted according to the variables: native language(s), cue(s), digit/letter pattern, order (a, b). Data analysis consisted of four stages. First, responses to control stimuli were analysed, to ascertain whether there was a significant bias for the 3 + 2 or 2 + 3 grouping. This was assessed using binomial probabilities approximated from the standard normal distribution, because this statistic indicates whether a large series of responses which are binary in nature are significantly above chance level (i.e. 50% response A and 50% response B). Second, all the variables were included in one logistic regression model using the ‘lmer’ function in the R software environment. The rationale behind this test was to include all the variables manipulated in the experiment in one single statistical model, to reveal which variables had a significant main effect on responses. Since the responses were binary (either 3 + 2 or 2 + 3 grouping), a logistic model was appropriate. Once the data set as a whole had been analysed like this to reveal the main effects, ANOVAs (run in SPSS) were used to explore in more detail the interaction between just two variables at a time, by averaging the results over the other variables, so the data were no longer binary responses and were suitable for ANOVAs. In this way, the third stage of analysis examined the interaction between the digit/letter pattern and native language variables, and the fourth (or main) stage of analysis concerned the two variables crucial to the research questions: cue(s) and native language(s). In the main analysis, the data were averaged over the six digit/letter patterns to avoid a three- way comparison in which any interactions would be complex to interpret with the non- parametric tests which would be required. Most data series were relatively normally distributed according to visual inspection of histograms, though some were significantly non- normal according to Shapiro- Wilk tests (p < 0.05) [Field, 2005]. Given that ANOVA is robust against violation of the normality assumption [Howell, 2007, p. 316], and that transformations (including square root and arcsine) made little difference to the normality of these data, untransformed data were input to the ANOVAs. No data series violated the homogeneity of variance assumption according to Levene tests (p > 0.05).

3. Results

3.1. Control Stimuli

If no grouping bias occurred, responses to the control stimuli would be 50% ‘3 + 2’ (and 50% ‘2 + 3’). Figure 1 shows that there was a preference for 3 + 2 group-ing in most of the ten subject groups, but in only four groups was this bias significant (table 5; the BiSG 3 + 2 left group consists of the same subjects as the BiSFr 3 + 2 right group).

A 3 + 2 bias was also found by House [1990], who used a ‘difference score’ cal-culation to cancel out this bias in the analysis of results. This calculation was also adopted here, because a significant bias occurred in some groups: ‘A difference score

230 Phonetica 2010;67:219–242 Cumming

representing the percent 3 + 2 responses for stimulus “a” minus the percent 3 + 2 responses for stimulus “b” is presented for each stimulus pair. The difference scores represent the relative contribution of each variable to the perception of the [rhythmic- group] boundary where a higher difference score for a stimulus pair means a greater contribution to the perception of phrasing of the variable manipulated in that pair. For example, if 100% of the responses for stimulus “a” in a pair were 3 + 2 and 0% of the responses for stimulus “b” in the same pair were 3 + 2 (i.e. 100% 2 + 3 responses) the difference score would be 100 – 0 = 100 giving us the maximum contribution of the variable which was manipulated. If, however, say 70% of the responses for stimulus “a” were 3 + 2 and 50% for stimulus “b” were 3 + 2 we would get a difference score of 70 – 50 = 20 which would mean that there was very little contribution from the manipulated variable’ [House, 1990, p. 94].

SG SFr Fr BiSG BiSFr0

Mea

n p

erce

ntag

e of

3 +

2 r

esp

onse

s

50

1003 + 2 left3 + 2 right

Fig. 1. Mean percentage (across subjects) of 3 + 2 responses to control stimuli (error bars: ±1 standard devia-tion); 3 + 2 left/right refers to which side of the screen sub-jects saw the 3 + 2 grouping.

Table 5. Frequency of 3 + 2 responses out of total number of control trials per language group, and significance tests

SG SFr Fr BiSG BiSFr

3 + 2 3 + 2 3 + 2 3 + 2 3 + 2left right left right left right left right left right

3 + 2 responses

118 102 146 124 127 108 77 66 63 74

Number of control trials

216 216 228 228 216 216 120 120 120 120

z 1.36 –0.82 4.24 1.32 2.59 0.00 3.10 1.10 0.55 2.56

Sig. NS NS *** NS * NS ** NS NS *

z- score for the binomial probability approximated from the standard normal distribution.Significance: *** p < 0.0001; ** p < 0.001; * p < 0.01; NS, p > 0.05.

231Phonetica 2010;67:219–242Tonal and Durational Cues in Rhythmic- Group Perception

3.2. Initial Analysis

3.2.1. All VariablesA logistic regression analysis was conducted which included all the variables. Since

each subject responded to several stimuli, a mixed model with both random and fixed effects had to be fitted [Baayen, 2008]. The random effect was subject, which intro-duced adjustments to the intercept grouped by each subject. The fixed effects were native language(s) (5 levels), cue(s) (7 levels), digit/letter pattern (6 levels), and order (2 levels). The dependent variable was response (‘3 + 2’ = 1; ‘2 + 3’ = 0); it was not necessary to use ‘difference score’ (DS) as the dependent here, since instead order accounted for any bias. Table 6 displays the output; the rightmost column, which displays significance values for fixed effects, is of most interest. Each level of each variable was compared to the base-line level (rows without figures). The following observations can be made: SFr listeners gave significantly more 3 + 2 responses than SG listeners; responses to ‘pitch1/2’ and ‘pitch1/2 and length conflicting’ stimuli differed significantly from responses to ‘length’ stimuli; unsurprisingly, responses differed significantly depending on whether the second or third syllable was manipulated (order a/b); responses to three out of five digit/letter patterns differed significantly from responses to BBBBB/PPPPP stimuli.

3.2.2. Digit/Letter PatternFor each digit/letter pattern, subjects’ DSs were averaged across all cue condi-

tions, and then input to a two- way mixed- measures ANOVA with the factors language

Table 6. Output of regression model

Fixed effects Estimate SE z p value

Intercept 1.206 0.126 9.55 <0.0001***Native SGlanguage(s) SFr 0.288 0.138 2.09 0.037*

Fr –0.038 0.140 –0.27 0.787BiSG 0.037 0.167 0.22 0.824BiSFr 0.087 0.167 0.52 0.604

Cue(s) lengthpitch1 0.520 0.084 6.19 <0.0001***pitch2 0.295 0.084 3.52 <0.001***pitch1 and length accordant 0.049 0.084 0.59 0.558pitch2 and length accordant 0.056 0.084 0.67 0.503pitch1 and length conflicting –0.161 0.084 –1.93 0.050*pitch2 and length conflicting –0.256 0.084 –3.05 0.002**

Order order aorder b –2.796 0.047 –60.07 <0.0001***

Digit/letter pattern

BBBBB, PPPPP22222, 33333 0.081 0.078 1.05 0.2963PTP3, 2BDB2 0.410 0.078 5.27 <0.0001***5Z4J6, 75Z46 0.162 0.078 2.09 0.036*H2H2H, C3C3C –0.199 0.078 –2.56 0.011*S888F, S888F –0.084 0.078 –1.08 0.278

Significance: *** p < 0.001; ** p < 0.01; * p < 0.05.

232 Phonetica 2010;67:219–242 Cumming

(SG, SFr, Fr) and pattern (x6) (here just the monolinguals’ responses were analysed, since the bilinguals formed a repeated- measures design). No main effect of language occurred [F(2, 107) = 1.025, p > 0.05], nor an interaction of pattern × language [F(9.189, 491.621) = 1.201, p > 0.05], so the SG and Fr stimuli, though segmentally different, were cross- linguistically equivalent in terms of the responses they elicited. There was a main effect of pattern [F(4.595, 491.621) = 38.532, p < 0.0001]; to explore this further, post- hoc pairwise comparisons were computed. With a Bonferroni- adjusted alpha level [p < 0.003 (0.05/15)], the mean DS was: significantly higher for both ‘identical’ than all ‘varied’ patterns; not significantly different between the two ‘identical’ patterns, and lowest for the pattern with five segmentally different syllables (5Z4J6/75Z46), which differed significantly from two of the three other ‘varied’ patterns (3PTP3/2BDB2 and S888F/S888F, but not H2H2H/C3C3C).

A repeated- measures ANOVA on the bilinguals’ averaged DSs was computed with the factors language (BiSG, BiSFr) and pattern (x6). As for monolinguals, no main effect of language occurred [F(1, 19) = 0.8, p > 0.05], nor an interaction of language × pattern [F(5, 95) = 1.215, p > 0.05], and a main effect of pattern occurred [F(5, 95) = 6.55, p < 0.0001]. Post- hoc pairwise comparisons for pattern with Bonferroni- adjusted alpha levels revealed an overall picture similar to the monolinguals.

3.3. Main Analysis: Cue(s), Native Language(s)

Figure 2 shows the mean DSs (across all subjects per language group). A 100% DS indicates a maximum effect of that variable in subjects’ rhythmic- group judge-ments, whereas a 0% DS indicates no effect. The mean DS for each digit/letter pattern is shown separately by each different shade of grey and the black line above it; these six separate DSs are added one on top of the other, to give the total DS across all digit/letter patterns. This is why the scale reaches 600%: a maximum of a 100% DS for each of six digit/letter patterns combined.

The ‘landscape’ is similar for all five graphs. A peak on the left for ‘length’, fol-lowed by a valley for ‘pitch1’, then a rise to the ‘accordant’ conditions, and a fall to the ‘conflicting’ conditions. Some subtle differences appear between language groups. For monolinguals, the ‘pitch1’ valley is deepest for Fr, slightly less for SFr and much shallower for SG, i.e. ‘pitch1’ influenced responses more in SG than in (S)Fr. The two ‘accordant’ conditions form more of a plateau for (S)Fr compared to a peak for SG, which (on the graph) results from the fact that ‘pitch2’ influenced responses more than ‘length’ in SG. The three leftmost conditions are ranked (most to least effect on responses) as ‘pitch2’ > ‘length’ > ‘pitch1’ in SG, and ‘length’ > ‘pitch2’ > ‘pitch1’ in (S)Fr. The bilinguals responded similarly to monolinguals when listening to stimuli in the respective language. To statistically analyse the data displayed in figure 2, DSs were averaged across the digit/letter pattern conditions.

3.3.1. MonolingualsA two- way mixed- measures ANOVA was conducted with the factors language

(SG, SFr, Fr) and cue(s) (x7). No main effect of language occurred [F(2, 107) = 1.021, p > 0.05], but there was a significant interaction of cue(s) × language [F(7.330, 392.138) = 2.222, p = 0.029], and a main effect of cue(s) [F(3.665, 392.138) = 72.310, p < 0.0001]. For the cue(s) × language interaction, individual contrasts comparing ‘length’

233Phonetica 2010;67:219–242Tonal and Durational Cues in Rhythmic- Group Perception

to every other condition revealed that the only significant difference was between ‘length’ and ‘pitch2’ [F(2) = 4.643, p = 0.012]. This shows that the differential rank-ing of pitch and length observed between languages in figure 2 is significant: for SG, ‘pitch2’ > ‘length’ (> ‘pitch1’); for (S)Fr, ‘length’ > ‘pitch2’ (> ‘pitch1’). To verify that no between- variety difference occurred between SFr and Fr, another two- way mixed- measures ANOVA was calculated for language and cue(s) with only SFr and Fr. No main effect of language occurred [F(1, 72) = 2.722, p > 0.05], nor an interaction of language × cue(s) [F(3.808, 274.173) = 0.132, p > 0.05].

To explore the main effect of cue(s) further, post- hoc pairwise comparisons of all conditions were computed. With a Bonferroni- adjusted alpha level [p < 0.002 (0.05/21)], the mean DS for each cue condition was significantly higher/lower than for all other conditions, except the following comparisons: ‘length’ versus ‘pitch2’, ‘pitch1’ versus ‘pitch1 and length conflicting’, and ‘pitch1 and length accordant’ versus ‘pitch2 and length accordant’. Overall, the cue or cues which occurred on the second and/or third syllable of a five- syllable sequence had a great influence on listeners’ perception of rhythmic groups. The mean DS was not significantly different between ‘length’ and ‘pitch2’, which was the only comparison to reach significance in the cue(s) × language interaction. Thus only with native language taken into account did substantially rising f0 and lengthening differentially influence responses.

3.3.2. BilingualsA two- way repeated- measures ANOVA was conducted with the factors language

(BiSG, BiSFr) and cue(s) (x7). There was a main effect of cue(s) [F(2.248, 42.713) = 35.569, p < 0.0001], but not language [F(1, 19) = 0.8, p > 0.05], as for monolinguals. The cue(s) × language interaction was not significant overall [F(6, 114) = 0.742, p > 0.05], but individual contrasts for this interaction, comparing ‘length’ to every other condition, revealed a significant difference between ‘length’ and ‘pitch2’ [F(1, 19) = 6.241, p < 0.05], as for monolinguals. When bilinguals listened to SG five- syllable sequences, a substantial f0 rise in the second/third syllable influenced their rhythmic- group perception more than an increased duration did, whereas this was the opposite when they listened to Fr.

The main effect of cue(s) was explored further with post- hoc pairwise compari-sons of all conditions. With a Bonferroni- adjusted alpha level [p < 0.001 (0.05/42)], there were fewer significant differences in DSs between- cue conditions than for the monolinguals. Particularly noteworthy are the differences in DSs that were significant in one language (of stimuli) but not the other: ‘pitch1’ had a significantly lower mean DS than ‘length’ for Fr but not SG, whereas ‘pitch2’ had a significantly higher mean DS than ‘pitch1’ for SG but not Fr; ‘length’ had a significantly lower mean DS than the two ‘accordant’ conditions for SG but not Fr, whereas ‘pitch2’ had a significantly lower mean DS than the two ‘accordant’ conditions for Fr but not SG. These significant differences all accord with the finding that bilinguals responded like each monolingual group when hearing the respective stimuli.

Then a two- way mixed- measures ANOVA was conducted with the factors group (SG, BiSG) and cue(s) (x7). Likewise a two- way mixed- measures ANOVA was conducted with the factors group (SFr, Fr, BiSFr) and cue(s) (x7). For both ANOVAs, there was a main effect of cue(s) (as above), [SG vs. BiSG: F(3.018, 162.959) = 56.718, p < 0.0001; SFr vs. Fr vs. BiSFr: F(3.715, 338.083) = 64.105, p < 0.0001], but not group [SG vs. BiSG: F(1, 54) = 0.154, p > 0.05; SFr vs. Fr vs. BiSFr: F(2, 91) = 1.496,

234 Phonetica 2010;67:219–242 Cumming

0%

b

Swiss French

Leng

th

Pitch1

Pitch2

Pitch1

&

lengt

h ac

c.

Pitch2

&

lengt

h ac

c.

Pitch1

&

lengt

h co

n.

Pitch2

&

lengt

h co

n.

100%

200%

300%

400%

500%

600%PPPPP333332BDB2

75Z46C3C3CS888F

0%

a

Swiss German

Leng

th

Pitch1

Pitch2

Pitch1

&

lengt

h ac

c.

Pitch2

&

lengt

h ac

c.

Pitch1

&

lengt

h co

n.

Pitch2

&

lengt

h co

n.

100%

200%

300%

400%

500%

600%BBBBB222223PTP3

5Z4J6H2H2HS888F

0%

c

French

Leng

th

Pitch1

Pitch2

Pitch1

&

lengt

h ac

c.

Pitch2

&

lengt

h ac

c.

Pitch1

&

lengt

h co

n.

Pitch2

&

lengt

h co

n.

100%

200%

300%

400%

500%

600%PPPPP333332BDB2

75Z46C3C3CS888F

235Phonetica 2010;67:219–242Tonal and Durational Cues in Rhythmic- Group Perception

p > 0.05], nor an interaction of cue(s) × group [SG vs. BiSG: F(3.018, 162.959) = 0.701, p > 0.05; SFr vs. Fr vs. BiSFr: F(7.430, 338.083) = 1.025, p > 0.05]. Thus there was little difference in responses between the bilinguals and each respective monolingual group.

3.3.3. ‘Conflicting’ StimuliThe significantly lower DSs for ‘conflicting’ stimuli show that listeners were

confused by what they heard. Responses to these stimuli reveal how often listeners, when faced with choosing between two cues, used either length or pitch as a cue (i.e.

0%

e

Bilingual (Swiss French)

Leng

th

Pitch1

Pitch2

Pitch1

&

lengt

h ac

c.

Pitch2

&

lengt

h ac

c.

Pitch1

&

lengt

h co

n.

Pitch2

&

lengt

h co

n.

100%

200%

300%

400%

500%

600%PPPPP333332BDB2

75Z46C3C3CS888F

0%

d

Bilingual (Swiss German)

Leng

th

Pitch1

Pitch2

Pitch1

&

lengt

h ac

c.

Pitch2

&

lengt

h ac

c.

Pitch1

&

lengt

h co

n.

Pitch2

&

lengt

h co

n.

100%

200%

300%

400%

500%

600%BBBBB222223PTP3

5Z4J6H2H2HS888F

Fig. 2. Mean percentage DS (y axes) across all subjects per language group for each cue condition (x axes) and digit/letter pattern (shadings). Note that on the SG and both bilingual graphs, there is some overlap of the lines representing DSs for different digit/letter patterns in the ‘conflicting’ (con.) condi-tions. Where a line which borders a lighter shading falls below that of a line which borders a darker shading, this indicates a negative DS for the digit/letter pattern with lighter shading. With reference to table 4, a negative DS for ‘conflicting’ stimuli means that pitch was used as a cue (i.e. came at the end of a subject’s perceived RG) more often than length was used as a cue, whereas a positive DS for ‘con-flicting’ stimuli means that length was used as a cue more often than pitch was. acc. = ʻaccordantʼ

236 Phonetica 2010;67:219–242 Cumming

increased duration or rising f0, respectively, was group- final in their perceived first rhythmic group; fig. 3).

According to binomial probabilities approximated from the standard normal distribution, listeners used length as a cue significantly more often (i.e. pitch signifi-cantly less often) than chance (50%) in all groups (z > 5, p < 0.001), except SG, BiSG and BiSFr listeners in the ‘pitch2 and length conflicting’ condition (top- left cluster). Generally, pitch was used as a cue more often for stimuli with a higher f0 excursion (black) than for stimuli with a lower f0 excursion (grey). Individuals’ responses to con-flicting stimuli were also inspected to check that it was not the case that some listeners only ever used length and some only pitch; all listeners used each cue in around one to two thirds of stimuli.

Although no significant difference between each bilingual group and its cor-responding monolingual group was observed when the responses were analysed to include all cue conditions, there appears to be quite a difference in figure 3 between the BiSG and SG groups, and the BiSFr and SFr groups, for the lower excursion conflicting stimuli (grey). However, even when just the responses to the conflicting stimuli were compared between bilingual and monolingual groups (using the number of responses in which length was used as a cue), these differences were not significant [‘Pitch1 and length conflicting’: SG vs. BiSG, t(54) = 1.05, p > 0.05; SFr vs. BiSFr, t(56) = 1.47, p > 0.05. ‘Pitch2 and length conflicting’: SG vs. BiSG, t(54) = – 0.03, p > 0.05; SFr vs. BiSFr, t(56) = 1.91, p > 0.05].

4. Discussion

Both hypotheses (i) and (ii) are confirmed, since listeners’ responses indicated that rhythmic groups were perceptible only when durational and/or tonal cues to these groups were present, and that duration and f0 were interdependent cues to these groups. Moreover, responses differed between language groups.

2050 55 60 65

Responses in which subjects used length (%)

70 75 80

Res

pon

ses

in w

hich

sub

ject

s us

ed p

itch

(%)

40

30

50

BiFr

SG

FrBiSG

SFr

Fig. 3. Percentage of respo-nses to ‘conflicting’ stimuli in which subjects used either length or pitch; grey, ‘pitch1 and length conflicting’, black, ‘pitch2 and length conflicting’.

237Phonetica 2010;67:219–242Tonal and Durational Cues in Rhythmic- Group Perception

4.1. Interdependence of Duration and f0

When both increased duration and rising f0 occurred simultaneously, listeners were significantly more convinced of the rhythmic- group boundary location (i.e. higher DS) than when only one cue was available. When the cues conflicted on adjacent syllables, listeners were significantly less convinced of the boundary location (i.e. lower DS), and did not attend to the same cue in every ‘conflicting’ stimulus. These findings demon-strate that increased duration and rising f0 are interdependent cues to rhythmic groups. Moreover, the findings could be interpreted as a situation of phonetic trading relations, which are usually discussed with reference to segmental contrasts. Repp [1982, p. 87] stated: ‘[. . .] virtually every phonetic contrast is cued by several distinct acoustic prop-erties of the speech signal. It follows that, within limits set by the relative perceptual weights and by the ranges of effectiveness of these cues, a change in the setting of one cue (which, by itself, would have led to a change in the phonetic percept) can be offset by an opposed change in the setting of another cue so as to maintain the original pho-netic percept. This is a phonetic trading relation. [. . .] neither cue is perceived in isola-tion; rather, they are perceived together and integrated into a unitary phonetic percept.’

In this experiment, the two cues are increased duration and rising f0, and the con-trast is the categorical location of a rhythmic- group boundary after the second or third syllable in a five- syllable sequence. When one syllable has increased length and (suffi-ciently) rising pitch, the cues are perceived together as being accordant, and integrated into the phonetic percept of a boundary after that syllable. When one syllable with increased length precedes another with rising pitch, or vice versa, the cues are per-ceived together, but as being conflicting; the phonetic percept resulting from integra-tion of the cues is less clearly in one category or the other. Increased length may have a greater perceptual weight than rising pitch for (S)Frs, so the percept of a boundary may depend more on the duration setting than the f0 setting (present too), and the opposite for SGs. The results for stimuli with only one cue present could also be interpreted in terms of trading relations. Increased length may be the most usual rhythmic- group cue, but when this is missing, rising pitch could be equivalent, and so maintains the percept of a boundary following that syllable. Depending on native language, the relationship may be reversed, i.e. rising pitch is the most usual cue. A future experiment could test this possibility more directly.

4.2. Influence of Listeners’ Native Language

Rising pitch (with sufficiently large excursion) was a significantly more important cue than increased length for monolingual SGs, whereas increased length was a signifi-cantly more important cue than rising pitch for monolingual (S)Frs (overall there was little difference in results between SFrs and Frs). In fact, only when native language was taken into account was there a significant difference between length and pitch cues in terms of their effect on responses. Moreover, SGs were more sensitive than (S)Frs even to small (10- Hz) rises, and, in the ‘conflicting’ stimuli, (S)Frs generally used length as a cue more, and pitch less, than SGs. Bilinguals responded like SGs when listening to SG and like (S)Frs when listening to Fr, which provides more evidence that native language influences the relative weighting of duration and f0 as rhythmic- group cues, as bilinguals acquired both languages early in life.

238 Phonetica 2010;67:219–242 Cumming

These findings can be interpreted with reference to the prosodic characteristics of each language (references in section 1.2). In (S)Fr, each rhythmic group (or AP) always has a final prominent syllable, and an optional initial prominent syllable. AP- initial prominence is marked by rising f0 but not lengthening, whereas AP- final prominence is marked by dynamic f0 and lengthening, though the primary cue is probably length-ening. Thus it is logical that (S)Frs perceive increased duration as a stronger (right- hand) rhythmic- group- boundary cue than rising f0. SG has phonological vowel- and consonant- length contrasts, so all syllables are relatively tightly constrained in dura-tion. Moreover, accented syllables, which are often not at right- hand rhythmic- group boundaries, are lengthened to some extent compared to unaccented syllables. SGs may use and expect rising f0 to cue rhythmic- group boundaries, because this tonal property is more salient than the durational properties that are already in use as cues to seg-mental contrasts and (rhythmic- group- medial) prominence. This experiment provides insight on boundary cues in these languages, though further perceptual experiments are needed in general on SG and (S)Fr prosodic cues.

4.3. Other Findings: Segmental Variation and Pitch Excursion

Increased length and rising pitch were more effective rhythmic- group cues in series of identical syllables than in series with segmentally varied syllables; for varied sequences, the effect of these cues differed little between series with syllables which alternated between two/three segmental structures (e.g. 2BDB2) and series with syl-lables which all had different segments (e.g. 5Z4J6). Real speech is more similar to the segmentally varied stimuli, so the results from these stimuli are more generalisable to speech perception than results from stimuli with identical syllables. The various vowels and consonants which listeners have to process when perceiving series of var-ied syllables differ in spectral properties, including intrinsic pitch, which leads to f0 microperturbations in the signal. Therefore, these varied series present a greater cog-nitive load for listeners than series of identical syllables, which means less cognitive capacity would remain to use prosodic cues as effectively when processing the seg-mentally varied stimuli. This suggestion is similar to House’s [1990] conclusion that f0 movement is perceived optimally when perception is not constrained by attention to spectral changes; this experiment extends to another acoustic cue (increased duration) as well as f0 movement.

Subjects were more convinced of the group boundary location in stimuli with greater f0 excursion on the relevant syllable. This was evident in the difference in responses between the ‘pitch1’ and ‘pitch2’ conditions, in the responses to ‘accor-dant’ stimuli (a small excursion was a less effective cue than a larger one), and in the responses to ‘conflicting’ stimuli (a small excursion was a less confusing cue than a larger one). The 30- Hz excursion was based on the values observed in the record-ings; the 10- Hz excursion was chosen because previous experiments have shown that it would unlikely be perceived as a rise [Rossi, 1971]. Indeed the 10- Hz rises were less perceptible, and were probably (depending on native language) considered as micro-fluctuations, so generally did not signal linguistic (i.e. rhythmic- group) structure.

Only a pitch- excursion threshold, not a duration threshold, was investigated in this experiment. Just as for rising pitch, it is likely that lengthening needs to be above a certain percentage in order to be an effective rhythmic- group cue, and this would be

239Phonetica 2010;67:219–242Tonal and Durational Cues in Rhythmic- Group Perception

greater than the just noticeable differences for duration reported in previous experi-ments [Lehiste, 1970]. It is also possible that a duration threshold depends on listeners’ native language. Some listeners may be more sensitive to relatively less lengthening, whereas others may need to hear greater lengthening to perceive a rhythmic- group boundary; this is similar to the finding that SGs were more sensitive than (S)Frs to pitch rises with a small excursion.

4.4. Group- Final versus Group- Initial Cues

This experiment can only draw conclusions about the interdependence of dura-tional and tonal cues in group- final position, but it is interesting to consider a com-parison with group- initial cues. This links back to the discussion in section 1 about the potential difference between grouping at a phrase level and the metrical foot level.

In trochaic feet, the first syllable is prominent, due to the presence of certain cues in that syllable, just as in the case of group- final syllables. However, it is possible that group- initial and group- final cues differ, within and between languages. For example, durational and tonal cues may not be interdependent group- initially, with just rising pitch or lengthening being an effective cue; or these cues may be interdependent group- initially, but the relative importance of each cue may differ from that group- finally. In either case, this would make group endings distinguishable from group beginnings in the speech signal, which would be useful from the perspective of a listener trying to understand what was spoken.

The conflicting stimuli in this experiment had the potential to provide both group- final and group- initial cues: if listeners responded 2 + 3, this could be because they perceived the lengthening on syllable 2 to be group- final and the pitch rise on syllable 3 to be group- initial (in the case of ‘b’ stimuli, or vice versa for ‘a’ stimuli). However, the question of group- initial versus group- final cues was not pursued in the analysis of this experiment. A future experiment could use similar such stimuli to investigate fur-ther the question of how the interdependence of durational and tonal cues to rhythmic groups extends to group- initial as well as group- final position.

4.5. Implications for Research on Speech Rhythm

Several early psychologists investigating (general auditory) rhythm perception identified that perceived rhythm results from a regular recurrence of certain sound ‘events’, and that both the timing and the physical structure of these events are impor-tant in determining the rhythm that is perceived [e.g. Bolton, 1894; Brown, 1908; Wallin, 1901; Woodrow, 1909]. However, over several decades of research on speech rhythm in phonetics, the timing of events in speech (i.e. the duration of units like syl-lables and feet) has dominated, leaving the physical structure of these units relatively under- researched [for a history of speech- rhythm research, see e.g. Kohler, 2009]. Recently this research has culminated in the use of various rhythm metrics, which are derived (almost exclusively) from measured durations of vowels or consonants [e.g. Bertinetto and Bertini, 2008; Dellwo, 2006; Grabe and Low, 2002; Ramus et al., 1999]. A few exceptions are metrics derived from measures of non- temporal properties that make up the physical structure of speech units, such as amplitude, spectral dispersion

240 Phonetica 2010;67:219–242 Cumming

and f0, or a combination of some of these [e.g. Lee and Todd, 2004; Low, 1998; Tilsen and Johnson, 2008]. These properties may (partly) cue prominence in speech, which plays an important role in perceived rhythm.

The experiment reported in this paper was one within a series of planned exper-iments concerning the interdependence of duration and f0 as cues to speech rhythm [Cumming, 2010, in press; Cumming, 2011, in press]. The rationale was that if f0 and duration are interdependent cues, as indeed this experiment found was the case for rhyth-mic groups, then this has a major implication for current research on speech rhythm in phonetics. That is, f0 as well as duration should be included in the scope of the research, so that more insight is gained into the timing and physical structure of speech units.

This experiment concerned perceptual grouping, and, more specifically, it demonstrated a group- boundary phenomenon that is related to prominence, as listen-ers identified group boundaries by perceiving one syllable as more prominent than others. Prominence is not limited to boundaries, though this varies across languages. The early investigations of (general auditory) rhythm established that perceptual grouping is important in the perception of rhythm. This connection between group-ing and rhythm has not been widely discussed in speech- rhythm research [though see Wagner, forthcoming]. The perception of speech rhythm may also concern perceptual grouping; a certain pattern of prominent and non- prominent syllables can form a structure of perceived groups in speech, which leads to the perception of a certain rhythm. A further experiment in the series [Cumming, 2011, in press] went on to investigate whether f0 and duration are interdependent cues in the perceived rhyth-micality of sentences. The effect found was related not just to group boundaries, but to a longer pattern of prominent and non- prominent syllables in linguistically more complex utterances.

5. Conclusion

Several conclusions can be drawn from this experiment. First, groups of syllables in continuous speech that are cued by prosodic properties (f0, duration) are psycho-logically real. Second, rising f0 and lengthening are interdependent perceptual cues to rhythmic groups, since the two cues are significantly more effective than one when heard simultaneously, and significantly less effective than one when heard in conflict-ing positions around the boundary location. When one cue is absent, the other cues the boundary, perhaps because the underlying percept correlates with two physical proper-ties (increased duration and rising f0). Third, rising f0, to be an effective cue, must be of a great enough excursion that it could be related to linguistic structure rather than an acoustic microfluctuation. Finally, the relative weighting of duration and f0 in the interdependence of these cues depends on listeners’ native language.

Acknowledgements

I would like to thank Francis Nolan, Bill Barry and Brechtje Post for comments, suggestions and discussion of the work at various stages, and Stephan Schmid and Daniel Elmiger for providing loca-tions to test participants in Switzerland. This work forms part of research conducted for the author’s PhD which was funded by an AHRC doctoral award.

241Phonetica 2010;67:219–242Tonal and Durational Cues in Rhythmic- Group Perception

References

Andreassen, H.N.: Aspects de la durée vocalique dans le vaudois. Bulletin PFC: Phonologie du Français Contemporain – Usages, Variétés et Structure 6: 115– 134 (2006).

Baayen, R.H.: Analyzing linguistic data: a practical introduction to statistics (Cambridge University Press, Cambridge 2008).

Beddor, P.S.; Gottfried, T.L.: Methodological issues in cross- language speech perception research with adults; in Strange, Speech perception and linguistic experience: issues in cross- language research, pp. 207– 232 (York Press, Baltimore 1995).

Benguerel, A.- P.: Duration of French vowels in unemphatic stress. Lang. Speech 14: 383– 391 (1971).Benguerel, A.- P.: Corrélats physiologiques de l’accent en français. Phonetica 27: 21– 35 (1973).Bertinetto, P.M.; Bertini, C.: On modeling the rhythm of natural languages. Speech Prosody 2008, Campinas 2008.Boersma, P.; Weenik, D.: Praat: Doing phonetics by computer (version 5.0.21, computer program). Retrieved 24th

April 2008, from http://www.praat.org/.Bolton, T.L.: Rhythm. Am. J. Psychol. 6: 145– 238 (1894).Boucher, V.J.: On the function of stress rhythms in speech: evidence of a link with grouping effects on serial mem-

ory. Lang. Speech 49: 495– 519 (2006).Bouwhuis, D.G.; Bergmans, J.; de Rooij, J.J.: A model for the perception of prosodic boundaries. IPO Annu. Progr.

Rep. 13: 76– 82 (1978).Brown, W.: Time in English verse rhythm (Science Press, New York 1908).Carlson, R.; Granström, B.; Heldner, M.; House, D.; Megyesi, B.; Strangert, E.; Swerts, M.: Boundaries and group-

ings – the structuring of speech in different communicative situations: a description of the GROG project. TMH- QPSR Fonetik 43 (2002).

Carlson, R.; Swerts, M.: Relating perceptual judgments of upcoming prosodic breaks to f0 features. PHONUM 9: 181– 184 (2003).

Crowder, R.G.: The demise of short- term memory. Acta psychol. 50: 75– 84 (1982).Crystal, D.: Prosodic systems and intonation in English (Cambridge University Press, Cambridge 1969).Cumming, R.E.: The effect of dynamic fundamental frequency on the perception of duration. J. Phonet. (2010, in

press).Cumming, R.E.: The language-specific interdependence of tonal and durational cues in perceived rhythmicality.

Phonetica (2011, in press). Cumming, R.E.: Speech rhythm: the language- specific integration of pitch and duration; doct. thesis University of

Cambridge (2010).Cutler, A.; Dahan, D.; van Donselaar, W.: Prosody in the comprehension of spoken language: a literature review.

Lang. Speech 40: 141– 202 (1997).de Pijper, J.R.; Sanderman, A.A.: On the perceptual strength of prosodic boundaries and its relation to suprasegmen-

tal cues. J. acoust. Soc. Am. 96: 2037– 2047 (1994).de Rooij, J.J.: Perception of prosodic boundaries. IPO annu. Progr. Rep. 11: 20– 24 (1976).Dellwo, V.: Rhythm and speech rate: a variation coefficient for deltaC; in Karnowski; Szigeti, Language and lan-

guage processing. Proc. 38th Linguistic Colloquium, pp. 231– 241 (Lang, Frankfurt 2006).Di Cristo, A.: Vers une modélisation de l’accentuation du français: première partie. French Lang. Stud. 9: 143– 179

(1999).Di Cristo, A.: Vers une modélisation de l’accentuation du français: deuxième partie. French Lang. Stud. 10: 27– 44

(2000).Field, A.: Discovering statistics using SPSS (Sage, London 2005).Fitzpatrick- Cole, J.: The alpine intonation of Bern Swiss German. 14th Int. Congr. Phonet. Sci., San Francisco 1999,

pp. 941– 944.Fleischer, J.; Schmid, S.: Illustrations of the IPA: Zurich German. J. Int. Phonet. Assoc. 36: 243– 253 (2006).Fraisse, P.: Rhythm and tempo; in Deutsch, Psychology of music, pp. 149– 180 (Academic Press, New York 1982).Frankish, C.: Modality- specific grouping effects in short- term memory. J. Mem. Lang. 24: 200– 209 (1985).Frankish, C.: Perceptual organization and pre- categorical acoustic storage. J. exp. Psychol. Learning Mem. Cogn.

15: 469– 479 (1989).Galloway, R.E.: Bilinguals’ interacting phonologies? A study of speech production in French- Swiss German bilin-

guals; MPhil thesis University of Cambridge (2007).Grabe, E.; Low, E.L.: Durational variability in speech and the rhythm class hypothesis; in Warner. Gussenhoven,

Papers in Laboratory Phonology vii, pp. 515– 543 (Mouton de Gruyter, Berlin 2002).Grosjean, F.; Carrard, S.; Godio, C.; Grosjean, L.: Long and short vowels in Swiss French: their production and

perception. French Lang. Stud. 17: 1– 19 (2007).Guaïtella, I.: Rhythm in speech: what rhythmic organizations reveal about cognitive processes in spontaneous

speech production versus reading aloud. J. Pragmatics 31: 509– 523 (1999).Haas, W.: Die deutschsprachige Schweiz; in Schläpfer, Die viersprachige Schweiz, pp. 73– 160 (Benziger, Zürich

1982).Halliday, M.A.K.: Categories of the theory of grammar. Word 17: 241– 292 (1960).Häsler, K.; Hove, I.; Siebenhaar, B.: Die Prosodie des Schweizerdeutschen – Erkenntnisse aus der sprachsyn-

thetischen Modellierung von Dialekten. Linguistik Online 24: 187– 224 (2005).

242 Phonetica 2010;67:219–242 Cumming

Haugen, E.: Dialect, language, nation; in Pride, Holmes, Sociolinguistics, pp. 97– 116 (Penguin, Harmondsworth 1966).

Hay, J.F.; Diehl, R.L.: Perception of rhythmic grouping: testing the iambic/trochaic law. Perception Psychophysics 69: 113– 122 (2007).

House, D.: Tonal perception in speech (Lund University Press, Lund 1990).Howell, D.C.: Statistical methods for psychology (Thomson Wadsworth, Belmont 2007).Iversen, J.R.; Patel, A.D.; Ohgushi, K.: Perception of rhythmic grouping depends on auditory experience. J. acoust.

Soc. Am. 124: 2263– 2271 (2009).Jakobson, R.; Fant, G.; Halle, M.: Preliminaries to speech analysis: the distinctive features and their correlates.

Acoustics Lab., MIT, Techn. Rep. 13 (1952).Jun, S.- A.; Fougeron, C.: A phonological model of French intonation; in Botinis, Intonation: analysis, modelling and

technology, pp. 209– 242 (Kluwer Academic, Dordrecht 2000).Knecht, P.; Rubattel, C.: A propos de la dimension sociolinguistique du français en suisse romande. Français mod-

erne 52: 138– 150 (1984).Kohler, K.J.: Rhythm in speech and language: a new research paradigm. Phonetica 66: 29– 45 (2009).Ladd, D.R.: Intonational phonology (Cambridge University Press, Cambridge 1996).Lee, C.S.; Todd, N.P.M.: Towards an auditory account of speech rhythm: application of a model of the auditory

‘primal sketch’ to two multi- language corpora. Cognition 93: 225– 254 (2004).Leeman, A.; Siebenhaar, B.: Intonational and temporal features of Swiss German. 16th Int. Congr. Phonet. Sci.,

Saarbrücken 2007).Lehiste, I.: Suprasegmentals (MIT Press, Cambridge 1970).Low, E.L.: Prosodic prominence in Singapore English; doct. thesis University of Cambridge (1998).Miller, J.S.: Swiss French prosody: intonation, rate and speaking style in the Vaud canton; doct. thesis University of

Illinois at Urbana- Champaign (2007).Nazzi, T.; Iakimova, G.; Bertoncini, J.; Fredonie, S.; Alcantara, C.: Early segmentation of fluent speech by infants

acquiring French: emerging evidence for crosslinguistic differences. J. Mem. Lang. 54: 283– 299 (2006).Nazzi, T.; Ramus, F.: Perception and acquisition of linguistic rhythm by infants. Speech Commun. 41: 233– 243

(2003).Nespor, M.; Vogel, I.: Prosodic phonology (Foris, Dordrecht 1986).Nolan, F.; Hausmann, T.: Are phrase tones in Swiss German intonation stress- seeking? Between Stress and Tone

Conf., Leiden 2005.Pierrehumbert, J.: The phonology and phonetics of English intonation (MIT, Cambridge 1980).Post, B.: Tonal and phrasal structures in French intonation; doct. thesis Catholic University of Nijmegen (2000).Ramus, F.; Nespor, M.; Mehler, J.: Correlates of linguistic rhythm in the speech signal. Cognition 73: 263– 292

(1999).Reeves, C.; Schmauder, A.R.; Morris, R.K.: Stress grouping improves performance on an immediate serial list recall

task. J. exp. Psychol. Learning Mem. Cogn. 26: 1638– 1654 (2000).Repp, B.H.: Phonetic trading relations and context effects: new experimental evidence for a speech mode of percep-

tion. Psychol. Bull. 92: 81– 110 (1982).Rossi, M.: Le seuil de glissando ou seuil de perception des variations tonales pour les sons de la parole. Phonetica

23: 1– 33 (1971).Schmid, S.: Un nouveau fondement phonétique pour la typologie rhythmique des langues? Workshop for the 10ème

anniversaire du Laboratoire d’analyse informatique de la parole (LAIP). Lausanne 2001.Singy, P.: L’image du français en suisse romande: une enquête sociolinguistique en pays de Vaud (L’Harmattan,

Paris 1996).Swerts, M.: Prosodic features at discourse boundaries of different strength. J. acoust. Soc. Am. 101: 514– 521

(1997).Tilsen, S.; Johnson, K.: Low- frequency Fourier analysis of speech rhythm. J. acoust. Soc. Am. 124: EL34– EL39

(2008).Vaissière, J.: Rhythm, accentuation and final lengthening in French; in Sundberg, Nord, Carlson, Music, language,

speech and brain. Proc. int. Symp. at the Wenner- Gren Center, Stockholm 1990, pp. 108– 120 (1991).Voillat, F.: Aspects du français régional actuel; actes du colloque de dialectologie francoprovençale, Neuchâtel

1969, pp. 216– 241 (University of Neuchâtel, Neuchâtel 1971).Wagner, P.: Two sides of the same coin? Investigating iambic and trochaic timing and prominence in German poetry.

Proc. Speech Prosody 2010, Chicago 2010.Wagner, P.: The rhythm of language and speech: constraints, models, metrics and applications. Available at http://

www.uni- bielefeld.de/lili/personen/pwagner/pubs.html (accessed September 2010, forthcoming).Wallin, J.E.W.: Researches on the rhythm of speech. Stud. Yale Psychol. Lab. 9: 1– 142 (1901).Woodrow, H.H.: A quantitative study of rhythm; the effect of variations in intensity, rate and duration (Science

Press, New York 1909).