9
Perception & Psychophysics 1980, Vol. 27 (5), 435-443 Conditions on rate normalization in speech perception RANDY L. DIEHL, ARTHUR F. SOUTHER, and CHARLES L. CONVIS University of Texas, Austin, Texas 78712 Recent work by Summerfield (1975) and others indicates that a listener's phonemic judg- ments may vary with the utterance rate of prior context. In particular, if a phonemic distinction is signaled by a temporal cue such as voice onset time (VOT),faster utterance rates tend to shift the phoneme boundary toward smaller values of that cue. The listener thus appears to "normal- ize" temporal cues according to utterance rate. In the present experiment, subjects identified syllables varying in VOT «(ga)·[kha)) following either a slow or a fast version of the phrase "Teddy hears ." Typical normalization effects were observed when the precursor phrase and target syllable had formant frequencies corresponding to an adult male vocal tract. How- ever, a reversal of the typical pattern (i.e., a shift in the perceived voicing boundary toward larger values of VOT with an increased utterance rate) occurred when the precursor and target had formant frequencies corresponding to an adult female vocal tract. Both normalization and "reverse" normalization effects were reduced or eliminated under several conditions of source change between precursor and target. These conditions included a change in fundamental fre- quency, a change in implied vocal-tract size (as reflected in an upward or downward scaling of formant frequencies), or both. There now exists abundant evidence that syllable- sized portions of the speech signal are recognized in a highly context-dependent manner. Syllable labeling is affected by, among other things, the phonemic value and perceptual quality of adjacent syllables (Diehl, Elman, & McCusker, 1978; Diehl, Lang, & Parker, 1980; Eimas, 1963), the language (e.g., En- glish vs. Spanish) of a precursor phrase (Elman, Diehl, & Buchwald, 1977), and the utterance rate of a precursor phrase (Port, 1979; Summerfield, 1975; Minifie, Kuhl, & Stecher, Note 1; Port, Note 2; Summerfield, Note 3, Note 4, Note 5). It is with this last type of contextual influence, usually referred to as rate normalization, that this paper will be con- cerned. In the seminal investigations of rate normalization by Summerfield, listeners identified a series of stop consonant-vowel syllables, each of which occurred at the end of a phrase that was either rapidly or slowly articulated. I The syllables varied in voice onset time (VaT) and were perceived as beginning with either the voiced stop [g] or the voiceless stop [k]. [VaT re- fers to the interval between closure release and the onset of vocal-fold vibration (Lisker & Abramson, 1964). Small values of VaT correspond to voiced initial stops, whereas relatively large VaT values cor- This work was supported by NIH Grant NS13764. Some of the results reported here were also presented at the 96th Meeting of the Acoustical Society of America, Honolulu, November 1978. We thank Carol Lamb for her help in collecting the data. Requests for reprints should be sent to Randy L. Diehl, Department of Psychol- ogy, 330 Mezes, University of Texas at Austin, Austin, Texas 78712. respond to voiceless initial stops.] Summerfield found that, with a faster precursor utterance rate, the perceived boundary between voiced and voiceless stops shifted toward smaller values of VaT. That is, listeners tended to identify more of the test syllables as having initial [k]. Summerfield (Note 3) attributed this result to the listener's tendency to compensate for the temporal reduction of acoustic segments, including vaT inter- vals, which is a natural consequence of faster articu- latory rates. Although this explanation is quite plau- sible, the critical assumption that VaT actually de- creases as rate of utterance increases has not been uniformly supported in the literature. In a study of VaT production in three different rate conditions, Summerfield (1975) did find the expected relation- ship. However, two other studies (Baran, Laufer, & Daniloff, 1977; Lisker & Abramson, 1967) failed to find any correlation between utterance rate and VaT. The discrepancy may be due to an important methodological difference: While Summerfield in- structed his subjects to speak at several rates, the other investigators apparently allowed each subject to establish his own speaking rate. Under the latter condition, one would expect less within-subject rate variation and hence less chance of observing a rate- VaT correlation. For the moment, then, we shall accept Summerfield's account of why listeners shift their category boundaries as speaking rate increases. In addition to the VaT effects reported by Summerfield, rate normalization has been found to occur for medial stops whose voicing characteristic is signaled by closure duration (Port, 1979, Note 2) Copyright 1980 Psychonomic Society, Inc. 435 0031-5117/80/050435-09$01.15/0

Conditions on rate normalization in speech perception

Embed Size (px)

Citation preview

Perception & Psychophysics1980, Vol. 27 (5), 435-443

Conditions on rate normalizationin speech perception

RANDY L. DIEHL, ARTHUR F. SOUTHER, and CHARLES L. CONVISUniversity of Texas, Austin, Texas 78712

Recent work by Summerfield (1975) and others indicates that a listener's phonemic judg­ments may vary with the utterance rate of prior context. In particular, if a phonemic distinctionis signaled by a temporal cue such as voice onset time (VOT), faster utterance rates tend to shiftthe phoneme boundary toward smaller values of that cue. The listener thus appears to "normal­ize" temporal cues according to utterance rate. In the present experiment, subjects identifiedsyllables varying in VOT «(ga)·[kha)) following either a slow or a fast version of the phrase"Teddy hears ." Typical normalization effects were observed when the precursor phraseand target syllable had formant frequencies corresponding to an adult male vocal tract. How­ever, a reversal of the typical pattern (i.e., a shift in the perceived voicing boundary towardlarger values of VOT with an increased utterance rate) occurred when the precursor and targethad formant frequencies corresponding to an adult female vocal tract. Both normalization and"reverse" normalization effects were reduced or eliminated under several conditions of sourcechange between precursor and target. These conditions included a change in fundamental fre­quency, a change in implied vocal-tract size (as reflected in an upward or downward scaling offormant frequencies), or both.

There now exists abundant evidence that syllable­sized portions of the speech signal are recognized ina highly context-dependent manner. Syllable labelingis affected by, among other things, the phonemicvalue and perceptual quality of adjacent syllables(Diehl, Elman, & McCusker, 1978; Diehl, Lang, &Parker, 1980; Eimas, 1963), the language (e.g., En­glish vs. Spanish) of a precursor phrase (Elman,Diehl, & Buchwald, 1977), and the utterance rate ofa precursor phrase (Port, 1979; Summerfield, 1975;Minifie, Kuhl, & Stecher, Note 1; Port, Note 2;Summerfield, Note 3, Note 4, Note 5). It is with thislast type of contextual influence, usually referred toas rate normalization, that this paper will be con­cerned.

In the seminal investigations of rate normalizationby Summerfield, listeners identified a series of stopconsonant-vowel syllables, each of which occurred atthe end of a phrase that was either rapidly or slowlyarticulated. I The syllables varied in voice onset time(VaT) and were perceived as beginning with eitherthe voiced stop [g] or the voiceless stop [k]. [VaT re­fers to the interval between closure release and theonset of vocal-fold vibration (Lisker & Abramson,1964). Small values of VaT correspond to voicedinitial stops, whereas relatively large VaT values cor-

This work was supported by NIH Grant NS13764. Some of theresults reported here were also presented at the 96th Meeting of theAcoustical Society of America, Honolulu, November 1978. Wethank Carol Lamb for her help in collecting the data. Requests forreprints should be sent to Randy L. Diehl, Department of Psychol­ogy, 330 Mezes, University of Texas at Austin, Austin, Texas78712.

respond to voiceless initial stops.] Summerfieldfound that, with a faster precursor utterance rate, theperceived boundary between voiced and voicelessstops shifted toward smaller values of VaT. That is,listeners tended to identify more of the test syllablesas having initial [k].

Summerfield (Note 3) attributed this result to thelistener's tendency to compensate for the temporalreduction of acoustic segments, including vaT inter­vals, which is a natural consequence of faster articu­latory rates. Although this explanation is quite plau­sible, the critical assumption that VaT actually de­creases as rate of utterance increases has not beenuniformly supported in the literature. In a study ofVaT production in three different rate conditions,Summerfield (1975) did find the expected relation­ship. However, two other studies (Baran, Laufer, &Daniloff, 1977; Lisker & Abramson, 1967) failed tofind any correlation between utterance rate andVaT. The discrepancy may be due to an importantmethodological difference: While Summerfield in­structed his subjects to speak at several rates, theother investigators apparently allowed each subjectto establish his own speaking rate. Under the lattercondition, one would expect less within-subject ratevariation and hence less chance of observing a rate­VaT correlation. For the moment, then, we shallaccept Summerfield's account of why listeners shifttheir category boundaries as speaking rate increases.

In addition to the VaT effects reported bySummerfield, rate normalization has been found tooccur for medial stops whose voicing characteristicis signaled by closure duration (Port, 1979, Note 2)

Copyright 1980 Psychonomic Society, Inc. 435 0031-5117/80/050435-09$01.15/0

436 DIEHL, SOUTHER, AND CONVIS

and for initial consonants whose manner (stop vs.glide) is signaled by transition duration (Minifieet al., Note 1). In view of their relative compress­ibility (Klatt, 1976), it is not surprising that vowelsalso are subject to rate normalization (Ainsworth,1974; Verbrugge & Shankweiler, Note 6). There canthus be little doubt that normalization has a generaland useful role in speech recognition.

Our main concern in the present paper is whetherrate normalization may be reduced or eliminated al­together if an abrupt change in source characteristics(e.g., fundamental frequency or implied vocal tractsize) occurs between the precursor phrase and the tar­get syllable. It is quite conceivable that syllable rateis monitored independently of source characteristicsand that the normalizing mechanism simply ignoresdiscontinuous changes in those characteristics. Thereare several reasons, however, to predict that abruptsource changes might, indeed, reduce the listener'stendency to normalize. First, it is well known thatcontinuous source characteristics (in particular, fun­damental frequency) are important in allowing alistener to attend selectively to one voice amongmany (Broadbent & Ladefoged, 1957; Parsons, 1976;Spieth, Curtis, & Webster, 1954) and to fuse fre­quency components into a single message (Darwin &Bethell-Fox, 1977). Both of these functions wouldappear to be useful in achieving accurate normaliza­tion, especially in noisy environments. Second, dis­continuous changes in source characteristics nor­mally imply that a change in talker has occurred.Since there would be no particular reason for thelistener to normalize one talker's utterance on thebasis of a different talker's utterance, one wouldexpect normalization to be blocked.

In Experiment 1, we compared the extent of ratenormalization that occurs (1) when an abrupt changein fundamental frequency (Fs) is introduced betweena precursor phrase and a target syllable, and (2) whenthe precursor and target share a more continuouspitch contour. Experiment 2 was analogous, exceptthat the source-change conditions involved an abruptshift in implied vocal-tract size, as reflected in a sys­tematic raising or lowering of formant frequencies.Finally, in Experiment 3, the effects of combinedF, and vocal-tract-size changes were compared withthose of continuous-source conditions.

EXPERIMENT 1

MethodSubjects. Fifty-three introductory psychology students at the

University of Texas at Austin received course credit for theirparticipation as subjects. All reported having normal hearing.

Stimuli and Procedure. Two series of nine target syllables wereprepared by means of a computer-controlled OVE Illd speechsynthesizer at the University of Texas at Austin. Each seriesvaried in VOT from 10 to 50 msec in 5-msec steps and ranged

perceptually from [gal to [kha]. The syllables had an initial 50­msec period of formant transitions followed by a 290-msec steady­state (vowel) period. The frequencies of the steady-state segmentsof Formants I through 3 (F .. F

"F3) were 755, 1,345, and 2,470 Hz,

respectively. Formant transitions began at 200 Hz (F,), 1,900 Hz(F,), and 1,960 Hz (F3 ) and were approximately linear. Delay invoicing relative to stimulus onset was simulated by sharply atten­uating the first formant through bandwidth expansion and byreplacing periodic excitation with noise excitation during the F I

attenuation period. Amplitude was held constant during the first230 msec of each syllable and then decreased in linear decibelsteps.

The two series differed only in Fo contour and amplitude (theOVE Illd synthesizer is configured to produce sounds of greaterintensity as F, increases, other parameter settings being equal).For one of the series. Fo was set at 112 Hz during the first 230 msecof a syllable, then dropped to 97 Hz for the remaining 110 msec.For the other series, F, was set at 178 Hz during the first 230 msec,then dropped to 145 Hz. The higher pitched syllables were approx­imately 2 dB more intense than the lower pitched ones.

Each target syllable appeared 100 msec after the synthetic pre­cursor phrase "Teddy hears." Four versions of this precursor wereused: slow and low-pitched, fast and low-pitched, slow and high­pitched, fast and high-pitched. The slow and fast precursors haddurations of 730 and 445 msec, respectively. The fast precursorswere produced by shortening all acoustic segments of the slowprecursors, including formant transitions, by roughly the sameproportion while leaving formant frequencies unchanged. Fo forthe low-pitched precursors (both fast and slow) was 115 Hz duringthe initial 30070 of the utterance, then dropped to 112 Hz for theremainder. For the high-pitched precursors, the corresponding F,values were 173 and 168 Hz. These F, contours were selected tobe approximately continuous with those of the low- and high­pitched target syllables, respectively. The experimenters judgedeach of the precursors to be reasonably natural sounding. DespiteF, differences, each of the precursors and targets had an adultmale vocal quality. Figure I displays broad-band spectrograms ofa low-pitched [kha] target syllable in the context of both the slowand fast low-pitched precursors.

Four experimental tapes were prepared, each representing a dif­ferent combination of precursor pitch and target pitch (i.e., low­low, high-low, high-high, and low-high). On a given tape, each ofthe nine target syllables of a particular pitch appeared 10 timeswith each of the two precursors (slow and fast) of a particularpitch, for a total of 180 full utterances per tape. The stimulustokens were completely randomized, and there was a 2-sec inter­stimulus interval. Twenty-six of the subjects were presented thetwo tapes with the low-pitched targets, and the other 27 subjectsheard the two tapes with the high-pitched targets. Order of tapepresentation was counterbalanced across subjects. Before eachidentification test, the subjects were given a short practice sessionin which they listened to a sample of the utterances they would bejudging.

Stimulus tapes were played on a Teac A2300 S tape recorder;the signal was amplified and presented binaurally through Tele­phonics TDH-39 earphones at a comfortable listening level (ap­proximately 78 dB SPL for the low-pitched target syllables). Thesubjects were instructed to identify the initial consonant of thefinal (target) syllable of each utterance as either G or K on ananswer sheet provided.

Results andDiscussionRate normalization was assessed for each subject by

subtracting the number of [g) labeling responses inthe fast-precursor conditions from the number of [g)responses in the corresponding slow-precursor condi­tions. A positive difference (D) represents a tendencyto normalize in the expected way, that is, to shift the

CONDITIONS ON NORMALIZATION 437...-

4NX~- 3)-UZ 2W::::>0wa::LL

l250 MSECJ

TEDDY HEARS KA

-- 3)­UZ 2W::::>awa::LL

l250 MSECJ

TEDDY HEARS KAFigure 1. Broad-band spectrograms of a low-pitched [kha] target syllable in the context of both the slow and fast low-pitched

precursor.

perceivedvoicing boundary toward smaller VOT val­ues in the context of the fast precursor.

The subjects who heard the low-pitched target syl­lables demonstrated a small, but significant, amountof rate normalization both in the low-pitched precur­sor condition (0 = 2.58, p < .01) and in the high­pitched precursor condition (0=2.81, p< .01).1,3These two conditions (low- vs. high-pitched precur­sor) did not differ significantly. For the subjects whoheard the high-pitched targets, there was significantnormalization in the high-pitched precursor condi­tion (D=4.89, p < .001) and also in the low-pitchedprecursor condition (0 = 2.30, p < .Dl), but the for­mer effect was reliably greater than the latter(p < .001). Figure 2 shows the group identificationfunctions for both the low- and high-pitched targetsyllables. In summary, a discontinuous pitch changebetween the precursor and the target syllable reduced

the magnitude of normalization for the high-pitched,but not the low-pitched, targets.

There are at least two possible reasons for thisasymmetry. First, the extent of frequency change was10 Hz greater (66 Hz vs. 56 Hz) between the low­pitched precursor and the high-pitched target thanbetween the high-pitched precursor and the low­pitched target. Second, falling intonation near theend of an utterance is considerably more common inEnglish than rising intonation. Declarative sentencesand WH questions normally end with a falling con­tour (Armstrong & Ward, 1931; Lieberman, 1967;Palmer, 1924; Pike, 1945), and even "yes-no" ques­tions carry a falling terminal contour more oftenthan a rising one (Fries, 1964). In view of this, thedownward-pitch-change condition may have seemedless unnatural to the listeners than the upward-pitch­change condition (at least relative to their respective

438 DIEHL, SOUTHER, AND CONVIS

Figure 2. Results of Experiment 1. Target-syllable identificationfollowing slow precursor (solid circles) and fast precursor (opencircles).

EXPERIMENT 2

It is well known that, for a uniform cylindricalair-filled tube, the resonant frequencies vary in­versely with length. In most of its configurations,the vocal tract is a highly nonuniform tube, butthe above relationship generally remains valid to afair approximation (Fant, 1960). Thus, apart fromFo and other glottal source characteristics (Monsen,1977), the principal acoustic basis for distinguishingone vocal tract from another is the location of energymaxima (i.e., formants) in the output spectrum. Fora given vowel, an adult female's formant frequencieswill in general be higher than those of an adult male(with his longer vocal tract), and a child's formantfrequencies will be higher still (Fant, 1973; Peterson& Barney, 1952).

The present experiment was analogous to Experi­ment 1, except that, instead of manipulating F, in thesource-change condition, we introduced a change inimplied vocal-tract size, as reflected in an upward ordownward scaling of formant frequencies. betweenthe precursor and the target syllable. In a similar-ex­periment, Summerfield (Note 4) found that a scalefactor of 1.09 (upward or downward formant shift)was not sufficient to reduce rate normalization rel­ative to control conditions in which implied vocal­tract size was held constant. We therefore decided touse a larger scale factor (viz, 1.19) in constructingour source-change stimuli. This particular scale fac­tor was chosen because it closely approximates theaverage ratio of adult female/adult male vowel for­mant frequencies (Fant, 1973), and we felt that sucha change could hardly go unnoticed by the listeners.4

MethodSubjects. Thirty-three individuals selected from the same popu­

lation as those in Experiment 1 served as subjects. None had par­ticipated in the previous experiment.

Stimuli and Procedure. Two series of seven [gaj-[khaj targetsyllableswere synthesized. In each series, VOT values ranged from15 to 45 msec in 5-msec steps. The syllables of one series had thesame duration and formant frequencies as the target syllables usedin Experiment 1. The syllables of the other series were identical tothese except that all formant values were multiplied by a scale fac­tor of 1.19. The formant frequencies of the two series corre­sponded to the output of an adult male (large) and an adult fe­male (small) vocal tract, respectively. For both series, F. began at118 Hz and declined linearly to 97 Hz through the voiced portionof the stimulus.

Each target syllable appeared 100 msec after four new versionsof the precursor "Teddy hears.' ~ Two of these had durations andformant frequencies that were the same as those of the slow andfast precursors used in Experiment 1, except that the flap-closureduration of "Teddy" was shortened by 10 msec in both the slowand fast precursors to yield more natural-sounding stimuli. Theother two precursors were identical to these except that all formantfrequencies were multiplied by a scale factor of 1.19. Again, thehigher formant frequencies corresponded to the output of a female(small) vocal tract and the lower formant frequencies to a male(large) vocal tract. For both types of precursors, F. had a slightlyfalling contour from 118to 112 Hz.

LON Fa TARGETLOW F. PRECURSOR HIGH F. PRECURSOR

50.

100

1Il 25...1IlZ0 0

10 20 50Il. 10 201Il...II:

r-rOJ......

HIGH Fo TARGETI-z LOW F. PRECURSOR HIGH F. PRECURSOR... 100(,)II:...Il. 15

50

25

010 20

TARGET SYLLABLE VOICE ONSET TIME (MSEC)

continuous-pitch control conditions). Other thingsbeing equal, greater naturalness would presumablylead to greater normalization. The above argumentnotwithstanding, we wish to emphasize that neitherof the pitch-change conditions mimicked the terminalF, adjustments typically observed in natural utter­ances. Natural Fo changes, although perhaps com­parable in magnitude to those we used. tend to begradual and continuous [see. for example. the Focontours depicted in Lieberman (1967)]. quite unlikethe abrupt changes that occurred between our precur­sors and targets. That neither our upward nor ourdownward pitch-change stimuli sounded very naturalis evidenced by the fact that many of the subjects hadto suppress laughter upon first hearing them duringthe practice session. Our continuous-pitch stimuli didnot evoke this reaction.

Despite the unnatural quality of the stimuli in thetwo pitch-change conditions, significant rate normal­ization effects were still observed, even if somewhatreduced in magnitude for the upward pitch-changecondition. This may mean that the types of pitch dis­continuity we used were not sufficient to signal inevery instance that a change in speaker had occurred.Since, for any given speaker, Fo is a modulable prop­erty of vocal output, listeners may be willing to toler­ate considerable changes in pitch before they ceaseto normalize, especially when rhythmic and othercues signaling source continuity are present. Accord- .ingly, we decided in Experiment 2 to investigate theeffects of changing a property that is ordinarilyunmodulable for a given speaker, namely. impliedvocal-tract size.

CONDITIONS ON NORMALIZATION 439

LARGE-VOCAL-TRACT TARGET

45

15 25 35 45

SMALL-VOCAL-TRACTPRECURSOR

SMALL-VOCAL-TRACT PRECURSOR

TARGET SYLLABLE VOICE ONSET TIME (MSEC)

15 25 35 45

25

LARGE-VOCAL-TRACT PRECURSOR100

75

50

.....OJw

13 25<II~ oL-J.--L.............l-.Eoio_Q.. 15 25 35 45

13II::

SMALL-VOCAL-TRACT TARGET!Z LARGE-VOCAL-TRACT PRECURSORLLI 100

~~ 75

50

Figure 3. Results of Experiment 2. Target-syllable identificationfollowing slow precursor (solid circles) and fast precursor (opencircles).

EXPERIMENT 3

tract, it seems likely that mixed-source stimuli wouldyield some average (Le., cancellation) of the two op­posite response tendencies. The net result would cor­respond to what we, in fact, observed in the twomixed-source conditions, viz, no significant normal­ization. Presently, we have no basis for choosing be­tween these two accounts, and indeed both may becorrect.

The reversal of the usual normalization pattern inthe small-vocal-tract precursor/small-vocal-tract tar­get condition was obviously unexpected. Before at­tempting to explain this reversal, we must determinewhether it is replicable, particularly under morenatural conditions. In the present experiment, bothlarge- and small-vocal-tract stimuli shared an Fo con­tour typical of an adult male. Accordingly, the small­vocal-tract stimuli, while quite intelligible, had asomewhat unnatural quality (much like the speech ofan adult male dwarf). If reverse normalization occursonly for this type of stimulus, it would be of limitedtheoretical interest. In the next experiment, we soughtto determine, among other things, whether reversenormalization occurs for small-vocal-tract stimuliwhen the Fo pattern corresponds more closely to thatof an adult female talker.

In this experiment, we investigated the effects onrate normalization of (1) a change in Fo betweenprecursor and target, (2) a change in implied vocal-tractsize between precursor and target, and (3) a simul­taneous change in both of these variables. For eachof the continuous-source control conditions, the

Four experimental tapes were constructed in a manner analo­gous to those of Experiment 1. Each tape represented a differentcombination of precursor and target vocal-tract size (i.e., large­large, small-large, small-small, and large-small). On a given tape,each of the seven target syllables corresponding to a particularvocal-tract size appeared 10 times with each of the two precursors(slow and fast) corresponding to a particular vocal-tract size.There was thus a total of 140 stimulus items per tape; all wererandomized with 2-sec interstimulus interval. Fifteen of the sub­jects were presented the two tapes with the large-vocal-tract tar­gets, and the other 18 were presented the two tapes with the small­vocal-tract targets. Order of tape presentation was counterbal­anced across subjects. The same procedures and apparatus de­scribed in Experiment 1were used in the present experiment.

Results and DiscussionSubjects identifying the large-vocal-tract targets

showed significant normalization when the precursoralso corresponded to a large vocal tract (0 =2.67,p < .05) but not when it corresponded to a smallvocal tract (0= -.47, n.s.), The difference betweenthe two conditions was reliable (p < .05). These re­sults contrast with those of Summerfield (Note 4),who, as we have noted, found equalamounts of nor­malization in the two conditions. Although stimulusmaterials and procedures differed in various ways be­tween the two studies,' the most likely source of thediscrepancy is the different-sized scale factors used toconstruct the small-vocal-tract stimuli (1.19 vs. 1.09).

The slight, nonsignificant trend toward "reverse"normalization in the small-vocal-tract precursor/large-vocal-tract target condition presaged our re­sults for the small-vocal-tract target conditions." Inthe latter, subjects demonstrated a seemingly para­doxical tendency to hear fewer targets as [k] whenthey followed the fast precursor. This reverse nor­malization was significant (0 = -1.89, p < .01)when the precursors corresponded to a small vocaltract but not when they corresponded to a large vocaltract (0 = -1.06, n.s.). The difference between theseconditions was not significant, however. Group iden­tification functions for both the large- and small­vocal-tract targets are displayed in Figure 3.

Interpretation of the results for the large-vocal­tract targets is complicated by the evidence of reversenormalization elsewhere in the data. One explanationfor the blocking of normalization in the small-vocal­tract precursor/large-vocal-tract target condition isbased on the mere occurrence of an implied talkerchange. As we noted earlier, there is little to begained in normalizing one speaker's utterance on thebasis of a different speaker's utterance. The optimalperceptual strategy would appear to be one that"resets" the normalizing mechanism whenever a newspeaker commences talking. Although this account isplausible, there is an alternative (perhaps compatible)explanation. Since listeners normalized in the ex­pected manner when both precursor and target corre­sponded to a large vocal tract but reversed this pat­tern when they both corresponded to a small vocal

440 DIEHL,SOUTHER, ANDCONVIS

stimuli mimicked the speech of an adult female (i.e.,relatively high F, and formant frequencies).

HIGH Fo• SMALL- VOCAL-TRACT TARGET (ALL CONDITIONS)

GROUP I

TARGET SYLLABLE VOICE ONSET TIME (MSEC)

45

3S 452515

15

LOW Fo • SMALL-VOCAL-TRACTPRECURSOR

HIGH Fo• LARGE-VOCAL-TRACTPRECURSOR

LOW Fo , LARGE-VOCAL-TRACTPRECURSOR

GROUP 3

35 4525

GROUP 2

HIGH Fo, SMALL-VOCAL-TRACTPRECURSOR

75 'l, ,,,,~

,50 ,,,,25 ,,

a15 25 35 45 15 25 35 45

,75 ''l,

50,

",25 <-

,a

15 25 35 45

100

100

50

HIGH Fo • SMALL-VOCAL-TRACTPRECURSOR

HIGH Fo • SMALL-VOCAL-TRACTPRECURSOR

I- 25ZUJ~ a 15UJQ.

13UlZ 100~UlUJ 750:

Figure 4. Results of Experiment 3. Target-syllable identificationfollowing slow precursor (solid circles) and fast precursor (opencircles).

MethodSubjects. Sixty-four introductory psychology students, none of

whom were in the previous experiments, served as subjects.Stimuli and Procedure. A single series of target syllables was

used for all conditions. This series was identical to the small­vocal-tract targets used in Experiment 2, except that Fo began at178 Hz and declined linearly to 145 Hz through the voiced portionof each stimulus. Four slow/fast pairs of precursors were syn­thesized. One pair was identical to the low-pitched, small-vocal­tract precursors used in Experiment 2. A second pair was identicalto these, except that F0 started at 178 Hz and declined during thecourse of the utterance to 168 Hz. This high-pitched pair wasused in the continuous-source control condition. A third pair wasthe same as the low-pitched, large-vocal-tract precursors of Ex­periment 2. A final pair combined the pitch contour of the secondpair and the formant frequencies of the third (i.e., high-pitched,large vocal tract). As in the other experiments, the target syllablesalways appeared 100 msec after the precursors.

Four tapes were prepared in a manner analogous to the previousexperiments. Each contained a different slow/fast precursor pair,but, as noted above, the target series was always the same. Therewere 140 items per tape (2 precursor rates x 7 targets x 10 repe­titions); all items were randomized, with a 2-sec interstimulusinterval.

The subjects were assigned to three groups. Each group waspresented the tape containing high-pitched, small-vocal-tract pre­cursors (continuous-source control tape) plus one of the remainingthree tapes. Group 1 (18 subjects) received the tape in which alarge upward Fo change occurred between precursor and target;Group 2 (22 subjects) received the tape in which an upward shiftin formant frequencies occurred between precursor and target; andGroup 3 (24 subjects) received the tape in which both types ofchange occurred simultaneously. Order of tape presentation wascounterbalanced across subjects. The procedures and apparatusdescribed in Experiment 1 were again used in the present ex­periment.

Results and DiscussionPerhaps the most striking overall result of this ex­

periment was the finding of significant reverse nor­malization for all three groups in the continuous­source control condition (Group 1, D= -2.72,p < .01; Group 2, D= -1.86, p < .05; Group 3,D= -1.83, p < .01). This confirms and generalizesour seemingly paradoxical result for small-vocal­tract stimuli in Experiment 2. Recall that those stim­uli had a pitch contour that was unnaturally lowgiven the implied vocal-tract size. The fact that re­verse normalization also occurred for the more nat­ural (adult female) stimuli of the present experimentsuggests that it may be a quite general phenomenon.

In contrast to the continuous-source condition,none of source-change conditions yielded significantrate normalization in either direction [Group 1 (pitchchange), D =.61; Group 2 (vocal-tract-size change),D = - .27; Group 3 (both pitch and vocal-tract-sizechange), D = 1.20]. For Groups 1 and 3, the differ­ence between continuous-source and source-changeconditions was reliable (p < .01 in both instances).However, this difference was only marginally signifi-

cant for Group 2 (.05 < P < .1). Figure 4 displaysthe target-syllable identification functions for Groups1 through 3.

We initially assumed that concurrent pitch andvocal-tract-size changes might have mutually rein­forcing effects in blocking normalization. If one sub­tracts the D score of the continuous source conditionfrom that of the source-change condition for eachgroup, there is no evidence of such mutual reinforce­ment of effect. Indeed, the pitch change alone yieldednominally greater blocking than did the change inboth pitch and vocal-tract size, although the differ­ence was certainly not significant.

Along with the results of the first two experiments,the present findings clearly show that normalization(in this case, reverse normalization) can be reducedand even blocked altogether if a discontinuous changein source characteristics occurs between the precursorand the target syllable. For Groups 2 and 3, both ofwhose source-change conditions involved a shift inimplied vocal-tract size, it is unclear whether the re­duction in normalization resulted from the source

CONDITIONS ON NORMALIZATION 441

Table IMean Sentence, Syllable, and VOT Durations (in Milliseconds)

for the Sentence "Teddy Hears ka" at Fast and Slow Rates

In view of our findings, one is tempted to speculatethat the discrepancy between Summerfield's resultsand those of Baran et al. is due at least in part tosex differences. It is conceivable that female talkersgenerally employ different strategies of articulatorytiming than males, and, furthermore, that listenersare sensitive to these differences. An intriguing pos­sibility, one that would obviously help explain ourreverse normalization findings, is that female talkers,unlike male talkers, tend to increase certain VOTintervals (e.g., those associated with stressed syl­lables) at faster utterance rates.

To test this possibility, we conducted the followingexperiment. Four adult male and four adult femalespeakers of American English were asked to pro­nounce the sentence "Teddy hears 'ka'" at both afast and slow rate. The utterances were recorded onaudiotape, and five tokens were selected from eachspeaker and rate condition for temporal analysis.(Only tokens that we judged to be both fluent andnatural sounding were included in the sample.) Frombroad-band spectrograms of the utterances, we mea­sured total utterance duration, duration of the [ka]syllable (from release burst to final glottal pulse), andVOT of the [ka] syllable. Group mean values of thesemeasurements are presented in Table 1. The resultsobviously do not support our hypothesis that femalesincrease VOT intervals of stressed syllables at fasterutterance rates. To the contrary, the female talkersshowed as consistent and substantial a reduction ofVOT at faster rates as the male talkers.

In view of these findings, it appears unlikely thatour reverse normalization results reflect a naturalcompensatory strategy on the part of the listeners.We cannot rule out the possibility that certain idio­syncratic properties of the synthetic speech stimuliused in Experiments 2 and 3 may have been responsi­ble for the occurrence of reverse normalization. Al­though formant frequencies and pitch characteristicsof natural female utterances are easily simulated withthe OVE 111 d synthesizer, the source function pro­duced by the OVE 111d is much more characteristicof male than of female glottal outputs. In addition,our use of a uniform scale factor to create the small­vocal-tract stimuli may have resulted in a somewhatunnatural speech quality (but see Footnote 4). How­ever, there are no Obvious reasons why these or anyother stimulus properties should lead to consistent

Sentence

Fast Slow Fast Slow

73100

[lea) VOT

4752

334376

203208

[kal Syllable

Slow

13131443

Fast

656686

MalesFemales

change per se or from a mutual cancellation of nor­malization and reverse normalization tendencies.However, for Group 1, whose source change in­volved pitch alone, we can apparently rule out thelatter explanation. Because reverse normalizationwas demonstrated for both low-pitched and high­pitched small-vocal-tract stimuli, the cancellation hy­pothesis does not apply. A similar argument can,of course, be made with respect to the pitch-changeresults of Experiment 1.

GENERAL DISCUSSION

There were three main results of the present seriesof experiments. First, rate normalization, that is, ashift in perceived voicing boundary toward smallerVOT values when the precursor utterance rate isincreased, occurred under continuous-source condi­tions if the implied vocal-tract size corresponded tothat of an adult male. Second, a reversal of the abovepattern occurred under continuous-source conditionsif the implied vocal tract size corresponded to that ofan adult female. Third, both normalization and re­verse normalization effects were reduced in magni­tude, even eliminated, under certain conditions ofsource change between the precursor and target. Inparticular, pitch changes and/or changes in impliedvocal-tract size were in many instances sufficient toweaken normalization effects. This third result ap­pears to disconfirm any model of the rate normaliza­tion process in which source information is ignoredor used only for other purposes.

Although our results for the large-vocal-tract stim­uli generally agree with the findings of others, thereversal of the usual normalization pattern for small­vocal-tract stimuli was quite unexpected. This rever­sal occurred in both low- and high-pitched conditionsand appears therefore not to depend on female vocalquality per se but only on female-like formant fre­quencies.

It is worthwhile to reconsider at this point the expla­nation of rate normalization offered by Summerfield(Note 3), which we tentatively accepted. He arguedthat an increase in utterance rate tends to compresssegments of the speech signal, including VOT inter­vals corresponding to initial voiceless stops. Attempt­ing to compensate for this natural VOT compression,the listener does not require as large a VOT value inorder to identify an initial stop as voiceless (Le., hislabeling boundary shifts toward a smaller VOT val­ue). We noted that Summerfield (1975), who usedonly male talkers, did find an inverse correlation be­tween produced VOT and utterance rate but thatother investigators (Baran et al., 1977; Lisker &Abramson, 1967) found no such correlation. UnlikeSummerfield (1975), Baran et al. used female talkers.

442 DIEHL, SOUTHER, AND CONVIS

reverse normalization. In future studies, we plan touse edited versions of natural female utterances inorder to determine the generality of this seeminglyparadoxical phenomenon.

Whatever the basis of our reverse normalizationfindings, it is significant that both normalization andreverse normalization were blocked or reduced inmagnitude by the occurrence of abrupt source changesbefore the onset of the target syllable. Such resultssuggest that listeners follow a quite reasonable per­ceptual strategy: given sufficient cues signaling achange in speaker, disregard prior rate informationand begin the normalizing process anew.

REFERENCE NOTES

I. Minifie, F., Kuhl. P., & Stecher, B. Categorical perceptionof /b/ and Iwl during changes in rate of utterance. Paper pre­sented at the 94th Meeting of the Acoustical Society of America.Miami Beach, December 1977.

2. Port, R. F. Effects of word-internal vs. word-external tempoon the voicing boundary for medial stop closure. Paper presentedat the 95th Meeting of the Acoustical Society of America,Providence, May 1978.

3. Summerfield, A. Q. Towards a detailed model for the per­ception of voicing contrasts (Speech Perception, 3, 1-26; ProgressReport). Belfast, Northern Ireland: Department of Psychology,The Queen's University, 1974.

4. Summerfield, A. Q. Cues, contexts and complications inthe perception of voicing contrasts (Speech Perception, 4, 99-129;Progress Report). Belfast, Northern Ireland: Department of Psy­chology, The Queen's University, 1975.

5. Summerfield, A. Q. Speech rate influences on the perceptionofstop voicing. Paper presented at the 91st Meeting of the Acous­tical Society of America, Washington, D.C., Apri11976.

6. Verbrugge, R. R., & Shankweiler, D. P. Prosodic infor­mation for vowel identity (Status Report on Speech Research,SR-51/52, 27-35). New Haven, Conn: Haskins Laboratories, 1977.

REFERENCES

AINSWORTH, W. A. The influence of precursive sequences on theperception of synthesized vowels. Language and Speech, 1974,17, 103-109.

ARMSTRONG, L. E., & WARD, E. C. A handbook of Englishintonation. Cambridge, England: Heffer, 1931.

BARAN, J. A., LAUFER, M. Z., & DANILOFF, R. Phonologicalcontrastivity in conversation: A comparative study of voiceonset time. Journal ofPhonetics, 1977, S, 339-350.

BROADBENT, D. E., & LADEFOGED, P. On the fusion of soundsreaching different sense organs. Journal of the AcousticalSociety ofAmerica, 1957,29,708-710.

DARWIN, C. J., & BETHELL-Fox, C. E. Pitch continuity andspeech source attribution. Journal ofExperimental Psychology:Human Perception and Performance, 1977,3,665-672.

DIEHL, R. L., ELMAN, J. L., & MCCUSKER, S. B. Contrasteffects on stop consonant identification. Journal of Experi­mental Psychology: Human Perception and Performance, 1978,4,599-609.

DIEHL, R. L., LANG, M., & PARKER, E. M. A further parallelbetween selective adaptation and contrast. Journal of Experi­mental Psychology: Human Perception and Performance, 1980,6,24-44.

EIMAS, P. D. The relation between identification and discrimina­tion along speech and nonspeech continua. Language andSpeech, 1963,6,206-217.

ELMAN, J. L., DIEHL, R. L., & BUCHWALD, S. E. Perceptualswitching in bilinguals. Journal of the Acoustical Society ofAmerica, 1977,62,971-974.

FANT, C. G. M. Acoustic theory of speech production. The Hague:Mouton, 1960.

FANT, C. G. M. A note on vocal tract size factors and nonuniformF-pattern sca1ings. In Speech sounds and features. Cambridge,Mass: M.LT. Press, 1973.

FRIES, C. C. On the intonation of 'yes-no' questions in English.In D. Abercrombie, D. B. Fry, P. A. D. MacCarthy, N. C.Scott, & J. L. M. Trim (Eds.), In honour of Daniel Jones.London: Longmans, Green, 1964.

KLATT, D. H. Linguistic uses of segmental duration in English:Acoustic and perceptual evidence. Journal of the AcousticalSociety ofAmerica, 1976,59,1208-1221.

LIEBERMAN, P. Intonation, perception, and language, Cambridge,Mass: M.LT. Press, 1967.

LISKER, L., & ABRAMSON, A. S. A cross-language study of voic­ing in initial stops: Acoustical measurements. Word, 1964, 20,384-422.

LISKER, L., & ABRAMSON, A. S. Some effects of context on voiceonset time in English stops. Language and Speech, 1967, 10,1-28.

MILLER, J. L., & LIBERMAN, A. M. Some effects of later­occurring information on the perception of stop consonant andsemivowel. Perception & Psychophysics, 1979,25,457-465.

MONSEN, R. B. Study of variations in the male and femaleglottal wave. Journal of the Acoustical Society of America,1977,62,981-993.

PALMER, H. E. English intonation with systematic exercises.Cambridge, England: Heffer, 1924.

PARSONS, T. W. Separation of speech from interfering speechby means of harmonic selection. Journal of the AcousticalSociety ofAmerica, 1976,60,911-918.

PETERSON, G. E., & BARNEY, H. L. Control methods used in astudy of the vowels. Journal of the Acoustical Society ofAmerica, 1952,24,175-184.

PIKE, K. L. The intonation of American English. Ann Arbor,Mich: University of Michigan, 1945.

PORT, R. F. The influence of tempo on stop closure duration asa cue for voicing and place. Journal of Phonetics, 1979, 7,45-56.

SPIETH, W., CURTIS, J. F., & WEBSTER, J. C. Responding toone of two simultaneous messages. Journal of the AcousticalSociety ofAmerica, 1954,26,391-396.

SUMMERFIELD, A. Q. How a full account of segmental perceptiondepends on prosody and vice versa. In A. Cohen & S. G.Nooteboom (Eds.), Structure and process in speech perception.New York: Springer-Verlag, 1975.

NOTES

I. Changes in articulatory rate were simulated by varying thesynthesis time base.

2. In all instances, statistical significance was determined bymeans of two-tailed t tests for correlated observations.

3. The relatively small size of the normalization effects reportedhere may be attributable to certain properties of the test stimuli.In particular, the constant durations of both the pre target silentinterval and the target syllable itself across the two rate conditionsprobably diminished some of the differential effect of the two pre­cursor rates on the location of the voicing boundary. Recently,Miller and Liberman (1979) showed that the syllable-initial distinc­tion between [b) and [w), when cued by formant transition dura­tion, was greatly affected by the duration of the remainder of thesyllable. Longer syllable durations produced a shift in the [b)-[w)boundary toward longer transition durations. Also, Port (Note 2)reported that the perceptual boundary separating voiced andvoiceless medial stops (a distinction cued by closure duration) was

more influenced by the tempo of the test word itself than by thetempo of the carrier sentence. Maintaining a constant test-syllableduration may therefore partially override the effects of varyingprecursor rate.

Moreover, it should be pointed out that in fluent speech (asopposed to citation utterances) the VOT distributions of word­initial voiced and voiceless stops overlap considerably (Lisker &Abramson, 1967). Thus, even small normalization effects mayhave a nonnegligible practical significance.

4. Fant (1973) noted that the average female/male formant­frequency ratio varies considerably across formants and vowels.

CONDITIONS ON NORMALIZATION 443

However, for the particular vowels that occur in our precursorphrase and target syllable, this ratio variation is greatly reduced(Fant, 1973, Figure 2). Our use of a uniform scale factor appearstherefore to be justified.

5. By the terms "reverse normalization," we mean only that areversal of the typical normalization pattern has occurred.

(Received for publication July 17,1979;revision accepted February II, 1980.)