16
Language, Speech and Hearing Section 1 Section 2 Section 3 Section 4 Speech Communication Sensory Communication Auditory Physiology Linguistics 339 eT I - -- '-gp - --- I _I __ __ Part V

Part V Language, Speech and Hearing Section 1 Speech

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Part V Language, Speech and Hearing Section 1 Speech

Language, Speech and Hearing

Section 1

Section 2

Section 3

Section 4

Speech Communication

Sensory Communication

Auditory Physiology

Linguistics

339

eT I - -- '-gp

- --- I _I __ __

Part V

Page 2: Part V Language, Speech and Hearing Section 1 Speech

340 RLE Progress Report Number 139

Page 3: Part V Language, Speech and Hearing Section 1 Speech

Section 1 Speech Communication

Chapter 1 Speech Communication

341

Page 4: Part V Language, Speech and Hearing Section 1 Speech

342 RLE Progress Report Number 139

Page 5: Part V Language, Speech and Hearing Section 1 Speech

Chapter 1. Speech Communication

Chapter 1. Speech Communication

Academic and Research Staff

Professor Kenneth N. Stevens, Professor Jonathan Allen, Professor Morris Halle, Professor Samuel J.Keyser, Dr. Krishna K. Govindarajan, Dr. David W. Gow, Dr. Helen M. Hanson, Dr. Joseph S. Perkell, Dr.Stefanie Shattuck-Hufnagel, Dr. Alice Turk, Dr. Reiner Wilhelms-Tricarico, Peter C. Guiod, Seth M. Hall,Glenn S. LePrell, Jennell C. Vick, Majid Zandipour

Visiting Scientists and Research Affiliates

Dr. Shyam S. Agrawal,1 Dr. Corine A. Bickley, Dr. Suzanne E. Boyce, 2 Dr. Limin Du, 3 Dr. Anna Esposito,4 Dr.Carol Espy-Wilson, 2 Astrid Hagen,5 Dr. Robert E. Hillman, 6 Dr. Eva B. Holmberg,7 Dr. Caroline Huang,8 Dr.Harlan Lane, 9 Dr. John I. Makhoul, 10 Dr. Sharon Y. Manuel, Dr. Melanie L. Matthies,2 Dr. Richard S.McGowan," Dr. Pascal H. Perrier, 12 Dr. Yingyong Qi,13 Dr. Lorin F. Wilde, 14 Dr. David R. Williams, " KarenChenausky, 15 Jane W. Wozniak 2

Graduate Students

Marilyn Y. Chen, Harold Cheyne, Jeung-Yoon Choi, Michael Harms, Mark A. Hasegawa-Johnson, David M.Horowitz, Hong-Kwang J. Kuo, Kelly L. Poort, Janet L. Slifka, Jason L. Smith, Walter Sun

Undergraduate Students

Barbara B. Barotti, Howard Cheng, Erika S. Chuang, Laura C. Dilley, Emily J. Hanna, Dameon Harrell, MarkD. Knobel, Genevieve Lada, Adrian D. Perez, Adrienne M. Prahler, Hemant Tanaja

Technical and Support Staff

Arlene E. Wint

1 CEERI Centre, CSIR Comlex, New Delhi, India.

2 Boston University, Boston, Massachusetts.

3 Institute of Acoustics, Chinese Academy of Sciences, Beijing, China.

4 International Institute for Advanced Scientific Studies SA (IIASS), Italy.

5 University of Erlangen-Nurnberg, Erlangen, Germany.

6 Massachusetts Eye and Ear Infirmary, Boston, Massachusetts.

7 MIT staff member and Massachusetts Eye and Ear Infirmary, Boston, Massachusetts.

8 Altech Inc., Cambridge, Massachusetts, and Boston University, Boston, Massachusetts.

9 Department of Psychology, Northeastern University, Boston, Massachusetts.

10 Bolt, Beranek and Newman, Inc., Cambridge, Massachusetts.

11 Sensimetrics Corporation, Cambridge, Massachusetts.

12 Institut de la Communication Parlee, Grenoble, France.

13 Department of Speech and Hearing Sciences, University of Arizona, Tucson, Arizona.

14 Pure Speech, Inc., Cambridge, Massachusetts.

15 Harvard-MIT Division of Health Sciences and Technology, Cambridge, Massachusetts.

343

Page 6: Part V Language, Speech and Hearing Section 1 Speech

Chapter 1. Speech Communication

Sponsors

C.J. Lebel FellowshipDennis Klatt Memorial FundNational Institutes of Health

Grant R01-DC00075Grant R01-DC01291Grant R01-DC01925Grant R01-DC02125Grant R01-DC02978Grant R01-DC03007Grant R29-DC02525Grant F32-DC00194Grant F32-DC00205Grant T32-DC00038

National Science FoundationGrant IRI 89-0524916Grant IRI 93-1496717Grant INT 94-2114618

1.1 Studies of the Acoustics,Perception, and Modeling of SpeechSounds

1.1.1 Glottal Characteristics of MaleSpeakers

The configuration of the vocal folds during the pro-duction of both normal and disordered speech hasan influence on the voicing source waveform and,thus, affects perceived voice quality. Voice qualitycontains both linguistic and nonlinguistic informationwhich listeners utilize in their efforts to understandspoken language and to recognize speakers. Inclinical settings, voice quality also relays informationabout the health of the voice-production mech-anism. The ability to glean voice quality fromspeech waveforms has implications for computer-based speech recognition, speaker recognition, andspeech synthesis and may be of value for diagnosisor treatment in clinical settings.

Theoretical models have been used to predict howchanges in vocal-fold configuration are manifestedin the output speech waveform. In previous work,we used acoustic-based methods to study vari-ations in vocal-fold configuration among femalespeakers (nondisordered). We found a good corre-lation among the acoustic parameters and percep-

tions of breathy voice. Preliminary evidencegathered from fiberscopic images of the vocal foldsduring phonation suggest that these acoustic mea-sures may be useful for categorizing the femalespeakers by vocal-fold configuration. Currently, thatwork is being extended to male speakers (also non-disordered). Because male speakers are less likelythan female speakers to have posterior glottalopenings during phonation, we expect to find lessparameter variation among male speakers, in addi-tion to significant differences in mean values, whencompared with the female results. The data col-lected thus far are consistent with the theoreticalmodels. Males tend to have lower average valuesof the acoustic parameters than female speakers.Although there are individual differences among themale speakers, the variation is smaller than thatamong female speakers. Voicing-source character-istics may play a role in the perception of gender.

1.1.2 Vowel Amplitude Variation DuringSentence Production

During the production of a sentence, there are sig-nificant changes in the amplitudes of the vowelsand in the spectrum of the glottal source,depending on the degree of prominence that isplaced on each syllable. The sources of such vari-ations are of interest for the development of modelsof speech production, for articulatory synthesis, andfor the design of methods for accounting for vari-ability in speech recognition systems. Thesesources include changes in subglottal pressure, fun-damental frequency, vocal-tract losses, and voicing-source spectral tilt. Through a study of both theacoustic sound pressure signal and estimates ofsubglottal pressure from oral pressure signals, wehave examined some of the sources of variation invowel amplitude and spectrum during sentence pro-duction.

We have found that there are individual differences,and perhaps gender differences, in the relationshipbetween subglottal pressure and sound pressurelevel, and in the degree of vowel amplitude contrastbetween different types of syllables. Such differ-ences may contribute to the perception of a speak-er's individuality of gender. However, an overalltrend can also be noted and may be useful for

16 Under subcontract to Boston University.

17 Under subcontract to Stanford Research Institute (SRI) Project ECU 5652.

18 U.S.-India Cooperative Science Program.

344 RLE Progress Report Number 139

Page 7: Part V Language, Speech and Hearing Section 1 Speech

Chapter 1. Speech Communication

speech synthesis: Subglottal pressure can be usedto control vowel amplitude at sentence-level andmain prominences (i.e., phrasal stress), but vowelamplitude of reduced and non-nuclear full vowelsmay be manipulated by adjustment of vocal-foldconfiguration.

1.1.3 The Perception of Vowel Quality

Vowel quality varies continuously in the acousticrealization of words in connected speech. In recentyears, it has been claimed that vowel quality driveslexical segmentation. According to this view lis-teners posit a word boundary before all unreducedsyllables. Moreover, it has been suggested thatthese unreduced syllables are perceivedcategorically. To address this claim, syntheticspeech continua were generated between schwaand four unreduced vowels [i], [ei], [o], and [I] bymanipulating the values of F1, F2, and F3. Identifi-cation studies employing these continua have pro-vided mixed evidence for a clear categoricalfunction for vowel quality classification. However,analysis of reaction time (RT) data collected duringthis task suggests that the perception of tokens thatlie between categories requires additional pro-cessing effort. It might be concluded, then, that thisclassification is not automatic. Analysis of discrimi-nation data using these continua shows a peak indiscrimination sensitivity at identification boundariesthat is consistent with categorical perception. RTdata from this task also show a peak in the amountof effort needed to make discriminations of tokensnear the classification boundary. Additional studiesare planned to determine if vowel quality perceptionstill appears categorical in the perception of con-nected speech when processing resources may belimited by the competing demands of continuousword recognition and sentence processing.

1.1.4 Comparison of AcousticCharacteristics of Speech SoundsProduced by Speakers Over a 30-yearPeriod

In collaboration with Dr. Arthur House of the Insti-tute for Defense Analysis in Princeton, New Jersey,we have been analyzing speech recorded by threeadult males on two occasions 30 years apart (1960and 1990). The speech materials were the samefor the two sessions. They consisted of a largenumber of bisyllabic utterances of the form the /he'ClVC 2/. The consonants C1 and C2 were usuallythe same, and included all the consonants of Amer-ican English; the utterances contained 12 differentvowels V. Almost all combinations of consonantsand vowels were included.

A number of acoustic measures were made onthese utterances, such as formant frequencies, fun-damental frequency, properties relating to glottalsource characteristics and to nasalization, spectraof sounds produced with turbulence noise, andduration characteristics. Many of these measuresshowed little or no changes for any of the speakersover the 30-year period. These included vowelformant frequencies, fundamental frequency, voweland consonant durations, and characteristics ofnasal vowels and consonants. In some cases,these changes were significant, but they weresmall. For example, there was an upward shift infundamental frequency for all speakers, rangingfrom 3 to 9 Hz on average.

There were significant differences in the downwardtilt of the spectrum of the glottal souce for two ofthe speakers, as measured by the spectrum ampli-tude in the third-formant region relative to the ampli-tude of the first harmonic. The change in thismeasure was 15-20 dB for these speakers. In onespeaker, this change was accompanied by a 6 dBreduction (on average) in the amplitude of the first-formant (Fl) prominence, indicating an increasedF1 bandwidth due to incomplete glottal closure.These changes in the glottal spectrum resulted inmodifications in the long-term average spectrum ofthe speech. All speakers showed a reduction in thespectrum of aspiration noise at high frequencies,and for one speaker there was a reduction in highfrequency energy for strident fricatives.

1.1.5 Formant Regions and RelationalPerception of Stop Consonants

Listeners are quite adept at restoring speech seg-ments that have been obliterated by noise. Forexample, if in running speech the /A/ in "ship" isreplaced by noise, listeners report hearing "ship"even though the acoustic cues for /g/ are notpresent. The current set of experiments investi-gates whether listeners can restore the first formant(Fl) in stimuli where F1 is missing, based on theinformation present in the rest of the formant struc-ture. Does the listener substitute a default value forF1 or treat the second formant (F2) as the effective"Fl "?

Subjects were asked to identify syntheticconsonant-vowel (CV) stimuli where F1 had a tran-sition that rose into the vowel, had a flat transition,or was absent. For rising F1 transitions, listenersheard /ba/ or /da/ depending on the F2 transition,and stimuli with a flat F1 transition were heard as/a/. For stimuli where F1 was absent and F2 had arising transition, listeners perceived either /ba/ or

345

Page 8: Part V Language, Speech and Hearing Section 1 Speech

Chapter 1. Speech Communication

/da/ depending on the F3 transition; and for stimuliwith F1 absent and a flat F2 transition, listenersperceived /a/. This experiment suggests that F2 isperceived as "Fl" when F1 is missing. A secondexperiment tried to determine if there is a frequencylimit to listeners' treatment of F2 as "Fl" by varyingF2 in the missing F1 stimulus. The results from thesecond experiment imply that listeners treat F2 as"Fl" when F2 is below approximately 1350 Hz, buttreat F2 as "F2" for F2 greater than 1350 Hz.

The results from these experiments suggest thatthere is a region where listeners expect F1 to exist,and they will perceive "Fl" within this region evenfor unusually high values of "Fl". Thus, for stimuliwith missing F1, listeners treat F2 as "Fl" up to1350 Hz. However, once F2 crosses this boundary,listeners treat F2 as "F2", suggesting an implicitvalue for "Fl".

1.1.6 MEG Studies of Vowel Processing inAuditory Cortex

We are collaborating with researchers at MIT andthe University of California at San Francisco in abrain imaging study to determine which attributes ofvowels give rise to distinctive responses in auditorycortex. The technique that is employed is magneto-encephalography (MEG), a method that usespassive, non-invasive measurements of magneticfields outside the head to infer the localization andtiming of neural activity in the brain. One measurederived from MEG is the M100 latency-the firstpeak in the magnetic activity after the onset of thestimulus, which usually occurs near 100 ms. Pre-vious studies using three-formant vowels showedthat the M100 latency varied as a function of vowelidentity and is not correlated with the fundamentalfrequency (FO).

To simplify the issue of which components influencethe M100 latency, this experiment studied the M100latency for single-formant synthetic vowels, /a/ and/u/, for both a male FO (100 Hz) and a female FO(170 Hz). The single formant corresponds to Fl.This M100 latency was compared to the latencydetermined for the three-formant vowels and to thelatency for two-tone complexes that matched FO0and F1 in the single formant vowels. The latenciesacross the three conditions showed similar trends.The latency varied with F1 (or the tone corre-sponding to Fl) and not FO (or the tone corre-sponding to FO). The similarity to the three-formantstimuli suggests that the most prominent formant,F1, is the primary cause of the M100 latency.

1.1.7 Synthesis of Hindi

In collaboration with Dr. Shyam Agrawal, of theCentral Electronics Engineering Research Instituteof New Delhi, India, we have initiated a project onspeech synthesis of Hindi. The aim of the NewDelhi group is to develop a rule-based system forthe synthesis of Hindi. As one component of thiswork, we have synthesized examples of voiced andvoiceless aspirated consonants in Hindi. In order toachieve natural-sounding aspiration noise for Hindi,it was necessary to modify the spectrum of theaspiration noise in the Klatt synthesizer to enhancethe amount of low-frequency energy in the aspi-ration source.

1.2 Speech Production Planning andProsody

Our work in speech production planning focuses onthe role of prosodic structure in constraining pho-netic and phonological modification processes andspeech error patterns in American English. Thedevelopment of labelling systems and of extensivelabelled databases of speech and of speech errorsprovides resources for testing hypotheses about therole of both prosodic and morpho-syntactic struc-tures in the production planning process. Thisprocess results in substantial differences in a givenword or segment from one utterance to another,with changes in adjacent phonetic context, speakingrate, intonation, phrasing, etc. An understanding ofthese systematic variations, and the structures thatgovern them, will not only improve our models ofhuman speech processing, but will also make pos-sible more natural synthesis and improved recogni-tion algorithms.

1.2.1 Prosodic Structure and Glottalization

Drawing on a large corpus of FM radio newsspeech that we have labeled for prosodic structureusing the ToBI system, we replicated Pierrehumbertand Talkin's (1992) finding that glottalization ofvowel-initial words was more likely at prosodicallysignificant locations such as the onsets ofintonational phrases and pitch accented syllables, ina larger speech corpus with the full range of normalphonetic and intonational contexts. Further, weextended their finding to show that reduced-vowelonsets are more likely to glottalize at the onset of aFull Intonational Phrase than at the onset of anIntermediate Intonational Phrase, providing supportfor the distinction between these two constituentlevels. We also reported profound differencesamong speakers both in the rate of glottalization for

346 RLE Progress Report Number 139

Page 9: Part V Language, Speech and Hearing Section 1 Speech

Chapter 1. Speech Communication

word-initial vowels and in the acoustic shape pre-ferred by each speaker.

This same labeled speech corpus also allowed usto examine how intonational phrase structure influ-ences the placement of optional prenuclear pitchaccents before the last (or nuclear) pitch accent ofthe phrase. Earlier work showed that speakerstend to place the first accent of a new intonationalphrase on an early syllable in late-main-stresswords, as in "The MASSachusetts instiTUtions."More recently we found that phrase-medial accentsavoid clash with accented syllables both to the leftand to the right, demonstrating the avoidance ofleft-ward clash and providing further support for therole of intonational phrase structure in productionplanning.

1.2.2 Detection and Analysis ofLaryngealized Voice Source

An algorithm for the automatic detection oflaryngealized regions, developed for German at theUniversity of Erlangen, has been found to performwell for read American English speech but not for acorpus of spontaneous telephone quality speech.An extensive labelling effort was undertaken for thespontaneous speech to determine the causes ofthis result, which was unexpected since the algo-rithm performed well for both read and spontaneousband-limited German speech. The labelling hasrevealed that perceived glottalization is not alwaysdirectly correlated with evidence for irregular pitchperiods in the signal in English, and we are cur-rently exploring this paradox. The four corporalabelled for both laryngealized voice source andprosodic structure (i.e., read speech and sponta-neous speech in both American English andGerman) will permit us to compare the distributionof glottalized regions with respect to prosodic struc-ture in the two languages. Preliminary resultssupport earlier claims that syllable-final /p t k/ doglottalize in spontaneous American English speech,but not in German.

1.2.3 Development of a TranscriptionSystem for Other Aspects of Prosody

Work by Couper-Kuhlen in the 1980s and by Fant,Kruckenberg, and Nord in the 1990s suggests thatspeech may be characterized by structure-basedmanipulations of a regular rhythmic structure.Moreover, work in the British framework ofintonational analysis in the 1950s suggests thatrepeated use of similar FO contours is an importantaspect of the speech signalling system. We havedeveloped a transcription system for these two

aspects of spoken language, along with an on-linetraining tutorial, and are currently evaluating thereliability of the transcription system across lis-teners, as well as the relationship between thesetwo aspects of speech and the intonational phrasestructure and prominence patterns captured by theToBI labeling system. Preliminary analysis showsthat perceived rhythmic beats are often coincidentwith pitch accented syllables, but not always, andthat rhythmic patterns can set up expectations thatlead the listener to perceive a beat even in a silentregion of the speech wave form.

1.2.4 Other Work

We continue to develop the MIT Digitized SpeechError Corpus, the computer categorization systemfor speech errors captured in written form, andextensive corpora of labeled speech, including pho-netic, syntactic and prosodic labels for the sameutterances. We published a tutorial to summarizeour examination of the evidence for prosodic struc-ture in the acoustic-phonetic and psycholinguistic lit-erature. This will be followed by a closerexamination of the evidence for the smaller constit-uents in the prosodic hierarchy, such as the mora,the syllable and the prosodic word, and the evi-dence for rhythmic constituents such as theAbercrombian cross-word-boundary foot.

1.3 Studies of Normal SpeechProduction

1.3.1 Experimental Studies

During this year, we completed most of the datacollection, signal processing, and data extraction onfour interrelated experiments. The experiments aredesigned to test hypotheses based on a model ofspeech production. According to the model, thespeaker operates under a listener-oriented require-ment to produce an intelligible signal, using a pro-duction mechanism that has dynamical propertiesthat limit kinematic performance. Various strategiesare employed to produce speech under these con-straints, including the modification of clarity by usingvalues of kinematic parameters that will producesufficiently distinctive acoustic cues, while mini-mizing the expenditure of articulatory effort.

In the four studies, we have made kinematic andacoustic measures on the same ten subjects usinga multiple single-subject design, and we areobtaining intelligibility measures of the subjects'speech from an independent group of listeners.The four studies are: (1) The clarity constraint

347

Page 10: Part V Language, Speech and Hearing Section 1 Speech

Chapter 1. Speech Communication

versus economy of effort: speaker strategies undervarying clarity demands; (2) clarity versus economyof effort: the relative timing of articulatory move-ments; (3) kinematic performance limits on speecharticulations; and (4) interarticulator coordination inthe production of acoustic/phonetic goals: a motorequivalence strategy. In study one, speech is pro-duced in the following conditions: normal, fast, slow,clear, clear+fast and casual. The other studiesemploy subsets of these conditions. On the basisof previous recordings and analyses of data fromtwo pilot subjects, we made improvements in utter-ance materials, recording protocols and algorithmsfor data extraction.

For the eight subsequent subjects, data were col-lected and derived from the acoustic signal and themovements of points on the lips, tongue andmandible, as transduced by an electromagneticmidsagittal articulometer (EMMA) system. Spectraland temporal measures are being extracted fromthe acoustic signal, and articulatory kinematic mea-sures (displacements, velocities, accelerations,durations) are being extracted from the movementsignals. Measures of "effort" are being calculatedfrom the kinematic parameters. These measuresare being compared with each other and they willbe compared with results of tests of intelligibilityand prototypicality of phonetic tokens. Initial resultsshow effects of speaking condition, articulator,vowel, consonant manner and vowel stress. Thepattern of results differs somewhat acrossspeakers.

1.3.2 Physiological Modeling of SpeechProduction

In order to achieve practical computation times for afinite-element simulation of the behavior of thetongue and for a future, more complete vocal tractmodel, it will be necessary to use parallel computa-tion. For this purpose work has been done on thetheoretical aspects of a parallel implementation. Inthe resulting proposed algorithm, the tongue is splitinto interconnected domains, usually containingseveral finite elements, and the state variables ineach sub-domain are updated by one processor.The continuity requirement at the boundariesbetween the domains is simplified to require thatthe state variables of two interfacing domains areequal only at nodes in the interfaces. Therefore, intheory, it is possible that different computationalmethods can be used in the different domains. The

continuity requirements are maintained by Lagrangemultipliers (one per equality constraint), whosecomputation is achieved by several processors,each acting on the nodes of one inter-domainsurface. The resulting algorithm still leaves muchto be desired, since at each time-step, interchangeof information between neighboring domains wouldbe required. On the other hand, this schemeshould be useful when we interface two differentmodels: a new finite element tongue model and amandible and hyoid model that is based on rigidbody movements. The latter model was imple-mented by other researchers with whom we plan tocollaborate.

In the planned collaboration, we will combine a newtongue model with the existing mandible/hyoid-bonemodel. For this purpose, a modified "lambda"muscle control model has been designed for imple-mentation as part of the tongue modeling. It is anextension of the lambda model that was originallydesigned for the control of limbs moving as rigidbodies. While in the original lambda model thereare only two parameters describing the musclestate, length and velocity, the corresponding vari-ables have first to be computed for each muscle inthe tongue model. In this computation, the lengthand velocity parameters are replaced by theaverage stretch and stretch velocity of the musclethat is modeled as a continuum. Both variables canbe obtained by integration over the finite elements.

In addition, progress has been made on imple-menting an algorithm for parallel computation andan overall plan for a new model building method.

1.3.3 Improvements in Facilities for SpeechProduction Research

We have completed the refinement of facilities forgathering articulatory movement and acoustic data.We made further progress on a facility for per-forming perceptual tests. In collaboration with aradiologist from the Massachusetts Eye and EarInfirmary, we have developed techniques forobtaining high-quality MRI images of the vocal tract.We developed software to facilitate the extraction ofarea function data from the MR images. We alsoimplemented additional algorithms for articulatoryand acoustic data extraction, and we implementedthe main component of our program for EMMA dataextraction in MATLAB, in anticipation of changingcomputer hardware and software platforms.

348 RLE Progress Report Number 139

Page 11: Part V Language, Speech and Hearing Section 1 Speech

Chapter 1. Speech Communication

1.4 Speech Research Relating toSpecial Populations

1.4.1 Speech Production of CochlearImplant and Bilateral Acoustic Neuroma(NF2) Patients

We began work on a new NIH-funded project, "TheRole of Hearing in Speech: Cochlear ImplantUsers." This research is an extension of work donepreviously under other funding sources. In thisinitial year, we refined previously-used paradigms,devised new sets of utterance materials, and estab-lished new procedures for recruiting patients. Wehave made four pre-implant recordings on our firstnew implant patient. She has received her implantand we will soon begin a series of post-implantrecordings. We have also made recordings of twonormally hearing speakers. We have analyzed asubstantial part of the data from these recordings.We also have given extensive training in techniquesfor recording and data analysis to two staffmembers who are new to this research.

Auditory Feedback and Variation in SoundPressure and Fundamental Frequency Contoursin Deafened Adults

Sound pressure level (SPL) and fundamental fre-quency (FO) contours were obtained from fourpostlingually deafened adults who received cochlearimplants and from a subject withNeurofibromatosis-2 (NF2) whose hearing wasseverely reduced following surgery to remove anauditory-nerve tumor and insert an auditorybrainstem implant. SPL and FO contours for eachphrase in passages read before and after changesin hearing were averaged over repeated readingsand then normalized with respect to the highestSPL or FO value in the contour. The regularity ofeach average contour was measured by calculatingdifferences between successive syllable means andaveraging the absolute values of these differences.With auditory feedback made available, thecochlear implant user with the least contour vari-ation pre-implant showed no change, but all of theremaining speakers produced less variable FO con-tours and three also produced less variable SPLcontours. In complementary fashion, when the NF2speaker had her auditory feedback severelyreduced, she produced more variable FO and SPLcontours. The results are interpreted as supportinga dual-process theory of the role of auditory feed-back in speech production, according to which onerole of self-hearing is to monitor transmission condi-tions, leading the speaker to make changes inspeech "postures" (such as average sound level

and speaking rate) aimed at maintaining intelli-gibility.

1.4.2 Speech Respiration and Changes inHearing Status

We have analyzed speech respiration parametersfrom complete series of longitudinal recordings ofseven of our former subjects who gained somehearing from cochlear implants, from three NF2patients (who experienced hearing loss), and onecontrol. We have also conducted an extensivesearch of the literature for normative data. Theresults from the implant subjects confirm our earlierfinding that parameters change toward normal withthe acquisition of some hearing. The hearingcontrol maintained respiratory parameter values atabout the same levels, and the NF2 patientsbehaved variously. We are currently examiningchanges in SPL as a potential confounding vari-able. As part of this study, we have also processedrate measures on all these subjects. In general,implant users increased speaking rate post-implant,the hearing control showed no change, and theNF2 patients behaved variously.

1.4.3 Voice Source Analysis of Speakerswith Vocal Fold Nodules

In collaboration with Dr. Robert Hillman at theMassachusetts Eye and Ear Infirmary, we havebeen carrying out an analysis of vowels producedby several patients with vocal-fold nodules. Thegoal is to develop models that will help to interpretthe acoustic and aerodynamic observations forthese vowels, and hopefully to explain differencesin the glottal source for these patients comparedwith a normal group of speakers. Previous workhas shown that nodule subjects required anincreased subglottal pressure of 4-6 dB in order toachieve vowel sound pressure levels comparable tothose produced by normals. Other acoustic andaerodynamic measures have shown that the ampli-tude of the first harmonic in relation to the ampli-tude of the first-formant peak is increased for thenodule subjects, suggesting an increased first-formant bandwidth and possibly an enhancedamplitude of the first harmonic and an increasedspectrum tilt extending into the F1 region.

We have been introducing some modifications intoa conventional two-mass model of the vocal folds inorder to simulate some aspects of vocal-fold vibra-tion for nodule subjects. Two kinds of modificationshave been attempted: changes in the couplingstiffness between the upper and lower masses (inan attempt to simulate the effect of the increased

349

Page 12: Part V Language, Speech and Hearing Section 1 Speech

Chapter 1. Speech Communication

stiffness of the nodule), and an increase in thestatic separation between the folds (simulating thelack of complete closure when nodules arepresent). These modifications lead to changes inthe threshold subglottal pressure needed to initiatephonation and in the amplitude of the airflow pulsesfor a given subglottal pressure. The model behaviorshows some of the attributes of the glottal sourcefor the subjects with nodules.

1.4.4 Preliminary Studies of Production sand sh by Some Dysarthric Speakers

We have begun a detailed study of the fricativeconsonants /s/ and /§/ produced in word-initial posi-tion by four dysarthric speakers with differentdegrees of severity. The speakers ranged in agefrom 38 to 61, three have cerebral palsy and onehas cerebellar ataxia. The intelligibility of wordsbeginning with these sounds, measured in anearlier study, was found to be relatively low on theaverage. It was assumed that this low intelligibilitywas due to improper placement and shaping of thetongue blade against the hard palate and possiblydue to inappropriate control of the intraoral pres-sure.

On the basis of acoustic and modeling studies forfricatives, including our own recent research in thisarea, it has been established that, for normalspeakers, the spectrum for /s/ in most phoneticenvironments has a main prominence in the rangeof the fourth formant or higher. The spectrum for/§/ has a prominence in the F3 range, as well asone or more additional prominences in the F4 rangeor higher. The peak amplitudes of the spectralprominences at F4 or higher (called Ah) and for theF3 range (called Am) were measured for the twofricatives that occurred in initial position for severalwords. For each word, the amplitude A1 of thespectral prominence corresponding to the firstformant in the following vowel was also measured.This vowel amplitude was used as a referenceagainst which the fricative amplitude was specified.All these amplitudes for each utterance wereobtained by averaging the spectrum over a timeinterval of about 20 ms. The differences Al-Ah andAl-Am were then determined for each word, andaverages were calculated for the s-words and thesh-words for each speaker. Similar data wereobtained for a normal speaker.

There appeared to be a reasonable correlationbetween the word intelligibility and the acousticmeasure of "stridency" for the s-words. When thehigh-frequency amplitude in the fricative is weak,the intelligibility is low. For the sh-words there is asimilar trend for the mid-frequency amplitude, but it

is not as regular. Other factors besides measure-ments of the average noise spectrum shape areclearly contributing to loss of intelligibility. Thesedata are preliminary in two ways: (1) the values andranges of the differences for a number of normalspeakers should be given rather than values forone speaker; and (2) the data for the dysarthricspeakers are based on only a small number ofutterances.

The decreased amplitude of the frication noise forthe fricatives produced by the dysarthric speakersprobably is a consequence of incorrect placementand shaping of the tongue blade when it is raisedtoward the hard palate. Either the constriction isnot sufficiently narrow, or the tongue blade isplaced too far forward. For some utterances, therewas strong vocal-fold vibration, which caused astrong low-frequency peak in the fricative spectrum.In these cases, it is possible that there was insuffi-cient intraoral pressure to cause enough turbulencenoise near the tongue-blade constriction.

1.5 Models of Lexical Representationand Lexical Access

1.5.1 Quantitative Error Models forClassification of Stop Consonants

The general aim of this project is to develop amodel that will identify consonantal features inrunning speech. The goal is for the model to havea performance similar to that of a human listener.Implementation of the proposed model requires thatcertain spectral and temporal measurements aremade in the vicinity of landmarks where rapid spec-tral changes occur in the speech signal. In thecase of landmarks at stop-consonant releases,these measurements are intended to locate spectralpeaks that represent particular vocal-tract reso-nances, and how these resonances change withtime. In an initial experiment using a corpus of sen-tences, a series of such measurements were madeby hand by individuals who know how to interpretthe spectra and to avoid labeling spurious spectralprominences. The measurements were made on alarge number of stop consonants, for which the timeof release was specified in advance. A classifica-tion system was implemented, and the number oferrors in identifying the consonants from thesehand-made measurements was relatively small.Algorithms that were developed for making themeasurements automatically were somewhat error-prone. However, it was possible to evaluate theerror or uncertainty in these algorithms and topredict the difference between the classificationerrors based on the two types of measurements.

350 RLE Progress Report Number 139

Page 13: Part V Language, Speech and Hearing Section 1 Speech

Chapter 1. Speech Communication

This exercise has emphasized the importance ofquantifying the uncertainty in measurements ofacoustic parameters and incorporating this know-ledge in models for classifying features.

1.5.2 Identifying Features in IsolatedConsonant-Vowel Syllables

A test for the validity of a model for lexical accessis a comparison of the errors made by the modelwith the errors made by human listeners. One rela-tively simple way to make such a comparison is touse consonant-vowel syllables in noise as input to alistener or to a model that is intended to recognizethe consonants, and to compare the errors that aremade in identification of the consonant features. Apreliminary experiment of this kind has been carriedout using a set of 10 CV syllables with the conso-nants /p t b d f s v z m n/followed by /a/. Variouslevels of white noise were added to the syllables.The syllables were processed in four ways: lis-teners identified the consonants, spectrograms ofthe syllables (with noise) were made and the con-sonants were identified by a spectrogram reader,the syllables were processed by an algorithm thatattempted to identify certain manner features of theconsonants (sonorant and continuant), and an HMMrecognition system was trained to recognize thenoise-free syllables. Principal outcomes of thiswork were: (1) the types of errors made by theHMM system were quite different from those madeby listeners; (2) automatic identification of mannerfeatures in noise showed many more errors thanlisteners, indicating that improvement in mannerdetection in noise is necessary; and (3) errors inplace of articulation in noise are similar for listenersand for spectrogram reading.

1.6 Locating Glides in Running Speech

One of the initial steps in the model for lexicalaccess that we are proposing is a procedure foridentifying landmarks in the speech signal. Theselandmarks are of three kinds: (1) points in thespeech signal where there are abrupt discontinui-ties, representing points where consonantal con-strictions in the vocal tract are created or released;(2) regions in the signal that identify vowel promi-nences; and (3) regions in the signal where theenergy is a minimum but there are no abrupt dis-continuities. This last type of landmark is createdby glides. An algorithm has been developed forlocating glide landmarks in speech. The algorithmis based on locating regions in time where there areminima in amplitude, minima in first-formant fre-quency, and rates of change of these parametersthat are constrained to be in particular ranges. The

overall recognition results were 88 percent for glidedetection and 91 percent for nonglide detection.

1.6.1 Enhancement Theory and Variability

In the model of lexical access that we and othershave proposed, lexical items are organized intosegments, and the segments, in turn, are specifiedin terms of hierarchically arranged binary features.Many of these features are defined in articulatoryterms. That is, each feature or group of featuresspecifies instructions to particular articulators. It isunderstood, however, that, in addition to specifyingarticulatory instructions, a feature also has acousticcorrelates, although these acoustic correlates for aparticular feature may depend on other features inthe same segment. It is further recognized that thestrength of a particular acoustic correlate thatsignals the presence of a feature in a segment canbe enhanced by recruiting articulatory gestures inaddition to the one specified by the feature. Thatis, more than one articulatory gesture can con-tribute to enhancement of the acoustic contrast cor-responding to the + or - value of a feature. (Anexample is the partial lip rounding that is used toproduce the palatoalveolar /9/, presumably toenhance the contrast with /s/). Our view is that thisenhancing gesture (rounding in this example) isgraded, and does not have the status of a lexicallyrepresented feature. Because enhancement canbe graded, variability can exist in the acoustic man-ifestation of a feature, in addition to the variabilityfrom other sources.

This enhancement process is most likely to bebrought into play for a given feature when thereexists a gesture that can indeed help to strengthenthe acoustic attribute generated by the primaryarticulator receiving instructions from the feature. Itis also invoked when the acoustic differencesbetween competing similar segments are minimal.We have been examining a number of enhance-ment processes from this point of view, and wehave noted that enhancement is especially utilizedto strengthen the voiced-voiceless contrast forobstruents, the contrast between different conso-nants produced with the tongue blade, and theheight and front-back contrast for vowels.

1.6.2 The Special Status of Word Onsets inWord Recognition and Segmentation

Word onsets have been hypothesized to play a crit-ical role in word recognition and lexical segmenta-tion in the perception of connected speech. Inorder to understand why onsets appear to have thisspecial status, onsets were examined from the per-

Page 14: Part V Language, Speech and Hearing Section 1 Speech

Chapter 1. Speech Communication

spectives of work in acoustic phonetics, phonolog-ical theory, and behavioral studies of wordrecognition. A review of research in acoustic pho-netics suggested that onsets provide a particularlyrich source of segmental featural information due toa constellation of articulatory and allophonic factors.A cross-linguistic survey of phonological rules con-ditioning featural change or deletion revealed astriking asymmetry in which onsets are generallyinsulated against change which occurs widely inother positions. Finally a review of behavioralresearch demonstrated that listeners have a lowertolerance for the loss or alteration of word-initialsegments than for other segments in word recogni-tion. This pattern coincides with a tendency forsensitivity to lexical effects over the course of aword. Together, these results suggest that onsets:(1) provide particularly robust featural information;(2) are more transparently interpretable in wordrecognition than non-onsets due to their lack ofphonologically conditioned surface variation; and (3)appear to be particularly well-suited to drivelexically-mediated processes that facilitate the per-ception of words when non-onset information is dif-ficult to recover or interpret due to acoustic orrepresentational underspecification. In the contextof a model of segmentation in which segmentationis accomplished as a by-product of word recogni-tion, these observations may account for acoustic-phonetic nature and distribution of putative cues toword boundary.

1.7 Publications

1.7.1 Journal Articles

Dilley, L., S. Shattuck-Hufnagel, and M. Ostendorf."Glottalization of Word-Initial Vowels as a Func-tion of Prosodic Structure." J. Phonetics 24:423-444 (1996).

Hanson, H.M. "Glottal Characteristics of FemaleSpeakers: Acoustic Correlates." J. Acoust. Soc.Am. 101: 466-481 (1996).

Matthies, M.L., M. Svirsky, J. Perkell, and H. Lane."Acoustic and Articulatory Measures of SibilantProduction with and without Auditory Feedbackfrom a Cochlear Implant." J. Speech Hear. Res.39: 936-946 (1996).

Perkell, J.S. "Properties of the Tongue Help toDefine Vowel Categories: Hypotheses Based onPhysiologically-Oriented Modeling." J. Pho-netics 24: 3-22 (1996).

Shattuck-Hufnagel, S., and A. Turk. "A ProsodyTutorial for Investigators of Auditory SentenceProcessing." J. Psycholing. Res. 25(2): 193-247(1996).

Stevens, K.N. "Critique: Articulatory-Acoustic Rela-tions and their Role in Speech Perception." J.Acoust. Soc. Am. 99: 1693-1694 (1996).

Wilhelms-Tricarico, R. "A Biomechanical andPhysiologically-Based Vocal Tract Model and itsControl." J. Phonetics 24: 23-28 (1996).

1.7.2 Published Meeting Papers

Perkell, J.S., M.L. Matthies, R. Wilhelms-Tricarico,H. Lane, and J. Wozniak. "Speech MotorControl: Phonemic Goals and the Use of Feed-back." Proceedings of the Fourth Speech Pro-duction Seminar: Models and Data (Also calledthe First ESCA Tutorial and Research Workshopon Speech Production Modeling: From ControlStrategies to Acoustics) 1996, pp. 133-136.

Stevens, K.N. "Understanding Variability in Speech:A Requisite for Advances in Speech Synthesisand Recognition." Special Session 2aSC ofASA-ASJ Third Joint Meeting, Speech Com-munication for the Next Decade: New Directionsof Research, Technological Development, andEvolving Applications, Honolulu, Hawaii, 1996,pp. 3.1-3.9.

1.7.3 Chapters in Books

Halle, M., and K.N. Stevens. "The PostalveolarFricatives of Polish." Festschrift for OsamuFujimura. Berlin: Mouton de Gruyter, 1997, pp.177-194.

Perkell, J.S. "Articulatory Processes." In Handbookof Phonetic Science. Eds. W. Hardcastle and J.Laver. Blackwell: Oxford, 1997, pp. 333-370.

Perkell, J.S., and M.H. Cohen. "Token-to-TokenVariation of Tongue-body Vowel Targets: TheEffect of Context." Festschrift for OsamuFujimura. Berlin: Mouton de Gruyter, 1997, pp.229-240.

Stevens, K.N. "Articulatory-Acoustic-AuditoryRelationships." In Handbook of Phonetic Sci-ences. Eds. W. Hardcastle and J. Laver.Blackwell: Oxford, 1997, pp. 462-506.

352 RLE Progress Report Number 139

Page 15: Part V Language, Speech and Hearing Section 1 Speech

Chapter 1. Speech Communication

Stevens, K.N. "Models of Speech Production." InEncyclopedia of Acoustics. Ed. M. Crocker.New York: John Wiley, 1997, pp. 1565-1578.

Wilhelms-Tricarico, R., and J.S. Perkell. "Biome-chanical and Physiologically-Based SpeechModeling." In Speech Synthesis. Eds. J.P.H.van Santen, R.W. Sproat, J.P. Olive, and J.Hirshberg. New York: Springer, 1996, pp.221-234.

1.7.4 Journal Articles and Book Chaptersto be Published

Chen, M., and R. Metson. "Effects of SinusSurgery on Speech." Arch. Otolaryngol. Forth-coming.

Govindarajan, K.K. "Relational Perception ofFormants and Stop Consonants: Evidence fromMissing First Formant Stimuli." J. Acoust. Soc.Am. Forthcoming.

Hillman, R.E., E.B. Holmberg, J.S. Perkell, J.Kobler, P. Guiod, C. Gress, and E.E. Sperry."Speech Respiration in Adult Females withVocal Nodules." J. Speech Hear. Res. Forth-coming.

Lane, H., J. Wozniak, M.L. Matthies, M.A. Svirksy,J.S. Perkell, M. O'Connell, and J. Manzella."Changes in Sound Pressure and FundamentalFrequency Contours Following Changes in

Hearing Status." J. Acoust. Soc. Am. Forth-coming.

Perkell, J.S., M.L. Matthies, H. Lane, R. Wilhelms-Tricarico, J. Wozniak, and P. Guiod. "SpeechMotor Control: Segmental Goals and the Use ofFeedback." Submitted to Speech Commun.

Shattuck-Hufnagel, S. "Phrase-level Phonology inSpeech Production Planning: Evidence for theRole of Prosodic Structure." In Prosody: Theoryand Experiment: Studies Presented to GostaBruce. Ed. M. Horne. Stockholm: Kluwer. Forth-coming.

Svirsky, M.A., K.N. Stevens, M.L. Matthies, J.Manzella, J.S. Perkell, and R. Wilhelms-Tricarico. "Tongue Surface Displacement DuringObstruent Stop Consonants." J. Acoust. Soc.Am. Forthcoming.

1.7.5 Theses

Hasegawa-Johnson, M.A. Formant and Burst Spec-tral Measurements with Quantitative ErrorModels for Speech Sound Classification. Ph.D.diss., Dept. of Electr. Eng. and Comput. Sci.,MIT, 1996.

Sun, W. Analysis and Interpretation of Glide Char-acteristics in Pursuit of an Algorithm for Recog-nition. S.M. thesis, Dept. of Electr. Eng. andComput. Sci., MIT, 1996.

353

Page 16: Part V Language, Speech and Hearing Section 1 Speech

354 RLE Progress Report Number 139