Development of speechreading supplements based on automatic speech recognition

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 47, NO. 4, APRIL 2000 487

Development of Speechreading Supplements Basedon Automatic Speech Recognition

Paul Duchnowski, David S. Lum, Jean C. Krause, Matthew G. Sexton, Maroula S. Bratakos, andLouis D. Braida*, Member, IEEE

Abstract—In manual-cued speech (MCS) a speaker produceshand gestures to resolve ambiguities among speech elementsthat are often confused by speechreaders. The shape of the handdistinguishes among consonants; the position of the hand relativeto the face distinguishes among vowels. Experienced receiversof MCS achieve nearly perfect reception of everyday connectedspeech. MCS has been taught to very young deaf children andgreatly facilitates language learning, communication, and generaleducation.

This manuscript describes a system that can produce a form ofcued speech automatically in real time and reports on its evalu-ation by trained receivers of MCS. Cues are derived by a hiddenmarkov models (HMM)-based speaker-dependent phonetic speechrecognizer that uses context-dependent phone models and are pre-sented visually by superimposing animated handshapes on the faceof the talker. The benefit provided by these cues strongly dependson articulation of hand movements and on precise synchronizationof the actions of the hands and the face. Using the system reportedhere, experienced cue receivers can recognize roughly two-thirds ofthe keywords in cued low-context sentences correctly, compared toroughly one-third by speechreading alone (SA). The practical sig-nificance of these improvements is to support fairly normal ratesof reception of conversational speech, a task that is often difficultvia SA.

Index Terms—Aids for the deaf, autocuer, automaticspeech recognition, cued speech, deafness, speech percep-tion, speechreading, transliteration.

I. INTRODUCTION

T HE ABILITY to see and interpret the facial actions thataccompany speech production can enhance communica-

tion in difficult listening conditions substantially [1]. Althoughlisteners with normal hearing can benefit from integrating cuesderived from speechreading with those conveyed by the speechwaveform, for many listeners with severe hearing impair-ments, speechreading is the principal mode of communication.However, speechreading alone (SA) typically provides only a

Manuscript received December 30, 1997; revised January 4, 2000. This workwas supported by the National Institute on Deafness and Other Communica-tions Disorders, the H. E. Warren Professorship at Massachusetts Institute ofTechnology (M.I.T.), and the Motorola Corporation.Asterisk indicates corre-sponding author.

P. Duchnowski is with the Research Laboratory of Electronics, MassachusettsInstitute of Technology, Cambridge, MA 02139 USA.

D. S Lum, J. C. Krause, M. G. Sexton, and M. S. Bratakos are with the Re-search Laboratory of Electronics, Massachusetts Institute of Technology, Cam-bridge, MA 02139 USA.

*L. D. Braida is with the Research Laboratory of Electronics, 36-747,Massachusetts Institute of Technology, Cambridge, MA 02139 USA (e-mail:[email protected]).

Publisher Item Identifier S 0018-9294(00)02640-9.

minimal basis for communication because many distinctionsbetween speech elements are not manifest visually. While someof the resulting ambiguity can be overcome if the receivermakes use of contextual redundancy, the concomitant demandson attention and memory are often daunting [2]. Even whendealing with easy material that is spoken slowly, with repeti-tions, and that is viewed under optimal reception conditions,speechreaders typically miss more than one-third of the wordsspoken.

Although several means are available to help persons who aredeaf overcome the limitations associated with speechreading,none are without limitations. One approach, the use of signlanguage [3] supports high levels of communication, linguisticdevelopment, and general education, but is based on a languageother than English, so that reading skills are not always welldeveloped [4]. A second approach, electrical stimulation of theperipheral auditory system, can evoke auditory sensations thatare highly effective speechreading supplements for many deafadults, and enables many to communicate without the use ofspeechreading [5], [6]. However, not all deaf adults are helpedby this approach, perhaps because the benefits provided bycochlear implants depend on the number of peripheral auditoryneurons surviving. In addition, in current medical practice theimplantation of deaf infants is restricted to those beyond theage of two years, well beyond the point at which deafnesscan be detected and well after normal-hearing infants begin toacquire speech and language skills. A third approach is basedon the presentation of speechreading supplements via the visualor tactile senses. These supplements are intended to enable thespeechreader to make distinctions between elements or patternsof speech that are not possible via SA.

Although there are many types of such supplements, it isuseful to distinguish between two different types. One type usessignal processing techniques to derive parameters or waveformsfrom acoustic speech (e.g., voice pitch, the speech amplitudeenvelope contour, etc.), and codes them for presentation to thespeechreader. The other type uses automatic speech recognition(ASR) techniques to derive a sequence of discrete symbols fromthe speech waveform. Many examples of the former type havebeen evaluated, including those that present the supplementsto the visual [7], [8], and tactile [9] senses and several tactilespeechreading aids are currently in use by deaf individuals. Nosupplements based on speech recognition, such as those relatedto the system of cued speech developed for the education of deafpersons, appear to be in use at the present time.

In Section II, we describe cued speech and review previousattempts to develop speechreading supplements based on cued

0018–9294/00$10.00 © 2000 IEEE

488 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 47, NO. 4, APRIL 2000

speech. In Section III we outline the operation of the automaticcueing systems evaluated in the current study. In Section IV, wedescribe the methods used to evaluate the reception of automat-ically cued speech (ACS). In Section V, we report the resultsof these evaluations. In Section VI, we compare the results ob-tained in this study with those reported for other aids for the deafand identify problems to be addressed in future research.

II. CUED SPEECH

In cued speech, communication relies on a combination ofspeechreading and a synchronous sequence of discrete symbols.The symbols attempt to help the receiver distinguish betweenspeech elements with similar facial articulations. Several dif-ferent systems of cued speech that have been developed shareseveral common elements. Each symbol in the sequence is typ-ically displayed while a CV syllable is articulated. The symboltypically consists of two components or indicators, one for theconsonant and the other for the vowel. Each consonant indicatorrepresents a group of consonants that are readily distinguishedby SA. Similarly, each vowel indicator represents a group ofvowels that are readily distinguished by SA. For ease of recep-tion, the consonant and vowel indicators are typically displayedby perceptually independent elements.

Within these constraints, a wide variety of cued speech sys-tems can be defined. In manual-cued speech (MCS), developedmore than 30 years ago by Cornett [10], the indicator for vowelsis the position of the hand relative to the face of the talker. Thevowels are divided into four groups, each having the propertythat its members can be distinguished by speechreading. Simi-larly, the indicator for consonants, which are divided into eightgroups, is the shape of the hand. An alternative system,Q-codes,was proposed [11], for Swedish and subsequently adapted toJapanese [12]. In these systems, the assignment of consonantsand vowels to groups was based on acoustic-phonetic dimen-sions that were thought to be easy to detect automatically inthe acoustic speech waveform. The set of hand shapes used torepresent the consonant groups differ from those of MCS, withspecific aspects of shape conveying phonemic distinctions. Forexample, the spread of the thumb is used to distinguish voicedfrom unvoiced consonants.

There is no general agreement on the optimal number or con-stitution of cue groups, either for consonants or vowels. Recentresearch at Massachusetts Institute of Technology (M.I.T) hasexplored methods for constructing such groups automatically[13], based on the accuracy with which the speech elements canbe recognized by an automatic recognizer and the perceptualconfusions made by speechreaders. The recognizers availableat that time were estimated to aid speech reception maximallywhen only a relatively small number of groups was used.

Parents and educators have taught the MCS system success-fully to very young deaf children, who have subsequently usedthe system to facilitate communication, language learning, andgeneral education. Several studies have documented the benefitsMCS can provide to speech reception [14]–[17]. For example,after seven years of MCS use, 18 teenagers were able to identifywords in low-context sentences with an accuracy of 26% by SAversus 97% via MCS [17]. Studies of highly experienced users

of MCS conducted at our laboratory have found similar gains forkeywords in low-context IEEE sentences [18] with scores im-proving from 25% to 31% for SA to 84% for MCS [13], [19].Comparable evaluations of alternative cueing systems are notavailable.

To derive real benefit from cued speech in everyday commu-nication, the talker, or a transliterator, must produce cues for thereceiver. However, the number of individuals who are proficientat producing MCS is relatively small, perhaps several thousand,so that cued speech is not widely used by the deaf. To overcomethis limitation, Cornett and his colleagues at Gallaudet Univer-sity and the Research Triangle Institute (R.T.I.) attempted todevelop an “Autocuer,” a system that would derive cues sim-ilar to those of MCS automatically, by electronic analysis of theacoustic speech signal, and display them visually to the cue re-ceiver.

A. The Gallaudet-R.T.I. Autocuer

The most advanced implementation of the Gallaudet-R.T.I.Autocuer [20] displays cues as virtual images of a pair of seven-segment LED elements projected near the face of the talkerin the viewing field of an eyeglass lens. Nine distinct symbolshapes, cueing consonant distinctions, can appear at each of fourpositions, cueing vowel distinctions.

The operation of the speech analysis subsystem used in thewearable version of the Autocuer has not been described pub-licly. A published block diagram of the system suggests thatspeech sounds would be assigned to cue groups on the basisof estimates of the voice pitch as well as the zero crossing ratesand peak-to-peak amplitudes in low- and high-frequency bandsof speech. Evaluations of the performance of the most recentimplementation of the Autocuer [21] indicate that the phonemeidentification score is roughly 54%, with a 33% deletion rateand a 13% substitution rate.

The Gallaudet-R.T.I. development effort used simulationtechniques to estimate the level of recognition performance thatwould be required for a successful Autocuer system. Speechreception tests were conducted both with “ideal” cues derivedmanually from spectrograms, and with cues derived automat-ically by the analyzer. Both normal- and impaired-hearinglisteners were tested. They were provided with 40 h of trainingsince many had no experience with MCS. Early tests of asimulation of the Autocuer system [20] used spectrogram-de-rived cues. Scores for the reception of common words spokenin isolation improved from 63% (SA) to 84% when cueswere presented. However, the subjects do not appear to havebeen trained sufficiently to use the cues when the words wereembedded in short phrases. More recent simulation studies [21]indicated that considerably less benefit would be obtained withrecognizer-produced cues: scores on isolated words improvedfrom 59% (SA) to 67% with the automatically produced cues.

Unfortunately the results of the Gallaudet-R.T.I. studiesmake very limited contributions to the general developmentof automatic cueing systems. The participants in these studieswere not sufficiently trained on the cues that were producedand displayed to derive much benefit when using the cues tosupplement the speechreading of connected discourse, evenwhen the cues were perfectly matched to what was said. In

DUCHNOWSKI et al.: SPEECHREADING SUPPLEMENTS BASED ON ASR 489

addition, the simulation studies assumed that the recognizerwould require roughly 150 ms to derive and display a cue afterthe corresponding syllable was spoken. The effect of this delaywas not explored, although it arguably could have differenteffects when test materials consist of running speech ratherthan isolated words.

B. The M.I.T. Simulation Study

To evaluate potential automatic cueing systems under morerealistic conditions, new studies of the effects of imperfectionsin the recognition and display of cues were recently conducted atM.I.T. [19]. To avoid the problems posed by inadequate training,six highly trained receivers of MCS were tested. The recep-tion of words in low-context sentences [18] was evaluated underconditions that simulated the recognition errors and delays ex-pected of current ASR systems. Cues consisting of images ofhands were dubbed at appropriate locations onto recordings ofthe faces of talkers who spoke sentences.

The test protocol included three reference conditions [SA;MCS; and perfect synthetic cues, (PSC)], in which the cues werespecified by phonetic transcriptions of the sentences) and tentest conditions. In the test conditions, errors were introducedinto the phonetic transcriptions and/or the appearance of cueswas delayed relative to the indicated beginning of the phone.Several combinations of errors and delays were evaluated. In thereference conditions, the lowest scores were obtained with SA(30%); and the highest with MCS (83%). With PSC, scores wereslightly lower than in MCS (77%) perhaps reflecting differencesin speaking rates (100 wpm for MCS, 140 for PSC), cue display(articulated versus static), and cue timing (in the PSC condi-tion, the time reference for the display of cues was derived fromthe acoustic waveform; in MCS the shapes and positions of thecuer’s hands often change before there is detectable sound).

Both simulated recognizer errors and delays in the display ofcues reduced scores. When 10% of the phones were in error,scores decreased by 14 percentage points; a 20% rate of er-rors reduced scores by 24 points. In combination with a delayof 165 ms, the effect of a 10% error rate was even larger, re-ducing scores by 38 points. The effect of a 20% error rate wasalso increased by a delay of 100 ms (from a 24- to a 36-point re-duction in scores). When cues were derived by a state of the artphonetic recognizer that produced 20% phone errors when oper-ated off-line, scores were roughly the same as for the simulatedrecognizer that produced 10% errors. Apparently the errors pro-duced by the real recognizer tended to cluster in words, so that,for the same error rate, more words in a sentence were likely tobe free of errors for the real than for the simulated recognizer.

These results suggest that state of the art recognizers canproduce cues that aid speechreading substantially. On the otherhand, they also underscore the deleterious effects of recognizerdelays on the effectiveness of these cues. The MCS system isbased on presenting cues corresponding to CV pairs. While themanual cuer can produce the cue at the start of the initial conso-nant, an automatic system cannot determine what cue to produceuntil after the final vowel is spoken and recognized. Typically,this would impose a delay in the display of the cue that is wellin excess of 165 ms, whose effect was found to be highly dele-

terious. One approach to minimizing the effect of such delays isdiscussed in the following sections.

III. A UTOMATIC CUE GENERATING SYSTEMS

The automatic cue generator investigated here performedthree principal functions: 1) capture and parameterization ofthe acoustic speech input (the recognizer front end); 2) cueidentification via speech recognition; 3) presentation of theidentified cues to the cue receiver. Fig. 1 shows the block dia-gram of this system, which, for convenience, was implementedon two computers.

A. Parameterization

The first two functions were tightly coupled, the recognitionalgorithm determining the type of speech preprocessing. Thespeech waveform was captured using a standard lapel-typeomni-directional microphone, sampled at 10-kHz, high-fre-quency preemphasized by a first-order filter with a cutofffrequency of 150 Hz (to flatten the spectrum, e.g., [22], anddivided into 20-ms-long frames with 10-ms overlap. For eachframe a vector of 25 parameters was derived from the samplesincluding 12-mel-frequency cepstral coefficients, 12 differ-ences of cepstral coefficients across frames, and the differencebetween frame energies. Differences were computed over afour frame span so that the difference parameters of thethframe were computed from the static parameters of frames

and . RASTA processing [23] was applied to theparameter vectors to improve robustness.

Signal acquisition and parameterization was performed on aMotorola DSP96000 board running on PC1, a Pentium Pro classcomputer. The same parameterization was used in all experi-ments. Parameter vectors were time-stamped to allow for sub-sequent synchronization with the video and sent over the localEthernet to the recognition program.

B. Phone/Cue Recognition

The second subsystem recognized the phones correspondingto the acoustic input and converted them to a time-markedstream of cue codes which was sent to the display subsystem.Speech was recognized as phones by PC2, a DEC AlphaStation.This software implemented hidden markov models (HMM),[24], based on routines in the HTK suite of programs [25], [26],and operated in speaker-dependent mode.

Three phonetic recognizers were studied. The recognizersdiffered primarily in the number of separate models trainedfor each phone and the method of assignment of a subclass ofthe given phone to each of the available models. In all cases,three-state left-right models were used. Output probability den-sities were described by mixtures of six diagonal-covarianceGaussian distributions. Static and dynamic parameters formeddistinct “streams” with different probability densities, underthe implicit assumption that static and dynamic parameterswere statistically independent [27]. Pilot studies showed thatthis was the broadly optimal configuration for this application.For each speaker, models were trained on roughly 60 min(1000 sentences) of speech data using the forward-backward


Fig. 1. Block diagram of the automatic cue generator.

algorithm [24], [28], [29]. Recognition was performed usingthe HTK Viterbi beam search modified to accept continuouslyarriving data vectors and to decode the corresponding phonesequence in real time.

The first recognizer, C1, used context-independent models of46 phones (Table I) similar to the simplified phone set used inthe TIMIT database [30], [31], but with some minor modifica-tions.1 The sentences used in our simulation studies were tran-scribed using this set by trained phoneticians. This recognizeroperated in real time with virtually no restrictions on the searchbeam and achieved a phone recognition accuracy2 of 71% inoff-line experiments and 65% for live speech. Poor recognizeraccuracy, relative to off-line recognition, was likely due to dif-ferences between the speaking style, microphones, recordingconditions, etc. of the prerecorded and real-time speech.

The second recognizer, C2, used right-context-dependentmodels, i.e., separate models were trained for a given phone foreach possible following phone. This resulted in a significantincrease in the number of models, from 46 to 2116, and conse-quently required much greater amounts of computation duringrecognition. To achieve real-time operation we implementeda modified decoding procedure, similar to the Forward-Back-ward search [32]. In our approach context-independent modelswere used in the first (backward) pass to identify a subsetof phone hypotheses that exceed a likelihood threshold. This

1For example, we used separate models for the flapped “d” and the flapped“t” because in MCS the cue would reflect the underlying phoneme. Using theseparate models, the recognizer was able to distinguish between them at a levelthat was well above chance, even though they are acoustically very similar.

2Accuracy = 100 (total phones-substitutions-deletions-insertions)/(totalphones).

TABLE IPHONES MODELED BY THE RECOGNITION

SOFTWARE. A CLOSUREREFERS TO THEBRIEF PERIOD IMMEDIATELY

PRECEDING A PLOSIVE BURST WHEN THE VOCAL TRACT IS COMPLETELY

CLOSED AND NO SOUND IS EMITTED FROM THE MOUTH

reduced drastically the number of phone models (the “beamwidth”) whose match to the acoustic data was then evaluatedin the second (forward) pass using the more accurate con-text-dependent models. This method of reducing the number


TABLE IIASSIGNMENT OFPHONES TOCONTEXT CLASSES. NOTE THAT, FOR PURPOSES

OF CONTEXT FORMING, CLOSURESWERE CONSIDERED TOBE PART OF THE

STOP THAT FOLLOWED THEM

of the context-dependent models generally resulted in higheraccuracy than the simpler beam search [24], which constrainedthe number of phone hypotheses based only on one pass.

In practice the number of phone models that exceed the like-lihood threshold varied significantly. For example, steady por-tions of vowels and strong fricatives tended to be matched wellto only a few models while nasals and phone transitions ini-tially produced fairly good matches to many models. Conse-quently, the time required to process a given vector of acousticparameters varied as well. To make good use of computationalresources, we imposed additional, dynamically adapted, con-straints on the beam width. The accuracy of C2 was 79% inoff-line conditions but only 66%–70%3 for live speech. Appar-ently some correct phone hypotheses were erroneously pruned,nullifying the gains from improved modeling.

Recognizer C3 used phone models that depended on boththe preceeding and subsequent phones. To limit the numberof models, generalized contexts were used. Thirteen contextclasses, each containing between two and six phones, were con-structed. Thus the same phone model was used for an “ah” whenfollowed by a “b” as when followed by a “p” but a differentmodel was used when “ah” was followed by a “d.” The phonecontext classes generally group consonant phones with similarmanner and place of articulation and vowel phones with similartongue position (Table II). Of the 7774 possible models we usedonly roughly half—those most frequently occurring in trainingdata. The accuracy of the C3 recognizer was 80% off-line and74% for live speech. The live speech accuracy was higher thanthat of the C2 recognizer when implemented on the same pro-cessor (70%). All recognizers produced running estimates ofthe sequences of spoken phones together with estimated startand end times. The recognized phones were converted into a se-quence of codes for cues via a finite-state grammar. These codesand start/stop times were then sent to the cue display subsystem.

3Differences between recognizer scores of 1.5 percentage points are signifi-cant at the 0.05 level.

C. Cue Display

The cue display subsystem 1) captured images of thespeaker’s face, 2) overlayed the appropriate cues on theseimages, and 3) displayed the sequence of overlayed imagesto the cue receiver. As discussed above, the simulation studyhad underscored the importance of synchronizing the presenceof cues to the talker’s speech. Since cues cannot be identifiedand displayed before the speaker has uttered the correspondingsyllable, the video image of the talker’s face was stored fordelayed playback while the cue to be displayed was recognized.An Oculus TCi framegrabber card (Coreco Inc.), runningon PC1 digitized the output of a video camera recording thespeaker’s face at 30 frames/s. The digital images were stored inmemory for 2 s, a period that was more than adequate to allowthe cue to be identified by the recognizer, and delivered to thedisplay program. Each stored frame was then retrieved, oneof the hand shapes was overlayed at the appropriate location,and the overlayed image was displayed on a monitor. The eighthand shapes of MCS were available in memory as digitizedstill images of an actual hand (not necessarily the speaker’s).The artificially cued talker, as seen by the cue receiver, wasthus delayed by 2 s, relative to the real talker, but was displayedcontinuously, using smooth, full-motion video.

Since some of the cue receivers who had participated in thesimulation study [19] reacted negatively to the discrete motionof the hand shapes from position to position, several alternatestyles of cue display4 were studied. In the “smooth” display,the hand image was moved at a uniform rate along a straightline path between target positions, without pausing at these po-sitions. The display computer interpolated the intermediate lo-cations from the cue endpoints specified by the recognizer [34].The “dynamic” display used heuristic rules to apportion cue dis-play time between time paused at target positions and time spentin transition between these positions. Typically, 150 ms was al-located to the transition provided the hand could pause at thetarget position for at least 100 ms. The movement between targetpositions was, thus, smooth unless the cue was short, in whichcase it would tend to resemble the original “static” display.

Comparison of the output of the system with the hand mo-tions of human cuers suggested further modifications to the cuedisplay. In particular, human cuers often begin to form cues wellbefore producing audible sound. To achieve a similar effect, inour “synchronous” display, the time at which cues were dis-played was advanced by 100 ms relative to the start time de-termined by the recognizer. We also changed the timing of thetransition from one hand shape to the next so that cues changedhalfway through a transition rather than at the end (as in the “dy-namic” display). Special rules were used for cues correspondingto diphthongs. In particular, a minimum of 200 ms was allocatedto the diphthong motion and 150 ms was allocated for the tran-sition from the diphthong final position to the next cue position.Any remaining time was added to the diphthong motion and anyshortfall was first subtracted from the transition to the next cue.MCS specifies that during the diphthong motion the hand shapemust change from the shape associated with the consonant to the

4Examples of these styles are available on CD-ROM [33].


TABLE IIIHISTORIES OFSUBJECTSEXPERIENCED INRECEIVING MCS

“open” shape. We selected that point at 75% of the diphthongmotion.

IV. EVALUATION METHODS

To assess the benefit of ACS, we conducted speech receptionexperiments with experienced users of MCS. The goal was tocompare the subjects’ performance under SA, MCS, and ACS.These experiments also guided the development of the auto-matic cueing system and allowed us to evaluate its various ver-sions.

A. Subjects

A total of five subjects, ranging in age from 19–24 yrs, weretested in the experiments (Table III). All were highly skilledreceivers of MCS and native speakers of english. While all of thesubjects used MCS extensively in primary school, at the time ofthe study they used MCS for 1–4 h/day, usually with a relativeor transliterator.

Subjects S1 and S3 had participated in our simulation study[19]. The others had no previous exposure to ACS.

B. Talkers

Cued and uncued sentences were produced by three femaletalkers with normal hearing. Two teachers of the deaf who useMCS, T1, and T2, produced all the sentences used in Experi-ment I. Talker T3, certified by the National Cued Speech Asso-ciation as an Instructor of Cued Speech, produced all sentencesin the remaining experiments.

C. Materials

Three types of test materials were used to measure speechreception. The CUNY sentences [35] exhibit relatively highcontext (e.g., “The football field is right next to the baseballfield.”). They are organized in lists of 12, each sentencecontaining from three to fourteen words, for a total of 102words per list. The IEEE sentences [18] are more difficult andprovide fewer contextual cues (e.g., “Glue the sheet to the darkblue background.”). They are divided into lists of ten sentenceswith each sentence containing fivekeywordsfor a total of 50keywords/list. Additional IEEE-like sentences were speciallycreated for these experiments to extend the number of low-con-text test materials. They used the same keywords and sentencestructure as the IEEE sentences, but used different phrasing.The resulting sentences (e.g., “Cars left outside are prone torust.”) are, thus, believed to be of comparable difficulty.

D. Procedures

Live, rather than recorded, sentence productions were used.The talker sat in a sound-proof booth, faced a video camera andmonitor and wore a lapel-style, omni-directional microphone.A monitor displayed the text to be spoken. The cue receiver,who sat in a separate booth, observed the video of the talkeron another monitor. The video was delayed by 2 s, whether ornot artificial cues were superimposed. No audio signal was pro-vided to the receiver. In some experiments, several subjects weretested simultaneously. No communication was allowed amongthe subjects.

A typical experiment lasted 3–4 h and was divided intosessions comprising reception of four or five sentence lists.Subjects were given 10- to 20-min breaks between sessions. Asingle session included a list in the speechreading-alone mode,a list cued manually by the speaker, and two or three lists cuedautomatically. The order of conditions was randomized fromsession to session. Over the course of an experiment at least50 sentences were presented in each of the conditions. Eachsentence was presented to a given subject only once.

Subjects wrote as much of each observed sentence as theyunderstood on prepared answer sheets. Guessing at imperfectlyperceived words was encouraged. A single training session wasconducted at the beginning of the experiment. In general, thesubjects had little difficulty understanding the task and the pro-cedure.

V. RESULTS

The receivers’ responses were scored as the percentage ofkeywords recognized correctly. Strict scoring rules were used:responses considered correct had to agree in tense, number, andform with the sentence text. Homophonic responses were scoredas correct. For the CUNY sentences all 102 words in a list wereconsidered keywords. The IEEE and IEEE-like sentences had50 designated keywords per list.

Table IV summarizes the results5 of the four experimentstests using the recognizers and displays described in Section III.Scheduling constraints made it impractical to test each subjecton a wide variety of systems.

Analysis of variance indicated that in no experiment wasthere significant interaction (at the 0.01 level) between subjectand the pattern of scores across presentation conditions. TheMann–Whitney test indicated that scores for MCS were signifi-cantly greater than for SA and for ACS in all experiments. ACSscores were significantly greater than SA scores in ExperimentsIII and IV. The difference scores between ACS display typeswas not significant in Experiment III, but was significant inExperiment IV.

The scores shown in Table IV and the statistical analyses indi-cate the following. First, the better automatic cueing systems im-proved word reception relative to SA, although the improvementwas less than when MCS was used. Second, while improvingperformance of the cue recognizer (from C1 and C2 to C3)clearly improved scores, the changes in the display style (from

5Percentage benefitB reports the proportion of the increase in MCS score(S ) relative to the speechreading score (S ) that is achieved by a givenautomatic cueing system, i.e.,B = 100(S � S )=(S � S ).


TABLE IVRESULTS OFSPEECHRECEPTIONEXPERIMENTS WITH VARIOUS CUE SYSTEMS

“dynamic” to “synchronous”) produced a greater improvementin scores. All three subjects (S3, S4, and S5) commented posi-tively on the final system tested (C3-“synchronous”), indicatingthat it provided them with appreciable aid relative to SA.

VI. DISCUSSION

The outcome of our experiments with the automatic cuegeneration systems largely confirms the results of our earliersimulation studies: speechreaders can derive a significantimprovement in speech reception from artificial cues producedby an ASR system. Cues produced by the C3 recognizer andpresented in the synchronous mode gave our subjects over 57%of the benefit that would accrue from MCS. Keyword scoresincreased from 35% via unaided speechreading to 66% whenspeechreading was supplemented by automatically producedcues. Since these results were obtained using low-contextsentences, they show an unambiguous potential for improvingspeech comprehension by the deaf in realistic, full discoursesituations.

A. Other Aids Based on ASR

Although the automatic production of cued speech is not theonly way ASR technology can be used to aid communication bythe deaf, it offers certain advantages that make it an attractiveapproach. One alternative would be simply to display the rec-ognized phones, perhaps as a running stream of phonetic sym-bols. This technique was investigated in the VIDVOX project[36]. However, it was found that a phone recognition accuracyof roughly 95% would be required to provide significant benefitto the receiver. This level of recognizer performance is well inexcess of the capabilities of currently existing systems.

Another approach would use ASR to caption the speech withwords. However, such a display would only benefit those withgood reading skills. It would not be appropriate for young chil-dren or adults with inadequate reading abilities. The technicalfeasibility of such a system is also questionable. The most ad-vanced continuous speech recognition systems achieve word ac-curacies of 90%–95% [37], [38], but only after significant adap-tation to a given speaker and with heavy reliance on languagemodels. Moreover such high levels of performance are typicallyachieved only under relatively benign conditions: low ambientnoise, constant acoustic transmission characteristics, grammat-ically correct utterances, and careful speaking style. Relaxationof any of these constraints typically increases the error rate sub-stantially [37]. For example, when such systems are applied tospontaneous speech, as found in everyday conversations, errorrates often exceed 30% [39].

Spontaneous speech is less of a problem for an automaticsystem that produces cues for all of the phones uttered by thespeaker, including hesitation sounds (e.g., “umm”), repeatedwords, even stutter. Since phonetic recognizers do not rely onword-level language models, they can deal with unusual or un-grammatical utterances. The performance of all ASR systems isdegraded by disturbances in the acoustic environment. Trainedreceivers appear to ignore cues when they are produced by un-reliable phonetic recognizers and to rely on speechreading (e.g.,Table IV and [19]). How well speechreading would be inte-grated with an unreliable word-based display is not known.

B. Alternate Visual Speechreading Aids

The approach to developing visual speechreading supple-ments proposed in this paper is based on the presentation of asmall set of discrete symbols that are abstracted from the acoustic


speech signal via speech recognition. Visual supplements thatdo not identify discrete linguistic units have also been studied[7], [8], [40]. These approaches estimate parameters of theacoustic speech signal (e.g., energy in low- and high-frequencybands), and quantize the parameters. The results of logicalcomputations performed on the quantized parameters controlthe illumination of light-emitting elements in the field of viewor visual periphery of the speechreader. For example, Upton’soriginal system [7] used five lamps to indicate the presence offive types of speech sounds. Evaluations of such supplements[40] indicated that they had potential for aiding speech reception.After only 6 h of training, college students with moderate tosevere hearing losses achieved 14%–19% higher scores formonosyllabic words when the aid was used as a supplement tospeechreading. Upton himself was documented as achieving a20% improvement using the device. Demonstrating benefits tothe reception of connected discourse proved more difficult: therewas no carryover for the college student group. However, onesubject with a severe hearing loss achieved 19% higher wordscores on IEEE sentences after using the device for a period of 6mo, and Upton himself achieved 27% higher scores for words inmore contextual sentences using a colored-light version of thedisplay. These improvements are noticeably smaller than thoseachieved by our subjects in Experiment IV.

More recently, Ebrahimi and Kunov [8] presented an exten-sive rationale for the design of a speechreading supplement.They concluded that it would be beneficial to represent voicing(vocal fold activity) and voice onset time, intonation patternssuch as fundamental frequency contours, formant frequencies,and the levels of high frequency sounds. To allow the supple-mentary cues to be integrated with speechreading without in-creasing the visual workload, they advocated presenting the sup-plements via peripheral rather than foveal vision. Their displaywas based on a LED matrix that presented three parame-ters derived from the acoustic waveform: voice fundamental fre-quency, the speech waveform envelope, and speech power above3 kHz. Ideally, the display would present five distinct spatio-temporal patterns to distinguish among consonants that differin voicing and manner of articulation. Eight young adults (3normal-hearing and five profoundly deaf) evaluated this devicein a 12-consonant (/a/-C-/a/ context) identification task. Scoresimproved from 41% for SA to 76% for aided speechreading.In particular, distinctions between voiced and unvoiced conso-nants, which were made with only 55% accuracy with SA, weremade with greater than 88% accuracy with the aid.

In spite of the encouraging results reported for these systems,the development of these speechreading supplements is incom-plete. Both approaches are highly empirical, and necessarilyconfound the effects of parameter selection and parameter dis-play with those of training. Even if the benefit can be fairly eval-uated after only 6 months. of experience with the supplement,it would be extremely time consuming to compare the benefitsprovided by different parameter sets and/or different displays.Note that although several variants of Upton’s aid were devel-oped, no two variants were ever compared in a formal study.Although the use of closed-set identification tests might seemto minimize this problem, this conclusion is highly suspect.Greater amounts of practice would be required to evaluate re-

Fig. 2. Keyword scores with various speechreading aids versus unaided scores.Solid line delineates the region of significant benefit (speech reception allowingfor reasonable conversation).

ception of the complete set of English consonants, particularlyif the presentation conditions included a wider set of vowel con-texts. Additional tests would be needed to evaluate the receptionof vowels. Thus, while this approach might serve to eliminatesome parameter sets and/or displays, it is not capable of provingthe adequacy of a given system.

Our approach circumvents many of these difficulties. Sincethe supplements consist of sequences of discrete symbols, it ispossible to evaluate the adequacy of the recognition stage ofthe system separately from the adequacy of the display stage.Furthermore, analytical models can be used to predict how wellspeech segments can be recognized based on the error patternsmade by the speech recognizer [13]. Additional advantages ac-crue from the fact that the supplement is closely related to MCS.As we have shown, highly trained cue receivers are sensitive todifferences in display strategies without the need for months oftraining. Perhaps more importantly, it is possible to incorporatemany of the desirable properties of MCS in the system. For ex-ample, MCS displays the cues for consonants over the durationof the CV segment rather than for the duration of the consonantitself. This is likely to improve the reception of consonant cuessince many English consonants are of relatively short duration.Similarly, it is unnecessary to speculate whether supplementsshould use peripheral vision for the display: MCS provides highlevels of speech reception via foveal vision.

C. Tactile Aids and Implants

Fig. 2 compares word reception scores obtained by thesubjects who evaluated our most advanced system with thoseobtained with comparable test materials by a sample of usersof the Ineraid cochlear implant and the tactile aids (Tactaid 2and Tactaid 7) [41]. As can be seen, scores obtained by usersof ACS exceed those obtained by roughly one-third of theusers of the cochlear implant and by roughly two-thirds of theusers of the tactile aids. Note that the only superior scores fortactile aids were obtained by a single subject with exceptionalspeechreading skills.


D. Practical Applications

The development of the cue generating system has progressedrapidly; less than a year elapsed between the first and final ver-sions tested. Additional research directed towards the improve-ment of recognition and display technology seems likely to in-crease the benefit that the system provides to the cue receiver.It should be possible to reduce the 2-s delay incorporated in thecurrent system substantially, to facilitate face to face communi-cation, and to provide a delayed acoustic output for users withpartial hearing. In addition, the cost of providing the recognitionand display functions required by automatic cueing systems isdecreasing rapidly.

Nonetheless the computational requirements of ASR and thecomplexity of the display make it unlikely that a truly portablecueing system, as envisioned by the developers of the Autocuer,can be realized soon. Rather, we expect that initial deploymentsof the system will be in relatively static and controlled settingssuch as lecture halls, classrooms, or the homes of deaf chil-dren. A camera, controlled automatically to track the face ofthe speaker [42], [43] would be required in some applications.Cues would be generated and displayed on a laptop computers.In the home environment, the system could be used to producecues for television broadcasts or to help train family membersto produce MCS. Because the system produces a supplementthat is compatible with MCS, it could be used immediately byskilled cue receivers with little need for additional training.

ACKNOWLEDGMENT

The authors would like to thank the experienced producersand receivers of MCS who participated in the studies reported.They would also like to thank A. K. Dix, who helped developthe IEEE-like sentences. Finally, they would like to thank C. M.Reed and L. A. Delhorne, who allowed them to use their data onthe reception of speech via cochlear implants and tactile aids.

REFERENCES

[1] K. Grant and L. Braida, “Evaluating the articulation index for audiovi-sual input,”J. Acoust. Soc. Amer., vol. 89, pp. 2952–2960, 1991.

[2] E. Douglas-Cowie, “Acquired deafness and communication,”Copingwith Acquired Hearing Loss, pp. 13–22, 1988.

[3] E. Klima and U. Bellugi,The Signs of Language. Cambridge, MA:Harvard Univ. Press, 1979.

[4] J. E. Wandel, “Use of internal speech in reading by hearing and hearingimpaired students in oral, total communication, and cued speech pro-grams,” Ph.D. dissertation, Teacher’s College, Columbia Univ., NewYork, 1989.

[5] M. J. Osberger, L. Fisher, S. Zimmerman–Phillips, L. Geier, and M. J.Baker, “Speech recognition performance of older children with cochlearimplants,”Amer. J. Otol., vol. 19, pp. 152–157, March 1998.

[6] G. M. Clark, “Cochlear implants in the second and third millennia,” inProc. ICSLP—98, December 1998, pp. 1–6.

[7] H. W. Upton, “Wearable eyeglass speechreading aid,”Amer. Ann. Deaf,vol. 113, pp. 222–229, March 1968.

[8] D. Ebrahimi and H. Kunov, “Peripheral vision lipreading aid,”IEEETrans. Biomed. Eng., vol. 38, pp. 944–952, Oct. 1991.

[9] D. K. Oller, Ed.,Tactile Aids for the Hearing Impaired. New York: Int.Sensory Aids Soc., Thieme Medical, November 1995.

[10] R. O. Cornett, “Cued speech,”Amer. Ann. Deaf, vol. 112, pp. 3–13, 1967.[11] G. Fant, “Q-codes,” inProc. Int. Symp. Speech Communication Ability

and Profound Deafness, G. Fant, Ed., Washington, DC, 1970, pp.261–299.

[12] S. Hiki and Y. Fukuda, “Proposal of a system of manual signs as an aidfor japanese lipreading,”J. Acoust. Soc. Jpn., vol. 2, no. 2, pp. 127–129,1981.

[13] R. Uchanski, L. Delhorne, A. Dix, C. M. Reed, L. Braida, and N.Durlach, “Automatic speech recognition to aid the hearing impaired:Prospects for the automatic generation of cued speech,”J. Rehab. Res.Dev., vol. 31, no. 1, pp. 20–41, 1994.

[14] H. Kaplan, “The effects of cued speech on the speechreading ability ofthe deaf,” Ph.D. dissertation, Univ. Maryland, College Park, 1974.

[15] D. Ling and B. Clarke, “Cued speech: An evaluative study,”Amer. Ann.Deaf, vol. 120, pp. 480–488, 1975.

[16] B. Clarke and D. Ling, “The effects of using cued speech: A follow upstudy,”Volta Rev., vol. 78, pp. 23–34, 1976.

[17] G. Nicholls and D. Ling, “Cued speech and the reception of spoken lan-guage,”J. Speech Hearing Res., vol. 25, pp. 262–269, 1982.

[18] “IEEE recommended practices for speech quality measurements,” IEEE,New York, Tec. Rep., 1969.

[19] M. S. Bratakos, P. Duchnowski, and L. D. Braida, “Toward the automaticgeneration of cued speech,”The Cued Speech J., vol. 6, pp. 1–37, June1997.

[20] O. Cornett, R. Beadles, and B. Wilson, “Automatic cued speech,”Pro-cessing Aids for the Deaf, pp. 224–239, 1977.

[21] R. Beadles and L. D. Braida, “Describing the performance of the RTIAutocuer and the results of tests with simulations of the Autocuer,” un-published, August 31, 1989.

[22] D. O’Shaughnessy,Speech Communication, Human and Ma-chine. Reading, MA: Addison-Wesley, 1990.

[23] H. Hermansky and N. Morgan, “RASTA processing of speech,”IEEETrans. Acoust., Speech, Signal Processing, vol. 24, pp. 52–59, Oct. 1994.

[24] L. Rabiner, “A tutorial on hidden Markov models and selected appli-cations in speech recognition,”Proc. IEEE, vol. 77, pp. 257–286, Feb.1989.

[25] S. Y. P. Woodland and W. Byrne,HTK: Hidden Marchkov Model Toolkit,Version 1.5 User’s Manual. Washington, DC: Entropics Inc., 1993.

[26] P. Woodland, C. Leggetter, J. Odell, V. Valtchev, and S. Young, “The1994 HTK large vocabulary speech recognition system,” inProc. 1995Int. Conf. Acoust. Speech and Signal Processing, 1995, pp. 73–76.

[27] S. Furui, “Speaker-independent isolated word recognition usingdynamic features of the speech spectrum,”IEEE Trans. Acoust. SpeechSignal Processing, vol. ASSP-34, pp. 52–59, Feb. 1986.

[28] L. Baum, “An inequality and associated maximization technique in sta-tistical estimation of probabilistic functions of Markov processes,”In-equalities, vol. 3, pp. 1–8, 1972.

[29] L. Bahl, F. Jelinek, and R. Mercer, “A maximum likelihood approachto continuous speech recognition,”IEEE Trans. Pattern Anal. MachineIntell., vol. PAMI-5, pp. 179–190, Feb. 1983.

[30] L. Lamel, R. Kassel, and S. Seneff, “Speech database development:Design and analysis of the acoustic-phonetic corpus,” inProc. DARPASpeech Recognition Workshop, L. Baumann, Ed., Feb. 1986, pp.100–109.

[31] W. Fisher, V. Zue, J. Bernstein, and D. Pallett, “An acoustic-phoneticdata base,”J. Acoust. Soc. Amer., vol. 81, no. S1, p. S92, 1987.

[32] S. Austin, R. Schwartz, and P. Placeway, “The forward-backward searchalgorithm,” inProc. 1991 Int. Conf. Speech, Acoustics, and Signal Pro-cessing, vol. 1, 1991, pp. 697–700.

[33] P. Duchnowski, L. D. Braida, M. S. Bratakos, D. S. Lum, M. G. Sexton,and J. C. Krause, “A speechreading aid based on phonetic ASR,” inProc.ICSLP-98, Dec. 1998, pp. 3289–3293.

[34] M. G. Sexton, “A video display system for an automatic cue generator,”Master’s thesis, M.I.T., Cambridge, MA, Feb. 1997.

[35] A. Boothroyd, T. Hnath-Chisolm, and L. Hanin, “A sentence testof speech perception: Reliability, set-equivalence, and short-termlearning,” City Univ. New York, New York, Tech. Rep. RCI10, 1985.

[36] A. Huggins, R. Houde, and E. Colwell, “Vidvox human factors investi-gation,” BBN Laboratories, Cambridge, MA, Tech. Rep. 6187, 1986.

[37] S. Young, “A review of large-vocabulary continuous-speech recogni-tion,” in IEEE Signal Processing Mag., vol. 13, 1996, pp. 45–57.

[38] D. Systems. (1997, December) Product information for naturallyspeaking speech recognition software. [Online] Available: HTTP:http://www.dragonsystems.com

[39] S. Young, J. Odell, and P. Woodland, “Tree-based state tying for highaccuracy acoustic modelling,” inProc. ARPA Human Lang. Tech. Work-shop, March 1994, pp. 615–621.

[40] R. W. Gengel, “Upton’s wearable eyeglass speechreading aid: Historyand current status,” inHearing and Davis: Essays Honoring HallowellDavis, S. Hirsh, D. Eldridge, I. J. Hirsh, and S. Silverman, Eds. St.Louis, MO: Washington Univ. Press, 1976, pp. 291–299.


[41] C. Reed and L. Delhorne, “Current results of a field study of adult usersof tactile aids,”Seminars Hearing, pp. 305–315, 1995.

[42] J. Yang and A. Waibel, “Tracking human faces in real time,” CarnegieMellon University, School of Computer Science, Pittsburgh, PA, Tech.Rep. CMU-CS-95-210, 1995.

[43] N. Oliver, A. Pentland, F. Berard, and J. Coutaz, “LAFTER: Lips andface tracker,” MIT Media Lab., Cambridge, MA, Tech. Rep. 396, 1996.

Paul Duchnowskireceived several degrees from theMassachusetts Institute of Technology.

He is currently developing Internet messagingtechnology at Software.com, Inc., Lexington, MA.He is also a Visiting Scientist in the Sensory Com-munications Group at the Massachusetts Institute ofTechnology.

David S. Lum received the S.B. and M.Eng. degreesfrom the Massachusetts Institute of Technology,Cambridge, in 1994 and 1995, respectively.

He is currently a Technical Lead at Template Soft-ware, Dulles, VA, developing a CORBA binding forTemplate’s Snap language. One of his main interestsis evangelizing for Linux.

Jean C. Krausereceived the B.S.E.E. degree fromGeorgia Institute of Technology, Atlanta, and an S.M.degree from Massachusetts Institute of Technology(M.I.T.), Cambridge, in 1993 and 1995, respectively.She is currently pursuing the Ph.D. degree in elec-trical engineering at MIT.

Her research interests include clear speech, speechcommunication, digital signal processing, cuedspeech, and automatic cueing technology.

Matthew G. Sexton received the S.B. and M.Eng. degrees in electrical engi-neering from the Massachusetts Institute of Technology, Cambridge, in 1996and 1997, respectively.

He is currently a Member of Technical Staff of Mercury Computer Systems,Chelmsford, MA.

Maroula S. Bratakos received the B.S. and M.Engdegrees in electrical engineering from the Massachu-setts Institute of Technology, Cambridge, in 1993 and1995, respectively.

She is currently a Member of Technical Staff inthe DSP group of Nuera Communications, Inc., SanDiego, CA.

Louis D. Braida (S’62–M’69) received the B.E.E.degree in electrical engineering from The CooperUnion, New York, NY, in 1964 and the M.S. andPh.D. degrees in electrical engineering from theMassachusetts Institute of Technology, Cambridge,in 1965 and 1969, respectively.

He has been a member of the Faculty of the Depart-ment of Electrical Engineering and Computer Sci-ence at the Massachusetts Institute of Technology,Cambridge, since 1969 and is currently Henry EllisWarren Professor of Electrical Engineering. His re-

search is in the areas of auditory perception and the development of improvedhearing aids and aids for the deaf.

Documents

Development of speechreading supplements based on automatic speech recognition