The Limitation of Speech Recognition

COMMUNICATIONS OF THE ACM September 2000/Vol. 43, No. 9 63

To improve speech recognition applications, designers must understand acoustic memory and prosody.

Continued research and development should beable to improve certain speech input, output, anddialogue applications. Speech recognition and gen-eration is sometimes helpful for environments thatare hands-busy, eyes-busy, mobility-required, orhostile and shows promise for telephone-based ser-vices. Dictation input is increasingly accurate, butadoption outside the disabled-user community hasbeen slow compared to visual interfaces. Obviousphysical problems include fatigue from speakingcontinuously and the disruption in an office filledwith people speaking.

By understanding the cognitive processes sur-rounding human acoustic memory and process-ing, interface designers may be able to integratespeech more effectively and guide users more suc-

cessfully. By appreciating the differences betweenhuman-human interaction and human-computerinteraction, designers may then be able to chooseappropriate applications for human use of speechwith computers. The key distinction may be therich emotional content conveyed by prosody, or thepacing, intonation, and amplitude in spoken lan-guage. The emotive aspects of prosody are potentfor human-human interaction but may be disrup-tive for human-computer interaction. The syntacticaspects of prosody, such as rising tone for questions,are important for a systems recognition and gener-ation of sentences.

Now consider human acoustic memory and pro-cessing. Short-term and working memory are some-times called acoustic or verbal memory. The part of

Ben Shneiderman

Human-human relationships are rarely a good model for design-

ing effective user interfaces. Spoken language is effective for human-human

interaction but often has severe limitations when applied to human-com-

puter interaction. Speech is slow for presenting information, is transient and

therefore difficult to review or edit, and interferes significantly with other cognitive tasks.

However, speech has proved useful for store-and-forward messages, alerts in busy environ-

ments, and input-output for blind or motor-impaired users.

the Limitsof SpeechRecognition

the human brain that transiently holds chunks ofinformation and solves problems also supports speak-ing and listening. Therefore, working on tough prob-lems is best done in quiet environmentswithoutspeaking or listening to someone. However, becausephysical activity is handled in another part of thebrain, problem solving is compatible with routinephysical activities like walking and driving. In short,humans speak and walk easily but find it more diffi-cult to speak and think at the same time (see Figure 1).

Similarly when operating a computer, mosthumans type (or move a mouse) and think but find itmore difficult to speak and think at the same time.Hand-eye coordination is accomplished in differentbrain structures, so typing or mouse movement canbe performed in parallel with problem solving.

My students and I at the University of Marylandstumbled across this innate human limitation duringa 1993 study [4] in which 16 word-processor userswere given the chance to issue voice commands for 18tasks, including page down, boldface, italic,and superscript. For most tasks, this facility enableda 12%30% speed-up, since users could keep theirhands on the keyboard and avoid mouse selections.However, one task required the memorization ofmathematical symbols, followed by a page downcommand. The users then had to retype the symbolsfrom memory. Voice-command users had greater dif-ficulty with this task than mouse users. Voice-com-mand users repeatedly scrolled back to review thesymbols, because speaking the commands appearedto interfere with their retention.

Product evaluators of an IBM dictation software

package also noticed this phenomenon [1]. Theywrote that thought for many people is very closelylinked to language. In keyboarding, users can con-tinue to hone their words while their fingers outputan earlier version. In dictation, users may experiencemore interference between outputting their initialthought and elaborating on it. Developers of com-mercial speech-recognition software packages recog-nize this problem and often advise dictation of fullparagraphs or documents, followed by a review orproofreading phase to correct errors.

A 1999 study of three commercial speech-recogni-tion systems focused on errors and error-correctionpatterns [2, 3]. It found that when novice users try tofix errors, they often get caught in cascades of errors(up to 22 steps). Part of the explanation is thatnovices stuck with speech commands for corrections,while more experienced users learned to switch tokeyboard correction. While all study participants hadlonger performance times for composition tasks thantranscription tasks, the difference was greater forthose using speech. The demands of using speechrather than keyboard entry may have slowed speechusers more in the higher-cognitive-load task of com-position.

Since speaking consumes precious cognitiveresources, it is difficult to solve problems at the sametime. Proficient keyboard users can have higher levelsof parallelism in problem solving while performingdata entry. This may explain why after 30 years ofambitious attempts to provide military pilots withspeech recognition in cockpits, aircraft designers per-sist in using hand-input devices and visual displays.Complex functionality is built into the pilots joy-stick, which has up to 17 functions, including pitch-roll-yaw controls, plus a rich set of buttons andtriggers. Similarly automobile controls may have turnsignals, wiper settings, and washer buttons all builtonto a single stick, and typical video camera controlsmay have dozens of settings that are adjustablethrough knobs and switches. Rich designs for handinput can inform users and free their minds for statusmonitoring and problem solving.

The interfering effects of acoustic processing are alimiting factor for designers of speech recognition,but the the role of emotive prosody raises further con-cerns. The human voice has evolved remarkably wellto support human-human interaction. We admireand are inspired by passionate speeches. We aremoved by grief-choked eulogies and touched by achilds calls as we leave for work.

A military commander may bark commands attroops, but there is as much motivational force in thetone as there is information in the words. Loudly

64 September 2000/Vol. 43, No. 9 COMMUNICATIONS OF THE ACM

Figure 1. A simple resource model showing thatcognitive resources available for problem solvingand recall are limited when speech input/output

consumes short-term and working memory. When hand-eye coordination is used for pointing

and clicking, more cognitive resources areavailable for problem solving and recall.

Cognitive Resources Available for Performing Tasks

Cognitive resources for problem solving and recall are limited when speech input/output shares short-term and working memory.

Short-term and Working Memory

Cognitive resources for problem solving and recall expand when hand-eye coordination is used for pointing and clicking.

Problem Solving +Recall

Speechinput/output

Short-term and Working Memory+ Hand-Eye Coordination

Problem Solving +Recall

Pointing+ Clicking

barking commands at a computer is not likely to forceit to shorten its response time or retract a dialoguebox. Promoters of affective computing, or reorga-nizing, responding to, and making emotional dis-plays, may recommend such strategies, though thisapproach seems misguided. Many users might wantshorter response times without having to work them-selves into a mood of impatience. Secondly, the logicof computing requires a user response to a dialoguebox independent of the users mood. And thirdly, theuncertainty of machine recognition could underminethe positive effects of user control and interface pre-dictability.

The efficacy of human-human speech interactionis wrapped tightly with prosody. We listen to radio orTV news in part because we become accustomed tothe emotional level of our favorite announcer, such asthe classic case of Walter Cronkite. Many peoplecame to know his customary tone: sharp for breakingnews, somber for tragedies, perfunctory for the stockmarket report. This subtle nuance of his vocal toneenriched our understanding of the news, especially hisobvious grief reporting John F. Kennedys death in1963 and excitement at the first moon landing in1969.

People learn about each other through continuingrelationships and attach meaning to deviations frompast experiences. Friendship and trust are built byrepeated experiences of shared emotional states,empathic responses, and appropriate assistance.Going with a friend to the doctor demonstrates com-mitment and builds a relationship. A supportive tonein helping to ask a doctor the right questions anddealing with bad news together are possible due toshared histories and common bodily experiences.Human emotional expression is so varied (across indi-viduals), nuanced (subtly combining anger, frustra-tion, impatience, and more), and situated(contextually influenced in uncountable ways) thataccurate simulation or recognition of emotional statesis usually impractical.

For routine tasks with limited vocabulary and con-strained semantics, such as order entry and banktransfers, the absence of prosody enables limited suc-cesses, though visual alternatives may be more effec-tive. Although stock market information and sometrading is done today via voice activation, the visualapproaches have attracted at least 10 times as manyusers. For such emotionally charged and highly vary-ing tasks as medical consultations or emergency-response teamwork, the critical role of prosody makesit difficult to provide effective speech recognition.

In summary, the number of speech interaction suc-cess stories is increasing slowly; designers should con-

duct empirical studies to understand the reasons fortheir success, as well as their limitations, and theiralternatives. A particular concern for everyone on theroad today is the plan by several manufacturers tointroduce email handling via speech recognition forautomobile drivers, when there is already convincingevidence of higher accident rates for cell phone users.

Realistic goals for speech-based human-computerinteraction, better human multitasking models, andan understanding of how human-computer interac-tion is different from human-human interactionwould be helpful. Speech systems founder whendesigners attempt to model or recognize complexhuman behaviors. Comforting bedside manner,trusted friendships, and inspirational leadership arecomponents of human-human relationships notamenable to building into machines.

On the positive side, I expect speech messaging,alerts, and input-output for blind or motor-impairedusers to grow in popularity. Dictation designers willfind useful niches, especially for routine tasks. Therewill be happy speech-recognition users, such as thosewho wish to quickly record some ideas for later reviewand keyboard refinement. Telephone-based speech-recognition applications, such as voice dialing, direc-tory search, banking, and airline reservations, may beuseful complements to graphical user interfaces. Butfor many tasks, I see more rapid growth of reliablehigh-speed visual interaction over the Web as a likelyscenario. Similarly for many physical devices, carefullyengineered control sticks and switches will be effectivewhile preserving speech for human-human interac-tion and keeping rooms pleasantly quiet.

References1. Danis, C., Comerford, L., Janke, E., Davies, K., DeVries, J., and

Bertran, A. StoryWriter: A speech oriented editor. In Proceedings ofCHI94: Human Factors in Computing Systems: Conference Companion(Boston, Apr. 2428). ACM Press, New York, 1994, 277278.

2. Halverson, C., Horn, D., Karat, C., and Karat, J. The beauty of errors:Patterns of error correction in desktop speech systems. In Proceedings ofINTERACT99 (Edinburgh, Scotland, Aug. 30Sept. 3). IOS Press,Amsterdam, 1999, 133140.

3. Karat, C., Halverson, C., Horn, and Karat, J., Patterns of entry and cor-rection in large vocabulary continuous speech recognition systems. InProceedings of CHI99: Human Factors in Computing Systems (Pittsburgh,May 1520). ACM Press, New York, 1999, 568575.

4. Karl, L., Pettey, M., and Shneiderman, B. Speech versus mouse com-mands for word processing applications: An empirical evaluation. Int. J.Man-Mach. Stud. 39, 4 (1993), 667687.

Ben Shneiderman ([email protected]) is a professor in theDepartment of Computer Science, founding director of the Human-Computer Interaction Laboratory, and member of the Institutes for Advanced Computer Studies and for Systems Research at the University of Maryland in College Park.

2000 ACM 0002-0782/00/0900 $5.00

c

COMMUNICATIONS OF THE ACM September 2000/Vol. 43, No. 9 65

Documents

The Limitation of Speech Recognition