ECE 259—Lecture 19 Techniques for Speech and Natural Language Recognition

ECE 259—Lecture 19ECE 259—Lecture 19

Techniques for Speech and Techniques for Speech and Natural Language RecognitionNatural Language Recognition

Speech Recognition-2001Speech Recognition-2001

The Speech Dialog CircleThe Speech Dialog Circle

DM

SLU

TTSText-to-Speech

Synthesis

Automatic SpeechRecognition

Spoken LanguageUnderstanding

DialogManagement

ASR

SLGSpoken Language Generation

Data,Rules

Speech

Action

Words spoken

“I dialed a wrong number”

Meaning

“Billing credit”

Words: What’s next?

“Determine correct number”

Voice reply to customer

“What number did you want to call?”

• Goal: Accurately and efficiently convert a

speech signal into a text message independent of the device, speaker or the environment.

• Applications: Automation of complex operator-based tasks, e.g., customer care, dictation, form filling applications, provisioning of new services, customer help lines, e-commerce, etc.

Automatic Speech Automatic Speech RecognitionRecognition

Milestones in Speech and Milestones in Speech and Multimodal Technology Multimodal Technology

ResearchResearch

1962 1967 1972 1977 1982 1987 1992 1997 2002

Year

Isolated Words

Filter-bank analysis;

Time-normalization;Dynamic programmin

g

Isolated Words; Connected Digits;

Continuous Speech

Pattern recognition; LPC analysis;

Clustering algorithms;

Level building;

Continuous Speech; Speech Understanding

Stochastic language

understanding; Finite-state machines; Statistical learning;

Small Vocabulary,

Acoustic Phonetics-

based

Medium Vocabulary, Template-based

Large Vocabulary;

Syntax, Semantics,

Connected Words;

Continuous Speech

Large Vocabulary, Statistical-

based

Hidden Markov

models;

Stochastic Language modeling;

Spoken dialog; Multiple

modalities

Very Large Vocabulary; Semantics, Multimodal Dialog, TTS

Concatenative synthesis; Machine

learning; Mixed-initiative

dialog;

Basic ASR FormulationBasic ASR FormulationThe basic equation of Bayes rule-based speech recognition is

where X=X1,X2,…,XN is the acoustic observation (feature vector) sequence.

is the corresponding word sequence, P(X|W) is the acoustic model and P(W) is the language model

ˆ arg max ( | )

( ) ( | )arg max

( )

arg max ( ) ( | )

W

W

W

W P

P P

P

P P

W X

W X W

X

W X W

1 2ˆ ... Mw w wW

Speech Analysis

Decoders(n), W Xn W

Feature Analysis (Spectral Analysis)

Feature Analysis (Spectral Analysis)

Language Model (N-

gram)

Language Model (N-

gram)Word

Lexicon

Word Lexicon

Confidence Scoring

(Utterance Verification)

Confidence Scoring

(Utterance Verification)

Pattern Classification

(Decoding, Search)


(Decoding, Search)

Acoustic Model (HMM)

Acoustic Model (HMM)

Input Speec

h“Hello World”

(0.9) (0.8)

Speech Speech Recognition Recognition

ProcessProcess

Speech Speech Recognition Recognition

ProcessProcessSLU

TTS ASR

DM

SLG

s(n), W

Xn

W^

Speech Recognition ProcessesSpeech Recognition Processes• Choose task => sounds, word

vocabulary, task syntax (grammar), task semantics– Text training data set => word lexicon, word

grammar (language model), task grammar– Speech training data set => acoustic models

• Evaluate performance– Speech testing data set

• Training algorithm => build models from training set of text and speech

• Testing algorithm => evaluate performance from testing set of speech

Feature ExtractionFeature ExtractionFeature ExtractionFeature ExtractionGoal: Extract robust features (information) fromthe speech that are relevant for ASR.

Method: Spectral analysis through either a bank-of-filters or through LPC followedby non-linearity and normalization (cepstrum).

Result: Signal compression where for each window of speech samples where 30 or so cepstral features are extracted (64,000 b/s -> 5,200 b/s).

Challenges: Robustness to environment (office, airport, car), devices (speakerphones, cellphones), speakers (acents, dialect, style, speaking defects), noise and echo. Feature set for recognition—cepstral features or those from a high dimensionality space.

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model

Word Lexicon

Word Lexicon

Confidence Scoring

Confidence Scoring

What Features to Use?What Features to Use?

Short-time Spectral Analysis: Acoustic features:

- cepstrum (LPC, filterbank, wavelets)- formant frequencies, pitch, prosody- zero-crossing rate, energy

Acoustic-Phonetic features:- manner of articulation (e.g., stop, nasal, voiced)- place of articulation (labial, dental, velar)

Articulatory features:- tongue position, jaw, lips, velum

Auditory features:- ensemble interval histogram (EIH), synchrony

Temporal Analysis: approximation of the velocity and acceleration typically through first and second ordercentral differences.

Sampling and Quantization

Segmentation (blocking)

Windowing

Preemphasis

Feature Extraction ProcessFeature Extraction Process

Spectral Analysis

Cepstral Analysis

Filtering

Temporal Derivative

Equalization

Delta cepstrumDelta^2 cepstrum

Bias removalor normalization

PitchFormants

NoiseRemoval,

Normalization

( )s t

( )s n

( )s n

( )ms n

( )ms n

)(nW

( )ma l

( )mc l

( )mc l

( )mc l

M,N

EnergyZero-Crossing

( )ma l

RobustnessRobustness RobustnessRobustness

UnlimitedVocabulary

UnlimitedVocabulary

RejectionRejection

Problem:A mismatch in the speech signal between the training phase and testing phase can result in performance degradation.

Methods:Traditional techniques for improving system robustness are based on signal enhancement, feature normalization or/andmodel adaptation.

Perception Approach:Extract fundamental acoustic information in narrow bands of speech. Robust integration of features across time and frequency.

Methods for Robust Speech Methods for Robust Speech RecognitionRecognition

Methods for Robust Speech Methods for Robust Speech RecognitionRecognition

SignalTraining

Testing

Features Model

Signal Features Model

Enhancement

Normalization Adaptation

A mismatch in the speech signal between the training phase and testing phase results in performance degradation.

Noise and Channel Noise and Channel DistortionDistortion

+

Speech

Noise

Distorted Speech

frequency

FourierTransform

FourierTransform

Channel

frequency

5dB50dB

n(t)

s(t)

h(t)

y(t)

y(t) = [s(t) + n(t)] * h(t)

Speaker VariationsSpeaker Variations- one source of acoustic variation among speakers is explained by their different vocal tract lengths (which varies in length from 15-20 cm)- an increase in vocal tract length is inversely proposal to the frequency contents of the resulting acoustic signal- by warping the frequency content of a signal, one can effectively normalize for variations in vocal tract lengths

Maximum Likelihood Speaker Normalization' argmaxPr( | )c

17.187.0

Acoustic ModelAcoustic ModelAcoustic ModelAcoustic ModelGoal: Map acoustic features into distinct phonetic labels (e.g., /s/, /aa/).

Hidden Markov Model (HMM): Statistical method for characterizing the spectral properties of speech by a parametric random process. A collection of HMMs is associated with a phone. HMMs are also assigned for modeling extraneous events.

Advantages: Powerful statistical method for dealing with a wide range of data and reliably recognizing speech.

Challenges: Understanding the role of classification models (ML Training) versus discriminative models (MMI training). What comes after the HMM—are there data driven models that work better for some or all vocabularies.

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model

Word Lexicon

Word Lexicon

Confidence Scoring

Confidence Scoring

Word LexiconWord LexiconWord LexiconWord LexiconGoal:Map legal phone sequences into wordsaccording to phonotactic rules. For example,

David /d/ /ey/ /v/ /ih/ /d/

Multiple Pronunciation:Several words may have multiple pronunciations. For example

Data /d/ /ae/ /t/ /ax/Data /d/ /ey/ /t/ /ax/

Challenges: How do you generate a word lexicon automatically; how do you add new variant dialects and word pronunciations.

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model

Word Lexicon

Word Lexicon

Confidence Scoring

Confidence Scoring

The LexiconThe Lexicon• one entry per word, with at least 100K words

needed for dictation (find: /f/ /ay/ /n/ /d/)• either done by hand or with letter-to-sound

rules (LTS). Rules can be automatically trained with decision trees (CART); less than 8% errors, but proper nouns are hard!

• multiple pronunciations (ta-mey-to vs to-mah-to)

/ow /

/dx/

/t/

/aa/

/ey/

/m /

/ow /

/t/

/ax/

0.5

0.3

0.2

0.7

0.3 0.80.2

0.8

0.2

Goal: Model “acceptable” spoken phrases, constrained by task syntax.

Rule-based:Deterministic grammars that are knowledge driven. For example,

flying from $city to $city on $date

Statistical:Compute estimate of word probabilities (N-gram, class-based, CFG). For example

flying from Newark to Boston tomorrow

Challenges:How do you build a language model rapidly for a new task.

0.60.4

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model

Word Lexicon

Word Lexicon

Confidence Scoring

Confidence Scoring

Language ModelLanguage Model

Formal GrammarsFormal Grammars

S

V PN P

V N P

A D J

N A M E

M aryloves

tha t

person

R ew rite R u le s :

1 . S N P V P2 . V P V N P3 . V P A U X V P4 . N P A R T N P 15 . N P A D J N P 1

7 . N P 1 N6 . N P 1 A D J N P 1

8 . N P N A M E9 . N P P R O N10 . N A M E M ary11 . V lo ves12 . A D J tha t13 . N p erson

N P 1

N

N-GramsN-Grams

• Trigrams Estimation

1 2

1 2 1 3 1 2 1 2 1

1 2 11

( ) ( , , , )

( ) ( | ) ( | , ) ( | , , , )

( | , , , )

N

N N

N

i ii

P P w w w

P w P w w P w w w P w w w w

P w w w w

W

2, 1,1, 2

2, 1

( )( | )

( )i i i

i i ii i

C w w wP w w w

C w w

ExampleExample

• training data: “John read her book”, “I read a different book”, “John read a book by Mulan”

• but we have a problem here

( , ) 2( | )

( ) 3

( , ) 2( | )

( ) 2

C s JohnP John s

C s

C John readP read John

C John

( , ) 2( | )

( ) 3

( , ) 1( | )

( ) 2

( , / ) 2( / | )

( ) 3

C read aP a read

C read

C a bookP book a

C a

C book sP s book

C book

( )

( | ) ( | ) ( | ) ( | ) ( / | ) 0.148

P John read a book

P John s P read John P a read P book a P s book

( )

( | ) ( | ) ( | ) ( | ) ( / | ) 0

P Mulan read a book

P Mulan s P read Mulan P a read P book a P s book

N-Gram SmoothingN-Gram Smoothing

• data sparseness: in millions of words more than 50% of trigrams occur only once.

• can’t assign p(wi|wi-1, wi-2)=0

• solution: assign non-zero probability for each n-gram by lowering the probability mass of seen n-grams.

1 1 1 1

2 1

( | ... ) ( | ... )

(1 ) ( | ... )

smooth i i n i ML i i n i

smooth i i n i

P w w w P w w w

P w w w

PerplexityPerplexity

• cross-entropy of a language model P on word sequence W is

• and its perplexity

• measures the complexity of the language model (geometric mean of branching factor).

2

1( ) log ( )H P

N

W

W W

( )( ) 2HPP WW

PerplexityPerplexity

• for digit recognition task (TIDIGITS) has 10 words, PP=10 and 0.2% error rate

• Airline Travel Information System (ATIS) has 2000 words and PP=20

• Wall Street Journal Task has 5000 words and PP=130 with bigram and 5% error rate

• in general, lower perplexity => lower error rate, but it does not take acoustic confusability into account: E-set (B, C, D, E, G, P, T) has PP=7 and has 5% error rate.

Bigram PerplexityBigram Perplexity

0

100

200

300

400

500

10k 30k 40k 60kVocabulary Size

Per

ple

xity

Trained on 500 million words and tested on the Encarta Encyclopedia

Out-Of-Vocabulary (OOV) Out-Of-Vocabulary (OOV) RateRate

0

5

10

15

20

25

10k 30k 40k 60k

Vocabulary Size

OO

V R

ate

OOV rate measured on Encarta Encyclopedia. Trained on 500 million words

WSJ ResultsWSJ Results

Models Perplexity Word Error Rate

Unigram Katz 1196.45 14.85%

Unigram Kneser-Ney 1199.59 14.86%

Bigram Katz 176.31 11.38%

Bigram Kneser-Ney 176.11 11.34%

Trigram Katz 95.19 9.69%

Trigram Kneser-Ney 91.47 9.60%

• Perplexity and word error rate on the 60,000-word Wall Street Journal continuous speech recognition task.

• Unigrams, bigrams and trigrams were trained from 260 million words

•Smoothing mechanisms: Katz and Kneser-Ney

Pattern ClassificationPattern ClassificationPattern ClassificationPattern ClassificationGoal:Combine information (probabilities)from the acoustic model, languagemodel and word lexicon to generatean “optimal” word sequence (highest probability).

Method:Decoder searches through all possible recognitionchoices using a Viterbi decoding algorithm.

Challenges:How do we build efficient structures (FSMs) for decoding and searching large vocabulary, complex language models tasks;

• features x HMM units x phones x words x sentences can lead to search networks with 10 states• FSM methods can compile the network to 10 states—14 orders of magnitude more efficient

What is the theoretical limit of efficiency that can be achieved

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model

Word Lexicon

Word Lexicon

Confidence Scoring

Confidence Scoring

22 8

Unlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASR

• The basic problem in ASR is to find the sequence of words that explain the input signal. This implies the following mapping:

• For the WSJ 20,000 vocabulary, this results in a network of 10 bytes!

• State-of-the-art methods including fast match, multi-pass decoding A* stack and finite-state transducers all provide tremendous speed-up by searching through the network and finding the best path that maximizes the likelihood function.

22

RobustnessRobustness

UnlimitedVocabulary

UnlimitedVocabulary

RejectionRejection

Features HMM states HMM units Phones Words Sentences

Weighted Finite State Weighted Finite State Transducers (WFST)Transducers (WFST)

Weighted Finite State Weighted Finite State Transducers (WFST)Transducers (WFST)

• Unified Mathematical framework to ASR• Efficiency in time and space

WFSTWFST

WFSTWFST

WFSTWFST

Word:Phrase

Phone:Word

HMM:PhoneCombinationCombination OptimizationOptimization

SearchNetwork

WFSTWFST

State:HMM

WFST can compile the network to 108 states --14 orders of magnitude more efficient

Weighted Finite State Weighted Finite State TransducerTransducer

d: /1

ey:/.4

ae:/.6

dx:/.8

t:/.2

ax:”data”/1

Word Pronunciation Transducer

Data

0

5

10

15

20

25

30

1994 1995 1996 1997 1998 1999 2000 2001 2002

Year

Rel

ativ

e S

pee

d

AT&T (Algorithmic) Moore's Law (hardware)

North American Business vocabulary:

40,000 words branching factor: 85

North American Business vocabulary:

40,000 words branching factor: 85

Algorithmic Speed-up for Algorithmic Speed-up for Speech RecognitionSpeech Recognition

Algorithmic Speed-up for Algorithmic Speed-up for Speech RecognitionSpeech Recognition

Community

AT&T

Goal:Identify possible recognition errorsand out-of-vocabulary events. Potentiallyimproves the performance of ASR, SLU and DM.

Method:A confidence score based on a hypothesis test is associated with each recognized word. For example:

Label: credit pleaseRecognized: credit fees Confidence: (0.9) (0.3)

Challenges:Rejection of extraneous acoustic events (noise, background speech, door slams) without rejection of valid user input speech.

Confidence ScoringConfidence ScoringConfidence ScoringConfidence Scoring

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model

Word Lexicon

Word Lexicon

Confidence Scoring

Confidence Scoring

( ) ( | )( | )

( ) ( | )

P PP

P PW

W X WW X

W X W

RejectionRejectionProblem: Extraneous acoustic events, noise, background speech and out-of-domain speech deteriorate system performance.

Effect: Improvement in the performance of the recognizer, understanding system and dialogue manager.

Measure of Confidence: Associating word strings with a verification cost thatprovide an effective measure of confidence (Utterance Verification).

RobustnessRobustness

UnlimitedVocabulary

UnlimitedVocabulary

RejectionRejection

State-of-the-Art State-of-the-Art PerformancePerformance

State-of-the-Art State-of-the-Art PerformancePerformance

Feature Extraction

Feature Extraction

Language Model

Language Model

Word Lexicon

Word Lexicon


(Decoding, Search)


(Decoding, Search)

Acoustic Model

Acoustic Model

Input Speech

RecognizedSentence

Confidence Scoring

Confidence Scoring

SLU

TTS ASR

DM

SLG

How to Evaluate How to Evaluate Performance?Performance?

• Dictation applications: Insertions, substitutions and deletions

• Command-and-control: false rejection and false acceptance => ROC curves.

# # #Word Error Rate 100%

No.of words in the correct sentence

Subs Dels Ins

Word Error RatesWord Error RatesCORPUS TYPE VOCABULARY

SIZE WORD

ERROR RATE

Connected Digit Strings--TI Database

Spontaneous 11 (zero-nine, oh)

0.3%

Connected Digit Strings--Mall Recordings

Spontaneous 11 (zero-nine, oh)

2.0%

Connected Digits Strings--HMIHY

Conversational 11 (zero-nine, oh)

5.0%

RM (Resource Management)

Read Speech 1000 2.0%

ATIS(Airline Travel Information System)

Spontaneous 2500 2.5%

NAB (North American Business)

Read Text 64,000 6.6%

Broadcast News News Show 210,000 13-17%

Switchboard Conversational Telephone

45,000 25-29%

Call Home Conversational Telephone

28,000 40%

factor of 17

increase in digit error rate

NIST Benchmark PerformanceNIST Benchmark Performance

Year

Word

Err

or

Rate

Resource Manageme

ntATIS

NAB

vocabulary: 40,000 words branching

factor: 85

vocabulary: 40,000 words branching

factor: 85

North American BusinessNorth American BusinessNorth American BusinessNorth American Business

30

40

50

60

70

80

1996 1997 1998 1999 2000 2001 2002

Year

Wo

rd A

cc

ura

cy

Switchboard/Call Home Vocabulary: 40,000

words Perplexity: 85

Switchboard/Call Home Vocabulary: 40,000

words Perplexity: 85

Algorithmic Accuracy for Algorithmic Accuracy for Speech RecognitionSpeech Recognition

Growth in Effective Growth in Effective Recognition Vocabulary Recognition Vocabulary

SizeSize

110

1001000

10000100000

100000010000000

1960 1970 1980 1990 2000 2010Year

Vo

cab

ula

ry S

ize

Human Speech Recognition vs Human Speech Recognition vs ASRASR

0.1

1

10

100

0.001 0.01 0.1 1 10

HUMAN ERROR (%)

MA

CH

INE

ER

RO

R (

%)

Digits RM-LM NAB-mic WSJ

RM-null NAB-omni SWBD WSJ-22dB

MachinesOutperform

Humansx1

x10

x100

Challenges in ASRChallenges in ASR

System Performance- accuracy- efficiency (speed, memory)- robustness

Operational Performance- end-point detection- user barge-in- utterance rejection- confidence scoring

Machines are 10-100 times less accurate than humans

Large-Vocabulary ASR Large-Vocabulary ASR DemoDemo

Courtesy of Murat Saraclar

Voice-Enabled System Voice-Enabled System Technology ComponentsTechnology Components

DM

SLU

TTSText-to-Speech

Synthesis



DialogManagement

ASR


Data,Rules

Words

Meaning

SpeechSpeech

Action

Words

Spoken Language Understanding (SLU)Spoken Language Understanding (SLU)• Goal:: Interpret the meaning of key words and phrases in the recognized

speech string, and map them to actions that the speech understanding system should take– accurate understanding can often be achieved without correctly recognizing every

word– SLU makes it possible to offer services where the customer can speak naturally

without learning a specific set of terms

• Methodology:: Exploit task grammar (syntax) and task semantics to restrict the range of meaning associated with the recognized word string; exploit ‘salient’ words and phrases to map high information word sequences to appropriate meaning

• Performance Evaluation:: Accuracy of speech understanding system on various tasks and in various operating environments

• Applications:: Automation of complex operator-based tasks, e.g., customer care, catalog ordering, form filling systems, provisioning of new services, customer help lines, etc.

• Challenges: What goes beyond simple classifications systems but below full Natural Language voice dialogue systems

SLU FormulationSLU Formulation• let W be a sequence of words and C

be its underlying meaning (conceptual structure),; then, using Bayes rule, we get

• finding the best conceptual structure can be done by parsing and ranking using a combination of acoustic, linguistic and semantics scores.

( | ) ( | ) ( ) / ( )P C W P W C P C P W

* arg max ( | ) ( )C

C P W C P C

Knowledge Sources for Knowledge Sources for Speech UnderstandingSpeech Understanding

PhonotacticPhonotactic SyntacticSyntactic SemanticSemantic PragmaticPragmaticAcoustic/PhoneticAcoustic/Phonetic

AcousticModel

WordLexicon

LanguageModel

Relationshipof speech sounds and

English phonemes

Rules forphonemesequences

and pronunciation

Structure ofwords, phrasesin a sentence

Relationshipand meaningsamong words

Discourse,interaction history,world knowledge

UnderstandingModel

DialogManager

ASR SLU DM

DARPA CommunicatorDARPA Communicator• Darpa sponsored research and development of mixed-

initiative dialogue systems

• Travel task involving airline, hotel and car information and reservation

“Yeah I uh I would like to go from New York to Boston tomorrow night with United”

• SLU output (Concept decoding)

Topic: ItineraryOrigin: New YorkDestination: BostonDay of the week: SundayDate: May 25th, 2002Time: >6pmAirline: United

XML Schema<itinerary> <origin>

<city></city><state></state>

</origin> <destination>

<city></city><state></state>

</destination> <date></date> <time></time> <airline></airline></itinerary>

Customer Care IVR and Customer Care IVR and HMIHYHMIHY

Network Menu

LEC Misdirect Announcement

Account Verification Routine

Main Menu

Reverse Directory RoutineLD Sub-Menu

HMIHY

Sparkle Tone“AT&T, How may I help you?”

Account Verification Routine

Customer Care IVR

Sparkle Tone“Thank you for calling AT&T…”

Total Time to Get to Reverse Directory Lookup: 2:55

minutes!!!

10 seconds

30 seconds

13 seconds

26 seconds

58 seconds

38 seconds

Total Time to Get to Reverse Directory Lookup: 28

seconds!!!

20 seconds

8 seconds

Prompt is “AT&T. How may I help you?” User responds with totally unconstrained fluent

speech System recognizes the words and determines the

meaning of users’ speech, then routes the call Dialog technology enables task completion

HMIHY

Account

Balance

Calling

Plans

Local Unrecognized

Number. . .

HMIHY—How Does It WorkHMIHY—How Does It Work

HMIHY Example DialogsHMIHY Example Dialogs

• Irate Customer• Rate Plan• Account Balance• Local Service• Unrecognized Number• Threshold Billing• Billing Credit

Customer Customer SatisfactionSatisfaction

• decreased repeat calls (37%)

• decreased ‘OUTPIC’ rate (18%)

• decreased CCA (Call Control Agent) time per call (10%)

• decreased customer complaints (78%)

Customer Care Customer Care ScenarioScenario

Voice-enabled System Voice-enabled System Technology ComponentsTechnology Components

DM

SLU

TTSText-to-Speech

Synthesis



DialogManagement

ASR


Data,Rules

Words

Meaning

SpeechSpeech

Action

Words

Dialog Management (DM)Dialog Management (DM)• Goal:: Combine the meaning of the current input with the interaction history to

decide what the next step in the interaction should be– DM makes viable complex services that require multiple exchanges between the system

and the customer– dialog systems can handle user-initiated topic switching (within the domain of the

application)

• Methodology:: Exploit models of dialog to determine the most appropriate spoken text string to guide the dialog forward towards a clear and well understood goal or system interaction

• Performance Evaluation:: Speed and accuracy of attaining a well defined task goal, e.g., booking an airline reservation, renting a car, purchasing a stock, obtaining help with a service

• Applications:: Customer care (HMIHY), travel planning, conference registration, scheduling, voice access to unified messaging

• Challenges: Is there a science of dialogues—how do you keep it efficient (turns, time, progress towards a goal); how do you attain goals (get answers). Is there an art of dialogues. How does the User Interface play into the art/science of dialogues—sometimes it is better/easier/faster/more efficient to point, use a mouse, type than speak multimodal interactions with machines

Dialog Management Dialog Management (DM)(DM)

• the DM can be considered as a model that is programmed from either rules and/or data so that to accomplish desired tasks.

• the goal of a DM is to manage elaborate exchanges with the user, providing automated access to information and services.

Computational Models for Computational Models for DMDM

- Structural-based Approach - assumes that there exists a regular structure in dialog that can be modeled as a state transition network or dialog grammars.

- dialog strategies need to be predefined, thus limiting

- several real-time deployed systems exist today which have inference engines and knowledge representation.

- Plan-based Approach- considers communication as “acting”. Dialog acts and actions are oriented toward goal achievement.

- motivated by human/human interaction in which humans generally have goals and plans when interacting with others.

- aims to account for general models and theories of discourse

Generic Spoken Dialog Generic Spoken Dialog SystemSystem

Context TrackingContext Tracking

Dialog Strategy Modules

Dialog Strategy Modules

SL UnderstandingSL Understanding

Response GenerationResponse Generation

Dialog Manager

Dialog Manager

Resource Manager

Resource Manager

Access Device Interface (Telephone,

IP)

Access Device Interface (Telephone,

IP)

DatabasesDatabases

TTS/PlayerTTS/Player

ASR ModuleASR Module

Dialog StrategyDialog Strategy• Confirmation: used to ascertain correctness of the recognized utterance

or utterances• Error Recovery: used to get the dialog back on track after a user

indicates that the system has misunderstood something• Reprompting: used when the system expected input but did not receive

any• Completion: use to elicit missing input information from the user• Constraining: used to reduce the scope of the request so that a

reasonable amount of information is retrieved, presented to the user, or otherwise acted upon

• Relaxation: used to increase the scope of the request when no information has been retrieved

• Disambiguation: used to resolve inconsistent input from the user• Greeting/Closing: used to maintain social protocol at the beginning and

end of an interaction

Building Good Speech-Based Building Good Speech-Based ApplicationsApplications

• Good User Interfaces--make the application easy-to-use and robust to the kinds of confusion that arise in human-machine communications by voice

• Good Models of Dialog--keep the conversation moving forward, even in periods of great uncertainty on the parts of either the user or the machine

• Matching the Task to the Technology--be realistic about the capabilities of the technology– fail soft methods– error correcting techniques (data dips)

Mixed-initiative DialogMixed-initiative Dialog

System UserInitiative

Please say collect, calling card, third

number.

How may I help you?

Who manages the dialog?

Example: Mixed-Example: Mixed-Initiative DialogInitiative Dialog

• System Initiative

• Mixed InitiativeSystem: Please say your departure cityUser: I need to travel from Chicago to Newark tomorrow.

• User Initiative

System: How may I help you?User: I need to travel from Chicago to Newark tomorrow.

System: Please say just your departure city.User: ChicagoSystem: Please say just your arrival city.User: Newark

Shorter dialogs (better user experience) but more difficult to design

Long dialogs but easier to design

Dialog EvaluationDialog Evaluation

Why?• mining problematic situations (task failure)• minimize user hang-ups and routing to an operator• improve user experience• perform data analysis and monitoring• conduct data mining for sales and marketing

How?• log exchange features from ASR, SLU, DM, ANI • define and automatically optimize a task objective function• compute service statistics (instant and delayed)

Voice-Enabled System Voice-Enabled System Technology ComponentsTechnology Components

DM

SLU

TTSText-to-Speech

Synthesis



Dialog Management

ASR


Data

Words

Meaning

SpeechSpeech

Action

Words

Spoken Language Spoken Language Generation (SLG)Generation (SLG)

• translate the action of the dialogue manager into a textual representation– template-based SLG– stochastic-based SLG

Template-Based SLG• Language generation is achieved by

manipulating text strings using– concatenation of strings and – substitution of substrings (slots) by other

strings• Number of templates grows exponentially

– No syntactic generalization; different templates for singular and plural versions

• Not much variation in output– No text planning; simple concatenation of

strings• Maintainability

– Expensive and complex to assemble and change with large number of templates

Stochastic-based SLGStochastic-based SLG• Text planning

– transform communicative goals into sequence of elementary communicative goals

• Sentence planning– choose linguistic resources to express the

communicative goals• Surface realization

– Pproduce surface word order according to the grammar of the language.

– combine linguistic information with statistical models derived from annotated corpora

• Goal: constructing dialog applications with zero human intervention by automatically mining information from the clients web site

– possess as much knowledge as the website does– synchronize with updates on the website with zero effort– synchronization of telephone-based and Internet-base services

• Challenges:– robust website understanding– webpage-based language model that reflects human-machine

interaction– web data driven dialog manger– spoken sanguage understanding that requires no hand rules and

annotation– language generation that’s grammatically correct without rules

Automatic Creation of ServicesAutomatic Creation of Servicesfrom Web Sites – WebTalkfrom Web Sites – WebTalk

W: Hi, this is AT&T local service. How may I help you?U: HiW: Hello, what is your question?U: I need help pleaseW: Let me give you an idea of what you can say. You can ask for solutions, LNP, products services, local solutions and local voice service. How may I help you?U: What’s LNP?W: LNP stands for local number portability.U: ThanksW: You're welcome. Anything else?U: What products and services do you provide?W: Let me give you more information about our products, including centrex, AT&T primeplex,AT&T primecodes, T1 access, AT&T prime digital trunk, …

Voice-enabled System Voice-enabled System Technology ComponentsTechnology Components

DM

SLU

TTS

Meaning

Text-to-Speech

Synthesis



DialogManagement

ASR

SLG

Words to besynthesized

Spoken Language Generation

Words spoken

Data,Rules

Words

Meaning

SpeechSpeech

Action

Words

Building Spoken dialog Building Spoken dialog systems systems

Rule-Based or Data-DrivenRule-Based or Data-Driven Rule-based Data-driven

Task complexity Simple

(e.g., name dialing)

Complex

(e.g., customer care)

Input speech Read/Fluent Spontaneous/

conversational

Resources No data collection Extensive data collection

Maintenance Harder only for complex apps

Straight forward if sufficient data is available

Characteristic Brittle Robust

User InterfaceUser Interface• Good User Interfaces

– makes the application easy-to-use and robust to the kinds of confusion that arise in human-machine communications by voice

– keeps the conversation moving forward, even in periods of great uncertainty on the parts of either the user or the machine

– a great UI cannot save a system with poor ASR and NLU. But, the UI can make or break a system, even with excellent ASR & NLU

– effective UI design is based on a set of elementary principles and common widgets; sequenced screen presentations; simple error-trap dialog; a user manual.

Correction UICorrection UI

• N-best list

Multimodal System Multimodal System Technology ComponentsTechnology Components

DM

SLU

TTSText-to-Speech

Synthesis



DialogManagement

ASR


Data,Rules

Words

Meaning

SpeechSpeech

Action

Words

VisualPen

Gesture

Multimodal ExperienceMultimodal Experience

• Users need to have access to automated dialog applications at anytime and anywhere using any access device

• Selection of “most appropriate” UI mode or combination of modes depends on the device, the task, the environment, and the user’s abilities and preferences.

Multimodal ArchitectureMultimodal Architecture

Parsing

Rendering

Understanding Application Logic

Language Model

SemanticModel

Behavior Model

What the user might do (say)P(F|x, Sn-1)

What does that meanP(Sn|F, Sn-1)

What to show (say) to the user

P(A|Sn)

x (Multimodal

Inputs)

F (Surface Semantics)

A (Multimedia

Outputs)

Sn (Discourse Semantics)

Microsoft MIPadMicrosoft MIPad

• Multimodal Interactive Pad

• Usability studies show double throughput for English

• Speech is mostly useful in cases with lots of alternatives

MiPad VideoMiPad Video

AT&T MATCHAT&T MATCH

“Are there any cheap Italian places in this neighborhood?”

Multimodal Access To City Help

Access to information through voice interface, gesture and GPS.

• Multimodal Integration Combination of speech, gesture and meaning using finite state technology

MATCH VideoMATCH Video

Courtesy of Michael Johnston

The Speech AdvantageThe Speech Advantage• Reduce costs

– reduce labor expenses while still providing customers an easy-to-use and natural way to access information and services

• New revenue opportunities – 24x7 high-quality customer care automation– Access to information without a keyboard or touch-

tones• Customer retention

– provide personal services for customer preferences– improve customer experience

Poor Customer CarePoor Customer Care

In a recent survey conducted by Mobius Management Systems:

• 54% consider customer service to be poor if they wait over 5 minutes for a rep• for poor customer care

• 46% of respondents dropped their credit card provider• 32% dropped their internet provider • 30% dropped a phone company• 32% dropped a cellular phone vendor

• 53% of respondents consider touch-tone IVR to be the most frustrating aspect of customer service.

Voice-Enabled ServicesVoice-Enabled Services• Desktop applications -- dictation, command and control of

desktop, control of document properties (fonts, styles, bullets, …)• Agent technology – simple tasks like stock quotes, traffic reports,

weather; access to communications, e.g., voice dialing, voice access to directories (800 services); access to messaging (text and voice messages); access to calendars and appointments

• Voice Portals – ‘convert any web page to a voice-enabled site’ where any question that can be answered on-line can be answered via a voice query; protocols like VXML, SALT, SMIL, SOAP and others are key

• E-Contact services – Call Centers, Customer Care (HMIHY) and Help Desks where calls are triaged and answered appropriately using natural language voice dialogues

• Telematics – command and control of automotive features (comfort systems, radio, windows, sunroof)

• Small devices – control of cellphones, PDAs from voice commands

Customer Care and Help Customer Care and Help Desk AutomationDesk Automation

• Telecommunication• Banking• Finance• Technology• Retail• Insurance• Entertainment

Customer care and Help Desk costs are expected to be over $100 billion/year world-wide

Major industry sectors

It’s Just the It’s Just the Beginning…Beginning…

Armstrong - Telephone should be the most ubiquitous Internet access device

Gates - Speech is key to popularizing the Internet

A Look Into the A Look Into the FutureFuture

MATCH: Multimodal Access To City Help

Keyword spottingHandcrafted grammars No dialogue

Directory AssistanceVRCP

• Constrained speech• Minimal data collection• Manual design

Medium size ASRHandcrafted Grammars System Initiative

Airline reservationBanking

• Constrained speech• Moderate data collection• Some automation

Large size ASR Limited NLU Mixed-initiative

Call centers, E-commerce

• Spontaneous speech• Extensive data collection• Semi-automation

1990+

Unlimited ASR Deeper NLU Adaptive systems

Multimodal, MultilingualHelp Desks, E-commerce

• Spontaneous speech/pen• Fully automated systems

1995+

2000+

2005+

Future of Speech Recognition Future of Speech Recognition TechnologiesTechnologies

Future of Speech Recognition Future of Speech Recognition TechnologiesTechnologies

2002 2005 2008 2011

Year

Dialog Systems

Robust Systems

Multilingual Systems;

Multimodal Speech Enabled Devices

Very Large Vocabulary,

Limited Tasks,

Controlled Environment

Very Large Vocabulary,

Limited Tasks,

Arbitrary Environment

Unlimited Vocabulary, Unlimited

Tasks, Many Languages

Spoken Dialog Interfaces Spoken Dialog Interfaces Vs. Touch-Tone IVR?Vs. Touch-Tone IVR?

Courtesy of Jay Wilpon

Documents

ECE 259—Lecture 19 Techniques for Speech and Natural Language Recognition