Upload
nguyenthuy
View
218
Download
0
Embed Size (px)
Citation preview
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
1
J. Gustafson, CTT, KTH http://www.speech.kth.se
Joakim Gustafson
Centrum för Talteknologi, KTHTal, musik och hörsel, KTH
DialogsystemDialogsystemDialogsystemDialogsystem
J. Gustafson, CTT, KTH 2
AgendaAgendaAgendaAgenda
• Introduction to speaker• Talking Machines• Research Spoken Dialogue Systems• Commercial Speech Applications• Spoken Dialogue System Components• Spoken Dialogue System Research
J. Gustafson, CTT, KTH 3
IntroductionIntroductionIntroductionIntroduction
J. Gustafson, CTT, KTH 4
Joakim GustafsonJoakim GustafsonJoakim GustafsonJoakim Gustafson
1987-1992 Electrical Engineering program at KTH
1992-1993 Linguistics at Stockholm University
1993-2000 PhD studies at CTT/KTH
2000-2007 Senior researcher at Telia R&D department
2007– Assistant Professor CTT/KTH
J. Gustafson, CTT, KTH 5
My research field:My research field:My research field:My research field:spoken dialogue systemsspoken dialogue systemsspoken dialogue systemsspoken dialogue systems
J. Gustafson, CTT, KTH 6
Multi Multi Multi Multi ----disciplinary fielddisciplinary fielddisciplinary fielddisciplinary field
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
2
J. Gustafson, CTT, KTH
Research areas• Speech production
• Speech perception
• Communication aids
• Multimodal speech synthesis
• Speech recognition
• Speaker verification
• Spoken conversational speech
• Interactive dialogue systems
CTT CTT CTT CTT ---- Centrum Centrum Centrum Centrum förförförför talteknologitalteknologitalteknologitalteknologi
J. Gustafson, CTT, KTH
The KTH speech groupThe KTH speech groupThe KTH speech groupThe KTH speech group
---- Early daysEarly daysEarly daysEarly days
Gunnar Fant and OVE I 1953
J. Gustafson, CTT, KTH
Namn Nr Årskurs Period
Poäng
Talteknologi (Speech Technology)
2F1111 D3-4, E4, F4 3 4
Talteknologi, utökad kurs (Speech Technology, extended course)
2F1112 Media 3-4 3 5
Spektrala transformer för Media 2F1120 Media 3-4 2 4
Musikakustik (Music Acoustics)
2F1212 D4, E4, F4 4 4
Musical communication & music technology(Musikalisk kommunikation och musikteknologi)
2F1213 Media 3-4,E4, D4
4 5
Elektroakustik (Electroacoustics)
2F1400 E4-5, M4-5 1 4
Audioteknik (Audio technology)
2F1410 Media 3-4,E4, D4
2 5
Orkesterspelets teori(Theory of orchestral performance)
2F1601 alla 6
Orkesterspelets praktik(Orchestral performance)
2F1602 alla 6
Elektroprojekt 2U1700 E1 3-4 5
Medieteknik grundkurs (avsnitt Ljud) 2D1574 Media 2 1-2 12
J. Gustafson, CTT, KTH 10
Talking MachinesTalking MachinesTalking MachinesTalking Machines
J. Gustafson, CTT, KTH 11
Talking machines Talking machines Talking machines Talking machines as we know themas we know themas we know themas we know them
• Star Trek
• HAL 9000 (2001)
• C3PO (Star Wars)
J. Gustafson, CTT, KTH 12
The beauty of speechThe beauty of speechThe beauty of speechThe beauty of speech1. Works in hands free situations2. Works in eyes free situations3. Works when other interfaces are inconvenient:
4. Works where disabilities make other interfaces useless
5. Works with common hardware, e.g. a telephone
6. Efficient information transfer (as far as humans are concerned).
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
3
J. Gustafson, CTT, KTH 13
Speech application DomainsSpeech application DomainsSpeech application DomainsSpeech application Domains
• Information retrieval systems– e.g. train time table information or directory
inquiries
• Ordering– ticket booking, buying music
• Command control systems– home control (“turn the radio off”) or voice
command shortcuts (“save”)
• Dictation
J. Gustafson, CTT, KTH 14
The beauty of speech (cont)The beauty of speech (cont)The beauty of speech (cont)The beauty of speech (cont)
7. Reasoning.8. Problem solving.9. Naturalness10. Easy-of-use11. Flexibility12. Error handling and hedging13.Mutual adaptation enables error handling and more efficient information transfer
14.Social, bond-building.
J. Gustafson, CTT, KTH 15
Application Domains (cont)Application Domains (cont)Application Domains (cont)Application Domains (cont)
• Games and entertainment– Games can take advantage of the more social aspects of
spoken dialogue
• Co-ordinated collaboration– The task of controlling or over-viewing complex situations
requires flexibility and efficiency.
• Expert systems– Diagnose and help systems that need to reason about facts
and goals and may benefit from natural dialogue
• Learning and training – Naturalness, flexibility, and robustness are attractive features
in training environments
J. Gustafson, CTT, KTH 16
The spoken dialogueThe spoken dialogueThe spoken dialogueThe spoken dialoguesystem vision system vision system vision system vision
An interface that allows speakers to interact with a computer using spontaneous, unconstrained speech
J. Gustafson, CTT, KTH 17
Multimodal interfacesMultimodal interfacesMultimodal interfacesMultimodal interfaces
• User and system interact by speech and other modalities.
J. Gustafson, CTT, KTH 18
Why multimodality?Why multimodality?Why multimodality?Why multimodality?
• Different modalities suited for different things
• Increased flexibility• New kinds of terminals• Facilitates error handling• Graphical output can confirm spoken input
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
4
J. Gustafson, CTT, KTH 19
Research SpokenResearch SpokenResearch SpokenResearch SpokenDialogue SystemsDialogue SystemsDialogue SystemsDialogue Systems
J. Gustafson, CTT, KTH 20
Classic research dialogue systemsClassic research dialogue systemsClassic research dialogue systemsClassic research dialogue systems
• Historic research systems– Voyager (1989)– ATIS (1992)– SUNDIAL (1993)– TRAINS (1996)
• Application– Philips Train Information (1995)
• Large Efforts– Communicator – Verbmobil– SmartKom
J. Gustafson, CTT, KTH 21
The TRAINS Project: Natural Spoken The TRAINS Project: Natural Spoken The TRAINS Project: Natural Spoken The TRAINS Project: Natural Spoken Dialogue and Interactive PlanningDialogue and Interactive PlanningDialogue and Interactive PlanningDialogue and Interactive Planning
• Project Leader: James Allen
• TRAINS-91 was limited in scope but were important demonstration of the "doability" of spoken dialogue systems.
J. Gustafson, CTT, KTH 22
The Nordic SceneThe Nordic SceneThe Nordic SceneThe Nordic Scene
• Stockholm, Sweden– KTH (Waxholm, August, AdApt, Higgins)– Telia (Resebokning, NICE)
• Linköping, Sweden– IDA (LINLIN, Ornitology, Movie Finder)
• Göteborg, Sweden– Lingvistiken (TRINDI, GODIS, TALK)– Chalmers (State-chart xml)
• Aalborg, Odense Denmark• Helsinki, Tampere Finland• Trondheim, Norway
J. Gustafson, CTT, KTH 23
The Waxholm Project (92The Waxholm Project (92The Waxholm Project (92The Waxholm Project (92----95)95)95)95)
• Tourist information– Stockholm archipelago
– time-tables, hotels, hostels, camping and dining possibilities.
• Mixed initiative dialogue– speech recognition
– multimodal synthesis
• Graphic information– pictures, maps, charts and
time-tables.
J. Gustafson, CTT, KTH 24
Waxholm SystemWaxholm SystemWaxholm SystemWaxholm System
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
5
J. Gustafson, CTT, KTH 25
Waxholm DatabaseWaxholm DatabaseWaxholm DatabaseWaxholm Database
• About 70 subjects (9200 words)• Phonetically transcribed• Examples from the Waxholm system• Five different speakers• EJ KR GÖ LN MK
(.wav) (.wav) (.wav) (.wav) (.wav)
J. Gustafson, CTT, KTH 26
Three years later….Three years later….Three years later….Three years later….
J. Gustafson, CTT, KTH
The Waxholm systemThe Waxholm systemThe Waxholm systemThe Waxholm system
Where can I find hotels?Information about hotels is shown in this tableWhich hotels are in Waxholm?Information about the hotels in Waxholm is
shown in this table
Where is Waxholm?Waxholm is shown on this mapI think I want to go to WaxholmI am looking for boats to Waxholm
From where do you want to go
The cityWhich day of the week do you want to go? I want to go tomorrow There are lots of boats from Stockholm to Waxholm
on a Friday, At what time do you want to go?
When do the evening boats depart?This is a table of the boats...Is it possible to eat in Waxholm?Information about the restaurants in Waxholm
is shown in this table
Thank youThank you tooJ. Gustafson, CTT, KTH 28
The August kiosk at The August kiosk at The August kiosk at The August kiosk at KulturhusetKulturhusetKulturhusetKulturhuset 98 98 98 98
• Swedish spoken dialogue system for public use • Animated agent named after August Strindberg• Speech technology components developed at KTH• Designed to be easily expanded and reconfigured • Used to collect spontaneous speech data
J. Gustafson, CTT, KTH
The August system
What is your name?I call myself Strindberg, but I
don’t really have a surnameWhen were you born?Strindberg was born in 1849What do you do for a living?I can answer questions about
Strindberg, KTH and Stockholm
How many people live in
Stockholm?
Over a million people live
in the Stockholm areaDo you like it here?People who live in glass houses
should not throw stones
Yes, that was a smart thing
to say!
I come from the department
of Speech, Music and Hearing
The Royal Institute of
Technology!
The information is shown
on the mapThank you!You are welcome!Good bye!Perhaps we will meet
soon again!Yes, it might be that we will!Strindberg was married
three times!
Strindberg was married
three times! J. Gustafson, CTT, KTH 30
The HIGGINS system (03The HIGGINS system (03The HIGGINS system (03The HIGGINS system (03----))))
• The primary domain of HIGGINS is city navigation for pedestrians.
• Secondarily, HIGGINS is intended to provide simple information about the immediate surroundings.
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
6
J. Gustafson, CTT, KTH 31
Higgins målsättning Higgins målsättning Higgins målsättning Higgins målsättning ----demodemodemodemo
J. Gustafson, CTT, KTH 32
The Commercial Speech The Commercial Speech The Commercial Speech The Commercial Speech ApplicationsApplicationsApplicationsApplications
J. Gustafson, CTT, KTH 33
Access Convergence & SpeechAccess Convergence & SpeechAccess Convergence & SpeechAccess Convergence & Speech
J. Gustafson, CTT, KTH 34
The “Killer” ApplicationsThe “Killer” ApplicationsThe “Killer” ApplicationsThe “Killer” Applicationsof Speech technologyof Speech technologyof Speech technologyof Speech technology
• Dictation– Medical
– Legal
– Windows Vista (95-99% without training?)
• Customer care center automation– Call Routing
– Technical support
• Multimodal mobile applications– Information search
– Content download
– Entertainment
J. Gustafson, CTT, KTH 35
Third generation dialog systemsThird generation dialog systemsThird generation dialog systemsThird generation dialog systems
LOW MEDIUM HIGH
COMPLEXITY
FLIGHT
STATUS
STOCK
TRADING
PACKAGE
TRACKING
FLIGHT/TRAIN
RESERVATION
BANKING CUSTOMER
CARE
TECHNICAL
SUPPORT (Tell-Eureka)
1st Generation
INFORMATIONAL
2nd Generation
TRANSACTIONAL
3RD Generation
PROBLEM SOLVING
(Pieracchini 2006) J. Gustafson, CTT, KTH 36
Problem solvingProblem solvingProblem solvingProblem solving
• Voxify– Wyndham hotel reservation
– Continental airline tickets
• SpeechCycle– Broadband support Agent
– Video support Agent
– IP Telephony Agent
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
7
J. Gustafson, CTT, KTH 37
Speech Technology Speech Technology Speech Technology Speech Technology ComponentsComponentsComponentsComponents
J. Gustafson, CTT, KTH 38
Spoken dialogue processingSpoken dialogue processingSpoken dialogue processingSpoken dialogue processing
J. Gustafson, CTT, KTH 39
The Components of The Components of The Components of The Components of Spoken Dialogue SystemsSpoken Dialogue SystemsSpoken Dialogue SystemsSpoken Dialogue Systems
• Audio Server• Automatic Speech Recognition (ASR)• Natural Language Understanding (NLU)• Dialogue Management (DM)• Natural Language Generation (NLG)• Text-To-Speech synthesis (TTS)• Multimodal modules
J. Gustafson, CTT, KTH 40
Audio serverAudio serverAudio serverAudio server
Purpose: data capture
Input: speech
Output: digitized version of speech
Considerations: Availability, Bandwidth, Drop-out and Barge-in
J. Gustafson, CTT, KTH 41
Speech RecognizerSpeech RecognizerSpeech RecognizerSpeech Recognizer
Purpose: convert speech to text
Input: digital speech
Output: String(s) or lattice that represent the most
probable utterance(s)
Considerations: Vocabulary size, Grammar and Speech type
J. Gustafson, CTT, KTH
What makes speechWhat makes speechWhat makes speechWhat makes speechrecognition difficult? recognition difficult? recognition difficult? recognition difficult?
Speaker Channel Listener
• Between speakers– Age– Gender
– Anatomy– Dialect
• Within speaker– Stress– Mood
– Health– Formel / Spontanteous
• Environment– Additive noise
– Room
• Microphone, Telephone– Bandwidth
– Noise
• Listener– Age– First Language
– Level of hearing– Known / Unknown
– Human / Maschine
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
8
J. Gustafson, CTT, KTH 43
(.wav)
(.wav)
(.wav)
(.wav)
”Flyget, tåget och bilbranschen tävlar om lönsamhet och folkets gunst”.
Född iUSAex-Jugoslavien
(.wav)
J. Gustafson, CTT, KTH 44
Speech Recognition Speech Recognition Speech Recognition Speech Recognition
• Identify the speech sounds– Measure the spectrum of the speech signal 100
times per second
• Match the recognized sounds against a pronunciation dictionary– Only the allowed words are active
• Construct a sentence of the most probable words – Taking the probability of the context into account
J. Gustafson, CTT, KTH 45
Statistiskt baserad taligenkänningStatistiskt baserad taligenkänningStatistiskt baserad taligenkänningStatistiskt baserad taligenkänning
Sökning
Språkmodellmöjliga ordföljder
N-best
Meningsförståelse
N-bestN-bestN-bästa
Fonemmodellerakustisk beskr.
Lexikonvokabulär + fonemtranskription
Vald meningKunskap
Analys
Jämförelse
ParametriseringFFT16 kHz 100 Hz
Spektralanalys
Klassificering
Kontinuerlig Diskret
J. Gustafson, CTT, KTH 46
Olika tOlika tOlika tOlika talparametraralparametraralparametraralparametrar• Filterbanksamplituder (FFT)
– talets spektrum var 10 ms
– Mel-skala - baserad på örats frekvensupplösning
• Cepstrum– inversfouriertransform av logaritmiskt spektrum - ortogonala
• Artikulatoriska parametrar– nära kopplad till talproduktionen
– komplicerade att beräkna
• Hörselbaserade parametrar– förenklad modellering av hörseln
– lovande resultat för tal stört av buller och brus
J. Gustafson, CTT, KTH 47
Språkmodeller för igenkänningSpråkmodeller för igenkänningSpråkmodeller för igenkänningSpråkmodeller för igenkänning
• N-gram (ordföljdssannolikheter) ger bra resultat trots sin enkelhet– bigram: P(wi | wi-1)
– trigram: P(wi | wi-2, wi-1)
• Klasspar/Ordpar om träningsmaterialsaknas
J. Gustafson, CTT, KTH
Resultat år 2005Resultat år 2005Resultat år 2005Resultat år 2005
(tidigare resultat inom parentes)(tidigare resultat inom parentes)(tidigare resultat inom parentes)(tidigare resultat inom parentes)
Corpus Speech type Lexicon size Word Error Rate (%)
Human Error Rate (%)
Digit string (phone) spontaneous 10 0.3 0.009
Resource Management
read 1000 3.6 0.1
ATIS spontaneous 2000 2 -
Wall Street Journal
read 64000 6.6 1
Radio News mixed 64000 13.5 (20) -
Switchboard (phone)
conversation 10000 19.3 (38) 4
Call Home (phone) conversation 10000 30 (50) -
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
9
J. Gustafson, CTT, KTH 49
Natural Language Natural Language Natural Language Natural Language UnderstandingUnderstandingUnderstandingUnderstanding
Purpose: produce meaning representation from ASR output
Input: String(s) lattice of words
Output: Meaning representation
Considerations: Type of grammar (Finite-state, Full parse or Word-spotting)
J. Gustafson, CTT, KTH 50
Phrases and syntaxPhrases and syntaxPhrases and syntaxPhrases and syntax
I want to take the boat
J. Gustafson, CTT, KTH 51
Robust Utterance AnalysisRobust Utterance AnalysisRobust Utterance AnalysisRobust Utterance Analysis
J. Gustafson, CTT, KTH 52
Pragmatics/world knowledgePragmatics/world knowledgePragmatics/world knowledgePragmatics/world knowledge
• Flights can be delayed• To be delayed is bad• ”Usually” → recurring in time. How often is ”usually”?
• Is this particular flight delayed a lot? • What would that mean for connecting flights?
S: There is a flight at 8.10.
U: Isn’t that usually delayed?
J. Gustafson, CTT, KTH 53
Dialogue ManagementDialogue ManagementDialogue ManagementDialogue Management
Purpose: decide the next system action
Input: a meaning representation from NLU
Output: High-level communicative goal(s)
J. Gustafson, CTT, KTH 54
Automating DialogueAutomating DialogueAutomating DialogueAutomating Dialogue
• To automate a dialogue, we must model:User’s Intentions User’s Intentions User’s Intentions User’s Intentions
• Recognized from input media forms (multi-modality)
• Disambiguated by the Context
System ActionsSystem ActionsSystem ActionsSystem Actions• Based on the task model and on output medium forms
(multi-media)
Dialogue FlowDialogue FlowDialogue FlowDialogue Flow• Content Selection (what to say)
• Information Packaging (how much to say)
• Turn-taking (who can speak/interrupt)
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
10
J. Gustafson, CTT, KTH 55
Knowledge SourcesKnowledge SourcesKnowledge SourcesKnowledge Sourcesfor Dialogue Systemsfor Dialogue Systemsfor Dialogue Systemsfor Dialogue Systems
• Context:– Dialogue history– Current task, slot, state
• Ontology:– World Knowledge model– Domain model
• Language:– Models of conversational competence:
• Turn-taking strategies• Discourse obligations
– Linguistic models:• Grammars, Dictionaries, Corpora
• User Model:– Information about user that could be relevant to the dialogue
• Age, gender, spoken language, location, preferences, etc.• Dynamic information: the current Mental State (Belief, Desires, Intentions)
J. Gustafson, CTT, KTH 56
Dialogue Control StrategiesDialogue Control StrategiesDialogue Control StrategiesDialogue Control Strategies
• Finite-state based systems– dialog and states explicitly specified
• Frame based systems– dialog separated from information states
• Agent based systems– model of intentions, goals, beliefs
J. Gustafson, CTT, KTH 57
Interaction controlInteraction controlInteraction controlInteraction control
• Conversation– Exchange of information
– Control of the exchange of information
• Turn-taking– Control of the ’floor’
• Feedback – Perception, attention, understanding, attitude…
• Dialogue systems needs both turn-taking and feedback
J. Gustafson, CTT, KTH 58
Conversational “grunts”Conversational “grunts”Conversational “grunts”Conversational “grunts”
• Grunts occur an average of once every 5 seconds in American English conversation. (Nigel Ward, 2000)
• In Switchboard database
– um was the 6th most frequent item (after I, and, the, you, and a), (Nigel Ward, 2000)
– the four items uh, uh-huh and um and um-hum accounted for 4% of the total (Picone et al. 1998).
• DEMO
J. Gustafson, CTT, KTH 59
An example:An example:An example:An example:Telias customer care line 90200Telias customer care line 90200Telias customer care line 90200Telias customer care line 90200
• 14 million calls/year
• A number of areas/families
– Fixed telephony, Mobile telephony, Mobile data, Broad band, IP telephoni, IP-TV....
• A number of reasons for calling
– Billing, ordering, problem solving, support, information,
J. Gustafson, CTT, KTH 60
Simulation (WizardSimulation (WizardSimulation (WizardSimulation (Wizard----ofofofof----Oz)Oz)Oz)Oz)
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
11
J. Gustafson, CTT, KTH 61
The point of WoZ simulationThe point of WoZ simulationThe point of WoZ simulationThe point of WoZ simulation
• Data for speech recognition• Typical utterance patterns• Typical dialogue patterns• Click-and-talk behaviour• Hints at important research problems
J. Gustafson, CTT, KTH 62
The WOZ recording environmentThe WOZ recording environmentThe WOZ recording environmentThe WOZ recording environment• Customer care agent were used as wizards:
– Listened to the customer (acting as ASR)
– Generate system answers via a prompt piano (acting as DM)
– Routing the call to the appropriate agent (acting as call router)
• All calls were logged and recorded• The wizard could route to themselves in problematic cases:– Less customer irritation
– Human error handling dialogues
– Used to collect interwievs
Wozmiljö och promptpiano framtaget av
Wirén, Eklund, Engberg and Westermark
J. Gustafson, CTT, KTH 63
• Ten real customer care agent were used as wizards (in-service WOZ collection)
• Five weeks of recordings resulted in 42,000 transcribed and tagged calls
The WoZ recordingsThe WoZ recordingsThe WoZ recordingsThe WoZ recordings
Wirén, M., Eklund, R., Engberg, F. and Westermark, J “Experiences of an In-Service Wizard-of-Oz Data Collection
for the Deployment of a Call-Routing Application”, Proceedings of the Workshop on Bridging the Gap: Academic and Industrial
Research in Dialog TechnologiesJ. Gustafson, CTT, KTH 64
A study on a system with a more A study on a system with a more A study on a system with a more A study on a system with a more natural interactions controlnatural interactions controlnatural interactions controlnatural interactions control
• The prompts were re-designed to be more spontanueos and conversational in style
• These should make the system more responsive (indicating listening, thinking)
• Feedback and grunts (mmm, okej) were added to be used to encourage the user to keep on describing their reason for calling
• Utterances as responset to channell checks (hello) and social utterances like greeting (Hi) were added
J. Gustafson, CTT, KTH 65
WizardWizardWizardWizard----ofofofof----Oz interface 2Oz interface 2Oz interface 2Oz interface 2
J. Gustafson, CTT, KTH 66
Example dialogExample dialogExample dialogExample dialogSYS: välkommen till telia vad kan jag hjälpa dig med?
KUND: ja hejsan [PAUS]
EEH [PAUS] hör du mig
SYS: ja hallå
KUND: ja [PAUS] jo EEH jag har problem med en räkning
SYS: MM?
KUND: jag har EHH anmält att jag har ingen telefon och
jag har fått en eh [PAUS] en räkning på ett abonnemang
SYS: okej
KUND: och jag har ringt flera flera gånger och försökt förklara det
att jag betalar inte så länge jag inte får telefonen inkopplad
SYS: gäller det din hemtelefon?
KUND: ja!
SYS: MM dröj lite så ska jag koppla dig
KUND: MM.
Welcome to Telia how may I help you?
Yes hello [PAUS]
EEH [PAUS] do you hear me?
Yes hello
yeah [PAUS] well EH I have a problem with my bill
MM?
I have EHH notified you that I don’t have a phone
And I have gotten a EEH [PAUS] billl
Okey..
Ind I have called you several times to tell you that
I refuse to pay until you reconnect my phone
Is this your home phone
Yes!
MM, hold on while I connect you
MM.
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
12
J. Gustafson, CTT, KTH 67
Natural Language GenerationNatural Language GenerationNatural Language GenerationNatural Language Generation
Purpose: produce an output string to be shown/spoken to the user
Input: Representation from DM
Output: String for TTS (possibly marked for prosody, etc.)
J. Gustafson, CTT, KTH 68
Utterance GenerationUtterance GenerationUtterance GenerationUtterance Generation
• Predefined utterances• Frames with slots• Generation based on grammar and underlying semantics
J. Gustafson, CTT, KTH 69
TextTextTextText----totototo----Speech SynthesisSpeech SynthesisSpeech SynthesisSpeech Synthesis
Input: text string (with tags)
Purpose: speak string to user
Considerations: Human-like, Flexibility
J. Gustafson, CTT, KTH 70
What is speech synthesis?What is speech synthesis?What is speech synthesis?What is speech synthesis?• Recorded human speech
– Words and phrases
– fix vocabulary
• Systematically recorded fragments of speech– one fix speaker
• Record the phoneme transitions "diphones"
• Parametric synthesis (artificial)– Formant synthesis
– articulatory synthesis
• Multimodal synthesis (talking head)
J. Gustafson, CTT, KTH 71
TextTextTextText----tilltilltilltill----tal (TTS)tal (TTS)tal (TTS)tal (TTS)
text
Lingvistisk analys
Prosodisk analys
Fonetisk beskrivning
Ljudgenerering
Morfologisk analysLexikon och reglerSyntaxanalys
Regler och lexikon
Regler och enhetsval
Sammansättning Regler
Språkidentifiering
“abcd”
J. Gustafson, CTT, KTH Kent (1997). The Speech Sciences
SourceSourceSourceSourceSourceSourceSourceSource--------Filter Theory Filter Theory Filter Theory Filter Theory Filter Theory Filter Theory Filter Theory Filter Theory of Speech Production of Speech Production of Speech Production of Speech Production of Speech Production of Speech Production of Speech Production of Speech Production
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
13
J. Gustafson, CTT, KTH Kent (1997). The Speech Sciences
FormantsFormantsFormantsFormantsFormantsFormantsFormantsFormants
J. Gustafson, CTT, KTH
General Rules of Vowel General Rules of Vowel General Rules of Vowel General Rules of Vowel
FormantsFormantsFormantsFormants(F1 = First Formant Frequency, (F1 = First Formant Frequency, (F1 = First Formant Frequency, (F1 = First Formant Frequency,
F2 = Second Formant Frequency)F2 = Second Formant Frequency)F2 = Second Formant Frequency)F2 = Second Formant Frequency)
Rules:
F1 associated with Tongue Height
• /i/ & /u/ (high vowels): Low F1
• /æ/ & /a/ (low vowels): High F1
F2 associated with A-P Position of tongue
• /u/ & /a/ (back vowels): Low F2
• /i/ & /æ/ (front vowels): High F2
Kent (1997). The Speech Sciences
J. Gustafson, CTT, KTH
TextText--toto--Speech Synthesis (TTS) Speech Synthesis (TTS)
EvolutionEvolution
1962 1967 1972 1977 1982 1987 1992 1997
Year
Formant
Synthesis
Bell Labs; Joint
Speech Research
Unit; MIT (DEC-
Talk); Haskins
Lab; KTH
LPC-Based
Diphone/Dyad
Synthesis
Bell Labs; CNET;
Bellcore; Berkeley
Speech
Technology
Unit Selection
Synthesis
ATR in Japan; CSTR
in Scotland; BT in
England; AT&T Labs
(1998); L&H in
Belgium
Poor
Intelligibility;
Poor Naturalness
Good
Intelligibility;
Poor Naturalness
Good
Intelligibility;
Customer Quality
Naturalness
(Limited Context)
larry rabiner J. Gustafson, CTT, KTH 76
Parametrical methods for Parametrical methods for Parametrical methods for Parametrical methods for speech synthesisspeech synthesisspeech synthesisspeech synthesis
Articulators
The form of the vocal tract
Tubes
SourceResonators
Electrical networks
Läppar
J. Gustafson, CTT, KTH 77
ArticulatoryArticulatoryArticulatoryArticulatoryArticulatoryArticulatoryArticulatoryArticulatory Synthesis IssuesSynthesis IssuesSynthesis IssuesSynthesis IssuesSynthesis IssuesSynthesis IssuesSynthesis IssuesSynthesis Issues
• requires highly accurate models of glottis and vocal tract
• requires rules for dynamics of the articulators
J. Gustafson, CTT, KTH 78
Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959--------1987)1987)1987)1987)1987)1987)1987)1987)
• Haskins, 1959 • KTH – Stockholm, 1962• Bell Labs, 1973 • MIT, 1976• MIT-talk, 1979• Speak ‘N spell, 1980• BELL Labs, 1985• Dec talk, 1987
larry rabiner
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
14
J. Gustafson, CTT, KTH 79
PSOLAPSOLAPSOLAPSOLA
• Pitch pulses moved in time to fit F0 contour
• Conceptually simple and computationally efficient
- Need for precise pitch pulse marking- Could not handle spectral interpolation
J. Gustafson, CTT, KTH 80
Unit selectionUnit selectionUnit selectionUnit selection
• Large databases of recorded natural speech• Minimal processing• Annotation of database – what information is needed?
• Synthesis defaults to transcription and search problem
• Few cuts > maximally long units selected(but context and prosody must fit well)
• Target and concatenation costs
J. Gustafson, CTT, KTH 81
Concatenation UnitsConcatenation UnitsConcatenation UnitsConcatenation UnitsConcatenation UnitsConcatenation UnitsConcatenation UnitsConcatenation Units
• Words:– no complete coverage for broad domains � words
have to be supplemented with smaller units– limited ability to modify pitch, amplitude and
duration without losing naturalness and intelligibility
– need huge database to extract multiple versions of each word
• Sub-Words:– hard to isolate in context due to co-articulation– need allophonic variations to characterize units in
all contexts– puts large burden on signal processing to smooth at
unit join points
larry rabiner J. Gustafson, CTT, KTH 82
Modern TTS Systems (Natural Modern TTS Systems (Natural Modern TTS Systems (Natural Modern TTS Systems (Natural Modern TTS Systems (Natural Modern TTS Systems (Natural Modern TTS Systems (Natural Modern TTS Systems (Natural Voices from AT&T)Voices from AT&T)Voices from AT&T)Voices from AT&T)Voices from AT&T)Voices from AT&T)Voices from AT&T)Voices from AT&T)
• Soliloquy from Hamlet—• Gettysburg Address—• Bob Story—• German female—• UK British female—• Spanish female—• Korean female—• French male—
larry rabiner
J. Gustafson, CTT, KTH
Modern COMMERCIAL SystemsModern COMMERCIAL Systems
• Lucent• AcuVoice • Festival• L&H RealSpeak• SpeechWorks female• SpeechWorks male• Cselt (Actor) - Italian
larry rabiner J. Gustafson, CTT, KTH 84
Multimodal SynthesisMultimodal SynthesisMultimodal SynthesisMultimodal Synthesis
Input: tagged text string
Purpose: support synthesis with lip movements
Considerations: Synchronization, Human-like?
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
15
J. Gustafson, CTT, KTH 85
Talking headTalking headTalking headTalking head
• Parametric 3D model• Articulatory based parameters
• Rule based control– speech
– gestures
• Tools for character creation• Windows, linux, unix
Listening & Thinking
J. Gustafson, CTT, KTH
Articulatory synthesisArticulatory synthesisArticulatory synthesisArticulatory synthesis
/usu/
J. Gustafson, CTT, KTH
Using humans to control Using humans to control Using humans to control Using humans to control
animated characters ...animated characters ...animated characters ...animated characters ...
J. Gustafson, CTT, KTH
Makes a convincing ”system”Makes a convincing ”system”Makes a convincing ”system”Makes a convincing ”system”
J. Gustafson, CTT, KTH 89
The NICE systemThe NICE systemThe NICE systemThe NICE system
J. Gustafson, CTT, KTH 90
BackgroundBackgroundBackgroundBackground
• NICE was a EU project 2002-2005• Five Partners
– Liquid Media, Sweden
– Scansoft (initially Philips, now Nuance), Germany
– Limsi, France
– TeliaSonera, Sweden
– Nislab, Denmark
• Two systems were developed– a Swedish Fairytale game
– an English HC Andersen story-teller
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
16
J. Gustafson, CTT, KTH 91
The NICE system setupThe NICE system setupThe NICE system setupThe NICE system setup
J. Gustafson, CTT, KTH 92
A computer game with A computer game with A computer game with A computer game with multimodal dialoguemultimodal dialoguemultimodal dialoguemultimodal dialogue
• It puts new & different demands on speech and dialogue technology– young users– a game lasts for many hours– many kinds of dialogue: problem solving, negotiation,
information-seeking, socializing, command & control, multi-party, …
– the setup is multimodal and asynchronous
• Many things are easier than for ‘normal’ dialogue systems– low penalty for recognition errors– voluntary suspension of belief– gamers already know all the clichés
• $ Might pay off big some day $
J. Gustafson, CTT, KTH 93
Research challengesResearch challengesResearch challengesResearch challenges
• Conversational Fairytale characters – Different personalities and goals
– Autonomous
– Reactive
– Aware of the user, the environment and each other
• Conversational multimodal output generation
• Game-progression VS user control• Understanding children’s multi-modal input
J. Gustafson, CTT, KTH 94
Character design in NiceCharacter design in NiceCharacter design in NiceCharacter design in Nice
• Cloddy Hans looks stupid, but friendly –Karin looks a bit angry and smart
• Cloddy Hans is a classical friendly helper character
• The tasks the users have to solve are centered around what is seen in the scenes
• All characters have emotional and attention states
• The system features characters with different roles (e.g. helper and gatekeeper)
J. Gustafson, CTT, KTH 95
The fairyThe fairyThe fairyThe fairy----tale gametale gametale gametale game
• Two scenes – the room and the fairy-tale world– Scene one – Multimodal training session put-that-there– Scene two – Multi-party negotiation dialogue
• Three characters– Cloddy Hans – slow but friendly Helper character– Karen – fast and sullen Gatekeeper character– Thumbelina – silent supporting character with attitudinal gestures
J. Gustafson, CTT, KTH Interspeech07
Aug 30 2007
User Eh put that lamp in that hole eh pipe where the jewels are Concrete
Cloddy Could we call it valuable? Function
User Yes do that
Cloddy Do you want me to put the lamp in valuable? Function
User Yes do that
Cloddy I will go over to the shelf
User Yes do that
Cloddy I will go over to the machine
User Yeah and put it in that jewelery thing or the jewelery pipe Function+slot
Cloddy I will put the lamp in valuable Function
User Yes….Okay put the lamp in valuable oops did you do that already? Function (convergence)
User Put the sword in the pipe with the skull Concrete
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
17
J. Gustafson, CTT, KTH 97
Inside a fairyInside a fairyInside a fairyInside a fairy----tale braintale braintale braintale brain
• World model: Interrelated (Java) objects representing things, characters, and their relationships.
• Discourse history: Who said what when?
• Agenda: Goals, actions and their causal relationships.
J. Gustafson, CTT, KTH
Scene scriptingScene scriptingScene scriptingScene scripting
Game
RoomScene
Intro Main
WorldScene
Intro
Exploration
Negotiation
Crossing
There is only one“current scene”
J. Gustafson, CTT, KTH 99
Characters are autonomousCharacters are autonomousCharacters are autonomousCharacters are autonomous
Virtual World
DispatcherUser
Karen’s
brain
Cloddy’s
brain
ASR
TTS
NLU
NLG
Dialogue acts
Spoken words
Animation track
J. Gustafson, CTT, KTH 100
MessageMessageMessageMessage----passingpassingpassingpassing
Virtual World
DispatcherUser
Karen’s
brain
Cloddy’s
brain
ASR
TTS
NLU
NLG
Receipt
Broadcast
Utterance
J. Gustafson, CTT, KTH 101
EventEventEventEvent----based dialoguebased dialoguebased dialoguebased dialogue
J. Gustafson, CTT, KTH 102
The Animation systemThe Animation systemThe Animation systemThe Animation system
Dialogsystem, AGI
Joakim Gustafson, CTT, KTH
18
J. Gustafson, CTT, KTH 103
System Module
(Message)
AnimationHandler
Request
Animation Renderer Request
GR(StartOfGesture) AttentionTo(GestureInput)
GR(EndOfGesture)
GI(Select(Lamp)) LookAt(Lamp) 2) Play(Animation(
TorsoTurnLeft(amount),
HeadTurnLeft(amount)))
ASR(StartOfSpeech) AttentionTo(SpeechInput)
ASR(EndOfSpeech) TakeTurnGesture 4) Play(Animation(Eyegaze(UpRight),
Eyebrow(frown)
))NLP(PickUp(Lamp))(by simple fusion)
Animation handler actions generated by Animation handler actions generated by Animation handler actions generated by Animation handler actions generated by ”pick it up”+selection of lamp”pick it up”+selection of lamp”pick it up”+selection of lamp”pick it up”+selection of lamp
1) RaiseEyebrows
3) TurnTo(ActiveCamera)
J. Gustafson, CTT, KTH 104
Animation handler actions used to Animation handler actions used to Animation handler actions used to Animation handler actions used to pick up the lamppick up the lamppick up the lamppick up the lamp
6a) TurnTo(Lamp)
6b) Play(Animation(
PickupGesture
))
6c) PickUp(Lamp)
6d) TurnTo(ActiveCamera)
System Module
(Message)Animation Handler
Request
Animation Renderer Request
DM(convey(tell(
PickUp(Lamp))))
NLG(convey(“I will pick up the lamp”)
Say(“I will pick up the lamp”)(the Synthesizer is first requested to
generate the sound file and lip-synch
track)
5) Play(Sound(synthesis_file)
Animation(Lipsynch_track)
)
DM(perform(PickUp(Lamp))) PickUp(Lamp)
J. Gustafson, CTT, KTH 105
AttentionalAttentionalAttentionalAttentional and emotionaland emotionaland emotionaland emotionaloutput behaviorsoutput behaviorsoutput behaviorsoutput behaviors
J. Gustafson, CTT, KTH 106
Body gestures and Body gestures and Body gestures and Body gestures and physical actionsphysical actionsphysical actionsphysical actions
J. Gustafson, CTT, KTH 107
Signs of social involvementSigns of social involvementSigns of social involvementSigns of social involvementand engagementand engagementand engagementand engagement
• Users form alliances with either Cloddy Hans or Karin
• User show repent when being accused of deceipt
• Users lie, make ironic, sarcastic and humorous remarks
• Users react to the character’s mood and use politeness markers
• Users lecture Cloddy Hans while making reference to common dialogue history.
J. Gustafson, CTT, KTH 108
The End