Conversational Systems and the Marriage of Speech & Language · Thatthat is there there uh uh there are two arguments about the risk of corruption. At the moment the argument that

Conversational Systems and the Marriage of Speech & Language

Mari OstendorfUniversity of Washington

Research sponsored at various times by Google, Amazon, Tencent, NSF & DARPA

The Story…

Conversational systems are the next frontier of speech processing

2

Conversations pose a number of challenges

… that have implications for the neural architectures used

… and require responsible AI thinking

Why speech?

• People like spoken communication• Video & podcast content online is rapidly growing• It is often easier to talk than type (driving, cooking, …)• Speech technology makes information accessible

• Speech recognition & synthesis have advanced!

3

Why spoken language processing?

• Natural language processing (NLP) has advanced!

• Information retrieval & extraction• Summarization, question answering• Sentiment analysis• Translation

• Combining speech + language enables• Speech interfaces for more complex tasks• Making speech & video as accessible

as text4

Why conversations?

• People like spoken interactions• Web-based chat & FAQs have not replaced

call centers• Informational audio is often interactive:

interviews, debates, hearings, lecture Q&A• In the COVID-19 era, person-to-person meeting

(and dinner parties) continue… virtually

• Interaction is useful for learning, problem solving, and team building

5

Conversations are the next frontier!

• Virtual assistants are already becoming more conversational(for constrained tasks) -- see Google Duplex Demo

• And there’s potential for impact in many application areas• Call center support• Interactive tutorials and dialogic reading• Virtual companions & personal robots• Conversational speech translation• Meeting summarization• Medical diagnostics

6

Breazeal et al. NRI project

Marriage of Speech & Language

• To build a human-computer dialog system or process archives of conversational speech, we need speech + NLP• The challenges ahead …

• Style differences in written vs. conversational language• Information in the acoustic signal beyond the words• Interactive nature of conversations

• Require more than throwing words over a fence

7

Speech NLP

The first problem ….

•Most NLP technology is built on written text, which is very unlike conversational speech transcripts

• Language use is different (domain adaptation issues)• No punctuation, ASR errors, disfluencies

8

Examples of Contextual VariationFujitsu Ltd.'s top executive took the unusual step of publicly apologizing for his company's making bids of just one yen for several local government projects, while computer rival NEC Corp. made a written apology for indulging in the same practice.

A: yeah and we’ve been also doing more cooking which is fun B: yeah that’s that’s true

A: like the wha- how do you say it again? uh bac- bacanau? B: oh bacalhaoA: bac- bacalhaoB: yeah that was that was a recipe that I wanted to do for quite some time

Interestingly, four Republicans, including the Senate Majority Leader, joined all the Democrats on the losing end of a 17-12 vote. Welcome to the coven of secularists and atheists. Not that this has anything to do with religion, as the upright Senator X swears, … I guess no one told this neanderthal that lying is considered a sin by most religions.

9

WSJ

Blog Post

Conversation

How most people talk in conversations

10

What we take away

What di- … what’d I do the o- ... the other day? It was ... oh the the then th- th- the pork ribs and the … bunch of Korean food and stuff … but … yeah … I don’t know.

What did I do the other day? Oh, the pork ribs and bunch of Korean food and stuff. Yeah, I don’t know.

… as do justices and lawyers

… and childrenbecause it’s because it's 'citing it's i-and and also she must do it because if she want to do the same thing as me then if she want to do it then she have to … um uh you just have to be nice and act like this excuse me um mm could I paint with you this evening … if she doesn’t then then then s- hen y- then then you c- you can’t yell

It would have to be I would think a reasonable standard is would have to be …

Yes. That that is there there uh uh there are two arguments about the risk of corruption. At the moment the argument that I'm talking about is that the party is a means that that to that that the um contribution limits on individual donors are justified as a means of preventing uh corruption …

(from Alwan & Bailey, UCLA)

State-of-the-Art NLP Uses Word Vectors• Represent a word with a vector embedding learned

from the statistics of its neighbors• Vector distance characterizes word similarity• Reduced dimension space results in parameter sharing

• Even better: contextualized word embeddings• Learn a neural language model, hidden state represents

a word in context

• Use vast amounts of text to learn general-purpose embeddings, which are plugged into task-specific networks

12

GloVe, Word2vec, …Table look-up!

RNN + word prediction objective à ELMoTransformer + masking objective à BERT, XLNet, RoBERTa, ...

Informality à Domain Mismatch Problem• General purpose language models don’t work for all tasks

• Researchers have train sciBERT, bioBERT, clinicalBERT, …• But there isn’t (yet) a lot of accurately transcribed conversational

speech for training a convBERT

• Are written text embeddings useful for conversations?

13

YES! With fine-tuning Parser Training Embeddings F1

Text (WSJ) General Text 77.4

Conversations (Swbd) Conversations 91.0

Conversations General Text 93.2

Swbd + WSJ General Text 93.4

Experiments with a neural parser, on hand-transcribed sentences, from Switchboard conversations

(Tran et al., Interspeech 2019)

What if we want to clean up disfluencies?Disfluency structure (Shriberg, 1994)

14

[ reparandum + { interregnum } repair ]

material to delete optional edit markers & fillers

replacementinterruption point

[ I think + I think ] [ the stand they have, + {E [ or + or ] } the way they command respect, ] [ I + I ] support that.

I think the way they command respect, I support that.

I think I think the stand they have or or the way they command respect I I support that

Learning Disfluency Structure• Treat disfluency detection as a word tagging problem• A network over word embeddings learns phrase similarity

15

Standard neural tagger is fine for Switchboard conversations, but similarity network improves results on court arguments and congressional hearings.

(Zayats & Ostendorf, 2018)

AND, there is another problem ….

•Most NLP technology is built on written text, which is very unlike conversational speech transcripts•Most speech systems use only the surface word sequence,

not accounting for prosodic meaning

16

… but there’s opportunity here!

ProsodyIt’s not what you say, but how you say it.

Found in the Language Log archives.

Signal Processing for Prosody• Acoustic cues associated with prosody include fundamental

frequency (F0), vocal effort, and timing (words and pauses)

• Different phenomena occur at different time scales, e.g.• Word emphasis: word• Topic change, segmentation, disfluencies: word endings and onsets• Asides: phrase-level differences• Emotion, attitude: whole segments and/or word-specific tones

18

Prosody for human-centered conversations• Is the user talking to the device? …finished with the turn?

• Sentence segmentation, statement vs. question

• The meaning of “yeah” and “no”

•Meaning of interjections: “oh”

19

yeah I did notDid you know that …?

I get it

Answer no

Of course

Positive correctionUncertain no

I remember Unenthusiastic

Computational Modeling Challenges for Prosody• Prosodic cues are associated with multiple phenomena at

different time scales – hard to disentangle• Differences between read and spontaneous speech

• Prosody carries more information in spontaneous speech• Much prior work on read speech

• In speech understanding… • The words matter more; effort on ASR has given greater benefits• Prosody used in early dialog systems reflects user uncertainty

• With the major ASR advances, it’s worth revisiting prosody

Fruit in a bowlBananas, apples and a lemon

Prosody + Text: Multimodal Analogies

• Caption & image have both complementary and redundant info (as for words & prosody)

• Caption depends on the bounding boxes; sentence meaning depends on emphasis & break location

• Standard multimodal integration issues• How tightly coupled modalities are• Modality representation learning strategy

• For sentence understanding tasks, we use• Word-aligned prosody vectors• Parallel vs. conditional encodings

21Photo: Rodney Chen — UltiPhotos.com

Learn Prosody Features in a Neural Parser

• CNN learns F0 & energy feature functions

• Additional pause & duration embeddings

• Concatenate prosodic features with word embeddings

• End-to-end parser training

(Tran et al., NAACL 2018)

Experiments in Parsing Switchboard• Task: Constituency parsing given hand transcripts & sentence boundaries• Base parser: Self-attentive encoder + chart decoder (Kitaev & Klein, 2018)• Findings:

• ELMo/BERT contextualized embeddings help with parsing conversational speech• Prosody helps over text alone, with or without contextualized embeddings• Parsing F-score = 93 for conversational speech!• Prosody helps with disfluencies, VP attachments

The same approach does not lead to gains from prosody in disfluency detection, so….

(Tran et al., Interspeech 2019)

Contextually-Normalized Prosody Features

1. Given a text, predict its prosody

2. Compare predicted with true signal: what is the difference relative to the expected variability?

3. Use as the prosody features (called innovations)

(Zayats et al., NAACL 2019)

was it I mean did you put …

!p

p z

!pik | hi ~ Ν(µi,k,σ i,k

2 )

zik =

pik −µi,k

σ i,kInnovations = variation that is not accounted for by the word sequence (i.e. default reading)

Experiments in Disfluency Detection

• Task: Disfluency detection in Switchboard• Findings:

• Disfluency interruption points: words are longer and lower energy• Innovations are almost as good as text alone• Innovations (but raw prosody) are useful when combined with text•

020406080

100 Text

RawProsodyInnovations

Examples where prosody helps:

but it‘s just you know leak leak leak everywhere

I mean [ it was + it ]

The problems continue ….

•Most NLP technology is built on written text, which is very unlike conversational speech transcripts•Most speech systems use only the surface word sequence,

not accounting for prosodic meaning•Most systems treat conversations like documents (a linear

sequence of sentences)

26

vs.

Written document ICSI meeting

A Multi-party Discussion

27

People interrupt each other, overlap, stop mid sentence, finish someone else’s sentence, respond to earlier talkers, …

People may have specific roles (host vs. guest, student vs. expert, agent vs. customer) or interaction styles

noyescoolyeah that’s coolno I didn’tNoyes

i don't know that's an interesting question and is it really true that garlic keeps vampires the wedding and what i

what are they have their long fingernails for i think that that's probably true but i think it vampires are evil and

they don't care about sustaining things for human be-...

Conversations have graph structure

28

Online discussion In-person conversation

I was talking with a friend and just by random coincidence she also did the same recipe I was doing in the same week

Really?

Yeah, I was like oh

Different links:• Reply to• Temporal

what?Wh- like a childhood friend or or

A friend from college

Oh wow

Prediction Paradigm for Dialog Tasks

29

Dialog state

𝒛𝐭"𝟏 𝒛𝐭 𝒛𝐭$𝟏

Speaker A

𝒉%('"()* 𝒉%('$()*

𝒉+('),

Speaker B

Prediction Task

Speaker State

Tracker

Speaker State

Tracker

Speaker State Tracker for Dialogs• Two types of information• Short-term modes reflected in the current

utterance• Long-term/carry-over factors reflected in

dialog segments• Leverage latent user modes to represent an utterance

• Use next utterance (response) prediction to provide an unsupervised pre-training signal

30

!𝒖𝒕

𝒉"

!𝒖𝒕#𝟏

𝒉"#%

Conversation Level

Turn Level

𝒖( 𝒖- 𝒖.

Dictionary (Memory)Attention

Attention Weights𝑎! 𝑎"

𝑎#!𝒖 =%

&'%

(𝑎&𝒖&

Local Text Context

Speaker State Tracking Studies• Unsupervised pre-training of a speaker state sequence

model benefits several tasks involving conversations• Reddit discussions: comment popularity prediction• Open-domain socialbot chat: topic acceptance prediction• Human-human conversation: dialog act tagging• Task-oriented dialog: Multi-domain state tracking

•What we learn in the speaker modes• Topics or topic interests• Sentiment, politeness• Course-grained dialog act

31

(Cheng et al., EMNLP 2017)(Cheng et al., NAACL 2019)

(Cheng Ph.D., 2019)

A Common Theme…

• For all these challenges, we are taking advantage of general advances in neural networks for language processing

• Always fine-tuning on conversational data, of course

• However, we’re also introducing structure that is explicitly conversational, such as

• Disfluency structure networks• Prosody word vectors• Discussion response structure

32

Aside: Interactivity Challenges Evaluation• There are few tasks with quantitative evaluation metrics• There is no gold standard conversation

• Once a human is in the loop, the conversation can go anywhere• Many responses may be acceptable• Responses that are acceptable to some are not to others

33

Did you know that Malaysian vampires are tiny monsters that burrow into people's heads and force them to talk about cats?

That’s creepy.

Oh my god that’s funny.

Oh gods are you have to hear this.

What the heck.

Cats are my favorite animals.

Wow that’s interesting.

Big Problem: Spoken Language is Personal

• Advances in speaker recognition technology make people identifiable from their speech• False trigger of hotword detection can expose personal info• Advances in speech synthesis technology enable voice theft• All these valid privacy concerns make research more difficult

• Commercial ASR systems don’t expose the audio à no prosody features• Data not shared limits the potential for learning

Conversation technology will need to leverage advances in privacy-sensitive signal processing, learning, etc.

Other Ethical & Social Issues• People should know when they are talking to a

conversational agent – forget about the Turing test• Architectures and learning strategies for language

technologies should work for different languages• Note: prosody is used differently in different languages

• Systems should serve a diverse community of users, including age, dialect and cultural differences

• Issue of bias in machine learning• User modeling should not reinforce stereotype bias

• Users can delete their data: limits reproducibility35

Summary

• Conversational systems are the next frontier• Speech ≠ noisy text! Spontaneous speech poses many

challenges not yet solved by current technology• Future systems will benefit from:

• General advances in neural sequence modeling, domain adaptation, multi-modal integration, graph-based processing

• Innovations addressing specific challenges of conversations

• Conversational system research needs to be informed by and drive work in privacy and responsible ML

Thanks and more information…Parsing, dislfuencies & prosody

• T. Tran, J. Yuan, Y. Liu and M. Ostendorf, Proc. Interspeech, 2019.

• T. Tran, S. Toshniwal, M. Bansal, K. Gimpel, K. Livescu and M. Ostendorf, Proc. NAACL, pp. 69-81, 2018.

• V. Zayats and M. Ostendorf, Proc. NAACL, 2019.

• V. Zayats and M. Ostendorf, arXiv, 2019.

Conversation structure

• V. Zayats and M. Ostendorf, Trans. Association of Computational Linguistics, 2018.

• H. Cheng, H. Fang and M. Ostendorf, Proc. EMNLP, 2017.

• H. Cheng, H. Fang and M. Ostendorf, Proc. NAACL, 2019.

• H. Cheng, PhD dissertation, 2019.37

Trang Tran, Vicky Zayats, Hao Cheng, Hao Fang

Sounding Board Alexa Prize Team

Documents

Conversational Systems and the Marriage of Speech & Language · Thatthat is there there uh uh there are two arguments about the risk of corruption. At the moment the argument that