5
Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues Workshop 06

Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues

Reinforcement Learning for Spoken Dialogue Systems:

Comparing Strengths & Weaknesses for Practical Deployment

Tim Paek

Microsoft Research

Dialogue on Dialogues Workshop 06

Page 2: Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues

Reinforcement Learning for SDS• Dialogue manager (DM) in spoken dialogue systems

– Selects actions | {observations & beliefs}– Typically, hand-crafted & knowledge intensive– RL seeks to formalize & optimize action selection– Once dialogue dynamics is represented as Markov

Decision Process (MDP), derive optimal policy

• MDP in a nutshell

– Input tuple (S,A,T,R) → Output policy

– Objective Function:

– Value Function:* *( ) max ( , ) ( , , ) ( ) ,

as S

V s R s a T s a s V s s S

0

,0 1tt

t

E R

AS :

To what extent can RL really automate hand-crafted DM? Is RL practical for speech application development?

Page 3: Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues

Strengths Weaknesses

Objective Function• Optimization framework• Explicitly defined & adjustable• Can serve as evaluation metric

• Unclear what dialogues can & cannot be modeled using a specifiable objective function• Not easily adjusted

Reward Function• Overall behavior very sensitive to changes in reward

• Mostly hand-crafted & tuned• Not easily adjusted

State Space & Transition Function

• Once modeled as MDP or POMDP, well-studied algorithms exist for deriving policy

• State space small due to algorithmic limitations• Selection is still mostly manual• No best practices• No domain-independent state variables• Markov assumption• Not easily adjusted

Policy• Guaranteed to be optimal with respect to the data• Can function as black box

• Removes control away from developers• Not easily adjusted• No theoretical insight

Evaluation• Can be rapidly trained and tested with user models

• Real user behavior may differ• Testing on same user model is cheating• Hand-crafted policy should be tuned to same objective function

Page 4: Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues

Opportunities

Objective Function• Open up black boxes and optimize ASR / SLU using same objective function as DM

Reward Function• Inverse reinforcement learning

• Adapt reward function / policy based on user type / behavior (similar to adapting mixed initiative)

State Space & Transition Function

• Learn what state space variables are important for local / long term reward

• Apply more efficient POMDP methods

• Identify domain-independent state variables

• Identify best practices

Policy• Online policy learning (explore vs. exploit)

• Identify domain-independent, reusable error handling mechanisms

Evaluation• Close gap between user model and real user behavior

Page 5: Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues

Any comments?