Upload
gwendoline-young
View
217
Download
2
Embed Size (px)
Citation preview
Reinforcement Learning for Spoken Dialogue Systems:
Comparing Strengths & Weaknesses for Practical Deployment
Tim Paek
Microsoft Research
Dialogue on Dialogues Workshop 06
Reinforcement Learning for SDS• Dialogue manager (DM) in spoken dialogue systems
– Selects actions | {observations & beliefs}– Typically, hand-crafted & knowledge intensive– RL seeks to formalize & optimize action selection– Once dialogue dynamics is represented as Markov
Decision Process (MDP), derive optimal policy
• MDP in a nutshell
– Input tuple (S,A,T,R) → Output policy
– Objective Function:
– Value Function:* *( ) max ( , ) ( , , ) ( ) ,
as S
V s R s a T s a s V s s S
0
,0 1tt
t
E R
AS :
To what extent can RL really automate hand-crafted DM? Is RL practical for speech application development?
Strengths Weaknesses
Objective Function• Optimization framework• Explicitly defined & adjustable• Can serve as evaluation metric
• Unclear what dialogues can & cannot be modeled using a specifiable objective function• Not easily adjusted
Reward Function• Overall behavior very sensitive to changes in reward
• Mostly hand-crafted & tuned• Not easily adjusted
State Space & Transition Function
• Once modeled as MDP or POMDP, well-studied algorithms exist for deriving policy
• State space small due to algorithmic limitations• Selection is still mostly manual• No best practices• No domain-independent state variables• Markov assumption• Not easily adjusted
Policy• Guaranteed to be optimal with respect to the data• Can function as black box
• Removes control away from developers• Not easily adjusted• No theoretical insight
Evaluation• Can be rapidly trained and tested with user models
• Real user behavior may differ• Testing on same user model is cheating• Hand-crafted policy should be tuned to same objective function
Opportunities
Objective Function• Open up black boxes and optimize ASR / SLU using same objective function as DM
Reward Function• Inverse reinforcement learning
• Adapt reward function / policy based on user type / behavior (similar to adapting mixed initiative)
State Space & Transition Function
• Learn what state space variables are important for local / long term reward
• Apply more efficient POMDP methods
• Identify domain-independent state variables
• Identify best practices
Policy• Online policy learning (explore vs. exploit)
• Identify domain-independent, reusable error handling mechanisms
Evaluation• Close gap between user model and real user behavior
Any comments?