Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016

Using Deep Reinforcement Learningfor Dialogue Systems

Harm van Seijen, Research Scientist Montréal, Canada

spoken dialogue system

natural language understanding state tracker

policy managernatural language generation

“Hi, do you know a good Indian restaurant”

system response

user act

systemact

dialogue stateuser

inform(food=“Indian”)

user input

“Sure. What price range are you thinking of?” request(price_range)

spoken dialogue system

natural language understanding state tracker

policy managernatural language generation

“Hi, do you know a good Indian restaurant”

system response

user act

systemact

dialogue stateuser

The central question: how to train the policy manager?

inform(food=“Indian”)

user input

“Sure. What price range are you thinking of?” request(price_range)

outline1. what is reinforcement learning 2. solution strategies for RL 3. applying RL to dialogue systems

what is reinforcement learningReinforcement Learning is a data-driven approach towards learning behaviour.

machine learning

unsupervised learning

supervised learning

reinforcement learning

machine learning

supervised learning

+deep learning deep learning

+ +deep learning

machine learning

supervised learning

+deep learning deep learning

+ +deep learning

= deep reinforcement

learning

RL vs supervised learningbehaviour: function that maps environment states to actions

RL vs supervised learning

supervised learninghard to specify function easy to identify correct output

behaviour: function that maps environment states to actions

RL vs supervised learning

supervised learninghard to specify function easy to identify correct output

behaviour: function that maps environment states to actions

example: recognizing cats in images

f cat / no cat

reinforcement learning: hard to specify function hard to identify correct output easy to specify behaviour goal

example: double inverted pendulum

state: θ1, θ2, ω1, ω2 action: clockwise/counter-clockwise torque on top joint goal: balance pendulum upright

advantages RL

does not require knowledge of good policy does not require labelled data online learning: adaptation to environment changes

challenges RL

requires lots of data sample distribution changes during learning samples are not i.i.d.

definitions

estimating the value function

finding the optimal policypolicy estimation

policy improvement:

finding the optimal policy

Q-learning: classical RL algorithm combines (partial) policy evaluation with (partial) policy improvement update target:

policy estimation

policy improvement:

deep reinforcement learning2015 Nature paper from DeepMind introduced an RL method based on deep learning, called DQN

main result: with same network architecture, learned to play large number of Atari 2600 games effectively

deep reinforcement learning2015 Nature paper from DeepMind introduced an RL method based on deep learning, called DQN

main result: with same network architecture, learned to play large number of Atari 2600 games effectively

DQN characteristics variation on Q-learning that uses deep neural networks to approximate the Q function uses experience replay to deal with non-i.i.d. samples uses two networks (Q and Q’) to mitigate non-stationarity of update targets

applying RL to dialogue system

training dialogue manager requires huge number of online samples hence, a user simulator, trained on offline data, is used to train dialogue manager

policy manager

systemact

user simulator

trainingstate tracker

dialogueact

offline data

deep RL for dialogue system

exact state is not observed, hence belief state is used belief-state spaces are typically discretized into summary state spaces to make the task tractable deep RL can be applied directly to the belief-state space due to its strong generalization properties with pre-training, a deep RL method can become even more efficient

effect of pre-training

without pre-training with pre-training

[based on DSTC2 dataset]

summaryRL is a data-driven approach towards learning behaviour RL does not require knowledge of good policy RL can be used for online learning combining RL with deep learning means that RL can be applied to much bigger problems constructing a good policy for a modern dialogue manager is a challenging task deep RL is the perfect candidate to address this challenge

Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016

Technology

Ewa Dominowska – Engineering Manager, Facebook at MLconf ATL

MLconf NYC Corinna Cortes

Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017

Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

MLconf seattle 2015 presentation

Talwalkar mlconf (1)

MLConf 2016 SigOpt Talk by Scott Clark

Atlanta MLconf Machine Learning Conference 09-23-2016

Kaz Sato, Evangelist, Google at MLconf ATL 2016

Ted Willke, Intel Labs MLconf 2013

Josh Patterson MLconf slides

Josh Wills, MLconf 2013

Quoc le, slides MLconf 11/15/13

Sri Ambati – CEO, 0xdata at MLconf ATL

ReviewAnalysis MLconf 2016 JPrendki

MLconf NYC Pek Lum

MLconf NYC Xiangrui Meng

MLconf NYC Justin Basilico

MLconf NYC Edo Liberty