52
Sequnce(s)-to-Sequence Transformaons in Text Processing Narada Warakagoda

Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk Thanks Very Much Encoder Decoder

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Sequnce(s)-to-Sequence Transformations in Text Processing

Narada Warakagoda

Page 2: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Seq2seq Transformation

Model

Variable length input

Variable length output

Page 3: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Example Applications

● Summarization (extractive/abstractive)● Machine translation● Dialog systems /chatbots● Text generation● Question answering●

Page 4: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Seq2seq Transformation

Model size should be constant.

Model

Variable length input

Variable length output

Solution: Apply a constant sized neural net module repeatedly on the data

Page 5: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Possible Approaches

● Recurrent networks● Apply the NN module in a serial fashion

● Convolutions networks● Apply the NN modules in a hierarchical fashion

● Self-attention ● Apply the NN module in a parallel fashion

Page 6: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Processing Pipeline

Decoder

Variable length input

Variable length output

Encoder

Intermediate representation

Page 7: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Processing Pipeline

Intermediate representation

Decoder

Variable length output

Variable length input

Encoder

Variable length text

Embedding

Attention

Page 8: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Architecture Variants

Encoder Decoder AttentionRecurrent net Recurrent net NoRecurrent net Recurrent net YesConvolutional net Convolutional net NoConvolutional net Recurrent net YesConvolutional net Convolutional net YesFully connected net with self-attention

Fully connected net with self-attention

Yes

Page 9: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Possible Approaches

● Recurrent networks● Apply the NN module in a serial fashion

● Convolutions networks● Apply the NN modules in a hierarchical fashion

● Self-attention ● Apply the NN module in a parallel fashion

Page 10: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

RNN-decoder with RNN-encoder

Softmax

Softmax

Softmax

Softmax

Tusen Takk <end>

<start> Thanks Very Much

Encoder Decoder

= RNN cell

Page 11: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

RNN-dec with RNN-enc, Training

Softmax

Softmax

Softmax

Softmax

Tusen Takk <end>

<start> Thanks Very Much

Encoder Decoder

Ground Truths

Thanks Very Much <end>

Page 12: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

RNN-dec with RNN-enc, Decoding

Softmax

Softmax

Softmax

Softmax

Tusen Takk <end>

<start> Thanks Much Very

Encoder Decoder

Thanks Much Very <end>

Greedy Decoding

Page 13: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Decoding Approaches● Optimal decoding

● Greedy decoding

● Easy● Not optimal

● Beam search● Closer to optimal decoder● Choose top N candidates instead of the best one at

each step.

Page 14: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Beam Search Decoding

Beam Width = 3

Page 15: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Straight-forward ExtensionsCurrent state Next state

Current Input

RNN Cell

Current state Next state

Current Input

LSTM Cell

Next control stateCurrent control state

Current state Next state

Current Input

Next stateCurrent state

Current state Next state

Current Input

Next stateCurrent state

Bidirectional Cell Stacked Cell

Page 16: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

RNN-decoder with RNN-encoder with Attention

Softmax

Softmax

Softmax

Softmax

Tusen Takk <end>

<start> Thanks Very Much

Encoder Decoder

= RNN cell

+

Context

Page 17: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Attention● Context is given by

● Attention weights are dynamic ● Generally defined by with

where function f can be defined in several ways.● Dot product

● Weighted dot product

● Use another MLP (eg: 2 layer)

Page 18: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Attention

+

RNN Cell

Page 19: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Example: Google Neural Machine Translation

Page 20: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Possible Approaches

● Recurrent networks● Apply the NN module in a serial fashion

● Convolutions networks● Apply the NN modules in a hierarchical fashion

● Self-attention ● Apply the NN module in a parallel fashion

Page 21: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Why Convolution

● Recurrent networks are serial● Unable to be parallelized● “Distance” between feature vector and different

inputs are not constant

● Convolutions networks● Can be parallelized (faster)● “Distance” between feature vector and different

inputs are constant

Page 22: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Long range dependency capture with conv nets

k

nk

Page 23: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Conv net, Recurrent net with Attention

Gehring et.al. A Convolutional Encoder Model for Neural Machine Translation (2016)

CNN-a CNN-c

1z 3z2z 4z 1y 2y 3y 4y

,1ia ,2ia ,3ia ,4ia

ig

icih

ih

id

1ih

i d i id W h g

Page 24: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Two conv nets with attention

Gehring et.al, Convolutional Sequence to Sequence Learning, 2017

W WW W

Wd Wd Wd Wd

1z 3z2z

1,2,3,4id i

1e 2e 3e

, 1,2,3,4 1,2,3i ja i j 1c 2c 3c 4c

1g 2g 3g 4g

, 1,2,3,4ih i

Page 25: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Possible Approaches

● Recurrent networks● Apply the NN module in a serial fashion

● Convolutions networks● Apply the NN modules in a hierarchical fashion

● Self-attention ● Apply the NN module in a parallel fashion

Page 26: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Why Self-attention

● Recurrent networks are serial● Unable to be parallelized● “Distance” between feature vector and different

inputs are not constant

● Self-attention networks● Can be parallelized (faster)● “Distance” between feature vector and different

inputs does not depend on the input length

Page 27: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

FCN with self-attention

Inputs

Previous Words

Probability of the next words

Vasvani et.al, Attention is all you need, 2017

Page 28: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Scaled dot product attention

Query Keys Values

Page 29: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Multi-Head Attention

Page 30: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Encoder Self-attention

Self Attention

Page 31: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Decoder Self-attention

• Almost same as encoder self attention• But only leftward positions are considered.

Page 32: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Encoder-decoder attention

Encoder states Decoder state

Page 33: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Overall Operation

Previous WordsNext Word

Neural machine translation, philipp Koehn

Page 34: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Reinforcement Learning

● Machine Translation/Summarization● Dialog Systems●

Page 35: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Reinforcement Learning

● Machine Translation/Summarization● Dialog Systems●

Page 36: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Why Reinforcement Learning● Exposure bias

● In training ground truths are used. In testing, generated word in the previous step is used to generate the next word.

● Use generated words in training needs sampling : Non differentiable

● Maximum Likelihood criterion is not directly relevant to evaluation metrics ● BLEU (Machine translation)● ROUGE (Summarization)● Use BLEU/ROUGE in training: Non differentiable

Page 37: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Sequence Generation as Reinforcement Learning

● Agent: The Recurrent Net● State: Hidden layers, Attention weights etc.● Action: Next Word● Policy: Generate the next word (action) given

the current hidden layers and attention weights (state)

● Reward: Score computed using the evaluation metric (eg: BLEU)

Page 38: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Maximum Likelihood Training (Revisit)

Minimize the negative log likelihood

Page 39: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Reinforcement Learning Formulation

Minimize the expected negative reward, using REINFORCE algorithm

Page 40: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Reinforcement Learning Details ● Expected reward

● We need the gradient

● Need to write this as an expectation, so that we can evaluate it using samples. Use the log derivative trick:

● This is an expectation

● Approximate this with sample mean

● In practice we use only one sample

Page 41: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Reinforcement Learning Details ● Gradient

● This estimation has high variance. Use a baseline to combat this problem.

● Baseline can be anything independent of

● It can for example be estimated as the reward for word sequence generated using argmax at each cell.

Page 42: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Reinforcement Learning

● Machine Translation/Summarization● Dialog Systems●

Page 43: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Maximum Likelihood Dialog Systems

How Are You?

I Am Fine

I Am<start>

Page 44: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Why Reinforcement Learning

● Maximum Likelihood criterion is not directly relevant to successful dialogs ● Dull responses (“I don’t know”)● Repetitive responses

● Need to integrate developer defined rewards relevant to longer term goals of the dialog

Page 45: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Dialog Generation as Reinforcement Learning

● Agent: The Recurrent Net● State: Previous dialog turns● Action: Next dialog utterance● Policy: Generate the next dialog utterance

(action) given the previous dialog turns (state)● Reward: Score computed based on relevant

factors such as ease of answering, information flow, semantic coherence etc.

Page 46: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Training Setup

Agent 1 Agent 2

DecoderDecoder

Encoder Encoder

Page 47: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Training Procedure

● From the viewpoint of a given agent, the procedure is similar to that of sequence generation ● REINFORCE algorithm

● Appropriate rewards must be calculated based on current and previous dialog turns.

● Can be initialized with maximum likelihood trained models.

Page 48: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Adversarial Learning ● Use a discriminator as in GANs to calculate the reward

● Same training procedure based on REINFORCE for generator

Discriminator

Human Dialog

Page 49: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

Question Answering

● Slightly different from sequence-to-sequence model.

Model

Variable length inputs

Fixed Length OutputSingle Word Answer/ Start-end points of the answer

Passage/Document/ContextQuestion/Query

Page 50: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

QA- Naive Approach● Combine question and passage and use an RNN to

classify it. ● Will not work because relationship between the

passage and question is not adequately captured.

Model

Variable length input

Fixed Length OutputSingle Word Answer/ Start-end points of the answer

Question and passage

Page 51: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

QA- More Successful Approach

● Use attention between the question and passage● Bi-directional attention, co-attention

● Temporal relationship modeling● Classification or predict start and end-point of

the answer within passage.

Page 52: Narada Warakagoda - Universitetet i oslo · RNN-decoder with RNN-encoder Soft max Soft max Soft max Soft max Tusen Takk   Thanks Very Much Encoder Decoder

QA Example with Bi-directional Attention

Bi-directional attention flow for machine comprehension Seo M. et.al