37
Transformer Applied Deep Learning April 7th, 2020 http://adl.miulab.tw

Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Transformer

Applied Deep Learning

April 7th, 2020 http://adl.miulab.tw

Page 2: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Sequence EncodingBasic Attention

2

Page 3: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Representations of Variable Length Data

◉ Input: word sequence, image pixels, audio signal, click logs

◉ Property: continuity, temporal, importance distribution

◉ Example

✓ Basic combination: average, sum

✓ Neural combination: network architectures should consider input domain properties

− CNN (convolutional neural network)

− RNN (recurrent neural network): temporal information

Network architectures should consider the input domain properties

3

Page 4: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Recurrent Neural Networks

◉ Learning variable-length representations

✓ Fit for sentences and sequences of values

◉ Sequential computation makes parallelization difficult

◉ No explicit modeling of long and short range dependencies

RNN

RNN

have

RNN

RNN

a

RNN

RNN

nice

4

Page 5: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Convolutional Neural Networks

◉ Easy to parallelize

◉ Exploit local dependencies

✓ Long-distance dependencies require many layers

FFNN

conv

have

FFNN

conv

a

FFNN

conv

nice

5

Page 6: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Attention

◉ Encoder-decoder model is important in NMT

◉ RNNs need attention mechanism to handle long dependencies

◉ Attention allows us to access any state

深 習度 學

ℎ1 ℎ2 ℎ3 ℎ4RNN

Encoder

Information of the whole sentences

ℎ4 ℎ4 ℎ4d

eep

learnin

g

<END

>

RNN

Decoder

6

Page 7: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Machine Translation with Attention

𝑧0

深 習度 學

ℎ1 ℎ2 ℎ3 ℎ4

match

𝛼01

𝑐0

0.5 0.5 0.0 0.0

softmax

ො𝛼01 ො𝛼0

2 ො𝛼03 ො𝛼0

4

深 習度 學

ℎ1 ℎ2 ℎ3 ℎ4

query

key value

7

Page 8: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Dot-Product Attention

◉ Input: a query 𝑞 and a set of key-value (𝑘-𝑣) pairs to an output

◉ Output: weighted sum of values

✓ Query 𝑞 is a 𝑑𝑘-dim vector

✓ Key 𝑘 is a 𝑑𝑘-dim vector

✓ Value 𝑣 is a 𝑑𝑣-dim vector

Inner product of

query and corresponding key

8

Page 9: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Dot-Product Attention in Matrix

◉ Input: multiple queries 𝑞 and a set of key-value (𝑘-𝑣) pairs to an output

◉ Output: a set of weighted sum of values

softmax

row-wise

9

Page 10: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Sequence EncodingSelf-Attention

10

Page 11: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Attention

◉ Encoder-decoder model is important in NMT

◉ RNNs need attention mechanism to handle long dependencies

◉ Attention allows us to access any state

Using attention to replace recurrence architectures

11

Page 12: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Self-Attention

◉ Constant “path length” between two positions

◉ Easy to parallelize

FFNN

cmp

have

FFNN

a

FFNN

cmp

nice

+××

FFNN

cmp

×

day

12

Page 13: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Transformer Idea

FFNN

cmp

FFNN FFNN

cmp

+××

FFNN

cmp

×

softmax

Self Attention

Feed-Forward NN

Self-A

tten

tio

nF

FN

N

FFNN

cmp

FFNN FFNN

cmp

+××

FFNN

cmp

×

softmax

Self Attention

Feed-Forward NN

Self-A

tten

tio

nF

FN

N

Encoder-Decoder Attention

softmax

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.

Encoder-Decoder Attention

13

Page 14: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Encoder Self-Attention (Vaswani+, 2017)

Dot-Prod

FFNN

Dot-Prod

+××

Dot-Prod

×

MatMulK

MatMulV MatMulV

softmax

MatMulQ

Dot-Prod

Value

QueryKey

MatMulV

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.

14

Page 15: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Decoder Self-Attention (Vaswani+, 2017)

Dot-Prod

Dot-Prod

+××

Dot-Prod

×

MatMulK

MatMulV MatMulV

softmax

MatMulQ

Dot-Prod

Value

QueryKey

MatMulV

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.

15

Page 16: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Sequence EncodingMulti-Head Attention

16

Page 17: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Convolutions

I kicked the ball

kicked

who

did

what

to whom

17

Page 18: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Self-Attention

I kicked the ball

kicked

18

Page 19: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Attention Head: who

I kicked the ball

kicked

who

19

Page 20: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Attention Head: did what

I kicked the ball

kicked

did

what

20

Page 21: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Attention Head: to whom

I kicked the ball

kicked

to whom

21

Page 22: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Multi-Head Attention

I kicked the ball

kicked

whodid

what

to whom

22

Page 23: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Comparison

◉ Convolution: different linear transformations by relative positions

◉ Attention: a weighted average

◉ Multi-Head Attention: parallel attention layers with different linear transformations

on input/output

I kicked the ball

kicked

I kicked the ball

kicked

I kicked the ball

kicked

23

Page 24: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Sequence EncodingTransformer

24

Page 25: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Transformer Overview

◉ Non-recurrent encoder-decoder for MT

◉ PyTorch explanation by Sasha Rush

http://nlp.seas.harvard.edu/2018/04/03/attention.html

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.

25

Page 26: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Transformer Overview

◉ Non-recurrent encoder-decoder for MT

◉ PyTorch explanation by Sasha Rush

http://nlp.seas.harvard.edu/2018/04/03/attention.html

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.

Multi-Head

Attention

Multi-Head

Attention

Masked

Multi-Head

Attention

26

Page 27: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Multi-Head Attention

◉ Idea: allow words to interact with one another

◉ Model

− Map V, K, Q to lower dimensional spaces

− Apply attention, concatenate outputs

− Linear transformation

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.

27

Page 28: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Scaled Dot-Product Attention

◉ Problem: when 𝑑𝑘 gets large, the variance of 𝑞𝑇𝑘 increases

→ some values inside softmax get large

→ the softmax gets very peaked

→ hence its gradient gets smaller

◉ Solution: scale by length of query/key vectors

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.

28

Page 29: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Transformer Overview

◉ Non-recurrent encoder-decoder for MT

◉ PyTorch explanation by Sasha Rush

http://nlp.seas.harvard.edu/2018/04/03/attention.html

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.

Add & Norm

Multi-Head

Attention

Feed

Forward

Add & Norm

29

Page 30: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Transformer Encoder Block

◉ Each block has

− multi-head attention

− 2-layer feed-forward NN (w/ ReLU)

◉ Both parts contain

− Residual connection

− Layer normalization (LayerNorm)

Change input to have 0 mean and 1 variance per layer & per training point

→ LayerNorm(x + sublayer(x))

Add & Norm

Multi-Head

Attention

Feed Forward

Add & Norm

https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4

30

Page 31: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Encoder Input

◉ Problem: temporal information is missing

◉ Solution: positional encoding allows words at different locations to have

different embeddings with fixed dimensionsAdd & Norm

Multi-Head

Attention

Feed

Forward

Add & Norm

https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4

31

Page 32: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

32

Page 33: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Multi-Head Attention Details

https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4

33

Page 34: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Training Tips

◉ Byte-pair encodings (BPE)

◉ Checkpoint averaging

◉ ADAM optimizer with learning rate changes

◉ Dropout during training at every layer just before adding residual

◉ Label smoothing

◉ Auto-regressive decoding with beam search and length penalties

34

Page 35: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

MT Experiments

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.

35

Page 36: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Parsing Experiments

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.

36

Page 37: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain

Concluding Remarks

◉ Non-recurrence model is easy to parallelize

◉ Multi-head attention captures different aspects by

interacting between words

◉ Positional encoding captures location information

◉ Each transformer block can be applied to diverse tasks

37