52
Peltarion Holländargatan 17 SE-111 60 Stockholm Sweden / Transformers and Attention : ID2223 Scalable Machine Learning and Deep Learning Karl Fredrik Erliksson Industrial PhD Candidate, KTH and Peltarion November 25, 2020 1

Learning and Deep Learning ID2223 Scalable Machine

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning and Deep Learning ID2223 Scalable Machine

PeltarionHolländargatan 17SE-111 60 StockholmSweden

/ Transformers and Attention:ID2223 Scalable Machine Learning and Deep Learning

Karl Fredrik ErlikssonIndustrial PhD Candidate,KTH and Peltarion

November 25, 2020

1

Page 2: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Roadmap

2

05

Distillation and Practical Example

04

BERT

03

TransformersStep-by-Step

02

From RNNs to Transformers

01

Contextualized Embeddings

Page 3: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Material based on:

● Christoffer Manning’s NLP Lectures at Stanford

● The Illustrated Transformer by Jay Alammar

● Slides from Jacob Devlin

● Self-attention Video from Peltarion

Acknowledgements

3

Page 4: Learning and Deep Learning ID2223 Scalable Machine

PeltarionHolländargatan 17SE-111 60 StockholmSweden

/ ContextualizedEmbeddings

01

4

Page 5: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

● Word embeddings arethe basis of NLP

● Popular embeddings likeGloVe and Word2Vec are pre-trained on large text corpuses based on co-occurrence statistics

● “A word is characterized by the company it keeps” [Firth, 1957]

Background to Natural Language Processing (NLP)

5

[Peltarion, 2020]

Page 6: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Word Embeddings

6[Peltarion, 2020]

Page 7: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Word Embeddings

7

Problem: Word embeddings are context-free

[Peltarion, 2020]

Page 8: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Word Embeddings

8

Problem: Word embeddings are context-free

[Peltarion, 2020]

Page 9: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Word Embeddings

9

Problem: Word embeddings are context-freeSolution: Create contextualized representation

[Peltarion, 2020]

Page 10: Learning and Deep Learning ID2223 Scalable Machine

PeltarionHolländargatan 17SE-111 60 StockholmSweden

/ From RNNs toTransformers

02

10

Page 11: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

● Sequential computations prevents parallelization● Despite GRUs and LSTMs, RNNs still need attention mechanisms to deal

with long range dependencies● Attention gives us access to any state… Maybe we don’t need the costly

recursion? 🤔● Then NLP can have deep models, solves our computer vision envy!

Problems with RNNs - Motivation for Transformers

11

Page 12: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

● Sequence-to-sequence model for Machine Translation

● Encoder-decoder architecture

● Multi-headed self-attention○ Models context and no

locality bias

Attention is all you need! [Vaswani, 2017]

12[Vaswani et al., 2017]

Page 13: Learning and Deep Learning ID2223 Scalable Machine

PeltarionHolländargatan 17SE-111 60 StockholmSweden

/ TransformersStep-by-Step

03

13

Page 14: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Understanding the Transformer: Step-by-Step

14[Alammar, 2018]

Page 15: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Understanding the Transformer: Step-by-Step

15

No recursion, instead stacking encoder and decoder blocks

● Originally: 6 layers● BERT base: 12 layers● BERT large: 24 layers● GPT2-XL: 48 layers● GPT3: 96 layers

[Alammar, 2018]

Page 16: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

The Encoder Block

16[Alammar, 2018]

Page 17: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

The Encoder and Decoder Blocks

17[Alammar, 2018]

Page 18: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Mimics the retrieval of a value vi for a query q based on a key ki in a database, but in a probabilistic fashion

Attention Preliminaries

18

Query q Value vi

Database

Page 19: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

● Queries, keys and values are vectors● Output is a weighted sum of the values● Weights are are computed as the scaled dot-product (similarity) between

the query and the keys

Dot-Product Attention

19

● Self-attention: Let the word embeddings be the queries, keys and values, i.e. let the words select each other

● Can stack multiple queries into a matrix Q

Output is a row-vector

Output is again a matrix

Page 20: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Self-Attention Mechanism

20[Alammar, 2018]

Page 21: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Self-Attention Mechanism

21[Alammar, 2018]

Page 22: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Self-Attention Mechanism in Matrix Notation

22[Alammar, 2018]

Page 23: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Multi-Headed Self-Attention

23[Alammar, 2018]

Page 24: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Multi-Headed Self-Attention

24[Alammar, 2018]

Page 25: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Self-Attention: Putting It All Together

25[Alammar, 2018]

Page 26: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Attention visualized

26

Page 27: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

The Full Encoder Block

27

Encoder block consisting of:

● Multi-headed self-attention

● Feedforward NN (FC 2 layers)

● Skip connections

● Layer normalization - Similar to batch normalization but computed over features (words/tokens) for a single sample

[Alammar, 2018]

Page 28: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Encoder-Decoder Architecture - Small Example

28[Alammar, 2018]

Page 29: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Positional Encodings

29

● Attention mechanism has no locality bias - no notion of word order

● Add positional encodings to input embeddings to let model learn relative positioning

[Alammar, 2018]

Page 30: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Positional Encodings

30[Kazemnejad, 2019]

Page 31: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Let’s start the encoding!

31[Alammar, 2018]

Page 32: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Decoding procedure

32[Alammar, 2018]

Page 33: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

● The output from the decoder is passed through a final fully connected linear layer with a softmax activation function

● Produces a probability distribution over the pre-defined vocabulary of output words (tokens)

● Greedy decoding picks the word with the highest probability at each time step

Producing the output text

33[Alammar, 2018]

Page 34: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Training Objective

34[Alammar, 2018]

Page 35: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Complexity Comparison

35[Vaswani et al., 2017]

Page 36: Learning and Deep Learning ID2223 Scalable Machine

PeltarionHolländargatan 17SE-111 60 StockholmSweden

/ BERT04

36

Page 37: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Bidirectional Encoder Representations from Transformers

● Self-supervised pre-training of Transformers encoder for language understanding

● Fine-tuning for specific downstream task

BERT

37

Page 38: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

BERT Training Procedure

38[Devlin et al., 2018]

Page 39: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Masked Language Modelling

Next Sentence Prediction

BERT Training Objectives

39[Devlin et al., 2018]

Page 40: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

BERT Fine-Tuning Examples

40[Devlin et al., 2018]

SentenceClassification

QuestionAnswering

Named EntityRecognition

Page 41: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

● Scaling up models size and amount of training data helps a lot● Best model is 11B (!!) parameters● Exact pre-training objective (MLM, NSP, corruption) doesn’t matter too much● SuperGLUE benchmark:

Exploring the Limits of Transfer Learning (T5)

41[Raffel et al., 2019]

Page 42: Learning and Deep Learning ID2223 Scalable Machine

PeltarionHolländargatan 17SE-111 60 StockholmSweden

/ Practical Examples05

42

Page 43: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

BERT in low-latency production settings

43[Devlin, 2020]

Page 44: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

Distillation

44

[Turc, 2020]

[Devlin, 2020]

● Modern pre-trained language models are huge and very computationally expensive

● How are these companies applying them to low-latency applications?● Distillation!

○ Train SOTA teacher model(pre-training + fine-tuning)

○ Train smaller student model thatmimics the teacher’s output on alarge dataset on unlabeled data

● Why does it work so well?

Page 45: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

● The HuggingFace Library contains a majority of the recent pre-trained State-of-the-art NLP models, as well as over 4 000 community uploaded models

● Works with both TensorFlow and PyTorch

Transformers in TensorFlow using HuggingFace 🤗

45

Page 46: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

from transformers import BertTokenizerFast, TFBertForSequenceClassificationfrom datasets import load_datasetimport tensorflow as tf

dataset = load_dataset("imdb").shuffle()tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

train_encodings = tokenizer(dataset['train']['text'], truncation=True, padding=True)train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), dataset['train']['label']))val_dataset = ... // Analogously

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)model.compile(optimizer=optimizer, loss=model.compute_loss)model.fit(train_dataset.batch(16), epochs=3, batch_size=16)

model.evaluate(val_dataset.batch(16), verbose=0)

Transformers in TensorFlow using HuggingFace 🤗

46

Page 47: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

from transformers import BertTokenizerFast, TFBertForSequenceClassificationfrom datasets import load_datasetimport tensorflow as tf

dataset = load_dataset("imdb").shuffle()tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

train_encodings = tokenizer(dataset['train']['text'], truncation=True, padding=True)train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), dataset['train']['label']))val_dataset = ... // Analogously

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)model.compile(optimizer=optimizer, loss=model.compute_loss)model.fit(train_dataset.batch(16), epochs=3, batch_size=16)

model.evaluate(val_dataset.batch(16), verbose=0)

Transformers in TensorFlow using HuggingFace 🤗

47

Page 48: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

from transformers import BertTokenizerFast, TFBertForSequenceClassificationfrom datasets import load_datasetimport tensorflow as tf

dataset = load_dataset("imdb").shuffle()tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

train_encodings = tokenizer(dataset['train']['text'], truncation=True, padding=True)train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), dataset['train']['label']))val_dataset = ... // Analogously

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)model.compile(optimizer=optimizer, loss=model.compute_loss)model.fit(train_dataset.batch(16), epochs=3, batch_size=16)

model.evaluate(val_dataset.batch(16), verbose=0)

Transformers in TensorFlow using HuggingFace 🤗

48

Page 49: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

from transformers import BertTokenizerFast, TFBertForSequenceClassificationfrom datasets import load_datasetimport tensorflow as tf

dataset = load_dataset("imdb").shuffle()tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

train_encodings = tokenizer(dataset['train']['text'], truncation=True, padding=True)train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), dataset['train']['label']))val_dataset = ... // Analogously

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)model.compile(optimizer=optimizer, loss=model.compute_loss)model.fit(train_dataset.batch(16), epochs=3, batch_size=16)

model.evaluate(val_dataset.batch(16), verbose=0)

Transformers in TensorFlow using HuggingFace 🤗

49

Page 50: Learning and Deep Learning ID2223 Scalable Machine

PeltarionHolländargatan 17SE-111 60 StockholmSweden

/ Wrap Up06

50

Page 51: Learning and Deep Learning ID2223 Scalable Machine

Peltarion - the operational AI platform

● Transformers have blown other architectures out of the water for NLP

● Get rid of recurrence and rely on self-attention

● NLP pre-training using Masked Language Modelling

● Most recent improvements using larger models and more data

● Distillation can make model serving and inference more tractable

Summary

51

Page 52: Learning and Deep Learning ID2223 Scalable Machine

PeltarionHolländargatan 17SE-111 60 StockholmSweden

Thanks/November 25, 2020

Karl Fredrik ErlikssonPhD Candidate, KTH and [email protected]

52

Questions?