41
Lecture 5 Convolutional Neural Networks LTAT.01.001 – Natural Language Processing Kairit Sirts ([email protected] ) 17.03.2021

Lecture 5 Convolutional Neural Networks

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 5 Convolutional Neural Networks

Lecture 5 Convolutional Neural

NetworksLTAT.01.001 – Natural Language Processing

Kairit Sirts ([email protected])17.03.2021

Page 2: Lecture 5 Convolutional Neural Networks

Plan for today

● CNN as an analog to human visual system● CNN components

● Convolutional layers● Pooling layers● Linear and non-linear layers

● An example CNN for text classification● (Interpreting CNN decisions)

2

Page 3: Lecture 5 Convolutional Neural Networks

Convolutional neural networks

● Feature extractors● Analogous to the brain’s visual system● First applied to image processing domain● Now also used in NLP

3

Page 4: Lecture 5 Convolutional Neural Networks

Feature extraction in image detection

4

Page 5: Lecture 5 Convolutional Neural Networks

Visual system in the brain

5

Page 6: Lecture 5 Convolutional Neural Networks

Convolutional NN for image classification

6

Page 7: Lecture 5 Convolutional Neural Networks

“Artificial” and “natural” CNN

7

Page 8: Lecture 5 Convolutional Neural Networks

Convolutional layers

8

Page 9: Lecture 5 Convolutional Neural Networks

Convolutional layers

9

𝐹 − #ilter

𝑂!,! = 𝐼!,!𝐹!,! + 𝐼!,#𝐹!,# + 𝐼!,$𝐹!,$ +𝐼#,!𝐹#,! + 𝐼#,#𝐹#,# + 𝐼#,$𝐹#,$ +𝐼$,!𝐹$,! + 𝐼$,#𝐹$,# + 𝐼$,$𝐹$,$

𝑂%,# = 𝐼%,#𝐹#,# + 𝐼%,$𝐹#,$ + 𝐼%,&𝐹#,& +𝐼',#𝐹$,# + 𝐼',$𝐹$,$ + 𝐼',&𝐹$,& +𝐼(,#𝐹&,# + 𝐼(,$𝐹&,$ + 𝐼(,&𝐹&,&

𝑂),* = -+,!

-.#

-/,!

0.#

𝐼)1+,*1/ . 𝐹+,/

𝐼 − input

https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks

𝑂 − output

Page 10: Lecture 5 Convolutional Neural Networks

Convolutional layer for text

10

Embed Embed Embed EmbedEmbed Embed

The cat sat on the mat

Page 11: Lecture 5 Convolutional Neural Networks

Convolutional layer for text

● The convolutional filter is an n-gram detector

● Concatenate the embeddings of the n-gram

● Compute dot-product between the concatenated embedding vector and the filter

● The result is a scalar score for the n-gram

11

Embed EmbedEmbed

The cat sat

×

1D filter

Page 12: Lecture 5 Convolutional Neural Networks

Convolutional layer for text

12

Embed Embed Embed EmbedEmbed Embed

The cat sat on the mat

×

Page 13: Lecture 5 Convolutional Neural Networks

Convolutional layer for text

13

Embed Embed Embed EmbedEmbed Embed

The cat sat on the mat

×

Page 14: Lecture 5 Convolutional Neural Networks

Convolutional layer for text

14

Embed Embed Embed EmbedEmbed Embed

The cat sat on the mat

×

Page 15: Lecture 5 Convolutional Neural Networks

Convolutional layer for text

15

Embed Embed Embed EmbedEmbed Embed

The cat sat on the mat

×

Page 16: Lecture 5 Convolutional Neural Networks

Another (but equivalent) formulation

16

Embed Embed Embed EmbedEmbed Embed

The cat sat on the mat

×Element-wise multiplicationThen sum up

Again, apply to every n-gram

Page 17: Lecture 5 Convolutional Neural Networks

Different filters: unigrams, bigrams, trigrams etc

17

Embed Embed Embed EmbedEmbed Embed

The cat sat on the mat

× × ×

unigrams bigrams trigrams

Page 18: Lecture 5 Convolutional Neural Networks

Number of filters

● There are typically several filters for each n-gram type● This is a tunable hyperparameter

18The cat sat on the mat

×

Number of filters = 3

Page 19: Lecture 5 Convolutional Neural Networks

Final representation

● After applying 5 different filters to unigrams, bigrams and trigrams

19The cat sat on the mat

unigrams bigrams trigrams

filte

rs =

5

Page 20: Lecture 5 Convolutional Neural Networks

Narrow and wide convolution

● Narrow convolution: ● start the n-grams from the first word and end with the last word in the text● The resulting vector has n-k+1 elements● n – length of the sequence● k – n-gram length

20

The cat sat on the mat

Page 21: Lecture 5 Convolutional Neural Networks

Narrow and wide convolution

● Wide convolution:● Pad the sequence from the beginning and and with a special PAD symbol● The length of the resulting vector is equal to n + k – 1

21

The cat sat on the matPAD PAD

Page 22: Lecture 5 Convolutional Neural Networks

Wide convolution

● Trigram filter

22

The cat sat on the matPAD PADPAD PAD

Page 23: Lecture 5 Convolutional Neural Networks

Pooling layers

23

Page 24: Lecture 5 Convolutional Neural Networks

Pooling layers

● The goal is to reduce the dimensionality of the feature map (output of the convolutional layer)

● Brings the representations of different inputs to the same dimensionality● Typical pooling operations:

● max-pooling (over time)● average pooling

24

Page 25: Lecture 5 Convolutional Neural Networks

Max-pooling over time

● Retain the maximum value over all values computed by a filter● The information about the most relevant n-grams are kept

2525The cat sat on the mat

×

Number of filters = 3max-pool

Page 26: Lecture 5 Convolutional Neural Networks

Max-pooling over time

26

unigrams bigrams trigrams

filte

rs =

5

The cat sat on the mat

Page 27: Lecture 5 Convolutional Neural Networks

Max-pooling over time

27

unigrams bigrams trigrams

filte

rs =

5

The cat sat on the mat

Page 28: Lecture 5 Convolutional Neural Networks

28

unigrams bigrams trigrams

filte

rs =

5

The cat sat on the mat

unigrams

Page 29: Lecture 5 Convolutional Neural Networks

29

unigrams bigrams trigrams

filte

rs =

5

The cat sat on the mat

unigrams bigrams

Page 30: Lecture 5 Convolutional Neural Networks

30

unigrams bigrams trigrams

filte

rs =

5

The cat sat on the mat

unigrams bigrams trigrams

Page 31: Lecture 5 Convolutional Neural Networks

Other pooling layers

● Average pooling: compute the average along the filter axis● K-max-pooling: retain k maximum elements (instead of just one), keep the

original order of the elements

31

Page 32: Lecture 5 Convolutional Neural Networks

Linear and non-linear layers

32

Page 33: Lecture 5 Convolutional Neural Networks

Linear layer

● Linear layer is just a parameter matrix● The input vector is multiplied with the linear layer matrix from the left

33

Linear layerW

inpu

t dim

output dim

input x output y

× =

𝒚 = 𝒙 $ 𝑊

Page 34: Lecture 5 Convolutional Neural Networks

Bias vector

● Often a bias vector is part of the linear layer

34

Linear layerW

inpu

t dim

output dim

input x output y

× =

𝒚 = 𝒙 $ 𝑊 + 𝒃

+

bias b

Page 35: Lecture 5 Convolutional Neural Networks

Non-linear activation functions

● Typically applied on top of a linear layer

● Except for the linear output layer – after that typically comes the softmax

36

Page 36: Lecture 5 Convolutional Neural Networks

An example CNN for text classification

37

Page 37: Lecture 5 Convolutional Neural Networks

Y. Kim, 2014. Convolutional Neural Networks for Sentence Classification

38

Page 38: Lecture 5 Convolutional Neural Networks

Things to consider

● Pretrained word embeddings or learn from scratch?● What kind of n-gram filters?● How many filters per n-gram?● Dropout?● How to model the output layer in binary case? One-dimensional with

sigmoid or two-dimensional with softmax?● Something else?

39

Page 39: Lecture 5 Convolutional Neural Networks

Interpreting CNNs

40

Page 40: Lecture 5 Convolutional Neural Networks

Decision interpretation

● For each test item it is possible to check which n-grams passed the max-pool and thus contributed to the classification decision

41Examples from: Jacovi et al., 2018. Understanding Convolutional Neural Networks for Text Classification

Page 41: Lecture 5 Convolutional Neural Networks

Model interpretation

● Data from reddit● Model for predicting users with self-reported eating disorder● Posts extracted from non mental health related sub-reddits● The model configuration

● The model architecture of Kim (2014)● n-gram filters: unigrams to 5-grams● 100 filters per n-gram● Inputs: all posts per user concatenated together● Maximum length of the input: 15000 words● All words appearing less than 20 times are replaced with UNK token

● Look, which n-grams most frequently passed through the max-pool of each filter

42