116
Hora Informaticae, ÚI AV ČR, Praha, 14 Jan 2019 Rudolf Rosa [email protected] Deep Neural Networks in Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Hora Informaticae, ÚI AV ČR, Praha, 14 Jan 2019

Rudolf [email protected]

Deep Neural Networksin Natural Language Processing

Charles UniversityFaculty of Mathematics and Physics

Institute of Formal and Applied Linguistics

Page 2: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 2/116

Background check: do you know...

Page 3: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 3/116

Background check: do you know...

Machine learning? (ML)

Page 4: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 4/116

Background check: do you know...

Machine learning? (ML) Artificial neural networks? (NN)

Page 5: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 5/116

Background check: do you know...

Machine learning? (ML) Artificial neural networks? (NN) Deep neural networks? (DNN)

Page 6: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 6/116

Background check: do you know...

Machine learning? (ML) Artificial neural networks? (NN) Deep neural networks? (DNN) Convolutional neural networks? (CNN)

Page 7: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 7/116

Background check: do you know...

Machine learning? (ML) Artificial neural networks? (NN) Deep neural networks? (DNN) Convolutional neural networks? (CNN) Recurrent neural networks? (RNN)

Page 8: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 8/116

Background check: do you know...

Machine learning? (ML) Artificial neural networks? (NN) Deep neural networks? (DNN) Convolutional neural networks? (CNN) Recurrent neural networks? (RNN)

Long short-term memory units? (LSTM) Gated recurrent units? (GRU)

Page 9: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 9/116

Background check: do you know...

Machine learning? (ML) Artificial neural networks? (NN) Deep neural networks? (DNN) Convolutional neural networks? (CNN) Recurrent neural networks? (RNN)

Long short-term memory units? (LSTM) Gated recurrent units? (GRU)

Attention mechanism? (Bahdanau+, 2014)

Page 10: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 10/116

Background check: do you know...

Machine learning? (ML) Artificial neural networks? (NN) Deep neural networks? (DNN) Convolutional neural networks? (CNN) Recurrent neural networks? (RNN)

Long short-term memory units? (LSTM) Gated recurrent units? (GRU)

Attention mechanism? (Bahdanau+, 2014)

Self-attentive networks? (SAN, Transformer)

Page 11: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 11/116

Background check: do you know...

Machine learning? (ML) Artificial neural networks? (NN) Deep neural networks? (DNN) Convolutional neural networks? (CNN) Recurrent neural networks? (RNN)

Long short-term memory units? (LSTM) Gated recurrent units? (GRU)

Attention mechanism? (Bahdanau+, 2014)

Self-attentive networks? (SAN, Transformer) Word embeddings? (Bengio+, 2003)

Word2vec? (Mikolov+, 2013)

Page 12: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 12/116

ML in Natual Language Processing

Before: complex multistep pipelines Preprocessing, low-level processing, high-level

processing, classification, post-processing… Massive feature engineering, linguistic knowledge…

Page 13: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 13/116

ML in Natual Language Processing

Before: complex multistep pipelines Preprocessing, low-level processing, high-level

processing, classification, post-processing… Massive feature engineering, linguistic knowledge…

Now: monolitic end-to-end systems (or nearly)

Page 14: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 14/116

ML in Natual Language Processing

Before: complex multistep pipelines Preprocessing, low-level processing, high-level

processing, classification, post-processing… Massive feature engineering, linguistic knowledge…

Now: monolitic end-to-end systems (or nearly) text → deep neural network → output Little or no linguistic knowledge required Little or no feature engineering Little or no dependence on external tools

Page 15: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116

ML in Natual Language Processing

Before: complex multistep pipelines Preprocessing, low-level processing, high-level

processing, classification, post-processing… Massive feature engineering, linguistic knowledge…

Now: monolitic end-to-end systems (or nearly) text → deep neural network → output Little or no linguistic knowledge required Little or no feature engineering Little or no dependence on external tools → so now is a good time for anyone to get into NLP!

Page 16: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 16/116

Neural networks & text processing

Input to a neuron: fixed-dimension real vector

Page 17: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 17/116

Neural networks & text processing

Input to a neuron: fixed-dimension real vector Dimension should be reasonable (<103)

Page 18: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 18/116

Neural networks & text processing

Input to a neuron: fixed-dimension real vector Dimension should be reasonable (<103) Neural net: fixed-sized network of neurons

Page 19: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 19/116

Neural networks & text processing

Input to a neuron: fixed-dimension real vector Dimension should be reasonable (<103) Neural net: fixed-sized network of neurons

Text input: sequence processing Sentence = sequence of words

Page 20: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 20/116

Neural networks & text processing

Input to a neuron: fixed-dimension real vector Dimension should be reasonable (<103) Neural net: fixed-sized network of neurons

Text input: sequence processing Sentence = sequence of words Words: discrete (but interrelated)

Massively multi-valued (~106) Very sparse (Zipf distribution)

Page 21: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 21/116

Neural networks & text processing

Input to a neuron: fixed-dimension real vector Dimension should be reasonable (<103) Neural net: fixed-sized network of neurons

Text input: sequence processing Sentence = sequence of words Words: discrete (but interrelated)

Massively multi-valued (~106) Very sparse (Zipf distribution)

Sentences: variable length (~1 to 100) Complex and hidden internal structure

Page 22: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 22/116

Outline of the talk

Problem 1: Words

Problem 2: Sentences

Page 23: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 23/116

Outline of the talk

Problem 1: Words There are too many They are discrete Representing massively multi-valued discrete data

by continuous low-dimensional vectors Problem 2: Sentences

Page 24: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 24/116

Outline of the talk

Problem 1: Words There are too many They are discrete Representing massively multi-valued discrete data

by continuous low-dimensional vectors Problem 2: Sentences

They have various lengths They have internal structure Handling variable-length input sequences with

complex internal relations by fixed-sized neural units

Page 25: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 25/116

Outline of the talk

Problem 1: Words There are too many They are discrete Representing massively multi-valued discrete data

by continuous low-dimensional vectors Problem 2: Sentences

They have various lengths They have internal structure Handling variable-length input sequences with

complex internal relations by fixed-sized neural units

Page 26: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 26/116

Outline of the talk

Problem 1: Words There are too many They are discrete Representing massively multi-valued discrete data

by continuous low-dimensional vectors Problem 2: Sentences

They have various lengths They have internal structure Handling variable-length input sequences with

complex internal relations by fixed-sized neural units

Page 27: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 27/116

Warnings

Page 28: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 28/116

Warnings

I am not a ML expert, rather a ML user Please excuse any errors and inaccuracies

Page 29: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 29/116

Warnings

I am not a ML expert, rather a ML user Please excuse any errors and inaccuracies

Focus of talk: input representation (“encoding”) Key problem in NLP, interesting properties

Page 30: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 30/116

Warnings

I am not a ML expert, rather a ML user Please excuse any errors and inaccuracies

Focus of talk: input representation (“encoding”) Key problem in NLP, interesting properties

Leaving out Generating output (“decoding”) – that’s also interesting

Page 31: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 31/116

Warnings

I am not a ML expert, rather a ML user Please excuse any errors and inaccuracies

Focus of talk: input representation (“encoding”) Key problem in NLP, interesting properties

Leaving out Generating output (“decoding”) – that’s also interesting

Sequence generation Seq. elements discrete, large domain (softmax over 106) Sequence length not a priori known

Page 32: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 32/116

Warnings

I am not a ML expert, rather a ML user Please excuse any errors and inaccuracies

Focus of talk: input representation (“encoding”) Key problem in NLP, interesting properties

Leaving out Generating output (“decoding”) – that’s also interesting

Sequence generation Seq. elements discrete, large domain (softmax over 106) Sequence length not a priori known

Decision at encoder/decoder boundary (if any)

Page 33: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 33/116

Problem 1: Words

Continuous low-dimensional vectors(word embeddings)

Massively multi-valued discrete data(words)

Page 34: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 34/116

Simplification

For now, forget sentences

1 word some output

Page 35: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 35/116

Simplification

For now, forget sentences

1 word some output

Word is positive/neutral/negative,

Page 36: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 36/116

Simplification

For now, forget sentences

1 word some output

Word is positive/neutral/negative,Definition of the word,

Page 37: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 37/116

Simplification

For now, forget sentences

1 word some output

Word is positive/neutral/negative,Definition of the word,

Hyperonym (dog → animal),…

Page 38: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 38/116

Simplification

For now, forget sentences

1 word some output

Word is positive/neutral/negative,Definition of the word,

Hyperonym (dog → animal),…

Situation We have labelled training data for some words (103) We want to generalize (ideally) to all words (106)

Page 39: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 39/116

The problem with words

How many words are there?

Page 40: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 40/116

The problem with words

How many words are there? Too many!

Page 41: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 41/116

The problem with words

How many words are there? Too many! Many problems with counting words, cannot be done

Page 42: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 42/116

The problem with words

How many words are there? Too many! Many problems with counting words, cannot be done ~106

Page 43: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 43/116

The problem with words

How many words are there? Too many! Many problems with counting words, cannot be done ~106 (but potentially infinite – new words get created every day)

Page 44: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 44/116

The problem with words

How many words are there? Too many! Many problems with counting words, cannot be done ~106 (but potentially infinite – new words get created every day)

Long-standing problem of NLP

Page 45: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 45/116

The problem with words

How many words are there? Too many! Many problems with counting words, cannot be done ~106 (but potentially infinite – new words get created every day)

Long-standing problem of NLP Natural representation: 1-hot vector 0 0 0 0 1 0 0 0 0… …

106

i

Page 46: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 46/116

The problem with words

How many words are there? Too many! Many problems with counting words, cannot be done ~106 (but potentially infinite – new words get created every day)

Long-standing problem of NLP Natural representation: 1-hot vector

ML with ~106 binary features on input

0 0 0 0 1 0 0 0 0… …

106

i

Page 47: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 47/116

The problem with words

How many words are there? Too many! Many problems with counting words, cannot be done ~106 (but potentially infinite – new words get created every day)

Long-standing problem of NLP Natural representation: 1-hot vector

ML with ~106 binary features on input Pair of words: ~1012

0 0 0 0 1 0 0 0 0… …

106

i

Page 48: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 48/116

The problem with words

How many words are there? Too many! Many problems with counting words, cannot be done ~106 (but potentially infinite – new words get created every day)

Long-standing problem of NLP Natural representation: 1-hot vector

ML with ~106 binary features on input Pair of words: ~1012

No generalization, meaning of words not captured dog~puppy, dog~~cat, dog~~~platypus, dog~~~~whiskey

0 0 0 0 1 0 0 0 0… …

106

i

Page 49: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 49/116

Split the words

Split into characters M O C K

Page 50: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 50/116

Split the words

Split into characters Not that many (~102)

M O C K

Page 51: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 51/116

Split the words

Split into characters Not that many (~102) Do not capture meaning

Starts with “m-”, is it positive or negative?

M O C K

Page 52: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 52/116

Split the words

Split into characters Not that many (~102) Do not capture meaning

Starts with “m-”, is it positive or negative?

Split into subwords/morphemes

M O C K

mis class if ied

Page 53: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 53/116

Split the words

Split into characters Not that many (~102) Do not capture meaning

Starts with “m-”, is it positive or negative?

Split into subwords/morphemes Word starts with “mis-”: it is probably negative

misclassify, mistake, misconception…

M O C K

mis class if ied

Page 54: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 54/116

Split the words

Split into characters Not that many (~102) Do not capture meaning

Starts with “m-”, is it positive or negative?

Split into subwords/morphemes Word starts with “mis-”: it is probably negative

misclassify, mistake, misconception… Helps, used in practice

M O C K

mis class if ied

Page 55: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 55/116

Split the words

Split into characters Not that many (~102) Do not capture meaning

Starts with “m-”, is it positive or negative?

Split into subwords/morphemes Word starts with “mis-”: it is probably negative

misclassify, mistake, misconception… Helps, used in practice

Potentially infinite set covered by a finite set of subwords

M O C K

mis class if ied

Page 56: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 56/116

Split the words

Split into characters Not that many (~102) Do not capture meaning

Starts with “m-”, is it positive or negative?

Split into subwords/morphemes Word starts with “mis-”: it is probably negative

misclassify, mistake, misconception… Helps, used in practice

Potentially infinite set covered by a finite set of subwords Meaning-capturing subwords still too many (~105)

M O C K

mis class if ied

Page 57: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 57/116

Distributional hypothesis

Page 58: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 58/116

Distributional hypothesis

smelt (assume you don’t know this word)

Page 59: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 59/116

Distributional hypothesis

smelt (assume you don’t know this word)

I had a smelt for lunch.

Page 60: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 60/116

Distributional hypothesis

smelt (assume you don’t know this word)

I had a smelt for lunch. → noun, meal/food

Page 61: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 61/116

Distributional hypothesis

smelt (assume you don’t know this word)

I had a smelt for lunch. → noun, meal/food My father caught a smelt.

Page 62: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 62/116

Distributional hypothesis

smelt (assume you don’t know this word)

I had a smelt for lunch. → noun, meal/food My father caught a smelt. → animal/illness

Page 63: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 63/116

Distributional hypothesis

smelt (assume you don’t know this word)

I had a smelt for lunch. → noun, meal/food My father caught a smelt. → animal/illness

Page 64: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 64/116

Distributional hypothesis

smelt (assume you don’t know this word)

I had a smelt for lunch. → noun, meal/food My father caught a smelt. → animal/illness Smelts are disappearing from oceans.

Page 65: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 65/116

Distributional hypothesis

smelt (assume you don’t know this word)

I had a smelt for lunch. → noun, meal/food My father caught a smelt. → animal/illness Smelts are disappearing from oceans. → plant/fish

Page 66: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 66/116

Distributional hypothesis

smelt (assume you don’t know this word)

I had a smelt for lunch. → noun, meal/food My father caught a smelt. → animal/illness Smelts are disappearing from oceans. → plant/fish

Page 67: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 67/116

Distributional hypothesis

smelt (assume you don’t know this word)

I had a smelt for lunch. → noun, meal/food My father caught a smelt. → animal/illness Smelts are disappearing from oceans. → plant/fish

Page 68: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 68/116

Distributional hypothesis

smelt (assume you don’t know this word)

I had a smelt for lunch. → noun, meal/food My father caught a smelt. → animal/illness Smelts are disappearing from oceans. → plant/fish

koruška

Page 69: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 69/116

Distributional hypothesis

smelt (assume you don’t know this word)

I had a smelt for lunch. → noun, meal/food My father caught a smelt. → animal/illness Smelts are disappearing from oceans. → plant/fish

Harris (1954): “Words that occur in the same contexts tend to have similar meanings.”

Page 70: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 70/116

Distributional hypothesis

Harris (1954): “Words that occur in the same contexts tend to have similar meanings.”

Cooccurrence matrix # of sentences containing both WORD and CONTEXT

WORD CONTEXT

lunch caught oceans doctor green

smelt 10 10 10 1 1

salmon 100 100 100 1 1

flu 1 100 1 100 10

seaweed 10 1 100 1 100

Page 71: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 71/116

Distributional hypothesis

Harris (1954): “Words that occur in the same contexts tend to have similar meanings.”

Cooccurrence matrix # of sentences containing both WORD and CONTEXT

WORD CONTEXT

lunch caught oceans doctor green

smelt 10 10 10 1 1

salmon 100 100 100 1 1

flu 1 100 1 100 10

seaweed 10 1 100 1 100

Cheap plentiful data (webs, news, books…): ~109

Page 72: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 72/116

Distributional hypothesis

Harris (1954): “Words that occur in the same contexts tend to have similar meanings.”

Cooccurrence matrix # of sentences containing both WORD and CONTEXT

WORD CONTEXT

lunch caught oceans doctor green

smelt 10 10 10 1 1

salmon 100 100 100 1 1

flu 1 100 1 100 10

seaweed 10 1 100 1 100

Cheap plentiful data (webs, news, books…): ~109

NxN,N~106

Page 73: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 73/116

From cooccurence to PMI

Cooccurrence matrix M

C[i, j] = count(word

i & context

j)

Conditional probability matrix M

P[i, j] = P(word

i | context

j) = M

C[i, j] / count(context

j)

Conditional log-probability matrix M

LogP[i, j] = log P(word

i | context

j) = log M

P[i, j]

Pointwise mutual information matrix M

PMI[i, j] = log [P(word

i | context

j) / P(word

i)]

Associationmeasures

Page 74: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 74/116

From cooccurence to PMI

Cooccurrence matrix M

C[i, j] = count(word

i & context

j)

Conditional probability matrix M

P[i, j] = P(word

i | context

j) = M

C[i, j] / count(context

j)

Conditional log-probability matrix M

LogP[i, j] = log P(word

i | context

j) = log M

P[i, j]

Pointwise mutual information matrix M

PMI[i, j] = log [P(word

i | context

j) / P(word

i)]

PMI(A, B) = log P(A & B) / P(A) P(B)

Associationmeasures

Page 75: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 75/116

From cooccurence to PMI

Word representation still impratically huge M

PMI[i] ∈ RN, N~106

Page 76: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 76/116

From cooccurence to PMI

Word representation still impratically huge M

PMI[i] ∈ RN, N~106

But better than 1-hot Meaningful continuous vectors (e.g. cos similarity)

Page 77: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 77/116

From cooccurence to PMI

Word representation still impratically huge M

PMI[i] ∈ RN, N~106

But better than 1-hot Meaningful continuous vectors (e.g. cos similarity)

Just need to compress it!

Page 78: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 78/116

From cooccurence to PMI

Word representation still impratically huge M

PMI[i] ∈ RN, N~106

But better than 1-hot Meaningful continuous vectors (e.g. cos similarity)

Just need to compress it! Explicitly: matrix factorization

post-hoc, not used Implicitly: word2vec

widely used

Page 79: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 79/116

Matrix factorization

Page 80: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 80/116

Matrix factorization

Levy&Goldberg (2014)

Page 81: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 81/116

Matrix factorization

Levy&Goldberg (2014)

Take MLogP

or MPMI

Page 82: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 82/116

Matrix factorization

Levy&Goldberg (2014)

Take MLogP

or MPMI

Shift the matrix to make it positive (- min)

Page 83: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 83/116

Matrix factorization

Levy&Goldberg (2014)

Take MLogP

or MPMI

Shift the matrix to make it positive (- min) Truncated Singular Value Decomposition:

M = UDVT M ∈ RNxN→U ∈ RNxd, D ∈ Rdxd, V ∈ RNxd

N ~ 106

d ~ 102

Page 84: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 84/116

Matrix factorization

Levy&Goldberg (2014)

Take MLogP

or MPMI

Shift the matrix to make it positive (- min) Truncated Singular Value Decomposition:

M = UDVT M ∈ RNxN→U ∈ RNxd, D ∈ Rdxd, V ∈ RNxd

Word embedding matrix: W = UD ∈ RNxd

Embedding vec(wordi) = W[i] ∈ Rd

N ~ 106

d ~ 102

Page 85: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 85/116

Matrix factorization

Levy&Goldberg (2014)

Take MLogP

or MPMI

Shift the matrix to make it positive (- min) Truncated Singular Value Decomposition:

M = UDVT M ∈ RNxN→U ∈ RNxd, D ∈ Rdxd, V ∈ RNxd

Word embedding matrix: W = UD ∈ RNxd

Embedding vec(wordi) = W[i] ∈ Rd

Continuous low-dimensional vector

N ~ 106

d ~ 102

Page 86: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 86/116

Matrix factorization

Levy&Goldberg (2014)

Take MLogP

or MPMI

Shift the matrix to make it positive (- min) Truncated Singular Value Decomposition:

M = UDVT M ∈ RNxN→U ∈ RNxd, D ∈ Rdxd, V ∈ RNxd

Word embedding matrix: W = UD ∈ RNxd

Embedding vec(wordi) = W[i] ∈ Rd

Continuous low-dimensional vector Meaningful (cos similarity, algebraic operations)

N ~ 106

d ~ 102

Page 87: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 87/116

Word embeddings magic

Word similarity (cos) vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten)

Page 88: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 88/116

Word embeddings magic

Word similarity (cos) vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten)

Word meaning algebra Some relations parallel across words vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat)

dog

puppy

cat

kitten

Page 89: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 89/116

Word embeddings magic

Word similarity (cos) vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten)

Word meaning algebra Some relations parallel across words vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat)

dog

puppy

cat

kitten

=> vec(puppy) - vec(dog) + vec(cat) ~ vec(kitten)

Page 90: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 90/116

Word embeddings magic

Word similarity (cos) vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten)

Word meaning algebra Some relations parallel across words vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat)

dog

puppy

cat

kitten

=> vec(puppy) - vec(dog) + vec(cat) ~ vec(kitten) vodka – Russia + Mexico, teacher – school + hospital…

Page 91: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 91/116

word2vec (Mikolov+, 2013)

Predict word wi from its context (CBOW)

E.g.: “I had _____ for lunch”

Sentence: … wi-2

wi-1

wi w

i+1 w

i+2 …

0 0 0 0 1 0 0 0 0… …

0 0 0 0 1 0 0 0 0… …

0 0 0 0 1 0 0 0 0… …

0 0 0 0 1 0 0 0 0… …

W

W

W

W

Wi-2

Wi-1

Wi+1

Wi+2

V 0 0 0 0 1 0 0 0 0… … Wi

Contextvectors(1-hot)

Sharedprojection matrix

(Nxd)

”Linearhiddenlayer”

Anothermatrix(dxN)

Output word(distribution)

σ

Softmax(hierarchical)

Train withSGD

Page 92: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 92/116

word2vec (Mikolov+, 2013)

Predict context from a word wi (SGNS)

E.g.: “____ _____ smelt _____ _____”

Sentence: … wi-2

wi-1

wi w

i+1 w

i+2 …

0 0 0 0 1 0 0 0 0… …

0 0 0 0 1 0 0 0 0… …

0 0 0 0 1 0 0 0 0… …

0 0 0 0 1 0 0 0 0… …

V

V

V

V

Wi-2

Wi-1

Wi+1

Wi+2

W0 0 0 0 1 0 0 0 0… …Wi

Outputcontext vectors(distributions)

Another matrix,shared(dxN)

Projectionmatrix(Nxd)

Inputword

(1-hot)

σ

σ

σ

σ

Page 93: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 93/116

word2vec ~ implicit factorization

Word embedding matrix W ∈ RNxd

embedding(wordi) = W[i] ∈ Rd

Levy&Goldberg (2014) word2vec SGNS implicitly factorizes M

PMI

MPMI

[i, j] = log [P(wordi | context

j) / P(word

i)]

SGNS: MPMI

= WV

MPMI

∈ RNxN→W ∈ RNxd, V ∈ RdxN

W0 0 0 0 1 0 0 0 0… …Wi

0 0 0 0 1 0 0 0 0… …V Wi-2

σ

Page 94: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 94/116

Problem 2: Sentences

Fixed-sized neural units (attention mechanisms)

Variable-length input sequences with long-distance relations between elements (sentences)

Page 95: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 95/116

Processing sentences

Convolutional neural netowrks Recurrent neural networks Attention mechanism Self-attentive networks

Page 96: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 96/116

Convolutional neural networks

Input: sequence of word embeddings Filters (size 3-5), norm, maxpooling

Page 97: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 97/116

Convolutional neural networks

Input: sequence of word embeddings Filters (size 3-5), norm, maxpooling Training deep CNNs hard→residual connections

Layer input averaged with output, skips non-linearity

Page 98: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 98/116

Convolutional neural networks

Input: sequence of word embeddings Filters (size 3-5), norm, maxpooling Training deep CNNs hard→residual connections

Layer input averaged with output, skips non-linearity Problem: capturing long-range dependencies

Receptive field of each filter is limited My computer works, but I have to buy a new mouse.

Page 99: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 99/116

Convolutional neural networks

Input: sequence of word embeddings Filters (size 3-5), norm, maxpooling Training deep CNNs hard→residual connections

Layer input averaged with output, skips non-linearity Problem: capturing long-range dependencies

Receptive field of each filter is limited My computer works, but I have to buy a new mouse.

Good for word ngram spotting Sentiment analysis, named entity detection…

Page 100: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 100/116

Recurrent neural networks

Input: sequence of word embeddings Output: final state of RNN

Page 101: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 101/116

Recurrent neural networks

Input: sequence of word embeddings Output: final state of RNN Problems

Page 102: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 102/116

Recurrent neural networks

Input: sequence of word embeddings Output: final state of RNN Problems

Vanishing gradient → memory cells (LSTM, GRU)

Page 103: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 103/116

Recurrent neural networks

Input: sequence of word embeddings Output: final state of RNN Problems

Vanishing gradient → memory cells (LSTM, GRU) Long distance dependencies not perfectly captured

Page 104: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 104/116

Recurrent neural networks

Input: sequence of word embeddings Output: final state of RNN Problems

Vanishing gradient → memory cells (LSTM, GRU) Long distance dependencies not perfectly captured Final state is biased (“forgetting”)

…sentence end better captured than sentence start

Page 105: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 105/116

Recurrent neural networks

Input: sequence of word embeddings Output: final state of RNN Problems

Vanishing gradient → memory cells (LSTM, GRU) Long distance dependencies not perfectly captured Final state is biased (“forgetting”)

…sentence end better captured than sentence start Bidirectional RNN, output = concat of both final states

Still may not well capture the middle parts…

Page 106: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 106/116

Recurrent neural networks

Input: sequence of word embeddings Output: final state of RNN Problems

Vanishing gradient → memory cells (LSTM, GRU) Long distance dependencies not perfectly captured Final state is biased (“forgetting”)

…sentence end better captured than sentence start Bidirectional RNN, output = concat of both final states

Still may not well capture the middle parts… Using all hidden states as output, not just the final one

We loose the fixed-sized representation

Page 107: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 107/116

Attention (on top of a RNN)

Page 108: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 108/116

Attention (on top of a RNN)

Classifier/decoder gets a fixed-size context vector Weighted average of encoder hidden states

Page 109: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 109/116

Attention (on top of a RNN)

Classifier/decoder gets a fixed-size context vector Weighted average of encoder hidden states Attention weights computed by a feed-forward subnet

weighti ~ NN(state

i, state

decoder)

Page 110: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 110/116

Advanced attention

Multi-head attention Multiple attention heads (~8), each has its own distro Resulting context vectors concatenated

Page 111: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 111/116

Advanced attention

Multi-head attention Multiple attention heads (~8), each has its own distro Resulting context vectors concatenated

Self-attentitive encoder (SAN, Transformer) CNN/attention hybrid CNN: cell gets small local context via filters SAN: cell gets global context via attention heads

Page 112: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 112/116

Conclusion

Page 113: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 113/116

Conclusion

Words → word embeddings Too many, too sparse Word meaning ~ context in which it appears Cooccurrence matrix, implicit/explicit factorization

Page 114: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 114/116

Conclusion

Words → word embeddings Too many, too sparse Word meaning ~ context in which it appears Cooccurrence matrix, implicit/explicit factorization

Sentences → attention Variable length, complex internal structure biRNN (LSTM, GRU), CNN+residuals Attention: weighted sum of encoder hidden states Self-attention: à la CNN, filters → attention

Page 115: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 115/116

References Word embeddings:

Distributional hyp.: Harris: Distributional structure. Word, 1954 First: Bengio+: A neural probabilistic language model. JMLR, 2003 Efficient implicit (word2vec): Mikolov+: Linguistic Regularities in

Continuous Space Word Representations. NAACL, 2013 Explicit (TSVD): Levy&Goldberg: Neural Word Embedding as

Implicit Matrix Factorization. NIPS, 2014 Recurrent neural networks and attention:

LSTM: Hochreiter+: Long short-term memory. NeCo, 1997 Attention: Bahdanau+: Neural Machine Translation by Jointly

Learning to Align and Translate. CoRR, 2014 Tranformer SAN: Vaswani+: Attention is all you need. NIPS, 2017

Page 116: Deep Neural Networks in Natural Language Processingrosa/2019/docs/presentation.pdf · Rudolf Rosa – Deep Neural Networks in Natural Language Processing 15/116 ML in Natual Language

Rudolf Rosa – Deep Neural Networks in Natural Language Processing 116/116

Thank you for your attention

http://ufal.mff.cuni.cz/rudolf-rosa/

Charles UniversityFaculty of Mathematics and Physics

Institute of Formal and Applied Linguistics

Deep Neural Networksin Natural Language Processing

Rudolf [email protected]