Deep Learning & Natural Language processing

Deep Learning

& Natural Language processingEmbeddings, CNNs, RNNs, etc...

Julius B. Kirkegaard 2019Snippets at: https://bit.ly/2VWvs3m

Deep Neural Networks & PyTorch

Deep neural networks

Neural networks

Network architectureParameters

Data (perhaps preprocessed)

Neural networks

Some matrix

Some non-linear function

”Activation Function”

Some vector

Deep neural networks

Requirements for DNN Frameworks

• Optimisation of parameters p

• Take first order derivatives

• Chain rule (backpropagation)

• Process large amounts of data fast

• Exploit GPUs

• Nice to haves:

• Standard functions and operations built-in

• Built-in optimizers

• Spread training across network

• Compile for fast inference

• …

PyTorch

• GPU acceleration

• Automatic Error-Backpropagation

(chain rule through operations)

• Tons of functionality built-in

Hard to play with,

not good for new ideas

and research (IMO)

Easy play. Dificult to

implement custom and

dynamic architechtures

Requirement 1: Calculate gradients

Requirement 2: GPU

Simple Neural Network: ”Guess the Mean”

1 2 3

Neural network in three steps:

Design the architecture

and initialise parametersCalculate the loss Update parameters based

on loss gradient

Warning: This is not the best way to

implement, better version will follow…

(this version is for understanding)Snippets at: https://bit.ly/2VWvs3m

Better optimiser stepping

• What if some gradients are much smaller than others?

• What happens when gradients disappear when loss is small?

Solution → Variable learning rates and momentum

• Many algorithms exists, perhaps most popular: “Adam”

Better optimiser stepping

SGD (Stochastic gradient descent) Adam (Adaptive Moment Estimation)

simple_nn.py module_nn.py

Snippets at: https://bit.ly/2VWvs3m

Representing sentences

The Trouble

“Hej med dig”

Bag of Words

“Hej med dig”

“Hej hej”

Bag of Words, poor behaviour #1

“I had my car cleaned.”

“I had cleaned my car.” (order ignored)


“Hej med dig”

“Heej med dig”

“Haj medd dig” (typos)

(semantically similar)


“Hej med dig”

The idea for a solution

Idea: Represent each word as a vector, with similar words

having vector that are close

Problem: how to choose the vector representing each word?

The idea for a solution

The country was ruled by a _____

The bishop anointed the ____ with aromatic oils

The crown was put on the ____

”Context defines meaning”:

King/Queen

Continous Bag of Words

• Input is a ”one-hot” vector

• We force network to make eachword into a

~200 length vector

• From these vector we predict ”focus word”

• When done, keep ”embeddings”

See e.g. https://github.com/FraLotito/pytorch-continuous-bag-of-words/blob/master/cbow.py

for simple implementation

The bishop anointed the with aromatic oilsqueen

Context ContextFocus

word

https://github.com/FraLotito/pytorch-continuous-bag-of-words/blob/master/cbow.py


I think therefore

ContextContext

Focus

word

I am Dictionary: [“I”, “think”, “therefore”, “am”]

Context size = 2


Very simple version:


Probability distribution

of all words in

dictionary.

Can be > 1 million

words, so smarter

training techniques are

typically used:

“Negative sampling”

Vectors

Word2Vec Vectors

Word2Vec Vectors

King – Man + Woman = Queen

Pretrained word vectors

• Glove: https://nlp.stanford.edu/projects/glove/

• FastText: https://fasttext.cc/docs/en/crawl-vectors.html

• ELMo: https://github.com/HIT-SCIR/ELMoForManyLangs

Can be used as-is or further trained on specific corpus

Trained on Wikipedia and “common crawl”

https://nlp.stanford.edu/projects/glove/

https://fasttext.cc/docs/en/crawl-vectors.html

https://github.com/HIT-SCIR/ELMoForManyLangs

Representing sentences

Using word embeddings sentences become “pictures”:

“I think therefore I am”

5 x 200 matrix

Convolutional Neural Networks

CNNs: Convolutional Neural Networks

is trainable


Padded with zeros


Padded with zeros,

Stride = 2


Kernels = Filters = Features in CNN language

Pooling

Max-pooling 3x3

Pooling = Subsampling in CNN language


Text Classification

Standard choices:

• Convolutional Neural Networks

• Recursive Neural Networks (LSTMs)

Classification using CNN

See e.g. https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/4%20-%20Convolutional%20Sentiment%20Analysis.ipynb

1D convolutions with 2D filters

(embedding size x kernel size)

https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/4%20-%20Convolutional%20Sentiment%20Analysis.ipynb

Recursive Neural Networks

Language Modelling

Hi mom, I’ll be late for …

Neural networks

Network architectureParameters


Recursive neural networks

Network architecture

Parameters


Hidden state

What are recursive neural networks?

Example: (“classic” RNN):

Language Modelling with RNNs

Hi mom, I’ll be late for …

can be used to predict next word

Language Modelling with RNNs

Snippets at: https://bit.ly/2VWvs3m

RNN Design choices

“I grew up in France” “Since my mother tongue is ____”

Standard RNN:

LSTMs: Long-Short Term Memory

Standard RNN:

LSTM:

See https://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTMs: Long-Short Term Memory

Standard RNN:

LSTM:

LSTM Language Model

“I’ll be late for….”

Sample loop: take word of highest probability and repeat

(real models tend to stack many LSTMs)

Sampling

Shakespeare

PANDARUS:Alas, I think he shall be come approached and the dayWhen little srain would be attain’d into being never fed,And who is but a chain and subjects of his death,I should not sleep.Second Senator:They are away this miseries, produced upon my soul,Breaking and strongly should be buried, when I perishThe earth and thoughts of many states.

Wikipedia

Naturalism and decision for the majority of Arab countries’ capitalide was groundedby the Irish language by [[John Clair]], [[An Imperial Japanese Revolt]], associatedwith Guangzham’s sovereignty. His generals were the powerful ruler of the Portugalin the [[Protestant Immineners]], which could be said to be directly in CantoneseCommunication, which followed a ceremony and set inspired prison, training. Theemperor travelled back to [[Antioch, Perth, October 25|21]] to note, the Kingdomof Costa Rica, unsuccessful fashioned the [[Thrales]], [[Cynth’s Dajoard]], knownin western [[Scotland]], near Italy to the conquest of India with the conflict.

<revision><id>15900676</id><timestamp>2002–08–03T18:14:12Z</timestamp><contributor><username>Paris</username><id>23</id></contributor><minor /><comment>Automated conversion</comment></revision>

XML ObamaGood everybody. Thank you very much. God bless the United States of America, and has already began with the world’s gatheringIt’s about hard-earned for our efforts that are not continued.We are all the assumptionion to the streets of the Americas that we are still for everybody and destruction.We are doing a lot of this.I know that someone would be prefered to their children to take a million insurance company. We’re watching their people and continued to find ourselves with RepubThank you. God bless you. Good morning, everybody. And May God loss man. Thank you very much. Thank you very much, everybody.

OpenAI Language Model…

Prompt: “Recycling is good for the world, no, you could not be more wrong”

“Recycling is NOT good for the world. It is bad for the

environment, it is bad for our health, and it is bad for our

economy. I’m not kidding. Recycling is not good for the

environment. It is destructive to the earth and it is a major

contributor to global warming. Recycling is not good for our

health. It contributes to obesity and diseases like heart disease

and cancer. Recycling is bad for our economy. It increases the

cost of a product, and in turn, the price of everything that is made

with that product. Recycling is not good for our nation. We pay a

tremendous price for the privilege of having the world’s most

advanced and efficient recycling system. Recycling is a huge,

colossal waste of time, energy, money, and resources.”

Transfer Learning & State-of-the-Art

Example: IMDB Dataset

Task: Guess the rating from review text

Transfer learning

Transfer learning

The Strength of Transfer learning

IMDB: What if only 1 % of reviews included a rating?

can the remaining 99 % reviews be used for anything?

Language model!

(and this is very, very standard situation, in academia and industry)

The Strength of Transfer learning

“… we found that training our approach with only 100 labeled examples (and giving it access to about 50,000 unlabeled examples),

we were able to achieve the same performance as training a model from scratch with 10,000 labeled examples.

Another important insight was that we could use any reasonably general and large“

- Howard & Ruder (2018)

Transfer learning: Other methods

They all laughed. [NEXT] Frodo felt his spirits reviving.

They all laughed. [NEXT] Bag End seemed sad and gloomy and dishevelled.

Task: Classify if two sentences are next to each other

See e.g. https://arxiv.org/abs/1810.04805

“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”

https://arxiv.org/abs/1810.04805

Concepts skipped

• Encoder-Decoders (sequence to sequence)

• Attention

• Transformers

See e.g. paper: “Attention Is All You Need” (2017)

Documents

Deep Learning & Natural Language processing