Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Deep Learning
& Natural Language processingEmbeddings, CNNs, RNNs, etc...
Julius B. Kirkegaard 2019Snippets at: https://bit.ly/2VWvs3m
Deep Neural Networks & PyTorch
Deep neural networks
Neural networks
Network architectureParameters
Data (perhaps preprocessed)
Neural networks
Some matrix
Some non-linear function
”Activation Function”
Some vector
Deep neural networks
Requirements for DNN Frameworks
• Optimisation of parameters p
• Take first order derivatives
• Chain rule (backpropagation)
• Process large amounts of data fast
• Exploit GPUs
• Nice to haves:
• Standard functions and operations built-in
• Built-in optimizers
• Spread training across network
• Compile for fast inference
• …
PyTorch
• GPU acceleration
• Automatic Error-Backpropagation
(chain rule through operations)
• Tons of functionality built-in
Hard to play with,
not good for new ideas
and research (IMO)
Easy play. Dificult to
implement custom and
dynamic architechtures
Requirement 1: Calculate gradients
Requirement 2: GPU
Simple Neural Network: ”Guess the Mean”
1 2 3
Neural network in three steps:
Design the architecture
and initialise parametersCalculate the loss Update parameters based
on loss gradient
Warning: This is not the best way to
implement, better version will follow…
(this version is for understanding)Snippets at: https://bit.ly/2VWvs3m
Better optimiser stepping
• What if some gradients are much smaller than others?
• What happens when gradients disappear when loss is small?
Solution → Variable learning rates and momentum
• Many algorithms exists, perhaps most popular: “Adam”
Better optimiser stepping
SGD (Stochastic gradient descent) Adam (Adaptive Moment Estimation)
simple_nn.py module_nn.py
Snippets at: https://bit.ly/2VWvs3m
Representing sentences
The Trouble
“Hej med dig”
Bag of Words
“Hej med dig”
“Hej hej”
Bag of Words, poor behaviour #1
“I had my car cleaned.”
“I had cleaned my car.” (order ignored)
Bag of Words, poor behaviour #2
“Hej med dig”
“Heej med dig”
“Haj medd dig” (typos)
(semantically similar)
Bag of Words, poor behaviour #3
“Hej med dig”
The idea for a solution
Idea: Represent each word as a vector, with similar words
having vector that are close
Problem: how to choose the vector representing each word?
The idea for a solution
The country was ruled by a _____
The bishop anointed the ____ with aromatic oils
The crown was put on the ____
”Context defines meaning”:
King/Queen
Continous Bag of Words
• Input is a ”one-hot” vector
• We force network to make eachword into a
~200 length vector
• From these vector we predict ”focus word”
• When done, keep ”embeddings”
See e.g. https://github.com/FraLotito/pytorch-continuous-bag-of-words/blob/master/cbow.py
for simple implementation
The bishop anointed the with aromatic oilsqueen
Context ContextFocus
word
Continous Bag of Words
I think therefore
ContextContext
Focus
word
I am Dictionary: [“I”, “think”, “therefore”, “am”]
Context size = 2
Continous Bag of Words
Very simple version:
Continous Bag of Words
Probability distribution
of all words in
dictionary.
Can be > 1 million
words, so smarter
training techniques are
typically used:
“Negative sampling”
Vectors
Word2Vec Vectors
Word2Vec Vectors
King – Man + Woman = Queen
Pretrained word vectors
• Glove: https://nlp.stanford.edu/projects/glove/
• FastText: https://fasttext.cc/docs/en/crawl-vectors.html
• ELMo: https://github.com/HIT-SCIR/ELMoForManyLangs
Can be used as-is or further trained on specific corpus
Trained on Wikipedia and “common crawl”
Representing sentences
Using word embeddings sentences become “pictures”:
“I think therefore I am”
5 x 200 matrix
Convolutional Neural Networks
CNNs: Convolutional Neural Networks
is trainable
CNNs: Convolutional Neural Networks
Padded with zeros
CNNs: Convolutional Neural Networks
Padded with zeros,
Stride = 2
CNNs: Convolutional Neural Networks
Kernels = Filters = Features in CNN language
Pooling
Max-pooling 3x3
Pooling = Subsampling in CNN language
CNNs: Convolutional Neural Networks
Text Classification
Standard choices:
• Convolutional Neural Networks
• Recursive Neural Networks (LSTMs)
Classification using CNN
See e.g. https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/4%20-%20Convolutional%20Sentiment%20Analysis.ipynb
1D convolutions with 2D filters
(embedding size x kernel size)
Recursive Neural Networks
Language Modelling
Hi mom, I’ll be late for …
Neural networks
Network architectureParameters
Data (perhaps preprocessed)
Recursive neural networks
Network architecture
Parameters
Data (perhaps preprocessed)
Hidden state
What are recursive neural networks?
Example: (“classic” RNN):
Language Modelling with RNNs
Hi mom, I’ll be late for …
can be used to predict next word
Language Modelling with RNNs
Snippets at: https://bit.ly/2VWvs3m
RNN Design choices
“I grew up in France” “Since my mother tongue is ____”
Standard RNN:
LSTMs: Long-Short Term Memory
Standard RNN:
LSTM:
See https://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTMs: Long-Short Term Memory
Standard RNN:
LSTM:
LSTM Language Model
“I’ll be late for….”
Sample loop: take word of highest probability and repeat
(real models tend to stack many LSTMs)
Sampling
Shakespeare
PANDARUS:Alas, I think he shall be come approached and the dayWhen little srain would be attain’d into being never fed,And who is but a chain and subjects of his death,I should not sleep.Second Senator:They are away this miseries, produced upon my soul,Breaking and strongly should be buried, when I perishThe earth and thoughts of many states.
Wikipedia
Naturalism and decision for the majority of Arab countries’ capitalide was groundedby the Irish language by [[John Clair]], [[An Imperial Japanese Revolt]], associatedwith Guangzham’s sovereignty. His generals were the powerful ruler of the Portugalin the [[Protestant Immineners]], which could be said to be directly in CantoneseCommunication, which followed a ceremony and set inspired prison, training. Theemperor travelled back to [[Antioch, Perth, October 25|21]] to note, the Kingdomof Costa Rica, unsuccessful fashioned the [[Thrales]], [[Cynth’s Dajoard]], knownin western [[Scotland]], near Italy to the conquest of India with the conflict.
<revision><id>15900676</id><timestamp>2002–08–03T18:14:12Z</timestamp><contributor><username>Paris</username><id>23</id></contributor><minor /><comment>Automated conversion</comment></revision>
XML ObamaGood everybody. Thank you very much. God bless the United States of America, and has already began with the world’s gatheringIt’s about hard-earned for our efforts that are not continued.We are all the assumptionion to the streets of the Americas that we are still for everybody and destruction.We are doing a lot of this.I know that someone would be prefered to their children to take a million insurance company. We’re watching their people and continued to find ourselves with RepubThank you. God bless you. Good morning, everybody. And May God loss man. Thank you very much. Thank you very much, everybody.
OpenAI Language Model…
Prompt: “Recycling is good for the world, no, you could not be more wrong”
“Recycling is NOT good for the world. It is bad for the
environment, it is bad for our health, and it is bad for our
economy. I’m not kidding. Recycling is not good for the
environment. It is destructive to the earth and it is a major
contributor to global warming. Recycling is not good for our
health. It contributes to obesity and diseases like heart disease
and cancer. Recycling is bad for our economy. It increases the
cost of a product, and in turn, the price of everything that is made
with that product. Recycling is not good for our nation. We pay a
tremendous price for the privilege of having the world’s most
advanced and efficient recycling system. Recycling is a huge,
colossal waste of time, energy, money, and resources.”
Transfer Learning & State-of-the-Art
Example: IMDB Dataset
Task: Guess the rating from review text
Transfer learning
Transfer learning
The Strength of Transfer learning
IMDB: What if only 1 % of reviews included a rating?
can the remaining 99 % reviews be used for anything?
Language model!
(and this is very, very standard situation, in academia and industry)
The Strength of Transfer learning
“… we found that training our approach with only 100 labeled examples (and giving it access to about 50,000 unlabeled examples),
we were able to achieve the same performance as training a model from scratch with 10,000 labeled examples.
Another important insight was that we could use any reasonably general and large“
- Howard & Ruder (2018)
Transfer learning: Other methods
They all laughed. [NEXT] Frodo felt his spirits reviving.
They all laughed. [NEXT] Bag End seemed sad and gloomy and dishevelled.
Task: Classify if two sentences are next to each other
See e.g. https://arxiv.org/abs/1810.04805
“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”
Concepts skipped
• Encoder-Decoders (sequence to sequence)
• Attention
• Transformers
See e.g. paper: “Attention Is All You Need” (2017)