Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for...

Neural models in NLPNatural Language Processing: Lecture 4

28.09.2017

Kairit Sirts

The goal of today’s lecture

• Explain word embeddings

• Explain the recurrent neural models used in NLP

Log-linear language model

y – the next word to predict

x – the context sequence: words, annotations etc

v – model parameters

f(x, y) – feature vector for the input-output pair (x, y)

The problem with log-linear models

Feature engineering

• Developing feature templates

• Which features are relevant to which problems?

• Experiment with subsets of features

• Features can be very complex

What if we could let the model learn the relevant features automatically?

Neural networks

1-hot representation

the girl with flowers is cute are were flower … …

The 1 0 0 0 0 0 0 0 0 0 0

girl 0 1 0 0 0 0 0 0 0 0 0

with 0 0 1 0 0 0 0 0 0 0 0

the 1 0 0 0 0 0 0 0 0 0 0

flowers 0 0 0 1 0 0 0 0 0 0 0

is 0 0 0 0 1 0 0 0 0 0 0

cute 0 0 0 0 0 1 0 0 0 0 0

… … … … … … … … … … … …

flower 0 0 0 0 0 0 0 0 1 0 0

What is the similarity between vectors for flower and flowers?

flowers 0 0 0 1 0 0 0 0 0 0 0

flower 0 0 0 0 0 0 0 0 1 0 0

the girl with flowers is cute are were flower … …

Features as distributed representations

8Deep Learning: What is meant by a distributed representation? https://www.quora.com/Deep-Learning-What-is-meant-by-a-distributed-representation/answer/Rangan-Majumder

Distributed word representations

f1 f2 f3 f4

flower 6 3 0 4

flowers 1 7 2 8

What is the cosine similarity between flower and flowers now?

Learning distributed word representations

The girl with the flowers is cute.

She has the flowers in her hand.

I picked these flowers myself.

The girl with a flower is cute.

She has a flower in her hand.

I picked this flower myself.

with the

has the

picked the

is cute

in her

myself

flowerswith a

picked a

is cute

in her

myself

flower

11http://metaoptimize.s3.amazonaws.com/cw-embeddings-ACL2010/embeddings-mostcommon.EMBEDDING_SIZE=50.png

Word2Vec

13Mikolov et al., 2013. Efficient Estimation of Word Representations in Vector Space

CBOW – continuous bag of words

• w(t-2), w(t-1), w(t+1), w(t+2) – one-hot vectors

• – a row in the parameter matrix

• C – the set of context vectors

• c – the size of the context window

• - linear projection• d – embedding size

• 14

Skip-gram model

• Predict the context words

• w(t) – one-hot vector

• Maximize: z

Training word embeddings

• General principle – maximize the probability of the:• Middle word, given the context words (CBOW)

• Context words, given the middle word (skip-gram)

• In case of skip-gram:• Given T training words in context

• Maximize:

• Minimize:

Training word embeddings

• Typically trained with gradient descent• You will learn more sophisticated methods in other courses

• Initialize the parameter vectors/matrices (somehow)

• Repeat until convergence:

• - the set of all trainable parameters

• - learning rate

Softmax vs log-linear model

Softmax is a log-linear model

Log-linear:

Softmax:

The gradient of a log-linear model

Empirical count Expected count

The gradients in skip-gram model

c – context wordw – middle word

The problem with softmax gradients

• Computing is computationally very expensive. Why?

• The gradients always include the sum over the whole vocabulary

• This makes computation very inefficient

Negative sampling

The general idea:

Maximize the probability of the (word, context) pairs that came from the training data (instead of the probability of the context given the word)

Previously: maximize

Now: maximize

Skip-gram objective with negative sampling

Maximize:

• - the set of random negative samples

• In practice, the number of negative samples per each positive sample is between 2-20

Tools for training word embeddings

• Word2vec – Gensim includes both CBOW and skip-gram implementations

• Glove – optimizes the predictions of co-occurrence counts between words

• Polyglot

• Dependency-based word embeddings

Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for...

Documents

Practical Neural Networks for NLP (Part 2) - Graham Neubig · Practical Neural Networks for NLP (Part 2) Chris Dyer, Yoav Goldberg, Graham Neubig

Deep Learning for NLP: An Introduction to Neural Word Embeddings

Neural Networks and Lecture 4: Backpropagationvision.stanford.edu/teaching/cs231n/slides/2019/cs231n_2019_lecture04.pdf · Neural Turing Machine Figure reproduced with permission

Thanapat Kangkachit, Ph.D. - dongtanengineer.comdongtanengineer.com/contents/180610/DeepLearning.pdf · • CS231n: Convolutional Neural Networks for Visual Recognition [ Introduction

Lecture 10: Recurrent Neural Networksvision.stanford.edu/teaching/cs231n/slides/2017/cs231n_2017_lectur… · Lecture 10 - 21 May 4, 2017 Recurrent Neural Network x RNN y We can process

CS11-747 Neural Networks for NLP Learning From/For ...phontron.com/class/nn4nlp2017/assets/slides/nn4nlp-21-kb.pdf · CS11-747 Neural Networks for NLP Learning From/For Knowledge

Cs231n 2017 lecture10 Recurrent Neural Networks

CS224d Deep NLP Lecture 8: Recurrent Neural Networkscs224d.stanford.edu/lectures/CS224d-Lecture8.pdf · CS224d Deep NLP Lecture 8: Recurrent Neural Networks Richard Socher richard@metamind.io

Deep neural networks - Carnegie Mellon School of …wcohen/10-601/deep-1.pdf · 2016-04-05 · • Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition

Practical Deep Neural Networks - DGYrt.dgyblog.com/res/dlworkshop/introduction.pdf · Practical Deep Neural Networks ... ' CS231n Convolutional Neural Networks for Visual Recognition

Recurrent Neural Networks · Recurrent Neural Networks LING572 Advanced Statistical Methods for NLP March 5 2020 1. Outline Word representations and MLPs for NLP tasks Recurrent neural

Machine Learning for NLP Lecture 4: Neural networks · · 2016-09-16Machine Learning for NLP Lecture 4: Neural networks UNIVERSITY OF GOTHENBURG Richard Johansson September 16,

CS224d: Deep NLP Lecture 13: Convolutional Neural Networks ... · CS224d: Deep NLP Lecture 13: Convolutional Neural Networks (for NLP) Richard Socher richard@metamind.io

Practical Neural Networks for NLP (Part 1) · 2016-11-17 · Practical Neural Networks for NLP (Part 1) Chris Dyer, Yoav Goldberg, Graham Neubig November 1, 2016 EMNLP ... neural

NLP Programming Tutorial 8 - Recurrent Neural Nets · 7 NLP Programming Tutorial 8 – Recurrent Neural Nets Review: Prediction Problems Given x, predict y A book review Oh, man I

Introductiondsba.korea.ac.kr/wp/wp-content/seminar/CS231n/CS231n... · 2017-01-08 · Linear Classification, Optimization, Neural Networks, Convolutional Neural Network, Visualization,

Post-hoc Interpretability for Neural NLP: A Survey

(Deep) Neural Networks在 NLP 和 Text Mining 总结

CS224d Deep NLP Lecture 6: Neural Tips and Tricks ... · Neural Tips and Tricks + Recurrent Neural Networks ... • First intro to Recurrent Neural Networks ... through the parameters

CS11-747 Neural Networks for NLP Reinforcement Learning ...phontron.com/class/nn4nlp2017/assets/slides/nn4nlp-16-rl.pdf · CS11-747 Neural Networks for NLP Reinforcement Learning