Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture...

Lecture 7: Word Embeddings

Kai-Wei ChangCS @ University of Virginia

kw@kwchang.net

Couse webpage: http://kwchang.net/teaching/NLP16

16501 Natural Language Processing

This lecture

v Learning word vectors (Cont.)

v Representation learning in NLP

26501 Natural Language Processing

Recap: Latent Semantic Analysis

vData representationvEncode single-relational data in a matrix

v Co-occurrence (e.g., from a general corpus)v Synonyms (e.g., from a thesaurus)

vFactorizationvApply SVD to the matrix to find latent

components

vMeasuring degree of relationvCosine of latent vectors

Recap: Mapping to Latent Space via SVD

v SVD generalizes the original datav Uncovers relationships not explicit in the thesaurusv Term vectors projected to 𝑘-dim latent space

v Word similarity: cosine of two column vectors in 𝚺𝐕$

𝑪 𝐔𝐕'≈

𝑑×𝑛 𝑑×𝑘

𝑘×𝑘 𝑘×𝑛

Low rank approximation

vFrobenius norm. C is a 𝑚×𝑛 matrix

||𝐶||/ = 11|𝑐34|56

vRank of a matrix.vHow many vectors in the matrix are

independent to each other

6501 Natural Language Processing 5

v Low rank approximation problem:min=||𝐶 − 𝑋||/𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘

v If I can only use k independent vectors to describe the points in the space, what are the best choices?

Essentially,weminimizethe“reconstructionloss”underalowrankconstraint

v Low rank approximation problem:min=||𝐶 − 𝑋||/𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘

v If I can only use k independent vectors to describe the points in the space, what are the best choices?

Essentially,weminimizethe“reconstructionloss”underalowrankconstraint

v Assume rank of 𝐶 is rv SVD: 𝐶 = 𝑈Σ𝑉', Σ = diag(𝜎8, 𝜎5 …𝜎P, 0,0,0, …0)

v Zero-out the r − 𝑘 trailing valuesΣ′ = diag(𝜎8, 𝜎5 …𝜎U, 0,0,0,… 0)

v 𝐶V = UΣV𝑉' is the best k-rank approximation: CV = 𝑎𝑟𝑔min

=||𝐶 − 𝑋||/𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘

Σ =𝜎8 0 00 ⋱ 00 0 0

𝑟 non-zeros

Word2Vec

v LSA: a compact representation of co-occurrence matrix

v Word2Vec:Predict surrounding words (skip-gram)vSimilar to using co-occurrence counts Levy&Goldberg

(2014), Pennington et al. (2014)

v Easy to incorporate new wordsor sentences

Word2Vec

v Similar to language model, but predicting next word is not the goal.

v Idea: words that are semantically similar often occur near each other in textv Embeddings that are good at predicting neighboring

words are also good at representing similarity

Skip-gram v.s Continuous bag-of-words

vWhat differences?

Skip-gram v.s Continuous bag-of-words

Objective of Word2Vec (Skip-gram)

vMaximize the log likelihood of context word 𝑤\]9,𝑤\]9^8, … ,𝑤\]8 , 𝑤\^8, 𝑤\^5,… ,𝑤\^9given word 𝑤\

vm is usually 5~10

Objective of Word2Vec (Skip-gram)

vHow to model log 𝑃(𝑤\^4|𝑤\)?

𝑝 𝑤\^4 𝑤\ =cde(fghij⋅lgh)

∑ cde(fgn ⋅lgh)gn

vsoftmax function Again!

vEvery word has 2 vectorsv𝑣p : when 𝑤 is the center wordv𝑢p: when 𝑤 is the outside word (context word)

How to update?

𝑝 𝑤\^4 𝑤\ =cde(fghij⋅lgh)

∑ cde(fgn ⋅lgh)gn

vHow to minimize 𝐽(𝜃)vGradient descent!vHow to compute the gradient?

Recap: Calculus

vGradient:𝒙' = 𝑥8 𝑥5 𝑥z ,

∇𝜙 𝒙 =

𝜕𝜙(𝒙)𝜕𝑥8𝜕𝜙(𝒙)𝜕𝑥5𝜕𝜙(𝒙)𝜕𝑥z

v 𝜙 𝒙 = 𝒂 ⋅ 𝒙 (or represented as 𝒂'𝒙)∇𝜙 𝒙 = 𝒂

Recap: Calculus

v If𝑦 = 𝑓 𝑢 and𝑢 = 𝑔 𝑥 (i.e,.𝑦 = 𝑓(𝑔 𝑥 )��= ��(f)

�f��(�)��

( ��f

�f��

1. 𝑦 = 𝑥� + 6 z 2. y = ln(𝑥5 + 5)3. y = exp(x� + 3𝑥 + 2)

Other useful formulation

v 𝑦 = exp 𝑥dydx = exp x

v y = log xdydx =

WhenIsaylog(inthiscourse), usuallyImeanln

Example

vAssume vocabulary set is 𝑊. We have one center word 𝑐, and one context word 𝑜.

vWhat is the conditional probability 𝑝 𝑜 𝑐

𝑝 𝑜 𝑐 =exp(𝑢� ⋅ 𝑣�)

∑ exp(𝑢pn ⋅ 𝑣�)pV vWhat is the gradient of the log likelihood

w.r.t 𝑣�?𝜕 log 𝑝 𝑜 𝑐

𝜕𝑣�= 𝑢� − 𝐸p∼� 𝑤 𝑐 [𝑢p]

Gradient Descent

minp𝐽(𝑤)

Update w: 𝑤 ← 𝑤 − 𝜂∇𝐽(𝑤)

Local minimum v.s. global minimum

Stochastic gradient descent

v Let 𝐽 𝑤 = 86∑ 𝐽4(𝑤)6478

v Gradient descent update rule:

𝑤 ← 𝑤 − �6∑ 𝛻𝐽4 𝑤6478

v Stochastic gradient descent:

v Approximate 86∑ 𝛻𝐽4 𝑤6478 by the gradient at a

single example 𝛻𝐽3 𝑤 (why?)v At each step:

Randomlypickanexample𝑖𝑤 ← 𝑤 − 𝜂𝛻𝐽3 𝑤

Negative sampling

vWith a large vocabulary set, stochastic gradient descent is still not enough (why?)

𝜕 log𝑝 𝑜 𝑐𝜕𝑣�

= 𝑢� − 𝐸p∼� 𝑤 𝑐 [𝑢p]

vLet’s approximate it again!vOnly sample a few words that do not appear

in the contextvEssentially, put more weights on positive

samples

More about Word2Vec – relation to LSA

v LSA factorizes a matrix of co-occurrence counts

v (Levy and Goldberg 2014) proves that skip-gram model implicitly factorizes a (shifted) PMI matrix!

vPMI(w,c) =log ¡(�|�)¡(�)

= log ¡(�,�)�(�)¡(�)

= log# 𝑤, 𝑐 ⋅ |𝐷|#(𝑤)#(𝑐)

All problem solved?

Continuous Semantic Representations

sunnyrainy

windycloudy

cab sad

emotion

feeling

Semantics Needs More Than Similarity

Tomorrow will be rainy.

Tomorrow will be sunny.

𝑠𝑖𝑚𝑖𝑙𝑎𝑟(rainy, sunny)?

𝑎𝑛𝑡𝑜𝑛𝑦𝑚(rainy, sunny)?

Polarity Inducing LSA [Yih, Zweig, Platt 2012]

vData representationvEncode two opposite relations in a matrix using

“polarity”v Synonyms & antonyms (e.g., from a thesaurus)

vFactorizationvApply SVD to the matrix to find latent

components

vMeasuring degree of relationvCosine of latent vectors

joy gladden sorrow sadden goodwill

Group 1:“joyfulness” 1 1 -1 -1 0

Group2:“sad” -1 -1 1 1 0

Group3:“affection” 0 0 0 0 1

Encode Synonyms & Antonyms in Matrix

v Joyfulness: joy, gladden; sorrow, saddenv Sad: sorrow, sadden; joy, gladden

Inducing polarity

Cosine Score: +𝑆𝑦𝑛𝑜𝑛𝑦𝑚𝑠

Target word: row-vector

joy gladden sorrow sadden goodwill

Group 1:“joyfulness” 1 1 -1 -1 0

Group2:“sad” -1 -1 1 1 0

Group3:“affection” 0 0 0 0 1

Encode Synonyms & Antonyms in Matrix

v Joyfulness: joy, gladden; sorrow, saddenv Sad: sorrow, sadden; joy, gladden

Inducing polarity

Cosine Score: −𝐴𝑛𝑡𝑜𝑛𝑦𝑚𝑠

Target word: row-vector

Continuous representations for entities

MichelleObama

DemocraticParty

GeorgeWBush

LauraBush

RepublicParty

Continuous representations for entities

• Useful resources for NLP applications• Semantic Parsing & Question Answering • Information Extraction

Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture...

Documents

Word Meaning Vector Semantics & Embeddings

Mathematical Embeddings of Complex Systems

CONTRASTIVE BEHAVIORAL SIMILARITY EMBEDDINGS FOR

From Word Embeddings To Document Distancesmkusner.github.io/presentations/From_Word_Embeddings_To... · 2020-05-16 · From Word Embeddings To Document Distances ... word embeddings

Learnable PINs: Cross-Modal Embeddings for …for character labelling in a TV drama. This again evaluates the embeddings on unseen-unheard identities. 2RelatedWork Cross-modal embeddings:

Interpretting Embeddings with Comparison

geometric embeddings and graph expansion

Learning Cross-Modal Embeddings With Adversarial Networks ...openaccess.thecvf.com/content_CVPR_2019/papers/Wang_Learning_Cross... · Model Embedding (ACME), where the embeddings

educationnewsbd24blog...Sep 29, 2016 · Bengali Grammar in English Language Grammar of rhe Bengali . Exp : Exp: Exp: Exp : Exp : Exp : Exp : Exp: I have not heard from him— @ long

Word Embeddings - cocoxu.github.io

(PEHG1HW Hypotheses and Deep Embeddings

Learning Disentangled Behavior Embeddings

Lazy evaluation illustrated - GitHub Pages · One of the mental models for Haskell program main = exp aa (exp ab exp ac exp ad) exp ac = exp aca exp acb exp ad = exp ada exp adb exp

Word embeddings (II)

Self embeddings of Bedford-McMullen carpetsmath.huji.ac.il/~mhochman/preprints/affine-embeddings-of-carpets.pdf · Self embeddings of Bedford-McMullen carpets Amir Algom and Michael

Highlights Cross-Lingual Word Embeddings

Word Embeddings - w4nderlustw4nderlu.st/content/4-teaching/3-word-embeddings/... · Word Embeddings: hot trend in NLP ... Different languages use different ... Zellig Harris, Mathematical

Text mining, word embeddings, & wikipedia

Word2vec embeddings: CBOW and Skipgram · Word2vec embeddings: CBOW and Skipgram VL Embeddings Uni Heidelberg SS 2019. Skipgram { IntuitionGradient DescentStochastic Gradient DescentBackpropagation

A survey of linkless embeddings