Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture...

Preview:

Citation preview

Lecture 7: Word Embeddings

Kai-Wei ChangCS @ University of Virginia

kw@kwchang.net

Couse webpage: http://kwchang.net/teaching/NLP16

16501 Natural Language Processing

This lecture

v Learning word vectors (Cont.)

v Representation learning in NLP

26501 Natural Language Processing

Recap: Latent Semantic Analysis

vData representationvEncode single-relational data in a matrix

v Co-occurrence (e.g., from a general corpus)v Synonyms (e.g., from a thesaurus)

vFactorizationvApply SVD to the matrix to find latent

components

vMeasuring degree of relationvCosine of latent vectors

Recap: Mapping to Latent Space via SVD

v SVD generalizes the original datav Uncovers relationships not explicit in the thesaurusv Term vectors projected to 𝑘-dim latent space

v Word similarity: cosine of two column vectors in 𝚺𝐕$

𝑪 𝐔𝐕'≈

𝑑×𝑛 𝑑×𝑘

𝑘×𝑘 𝑘×𝑛

𝚺

Low rank approximation

vFrobenius norm. C is a 𝑚×𝑛 matrix

||𝐶||/ = 11|𝑐34|56

478

9

378

vRank of a matrix.vHow many vectors in the matrix are

independent to each other

6501 Natural Language Processing 5

Low rank approximation

v Low rank approximation problem:min=||𝐶 − 𝑋||/𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘

v If I can only use k independent vectors to describe the points in the space, what are the best choices?

6501 Natural Language Processing 6

Essentially,weminimizethe“reconstructionloss”underalowrankconstraint

Low rank approximation

v Low rank approximation problem:min=||𝐶 − 𝑋||/𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘

v If I can only use k independent vectors to describe the points in the space, what are the best choices?

6501 Natural Language Processing 7

Essentially,weminimizethe“reconstructionloss”underalowrankconstraint

Low rank approximation

v Assume rank of 𝐶 is rv SVD: 𝐶 = 𝑈Σ𝑉', Σ = diag(𝜎8, 𝜎5 …𝜎P, 0,0,0, …0)

v Zero-out the r − 𝑘 trailing valuesΣ′ = diag(𝜎8, 𝜎5 …𝜎U, 0,0,0,… 0)

v 𝐶V = UΣV𝑉' is the best k-rank approximation: CV = 𝑎𝑟𝑔min

=||𝐶 − 𝑋||/𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘

6501 Natural Language Processing 8

Σ =𝜎8 0 00 ⋱ 00 0 0

𝑟 non-zeros

Word2Vec

v LSA: a compact representation of co-occurrence matrix

v Word2Vec:Predict surrounding words (skip-gram)vSimilar to using co-occurrence counts Levy&Goldberg

(2014), Pennington et al. (2014)

v Easy to incorporate new wordsor sentences

6501 Natural Language Processing 9

Word2Vec

v Similar to language model, but predicting next word is not the goal.

v Idea: words that are semantically similar often occur near each other in textv Embeddings that are good at predicting neighboring

words are also good at representing similarity

6501 Natural Language Processing 10

Skip-gram v.s Continuous bag-of-words

vWhat differences?

6501 Natural Language Processing 11

Skip-gram v.s Continuous bag-of-words

6501 Natural Language Processing 12

Objective of Word2Vec (Skip-gram)

vMaximize the log likelihood of context word 𝑤\]9,𝑤\]9^8, … ,𝑤\]8 , 𝑤\^8, 𝑤\^5,… ,𝑤\^9given word 𝑤\

vm is usually 5~10

6501 Natural Language Processing 13

Objective of Word2Vec (Skip-gram)

vHow to model log 𝑃(𝑤\^4|𝑤\)?

𝑝 𝑤\^4 𝑤\ =cde(fghij⋅lgh)

∑ cde(fgn ⋅lgh)gn

vsoftmax function Again!

vEvery word has 2 vectorsv𝑣p : when 𝑤 is the center wordv𝑢p: when 𝑤 is the outside word (context word)

6501 Natural Language Processing 14

How to update?

𝑝 𝑤\^4 𝑤\ =cde(fghij⋅lgh)

∑ cde(fgn ⋅lgh)gn

vHow to minimize 𝐽(𝜃)vGradient descent!vHow to compute the gradient?

6501 Natural Language Processing 15

Recap: Calculus

6501 Natural Language Processing 16

vGradient:𝒙' = 𝑥8 𝑥5 𝑥z ,

∇𝜙 𝒙 =

𝜕𝜙(𝒙)𝜕𝑥8𝜕𝜙(𝒙)𝜕𝑥5𝜕𝜙(𝒙)𝜕𝑥z

v 𝜙 𝒙 = 𝒂 ⋅ 𝒙 (or represented as 𝒂'𝒙)∇𝜙 𝒙 = 𝒂

Recap: Calculus

6501 Natural Language Processing 17

v If𝑦 = 𝑓 𝑢 and𝑢 = 𝑔 𝑥 (i.e,.𝑦 = 𝑓(𝑔 𝑥 )����= ��(f)

�f��(�)��

( ���f

�f��

)

1. 𝑦 = 𝑥� + 6 z 2. y = ln(𝑥5 + 5)3. y = exp(x� + 3𝑥 + 2)

Other useful formulation

v 𝑦 = exp 𝑥dydx = exp x

v y = log xdydx =

1x

6501 Natural Language Processing 18

WhenIsaylog(inthiscourse), usuallyImeanln

6501 Natural Language Processing 19

Example

vAssume vocabulary set is 𝑊. We have one center word 𝑐, and one context word 𝑜.

vWhat is the conditional probability 𝑝 𝑜 𝑐

𝑝 𝑜 𝑐 =exp(𝑢� ⋅ 𝑣�)

∑ exp(𝑢pn ⋅ 𝑣�)pV vWhat is the gradient of the log likelihood

w.r.t 𝑣�?𝜕 log 𝑝 𝑜 𝑐

𝜕𝑣�= 𝑢� − 𝐸p∼� 𝑤 𝑐 [𝑢p]

6501 Natural Language Processing 20

Gradient Descent

minp𝐽(𝑤)

Update w: 𝑤 ← 𝑤 − 𝜂∇𝐽(𝑤)

6501 Natural Language Processing 21

Local minimum v.s. global minimum

6501 Natural Language Processing 22

Stochastic gradient descent

v Let 𝐽 𝑤 = 86∑ 𝐽4(𝑤)6478

v Gradient descent update rule:

𝑤 ← 𝑤 − �6∑ 𝛻𝐽4 𝑤6478

v Stochastic gradient descent:

v Approximate 86∑ 𝛻𝐽4 𝑤6478 by the gradient at a

single example 𝛻𝐽3 𝑤 (why?)v At each step:

6501 Natural Language Processing 23

Randomlypickanexample𝑖𝑤 ← 𝑤 − 𝜂𝛻𝐽3 𝑤

Negative sampling

vWith a large vocabulary set, stochastic gradient descent is still not enough (why?)

𝜕 log𝑝 𝑜 𝑐𝜕𝑣�

= 𝑢� − 𝐸p∼� 𝑤 𝑐 [𝑢p]

vLet’s approximate it again!vOnly sample a few words that do not appear

in the contextvEssentially, put more weights on positive

samples

6501 Natural Language Processing 24

More about Word2Vec – relation to LSA

v LSA factorizes a matrix of co-occurrence counts

v (Levy and Goldberg 2014) proves that skip-gram model implicitly factorizes a (shifted) PMI matrix!

vPMI(w,c) =log ¡(�|�)¡(�)

= log ¡(�,�)�(�)¡(�)

= log# 𝑤, 𝑐 ⋅ |𝐷|#(𝑤)#(𝑐)

6501 Natural Language Processing 25

All problem solved?

6501 Natural Language Processing 26

Continuous Semantic Representations

sunnyrainy

windycloudy

car

wheel

cab sad

joy

emotion

feeling

6501 Natural Language Processing 27

Semantics Needs More Than Similarity

Tomorrow will be rainy.

Tomorrow will be sunny.

𝑠𝑖𝑚𝑖𝑙𝑎𝑟(rainy, sunny)?

𝑎𝑛𝑡𝑜𝑛𝑦𝑚(rainy, sunny)?

6501 Natural Language Processing 28

Polarity Inducing LSA [Yih, Zweig, Platt 2012]

vData representationvEncode two opposite relations in a matrix using

“polarity”v Synonyms & antonyms (e.g., from a thesaurus)

vFactorizationvApply SVD to the matrix to find latent

components

vMeasuring degree of relationvCosine of latent vectors

joy gladden sorrow sadden goodwill

Group 1:“joyfulness” 1 1 -1 -1 0

Group2:“sad” -1 -1 1 1 0

Group3:“affection” 0 0 0 0 1

Encode Synonyms & Antonyms in Matrix

v Joyfulness: joy, gladden; sorrow, saddenv Sad: sorrow, sadden; joy, gladden

Inducing polarity

Cosine Score: +𝑆𝑦𝑛𝑜𝑛𝑦𝑚𝑠

Target word: row-vector

joy gladden sorrow sadden goodwill

Group 1:“joyfulness” 1 1 -1 -1 0

Group2:“sad” -1 -1 1 1 0

Group3:“affection” 0 0 0 0 1

Encode Synonyms & Antonyms in Matrix

v Joyfulness: joy, gladden; sorrow, saddenv Sad: sorrow, sadden; joy, gladden

Inducing polarity

Cosine Score: −𝐴𝑛𝑡𝑜𝑛𝑦𝑚𝑠

Target word: row-vector

Continuous representations for entities

6501 Natural Language Processing 32

?

MichelleObama

DemocraticParty

GeorgeWBush

LauraBush

RepublicParty

Continuous representations for entities

6501 Natural Language Processing 33

• Useful resources for NLP applications• Semantic Parsing & Question Answering • Information Extraction

Recommended