Lecture 6: Vector Space Model - Computer Sciencekc2wc/teaching/NLP16/slides/06-VM.pdf · Latent Semantic Analysis (LSA) vData representation vEncode single-relational data in a matrix

Lecture 6: Vector Space Model

Kai-Wei ChangCS @ University of Virginia

[email protected]

Couse webpage: http://kwchang.net/teaching/NLP16

16501 Natural Language Processing

This lecture

v How to represent a word, a sentence, or a document?

v How to infer the relationship among words?

v We focus on “semantics”: distributional semantics

v What is the meaning of “life”?

26501 Natural Language Processing

6501 Natural Language Processing 3

How to represent a word

vNaïve way: represent words as atomic symbols: student, talk, universityvN-germ language model, logical analysis

vRepresent word as a “one-hot” vector[ 0 0 0 1 0 … 0 ]

vHow large is this vector?vPTB data: ~50k, Google 1T data: 13M

v 𝑣 ⋅ 𝑢 =?


eggstudenttalkuniversity happybuy

Issues?

vDimensionality is large; vector is sparsevNo similarity

vCannot represent new wordsvAny idea?


𝑣'())* =[00010…0 ]𝑣+(,=[00100…0]𝑣-./0 = [10000…0]

𝑣'())* ⋅ 𝑣+(, = 𝑣'())* ⋅ 𝑣-./0 =0

Idea 1: Taxonomy (Word category)


What is “car”?

>>>fromnltk.corpusimportwordnet aswn>>>wn.synsets('motorcar')[Synset('car.n.01')]


>>>motorcar.hypernyms()[Synset('motor_vehicle.n.01')]>>>paths=motorcar.hypernym_paths()>>>[synset.name()for synsetin paths[0]]

['entity.n.01','physical_entity.n.01','object.n.01','whole.n.02','artifact.n.01', 'instrumentality.n.03','container.n.01','wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01','motor_vehicle.n.01','car.n.01']

>>>[synset.name()for synsetin paths[1]]['entity.n.01','physical_entity.n.01','object.n.01','whole.n.02','artifact.n.01', 'instrumentality.n.03','conveyance.n.03','vehicle.n.01','wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01','motor_vehicle.n.01','car.n.01']

Word similarity?


>>>right=wn.synset('right_whale.n.01')>>>minke =wn.synset('minke_whale.n.01')>>>orca=wn.synset('orca.n.01')>>>tortoise=wn.synset('tortoise.n.01')>>>novel=wn.synset('novel.n.01')

>>>right.lowest_common_hypernyms(minke)[Synset('baleen_whale.n.01')]>>>right.lowest_common_hypernyms(orca)[Synset('whale.n.02')]>>>right.lowest_common_hypernyms(tortoise)[Synset('vertebrate.n.01')]>>>right.lowest_common_hypernyms(novel)[Synset('entity.n.01')]

Requirehumanlabor

Taxonomy (Word category)

vSynonym, hypernym (Is-A), hyponym


Idea 2: Similarity = Clustering


Cluster n-gram model

vCan be generated from unlabeled corporavBased on statistics, e.g., mutual information


Implementation oftheBrown hierarchical wordclustering algorithm.PercyLiang

Idea 3: Distributional representation

v Linguistic items with similar distributions have similar meaningsv i.e., words occur in the same contexts

⇒ similar meaning


"a word is characterized by the company it keeps” --Firth, John, 1957

Vector representation (word embeddings)v Discrete ⇒ distributed representations v Word meanings are vector of “basic concept”

vWhat are “basic concept”?vHow to assign weights?vHow to define the similarity/distance?


𝑣0.23 = [0.8 0.9 0.1 0 …]𝑣45662 = [0.8 0.1 0.8 0 …]𝑣())/* = [0.1 0.2 0.1 0.8 …]

royaltymasculinity femininity eatable

An illustration of vector space model


Masculine

Eatable

Royalty

w4

w2

W1W5

w3

|D2-D4|

Semantic similarity in 2D

vHome depot product


Capture the structure of words

vExample from GloVe


How to use word vectors?


Pre-trained word vectors

vGoogle Bookhttps://code.google.com/archive/p/word2vecv100 billion tokens, 300 dimension, 3M words

vGlove projecthttp://nlp.stanford.edu/projects/glove/vPre-trained word vectors of Wiki (6B), web

crawl data (840B), twitter (27B)


Distance/similarity

vVector similarity measure⇒ similarity in meaning

vCosine similarity v cos 𝑢, 𝑣 = 5⋅;

||5||⋅||;||

vWord vector are normalized by length

vEuclidean distance ||𝑢 − 𝑣||>

v Inner product 𝑢 ⋅ 𝑣vSame as cosine similarity if vectors are

normalized


5||5||

isaunitvector

Distance/similarity

vVector similarity measure⇒ similarity in meaning

vCosine similarity v cos 𝑢, 𝑣 = 5⋅;

||5||⋅||;||

vWord vector are normalized by length

vEuclidean distance ||𝑢 − 𝑣||>

v Inner product 𝑢 ⋅ 𝑣vSame as cosine similarity if vectors are

normalized


LinguisticRegularities inSparseandExplicitWordRepresentations, Levy Goldberg, CoNLL 14

Choosing the right similarity metric is important

Word similarity DEMO

v http://msrcstr.cloudapp.net/


Word analogy

v 𝑣-(2 − 𝑣?@-(2 + 𝑣52B/6 ∼ 𝑣(52D


From words to phrases


Neural Language Models


How to “learn” word vectors?


Whatare“basicconcept”?Howtoassignweights?Howtodefinethesimilarity/distance?Cosinesimilarity

Back to distributional representation

vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)

v Bag-of-word model: documents (clusters) as the basis for vector space



vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)v Skip-grams



vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)v Skip-gramsv From taxonomy (e.g., WordNet, thesaurus)


joy gladden sorrow sadden goodwill

Group 1:“joyfulness” 1 1 0 0 0

Group2:“sad” 0 0 1 1 0

Group3:“affection” 0 0 0 0 1

Input: Synonyms from a thesaurusJoyfulness: joy, gladdenSad: sorrow, sadden


vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)v Skip-gramsv From taxonomy (e.g., WordNet, thesaurus)


joy gladden sorrow sadden goodwill

Group 1:“joyfulness” 1 1 0 0 0

Group2:“sad” 0 0 1 1 0

Group3:“affection” 0 0 0 0 1

Input: Synonyms from a thesaurusJoyfulness: joy, gladdenSad: sorrow, sadden Cosinesimilarity?

Prosandcons?

Problems?

v Number of basic concepts is largev Basis is not orthogonal

(i.e., not linearly independent)v Some function words are too frequent (e.g., the)

vSyntax has too much impactvE.g, TF-IDF can be appliedvE.g, skip-gram: scaling by distance to target


Latent Semantic Analysis (LSA)

vData representationvEncode single-relational data in a matrix

v Co-occurrence (e.g., document-term matrix, skip-gram)

v Synonyms (e.g., from a thesaurus)

vFactorizationvApply SVD to the matrix to find latent

components


Principle Component Analysis (PCA)

vDecompose the similarity space into a set of orthonormal basis vectors


Principle Component Analysis (PCA)

vDecompose the similarity space into a set of orthonormal basis vectors

vFor an 𝑚×𝑛 matrix 𝐴 , there exists a factorization such that

𝐴 = 𝑈Σ𝑉L

v𝑈, 𝑉 are orthogonal matrices


Low-rank Approximation

v Idea: store the most important information in a small number of dimensions (e.g,. 100-1000)

v SVD can be used to compute optimal low-rank approximation

v Set smallest n-r singular value to zero

v Similar words map to similar location in low dimensional space


Latent Semantic Analysis (LSA)

vFactorizationvApply SVD to the matrix to find latent

components


LSA example

vOriginal matrix C


ExamplefromChristopherManningandPandu Nayak,introduction toIR

LSA example

vSVD: 𝐶 = 𝑈Σ𝑉L


LSA example

vOriginal matrix CvDimension reduction 𝐶 ∼ 𝑈Σ𝑉L


LSA example

v Original matrix 𝐶 v.s. reconstructed matrix𝐶>

v What is the similarity between ship and boat?


Word vectors

𝐶 ∼ 𝑈Σ𝑉L

𝐶𝐶L ∼ 𝑈Σ𝑉L × 𝑈Σ𝑉L L= 𝑈Σ𝑉L×𝑉ΣL𝑈L= 𝑈ΣΣL𝑈L (why?)= 𝑈Σ UΣ L

v 𝐶:,+'.) ⋅ 𝐶:,P@(D∼ 𝑈Σ :,+'.) ⋅ 𝑈Σ :,P@(D


Why we need low rank approximation?

vKnowledge base (e.g., thesaurus) is never complete

vNoise reduction by dimension reduction v Intuitively, LSA brings together “related”

axes (concepts) in the vector spacevA compact model


All problem solved?


An analogy game

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, NIPS 16


Continuous Semantic Representations

sunnyrainy

windycloudy

car

wheel

cab sad

joy

emotion

feeling


Semantics Needs More Than Similarity

Tomorrow will be rainy.

Tomorrow will be sunny.

𝑠𝑖𝑚𝑖𝑙𝑎𝑟(rainy, sunny)?

𝑎𝑛𝑡𝑜𝑛𝑦𝑚(rainy, sunny)?


Continuous representations for entities


?

MichelleObama

DemocraticParty

GeorgeWBush

LauraBush

RepublicParty

Continuous representations for entities


• Useful resources for NLP applications• Semantic Parsing & Question Answering • Information Extraction

Next lecture: a more flexible framework

vDirectly learn word vectors using NN modelvMore flexible vEasier to learn new wordsv Incorporate other informationvOptimize specific task loss.

vReview calculus!


Documents

Lecture 6: Vector Space Model - Computer Sciencekc2wc/teaching/NLP16/slides/06-VM.pdf · Latent Semantic Analysis (LSA) vData representation vEncode single-relational data in a matrix