Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Lecture 6: Vector Space Model
Kai-Wei ChangCS @ University of Virginia
Couse webpage: http://kwchang.net/teaching/NLP16
16501 Natural Language Processing
This lecture
v How to represent a word, a sentence, or a document?
v How to infer the relationship among words?
v We focus on “semantics”: distributional semantics
v What is the meaning of “life”?
26501 Natural Language Processing
6501 Natural Language Processing 3
How to represent a word
vNaïve way: represent words as atomic symbols: student, talk, universityvN-germ language model, logical analysis
vRepresent word as a “one-hot” vector[ 0 0 0 1 0 … 0 ]
vHow large is this vector?vPTB data: ~50k, Google 1T data: 13M
v 𝑣 ⋅ 𝑢 =?
6501 Natural Language Processing 4
eggstudenttalkuniversity happybuy
Issues?
vDimensionality is large; vector is sparsevNo similarity
vCannot represent new wordsvAny idea?
6501 Natural Language Processing 5
𝑣'())* =[00010…0 ]𝑣+(,=[00100…0]𝑣-./0 = [10000…0]
𝑣'())* ⋅ 𝑣+(, = 𝑣'())* ⋅ 𝑣-./0 =0
Idea 1: Taxonomy (Word category)
6501 Natural Language Processing 6
What is “car”?
>>>fromnltk.corpusimportwordnet aswn>>>wn.synsets('motorcar')[Synset('car.n.01')]
6501 Natural Language Processing 7
>>>motorcar.hypernyms()[Synset('motor_vehicle.n.01')]>>>paths=motorcar.hypernym_paths()>>>[synset.name()for synsetin paths[0]]
['entity.n.01','physical_entity.n.01','object.n.01','whole.n.02','artifact.n.01', 'instrumentality.n.03','container.n.01','wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01','motor_vehicle.n.01','car.n.01']
>>>[synset.name()for synsetin paths[1]]['entity.n.01','physical_entity.n.01','object.n.01','whole.n.02','artifact.n.01', 'instrumentality.n.03','conveyance.n.03','vehicle.n.01','wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01','motor_vehicle.n.01','car.n.01']
Word similarity?
6501 Natural Language Processing 8
>>>right=wn.synset('right_whale.n.01')>>>minke =wn.synset('minke_whale.n.01')>>>orca=wn.synset('orca.n.01')>>>tortoise=wn.synset('tortoise.n.01')>>>novel=wn.synset('novel.n.01')
>>>right.lowest_common_hypernyms(minke)[Synset('baleen_whale.n.01')]>>>right.lowest_common_hypernyms(orca)[Synset('whale.n.02')]>>>right.lowest_common_hypernyms(tortoise)[Synset('vertebrate.n.01')]>>>right.lowest_common_hypernyms(novel)[Synset('entity.n.01')]
Requirehumanlabor
Taxonomy (Word category)
vSynonym, hypernym (Is-A), hyponym
6501 Natural Language Processing 9
Idea 2: Similarity = Clustering
6501 Natural Language Processing 10
Cluster n-gram model
vCan be generated from unlabeled corporavBased on statistics, e.g., mutual information
6501 Natural Language Processing 11
Implementation oftheBrown hierarchical wordclustering algorithm.PercyLiang
Idea 3: Distributional representation
v Linguistic items with similar distributions have similar meaningsv i.e., words occur in the same contexts
⇒ similar meaning
6501 Natural Language Processing 12
"a word is characterized by the company it keeps” --Firth, John, 1957
Vector representation (word embeddings)v Discrete ⇒ distributed representations v Word meanings are vector of “basic concept”
vWhat are “basic concept”?vHow to assign weights?vHow to define the similarity/distance?
6501 Natural Language Processing 13
𝑣0.23 = [0.8 0.9 0.1 0 …]𝑣45662 = [0.8 0.1 0.8 0 …]𝑣())/* = [0.1 0.2 0.1 0.8 …]
royaltymasculinity femininity eatable
An illustration of vector space model
6501 Natural Language Processing 14
Masculine
Eatable
Royalty
w4
w2
W1W5
w3
|D2-D4|
Semantic similarity in 2D
vHome depot product
6501 Natural Language Processing 15
Capture the structure of words
vExample from GloVe
6501 Natural Language Processing 16
How to use word vectors?
6501 Natural Language Processing 17
Pre-trained word vectors
vGoogle Bookhttps://code.google.com/archive/p/word2vecv100 billion tokens, 300 dimension, 3M words
vGlove projecthttp://nlp.stanford.edu/projects/glove/vPre-trained word vectors of Wiki (6B), web
crawl data (840B), twitter (27B)
6501 Natural Language Processing 18
Distance/similarity
vVector similarity measure⇒ similarity in meaning
vCosine similarity v cos 𝑢, 𝑣 = 5⋅;
||5||⋅||;||
vWord vector are normalized by length
vEuclidean distance ||𝑢 − 𝑣||>
v Inner product 𝑢 ⋅ 𝑣vSame as cosine similarity if vectors are
normalized
6501 Natural Language Processing 19
5||5||
isaunitvector
Distance/similarity
vVector similarity measure⇒ similarity in meaning
vCosine similarity v cos 𝑢, 𝑣 = 5⋅;
||5||⋅||;||
vWord vector are normalized by length
vEuclidean distance ||𝑢 − 𝑣||>
v Inner product 𝑢 ⋅ 𝑣vSame as cosine similarity if vectors are
normalized
6501 Natural Language Processing 20
LinguisticRegularities inSparseandExplicitWordRepresentations, Levy Goldberg, CoNLL 14
Choosing the right similarity metric is important
Word similarity DEMO
v http://msrcstr.cloudapp.net/
6501 Natural Language Processing 21
Word analogy
v 𝑣-(2 − 𝑣?@-(2 + 𝑣52B/6 ∼ 𝑣(52D
6501 Natural Language Processing 22
From words to phrases
6501 Natural Language Processing 23
Neural Language Models
6501 Natural Language Processing 24
How to “learn” word vectors?
6501 Natural Language Processing 25
Whatare“basicconcept”?Howtoassignweights?Howtodefinethesimilarity/distance?Cosinesimilarity
Back to distributional representation
vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)
v Bag-of-word model: documents (clusters) as the basis for vector space
6501 Natural Language Processing 26
Back to distributional representation
vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)v Skip-grams
6501 Natural Language Processing 27
Back to distributional representation
vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)v Skip-gramsv From taxonomy (e.g., WordNet, thesaurus)
6501 Natural Language Processing 28
joy gladden sorrow sadden goodwill
Group 1:“joyfulness” 1 1 0 0 0
Group2:“sad” 0 0 1 1 0
Group3:“affection” 0 0 0 0 1
Input: Synonyms from a thesaurusJoyfulness: joy, gladdenSad: sorrow, sadden
Back to distributional representation
vEncode relational data in a matrixv Co-occurrence (e.g., from a general corpus)v Skip-gramsv From taxonomy (e.g., WordNet, thesaurus)
6501 Natural Language Processing 29
joy gladden sorrow sadden goodwill
Group 1:“joyfulness” 1 1 0 0 0
Group2:“sad” 0 0 1 1 0
Group3:“affection” 0 0 0 0 1
Input: Synonyms from a thesaurusJoyfulness: joy, gladdenSad: sorrow, sadden Cosinesimilarity?
Prosandcons?
Problems?
v Number of basic concepts is largev Basis is not orthogonal
(i.e., not linearly independent)v Some function words are too frequent (e.g., the)
vSyntax has too much impactvE.g, TF-IDF can be appliedvE.g, skip-gram: scaling by distance to target
6501 Natural Language Processing 30
Latent Semantic Analysis (LSA)
vData representationvEncode single-relational data in a matrix
v Co-occurrence (e.g., document-term matrix, skip-gram)
v Synonyms (e.g., from a thesaurus)
vFactorizationvApply SVD to the matrix to find latent
components
6501 Natural Language Processing 31
Principle Component Analysis (PCA)
vDecompose the similarity space into a set of orthonormal basis vectors
6501 Natural Language Processing 32
Principle Component Analysis (PCA)
vDecompose the similarity space into a set of orthonormal basis vectors
vFor an 𝑚×𝑛 matrix 𝐴 , there exists a factorization such that
𝐴 = 𝑈Σ𝑉L
v𝑈, 𝑉 are orthogonal matrices
6501 Natural Language Processing 33
Low-rank Approximation
v Idea: store the most important information in a small number of dimensions (e.g,. 100-1000)
v SVD can be used to compute optimal low-rank approximation
v Set smallest n-r singular value to zero
v Similar words map to similar location in low dimensional space
6501 Natural Language Processing 34
Latent Semantic Analysis (LSA)
vFactorizationvApply SVD to the matrix to find latent
components
6501 Natural Language Processing 35
LSA example
vOriginal matrix C
6501 Natural Language Processing 36
ExamplefromChristopherManningandPandu Nayak,introduction toIR
LSA example
vSVD: 𝐶 = 𝑈Σ𝑉L
6501 Natural Language Processing 37
LSA example
vOriginal matrix CvDimension reduction 𝐶 ∼ 𝑈Σ𝑉L
6501 Natural Language Processing 38
LSA example
v Original matrix 𝐶 v.s. reconstructed matrix𝐶>
v What is the similarity between ship and boat?
6501 Natural Language Processing 39
Word vectors
𝐶 ∼ 𝑈Σ𝑉L
𝐶𝐶L ∼ 𝑈Σ𝑉L × 𝑈Σ𝑉L L= 𝑈Σ𝑉L×𝑉ΣL𝑈L= 𝑈ΣΣL𝑈L (why?)= 𝑈Σ UΣ L
v 𝐶:,+'.) ⋅ 𝐶:,P@(D∼ 𝑈Σ :,+'.) ⋅ 𝑈Σ :,P@(D
6501 Natural Language Processing 40
Why we need low rank approximation?
vKnowledge base (e.g., thesaurus) is never complete
vNoise reduction by dimension reduction v Intuitively, LSA brings together “related”
axes (concepts) in the vector spacevA compact model
6501 Natural Language Processing 41
All problem solved?
6501 Natural Language Processing 42
An analogy game
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, NIPS 16
6501 Natural Language Processing 43
Continuous Semantic Representations
sunnyrainy
windycloudy
car
wheel
cab sad
joy
emotion
feeling
6501 Natural Language Processing 44
Semantics Needs More Than Similarity
Tomorrow will be rainy.
Tomorrow will be sunny.
𝑠𝑖𝑚𝑖𝑙𝑎𝑟(rainy, sunny)?
𝑎𝑛𝑡𝑜𝑛𝑦𝑚(rainy, sunny)?
6501 Natural Language Processing 45
Continuous representations for entities
6501 Natural Language Processing 46
?
MichelleObama
DemocraticParty
GeorgeWBush
LauraBush
RepublicParty
Continuous representations for entities
6501 Natural Language Processing 47
• Useful resources for NLP applications• Semantic Parsing & Question Answering • Information Extraction
Next lecture: a more flexible framework
vDirectly learn word vectors using NN modelvMore flexible vEasier to learn new wordsv Incorporate other informationvOptimize specific task loss.
vReview calculus!
6501 Natural Language Processing 48