Identifying Semantic Components From Cross-Language ...jjbaek/word_vector_survey.pdf · across words, so naturally words that serve similar functions, i.e. some preposi-tions, may

Identifying Semantic Components From

Cross-Language Variation, Structured Lexical

Sources, and Corpora: a Review of Current

Literature

UC Berkeley Language and Cognition LabJongmin J. Baek

August 5, 2016

Contents

1 Overview 3

2 Algebraic Manipulations of Distributional Semantic Represen-tations 42.1 Semantic Compositionality through Recursive Matrix-Vector Spaces

(2012, Stanford) . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.1 Formalization . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Recursive Deep Models for Semantic Compositionality over a Sen-timent Treebank (2013, Stanford) . . . . . . . . . . . . . . . . . . 62.2.1 Performace . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 GloVe: Global Vectors for Word Representation (2014, Stanford) 62.3.1 Formalization . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Distributional Models of Preposition Semantics (2003, Stanford& U. of Edinburgh) . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Structured Knowledge Bases and Distributional Semantics 83.1 RC-NET: A General Framework for Incorporating Knowledge

into Word Representations (2014, Microsoft Research) . . . . . . 83.2 Retrofitting Word Vectors to Semantic Lexicons (2015, CMU) . . 93.3 Single or Multiple? Combining Word Representations Indepen-

dently Learned from Text and WordNet (2016, University of theBasque Country) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.4 Leveraging Frame Semantics and Distributional Semantics forUnsupervised Semantic Slot Induction in Spoken Dialogue Sys-tems (2014, CMU) . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.5 Event Extraction and Classification by Neural Network Model(2016, National Taiwan Normal U.) . . . . . . . . . . . . . . . . . 10

4 Multilingual Distributional Semantics 114.1 Bilingual Word Embeddings for Phrase-Based Machine Transla-

tion (2013, Stanford) . . . . . . . . . . . . . . . . . . . . . . . . . 11

1

4.2 Massively Multilingual Word Embeddings (2016, CMU & U. ofWashington) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2.1 MultiCluster . . . . . . . . . . . . . . . . . . . . . . . . . 124.2.2 MultiCCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Multimodal Distributional Semantics 145.1 Multimodal Distributional Semantics (2014, University of Trento

& L3S Research Center . . . . . . . . . . . . . . . . . . . . . . . 145.2 Grounded Compositional Semantics for Finding and Describing

Images with Sentences (2014, Stanford & Google) . . . . . . . . . 145.3 Distributional Semantics from Text and Images (2011, Free Uni-

versity of Bolzano & University of Trento) . . . . . . . . . . . . . 155.4 Multi- and Cross-Modal Semantics Beyond Vision: Grounding in

Auditory Perception (2015, Cambridge) . . . . . . . . . . . . . . 155.5 Grounding Semantics in Olfactory Perception (2015, Cambridge) 16

2

Chapter 1

Overview

Distributional semantics makes a compelling case of how meaning arises fromstatistical distributions of data. So far, most accounts of distributional seman-tics have focused on English words and using vectors to represent those words.However, that landscape is quickly changing. Recent efforts on distributionalsemantics may be divided into at least four clusters: using more complicatedmathematics to represent words and compositions of words, leveraging high-quality structured knowledge bases, combining multiple languages, and goingbeyond corpus training data to visual, auditory, and olfactory training data,to giving the machine multiple modalities, so to speak. What emerges is anexciting hodgepodge of ideas that look ripe to advance further and intertwinetogether. In what follows, I have listed papers of interest with a brief summaryof their methods and, if particularly promising, an extended discussion of theirformalizations and limitations.

3

Chapter 2

Algebraic Manipulations ofDistributional SemanticRepresentations

2.1 Semantic Compositionality through Recur-sive Matrix-Vector Spaces (2012, Stanford)

Socher, Huval, Manning, & Ng discuss a method to assign a “vector and a matrixto every node in a parse tree: the vector captures the inherent meaning of theconstituent, while the matrix captures how it changes the meaning of neighbor-ing words or phrases.” In this way, the matrix-vector recursive neural network,henceforth called MV-RNN, can learn “compositional vector representations forphrases and sentences of arbitrary syntactic type and length”.

2.1.1 Formalization

First, a sentence or a phrase is parsed into a binary parse tree of arbitrary depth.Then a composition function f : R2n → Rn is applied to pairs of constituentsin a parse tree. More specifically, this function f is

fA,B(a, b) = f(Ba,Ab) = g(W

[BaAb

])

where g is some nonlinearity function such as tanh or sigmoid, A,B are matricesfor single words, a, b are vectors for single words, W ∈ Rn×2n is a matrix thatmaps both transformed words back into Rn. The kicker is that f takes a pairof n-dimensional vectors and maps them into one n-dimensional vector, andtherefore each pair of nodes in a parse tree can be collapsed into one vector thatcan then be recursively recombined. In this way, f can be recursively applied toa parse tree to finally yield one n-dimensional vector that (hopefully) captures

4

syntactic structure and corresponding semantics of the sentence.We can do something similar with the word matrices. For computing phrasematrices, we define the function

fM (A,B) = WM

[AB

]where WM ∈ Rn×2n, so the output of the function is another n×n matrix, justlike the input matrices.Now we describe the learning algorithm. We want to train for the vector repre-sentation of each phrase. For that, we need an objective function. Our objectivefunction is a softmax classifier that predicts class distribution over some classes,such as sentiment or relationship. The class labels go in a matrix W label. Thecorresponding error function for each sentence s and its corresponding tree t,E(s, t, θ), minimizes the sum of cross-entropy errors at all nodes of the tree.With a regularization parameter λ the gradient of our overall objective functionJ becomes

∂J

∂θ=

1

N

∑(x,t)

∂E(x, t; θ)

∂θ+ λθ

where θ = (W,WM ,Wlabel, L, LM ) are our model parameters. L,LM are just

the sets of all word vectors and word matrices, respectively.

2.1.2 Performance

• MV-RNN outperforms most comparable models in a adverb-adjective pairsentiment classification task. “It shows that the idea of matrix-vectorrepresentations for all words and having a nonlinearity are both important.The MV-RNN which combines these two ideas is best able to learn variouscompositional effects.” The MV-RNN especially performs well in negationtasks, as it allows negation to completely shift the sentiment with respectto an adjective.

• MV-RNN can capture logical compositionality. It can learn propositionaloperators such as NOT and AND.

• MV-RNN outperforms similar models in sentiment prediction for sentencesof arbitrary length. Its accuracy is 79.0, while the closest competitors’accuracy is 77.7. However, it is interesting to note that it failed to classifyexamples that require structured real-life knowledge, of the sort likelyto be found in FrameNet. It wrongly classified as positive the review“Director Hoffman, his writer and Kline’s agent should serve detention.”It also wrongly classified as negative “A bodice-ripper for intellectuals.”

2.1.3 Limitations

It has some limitations. The model has an ungodly number of parameters(O(nd2)), where n is the number of words and d is the dimension of the word

5

vector. Therefore it has a tendency to overfit and consumes a lot of resources.The paper in the following chapter alleviates this problem by sharing matricesacross words, so naturally words that serve similar functions, i.e. some preposi-tions, may share matrices.

2.2 Recursive Deep Models for Semantic Com-positionality over a Sentiment Treebank (2013,Stanford)

Socher, Perelygin, Wu, Chuang, Manning, Ng & Potts introduce a SentimentTreebank and, more relevant to our interests, the Recursive Neural Tensor Net-work. “Recursive Neural Tensor Networks (RNTN) take as input phrases ofany length. They represent a phrase through word vectors and a parse treeand then compute vectors for higher nodes in the tree using the same tensor-based composition function.” They solve the problem of MV-RNN as discussedin the previous chapter, that “the number of parameters becomes very largeand depends on the size of the vocabulary. It would be cognitively more plau-sible if there were a single powerful composition function with a fixed numberof parameters.”

2.2.1 Performace

• In a scale-of-1-to-5 semtiment classification task, RNTN outperforms MV-RNN by a small margin.

• In a full setnence binary sentiment classification task, RNTN pushes thestate of the art from 80% to 85.4%.

• With sentences that have a ’X but Y’ structure, RNTN obtains an accu-racy of 41% compared to 37% of MV-RNN.

• Correctly classifies negation of a positive sentences 71.4%, negation of anegative sentence 81.8%.

2.3 GloVe: Global Vectors for Word Represen-tation (2014, Stanford)

Jeffrey Pennington, Richard Socher, and Christopher Manning talk about GlobalVectors. Global Vectors combine the collocation matrix LSA model and localcontext window neural embedding model such as Mikolov et al (2013a), com-bining the best of both worlds. The LSA model uses the statistical informationmore efficiently, while being poor at word analogy tasks; the neural embeddingmodel is just the opposite. Global vectors take advantage of the good partswhile eschewing the bad parts.

6

2.3.1 Formalization

The cost function is

J =

V∑i;j=1

f(Xij)(wTi w̃j + bi + b̃j − logXij)

2

Where f is a weighting function for each word-word coocurrence pair, and theexpression inside the squared power shows the desire to minimize the differencebetween the dot product of the words, plus some bias, minus the log of thecoocurrence statistic. For a discussion of why this is used, please refer to thepaper. f prevents rare coocurrences from severely distorting the cost function,and the example function used is

f(x) =

{(x/xmax)α x < xmax

1 otherwise

where xmax = 100 and α = 3/4 is chosen empirically.

2.4 Distributional Models of Preposition Seman-tics (2003, Stanford & U. of Edinburgh)

Colin Bannard and Timothy Baldwin present a method to compare preposi-tions modeled with distributional semantics. Their main contribution is toshow that prepositions, long considered to be semantically vacuous becuase oftheir promiscuity, can feasibly be modeled with distributional semantics. First,they make a distinction between transitive and intransitive prepositions, andtreat them differently; for example, the token “up” may be used transitivelyor intransitively, such as “time is up” or “it is up to you”. Then the Pearsoncorrelation is found between different prepositions to calculate inter-prepositionsimilarity. It concludes with the note that preposition semantics cannot workwithout considering valence.

7

Chapter 3

Structured KnowledgeBases and DistributionalSemantics

3.1 RC-NET: A General Framework for Incor-porating Knowledge into Word Representa-tions (2014, Microsoft Research)

Xu et al. take issue with the fact that statistical distributional semantics do nottake into account structured knowledge bases such as WordNet and FrameNet.They propose a “novel framework to take advantage of both relational and cat-egorical knowledge to produce high-quality word representations. This frame-work is built upon the skip-gram model (Mikolov et al. 2013a), in which weextend its objective function by incorporating the external knowledge as regu-larization functions.”

• “To leverage relational knowledge, we define a corresponding regulariza-tion function by inheriting the similar idea from a recent study on multi-relation model (Bordes et al. 2013), which characterizing the relation-ships between entities by interpreting them as translations in the low-dimensional embeddings of the entities.”

• “To incorporate the categorical knowledge, we define another regulariza-tion function by minimizing the weighted distance between those wordswith the same attributes.”

• “Then, we combine these two regularization functions with the original

8

objective function of the skip-gram model.”

3.2 Retrofitting Word Vectors to Semantic Lex-icons (2015, CMU)

Manaal Faruqui et al. generalize Xu et al.’s ideas to work with a broader set ofknowledge bases.

3.3 Single or Multiple? Combining Word Rep-resentations Independently Learned from Textand WordNet (2016, University of the BasqueCountry)

Josu Gokioetxea, Eneko Agirre, and Aitor Soroa explore a way to learn dis-tributed representations independently from corpora and from structured knowl-edge bases like WordNet, then use simple concatenation or averaging methodsto combine them. Notably, such simple methods outperform more sophisticatedmethods like Canonical Correlation Analysis (CCA) or retrofitting.

3.4 Leveraging Frame Semantics and Distribu-tional Semantics for Unsupervised SemanticSlot Induction in Spoken Dialogue Systems(2014, CMU)

Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky combinedistributional semantics and frame semantics into one, creating a model that caninduce the frame-semantic slots after hearing human speech in an unsupervisedfashion. The pipeline goes this way:

• Generate text from the speech files using a generic automatic speech recog-nition algorithm.

• Use a FrameNet-trained statistical probabilistic semantic parser to gener-ate initial frame-semantic parses.

• Adapt these initial frame-semantic parses to the semantic slots of the tar-get semantic space, leveraging distributed word representations. The laststep “distinguishes between generic semantic concepts and domain-specificconcepts.” This is because we may want to focus on a specific domain, forexample for a conversation about restaurant recommendation. To do this,we must rank slot candidates and use one ranked highest. The rankingalgorithm uses “(1) the normalized frequency of each slot candidate in

9

the corpus, since slots with higher frequency may be more important. (2)the coherence of slot-fillers corresponding to the slot... the coherence ofthe corresponding slot-fillers can help measure the prominence of the slotsbecause they are similar to each other.” (585)

3.5 Event Extraction and Classification by Neu-ral Network Model (2016, National TaiwanNormal U.)

Bamfa Ceesay and Wen-Juan Hou present a single model to label semantic roles,then predict event types and subtypes using neural embeddings. The pipelineis:

• Transform 1-grams or 2-grams into feature vectors. Features consist ofprobability distributions over phrase type, grammatical function, position,voice, FrameNet and VerbNet semantic roles or verb types.

• Train a neural net with the input being a 1-gram or 2-gram and theoutput being an “event nugget”: event type, event subtype, event realis,event mention. The paper discusses nine such events: Business, Conflict,Justice, Life, etc.

• After training, the model can predict, with about 70% accuracy, the eventnugget given the feature vector of a 1-gram or a 2-gram.

10

Chapter 4

Multilingual DistributionalSemantics

4.1 Bilingual Word Embeddings for Phrase-BasedMachine Translation (2013, Stanford)

Will Zou, Richard Socher, Daniel Cer, and Christopher Manning present amethod to embed both English and Madarin into one embedding space. Theyachieve this by using an objective function that is a weighted sum of a mono-lingual objective function and a translation equivalence objective functino.

• Use some off-the-shelf monolingual objective function. The paper usesCollobert et al. (2008)’s formulation, the Context Objective:

J(c,d)CO =

∑wr∈VR

max(0, 1− f(cw, d) + f(cwr

, d))

Where f is an objective function defined by the neural network, wr is aword chosen in a random subset VR of the vocabulary, and cw

r

is a contextwindow of wr. So this model contrasts the score of when a word is placedin its correct context, versus when a random word is placed in the samecontext.

• Align English and Mandarin corpora using the Berkeley Aligner (Liang etal., 2006). Formally, they use the following equation to find starting wordembeddings:

Wt−init =

S∑s=1

Cts + 1

Ct + sWs

where S is the number of possible target language words that may bealigned with the source word, Cts is the number of times when word t inthe target and word s in the source are aligned in the training text, and Ct

11

denotes the total number of counts of word t that appeared in the targetlanguage. Laplace smoothing is then applied to this function.

• Form alignment matrices Aen→ch and Ach→en. For the former, each rowis a Chinese word and each column is an English word. An element aijdenotes the number of times the ith Chinese word was aligned with the jthEnglish word. The latter matrix is defined similarly. Then our TranslationEquivalence Objective is

JTEO−en→ch = ||Vch −Aen→chVen||2

JTEO−ch→en = ||Ven −Ach→enVch||2

• Finally, we optimize for a weighted combined objective during training:

JCO−ch + λJTEO−en→ch

JCO−en + λJTEO−ch→en

• After training, we can visualize the embedding space using t-SNE (vander Maateen 2008).

4.2 Massively Multilingual Word Embeddings(2016, CMU & U. of Washington)

Waleed Ammar, Gergoe Mulcaire et al. use pairwise parallel dictionaries andmonolingual corpus data to create a single embedding space for more than fiftylanguages. They use three methods, multiCluster, multiCCA, and multiQVEC-CCA, of which multiQVEC-CCA performs best.

4.2.1 MultiCluster

• Using the parallel dictionary, map each word-language pair to a multilin-gual cluster. More precisely, each cluster is a connected graph such that

12

the each node is a word-language pair and the each edge is an entry in thedictionary that indicates translational equivalence.

• Give each cluster an arbitrary unique ID. Replace each instance of theword in each corpus with the corresponding cluster ID.

• With the resulting corpus of ID’s, use the standard skipgram model ofMikolov et al. (2013a) to assign a vector to each cluster just as you woulda normal, monolingual corpus.

4.2.2 MultiCCA

• Using monolingual corpora, learn monolingual embeddings for each lan-guage separately. For languages n and m, we write En and Em as theembeddings.

• Use canonical correlation analysis (CCA) to estimate linear projectionsfrom the ranges of the monolingual embedding spaces, yielding a bilingualembedding space. More formally, the optimization problem is

MAX{Corr(Tm→m,nEm(u), Tn→m,nE

n(v))∀(u, v) ∈ Dm,n}

where Tm→m,n, Tn→m,n ∈ Rd×d are linear projections from the languagen’s embedding space to the bilingual embedding space, and from the lan-guage m’s embedding space to the bilingual embedding space. Dm,n is abilingual dictionary.

13

Chapter 5

Multimodal DistributionalSemantics

5.1 Multimodal Distributional Semantics (2014,University of Trento & L3S Research Center

A 47-page survey on the topic. “For all its successes, distributional semanticssuffers of the obvious limitation that it represents the meaning of a word entirelyin terms of connections to other words... (this is) deeply problematic, an issuethat is often referred to as the symbol grounding problem.”

5.2 Grounded Compositional Semantics for Find-ing and Describing Images with Sentences(2014, Stanford & Google)

Richard Socher, Andrej Karpathy, Quoc Le, Christopher Manning and AndrewNg introduce the Dependency Tree Recursive Neural Network (DT-RNN) modelto “embed sentences into a vector space in order to retrieve images that aredescribed by those sentences.” (207) While normal RNN models, using binaryconstituency trees, are sometimes very effective at this task, they inevitablycapture a lot of the syntactic information of a sentence, which is not desirablewhen wanting to retrieve an image with a sentence description. For example,“the man is on the bike” and “the bike has the man on it” should give roughlysimilar vector representations, which normal RNN models fail to do. On theother hand, DT-RNNs naturally focus on a sentence’s actions and its agents,because the root usually splits to its NP and VP.

14

5.3 Distributional Semantics from Text and Im-ages (2011, Free University of Bolzano &University of Trento)

Bruni, Tran, and Baroni present a distributional semantic model that learnsnot only from words but also from images. This model captures qualitativelydifferent aspects of meaning; while traditional distributional semantic modelsthat learns only from corpus data usually capture abstract meaning, this modelcaptures visual meaning. They go on so far as to say “we consider this (very!)preliminary evidence for an integrated view of semantics where the more con-crete aspects of meaning derive from perceptual experience, whereas verbal as-sociations mostly account for abstraction.”

5.4 Multi- and Cross-Modal Semantics BeyondVision: Grounding in Auditory Perception(2015, Cambridge)

Douwe Kiela and Stephen Clark discuss a way to give machines ears. Just asvisual multi-modal models use “bag of visual words” (BoVW) representations,Kiela and Clark use “bag of audio words” (BoAW) representations. They use theonline search engine Freesound to download audio files tagged with keywords.What results is a model that finds auditorily similar nearest neighbors in analmost poetic way.

15

5.5 Grounding Semantics in Olfactory Percep-tion (2015, Cambridge)

Douwe Kiela and Stephen Clark discuss a way to give machines nostrils. Justas visual multi-modal models use “bag of visual words” (BoVW) representa-tions, Kiela and Clark use “bag of chemical compounds” (BoCC) representa-tions. They use the Sigma-Aldrich Fine Chemicals flavors and fragrances catalog(SAFC), the largest public olfactory databse, to find 137 smells and their 11,152associated chemical compositions.

16

Documents

Identifying Semantic Components From Cross-Language ...jjbaek/word_vector_survey.pdf · across words, so naturally words that serve similar functions, i.e. some preposi-tions, may