Download pdf - Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

Jointly Learning Word and Phrase Embeddings Using

Neural Networks and Implicit Tensor Factorization

Kazuma Hashimoto

Tsuruoka Laboratory, University of Tokyo

19/06/2015 Talk@UCL Machine Reading Lab.

• Name

– Kazuma Hashimoto (橋本和真 in Japanese)

– http://www.logos.t.u-tokyo.ac.jp/~hassy/

• Belong

– Tsuruoka Laboratory, University of Tokyo

• April 2015 – present Ph.D. student

• April 2013 – March 2015 Master’s student

– National Centre for Text Mining (NaCTeM)

• Research Interest

– Word/phrase/document embeddings and their

applications

Self Introduction

19/06/2015 Talk@UCL Machine Reading Lab. 2 / 39

http://www.logos.t.u-tokyo.ac.jp/~hassy/

1. Background

– Word and Phrase Embeddings

2. Jointly Learning Word and Phrase Embeddings

– General Idea

3. Our Methods Focusing on Transitive Verb Phrases

– Word Prediction (EMNLP 2014)

– Implicit Tensor Factorization (CVSC 2015)

4. Experiments and Results

5. Summary

Today’s Agenda


1. Background



– General Idea





5. Summary

Today’s Agenda


• Word: String Index Vector

• Why vectors?

– Word similarities can be measured using distance

metrics of the vectors (e.g., the cosine similarity)

Assigning Vectors to Words

cause

trigger

disorder

disease

animal

mouse

ratanimal

mouserat

diseasedisorder

triggercause

Embedding words in a vector space


• Two approaches using large corpora:

(systematic comparison of them in Baroni+ (2014))

– Count-based approach

• e.g.) Reducing the dimension of word co-

occurrence matrix using SVD

– Prediction-based approach

• e.g.) Predicting words from their contexts using

neural networks

• We focus on prediction-based approach

– Why?

Approaches to Word Representations


• Prediction-based approaches usually

– parameterize the word embeddings

– learn them based on co-occurrence statistics

• Word embeddings appearing in similar contexts get

close to each other

Learning Word Embeddings

------------

text data

… the prevalence of drunken driving and accidents caused by drinking …

target

word prediction using the word embedding

SkipGram model (Mikolov+, 2013) in word2vec


• Learning word embeddings for relation classification

– To appear at CoNLL 2015 (just advertising)

Task-Oriented Word Embeddings


• Treating phrases and sentences as well as words

– gaining much attention recently!

Beyond Word Embeddings

make payment

pay money

moneypay

pay moneymake payment

paymentmake

Embedding phrases in a vector space


• Element-wise addition/multiplication (Lapata+, 2010)

– 𝑣 sentnce = 𝑖 𝑣 𝑤𝑖

• Recursive autoencoders (Socher+, 2011; Hermann+, 2013)

– Using parse trees

– 𝑣 parent = 𝑓(𝑣 left child , 𝑣 right child )

• Tensor/matrix-based methods

– 𝑣 adj noun = 𝑀 adj 𝑣(noun) (Baroni+, 2010)

– 𝑀 verb = 𝑖,𝑗 𝑣 subj𝑖T𝑣 obj𝑗 (Grefenstette+, 2011)

• 𝑀 subj, verb, obj ={𝑣 subj T𝑣 obj } ∗ 𝑀(verb)

• 𝑣 subj, verb, obj = 𝑀 verb 𝑣 obj ∗ 𝑣 subj

(Kartsaklis+, 2012)

Approaches to Phrase Embeddings


• Co-occurrence matrix + SVD

• C&W (Collobert+, 2011)

• RNNLM (Mikolov+, 2013)

• SkipGram/CBOW (Mikolov+, 2013)

• vLBL/ivLBL (Mnih+, 2013)

• Dependency-based SkipGram (Levy+, 2014)

• Glove (Pennington+, 2014)

Which Word Embeddings are the Best?


Which word embeddings should we use for which composition methods?

Joint leaning11 / 39

1. Background



– General Idea





5. Summary

Today’s Agenda


• Word co-occurrence statistics word embeddings

• How about phrase embeddings?

– Phrase co-occurrence statistics!

Co-Occurrence Statistics of Phrases

The importer made payment in his own domestic currency


The businessman pays his monthly fee in yen

similar contexts

similar meanings?

13 / 39

• Using Predicate-Argument Structures (PAS)

– Enju parer (Miyao+, 2008)

• Analyzes relations between phrases and words

How to Identify Phrase-Word Relations?

The importer made payment in his own domestic currency

NP

NP

NP

VPNP

verb prepositionpredicates


arguments

14 / 39

1. Background



– General Idea





5. Summary

Today’s Agenda


• Meanings of transitive verbs are affected by their

arguments

– e.g.) run, make, etc.

Good target to test composition models

Why Transitive Verb Phrases?


make

make payment

make money

make use (of)

pay

earn

use

16 / 39

• Embedding subject-verb-object tuples in a vector space

– Semantic similarities between SVOs can be used!

Possible Application: Semantic Search


• Focusing on the role of prepositional adjuncts

– Prepositional adjuncts complement meanings of

verb phrases should be useful

Training Data from Large Corpora

simplification

How to model the relationships between predicates and arguments?


------------

English Wikipedia,BNC, etc.

parse

18 / 39

1. Background



– General Idea





5. Summary

Today’s Agenda


• Predicting words in predicate-argument tuples

Word Prediction Model (like word2vec)

arg1

+

currency furniture

max(0, 1-s(currency)+s(furniture)) cost function

pred

[importer make payment] in

𝐩 = tanh(𝐡𝑎𝑟𝑔1prep

⊙𝐯𝑎𝑟𝑔1 +

𝐡𝑝𝑟𝑒𝑑prep

⊙𝐯𝑝𝑟𝑒𝑑)

𝐯𝑎𝑟𝑔1 𝐯𝑝𝑟𝑒𝑑

feature vectorfor the word prediction

𝐡𝑎𝑟𝑔1prep

𝐡𝑝𝑟𝑒𝑑prep


PAS-CLBLM20 / 39

• Two methods:

– (a) assigning a vector to each SVO tuple

– (b) composing SVO embeddings

How to Compute SVO Embeddings?

[importer make payment]

subj obj

+verb

[importer make payment]

(a) (b)

- parameterized vectors

- composed vectors


1. Background



– General Idea





5. Summary

Today’s Agenda


• Only element-wise vector operations

– Pros: Fast training

– Cons: Poor interaction between predicates and

arguments

• Interactions between predicates and arguments are

important for transitive verbs

Weakness of PAS-CLBLM


make

make payment

make money

make use (of)

pay

earn

use

23 / 39

• Tensor/matrix-based approaches (Noun: vector)

– Adjective: matrix (Baroni+, 2010)

– Transitive verb: matrix

(Grefenstette+, 2011; Van de Cruys+, 2013)

Focusing on Tensor-Based Approaches


verb

subject

verb

𝑑𝑑

𝑑

subject≅

𝑃𝑀𝐼(importer, make, payment) = 0.31

GivenGiven

Given

pre-trained

24 / 39

• Parameterizing

– Predicate matrices and

– Argument embeddings

Implicit Tensor Factorization (1)


predicate

argument 2

predicate

𝑑𝑑

𝑑

argument 2≅

GivenGiven

Given

25 / 39

• Calculating plausibility scores

– Using predicate matrices & argument embeddings



predicate

argument 2

predicate

𝑑𝑑

𝑑

argument 2≅

GivenGiven

Given

𝑇(i, j, k) =

ij k

26 / 39

• Learning model parameters

– Using plausibility judgment task

• Observed tuple: (i, j, k)

• Collapsed tuple: (i’, j, k), (i, j’, k), (i, j, k’)

–Negative sampling (Mikolov+, 2013)



Cost function

27 / 39

• Discriminating between observed and collapsed ones

Example


(i, j, k) = (in, importer make payment, currency)(i’, j, k)= (on, importer make payment, currency)(i, j’, k)= (in, child eat pizza, currency)(i, j, k’)= (in, importer make payment, furniture)

28 / 39

• Two methods:

– (a) assigning a vector to each SVO tuple

– (b) composing SVO embeddings

How to Compute SVO Embeddings?

- parameterized vectors

- composed vectors


[importer make payment][importer make payment]

(a) (b)

- parameterized matrices

(Kartsaklis+, 2012)

29 / 39

• The function is presented in Kartsaklis+ (2012)

– Using verb matrices in Grefenstette+ (2011)

• Our verb matrices are related to Grefenstette+

(2011)

• The function can compute

– verb-object phrase embeddings

– subject-verb-object phrase embeddings

Why the Copy-Subject Function?


1. Background



– General Idea





5. Summary

Today’s Agenda


• Training corpus: English Wikipedia

– SVO data: 23.6 million instances

– SVO-preposition-noun data: 17.3 million instances

• Parameter Initialization: random values

• Optimization: mini-batch AdaGrad (Duchi+, 2011)

• Embedding dimensionality

– PAS-CLBLM: 200

– Tensor method: 50

• # of model parameters of PAS-CLBLM is a little

bit larger than that of the tensor method

Experimental Settings


• Case 1: assigning a vector to each SVO tuple

Examples of Learned SVO Embeddings

Adjuncts seem to be helpful in learning the meanings of verb phrases

This approach omits the information about individual words!


• Case 2: composing SVO embeddings

Examples of Learned SVO Embeddings

Tensor (CVSC 2015) PAS-CLBLM (EMNLP 2014)

More flexible!


Strongly enhancing the head word

34 / 39

• In the latest approach, the learned verb matrices

capture multiple meanings

Multiple Meanings in Verb Matrices


• Measuring semantic similarities of verb pairs taking

the same subjects and objects (Grefenstette+, 2011)

– Evaluation: Speaman’s rank correlation between

similarity scores and human ratings

Verb Sense Disambiguation Task

verb pair with subj&obj human rating

student write name

student spell name7

child show sign

child express sign6

system meet criterion

system visit criterion1


• State-of-the-art results on the disambiguation task

– Prepositional adjuncts improve the results

• How about other kinds of adjuncts?

Results

Method

Tensor (only verb data) 0.480

Tensor (verb and preposition data) 0.614

PAS-CLBLM (this experiment) 0.374

Milajevs+, 2014 0.456

Hashimoto+, 2014 0.422

Future work: improving real-world applications using the method


1. Background



– General Idea





5. Summary

Today’s Agenda


• Word and phrase embeddings are jointly learned

using large corpora parsed by syntactic parsers

– Tensor-based method is suitable for verb sense

disambiguation

– Adjuncts are useful in learning verb phrases

• Future directions:

– improving the embedding methods

– applying them to real-world NLP applications

• What kind of information should be captured?

Summary