Cross-Lingual Word Representations via Spectral Graph Embeddings...

Preview:

Citation preview

Cross-Lingual Word Representationsvia Spectral Graph Embeddings

2016/08/10 ACL2016 Berlin, Germany

Takamasa OshikiriKazuki FukuiHidetoshi Shimodaira

Session 8B: Word Vectors III (short papers)

Project page: http://oshikiri.org/cleigenwords

2Cross-Lingual Word EmbeddingRepresent words as points in language-independent common space.

● Applications– Knowledge transfer– Cross-lingual information retrieval– Machine translation

PCA projection of countries and its capitals in Spanish and English

3Some existing works① Generate two monolingual word vectors separately,

then optimize cross-lingual objective– Based on linear projection [Mikolov+ 2013]– Based on CCA [Faruqui&Dyer 2014]

② Jointly optimize “monolingual + cross-lingual” objectives– BilBOWA [Gouws+ 2015]– Trans-gram [Coulmance+ 2015]

● Our method uses ② joint optimization and simpler approach.

4Our Main Idea (CL-Eigenwords)● Leverage sentence-alignment information

via spectral graph embedding based framework

monolingual x 2 cross-lingual

sentence-alignment informations

5Word vectors generated by CL-Eigenwords (1/2)Translation pairs (ex. italy & italia ) have almost identical representations.

Distances between two words in pairs

2-dim visualization (by t-SNE) of word vectors of 1000 translation pairs. Words in each pair are connected by a line segment.

> 90% are almost identical.

6Word vectors generated by CL-Eigenwords (2/2)

1. Synonyms across languages

They also have vector arithmetic properties.

2. Monolingual Additive Compositionality

3. Cross-lingual Additive Compositionality

Countries LanguagesObtained from English-Spanish parallel corpus

7Proposed method:Cross-Lingual Eigenwords (CL-Eigenwords)

● use CDMCA [Shimodaira 2016] instead of CCA.– CDMCA (Cross-Domain Matching Correlation Analysis) is a cross-domain

extension of the spectral graph embedding of [Yan+ 2007]● require sentence-aligned parallel corpus Document id English German Genesis 1-1 In the beginning God created the heaven

and the earth. Am Anfang schuf Gott Himmel und Erde.

Genesis 2-1 Thus the heavens and the earth were finished, and all the host of them.

Also ward vollendet Himmel und Erde mit ihrem ganzen Heer.

https://www.wordproject.org/bibles/

[Dhillon+ 2015] [Littman+ 1998]CL-Eigenwords = Eigenwords + CL-LSI

Example of sentence-aligned parallel corpus

8Previous works & CL-Eigenwords

9Previous works & CL-Eigenwords

10Eigenwords (One-Step CCA) [Dhillon+ 2015]● CCA-based monolingual word embedding method● Embed vi and ci onto nearby place.

Notations

word its context

: Corpus ( ti is drawn from size V vocabulary)

: 1-of-V encoding of ti

: Concatenation of

11

be is not or proble

mtha

tthe to

tobeor

nottobe

thatCor

pus

Vocabulary

1-of-V encoding of “be”

V := tobe

ornot

tobe

that

= (T × V)

・・・

・・・

to

-

ornotbe-to

tobe

ornot

tobe-

(T × 2hV)C :=

Illustrative exampleWord matrix

Context matrix(h = 2)

tobeor

nottobe

that

Corpus tobe

ornotbe

thatis

・・・

tobe

ornot

thethatis

・・・

・・・

12Formulation of Eigenwords [Dhillon+ 2015]

: UnknownK : Dimension of vector representation

: Vector representation of wordsOutput:

This optimization problem is solved by CCA [Hotelling 1936] .

13Previous works & CL-Eigenwords

14Cross-Language LSI (CL-LSI) [Littman+1998]

● Cross-Lingual extension of Latent Semantic Indexing

● Concatenate two word-document matrices, and then execute SVD.

is the identity matrix

15Previous works & CL-Eigenwords

16Formulation of CL-Eigenwords

monolingual term

cross-lingual term

Object

iveCon

straint

monolingual term : Eigenwords [Dhillon+ 2015] objective for each corporacross-lingual term : Cross-Language LSI [Littman+ 1998] based objective

This optimization problem is solved via generalized eigenvalue decomposition.

17Formulation of CL-Eigenwordsmonolingual term : Eigenwords [Dhillon+ 2015] objective for each corporacross-lingual term : Cross-Language LSI [Littman+ 1998] based objective

18Lexical translation task [Mikolov+ 2013]● Word translation using generated word vectors

using Europarl corpus [Koehn 2005] Higher is better

● CL-Eigenwords is competitive to other methods.● Computation times of CL-Eigenwords are as short as those of BilBOWA.

3 cores

19Conclusion● We proposed a novel cross-lingual word

embedding methods (CL-Eigenwords).● Our method is simple but competitive to

state-of-the-arts methods.● Our implementation is available at GitHub.

Project page: http://oshikiri.org/cleigenwords

20ReferencesDhillon, P. S., Foster, D. P., & Ungar, L. H. (2015). Eigenwords: Spectral Word Embeddings. Journal of Machine Learning Research, 16, 3035–3078.Hotelling, H., (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.Yan, S., Xu, D., Zhang, B., Zhang, H. J., Yang, Q., & Lin, S. (2007). Graph embedding and extensions: a general framework for dimensionality reduction. IEEE transactions on pattern analysis and machine intelligence, 29(1), 40-51.Shimodaira, H., (2016). Cross-validation of matching correlation analysis by resampling matching weights, Neural Networks, vol. 75, 126–140.Mikolov, T., Le, Q. V, & Sutskever, I. (2013). Exploiting Similarities among Languages for Machine Translation. arXiv Preprint arXiv:1309.4168v1.Faruqui, M., & Dyer, C. (2014). Improving vector space word representations using multilingual correlation. Proceedings of EACL, 462–471. Gouws, S., Bengio, Y., & Corrado, G. (2015). BilBOWA: Fast Bilingual Distributed Representations without Word Alignments. Proceedings of ICML2015, 748–756.Coulmance, J., Marty, J.-M., Wenzek, G., & Benhalloum, A. (2015). Trans-gram, Fast Cross-lingual Word-embeddings. EMNLP 2015, (September), 1109–1113.Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. MT Summit, 5(June), 79–86.Littman, M. L., Dumais, S. T., & Landauer, T. K. (1998). Automatic cross-language information retrieval using latent semantic indexing. Cross-Language Information Retrieval, 51–62.

Appendices

223 languages version● Europarl corpus

(es-en & de-en)● t-SNE (dim=2)

23Formulation of Eigenwords (1/2): Projection matrices

K : Dimension of vector representation

: Vector representation of wordsOutput:

Optimal solution of above problemcan be obtained via eigen value decomposition

24Formulation of Eigenwords (2/2)● Previous optimization problem can be

rewritten as trace maximization problem.

The optimal solution can be obtained via eigenvalue decomposition of

25Cross-Language LSI (CL-LSI) [Littman+1998]

inthe

beginning・・・

earththus

theheavens

English(  ) doc. 1

doc. 2 ・・・

・・・

● Cross-Lingual extension of Latent Semantic Indexing

● Concatenate two word-document matrices, and then execute SVD.

is the identity matrix

doc. D

26Formulation of CL-Eigenwords (1/2)Define independently from th corpus.d

j is 1-of-D encoding of j th document

is 1 if i th token of th corpus is came from j th document, 0 otherwise.

monolingual term

cross-lingual term

27Formulation of CL-Eigenwords (2/2)● Previous optimization problem can be

rewritten by trace maximization problem.

The optimal solution can be obtained via eigenvalue decomposition of

monolingual term

cross-lingual term

This problem is formulated as an example of Cross-Domain Matching Correlation Analysis (CDMCA) [Shimodaira 2016].

29Difference between Eigenwords & CL-Eigenwords● Objective functions

Eigenw

ords

CL- Eig

enword

s

monolingual term cross-lingual term

30Difference between Eigenwords & CL-Eigenwords● Matrix

Eigenw

ords

CL- Eig

enword

s

31Lexical translation task [Mikolov+ 2013]● Word translation using generated word vectors

using Europarl corpus [Koehn 2005]

① Choose a word from source language.② Search top-n nearest words from target language.③ If they contain translation returned by Google

Translate, they are thought as a correct answer.

Procedure :

Recommended