31
Cross-Lingual Word Representations via Spectral Graph Embeddings 2016/08/10 ACL2016 Berlin, Germany Takamasa Oshikiri Kazuki Fukui Hidetoshi Shimodaira Session 8B: Word Vectors III (short papers) Project page: http://oshikiri.org/cleigenwords

Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

Cross-Lingual Word Representationsvia Spectral Graph Embeddings

2016/08/10 ACL2016 Berlin, Germany

Takamasa OshikiriKazuki FukuiHidetoshi Shimodaira

Session 8B: Word Vectors III (short papers)

Project page: http://oshikiri.org/cleigenwords

Page 2: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

2Cross-Lingual Word EmbeddingRepresent words as points in language-independent common space.

● Applications– Knowledge transfer– Cross-lingual information retrieval– Machine translation

PCA projection of countries and its capitals in Spanish and English

Page 3: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

3Some existing works① Generate two monolingual word vectors separately,

then optimize cross-lingual objective– Based on linear projection [Mikolov+ 2013]– Based on CCA [Faruqui&Dyer 2014]

② Jointly optimize “monolingual + cross-lingual” objectives– BilBOWA [Gouws+ 2015]– Trans-gram [Coulmance+ 2015]

● Our method uses ② joint optimization and simpler approach.

Page 4: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

4Our Main Idea (CL-Eigenwords)● Leverage sentence-alignment information

via spectral graph embedding based framework

monolingual x 2 cross-lingual

sentence-alignment informations

Page 5: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

5Word vectors generated by CL-Eigenwords (1/2)Translation pairs (ex. italy & italia ) have almost identical representations.

Distances between two words in pairs

2-dim visualization (by t-SNE) of word vectors of 1000 translation pairs. Words in each pair are connected by a line segment.

> 90% are almost identical.

Page 6: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

6Word vectors generated by CL-Eigenwords (2/2)

1. Synonyms across languages

They also have vector arithmetic properties.

2. Monolingual Additive Compositionality

3. Cross-lingual Additive Compositionality

Countries LanguagesObtained from English-Spanish parallel corpus

Page 7: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

7Proposed method:Cross-Lingual Eigenwords (CL-Eigenwords)

● use CDMCA [Shimodaira 2016] instead of CCA.– CDMCA (Cross-Domain Matching Correlation Analysis) is a cross-domain

extension of the spectral graph embedding of [Yan+ 2007]● require sentence-aligned parallel corpus Document id English German Genesis 1-1 In the beginning God created the heaven

and the earth. Am Anfang schuf Gott Himmel und Erde.

Genesis 2-1 Thus the heavens and the earth were finished, and all the host of them.

Also ward vollendet Himmel und Erde mit ihrem ganzen Heer.

https://www.wordproject.org/bibles/

[Dhillon+ 2015] [Littman+ 1998]CL-Eigenwords = Eigenwords + CL-LSI

Example of sentence-aligned parallel corpus

Page 8: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

8Previous works & CL-Eigenwords

Page 9: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

9Previous works & CL-Eigenwords

Page 10: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

10Eigenwords (One-Step CCA) [Dhillon+ 2015]● CCA-based monolingual word embedding method● Embed vi and ci onto nearby place.

Notations

word its context

: Corpus ( ti is drawn from size V vocabulary)

: 1-of-V encoding of ti

: Concatenation of

Page 11: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

11

be is not or proble

mtha

tthe to

tobeor

nottobe

thatCor

pus

Vocabulary

1-of-V encoding of “be”

V := tobe

ornot

tobe

that

= (T × V)

・・・

・・・

to

-

ornotbe-to

tobe

ornot

tobe-

(T × 2hV)C :=

Illustrative exampleWord matrix

Context matrix(h = 2)

tobeor

nottobe

that

Corpus tobe

ornotbe

thatis

・・・

tobe

ornot

thethatis

・・・

・・・

Page 12: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

12Formulation of Eigenwords [Dhillon+ 2015]

: UnknownK : Dimension of vector representation

: Vector representation of wordsOutput:

This optimization problem is solved by CCA [Hotelling 1936] .

Page 13: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

13Previous works & CL-Eigenwords

Page 14: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

14Cross-Language LSI (CL-LSI) [Littman+1998]

● Cross-Lingual extension of Latent Semantic Indexing

● Concatenate two word-document matrices, and then execute SVD.

is the identity matrix

Page 15: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

15Previous works & CL-Eigenwords

Page 16: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

16Formulation of CL-Eigenwords

monolingual term

cross-lingual term

Object

iveCon

straint

monolingual term : Eigenwords [Dhillon+ 2015] objective for each corporacross-lingual term : Cross-Language LSI [Littman+ 1998] based objective

This optimization problem is solved via generalized eigenvalue decomposition.

Page 17: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

17Formulation of CL-Eigenwordsmonolingual term : Eigenwords [Dhillon+ 2015] objective for each corporacross-lingual term : Cross-Language LSI [Littman+ 1998] based objective

Page 18: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

18Lexical translation task [Mikolov+ 2013]● Word translation using generated word vectors

using Europarl corpus [Koehn 2005] Higher is better

● CL-Eigenwords is competitive to other methods.● Computation times of CL-Eigenwords are as short as those of BilBOWA.

3 cores

Page 19: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

19Conclusion● We proposed a novel cross-lingual word

embedding methods (CL-Eigenwords).● Our method is simple but competitive to

state-of-the-arts methods.● Our implementation is available at GitHub.

Project page: http://oshikiri.org/cleigenwords

Page 20: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

20ReferencesDhillon, P. S., Foster, D. P., & Ungar, L. H. (2015). Eigenwords: Spectral Word Embeddings. Journal of Machine Learning Research, 16, 3035–3078.Hotelling, H., (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.Yan, S., Xu, D., Zhang, B., Zhang, H. J., Yang, Q., & Lin, S. (2007). Graph embedding and extensions: a general framework for dimensionality reduction. IEEE transactions on pattern analysis and machine intelligence, 29(1), 40-51.Shimodaira, H., (2016). Cross-validation of matching correlation analysis by resampling matching weights, Neural Networks, vol. 75, 126–140.Mikolov, T., Le, Q. V, & Sutskever, I. (2013). Exploiting Similarities among Languages for Machine Translation. arXiv Preprint arXiv:1309.4168v1.Faruqui, M., & Dyer, C. (2014). Improving vector space word representations using multilingual correlation. Proceedings of EACL, 462–471. Gouws, S., Bengio, Y., & Corrado, G. (2015). BilBOWA: Fast Bilingual Distributed Representations without Word Alignments. Proceedings of ICML2015, 748–756.Coulmance, J., Marty, J.-M., Wenzek, G., & Benhalloum, A. (2015). Trans-gram, Fast Cross-lingual Word-embeddings. EMNLP 2015, (September), 1109–1113.Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. MT Summit, 5(June), 79–86.Littman, M. L., Dumais, S. T., & Landauer, T. K. (1998). Automatic cross-language information retrieval using latent semantic indexing. Cross-Language Information Retrieval, 51–62.

Page 21: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

Appendices

Page 22: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

223 languages version● Europarl corpus

(es-en & de-en)● t-SNE (dim=2)

Page 23: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

23Formulation of Eigenwords (1/2): Projection matrices

K : Dimension of vector representation

: Vector representation of wordsOutput:

Optimal solution of above problemcan be obtained via eigen value decomposition

Page 24: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

24Formulation of Eigenwords (2/2)● Previous optimization problem can be

rewritten as trace maximization problem.

The optimal solution can be obtained via eigenvalue decomposition of

Page 25: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

25Cross-Language LSI (CL-LSI) [Littman+1998]

inthe

beginning・・・

earththus

theheavens

English(  ) doc. 1

doc. 2 ・・・

・・・

● Cross-Lingual extension of Latent Semantic Indexing

● Concatenate two word-document matrices, and then execute SVD.

is the identity matrix

doc. D

Page 26: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

26Formulation of CL-Eigenwords (1/2)Define independently from th corpus.d

j is 1-of-D encoding of j th document

is 1 if i th token of th corpus is came from j th document, 0 otherwise.

monolingual term

cross-lingual term

Page 27: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

27Formulation of CL-Eigenwords (2/2)● Previous optimization problem can be

rewritten by trace maximization problem.

The optimal solution can be obtained via eigenvalue decomposition of

Page 28: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

monolingual term

cross-lingual term

This problem is formulated as an example of Cross-Domain Matching Correlation Analysis (CDMCA) [Shimodaira 2016].

Page 29: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

29Difference between Eigenwords & CL-Eigenwords● Objective functions

Eigenw

ords

CL- Eig

enword

s

monolingual term cross-lingual term

Page 30: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

30Difference between Eigenwords & CL-Eigenwords● Matrix

Eigenw

ords

CL- Eig

enword

s

Page 31: Cross-Lingual Word Representations via Spectral Graph Embeddings …oshikiri.org/publications/acl2016/slides.pdf · 2020-03-07 · Word vectors generated by CL-Eigenwords (1/2) 5

31Lexical translation task [Mikolov+ 2013]● Word translation using generated word vectors

using Europarl corpus [Koehn 2005]

① Choose a word from source language.② Search top-n nearest words from target language.③ If they contain translation returned by Google

Translate, they are thought as a correct answer.

Procedure :