Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Cross-Lingual Word Representationsvia Spectral Graph Embeddings
2016/08/10 ACL2016 Berlin, Germany
Takamasa OshikiriKazuki FukuiHidetoshi Shimodaira
Session 8B: Word Vectors III (short papers)
Project page: http://oshikiri.org/cleigenwords
2Cross-Lingual Word EmbeddingRepresent words as points in language-independent common space.
● Applications– Knowledge transfer– Cross-lingual information retrieval– Machine translation
PCA projection of countries and its capitals in Spanish and English
3Some existing works① Generate two monolingual word vectors separately,
then optimize cross-lingual objective– Based on linear projection [Mikolov+ 2013]– Based on CCA [Faruqui&Dyer 2014]
② Jointly optimize “monolingual + cross-lingual” objectives– BilBOWA [Gouws+ 2015]– Trans-gram [Coulmance+ 2015]
● Our method uses ② joint optimization and simpler approach.
4Our Main Idea (CL-Eigenwords)● Leverage sentence-alignment information
via spectral graph embedding based framework
monolingual x 2 cross-lingual
sentence-alignment informations
5Word vectors generated by CL-Eigenwords (1/2)Translation pairs (ex. italy & italia ) have almost identical representations.
Distances between two words in pairs
2-dim visualization (by t-SNE) of word vectors of 1000 translation pairs. Words in each pair are connected by a line segment.
> 90% are almost identical.
6Word vectors generated by CL-Eigenwords (2/2)
1. Synonyms across languages
They also have vector arithmetic properties.
2. Monolingual Additive Compositionality
3. Cross-lingual Additive Compositionality
Countries LanguagesObtained from English-Spanish parallel corpus
7Proposed method:Cross-Lingual Eigenwords (CL-Eigenwords)
● use CDMCA [Shimodaira 2016] instead of CCA.– CDMCA (Cross-Domain Matching Correlation Analysis) is a cross-domain
extension of the spectral graph embedding of [Yan+ 2007]● require sentence-aligned parallel corpus Document id English German Genesis 1-1 In the beginning God created the heaven
and the earth. Am Anfang schuf Gott Himmel und Erde.
Genesis 2-1 Thus the heavens and the earth were finished, and all the host of them.
Also ward vollendet Himmel und Erde mit ihrem ganzen Heer.
https://www.wordproject.org/bibles/
[Dhillon+ 2015] [Littman+ 1998]CL-Eigenwords = Eigenwords + CL-LSI
Example of sentence-aligned parallel corpus
8Previous works & CL-Eigenwords
9Previous works & CL-Eigenwords
10Eigenwords (One-Step CCA) [Dhillon+ 2015]● CCA-based monolingual word embedding method● Embed vi and ci onto nearby place.
Notations
word its context
: Corpus ( ti is drawn from size V vocabulary)
: 1-of-V encoding of ti
: Concatenation of
11
be is not or proble
mtha
tthe to
tobeor
nottobe
thatCor
pus
Vocabulary
1-of-V encoding of “be”
V := tobe
ornot
tobe
that
= (T × V)
・・・
・・・
to
-
ornotbe-to
tobe
ornot
tobe-
(T × 2hV)C :=
Illustrative exampleWord matrix
Context matrix(h = 2)
tobeor
nottobe
that
Corpus tobe
ornotbe
thatis
・・・
tobe
ornot
thethatis
・・・
・・・
12Formulation of Eigenwords [Dhillon+ 2015]
: UnknownK : Dimension of vector representation
: Vector representation of wordsOutput:
This optimization problem is solved by CCA [Hotelling 1936] .
13Previous works & CL-Eigenwords
14Cross-Language LSI (CL-LSI) [Littman+1998]
● Cross-Lingual extension of Latent Semantic Indexing
● Concatenate two word-document matrices, and then execute SVD.
is the identity matrix
15Previous works & CL-Eigenwords
16Formulation of CL-Eigenwords
monolingual term
cross-lingual term
Object
iveCon
straint
monolingual term : Eigenwords [Dhillon+ 2015] objective for each corporacross-lingual term : Cross-Language LSI [Littman+ 1998] based objective
This optimization problem is solved via generalized eigenvalue decomposition.
17Formulation of CL-Eigenwordsmonolingual term : Eigenwords [Dhillon+ 2015] objective for each corporacross-lingual term : Cross-Language LSI [Littman+ 1998] based objective
18Lexical translation task [Mikolov+ 2013]● Word translation using generated word vectors
using Europarl corpus [Koehn 2005] Higher is better
● CL-Eigenwords is competitive to other methods.● Computation times of CL-Eigenwords are as short as those of BilBOWA.
3 cores
19Conclusion● We proposed a novel cross-lingual word
embedding methods (CL-Eigenwords).● Our method is simple but competitive to
state-of-the-arts methods.● Our implementation is available at GitHub.
Project page: http://oshikiri.org/cleigenwords
20ReferencesDhillon, P. S., Foster, D. P., & Ungar, L. H. (2015). Eigenwords: Spectral Word Embeddings. Journal of Machine Learning Research, 16, 3035–3078.Hotelling, H., (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.Yan, S., Xu, D., Zhang, B., Zhang, H. J., Yang, Q., & Lin, S. (2007). Graph embedding and extensions: a general framework for dimensionality reduction. IEEE transactions on pattern analysis and machine intelligence, 29(1), 40-51.Shimodaira, H., (2016). Cross-validation of matching correlation analysis by resampling matching weights, Neural Networks, vol. 75, 126–140.Mikolov, T., Le, Q. V, & Sutskever, I. (2013). Exploiting Similarities among Languages for Machine Translation. arXiv Preprint arXiv:1309.4168v1.Faruqui, M., & Dyer, C. (2014). Improving vector space word representations using multilingual correlation. Proceedings of EACL, 462–471. Gouws, S., Bengio, Y., & Corrado, G. (2015). BilBOWA: Fast Bilingual Distributed Representations without Word Alignments. Proceedings of ICML2015, 748–756.Coulmance, J., Marty, J.-M., Wenzek, G., & Benhalloum, A. (2015). Trans-gram, Fast Cross-lingual Word-embeddings. EMNLP 2015, (September), 1109–1113.Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. MT Summit, 5(June), 79–86.Littman, M. L., Dumais, S. T., & Landauer, T. K. (1998). Automatic cross-language information retrieval using latent semantic indexing. Cross-Language Information Retrieval, 51–62.
Appendices
223 languages version● Europarl corpus
(es-en & de-en)● t-SNE (dim=2)
23Formulation of Eigenwords (1/2): Projection matrices
K : Dimension of vector representation
: Vector representation of wordsOutput:
Optimal solution of above problemcan be obtained via eigen value decomposition
24Formulation of Eigenwords (2/2)● Previous optimization problem can be
rewritten as trace maximization problem.
The optimal solution can be obtained via eigenvalue decomposition of
25Cross-Language LSI (CL-LSI) [Littman+1998]
inthe
beginning・・・
earththus
theheavens
English( ) doc. 1
doc. 2 ・・・
・・・
● Cross-Lingual extension of Latent Semantic Indexing
● Concatenate two word-document matrices, and then execute SVD.
is the identity matrix
doc. D
26Formulation of CL-Eigenwords (1/2)Define independently from th corpus.d
j is 1-of-D encoding of j th document
is 1 if i th token of th corpus is came from j th document, 0 otherwise.
monolingual term
cross-lingual term
27Formulation of CL-Eigenwords (2/2)● Previous optimization problem can be
rewritten by trace maximization problem.
The optimal solution can be obtained via eigenvalue decomposition of
monolingual term
cross-lingual term
This problem is formulated as an example of Cross-Domain Matching Correlation Analysis (CDMCA) [Shimodaira 2016].
29Difference between Eigenwords & CL-Eigenwords● Objective functions
Eigenw
ords
CL- Eig
enword
s
monolingual term cross-lingual term
30Difference between Eigenwords & CL-Eigenwords● Matrix
Eigenw
ords
CL- Eig
enword
s
31Lexical translation task [Mikolov+ 2013]● Word translation using generated word vectors
using Europarl corpus [Koehn 2005]
① Choose a word from source language.② Search top-n nearest words from target language.③ If they contain translation returned by Google
Translate, they are thought as a correct answer.
Procedure :