137
Extraction of domain-specific bilingual lexicon from comparable corpora compositional translation and ranking Estelle Delpech 1 , B´ eatrice Daille 1 , Emmanuel Morin 1 , Claire Lemaire 2,3 1 LINA, Universit´ e de Nantes 2 GREMUTS, Universit´ e de Grenoble 3 Lingua et Machina COLING’12 10/12/12 Mumbai, India

Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Embed Size (px)

DESCRIPTION

Material presented at the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India. Paper download at http://hal.archives-ouvertes.fr/hal-00743807. Institutions: Laboratoire d'Informatique de Nantes Atlantique (LINA), Lingua et Machina, Gremuts.

Citation preview

Page 1: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Extraction of domain-specific bilingual lexiconfrom comparable corpora

compositional translation and ranking

Estelle Delpech1, Beatrice Daille1, Emmanuel Morin1, ClaireLemaire2,3

1LINA, Universite de Nantes 2GREMUTS, Universite de Grenoble3Lingua et Machina

COLING’12 10/12/12 Mumbai, India

Page 2: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Outline

1 Context

2 Translation method

3 Ranking method

4 Results of experiments

5 Future work

Page 3: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Outline

1 Context

2 Translation method

3 Ranking method

4 Results of experiments

5 Future work

Page 4: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Context : comparable corpora for Computer-AidedTranslation

Aim : provide domain-specific bilingual lexicons to translatorswhen no parallel data is available

⇒ Comparable corpora :

I Set of texts in languages L1 and L2, which are nottranslations, but which deal with the same subject matter, sothat there is still a possibility to extract translation pairs

1 / 31

Page 5: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Context : comparable corpora for Computer-AidedTranslation

Aim : provide domain-specific bilingual lexicons to translatorswhen no parallel data is available

⇒ Comparable corpora :

I Set of texts in languages L1 and L2, which are nottranslations, but which deal with the same subject matter, sothat there is still a possibility to extract translation pairs

1 / 31

Page 6: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Context : comparable corpora for Computer-AidedTranslation

Aim : provide domain-specific bilingual lexicons to translatorswhen no parallel data is available

⇒ Comparable corpora :

I Set of texts in languages L1 and L2, which are nottranslations, but which deal with the same subject matter, sothat there is still a possibility to extract translation pairs

1 / 31

Page 7: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:

I 51% to 88% precision on top 20 candidates with specializedcorpora [Daille and Morin, 2005]

⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :

I 81% to 94% precision on Top1[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]

I More than 60% of terms in technical and scientific domains aremorphologically complex [Namer and Baud, 2007]

I Outperforms context-based approaches for the translation ofterms with compositional meaning [Morin and Daille, 2009]

2 / 31

Page 8: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:

I 51% to 88% precision on top 20 candidates with specializedcorpora [Daille and Morin, 2005]

⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :

I 81% to 94% precision on Top1[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]

I More than 60% of terms in technical and scientific domains aremorphologically complex [Namer and Baud, 2007]

I Outperforms context-based approaches for the translation ofterms with compositional meaning [Morin and Daille, 2009]

2 / 31

Page 9: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:I 51% to 88% precision on top 20 candidates with specialized

corpora [Daille and Morin, 2005]

⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :

I 81% to 94% precision on Top1[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]

I More than 60% of terms in technical and scientific domains aremorphologically complex [Namer and Baud, 2007]

I Outperforms context-based approaches for the translation ofterms with compositional meaning [Morin and Daille, 2009]

2 / 31

Page 10: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:I 51% to 88% precision on top 20 candidates with specialized

corpora [Daille and Morin, 2005]⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :

I 81% to 94% precision on Top1[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]

I More than 60% of terms in technical and scientific domains aremorphologically complex [Namer and Baud, 2007]

I Outperforms context-based approaches for the translation ofterms with compositional meaning [Morin and Daille, 2009]

2 / 31

Page 11: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:I 51% to 88% precision on top 20 candidates with specialized

corpora [Daille and Morin, 2005]⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :

I 81% to 94% precision on Top1[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]

I More than 60% of terms in technical and scientific domains aremorphologically complex [Namer and Baud, 2007]

I Outperforms context-based approaches for the translation ofterms with compositional meaning [Morin and Daille, 2009]

2 / 31

Page 12: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:I 51% to 88% precision on top 20 candidates with specialized

corpora [Daille and Morin, 2005]⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :I 81% to 94% precision on Top1

[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]

I More than 60% of terms in technical and scientific domains aremorphologically complex [Namer and Baud, 2007]

I Outperforms context-based approaches for the translation ofterms with compositional meaning [Morin and Daille, 2009]

2 / 31

Page 13: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:I 51% to 88% precision on top 20 candidates with specialized

corpora [Daille and Morin, 2005]⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :I 81% to 94% precision on Top1

[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]I More than 60% of terms in technical and scientific domains are

morphologically complex [Namer and Baud, 2007]

I Outperforms context-based approaches for the translation ofterms with compositional meaning [Morin and Daille, 2009]

2 / 31

Page 14: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Motivations for compositional translation

Usual context-based methods [Fung, 1997]:I 51% to 88% precision on top 20 candidates with specialized

corpora [Daille and Morin, 2005]⇒ lexicons difficult to use for translators [Delpech, 2011]

Compositional translation :I 81% to 94% precision on Top1

[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]I More than 60% of terms in technical and scientific domains are

morphologically complex [Namer and Baud, 2007]I Outperforms context-based approaches for the translation of

terms with compositional meaning [Morin and Daille, 2009]

2 / 31

Page 15: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Compositional translation

Compositionality

“the meaning of the whole is a function of the meaning of theparts” [Keenan and Faltz, 1985, 24-25]

Input : ”ab”

Decompose {a, b}Translate {α, β}

Reorder {αβ, βα}Select αβ

Output : ”αβ”

3 / 31

Page 16: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Compositional translation

Compositionality

“the meaning of the whole is a function of the meaning of theparts” [Keenan and Faltz, 1985, 24-25]

Input : ”ab”

Decompose {a, b}Translate {α, β}

Reorder {αβ, βα}Select αβ

Output : ”αβ”

3 / 31

Page 17: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Compositional translation

Compositionality

“the meaning of the whole is a function of the meaning of theparts” [Keenan and Faltz, 1985, 24-25]

Input : ”ab”

Decompose {a, b}

Translate {α, β}Reorder {αβ, βα}

Select αβ

Output : ”αβ”

3 / 31

Page 18: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Compositional translation

Compositionality

“the meaning of the whole is a function of the meaning of theparts” [Keenan and Faltz, 1985, 24-25]

Input : ”ab”

Decompose {a, b}Translate {α, β}

Reorder {αβ, βα}Select αβ

Output : ”αβ”

3 / 31

Page 19: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Compositional translation

Compositionality

“the meaning of the whole is a function of the meaning of theparts” [Keenan and Faltz, 1985, 24-25]

Input : ”ab”

Decompose {a, b}Translate {α, β}

Reorder {αβ, βα}

Select αβ

Output : ”αβ”

3 / 31

Page 20: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Compositional translation

Compositionality

“the meaning of the whole is a function of the meaning of theparts” [Keenan and Faltz, 1985, 24-25]

Input : ”ab”

Decompose {a, b}Translate {α, β}

Reorder {αβ, βα}Select αβ

Output : ”αβ”

3 / 31

Page 21: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Compositional translation

Compositionality

“the meaning of the whole is a function of the meaning of theparts” [Keenan and Faltz, 1985, 24-25]

Input : ”ab”

Decompose {a, b}Translate {α, β}

Reorder {αβ, βα}Select αβ

Output : ”αβ”

3 / 31

Page 22: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Related work

Applied to phrases, decomposed into words[Robitaille et al., 2006, Morin and Daille, 2009]

I rate of evaporation → taux d’evaporation

Applied to words, decomposed into morphemes[Cartoni, 2009, Harastani et al., 2012]

I cardiology → cardiologieI ricostruire → rebuild

⇒ No approach links bound morphemes to words :I -cyto- → cellule ’cell’I cytotoxic → toxique pour les cellules ’toxic to the cells’

4 / 31

Page 23: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Related work

Applied to phrases, decomposed into words[Robitaille et al., 2006, Morin and Daille, 2009]

I rate of evaporation → taux d’evaporation

Applied to words, decomposed into morphemes[Cartoni, 2009, Harastani et al., 2012]

I cardiology → cardiologieI ricostruire → rebuild

⇒ No approach links bound morphemes to words :I -cyto- → cellule ’cell’I cytotoxic → toxique pour les cellules ’toxic to the cells’

4 / 31

Page 24: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Related work

Applied to phrases, decomposed into words[Robitaille et al., 2006, Morin and Daille, 2009]

I rate of evaporation → taux d’evaporation

Applied to words, decomposed into morphemes[Cartoni, 2009, Harastani et al., 2012]

I cardiology → cardiologieI ricostruire → rebuild

⇒ No approach links bound morphemes to words :I -cyto- → cellule ’cell’I cytotoxic → toxique pour les cellules ’toxic to the cells’

4 / 31

Page 25: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Related work

Applied to phrases, decomposed into words[Robitaille et al., 2006, Morin and Daille, 2009]

I rate of evaporation → taux d’evaporation

Applied to words, decomposed into morphemes[Cartoni, 2009, Harastani et al., 2012]

I cardiology → cardiologieI ricostruire → rebuild

⇒ No approach links bound morphemes to words :I -cyto- → cellule ’cell’I cytotoxic → toxique pour les cellules ’toxic to the cells’

4 / 31

Page 26: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection and ranking methods

Select translations that occur in target texts / Web[Morin and Daille, 2009]

Select most frequent translation [Grefenstette, 1999]

Compare contexts [Garera and Yarowsky, 2008]

ML : Binary classifier [Baldwin and Tanaka, 2004]

⇒ Combination of criterion

⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

Page 27: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection and ranking methods

Select translations that occur in target texts / Web[Morin and Daille, 2009]

Select most frequent translation [Grefenstette, 1999]

Compare contexts [Garera and Yarowsky, 2008]

ML : Binary classifier [Baldwin and Tanaka, 2004]

⇒ Combination of criterion

⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

Page 28: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection and ranking methods

Select translations that occur in target texts / Web[Morin and Daille, 2009]

Select most frequent translation [Grefenstette, 1999]

Compare contexts [Garera and Yarowsky, 2008]

ML : Binary classifier [Baldwin and Tanaka, 2004]

⇒ Combination of criterion

⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

Page 29: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection and ranking methods

Select translations that occur in target texts / Web[Morin and Daille, 2009]

Select most frequent translation [Grefenstette, 1999]

Compare contexts [Garera and Yarowsky, 2008]

ML : Binary classifier [Baldwin and Tanaka, 2004]

⇒ Combination of criterion

⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

Page 30: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection and ranking methods

Select translations that occur in target texts / Web[Morin and Daille, 2009]

Select most frequent translation [Grefenstette, 1999]

Compare contexts [Garera and Yarowsky, 2008]

ML : Binary classifier [Baldwin and Tanaka, 2004]

⇒ Combination of criterion

⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

Page 31: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection and ranking methods

Select translations that occur in target texts / Web[Morin and Daille, 2009]

Select most frequent translation [Grefenstette, 1999]

Compare contexts [Garera and Yarowsky, 2008]

ML : Binary classifier [Baldwin and Tanaka, 2004]

⇒ Combination of criterion

⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

Page 32: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection and ranking methods

Select translations that occur in target texts / Web[Morin and Daille, 2009]

Select most frequent translation [Grefenstette, 1999]

Compare contexts [Garera and Yarowsky, 2008]

ML : Binary classifier [Baldwin and Tanaka, 2004]

⇒ Combination of criterion

⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

Page 33: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Outline

1 Context

2 Translation method

3 Ranking method

4 Results of experiments

5 Future work

Page 34: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}Concatenate {non, toxique, cellule}, {nontoxique, cellule},

{non, toxiquecellule}, {nontoxiquecellule}Match {non, toxique, cellule}

Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

Page 35: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}Concatenate {non, toxique, cellule}, {nontoxique, cellule},

{non, toxiquecellule}, {nontoxiquecellule}Match {non, toxique, cellule}

Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

Page 36: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}

Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,cytotoxic} , {noncytotoxic}

Translate {non, cellule, toxique}, {non, cyto, toxique},{non, cellule, toxicite}, {non, cyto, toxicite}

Reorder {non, toxique, cellule}, {non, cellule, toxique},{cellule, toxique, non}

Concatenate {non, toxique, cellule}, {nontoxique, cellule},{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

Page 37: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}

Translate {non, cellule, toxique}, {non, cyto, toxique},{non, cellule, toxicite}, {non, cyto, toxicite}

Reorder {non, toxique, cellule}, {non, cellule, toxique},{cellule, toxique, non}

Concatenate {non, toxique, cellule}, {nontoxique, cellule},{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

Page 38: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}

Translate {non, cellule, toxique}, {non, cyto, toxique},{non, cellule, toxicite}, {non, cyto, toxicite}

Reorder {non, toxique, cellule}, {non, cellule, toxique},{cellule, toxique, non}

Concatenate {non, toxique, cellule}, {nontoxique, cellule},{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

Page 39: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}

Reorder {non, toxique, cellule}, {non, cellule, toxique},{cellule, toxique, non}

Concatenate {non, toxique, cellule}, {nontoxique, cellule},{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

Page 40: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}

Reorder {non, toxique, cellule}, {non, cellule, toxique},{cellule, toxique, non}

Concatenate {non, toxique, cellule}, {nontoxique, cellule},{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

Page 41: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}

Concatenate {non, toxique, cellule}, {nontoxique, cellule},{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

Page 42: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}

Concatenate {non, toxique, cellule}, {nontoxique, cellule},{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

Page 43: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}Concatenate {non, toxique, cellule}, {nontoxique, cellule},

{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

Page 44: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}Concatenate {non, toxique, cellule}, {nontoxique, cellule},

{non, toxiquecellule}, {nontoxiquecellule}

Match {non, toxique, cellule}Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

Page 45: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}Concatenate {non, toxique, cellule}, {nontoxique, cellule},

{non, toxiquecellule}, {nontoxiquecellule}Match {non, toxique, cellule}

Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

Page 46: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation process overview

Input : ”non-cytotoxic”

Decompose {non, cyto, toxic}Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,

cytotoxic} , {noncytotoxic}Translate {non, cellule, toxique}, {non, cyto, toxique},

{non, cellule, toxicite}, {non, cyto, toxicite}Reorder {non, toxique, cellule}, {non, cellule, toxique},

{cellule, toxique, non}Concatenate {non, toxique, cellule}, {nontoxique, cellule},

{non, toxiquecellule}, {nontoxiquecellule}Match {non, toxique, cellule}

Output : ”non toxique pour les cellules” ’non toxic to thecells’

7 / 31

Page 47: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Decomposition

non-cytotoxic → {non, cyto, toxic}

Split source term into minimal components with heuristicrules:

I split on hyphensI match substrings of the source term with:

a list of morphemesa list of lexical items

I respect some length constraints on the substrings

8 / 31

Page 48: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Decomposition

non-cytotoxic → {non, cyto, toxic}

Split source term into minimal components with heuristicrules:

I split on hyphensI match substrings of the source term with:

a list of morphemesa list of lexical items

I respect some length constraints on the substrings

8 / 31

Page 49: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Decomposition

non-cytotoxic → {non, cyto, toxic}

Split source term into minimal components with heuristicrules:

I split on hyphens

I match substrings of the source term with:

a list of morphemesa list of lexical items

I respect some length constraints on the substrings

8 / 31

Page 50: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Decomposition

non-cytotoxic → {non, cyto, toxic}

Split source term into minimal components with heuristicrules:

I split on hyphensI match substrings of the source term with:

a list of morphemesa list of lexical items

I respect some length constraints on the substrings

8 / 31

Page 51: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Decomposition

non-cytotoxic → {non, cyto, toxic}

Split source term into minimal components with heuristicrules:

I split on hyphensI match substrings of the source term with:

a list of morphemesa list of lexical items

I respect some length constraints on the substrings

8 / 31

Page 52: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Concatenation

Generate all possible concatenations of the minimalcomponents

Increases the chances of matching the components withentries of the dictionaries

{ non, cyto, toxic} → {non, cyto, ∅ }{non, cytotoxic} → {non, cytotoxique }

9 / 31

Page 53: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Concatenation

Generate all possible concatenations of the minimalcomponents

Increases the chances of matching the components withentries of the dictionaries

{ non, cyto, toxic} → {non, cyto, ∅ }{non, cytotoxic} → {non, cytotoxique }

9 / 31

Page 54: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Concatenation

Generate all possible concatenations of the minimalcomponents

Increases the chances of matching the components withentries of the dictionaries

{ non, cyto, toxic} → {non, cyto, ∅ }{non, cytotoxic} → {non, cytotoxique }

9 / 31

Page 55: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with direct dictionary look-up

Bilingual dictionary for lexical items:I toxic → toxique

Morpheme translation table for bound morphemes:I allow bound to free morpheme translation equivalenceI -cyto- → -cyto-, cellule

{-cyto-, toxic} → {-cyto-, toxique},{cellule, toxique}

10 / 31

Page 56: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with direct dictionary look-up

Bilingual dictionary for lexical items:I toxic → toxique

Morpheme translation table for bound morphemes:I allow bound to free morpheme translation equivalenceI -cyto- → -cyto-, cellule

{-cyto-, toxic} → {-cyto-, toxique},{cellule, toxique}

10 / 31

Page 57: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with direct dictionary look-up

Bilingual dictionary for lexical items:I toxic → toxique

Morpheme translation table for bound morphemes:I allow bound to free morpheme translation equivalenceI -cyto- → -cyto-, cellule

{-cyto-, toxic} → {-cyto-, toxique},{cellule, toxique}

10 / 31

Page 58: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with direct dictionary look-up

Bilingual dictionary for lexical items:I toxic → toxique

Morpheme translation table for bound morphemes:I allow bound to free morpheme translation equivalenceI -cyto- → -cyto-, cellule

{-cyto-, toxic} → {-cyto-, toxique},{cellule, toxique}

10 / 31

Page 59: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with variation

Morphological lexiconI toxic → toxique → toxicite ’toxicity’

SynonymsI toxic → toxique → veneneux ’poisonous’

{-cyto-, toxic} → {-cyto-, toxicite},{-cyto-, veneneux}, {cellule, toxicite},{cellule, veneneux}

11 / 31

Page 60: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with variation

Morphological lexiconI toxic → toxique → toxicite ’toxicity’

SynonymsI toxic → toxique → veneneux ’poisonous’

{-cyto-, toxic} → {-cyto-, toxicite},{-cyto-, veneneux}, {cellule, toxicite},{cellule, veneneux}

11 / 31

Page 61: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with variation

Morphological lexiconI toxic → toxique → toxicite ’toxicity’

SynonymsI toxic → toxique → veneneux ’poisonous’

{-cyto-, toxic} → {-cyto-, toxicite},{-cyto-, veneneux}, {cellule, toxicite},{cellule, veneneux}

11 / 31

Page 62: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Translation with variation

Morphological lexiconI toxic → toxique → toxicite ’toxicity’

SynonymsI toxic → toxique → veneneux ’poisonous’

{-cyto-, toxic} → {-cyto-, toxicite},{-cyto-, veneneux}, {cellule, toxicite},{cellule, veneneux}

11 / 31

Page 63: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Reordering

No translation patterns or reordering rules

Permutate the translated components :

{cellule, toxique} → {cellule, toxique},{toxique, cellule}

12 / 31

Page 64: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Reordering

No translation patterns or reordering rules

Permutate the translated components :

{cellule, toxique} → {cellule, toxique},{toxique, cellule}

12 / 31

Page 65: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Reordering

No translation patterns or reordering rules

Permutate the translated components :

{cellule, toxique} → {cellule, toxique},{toxique, cellule}

12 / 31

Page 66: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Concatenation

Recreate target words by generating all possibleconcatenations of the components :

{toxique, cellule} → {toxique cellule},{toxiquecellule}

13 / 31

Page 67: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Concatenation

Recreate target words by generating all possibleconcatenations of the components :

{toxique, cellule} → {toxique cellule},{toxiquecellule}

13 / 31

Page 68: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection

Match target words with the words of the target corpus

Allow at maximum 3 stop words between two words

{toxique cellule} → ‘‘toxique pour les

cellules’’ ’toxic to the cells’

14 / 31

Page 69: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection

Match target words with the words of the target corpus

Allow at maximum 3 stop words between two words

{toxique cellule} → ‘‘toxique pour les

cellules’’ ’toxic to the cells’

14 / 31

Page 70: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection

Match target words with the words of the target corpus

Allow at maximum 3 stop words between two words

{toxique cellule} → ‘‘toxique pour les

cellules’’ ’toxic to the cells’

14 / 31

Page 71: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Selection

Match target words with the words of the target corpus

Allow at maximum 3 stop words between two words

{toxique cellule} → ‘‘toxique pour les

cellules’’ ’toxic to the cells’

14 / 31

Page 72: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Outline

1 Context

2 Translation method

3 Ranking method

4 Results of experiments

5 Future work

Page 73: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Target term frequency

Number of occurrences of target term divided by the totalnumber of occurrences in the target texts

Freq(t) =occ(t)

N

16 / 31

Page 74: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Target term frequency

Number of occurrences of target term divided by the totalnumber of occurrences in the target texts

Freq(t) =occ(t)

N

16 / 31

Page 75: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Context similarity measure

Corresponds to context-based approaches

Collect words coocurring with source and target term in awindow of 5 words

Normalize cooccurrences with log-likelihood ratio

Compare contexts with weighted jaccard

Cont(s, t) =

∑w∈s∩t min(c(s,w), c(t,w))∑w∈s∪t max(c(s,w), c(t,w))

17 / 31

Page 76: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Context similarity measure

Corresponds to context-based approaches

Collect words coocurring with source and target term in awindow of 5 words

Normalize cooccurrences with log-likelihood ratio

Compare contexts with weighted jaccard

Cont(s, t) =

∑w∈s∩t min(c(s,w), c(t,w))∑w∈s∪t max(c(s,w), c(t,w))

17 / 31

Page 77: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Context similarity measure

Corresponds to context-based approaches

Collect words coocurring with source and target term in awindow of 5 words

Normalize cooccurrences with log-likelihood ratio

Compare contexts with weighted jaccard

Cont(s, t) =

∑w∈s∩t min(c(s,w), c(t,w))∑w∈s∪t max(c(s,w), c(t,w))

17 / 31

Page 78: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Context similarity measure

Corresponds to context-based approaches

Collect words coocurring with source and target term in awindow of 5 words

Normalize cooccurrences with log-likelihood ratio

Compare contexts with weighted jaccard

Cont(s, t) =

∑w∈s∩t min(c(s,w), c(t,w))∑w∈s∪t max(c(s,w), c(t,w))

17 / 31

Page 79: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Context similarity measure

Corresponds to context-based approaches

Collect words coocurring with source and target term in awindow of 5 words

Normalize cooccurrences with log-likelihood ratio

Compare contexts with weighted jaccard

Cont(s, t) =

∑w∈s∩t min(c(s,w), c(t,w))∑w∈s∪t max(c(s,w), c(t,w))

17 / 31

Page 80: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Part-of-speech translation probability

Probability that source term with part-of-speech A translatesto target term with part of speech B

Pos(s, t) = P(pos(t)|pos(s))= P(B|A)

Acquired from pos-tagged parallel corpora [Tiedemann, 2009]with word alignment software AnyMalign [Lardrilleux, 2008]

18 / 31

Page 81: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Part-of-speech translation probability

Probability that source term with part-of-speech A translatesto target term with part of speech B

Pos(s, t) = P(pos(t)|pos(s))= P(B|A)

Acquired from pos-tagged parallel corpora [Tiedemann, 2009]with word alignment software AnyMalign [Lardrilleux, 2008]

18 / 31

Page 82: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Part-of-speech translation probability

Probability that source term with part-of-speech A translatesto target term with part of speech B

Pos(s, t) = P(pos(t)|pos(s))= P(B|A)

Acquired from pos-tagged parallel corpora [Tiedemann, 2009]with word alignment software AnyMalign [Lardrilleux, 2008]

18 / 31

Page 83: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Resources reliability score

Some translation resources might give more reliabletranslations than others

I ex : bilingual dictionary > synonyms

I score = mean of the reliability of the resources used fortranslating the components

Reso(t = {c1, ...cn}) =

∑ni=1 resource reliability(ci )

n

Tuned on training data

19 / 31

Page 84: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Resources reliability score

Some translation resources might give more reliabletranslations than others

I ex : bilingual dictionary > synonyms

I score = mean of the reliability of the resources used fortranslating the components

Reso(t = {c1, ...cn}) =

∑ni=1 resource reliability(ci )

n

Tuned on training data

19 / 31

Page 85: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Resources reliability score

Some translation resources might give more reliabletranslations than others

I ex : bilingual dictionary > synonymsI score = mean of the reliability of the resources used for

translating the components

Reso(t = {c1, ...cn}) =

∑ni=1 resource reliability(ci )

n

Tuned on training data

19 / 31

Page 86: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Resources reliability score

Some translation resources might give more reliabletranslations than others

I ex : bilingual dictionary > synonymsI score = mean of the reliability of the resources used for

translating the components

Reso(t = {c1, ...cn}) =

∑ni=1 resource reliability(ci )

n

Tuned on training data

19 / 31

Page 87: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Combination

Linear combination of the 4 criterion Frequency, Context,Part-of-speech translation probability and Resources reliabilily

Combi(t, s) = Freq(s) + Cont(s, t) + Pos(s, t) + Reso(t)

20 / 31

Page 88: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Combination

Linear combination of the 4 criterion Frequency, Context,Part-of-speech translation probability and Resources reliabilily

Combi(t, s) = Freq(s) + Cont(s, t) + Pos(s, t) + Reso(t)

20 / 31

Page 89: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

Tried 3 algorithms implemented in the RankLib software1

I AdaRank [Li and Xu, 2007]I Coordinate Ascend [Metzler and Croft, 2000]I LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1http://people.cs.umass.edu/ vdang/ranklib.html21 / 31

Page 90: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

Tried 3 algorithms implemented in the RankLib software1

I AdaRank [Li and Xu, 2007]I Coordinate Ascend [Metzler and Croft, 2000]I LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1http://people.cs.umass.edu/ vdang/ranklib.html21 / 31

Page 91: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

Tried 3 algorithms implemented in the RankLib software1

I AdaRank [Li and Xu, 2007]I Coordinate Ascend [Metzler and Croft, 2000]I LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1http://people.cs.umass.edu/ vdang/ranklib.html21 / 31

Page 92: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

Tried 3 algorithms implemented in the RankLib software1

I AdaRank [Li and Xu, 2007]

I Coordinate Ascend [Metzler and Croft, 2000]I LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1http://people.cs.umass.edu/ vdang/ranklib.html21 / 31

Page 93: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

Tried 3 algorithms implemented in the RankLib software1

I AdaRank [Li and Xu, 2007]I Coordinate Ascend [Metzler and Croft, 2000]

I LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1http://people.cs.umass.edu/ vdang/ranklib.html21 / 31

Page 94: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

Tried 3 algorithms implemented in the RankLib software1

I AdaRank [Li and Xu, 2007]I Coordinate Ascend [Metzler and Croft, 2000]I LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1http://people.cs.umass.edu/ vdang/ranklib.html21 / 31

Page 95: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

Tried 3 algorithms implemented in the RankLib software1

I AdaRank [Li and Xu, 2007]I Coordinate Ascend [Metzler and Croft, 2000]I LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1http://people.cs.umass.edu/ vdang/ranklib.html21 / 31

Page 96: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Outline

1 Context

2 Translation method

3 Ranking method

4 Results of experiments

5 Future work

Page 97: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Corpora

English → French, German

breast cancer

≈ 400k words per language

23 / 31

Page 98: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Corpora

English → French, German

breast cancer

≈ 400k words per language

23 / 31

Page 99: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Corpora

English → French, German

breast cancer

≈ 400k words per language

23 / 31

Page 100: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Corpora

English → French, German

breast cancer

≈ 400k words per language

23 / 31

Page 101: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Lexicons

Morpheme translation table (hand-crafted)

General language dictionary (Xelda)

Synonyms (Xelda)

Domain-specific dictionary : cognates extracted from corpus[Hauer and Kondrak, 2011]

Morphological families [Porter, 1980]

24 / 31

Page 102: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Lexicons

Morpheme translation table (hand-crafted)

General language dictionary (Xelda)

Synonyms (Xelda)

Domain-specific dictionary : cognates extracted from corpus[Hauer and Kondrak, 2011]

Morphological families [Porter, 1980]

24 / 31

Page 103: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Lexicons

Morpheme translation table (hand-crafted)

General language dictionary (Xelda)

Synonyms (Xelda)

Domain-specific dictionary : cognates extracted from corpus[Hauer and Kondrak, 2011]

Morphological families [Porter, 1980]

24 / 31

Page 104: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Lexicons

Morpheme translation table (hand-crafted)

General language dictionary (Xelda)

Synonyms (Xelda)

Domain-specific dictionary : cognates extracted from corpus[Hauer and Kondrak, 2011]

Morphological families [Porter, 1980]

24 / 31

Page 105: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Lexicons

Morpheme translation table (hand-crafted)

General language dictionary (Xelda)

Synonyms (Xelda)

Domain-specific dictionary : cognates extracted from corpus[Hauer and Kondrak, 2011]

Morphological families [Porter, 1980]

24 / 31

Page 106: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Lexicons

Morpheme translation table (hand-crafted)

General language dictionary (Xelda)

Synonyms (Xelda)

Domain-specific dictionary : cognates extracted from corpus[Hauer and Kondrak, 2011]

Morphological families [Porter, 1980]

24 / 31

Page 107: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget textsgenerated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

Page 108: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget textsgenerated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

Page 109: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget textsgenerated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

Page 110: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget textsgenerated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

Page 111: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget texts

generated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

Page 112: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget textsgenerated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

Page 113: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget textsgenerated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

Page 114: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Training and evaluation datasets

EVALUATION ≈ 100 source terms

source terms in UMLS meta-thesaurus withtranslation(s) in target texts

TRAINING ≈ 600 source terms

source terms for which a translation could begenerated and whose translation(s) is in thetarget textsgenerated translations were scored manually

⇒ evaluation and training datasets are disjoint

⇒ source terms are morphologically complex words with notranslation in dictionary

25 / 31

Page 115: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Results for translation generation

EN → FR EN → DE

# source terms 126 90

# at least 1 translation 86 (68%) 56 (62%)

# at least 1 translation 86 56

1 trans. in UMLS 68 (79%) 40 (71%)

1 trans. in UMLS or judged correct 81 (94%) 51 (91%)

26 / 31

Page 116: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Results for translation ranking

EN → FR EN → DE Average

Random .83 .80 .815

Freq .92 .84 .88

Cont .90 .82 .86

Pos .88 .91 .895

Reso .92 .82 .87

Combination .93 .89 .91

ML AdaRank .90 .84 .87

ML CoordAsc .93 .89 .91ML LambdaMart .86 .88 .87

Table: Top1 translation in UMLS or judged correct

27 / 31

Page 117: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Silence analysis

Missing translation in resources (≈30%)

Target term is not compositional (≈30%)I breastfeeding → allaitement (FR), stillen (DE)

Lexical divergence (≈20%)I radiosensitivity → Strahlentoleranz, sensitivity 6= toleranz

Additional elements (≈13%)I postpartum→ postpartalperiod

28 / 31

Page 118: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Silence analysis

Missing translation in resources (≈30%)

Target term is not compositional (≈30%)I breastfeeding → allaitement (FR), stillen (DE)

Lexical divergence (≈20%)I radiosensitivity → Strahlentoleranz, sensitivity 6= toleranz

Additional elements (≈13%)I postpartum→ postpartalperiod

28 / 31

Page 119: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Silence analysis

Missing translation in resources (≈30%)

Target term is not compositional (≈30%)I breastfeeding → allaitement (FR), stillen (DE)

Lexical divergence (≈20%)I radiosensitivity → Strahlentoleranz, sensitivity 6= toleranz

Additional elements (≈13%)I postpartum→ postpartalperiod

28 / 31

Page 120: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Silence analysis

Missing translation in resources (≈30%)

Target term is not compositional (≈30%)I breastfeeding → allaitement (FR), stillen (DE)

Lexical divergence (≈20%)I radiosensitivity → Strahlentoleranz, sensitivity 6= toleranz

Additional elements (≈13%)I postpartum→ postpartalperiod

28 / 31

Page 121: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Silence analysis

Missing translation in resources (≈30%)

Target term is not compositional (≈30%)I breastfeeding → allaitement (FR), stillen (DE)

Lexical divergence (≈20%)I radiosensitivity → Strahlentoleranz, sensitivity 6= toleranz

Additional elements (≈13%)I postpartum→ postpartalperiod

28 / 31

Page 122: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Error analysis

Problems in word reorderingI self-examination → untersuchung selbst ’examination self’

Wrong or innapropriate translationsI in-patient → pas malade ’not ill’

in → “inside” → inside patientin → “inverse” → not a patient

29 / 31

Page 123: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Error analysis

Problems in word reorderingI self-examination → untersuchung selbst ’examination self’

Wrong or innapropriate translationsI in-patient → pas malade ’not ill’

in → “inside” → inside patientin → “inverse” → not a patient

29 / 31

Page 124: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Error analysis

Problems in word reorderingI self-examination → untersuchung selbst ’examination self’

Wrong or innapropriate translationsI in-patient → pas malade ’not ill’

in → “inside” → inside patientin → “inverse” → not a patient

29 / 31

Page 125: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Impact of fertile translations

EN → FR EN → DE

exact translations 21% 10%

wrong translations 50% 80%

Table: % of fertile translations

German germanic language: tendency to agglutinationoestrogen-independant → Ostrogen-unabhangige

French romance language: creates phrases more easilyoestrogen-independant → independant des œstrogenes

30 / 31

Page 126: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Impact of fertile translations

EN → FR EN → DE

exact translations 21% 10%

wrong translations 50% 80%

Table: % of fertile translations

German germanic language: tendency to agglutinationoestrogen-independant → Ostrogen-unabhangige

French romance language: creates phrases more easilyoestrogen-independant → independant des œstrogenes

30 / 31

Page 127: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Outline

1 Context

2 Translation method

3 Ranking method

4 Results of experiments

5 Future work

Page 128: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ContextTranslation method

Ranking methodResults of experiments

Future work

Future work

Improve quality of linguistic resourcesI morphological derivation rules instead of stemmingI use of a thesaurus

Try translations patterns on top of permutations

Try learning morpheme translation equivalences fromI cognatesI bilingual dictionariesI out-of-domain parallel data

31 / 31

Page 129: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Thank you for your attention.

[email protected]@univ-nantes.fr

[email protected]@lingua-et-machina.com

Page 130: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

ADDITIONAL SLIDES

Page 131: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Exact translations

Non fertiles:I pathophysiological → physiopathologiqueI overactive → uberaktiv

Fertiles:I cardiotoxicity → toxicite cardiaque ’cardiac toxicity’I mastectomy → ablation der brust ’ablation of the breast’

Page 132: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Morphological variants

Non fertiles:I dosimetry → dosimetrique ’dosimetric’I radiosensitivity → strahlenempfindlich ’radiosensitive’

Fertiles:I milk-producing → production de lait ’production of milk’I selfexamination → selbst untersuchen ’self examine’

Page 133: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Inexact but semantically related

Non fertiles:I oncogene → oncogenese ’oncogenesis’I breakthrough → durchbrechen ’break’

Fertiles:I chemoradiotherapy → chemotherapie oder strahlen

’chemotherapy or radiation’I treatable → pouvoir le traiter ’can treat it’

Page 134: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Wrong translations

Non fertiles:I immunoscore → immunomarquer ’immunostain’I check-in → unkontrollieren ’uncontrolled’

Fertiles:I bloodstream → fliessen mehr blut ’more blood flow’I risk-reducing → risque de reduire ’risk of reducing’

Page 135: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

References I

Baldwin, T. and Tanaka, T. (2004).

Translation by machine of complex nominals.In Proceedings of the ACL 2004 Workshop on Multiword expressions: Integrating Processing, pages 24–31,Barcelona, Spain.

Bo, L. and Gaussier, E. (2010).

Improving corpus comparability for bilingual lexicon extraction from comparable corpora.In 23eme International Conference on Computational Linguistics, pages 23–27, Beijing, Chine.

Cartoni, B. (2009).

Lexical morphology in machine translation: A feasibility study.In Proceedings of the 12th Conference of the European Chapter of the ACL, pages 130–138, Athens, Greece.

Daille, B. and Morin, E. (2005).

French-English terminology extraction from comparable corpora.In Proceedings, 2nd International Joint Conference on Natural Language Processing, volume 3651 ofLecture Notes in Computer Sciences, page 707–718, Jeju Island, Korea. Springer.

Delpech, E. (2011).

Evaluation of terminologies acquired from comparable corpora : an application perspective.In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011), volume 11of NEALT Proceedings Series,, pages 66–73, Riga, Latvia. Pedersen B.S., Nespore G., Skadina I.

Fung, P. (1997).

Finding terminology translations from non-parallel corpora.pages 192–202, Hong Kong.

Garera, N. and Yarowsky, D. (2008).

Translating compounds by learning component gloss translation via multiple languages.In Proceedings of the 3rd International Joint Conference on Natural Language Processing, volume 1, pages403–410, Hyderabad, India.

Page 136: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

References II

Grefenstette, G. (1999).

The world wide web as a resource for example-based machine translation tasks.ASLIB’99 Translating and the computer, 21.

Harastani, R., Daille, B., and Morin, E. (2012).

Neoclassical compound alignments from comparable corpora.In Proceedings of the 13th International Conference on Computational Linguistics and Intelligent TextProcessing, volume 2, pages 72–82, New Delhi, India.

Hauer, B. and Kondrak, G. (2011).

Clustering semantically equivalent words into cognate sets in multilingual lists.In Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 865–873,Chiang Mai, Thailand.

Keenan, E. L. and Faltz, L. M. (1985).

Boolean semantics for natural language.D. Reidel, Dordrecht, Holland.

Lardrilleux, A. (2008).

A truly multilingual, high coverage, accurate, yet simple, sub-sentential alignment method.

Li, H. and Xu, J. (2007).

Adarank: A boosing algorithm for information retrieval.In Proceedings of the 30th annual international ACM SIGIR conference on Research and development ininformation retrieval, pages 391–398, Amsterdam, The Netherlands.

Metzler, D. and Croft, W. B. (2000).

Linear feature-based models for information retrieval.Information Retrieval, 10(3):257–274.

Page 137: Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

References III

Morin, E. and Daille, B. (2009).

Compositionality and lexical alignment of multi-word terms.In Language Resources and Evaluation (LRE), volume 44 of Multiword expression: hard going or plainsailing, pages 79–95. P. Rayson, S. Piao, S. Sharoff, S. Evert, B. Villada Moiron, springer netherlandsedition.

Morin, E. and Daille, B. (2010).

Compositionality and lexical alignment of multi-word terms.In Rayson, P., Piao, S., Sharoff, S., Evert, S., and B., V. M., editors, Language Resources and Evaluation(LRE), volume 44 of Multiword expression: hard going or plain sailing, pages 79–95. Springer Netherlands.

Namer, F. and Baud, R. (2007).

Defining and relating biomedical terms: Towards a cross-language morphosemantics-based system.International Journal of Medical Informatics, 76(2-3):226–33.

Porter, M. F. (1980).

An algorithm for suffix stripping.Program, 14(3):130–137.

Robitaille, X., Sasaki, X., Tonoike, M., Sato, S., and Utsuro, S. (2006).

Compiling French-Japanese terminologies from the web.In Proceedings of the 11th Conference of the European Chapter of the Association for ComputationalLinguistics, pages 225–232, Trento, Italy.

Tiedemann, J. (2009).

News from opus - a collection of multilingual parallel corpora with tools and interfaces.

Wu, Q., Burges, J. C., Svore, K., and Gao, J. (2010).

Adapting boosting for information retrieval measures.Journal of Information Retrieval, 13(3):254–270.