32
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Improved Word Alignments Using the Web as a Corpus Preslav Nakov, University of California, Berkeley Svetlin Nakov, Sofia University "St. Kliment Ohridski" Elena Paskaleva, Bulgarian Academy of Sciences International Conference RANLP 2007 (Recent Advances in Natural Language Processing)

Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

Embed Size (px)

DESCRIPTION

Nakov P., Nakov S., Paskaleva E., Improved Word Alignments Using the Web as a Corpus, Proceedings of the International Conference RANLP 2007 (Recent Advances in Natural Language Processing), pp. 400-405, ISBN 978-954-91743-7-3, Borovets, Bulgaria, 27-29 September 2007

Citation preview

Page 1: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Improved Word Alignments Using the

Web as a Corpus

Preslav Nakov, University of California, Berkeley

Svetlin Nakov, Sofia University "St. Kliment Ohridski"

Elena Paskaleva, Bulgarian Academy of Sciences

International Conference RANLP 2007(Recent Advances in Natural Language Processing)

Page 2: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Statistical Machine Translation (SMT) 1988 – IBM models 1, 2, 3, 4 and 5

Start with bilingual parallel sentence-aligned corpus

Learn translation probabilities of individual words

2004 – PHARAOH model Learn translation probabilities for phrases

Alignment template approach – extracts translation phrases from word alignments

Improved word alignments in sentences improve translation quality!

Page 3: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Word Alignments The word alignments problem

Given a bilingual parallel sentence-aligned corpus align the words in each sentence with corresponding words in its translation

Example English sentence

Example Bulgarian sentence

Try our same day delivery of fresh flowers, roses, and

unique gift baskets.

Опитайте нашите свежи цветя, рози и уникални

кошници с подаръци с доставка на същия ден.

Page 4: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Word Alignments – Exampletry

our

same

day

delivery

of

fresh

flowers

roses

and

unique

gift

baskets

опитайте

нашите

свежи

цветя

рози

и

уникални

кошници

с

подаръци

с

доставка

на

същия

ден

Page 5: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Our Method Use combination of

Orthographic similarity measure

Semantic similarity measure

Competitive linking

Orthographic similarity measure Modified weighted minimum-edit-distance

Semantic similarity measure Analyses the co-occurring words in the

local contexts of the target words using the Web as a corpus

Page 6: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Orthographic Similarity Minimum Edit Distance Ratio (MEDR)

MED(s1, s2) = the minimum number of INSERT / REPLACE / DELETE operations for transforming s1 to s2

Longest Common Subsequence Ratio (LCSR)

LCS(s1, s2) = the longest common subsequence of s1 and s2

Page 7: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Orthographic Similarity Modified Minimum Edit Distance Ratio

(MMEDR) for Bulgarian / Russian

1. Normalize the strings

2. Assign weights for the edit operations

Normalizing the strings

Hand-crafted rules

Strip the Russian letters "ь" and "ъ"

Remove the Russian "й" at the endings

Remove the definite article in Bulgarian (e.g. "ът", "ят" at the endings)

Page 8: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Orthographic Similarity Assigning weights for the edit operations

0.5-0.9 for the vowel to vowel substitutions, e.g. 0.5 for е о

0.5-0.9 for some consonant-consonant replacements, e.g. с з

1.0 for all other edit operations

Example: Bulgarian първият and the Russian первый (first)

Normalization produces първи and перви, thus MMED = 0.5 (weight 0.5 for ъ о)

Page 9: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Semantic Similarity What is local context?

Few words before and after the target word

The words in the local context of given word are semantically related to it

Need to exclude the stop words: prepositions, pronouns, conjunctions, etc.

Stop words appear in all contexts

Need of sufficiently big corpus

Same day delivery of fresh flowers, roses, and unique gift baskets

from our online boutique. Flower delivery online by local florists for

birthday flowers.

Page 10: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Semantic Similarity Web as a corpus

The Web can be used as a corpus to extract the local context for given word

The Web is the largest possible corpus

Contains big corpora in any language

Searching some word in Google can return up to 1 000 excerpts of texts

The target word is given along with its local context: few words before and after it

Target language can be specified

Page 11: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Semantic Similarity Web as a corpus

Example: Google query for "flower"

Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ...

Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30 years.

Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses ...

Wide selection of BOUQUETS, FLORAL ARRANGEMENTS, CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable.

Flowers, plants, roses, & gifts. Flowers delivery with fewer ...

Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you will never use flowers delivery from florists again.

Page 12: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Semantic Similarity Measuring semantic similarity

For given two words their local contexts are extracted from the Web A set of words and their frequencies

Apply lemmatization

Semantic similarity is measured as similarity between these local contexts Local contexts are represented as

frequency vectors for given set of words

Cosine between the frequency vectors in the Euclidean space is calculated

Page 13: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Semantic Similarity Example of context words frequencies

word countfresh 217

order 204

rose 183

delivery 165

gift 124

welcome 98

red 87

... ...

word: flower

word countInternet 291

PC 286

technology 252

order 185

new 174

Web 159

site 146

... ...

word: computer

Page 14: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Semantic Similarity Example of frequency vectors

Similarity = cosine(v1, v2)

# word freq.0 alias 3

1 alligator 2

2 amateur 0

3 apple 5

... ... ...

4999 zap 0

5000 zoo 6

v1: flower

# word freq.0 alias 7

1 alligator 0

2 amateur 8

3 apple 133

... ... ...

4999 zap 3

5000 zoo 0

v2: computer

Page 15: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Cross-Lingual Semantic Similarity

We are given two words in different languages L1 and L2

We have a bilingual glossary G of translation pairs {p ∈ L1, q ∈ L2}

Measuring cross-lingual similarity:

1. We extract the local contexts of the target words from the Web: C1 ∈ L1 and C2 ∈ L2

2. We translate the context

3. We measure similarity between C1* and C2

C1*C1G

Page 16: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Competitive Linking What is competitive linking?

One-to-one bi-directional word alignments algorithm

Greedy "best first" approach

Links the most probable pair first, removes it, and repeats the same for the rest

Page 17: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Applying Competitive Linking1. Make all words lowercase

2. Remove punctuation

3. Remove the stop words: prepositions, pronouns, conjunctions, etc. We don't align them

4. Align the most similar pair of words Using the orthographic similarity

combined with the semantic similarity

5. Remove the aligned words

6. Align the rest of the sentences

Page 18: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Our Method – Example Bulgarian sentence

Russian sentence

Процесът на създаването на такива рефлекси е по-

сложен, но същността им е еднаква.

Процесс создания таких рефлексов сложнее, но

существо то же.

Page 19: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Out Method – Example1. Remove the stop words

Bulgarian: на, на, такива, е, но, им, е Russian: таких, но, то

2. Align рефлекси and рефлексов (semantic similarity = 0.989)

3. Align по-сложен and сложнее (orthographic similarity = 0.750)

4. Align процесът and процесс (orthographic similarity = 0.714)

5. Align създаването and создания (orthographic similarity = 0.544)

6. Align процесът and процесс (orthographic similarity = 0.536)

7. Not aligned: еднаква

Page 20: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Our Method – Exampleпроцесът

на

създаването

на

такива

рефлекси

е

по-сложен

но

същността

им

е

еднаква

процесс

создания

таких

рефлексов

сложнее

но

существо

то

же

Page 21: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Evaluation We evaluated the following algorithms

BASELINE: the traditional alignment algorithm (IBM model 4)

LCSR, MEDR, MMEDR: orthographic similarity algorithms

WEB-ONLY: semantic similarity algorithm WEB-AVG: average of WEB-ONLY and

MMEDR WEB-MAX: maximum of WEB-ONLY and

MMEDR WEB-CUT: 1 if MMEDR(s1, s2) >= α (0 < α <

1), or WEB-ONLY(s1, s2) otherwise

Page 22: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Testing Data and Experiments Testing data set

A corpus of 5 827 parallel sentences

Training set: 4 827 sentences

Tuning set: 500 sentences

Testing set: 500 sentences

Experiments Manual evaluation of WEB-CUT AER for competitive linking Translation quality: BLEU / NIST

Page 23: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Manual Evaluation of WEB-CUT Aligned the texts of the testing data set

Used competitive linking and WEB-CUT for α=0.62

Obtained 14,246 distinct word pairs

Manually evaluated the aligned pairs as: Correct

Rough (considered incorrect)

Wrong (considered incorrect)

Calculated precision and recall For the case MMEDR < 0.62

Page 24: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Manual Evaluation of WEB-CUT Precision-recall curve

Page 25: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Evaluation of Alignment Error Rate

Gold standard for alignment

For the first 100 sentences

Created manually by a linguist

Stop words and punctuation were removed

Evaluated the alignment error rate (AER) for competitive linking

Evaluated for all the algorithms

LCSR, MEDR, MMEDR, WEB-ONLY, WEB-AVG, WEB-MAX and WEB-CUT

Page 26: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Evaluation of Alignment Error Rate

AER for competitive linking

Page 27: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Evaluation of Translation Quality Built a Russian Bulgarian statistical

machine translation (SMT) system Extracted from the training set the distinct

word pairs aligned with competitive linking Added them twice as additional “sentence”

pairs to the training corpus Trained log-linear model for SMT with

standard feature functions Used minimum error rate training on the

tuning set

Evaluated BLUE and NIST score on the testing set

Page 28: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Evaluation of Translation Quality

Translation quality: BLEU

Page 29: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Evaluation of Translation Quality

Translation quality: NIST

Page 30: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Resources We used the following resources:

Bulgarian-Russian parallel corpus: 5 827 sentences

Bilingual Bulgarian / Russian glossary: 3 794 pairs of translation words

A list of 599 Bulgarian / 508 Russian stop words

Bulgarian lemma dictionary: 1 000 000 wordforms and 70 000 lemmata

Russian lemma dictionary: 1 500 000 wordforms and 100 000 lemmata

Page 31: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Conclusion and Future Work Conclusion

Semantic similarity extracted from the Web can improve statistical machine translation

For similar languages like Bulgarian and Russian orthographic similarity is useful

Future Work Improve MMED with automatic leaned rules Improve the semantic similarity algorithm

Filter parasite words like "site", "click", etc.

Replace competitive linking with maximum weight bipartite matching

Page 32: Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Questions?

Improved Word Alignments Using the Web as a Corpus