27
Simultaneous Multilingual Search for Translingual Information Retrieval Kristen Parton 1 Kathleen McKeown 1 James Allan 2 Enrique Henestroza 1 2 1

Simultaneous Multilingual Search for Translingual Information Retrieval

  • Upload
    dandre

  • View
    60

  • Download
    0

Embed Size (px)

DESCRIPTION

Simultaneous Multilingual Search for Translingual Information Retrieval. Kristen Parton 1 Kathleen McKeown 1 James Allan 2 Enrique Henestroza 1. 1. 2. Motivation: Cross-Lingual IR. User needs to search documents in other languages. Documents. Search Results in Document Language(s). - PowerPoint PPT Presentation

Citation preview

Page 1: Simultaneous Multilingual Search for Translingual Information Retrieval

Simultaneous Multilingual Search for Translingual Information Retrieval

Kristen Parton1

Kathleen McKeown1

James Allan2

Enrique Henestroza1

2

1

Page 2: Simultaneous Multilingual Search for Translingual Information Retrieval

Motivation: Cross-Lingual IR

DocumentsQuery in User Language

Search Resultsin Document Language(s)

User needs to search documents in other languages

stereotypes of Arabs

الله العبد رانيا الملكةعن النمطية الصورة تناقش

العرب

Page 3: Simultaneous Multilingual Search for Translingual Information Retrieval

Task Redefinition: Translingual IR

DocumentsQuery in User Language

Search Resultsin User Language

User needs to search documents in other languages and get back translated results

stereotypes of Arabs

Queen Rania Al-Abdullah discusses stereotypes of Arabs

Page 4: Simultaneous Multilingual Search for Translingual Information Retrieval

Task Redefinition: Translingual IR User needs to search documents in other

languages and get back translated results

For translingual applications, integrating CLIR and result translation can improve both relevance and translation quality

Page 5: Simultaneous Multilingual Search for Translingual Information Retrieval

Outline

Approaches to CLIR SMLIR for Translingual IR Query-Directed MT Post-Editing System Evaluation Conclusions and Future Work

Page 6: Simultaneous Multilingual Search for Translingual Information Retrieval

Approaches to CLIR

Map query and/or documents to common representation

Schwarzenegger

Doc1 Doc2 Doc3

ايضا هو شوارزنجر ان يذكراألوليمبية للحركة نصير

...الخاصة

والية ... وحاكم النجم جانب الىشوارزنيجر ارنولد . كاليفورنيا

التي االقتراحات كل فشلفي شوارزينغر عرضها

استفتاء

Page 7: Simultaneous Multilingual Search for Translingual Information Retrieval

Approaches to CLIR

Map query and/or documents to common representation Document translation (DT) + pre-translation query expansion

SchwarzeneggerSchwarzneggerSchwartzenegger...

It should be mentioned that $wArznjr is also a nasseer of the Olympic Movement […]

… besides the star and the governor of the state of California Arnold Schwarznegger .

The failure of all proposals made by Schwarzenegger in a referendum

Doc1 Doc2 Doc3

Page 8: Simultaneous Multilingual Search for Translingual Information Retrieval

Approaches to CLIR

Map query and/or documents to common representation Document translation (DT) + pre-translation query expansion Query translation (QT) + post-translation query expansion

SchwarzeneggerSchwarzneggerSchwartzenegger...

شفارتزنيغرشوارزنجرشوارزنيجرشوارزينيجر

ايضا هو شوارزنجر ان يذكراألوليمبية للحركة نصير

...الخاصة

والية ... وحاكم النجم جانب الىشوارزنيجر ارنولد . كاليفورنيا

التي االقتراحات كل فشلفي شوارزينغر عرضها

استفتاء

Doc1 Doc2 Doc3

Page 9: Simultaneous Multilingual Search for Translingual Information Retrieval

Approaches to CLIR

Map query and/or documents to common representation Document translation (DT) + pre-translation query expansion Query translation (QT) + post-translation query expansion

SchwarzeneggerSchwarzneggerSchwartzenegger...

شفارتزنيغرشوارزنجرشوارزنيجرشوارزينيجر

ان ايضا شوارزنجريذكر هواألوليمبية للحركة نصير

...الخاصة

والية ... وحاكم النجم جانب الىارنولد . شوارزنيجركاليفورنيا

التي االقتراحات كل فشلفي شوارزينغرعرضها استفتاء

Doc1 Doc2 Doc3

Page 10: Simultaneous Multilingual Search for Translingual Information Retrieval

Query Translation vs. Document Translation Trade-offs

Translation resources Approximate DT [Oard 00], [Chen 04]

Translation quality Handling synonymy

Hybrid methods [McCarley 99], [Chen & Gey 04]: Run QT and DT searches,

merge results and rerank [Wang & Oard 06]: Use bidirectional word alignments to

capture information from QT and DT

Page 11: Simultaneous Multilingual Search for Translingual Information Retrieval

Hybrid Merged Method

Merge and re-rank results of two searches [McCarley 99] DT: Query + indexed document translations QT: Translated query + indexed source documents

Problems Different document lengths, query lengths Raw IR scores not comparable across queries Many ways to re-rank, merge searches

DT Score QT Score Average docid

0 0.5 0.25 Doc1

0.9 0.5 0.7 Doc2

0.8 0 0.4 Doc3

Doc2

Doc3

Doc1

Merged Results

Page 12: Simultaneous Multilingual Search for Translingual Information Retrieval

Outline

Approaches to CLIR SMLIR for Translingual IR Query-Directed MT Post-Editing System Evaluation Conclusions and Future Work

Page 13: Simultaneous Multilingual Search for Translingual Information Retrieval

Simultaneous Multilingual IR (SMLIR) Indexed document: source + document translation Query: original query + query translations (+expansions)

ان ايضا شوارزنجريذكر هواألوليمبية للحركة نصير

...الخاصة

والية ... وحاكم النجم جانب الىارنولد . شوارزنيجركاليفورنيا

It should be mentioned that $wArznjr is also a nasseer of the Olympic Movement […]

… besides the star and the governor of the state of California Arnold Schwarznegger .

شوارزينيجر شوارزنيجر شوارزنجر شفارتزنيغر

Query:

التي االقتراحات كل فشلفي شوارزينغرعرضها استفتاء

The failure of all proposals made by Schwarzenegger in a referendum

Doc1 Doc2 Doc3

Schwarzenegger Schwarznegger

Page 14: Simultaneous Multilingual Search for Translingual Information Retrieval

Simultaneous Multilingual IR (SMLIR) Multilingual (probabilistic) structured queries

Treat query term and its translations as synonyms

SMLIR Hybrid vs. Merged Hybrid No need for re-ranking or raw score normalization Single index, one search Query time comparable to Merged in practice

)(

)()()(wtransx

jjj xTFwTFwFT

)(

)()()(wtransx

jjj xDFwDFwFD

Page 15: Simultaneous Multilingual Search for Translingual Information Retrieval

Outline

Approaches to CLIR SMLIR for Translingual IR Query-Directed MT Post-Editing System Evaluation Conclusions and Future Work

Page 16: Simultaneous Multilingual Search for Translingual Information Retrieval

Relevance: Lost in Translation

Statistical MT makes mistakes Bad translations of relevant documents may be

perceived as irrelevant

Detection: IR match in source language but not in document translation → Bad translation?

Correction: Replace bad translation with query term

العراقية وكانتالريشاوي ... اوقفت ساجدة

It was the Iraqi sajidah Alry$Awy had stopped…

Sajida al-Rishawi الريشاوي ساجدة

Page 17: Simultaneous Multilingual Search for Translingual Information Retrieval

Query-Directed MT Post-Editing Use query translation + word alignments to rewrite

incorrect machine translation (MT)

Considerations: errors in query translation, incorrect word alignments

It was the Iraqi Sajida al-Rishawi had stopped…

Translated document with word alignments

Edited translation

العراقية وكانتالريشاوي ... اوقفت ساجدة

It was the Iraqi sajidah Alry$Awy had stopped…

Sajida al-Rishawi الريشاوي ساجدة

Page 18: Simultaneous Multilingual Search for Translingual Information Retrieval

Outline

Approaches to CLIR SMLIR for Translingual IR Query-Directed MT Post-Editing System Evaluation Conclusions and Future Work

Page 19: Simultaneous Multilingual Search for Translingual Information Retrieval

Experiment Setup Part of Darpa GALE question-answering task

WHERE HAS [UN Secretary General Kofi Annan] BEEN AND WHEN? Multilingual: English, Chinese, Arabic Multimodal: speech, text; Multigenre: formal, informal

Evaluation Corpus 102,859 Chinese documents Translated into English using RWTH statistical machine

translation system Searches run using Indri (Lemur) IR system

Relevance judgments 145 queries, 8,785 documents judged A document is Relevant or Not Relevant for a query Judgments based on Chinese text, by Chinese native speakers

Page 20: Simultaneous Multilingual Search for Translingual Information Retrieval

Evaluation Points

1. Query Translation Strategies English query Chinese query Run SMLIR searches, evaluate results

2. Cross-lingual IR Approaches Using Chinese and/or English query, search over Chinese

and/or translated documents

3. Machine Translation Post-Editing Detect errors in result translations Rewrite translations

Page 21: Simultaneous Multilingual Search for Translingual Information Retrieval

Query Translation for SMLIR

GALE queries are name-centric Statistical machine translation (SMT) failed to translate

many names in corpus Wikipedia for name translation [Ferrandez et al. 07]

Generated by humans, “edited” by humans Contains slang, name variations, common misspellings Noisy, some intentional spam Large variation in quantity/quality by language

Page 22: Simultaneous Multilingual Search for Translingual Information Retrieval

User-Generated “Synonyms” and TranslationsEnglish Query English Redirects Cross-Language

LinksArabic Redirects

mahmoud abbas abu mazenmahmud 'abbasmahmud abbasabbas, mahmoud

عباس محمود عباس محمودمازن أبو

kofi annan annan, kofikofikofi a annankofi atta annankofi bo bofinana maria annan

عنان كوفي عنان كوفيانان كوفيأنان كوفي

arnoldschwarzenegger

(49 variants)

ahnuld

governator

arnold swarzenager

arnold swarzenneger

arnold swartzeneger

آرنولد شوارزنيجر

آرنولد شوارزنيجر

Page 23: Simultaneous Multilingual Search for Translingual Information Retrieval

Query Translation Strategies for SMLIR

0.30

0.35

0.40

0.45

0.50

0.55

0.60

MT Dictionary Wikipedia Wikipedia + MTDictionary

ND

CG

at

10

MT dictionary: probabilistic translation dictionary derived from word alignments

Wikipedia: for name translations; not probabilistic

Combination did not help?

Page 24: Simultaneous Multilingual Search for Translingual Information Retrieval

CLIR Evaluation

0.30

0.35

0.40

0.45

0.50

0.55

0.60

QueryTranslation

DocumentTranslation

MergedHybrid

SMLIRHybrid

ND

CG

at

10

SMLIR significantly outperforms all

DT significantly better than QT

Poor performance of QT degrades Merged

Page 25: Simultaneous Multilingual Search for Translingual Information Retrieval

Results: Query-Directed SMT Post-Editing Post-Editing

Detect possible incorrect name translations If translated name is not a synonym of query, rewrite name Very conservative algorithm; does not handle deletions

Experiment 127 queries, top 10 documents 28 queries triggered post-editing 15% of name matches were rewritten

Evaluation 101 rewrites examined; 93% Acceptable, 6% Not Acceptable

Page 26: Simultaneous Multilingual Search for Translingual Information Retrieval

Conclusions

SMLIR: Novel and effective approach for integrating document and query translation in CLIR

Query-directed SMT post-editing shows promise More sophisticated editing possible, beyond just names

Future work: evaluating whole system for end-to-end question answering

Combining CLIR and machine translation can improve both search relevance and translation accuracy

Page 27: Simultaneous Multilingual Search for Translingual Information Retrieval

Thank you! This work was supported in part by the Defense Advanced Research

Projects Agency (DARPA) under contract number HR0011-06-C-0023, in part by an NSF Graduate Research Fellowship, and in part by the Center for Intelligent Information Retrieval at the University of Massachusetts.

Thanks very much to Bob Armstrong for making the annotation happen. Thanks also to Mark Smucker and Giridhar Kumaran for help with INDRI interface and corpus issues, and Ben Carterette for help with estimated MAP.  We would also like to thank the members of the NIGHTINGALE machine translation team for translation data, especially Nizar Habash and Mahmoud Ghoneim.

[email protected]