“Multilingual Pseudo Relevance Feedback: A way of Query Expansion and Disambiguation Pushpak Bhattacharyya Computer Science and Engineering Department

“Multilingual Pseudo Relevance Feedback:

A way of Query Expansion and Disambiguation

Pushpak Bhattacharyya

Computer Science and Engineering Department

IIT Bombay

www.cse.iitb.ac.in/~pb

Work done with Manoj, Karthik, Arjun and many others

http://www.cse.iitb.ac.in/~pb

Classical Information Retrieval (Simplified)

Retrieval Model a.k.a

Ranking algorithm

query

relevant documents

40+ years of work in designing better models• Vector space models• Binary independence models• Network models• Logistic regression models • Bayesian inference models• Hyperlink retrieval models

late 1960’s

2010

documentrepresentation

(Courtesy: Dr. Sriram Raghvan, IBM India Research Lab)

The elusive user satisfaction

Ranking

Correctnessof

Query ProcessingCoverage

NER

StemmingMWE

Crawling Indexing

OutputPresentation

Snippet

How to improve ranking with more meaningful query models

• Set theoretic, Algebraic and Probabilistic Models• Underlying current of attempt trying to capture “Query

Meaning”• Started with Karen Spark-Jones’ thesis titled “Synonyms

and Semantic Search” in Cambridge in the 90s• The effort continues

Another perspective: Mutilinguality• English still the most dominant

language on the web Contributes 72% of the

content• Number of non-English users

steadily rising all over the world

• English penetration in India Estimated to be around 3-4% Mostly the urban educated

class• Need to enable access to

above information through local languages

India’s CLIR project

• Enable access to information through local languages

• Query Languages: 9– (Assamese, Bengali , Gujarati, Hindi, Marathi,

Priya, Punjabi, Telugu, Tamil)• Results in: Source Language + Hindi + English• Domains: Tourism, Health• http://www.clia.iitb.ac.in/sandhan• Public release for select languages planned in

Jan 2012

http://www.clia.iitb.ac.in/sandhan

Crawled and Indexed

Web Pages

Target Informationin English

ति�रूपति� या�त्रा�

Hindi Query

CLIR Engine

Target Language Indexin English

Ranked List of Results

Language Resources

ति�रूपति� आने के लि ए रे सा�धने

ति�रूपति� प�ण्य नगर पहुँ�चन� के� लि�ए बहुँ� र�� उप�ब्ध हैं� | अगर मुं��बई से� य�त्रा� केर रहैं� हैं� � मुं��बई- च�न्नई एक्सेप्रे�से ग�ड़ी% से� प्रेवा�से केर सेके�� हैं� |

ति�रूपति� या�त्रा�

Result Snippetsin Hindi

Resource Constrained Languages

• Unlike English, many of the non-English languages are constrained on resources like– Large crawl coverage– Large annotated corpora– Language resources like Stemmers, Morphological

analyzers, Word de-compounders etc.

Main Message of the presentation

User Information Need

• Expressed as short query (average length 2.5 words)• Need query expansion• Lexical resources based expansion did not deliver

(Voorhees 1994) – Paradigmatic association (synonyms, antonyms, hypo

and hypernyms)– Introduces severe topic drift through unrelated senses

of expansion terms– Also through irrelevant senses of query terms

Illustration

Query word: “Madrid bomb blast case”

{case, container}

{case, suit, lawsuit}

{suit, apparel}

Drifted topic due to expanded term!!!

Drifted topic due to inapplicable sense!!!

Query Expansion: Current Dominant Practice

• Syntagmatic Expansion• Through Pseudo Relevance Feedback• We show

– Mutlilingual PRF helps– Familially related language helps still more– Result of insight from linguistics and NLP– Disambiguation by leveraging multilinguality

Road map

1. Need for Pseudo Relevance Feedback (PRF)

2. Limitations of PRF

3. Need for MultiPRF

4. MultiPRF using English

5. MultiPRF using other languages

6. Conclusions and future directions

Language Modeling Approach to IR

• Offers a principled approach to IR• Each document modeled as a probability distribution –

Document Language Model• User information need is modeled as a probability

distribution – Query Model

Ranking: a matter of resolving divergence

Importance of term in Query

Importance of term in Document

Score(D) ( , )

( | ) log ( | )

R

Rw

KL D

P w P w D

Ranking Function – KL Divergence

Problem of Retrieval ↔ Problem of Estimating P(w|ΘR) and P(w|D)

q1, q2, q3,q4, … qn

d1, d2, d3,d4, … dn

Query wordsDocument words

The Challenge - Estimating Query Model ΘR

• Average length of query: 2.5 words

• Relevance Feedback to the rescue–User marks some documents from initial

ranked list as “relevant”–Usually difficult to obtain

Pseudo-Relevance Feedback (PRF)

DocumentCollection

IR Engine

Doc. Scored1 2.4d2 2.1

d3 1.8d4 0.7.dm 0.01

Initial Results

Query Q

Rerank Corpuswith Updated

Query Relevance

Model

UpdatedQuery Relevance

Model

Pseudo-RelevanceFeedback

(PRF)Learn Feedback Model from Documents

d1 √d2 √

d3 √d4 √dk √

Assumetop ‘k’ asRelevant

Doc. Scored2 2.3d1 2.2

d3 1.8d5 0.6.dm 0.01

Final Results

Impact of Coverage on PRF Performance

• If coverage is less, precision at higher ranks decreases

• Experimented on CLEF collections in French– Same set of queries

run on different collection sizes

Lexical and Semantic Non-Exclusion and Attempts to Solve It

Accession to European Union

Initial Retrieval Documents

europeunionaccessnationrussiapresidgettiyearstate

Relevant documents with terms like “Membership”,

“Member”, “Country” not ranked higher

Final ExpandedQuery

Stemmed Query“access europe union”

Previous Attempts • Voorhees et al. SIGIR ‘94 used Wordnet

• Negative results• Random walk models

• Translation Models - Zhai et al. SIGIR ‘01• Collins-Thompson and Callan ‘CIKM ’05

• Latent Concept Expansion• Metzler et al. SIGIR ‘07

Limitations of PRF: Lack of Robustness

Olive Oil Production in

Mediterranean

Initial Retrieved Documents

OilOlivMediterraneanProducCookSaltPepperServCup

Causes Query Drift

Final ExpandedQuery

Stemmed Query“oliv oil mediterranean”

Documents about

Cooking

Previous Attempts• Refining top document set• Refining initial terms obtained

through PRF• Selective query expansion• TREC Robustness Track –

improving robustness

Can both Semantic Non-inclusion and Lack of Robustness be solved?

• Harnesses “Multilinguality”:– Take help of a collection in a different language called

“assisting language”– Expectation of increased robustness, since searching

in two collections

• An attractive proposition for languages that have poor monolingual performance due to– Resource constraints like inadequate coverage– Morphological complexity

Related Work

• Gao et al. (2009) use English to improve Chinese language ranking– Demonstrate only on a subset of queries– Experimentation on a small dataset– Uses cross-document similarity

Multilingual PRF: System Flow

Query in L1

Initial Retrieval

Translate Query into L2

Initial Retrieval

L1 Index

L2

Index

Top ‘k’ Results

Top ‘k’ Results

Get Feedback Model in

L1


L2

θL1

θL2

Translate Feedback

Model into L1

θL1Trans

InterpolateModels

Ranking using Final

Model

We remember that a set of words is compared with another set of words

q1, q2, q3,q4, … qn

d1, d2, d3,d4, … dn

ReformulatedQuery words

Document words

OriginalQueryWords

OWNPRFWords

PRFWordsfromTranslation

English Lends a Helping Hand!

• English used as assisting language– Good monolingual performance– Ease of processing

• MultiPRF consistently and significantly outperforms monolingual PRF baseline

(Manoj Chinnakotla, Karthik Raman and Pushpak Bhattacharyya, Multilingual PRF: English Lends a Helping Hand, SIGIR 2010, Geneva, Switzerland, July, 2010.)

http://www.cse.iitb.ac.in/~pb/papers/sigir10-multiprf-english.pdf

Why English as the assisting language?

Has the qualities of a good assisting language:

• Resource-rich– More than 70% of the web in English

• Morphological ease of processing– No complex issues like word compounding etc.

• Good monolingual performance– IR issues in English well-studied

Feedback Model Translation

• Feedback model estimated in L2 (ΘFL2) translated back

into L1 (ΘTransL1)

– Using probabilistic bi-lingual dictionary from L2 to L1

– Learnt from parallel sentence-aligned corpora

• Back Translation Step

Semantically Related Terms through Feedback Model Translation

Feedback Model

Translation Step

Nation

Nation

Country

State

UN

English-French Word Alignments

Nation, CountryState, UN, United

United

Flugzeug

Aircraft

Plane

Aeroplane

Air

German-English Word Alignments

Aircraft, PlaneAeroplane, Air, Flight

Flight

Semantically Related Terms through Feedback Model Translation

• Translation alternatives learnt through word-level alignments

• Back translation step acts as a rich source of morphological and semantic relations (Tiedemann 2001)

Linear Model Interpolation

• Original feedback model and translated model interpolated

• Final model also interpolated with query words to retain query focus

• ΘMulti used to finally re-rank documents from corpus based on KLD ranking function

Experimental Setup

• European languages chosen since Europarl freely available

• English chosen as assisting language

• CLEF Standard Dataset for Evaluation– Four widely differing source languages uses

• French (Romance Family), German(West Germanic)• Finnish (Baltic-Finnic), Hungarian (Uralic-Ugric)

– On more than 600 topics (only Title field)

• Use Google Translate for Query Translation

• Standard Evaluation Metrics (MAP, P@5, P@10, GMAP) used.

Result: MultiPRF Gains over PRF

FR-01+02 DE-01+02 FI-02+03+04 HU-050

0.1

0.2

0.3

0.4

0.5

MBFMultiPRF

FR-01+02 DE-01+02 FI-02+03+04 HU-050

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MBFMultiPRF

FR-01+02 DE-01+02 FI-02+03+04 HU-050

0.1

0.2

0.3

0.4

0.5

0.6

MBFMultiPRF

FR-01+02 DE-01+02 FI-02+03+04

HU-050

0.05

0.1

0.15

0.2

0.25

0.3

MBFMultiPRF

MAP P@5

P@10 GMAP

Significant Gains in All Collections

Increased Robustnes

s

Query in French Initial

Retrieval

Translate Query into

English

Initial Retrieval

L1 Index

L2

Index

Top ‘k’ Results

Top ‘k’ Results


L1


L2

θL1

θL2

θL1Multi

Oscar honorifiquepour des

réalisateursitaliens

Honorary Oscarfor Italian

filmmakers

italien, président (president), oscar , gouvern(governer) , scalfaro , spadolin(molecular)

Italien, oscar, film, realis, wild,cinem,honorif,president,honorair,cineast

filmmakfilm,movi,tobacco,placement,produc,stallon,studio,italian, oscar,honarari,

Translate &

Interpolate

MAP improves from0.1238 to 0.4324!

Query in German Initial

Retrieval


English

Initial Retrieval

L1 Index

L2

Index

Top ‘k’ Results

Top ‘k’ Results


L1


L2

θL1

θL2

θL1MultiÖlunfälle und

Vögel

Birds and Oil Spills

rhein, ollunfall, fluss, ol, auen, erdreich, heizol, tank, lit, folg, oberrhein, teil

Olunfall,vogel,ol,olverschmutz (oil pollution),erdol(petroleum),olp(oil slick),rhein,mcgrath,olivenol,fluss,tier,vergoss,vogelart (bird species),olkatastroph,olpreis

Oil, spill, bird,pipelin,river,offici,fish,lake,cleanup,state,gall

on

Translate &

Interpolate


Effect of Varying Query Translation

• Query Translation simpler task than MT.

• We used 3 different QT systems:– Google Translate– Naïve SMT System– Almost “Ideal” TranslationAnnotated Performance of Systems

(on 3-point scale [0,0.5,1])MAP P@5 P@10 GMAP

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MBFSMTGTIdeal

Performance of Different System on FR-01+02

• Robust To Suboptimal Translation

• Ideal Translation huge gains over MBF

Corpus Google Translate Naïve SMT

FR-01+02 0.93 0.67

FR-03+05 0.88 0.77

DE-01+02 0.93 0.64

DE-03 0.81 0.58

Do we need to go to another language?

Can we do as well using another collection in same language?

MAP GMAP0.2

0.25

0.3

0.35

0.4

0.45

0.5

DE-MultiPRFDE-SameLangFR-MultiPRFFR-SameLang

How about using Thesaurus Based Expansion as well?

• Since no publicly available thesaurus, we learn probabilistic thesaurus as suggested by Xu et. al.

( where e is an English word)

• Using model obtained by PRF + Assisting Collection in same language: , we expand using a thesaurus:

Comparison of Thesaurus-based Expansion with Multi-PRF

MAP GMAP0.2

0.25

0.3

0.35

0.4

0.45

0.5

DE-MultiPRFDE-Same+ThesFR-MultiPRFFR-Same+Thes

• Simply adding a Thesaurus to get Synonyms, does

not help

• Thus MultiPRF, combines both benefits well.

Can languages other than English help?

Manoj Chinnakotla, Karthik Raman and Pushpak Bhattacharyya, Multilingual Relevance Feedback: One Language Can Help Another, Conference of Association of Computational Linguistics (ACL 2010),

Uppsala, Sweden, July 2010.

http://www.cse.iitb.ac.in/~pb/papers/acl2010-multiprf.pdf

Performance Study of Assisting Languages

• Do the results hold for languages other than English?• What are the characteristics of a good assisting

language?• Can any language be used to improve the PRF

performance of another language?• Can this be extended to multiple assisting languages?

Language Typology

Again use linear Model Interpolation

Use and give weightage to terms from the assisting langauges

Query termsOwn PRF terms

PRF terms fromTranslation

Experimental Setup

• European languages chosen• Europarl corpora• CLEF dataset

– Six languages from different language families– French, Spanish (Romance), – German, English, Dutch (West Germanic), – Finnish (Baltic-Finnic)– On more than 600 topics

• Use Google Translate for Query Translation

MultiPRF with Non-English Assisting Languages

Query in German Initial

Retrieval


Spanish

Initial Retrieval

L1 Index

L2

Index

Top ‘k’ Results

Top ‘k’ Results


L1


L2

θL1

θL2

θL1Multi

Bronchial asthma

El asma bronquial

chronisch (chronic), pet, athlet(athlete), ekrank (ill), gesund(healthy), tuberkulos(tuberculosis), patient, reis (rice),person

asthma, allergi, krankheit (disease), allerg (allergenic), chronisch, hauterkrank (illness of skin), arzt (doctor), erkrank (ill)

Asma, bronquial, contamin, ozon, cient, enfermed, alerg, alergi, air

Translate &

Interpolate


Query in French Initial

Retrieval


Dutch

Initial Retrieval

L1 Index

L2

Index

Top ‘k’ Results

Top ‘k’ Results


L1


L2

θL1

θL2

θL1Multi

Ingénierie Génétique

GenetischeManipulatie

développ (developed), évolu(evolved), product, produit(product), moléculair (molecular)

génet, ingénier, manipul, animal, pêcheur (fisherman), développ (developed), gen

genetisch, manipulatie, exxon, dier (animal), visser (fisherman), gen

Translate &

Interpolate


Results

Dependence on Monolingual Performance

Monolingual MAP

0.4495 0.4033 0.4153 0.4805 0.4356 0.3578

Rank 2 5 4 1 3 6

Back Translation Performance improves within the same family

More than one assisting language

• Tried parallel composition for two assisting languages

• Uniform interpolation weights used

• Exhaustively tried all 60 combinations

• Improvements reported over best performing PRF of L1 or L2

More Languages More Robust

Conclusions• MultiPRF uses another language to

improve robustness and performance• Can be looked upon as a way of

disambiguation• Need closer study of MultiPRF from

languages in the same family• Need to evolve measures for capturing

familial closeness

URLs

• For resources

www.cfilt.iitb.ac.in• For publications

www.cse.iitb.ac.in/~pb

http://www.cfilt.iitb.ac.in/

http://www.cse.iitb.ac.in/~pb

Thank you

Questions and comments?