Upload
georgia-crawford
View
230
Download
0
Tags:
Embed Size (px)
Citation preview
“Multilingual Pseudo Relevance Feedback:
A way of Query Expansion and Disambiguation
Pushpak Bhattacharyya
Computer Science and Engineering Department
IIT Bombay
www.cse.iitb.ac.in/~pb
Work done with Manoj, Karthik, Arjun and many others
Classical Information Retrieval (Simplified)
Retrieval Model a.k.a
Ranking algorithm
query
relevant documents
40+ years of work in designing better models• Vector space models• Binary independence models• Network models• Logistic regression models • Bayesian inference models• Hyperlink retrieval models
late 1960’s
2010
documentrepresentation
(Courtesy: Dr. Sriram Raghvan, IBM India Research Lab)
The elusive user satisfaction
Ranking
Correctnessof
Query ProcessingCoverage
NER
StemmingMWE
Crawling Indexing
OutputPresentation
Snippet
How to improve ranking with more meaningful query models
• Set theoretic, Algebraic and Probabilistic Models• Underlying current of attempt trying to capture “Query
Meaning”• Started with Karen Spark-Jones’ thesis titled “Synonyms
and Semantic Search” in Cambridge in the 90s• The effort continues
Another perspective: Mutilinguality• English still the most dominant
language on the web Contributes 72% of the
content• Number of non-English users
steadily rising all over the world
• English penetration in India Estimated to be around 3-4% Mostly the urban educated
class• Need to enable access to
above information through local languages
India’s CLIR project
• Enable access to information through local languages
• Query Languages: 9– (Assamese, Bengali , Gujarati, Hindi, Marathi,
Priya, Punjabi, Telugu, Tamil)• Results in: Source Language + Hindi + English• Domains: Tourism, Health• http://www.clia.iitb.ac.in/sandhan• Public release for select languages planned in
Jan 2012
Crawled and Indexed
Web Pages
Target Informationin English
ति�रूपति� या�त्रा�
Hindi Query
CLIR Engine
Target Language Indexin English
Ranked List of Results
Language Resources
ति�रूपति� आने के लि ए रे सा�धने
ति�रूपति� प�ण्य नगर पहुँ�चन� के� लि�ए बहुँ� र�� उप�ब्ध हैं� | अगर मुं��बई से� य�त्रा� केर रहैं� हैं� � मुं��बई- च�न्नई एक्सेप्रे�से ग�ड़ी% से� प्रेवा�से केर सेके�� हैं� |
ति�रूपति� या�त्रा�
Result Snippetsin Hindi
Resource Constrained Languages
• Unlike English, many of the non-English languages are constrained on resources like– Large crawl coverage– Large annotated corpora– Language resources like Stemmers, Morphological
analyzers, Word de-compounders etc.
Main Message of the presentation
User Information Need
• Expressed as short query (average length 2.5 words)• Need query expansion• Lexical resources based expansion did not deliver
(Voorhees 1994) – Paradigmatic association (synonyms, antonyms, hypo
and hypernyms)– Introduces severe topic drift through unrelated senses
of expansion terms– Also through irrelevant senses of query terms
Illustration
Query word: “Madrid bomb blast case”
{case, container}
{case, suit, lawsuit}
{suit, apparel}
Drifted topic due to expanded term!!!
Drifted topic due to inapplicable sense!!!
Query Expansion: Current Dominant Practice
• Syntagmatic Expansion• Through Pseudo Relevance Feedback• We show
– Mutlilingual PRF helps– Familially related language helps still more– Result of insight from linguistics and NLP– Disambiguation by leveraging multilinguality
Road map
1. Need for Pseudo Relevance Feedback (PRF)
2. Limitations of PRF
3. Need for MultiPRF
4. MultiPRF using English
5. MultiPRF using other languages
6. Conclusions and future directions
Language Modeling Approach to IR
• Offers a principled approach to IR• Each document modeled as a probability distribution –
Document Language Model• User information need is modeled as a probability
distribution – Query Model
Ranking: a matter of resolving divergence
Importance of term in Query
Importance of term in Document
Score(D) ( , )
( | ) log ( | )
R
Rw
KL D
P w P w D
Ranking Function – KL Divergence
Problem of Retrieval ↔ Problem of Estimating P(w|ΘR) and P(w|D)
q1, q2, q3,q4, … qn
d1, d2, d3,d4, … dn
Query wordsDocument words
The Challenge - Estimating Query Model ΘR
• Average length of query: 2.5 words
• Relevance Feedback to the rescue–User marks some documents from initial
ranked list as “relevant”–Usually difficult to obtain
Pseudo-Relevance Feedback (PRF)
DocumentCollection
IR Engine
Doc. Scored1 2.4d2 2.1
d3 1.8d4 0.7.dm 0.01
Initial Results
Query Q
Rerank Corpuswith Updated
Query Relevance
Model
UpdatedQuery Relevance
Model
Pseudo-RelevanceFeedback
(PRF)Learn Feedback Model from Documents
d1 √d2 √
d3 √d4 √dk √
Assumetop ‘k’ asRelevant
Doc. Scored2 2.3d1 2.2
d3 1.8d5 0.6.dm 0.01
Final Results
Impact of Coverage on PRF Performance
• If coverage is less, precision at higher ranks decreases
• Experimented on CLEF collections in French– Same set of queries
run on different collection sizes
Lexical and Semantic Non-Exclusion and Attempts to Solve It
Accession to European Union
Initial Retrieval Documents
europeunionaccessnationrussiapresidgettiyearstate
Relevant documents with terms like “Membership”,
“Member”, “Country” not ranked higher
Final ExpandedQuery
Stemmed Query“access europe union”
Previous Attempts • Voorhees et al. SIGIR ‘94 used Wordnet
• Negative results• Random walk models
• Translation Models - Zhai et al. SIGIR ‘01• Collins-Thompson and Callan ‘CIKM ’05
• Latent Concept Expansion• Metzler et al. SIGIR ‘07
Limitations of PRF: Lack of Robustness
Olive Oil Production in
Mediterranean
Initial Retrieved Documents
OilOlivMediterraneanProducCookSaltPepperServCup
Causes Query Drift
Final ExpandedQuery
Stemmed Query“oliv oil mediterranean”
Documents about
Cooking
Previous Attempts• Refining top document set• Refining initial terms obtained
through PRF• Selective query expansion• TREC Robustness Track –
improving robustness
Can both Semantic Non-inclusion and Lack of Robustness be solved?
• Harnesses “Multilinguality”:– Take help of a collection in a different language called
“assisting language”– Expectation of increased robustness, since searching
in two collections
• An attractive proposition for languages that have poor monolingual performance due to– Resource constraints like inadequate coverage– Morphological complexity
Related Work
• Gao et al. (2009) use English to improve Chinese language ranking– Demonstrate only on a subset of queries– Experimentation on a small dataset– Uses cross-document similarity
Multilingual PRF: System Flow
Query in L1
Initial Retrieval
Translate Query into L2
Initial Retrieval
L1 Index
L2
Index
Top ‘k’ Results
Top ‘k’ Results
Get Feedback Model in
L1
Get Feedback Model in
L2
θL1
θL2
Translate Feedback
Model into L1
θL1Trans
InterpolateModels
Ranking using Final
Model
We remember that a set of words is compared with another set of words
q1, q2, q3,q4, … qn
d1, d2, d3,d4, … dn
ReformulatedQuery words
Document words
OriginalQueryWords
OWNPRFWords
PRFWordsfromTranslation
English Lends a Helping Hand!
• English used as assisting language– Good monolingual performance– Ease of processing
• MultiPRF consistently and significantly outperforms monolingual PRF baseline
(Manoj Chinnakotla, Karthik Raman and Pushpak Bhattacharyya, Multilingual PRF: English Lends a Helping Hand, SIGIR 2010, Geneva, Switzerland, July, 2010.)
Why English as the assisting language?
Has the qualities of a good assisting language:
• Resource-rich– More than 70% of the web in English
• Morphological ease of processing– No complex issues like word compounding etc.
• Good monolingual performance– IR issues in English well-studied
Feedback Model Translation
• Feedback model estimated in L2 (ΘFL2) translated back
into L1 (ΘTransL1)
– Using probabilistic bi-lingual dictionary from L2 to L1
– Learnt from parallel sentence-aligned corpora
• Back Translation Step
Semantically Related Terms through Feedback Model Translation
Feedback Model
Translation Step
Nation
Nation
Country
State
UN
English-French Word Alignments
Nation, CountryState, UN, United
United
Flugzeug
Aircraft
Plane
Aeroplane
Air
German-English Word Alignments
Aircraft, PlaneAeroplane, Air, Flight
Flight
Semantically Related Terms through Feedback Model Translation
• Translation alternatives learnt through word-level alignments
• Back translation step acts as a rich source of morphological and semantic relations (Tiedemann 2001)
Linear Model Interpolation
• Original feedback model and translated model interpolated
• Final model also interpolated with query words to retain query focus
• ΘMulti used to finally re-rank documents from corpus based on KLD ranking function
Experimental Setup
• European languages chosen since Europarl freely available
• English chosen as assisting language
• CLEF Standard Dataset for Evaluation– Four widely differing source languages uses
• French (Romance Family), German(West Germanic)• Finnish (Baltic-Finnic), Hungarian (Uralic-Ugric)
– On more than 600 topics (only Title field)
• Use Google Translate for Query Translation
• Standard Evaluation Metrics (MAP, P@5, P@10, GMAP) used.
Result: MultiPRF Gains over PRF
FR-01+02 DE-01+02 FI-02+03+04 HU-050
0.1
0.2
0.3
0.4
0.5
MBFMultiPRF
FR-01+02 DE-01+02 FI-02+03+04 HU-050
0.1
0.2
0.3
0.4
0.5
0.6
0.7
MBFMultiPRF
FR-01+02 DE-01+02 FI-02+03+04 HU-050
0.1
0.2
0.3
0.4
0.5
0.6
MBFMultiPRF
FR-01+02 DE-01+02 FI-02+03+04
HU-050
0.05
0.1
0.15
0.2
0.25
0.3
MBFMultiPRF
MAP P@5
P@10 GMAP
Significant Gains in All Collections
Increased Robustnes
s
Query in French Initial
Retrieval
Translate Query into
English
Initial Retrieval
L1 Index
L2
Index
Top ‘k’ Results
Top ‘k’ Results
Get Feedback Model in
L1
Get Feedback Model in
L2
θL1
θL2
θL1Multi
Oscar honorifiquepour des
réalisateursitaliens
Honorary Oscarfor Italian
filmmakers
italien, président (president), oscar , gouvern(governer) , scalfaro , spadolin(molecular)
Italien, oscar, film, realis, wild,cinem,honorif,president,honorair,cineast
filmmakfilm,movi,tobacco,placement,produc,stallon,studio,italian, oscar,honarari,
Translate &
Interpolate
MAP improves from0.1238 to 0.4324!
Query in German Initial
Retrieval
Translate Query into
English
Initial Retrieval
L1 Index
L2
Index
Top ‘k’ Results
Top ‘k’ Results
Get Feedback Model in
L1
Get Feedback Model in
L2
θL1
θL2
θL1MultiÖlunfälle und
Vögel
Birds and Oil Spills
rhein, ollunfall, fluss, ol, auen, erdreich, heizol, tank, lit, folg, oberrhein, teil
Olunfall,vogel,ol,olverschmutz (oil pollution),erdol(petroleum),olp(oil slick),rhein,mcgrath,olivenol,fluss,tier,vergoss,vogelart (bird species),olkatastroph,olpreis
Oil, spill, bird,pipelin,river,offici,fish,lake,cleanup,state,gall
on
Translate &
Interpolate
MAP improves from0.0128 to 0.1184!
Effect of Varying Query Translation
• Query Translation simpler task than MT.
• We used 3 different QT systems:– Google Translate– Naïve SMT System– Almost “Ideal” TranslationAnnotated Performance of Systems
(on 3-point scale [0,0.5,1])MAP P@5 P@10 GMAP
0.2
0.25
0.3
0.35
0.4
0.45
0.5
MBFSMTGTIdeal
Performance of Different System on FR-01+02
• Robust To Suboptimal Translation
• Ideal Translation huge gains over MBF
Corpus Google Translate Naïve SMT
FR-01+02 0.93 0.67
FR-03+05 0.88 0.77
DE-01+02 0.93 0.64
DE-03 0.81 0.58
Do we need to go to another language?
Can we do as well using another collection in same language?
MAP GMAP0.2
0.25
0.3
0.35
0.4
0.45
0.5
DE-MultiPRFDE-SameLangFR-MultiPRFFR-SameLang
How about using Thesaurus Based Expansion as well?
• Since no publicly available thesaurus, we learn probabilistic thesaurus as suggested by Xu et. al.
( where e is an English word)
• Using model obtained by PRF + Assisting Collection in same language: , we expand using a thesaurus:
Comparison of Thesaurus-based Expansion with Multi-PRF
MAP GMAP0.2
0.25
0.3
0.35
0.4
0.45
0.5
DE-MultiPRFDE-Same+ThesFR-MultiPRFFR-Same+Thes
• Simply adding a Thesaurus to get Synonyms, does
not help
• Thus MultiPRF, combines both benefits well.
Can languages other than English help?
Manoj Chinnakotla, Karthik Raman and Pushpak Bhattacharyya, Multilingual Relevance Feedback: One Language Can Help Another, Conference of Association of Computational Linguistics (ACL 2010),
Uppsala, Sweden, July 2010.
Performance Study of Assisting Languages
• Do the results hold for languages other than English?• What are the characteristics of a good assisting
language?• Can any language be used to improve the PRF
performance of another language?• Can this be extended to multiple assisting languages?
Language Typology
Again use linear Model Interpolation
Use and give weightage to terms from the assisting langauges
Query termsOwn PRF terms
PRF terms fromTranslation
Experimental Setup
• European languages chosen• Europarl corpora• CLEF dataset
– Six languages from different language families– French, Spanish (Romance), – German, English, Dutch (West Germanic), – Finnish (Baltic-Finnic)– On more than 600 topics
• Use Google Translate for Query Translation
MultiPRF with Non-English Assisting Languages
Query in German Initial
Retrieval
Translate Query into
Spanish
Initial Retrieval
L1 Index
L2
Index
Top ‘k’ Results
Top ‘k’ Results
Get Feedback Model in
L1
Get Feedback Model in
L2
θL1
θL2
θL1Multi
Bronchial asthma
El asma bronquial
chronisch (chronic), pet, athlet(athlete), ekrank (ill), gesund(healthy), tuberkulos(tuberculosis), patient, reis (rice),person
asthma, allergi, krankheit (disease), allerg (allergenic), chronisch, hauterkrank (illness of skin), arzt (doctor), erkrank (ill)
Asma, bronquial, contamin, ozon, cient, enfermed, alerg, alergi, air
Translate &
Interpolate
MAP improves from0.062 to 0.636!
Query in French Initial
Retrieval
Translate Query into
Dutch
Initial Retrieval
L1 Index
L2
Index
Top ‘k’ Results
Top ‘k’ Results
Get Feedback Model in
L1
Get Feedback Model in
L2
θL1
θL2
θL1Multi
Ingénierie Génétique
GenetischeManipulatie
développ (developed), évolu(evolved), product, produit(product), moléculair (molecular)
génet, ingénier, manipul, animal, pêcheur (fisherman), développ (developed), gen
genetisch, manipulatie, exxon, dier (animal), visser (fisherman), gen
Translate &
Interpolate
MAP improves from0.145 to 0.357!
Results
Dependence on Monolingual Performance
Monolingual MAP
0.4495 0.4033 0.4153 0.4805 0.4356 0.3578
Rank 2 5 4 1 3 6
Back Translation Performance improves within the same family
More than one assisting language
• Tried parallel composition for two assisting languages
• Uniform interpolation weights used
• Exhaustively tried all 60 combinations
• Improvements reported over best performing PRF of L1 or L2
More Languages More Robust
Conclusions• MultiPRF uses another language to
improve robustness and performance• Can be looked upon as a way of
disambiguation• Need closer study of MultiPRF from
languages in the same family• Need to evolve measures for capturing
familial closeness
URLs
• For resources
www.cfilt.iitb.ac.in• For publications
www.cse.iitb.ac.in/~pb
Thank you
Questions and comments?