Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
IntroductiontoInformationRetrieval
Introductionto
InformationRetrieval
BM25,BM25F,andUserBehaviorChrisManningandPanduNayak
IntroductiontoInformationRetrieval
IntroductiontoInformationRetrieval
Summary– BIM[Robertson&Spärck-Jones1976]§ Boilsdownto
where
§ Withconstantpi =0.5,simplifiestoIDFweighting:
RSV = log Nnixi=qi=1
∑
RSV BIM = ciBIM ;
xi=qi=1∑ ci
BIM = log pi (1− ri )(1− pi )ri
document relevant(R=1) notrelevant(R=0)
termpresent xi =1 pi ritermabsent xi =0 (1– pi) (1 – ri)
Logoddsratio
IntroductiontoInformationRetrieval
GraphicalmodelforBIM– BernoulliNB
i ∈ q
Binaryvariablesxi = (tfi ≠ 0)
IntroductiontoInformationRetrieval
AkeylimitationoftheBIM§ BIM– likemuchoforiginalIR– wasdesignedfortitlesorabstracts,andnotformodernfulltextsearch
§ Wewanttopayattentiontotermfrequencyanddocumentlengths,justlikeinothermodelswediscuss
§ Want
§ Wantsomemodelofhowoftentermsoccurindocs
ci = logptf r0p0rtf
IntroductiontoInformationRetrieval
1.OkapiBM25 [Robertsonetal.1994,TRECCityU.]
§ BM25“BestMatch25”(theyhadabunchoftries!)§ DevelopedinthecontextoftheOkapisystem§ StartedtobeincreasinglyadoptedbyotherteamsduringtheTRECcompetitions
§ Itworkswell
§ Goal:besensitivetotermfrequencyanddocumentlengthwhilenotaddingtoomanyparameters§ (RobertsonandZaragoza2009;Spärck Jonesetal.2000)
IntroductiontoInformationRetrieval
§ Wordsaredrawnindependentlyfromthevocabularyusingamultinomialdistribution
Generativemodelfordocuments
... the draft is that each team is given a position in the draft …
basic
team each
thatof
is
the draft
designnfl
football
given
…
annual draftfootball
team
nfl
IntroductiontoInformationRetrieval
§ Distributionoftermfrequencies(tf)followsabinomialdistribution– approximatedbyaPoisson
Generativemodelfordocuments
... the draft is that each team is given a position in the draft …
draft
…
…
IntroductiontoInformationRetrieval
Poissondistribution§ ThePoissondistributionmodelstheprobabilityofk,thenumberofeventsoccurringinafixedintervaloftime/space,withknownaveragerateλ (=cf/T),independentofthelastevent
§ Examples§ Numberofcarsarrivingatthetollboothperminute§ Numberoftyposonapage
p(k) = λk
k!e−λ
IntroductiontoInformationRetrieval
Poissondistribution§ IfTislargeandpissmall,wecanapproximateabinomialdistributionwithaPoissonwhereλ =Tp
§ Mean=Variance=λ =Tp.§ Examplep= 0.08,T=20.Chanceof1occurrenceis:
§ Binomial
§ Poisson…alreadyclose
p(k) = λk
k!e−λ
P(1) = [(20)(.08)]1
1!e−(20)(.08) = 1.6
1e−1.6 = 0.3230
P(1) = 201
!
"#
$
%&(.08)1(.92)19 = .3282
IntroductiontoInformationRetrieval
Poissonmodel§ Assumethattermfrequenciesinadocument(tfi)followaPoissondistribution§ “Fixedinterval”impliesfixeddocumentlength…thinkroughlyconstant-sizeddocumentabstracts§…willfixlater
IntroductiontoInformationRetrieval
Poissondistributions
IntroductiontoInformationRetrieval
(One)PoissonModel§ Isareasonablefitfor“general”words§ Isapoorfitfortopic-specificwords
§ gethigherp(k)thanpredictedtoooften
Documents containingk occurrencesofword(λ =53/650)
Freq Word 0 1 2 3 4 5 6 7 8 9 10 11 12
53 expected 599 49 2
52 based 600 48 2
53 conditions 604 39 7
55 cathexis 619 22 3 2 1 2 0 1
51 comic 642 3 0 1 0 0 0 0 0 0 1 1 2
Harter, “A Probabilistic Approach to Automatic Keyword Indexing”, JASIST, 1975
IntroductiontoInformationRetrieval
Eliteness(“aboutness”)§ Modeltermfrequenciesusingeliteness§ Whatiseliteness?
§ Hiddenvariableforeachdocument-termpair,denotedasEi fortermi
§ Representsaboutness:atermiseliteinadocumentif,insomesense,thedocumentisabouttheconceptdenotedbytheterm
§ Elitenessisbinary§ Termoccurrencesdependonlyoneliteness…§ …butelitenessdependsonrelevance
IntroductiontoInformationRetrieval
ElitetermsTextfromtheWikipediapageontheNFLdraftshowingeliteterms
The National Football League Draft is an annual event in which the National Football League (NFL) teams select eligible college football players. It serves as the league’s most common source of player recruitment. The basic design of the draft is that each team is given a position in the draft order in reverse order relative to its record …
IntroductiontoInformationRetrieval
Graphicalmodelwitheliteness
i ∈ q
Frequencies(notbinary)
Binaryvariables
IntroductiontoInformationRetrieval
RetrievalStatusValue§ SimilartotheBIMderivation,wehave
where
andusingeliteness,wehave:
RSV elite = cielite
i∈q,tfi>0∑ (tfi );
p(TFi = tfi R) = p(TFi = tfi Ei = elite)p(Ei = elite R)+p(TFi = tfi Ei = elite)(1− p(Ei = elite R))
cielite(tfi ) = log
p(TFi = tfi R =1)p(TFi = 0 R = 0)p(TFi = 0 R =1)p(TFi = tfi R = 0)
IntroductiontoInformationRetrieval
2-Poissonmodel§ Theproblemswiththe1-PoissonmodelsuggestsfittingtwoPoissondistributions
§ Inthe“2-Poissonmodel”,thedistributionisdifferentdependingonwhetherthetermiseliteornot
§ whereπisprobabilitythatdocumentiseliteforterm§ but,unfortunately,wedon’tknowπ,λ,μ
p(TFi = ki R) =
IntroductiontoInformationRetrieval
Let’sgetanidea:Graphingfordifferentparametervaluesofthe2-Poisson
cielite(tfi )
IntroductiontoInformationRetrieval
Qualitativeproperties
§
§ increasesmonotonicallywithtfi
§ …butasymptoticallyapproachesamaximumvalueas[nottrueforsimplescalingoftf]
§ … withtheasymptoticlimitbeing
cielite(0) = 0
cielite(tfi )
ciBIM
Weightofelitenessfeature
tfi →∞
IntroductiontoInformationRetrieval
Approximatingthesaturationfunction§ Estimatingparametersforthe2-Poissonmodelisnoteasy
§ …Soapproximateitwithasimpleparametriccurvethathasthesamequalitativeproperties
tfk1 + tf
IntroductiontoInformationRetrieval
Saturationfunction
§ Forhighvaluesofk1,incrementsintfi continuetocontributesignificantlytothescore
§ Contributionstailoffquicklyforlowvaluesofk1
IntroductiontoInformationRetrieval
“Early”versionsofBM25§ Version1:usingthesaturationfunction
§ Version2:BIMsimplificationtoIDF
§ (k1+1) factordoesn’tchangeranking,butmakestermscore1 whentfi = 1
§ Similartotf-idf,buttermscoresarebounded
ciBM 25v1(tfi ) = ci
BIM tfik1 + tfi
ciBM 25v2 (tfi ) = log
Ndfi
×(k1 +1)tfik1 + tfi
IntroductiontoInformationRetrieval
Documentlengthnormalization§ Longerdocumentsarelikelytohavelargertfi values
§ Whymightdocumentsbelonger?§ Verbosity:suggestsobservedtfi toohigh§ Largerscope:suggestsobservedtfi mayberight
§ Arealdocumentcollectionprobablyhasbotheffects§ …soshouldapplysomekindofpartialnormalization
IntroductiontoInformationRetrieval
Documentlengthnormalization§ Documentlength:
§ avdl:Averagedocumentlengthovercollection§ Lengthnormalizationcomponent
§ b = 1 fulldocumentlengthnormalization§ b = 0 nodocumentlengthnormalization
dl = tfii∈V∑
B = (1− b)+ b dlavdl
"
#$
%
&', 0 ≤ b ≤1
IntroductiontoInformationRetrieval
Documentlengthnormalization
IntroductiontoInformationRetrieval
OkapiBM25§ Normalizetf usingdocumentlength
§ BM25rankingfunction
t !fi =tfiB
ciBM 25(tfi ) = log
Ndfi
×(k1 +1)t "fik1 + t "fi
= log Ndfi
×(k1 +1)tfi
k1((1− b)+ bdlavdl
)+ tfi
RSV BM 25 = ciBM 25
i∈q∑ (tfi );
IntroductiontoInformationRetrieval
OkapiBM25
§ k1 controlstermfrequencyscaling§ k1 = 0 isbinarymodel;k1 largeisrawtermfrequency
§ b controlsdocumentlengthnormalization§ b = 0 isnolengthnormalization;b = 1 isrelativefrequency(fullyscalebydocumentlength)
§ Typically,k1 issetaround1.2–2andb around0.75§ IIRsec.11.4.3discussesincorporatingquerytermweightingand(pseudo)relevancefeedback
RSV BM 25 = log Ndfii∈q
∑ ⋅(k1 +1)tfi
k1((1− b)+ bdlavdl
)+ tfi
IntroductiontoInformationRetrieval
WhyisBM25betterthanVSMtf-idf?§ Supposeyourqueryis[machinelearning]§ Supposeyouhave2documentswithtermcounts:
§ doc1:learning1024;machine1§ doc2:learning16;machine8
§ tf-idf:log2 tf *log2 (N/df)§ doc1:11*7+1*10 = 87§ doc2:5*7+4*10=75
§ BM25:k1 =2§ doc1:7*3+10*1=31§ doc2:7*2.67+10*2.4=42.7
IntroductiontoInformationRetrieval
2.Rankingwithfeatures§ Textualfeatures
§ Zones:Title,author,abstract,body,anchors,…§ Proximity§ …
§ Non-textualfeatures§ Filetype§ Fileage§ Pagerank§ …
IntroductiontoInformationRetrieval
Rankingwithzones§ Straightforwardidea:
§ Applyyourfavoriterankingfunction(BM25)toeachzoneseparately
§ Combinezonescoresusingaweightedlinearcombination
§ Butthatseemstoimplythattheelitenesspropertiesofdifferentzonesaredifferentandindependentofeachother§ …whichseemsunreasonable
IntroductiontoInformationRetrieval
Rankingwithzones§ Alternateidea
§ Assumeelitenessisaterm/documentpropertysharedacrosszones
§ …buttherelationshipbetweenelitenessandtermfrequenciesarezone-dependent§ e.g.,denseruseofelitetopicwordsintitle
§ Consequence§ Firstcombineevidenceacrosszonesforeachterm§ Thencombineevidenceacrossterms
IntroductiontoInformationRetrieval
BM25Fwithzones§ Calculateaweightedvariantoftotaltermfrequency§ …andaweightedvariantofdocumentlength
wherevz iszoneweighttfzi istermfrequencyinzonezlenz islengthofzonezZ isthenumberofzones
tfi = vztfziz=1
Z
∑ dl = vzlenzz=1
Z
∑ avdl = Averageacrossalldocuments
dl
IntroductiontoInformationRetrieval
SimpleBM25Fwithzones
§ Simpleinterpretation:zonez is“replicated”vz times
§ Butwemaywantzone-specificparameters(k1, b,IDF)
RSV SimpleBM 25F = log Ndfii∈q
∑ ⋅(k1 +1)tfi
k1((1− b)+ bdlavdl
)+ tfi
IntroductiontoInformationRetrieval
BM25F§ Empirically,zone-specificlengthnormalization(i.e.,zone-specificb)hasbeenfoundtobeuseful
tfi = vztfziBzz=1
Z
∑
Bz = (1− bz )+ bzlenzavlenz
"
#$
%
&', 0 ≤ bz ≤1
RSV BM 25F = log Ndfii∈q
∑ ⋅(k1 +1)tfik1 + tfi
See Robertson and Zaragoza (2009: 364)
IntroductiontoInformationRetrieval
Rankingwithnon-textualfeatures§ Assumptions
§ Usualindependenceassumption§ Independentofeachotherandofthetextualfeatures§ AllowsustofactoroutinBIM-stylederivation
§ Relevanceinformationisqueryindependent§ Usuallytrueforfeatureslikepagerank,age,type,…§ Allowsustokeepallnon-textualfeaturesintheBIM-stylederivationwherewedropnon-queryterms
p(Fj = f j R =1)p(Fj = f j R = 0)
IntroductiontoInformationRetrieval
Rankingwithnon-textualfeatures
where
andisanartificiallyaddedfreeparametertoaccountforrescalings intheapproximations§ CaremustbetakeninselectingVj dependingonFj.E.g.
§ Explainswhyworkswell
RSV = cii∈q∑ (tfi )+ λ jVj ( f j )
j=1
F
∑
Vj ( f j ) = logp(Fj = f j R =1)p(Fj = f j R = 0)
λ j
log( !λ j + f j )f j!λ j + f j
1!λ j + exp(− f j !!λ j )
RSV BM 25 + log(pagerank)
IntroductiontoInformationRetrieval
UserBehavior§ Search Results for “CIKM” (in 2010!)
38
# of clicks received
Taken with slight adaptation from Fan Guo and Chao Liu’s 2009/2010 CIKM tutorial: Statistical Models for Web Search: Click Log Analysis
IntroductiontoInformationRetrieval
UserBehavior§ Adapt ranking to user clicks?
39
# of clicks received
IntroductiontoInformationRetrieval
UserBehavior§ Tools needed for non-trivial cases
40
# of clicks received
IntroductiontoInformationRetrieval
Websearchclicklog
An example
41
IntroductiontoInformationRetrieval
WebSearchClickLog§ How large is the click log?
§ search logs: 10+ TB/day
§ In existing publications:§ [Silverstein+99]: 285M sessions§ [Craswell+08]: 108k sessions§ [Dupret+08] : 4.5M sessions (21 subsets * 216k sessions)§ [Guo +09a] : 8.8M sessions from 110k unique queries§ [Guo+09b]: 8.8M sessions from 110k unique queries§ [Chapelle+09]: 58M sessions from 682k unique queries§ [Liu+09a]: 0.26PB data from 103M unique queries
42
IntroductiontoInformationRetrieval
InterpretClicks:anExample
§ Clicks are good…§ Are these two clicks
equally “good”?
§ Non-clicks may have excuses:§ Not relevant§ Not examined
43
IntroductiontoInformationRetrieval
Eye-trackingUserStudy
44
IntroductiontoInformationRetrieval
§ Higher positions receive more user attention (eye fixation) and clicks than lower positions.
§ This is true even in the extreme setting where the order of positions is reversed.
§ “Clicks are informative but biased”.
45
[Joachims+07]
ClickPosition-bias
NormalPosition
Percen
tage
ReversedImpression
Percen
tage
IntroductiontoInformationRetrieval
Userbehavior§ Userbehaviorisanintriguingsourceofrelevancedata
§ Usersmake(somewhat)informedchoiceswhentheyinteractwithsearchengines
§ Potentiallyalotofdataavailableinsearchlogs
§ Buttherearesignificantcaveats§ Userbehaviordatacanbeverynoisy§ Interpretinguserbehaviorcanbetricky§ Spamcanbeasignificantproblem§ Notallquerieswillhaveuserbehavior
IntroductiontoInformationRetrieval
FeaturesbasedonuserbehaviorFrom[Agichtein,Brill,Dumais 2006;Joachims 2002]§ Click-throughfeatures
§ Clickfrequency,clickprobability,clickdeviation§ Clickonnextresult?previous result?above?below>?
§ Browsingfeatures§ Cumulativeandaveragetimeonpage,ondomain,onURLprefix;deviationfromaveragetimes
§ Browsepathfeatures
§ Query-textfeatures§ Queryoverlapwithtitle,snippet,URL,domain,nextquery§ Querylength
IntroductiontoInformationRetrieval
Incorporatinguserbehaviorintorankingalgorithm§ IncorporateuserbehaviorfeaturesintoarankingfunctionlikeBM25F§ ButrequiresanunderstandingofuserbehaviorfeaturessothatappropriateVj functionsareused
§ Incorporateuserbehaviorfeaturesintolearnedrankingfunction
§ Eitherofthesewaysofincorporatinguserbehaviorsignalsimproveranking
IntroductiontoInformationRetrieval
Resources§ S.E.RobertsonandH.Zaragoza.2009.TheProbabilistic
RelevanceFramework:BM25andBeyond.FoundationsandTrendsinInformationRetrieval 3(4):333-389.
§ K.Spärck Jones,S.Walker,andS.E.Robertson.2000.Aprobabilisticmodelofinformationretrieval:Developmentandcomparativeexperiments.Part1.InformationProcessingandManagement779–808.
§ T.Joachims.OptimizingSearchEnginesusingClickthroughData.2002.SIGKDD.
§ E.Agichtein,E.Brill,S.Dumais.2006.ImprovingWebSearchRankingByIncorporatingUserBehaviorInformation.2006.SIGIR.