Introduction to Information Retrievalweb.stanford.edu/class/cs276/handouts/lecture12-bm25etc.pdf · Introduction to Information Retrieval Elite terms Text from the Wikipedia page

IntroductiontoInformationRetrieval

Introductionto

InformationRetrieval

BM25,BM25F,andUserBehaviorChrisManningandPanduNayak



Summary– BIM[Robertson&Spärck-Jones1976]§ Boilsdownto

where

§ Withconstantpi =0.5,simplifiestoIDFweighting:

RSV = log Nnixi=qi=1

∑

RSV BIM = ciBIM ;

xi=qi=1∑ ci

BIM = log pi (1− ri )(1− pi )ri

document relevant(R=1) notrelevant(R=0)

termpresent xi =1 pi ritermabsent xi =0 (1– pi) (1 – ri)

Logoddsratio


GraphicalmodelforBIM– BernoulliNB

i ∈ q

Binaryvariablesxi = (tfi ≠ 0)


AkeylimitationoftheBIM§ BIM– likemuchoforiginalIR– wasdesignedfortitlesorabstracts,andnotformodernfulltextsearch

§ Wewanttopayattentiontotermfrequencyanddocumentlengths,justlikeinothermodelswediscuss

§ Want

§ Wantsomemodelofhowoftentermsoccurindocs

ci = logptf r0p0rtf


1.OkapiBM25 [Robertsonetal.1994,TRECCityU.]

§ BM25“BestMatch25”(theyhadabunchoftries!)§ DevelopedinthecontextoftheOkapisystem§ StartedtobeincreasinglyadoptedbyotherteamsduringtheTRECcompetitions

§ Itworkswell

§ Goal:besensitivetotermfrequencyanddocumentlengthwhilenotaddingtoomanyparameters§ (RobertsonandZaragoza2009;Spärck Jonesetal.2000)


§ Wordsaredrawnindependentlyfromthevocabularyusingamultinomialdistribution

Generativemodelfordocuments

... the draft is that each team is given a position in the draft …

basic

team each

thatof

is

the draft

designnfl

football

given

…

annual draftfootball

team

nfl


§ Distributionoftermfrequencies(tf)followsabinomialdistribution– approximatedbyaPoisson

Generativemodelfordocuments

... the draft is that each team is given a position in the draft …

draft

…

…


Poissondistribution§ ThePoissondistributionmodelstheprobabilityofk,thenumberofeventsoccurringinafixedintervaloftime/space,withknownaveragerateλ (=cf/T),independentofthelastevent

§ Examples§ Numberofcarsarrivingatthetollboothperminute§ Numberoftyposonapage

p(k) = λk

k!e−λ


Poissondistribution§ IfTislargeandpissmall,wecanapproximateabinomialdistributionwithaPoissonwhereλ =Tp

§ Mean=Variance=λ =Tp.§ Examplep= 0.08,T=20.Chanceof1occurrenceis:

§ Binomial

§ Poisson…alreadyclose

p(k) = λk

k!e−λ

P(1) = [(20)(.08)]1

1!e−(20)(.08) = 1.6

1e−1.6 = 0.3230

P(1) = 201

!

"#

$

%&(.08)1(.92)19 = .3282


Poissonmodel§ Assumethattermfrequenciesinadocument(tfi)followaPoissondistribution§ “Fixedinterval”impliesfixeddocumentlength…thinkroughlyconstant-sizeddocumentabstracts§…willfixlater


Poissondistributions


(One)PoissonModel§ Isareasonablefitfor“general”words§ Isapoorfitfortopic-specificwords

§ gethigherp(k)thanpredictedtoooften

Documents containingk occurrencesofword(λ =53/650)

Freq Word 0 1 2 3 4 5 6 7 8 9 10 11 12

53 expected 599 49 2

52 based 600 48 2

53 conditions 604 39 7

55 cathexis 619 22 3 2 1 2 0 1

51 comic 642 3 0 1 0 0 0 0 0 0 1 1 2

Harter, “A Probabilistic Approach to Automatic Keyword Indexing”, JASIST, 1975


Eliteness(“aboutness”)§ Modeltermfrequenciesusingeliteness§ Whatiseliteness?

§ Hiddenvariableforeachdocument-termpair,denotedasEi fortermi

§ Representsaboutness:atermiseliteinadocumentif,insomesense,thedocumentisabouttheconceptdenotedbytheterm

§ Elitenessisbinary§ Termoccurrencesdependonlyoneliteness…§ …butelitenessdependsonrelevance


ElitetermsTextfromtheWikipediapageontheNFLdraftshowingeliteterms

The National Football League Draft is an annual event in which the National Football League (NFL) teams select eligible college football players. It serves as the league’s most common source of player recruitment. The basic design of the draft is that each team is given a position in the draft order in reverse order relative to its record …


Graphicalmodelwitheliteness

i ∈ q

Frequencies(notbinary)

Binaryvariables


RetrievalStatusValue§ SimilartotheBIMderivation,wehave

where

andusingeliteness,wehave:

RSV elite = cielite

i∈q,tfi>0∑ (tfi );

p(TFi = tfi R) = p(TFi = tfi Ei = elite)p(Ei = elite R)+p(TFi = tfi Ei = elite)(1− p(Ei = elite R))

cielite(tfi ) = log

p(TFi = tfi R =1)p(TFi = 0 R = 0)p(TFi = 0 R =1)p(TFi = tfi R = 0)


2-Poissonmodel§ Theproblemswiththe1-PoissonmodelsuggestsfittingtwoPoissondistributions

§ Inthe“2-Poissonmodel”,thedistributionisdifferentdependingonwhetherthetermiseliteornot

§ whereπisprobabilitythatdocumentiseliteforterm§ but,unfortunately,wedon’tknowπ,λ,μ

p(TFi = ki R) =


Let’sgetanidea:Graphingfordifferentparametervaluesofthe2-Poisson

cielite(tfi )


Qualitativeproperties

§

§ increasesmonotonicallywithtfi

§ …butasymptoticallyapproachesamaximumvalueas[nottrueforsimplescalingoftf]

§ … withtheasymptoticlimitbeing

cielite(0) = 0

cielite(tfi )

ciBIM

Weightofelitenessfeature

tfi →∞


Approximatingthesaturationfunction§ Estimatingparametersforthe2-Poissonmodelisnoteasy

§ …Soapproximateitwithasimpleparametriccurvethathasthesamequalitativeproperties

tfk1 + tf


Saturationfunction

§ Forhighvaluesofk1,incrementsintfi continuetocontributesignificantlytothescore

§ Contributionstailoffquicklyforlowvaluesofk1


“Early”versionsofBM25§ Version1:usingthesaturationfunction

§ Version2:BIMsimplificationtoIDF

§ (k1+1) factordoesn’tchangeranking,butmakestermscore1 whentfi = 1

§ Similartotf-idf,buttermscoresarebounded

ciBM 25v1(tfi ) = ci

BIM tfik1 + tfi

ciBM 25v2 (tfi ) = log

Ndfi

×(k1 +1)tfik1 + tfi


Documentlengthnormalization§ Longerdocumentsarelikelytohavelargertfi values

§ Whymightdocumentsbelonger?§ Verbosity:suggestsobservedtfi toohigh§ Largerscope:suggestsobservedtfi mayberight

§ Arealdocumentcollectionprobablyhasbotheffects§ …soshouldapplysomekindofpartialnormalization


Documentlengthnormalization§ Documentlength:

§ avdl:Averagedocumentlengthovercollection§ Lengthnormalizationcomponent

§ b = 1 fulldocumentlengthnormalization§ b = 0 nodocumentlengthnormalization

dl = tfii∈V∑

B = (1− b)+ b dlavdl

"

#$

%

&', 0 ≤ b ≤1


Documentlengthnormalization


OkapiBM25§ Normalizetf usingdocumentlength

§ BM25rankingfunction

t !fi =tfiB

ciBM 25(tfi ) = log

Ndfi

×(k1 +1)t "fik1 + t "fi

= log Ndfi

×(k1 +1)tfi

k1((1− b)+ bdlavdl

)+ tfi

RSV BM 25 = ciBM 25

i∈q∑ (tfi );


OkapiBM25

§ k1 controlstermfrequencyscaling§ k1 = 0 isbinarymodel;k1 largeisrawtermfrequency

§ b controlsdocumentlengthnormalization§ b = 0 isnolengthnormalization;b = 1 isrelativefrequency(fullyscalebydocumentlength)

§ Typically,k1 issetaround1.2–2andb around0.75§ IIRsec.11.4.3discussesincorporatingquerytermweightingand(pseudo)relevancefeedback

RSV BM 25 = log Ndfii∈q

∑ ⋅(k1 +1)tfi


)+ tfi


WhyisBM25betterthanVSMtf-idf?§ Supposeyourqueryis[machinelearning]§ Supposeyouhave2documentswithtermcounts:

§ doc1:learning1024;machine1§ doc2:learning16;machine8

§ tf-idf:log2 tf *log2 (N/df)§ doc1:11*7+1*10 = 87§ doc2:5*7+4*10=75

§ BM25:k1 =2§ doc1:7*3+10*1=31§ doc2:7*2.67+10*2.4=42.7


2.Rankingwithfeatures§ Textualfeatures

§ Zones:Title,author,abstract,body,anchors,…§ Proximity§ …

§ Non-textualfeatures§ Filetype§ Fileage§ Pagerank§ …


Rankingwithzones§ Straightforwardidea:

§ Applyyourfavoriterankingfunction(BM25)toeachzoneseparately

§ Combinezonescoresusingaweightedlinearcombination

§ Butthatseemstoimplythattheelitenesspropertiesofdifferentzonesaredifferentandindependentofeachother§ …whichseemsunreasonable


Rankingwithzones§ Alternateidea

§ Assumeelitenessisaterm/documentpropertysharedacrosszones

§ …buttherelationshipbetweenelitenessandtermfrequenciesarezone-dependent§ e.g.,denseruseofelitetopicwordsintitle

§ Consequence§ Firstcombineevidenceacrosszonesforeachterm§ Thencombineevidenceacrossterms


BM25Fwithzones§ Calculateaweightedvariantoftotaltermfrequency§ …andaweightedvariantofdocumentlength

wherevz iszoneweighttfzi istermfrequencyinzonezlenz islengthofzonezZ isthenumberofzones

tfi = vztfziz=1

Z

∑ dl = vzlenzz=1

Z

∑ avdl = Averageacrossalldocuments

dl


SimpleBM25Fwithzones

§ Simpleinterpretation:zonez is“replicated”vz times

§ Butwemaywantzone-specificparameters(k1, b,IDF)

RSV SimpleBM 25F = log Ndfii∈q

∑ ⋅(k1 +1)tfi


)+ tfi


BM25F§ Empirically,zone-specificlengthnormalization(i.e.,zone-specificb)hasbeenfoundtobeuseful

tfi = vztfziBzz=1

Z

∑

Bz = (1− bz )+ bzlenzavlenz

"

#$

%

&', 0 ≤ bz ≤1

RSV BM 25F = log Ndfii∈q

∑ ⋅(k1 +1)tfik1 + tfi

See Robertson and Zaragoza (2009: 364)


Rankingwithnon-textualfeatures§ Assumptions

§ Usualindependenceassumption§ Independentofeachotherandofthetextualfeatures§ AllowsustofactoroutinBIM-stylederivation

§ Relevanceinformationisqueryindependent§ Usuallytrueforfeatureslikepagerank,age,type,…§ Allowsustokeepallnon-textualfeaturesintheBIM-stylederivationwherewedropnon-queryterms

p(Fj = f j R =1)p(Fj = f j R = 0)


Rankingwithnon-textualfeatures

where

andisanartificiallyaddedfreeparametertoaccountforrescalings intheapproximations§ CaremustbetakeninselectingVj dependingonFj.E.g.

§ Explainswhyworkswell

RSV = cii∈q∑ (tfi )+ λ jVj ( f j )

j=1

F

∑

Vj ( f j ) = logp(Fj = f j R =1)p(Fj = f j R = 0)

λ j

log( !λ j + f j )f j!λ j + f j

1!λ j + exp(− f j !!λ j )

RSV BM 25 + log(pagerank)


UserBehavior§ Search Results for “CIKM” (in 2010!)

38

# of clicks received

Taken with slight adaptation from Fan Guo and Chao Liu’s 2009/2010 CIKM tutorial: Statistical Models for Web Search: Click Log Analysis


UserBehavior§ Adapt ranking to user clicks?

39



UserBehavior§ Tools needed for non-trivial cases

40



Websearchclicklog

An example

41


WebSearchClickLog§ How large is the click log?

§ search logs: 10+ TB/day

§ In existing publications:§ [Silverstein+99]: 285M sessions§ [Craswell+08]: 108k sessions§ [Dupret+08] : 4.5M sessions (21 subsets * 216k sessions)§ [Guo +09a] : 8.8M sessions from 110k unique queries§ [Guo+09b]: 8.8M sessions from 110k unique queries§ [Chapelle+09]: 58M sessions from 682k unique queries§ [Liu+09a]: 0.26PB data from 103M unique queries

42


InterpretClicks:anExample

§ Clicks are good…§ Are these two clicks

equally “good”?

§ Non-clicks may have excuses:§ Not relevant§ Not examined

43


Eye-trackingUserStudy

44


§ Higher positions receive more user attention (eye fixation) and clicks than lower positions.

§ This is true even in the extreme setting where the order of positions is reversed.

§ “Clicks are informative but biased”.

45

[Joachims+07]

ClickPosition-bias

NormalPosition

Percen

tage

ReversedImpression

Percen

tage


Userbehavior§ Userbehaviorisanintriguingsourceofrelevancedata

§ Usersmake(somewhat)informedchoiceswhentheyinteractwithsearchengines

§ Potentiallyalotofdataavailableinsearchlogs

§ Buttherearesignificantcaveats§ Userbehaviordatacanbeverynoisy§ Interpretinguserbehaviorcanbetricky§ Spamcanbeasignificantproblem§ Notallquerieswillhaveuserbehavior


FeaturesbasedonuserbehaviorFrom[Agichtein,Brill,Dumais 2006;Joachims 2002]§ Click-throughfeatures

§ Clickfrequency,clickprobability,clickdeviation§ Clickonnextresult?previous result?above?below>?

§ Browsingfeatures§ Cumulativeandaveragetimeonpage,ondomain,onURLprefix;deviationfromaveragetimes

§ Browsepathfeatures

§ Query-textfeatures§ Queryoverlapwithtitle,snippet,URL,domain,nextquery§ Querylength


Incorporatinguserbehaviorintorankingalgorithm§ IncorporateuserbehaviorfeaturesintoarankingfunctionlikeBM25F§ ButrequiresanunderstandingofuserbehaviorfeaturessothatappropriateVj functionsareused

§ Incorporateuserbehaviorfeaturesintolearnedrankingfunction

§ Eitherofthesewaysofincorporatinguserbehaviorsignalsimproveranking


Resources§ S.E.RobertsonandH.Zaragoza.2009.TheProbabilistic

RelevanceFramework:BM25andBeyond.FoundationsandTrendsinInformationRetrieval 3(4):333-389.

§ K.Spärck Jones,S.Walker,andS.E.Robertson.2000.Aprobabilisticmodelofinformationretrieval:Developmentandcomparativeexperiments.Part1.InformationProcessingandManagement779–808.

§ T.Joachims.OptimizingSearchEnginesusingClickthroughData.2002.SIGKDD.

§ E.Agichtein,E.Brill,S.Dumais.2006.ImprovingWebSearchRankingByIncorporatingUserBehaviorInformation.2006.SIGIR.