Information Retrieval: Rankingcis.poly.edu/cs6093/lecture03.pdf · Information Retrieval given aqueryand acorpus,ﬁndrelevant documents. • query:user’s expression of the information

InformationRetrieval: Ranking

Fernando Diaz

Yahoo! Labs

February7, 2011

1 / 64

Outline

IntroductiontoInformationRetrieval

VectorSpaceModel

PageRank

RankinginPractice

2 / 64

IntroductiontoInformationRetrieval

3 / 64

InformationRetrieval

givena query anda corpus, find relevantdocuments.

• query: user’sexpressionoftheinformationneed• corpus: therepositoryofretrievableitems• relevance: satisfactionoftheinformationneed

4 / 64



• query: user’sexpressionoftheinformationneed

• corpus: therepositoryofretrievableitems• relevance: satisfactionoftheinformationneed

4 / 64



• query: user’sexpressionoftheinformationneed• corpus: therepositoryofretrievableitems

• relevance: satisfactionoftheinformationneed

4 / 64



• query: user’sexpressionoftheinformationneed• corpus: therepositoryofretrievableitems• relevance: satisfactionoftheinformationneed

4 / 64

ExamplesofInformationRetrievalProblemsWebSearch

givena keyword anda webcrawl, find relevantURLs.

5 / 64

ExamplesofInformationRetrievalProblemsImageSearch

givena keyword and imagedatabase, find relevantimages.

6 / 64

ExamplesofInformationRetrievalProblemsQuestionAnswering

givena question and availabletext, rules, logic, find ananswer.

7 / 64

ExamplesofInformationRetrievalProblemsJobSearch

givena resume and jobadvertisements, find relevantjobs.

8 / 64

ExamplesofInformationRetrievalProblemsApplicantSearch

givena advertisement and resumes, find goodcandidates.

9 / 64

History

• 1950s: earlyinformationworkinproblemdefinition,metrics.

• 1960s: GerardSaltonbeginsworkonSMART;Cranfieldevaluationmethoddefined.

• 1970s: informationretrievalresearchcommunitydeveloped(SIGIR);manyfundamentalconceptsproposed(e.g. cluster-basedretrieval,pseudo-relevancefeedback).

• 1980s: developmentoffirstcommercialinformationretrievalsystems.

• 1990s: TREC conferencesbegin, standardizingevaluation; websearchenginesdeveloped, usingmanyfundamentalIR techniques.

10 / 64

TextREtrievalConference(TREC)

• Startedin1992asaforumtocompareIR systemsusingstandardtestcollections, ensuring reproducibility.

• Initiallyfocusedon adhoc retrieval(keywordsearch),thescopehasbroadenedtoincludemulti-lingualretrieval, legalretrieval, andquestionanswering.

• Allowedforacceleratedcomparisonandtestingofalgorithmicchangesacrosssystems.

• ResultedinsimilarforumsinEurope(CLEF),Asia(NTCIR),andIndia(FIRE).

11 / 64

IR = DB

DB IRdata structured semi-structuredfields clearsemantics freetextqueries structured freetextmatching exact impreciseranking none important

basedonatablebyJamesAllan

12 / 64

FundamentalProblemsinInformationRetrievalResearch

• Effectiveness: howwelldoesthesystemsatisfytheuser’sinformationneed?

• algorithms• interaction• evaluation

• Efficiency: howefficientlydoesthesystemsatisfytheuser’sinformationneed?

• indexingarchitectures• fastscorecomputation• evaluation

13 / 64

Effectivenessalgorithms

• termimportance: whichwordsareimportantwhenrankingadocument(e.g. frequentvs. discriminativewords)?

• stemming: howtocollapsewordswhicharemorphologicallyequivalent(e.g. bicycles → bicycle)

• queryexpansion: howtocollapsewordswhicharesemanticallyequivalent(e.g. bicycles → bicycle)

• documentstructure: domatchesindifferentpartsofthedocumentmatter(e.g. titlevs. bodymatch)?

• personalization: canweexploituserinformationtoimproveranking?

14 / 64

Effectivenessinteraction

• relevancefeedback: askauserwhichdocumentsarerelevant

• disambiguation: howtoaskauserwhichwordsareimportant

15 / 64

Effectivenessevaluation

• relevance: howtodefineagooddocument• metrics: howtomeasureiftherankingisgood• comparison: howtocomparetwosystems

16 / 64

Efficiencyindexingarchitectures

• parsing: howshouldadocumentbesplitintoasetofterms?

• indexing: whichwordsshouldbekept?• weighting: whatinformationneedstobestoredwithterms?

• compression: howtocompresstheindexsize?

17 / 64

Efficiencyfastscorecomputation

• invertedindices: fastretrievalandscoringofshortqueries.

• tiering: canwetierretrievalandrankingtoimproveperformance?

• branchandbound: howtoefficientlypreventscoringunnecessarydocuments?

18 / 64

Modeling

• informationretrievalofteninvolvesformallymodelingtheretrievalprocessinordertooptimizeperformance.

• modelingtasks• abstractlyrepresentthedocuments• abstractlyrepresentthequeries• modeltherelationshipbetweenqueryanddocumentrepresentations

19 / 64

ModelingBooleanRetrievalModel

• representeachdocumentasanunweightedbagofwords.

• representthequeryasanunweightedbagofwords.• retrieveanunorderedsetofdocumentscontainingthequerywords.

20 / 64

ModelingSimpleRankedRetrievalModel

• representeachdocumentasanweightedbagofwords(basedondocumentfrequency).

• representthequeryasanunweightedbagofwords.• retrievearankingdocumentscontainingthequerywords.

21 / 64

Modeling

• muchofthehistoryofinformationretrievaleffectivenessresearchinvolvesdevelopingnewmodelsorextendingexistingmodels.

• asmodelingbecomesmorecomplicated, mathematicsandstatisticsbecomenecessary.

• newmodelsstillbeingdeveloped.

22 / 64

VectorSpaceModel

23 / 64

VectorSpaceModel

• representeachdocumentasavectorinaveryhighdimensionalspace.

• representeachqueryasavectorinthesamehighdimensionalspace.

• rankdocumentsaccordingtosomevectorsimilaritymeasure.

24 / 64

VectorSpaceModelFundamentals

• vectorcomponents: whatdoeseachdimensionrepresent?

• textabstraction: howtoembedaquery, documentintothisspace?

• vectorsimilarity: howtocomparedocumentsinthisspace?

25 / 64

VectorSpaceModelFundamentals

• vectorcomponents: whatdoeseachdimensionrepresent?

• textabstraction: howtoembedaquery, documentintothisspace?

• vectorsimilarity: howtocomparedocumentsinthisspace?

26 / 64

VectorComponents

• Documentsandqueriesshouldberepresentedaslinearlyindependentbasisvectors.

• Orthogonal‘concepts’suchastopicsorgenres: idealbutdifficulttodefine.

• Controlledkeywordvocabulary: flexibleandcompactbutdifficulttomaintain, maynotbelinearlyindependent.

• Freetextvocabulary: easytogeneratebutgrowsquickly, definitelynotlinearlyindependent.

• Inmostcases, whensomeonereferstothevectorspacemodel, eachcomponentofavectorrepresentsthepresenceofauniquetermintheentirecorpus.

27 / 64

VectorComponents

T2

T1

T3

28 / 64

TextAbstractiontaneously would appear well separated in the document space. Such a situation is depicted in Figure 2, where the distance between two x's representing two documents is inversely related to the similarity between the corresponding index vectors.

While the document configuration of Figure 2 may indeed represent the best possible situation, assuming that relevant and nonrelevant items with respect to the various queries are separable as shown, no practical way exists for actually producing such a space, because during the indexing process, it is difficult to anticipate what relevance assessments the user population will provide over the course of time. That is, the optimum configuration is difficult to generate in the absence of a priori knowledge of the complete retrieval history for the given collection.

In these circumstances, one might conjecture that the next best thing is to achieve a maximum possible separation between the individual documents in the space, as shown in the example of Figure 3. Specifically, for a collection of n documents, one would want to minimize the function

r = ~ ~ s(Di, D~), (1) i = l j = l

where s(Di, Dr) is the similarity between documents i and j . Obviously when the function of eq. (1) is mini- mized, the average similarity between document pairs is smallest, thus guaranteeing that each given document may be retrieved when located sufficiently close to a user query without also necessarily retrieving its neigh- bors. This insures a high precision search output, since a given relevant item is then retrievable without also retrieving a number of nonrelevant items in its vicinity. In cases where several different relevant items for a given query are located in the same general area of the space, it may then also be possible to retrieve many of the relevant items while rejecting most of the nonrelevant. This produces both high recall and high precision?

Two questions then arise: first, is it in fact the case that a separated document space leads to a good retrieval performance, and vice-versa that improved retrieval performance implies a wider separation of the documents in the space; second, is there a practical way of measuring the space separation. In practice, the expression of eq. (1) is difficult to compute, since the number of vector comparisons is proportional to n 2 for a collection of n documents.

For this reason, a clustered document space is best considered, where the documents are grouped into classes, each class being represented by a class centroid.

3 In practice, the best performance is achieved by obtaining for each user a desired recall level (a specified proportion of the relevant items); at that recall level, one then wants to maximize precision by retrieving as few of the nonrelevant items as possible.

4 A number of well-known clustering methods exist for auto- matically generating a clustered collection from the term vectors representing the individual documents [l].

614

Fig. 1. Vector representation of document space.

Y I t

D 3 = (T I ", TZ" T3" )

L /

/ /

/ /

/ /

T 3

D I = ( T I , T 2 , T 3 )

. . . . ~ T 2

- (T I ' , T I t ' , T ) ' )

Fig. 2. Ideal document space.

/ r - ~ \ [ \ ~ ] Groups of Re levan t - I tems

\ /

X Indiv idual Documen ts

Fig. 3. Space with maximum separation between document pairs.

X x x X X X X

X X

X Indiv idual D o c u m e n t

Communications November 1975 of Volume 18 the ACM Number 11

29 / 64

TextAbstractionDocuments

1. Parsedocumentinavectorof normalizedstrings.• wordsaresplitonwhitespace, removingpunctuation(e.g. ‘Catsarelazy.’ → [’Cats’, ’are’, ’lazy’]).

• wordsaredown-cased(e.g. → [’cats’, ’are’, ’lazy’]).• stopwordsareremoved(e.g. → [’cats’, ’lazy’]).• wordsarestemmed(e.g. → [’cat’, ’lazy’]).

30 / 64

TextAbstractionDocuments

2. Weighteachwordpresentinthedocument.• Countthefrequencyofeachwordinthedocument(e.g. → [⟨’cat’, 25⟩, ⟨’lazy’, 25⟩, ⟨’kitten’, 10⟩, . . .]).

• Re-weightaccordingtoeachword’sdiscriminatorypower; Verydependentonthecorpus!

• corpusaboutlazyanimals:→ [⟨’cat’, 189.70⟩, ⟨’lazy’, 4.50⟩, ⟨’kitten’, 120.31⟩, . . .]

• corpusaboutcats:→ [⟨’cat’, 0.12⟩, ⟨’lazy’, 53.45⟩, ⟨’kitten’, 5.43⟩, . . .]

• Alternatively, canusebinaryweights(e.g.→ [⟨’cat’, 1⟩, ⟨’lazy’, 1⟩, ⟨’kitten’, 1⟩, ⟨’dog’, 0⟩, . . .]).

31 / 64

TermDiscriminationWeights

• Termswhichappearinallorverymanydocumentsinthecorpusmaynotbeusefulforindexingorretrieval.

• Stopwordsor‘linguisticglue’canbesafelydetectedusingcorpus-independentlists.

• Otherfrequenttermsmaybedomain-specificandaredetectedusingcorpusstatistics.

• Inversedocumentfrequency(IDF) summarizesthis. Foracorpusof n documents, theIDF ofterm k isdefinedas,

IDFk = 1 + log2n

dk

where dk isthedocumentfrequencyof k.

32 / 64

InverseDocumentFrequency

5 10 15

5

10

15

20

dfk

IDFk

n=1000

n=100

n=10000

33 / 64

TextAbstractionQueries

• Queriesareprocessedusingthe exact sameprocess.• Becausequeriesaretreatedas‘shortdocuments’, wecansupportkeywordordocument-lengthquery-by-examplestylequeries.

• Canbeeasilygeneralizedtonon-englishlanguages.

34 / 64

VectorSimilarity

similarity binary weightedinnerproduct |X ∩ Y |

∑i xiyi

Dicecoefficient 2 |X∩Y ||X|+|Y | 2

∑i xiyi∑

i xi+∑

i yi

Cosinecoefficient |X∩Y |√|X|

√|Y |

∑i xiyi√∑

i x2i

√∑i y

2i

Jaccardcoefficient |X∩Y ||X|+|Y |−|X∩Y |

∑i xiyi∑

i x2i+

∑i y

2i −

∑i xiyi

basedonatablebyJamesAllan

35 / 64

VectorSpaceModelExample

vocabulary

v =[‘fish’ ‘turtle’ ‘cat’ ‘snake’ ‘kitten’ ‘dog’

]Tvectorfordocument i

di =[0 0 .23 0 .12 .48

]Tvectorforquery

q =[0 0 1 0 0 0

]T

36 / 64

VectorSpaceModelExample

vocabulary

v =[‘fish’ ‘turtle’ ‘cat’ ‘snake’ ‘kitten’ ‘dog’

]Tscorefordocument i

⟨q, di⟩ =[0 0 1 0 0 0

] [0 0 .23 0 .12 .48

]T= .23

37 / 64

VectorSpaceModelEfficiency

• precomputelength-normalizeddocumentvectors,∑i xiyi√∑

i x2i

√∑i y

2i

=∑i

xi√∑i x

2i

yi√∑i y

2i

=∑i

xiyi

• branchandboundtoavoidscoringunnecessarydocuments.

• localitysensitivehashingcanbeusedtodoveryfastapproximatesearch[IndykandMotwani1998].

38 / 64

LatentSemanticIndexing[Deerwester etal. 1990]

• Vectorspacemodelsufferswhenthereisaquerytermmismatch(e.g. ‘bike’insteadof‘bicycle’).

• By-productofnothavingindependentbasisvectors.• Latentsemanticindexing(LSI) attemptstoresolveissueswithcorrelatedbasisvectors.1. Usesingularvaluedecomposition(SVD) tofind

orthogonalbasisvectorsinthecorpus.2. Projectdocuments(andqueries)intothelower

dimensional‘conceptspace’.3. Indexandretrievelowerdimensionaldocumentsas

withtheclassicvectorspacemodel.

39 / 64

LatentSemanticIndexing[Deerwester etal. 1990]

• Similartoclusteringdocumentsinthecorpus(i.e.k-means, PLSI,LDA).

• Inpractice, verydifficulttodeterminetheappropriatenumberoflowerdimensions(i.e. concepts).

• Informationretrievalneedstosupportretrievalat allgranularities; clusteringcommitsthesystemtoasinglegranularity.

40 / 64

Summary

• Vectorspacemodelisastraightforward, easytoimplementretrievalmodel.

• Principlesunderliemanymoderncommercialretrievalsystems.

• However, thereismoretorankingthanjusttermmatches…

41 / 64

PageRank

42 / 64

q = [⟨cat, 0.50⟩, ⟨lazy, 0.50⟩, ⟨kitten, 0.00⟩, . . .]

di = [⟨cat, 0.93⟩, ⟨lazy, 0.82⟩, ⟨kitten, 0.10⟩, . . .]dj = [⟨cat, 0.80⟩, ⟨lazy, 0.20⟩, ⟨kitten, 0.05⟩, . . .]

⟨q, di⟩ = 0.99

⟨q, dj⟩ = 0.856

whatif di isapoorlywritten, undesirabledocumentand djisadocumentfromanencyclopedia?

43 / 64

BeyondBagofWords

• Documentcontent, especiallywhenusingunorderedbagsofwordshaslimitedexpressiveness.

• doesnotcapturephrases• doesnotcapturemetadata• doesnotcapturequality

• Oftentimesweareinterestedinhowadocumentisconsumed.

• dopeoplelikemethinkthisisrelevant?• doesthisdocumentreceivealotofbuzz?• isthisdocumentauthoritative?

44 / 64

TheValueofCredibleInformation

• Thereisalotofjunkontheweb(e.g. spam, irrelevantforums).

• Knowingwhatusersarereadingisavaluablesourceforknowingwhatisnotjunk.

• Ideally, wewouldbeabletomonitoreverythingtheuserisreadingandusethatinformationforranking;thisisachievedthroughtoolbars, browsers, operatingsystems, DNS.

• In1998, nosearchcompanieshadbrowsingdata.Howdidtheyaddressthislackofdata?

45 / 64

RandomSurferModel[BrinandPage1998]

• Simulatea verylarge numberofusersbrowsingtheentire web.

• Letusersbrowserandomly. Thisisanaïveassumptionbutworksokayinpractice.

• Observehowoftenpagesgetvisited.• Theauthoritativenessofapageisafunctionofitspopularityinthesimulation.

46 / 64

0

1

54

3

2

47 / 64

LinkMatrix

W =

0 1 0 1 1 00 0 1 0 1 10 0 0 0 0 10 0 0 0 1 00 0 0 0 0 10 0 0 0 1 0

48 / 64

TransitionMatrix

T =

0 1

30 1

313

00 0 1

30 1

313

0 0 0 0 0 10 0 0 0 1 00 0 0 0 0 10 0 0 0 1 0

49 / 64

GoogleMatrix

G = λ

0 1

30 1

313

00 0 1

30 1

313

0 0 0 0 0 10 0 0 0 1 00 0 0 0 0 10 0 0 0 1 0

T

+ (1− λ)

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

16

50 / 64


• Thematrix, G definesatransitionmatrixoverthewebgraph.

• Inordertorunthesimulation, wetakethematrix-vectorproduct,

G×

111111

• Theresultisadistributionovergraphnodesrepresentingwhereuserswouldhavegonehaveasinglesimulationstep.

51 / 64


• Wecanrunthesimulationforanarbitrarynumberofsteps, t, bytakingpowersofthematrix,

Gt ×

111111

• TheresultofthissimulationisthePageRankscoreforeverydocumentinthegraph.

52 / 64

idea of how a change in the ranking function affects the search results.

5 Results and Performance The most important measure of a searchengine is the quality of its search results.While a complete user evaluation isbeyond the scope of this paper, our ownexperience with Google has shown it toproduce better results than the majorcommercial search engines for mostsearches. As an example which illustratesthe use of PageRank, anchor text, andproximity, Figure 4 shows Google’sresults for a search on "bill clinton".These results demonstrates some ofGoogle’s features. The results areclustered by server. This helpsconsiderably when sifting through resultsets. A number of results are from thewhitehouse.gov domain which is whatone may reasonably expect from such asearch. Currently, most major commercialsearch engines do not return any resultsfrom whitehouse.gov, much less the rightones. Notice that there is no title for thefirst result. This is because it was notcrawled. Instead, Google relied on anchortext to determine this was a good answerto the query. Similarly, the fifth result isan email address which, of course, is notcrawlable. It is also a result of anchor text.

All of the results are reasonably highquality pages and, at last check, nonewere broken links. This is largely becausethey all have high PageRank. ThePageRanks are the percentages in redalong with bar graphs. Finally, there are no results about a Bill other than Clinton or about a Clintonother than Bill. This is because we place heavy importance on the proximity of word occurrences. Ofcourse a true test of the quality of a search engine would involve an extensive user study or resultsanalysis which we do not have room for here. Instead, we invite the reader to try Google for themselvesat http://google.stanford.edu.

5.1 Storage Requirements

Query: bill clinton http://www.whitehouse.gov/ 100.00% (no date) (0K) http://www.whitehouse.gov/ Office of the President 99.67% (Dec 23 1996) (2K) http://www.whitehouse.gov/WH/EOP/OP/html/OP_Home.html Welcome To The White House 99.98% (Nov 09 1997) (5K) http://www.whitehouse.gov/WH/Welcome.html Send Electronic Mail to the President 99.86% (Jul 14 1997) (5K) http://www.whitehouse.gov/WH/Mail/html/Mail_President.html

mailto:[email protected] 99.98% mailto:[email protected] 99.27% The "Unofficial" Bill Clinton 94.06% (Nov 11 1997) (14K) http://zpub.com/un/un-bc.html Bill Clinton Meets The Shrinks 86.27% (Jun 29 1997) (63K) http://zpub.com/un/un-bc9.html President Bill Clinton - The Dark Side 97.27% (Nov 10 1997) (15K) http://www.realchange.org/clinton.htm $3 Bill Clinton 94.73% (no date) (4K)http://www.gatewy.net/~tjohnson/clinton1.html

Figure 4. Sample Results from Google

53 / 64

PageRankExtensions

• Buildanon-randomsurfer[Meiss etal. 2010].• PersonalizedPageRank[Haveliwala2003].• PageRank-directedcrawling[ChoandSchonfeld2007].

• PageRankwithoutlinks[KurlandandLee2005].

54 / 64

PageRankIssues

• Needawebgraphstoredinanefficientdatastructure.• PageRankrequirestakingpowersofavery, verylargematrix.

• ThePageRankisan approximation ofvisitation.

55 / 64

Summary

• Atthetime, PageRankprovidedanicesurrogateforrealuserdata.

• Nowadays, searchengineshaveaccesstotoolbardata,clicklogs, GPS data, IM,email, socialnetworks, ….

• Nonetheless, therandomsurfermodelisimportantsincethesizeofthewebismuchlargerthaneventhesedata.

56 / 64

RankinginPractice

57 / 64

RankinginPractice

• Thevectorspacemodelmeasurestext-basedsimilarity.• PageRankisclaimedtobeanimportantrankingfeature, howdoesitcompare?

58 / 64

RetrievalMetrics(moreonthisinApril)

• Retrievalmetricsattempttoquantifythequalityofarankedlist.

• Usuallyassumedthetopoftherankedlistismoreimportantthanthebottom.

• Metricsusedinthisstudy,• NormalizedDiscountedCumulativeGain(NDCG):measurestheamountofrelevantinformationinthetopoftheranking; importanceofrankpositiondropsquicklyasyouscrolldownthelist.

• ReciprocalRank(RR):measurestherankofthetoprelevantresult; importantfornavigationalqueries.

• MeanAveragePrecision(MAP):measurestheamountofrelevantinformationinthetopoftheranking; decaysslightlymoreslowlythanNDCG.

59 / 64

RankingSignals

• PageRank: staticpagerankingalgorithm.• HITS:staticpagerankingalgorithmbasedonKleinbergmodel[Kleinberg1998].

• FBM25F:aclassictext-basedretrievalfunctionsimilartothevectorspacemodel[Robertson etal. 2004].

60 / 64

Experiment

• Corpus: 463billionwebpages.• Relevance: 28kqueries, with 17relevancejudgmentseach.

• Experiment: foreachquery, foreachsignal, rankdocuments, computemetric; compareaverageperformance.

61 / 64

IsolatedSignals

.221

.106

.105

.104

.102

.095

.092

.090

.038

.036

.035

.034

.032

.032

.011

0.00

0.05

0.10

0.15

0.20

0.25

bm25f

degree-in-id

degree-in-ih

hits-aut-id-25

hits-aut-ih-100

degree-in-all

pagerank

hits-aut-all-100

hits-hub-all-100

hits-hub-ih-100

hits-hub-id-100

degree-out-all

degree-out-ih

degree-out-id

random

NDCG@10

.100

.035

.033

.033

.033

.029

.027

.027

.008

.007

.007

.006

.006

.006

.002

0.00

0.02

0.04

0.06

0.08

0.10

0.12

bm25f

hits-aut-id-9

degree-in-id

hits-aut-ih-15

degree-in-ih

degree-in-all

pagerank

hits-aut-all-100

hits-hub-all-100

hits-hub-ih-100

hits-hub-id-100

degree-out-all

degree-out-ih

degree-out-id

random

MAP@

10

.273

.132

.126

.117

.114

.101

.101

.097

.032

.032

.030

.028

.027

.027

.007

0.00

0.05

0.10

0.15

0.20

0.25

0.30

bm25f

hits-aut-id-9

hits-aut-ih-15

degree-in-id

degree-in-ih

degree-in-all

hits-aut-all-100

pagerank

hits-hub-all-100

hits-hub-ih-100

hits-hub-id-100

degree-out-all

degree-out-ih

degree-out-id

random

MRR@10

Figure 2: E!ectiveness of di!erent features.

seek times of modern hard disks are too slow to retrieve thelinks within the time constraints, and the graph does not fitinto the main memory of a single machine, even when usingthe most aggressive compression techniques.

In order to experiment with HITS and other query-depen-dent link-based ranking algorithms that require non-regularaccesses to arbitrary nodes and edges in the web graph, weimplemented a system called the Scalable Hyperlink Store,or SHS for short. SHS is a special-purpose database, dis-tributed over an arbitrary number of machines that keeps ahighly compressed version of the web graph in memory andallows very fast lookup of nodes and edges. On our hard-ware, it takes an average of 2 microseconds to map a URLto a 64-bit integer handle called a UID, 15 microseconds tolook up all incoming or outgoing link UIDs associated witha page UID, and 5 microseconds to map a UID back to aURL (the last functionality not being required by HITS).The RPC overhead is about 100 microseconds, but the SHSAPI allows many lookups to be batched into a single RPCrequest.

We implemented the HITS algorithm using the SHS in-frastructure. We compiled three SHS databases, one con-taining all 17.6 billion links in our web graph (all), one con-taining only links between pages that are on di!erent hosts(ih, for “inter-host”), and one containing only links betweenpages that are on di!erent domains (id). We consider twoURLs to belong to di!erent hosts if the host portions of theURLs di!er (in other words, we make no attempt to de-termine whether two distinct symbolic host names refer tothe same computer), and we consider a domain to be thename purchased from a registrar (for example, we considernews.bbc.co.uk and www.bbc.co.uk to be di!erent hosts be-longing to the same domain). Using each of these databases,we computed HITS authority and hub scores for various pa-rameterizations of the sampling operator S , sampling be-tween 1 and 100 back-links of each page in the root set.Result URLs that were not covered by our web graph auto-matically received authority and hub scores of 0, since theywere not connected to any other nodes in the neighborhoodgraph and therefore did not receive any endorsements.

We performed forty-five di!erent HITS computations, eachcombining one of the three link selection predicates (all, ih,and id) with a sampling value. For each combination, weloaded one of the three databases into an SHS system run-ning on six machines (each equipped with 16 GB of RAM),and computed HITS authority and hub scores, one queryat a time. The longest-running combination (using the alldatabase and sampling 100 back-links of each root set ver-tex) required 30,456 seconds to process the entire query set,

or about 1.1 seconds per query on average.

7. EXPERIMENTAL RESULTSFor a given query Q, we need to rank the set of documents

satisfying Q (the “result set” of Q). Our hypothesis is thatgood features should be able to rank relevant documents inthis set higher than non-relevant ones, and this should resultin an increase in each performance measure over the queryset. We are specifically interested in evaluating the useful-ness of HITS and other link-based features. In principle, wecould do this by sorting the documents in each result set bytheir feature value, and compare the resulting NDCGs. Wecall this ranking with isolated features.

Let us first examine the relative performance of the dif-ferent parameterizations of the HITS algorithm we exam-ined. Recall that we computed HITS for each combinationof three link section schemes – all links (all), inter-host linksonly (ih), and inter-domain links only (id) – with back-linksampling values ranging from 1 to 100. Figure 1 shows theimpact of the number of sampled back-links on the retrievalperformance of HITS authority scores. Each graph is asso-ciated with one performance measure. The horizontal axisof each graph represents the number of sampled back-links,the vertical axis represents performance under the appropri-ate measure, and each curve depicts a link selection scheme.The id scheme slightly outperforms ih, and both vastly out-perform the all scheme – eliminating nepotistic links payso!. The performance of the all scheme increases as moreback-links of each root set vertex are sampled, while theperformance of the id and ih schemes peaks at between 10and 25 samples and then plateaus or even declines, depend-ing on the performance measure.

Having compared di!erent parameterizations of HITS, wewill now fix the number of sampled back-links at 100 andcompare the three link selection schemes against other iso-lated features: PageRank, in-degree and out-degree count-ing links of all pages, of di!erent hosts only and of di!erentdomains only (all, ih and id datasets respectively), and atext retrieval algorithm exploiting anchor text: BM25F[24].BM25F is a state-of-the art ranking function solely based ontextual content of the documents and their associated an-chor texts. BM25F is a descendant of BM25 that combinesthe di!erent textual fields of a document, namely title, bodyand anchor text. This model has been shown to be one ofthe best-performing web search scoring functions over thelast few years [8, 24]. BM25F has a number of free parame-ters (2 per field, 6 in our case); we used the parameter valuesdescribed in [24].

62 / 64

SignalsCombinedwithBM25F

.341

.340

.339

.337

.336

.336

.334

.311

.311

.310

.310

.310

.310

.231

0.22

0.24

0.26

0.28

0.30

0.32

0.34

0.36

degree-in-id

degree-in-ih

degree-in-all

hits-aut-ih-100

hits-aut-all-100

pagerank

hits-aut-id-10

degree-out-all

hits-hub-all-100

degree-out-ih

hits-hub-ih-100

degree-out-id

hits-hub-id-10

bm25f

NDCG@10

.152

.152

.151

.150

.150

.149

.149

.137

.136

.136

.128

.127

.127

.100

0.09

0.10

0.11

0.12

0.13

0.14

0.15

0.16

degree-in-ih

degree-in-id

degree-in-all

hits-aut-ih-100

hits-aut-all-100

hits-aut-id-10

pagerank

hits-hub-all-100

degree-out-ih

hits-hub-id-100

degree-out-all

degree-out-id

hits-hub-ih-100

bm25f

MAP@

10

.398

.397

.394

.394

.392

.392

.391

.356

.356

.356

.356

.356

.355

.273

0.25

0.30

0.35

0.40

degree-in-id

degree-in-ih

degree-in-all

hits-aut-ih-100

hits-aut-all-100

pagerank

hits-aut-id-10

degree-out-all

hits-hub-all-100

degree-out-ih

hits-hub-ih-100

degree-out-id

hits-hub-id-10

bm25f

MRR@10

Figure 3: E!ectiveness measures for linear combinations of link-based features with BM25F.

Figure 2 shows the NDCG, MRR, and MAP measuresof these features. Again all performance measures (andfor all rank-thresholds we explored) agree. As expected,BM25F outperforms all link-based features by a large mar-gin. The link-based features are divided into two groups,with a noticeable performance drop between the groups.The better-performing group consists of the features thatare based on the number and/or quality of incoming links(in-degree, PageRank, and HITS authority scores); and theworse-performing group consists of the features that arebased on the number and/or quality of outgoing links (out-degree and HITS hub scores). In the group of features basedon incoming links, features that ignore nepotistic links per-form better than their counterparts using all links. More-over, using only inter-domain (id) links seems to be marginallybetter than using inter-host (ih) links.

The fact that features based on outgoing links underper-form those based on incoming links matches our expecta-tions; if anything, it is mildly surprising that outgoing linksprovide a useful signal for ranking at all. On the otherhand, the fact that in-degree features outperform PageRankunder all measures is quite surprising. A possible explana-tion is that link-spammers have been targeting the publishedPageRank algorithm for many years, and that this has ledto anomalies in the web graph that a!ect PageRank, butnot other link-based features that explore only a distance-1neighborhood of the result set. Likewise, it is surprising thatsimple query-independent features such as in-degree, whichmight estimate global quality but cannot capture relevanceto a query, would outperform query-dependent features suchas HITS authority scores.

However, we cannot investigate the e!ect of these featuresin isolation, without regard to the overall ranking function,for several reasons. First, features based on the textual con-tent of documents (as opposed to link-based features) arethe best predictors of relevance. Second, link-based featurescan be strongly correlated with textual features for severalreasons, mainly the correlation between in-degree and num-

Feature Transform functionbm25f T (s) = spagerank T (s) = log(s + 3 · 10!12)degree-in-* T (s) = log(s + 3 · 10!2)degree-out-* T (s) = log(s + 3 · 103)hits-aut-* T (s) = log(s + 3 · 10!8)hits-hub-* T (s) = log(s + 3 · 10!1)

Table 1: Near-optimal feature transform functions.

ber of textual anchor matches.Therefore, one must consider the e!ect of link-based fea-

tures in combination with textual features. Otherwise, wemay find a link-based feature that is very good in isolationbut is strongly correlated with textual features and resultsin no overall improvement; and vice versa, we may find alink-based feature that is weak in isolation but significantlyimproves overall performance.

For this reason, we have studied the combination of thelink-based features above with BM25F. All feature combina-tions were done by considering the linear combination of twofeatures as a document score, using the formula score(d) =Pn

i=1 wiTi(Fi(d)), where d is a document (or document-query pair, in the case of BM25F), Fi(d) (for 1 ! i ! n) is afeature extracted from d, Ti is a transform, and wi is a freescalar weight that needs to be tuned. We chose transformfunctions that we empirically determined to be well-suited.Table 1 shows the chosen transform functions.

This type of linear combination is appropriate if we as-sume features to be independent with respect to relevanceand an exponential model for link features, as discussedin [8]. We tuned the weights by selecting a random sub-set of 5,000 queries from the query set, used an iterativerefinement process to find weights that maximized a givenperformance measure on that training set, and used the re-maining 23,043 queries to measure the performance of thethus derived scoring functions.

We explored the pairwise combination of BM25F with ev-ery link-based scoring function. Figure 3 shows the NDCG,MRR, and MAP measures of these feature combinations,together with a baseline BM25F score (the right-most barin each graph), which was computed using the same subsetof 23,045 queries that were used as the test set for the fea-ture combinations. Regardless of the performance measureapplied, we can make the following general observations:

1. Combining any of the link-based features with BM25Fresults in a substantial performance improvement overBM25F in isolation.

2. The combination of BM25F with features based on in-coming links (PageRank, in-degree, and HITS author-ity scores) performs substantially better than the com-bination with features based on outgoing links (HITShub scores and out-degree).

3. The performance di!erences between the various com-binations of BM25F with features based on incominglinks is comparatively small, and the relative orderingof feature combinations is fairly stable across the dif-

63 / 64

Summary

• Text-basedrankingmeasuresarenecessarybutnotsufficientforhighqualityretrieval.

• Link-basedrankingmeasuresareimportantbutsubtle.• Extremelyimportanttoconfirmintuitionwithexperiments.

64 / 64