View
2
Download
0
Category
Preview:
Citation preview
INFO 4300 / CS4300Information Retrieval
slides adapted from Hinrich Schutze’s,linked from http://informationretrieval.org/
IR 6: Ranking
Paul Ginsparg
Cornell University, Ithaca, NY
13 Sep 2011
1 / 48
Administrativa
Course Webpage:http://www.infosci.cornell.edu/Courses/info4300/2011fa/
Assignment 1. Posted: 2 Sep, Due: Sun, 18 Sep
Lectures: Tuesday and Thursday 11:40-12:55, Kimball B11
Instructor: Paul Ginsparg, ginsparg@..., 255-7371,Physical Sciences Building 452
Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mailinstructor to schedule an appointment
Teaching Assistant: Saeed Abdullah, office hour Fri3:30pm-4:30pm in the small conference room (133) at 301College Ave, and by email, use cs4300-l@lists.cs.cornell.edu
Course text at: http://informationretrieval.org/Introduction to Information Retrieval , C.Manning, P.Raghavan, H.Schutze
see alsoInformation Retrieval , S. Buttcher, C. Clarke, G. Cormack
http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307
2 / 48
Administrativa
Reread assignment 1 instructions
The Midterm Examination is on Thu, Oct 13 from 11:40 to12:55, in Kimball B11. It will be open book. The topics to beexamined are all the lectures and discussion class readingsbefore the midterm break.
According to the registrar(http://registrar.sas.cornell.edu/Sched/EXFA.html ),the final examination is Wed 14 Dec 7:00-9:30 pm (locationTBD)
3 / 48
Discussion 2, 20 Sep
For this class, read and be prepared to discuss the following:
K. Sparck Jones, “A statistical interpretation of termspecificity and its application in retrieval”.Journal of Documentation 28, 11-21, 1972.http://www.soi.city.ac.uk/∼ser/idfpapers/ksj orig.pdf
Letter by Stephen Robertson and reply by Karen SparckJones, Journal of Documentation 28, 164-165, 1972.http://www.soi.city.ac.uk/∼ser/idfpapers/letters.pdf
The first paper introduced the term weighting scheme known asinverse document frequency (IDF). Some of the terminology usedin this paper will be introduced in the lectures. The letterdescribes a slightly different way of expressing IDF, which hasbecome the standard form.(Stephen Robertson has mounted these papers on his Web sitewith permission from the publisher.)
4 / 48
Overview
1 Recap
2 Zones
3 Why rank?
4 More on cosine
5 Implementation
5 / 48
Outline
1 Recap
2 Zones
3 Why rank?
4 More on cosine
5 Implementation
6 / 48
Query Scores: S(q, d) =∑
t∈q w(idf)t · w (tf)
t,d (ltn.lnn)1. “A sentence is a document.”2. “A document is a sentence and a sentence is a document.”3. “This document is short.”4. “This document is a sentence.”
tft,d doc1 doc2 doc3 doc4a 2 4 0 1and 0 1 0 0document 1 2 1 1is 1 2 1 1sentence 1 2 0 1short 0 0 1 0this 0 0 1 1
→
w(tf)t,d doc1 doc2 doc3 doc4
a 1.3 1.6 0 1and 0 1 0 0document 1 1.3 1 1is 1 1.3 1 1sentence 1 1.3 0 1short 0 0 1 0this 0 0 1 1
df w(idf)t
a 3 .125and 1 .6document 4 0is 4 0sentence 2 .3short 1 .6this 2 .3
[
log(4/4) = 0, log(4/3) ≈ .125, log(4/2) ≈ .3, log(4/1) ≈ .6]
Query: “a sentence”
doc1: .125 ∗ 1.3 + .3 ∗ 1 = .46, doc2: .125 ∗ 1.6 + .3 ∗ 1.3 = .59
doc3: .125 ∗ 0 + .3 ∗ 0 = 0, doc4: .125 ∗ 1 + .3 ∗ 1 = .425
Query: “short sentence”
doc1: .6 ∗ 0 + .3 ∗ 1 = .3, doc2: .6 ∗ 0 + .3 ∗ 1.3 = .39
doc3: .6 ∗ 1 + .3 ∗ 0 = .6, doc4: .6 ∗ 0 + .3 ∗ 1 = .37 / 48
Query Scores: S(q, d) =∑
t∈q w(idf)t · w (tf)
t,d (ltn.lnc)1. “A sentence is a document.”2. “A document is a sentence and a sentence is a document.”3. “This document is short.”4. “This document is a sentence.”
w(tf)t,d doc1 doc2 doc3 doc4
a 1.3 1.6 0 1and 0 1 0 0document 1 1.3 1 1is 1 1.3 1 1sentence 1 1.3 0 1short 0 0 1 0this 0 0 1 1
→
w(tf)t,d
doc1 doc2 doc3 doc4
a .60 .54 0 .45and 0 .34 0 0document .46 .44 .5 .45is .46 .44 .5 .45sentence .46 .44 0 .45short 0 0 .5 0this 0 0 .5 .45
df w(idf)t
a 3 .125and 1 .6document 4 0is 4 0sentence 2 .3short 1 .6this 2 .3
lengths(doc1,. . .,doc4) = (2.17, 2.94, 2, 2.24)
Query: “a sentence”
doc1: .125 ∗ .6 + .3 ∗ .46 = .21, doc2: .125 ∗ .54 + .3 ∗ .44 = .20
doc3: .125 ∗ 0 + .3 ∗ 0 = 0, doc4: .125 ∗ .45 + .3 ∗ .45 = .19
Query: “short sentence”
doc1: .6 ∗ 0 + .3 ∗ .46 = .14, doc2: .6 ∗ 0 + .3 ∗ .44 = .133
doc3: .6 ∗ .5 + .3 ∗ 0 = .3, doc4: .6 ∗ 0 + .3 ∗ .45 = .1348 / 48
Cosine similarity between query and document
cos(~q, ~d) = sim(~q, ~d) =~q
|~q| ·~d
|~d |=
|V |∑
i=1
qi√
∑|V |i=1 q2
i
· di√
∑|V |i=1 d2
i
qi is the tf-idf weight (idf) of term i in the query.
di is the tf-idf weight (tf) of term i in the document.
|~q| and |~d | are the lengths of ~q and ~d .
~q/|~q| and ~d/|~d | are length-1 vectors (= normalized).
9 / 48
Cosine similarity illustrated
0 10
1
rich
poor
~v(q)
~v(d1)
~v(d2)
~v(d3)
θ
10 / 48
Variant tf-idf functions
We’ve considered sublinear tf scaling (wft,d = 1 + log tft,d)
Or normalize instead by maximum tf in document, tfmax(d):
ntft,d = a + (1 − a)tft,d
tfmax(d)
where a ∈ [0, 1] (e.g., .4) is a smoothing term to avoid large swingin ntf due small changes in tf.
This eliminates repeat content problem (d ′ = d + d), but hasother issues:
sensitive to change in stop word list
outlier terms with large tf
skewed distribution of many nearly most frequent terms.
11 / 48
Components of tf.idf weighting
Term frequency Document frequency Normalization
n (natural) tft,d n (no) 1 n (none) 1
l (logarithm) 1 + log(tft,d) t (idf) log N
dftc (cosine) 1√
w21 +w2
2 +...+w2M
a (augmented) 0.5 +0.5×tft,d
maxt(tft,d )p (prob idf) max{0, log N−dft
dft} u (pivoted
unique)1/u
b (boolean)
{
1 if tft,d > 00 otherwise
b (byte size) 1/CharLengthα,α < 1
L (log ave)1+log(tf t,d )
1+log(avet∈d(tf t,d ))
Best known combination of weighting options
Default: no weighting
12 / 48
tf.idf example
We often use different weightings for queries and documents.
Notation: qqq.ddd(term frequency / document frequency / normalization)for (query.document)
Example: ltn.lnc
query: logarithmic tf, idf, no normalization
document: logarithmic tf, no df weighting, cosinenormalization
Isn’t it bad to not idf-weight the document?
Example query: “best car insurance”
Example document: “car insurance auto insurance”
13 / 48
tf.idf example: ltn.lnc
Query: “best car insurance”. Document: “car insurance auto insurance”.word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0 0 5000 2.3 0 1 1 1 0.52 0best 1 1 50000 1.3 1.3 0 0 0 0 0car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight
√12 + 02 + 12 + 1.32 ≈ 1.92
1/1.92 ≈ 0.521.3/1.92 ≈ 0.68
Final similarity score between query and document:∑
i wqi · wdi = 0 + 0 + 1.04 + 2.04 = 3.08
14 / 48
Outline
1 Recap
2 Zones
3 Why rank?
4 More on cosine
5 Implementation
15 / 48
Parametric and Zone indices
Digital documents have additional structure: metadata encoded inmachine-parseable form (e.g., author, title, date of publication,. . .)
One parametric index for each field.
Fields: take finite set of values (e.g., dates of authorship)
Zones: arbitrary free text (e.g., titles, abstracts)
Permits searching for documents by Shakespeare written in 1601containing the phrase “alas poor Yorick”or find documents with “merchant” in title and “william” in authorlist and the phrase “gentle rain” in body
Use separate indexes for each field and zone,or use
william.abstract, william.title, william.author
Permits weighted zone scoring
16 / 48
Weighted Zone Scoring
Given boolean query q and document d ,assign to the pair (q, d) a score in [0,1]by computing linear combination of zone scores.
Let g1, . . . , gℓ ∈ [0, 1] such that∑ℓ
i=1 gi = 1.
For 1 ≤ i ≤ ℓ, let si be the score between q and the i th zone.
Then the weighted zone score is defined as∑ℓ
i=1 gi si .
Example:Three zones: author title, bodyg1 = .2, g2 = .5, g3 = .3 (match in author zone least important)
Compute weighted zone scores directly from inverted indexes:Instead of adding document to set of results as for boolean ANDquery, now compute a score for each document.
17 / 48
Learning Weights
How to determine the weights gi for weighted zone scoring?
A. specified by expert
B. “learned” using training examples that have been judgededitorially (machine-learned relevance)
1. given set of training examples [(q, d) plus relevancejudgment (e.g., yes/no)]
2. set the weights gi to best approximate the relevancejudgments
Expensive component: labor-intensive assembly of user-generatedrelevance judgments, especially expensive in rapidly changingcollection (such as Web).
Or use “passive collaborative feedback”? (clickthrough data)
18 / 48
19 / 48
Machine Learned Relevance
Given a table sT (d , q), sB(d , q) of Boolean matches,and relevance judgements r(d , q) (also, e.g., binary) of documentd relevant to query q (see fig. 6.5 in text), compute a score foreach of the training examples
score(d , q) = gsT (d , q) + (1 − g)sB(d , q)
and compare with r(d , q) using an error function
ε(g ,Φj ) =(
r(dj , qj) − score(dj , qj))2
for each training example.
Choose g to minimize total error∑
j ε(g ,Φj ) (a quadratic functionof g , so elementary algebra in this case; more generally asophisticated optimization problem).
20 / 48
Outline
1 Recap
2 Zones
3 Why rank?
4 More on cosine
5 Implementation
21 / 48
Why is ranking so important?
Two lectures ago: Problems with unranked retrieval
Users want to look at a few results – not thousands.It’s very hard to write queries that produce a few results.Even for expert searchers→ Ranking is important because it effectively reduces a largeset of results to a very small one.
Next: More data on “users only look at a few results”
Actually, in the vast majority of cases they only look at 1, 2,or 3 results.
22 / 48
Empirical investigation of the effect of ranking
How can we measure how important ranking is?
Observe what searchers do when they are searching in acontrolled setting
Videotape themAsk them to “think aloud”Interview themEye-track themTime themRecord and count their clicks
The following slides are from Dan Russell’s JCDL talk 2007
Dan Russell is the “Uber Tech Lead for Search Quality & UserHappiness” at Google.
23 / 48
Interview video
So . . . Did you notice the FTD official site?
To be honest I didn’t even look at that.
At first I saw “from $20” and $20 is what I was looking for.
To be honest, 1800-flowers is what I’m familiar with and why Iwent there next even though I kind of assumed they wouldn’t have$20 flowers.
And you knew they were expensive?
I knew they were expensive but I thought “hey, maybe they’ve gotsome flowers for under $20 here . . .”
But you didn’t notice the FTD?
No I didn’t, actually. . . that’s really funny.
24 / 48
Local work
Granka, L., Joachims, T., and Gay, G. (2004)“Eye-Tracking Analysis of User Behavior in WWW Search”Proceedings of the 28th Annual ACM Conference on Research and
Development in Information and Retrieval (SIGIR ’04)
http://www.cs.cornell.edu/People/tj/publications/granka etal 04a.pdf
26 / 48
Out of date?
Use of top and right margins
“instant” results
only for high bandwidth users?
mobile devices
31 / 48
Importance of ranking: Summary
Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).
Clicking: Distribution is even more skewed for clicking
In 1 out of 2 cases, users click on the top-ranked page.
Even if the top-ranked page is not relevant, 30% of users willclick on it.
→ Getting the ranking right is very important.
→ Getting the top-ranked page right is most important.
32 / 48
Outline
1 Recap
2 Zones
3 Why rank?
4 More on cosine
5 Implementation
33 / 48
A problem for cosine normalization
Query q: “anti-doping rules Beijing 2008 olympics”
Compare three documents
d1: a short document on anti-doping rules at 2008 Olympicsd2: a long document that consists of a copy of d1 and 5 othernews stories, all on topics different from Olympics/anti-dopingd3: a short document on anti-doping rules at the 2004 AthensOlympics
What ranking do we expect in the vector space model?
d2 is likely to be ranked below d3 . . .
. . . but d2 is more relevant than d3.
What can we do about this?
34 / 48
Pivot normalization
Cosine normalization produces weights that are too large forshort documents and too small for long documents (onaverage).
Adjust cosine normalization by linear adjustment: “turning”the average normalization on the pivot
Effect: Similarities of short documents with query decrease;similarities of long documents with query increase.
This removes the unfair advantage that short documents have.
35 / 48
Predicted and true probability of relevance
source:Lillian Lee
36 / 48
Pivot normalization
source:Lillian Lee
37 / 48
Outline
1 Recap
2 Zones
3 Why rank?
4 More on cosine
5 Implementation
38 / 48
Now we also need term frequencies in the index
Brutus −→ 1,2 7,3 83,1 87,2 . . .
Caesar −→ 1,1 5,1 13,1 17,1 . . .
Calpurnia −→ 7,1 8,2 40,1 97,3
term frequencies
We also need positions. Not shown here.
39 / 48
Term frequencies in the inverted index
In each posting, store tft,d in addition to docID d
As an integer frequency, not as a (log-)weighted real number. . .
. . . because real numbers are difficult to compress.
Unary code is effective for encoding term frequencies.
Why?
Overall, additional space requirements are small: much lessthan a byte per posting.
40 / 48
How do we compute the top k in ranking?
In many applications, we don’t need a complete ranking.
We just need the top k for a small k (e.g., k = 100).
If we don’t need a complete ranking, is there an efficient wayof computing just the top k?
Naive:
Compute scores for all N documentsSortReturn the top k
What’s bad about this?
Alternative?
41 / 48
Use min heap for selecting top k out of N
Use a binary min heap
A binary min heap is a binary tree in which each node’s valueis less than the values of its children.
Takes O(N log k) operations to construct (where N is thenumber of documents) . . .
. . . then read off k winners in O(k log k) steps
Essentially linear in N for small k and large N.
42 / 48
Binary min heap
0.6
0.85 0.7
0.9 0.97 0.8 0.95
43 / 48
Selecting top k scoring documents in O(N log k)
Goal: Keep the top k documents seen so far
Use a binary min heap
To process a new document d ′ with score s ′:
Get current minimum hm of heap (O(1))If s ′ ≤ hm skip to next documentIf s ′ > hm heap-delete-root (O(log k))Heap-add d ′/s ′ (O(log k))
44 / 48
Even more efficient computation of top k?
Ranking has time complexity O(N) where N is the number ofdocuments.
Optimizations reduce the constant factor, but they are stillO(N) — and 1010 < N < 1011!
Are there sublinear algorithms?
Ideas?
What we’re doing in effect: solving the k-nearest neighbor(kNN) problem for the query vector (= query point).
There are no general solutions to this problem that aresublinear.
We will revisit when we do kNN classification
45 / 48
Cluster pruning
Cluster docs in preprocessing step
Pick√
N “leaders”
For non-leaders, find nearest leader (expect√
N / leader)
For query q, find closest leader L (√
N computations)
Rank L and followers
or generalize: b1 closest leaders, and then b2 leaders closest toquery
46 / 48
47 / 48
Even more efficient computation of top k
Idea 1: Reorder postings lists
Instead of ordering according to docID . . .. . . order according to some measure of “expected relevance”.
Idea 2: Heuristics to prune the search space
Not guaranteed to be correct . . .. . . but fails rarely.In practice, close to constant time.For this, we’ll need the concepts of document-at-a-timeprocessing and term-at-a-time processing.
48 / 48
Recommended