INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013. 2. 6. · of 17K movies by 500K users. i.e., 100 million (User,Movie,Rating)’s of the form (105932,14002,3)

INFO 4300 / CS4300

Information Retrieval

slides adapted from Hinrich Schutze’s,linked from http://informationretrieval.org/

IR 8: Evaluation & SVD

Paul Ginsparg

Cornell University, Ithaca, NY

20 Sep 2011

1 / 25

http://informationretrieval.org/

Administrativa

Ass’t 2 to be posted 24 Sep, due Sat 8 Oct, 1pm(late submission permitted until Sun 9 Oct at 11 p.m.)

No class Tue 11 Oct (midterm break)

The Midterm Examination is on Thu Oct 13 from 11:40 to12:55, in Kimball B11. It will be open book.Topics examined include assignments, lectures and discussionclass readings before the midterm break.

2 / 25

Overview

1 Recap

2 SVD Intuition, cont’d

3 Incremental Numerics

4 Discussion 2

3 / 25

Outline

1 Recap



4 Discussion 2

4 / 25

Netflix challenge, 2006–2009

Next 9 slides adapted from (“Simon Funk” = Brandyn Webb)http://sifter.org/∼simon/journal/20061211.html

See also popular article:http://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html

Netflix provided 100M ratings (from 1 to 5)of 17K movies by 500K users.

i.e., 100 million (User,Movie,Rating)’s of the form(105932,14002,3)

Predict (User,Movie,?) not in the database(how would the given User rate the given Movie?)

$50k incentive to the best each year, and $1M to the first to beata set target (10% better than Netflix)

5 / 25

http://sifter.org/~simon/journal/20061211.html

http://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html

User-Movie Rating Matrix Rum

Visualize as large sparse 500k × 17k “user-movie” matrix Rum,with (u,m)th matrix element containing rating (1–5)by user u for movie m.

About 8.5B entries total, so data in only 1 of 85 = 1.2%.

Certain specified ‘?’ elements constitute a quiz:make best guess Pum at missing ratings.

Use “mean squared error” (mse) as measure of accuracy:guess 1.5 and actual is 2, “penalty” = (2 − 1.5)2 = 0.25.

Then sum over penalties for all guesses (including optional sqrt):

rmse E =

√∑

u,m

(Rum − Pum)2

6 / 25

Linear Dependencies

If one had the full 8.5 billion ratings (and many “weary users”),they would contains many regularities, i.e., not consist of 8.5Bindependent and unrelated ratings.

Describe each movie in terms of some basic attributes such as

overall quality

action or comedy

actors

. . .

Describe user preferences in terms of complementary attributes orpreferences

they rate high or low

prefer action or comedy

preferred actors

. . .7 / 25

Model the data

Explain 8.5 billion ratings by far less than 8.5 billion numbers(e.g., a single number specifying movie’s action content canexplain the attraction to a few million action-buffs)

Define model for data with smaller number of parameters, inferparameters from the data, SVD ( = singular value decomposition)reduces in this case to the assumption that user’s overall rating iscomposed of a sum of preferences over movie features

8 / 25

Example: Just one Feature

Suppose only 1 feature, overall quality, and 1 corresponding usertendency to rate high/low.

Three users: Uu = (1, 2, 3)Five movies: Vm = (1, 1, 3, 2, 1)

Predicted rating matrix:

Pum = UuVm =

1 1 3 2 12 2 6 4 23 3 9 6 3

‘Explain’ 15 data points with only 7 parameters(only one overall scale)

9 / 25

More Features

Now suppose 40 features: Each movie described by 40 values,specifying for each feature degree to which contained in movie;Each user described by 40 values, specifying degree to which eachfeature preferred by user.

To calculate rating, sum products of each user preferencemultiplied by the corresponding movie feature.

E.g., movie Terminator might be (action=1.2, chickflick=-1, . . .),and user Joe might be (action=3, chickflick=-1, . . .).Combine to find Joe likes Terminator with rating

3 ∗ 1.2 + (−1) ∗ (−1) + . . . = 4.6 + . . . .

(Negative numbers OK: Terminator is anti-chickflick, Joe hasaversion to chickflicks, “so Terminator actively scores positivepoints with Joe for being decidedly un-chickflicky.”)

10 / 25

Outline

1 Recap



4 Discussion 2

11 / 25

Concise Model

Model requires roughly 40∗(500K+17K) values, or about 20M:less than the original 8.5B by a factor of 400. Predicted ratings:

Pum =

r∑

f =1

U f

u · V f

m

U fu is the preference of user u for feature f , V f

m is the degree towhich movie m contains feature f (up to r = 40).

Original matrix has been decomposed into product of tworectangular matrices: the 500,000 × 40 user preference matrix U f

u ,and the 40 × 17,000 movie feature matrix V f

m.

(Matrix multiplication just performs the products and sumsdescribed above, resulting in an approximation to the original500,000 × 17,000 rating matrix.)

12 / 25

Pum = UuVm =

123

︸︷︷︸

3×1

(1 1 3 2 1

)

︸︷︷︸

1×5=

1 1 3 2 12 2 6 4 23 3 9 6 3

︸︷︷︸

3×5

Pum =∑

r

f =1 U fu · V f

m =

1 . . . 12 . . . 11 . . . 2

...2 . . . 45 . . . 11 . . . 2

︸︷︷︸

n×r

1 1 . . . 2 1...

... . . ....

...2 5 . . . 1 2

︸︷︷︸

r×m

=

3 6 . . . 3 34 7 . . . 5 45 11 . . . 4 5

...10 22 . . . 8 107 10 . . . 11 75 11 . . . 4 5

︸︷︷︸

n×m

13 / 25

How to calculate model parameters

Singular value decomposition (SVD) is the mathematical methodfor finding the two smaller matrices which minimize the resultingapproximation error (rmse) to original matrix.

The rank-40 SVD of the 8.5B matrix gives the best approximationwithin framework of 40 feature user-movie-rating model.

• Difficult to calculate SVD of large matrix.• Moreover don’t have all 8.5B entries

(instead have 100M entries and 8.4B empty cells)

But can train parameters by following derivative of theapproximation error (steepest descent).

(also means the unknown error on the 8.4B empty matrix elementscan be ignored — for a fully known matrix, end result coincidesexactly with the SVD)

14 / 25

Summary

End result of SVD = list of inferred categories, sorted by relevance.

Each category expressed by extent to which each user and moviebelong (or anti-belong), as read off from columns of user matrix U,or rows of movie matrix V .

Sorted by value, a category might represent action movies (movieswith a lot of action at the top, slow movies at the bottom), andcorrespondingly users who like action movies (at the top, and thosewho prefer slow movies at the bottom).

Procedure discovers whatever the data implies: algorithm itself hasno inherent concept of action (uses neither titles nor descriptions).Uses only a hundred million examples of the form:

user 17538 gives movie 4819 a rating of 3(and 84 of 85 ratings are missing).

15 / 25

Outline

1 Recap



4 Discussion 2

16 / 25

Incremental SVD method

(from http://sifter.org/∼simon/journal/20070815.html)

Recall:

Rum = known rating by user u for item m

Pum = predicted rating for user u for item m

Singular vectors indexed by f = 1, . . . , r

U fu = element of the f th singular user vector for the uth user

V fm = element of the f th singular item vector for the mth movie

SVD computes the prediction as:

Pum =

r∑

f =1

U f

u · V f

m

17 / 25

http://sifter.org/~simon/journal/20070815.html

Error Gradient

The error in the prediction for user u’s rating of movie m is

eum = Rum − Pum ,

and the total rms error E for all predictions is given by

E 2 =∑

u′,m′

e2u′m′ .

For gradient descent, take the partial derivative of the squarederror with respect to each of the parameters U f

u and V fm,

∂E 2

∂U fu

=∑

m′

−2eum′

∂Pum′

∂U fu

= −2∑

m′

eum′V f

m′ = −2∑

m′

(Rum′−Pum′)V f

m′

(derivative for U fu just the sum over all the ratings by user u). Similarly

∂E 2

∂V fm

=∑

u′

−2eu′m

∂Pu′m

∂V fm

= −2∑

u′

eu′mU f

u′ = −2∑

u′

(Ru′m−Pu′m)U f

u′

18 / 25

Gradient Descenthttp://mathworld.wolfram.com/MethodofSteepestDescent.html

Starts at point P0 and moves from Pi to Pi+1 by minimizing along the lineextending from Pi in the direction of −∇f (Pi), the local downhill gradient.

For 1d function f (x), takes the form of iterating xi = xi−1 − ǫf ′(xi−1) for

small ǫ > 0, from starting point x0 until fixed point is reached.

f (x) = x3 − 2x2 + 2with ǫ = .1 andstarting points x0 = 2, 0.01.

19 / 25

http://mathworld.wolfram.com/MethodofSteepestDescent.html

Inner Loop

In simple backpropagation algorithm for gradient descent, use as

parameter step “learning rate” parameter ℓ = 2ǫ multiplied by gradient:

∆U f

u = −ǫ∂E 2

∂U fu

= ℓ∑

m′

eum′V f

m′

∆V f

m = −ǫ∂E 2

∂V fm

= ℓ∑

u′

eu′mU f

u′

translates to inner loop of code as

real err = ℓ * (rating(user,movie) - predictRating(user,movie));userValue[f][user] += err * movieValue[f][movie];movieValue[f][movie] += err * userValue[f][user];

(sum former over movies, latter over users, and iterate to minimum)

20 / 25

Outline

1 Recap



4 Discussion 2

21 / 25

Discussion 2

K. Sparck Jones, “A statistical interpretation of term specificityand its application in retrieval”. Journal of Documentation 28,11-21, 1972.

Letter by Stephen Robertson and reply by Karen Sparck Jones,Journal of Documentation 28, 164-165, 1972.

22 / 25

Exhaustivity and specificity

What are the semantic and statistical interpretations of specificity?Semantic:

tea, coffee, cocoa (more specific, smaller # docs)beverage (less specific, larger # docs)

Statistical: specificity a function of term usage, frequently usedimplies non-specific (even if has specific meaning).

Exhaustivity of a document description is determined by thenumber of controlled vocabulary terms assigned.

Reject frequently occurring terms?

via conjunction (but according to item C table I, averagenumber of matched terms smaller than request, so wouldreduce recall)remove them entirely (again hurts recall, needed for manyrelevant documents)

What is graphed in figure 1 and what does it illustrate? (Whyaren’t axes labelled?)

23 / 25

idf weight

Sparck Jones defines f (n) = m such that 2m−1 < n <= 2m

(In other words f (n) = ⌈log2(n)⌉, where ⌈x⌉ denotes the smallestinteger not less than x , equivalent to one plus the greatest integerless than x)and suggests weight = f (N) − f (n) + 1e.g. for N = 200 documents,

f (N) = 8 (28 = 256)

n = 90, f (n) = 7 (27 = 128), hence weight = 8 − 7 + 1 = 2

n = 3, f (n) = 2 (22 = 4), hence weight = 8 − 2 + 1 = 7

overall weight for query is then 2 + 7 = 9

+1 so that terms occurring in more than roughly half thedocuments in the corpus not given zero weight (for N = 200,anything in more than 128 documents)

24 / 25

idf weight, modified

Robertson: Sparck Jones’ weight f (N)− f (n) + 1 ≈ log2(N/n) + 1

Note that n/N is the probability an item chosen (at random) willcontain the term.Suppose an item contains a, b, c in common with query, andprobabilities are pa, pb, pc .Then weight assigned to the document is

log(1/pa) + log(1/pb) + log(1/pc ) = log(1/papbpc)

(probability that doc will randomly contain all three terms a, b, cunder what assumption?)quantifies statement: less likely that given combination of termsoccurs, more likely relevant to query(theoretical justification for logarithmic idf weights)

25 / 25

Documents

INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013. 2. 6. · of 17K movies by 500K users. i.e., 100 million (User,Movie,Rating)’s of the form (105932,14002,3)