70

Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Embed Size (px)

Citation preview

Page 1: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with
Page 2: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Announcements• Milestone 2

– autolab disk quota issues:– classifier implementation issues:

• Almost any Weka classifier can be used on our datasets• … with some work …

– meta.FilteredClassifier, meta.MultiClassClassifier, meta.ClassificationViaRegression• If you didn’t do that work for MS2 you will for MS3

– meta.GridSearch, meta.CVParameterSelection, …– So: 48 hr extension

• but TA in charge is out of town Thus and Fri • Milestone 3

– is out …– major change: we’re now trying to get the best single classifier for all 12 problems

Page 3: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Social/Collaborative Filtering

Page 4: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Outline• What is CF?• Nearest-neighbor methods for CF– One old-school paper: BellCore’s movie recommender– Some general discussion

• CF reduced to classification• CF reduced to matrix factoring• Other uses of matrix factoring in ML

Page 5: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

WHAT IS COLLABORATIVE FILTERING?AKA social filtering, recommendation systems, …

Page 6: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

What is collaborative filtering?

Page 7: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

What is collaborative filtering?

Page 8: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

What is collaborative filtering?

Page 9: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

What is collaborative filtering?

Page 10: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with
Page 11: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

What is collaborative filtering?

Page 12: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Other examples of social filtering….

Page 13: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Other examples of social filtering….

Page 14: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Other examples of social filtering….

Page 15: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Other examples of social filtering….

Page 16: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Everyday Examples of Collaborative Filtering...• Bestseller lists• Top 40 music lists• The “recent returns” shelf at the library• Unmarked but well-used paths thru the woods• The printer room at work• “Read any good books lately?”• ....• Common insight: personal tastes are correlated:– If Alice and Bob both like X and Alice likes Y then Bob is more likely to like Y– especially (perhaps) if Bob knows Alice

Page 17: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

SOCIAL/COLLABORATIVE FILTERING: NEAREST-NEIGHBOR METHODS

Page 18: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

BellCore’s MovieRecommender• Recommending And Evaluating Choices In A Virtual Community Of Use. Will Hill, Larry Stead, Mark Rosenstein and George Furnas, Bellcore; CHI 1995

By virtual community we mean "a group of people who share characteristics and interact in essence or effect only". In other words, people in a Virtual Community influence each other as though they interacted but they do not interact. Thus we ask: "Is it possible to arrange for people to share some of the personalized informational benefits of community involvement without the associated communications costs?"

Page 19: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

MovieRecommender GoalsRecommendations should:• simultaneously ease and encourage rather than replace social processes....should make it easy to participate while leaving in hooks for people to pursue more personal relationships if they wish. • be for sets of people not just individuals...multi-person recommending is often important, for example, when two or more people want to choose a video to watch together. • be from people not a black box machine or so-called "agent". • tell how much confidence to place in them, in other words they should include indications of how accurate they are.

Page 20: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

BellCore’s MovieRecommender• Participants sent email to [email protected]• System replied with a list of 500 movies to rate on a 1-10 scale (250 random, 250 popular)– Only subset need to be rated

• New participant P sends in rated movies via email• System compares ratings for P to ratings of (a random sample of) previous users• Most similar users are used to predict scores for unrated movies (more later)• System returns recommendations in an email message.

Page 21: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Suggested Videos for: John A. Jamus.

Your must-see list with predicted ratings:

•7.0 "Alien (1979)"

•6.5 "Blade Runner"

•6.2 "Close Encounters Of The Third Kind (1977)"

Your video categories with average ratings:

•6.7 "Action/Adventure"

•6.5 "Science Fiction/Fantasy"

•6.3 "Children/Family"

•6.0 "Mystery/Suspense"

•5.9 "Comedy"

•5.8 "Drama"

Page 22: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

The viewing patterns of 243 viewers were consulted. Patterns of 7 viewers were found to be most similar. Correlation with target viewer:

•0.59 viewer-130 ([email protected])

•0.55 bullert,jane r ([email protected])

•0.51 jan_arst ([email protected])

•0.46 Ken Cross ([email protected])

•0.42 rskt ([email protected])

•0.41 kkgg ([email protected])

•0.41 bnn ([email protected])

By category, their joint ratings recommend:

•Action/Adventure:

• "Excalibur" 8.0, 4 viewers

• "Apocalypse Now" 7.2, 4 viewers

• "Platoon" 8.3, 3 viewers

•Science Fiction/Fantasy:

• "Total Recall" 7.2, 5 viewers

•Children/Family:

• "Wizard Of Oz, The" 8.5, 4 viewers

• "Mary Poppins" 7.7, 3 viewers

Mystery/Suspense: • "Silence Of The Lambs, The" 9.3, 3 viewers

Comedy: • "National Lampoon's Animal House" 7.5, 4

viewers • "Driving Miss Daisy" 7.5, 4 viewers • "Hannah and Her Sisters" 8.0, 3 viewers

Drama: • "It's A Wonderful Life" 8.0, 5 viewers • "Dead Poets Society" 7.0, 5 viewers • "Rain Man" 7.5, 4 viewers

Correlation of predicted ratings with your actual ratings is: 0.64 This number measures ability to evaluate movies accurately for you. 0.15 means low ability. 0.85 means very good ability. 0.50 means fair ability.

Page 23: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

BellCore’s MovieRecommender• Evaluation:–Withhold 10% of the ratings of each user to use as a test set–Measure correlation between predicted ratings and actual ratings for test-set movie/user pairs

Page 24: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with
Page 25: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

BellCore’s MovieRecommender• Participants sent email to [email protected]• System replied with a list of 500 movies to rate New participant P sends in rated movies via email• System compares ratings for P to ratings of (a random sample of) previous users• Most similar users are used to predict scores for unrated movies– Empirical Analysis of Predictive Algorithms for Collaborative Filtering Breese, Heckerman, Kadie, UAI98

• System returns recommendations in an email message.

Page 26: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

recap: k-nearest neighbor learning

?

Given a test example x:1. Find the k training-set

examples (x1,y1),….,(xk,yk) that are closest to x.

2. Predict the most frequent label in that set.

?

Page 27: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Breaking it down:• To train:– save the data

• To test:– For each test example x:

1. Find the k training-set examples (x1,y1),….,(xk,yk) that are closest to x.

2. Predict the most frequent label in that set.

Very fast!

Prediction is relatively slow …

... but it doesn’t depend on the number of classes, only the number of neighbors

Page 28: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

recap: k-nearest neighbor learning

?

Given a test example x:1. Find the k training-set

examples (x1,y1),….,(xk,yk) that are closest to x.

2. Predict the most frequent label in that set.

?

Page 29: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Algorithms for Collaborative Filtering 1: Memory-Based Algorithms (Breese et al, UAI98)• vi,j = vote of user i on item j• Ii = items for which user i has voted• Mean vote for i is • Predicted vote for “active user” a for j is a weighted sum

• weight is based on similarity between user a and i

weights of n similar usersnormalizer

Page 30: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Algorithms for Collaborative Filtering 1: Memory-Based Algorithms (Breese et al, UAI98)• K-nearest neighbor• Pearson correlation coefficient (Resnick ’94, Grouplens):

• Cosine distance, etc, …

else0

)neighbors( if1),(

aiiaw

Page 31: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

SOCIAL/COLLABORATIVE FILTERING: TRADITIONAL CLASSIFICATION

Page 32: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

What are other ways to formulate the collaborative filtering problem?• Treat it like ordinary classification or regression

Page 33: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Collaborative + Content Filtering(Basu et al, AAAI98; Condliff et al, AI-STATS99)

Airplane Matrix Room with a View

... Hidalgo

comedy, $2M

action,$70M

romance,$25M

... action,$30M

Joe 27,M,$70k

9 7 2 7

Carol 53,F, $20k

8 9

...

Kumar 25,M,$22k

9 3 6

Ua48,M,$81k

4 7 ? ? ?

Page 34: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Collaborative + Content FilteringAs Classification (Basu, Hirsh, Cohen, AAAI98)

Airplane Matrix Room with a View

... Hidalgo

comedy action romance ... action

Joe 27,M,70k 1 1 0 1

Carol 53,F,20k 1 1 0

...

Kumar 25,M,22k 1 0 0 1

Ua48,M,81k 0 1 ? ? ?

Classification task: map (user,movie) pair into {likes,dislikes}

Training data: known likes/dislikes

Test data: active users

Features: any properties of user/movie pair

Page 35: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Collaborative + Content FilteringAs Classification (Basu et al, AAAI98)

Airplane Matrix Room with a View

... Hidalgo

comedy action romance ... action

Joe 27,M,70k 1 1 0 1

Carol 53,F,20k 1 1 0

...

Kumar 25,M,22k 1 0 0 1

Ua48,M,81k 0 1 ? ? ?

Features: any properties of user/movie pair (U,M)

Examples: genre(U,M), age(U,M), income(U,M),...

• genre(Carol,Matrix) = action

• income(Kumar,Hidalgo) = 22k/year

Page 36: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Collaborative + Content FilteringAs Classification (Basu et al, AAAI98)

Airplane Matrix Room with a View

... Hidalgo

comedy action romance ... action

Joe 27,M,70k 1 1 0 1

Carol 53,F,20k 1 1 0

...

Kumar 25,M,22k 1 0 0 1

Ua48,M,81k 0 1 ? ? ?

Features: any properties of user/movie pair (U,M)

Examples: usersWhoLikedMovie(U,M):

• usersWhoLikedMovie(Carol,Hidalgo) = {Joe,...,Kumar}

• usersWhoLikedMovie(Ua, Matrix) = {Joe,...}

Page 37: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Collaborative + Content FilteringAs Classification (Basu et al, AAAI98)

Airplane Matrix Room with a View

... Hidalgo

comedy action romance ... action

Joe 27,M,70k 1 1 0 1

Carol 53,F,20k 1 1 0

...

Kumar 25,M,22k 1 0 0 1

Ua48,M,81k 0 1 ? ? ?

Features: any properties of user/movie pair (U,M)

Examples: moviesLikedByUser(M,U):

• moviesLikedByUser(*,Joe) = {Airplane,Matrix,...,Hidalgo}

• actionMoviesLikedByUser(*,Joe)={Matrix,Hidalgo}

Page 38: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Collaborative + Content FilteringAs Classification (Basu et al, AAAI98)

Airplane Matrix Room with a View

... Hidalgo

comedy action romance ... action

Joe 27,M,70k 1 1 0 1

Carol 53,F,20k 1 1 0

...

Kumar 25,M,22k 1 0 0 1

Ua48,M,81k 1 1 ? ? ?

Features: any properties of user/movie pair (U,M)

genre={romance}, age=48, sex=male, income=81k, usersWhoLikedMovie={Carol}, moviesLikedByUser={Matrix,Airplane}, ...

Page 39: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Collaborative + Content FilteringAs Classification (Basu et al, AAAI98)

Airplane Matrix Room with a View

... Hidalgo

comedy action romance ... action

Joe 27,M,70k 1 1 0 1

Carol 53,F,20k 1 1 0

...

Kumar 25,M,22k 1 0 0 1

Ua48,M,81k 1 1 ? ? ?

genre={romance}, age=48, sex=male, income=81k, usersWhoLikedMovie={Carol}, moviesLikedByUser={Matrix,Airplane}, ...

genre={action}, age=48, sex=male, income=81k, usersWhoLikedMovie = {Joe,Kumar}, moviesLikedByUser={Matrix,Airplane},...

Page 40: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Collaborative + Content FilteringAs Classification (Basu et al, AAAI98)

• Classification learning algorithm: rule learning (RIPPER)

• If NakedGun33/13 moviesLikedByUser and Joe usersWhoLikedMovie and genre=comedy then predict likes(U,M)

• If age>12 and age<17 and HolyGrail moviesLikedByUser and director=MelBrooks then predict likes(U,M)

• If Ishtar moviesLikedByUser then predict likes(U,M)

Page 41: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Basu et al 98 - results

• Evaluation:– Predict liked(U,M)=“M in top quartile of U’s ranking”

from features, evaluate recall and precision– Features:

• Collaborative: UsersWhoLikedMovie, UsersWhoDislikedMovie, MoviesLikedByUser

• Content: Actors, Directors, Genre, MPAA rating, ...• Hybrid: ComediesLikedByUser, DramasLikedByUser,

UsersWhoLikedFewDramas, ...• Results: at same level of recall (about 33%)

– Ripper with collaborative features only is worse than the original MovieRecommender (by about 5 pts precision – 73 vs 78)

– Ripper with hybrid features is better than MovieRecommender (by about 5 pts precision)

Page 42: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Matrix Factorization for Collaborative Filtering

Page 43: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Recovering latent factors in a matrixm movies

v11 …

… …vij

… vnmV[i,j] = user i’s rating of movie j

n u

sers

Page 44: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Recovering latent factors in a matrixm movies

n u

sers

m movies

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

~

V[i,j] = user i’s rating of movie j

Page 45: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

talk pilfered from …..

KDD 2011

Page 46: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with
Page 47: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Recovering latent factors in a matrixm movies

n u

sers

m movies

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

~

V[i,j] = user i’s rating of movie j

r

W

H

V

Page 48: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with
Page 49: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with
Page 50: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with
Page 51: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Recovering latent factors in a matrixm movies

n u

sers

m movies

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

~

V[i,j] = user i’s rating of movie j

r

W

H

V

Page 52: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

… is like Linear Regression ….m=1 regressors

n inst

an

ces

(e.g

., 1

50)

predictions

pl1 pw1 sl1 sw1pl2 pw2 sl2 sw2.. ..

… …pln pwn

w1w2w3w4y1…

yi

yn

~

Y[i,1] = instance i’s prediction

W

H

Y

r features (eg 4)

Page 53: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

.. for many outputs at once….m regressors

n inst

an

ces

(e.g

., 1

50)

predictions

pl1 pw1 sl1 sw1pl2 pw2 sl2 sw2.. ..

… …pln …

w11 w12w21 ..w31 ..w41 .. …

y11 y12…

yn1

~

Y[I,j] = instance i’s prediction for regression task j

W

H

Y

ym…

ynm

r features (eg 4)

… where we also have to find the dataset!

Page 54: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Matrix factorization as SGD

step size

Page 55: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Matrix factorization as SGD - why does this work?

step size

Page 56: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Matrix factorization as SGD - why does this work? Here’s the key claim:

Page 57: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

What loss functions are possible?N1, N2 - diagonal matrixes, sort of like IDF factors for the users/movies

“generalized” KL-divergence

Page 58: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

What loss functions are possible?

Page 59: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

ALS = alternating least squares

Page 60: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Matrix Multiplications in Machine Learning

Page 61: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Recovering latent factors in a matrixm movies

n u

sers

m movies

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

~

V[i,j] = user i’s rating of movie j

r

W

H

V

Page 62: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

….. vs k-meanscluster means

n e

xam

ple

s

0 11 0.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

~

original data setindicators

for r clusters

Z

M

X

Page 63: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Matrix multiplication - 1 (Gram matrix)transpose of X

x1 y1x2 y2.. ..

… …xn yn

x1 x2 .. … xny1 y2 … yn <x1,x1> …… …… <xi,xj> …

… <xn,xn>

~

V[i,j] = inner product of instances i and j (Gram

matrix)

X

XT

V

n inst

an

ces

(e.g

., 1

50)

r features (eg 2)

Page 64: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Matrix multiplication - 2 (Covariance matrix)transpose of X

a1 a2 … amb1 b2 … bmv11 v12v21 v22

~XT x1 y1x2 y2.. ..

… …xn yn

X

assuming mean(X)=0

r features (eg 2)

CX

Page 65: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Matrix multiplication - 2 (PCA)V = CX = XT X…variance/covariancesE = eigenvectors(V)X E T = PCA(X) = Z Z(i,j) is similarity of example i to eigenvector j x1 y1x2 y2.. ..

… …xn yn

XET

e11 e12e21 e22

z1 z1z2 z2.. ..

… …zn zn

Z

Eigenvecsof CX

I think of these as fixed point of a process where we predict each feature value from the others and CX

Page 66: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Matrix multiplication - 2 (PCA)V = CX = XT X…variance/covariancesE = eigenvectors(V)X E T = PCA(X) = ZK … or use E(1:K, :) instead of E x1 y1x2 y2.. ..

… …xn yn

X EKT

e11e21z1z2..

…zn

ZK

top K eigenvecsof CX

Page 67: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Matrix multiplication - 3 (SVD)V = CX = XT X…variance/covariancesE = eigenvectors(V)X E T = PCA(X) = Z X = Z E = Z Σ-1 Σ E = U Σ E … where U = Z Σ-1 … i.e., factored version of XUsually written as X = U Σ V

x1 y1x2 y2.. ..

… …xn yn

XE

e11 e12e21 e22

z1 z1z2 z2.. ..

… …zn zn

Z

Eigenvecsof CX

Page 68: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Matrix multiplication - 3 (SVD)V = CX = XT X…variance/covariancesE = eigenvectors(V)X EK T = PCA(X) = ZK … EK = E(1:K, :) instead of EX ≈ ZK EK ≈ ZK Σ-1 Σ E ≈ UK Σ E … where UK = ZK Σ-1 … i.e., factored version of X

x1 y1x2 y2.. ..

… …xn yn

XE

e11 e12e21 e22

z1z2..

…zn

UK

Eigenvecsof CX

Σ

Σ1

Page 69: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Matrix multiplication - 2 (SVD)eigenvectors of CX

x1 y1x2 y2.. ..

… …xn yn

x1 x2 .. … xny1 y2 … yn<x1,x1> …… …

… <xi,xj> …

… <xn,xn>

~

original matrix

UK

E

Xn inst

an

ces

K features with zero covar and unit variances Σ

Σ1 00 Σ2

Page 70: Announcements Milestone 2 – autolab disk quota issues: – classifier implementation issues: Almost any Weka classifier can be used on our datasets … with

Recovering latent factors in a matrixm movies

n u

sers

m movies

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

~

V[i,j] = user i’s rating of movie j

r

W

H

V