Introduction to Preference Learning

MotivationRanking Preferences

Comparing itemsPerspectives

Introduction to preference learning

Jose A. Rodriguez-Serrano

June 14, 2016

Jose A. Rodriguez-Serrano Introduction to preference learning



IntroductionExamplesOutlook

Introduction to preference learning

Preference learning

Inducing predictive preference models from empirical data.

Caveats:

This presentation: Some supervised models

Human preferences are very complex (and might be inconsistent)Machine learning is not magic

it’s inferring model parameters from datawhen data can be transformed into a “signal” with sufficientinformation of the target task





Additional Sources

Preference Learning: A Tutorial Introduction (Fürnkranz andHüllermeier) http://www.ke.tu-darmstadt.de/events/PL-12/slides/PL-Tutorial-1.pdf

http://www.preference-learning.org/

B.Kulis, Metric Learning: A Survey, 2012,http://web.cse.ohio-state.edu/~kulis/pubs/ftml_metric_learning.pdf


http://www.ke.tu-darmstadt.de/events/PL-12/slides/PL-Tutorial-1.pdf

http://www.ke.tu-darmstadt.de/events/PL-12/slides/PL-Tutorial-1.pdf

http://www.preference-learning.org/

http://web.cse.ohio-state.edu/~kulis/pubs/ftml_metric_learning.pdf

http://web.cse.ohio-state.edu/~kulis/pubs/ftml_metric_learning.pdf




Where does this come from?

Learning user preferences in search engines using click-through data

Joachims, Optimizing search engines using clickthrough data SIGKDD 2002





Learning preference models for transport

e.g. Chidlovskii, Improved Trip Planning by Learning fromTravelers’ Choices, Mining Urban Data, 2015.

Option 1

Estimated duration = 38 min

# changes = 1

Frequency (waiting time) = 6min

Walking time = 11 min

Cost = (1.7+2) EUR

Data : x1 = [38,1,6,11,3.7, . . .],x2 = . . . ,x3 = . . . ,x4 = . . .Label: `(x1,x2,x3,x4) = (1,0,0,0)

Other examples: question answering, ad selection, . . .Jose A. Rodriguez-Serrano Introduction to preference learning




“Semantic” search

A. Gordo, J. A. Rodriguez-Serrano, F. Perronnin, E. Valveny, Leveraging Category-Level Labels for Instance-Level ImageRetrieval, CVPR 2012.





Preference learning

Inducing predictive preference models from empirical data.

Learning to rank preferences

Learning to compare items

Applications: Model / explain / predict user preferences for products?Compare customers?




IntroductionSVM for preference learningMulti-class ranking SVMFeature embeddingsRecap

Outline

1 Motivation

2 Ranking Preferences

3 Comparing items

4 Perspectives





Ranking preferences





How to do that

(plotted using XKCDify)





SVM for preference learning

Compatibility features of user and product f(u,p) = [f 1up, . . . f

Dup]

Relevance score: R(u,p) = wT f(u,p)

We want: wT f(u,p+)> wT f(u,p−)⇒ wT (f(u,p+)− f(u,p−))> 0︸︷︷︸This is a linear classifier!

wT can be learned with a linear classifier

1 Construct data. Each triplet u, p1, p2 is a sample (xi , ì) with

Features: xi = f(u,p1)− f(u,p2)Label: ì =+1 if p1 is preferred, ì =−1 if p2 is preferred

2 Train a binary linear classifier (e.g SVM) with these data

Alternative: Structured output learning Nowozin and Lampert, Structured Learning and Prediction in Computer Vision, 2011





Special case 1: Multi-class ranking SVM (one model perproduct)

Preference for product 1: xTu w1

Preference for product 2: xTu w2

Multi-class ranking SVM

Data comes in form of preference triplets: x,c+,c−

We want w+T x > w−T x





Multi-class ranking SVM (2)

Method to learn ω = w1,w2, . . .

1 Sample (x,c+,c−)2 If w+

T x > w−T x+1, then δi = 0, else δi = 13 If False, update:

wp← wp(1−ηλ )+δiλxi

wn← wn(1−ηλ )−δiλxi

wk ← wk (1−ηλ )

4 Go to (1) until convergence

We obtain the w’s that represent each product (it can be interpreted asan “encoding” of the products)





Stochastic gradient descent

Lb(ω) = ∑i,p,n

Li,p,n = ∑i,p,n

max(0,1−wpT xi +wn

T xi)+η ∑j‖ωj‖2 (1)

Gradient descent: Make one step in the direction of the gradient.ω ← ω−λ

∂L∂ω

Stochastic gradient descent:

1 Sample i, p, n

2 Compute ω ← ω−λ∂Li,p,n

∂ω

(Then we end up with the solution of the previous slide)

Bottou, Large-Scale Learning with Stochastic Gradient Descent, 2010





Special case 2: Feature embeddings

Compatibility between a user and a product: g(u)T Wh(p)x = g(u) = [x1, . . . ,xD]

T

p = h(p) = [p1, . . . ,pE ]T

1 model only, customer and product set open.Jose A. Rodriguez-Serrano Introduction to preference learning




Special case 2: Feature embeddings (2)

1 Sample (xi ,p+,p−)2 Check if xT Wp+ > xT Wp−+13 If False, update:

W←W+λx(p+−p−)T

4 Go to (1) until convergence

Bai et al, Supervised Semantic Indexing, CIKM 2009





Simplicity matters





Recap & thoughts

Different ways to learn and predict preferences:

SVM for preference learning→When function f (u,p) is known, 1model onlyMulti-class ranking SVM→ Closed set of products, 1 model perproductLabel embedding→ Open set of products, 1 model only

Properties

Simple (cost of deploying is small)Easy to personalize (initialize from global model)

Potential applications:

Best next product recommenderUnderstand user preferences for channel




Metric LearningCanonical Correlation AnalysisRecap

Outline

1 Motivation


3 Comparing items

4 Perspectives





Metric Learning

Supervised notion that customer x should be more similar to ythan to z.

Express as computing a similarity between customersa(x,y) = xT Wy

Same solution as before

Example: Find similar customers to customers who have purchased aproduct.





Metric learning (2)

Low-rank decomposition W = UT U (W is D×D, W is K ×D, K < D)

Learning rule becomes U← U−λU(xi(x+i ,x−i )T +(x+

i ,x−i )xTi )

(Supervised) dimensionality reduction a(x,y) = (Ux)T Uy

Bai et al, Supervised Semantic Indexing, CIKM 2009Chechik et al., Large Scale Online Learning of Image Similarity Through Ranking, JMLR 2010Davis et al., Information Theoretic Metric Learning, ICML 2007Hu et al., Discriminative deep metric learning for face verification in the wild, CVPR 2014





Canonical Correlation Analysis

Multiple views of datax1 = {x1,1, . . . ,x1,D} y1 = {y1,1, . . . ,y1,D}x2 = {x2,1 . . . ,x2,D} y2 = {y2,1, . . . ,y2,D}... (e.g. transactionality, socio-demographics variables, etc. )

Canonical correlation analysis (CCA)

Project multiple views of data to a common subspace wherecorrelation is maximized.

C(wk ,uk ) =wk XYuk√

wTk XT Xwk uk YT Yuk

(2)

s.t.wTk XT Xwk = 1,uT

k YT Yuk = 1 (3)

Solution: Generalized eigenvalues of XT Y(YT Y+ρ I−1)YT Xwk = λ 2(XT X+ρ Iwk )H. Hotelling. Relations between two sets of variables. Biometrika, 1936.





Recap & thoughts

Different ways to “learn to compare” items:

Metric learning: 1 model only (supervised by proximity)

Canonical correlation analysis (consolidates multiple views)

Potential applications:

Find similar customers

Best next product recommender

Understand user preferences for channel

Improve predictions of similarity-based regression

Find new buyers




Relation to Deep Learning

Outline

1 Motivation


3 Comparing items

4 Perspectives





Relation to deep learning (1)

Projection-based methods are basically 1-layer perceptronsh = Ux, hi = ∑j Uijxj

In metric learning, instead of minimizing squared error, we minimize aranking loss (“ranking perceptron”).We can add all the “deep learning” creativity.

“Nothing is stronger than an idea whose time has come” (' Victor Hugo)





Relation to deep learning (2)

Learn metric on top of Restricted Boltzmann Machines, Deep BeliefNetworks, (Stacked) (Denoising) Autoencoders, etc.


Science

Introduction to Preference Learning