Applied machine learning for search engine relevance 3

Applied Machine Learning for Search Engine Relevance

Charles H Martin, PhD

Relevance as a Linear Regression

r =X†w+e

x: (tf-idf) bag of words vectorr: relevance score (i.e. 1/-1)w: weight vector

w = X†r/X†X

x=1querymodel*

form X from data(i.e. group of queries)

Solve as a numerical minimization (i.e. iterative methods like SOR, CG, etc )

min X†w-r 2 w 2 : 2-norm of w

*Actually will model and predict pairwise relations and not exact rank. ..stay tuned.

Moore-Penrose Pseudoinverse

Relevance as a Linear Regression:Tikhonov Regularization

w = (X†X)-1 X†r

Problem: inverse may be not exist (numerical instabilities, poles)

Solution: add constant a to diagonal of (X†X)-1

w = (X†X + aI)-1 X†ra: single, adjustable smoothing parameter

Equivalent minimization problem

min X†w-r2 + a w2

More generally : form (something like) X†X + G†G + aI, which is a self-adjoint , bounded operator =>

min X†w-r 2 + a Gw 2 i.e. G chosen to avoid over-fitting

The Representer Theorem Revisited:Kernels and Greens Functions

f(x) = S aiR(x, xi) R := Kernel

Problem: to estimate a function f(x) from training data (xi)

Solution: solve a general minimization problem

min Loss[(f(xi), yi)] + a Gx2

Machine Learning Methods for Estimating Operator Equations (Steinke & Scholkpf 2006)

min Loss[(f(xi), yi)] + aTKa Kij = R(xi, xj)

Equivalent to: given a Linear regularization operator ( G: H->L2(x) )

where K is an integral operator: (Kf)(y) = R(x,y)f(x)dx

so K is the Green’s Function for (GG) †, or G = (K1/2)†

in Dirac Notation: R(x,y) = <y|(GG) †|x>

f(x) = S aiR(x, xi) + S buu (xi) ; u span null space of G

Personalized Relevance Algorithms: eSelf Personality Subspace

qpages personality traitsp

Cars: 0.4

User Reading

(music site) Present to user

(used sports car ad)

Learned Traits:(Likes cars 0.4)(Sports cars 0.3)

ads

Sports cars0.0 =>0.3

Rock-n-roll

Hard rock

Compute personality traits during user visit to web siteq values = stored learned “personality traits”

Provide relevance rankings (for pages or ads) which include personality traits

Personalized Relevance Algorithms: eSelf Personality Subspace

model: L [p,q] = [h, u] where L is a square matrix

h: history (observed outputs)

p: output nodes(observables)

Web pages, Classified Ads, …

q: hidden nodes (not observed)

IndividualizedPersonalityTraits

u: user segmentation

Personalized Search: Effective Regression Problem

[p, q](t) = (Leff[q(t-1)]) -1 • [h, u](t) on each time step (t)

PLP PLQ p = hQLP QLQ q u

Leff = (PLP + PLQ (QLQ)-1 QLP)

Leff p = h

p = (Leff [q,u])-1 h

PLP p + PLQ q = hQLP p +QLQ q = 0

Formal solution:

=>

Adapts on each visit, finding relevant pages p(t) based on the links L, and the learned personality traits (q(t-1))

Regularization of PLP achieved with “Green’s Function / Resolvent Operator”

i.e. G†G ~= PLQ (QLQ)-1 QLP

Equivalent to Gaussian Process on a Graph, and/or Bayesian Linear Regression

Related Dimensional Noise Reductions:Rank (k) Approximations of a Matrix

Latent Semantic Analysis (LSA)(Truncated) Singular Value Decomposition (SVD): Diagonalize the Density operator D = A†ARetain a subset of (k) eigenvalues/vectors

Equivalent relations for SVD

Optimal rank(k) apprx. X s.t. min (D-X) 22

Decomposition: A = U∑ V† A†A = V (∑† ∑) V†

PDP PDQQDP QDQ

*Variable Latent Semantic Indexing (Yahoo! Research Labs)http://www.cs.cornell.edu/people/adg/html/papers/vlsi.pdf

VLSA* provides a rank (k) apprx. to any q query: min E [ qT (D-X) 22 ]

Can generalize to various noise models: i.e. VLSA*, PLSA**

**Probabilistic Latent Semantic Indexing (Recommind, Inc)http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf

PLSA* provides a rank (k) apprx. of classes (z) min DKL [P -P(data)]P = U∑ V†

P(d,w) = P(d|z) P(z)P(w|z) DKL = Kullback–Leibler divergence

Personalized Relevance: Mobile Services and Advertising

France Telecom: Last Inch Relevance Engine

time

location

play game send msg play song

suggest

…

KA for Comm Services

• Based on Empirical Bayesian score and Suggestion mapping table, a decision is made to one or more possible Comm services

• Based on Business Intelligence (BI) data mining and/or Pattern Recognition algorithms (i.e. supervised or unsupervised learning) , we compute statistical scores indicating who are the most likely people to Call, send an SMS, MMS, or E-Mail.

qEvents Personal Context(Sun. mornings)

p

EventsMap to a contextualcomm service

Suggestions for user(Call, SMS, MMS, E-mail)

Learned Traits:On Sunday morning, most likely to call Mom

CommServices

Mom (5)Call [who]

SMS [who]

MMS [who]

Bob (3)

Phone company (1)

p( |POD ) > p( |SUN); p( |POD) < p( |SUN)

Comm/Call Patterns

LOC

Day of Week

p( , P|POD) > 0; p( , ) = 0

POD

calls to different #'s

Bayesian Score EstimationTo estimate p(call|POD)

frequency: p(call|POD) = # of times user called

someone at that PODBayesian:

p(call|POD) = p(POD|call)p(call)Sq p(POD|q)p(q)

where q = call, sms, mms, or email

i.e. Bayesian Choice Estimator

• We seek to know the probability of a "call" (choice) at a given POD.

• We "borrow information" from other PODs, assuming this is less biased, to improve our statistical estimate

5 days3 PODs3 choices

f( | 1) = 2/5

p( | 1) = (2/5)(3/15) .

(2/5)(3/15)+(2/5)(3/15)+(1/5)(11/15)

= 6/23 ~ 1/4

1

2

3

frequency estimator

Bayesian choice estimator

Note: the Bayesian estimate is significantly lower because we nowexpect we might see a at POD 1

Incorporating Feedback

• It is not enough to simple recognize call patterns in the Event Facts—it is also necessary to incorporate feedback into our suggestions scores

• p( c | user, pod, loc, facts, feedback ) = ?

Event Facts

Suggestions

random

irrelevantpoorgood

p( c | user, pod, facts, feedback ) = p( c | user, pod, facts ) p ( c | user, pod, feedback)

A: Simply Factorize:

Evaluate probabilities independently,perhaps using different Bayesian models

Personalized Relevance: Empirical Bayesian Models

Closed form models:

Correct a sample estimate (mean m, variance ) with aweighted average of sample + complete data set

m = B m + (1-B) m

B shrinkage factor

i.e:

individualsample

usersegment

1play game send msg play song

Can rank order mobile services

based on estimated likelihood (m , )

1 23

Personalized Relevance: Empirical Bayesian Models

What is Empirical Bayes modeling?

specify Likelihood L(y|) and Prior () distributions

estimate the posterior () = L(y|) () L(y|) ()

L(y|) ()d (marginal)

Combines Bayesianism and frequentism:Approximates marginal using (or posterior) using point estimate(MLE), Monte Carlo, etc.

Estimates marginal using empirical data

Uses empirical data to infer prior, plug into likelihood to make predictions

Note: Special case of Effective Operator Regression:

P space ~ Q space ; PLQ = I ; u 0

Q-space defines prior information

Empirical Bayesian Methods: Poisson Gamma Model

Likelihood L(y| ) = Poisson distribution ( y e- )/y!Conjugate Prior ( a,b) = Gamma distribution ( a-1 b a e-b a )/ G(a) ; > 0

posterior (k) L(y|) () = (( y e- )/y!) (( a-1 b a e-b a )/ G(a) )

y+a-1 e-(1+(1/b))

also a Gamma distribution(a’,b’)a’ = y + a ; b’ = (1+1/b)-1

Take MLE estimate of Marginal = mean (m) of the posterior (ab)Obtain a,b from the mean (m = ab) and variance (ab2) of complete data

Final Point estimate E(y)= a’b’ for a sample is a weighted average of sample mean y=my and prior mean m

E(y) = ( my + a ) (1+1/b)-1

E(y) = (b/1+b) my + (1/1+b) m

Linear Personality Matrix

events

suggestionsaction

Linear (or non-linear)Matrix transformation: M s = a

Notice: the personality matrix may or may not mix suggestions across events, can include semantic information

and can then solve for the prob( s ) using a computational linear solver: s = M-1aOver time, we can estimate the Ma,s = prob( a | s )

i.e. calls

s1 = calls2 = smss3 = mmss4 = email

i.e. for a given time and location…count how many times we suggested a call but the user chose an email instead

Obviously we would like M to be diagonal…or as good as possible !Can we devise an algorithm that will learn to give "optimal" suggestions?

Matrices for Pattern Recognition(Statistical Factor Analysis)

Call on Mon @ pod 1Call on Mon @ pod 2Call on Mon @ pod 3……

Sms on Tue @ pod 1…

Week 1 2 3 4 5 …

We can use apply Computational Linear Algebra to remove noise and find patterns in data. Called Factor Analysis by Statisticians, Singular Value Decomposition (SVD) by Engineers. Implemented in Oracle Data Mining (ODM) as Non-Negative Matrix Factorization

1. Enumerate all choices 3. Form weekly choice

density Matrix AtA

2. Count # of times a choice is made each week

4. Weekly patterns arecollapsed into the densityMatrix AtA

They can be detectedusing spectral analysis(i.e. principal eigenvalues)

All weekly patterns Pure Noise

Similar to Latent (Multinomial) Dirichlet Algorithm (LDA), but much simpler to implement.Suitable when the number (#) of choices is not too large, and patterns are weekly.

Search Engine Relevance : Listing on

Which 5 items to list at bottom of page ?

Statistical Machine Learning: Support Vector Machines (SVM)

From Regression to Classification: Maximum Margin Solutions2

w2

w

Classification := Find the line that separates the points with the maximum margin

min ½ w2 2 subject to constraints

all “above” line

all “below” line

“above” : w.xi–b >= 1 + I

“below” : w.xi –b <= 1 + i

constraint specifications:

Simple minimization (regression) becomes a convex optimization (classification)

perhaps within some slack (i.e. min ½ w 2 2 + C S I )

SVM Light: Multivariate Rank Constraints

Multivariate Classification:

min ½ w2 2 +C s.t.

for all y’: wT Ψ(x,y) - wT Ψ(x,y’) >= D(y,y’) -

let Ψ(x,y’) = S y’x be a linear fcn

x1

x2

…xn

- 0.1+1.2…-0.7

x

- 1+1…-1

y

sgn

wTx

maps docs to relevance scores (+1/-1)

learn weights (w) s.t. max wT Ψ(x,y’) is correct for training set

(within a single slack constraint )max wT Ψ(x,y’)

D(y,y’) is a multivariate loss function: (i.e. 1- Average Precision(y,y’) )

Ψ(x,y’) is a linear discriminant function: (i.e. sum of ordered pairs SiSjyij (xi -xj) )

SvmLight Ranking SVMs

SVMperf : ROC Area, F1 Score, Precision/Recall

SVMmap : Mean Average Precision ( warning: buggy ! )

SVMrank : Ordinal Regression

Stnd Classification on pairwise differences

min ½ w2 2 + C S I,j,k s.t

for all queries qk (later, may not be query specific in SVMstruct)

doc pairs di, dj wT Ψ(qk,di) - wT Ψ(qk,dj) >= 1- I,j,k

DROCArea = 1- # swapped pairs

Enforces a directed ordering

1 2 3 4 5 6 7 8

1 0 0 0 0 1 1 0

8 7 6 5 4 3 2 1

1 2 3 4 5 6 7 8

MAP ROC Area

0.56 0.47

0.51 0.53

A Support Vector Method for Optimizing Average Precision (Joachims et. Al. 2007)

Large Scale, Linear SVMs

• Solving the Primal– Conjugate Gradient

– Joachims: cutting plane algorithm

– Nyogi

• Handling Large Numbers of Constraints• Cutting Plane Algorithm

• Open Source Implementations:– LibSVM

– SVMLight

Search Engine Relevance : Listing on

A ranking SVM consistently improves Shopping.com <click rank> by %12

Various Sparse Matrix Problems:

Google Page Rank algorithm

Ma = a rank a series of web pages by simulating user

browsing patterns (a) based on probabilistic model (M) of page links

Pattern Recognition, Inference

L p = h estimate unknown probabilities (p) based on historical observations (h) and probability model (L) of links between hidden nodes

Quantum Chemistry

H = E compute color of dyes, pigments given empirical information on realted molecules and/or solving massive eigenvalue problems

Quantum Chemistry:the electronic structure eigenproblem

Solve a massive eigenvalue problem (109-1012)

H (, , …) = (, , …)

H nergy Matrix

quantum state eigenvector

, , … electrons

Methods can have general applicability:Davidson method for dominant eigenvalue / eigenvectors

Motivation for Personalization TechnologyFrom solution of understanding the conceptual foundations of semi-empirical models (noiseless dimensional reduction)

E

Relations between Quantum Mechanics and Probabilistic Language Models

• Quantum States resemble the states (strings, words, phrases) in probabilistic language models (HMMs, SCFGs), except:

is a sum* of strings of electrons:

(, ) = 0. | 1 2 1 2 | + 0.2 | 2 3 1 2 | + …

• Energy Matrix H is known exactly, but large. Models of H can be inferred from empirical data to simplify computations.

• Energies ~= Log [Probabilities], un-normalized

*Not just a single string!

Ab initio (from first principles):

Solve entire H (, ) = (, ) …approximately

OR

Semi-empirical:

Assume (, ) electrons statistically independent:

(, ) = p() q()

Treat -electrons explicitly, ignore (hidden):

PHP p() = p() much smaller problem

Parameterize PHP matrix => Heff with empirical data using a small set of molecules, then apply to others (dyes, pigments)

Dimensional Reduction in Quantum Chemistry:where do semi-empirical Hamiltonians come from?

Effective Hamiltonians: Semi-Empirical Pi-Electron Methods

Heff [] p() = p()

PHP PHQ p = E pQHP QHQ q q

Heff [] = (PHP – PHQ (E-QHQ)-1 QHP)

PHP p + PHQ q = E pQHP p + QHQ q = E q

=>

implicit / hidden

Final Heff can be solved iteratively (as with eSelf Leff),or perturbatively in various forms

Solution is formally exact =>Dimensional Reduction / “Renormalization”

Graphical Methods

Vij = + …

Decompose Heff into effective interactions between electrons

(Expand (E-QHQ)-1 in an infinite series, remove E dependence)Represent diagrammatically, ~300 diagrams to evaluate

Precompile using symbolic manipulation: ~35 MG executable; 8-10 hours to compilerun time: 3-4 hours/parameter

+ +

Effective Hamiltonians: Numerical Calculations

VCC

-only effective empirical16 11.5 11-12 (eV)

Compute ab initio empirical parameters :Can test all basic assumptions of semi-empirical theory ,

“from first principles”

Also provides highly accurate eigenvalue spectra

Augment commercial packages (i.e. Fujitsu MOPAC) to model spectroscopy of photoactive proteins

example