32
Applied Machine Learning for Search Engine Relevance Charles H Martin, PhD

Applied machine learning for search engine relevance 3

Embed Size (px)

Citation preview

Page 1: Applied machine learning for search engine relevance 3

Applied Machine Learning for Search Engine Relevance

Charles H Martin, PhD

Page 2: Applied machine learning for search engine relevance 3

Relevance as a Linear Regression

r =X†w+e

x: (tf-idf) bag of words vectorr: relevance score (i.e. 1/-1)w: weight vector

w = X†r/X†X

x=1querymodel*

form X from data(i.e. group of queries)

Solve as a numerical minimization (i.e. iterative methods like SOR, CG, etc )

min X†w-r 2 w 2 : 2-norm of w

*Actually will model and predict pairwise relations and not exact rank. ..stay tuned.

Moore-Penrose Pseudoinverse

Page 3: Applied machine learning for search engine relevance 3

Relevance as a Linear Regression:Tikhonov Regularization

w = (X†X)-1 X†r

Problem: inverse may be not exist (numerical instabilities, poles)

Solution: add constant a to diagonal of (X†X)-1

w = (X†X + aI)-1 X†ra: single, adjustable smoothing parameter

Equivalent minimization problem

min X†w-r2 + a w2

More generally : form (something like) X†X + G†G + aI, which is a self-adjoint , bounded operator =>

min X†w-r 2 + a Gw 2 i.e. G chosen to avoid over-fitting

Page 4: Applied machine learning for search engine relevance 3

The Representer Theorem Revisited:Kernels and Greens Functions

f(x) = S aiR(x, xi) R := Kernel

Problem: to estimate a function f(x) from training data (xi)

Solution: solve a general minimization problem

min Loss[(f(xi), yi)] + a Gx2

Machine Learning Methods for Estimating Operator Equations (Steinke & Scholkpf 2006)

min Loss[(f(xi), yi)] + aTKa Kij = R(xi, xj)

Equivalent to: given a Linear regularization operator ( G: H->L2(x) )

where K is an integral operator: (Kf)(y) = R(x,y)f(x)dx

so K is the Green’s Function for (GG) †, or G = (K1/2)†

in Dirac Notation: R(x,y) = <y|(GG) †|x>

f(x) = S aiR(x, xi) + S buu (xi) ; u span null space of G

Page 5: Applied machine learning for search engine relevance 3

Personalized Relevance Algorithms: eSelf Personality Subspace

qpages personality traitsp

Cars: 0.4

User Reading

(music site) Present to user

(used sports car ad)

Learned Traits:(Likes cars 0.4)(Sports cars 0.3)

ads

Sports cars0.0 =>0.3

Rock-n-roll

Hard rock

Compute personality traits during user visit to web siteq values = stored learned “personality traits”

Provide relevance rankings (for pages or ads) which include personality traits

Page 6: Applied machine learning for search engine relevance 3

Personalized Relevance Algorithms: eSelf Personality Subspace

model: L [p,q] = [h, u] where L is a square matrix

h: history (observed outputs)

p: output nodes(observables)

Web pages, Classified Ads, …

q: hidden nodes (not observed)

IndividualizedPersonalityTraits

u: user segmentation

Page 7: Applied machine learning for search engine relevance 3

Personalized Search: Effective Regression Problem

[p, q](t) = (Leff[q(t-1)]) -1 • [h, u](t) on each time step (t)

PLP PLQ p = hQLP QLQ q u

Leff = (PLP + PLQ (QLQ)-1 QLP)

Leff p = h

p = (Leff [q,u])-1 h

PLP p + PLQ q = hQLP p +QLQ q = 0

Formal solution:

=>

Adapts on each visit, finding relevant pages p(t) based on the links L, and the learned personality traits (q(t-1))

Regularization of PLP achieved with “Green’s Function / Resolvent Operator”

i.e. G†G ~= PLQ (QLQ)-1 QLP

Equivalent to Gaussian Process on a Graph, and/or Bayesian Linear Regression

Page 8: Applied machine learning for search engine relevance 3

Related Dimensional Noise Reductions:Rank (k) Approximations of a Matrix

Latent Semantic Analysis (LSA)(Truncated) Singular Value Decomposition (SVD): Diagonalize the Density operator D = A†ARetain a subset of (k) eigenvalues/vectors

Equivalent relations for SVD

Optimal rank(k) apprx. X s.t. min (D-X) 22

Decomposition: A = U∑ V† A†A = V (∑† ∑) V†

PDP PDQQDP QDQ

*Variable Latent Semantic Indexing (Yahoo! Research Labs)http://www.cs.cornell.edu/people/adg/html/papers/vlsi.pdf

VLSA* provides a rank (k) apprx. to any q query: min E [ qT (D-X) 22 ]

Can generalize to various noise models: i.e. VLSA*, PLSA**

**Probabilistic Latent Semantic Indexing (Recommind, Inc)http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf

PLSA* provides a rank (k) apprx. of classes (z) min DKL [P -P(data)]P = U∑ V†

P(d,w) = P(d|z) P(z)P(w|z) DKL = Kullback–Leibler divergence

Page 9: Applied machine learning for search engine relevance 3

Personalized Relevance: Mobile Services and Advertising

France Telecom: Last Inch Relevance Engine

time

location

play game send msg play song

suggest

Page 10: Applied machine learning for search engine relevance 3

KA for Comm Services

• Based on Empirical Bayesian score and Suggestion mapping table, a decision is made to one or more possible Comm services

• Based on Business Intelligence (BI) data mining and/or Pattern Recognition algorithms (i.e. supervised or unsupervised learning) , we compute statistical scores indicating who are the most likely people to Call, send an SMS, MMS, or E-Mail.

qEvents Personal Context(Sun. mornings)

p

EventsMap to a contextualcomm service

Suggestions for user(Call, SMS, MMS, E-mail)

Learned Traits:On Sunday morning, most likely to call Mom

CommServices

Mom (5)Call [who]

SMS [who]

MMS [who]

Bob (3)

Phone company (1)

Page 11: Applied machine learning for search engine relevance 3

p( |POD ) > p( |SUN); p( |POD) < p( |SUN)

Comm/Call Patterns

LOC

Day of Week

p( , P|POD) > 0; p( , ) = 0

POD

calls to different #'s

Page 12: Applied machine learning for search engine relevance 3

Bayesian Score EstimationTo estimate p(call|POD)

frequency: p(call|POD) = # of times user called

someone at that PODBayesian:

p(call|POD) = p(POD|call)p(call)Sq p(POD|q)p(q)

where q = call, sms, mms, or email

Page 13: Applied machine learning for search engine relevance 3

i.e. Bayesian Choice Estimator

• We seek to know the probability of a "call" (choice) at a given POD.

• We "borrow information" from other PODs, assuming this is less biased, to improve our statistical estimate

5 days3 PODs3 choices

f( | 1) = 2/5

p( | 1) = (2/5)(3/15) .

(2/5)(3/15)+(2/5)(3/15)+(1/5)(11/15)

= 6/23 ~ 1/4

1

2

3

frequency estimator

Bayesian choice estimator

Note: the Bayesian estimate is significantly lower because we nowexpect we might see a at POD 1

Page 14: Applied machine learning for search engine relevance 3

Incorporating Feedback

• It is not enough to simple recognize call patterns in the Event Facts—it is also necessary to incorporate feedback into our suggestions scores

• p( c | user, pod, loc, facts, feedback ) = ?

Event Facts

Suggestions

random

irrelevantpoorgood

p( c | user, pod, facts, feedback ) = p( c | user, pod, facts ) p ( c | user, pod, feedback)

A: Simply Factorize:

Evaluate probabilities independently,perhaps using different Bayesian models

Page 15: Applied machine learning for search engine relevance 3

Personalized Relevance: Empirical Bayesian Models

Closed form models:

Correct a sample estimate (mean m, variance ) with aweighted average of sample + complete data set

m = B m + (1-B) m

B shrinkage factor

i.e:

individualsample

usersegment

1play game send msg play song

Can rank order mobile services

based on estimated likelihood (m , )

1 23

Page 16: Applied machine learning for search engine relevance 3

Personalized Relevance: Empirical Bayesian Models

What is Empirical Bayes modeling?

specify Likelihood L(y|) and Prior () distributions

estimate the posterior () = L(y|) () L(y|) ()

L(y|) ()d (marginal)

Combines Bayesianism and frequentism:Approximates marginal using (or posterior) using point estimate(MLE), Monte Carlo, etc.

Estimates marginal using empirical data

Uses empirical data to infer prior, plug into likelihood to make predictions

Note: Special case of Effective Operator Regression:

P space ~ Q space ; PLQ = I ; u 0

Q-space defines prior information

Page 17: Applied machine learning for search engine relevance 3

Empirical Bayesian Methods: Poisson Gamma Model

Likelihood L(y| ) = Poisson distribution ( y e- )/y!Conjugate Prior ( a,b) = Gamma distribution ( a-1 b a e-b a )/ G(a) ; > 0

posterior (k) L(y|) () = (( y e- )/y!) (( a-1 b a e-b a )/ G(a) )

y+a-1 e-(1+(1/b))

also a Gamma distribution(a’,b’)a’ = y + a ; b’ = (1+1/b)-1

Take MLE estimate of Marginal = mean (m) of the posterior (ab)Obtain a,b from the mean (m = ab) and variance (ab2) of complete data

Final Point estimate E(y)= a’b’ for a sample is a weighted average of sample mean y=my and prior mean m

E(y) = ( my + a ) (1+1/b)-1

E(y) = (b/1+b) my + (1/1+b) m

Page 18: Applied machine learning for search engine relevance 3

Linear Personality Matrix

events

suggestionsaction

Linear (or non-linear)Matrix transformation: M s = a

Notice: the personality matrix may or may not mix suggestions across events, can include semantic information

and can then solve for the prob( s ) using a computational linear solver: s = M-1aOver time, we can estimate the Ma,s = prob( a | s )

i.e. calls

s1 = calls2 = smss3 = mmss4 = email

i.e. for a given time and location…count how many times we suggested a call but the user chose an email instead

Obviously we would like M to be diagonal…or as good as possible !Can we devise an algorithm that will learn to give "optimal" suggestions?

Page 19: Applied machine learning for search engine relevance 3

Matrices for Pattern Recognition(Statistical Factor Analysis)

Call on Mon @ pod 1Call on Mon @ pod 2Call on Mon @ pod 3……

Sms on Tue @ pod 1…

Week 1 2 3 4 5 …

We can use apply Computational Linear Algebra to remove noise and find patterns in data. Called Factor Analysis by Statisticians, Singular Value Decomposition (SVD) by Engineers. Implemented in Oracle Data Mining (ODM) as Non-Negative Matrix Factorization

1. Enumerate all choices 3. Form weekly choice

density Matrix AtA

2. Count # of times a choice is made each week

4. Weekly patterns arecollapsed into the densityMatrix AtA

They can be detectedusing spectral analysis(i.e. principal eigenvalues)

All weekly patterns Pure Noise

Similar to Latent (Multinomial) Dirichlet Algorithm (LDA), but much simpler to implement.Suitable when the number (#) of choices is not too large, and patterns are weekly.

Page 20: Applied machine learning for search engine relevance 3

Search Engine Relevance : Listing on

Which 5 items to list at bottom of page ?

Page 21: Applied machine learning for search engine relevance 3

Statistical Machine Learning: Support Vector Machines (SVM)

From Regression to Classification: Maximum Margin Solutions2

w2

w

Classification := Find the line that separates the points with the maximum margin

min ½ w2 2 subject to constraints

all “above” line

all “below” line

“above” : w.xi–b >= 1 + I

“below” : w.xi –b <= 1 + i

constraint specifications:

Simple minimization (regression) becomes a convex optimization (classification)

perhaps within some slack (i.e. min ½ w 2 2 + C S I )

Page 22: Applied machine learning for search engine relevance 3

SVM Light: Multivariate Rank Constraints

Multivariate Classification:

min ½ w2 2 +C s.t.

for all y’: wT Ψ(x,y) - wT Ψ(x,y’) >= D(y,y’) -

let Ψ(x,y’) = S y’x be a linear fcn

x1

x2

…xn

- 0.1+1.2…-0.7

x

- 1+1…-1

y

sgn

wTx

maps docs to relevance scores (+1/-1)

learn weights (w) s.t. max wT Ψ(x,y’) is correct for training set

(within a single slack constraint )max wT Ψ(x,y’)

D(y,y’) is a multivariate loss function: (i.e. 1- Average Precision(y,y’) )

Ψ(x,y’) is a linear discriminant function: (i.e. sum of ordered pairs SiSjyij (xi -xj) )

Page 23: Applied machine learning for search engine relevance 3

SvmLight Ranking SVMs

SVMperf : ROC Area, F1 Score, Precision/Recall

SVMmap : Mean Average Precision ( warning: buggy ! )

SVMrank : Ordinal Regression

Stnd Classification on pairwise differences

min ½ w2 2 + C S I,j,k s.t

for all queries qk (later, may not be query specific in SVMstruct)

doc pairs di, dj wT Ψ(qk,di) - wT Ψ(qk,dj) >= 1- I,j,k

DROCArea = 1- # swapped pairs

Enforces a directed ordering

1 2 3 4 5 6 7 8

1 0 0 0 0 1 1 0

8 7 6 5 4 3 2 1

1 2 3 4 5 6 7 8

MAP ROC Area

0.56 0.47

0.51 0.53

A Support Vector Method for Optimizing Average Precision (Joachims et. Al. 2007)

Page 24: Applied machine learning for search engine relevance 3

Large Scale, Linear SVMs

• Solving the Primal– Conjugate Gradient

– Joachims: cutting plane algorithm

– Nyogi

• Handling Large Numbers of Constraints• Cutting Plane Algorithm

• Open Source Implementations:– LibSVM

– SVMLight

Page 25: Applied machine learning for search engine relevance 3

Search Engine Relevance : Listing on

A ranking SVM consistently improves Shopping.com <click rank> by %12

Page 26: Applied machine learning for search engine relevance 3

Various Sparse Matrix Problems:

Google Page Rank algorithm

Ma = a rank a series of web pages by simulating user

browsing patterns (a) based on probabilistic model (M) of page links

Pattern Recognition, Inference

L p = h estimate unknown probabilities (p) based on historical observations (h) and probability model (L) of links between hidden nodes

Quantum Chemistry

H = E compute color of dyes, pigments given empirical information on realted molecules and/or solving massive eigenvalue problems

Page 27: Applied machine learning for search engine relevance 3

Quantum Chemistry:the electronic structure eigenproblem

Solve a massive eigenvalue problem (109-1012)

H (, , …) = (, , …)

H nergy Matrix

quantum state eigenvector

, , … electrons

Methods can have general applicability:Davidson method for dominant eigenvalue / eigenvectors

Motivation for Personalization TechnologyFrom solution of understanding the conceptual foundations of semi-empirical models (noiseless dimensional reduction)

E

Page 28: Applied machine learning for search engine relevance 3

Relations between Quantum Mechanics and Probabilistic Language Models

• Quantum States resemble the states (strings, words, phrases) in probabilistic language models (HMMs, SCFGs), except:

is a sum* of strings of electrons:

(, ) = 0. | 1 2 1 2 | + 0.2 | 2 3 1 2 | + …

• Energy Matrix H is known exactly, but large. Models of H can be inferred from empirical data to simplify computations.

• Energies ~= Log [Probabilities], un-normalized

*Not just a single string!

Page 29: Applied machine learning for search engine relevance 3

Ab initio (from first principles):

Solve entire H (, ) = (, ) …approximately

OR

Semi-empirical:

Assume (, ) electrons statistically independent:

(, ) = p() q()

Treat -electrons explicitly, ignore (hidden):

PHP p() = p() much smaller problem

Parameterize PHP matrix => Heff with empirical data using a small set of molecules, then apply to others (dyes, pigments)

Dimensional Reduction in Quantum Chemistry:where do semi-empirical Hamiltonians come from?

Page 30: Applied machine learning for search engine relevance 3

Effective Hamiltonians: Semi-Empirical Pi-Electron Methods

Heff [] p() = p()

PHP PHQ p = E pQHP QHQ q q

Heff [] = (PHP – PHQ (E-QHQ)-1 QHP)

PHP p + PHQ q = E pQHP p + QHQ q = E q

=>

implicit / hidden

Final Heff can be solved iteratively (as with eSelf Leff),or perturbatively in various forms

Solution is formally exact =>Dimensional Reduction / “Renormalization”

Page 31: Applied machine learning for search engine relevance 3

Graphical Methods

Vij = + …

Decompose Heff into effective interactions between electrons

(Expand (E-QHQ)-1 in an infinite series, remove E dependence)Represent diagrammatically, ~300 diagrams to evaluate

Precompile using symbolic manipulation: ~35 MG executable; 8-10 hours to compilerun time: 3-4 hours/parameter

+ +

Page 32: Applied machine learning for search engine relevance 3

Effective Hamiltonians: Numerical Calculations

VCC

-only effective empirical16 11.5 11-12 (eV)

Compute ab initio empirical parameters :Can test all basic assumptions of semi-empirical theory ,

“from first principles”

Also provides highly accurate eigenvalue spectra

Augment commercial packages (i.e. Fujitsu MOPAC) to model spectroscopy of photoactive proteins

example