Upload
charles-martin
View
669
Download
0
Embed Size (px)
Citation preview
Applied Machine Learning for Search Engine Relevance
Charles H Martin, PhD
Relevance as a Linear Regression
r =X†w+e
x: (tf-idf) bag of words vectorr: relevance score (i.e. 1/-1)w: weight vector
w = X†r/X†X
x=1querymodel*
form X from data(i.e. group of queries)
Solve as a numerical minimization (i.e. iterative methods like SOR, CG, etc )
min X†w-r 2 w 2 : 2-norm of w
*Actually will model and predict pairwise relations and not exact rank. ..stay tuned.
Moore-Penrose Pseudoinverse
Relevance as a Linear Regression:Tikhonov Regularization
w = (X†X)-1 X†r
Problem: inverse may be not exist (numerical instabilities, poles)
Solution: add constant a to diagonal of (X†X)-1
w = (X†X + aI)-1 X†ra: single, adjustable smoothing parameter
Equivalent minimization problem
min X†w-r2 + a w2
More generally : form (something like) X†X + G†G + aI, which is a self-adjoint , bounded operator =>
min X†w-r 2 + a Gw 2 i.e. G chosen to avoid over-fitting
The Representer Theorem Revisited:Kernels and Greens Functions
f(x) = S aiR(x, xi) R := Kernel
Problem: to estimate a function f(x) from training data (xi)
Solution: solve a general minimization problem
min Loss[(f(xi), yi)] + a Gx2
Machine Learning Methods for Estimating Operator Equations (Steinke & Scholkpf 2006)
min Loss[(f(xi), yi)] + aTKa Kij = R(xi, xj)
Equivalent to: given a Linear regularization operator ( G: H->L2(x) )
where K is an integral operator: (Kf)(y) = R(x,y)f(x)dx
so K is the Green’s Function for (GG) †, or G = (K1/2)†
in Dirac Notation: R(x,y) = <y|(GG) †|x>
f(x) = S aiR(x, xi) + S buu (xi) ; u span null space of G
Personalized Relevance Algorithms: eSelf Personality Subspace
qpages personality traitsp
Cars: 0.4
User Reading
(music site) Present to user
(used sports car ad)
Learned Traits:(Likes cars 0.4)(Sports cars 0.3)
ads
Sports cars0.0 =>0.3
Rock-n-roll
Hard rock
Compute personality traits during user visit to web siteq values = stored learned “personality traits”
Provide relevance rankings (for pages or ads) which include personality traits
Personalized Relevance Algorithms: eSelf Personality Subspace
model: L [p,q] = [h, u] where L is a square matrix
h: history (observed outputs)
p: output nodes(observables)
Web pages, Classified Ads, …
q: hidden nodes (not observed)
IndividualizedPersonalityTraits
u: user segmentation
Personalized Search: Effective Regression Problem
[p, q](t) = (Leff[q(t-1)]) -1 • [h, u](t) on each time step (t)
PLP PLQ p = hQLP QLQ q u
Leff = (PLP + PLQ (QLQ)-1 QLP)
Leff p = h
p = (Leff [q,u])-1 h
PLP p + PLQ q = hQLP p +QLQ q = 0
Formal solution:
=>
Adapts on each visit, finding relevant pages p(t) based on the links L, and the learned personality traits (q(t-1))
Regularization of PLP achieved with “Green’s Function / Resolvent Operator”
i.e. G†G ~= PLQ (QLQ)-1 QLP
Equivalent to Gaussian Process on a Graph, and/or Bayesian Linear Regression
Related Dimensional Noise Reductions:Rank (k) Approximations of a Matrix
Latent Semantic Analysis (LSA)(Truncated) Singular Value Decomposition (SVD): Diagonalize the Density operator D = A†ARetain a subset of (k) eigenvalues/vectors
Equivalent relations for SVD
Optimal rank(k) apprx. X s.t. min (D-X) 22
Decomposition: A = U∑ V† A†A = V (∑† ∑) V†
PDP PDQQDP QDQ
*Variable Latent Semantic Indexing (Yahoo! Research Labs)http://www.cs.cornell.edu/people/adg/html/papers/vlsi.pdf
VLSA* provides a rank (k) apprx. to any q query: min E [ qT (D-X) 22 ]
Can generalize to various noise models: i.e. VLSA*, PLSA**
**Probabilistic Latent Semantic Indexing (Recommind, Inc)http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf
PLSA* provides a rank (k) apprx. of classes (z) min DKL [P -P(data)]P = U∑ V†
P(d,w) = P(d|z) P(z)P(w|z) DKL = Kullback–Leibler divergence
Personalized Relevance: Mobile Services and Advertising
France Telecom: Last Inch Relevance Engine
time
location
play game send msg play song
suggest
…
KA for Comm Services
• Based on Empirical Bayesian score and Suggestion mapping table, a decision is made to one or more possible Comm services
• Based on Business Intelligence (BI) data mining and/or Pattern Recognition algorithms (i.e. supervised or unsupervised learning) , we compute statistical scores indicating who are the most likely people to Call, send an SMS, MMS, or E-Mail.
qEvents Personal Context(Sun. mornings)
p
EventsMap to a contextualcomm service
Suggestions for user(Call, SMS, MMS, E-mail)
Learned Traits:On Sunday morning, most likely to call Mom
CommServices
Mom (5)Call [who]
SMS [who]
MMS [who]
Bob (3)
Phone company (1)
p( |POD ) > p( |SUN); p( |POD) < p( |SUN)
Comm/Call Patterns
LOC
Day of Week
p( , P|POD) > 0; p( , ) = 0
POD
calls to different #'s
Bayesian Score EstimationTo estimate p(call|POD)
frequency: p(call|POD) = # of times user called
someone at that PODBayesian:
p(call|POD) = p(POD|call)p(call)Sq p(POD|q)p(q)
where q = call, sms, mms, or email
i.e. Bayesian Choice Estimator
• We seek to know the probability of a "call" (choice) at a given POD.
• We "borrow information" from other PODs, assuming this is less biased, to improve our statistical estimate
5 days3 PODs3 choices
f( | 1) = 2/5
p( | 1) = (2/5)(3/15) .
(2/5)(3/15)+(2/5)(3/15)+(1/5)(11/15)
= 6/23 ~ 1/4
1
2
3
frequency estimator
Bayesian choice estimator
Note: the Bayesian estimate is significantly lower because we nowexpect we might see a at POD 1
Incorporating Feedback
• It is not enough to simple recognize call patterns in the Event Facts—it is also necessary to incorporate feedback into our suggestions scores
• p( c | user, pod, loc, facts, feedback ) = ?
Event Facts
Suggestions
random
irrelevantpoorgood
p( c | user, pod, facts, feedback ) = p( c | user, pod, facts ) p ( c | user, pod, feedback)
A: Simply Factorize:
Evaluate probabilities independently,perhaps using different Bayesian models
Personalized Relevance: Empirical Bayesian Models
Closed form models:
Correct a sample estimate (mean m, variance ) with aweighted average of sample + complete data set
m = B m + (1-B) m
B shrinkage factor
i.e:
individualsample
usersegment
1play game send msg play song
Can rank order mobile services
based on estimated likelihood (m , )
1 23
Personalized Relevance: Empirical Bayesian Models
What is Empirical Bayes modeling?
specify Likelihood L(y|) and Prior () distributions
estimate the posterior () = L(y|) () L(y|) ()
L(y|) ()d (marginal)
Combines Bayesianism and frequentism:Approximates marginal using (or posterior) using point estimate(MLE), Monte Carlo, etc.
Estimates marginal using empirical data
Uses empirical data to infer prior, plug into likelihood to make predictions
Note: Special case of Effective Operator Regression:
P space ~ Q space ; PLQ = I ; u 0
Q-space defines prior information
Empirical Bayesian Methods: Poisson Gamma Model
Likelihood L(y| ) = Poisson distribution ( y e- )/y!Conjugate Prior ( a,b) = Gamma distribution ( a-1 b a e-b a )/ G(a) ; > 0
posterior (k) L(y|) () = (( y e- )/y!) (( a-1 b a e-b a )/ G(a) )
y+a-1 e-(1+(1/b))
also a Gamma distribution(a’,b’)a’ = y + a ; b’ = (1+1/b)-1
Take MLE estimate of Marginal = mean (m) of the posterior (ab)Obtain a,b from the mean (m = ab) and variance (ab2) of complete data
Final Point estimate E(y)= a’b’ for a sample is a weighted average of sample mean y=my and prior mean m
E(y) = ( my + a ) (1+1/b)-1
E(y) = (b/1+b) my + (1/1+b) m
Linear Personality Matrix
events
suggestionsaction
Linear (or non-linear)Matrix transformation: M s = a
Notice: the personality matrix may or may not mix suggestions across events, can include semantic information
and can then solve for the prob( s ) using a computational linear solver: s = M-1aOver time, we can estimate the Ma,s = prob( a | s )
i.e. calls
s1 = calls2 = smss3 = mmss4 = email
i.e. for a given time and location…count how many times we suggested a call but the user chose an email instead
Obviously we would like M to be diagonal…or as good as possible !Can we devise an algorithm that will learn to give "optimal" suggestions?
Matrices for Pattern Recognition(Statistical Factor Analysis)
Call on Mon @ pod 1Call on Mon @ pod 2Call on Mon @ pod 3……
Sms on Tue @ pod 1…
Week 1 2 3 4 5 …
We can use apply Computational Linear Algebra to remove noise and find patterns in data. Called Factor Analysis by Statisticians, Singular Value Decomposition (SVD) by Engineers. Implemented in Oracle Data Mining (ODM) as Non-Negative Matrix Factorization
1. Enumerate all choices 3. Form weekly choice
density Matrix AtA
2. Count # of times a choice is made each week
4. Weekly patterns arecollapsed into the densityMatrix AtA
They can be detectedusing spectral analysis(i.e. principal eigenvalues)
All weekly patterns Pure Noise
Similar to Latent (Multinomial) Dirichlet Algorithm (LDA), but much simpler to implement.Suitable when the number (#) of choices is not too large, and patterns are weekly.
Search Engine Relevance : Listing on
Which 5 items to list at bottom of page ?
Statistical Machine Learning: Support Vector Machines (SVM)
From Regression to Classification: Maximum Margin Solutions2
w2
w
Classification := Find the line that separates the points with the maximum margin
min ½ w2 2 subject to constraints
all “above” line
all “below” line
“above” : w.xi–b >= 1 + I
“below” : w.xi –b <= 1 + i
constraint specifications:
Simple minimization (regression) becomes a convex optimization (classification)
perhaps within some slack (i.e. min ½ w 2 2 + C S I )
SVM Light: Multivariate Rank Constraints
Multivariate Classification:
min ½ w2 2 +C s.t.
for all y’: wT Ψ(x,y) - wT Ψ(x,y’) >= D(y,y’) -
let Ψ(x,y’) = S y’x be a linear fcn
x1
x2
…xn
- 0.1+1.2…-0.7
x
- 1+1…-1
y
sgn
wTx
maps docs to relevance scores (+1/-1)
learn weights (w) s.t. max wT Ψ(x,y’) is correct for training set
(within a single slack constraint )max wT Ψ(x,y’)
D(y,y’) is a multivariate loss function: (i.e. 1- Average Precision(y,y’) )
Ψ(x,y’) is a linear discriminant function: (i.e. sum of ordered pairs SiSjyij (xi -xj) )
SvmLight Ranking SVMs
SVMperf : ROC Area, F1 Score, Precision/Recall
SVMmap : Mean Average Precision ( warning: buggy ! )
SVMrank : Ordinal Regression
Stnd Classification on pairwise differences
min ½ w2 2 + C S I,j,k s.t
for all queries qk (later, may not be query specific in SVMstruct)
doc pairs di, dj wT Ψ(qk,di) - wT Ψ(qk,dj) >= 1- I,j,k
DROCArea = 1- # swapped pairs
Enforces a directed ordering
1 2 3 4 5 6 7 8
1 0 0 0 0 1 1 0
8 7 6 5 4 3 2 1
1 2 3 4 5 6 7 8
MAP ROC Area
0.56 0.47
0.51 0.53
A Support Vector Method for Optimizing Average Precision (Joachims et. Al. 2007)
Large Scale, Linear SVMs
• Solving the Primal– Conjugate Gradient
– Joachims: cutting plane algorithm
– Nyogi
• Handling Large Numbers of Constraints• Cutting Plane Algorithm
• Open Source Implementations:– LibSVM
– SVMLight
Search Engine Relevance : Listing on
A ranking SVM consistently improves Shopping.com <click rank> by %12
Various Sparse Matrix Problems:
Google Page Rank algorithm
Ma = a rank a series of web pages by simulating user
browsing patterns (a) based on probabilistic model (M) of page links
Pattern Recognition, Inference
L p = h estimate unknown probabilities (p) based on historical observations (h) and probability model (L) of links between hidden nodes
Quantum Chemistry
H = E compute color of dyes, pigments given empirical information on realted molecules and/or solving massive eigenvalue problems
Quantum Chemistry:the electronic structure eigenproblem
Solve a massive eigenvalue problem (109-1012)
H (, , …) = (, , …)
H nergy Matrix
quantum state eigenvector
, , … electrons
Methods can have general applicability:Davidson method for dominant eigenvalue / eigenvectors
Motivation for Personalization TechnologyFrom solution of understanding the conceptual foundations of semi-empirical models (noiseless dimensional reduction)
E
Relations between Quantum Mechanics and Probabilistic Language Models
• Quantum States resemble the states (strings, words, phrases) in probabilistic language models (HMMs, SCFGs), except:
is a sum* of strings of electrons:
(, ) = 0. | 1 2 1 2 | + 0.2 | 2 3 1 2 | + …
• Energy Matrix H is known exactly, but large. Models of H can be inferred from empirical data to simplify computations.
• Energies ~= Log [Probabilities], un-normalized
*Not just a single string!
Ab initio (from first principles):
Solve entire H (, ) = (, ) …approximately
OR
Semi-empirical:
Assume (, ) electrons statistically independent:
(, ) = p() q()
Treat -electrons explicitly, ignore (hidden):
PHP p() = p() much smaller problem
Parameterize PHP matrix => Heff with empirical data using a small set of molecules, then apply to others (dyes, pigments)
Dimensional Reduction in Quantum Chemistry:where do semi-empirical Hamiltonians come from?
Effective Hamiltonians: Semi-Empirical Pi-Electron Methods
Heff [] p() = p()
PHP PHQ p = E pQHP QHQ q q
Heff [] = (PHP – PHQ (E-QHQ)-1 QHP)
PHP p + PHQ q = E pQHP p + QHQ q = E q
=>
implicit / hidden
Final Heff can be solved iteratively (as with eSelf Leff),or perturbatively in various forms
Solution is formally exact =>Dimensional Reduction / “Renormalization”
Graphical Methods
Vij = + …
Decompose Heff into effective interactions between electrons
(Expand (E-QHQ)-1 in an infinite series, remove E dependence)Represent diagrammatically, ~300 diagrams to evaluate
Precompile using symbolic manipulation: ~35 MG executable; 8-10 hours to compilerun time: 3-4 hours/parameter
+ +
Effective Hamiltonians: Numerical Calculations
VCC
-only effective empirical16 11.5 11-12 (eV)
Compute ab initio empirical parameters :Can test all basic assumptions of semi-empirical theory ,
“from first principles”
Also provides highly accurate eigenvalue spectra
Augment commercial packages (i.e. Fujitsu MOPAC) to model spectroscopy of photoactive proteins
example