Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
A Generalization of PrincipalComponent Analysis to the
Exponential FamilyMichael Collins, Sanjoy Dasgupta, Robert E. Schapire
Presented by: Sameer AgarwalArtificial Intelligence Laboratory, UCSD
A Generalization of Principal Component Analysis to the Exponential Family – p.1/??
The shoe size problem
If you had only one variable to describe this data, whatwould it be ?
A Generalization of Principal Component Analysis to the Exponential Family – p.2/??
The shoe size problem
Fit a line to the data and take the projection of the dataonto it.
A Generalization of Principal Component Analysis to the Exponential Family – p.3/??
The PCA problem statement
Given a set of data points X = {x1,x2, . . . ,xn}, xi ∈ Rn
and an integer k < n. Find a k-dimensional subspace Sof R
n and the corresponding projectionsΘ = {θ1, θ2, . . . , θn} of X onto S, s.t.
∑
i
‖xi − θi‖2 (1)
is minimized. Subject to
ΘΘT is diagonal.
A Generalization of Principal Component Analysis to the Exponential Family – p.4/??
The Solution
The subspace spanned by the k-largest eigenvectors ofthe matrix
C = (X −X)(X −X)T
A Generalization of Principal Component Analysis to the Exponential Family – p.5/??
The PCA Algorithm
PCA (X, k)
1. [U,D, V ] = svd(X −X)
2. P = V [:, 1 : k]
or,
PCA (X, k)
1. C = (X −X)(X −X)T
2. CV = ΛV
3. P = V [:, 1 : k]
A Generalization of Principal Component Analysis to the Exponential Family – p.6/??
PCA in pictures
Assume that data is an ellipsoid.
A Generalization of Principal Component Analysis to the Exponential Family – p.7/??
PCA in pictures
Find the major and minor axes of the best fittingellipsoid.
A Generalization of Principal Component Analysis to the Exponential Family – p.8/??
Why PCA?
1. Reduce the dimenionality of the data.
2. Optimal in terms of L2 norm.
3. Output dimensions have zero correlation.
4. Denoises the data.
A Generalization of Principal Component Analysis to the Exponential Family – p.9/??
The age of the subspace
Assumption : The underlying model generating theobservations lives in a low dimensional subspace.
X = AV (2)size(V, 1) < size(X, 2) (3)
PCA Rows of V are orthonormal.
ICA Rows of V are independent.
LDA Rows of V are such that there is maximumdiscrimination between classes.
and variants thereof.
A Generalization of Principal Component Analysis to the Exponential Family – p.10/??
A gaussian view of PCA
Take a Line.
A Generalization of Principal Component Analysis to the Exponential Family – p.11/??
A gaussian view of PCA
Sample points from it.
A Generalization of Principal Component Analysis to the Exponential Family – p.12/??
A gaussian view of PCA
Add a sprinkling of unit variance gaussian noise.
A Generalization of Principal Component Analysis to the Exponential Family – p.13/??
A gaussian view of PCA
Remove all trace of the original model.
A Generalization of Principal Component Analysis to the Exponential Family – p.14/??
A gaussian view of PCA
Recover the best fitting subspace in the MLE sense, andthe projection of the data onto this subspace.
A Generalization of Principal Component Analysis to the Exponential Family – p.15/??
to summarize
Each data point xi is a sample from a probabilitydistribution P (x; θi) of the form
P (x; θi) =1√2π
e(x−θi)
2
2
Find Θ = {θ1, θ2, . . . , θn}, s.t. the likelihood of the data
given the parameters Θ is maximized, subject to the con-
straint that Θ lies in a k-dimensional subspace.
A Generalization of Principal Component Analysis to the Exponential Family – p.16/??
But what if ..
You knew that all your noise was positive ?
A Generalization of Principal Component Analysis to the Exponential Family – p.17/??
Beyond the Gaussian
1. Guassian error is not suitable for every problem.
2. Integer valued data ∼ Poisson.
3. Positive valued data ∼ Exponential.
4. Binary valued data ∼ Bernoulli.
A Generalization of Principal Component Analysis to the Exponential Family – p.18/??
The Exponential Family
P (x|θ) = P0(x)ex·θ−G(θ)
log P (x|θ) = log P0(x) + x · θ −G(θ)
1. θ is the natural parameter.
2. G(θ) = ln∫
ex·θP0(x)dx is the Cumulant Function.
3. G(θ) is strictly convex.
A Generalization of Principal Component Analysis to the Exponential Family – p.19/??
Members of the Exponential family
Gaussian
P (x|µ) =1√2π
e(x−µ)2
2 =e−x2
√2π
exθ− θ2
2
θ = µ
Bernoulli
P (x|p) = px(1− p)(1−x) = 1exθ−log(1+eθ)
θ = logp
1− p
A Generalization of Principal Component Analysis to the Exponential Family – p.20/??
Problem Statement
Given data points X = {x1,x2, . . . ,xn} and a memberof the exponential family PG(x|θ), find a setΘ = {θ1, θ2, . . . , θn}, which maximizes the totallikelihood of the data.
Θ = argmaxΘ
L(PG,X,Θ)
= argmaxΘ
∏
i
P (xi|θi)
A Generalization of Principal Component Analysis to the Exponential Family – p.21/??
Convex Sets
Convex Not Convex
x
y x
y
S ⊆ Rn is a convex set if
x, y ∈ S, p, q ≥ 0, p + q = 1 ⇒ pq + qy ∈ S
A Generalization of Principal Component Analysis to the Exponential Family – p.22/??
Convex Function
yx
p q
f (px + qy) ≤ pf(x) + qf(y)
p + q = 1
p, q ≥ 0
A Generalization of Principal Component Analysis to the Exponential Family – p.23/??
Convex Optimization Problems
minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . ,m
Ax = b, A ∈ Rp×n
f0, f1, . . . , fm are convex.
Affine equality constraints.
Feasible set is convex.
A Generalization of Principal Component Analysis to the Exponential Family – p.24/??
and the big deal is ?
Theorem: For a convex optimization problem, anylocal solution is also a global solution.
x ∈ C, is locally optimal if it satisfies
y ∈ C, ‖y − x‖ ≤ R ⇒ f0(y) ≥ f0(x)
x ∈ C is globally optimal if
y ∈ C ⇒ f0(y) ≥ f0(x)
A Generalization of Principal Component Analysis to the Exponential Family – p.25/??
Proof
Suppose x is locally optimal, buty ∈ C, f0(y) < f0(x)
Let z = εy + (1− ε)x with a small ε > 0.
z is near x, with
f0(z) = f0(εy + (1− ε)x)
≤ εf0(y) + (1− ε)f0(x)
< f0(x)
A contradiction, since x is locally optimal.
A Generalization of Principal Component Analysis to the Exponential Family – p.26/??
Bregman divergences
q p
BF (p, q) = F (p)− F (q)− (p− q) · ∇F (q)
F is real, convex and differentiable.
A Generalization of Principal Component Analysis to the Exponential Family – p.27/??
Properties of Bregman divergences
1. BF (q, p) measures the convexity ofF . It measuresthe increase in F (q) over F (p) above linear growthwith slope ∇F (p).
2. BF (q, p) ≥ 0.
3. BF (q, p) = 0 ⇐⇒ p = q and F (p) is strictlyconvex.
A Generalization of Principal Component Analysis to the Exponential Family – p.28/??
Properties of the Exponential Family
Let ,
λG(θ;x) = log PG(x; θ) = log P0(x) + x · θ −G(θ)
and assume,Eθ[∇θλ(θ;x)] = 0
then,
1. ∇θλ(θ;x) = x−∇θG(θ)
2. Eθ[x] = µ = ∇θG(θ) = g(θ)
3. G stricly convex ⇒ g(θ) is invertible.
A Generalization of Principal Component Analysis to the Exponential Family – p.29/??
Properties of the Exponential Family
Define, g−1(µ) = θ, then we can define the dual of G(θ)
F (µ) = θ · µ−G(θ)
G(θ) strictly convex ⇒ F (µ) is strictly convex too. now,
BG(θ, θ̂) = G(θ) + F (µ̂)− θ · µ̂ = BF (µ̂, µ)
λ(θ;x) = log P0(x) + θ · x−G(θ)
λ(θ;x) = log P0(x) + F (x)−BF (x, g(θ))
A Generalization of Principal Component Analysis to the Exponential Family – p.30/??
ePCA
since for members of the exponential family,
− log P (x|θ) = − log P0(x)− F (x) + BF (x, g(θ))
we have
L(Θ) = −∑
ij
log PG(xij|θij)
= C(X) +∑
ij
BF (xij, g(θij))
= C(X) + BF (X, g(Θ))
Θpca = argminΘ
BF (X, g(Θ))
A Generalization of Principal Component Analysis to the Exponential Family – p.31/??
Subspace rank constraint
We want Θ to be constrained in a subspace of dimensionk. Hence,
Θ = V A
wheresize(V ) = [d, k]
size(A) = [k, n]
V is a set of basis vectors for the subspace containing Θ.A is the set of bregman projections of X onto V .The optimization problem now is
argminΘ
L(Θ) = argmin{V,A}
BF (X, g(V A))
A Generalization of Principal Component Analysis to the Exponential Family – p.32/??
Minimizing L(V, A): idea
1. Start Randomly
2. Hold V constant and minimize w.r.t A
3. Hold A constant and minimize w.r.t V
4. Repeat 2-3 until convergence.
A Generalization of Principal Component Analysis to the Exponential Family – p.33/??
The 1-d Algorithm
1. V = rand(d, 1)
2. A = rand(1, n)
3. until convergence
4. ati = argmin
a
∑
j BF (xij, g(av(t−1)j ))
5. vtj = argmin
v
∑
i BF (xij, g(ativ))
6. t = t + 1
A Generalization of Principal Component Analysis to the Exponential Family – p.34/??
An example : Gaussian Noise
ati = argmin
aBF (xij, g(av
(t−1)j ))
g(x) = x
BF (q, p) =(p− q)2
2
→ ati = argmin
a
[
∑
j
(xij − av(t−1)j )2
2
]
Differentiating and equating to zero we get∑
j
(xij − av(t−1)j )v
(t−1)j = 0
A Generalization of Principal Component Analysis to the Exponential Family – p.35/??
An example (contd.)
Differentiating and equating to zero we get
a∑
j
v(t−1)j
2=
∑
j
xijv(t−1)j
a‖V (t−1)‖2 = Xi · V (t−1)
→ ati =
Xi · V (t−1)
‖V (t−1)‖2
→ At =X · V (t−1)
‖V (t−1)‖2
A Generalization of Principal Component Analysis to the Exponential Family – p.36/??
An example (contd.)
Similarly
V t =(At)
TX
‖At‖2
combining the two we get
V t = cV (t−1)XTX
equivalent to the power method of calculating the largest
eigenvector.
A Generalization of Principal Component Analysis to the Exponential Family – p.37/??
Comments
1. Each step of the algorithm solves a convexoptimization problem.
2. The over all optimization problem is not convex.
3. Convergence to global optima is difficult to prove ingeneral.
4. Gaussian is a special case, where the power methodconverges to the global optima.
A Generalization of Principal Component Analysis to the Exponential Family – p.38/??
Avoiding Infinity
Problem : A local optima may exist at infnity.Solution : Penalize large movements away from a fixedpoint in the range of g(θ).
L′(Θ) =∑
ij
[BF (xij , g(θij)) + εBF (µ0, g(θij))]
ε is a small positive constant. µ0 is an arbitrary point inthe range of g.
The alternating minimization procedure is guaranteed to
find a bounded solution to L′(Θ).
A Generalization of Principal Component Analysis to the Exponential Family – p.39/??
The General Algorithm: code
1. A = 0, V = 0
2. For n = 1, . . . , N, c = 1, . . . , l
3. Initialize v0c randomly.
4. sij =∑
k 6=c aikvkj
5. until convergence
6. i = 1, . . . , n
7. atic = argmin
a
∑
j BF (xij, g(av(t−1)cj + sij))
8. j = 1, . . . , d
9. vtcj = argmin
v
∑
j BF (xij, g(aticv + sij) + sij)
A Generalization of Principal Component Analysis to the Exponential Family – p.40/??
Hindsight and GLMs
Generalized Linear Models are extensions ofstandard regression
y = Ax + ε
where ε is a zero mean, constant variance normal.They generalize the model to
y = g(Ax) + η
here, g is a real differentiable function known asthe link function.η is distributed according to a fixed member ofthe exponential family.
A Generalization of Principal Component Analysis to the Exponential Family – p.41/??
Related Work
Hoffman et. al Probabilistc Latent SemanticIndexing
Seung et. al Non-negative Matrix Factorization
Rely on explicit constraints on A and V .
A Generalization of Principal Component Analysis to the Exponential Family – p.42/??
Summary
Interpret PCA using a generative model.
Extend the model to the exponential family.
Convert the Log Likelihood into a Bregmandivergence.
Minimize the divergence using an alternatingminimization procedure.
A Generalization of Principal Component Analysis to the Exponential Family – p.43/??
References
Collins and Schapire A Generalization of Principal Component
Analsysis to the Exponential Family
Tipping, M. E. & Bishop C. M. Probabilistic Principal
Component Analysis.
Azoury K. S. & Warmuth M. K. Relative loss bounds foron-line density estimation with the exponential family ofdistributions.
A Generalization of Principal Component Analysis to the Exponential Family – p.44/??
Acknowledgements
Sanjoy Dasgupta for answering last minutequestions.
The music of Nusrat Fateh Ali Khan for keeping mecompany.
A Generalization of Principal Component Analysis to the Exponential Family – p.45/??