A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

A Generalization of PrincipalComponent Analysis to the

Exponential FamilyMichael Collins, Sanjoy Dasgupta, Robert E. Schapire

Presented by: Sameer AgarwalArtificial Intelligence Laboratory, UCSD

A Generalization of Principal Component Analysis to the Exponential Family – p.1/??

The shoe size problem

If you had only one variable to describe this data, whatwould it be ?


The shoe size problem

Fit a line to the data and take the projection of the dataonto it.


The PCA problem statement

Given a set of data points X = {x1,x2, . . . ,xn}, xi ∈ Rn

and an integer k < n. Find a k-dimensional subspace Sof R

n and the corresponding projectionsΘ = {θ1, θ2, . . . , θn} of X onto S, s.t.

∑

i

‖xi − θi‖2 (1)

is minimized. Subject to

ΘΘT is diagonal.


The Solution

The subspace spanned by the k-largest eigenvectors ofthe matrix

C = (X −X)(X −X)T


The PCA Algorithm

PCA (X, k)

1. [U,D, V ] = svd(X −X)

2. P = V [:, 1 : k]

or,

PCA (X, k)

1. C = (X −X)(X −X)T

2. CV = ΛV

3. P = V [:, 1 : k]


PCA in pictures

Assume that data is an ellipsoid.


PCA in pictures

Find the major and minor axes of the best fittingellipsoid.


Why PCA?

1. Reduce the dimenionality of the data.

2. Optimal in terms of L2 norm.

3. Output dimensions have zero correlation.

4. Denoises the data.


The age of the subspace

Assumption : The underlying model generating theobservations lives in a low dimensional subspace.

X = AV (2)size(V, 1) < size(X, 2) (3)

PCA Rows of V are orthonormal.

ICA Rows of V are independent.

LDA Rows of V are such that there is maximumdiscrimination between classes.

and variants thereof.


A gaussian view of PCA

Take a Line.



Sample points from it.



Add a sprinkling of unit variance gaussian noise.



Remove all trace of the original model.



Recover the best fitting subspace in the MLE sense, andthe projection of the data onto this subspace.


to summarize

Each data point xi is a sample from a probabilitydistribution P (x; θi) of the form

P (x; θi) =1√2π

e(x−θi)

2

2

Find Θ = {θ1, θ2, . . . , θn}, s.t. the likelihood of the data

given the parameters Θ is maximized, subject to the con-

straint that Θ lies in a k-dimensional subspace.


But what if ..

You knew that all your noise was positive ?


Beyond the Gaussian

1. Guassian error is not suitable for every problem.

2. Integer valued data ∼ Poisson.

3. Positive valued data ∼ Exponential.

4. Binary valued data ∼ Bernoulli.


The Exponential Family

P (x|θ) = P0(x)ex·θ−G(θ)

log P (x|θ) = log P0(x) + x · θ −G(θ)

1. θ is the natural parameter.

2. G(θ) = ln∫

ex·θP0(x)dx is the Cumulant Function.

3. G(θ) is strictly convex.


Members of the Exponential family

Gaussian

P (x|µ) =1√2π

e(x−µ)2

2 =e−x2

√2π

exθ− θ2

2

θ = µ

Bernoulli

P (x|p) = px(1− p)(1−x) = 1exθ−log(1+eθ)

θ = logp

1− p


Problem Statement

Given data points X = {x1,x2, . . . ,xn} and a memberof the exponential family PG(x|θ), find a setΘ = {θ1, θ2, . . . , θn}, which maximizes the totallikelihood of the data.

Θ = argmaxΘ

L(PG,X,Θ)

= argmaxΘ

∏

i

P (xi|θi)


Convex Sets

Convex Not Convex

x

y x

y

S ⊆ Rn is a convex set if

x, y ∈ S, p, q ≥ 0, p + q = 1 ⇒ pq + qy ∈ S


Convex Function

yx

p q

f (px + qy) ≤ pf(x) + qf(y)

p + q = 1

p, q ≥ 0


Convex Optimization Problems

minimize f0(x)

subject to fi(x) ≤ 0, i = 1, . . . ,m

Ax = b, A ∈ Rp×n

f0, f1, . . . , fm are convex.

Affine equality constraints.

Feasible set is convex.


and the big deal is ?

Theorem: For a convex optimization problem, anylocal solution is also a global solution.

x ∈ C, is locally optimal if it satisfies

y ∈ C, ‖y − x‖ ≤ R ⇒ f0(y) ≥ f0(x)

x ∈ C is globally optimal if

y ∈ C ⇒ f0(y) ≥ f0(x)


Proof

Suppose x is locally optimal, buty ∈ C, f0(y) < f0(x)

Let z = εy + (1− ε)x with a small ε > 0.

z is near x, with

f0(z) = f0(εy + (1− ε)x)

≤ εf0(y) + (1− ε)f0(x)

< f0(x)

A contradiction, since x is locally optimal.


Bregman divergences

q p

BF (p, q) = F (p)− F (q)− (p− q) · ∇F (q)

F is real, convex and differentiable.


Properties of Bregman divergences

1. BF (q, p) measures the convexity ofF . It measuresthe increase in F (q) over F (p) above linear growthwith slope ∇F (p).

2. BF (q, p) ≥ 0.

3. BF (q, p) = 0 ⇐⇒ p = q and F (p) is strictlyconvex.


Properties of the Exponential Family

Let ,

λG(θ;x) = log PG(x; θ) = log P0(x) + x · θ −G(θ)

and assume,Eθ[∇θλ(θ;x)] = 0

then,

1. ∇θλ(θ;x) = x−∇θG(θ)

2. Eθ[x] = µ = ∇θG(θ) = g(θ)

3. G stricly convex ⇒ g(θ) is invertible.


Properties of the Exponential Family

Define, g−1(µ) = θ, then we can define the dual of G(θ)

F (µ) = θ · µ−G(θ)

G(θ) strictly convex ⇒ F (µ) is strictly convex too. now,

BG(θ, θ̂) = G(θ) + F (µ̂)− θ · µ̂ = BF (µ̂, µ)

λ(θ;x) = log P0(x) + θ · x−G(θ)

λ(θ;x) = log P0(x) + F (x)−BF (x, g(θ))


ePCA

since for members of the exponential family,

− log P (x|θ) = − log P0(x)− F (x) + BF (x, g(θ))

we have

L(Θ) = −∑

ij

log PG(xij|θij)

= C(X) +∑

ij

BF (xij, g(θij))

= C(X) + BF (X, g(Θ))

Θpca = argminΘ

BF (X, g(Θ))


Subspace rank constraint

We want Θ to be constrained in a subspace of dimensionk. Hence,

Θ = V A

wheresize(V ) = [d, k]

size(A) = [k, n]

V is a set of basis vectors for the subspace containing Θ.A is the set of bregman projections of X onto V .The optimization problem now is

argminΘ

L(Θ) = argmin{V,A}

BF (X, g(V A))


Minimizing L(V, A): idea

1. Start Randomly

2. Hold V constant and minimize w.r.t A

3. Hold A constant and minimize w.r.t V

4. Repeat 2-3 until convergence.


The 1-d Algorithm

1. V = rand(d, 1)

2. A = rand(1, n)

3. until convergence

4. ati = argmin

a

∑

j BF (xij, g(av(t−1)j ))

5. vtj = argmin

v

∑

i BF (xij, g(ativ))

6. t = t + 1


An example : Gaussian Noise

ati = argmin

aBF (xij, g(av

(t−1)j ))

g(x) = x

BF (q, p) =(p− q)2

2

→ ati = argmin

a

[

∑

j

(xij − av(t−1)j )2

2

]

Differentiating and equating to zero we get∑

j

(xij − av(t−1)j )v

(t−1)j = 0


An example (contd.)

Differentiating and equating to zero we get

a∑

j

v(t−1)j

2=

∑

j

xijv(t−1)j

a‖V (t−1)‖2 = Xi · V (t−1)

→ ati =

Xi · V (t−1)

‖V (t−1)‖2

→ At =X · V (t−1)

‖V (t−1)‖2


An example (contd.)

Similarly

V t =(At)

TX

‖At‖2

combining the two we get

V t = cV (t−1)XTX

equivalent to the power method of calculating the largest

eigenvector.


Comments

1. Each step of the algorithm solves a convexoptimization problem.

2. The over all optimization problem is not convex.

3. Convergence to global optima is difficult to prove ingeneral.

4. Gaussian is a special case, where the power methodconverges to the global optima.


Avoiding Infinity

Problem : A local optima may exist at infnity.Solution : Penalize large movements away from a fixedpoint in the range of g(θ).

L′(Θ) =∑

ij

[BF (xij , g(θij)) + εBF (µ0, g(θij))]

ε is a small positive constant. µ0 is an arbitrary point inthe range of g.

The alternating minimization procedure is guaranteed to

find a bounded solution to L′(Θ).


The General Algorithm: code

1. A = 0, V = 0

2. For n = 1, . . . , N, c = 1, . . . , l

3. Initialize v0c randomly.

4. sij =∑

k 6=c aikvkj

5. until convergence

6. i = 1, . . . , n

7. atic = argmin

a

∑

j BF (xij, g(av(t−1)cj + sij))

8. j = 1, . . . , d

9. vtcj = argmin

v

∑

j BF (xij, g(aticv + sij) + sij)


Hindsight and GLMs

Generalized Linear Models are extensions ofstandard regression

y = Ax + ε

where ε is a zero mean, constant variance normal.They generalize the model to

y = g(Ax) + η

here, g is a real differentiable function known asthe link function.η is distributed according to a fixed member ofthe exponential family.


Related Work

Hoffman et. al Probabilistc Latent SemanticIndexing

Seung et. al Non-negative Matrix Factorization

Rely on explicit constraints on A and V .


Summary

Interpret PCA using a generative model.

Extend the model to the exponential family.

Convert the Log Likelihood into a Bregmandivergence.

Minimize the divergence using an alternatingminimization procedure.


References

Collins and Schapire A Generalization of Principal Component

Analsysis to the Exponential Family

Tipping, M. E. & Bishop C. M. Probabilistic Principal

Component Analysis.

Azoury K. S. & Warmuth M. K. Relative loss bounds foron-line density estimation with the exponential family ofdistributions.


Acknowledgements

Sanjoy Dasgupta for answering last minutequestions.

The music of Nusrat Fateh Ali Khan for keeping mecompany.


Documents

A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire