45
A Generalization of Principal Component Analysis to the Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire Presented by: Sameer Agarwal Artificial Intelligence Laboratory, UCSD A Generalization of Principal Component Analysis to the Exponential Family – p.1/??

A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

A Generalization of PrincipalComponent Analysis to the

Exponential FamilyMichael Collins, Sanjoy Dasgupta, Robert E. Schapire

Presented by: Sameer AgarwalArtificial Intelligence Laboratory, UCSD

A Generalization of Principal Component Analysis to the Exponential Family – p.1/??

Page 2: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

The shoe size problem

If you had only one variable to describe this data, whatwould it be ?

A Generalization of Principal Component Analysis to the Exponential Family – p.2/??

Page 3: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

The shoe size problem

Fit a line to the data and take the projection of the dataonto it.

A Generalization of Principal Component Analysis to the Exponential Family – p.3/??

Page 4: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

The PCA problem statement

Given a set of data points X = {x1,x2, . . . ,xn}, xi ∈ Rn

and an integer k < n. Find a k-dimensional subspace Sof R

n and the corresponding projectionsΘ = {θ1, θ2, . . . , θn} of X onto S, s.t.

i

‖xi − θi‖2 (1)

is minimized. Subject to

ΘΘT is diagonal.

A Generalization of Principal Component Analysis to the Exponential Family – p.4/??

Page 5: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

The Solution

The subspace spanned by the k-largest eigenvectors ofthe matrix

C = (X −X)(X −X)T

A Generalization of Principal Component Analysis to the Exponential Family – p.5/??

Page 6: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

The PCA Algorithm

PCA (X, k)

1. [U,D, V ] = svd(X −X)

2. P = V [:, 1 : k]

or,

PCA (X, k)

1. C = (X −X)(X −X)T

2. CV = ΛV

3. P = V [:, 1 : k]

A Generalization of Principal Component Analysis to the Exponential Family – p.6/??

Page 7: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

PCA in pictures

Assume that data is an ellipsoid.

A Generalization of Principal Component Analysis to the Exponential Family – p.7/??

Page 8: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

PCA in pictures

Find the major and minor axes of the best fittingellipsoid.

A Generalization of Principal Component Analysis to the Exponential Family – p.8/??

Page 9: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Why PCA?

1. Reduce the dimenionality of the data.

2. Optimal in terms of L2 norm.

3. Output dimensions have zero correlation.

4. Denoises the data.

A Generalization of Principal Component Analysis to the Exponential Family – p.9/??

Page 10: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

The age of the subspace

Assumption : The underlying model generating theobservations lives in a low dimensional subspace.

X = AV (2)size(V, 1) < size(X, 2) (3)

PCA Rows of V are orthonormal.

ICA Rows of V are independent.

LDA Rows of V are such that there is maximumdiscrimination between classes.

and variants thereof.

A Generalization of Principal Component Analysis to the Exponential Family – p.10/??

Page 11: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

A gaussian view of PCA

Take a Line.

A Generalization of Principal Component Analysis to the Exponential Family – p.11/??

Page 12: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

A gaussian view of PCA

Sample points from it.

A Generalization of Principal Component Analysis to the Exponential Family – p.12/??

Page 13: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

A gaussian view of PCA

Add a sprinkling of unit variance gaussian noise.

A Generalization of Principal Component Analysis to the Exponential Family – p.13/??

Page 14: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

A gaussian view of PCA

Remove all trace of the original model.

A Generalization of Principal Component Analysis to the Exponential Family – p.14/??

Page 15: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

A gaussian view of PCA

Recover the best fitting subspace in the MLE sense, andthe projection of the data onto this subspace.

A Generalization of Principal Component Analysis to the Exponential Family – p.15/??

Page 16: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

to summarize

Each data point xi is a sample from a probabilitydistribution P (x; θi) of the form

P (x; θi) =1√2π

e(x−θi)

2

2

Find Θ = {θ1, θ2, . . . , θn}, s.t. the likelihood of the data

given the parameters Θ is maximized, subject to the con-

straint that Θ lies in a k-dimensional subspace.

A Generalization of Principal Component Analysis to the Exponential Family – p.16/??

Page 17: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

But what if ..

You knew that all your noise was positive ?

A Generalization of Principal Component Analysis to the Exponential Family – p.17/??

Page 18: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Beyond the Gaussian

1. Guassian error is not suitable for every problem.

2. Integer valued data ∼ Poisson.

3. Positive valued data ∼ Exponential.

4. Binary valued data ∼ Bernoulli.

A Generalization of Principal Component Analysis to the Exponential Family – p.18/??

Page 19: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

The Exponential Family

P (x|θ) = P0(x)ex·θ−G(θ)

log P (x|θ) = log P0(x) + x · θ −G(θ)

1. θ is the natural parameter.

2. G(θ) = ln∫

ex·θP0(x)dx is the Cumulant Function.

3. G(θ) is strictly convex.

A Generalization of Principal Component Analysis to the Exponential Family – p.19/??

Page 20: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Members of the Exponential family

Gaussian

P (x|µ) =1√2π

e(x−µ)2

2 =e−x2

√2π

exθ− θ2

2

θ = µ

Bernoulli

P (x|p) = px(1− p)(1−x) = 1exθ−log(1+eθ)

θ = logp

1− p

A Generalization of Principal Component Analysis to the Exponential Family – p.20/??

Page 21: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Problem Statement

Given data points X = {x1,x2, . . . ,xn} and a memberof the exponential family PG(x|θ), find a setΘ = {θ1, θ2, . . . , θn}, which maximizes the totallikelihood of the data.

Θ = argmaxΘ

L(PG,X,Θ)

= argmaxΘ

i

P (xi|θi)

A Generalization of Principal Component Analysis to the Exponential Family – p.21/??

Page 22: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Convex Sets

Convex Not Convex

x

y x

y

S ⊆ Rn is a convex set if

x, y ∈ S, p, q ≥ 0, p + q = 1 ⇒ pq + qy ∈ S

A Generalization of Principal Component Analysis to the Exponential Family – p.22/??

Page 23: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Convex Function

yx

p q

f (px + qy) ≤ pf(x) + qf(y)

p + q = 1

p, q ≥ 0

A Generalization of Principal Component Analysis to the Exponential Family – p.23/??

Page 24: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Convex Optimization Problems

minimize f0(x)

subject to fi(x) ≤ 0, i = 1, . . . ,m

Ax = b, A ∈ Rp×n

f0, f1, . . . , fm are convex.

Affine equality constraints.

Feasible set is convex.

A Generalization of Principal Component Analysis to the Exponential Family – p.24/??

Page 25: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

and the big deal is ?

Theorem: For a convex optimization problem, anylocal solution is also a global solution.

x ∈ C, is locally optimal if it satisfies

y ∈ C, ‖y − x‖ ≤ R ⇒ f0(y) ≥ f0(x)

x ∈ C is globally optimal if

y ∈ C ⇒ f0(y) ≥ f0(x)

A Generalization of Principal Component Analysis to the Exponential Family – p.25/??

Page 26: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Proof

Suppose x is locally optimal, buty ∈ C, f0(y) < f0(x)

Let z = εy + (1− ε)x with a small ε > 0.

z is near x, with

f0(z) = f0(εy + (1− ε)x)

≤ εf0(y) + (1− ε)f0(x)

< f0(x)

A contradiction, since x is locally optimal.

A Generalization of Principal Component Analysis to the Exponential Family – p.26/??

Page 27: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Bregman divergences

q p

BF (p, q) = F (p)− F (q)− (p− q) · ∇F (q)

F is real, convex and differentiable.

A Generalization of Principal Component Analysis to the Exponential Family – p.27/??

Page 28: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Properties of Bregman divergences

1. BF (q, p) measures the convexity ofF . It measuresthe increase in F (q) over F (p) above linear growthwith slope ∇F (p).

2. BF (q, p) ≥ 0.

3. BF (q, p) = 0 ⇐⇒ p = q and F (p) is strictlyconvex.

A Generalization of Principal Component Analysis to the Exponential Family – p.28/??

Page 29: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Properties of the Exponential Family

Let ,

λG(θ;x) = log PG(x; θ) = log P0(x) + x · θ −G(θ)

and assume,Eθ[∇θλ(θ;x)] = 0

then,

1. ∇θλ(θ;x) = x−∇θG(θ)

2. Eθ[x] = µ = ∇θG(θ) = g(θ)

3. G stricly convex ⇒ g(θ) is invertible.

A Generalization of Principal Component Analysis to the Exponential Family – p.29/??

Page 30: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Properties of the Exponential Family

Define, g−1(µ) = θ, then we can define the dual of G(θ)

F (µ) = θ · µ−G(θ)

G(θ) strictly convex ⇒ F (µ) is strictly convex too. now,

BG(θ, θ̂) = G(θ) + F (µ̂)− θ · µ̂ = BF (µ̂, µ)

λ(θ;x) = log P0(x) + θ · x−G(θ)

λ(θ;x) = log P0(x) + F (x)−BF (x, g(θ))

A Generalization of Principal Component Analysis to the Exponential Family – p.30/??

Page 31: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

ePCA

since for members of the exponential family,

− log P (x|θ) = − log P0(x)− F (x) + BF (x, g(θ))

we have

L(Θ) = −∑

ij

log PG(xij|θij)

= C(X) +∑

ij

BF (xij, g(θij))

= C(X) + BF (X, g(Θ))

Θpca = argminΘ

BF (X, g(Θ))

A Generalization of Principal Component Analysis to the Exponential Family – p.31/??

Page 32: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Subspace rank constraint

We want Θ to be constrained in a subspace of dimensionk. Hence,

Θ = V A

wheresize(V ) = [d, k]

size(A) = [k, n]

V is a set of basis vectors for the subspace containing Θ.A is the set of bregman projections of X onto V .The optimization problem now is

argminΘ

L(Θ) = argmin{V,A}

BF (X, g(V A))

A Generalization of Principal Component Analysis to the Exponential Family – p.32/??

Page 33: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Minimizing L(V, A): idea

1. Start Randomly

2. Hold V constant and minimize w.r.t A

3. Hold A constant and minimize w.r.t V

4. Repeat 2-3 until convergence.

A Generalization of Principal Component Analysis to the Exponential Family – p.33/??

Page 34: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

The 1-d Algorithm

1. V = rand(d, 1)

2. A = rand(1, n)

3. until convergence

4. ati = argmin

a

j BF (xij, g(av(t−1)j ))

5. vtj = argmin

v

i BF (xij, g(ativ))

6. t = t + 1

A Generalization of Principal Component Analysis to the Exponential Family – p.34/??

Page 35: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

An example : Gaussian Noise

ati = argmin

aBF (xij, g(av

(t−1)j ))

g(x) = x

BF (q, p) =(p− q)2

2

→ ati = argmin

a

[

j

(xij − av(t−1)j )2

2

]

Differentiating and equating to zero we get∑

j

(xij − av(t−1)j )v

(t−1)j = 0

A Generalization of Principal Component Analysis to the Exponential Family – p.35/??

Page 36: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

An example (contd.)

Differentiating and equating to zero we get

a∑

j

v(t−1)j

2=

j

xijv(t−1)j

a‖V (t−1)‖2 = Xi · V (t−1)

→ ati =

Xi · V (t−1)

‖V (t−1)‖2

→ At =X · V (t−1)

‖V (t−1)‖2

A Generalization of Principal Component Analysis to the Exponential Family – p.36/??

Page 37: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

An example (contd.)

Similarly

V t =(At)

TX

‖At‖2

combining the two we get

V t = cV (t−1)XTX

equivalent to the power method of calculating the largest

eigenvector.

A Generalization of Principal Component Analysis to the Exponential Family – p.37/??

Page 38: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Comments

1. Each step of the algorithm solves a convexoptimization problem.

2. The over all optimization problem is not convex.

3. Convergence to global optima is difficult to prove ingeneral.

4. Gaussian is a special case, where the power methodconverges to the global optima.

A Generalization of Principal Component Analysis to the Exponential Family – p.38/??

Page 39: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Avoiding Infinity

Problem : A local optima may exist at infnity.Solution : Penalize large movements away from a fixedpoint in the range of g(θ).

L′(Θ) =∑

ij

[BF (xij , g(θij)) + εBF (µ0, g(θij))]

ε is a small positive constant. µ0 is an arbitrary point inthe range of g.

The alternating minimization procedure is guaranteed to

find a bounded solution to L′(Θ).

A Generalization of Principal Component Analysis to the Exponential Family – p.39/??

Page 40: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

The General Algorithm: code

1. A = 0, V = 0

2. For n = 1, . . . , N, c = 1, . . . , l

3. Initialize v0c randomly.

4. sij =∑

k 6=c aikvkj

5. until convergence

6. i = 1, . . . , n

7. atic = argmin

a

j BF (xij, g(av(t−1)cj + sij))

8. j = 1, . . . , d

9. vtcj = argmin

v

j BF (xij, g(aticv + sij) + sij)

A Generalization of Principal Component Analysis to the Exponential Family – p.40/??

Page 41: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Hindsight and GLMs

Generalized Linear Models are extensions ofstandard regression

y = Ax + ε

where ε is a zero mean, constant variance normal.They generalize the model to

y = g(Ax) + η

here, g is a real differentiable function known asthe link function.η is distributed according to a fixed member ofthe exponential family.

A Generalization of Principal Component Analysis to the Exponential Family – p.41/??

Page 42: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Related Work

Hoffman et. al Probabilistc Latent SemanticIndexing

Seung et. al Non-negative Matrix Factorization

Rely on explicit constraints on A and V .

A Generalization of Principal Component Analysis to the Exponential Family – p.42/??

Page 43: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Summary

Interpret PCA using a generative model.

Extend the model to the exponential family.

Convert the Log Likelihood into a Bregmandivergence.

Minimize the divergence using an alternatingminimization procedure.

A Generalization of Principal Component Analysis to the Exponential Family – p.43/??

Page 44: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

References

Collins and Schapire A Generalization of Principal Component

Analsysis to the Exponential Family

Tipping, M. E. & Bishop C. M. Probabilistic Principal

Component Analysis.

Azoury K. S. & Warmuth M. K. Relative loss bounds foron-line density estimation with the exponential family ofdistributions.

A Generalization of Principal Component Analysis to the Exponential Family – p.44/??

Page 45: A Generalization of Principal Component Analysis to the ...cseweb.ucsd.edu/classes/fa02/cse252c/pca.pdf · Exponential Family Michael Collins, Sanjoy Dasgupta, Robert E. Schapire

Acknowledgements

Sanjoy Dasgupta for answering last minutequestions.

The music of Nusrat Fateh Ali Khan for keeping mecompany.

A Generalization of Principal Component Analysis to the Exponential Family – p.45/??