A Generalization of Principal Component Analysis to the...

A Generalization of PrincipalComponent Analysis to the

Exponential FamilyMichael Collins, Sanjoy Dasgupta, Robert E. Schapire

Presented by: Sameer AgarwalArtificial Intelligence Laboratory, UCSD

A Generalization of Principal Component Analysis to the Exponential Family – p.1/??

The shoe size problem

If you had only one variable to describe this data, whatwould it be ?

The shoe size problem

Fit a line to the data and take the projection of the dataonto it.

The PCA problem statement

Given a set of data points X = {x1,x2, . . . ,xn}, xi ∈ Rn

and an integer k < n. Find a k-dimensional subspace Sof R

n and the corresponding projectionsΘ = {θ1, θ2, . . . , θn} of X onto S, s.t.

‖xi − θi‖2 (1)

is minimized. Subject to

ΘΘT is diagonal.

The Solution

The subspace spanned by the k-largest eigenvectors ofthe matrix

C = (X −X)(X −X)T

The PCA Algorithm

PCA (X, k)

1. [U,D, V ] = svd(X −X)

2. P = V [:, 1 : k]

PCA (X, k)

1. C = (X −X)(X −X)T

2. CV = ΛV

3. P = V [:, 1 : k]

PCA in pictures

Assume that data is an ellipsoid.

PCA in pictures

Find the major and minor axes of the best fittingellipsoid.

Why PCA?

1. Reduce the dimenionality of the data.

2. Optimal in terms of L2 norm.

3. Output dimensions have zero correlation.

4. Denoises the data.

The age of the subspace

Assumption : The underlying model generating theobservations lives in a low dimensional subspace.

X = AV (2)size(V, 1) < size(X, 2) (3)

PCA Rows of V are orthonormal.

ICA Rows of V are independent.

LDA Rows of V are such that there is maximumdiscrimination between classes.

and variants thereof.

A gaussian view of PCA

Take a Line.

Sample points from it.

Add a sprinkling of unit variance gaussian noise.

Remove all trace of the original model.

Recover the best fitting subspace in the MLE sense, andthe projection of the data onto this subspace.

to summarize

Each data point xi is a sample from a probabilitydistribution P (x; θi) of the form

P (x; θi) =1√2π

e(x−θi)

Find Θ = {θ1, θ2, . . . , θn}, s.t. the likelihood of the data

given the parameters Θ is maximized, subject to the con-

straint that Θ lies in a k-dimensional subspace.

But what if ..

You knew that all your noise was positive ?

Beyond the Gaussian

1. Guassian error is not suitable for every problem.

2. Integer valued data ∼ Poisson.

3. Positive valued data ∼ Exponential.

4. Binary valued data ∼ Bernoulli.

The Exponential Family

P (x|θ) = P0(x)ex·θ−G(θ)

log P (x|θ) = log P0(x) + x · θ −G(θ)

1. θ is the natural parameter.

2. G(θ) = ln∫

ex·θP0(x)dx is the Cumulant Function.

3. G(θ) is strictly convex.

Members of the Exponential family

Gaussian

P (x|µ) =1√2π

e(x−µ)2

2 =e−x2

√2π

exθ− θ2

θ = µ

Bernoulli

P (x|p) = px(1− p)(1−x) = 1exθ−log(1+eθ)

θ = logp

1− p

Problem Statement

Given data points X = {x1,x2, . . . ,xn} and a memberof the exponential family PG(x|θ), find a setΘ = {θ1, θ2, . . . , θn}, which maximizes the totallikelihood of the data.

Θ = argmaxΘ

L(PG,X,Θ)

= argmaxΘ

P (xi|θi)

Convex Sets

Convex Not Convex

S ⊆ Rn is a convex set if

x, y ∈ S, p, q ≥ 0, p + q = 1 ⇒ pq + qy ∈ S

Convex Function

f (px + qy) ≤ pf(x) + qf(y)

p + q = 1

p, q ≥ 0

Convex Optimization Problems

minimize f0(x)

subject to fi(x) ≤ 0, i = 1, . . . ,m

Ax = b, A ∈ Rp×n

f0, f1, . . . , fm are convex.

Affine equality constraints.

Feasible set is convex.

and the big deal is ?

Theorem: For a convex optimization problem, anylocal solution is also a global solution.

x ∈ C, is locally optimal if it satisfies

y ∈ C, ‖y − x‖ ≤ R ⇒ f0(y) ≥ f0(x)

x ∈ C is globally optimal if

y ∈ C ⇒ f0(y) ≥ f0(x)

Suppose x is locally optimal, buty ∈ C, f0(y) < f0(x)

Let z = εy + (1− ε)x with a small ε > 0.

z is near x, with

f0(z) = f0(εy + (1− ε)x)

≤ εf0(y) + (1− ε)f0(x)

< f0(x)

A contradiction, since x is locally optimal.

Bregman divergences

BF (p, q) = F (p)− F (q)− (p− q) · ∇F (q)

F is real, convex and differentiable.

Properties of Bregman divergences

1. BF (q, p) measures the convexity ofF . It measuresthe increase in F (q) over F (p) above linear growthwith slope ∇F (p).

2. BF (q, p) ≥ 0.

3. BF (q, p) = 0 ⇐⇒ p = q and F (p) is strictlyconvex.

Properties of the Exponential Family

λG(θ;x) = log PG(x; θ) = log P0(x) + x · θ −G(θ)

and assume,Eθ[∇θλ(θ;x)] = 0

1. ∇θλ(θ;x) = x−∇θG(θ)

2. Eθ[x] = µ = ∇θG(θ) = g(θ)

3. G stricly convex ⇒ g(θ) is invertible.

Properties of the Exponential Family

Define, g−1(µ) = θ, then we can define the dual of G(θ)

F (µ) = θ · µ−G(θ)

G(θ) strictly convex ⇒ F (µ) is strictly convex too. now,

BG(θ, θ̂) = G(θ) + F (µ̂)− θ · µ̂ = BF (µ̂, µ)

λ(θ;x) = log P0(x) + θ · x−G(θ)

λ(θ;x) = log P0(x) + F (x)−BF (x, g(θ))

since for members of the exponential family,

− log P (x|θ) = − log P0(x)− F (x) + BF (x, g(θ))

we have

L(Θ) = −∑

log PG(xij|θij)

= C(X) +∑

BF (xij, g(θij))

= C(X) + BF (X, g(Θ))

Θpca = argminΘ

BF (X, g(Θ))

Subspace rank constraint

We want Θ to be constrained in a subspace of dimensionk. Hence,

Θ = V A

wheresize(V ) = [d, k]

size(A) = [k, n]

V is a set of basis vectors for the subspace containing Θ.A is the set of bregman projections of X onto V .The optimization problem now is

argminΘ

L(Θ) = argmin{V,A}

BF (X, g(V A))

Minimizing L(V, A): idea

1. Start Randomly

2. Hold V constant and minimize w.r.t A

3. Hold A constant and minimize w.r.t V

4. Repeat 2-3 until convergence.

The 1-d Algorithm

1. V = rand(d, 1)

2. A = rand(1, n)

3. until convergence

4. ati = argmin

j BF (xij, g(av(t−1)j ))

5. vtj = argmin

i BF (xij, g(ativ))

6. t = t + 1

An example : Gaussian Noise

ati = argmin

aBF (xij, g(av

(t−1)j ))

g(x) = x

BF (q, p) =(p− q)2

→ ati = argmin

(xij − av(t−1)j )2

Differentiating and equating to zero we get∑

(xij − av(t−1)j )v

(t−1)j = 0

An example (contd.)

Differentiating and equating to zero we get

v(t−1)j

xijv(t−1)j

a‖V (t−1)‖2 = Xi · V (t−1)

→ ati =

Xi · V (t−1)

‖V (t−1)‖2

→ At =X · V (t−1)

‖V (t−1)‖2

An example (contd.)

Similarly

V t =(At)

‖At‖2

combining the two we get

V t = cV (t−1)XTX

equivalent to the power method of calculating the largest

eigenvector.

Comments

1. Each step of the algorithm solves a convexoptimization problem.

2. The over all optimization problem is not convex.

3. Convergence to global optima is difficult to prove ingeneral.

4. Gaussian is a special case, where the power methodconverges to the global optima.

Avoiding Infinity

Problem : A local optima may exist at infnity.Solution : Penalize large movements away from a fixedpoint in the range of g(θ).

L′(Θ) =∑

[BF (xij , g(θij)) + εBF (µ0, g(θij))]

ε is a small positive constant. µ0 is an arbitrary point inthe range of g.

The alternating minimization procedure is guaranteed to

find a bounded solution to L′(Θ).

The General Algorithm: code

1. A = 0, V = 0

2. For n = 1, . . . , N, c = 1, . . . , l

3. Initialize v0c randomly.

4. sij =∑

k 6=c aikvkj

5. until convergence

6. i = 1, . . . , n

7. atic = argmin

j BF (xij, g(av(t−1)cj + sij))

8. j = 1, . . . , d

9. vtcj = argmin

j BF (xij, g(aticv + sij) + sij)

Hindsight and GLMs

Generalized Linear Models are extensions ofstandard regression

y = Ax + ε

where ε is a zero mean, constant variance normal.They generalize the model to

y = g(Ax) + η

here, g is a real differentiable function known asthe link function.η is distributed according to a fixed member ofthe exponential family.

Related Work

Hoffman et. al Probabilistc Latent SemanticIndexing

Seung et. al Non-negative Matrix Factorization

Rely on explicit constraints on A and V .

Summary

Interpret PCA using a generative model.

Extend the model to the exponential family.

Convert the Log Likelihood into a Bregmandivergence.

Minimize the divergence using an alternatingminimization procedure.

References

Collins and Schapire A Generalization of Principal Component

Analsysis to the Exponential Family

Tipping, M. E. & Bishop C. M. Probabilistic Principal

Component Analysis.

Azoury K. S. & Warmuth M. K. Relative loss bounds foron-line density estimation with the exponential family ofdistributions.

Acknowledgements

Sanjoy Dasgupta for answering last minutequestions.

The music of Nusrat Fateh Ali Khan for keeping mecompany.

A Generalization of Principal Component Analysis to the...

Documents

Generative Adversarial Networks, and Applicationscseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170412.pdf · Generative Model: Learns probability distribution of each class of

Edge image description using angular radial partitioningcseweb.ucsd.edu/classes/fa05/cse252c/chalechale.pdf · 2005. 10. 25. · Edge image description using angular radial partitioning

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting

Using Color for Object Recognitioncseweb.ucsd.edu/classes/sp11/cse252c/projects/2011/mmoghimi_final.pdfdistinguishing feature for some categories. Usually most of object recognition

DSCANSGP0044 EOS M100 Leaflet 210x297 FA02 · DSCANSGP0044_EOS M100_Leaflet_210x297_FA02 Created Date: 10/24/2017 11:04:25 AM

CSE 252C: Advanced Computer Visioncseweb.ucsd.edu/~mkchandraker/classes/CSE252C/... · • Presentation format (1 slide for each): 1. Motivation and problem description 2. Prior work

Video Google - Home | Computer Science and …cseweb.ucsd.edu/classes/fa05/cse252c/hewitt-2.pdfObject Level Grouping Extends Video Google work. Infers object’s significance from

19 JacobianLinearizations,equilibriumpoints - cds.caltech.edumurray/courses/cds101/fa02/caltech/pph02... · Through much computer simulation, a preplanned input schedule is developed,

Combining Appearance and Topology for Wide Baseline ...cseweb.ucsd.edu/classes/fa02/cse252c/tellCarlssonShow.pdfMethods for Estimating the Fundamental Matrix. In IJCV pages 271-300,

NIPS 2007 tutorial, Theory and Applications of Boosting, …media.nips.cc/Conferences/2007/Tutorials/Slides/schapire-NIPS-07... · Theory and Applications of Boosting ... Outline

Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Machine Learning Rob Schapire Princeton Avrim Blum Carnegie Mellon Tommi Jaakkola MIT

Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola

AdaBoost - University of California, San Diegocseweb.ucsd.edu/classes/fa01/cse291/AdaBoost.pdf · Vision and Learning Freund, Schapire, Singer: AdaBoost 2 ’ & $ % The horse-track

Learning to Order ThingsLearning to Order Things William W. Cohen Robert E. Schapire Yoram Singer AT&T Labs, 180 Park Ave., Florham Park, NJ 07932 { wcohen,schapire,singer} @research.att.com

0 Tutorial: Learning Topics in Game -Theoretic Decision Making Michael Littman - Rutgers University *Thanks to Michael Kearns, Manfred Warmuth, Rob Schapire

CSE 252C: Advanced Computer Visioncseweb.ucsd.edu/~mkchandraker/classes/CSE252C/...CSE 252C, SP20: Manmohan Chandraker [Kanazawa et al., CVPR 2019] Objects and Stuff CSE 252C, SP20:

AdaBoost Robert E. Schapire (Princeton University) Yoav Freund (University of California at San Diego) Presented by Zhi-Hua Zhou (Nanjing University)

Efficient Distribution-Free Learning of Probabilistic ...mkearns/papers/pconcepts.pdf · Efficient Distribution-Free Learning of Probabilistic MICHAEL J. KEARNS AND ROBERT E. SCHAPIRE

AdaBoost - University of California, San Diego › classes › fa01 › cse291 › AdaBoost.pdfVision and Learning Freund, Schapire, Singer: AdaBoost 1 ’ & $ % Resampling for Classi