67
Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Dimensionality Reduction Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 1 / 25

Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Introduction to Machine Learning (67577)Lecture 11

Shai Shalev-Shwartz

School of CS and Engineering,The Hebrew University of Jerusalem

Dimensionality Reduction

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 1 / 25

Page 2: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Dimensionality Reduction

Dimensionality Reduction = taking data in high dimensional spaceand mapping it into a low dimensional space

Why?

Reduces training (and testing) timeReduces estimation errorInterpretability of the data, finding meaningful structure in data,illustration

Linear dimensionality reduction: x 7→Wx where W ∈ Rn,d and n < d

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25

Page 3: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Dimensionality Reduction

Dimensionality Reduction = taking data in high dimensional spaceand mapping it into a low dimensional space

Why?

Reduces training (and testing) timeReduces estimation errorInterpretability of the data, finding meaningful structure in data,illustration

Linear dimensionality reduction: x 7→Wx where W ∈ Rn,d and n < d

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25

Page 4: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Dimensionality Reduction

Dimensionality Reduction = taking data in high dimensional spaceand mapping it into a low dimensional space

Why?

Reduces training (and testing) time

Reduces estimation errorInterpretability of the data, finding meaningful structure in data,illustration

Linear dimensionality reduction: x 7→Wx where W ∈ Rn,d and n < d

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25

Page 5: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Dimensionality Reduction

Dimensionality Reduction = taking data in high dimensional spaceand mapping it into a low dimensional space

Why?

Reduces training (and testing) timeReduces estimation error

Interpretability of the data, finding meaningful structure in data,illustration

Linear dimensionality reduction: x 7→Wx where W ∈ Rn,d and n < d

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25

Page 6: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Dimensionality Reduction

Dimensionality Reduction = taking data in high dimensional spaceand mapping it into a low dimensional space

Why?

Reduces training (and testing) timeReduces estimation errorInterpretability of the data, finding meaningful structure in data,illustration

Linear dimensionality reduction: x 7→Wx where W ∈ Rn,d and n < d

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25

Page 7: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Dimensionality Reduction

Dimensionality Reduction = taking data in high dimensional spaceand mapping it into a low dimensional space

Why?

Reduces training (and testing) timeReduces estimation errorInterpretability of the data, finding meaningful structure in data,illustration

Linear dimensionality reduction: x 7→Wx where W ∈ Rn,d and n < d

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25

Page 8: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Outline

1 Principal Component Analysis (PCA)

2 Random Projections

3 Compressed Sensing

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 3 / 25

Page 9: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Principal Component Analysis (PCA)

x 7→Wx

What makes W a good matrix for dimensionality reduction ?

Natural criterion: we want to be able to approximately recover x fromy =Wx

PCA:

Linear recovery: x̃ = Uy = UWxMeasures “approximate recovery” by averaged squared norm: givenexamples x1, . . . ,xm, solve

argminW∈Rn,d,U∈Rd,n

m∑i=1

‖xi − UWxi‖2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25

Page 10: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Principal Component Analysis (PCA)

x 7→Wx

What makes W a good matrix for dimensionality reduction ?

Natural criterion: we want to be able to approximately recover x fromy =Wx

PCA:

Linear recovery: x̃ = Uy = UWxMeasures “approximate recovery” by averaged squared norm: givenexamples x1, . . . ,xm, solve

argminW∈Rn,d,U∈Rd,n

m∑i=1

‖xi − UWxi‖2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25

Page 11: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Principal Component Analysis (PCA)

x 7→Wx

What makes W a good matrix for dimensionality reduction ?

Natural criterion: we want to be able to approximately recover x fromy =Wx

PCA:

Linear recovery: x̃ = Uy = UWxMeasures “approximate recovery” by averaged squared norm: givenexamples x1, . . . ,xm, solve

argminW∈Rn,d,U∈Rd,n

m∑i=1

‖xi − UWxi‖2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25

Page 12: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Principal Component Analysis (PCA)

x 7→Wx

What makes W a good matrix for dimensionality reduction ?

Natural criterion: we want to be able to approximately recover x fromy =Wx

PCA:

Linear recovery: x̃ = Uy = UWx

Measures “approximate recovery” by averaged squared norm: givenexamples x1, . . . ,xm, solve

argminW∈Rn,d,U∈Rd,n

m∑i=1

‖xi − UWxi‖2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25

Page 13: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Principal Component Analysis (PCA)

x 7→Wx

What makes W a good matrix for dimensionality reduction ?

Natural criterion: we want to be able to approximately recover x fromy =Wx

PCA:

Linear recovery: x̃ = Uy = UWxMeasures “approximate recovery” by averaged squared norm: givenexamples x1, . . . ,xm, solve

argminW∈Rn,d,U∈Rd,n

m∑i=1

‖xi − UWxi‖2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25

Page 14: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Solving the PCA Problem

argminW∈Rn,d,U∈Rd,n

m∑i=1

‖xi − UWxi‖2

Theorem

Let A =∑m

i=1 xix>i and let u1, . . . , un be the n leading eigenvectors of

A. Then, the solution to the PCA problem is to set the columns of U tobe u1, . . . ,un and to set W = U>

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 5 / 25

Page 15: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Solving the PCA Problem

argminW∈Rn,d,U∈Rd,n

m∑i=1

‖xi − UWxi‖2

Theorem

Let A =∑m

i=1 xix>i and let u1, . . . , un be the n leading eigenvectors of

A. Then, the solution to the PCA problem is to set the columns of U tobe u1, . . . ,un and to set W = U>

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 5 / 25

Page 16: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Proof main ideas

UW is of rank n, therefore its range is n dimensional subspace,denoted S

The transformation x 7→ UWx moves x to this subspace

The point in S which is closest to x is V V >x, where columns of Vare orthonormal basis of S

Therefore, we can assume w.l.o.g. that W = U> and that columns ofU are orthonormal

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 6 / 25

Page 17: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Proof main ideas

UW is of rank n, therefore its range is n dimensional subspace,denoted S

The transformation x 7→ UWx moves x to this subspace

The point in S which is closest to x is V V >x, where columns of Vare orthonormal basis of S

Therefore, we can assume w.l.o.g. that W = U> and that columns ofU are orthonormal

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 6 / 25

Page 18: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Proof main ideas

UW is of rank n, therefore its range is n dimensional subspace,denoted S

The transformation x 7→ UWx moves x to this subspace

The point in S which is closest to x is V V >x, where columns of Vare orthonormal basis of S

Therefore, we can assume w.l.o.g. that W = U> and that columns ofU are orthonormal

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 6 / 25

Page 19: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Proof main ideas

UW is of rank n, therefore its range is n dimensional subspace,denoted S

The transformation x 7→ UWx moves x to this subspace

The point in S which is closest to x is V V >x, where columns of Vare orthonormal basis of S

Therefore, we can assume w.l.o.g. that W = U> and that columns ofU are orthonormal

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 6 / 25

Page 20: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Proof main ideas

Observe:

‖x− UU>x‖2 = ‖x‖2 − 2x>UU>x+ x>UU>UU>x

= ‖x‖2 − x>UU>x

= ‖x‖2 − trace(U>xx>U) ,

Therefore, an equivalent PCA problem is

argmaxU∈Rd,n:U>U=I

trace

(U>

(m∑i=1

xix>i

)U

).

The solution is to set U to be the leading eigenvectors of A =∑m

i=1 xix>i .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 7 / 25

Page 21: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Proof main ideas

Observe:

‖x− UU>x‖2 = ‖x‖2 − 2x>UU>x+ x>UU>UU>x

= ‖x‖2 − x>UU>x

= ‖x‖2 − trace(U>xx>U) ,

Therefore, an equivalent PCA problem is

argmaxU∈Rd,n:U>U=I

trace

(U>

(m∑i=1

xix>i

)U

).

The solution is to set U to be the leading eigenvectors of A =∑m

i=1 xix>i .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 7 / 25

Page 22: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Proof main ideas

Observe:

‖x− UU>x‖2 = ‖x‖2 − 2x>UU>x+ x>UU>UU>x

= ‖x‖2 − x>UU>x

= ‖x‖2 − trace(U>xx>U) ,

Therefore, an equivalent PCA problem is

argmaxU∈Rd,n:U>U=I

trace

(U>

(m∑i=1

xix>i

)U

).

The solution is to set U to be the leading eigenvectors of A =∑m

i=1 xix>i .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 7 / 25

Page 23: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Value of the objective

It is easy to see that

minW∈Rn,d,U∈Rd,n

m∑i=1

‖xi − UWxi‖2 =

d∑i=n+1

λi(A)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 8 / 25

Page 24: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Centering

It is a common practice to “center” the examples before applyingPCA, namely:

First calculate µ = 1m

∑mi=1 xi

Then apply PCA on the vectors (x1 − µ), . . . , (xm − µ)

This is also related to the interpretation of PCA as variancemaximization (will be given in exercise)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 9 / 25

Page 25: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Efficient implementation for d� m and kernel PCA

Recall: A =∑m

i=1 xix>i = X>X where X ∈ Rm,d is a matrix whose

i’th row is x>i .

Let B = XX>. That is, Bi,j = 〈xi,xj〉If Bu = λu then

A(X>u) = X>XX>u = X>Bu = λ(X>u)

So, X>u‖X>u‖ is an eigenvector of A with eigenvalue λ

We can therefore calculate the PCA solution by calculating theeigenvalues of B instead of A

The complexity is O(m3 +m2d)

And, it can be computed using a kernel function

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 10 / 25

Page 26: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Pseudo code

PCA

inputA matrix of m examples X ∈ Rm,dnumber of components n

if (m > d)A = X>XLet u1, . . . ,un be the eigenvectors of A with largest eigenvalues

elseB = XX>

Let v1, . . . ,vn be the eigenvectors of B with largest eigenvaluesfor i = 1, . . . , n set ui =

1‖X>vi‖

X>vi

output: u1, . . . ,un

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 11 / 25

Page 27: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Demonstration

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 12 / 25

Page 28: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Demonstration

50× 50 images from Yale dataset

Before (left) and after reconstruction (right) to 10 dimensions

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 13 / 25

Page 29: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Demonstration

Before and after

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 14 / 25

Page 30: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Demonstration

Images after dim reduction to R2

Different marks indicate different individuals

x xx x xx x

o oo o oo o

* **

*** *+++ + +++

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 15 / 25

Page 31: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Outline

1 Principal Component Analysis (PCA)

2 Random Projections

3 Compressed Sensing

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 16 / 25

Page 32: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

What is a successful dimensionality reduction?

In PCA, we measured succes as squared distance between x and areconstruction of x from y =Wx

In some cases, we don’t care about reconstruction, all we care is thaty1, . . . ,ym will retain certain properties of x1, . . . ,xm

One option: do not distort distances. That is, we’d like that for alli, j, ‖xi − xj‖ ≈ ‖yi − yj‖

Equivalently, we’d like that for all i, j,‖Wxi−Wxj‖‖xi−xj‖ ≈ 1

Equivalently, we’d like that for all x ∈ Q, whereQ = {xi − xj : i, j ∈ [m]}, we’ll have ‖Wx‖

‖x‖ ≈ 1

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 17 / 25

Page 33: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

What is a successful dimensionality reduction?

In PCA, we measured succes as squared distance between x and areconstruction of x from y =Wx

In some cases, we don’t care about reconstruction, all we care is thaty1, . . . ,ym will retain certain properties of x1, . . . ,xm

One option: do not distort distances. That is, we’d like that for alli, j, ‖xi − xj‖ ≈ ‖yi − yj‖

Equivalently, we’d like that for all i, j,‖Wxi−Wxj‖‖xi−xj‖ ≈ 1

Equivalently, we’d like that for all x ∈ Q, whereQ = {xi − xj : i, j ∈ [m]}, we’ll have ‖Wx‖

‖x‖ ≈ 1

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 17 / 25

Page 34: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

What is a successful dimensionality reduction?

In PCA, we measured succes as squared distance between x and areconstruction of x from y =Wx

In some cases, we don’t care about reconstruction, all we care is thaty1, . . . ,ym will retain certain properties of x1, . . . ,xm

One option: do not distort distances. That is, we’d like that for alli, j, ‖xi − xj‖ ≈ ‖yi − yj‖

Equivalently, we’d like that for all i, j,‖Wxi−Wxj‖‖xi−xj‖ ≈ 1

Equivalently, we’d like that for all x ∈ Q, whereQ = {xi − xj : i, j ∈ [m]}, we’ll have ‖Wx‖

‖x‖ ≈ 1

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 17 / 25

Page 35: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

What is a successful dimensionality reduction?

In PCA, we measured succes as squared distance between x and areconstruction of x from y =Wx

In some cases, we don’t care about reconstruction, all we care is thaty1, . . . ,ym will retain certain properties of x1, . . . ,xm

One option: do not distort distances. That is, we’d like that for alli, j, ‖xi − xj‖ ≈ ‖yi − yj‖

Equivalently, we’d like that for all i, j,‖Wxi−Wxj‖‖xi−xj‖ ≈ 1

Equivalently, we’d like that for all x ∈ Q, whereQ = {xi − xj : i, j ∈ [m]}, we’ll have ‖Wx‖

‖x‖ ≈ 1

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 17 / 25

Page 36: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

What is a successful dimensionality reduction?

In PCA, we measured succes as squared distance between x and areconstruction of x from y =Wx

In some cases, we don’t care about reconstruction, all we care is thaty1, . . . ,ym will retain certain properties of x1, . . . ,xm

One option: do not distort distances. That is, we’d like that for alli, j, ‖xi − xj‖ ≈ ‖yi − yj‖

Equivalently, we’d like that for all i, j,‖Wxi−Wxj‖‖xi−xj‖ ≈ 1

Equivalently, we’d like that for all x ∈ Q, whereQ = {xi − xj : i, j ∈ [m]}, we’ll have ‖Wx‖

‖x‖ ≈ 1

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 17 / 25

Page 37: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Random Projections do not distort norms

Random projection: The transformation x 7→Wx, where W is arandom matrix

We’ll analyze the distortion due to W s.t. Wi,j ∼ N(0, 1/n)

Let wi be the i’th row of W . Then:

E[‖Wx‖2] =n∑i=1

E[(〈wi,x〉)2] =n∑i=1

x> E[wiw>i ]x

= nx>(1

nI

)x = ‖x‖2

In fact, ‖Wx‖2 has a χ2n distribution, and using a measure

concentration inequality it can be shown that

P

[ ∣∣∣∣∣‖Wx‖2

‖x‖2− 1

∣∣∣∣∣ > ε

]≤ 2 e−ε

2n/6

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 18 / 25

Page 38: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Random Projections do not distort norms

Random projection: The transformation x 7→Wx, where W is arandom matrix

We’ll analyze the distortion due to W s.t. Wi,j ∼ N(0, 1/n)

Let wi be the i’th row of W . Then:

E[‖Wx‖2] =n∑i=1

E[(〈wi,x〉)2] =n∑i=1

x> E[wiw>i ]x

= nx>(1

nI

)x = ‖x‖2

In fact, ‖Wx‖2 has a χ2n distribution, and using a measure

concentration inequality it can be shown that

P

[ ∣∣∣∣∣‖Wx‖2

‖x‖2− 1

∣∣∣∣∣ > ε

]≤ 2 e−ε

2n/6

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 18 / 25

Page 39: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Random Projections do not distort norms

Random projection: The transformation x 7→Wx, where W is arandom matrix

We’ll analyze the distortion due to W s.t. Wi,j ∼ N(0, 1/n)

Let wi be the i’th row of W . Then:

E[‖Wx‖2] =n∑i=1

E[(〈wi,x〉)2] =n∑i=1

x> E[wiw>i ]x

= nx>(1

nI

)x = ‖x‖2

In fact, ‖Wx‖2 has a χ2n distribution, and using a measure

concentration inequality it can be shown that

P

[ ∣∣∣∣∣‖Wx‖2

‖x‖2− 1

∣∣∣∣∣ > ε

]≤ 2 e−ε

2n/6

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 18 / 25

Page 40: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Random Projections do not distort norms

Random projection: The transformation x 7→Wx, where W is arandom matrix

We’ll analyze the distortion due to W s.t. Wi,j ∼ N(0, 1/n)

Let wi be the i’th row of W . Then:

E[‖Wx‖2] =n∑i=1

E[(〈wi,x〉)2] =n∑i=1

x> E[wiw>i ]x

= nx>(1

nI

)x = ‖x‖2

In fact, ‖Wx‖2 has a χ2n distribution, and using a measure

concentration inequality it can be shown that

P

[ ∣∣∣∣∣‖Wx‖2

‖x‖2− 1

∣∣∣∣∣ > ε

]≤ 2 e−ε

2n/6

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 18 / 25

Page 41: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Random Projections do not distort norms

Applying the union bound over all vectors in Q we obtain:

Lemma (Johnson-Lindenstrauss lemma)

Let Q be a finite set of vectors in Rd. Let δ ∈ (0, 1) and n be an integersuch that

ε =

√6 log(2|Q|/δ)

n≤ 3 .

Then, with probability of at least 1− δ over a choice of a random matrixW ∈ Rn,d with Wi,j ∼ N(0, 1/n), we have

maxx∈Q

∣∣∣∣‖Wx‖2

‖x‖2− 1

∣∣∣∣ < ε .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 19 / 25

Page 42: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Outline

1 Principal Component Analysis (PCA)

2 Random Projections

3 Compressed Sensing

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 20 / 25

Page 43: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Compressed Sensing

Prior assumption: x ≈ Uα where U is orthonormal and

‖α‖0def= |{i : αi 6= 0}| ≤ s for some s� d

E.g.: natural images are approximately sparse in a wavelet basis

How to “store” x ?

We can find α = U>x and then save the non-zero elements of αRequires order of s log(d) storageWhy go to so much effort to acquire all the d coordinates of x whenmost of what we get will be thrown away? Can’t we just directlymeasure the part that won’t end up being thrown away?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 21 / 25

Page 44: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Compressed Sensing

Prior assumption: x ≈ Uα where U is orthonormal and

‖α‖0def= |{i : αi 6= 0}| ≤ s for some s� d

E.g.: natural images are approximately sparse in a wavelet basis

How to “store” x ?

We can find α = U>x and then save the non-zero elements of αRequires order of s log(d) storageWhy go to so much effort to acquire all the d coordinates of x whenmost of what we get will be thrown away? Can’t we just directlymeasure the part that won’t end up being thrown away?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 21 / 25

Page 45: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Compressed Sensing

Prior assumption: x ≈ Uα where U is orthonormal and

‖α‖0def= |{i : αi 6= 0}| ≤ s for some s� d

E.g.: natural images are approximately sparse in a wavelet basis

How to “store” x ?

We can find α = U>x and then save the non-zero elements of αRequires order of s log(d) storageWhy go to so much effort to acquire all the d coordinates of x whenmost of what we get will be thrown away? Can’t we just directlymeasure the part that won’t end up being thrown away?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 21 / 25

Page 46: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Compressed Sensing

Prior assumption: x ≈ Uα where U is orthonormal and

‖α‖0def= |{i : αi 6= 0}| ≤ s for some s� d

E.g.: natural images are approximately sparse in a wavelet basis

How to “store” x ?

We can find α = U>x and then save the non-zero elements of α

Requires order of s log(d) storageWhy go to so much effort to acquire all the d coordinates of x whenmost of what we get will be thrown away? Can’t we just directlymeasure the part that won’t end up being thrown away?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 21 / 25

Page 47: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Compressed Sensing

Prior assumption: x ≈ Uα where U is orthonormal and

‖α‖0def= |{i : αi 6= 0}| ≤ s for some s� d

E.g.: natural images are approximately sparse in a wavelet basis

How to “store” x ?

We can find α = U>x and then save the non-zero elements of αRequires order of s log(d) storage

Why go to so much effort to acquire all the d coordinates of x whenmost of what we get will be thrown away? Can’t we just directlymeasure the part that won’t end up being thrown away?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 21 / 25

Page 48: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Compressed Sensing

Prior assumption: x ≈ Uα where U is orthonormal and

‖α‖0def= |{i : αi 6= 0}| ≤ s for some s� d

E.g.: natural images are approximately sparse in a wavelet basis

How to “store” x ?

We can find α = U>x and then save the non-zero elements of αRequires order of s log(d) storageWhy go to so much effort to acquire all the d coordinates of x whenmost of what we get will be thrown away? Can’t we just directlymeasure the part that won’t end up being thrown away?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 21 / 25

Page 49: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Compressed Sensing

Informally, the main premise of compressed sensing is the following three“surprising” results:

1 It is possible to fully reconstruct any sparse signal if it wascompressed by x 7→Wx, where W is a matrix which satisfies acondition called Restricted Isoperimetric Property (RIP). A matrixthat satisfies this property is guaranteed to have a low distortion ofthe norm of any sparse representable vector.

2 The reconstruction can be calculated in polynomial time by solving alinear program.

3 A random n× d matrix is likely to satisfy the RIP condition providedthat n is greater than order of s log(d).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 22 / 25

Page 50: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Compressed Sensing

Informally, the main premise of compressed sensing is the following three“surprising” results:

1 It is possible to fully reconstruct any sparse signal if it wascompressed by x 7→Wx, where W is a matrix which satisfies acondition called Restricted Isoperimetric Property (RIP). A matrixthat satisfies this property is guaranteed to have a low distortion ofthe norm of any sparse representable vector.

2 The reconstruction can be calculated in polynomial time by solving alinear program.

3 A random n× d matrix is likely to satisfy the RIP condition providedthat n is greater than order of s log(d).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 22 / 25

Page 51: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Compressed Sensing

Informally, the main premise of compressed sensing is the following three“surprising” results:

1 It is possible to fully reconstruct any sparse signal if it wascompressed by x 7→Wx, where W is a matrix which satisfies acondition called Restricted Isoperimetric Property (RIP). A matrixthat satisfies this property is guaranteed to have a low distortion ofthe norm of any sparse representable vector.

2 The reconstruction can be calculated in polynomial time by solving alinear program.

3 A random n× d matrix is likely to satisfy the RIP condition providedthat n is greater than order of s log(d).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 22 / 25

Page 52: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Restricted Isoperimetric Property (RIP)

A matrix W ∈ Rn,d is (ε, s)-RIP if for all x 6= 0 s.t. ‖x‖0 ≤ s we have∣∣∣∣‖Wx‖22‖x‖22

− 1

∣∣∣∣ ≤ ε .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 23 / 25

Page 53: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

RIP matrices yield lossless compression for sparse vectors

Theorem

Let ε < 1 and let W be a (ε, 2s)-RIP matrix. Let x be a vector s.t.‖x‖0 ≤ s, let y =Wx and let x̃ ∈ argminv:Wv=y ‖v‖0. Then, x̃ = x.

Proof.

Assume, by way of contradiction, that x̃ 6= x.

Since x satisfies the constraints in the optimization problem thatdefines x̃ we clearly have that ‖x̃‖0 ≤ ‖x‖0 ≤ s.

Therefore, ‖x− x̃‖0 ≤ 2s.

By RIP on x− x̃ we have∣∣∣‖W (x−x̃)‖2‖x−x̃‖2 − 1

∣∣∣ ≤ εBut, since W (x− x̃) = 0 we get that |0− 1| ≤ ε. Contradiction.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 24 / 25

Page 54: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

RIP matrices yield lossless compression for sparse vectors

Theorem

Let ε < 1 and let W be a (ε, 2s)-RIP matrix. Let x be a vector s.t.‖x‖0 ≤ s, let y =Wx and let x̃ ∈ argminv:Wv=y ‖v‖0. Then, x̃ = x.

Proof.

Assume, by way of contradiction, that x̃ 6= x.

Since x satisfies the constraints in the optimization problem thatdefines x̃ we clearly have that ‖x̃‖0 ≤ ‖x‖0 ≤ s.

Therefore, ‖x− x̃‖0 ≤ 2s.

By RIP on x− x̃ we have∣∣∣‖W (x−x̃)‖2‖x−x̃‖2 − 1

∣∣∣ ≤ εBut, since W (x− x̃) = 0 we get that |0− 1| ≤ ε. Contradiction.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 24 / 25

Page 55: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

RIP matrices yield lossless compression for sparse vectors

Theorem

Let ε < 1 and let W be a (ε, 2s)-RIP matrix. Let x be a vector s.t.‖x‖0 ≤ s, let y =Wx and let x̃ ∈ argminv:Wv=y ‖v‖0. Then, x̃ = x.

Proof.

Assume, by way of contradiction, that x̃ 6= x.

Since x satisfies the constraints in the optimization problem thatdefines x̃ we clearly have that ‖x̃‖0 ≤ ‖x‖0 ≤ s.

Therefore, ‖x− x̃‖0 ≤ 2s.

By RIP on x− x̃ we have∣∣∣‖W (x−x̃)‖2‖x−x̃‖2 − 1

∣∣∣ ≤ εBut, since W (x− x̃) = 0 we get that |0− 1| ≤ ε. Contradiction.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 24 / 25

Page 56: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

RIP matrices yield lossless compression for sparse vectors

Theorem

Let ε < 1 and let W be a (ε, 2s)-RIP matrix. Let x be a vector s.t.‖x‖0 ≤ s, let y =Wx and let x̃ ∈ argminv:Wv=y ‖v‖0. Then, x̃ = x.

Proof.

Assume, by way of contradiction, that x̃ 6= x.

Since x satisfies the constraints in the optimization problem thatdefines x̃ we clearly have that ‖x̃‖0 ≤ ‖x‖0 ≤ s.

Therefore, ‖x− x̃‖0 ≤ 2s.

By RIP on x− x̃ we have∣∣∣‖W (x−x̃)‖2‖x−x̃‖2 − 1

∣∣∣ ≤ εBut, since W (x− x̃) = 0 we get that |0− 1| ≤ ε. Contradiction.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 24 / 25

Page 57: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

RIP matrices yield lossless compression for sparse vectors

Theorem

Let ε < 1 and let W be a (ε, 2s)-RIP matrix. Let x be a vector s.t.‖x‖0 ≤ s, let y =Wx and let x̃ ∈ argminv:Wv=y ‖v‖0. Then, x̃ = x.

Proof.

Assume, by way of contradiction, that x̃ 6= x.

Since x satisfies the constraints in the optimization problem thatdefines x̃ we clearly have that ‖x̃‖0 ≤ ‖x‖0 ≤ s.

Therefore, ‖x− x̃‖0 ≤ 2s.

By RIP on x− x̃ we have∣∣∣‖W (x−x̃)‖2‖x−x̃‖2 − 1

∣∣∣ ≤ ε

But, since W (x− x̃) = 0 we get that |0− 1| ≤ ε. Contradiction.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 24 / 25

Page 58: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

RIP matrices yield lossless compression for sparse vectors

Theorem

Let ε < 1 and let W be a (ε, 2s)-RIP matrix. Let x be a vector s.t.‖x‖0 ≤ s, let y =Wx and let x̃ ∈ argminv:Wv=y ‖v‖0. Then, x̃ = x.

Proof.

Assume, by way of contradiction, that x̃ 6= x.

Since x satisfies the constraints in the optimization problem thatdefines x̃ we clearly have that ‖x̃‖0 ≤ ‖x‖0 ≤ s.

Therefore, ‖x− x̃‖0 ≤ 2s.

By RIP on x− x̃ we have∣∣∣‖W (x−x̃)‖2‖x−x̃‖2 − 1

∣∣∣ ≤ εBut, since W (x− x̃) = 0 we get that |0− 1| ≤ ε. Contradiction.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 24 / 25

Page 59: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Efficient reconstruction

If we further assume that ε < 11+√2

then

x = argminv:Wv=y

‖v‖0 = argminv:Wv=y

‖v‖1 .

The right-hand side is a linear programming problem

Summary: we can reconstruct all sparse vector efficiently based onO(s log(d)) measurements

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 25 / 25

Page 60: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Efficient reconstruction

If we further assume that ε < 11+√2

then

x = argminv:Wv=y

‖v‖0 = argminv:Wv=y

‖v‖1 .

The right-hand side is a linear programming problem

Summary: we can reconstruct all sparse vector efficiently based onO(s log(d)) measurements

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 25 / 25

Page 61: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Efficient reconstruction

If we further assume that ε < 11+√2

then

x = argminv:Wv=y

‖v‖0 = argminv:Wv=y

‖v‖1 .

The right-hand side is a linear programming problem

Summary: we can reconstruct all sparse vector efficiently based onO(s log(d)) measurements

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 25 / 25

Page 62: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

PCA vs. Random Projections

Random projections guarantee perfect recovery for allO(n/ log(d))-sparse vectors

PCA guarantee perfect recovery if all examples are in ann-dimensional subspace

Different prior knowledge:

If the data is e1, . . . , ed, random projections will be perfect but PCAwill failIf d is very large and data is exactly on an n-dim subspace. Then, PCAwill be perfect but random projections might fail

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 26 / 25

Page 63: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

PCA vs. Random Projections

Random projections guarantee perfect recovery for allO(n/ log(d))-sparse vectors

PCA guarantee perfect recovery if all examples are in ann-dimensional subspace

Different prior knowledge:

If the data is e1, . . . , ed, random projections will be perfect but PCAwill failIf d is very large and data is exactly on an n-dim subspace. Then, PCAwill be perfect but random projections might fail

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 26 / 25

Page 64: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

PCA vs. Random Projections

Random projections guarantee perfect recovery for allO(n/ log(d))-sparse vectors

PCA guarantee perfect recovery if all examples are in ann-dimensional subspace

Different prior knowledge:

If the data is e1, . . . , ed, random projections will be perfect but PCAwill failIf d is very large and data is exactly on an n-dim subspace. Then, PCAwill be perfect but random projections might fail

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 26 / 25

Page 65: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

PCA vs. Random Projections

Random projections guarantee perfect recovery for allO(n/ log(d))-sparse vectors

PCA guarantee perfect recovery if all examples are in ann-dimensional subspace

Different prior knowledge:

If the data is e1, . . . , ed, random projections will be perfect but PCAwill fail

If d is very large and data is exactly on an n-dim subspace. Then, PCAwill be perfect but random projections might fail

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 26 / 25

Page 66: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

PCA vs. Random Projections

Random projections guarantee perfect recovery for allO(n/ log(d))-sparse vectors

PCA guarantee perfect recovery if all examples are in ann-dimensional subspace

Different prior knowledge:

If the data is e1, . . . , ed, random projections will be perfect but PCAwill failIf d is very large and data is exactly on an n-dim subspace. Then, PCAwill be perfect but random projections might fail

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 26 / 25

Page 67: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Summary

Linear dimensionality reduction x 7→Wx

PCA: optimal if reconstruction is linear and error is squared distanceRandom projections: preserves disctancesRandom projections: exact reconstruction for sparse vectors (but witha non-linear reconstruction)

Not covered: non-linear dimensionality reduction

Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 27 / 25