42
Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

Embed Size (px)

Citation preview

Page 1: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

1

Differentiable Sparse Coding

David Bradley and J. Andrew BagnellNIPS 2008

Page 2: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

2

Joint Optimization

100,000 ft View• Complex Systems

VoxelFeatures

VoxelClassifier

2-DPlanner

Cameras

Ladar

Voxel Grid

ColumnCost

Y(Path to goal)

Initialize with “cheap” data

Page 3: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

3

10,000 ft view

• Sparse Coding = Generative Model

• Semi-supervised learning• KL-Divergence Regularization• Implicit Differentiation

OptimizationUnlabeled Data Latent Variable

Classifierx w

Loss Gradient

Page 4: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

4

Sparse Coding

Understand X

Page 5: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

5

As a combination of factors

B

Page 6: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

6

Sparse coding uses optimization

Projection (feed-forward):

)( Bxfw T

)( wBfx

Some vector

Want to useto classify x

Reconstructionloss function

Page 7: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

7

Sparse vectors: Only a few significant elements

Page 8: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

8

Example: X=Handwritten Digits

Basis vectors colored by w

Page 9: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

9

Input Basis

Projection

Optimization vs. Projection

Page 10: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

10

Input Basis

KL-regularized Optimization

Optimization vs. Projection

Outputs are sparse for each example

Page 11: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

11

Generative Model

Latent variables are Independent

PriorLikelihood

Examples are Independent

i

Page 12: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

12

PriorLikelihood

Sparse Approximation

MAP Estimate

Page 13: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

13

Sparse Approximation

Distance between reconstruction and input

Distance between weight vector and prior mean

Regularization Constant

pwwBfxxP

||Prior)(||Loss)(log

Page 14: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

14

Example: Squared Loss + L1

• Convex + sparse (widely studied in engineering)• Sparse coding solves for B as well (non-convex for now…)• Shown to generate useful features on diverse problems

Tropp, Signal Processing, 2006Donoho and Elad, Proceedings of the National Academy of Sciences, 2002Raina, Battle, Lee, Packer, Ng, ICML, 2007

Page 15: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

15

L1 Sparse Coding

Shown to generate useful features on diverse problems

Optimize B over all examples

Page 16: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

16

Differentiable Sparse Coding

Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008

Y

X

LearningModule (θ)

Loss Function

OptimizationModule (B)

UnlabeledData

Reconstruction Loss

LabeledData

X

W)(BWf

Sparse Coding

Raina, Battle, Lee, Packer, Ng, ICML, 2007

Page 17: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

17

L1 Regularization is Not Differentiable

Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008

Y

X

LearningModule (θ)

Loss Function

OptimizationModule (B)

UnlabeledData

Reconstruction Loss

LabeledData

X

W)(BWf

Sparse Coding

Page 18: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

18

Why is this unsatisfying?

Page 19: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

19

Problem #1:Instability

L1 Map Estimates are discontinuous

Outputs are not differentiable

Instead use KL-divergence

Proven to compete with L1 in online learning

Page 20: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

20

Problem #2: No closed-form Equation

At the MAP estimate:

pwwBfxww

||Prior)(||Lossminargˆ

0||ˆPrior)ˆ(||Loss pwwBfx ww

Page 21: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

21

Solution: Implicit DifferentiationDifferentiate both sides with respect to an element of B:

Since is a function of B: Solve for this

0||ˆPrior)ˆ(||Lossij

pwwBfxB ww

Page 22: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

22

KL-Divergence

Example: Squared Loss, KL prior

Page 23: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

23

Handwritten Digit Recognition

50,000 digit training set10,000 digit validation set10,000 digit test set

Page 24: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

24

Handwritten Digit Recognition

Unsupervised Sparse Coding

L2 loss and L1 prior

Training Set

Step #1:

Raina, Battle, Lee, Packer, Ng, ICML, 2007

Page 25: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

25

Handwritten Digit Recognition

Sparse Approximation

Step #2:Maxent

ClassifierLoss

Function

Page 26: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

26

Handwritten Digit Recognition

Supervised Sparse Coding

Step #3:Maxent

ClassifierLoss

Function

Page 27: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

27

KL Maintains Sparsity

Weight concentratedin few elements

Log scale

Page 28: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

28

KL adds Stability

X

W (KL) W (backprop)

W (L1)

Duda, Hart, Stork. Pattern classification. 2001

Page 29: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

29

Performance vs. Prior

0123456789

1,000 10,000 50,000

L1

KL

KL + Backprop

Number of Training Examples

Mis

clas

sific

atio

n E

rror

(%)

Bett

er

Page 30: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

30

Classifier Comparison

0%

1%

2%

3%

4%

5%

6%

7%

8%

Maxent 2-layer NN SVM (Linear) SVM (RBF)

Misc

lass

ifica

tion E

rror

PCA

L1

KL

KL+backprop

Bett

er

Page 31: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

31

Comparison to other algorithms

0.00%0.50%1.00%1.50%2.00%2.50%3.00%3.50%4.00%

50,000

Mis

clas

sific

ation

Err

or

L1

KL

KL+backprop

SVM

2-layer NN

Bett

er

Page 32: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

32

Transfer to English Characters

8

16

24,000 character training set12,000 character validation set12,000 character test set

Page 33: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

33

Transfer to English Characters

Sparse Approximation

Step #1:

DigitsBasis

MaxentClassifier

Loss Function

Raina, Battle, Lee, Packer, Ng, ICML, 2007

Page 34: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

34

Transfer to English CharactersStep #2:

MaxentClassifier

Loss Function

Supervised Sparse Coding

Page 35: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

35

Transfer to English Characters

40%

50%

60%

70%

80%

90%

100%

500 5000 20000

Clas

sific

ation

Acc

urac

y

Training Set Size

Raw

PCA

L1

KL

KL+backprop

Bett

er

Page 36: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

36

Text ApplicationX

Unsupervised Sparse Coding

KL loss + KL prior

5,000 movie reviews

10 point sentiment scale1=hated it, 10=loved it

Step #1:

Pang, Lee, Proceeding of the ACL, 2005

Page 37: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

37

Text ApplicationX

Sparse Approximation

5-fold Cross Validation

10 point sentiment scale1=hated it, 10=loved it

LinearRegression

L2 LossStep #2:

Page 38: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

38

Text ApplicationX

10 point sentiment scale1=hated it, 10=loved it

5-fold Cross Validation

Step #3:

LinearRegression

L2 Loss

Supervised Sparse Coding

Page 39: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

39

Movie Review Sentiment

0.00

0.10

0.20

0.30

0.40

0.50

0.60

Pred

ictiv

e R

2 LDA

KL

sLDA

KL+backpropBett

er

Unsupervisedbasis

Supervisedbasis

State of the artgraphical model

Blei, McAuliffe, NIPS, 2007

Page 40: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

40

RGB Camera

NIR Camera

Ladar

Future Work

Sparse Coding

EngineeredFeatures Labeled

Training Data

Example Paths

VoxelClassifier

MMPCamera

Laser

Page 41: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

41

Future Work: Convex Sparse Coding

• Sparse approximation is convex• Sparse coding is not because fixed-size basis is a

non-convex constraint• Sparse coding ↔ sparse approximation on

infinitely large basis + non-convex rank constraint– Relax to a convex L1 rank constraint

• Use boosting for sparse approximation directly on infinitely large basis

Bengio, Le Roux, Vincent, Dellalleau, Marcotte, NIPS, 2005Zhao, Yu. Feature Selection for Data Mining, 2005Riffkin, Lippert. Journal of Machine Learning Research, 2007

Page 42: Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

42

Questions?