When Dictionary Learning Meets Classificationbertozzi/WORKFORCE/REU 2013... · When Dictionary Learning Meets Classi cation Bu ord, Teresa1 Chen, Yuxin2 Horning, ... column is an

The Method Supervised Unsupervised - spectral Unsupervised - k-means Gaussian Noise Experiments Additional Material

When Dictionary Learning Meets Classification

Bufford, Teresa1 Chen, Yuxin2

Horning, Mitchell3 Shee, Liberty1

Mentor: Professor Yohann Tendero

1UCLA2Dalhousie University3Harvey Mudd College

August 9, 2013


Outline

1. The Method: Dictionary Learning

2. Experiments and Results

2.1 Supervised Dictionary Learning

2.2 Unsupervised Dictionary Learning

2.3 Robustness w.r.t. Noise

3. Conclusion


Dictionary LearningGOAL: Classify discrete image signals x ∈ Rn.

The Dictionary, D ∈ Rn×K

x ≈ Dα =

| |atom1 · · · atomK

| |

α1

...αK

• Each dictionary can be represented as a matrix, where each

column is an atom ∈ Rn, learned from a set of training data.

• A signal x ∈ Rn can be approximated by a linear combinationof atoms in a dictionary, represented by α ∈ RK .

• We seek a sparse signal representation:

arg minα∈RK

‖x − Dα‖22 + λ‖α‖1. (1)

0Sprechmann, et al. ”Dictionary learning and sparse coding for unsupervisedclustering.” Acoustics Speech and Signal Processing (ICASSP), 2010 IEEEInternational Conference on. IEEE, 2010.


Supervised Dictionary Learning Algorithm

Illustrating Example

Training image signals: x1, x2, x3, x4 X = [x1 x2 x3 x4]Classes: 0, 1

Group training images according to their labels:

class 0 x1 x4

class 1 x2 x3

Use training images with label i to train dictionary i :

class 0 x1 x4 −→ D0



Signal Classification


Test image signals: y1, y2, y3 Y = [y1 y2 y3]Dictionaries: D0,D1

Take one test image. Compute its optimal sparse representation α foreach dictionary. Use α to compute energy:

y1Ei (y1) = min

α‖y1 − Diα‖22 + λ‖α‖1

−−−−−−−−−−−−−−−−−−−−−−−−−→

[E0(y1)E1(y1)

]Do this for each test signal:

E(Y ) =

[E0(y1) E0(y2) E0(y3)E1(y1) E1(y2) E1(y3)

]Classify the test signal as class i∗ = arg minEi (y). For example:

E(Y ) =

[5 12 824 6 4

]−→

class 0 y1

class 1 y2 y3


Supervised Dictionary Learning Results

Table : Supervised Results

Cluster Centered& Digits K misclassificationType Normalized rate average

Supervised False {0, ..., 4} 500 1.3431%

Supervised True {0, ..., 4} 500 0.5449%


Supervised True {0, ..., 4} 800 0.3892%


Supervised True {0, ..., 9} 800 1.6800%

Error rate for digits {0, ..., 9} and K = 800 from [1] is 1.26%.



Supervised Results, ctd.

Here is a confusion matrix for the supervised, centered, normalizedcase with k = 800 and digits {0, . . . , 4}.

Element cij ∈ C is the number of times that an image of digit iwas classified as digit j .

C =

0 1 2 3 4

0 978 0 1 1 01 0 1132 2 1 02 5 2 1023 0 23 0 0 3 1006 14 1 0 1 0 980


Unsupervised Dictionary Learning Algorithm(Spectral Clustering)



Train a dictionary D from all training images:

Allimages

x1 x2x3 x4

−→ D =

| | |atom1 atom2 atom3

| | |

Reminder: The number of atoms in the dictionary is a parameter.

For each image, compute optimal sparse representation α w.r.t. D:

A =

x1 x2 x3 x4

atom1 | | | |atom2 α1 α2 α3 α4

atom3 | | | |

Reminder: α is a linear combination of atoms in D.




Construct a similarity matrix S1 = |A|t |A|.Perform spectral clustering on the graph G1 = {X ,S1}:

G1 = {X ,S1}spectral−−−−−−−−→..clustering..

class 0 x1 x4

class 1 x2 x3

Use signal clusters to train initial dictionaries:




Refining Dictionaries

Repeat:

1. Classify the training images using current dictionaries:

x1, x2, x3, x4classify

−−−−−−−−−−→..with D0,D1..

class 0 x4

class 1x1 x2x3

2. Use these classifications to train new, refined dictionariesfor each cluster:

class 0 x4 −→ D0,new

class 1x1 x2x3

−→ D1,new


Unsupervised Dictionary LearningSpectral Clustering Results

Table : Unsupervised Results (spectral clustering)

Cluster Centered & Digits K misclassificationNormalized rate average

Atoms True {0, ..., 4} 500 24.8881%

Atoms True {0, ..., 4} 800 27.1843%

Signals True {0, ..., 4} 500 27.4372%

Signals True {0, ..., 4} 800 29.5777%

Misclassification rate for digits {0, ..., 4} and K = 500 from [1] is 1.44%.



Why do our results differ from Sprechmann, et al.’s?

Hypothesis

The problems lie in the author’s choice of similarity measures,S = |A|t |A|.

Possible Problem: Normalization (A’s cols not of constant l2 norm)

Illustration of Problem

• Assume that A contains only positive entries. ThenS = |A|t |A| = AtA and the ij th entry of S is

< αi , αj >= ‖αi‖‖αj‖ cos(αi , αj).

• Nearly orthogonal vectors are more similar than co-linear onesif the products of their norms are big enough.

“more similar”


Attempted Solution

Normalized each column of A before computing S . However, thiscaused little change in the results, due to the l2 norms of A’scolumns being almost constant.

Figure : The normalized histogram of the l2 norm of the columns of A.Note: The l2 norms of most columns in A are in [3, 4].

2 3 4 5 6 7 8 9 10 110

0.1

0.2

0.3

0.4

0.5

0.6

0.7Histogram of the l2−norm of the columns of A (Normalized)

Norm interval

Em

irica

l pro

babi

lity


Possible Problem: Absolute Value (use |A|t |A| instead of |AtA|)

Illustration of Problem

a1 = (1,−1)t and a2 = (1, 1)t , which are orthogonal vectors,become co-linear when the entries are replaced by their absolutevalues.

Attempted Solution

In the experiments, changing |A||A|t to |AAt | does not significantlyimprove the results, because among all the entries of A associatedwith MNIST data, only ≈ 13.5% are negative.


In the last part of Sprechmann et al, the authors state:

“We observed that best results are obtained for all the experimentswhen the initial dictionaries in the learning stage are constructedby randomly selecting signals from the training set. If the size ofthe dictionary compared to the dimension of the data is small, [it]is better to first partition the dataset (using for example Euclideank-means) in order to obtain a more representative sample.”

Algorithm

Use k-means to cluster the training signals, instead of spectralclustering.


Unsupervised Dictionary Learning (kmeans)



Use kmeans to cluster training images:

x1, x2, x3, x4..kmeans..−−−−−−→

class 0 x1 x4

class 1 x2 x3

Note: Must center and normalize after clustering.

Use clusters to train dictionaries:



Refinement process is same as spectral clustering case.


Unsupervised Dictionary Learning K-means Results

Table : Unsupervised Results (kmeans), without refinement

Cluster Centered & Digits K iter misclassificationNormalized rate average

Signals True {0,...,4} 500 2 7.1025%

Table : Unsupervised Results (kmeans), with refinement

Cluster Centered & Digits K iter misclassificationNormalized rate average

Signals True {0,...,4} 500 2 1.1286%Signals True {0,...,4} 500 20 0.5449%

Misclassification rate for digits {0, ..., 4} and K = 500 from [1] is 1.44%.



Unsupervised Results, ctd.

Here is a confusion matrix for the k-means, unsupervised, centered,normalized case with k = 500, iter = 20, and digits {0, . . . , 4}.

Element cij ∈ C is the number of times that an image of digit iwas classified as digit j .

C =

0 1 2 3 4

0 979 0 1 0 01 0 1129 4 1 12 4 2 1023 2 13 0 0 3 1006 14 0 1 3 0 978


Gaussian Noise Experiments

Classification of Noisy Images

Goal: To analyze the robustness of our method with respect tonoise in images

Questions to consider:

• How does adding noise affect misclassification rates?

• How does centering and normalizing the data after addingnoise affect the results?


Results: Classifying Pure Gaussian NoiseClassification Rates for Each Digit

Train dictionaries on Train dictionaries oncentered and normalized data not centered and not normalized data

noise variance = 0.5 noise variance = 0.5


MNIST Images with Gaussian Noise Added


Results: Adding Gaussian Noise to MNISTNoise in test images only

Centered and Normalized Not Centered and Not Normalizedtest noise variance = 0.5 test noise variance = 0.5

Misclassification rate 8.21% Misclassification rate 87.05%


Results: Adding Gaussian Noise to MNISTSame noise variance for test & training images

Misclassification Rates

Noise variance 0.1 0.5 1.0Centered & 3.2% 17.33% 38.92%Normalized

Not Centered & 11.08% 57.56% 75.87%Not Normalized


Conclusions:

• Unable to reproduce results reported by Sprechmann

• Method of k-means produces better results than spectralclustering

• Centering and normalizing data consistently improves results

• Method is robust to added noise


Thanks for your attention!


Unsupervised Dictionary Learning Algorithm 2(with split initialization)

Step 1 Train a dictionary D0 ∈ Rnxk from all training images

Step 2 Obtain matrix A0 and similarity matrix S0 = |A0||A0|t asbefore

Step 3 Apply spectral clustering to dictionary D0 to split its atomsinto D1 and D2

Step 4 Obtain matrices A1, S1, A2, and S2 for dictionaries D1 andD2

Step 5 Apply spectral clustering to dictionaries D1 and D2 to spliteach into two clusters. Choose the segmentation thatresults in the lower energy and keep this one. Return theother dictionary to its original form. Now you have threedictionaries.

Step 6 Repeat this process so that after each iteration the numberof dictionaries increases by one. Stop when the desirednumber of clusters is reached.


Semisupervised Dictionary Learning Algorithm

Step 1 Use training images, a percentage of which have knownlabels. The remaining percentage are randomly labeled.

Step 2 Train dictionaries {D0, . . . ,DN−1} independently.

Step 3 Classify the training images using current dictionaries.Then use these classifications to train new, refineddictionaries for each cluster.


Semisupervised Dictionary Learning Results

Dictionaries for digits {0, . . . , 4}.Cluster Type Centered & K perturb. misclassification

Normalized percent rate average

Semisupervised True 200 40 0.9730%




1 2 3 4 5 6 7 8 9 10 11

5

10

15

20

25

30

Average misclassification Rate at Each Iteration, digits 0−4, 40%

Iteration (iteration 1 = first refinement)

Ave

rage

mis

clas

sific

atio

n R

ate

1 2 3 4 5 6 7 8 9 10 11

5

10

15

20

25

30



Ave

rage

mis

clas

sific

atio

n R

ate

1 2 3 4 5 6 7 8 9 10 11

5

10

15

20

25

30



Ave

rage

mis

clas

sific

atio

n R

ate

Figure : Misclassification rate average per refinement iteration. Left toright: 40%, 70%, 90% perturbation.

1 2 3 4 5 6 7 8 9 10 11

0.129

0.13

0.131

0.132

0.133

0.134

0.135

0.136

0.137

0.138

0.139

Energy at Each Iteration, digits 0−4, 40%


Ene

rgy

1 2 3 4 5 6 7 8 9 10 11

0.129

0.13

0.131

0.132

0.133

0.134

0.135

0.136

0.137

0.138

0.139



Ene

rgy

1 2 3 4 5 6 7 8 9 10 11

0.129

0.13

0.131

0.132

0.133

0.134

0.135

0.136

0.137

0.138

0.139



Ene

rgy

Figure : Energy per refinement iteration. Left to right: 40%, 70%, 90%perturbation.


1 2 3 4 5 6 7 8 9 10 11

5

10

15

20

25

30



Ave

rage

mis

clas

sific

atio

n R

ate

1 2 3 4 5 6 7 8 9 10 11

5

10

15

20

25

30



Ave

rage

mis

clas

sific

atio

n R

ate

1 2 3 4 5 6 7 8 9 10 11

5

10

15

20

25

30



Ave

rage

mis

clas

sific

atio

n R

ate

Figure : Misclassification rate average per refinement iteration. Left toright: 40%, 70%, 90% perturbation.

2 3 4 5 6 7 8 9 10 11

1000

2000

3000

4000

5000

6000

7000

Changes in Training Signals at Each Iteration, digits 0−4, 40%

Change to refinement # (change 1 = change to first refinement)

Num

ber

of s

igna

ls th

at c

hang

ed

2 3 4 5 6 7 8 9 10 11

1000

2000

3000

4000

5000

6000

7000



Num

ber

of s

igna

ls th

at c

hang

ed

2 3 4 5 6 7 8 9 10 11

1000

2000

3000

4000

5000

6000

7000



Num

ber

of s

igna

ls th

at c

hang

ed

Figure : Number of training images whose classification changed perrefinement iteration. Left to right: 40%, 70%, 90% perturbation.


Unsupervised - Atoms (change over refinement iterations)

1 2 3 4 5 6 7 8 9 10

0.113

0.114

0.115

0.116

0.117

0.118

0.119

0.12

0.121

0.122

0.123

Energy at Each Iteration, digits 0−4, CN


Ene

rgy

1 2 3 4 5 6 7 8 9 10

24

24.5

25

25.5

Average misclassification Rate at Each Iteration, digits 0−4, CN


Ave

rage

mis

clas

sific

atio

n R

ate

2 3 4 5 6 7 8 9 10

20

30

40

50

60

70

80

90

Changes in Training Signals at Each Iteration, digits 0−4, CN


Num

ber

of s

igna

ls th

at c

hang

ed

Figure : Unsupervised dictionary learning (K = 500), spectral clusteringof centered and normalized signals by atoms. Left to right: (1) energy,(2) misclassification rate average, (3) number of training images whoseclassification changed per refinement iteration.


Unsupervised - Signals (change over refinement iterations)

1 2 3 4 5 6 7 8 9 10

0.0855

0.086

0.0865

0.087

0.0875

0.088



Ene

rgy

1 2 3 4 5 6 7 8 9 1027

27.5

28

28.5

29

29.5

Average misclassification Rate at Each Iteration, digits 0−4, CN


Ave

rage

mis

clas

sific

atio

n R

ate

2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Changes in Training Signals at Each Iteration, digits 0−4, CN


Num

ber

of s

igna

ls th

at c

hang

ed

Figure : Unsupervised dictionary learning (K = 500), spectral clusteringof centered and normalized signals by signals. Left to right: (1) energy,(2) misclassification rate average, (3) number of training images whoseclassification changed per refinement iteration.


Unsupervised - kmeans (change over refinement iterations)

1 2 3 4 5 6 7 8 9 100.1282

0.1284

0.1286

0.1288

0.129

0.1292

0.1294

0.1296

0.1298Energy at Each Iteration, digits 0−4, CN


Ene

rgy

1 2 3 4 5 6 7 8 9 101

1.5

2

2.5

3

3.5

4

4.5

5Average misclassification Rate at Each Iteration, digits 0−4, CN


Ave

rage

mis

clas

sific

atio

n R

ate

2 3 4 5 6 7 8 9 10200

300

400

500

600

700

800Changes in Training Signals at Each Iteration, digits 0−4, CN


Num

ber

of s

igna

ls th

at c

hang

ed

Figure : Unsupervised dictionary learning (K = 500), kmeans clusteringof centered and normalized signals. Left to right: (1) energy, (2)misclassification rate average, (3) number of training images whoseclassification changed per refinement iteration.


Unsupervised - kmeans (change over refinement iterations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

0.114

0.116

0.118

0.12

0.122

0.124

0.126

0.128



Ene

rgy

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

0.114

0.116

0.118

0.12

0.122

0.124

0.126

0.128


Iteration (iteration 1 = first refinement)E

nerg

y

Figure : Unsupervised dictionary learning (K = 500), kmeans clusteringof centered and normalized signals. Left: 2 dictionary learningrefinements. Right: 20 dictionary learning refinements.

Documents

When Dictionary Learning Meets Classificationbertozzi/WORKFORCE/REU 2013... · When Dictionary Learning Meets Classi cation Bu ord, Teresa1 Chen, Yuxin2 Horning, ... column is an