Unsupervised Learning Jointly With Image Clusteringjyang375/Jianwei_Yang... · Unsupervised...

Preview:

Citation preview

Unsupervised Learning Jointly With Image Clustering

Virginia Tech

Jianwei Yang Devi Parikh Dhruv Batra

https://filebox.ece.vt.edu/~jw2yang/ 1

2

Huge amount of images!!!

3

Huge amount of images!!!

Learning without annotation efforts

4

Huge amount of images!!!

Learning without annotation efforts

What we need to learn?

5

An open problem

Huge amount of images!!!

Learning without annotation efforts

What we need to learn?

6

An open problem

A hot problem

Huge amount of images!!!

Learning without annotation efforts

What we need to learn?

7

Various methodologies

An open problem

A hot problem

Huge amount of images!!!

Learning without annotation efforts

What we need to learn?

8

Learning distribution (structure)

Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.

Clustering

9

Learning distribution (structure)

Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.

Clustering

K-means (Image Credit: Jesse Johnson)

10

Learning distribution (structure)

Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.

Clustering

K-means (Image Credit: Jesse Johnson)

Hierarchical Clustering

11

Learning distribution (structure)

Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.

Clustering

K-means (Image Credit: Jesse Johnson)Spectral Clustering

Manor et al, NIPS’04Hierarchical Clustering

12

Learning distribution (structure)

Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.

Clustering

K-means (Image Credit: Jesse Johnson)Spectral Clustering

Manor et al, NIPS’04Hierarchical Clustering Graph Cut

Shi et al, TPAMI’00

13

Learning distribution (structure)

Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.

Clustering

K-means (Image Credit: Jesse Johnson)

DBSCAN, Ester et al, KDD’96 (Image Credit: Jesse Johnson)

Spectral Clustering Manor et al, NIPS’04

Hierarchical Clustering Graph CutShi et al, TPAMI’00

14

Learning distribution (structure)

Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.

Clustering

K-means (Image Credit: Jesse Johnson)

DBSCAN, Ester et al, KDD’96 (Image Credit: Jesse Johnson)

Spectral Clustering Manor et al, NIPS’04

Hierarchical Clustering Graph CutShi et al, TPAMI’00

EM Algorithm, Dempster et al, JRSS’77

15

Learning distribution (structure)

Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.

Clustering

K-means (Image Credit: Jesse Johnson)

DBSCAN, Ester et al, KDD’96 (Image Credit: Jesse Johnson)

Spectral Clustering Manor et al, NIPS’04

Hierarchical Clustering Graph CutShi et al, TPAMI’00

EM Algorithm, Dempster et al, JRSS’77NMF, Xu et al, SIGIR‘03 (Image Credit: Conrad Lee)

16

Learning distribution (structure)

Sub-space Analysis

PCA (Image Credit: Jesse Johnson)ICA (Image Credit: Shylaja et al)

tSNE, Maaten et al, JMLR’08

Subspace Clustering, Vidal et al.

Sparse coding, Olshausen et al. Vision Research’9717

Learning representation (feature)

Yoshua Bengio, Aaron Courville, and Pierre Vincent. "Representation learning: A review and new perspectives." IEEE Transactions on Pattern Analysis and Machine Intelligence. 35.8 (2013): 1798-1828.

Autoencoder, Hinton et al, Science’06 (Image Credit: Jesse Johnson)

DBN, Hinton et al, Science’06 DBM, Salakhutdinov et al, AISTATS’09

Bengio et al, TPAMI’13

18

Learning representation (feature)

VAE, Kingma et al, arXiv’13(Image Credit: Fast Forward Labs)

GAN, Goodfellow et al, NIPS’14DCGAN, Radford et al, arXiv’15

(Image Credit: Mike Swarbrick Jones)

19

Most Recent CV Works

Spatial context, Doersch et al, ICCV’15Temporal context, Wang et al, ICCV’15

Solving Jigsaw, Noroozi et al, ECCV’16Context Encoder, Deepak et al, CVPR’16

Ego-motion, Jayaraman et al, ICCV’15

20

Most Recent CV Works

Visual concept clustering, Huang et al, CVPR’16

Graph constraint, Li et al, ECCV’16

TAGnet, Wang et al, SDM’16

Deep Embedding, Xie et al, ICML’1621

Our Work

Joint Unsupervised Learning (JULE) of Deep Representations and Image Clusters

22

Outline

• Intuition

• Approach

• Experiments

• Extensions

23

Intuition

Meaningful clusters can provide supervisory signals to learn image representations

24

Intuition

Meaningful clusters can provide supervisory signals to learn image representations

Good representations help to get meaningful clusters

25

Intuition

Cluster images first, and then learn representations

26

Intuition

Cluster images first, and then learn representations

Learn representations first, and then cluster images

27

Intuition

Cluster images and learn representations progressively

Cluster images first, and then learn representations

Learn representations first, and then cluster images

28

IntuitionGood clusterGood representationsGood clusters

Good representations

Poor clusters

Poor representations 29

IntuitionGood clusterGood representationsGood clusters

Good representations

Poor clusters

Poor representations 30

IntuitionGood clusterGood representations

Good representations

Good clusters

Poor clusters

Poor representations 31

IntuitionGood clusterGood representations

Good representations

Good clusters

Poor clusters

Poor representations 32

Approach

• Framework

• Objective

• Algorithm & Implementation

33

Approach: Framework

Agglomerative Clustering

Convolutional Neural Network

Representation Learning

Agglomerative Clustering

arg min ( | , )y

L y I

arg min ( | , )L y I

,

arg min ( , | )y

L y I

34

Approach: Framework

Convolutional Neural Network Agglomerative Clustering

arg min ( | , )L y I

arg min ( | , )y

L y I

35

Approach: Recurrent Framework

36

Approach: Recurrent Framework

37

Approach: Recurrent Framework

38

Approach: Recurrent Framework

39

Approach: Recurrent Framework

40

Approach: Recurrent Framework

41

Approach: Recurrent Framework

Backward at each time-step is time-consuming and prone to over-fitting!

42

Approach: Recurrent Framework

How about updating once for multiple time-steps?

Backward at each time-step is time-consuming and prone to over-fitting!

43

Approach: Recurrent Framework

Partially Unrolling: divide all T time-steps into P periods

In each period, we merge clusters for multiple times and update CNN parameters at the end of period

44

Approach: Recurrent Framework

Partially Unrolling: divide all T time-steps into P periods

In each period, we merge clusters for multiple times and update CNN parameters at the end of period

45

Approach: Recurrent Framework

In each period, we merge clusters for multiple times and update CNN parameters at the end of period

Partially Unrolling: divide all T time-steps into P periods

P is determined by a hyper-parameter will be introduced later 46

Approach: Objective Function

Overall loss:

,

arg min ( , | )y

L y I

arg min ( | , )y

L y I arg min ( | , )L y I

47

Approach: Objective Function

Loss at time-step t:

Conventional Agg. Clustering Strategy

Proposed Agg. Clustering Strategy

48

Approach: Objective Function

Loss at time-step t:

Conventional Agg. Clustering Strategy

Proposed Agg. Clustering Strategy

Affinity measure

49

Approach: Objective Function

Loss at time-step t:

Conventional Agg. Clustering Strategy

Proposed Agg. Clustering Strategy

i-th cluster

50

Approach: Objective Function

Loss at time-step t:

Conventional Agg. Clustering Strategy

Proposed Agg. Clustering Strategy

K_c nearest neighbor clusters of i-th cluster

51

Approach: Objective Function

Loss at time-step t:

Conventional Agg. Clustering Strategy

Proposed Agg. Clustering Strategy

Affinity between i-thcluster and its NN

52

Approach: Objective Function

Loss at time-step t:

Conventional Agg. Clustering Strategy

Proposed Agg. Clustering Strategy

Affinity between i-thcluster and its NN

Differences between two cluster affinities 53

Approach: Objective Function

Loss at time-step t:

Conventional Agg. Clustering Strategy

Proposed Agg. Clustering Strategy

Affinity between i-thcluster and its NN

Differences between two cluster affinities

Merge these two clusters

54

Approach: Objective Function

Loss at time-step t:

Conventional Agg. Clustering Strategy

Proposed Agg. Clustering Strategy

Affinity between i-thcluster and its NN

Differences between two cluster affinities

Merge these two clusters

55

Approach: Objective Function

Loss in forward pass in period p (merge clusters):

Loss in forward pass in period p (merge clusters):

56

Approach: Objective Function

Loss in forward pass in period p (merge clusters):

Loss in forward pass in period p (merge clusters):

57

Approach: Objective Function

Loss in forward pass in period p (merge clusters):

Loss in forward pass in period p (merge clusters):

CNN parameters are fixed

58

Approach: Objective Function

Loss in forward pass in period p (merge clusters):

Loss in forward pass in period p (merge clusters):

CNN parameters are fixed

Cluster labels are fixed

59

Approach: Objective Function

Forward Pass:

Simple Greedy Algorithm

Merge two clusters which minimize the loss at each time step

60

Approach: Objective Function

Forward Pass:

Simple Greedy Algorithm

Merge two clusters which minimize the loss at each time step

61

Approach: Objective Function

Forward Pass:

Simple Greedy Algorithm

Merge two clusters which minimize the loss at each time step

62

Approach: Objective Function

Forward Pass:

Simple Greedy Algorithm

Merge two clusters which minimize the loss at each time step

63

Approach: Objective

Backward Pass:

64

Approach: Objective

Backward Pass:

Consider all previous periods

65

Approach: Objective

Backward Pass:

Cluster based loss is not proper for batch optimization!!!

Consider all previous periods

66

Approach: Objective

Backward Pass:

Cluster based loss is not proper for batch optimization!!!

Consider all previous periods

Approximation:

67

Approach: Objective

Backward Pass:

Convert to sample-based loss:

Consider all previous periods

Intra-sample affinity Inter-sample affinity

Recall cluster-based loss:

68

Approach: Objective

Backward Pass:

Convert to sample-based loss:

Consider all previous periods

Intra-sample affinity Inter-sample affinity

Recall cluster-based loss:

Weighted triplet loss

69

Approach: Algorithm & Implementation

70

Approach: Algorithm & Implementation

Raw image data

71

Approach: Algorithm & Implementation

Raw image data

Assume it is known

72

Approach: Algorithm & Implementation

Raw image data

Assume it is known

Randomly initialize CNN parameters4 samples in each cluster in average

73

Approach: Algorithm & Implementation

Raw image data

Assume it is known

Randomly initialize CNN parameters4 samples in each cluster in average

Train CNN for about 20 epochs

74

Approach: Algorithm & Implementation

Raw image data

Assume it is known

Randomly initialize CNN parameters4 samples in each cluster in average

Train CNN for about 20 epochs

We can go back and retrain the model, but it improve slightly

75

Experiments

• Datasets

• Network Architecture

• Image Clustering

• Representation Learning

76

Experiments: Datasets

MNIST (70000, 10, 28x28) USPS (11000, 10, 16x16) COIL20 (1440, 20, 128x128) COIL100 (7200, 100, 128x128)

UMist (575, 20, 112x92) FRGC (2462, 20, 32x32) CMU-PIE (2856, 68, 32x32) Youtube Face (1000, 41, 55x55)77

Experiments: SettingsTwo important parameters

Set the layer numbers so that theOutput feature map is about 10x10

78

Experiments: Clustering : Performance

+6.43% on NMI to best performance of existing approaches averaged over all datasets

79

Experiments: Clustering : Performance

+12.76% on AC to best performance of existing approaches averaged over all datasets

80

Experiments: Clustering : Performance

Average +21.5% on NMI

81

Experiments: Clustering : Performance

Average +25.7% on NMI

82

Experiments: Clustering : Performance

Our clustering performance vs. that of existing clustering approaches using raw image data.

Clustering performance using our representation fed to existing clustering algorithms.

Experiments: Clustering : Visualization

COIL-20

COIL-100

84

Experiments: Clustering : Visualization

USPS

MNIST-test

85

Experiments: Clustering : Ablation study

86

Experiments: Clustering : Verification

87

Experiments: Clustering : Time Cost

88

Experiments: Representation Learning

Testing generalization of our learnt (unsupervised) representation to LFW face verification.

Evaluation on CIFAR-10 classification

Representation transfer

Representation learning

89

Extensions: Data Visualization

90

Conclusion

• A new unsupervised learning method jointly with image clustering, cast the problem into a recurrent optimization problem;

• In the recurrent framework, clustering is conducted during forward pass, and representation learning is conducted during backward pass;

• A unified loss function in the forward pass and backward pass;

• Performance outperforms the state-of-the-art over a number of datasets;

• It can also learn plausible representations for image recognition.

91

Thanks!

https://github.com/jwyang/joint-unsupervised-learning 92

Recommended