Transfer Learning for Image Classification. Transfer Learning Approaches Leverage data from related tasks to improve performance: Improve generalization

Transfer Learning forImage Classification

Transfer Learning Approaches

Leverage data from related tasks to improve performance:

Improve generalization.

Reduce the run-time of evaluating a set of classifiers.

Two Main Approaches:

Learning Shared Hidden Representations.

Sharing Features.

Sharing Features: efficient boosting procedures for multiclass object detection

Antonio TorralbaKevin MurphyWilliam Freeman

Snapshot of the idea

Goal: Reduce the computational cost of multiclass object recognition Improve generalization performance

Approach: Make boosted classifiers share weak learners

di Rx

)},(),...,,(),{( 221,1 nnT yxyxyx

],...,,[ 21 miiii yyyy

Some Notation:

}1,1{ kiy

Training a single boosted classifier

Consider training a single boosted classifier:

Candidate weak learners

Weighted stumps

Fit an additive model

Training a single boosted classifier

Greedy Approach

Minimize the exponential loss

Gentle Boosting

Standard Multiclass Case: No Sharing

Additive model for each class

Minimize the sum of exponential losses

Each class has its own set of weak learners:

Multiclass Case: Sharing Features

subset of classescorresponding additiveclassifierclassifier for the k class

At iteration t add one weak learner to oneof the additive models:

Minimize the sum of exponential losses

Naive approach:

Greedy Heuristic:

Sharing features for multiclass object detection

Torralba, Murphy, Freeman. CVPR 2004

Learning efficiency

Sharing features shows sub-linear scaling of features with objects (for area under ROC = 0.9).

Red: shared features Blue: independent features

How the features are shared across objects

Basic features: Filter responses computed at different locations of the image

Uncovering Shared Structuresin Multiclass Classification

Yonatan AmitMichael FinkNathan SrebroShimon Ullman

Structure Learning Framework

Class Parameters

Structural Parameters

Find optimal parameters:

Linear Classifiers

Linear Transformations

)()(),,(minarg 211

, cWcWLossn

iiiW

yx

Multiclass Loss Function

)1(max),,(1,1:,

xxyx

Tj

Tpyyyy

wwWLoss pjpj

Maximal Hinge Loss:

)1,0max( x tyw

Hinge Loss:

Snapshot of the idea

Main Idea: Enforce Sharing by finding low rank parameter matrix W

Transformation on x

Transformation on w

Consider the m by d parameter matrix:

Can be factored:

Rows of theta form a basis

)1(max),,(1,1:,

xxyx

Tj

Tpyyyy

wwWLoss pjpj

Low Rank Regularization Penalty

Rank of a d by m matrix is the smallest z, such that:

A regularization penalty designed to minimize the rank of W’ wouldtend to produce solutions where a few basis are shared by all classes.

Minimizing the rank would lead to a hard combinatorial problem

Instead use a trace norm penalty:

eigen value of W’

Putting it all together

No longer in the objective

For optimization they use a gradient based methodthat minimizes a smooth approximation of the objective

)(),,(minarg '

1

'' WcWLossn

iiiW

yx

Mammals Dataset

Results

Transfer Learning for Image Classification via Sparse Joint Regularization

Ariadna QuattoniMichael CollinsTrevor Darrell

Training visual classifiers when a few examples are available Problem:

Image classification from a few examples can be hard.

A good representation of images is crucial.

Solution: We learn a good image representation using:

unlabeled data + labeled data from related problems

Snapshot of the idea:

Use unlabeled dataset + kernel function to compute a new representation: Complex features, high dimensional space Some of them will be very discriminative (hopefully) Most will be irrelevant

If we knew the relevant features we could learn from fewer examples.

Related problems may share relevant features.

We can use data from related problems to discover them !!

Semi-supervised learning

Large dataset of unlabeled

data

Small training set of

labeled images

h-dimensional training set

h:F I R

Visual Representation

Unsupervised learning

Compute Train Classifier

Step 1: Learn representation

Step 2: Train Classifier

h:F I R

Semi-supervised learning:

Raina et al. [ICML 2007] proposed an approach that learns a sparse set

of high level features (i.e. linear combinations of the original features) from unlabeled data using a sparse coding technique.

Balcan et al. [ML 2004] proposed a representation based on computing kernel distances to unlabeled data points.

Learning visual representations using unlabeled dataonly

Unsupervised learning in data space Good thing:

Lower dimensional representation preserves relevant statistics of the data sample.

Bad things: The representation might still contain irrelevant

features, i.e. features that are useless for classification.

Learning visual representations from unlabeled data + labeled data from related categories

Step 1: Learn a representation

Large Dataset of unlabeled images

RXXk :

Kernel Function

Create New

Representation

Select DiscriminativeFeatures of the New

Representation

h:F I R

Discriminative representation

Labeled images from related categories

Our contribution

Main differences with previous approaches:

Our choice of joint regularization norm allows us to express the joint loss minimization as a linear program (i.e. no need for greedy approximations.)

While previous approaches build joint sparse classifiers on the feature space our method discovers discriminative features in a spacederived using the unlabeled data and uses these discriminative features to solve future problems.

Overview of the method

Step I: Use the unlabeled data to compute a new representation space

[Kernel SVD].

Step II: Use the labeled data from related problems to discover discriminative features in the new space [Joint Sparse Regularization].

Step III: Compute the new discriminative representation for the samples of the target problem.

Step IV: Train the target classifier using the representation of step III.

Step I: Compute a representation using the unlabeled data

A) Compute kernel matrix of unlabeled images:

),(: jiij xxkK

B) Compute a projection matrix A by taking all the eigen vectors of K.

K

U

Perform Kernel SVD on the Unlabeled Data.

Step I: Compute a representation using the unlabeled data C) Project labeled data from related problems to the new space:

D

Notational shorthand: )(' xzx

Sidetrack

Another possible method for learning a representation from the unlabeled datawould be to create a projection matrix Q by taking the h eigen vectors of Acorresponding to the h largest eigenvalues.

We will call this approach the Low Rank Baseline

Our method differs significantly from the low rank approach in that we use training data from related problems to select discriminative features inthe new space.

Step II: Discover relevant features by joint sparse approximation

}1,1{Y

X

YXF :

A classifier is a function:

where: is the input space, i.e. the representation learnt from unlabeled data. is a binary label, in our application is going to be 1 if an imageand

A loss function:RyxfF }),({:

belongs to a particular topic and -1 otherwise.


')'( xwxf t

Dyx

p

jjwyxfl

),( 1

||)),'((minw

Consider learning a single sparse linear classifier (on the space learnt from the unlabeled data) of the form:

A sparse model will have only a few features with non-zero coefficients. A natural choice for parameter estimation would be:

Classification error

L1 penalizes non-sparse solutions

Donoho [2004] has proven (in a regression setting) that the solution with smallest L1 norm is also the sparsest solution.


m

kk

Dyx

Wyxflk1 ),(

],..,[ )()),'((min m21 wwwW

},...,,{ 21 mDDDD

Goal: Find a subset of features R such that each problem in:

Solution : Regularized Joint loss minimization:

Classificationerror on training set k

penalizes solutions that

utilize too many features

can be well approximated by a sparse classifier whose non-zero coefficients correspond to features in R.

')'( xxf tkk w


rowszerononW #)(

pm

pp

m

m

www

wwwwww

21

222

21

112

11

W

How do we penalize solutions that use too many features?

Coefficients forfor feature 2

Coefficients for classifier 2

Problem : not a proper norm, would lead to a hard combinatorial problem .


Instead of using the #non-zero-rows pseudo-norm we will use a convex relaxation [Tropp 2006]

p

iikkwW

1

|)(|max)(

The combination of the two norms results in a solution where only a few features are used but the features used will contribute in solving many classification problems.

This norm combines: An L1 norm on the maximum absolute values of the coefficients promotes sparsity on max values.

Use few features

The L∞ norm on each row promotes non-sparsity on the rows Share

features


m

k

p

iikkk

Dyx

wyxflk

m1 1),(

],..,[ |)(|max)),'((min21 wwwW

Using the L1- L∞ norm we can rewrite our objective function as:

For any convex loss this is a convex function, in particular when consideringthe hinge loss:

))(1,0max())(( xyfxfl

the optimization problem can be expressed as a linear program.


Objective:

m

k

D

j

p

ii

kj

k

t1

||

1 1],,[min tεW

Linear program formulation ( hinge loss):

Max value constraints:

mkfor :1:

pifor :1:

0kj

iiki twt mkfor :1:

|:|1: kDjfor

kj

kjk

kj xfy 1)'(

and

Slack variables constraints:and

Step III: Compute the discriminative features representation

}|)max(|:{ rkwrR

Define the set of relevant features to be:

Create a new representation by taking all the features in x’ corresponding to the indexes in R

Experiments: Dataset

nL

10382 images, 108 topics .Predict 10 most frequent topics; (binary prediction)

Reuters Dataset

Data Partitions

3000 unlabeled images.

labeled training sets of sizes: 1, 5, 10 ,15…50.

2382 images as testing data.5000 images as source of supervised training data

Training set with n positive examples and 2*nnL

Dataset

SuperBowl [341]

Danish Cartoons[178]

Sharon [321]

Australian open[209]

Trapped coal miners [196]

Golden globes [167]

Grammys [170]

Figure skating [146]

Academy Awards [135]

Iraq [ 125]

Baseline Representation

Sampling:

‘Bag of words’ representation that combines: color, texture and raw local image information

Sample image patches on a fixed grid For each image patch compute:

Color features based on HSV color histogramsTexture features based on mean responses of Gabor filters at different scales and orientations

Raw features, normalized pixel values

Create visual dictionary: for each feature type we do vector quantization andcreate a dictionary V of 2000 visual words.

Baseline representation

]...,;,..,;,..,[)( ||21||21||21 rtc VVV rrrtttcccIg

For every feature type map each patch to its closest visual word in the corresponding dictionary.

Compute baseline representation:

Sample image patches over a fix grid

The final representation is given by:

where

icis the number of times that an image patch was mapped to the i-th color word

Setting

Step 1: Learn a representation using the unlabeled datataset and

labeled datatasets from 9 topics.

Step 2: Train a classifier for the 10th held out topic using the learnt representation.

As evaluation we use the equal error rate averaged over the 10 topics.

Uses as a representation: where Q is consists of the h highest eigenvectors of the matrix A computed in the first step of the algorithm

Experiments

Baseline model (RFB): Uses raw representation.

)()( xQxv t

For both LRB and SPT we used and RBF kernel when computing the Representation from unlabeled data.

Low Rank baseline (LRB):

Sparse Transfer Model (SPT)

Three models, all linear SVMs

Uses Representation computed by our algorithm

Results:

1 5 10 15 20 25 30 35 40 45 500.25

0.3

0.35

0.4

0.45

0.5E

qual

Erro

r Rat

e

# positive training examples

RFBSPT LRB

Results:

Mean Equal Error rate per topic for classifiers trained with five positive examples; for the RFB model and the SPT model. SuperBowl; GG: Golden Globes; DC: Danish Cartoons; Gr: Grammys; AO: Australian Open; Sh:Sharon; FS: Figure Skating; AA: Academy Awards; Ir: Iraq.

SB GG DC Gr AO Sh TC FS AA Ir0.25

0.3

0.35

0.4

0.45

0.5

Equ

al E

rror R

ate

Average equal error rate for models trained with 5 examples

RFBSPT

Results

Conclusion

Summary: We described a method for learning discriminative

sparse image representations from unlabeled images + images from related tasks.

The method is based on learning a representation from the unlabeled data and performing joint sparse approximation on the data from related tasks to find a subset of discriminative features.

The induced representation improves performance when learning with very few examples.

Future work

Develop an efficient algorithm for solving the joint optimization thatwill scale to very large datasets.

Combine different representations

Joint Sparse Approximation

Discovers image representations which improve learning with few examples

The LP formulation is feasible for small problems but becomes intractable for larger data-sets.

Outline

A Joint Sparse Approximation Model for Multi-task Learning

An Efficient Algorithm

Experiments

Joint Sparse Approximation as a Constrained ConvexOptimization Problem

m

kk

Dyxk

yxflD

k1 ),(

)),((||

1minW

d

iikk

QWts1

|)(|max..

We will use a Projected SubGradient method. Main advantages: simple, scalable, guaranteed convergence rates.

A convex function

Convex constraints

Projected SubGradient methods have been recently proposed:

L2 regularization, i.e. SVM [Shalev-Shwartz et al. 2007] L1 regularization [Duchi et al. 2008]

Ariadna

computing the subgradients is trivial.So the question is weather the projection can be computed efficiently.

Euclidean Projection into the L1-∞ ball

ji,

ji

jijiBAB

,

2,,,)(

21min

μ

Projection:

Inspecting the Lagrangian

shows that it can be reduced to:

We reduced the projection to finding new maximums for each feature across tasks and using them to truncate A. The total mass removed from a feature across tasks should be the

same for all features whose coefficients don’t vanish.

ijiB ,

Qd

ii

1

0, jiB

0i

ji ,

i

,μfind

ijiA

ijii Ai

,

,,0:

.t.s

.t.s

d

ii Q

1

1 2

jijiiji ABA ,,, ijiiji BA ,,

:set

0ii

0

Ariadna

Here is how we can formulate the projection.Assume that we have a matrix A that is outside the L1-inf ball. We wish to find the matrix B in the L1-inf ball that it is closest to A.I am going to assume that A is non-negative it is easy to show that this approach can be generalized to arbritrary matrixes.Let's first look at the formulation on the left. The objective searches for a matrix B that is close in euclidean distance to A.We also want B to be in the L1-inf ball of radious Q.To impose this constraint we introduce auxiliary variables mu, there will be one mu for every feature. These variables are going to encodethe maximum feature values of the new matrix B.Thus, the first constraint bounds the value of the coefficients of a feature i of B to be below its corresponding maximum mu_i. The second constraint makes sure that the new maximums sum to Q.------------------------------------------------------Inspecting the Lagrangina shows us the structure of this problem and reveals that it can be re- formulated as an optimization that can be solved efficiently; which is shown on the left box. In this new formulation:We are also going to search for new maximums mu,let's ignore theta for now we will return to it later.As in the previous formulation the first constraint enforces that the new maximums should sum to Q.Ignore the second constraints for now.Once we found the new maximums the two lines at the end of box 2 tell us how to construct B . It says that every coefficient of A below its corresponding maximum will remain the same while coefficients above the maximum will be truncated. So essentially we reduced the projection to finding new maximums for each feature that we will use to truncate A. Now let's go back to the second constraint, the lagragian reveals that at the optimal solution the mass that is lost because of the truncation (that is the sum of the differences between the values of A larger than the new maximum and the new maximum) will be the same accross features.Intuitively this makes sense because the l1 norm on the maximums should prefer no feature over another.

t1

t2

t3

t4

t5

t6

t1

t2

t3

t4

t5

t6

t1

t2

t3

t4

t5

t6Feature I Feature II Feature III

1

2

3


1,1

1,111 )(

jA

jAR

8

8

3 Features, 6 problems, Q=14

3

1

14i

i

Regret

Ariadna

Here we have a concrete examplesThere are 3 features, six problems and Q is 14so we want to find maxs that sum to 14The figures shows the sorted coefficients of each task for each of the 3 features. We see thatselecting a new maximum is selecting a new truncating point. Inside each boxis the mass that we will loose for the corresponding feature if we truncate it by thatvalue, i called this theta. Again The lagragiantell us that theta should be the same accross features. So our algorithm for finding the new maximum is based on lowering the maximums is such a way that theta is constant accross features and we do this until the new maximums sum to Q,and those are the maximums of the new matrix B.


8

7

8 8

An Efficient Algorithm For Finding μ

:R i i

Recall that we need to find a vector of maximums µ that sums to Q and a constant θ such that the mass loss for every feature is the same.

Consider the mass loss function and its inverse:

:1iR

d

iiR

1

1 )()N(

i

We can construct this function and pick the θ for which Q)N(

This is a piece wise linear function

So is its inverse

Consider a function that takes a mass loss and computes the corresponding sum of µ ( the norm of the new matrix)

This is also piece wise linear

Ariadna

connection to previous slidethis can be easily done by incrementally building some functions which are piece-wise lineaconsider the function Ri wich takes a maximum i and returns the corresponding losssince this is piece wise linear it is easy to build its inverse which given a loss computes its corresponding maximum. using this inverse function we can consider another function N that takes a theta value, computes the maximums for each feature using the inverse of R and sums them up. Note that this is the l-1 inf norm.

Complexity

))log(( mdmO

The total cost of the algorithm is dominated by the initial sort

The total cost is in the order of:

The merge of these rows to build the piece wise linear function:

of each row of A:

))log(( ddmO

))log(( dmdmO

Notice that we only need to consider non-zero rows of A, so d is really the number of non-zero rows rather than the actual number of features

Outline

A Joint Sparse Approximation Model for Multi-task Learning

An Efficient Algorithm

Experiments

Synthetic Experiments

We use the same projected subgradient method, and compared three different types of projection steps:

L1−∞ projection L2 projection L1 projection

)( xwsign All tasks consisted of predicting functions of the form:

To generate jointly sparse vectors we randomly selected 10% of the features to be the relevant feature set V . Then for each task we randomly selected a subset v and zeroed all parameters outside v.

Synthetic Experiments

10 20 40 80 160 320 64015

20

25

30

35

40

45

50

# training examples per task

Erro

r

Synthetic Experiments Results: 60 problems 200 features 10% relevant

L2L1L1-LINF

Test Error

Performance on predictingrelevant features

10 20 40 80 160 320 64010

20

30

40

50

60

70

80

90

100


Feature Selection Performance

Precision L1-INFRecall L1Precision L1Recall L1-INF

Dataset: News story prediction

SuperBowl

Danish CartoonsSharon

Australian open

Trapped coal miners

Golden globes Grammys Figure

skating Academy Awards Iraq

40 tasks Raw image representation: Bag of visual words

Linear Kernel 3000 dimensions

Image Classification Experiments

15 30 60 120 2400.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46


Mea

n E

ER

Reuters Dataset Results

L2L1L1-INF

Absolute Weights L1

Feat

ure

5 10 15 20 25 30 35 40

500

1000

1500

2000

2500

3000

10

20

30

40

50

60

Absolute Weights L1-INF

Feat

ure

task

5 10 15 20 25 30 35 40

500

1000

1500

2000

2500

3000

0.01

0.02

0.03

0.04

0.05

0.06

Conclusion and Future Work

Presented a simple and effective algorithm for training joint models with L1−∞ constraints.

The algorithm scales linearly with the number of examples and O(dm log(dm) ) with the number of problems and dimensions. The experiments on a real image dataset show that this algorithm can find solutions that are jointly sparse, resulting in lower test error.

We believe our approach can be extended to work on an online multi-task learning setting.

Future Work: Online Multitask Classification [Cavallanti et al. 2008]

There are m binary classification tasks indexed by 1 ,…., M At each time step t=1,2,…,T the learner receives a task index k and the corresponding instance vector. Based on this information the learner outputs a binary prediction an then observes the correct label yk

We are interested in comparing the learner’s mistake count to that of the optimal predictors:

T

i

it

it

it

Rww

xywld

m 1

),,(inf,....,1

Thanks!

Lagrangian

Lagrangian

Documents

Transfer Learning for Image Classification. Transfer Learning Approaches Leverage data from related tasks to improve performance: Improve generalization