Upload
evangeline-caldwell
View
228
Download
0
Embed Size (px)
DESCRIPTION
Sharing Features: efficient boosting procedures for multiclass object detection Antonio Torralba Kevin Murphy William Freeman
Citation preview
Transfer Learning forImage Classification
Transfer Learning Approaches
Leverage data from related tasks to improve performance:
Improve generalization.
Reduce the run-time of evaluating a set of classifiers.
Two Main Approaches:
Learning Shared Hidden Representations.
Sharing Features.
Sharing Features: efficient boosting procedures for multiclass object detection
Antonio TorralbaKevin MurphyWilliam Freeman
Snapshot of the idea
Goal: Reduce the computational cost of multiclass object recognition Improve generalization performance
Approach: Make boosted classifiers share weak learners
di Rx
)},(),...,,(),{( 221,1 nnT yxyxyx
],...,,[ 21 miiii yyyy
Some Notation:
}1,1{ kiy
Training a single boosted classifier
Consider training a single boosted classifier:
Candidate weak learners
Weighted stumps
Fit an additive model
Training a single boosted classifier
Greedy Approach
Minimize the exponential loss
Gentle Boosting
Standard Multiclass Case: No Sharing
Additive model for each class
Minimize the sum of exponential losses
Each class has its own set of weak learners:
Multiclass Case: Sharing Features
subset of classescorresponding additiveclassifierclassifier for the k class
At iteration t add one weak learner to oneof the additive models:
Minimize the sum of exponential losses
Naive approach:
Greedy Heuristic:
Sharing features for multiclass object detection
Torralba, Murphy, Freeman. CVPR 2004
Learning efficiency
Sharing features shows sub-linear scaling of features with objects (for area under ROC = 0.9).
Red: shared features Blue: independent features
How the features are shared across objects
Basic features: Filter responses computed at different locations of the image
Uncovering Shared Structuresin Multiclass Classification
Yonatan AmitMichael FinkNathan SrebroShimon Ullman
Structure Learning Framework
Class Parameters
Structural Parameters
Find optimal parameters:
Linear Classifiers
Linear Transformations
)()(),,(minarg 211
, cWcWLossn
iiiW
yx
Multiclass Loss Function
)1(max),,(1,1:,
xxyx
Tj
Tpyyyy
wwWLoss pjpj
Maximal Hinge Loss:
)1,0max( x tyw
Hinge Loss:
Snapshot of the idea
Main Idea: Enforce Sharing by finding low rank parameter matrix W
Transformation on x
Transformation on w
Consider the m by d parameter matrix:
Can be factored:
Rows of theta form a basis
)1(max),,(1,1:,
xxyx
Tj
Tpyyyy
wwWLoss pjpj
Low Rank Regularization Penalty
Rank of a d by m matrix is the smallest z, such that:
A regularization penalty designed to minimize the rank of W’ wouldtend to produce solutions where a few basis are shared by all classes.
Minimizing the rank would lead to a hard combinatorial problem
Instead use a trace norm penalty:
eigen value of W’
Putting it all together
No longer in the objective
For optimization they use a gradient based methodthat minimizes a smooth approximation of the objective
)(),,(minarg '
1
'' WcWLossn
iiiW
yx
Mammals Dataset
Results
Transfer Learning for Image Classification via Sparse Joint Regularization
Ariadna QuattoniMichael CollinsTrevor Darrell
Training visual classifiers when a few examples are available Problem:
Image classification from a few examples can be hard.
A good representation of images is crucial.
Solution: We learn a good image representation using:
unlabeled data + labeled data from related problems
Snapshot of the idea:
Use unlabeled dataset + kernel function to compute a new representation: Complex features, high dimensional space Some of them will be very discriminative (hopefully) Most will be irrelevant
If we knew the relevant features we could learn from fewer examples.
Related problems may share relevant features.
We can use data from related problems to discover them !!
Semi-supervised learning
Large dataset of unlabeled
data
Small training set of
labeled images
h-dimensional training set
h:F I R
Visual Representation
Unsupervised learning
Compute Train Classifier
Step 1: Learn representation
Step 2: Train Classifier
h:F I R
Semi-supervised learning:
Raina et al. [ICML 2007] proposed an approach that learns a sparse set
of high level features (i.e. linear combinations of the original features) from unlabeled data using a sparse coding technique.
Balcan et al. [ML 2004] proposed a representation based on computing kernel distances to unlabeled data points.
Learning visual representations using unlabeled dataonly
Unsupervised learning in data space Good thing:
Lower dimensional representation preserves relevant statistics of the data sample.
Bad things: The representation might still contain irrelevant
features, i.e. features that are useless for classification.
Learning visual representations from unlabeled data + labeled data from related categories
Step 1: Learn a representation
Large Dataset of unlabeled images
RXXk :
Kernel Function
Create New
Representation
Select DiscriminativeFeatures of the New
Representation
h:F I R
Discriminative representation
Labeled images from related categories
Our contribution
Main differences with previous approaches:
Our choice of joint regularization norm allows us to express the joint loss minimization as a linear program (i.e. no need for greedy approximations.)
While previous approaches build joint sparse classifiers on the feature space our method discovers discriminative features in a spacederived using the unlabeled data and uses these discriminative features to solve future problems.
Overview of the method
Step I: Use the unlabeled data to compute a new representation space
[Kernel SVD].
Step II: Use the labeled data from related problems to discover discriminative features in the new space [Joint Sparse Regularization].
Step III: Compute the new discriminative representation for the samples of the target problem.
Step IV: Train the target classifier using the representation of step III.
Step I: Compute a representation using the unlabeled data
A) Compute kernel matrix of unlabeled images:
),(: jiij xxkK
B) Compute a projection matrix A by taking all the eigen vectors of K.
K
U
Perform Kernel SVD on the Unlabeled Data.
Step I: Compute a representation using the unlabeled data C) Project labeled data from related problems to the new space:
D
Notational shorthand: )(' xzx
Sidetrack
Another possible method for learning a representation from the unlabeled datawould be to create a projection matrix Q by taking the h eigen vectors of Acorresponding to the h largest eigenvalues.
We will call this approach the Low Rank Baseline
Our method differs significantly from the low rank approach in that we use training data from related problems to select discriminative features inthe new space.
Step II: Discover relevant features by joint sparse approximation
}1,1{Y
X
YXF :
A classifier is a function:
where: is the input space, i.e. the representation learnt from unlabeled data. is a binary label, in our application is going to be 1 if an imageand
A loss function:RyxfF }),({:
belongs to a particular topic and -1 otherwise.
Step II: Discover relevant features by joint sparse approximation
')'( xwxf t
Dyx
p
jjwyxfl
),( 1
||)),'((minw
Consider learning a single sparse linear classifier (on the space learnt from the unlabeled data) of the form:
A sparse model will have only a few features with non-zero coefficients. A natural choice for parameter estimation would be:
Classification error
L1 penalizes non-sparse solutions
Donoho [2004] has proven (in a regression setting) that the solution with smallest L1 norm is also the sparsest solution.
Step II: Discover relevant features by joint sparse approximation
m
kk
Dyx
Wyxflk1 ),(
],..,[ )()),'((min m21 wwwW
},...,,{ 21 mDDDD
Goal: Find a subset of features R such that each problem in:
Solution : Regularized Joint loss minimization:
Classificationerror on training set k
penalizes solutions that
utilize too many features
can be well approximated by a sparse classifier whose non-zero coefficients correspond to features in R.
')'( xxf tkk w
Step II: Discover relevant features by joint sparse approximation
rowszerononW #)(
pm
pp
m
m
www
wwwwww
21
222
21
112
11
W
How do we penalize solutions that use too many features?
Coefficients forfor feature 2
Coefficients for classifier 2
Problem : not a proper norm, would lead to a hard combinatorial problem .
Step II: Discover relevant features by joint sparse approximation
Instead of using the #non-zero-rows pseudo-norm we will use a convex relaxation [Tropp 2006]
p
iikkwW
1
|)(|max)(
The combination of the two norms results in a solution where only a few features are used but the features used will contribute in solving many classification problems.
This norm combines: An L1 norm on the maximum absolute values of the coefficients promotes sparsity on max values.
Use few features
The L∞ norm on each row promotes non-sparsity on the rows Share
features
Step II: Discover relevant features by joint sparse approximation
m
k
p
iikkk
Dyx
wyxflk
m1 1),(
],..,[ |)(|max)),'((min21 wwwW
Using the L1- L∞ norm we can rewrite our objective function as:
For any convex loss this is a convex function, in particular when consideringthe hinge loss:
))(1,0max())(( xyfxfl
the optimization problem can be expressed as a linear program.
Step II: Discover relevant features by joint sparse approximation
Objective:
m
k
D
j
p
ii
kj
k
t1
||
1 1],,[min tεW
Linear program formulation ( hinge loss):
Max value constraints:
mkfor :1:
pifor :1:
0kj
iiki twt mkfor :1:
|:|1: kDjfor
kj
kjk
kj xfy 1)'(
and
Slack variables constraints:and
Step III: Compute the discriminative features representation
}|)max(|:{ rkwrR
Define the set of relevant features to be:
Create a new representation by taking all the features in x’ corresponding to the indexes in R
Experiments: Dataset
nL
10382 images, 108 topics .Predict 10 most frequent topics; (binary prediction)
Reuters Dataset
Data Partitions
3000 unlabeled images.
labeled training sets of sizes: 1, 5, 10 ,15…50.
2382 images as testing data.5000 images as source of supervised training data
Training set with n positive examples and 2*nnL
Dataset
SuperBowl [341]
Danish Cartoons[178]
Sharon [321]
Australian open[209]
Trapped coal miners [196]
Golden globes [167]
Grammys [170]
Figure skating [146]
Academy Awards [135]
Iraq [ 125]
Baseline Representation
Sampling:
‘Bag of words’ representation that combines: color, texture and raw local image information
Sample image patches on a fixed grid For each image patch compute:
Color features based on HSV color histogramsTexture features based on mean responses of Gabor filters at different scales and orientations
Raw features, normalized pixel values
Create visual dictionary: for each feature type we do vector quantization andcreate a dictionary V of 2000 visual words.
Baseline representation
]...,;,..,;,..,[)( ||21||21||21 rtc VVV rrrtttcccIg
For every feature type map each patch to its closest visual word in the corresponding dictionary.
Compute baseline representation:
Sample image patches over a fix grid
The final representation is given by:
where
icis the number of times that an image patch was mapped to the i-th color word
Setting
Step 1: Learn a representation using the unlabeled datataset and
labeled datatasets from 9 topics.
Step 2: Train a classifier for the 10th held out topic using the learnt representation.
As evaluation we use the equal error rate averaged over the 10 topics.
Uses as a representation: where Q is consists of the h highest eigenvectors of the matrix A computed in the first step of the algorithm
Experiments
Baseline model (RFB): Uses raw representation.
)()( xQxv t
For both LRB and SPT we used and RBF kernel when computing the Representation from unlabeled data.
Low Rank baseline (LRB):
Sparse Transfer Model (SPT)
Three models, all linear SVMs
Uses Representation computed by our algorithm
Results:
1 5 10 15 20 25 30 35 40 45 500.25
0.3
0.35
0.4
0.45
0.5E
qual
Erro
r Rat
e
# positive training examples
RFBSPT LRB
Results:
Mean Equal Error rate per topic for classifiers trained with five positive examples; for the RFB model and the SPT model. SuperBowl; GG: Golden Globes; DC: Danish Cartoons; Gr: Grammys; AO: Australian Open; Sh:Sharon; FS: Figure Skating; AA: Academy Awards; Ir: Iraq.
SB GG DC Gr AO Sh TC FS AA Ir0.25
0.3
0.35
0.4
0.45
0.5
Equ
al E
rror R
ate
Average equal error rate for models trained with 5 examples
RFBSPT
Results
Conclusion
Summary: We described a method for learning discriminative
sparse image representations from unlabeled images + images from related tasks.
The method is based on learning a representation from the unlabeled data and performing joint sparse approximation on the data from related tasks to find a subset of discriminative features.
The induced representation improves performance when learning with very few examples.
Future work
Develop an efficient algorithm for solving the joint optimization thatwill scale to very large datasets.
Combine different representations
Joint Sparse Approximation
Discovers image representations which improve learning with few examples
The LP formulation is feasible for small problems but becomes intractable for larger data-sets.
Outline
A Joint Sparse Approximation Model for Multi-task Learning
An Efficient Algorithm
Experiments
Joint Sparse Approximation as a Constrained ConvexOptimization Problem
m
kk
Dyxk
yxflD
k1 ),(
)),((||
1minW
d
iikk
QWts1
|)(|max..
We will use a Projected SubGradient method. Main advantages: simple, scalable, guaranteed convergence rates.
A convex function
Convex constraints
Projected SubGradient methods have been recently proposed:
L2 regularization, i.e. SVM [Shalev-Shwartz et al. 2007] L1 regularization [Duchi et al. 2008]
Euclidean Projection into the L1-∞ ball
ji,
ji
jijiBAB
,
2,,,)(
21min
μ
Projection:
Inspecting the Lagrangian
shows that it can be reduced to:
We reduced the projection to finding new maximums for each feature across tasks and using them to truncate A. The total mass removed from a feature across tasks should be the
same for all features whose coefficients don’t vanish.
ijiB ,
Qd
ii
1
0, jiB
0i
ji ,
i
,μfind
ijiA
ijii Ai
,
,,0:
.t.s
.t.s
d
ii Q
1
1 2
jijiiji ABA ,,, ijiiji BA ,,
:set
0ii
0
t1
t2
t3
t4
t5
t6
t1
t2
t3
t4
t5
t6
t1
t2
t3
t4
t5
t6Feature I Feature II Feature III
1
2
3
Euclidean Projection into the L1-∞ ball
1,1
1,111 )(
jA
jAR
8
8
3 Features, 6 problems, Q=14
3
1
14i
i
Regret
Euclidean Projection into the L1-∞ ball
8
7
8 8
An Efficient Algorithm For Finding μ
:R i i
Recall that we need to find a vector of maximums µ that sums to Q and a constant θ such that the mass loss for every feature is the same.
Consider the mass loss function and its inverse:
:1iR
d
iiR
1
1 )()N(
i
We can construct this function and pick the θ for which Q)N(
This is a piece wise linear function
So is its inverse
Consider a function that takes a mass loss and computes the corresponding sum of µ ( the norm of the new matrix)
This is also piece wise linear
Complexity
))log(( mdmO
The total cost of the algorithm is dominated by the initial sort
The total cost is in the order of:
The merge of these rows to build the piece wise linear function:
of each row of A:
))log(( ddmO
))log(( dmdmO
Notice that we only need to consider non-zero rows of A, so d is really the number of non-zero rows rather than the actual number of features
Outline
A Joint Sparse Approximation Model for Multi-task Learning
An Efficient Algorithm
Experiments
Synthetic Experiments
We use the same projected subgradient method, and compared three different types of projection steps:
L1−∞ projection L2 projection L1 projection
)( xwsign All tasks consisted of predicting functions of the form:
To generate jointly sparse vectors we randomly selected 10% of the features to be the relevant feature set V . Then for each task we randomly selected a subset v and zeroed all parameters outside v.
Synthetic Experiments
10 20 40 80 160 320 64015
20
25
30
35
40
45
50
# training examples per task
Erro
r
Synthetic Experiments Results: 60 problems 200 features 10% relevant
L2L1L1-LINF
Test Error
Performance on predictingrelevant features
10 20 40 80 160 320 64010
20
30
40
50
60
70
80
90
100
# training examples per task
Feature Selection Performance
Precision L1-INFRecall L1Precision L1Recall L1-INF
Dataset: News story prediction
SuperBowl
Danish CartoonsSharon
Australian open
Trapped coal miners
Golden globes Grammys Figure
skating Academy Awards Iraq
40 tasks Raw image representation: Bag of visual words
Linear Kernel 3000 dimensions
Image Classification Experiments
15 30 60 120 2400.32
0.34
0.36
0.38
0.4
0.42
0.44
0.46
# training examples per task
Mea
n E
ER
Reuters Dataset Results
L2L1L1-INF
Absolute Weights L1
Feat
ure
5 10 15 20 25 30 35 40
500
1000
1500
2000
2500
3000
10
20
30
40
50
60
Absolute Weights L1-INF
Feat
ure
task
5 10 15 20 25 30 35 40
500
1000
1500
2000
2500
3000
0.01
0.02
0.03
0.04
0.05
0.06
Conclusion and Future Work
Presented a simple and effective algorithm for training joint models with L1−∞ constraints.
The algorithm scales linearly with the number of examples and O(dm log(dm) ) with the number of problems and dimensions. The experiments on a real image dataset show that this algorithm can find solutions that are jointly sparse, resulting in lower test error.
We believe our approach can be extended to work on an online multi-task learning setting.
Future Work: Online Multitask Classification [Cavallanti et al. 2008]
There are m binary classification tasks indexed by 1 ,…., M At each time step t=1,2,…,T the learner receives a task index k and the corresponding instance vector. Based on this information the learner outputs a binary prediction an then observes the correct label yk
We are interested in comparing the learner’s mistake count to that of the optimal predictors:
T
i
it
it
it
Rww
xywld
m 1
),,(inf,....,1
Thanks!
Lagrangian
Lagrangian