Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
ICIAP 2007 - Tutorial
Advances of statistical learningand
Applications to Computer Vision
Ernesto De Vito and Francesca Odone
- PART 2 -
http://slipguru.disi.unige.it
2
Plan of the second part
Brief intro through a set of applications
One problem in detail (face detection):Choosing the representationFeature selectionClassification
On the choice of the classifier (filter methods)
Spotlights on other interesting issuesImage annotationKernel engineeringGlobal vs local
3
Plan of the second part
Brief intro through a set of applications
One problem in detail (face detection):Choosing the representationFeature selectionClassification
On the choice of the classifier (filter methods)
Spotlights on other interesting issuesImage annotationKernel engineeringGlobal vs local
4
Learning in everyday life
Security and video-surveillanceOCR systemsRobot controlBiometricsSpeech recognitionEarly diagnosis from medical dataKnowledge discovery in big dataset of heterogeneous data (included the Internet)Microarray analysis and classificationStock market preditionRegression applications in computer graphics
5
Statistical Learning in Computer Vision
6
Statistical Learning in Computer Vision
Detection problems
7
Statistical Learning in Computer Vision
More in general: Image annotation
cartreebuildingskypavementpedestrian..
8
How difficult is image understanding?
9
Plan of the second part
Brief intro through a set of applications
One problem in detail (face detection):Choosing the representationFeature selectionClassification
On the choice of the classifier (filter methods)
Spotlights on other interesting issuesImage annotationKernel engineeringGlobal vs local
10
Regularized face detection
Main steps towards a complete classifier:
Choosing the representationFeature selectionClassification
Joint work with A. Destrero – C. De Mol – A. Verri
Problem setting:Find one or more occurrences of a (~frontal) human face, possibly at different resolutions, in a digital image
11
Application scenario (the data)
2000+2000 training1000+1000 validation3400 test
19 x 19 images
12
Initial representation (the dictionary)
Overcomplete, general purpose sets of features are effective for modeling visual informationMany object classes have a peculiar intrinsic structure that can be better appreciated if one looks for symmetries or local geometry
Examples of features: wavelets, curvelets, ranklets, chirplets, rectangle features ...Example of problems: face detection (Heisele et al, Viola & Jones, ....), pedestrian detection (Oren et al., ..), car detection (Papageorgiou & Poggio)
13
Initial representation (the dictionary)
The approach is inspired by biological systems See, for instance, B.A. Olshauser and D. J. Field “Sparse coding with an over-complete basis set: a strategy employed by V1?” 1997.
Usually this approach is coupled with learning from examples
The prior knowledge is embedded in the choice of an appropriate training set
Problem: usually these sets are very big
14
Initial representation (the dictionary)
Rectangle features (Viola & Jones)
... About 64000 features per image patch!
Most of them are correlatedShort range correlation of natural imagesLong range correlation relative to the object of interest
15
What’s wrong with this?
Measurements are noisyFeatures are correlatedThe number of features is higher than the number of examples
=> Ill conditioned
16
Feature selection
Extracting features relevant for a given problem
What is relevant?
Often related to dimensionality reductionBut the two problems are different
A possible way to address the problem is to resort to regularization methods
Elastic net penalty (PART 1)
17
Let us revise the basic algorithm
We assume a linear dependence between input and output
φ={φij} is the measurement matrixi=1,...,n examples/dataj=1,...,p dictionary
β=(β1,..., βp)T vector of unknown weights to be estimated
f=(f1,..., fn)T output values {-1,1} labels in binary classification problems
βϕ=f
18
Choosing the appropriate algorithm
What sort of penalty suits our problem best?In other words:
How do we choose ε?
The choice is driven by the application domainWhat can we say about image correlation?Is there any reason to prefer feature A to feature B?Do we want them both?
{ })|(|minarg 222 βεβλϕβ
β++−
∈
fNR
A
B
19
Peculiarity of images
Given a group of short range correlated features each element is a good representative of the group
As for long range correlated features it would be interesting to keep them all, but it’s difficult to distinguish them at this stage
Notice that in other applications (e.g., microarray analysis) feature is important per se.
20
L1 penalty
A purely L1 penalty automatically enforces the presence of many zeros in fThe L1 norm is convex therefore providing feasible algorithms
(PROB L1) is the Lagrangian formulation of the so-called LASSO Problem
PROB L1{ }||minarg βλϕββ
+−∈
22f
NR
21
L1 penalty
The regularization parameter λ regulates the balance between misfit of the data and penalty
Also it allows us to vary the degree of sparsity
22
How do we solve it?
The solution is not uniqueA number of numerical strategies have been proposed
We adopt the iterated soft-threshold Landweber
[ ])( )()()( tL
TtL
tL fS ϕβϕββ λ −+=+1
⎩⎨⎧ ≥−
=otherwise
hifhsignhhS jjj
j 02λλ
λ
||)()(Where the
soft-thresholderis defined as
This algorithm converges to a minimized of (PROB L1) if |ϕ|<1.
ALG L
23
Thresholded Landweber and our problem
βϕ=f
φ is the measurement matrix: one row per imageone column per feature
f is the vector of labels+1 for faces-1 for negative examples
In our experiments φ has size 4000x64000 (about 1Gb!)
[ ])( )()()( tL
TtL
tL fS ϕβϕββ λ −+=+1
24
A sampled version of Thresholded Landweber
We build S feature subsets each time extracting with replacement m features, m < < pWe compute S sub-problems
Then we keep the features that were selected eachtime they appeared in the sub-set
s=1,...,Sssf ϕβ=
25
A sampled version of Thresholded Landweber
In our experimentsEach sub-set is 10% of the original sizeS=200 (the probability of extracting each feature at least 10 times is high)
5 10 15 20 25 30 35 400
1000
2000
3000
4000
5000
6000
26
Structure of the method (I)
S0
sub1 sub2 subS....
Alg L Alg L Alg L
+
S1
27
Choosing λ
A few words on parameter tuning
A classical choice is cross validation but in this case it is too heavy (because of the number of sub-problems..)
Thus, at this stage, we fix the number of zeros to be reached in a given number of iterations
28
Cross validation
A standard technique for parameter estimation
Try different parameters and choose the one that performs (generalizes) best.
K-fold cross validation:Divide the training set in K chunksKeep K-1 for training and 1 for validatingRepeat for the K different validation setsCompute an average classification rate
29
Classification
Two reasons:Obtain an effective face detectorSpeculate on the quality of the selected features
Face detection is a fairly standard binary classification problem
Regularized Least SquaresSupport Vector Machines (Vapnik, 1995)...with some nice kernel
In the following experiments we start using linear SVMs
30
Setting 90% of zeros
We get 4636 features..too many
What about increasing the number of zeros in the solution???
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.005 0.01 0.015 0.02
One stage feature selectionOne stage feature selection (cross validation)
One stage feature selection (on entire set of features)
31
A refinement of the solution
Setting 99% of zeros:345 features (good)Generalization performance drops of about 3% (bad)
IDEA: We apply the Thresholded Landweber once again (on S1 = 4636 features)This time we tune λ with cross validationWe obtain 247 features
32
Structure of the method (II)
S1
Alg L
S2
S0
sub1 sub2 subS....
Alg L Alg L Alg L
+
33
Comparative analysis
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.005 0.01 0.015 0.02
2 stages feature selection2 stages feature selection + correlation
Viola+Jones feature selection using our same dataViola+Jones cascade performance
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.005 0.01 0.015 0.02
2 stages feature selectionPCA
Comparison with PCAComparison with Adaboost feature selection (Viola&Jones)
34
How compact is the solution?
The 247 are still redundantFor real-time processing we may want to try and reduce it further
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.005 0.01 0.015 0.02
2 stages feature selection2 stages feature selection (polynomial kernel)
Linear vs Polynomial kernel
35
A third optimization stage
Starting from S2We choose one delegate for each group of short range correlated featuresOur correlation is based on discarding features that are
Of the same typeCorrelated according to the Spearman’s testSpatially close
36
Structure of the method (III)
S0
S2
sub1 sub2 subS
Alg L Alg L
+
S1
Alg L
Corr
S3
37
What do we get?
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.005 0.01 0.015 0.02
2 stages feature selection + correlation2 stages feature selection + correlation (polynomial kernel)Linear vs polynomial 0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.01 0.02 0.03 0.04 0.05 0.06
Two stages feature selectionTwo stages + correlation analysis
With and without 3rd stage
38
A fully trainable system for detecting faces
Peculiarity of object detectors:For each image many testsVery few positive examplesVery many negative examples
39
A fully trainable system for detecting faces
Coarse-to-fine methods deal with this, devising multiple classifiers of increasing difficulty
Many approaches (focus-of-attention, cascades, ...)
40
Our cascade of classifiers
Starting from a set of features, say S3we build many small linear SVM classifiers each of them based on at least 3 distant features that are able to reach a fixed target performance on a validation setThe target performance is chosen so that each classifier is not likely to miss faces
Minimum hit rate 99.5%Maximum false positive rate 50%
∏∏
==
i
i
hHfF For 10 layers
F ~ 90% and H ~ 0.510
41
Our cascade of classifiers
42
Finding faces in images
43
Finding faces in images
44
Finding faces in images
45
Finding faces in video frames
46
Finding eye regions...
The beauty of data driven approachesSame approachDifferent dataset: we extracted eye regions from a subset of the Feret dataset
47
A few results (faces and eyes)
48
Online examples
video
49
A few words on the choice of the classifier
SVMs are very popular for their effectiveness and their generalization abilityOther algorithms can perform in a similar way and have other attractiveness
Filter methods are very simple to implement and allow us to obtain very interesting performanceIn particular, iterative methods are very useful when parameter tuning is needed
Joint work with L. Logerfo, L. Rosasco, E. De Vito, A Verri
50
Experiments on face detection
1.48 ±0.34σ=300 t=59
1.53 ±0.33σ=341 t=89
1.63 ±0.32σ=341 t=95
ν method
1.60 ±0.71σ=1000 C=0.9
1.99 ±0.82σ=1000 C=1
2.41 ±1.39σ=800 C=1
RBF-SVM
800700600
Size of the training set
Experiments carried out on a portion of the previously mentioned facesdataset
51
Plan of the second part
Brief intro through a set of applications
One problem in detail (face detection):Choosing the representationFeature selectionClassification
On the choice of the classifier (filter methods)
Spotlights on other interesting issuesImage annotationKernel engineeringGlobal vs local
52
On the classifier choice:Filter methods
Starting from RLS we have seen (PART 1) how a large class of methods known as spectral regularization give rise to regularized learning algorithmsThese methods were originally proposed to solve inverse problemsThe crucial intuition is that the same principle allowing us to numerical stabilize a matrix inversion is crucial to avoid overfitting
They are worth investigating for their simplicity and effectiveness
53
Filter methods
Alle these algorithms are consistent and can be easily implemented
They have a common derivation (and similar implementation) but have
Different theoretical properties (PART 1)Different computational burden
54
Filter methods: computational issues
Non iterativeTikhonov (RLS)Truncated SVD
IterativeLandweberv methodIterated Tikhonov
55
Filter methods: computational issues
RLSTraining (for a fixed lambda):function [c] = rls(K, lambda, y)
n = length(K);c = (K+n*lambda*eye(n))\y;
Test:function [y_new] = rls_test(x, x_new, c)
K_test = kernel(x,x_test);y_new = K_test * c;% for classificationy_new = sign(y_new);
Careful to choose the matrix inversion function
56
Filter methods: computational issues
RLS
The computational cost of RLS is the cost of invertingthe matrix K: O(n3)
In case parameter tuning is needed resorting to a eigendecomposition of matrix K saves time:
yQnIQc
QQKT
T
1−+Λ=
Λ=
)()( λλ
ynIKc 1−+= )( λ
57
Filter methods: computational issues
v method
t plays the role of the regularization parameterComputational cost: O(tn2)
The iterative procedure allows us to compute all solutions from 0 to t (regularization path)This is convenient if parameter tuning is needed:
With an appropriate choice of the max number of iterations the computational cost does not change
λ=t
58
Plan of the second part
Brief intro through a set of applications
One problem in detail (face detection):Choosing the representationFeature selectionClassification
On the choice of the classifier (filter methods)
Spotlights on other interesting issuesImage annotationKernel engineeringGlobal vs local
59
How difficult is image understanding?
Problem setting (general):Assign one or many labels (from a finite but possibly big
set of known classes) to a digital image according to its content
This general problem is very complexMany better defined domains have been studied
Image categorizationObject detectionObject recognition
Usually the trick is in defining the boundaries of the problem of interest
Joint work(s) with A. Barla – E. Delponte – A. Verri
60
Object idenfication/recognition
Nevertheless, the problem isnot that simple
61
Image annotation
Problem setting:Assign one or more labels (from a finite set of
known classes) to a digital image according to its content
Assumption: we look for global descriptionsIndoor/outdoorDrawing/pictureDay/nightCityscape/not
It usually leads to supervised problems (binary classifiers)Low level descriptions are often applied
62
Image annotation from low-levelglobal descriptions
The problem:Capture a global description of the image usingsimple features
The procedure:Build a suitable training set of dataFind an appropriate representationChoose a classification algorithm and a kernelTune the parameters
63
Computer vision ingredients
Color: Color histograms
Shape: Orientation and strength edge histogramsHistograms of the lengths of edge chains
Texture:Wavelets, Co-occurrence matrices
We represent whole images with low leveldescriptions of color, shape or texture
64
A few comments
Histograms appear quite often
We need a simple example to discuss kernel engineeringDesigning ad hoc kernels for the problem/data at hand and the right properties:
SymmetryPositive definiteness
=> Let us go through the histogram intersection example
65
Histogram Intersection (HI)
• Since (Swain and Ballard, 1991) it is knownthat histogram intersection is a powerfulsimilarity measure for color indexing
• Given two images, A and B, of N pixels, if werepresent them as histograms with M bins Ai and Bi, histogram intersection is defined as
{ }∑=
=M
iii BABAK
1,min),(
66
Histogram Intersection (HI)
05
1015202530354045
Bin1
Bin2
Bin3
Bin4
Bin5
Bin6
Bin7
Bin8
05
1015202530354045
Bin1
Bin2
Bin3
Bin4
Bin5
Bin6
Bin7
Bin8
0
5
10
15
20
25
30
35
Bin1
Bin2
Bin3
Bin4
Bin5
Bin6
Bin7
Bin8
∑i
iBin
67
HI is a Kernel
If we build the MxN – dimensional vector
it can immediately be seen that
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛=
−−−321
876
321
876
321
876r
M
M
AN
A
AN
A
AN
A
A 0,...,0,0,1,...,1,1,...,0,...,0,0,1,...,1,1,0,...,0,0,1,...,1,12
2
1
1
( ) >=< BABAK ,,
NOTICE: The proof is based on finding an explicit mapping
Dot product(linear kernel)
68
Histogram intersection: applications
HI has been applied with success to a variety of classification problems, both global and local:
Indoor/outdoor, day/night, cityscape/landscape classificationObject detection from local features (SIFT)
In all those cases it outperformed RBF classifiers Also HI does not depend on any parameter
69
Local approaches
Global approaches have limitsOften objects of interest occupy only a (small) portion of the imageIn a simplified setting all the rest of the image can be defined as background (or context)Depending on the application domain context can help recognition or make it more difficult:
70
Local approaches
We may represent the image content as a set of local features (f1, ..., fn) --- corners, DoG features, ...
We immediately see that this is a variable length description
How to deal withvariable length:
Vocabulary approachLocal kernels (or kernels on sets)
Local features in scale-space
71
Local approaches: features vocabulary
It is reminiscent of text categorization
We define a vocabulary of local features and represent our images based on how often a given feature appears in the image
One implementation of this paradigm is the bag of keypoints approach
72
Local approaches: features vocabulary
[Csurka et al, 2004]
73
Local approaches: kernels on sets
Image descriptions based on local features can be seen as sets:
Variable lengthNo internal ordering
A common approach to define a global similarity between feature setsis to combine the local similarity between (possibly all) pairs of vector elements
},,{},,,{ 11 mn yyYxxX KK ==
mjniyxKYXK jiL ,...,1,,1)),((),( ==∀ℑ= K
74
Summation kernel [Haussler,1999]
The simplest kernel for sets is the summation kernel
Ks is a kernel if KL is a kernelKs is not so useful in practice:
Computationally heavyIt mixes good and bad correspondences
∑∑= =
=n
i
m
jjiLS yxKYXK
1 1),(),(
75
Matching kernel [Wallraven et al, 2003]
Among the many other kernels for sets proposed the matching kernel received a lot of attention for image data
( ){ }∑
= ==
+=n
jjiLmj
M
yxKm
YXK
XYKYXKYXK
1 ,...,1),(max1),(ˆ
),(ˆ),(ˆ21),(
where
76
Matching kernel [Wallraven et al, 2003]
The matching kernel lead to promising results on object recognition problems
Nevertheless it has been shown that it is not a Mercer kernel (because of the max op.)
77
Intermediate matching kernel[Boughorbel et al,, 2004]
Let us consider two feature sets
The two feature sets are compared through an auxiliary set of virtual features
The intermediate matching kernel is defined as
},,{},,,{ 11 mn yyYxxX KK ==
},,{ 1 pvvV K=
∑∈
=Vv
vVi
iYXKYXK ),(),(
),(),( ** yxKYXK Lvi=where
x* and y* are the elements of X and Ycloser to vi
78
Intermediate matching kernel[Boughorbel et al,, 2004]
∑∈
=Vv
vVi
iYXKYXK ),(),(
),(),( ** yxKYXK Lvi=where
x* and y* are the elements of X and Ycloser to vi
79
Intermediate matching kernel:how to choose the virtual features
The intuition behind the virtual features is to find representatives of the feature points extracted in the training setSimply the training set features are grouped in N clusters
The authors show that the choice of N is not crucial (the bigger the better, but careful to computational complexity)It is better to cluster features within each class
80
Conclusions
Understanding the image content is difficult
Statistical learning can help a lot
Don’t forget computer vision! Appropriate descriptions, similarity measures allow us to achieve good results and to obtain effective solutions
81
That’s all!
How to contact us:Ernesto: [email protected]: [email protected]
http://slipguru.disi.unige.itwhere you will find updated versions of the slides
82
Selected (and very incomplete) biblio
A. Destrero, C. De Mol, F. Odone, A. Verri. A regularized approachto feature selection for face detection. DISI-TR-2007-01A. Mohan, C. Papageorgiou, T. PoggioExample based object detection in images by components, PAMI(Vol. 23, No. 4), 2001F. Odone, A. Barla, and A. Verri. Building kernels from binary
strings for image matching, IEEE Transactions on ImageProcessing, 14(2):169-180, 2005
P. Viola and M. J. Jones. Robust real-time face detection. International Journal on Computer Vision, 57(2),2004.C. Wallraven, B. Caputo, A. Graf. Recognition with Local features: the kernel recipe. ICCV03.