ICIAP 2007 - Tutorial Advances of statistical learning and

ICIAP 2007 - Tutorial

Advances of statistical learningand

Applications to Computer Vision

Ernesto De Vito and Francesca Odone

- PART 2 -

http://slipguru.disi.unige.it

2

Plan of the second part

Brief intro through a set of applications

One problem in detail (face detection):Choosing the representationFeature selectionClassification

On the choice of the classifier (filter methods)

Spotlights on other interesting issuesImage annotationKernel engineeringGlobal vs local

3






4

Learning in everyday life

Security and video-surveillanceOCR systemsRobot controlBiometricsSpeech recognitionEarly diagnosis from medical dataKnowledge discovery in big dataset of heterogeneous data (included the Internet)Microarray analysis and classificationStock market preditionRegression applications in computer graphics

5

Statistical Learning in Computer Vision

6


Detection problems

7


More in general: Image annotation

cartreebuildingskypavementpedestrian..

8

How difficult is image understanding?

9






10

Regularized face detection

Main steps towards a complete classifier:

Choosing the representationFeature selectionClassification

Joint work with A. Destrero – C. De Mol – A. Verri

Problem setting:Find one or more occurrences of a (~frontal) human face, possibly at different resolutions, in a digital image

11

Application scenario (the data)

2000+2000 training1000+1000 validation3400 test

19 x 19 images

12

Initial representation (the dictionary)

Overcomplete, general purpose sets of features are effective for modeling visual informationMany object classes have a peculiar intrinsic structure that can be better appreciated if one looks for symmetries or local geometry

Examples of features: wavelets, curvelets, ranklets, chirplets, rectangle features ...Example of problems: face detection (Heisele et al, Viola & Jones, ....), pedestrian detection (Oren et al., ..), car detection (Papageorgiou & Poggio)

13


The approach is inspired by biological systems See, for instance, B.A. Olshauser and D. J. Field “Sparse coding with an over-complete basis set: a strategy employed by V1?” 1997.

Usually this approach is coupled with learning from examples

The prior knowledge is embedded in the choice of an appropriate training set

Problem: usually these sets are very big

14


Rectangle features (Viola & Jones)

... About 64000 features per image patch!

Most of them are correlatedShort range correlation of natural imagesLong range correlation relative to the object of interest

15

What’s wrong with this?

Measurements are noisyFeatures are correlatedThe number of features is higher than the number of examples

=> Ill conditioned

16

Feature selection

Extracting features relevant for a given problem

What is relevant?

Often related to dimensionality reductionBut the two problems are different

A possible way to address the problem is to resort to regularization methods

Elastic net penalty (PART 1)

17

Let us revise the basic algorithm

We assume a linear dependence between input and output

φ={φij} is the measurement matrixi=1,...,n examples/dataj=1,...,p dictionary

β=(β1,..., βp)T vector of unknown weights to be estimated

f=(f1,..., fn)T output values {-1,1} labels in binary classification problems

βϕ=f

18

Choosing the appropriate algorithm

What sort of penalty suits our problem best?In other words:

How do we choose ε?

The choice is driven by the application domainWhat can we say about image correlation?Is there any reason to prefer feature A to feature B?Do we want them both?

{ })|(|minarg 222 βεβλϕβ

β++−

∈

fNR

A

B

19

Peculiarity of images

Given a group of short range correlated features each element is a good representative of the group

As for long range correlated features it would be interesting to keep them all, but it’s difficult to distinguish them at this stage

Notice that in other applications (e.g., microarray analysis) feature is important per se.

20

L1 penalty

A purely L1 penalty automatically enforces the presence of many zeros in fThe L1 norm is convex therefore providing feasible algorithms

(PROB L1) is the Lagrangian formulation of the so-called LASSO Problem

PROB L1{ }||minarg βλϕββ

+−∈

22f

NR

21

L1 penalty

The regularization parameter λ regulates the balance between misfit of the data and penalty

Also it allows us to vary the degree of sparsity

22

How do we solve it?

The solution is not uniqueA number of numerical strategies have been proposed

We adopt the iterated soft-threshold Landweber

[ ])( )()()( tL

TtL

tL fS ϕβϕββ λ −+=+1

⎩⎨⎧ ≥−

=otherwise

hifhsignhhS jjj

j 02λλ

λ

||)()(Where the

soft-thresholderis defined as

This algorithm converges to a minimized of (PROB L1) if |ϕ|<1.

ALG L

23

Thresholded Landweber and our problem

βϕ=f

φ is the measurement matrix: one row per imageone column per feature

f is the vector of labels+1 for faces-1 for negative examples

In our experiments φ has size 4000x64000 (about 1Gb!)

[ ])( )()()( tL

TtL

tL fS ϕβϕββ λ −+=+1

24

A sampled version of Thresholded Landweber

We build S feature subsets each time extracting with replacement m features, m < < pWe compute S sub-problems

Then we keep the features that were selected eachtime they appeared in the sub-set

s=1,...,Sssf ϕβ=

25

A sampled version of Thresholded Landweber

In our experimentsEach sub-set is 10% of the original sizeS=200 (the probability of extracting each feature at least 10 times is high)

5 10 15 20 25 30 35 400

1000

2000

3000

4000

5000

6000

26

Structure of the method (I)

S0

sub1 sub2 subS....

Alg L Alg L Alg L

+

S1

27

Choosing λ

A few words on parameter tuning

A classical choice is cross validation but in this case it is too heavy (because of the number of sub-problems..)

Thus, at this stage, we fix the number of zeros to be reached in a given number of iterations

28

Cross validation

A standard technique for parameter estimation

Try different parameters and choose the one that performs (generalizes) best.

K-fold cross validation:Divide the training set in K chunksKeep K-1 for training and 1 for validatingRepeat for the K different validation setsCompute an average classification rate

29

Classification

Two reasons:Obtain an effective face detectorSpeculate on the quality of the selected features

Face detection is a fairly standard binary classification problem

Regularized Least SquaresSupport Vector Machines (Vapnik, 1995)...with some nice kernel

In the following experiments we start using linear SVMs

30

Setting 90% of zeros

We get 4636 features..too many

What about increasing the number of zeros in the solution???

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.005 0.01 0.015 0.02

One stage feature selectionOne stage feature selection (cross validation)

One stage feature selection (on entire set of features)

31

A refinement of the solution

Setting 99% of zeros:345 features (good)Generalization performance drops of about 3% (bad)

IDEA: We apply the Thresholded Landweber once again (on S1 = 4636 features)This time we tune λ with cross validationWe obtain 247 features

32

Structure of the method (II)

S1

Alg L

S2

S0

sub1 sub2 subS....

Alg L Alg L Alg L

+

33

Comparative analysis

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.005 0.01 0.015 0.02

2 stages feature selection2 stages feature selection + correlation

Viola+Jones feature selection using our same dataViola+Jones cascade performance

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.005 0.01 0.015 0.02

2 stages feature selectionPCA

Comparison with PCAComparison with Adaboost feature selection (Viola&Jones)

34

How compact is the solution?

The 247 are still redundantFor real-time processing we may want to try and reduce it further

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.005 0.01 0.015 0.02

2 stages feature selection2 stages feature selection (polynomial kernel)

Linear vs Polynomial kernel

35

A third optimization stage

Starting from S2We choose one delegate for each group of short range correlated featuresOur correlation is based on discarding features that are

Of the same typeCorrelated according to the Spearman’s testSpatially close

36

Structure of the method (III)

S0

S2

sub1 sub2 subS

Alg L Alg L

+

S1

Alg L

Corr

S3

37

What do we get?

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.005 0.01 0.015 0.02

2 stages feature selection + correlation2 stages feature selection + correlation (polynomial kernel)Linear vs polynomial 0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.01 0.02 0.03 0.04 0.05 0.06

Two stages feature selectionTwo stages + correlation analysis

With and without 3rd stage

38

A fully trainable system for detecting faces

Peculiarity of object detectors:For each image many testsVery few positive examplesVery many negative examples

39

A fully trainable system for detecting faces

Coarse-to-fine methods deal with this, devising multiple classifiers of increasing difficulty

Many approaches (focus-of-attention, cascades, ...)

40

Our cascade of classifiers

Starting from a set of features, say S3we build many small linear SVM classifiers each of them based on at least 3 distant features that are able to reach a fixed target performance on a validation setThe target performance is chosen so that each classifier is not likely to miss faces

Minimum hit rate 99.5%Maximum false positive rate 50%

∏∏

==

i

i

hHfF For 10 layers

F ~ 90% and H ~ 0.510

41

Our cascade of classifiers

42

Finding faces in images

43


44


45

Finding faces in video frames

46

Finding eye regions...

The beauty of data driven approachesSame approachDifferent dataset: we extracted eye regions from a subset of the Feret dataset

47

A few results (faces and eyes)

48

Online examples

video

49

A few words on the choice of the classifier

SVMs are very popular for their effectiveness and their generalization abilityOther algorithms can perform in a similar way and have other attractiveness

Filter methods are very simple to implement and allow us to obtain very interesting performanceIn particular, iterative methods are very useful when parameter tuning is needed

Joint work with L. Logerfo, L. Rosasco, E. De Vito, A Verri

50

Experiments on face detection

1.48 ±0.34σ=300 t=59

1.53 ±0.33σ=341 t=89

1.63 ±0.32σ=341 t=95

ν method

1.60 ±0.71σ=1000 C=0.9

1.99 ±0.82σ=1000 C=1

2.41 ±1.39σ=800 C=1

RBF-SVM

800700600

Size of the training set

Experiments carried out on a portion of the previously mentioned facesdataset

51






52

On the classifier choice:Filter methods

Starting from RLS we have seen (PART 1) how a large class of methods known as spectral regularization give rise to regularized learning algorithmsThese methods were originally proposed to solve inverse problemsThe crucial intuition is that the same principle allowing us to numerical stabilize a matrix inversion is crucial to avoid overfitting

They are worth investigating for their simplicity and effectiveness

53

Filter methods

Alle these algorithms are consistent and can be easily implemented

They have a common derivation (and similar implementation) but have

Different theoretical properties (PART 1)Different computational burden

54

Filter methods: computational issues

Non iterativeTikhonov (RLS)Truncated SVD

IterativeLandweberv methodIterated Tikhonov

55


RLSTraining (for a fixed lambda):function [c] = rls(K, lambda, y)

n = length(K);c = (K+n*lambda*eye(n))\y;

Test:function [y_new] = rls_test(x, x_new, c)

K_test = kernel(x,x_test);y_new = K_test * c;% for classificationy_new = sign(y_new);

Careful to choose the matrix inversion function

56


RLS

The computational cost of RLS is the cost of invertingthe matrix K: O(n3)

In case parameter tuning is needed resorting to a eigendecomposition of matrix K saves time:

yQnIQc

QQKT

T

1−+Λ=

Λ=

)()( λλ

ynIKc 1−+= )( λ

57


v method

t plays the role of the regularization parameterComputational cost: O(tn2)

The iterative procedure allows us to compute all solutions from 0 to t (regularization path)This is convenient if parameter tuning is needed:

With an appropriate choice of the max number of iterations the computational cost does not change

λ=t

58






59

How difficult is image understanding?

Problem setting (general):Assign one or many labels (from a finite but possibly big

set of known classes) to a digital image according to its content

This general problem is very complexMany better defined domains have been studied

Image categorizationObject detectionObject recognition

Usually the trick is in defining the boundaries of the problem of interest

Joint work(s) with A. Barla – E. Delponte – A. Verri

60

Object idenfication/recognition

Nevertheless, the problem isnot that simple

61

Image annotation

Problem setting:Assign one or more labels (from a finite set of

known classes) to a digital image according to its content

Assumption: we look for global descriptionsIndoor/outdoorDrawing/pictureDay/nightCityscape/not

It usually leads to supervised problems (binary classifiers)Low level descriptions are often applied

62

Image annotation from low-levelglobal descriptions

The problem:Capture a global description of the image usingsimple features

The procedure:Build a suitable training set of dataFind an appropriate representationChoose a classification algorithm and a kernelTune the parameters

63

Computer vision ingredients

Color: Color histograms

Shape: Orientation and strength edge histogramsHistograms of the lengths of edge chains

Texture:Wavelets, Co-occurrence matrices

We represent whole images with low leveldescriptions of color, shape or texture

64

A few comments

Histograms appear quite often

We need a simple example to discuss kernel engineeringDesigning ad hoc kernels for the problem/data at hand and the right properties:

SymmetryPositive definiteness

=> Let us go through the histogram intersection example

65

Histogram Intersection (HI)

• Since (Swain and Ballard, 1991) it is knownthat histogram intersection is a powerfulsimilarity measure for color indexing

• Given two images, A and B, of N pixels, if werepresent them as histograms with M bins Ai and Bi, histogram intersection is defined as

{ }∑=

=M

iii BABAK

1,min),(

66

Histogram Intersection (HI)

05

1015202530354045

Bin1

Bin2

Bin3

Bin4

Bin5

Bin6

Bin7

Bin8

05

1015202530354045

Bin1

Bin2

Bin3

Bin4

Bin5

Bin6

Bin7

Bin8

0

5

10

15

20

25

30

35

Bin1

Bin2

Bin3

Bin4

Bin5

Bin6

Bin7

Bin8

∑i

iBin

67

HI is a Kernel

If we build the MxN – dimensional vector

it can immediately be seen that

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛=

−−−321

876

321

876

321

876r

M

M

AN

A

AN

A

AN

A

A 0,...,0,0,1,...,1,1,...,0,...,0,0,1,...,1,1,0,...,0,0,1,...,1,12

2

1

1

( ) >=< BABAK ,,

NOTICE: The proof is based on finding an explicit mapping

Dot product(linear kernel)

68

Histogram intersection: applications

HI has been applied with success to a variety of classification problems, both global and local:

Indoor/outdoor, day/night, cityscape/landscape classificationObject detection from local features (SIFT)

In all those cases it outperformed RBF classifiers Also HI does not depend on any parameter

69

Local approaches

Global approaches have limitsOften objects of interest occupy only a (small) portion of the imageIn a simplified setting all the rest of the image can be defined as background (or context)Depending on the application domain context can help recognition or make it more difficult:

70

Local approaches

We may represent the image content as a set of local features (f1, ..., fn) --- corners, DoG features, ...

We immediately see that this is a variable length description

How to deal withvariable length:

Vocabulary approachLocal kernels (or kernels on sets)

Local features in scale-space

71

Local approaches: features vocabulary

It is reminiscent of text categorization

We define a vocabulary of local features and represent our images based on how often a given feature appears in the image

One implementation of this paradigm is the bag of keypoints approach

72

Local approaches: features vocabulary

[Csurka et al, 2004]

73

Local approaches: kernels on sets

Image descriptions based on local features can be seen as sets:

Variable lengthNo internal ordering

A common approach to define a global similarity between feature setsis to combine the local similarity between (possibly all) pairs of vector elements

},,{},,,{ 11 mn yyYxxX KK ==

mjniyxKYXK jiL ,...,1,,1)),((),( ==∀ℑ= K

74

Summation kernel [Haussler,1999]

The simplest kernel for sets is the summation kernel

Ks is a kernel if KL is a kernelKs is not so useful in practice:

Computationally heavyIt mixes good and bad correspondences

∑∑= =

=n

i

m

jjiLS yxKYXK

1 1),(),(

75

Matching kernel [Wallraven et al, 2003]

Among the many other kernels for sets proposed the matching kernel received a lot of attention for image data

( ){ }∑

= ==

+=n

jjiLmj

M

yxKm

YXK

XYKYXKYXK

1 ,...,1),(max1),(ˆ

),(ˆ),(ˆ21),(

where

76

Matching kernel [Wallraven et al, 2003]

The matching kernel lead to promising results on object recognition problems

Nevertheless it has been shown that it is not a Mercer kernel (because of the max op.)

77

Intermediate matching kernel[Boughorbel et al,, 2004]

Let us consider two feature sets

The two feature sets are compared through an auxiliary set of virtual features

The intermediate matching kernel is defined as

},,{},,,{ 11 mn yyYxxX KK ==

},,{ 1 pvvV K=

∑∈

=Vv

vVi

iYXKYXK ),(),(

),(),( ** yxKYXK Lvi=where

x* and y* are the elements of X and Ycloser to vi

78

Intermediate matching kernel[Boughorbel et al,, 2004]

∑∈

=Vv

vVi

iYXKYXK ),(),(

),(),( ** yxKYXK Lvi=where

x* and y* are the elements of X and Ycloser to vi

79

Intermediate matching kernel:how to choose the virtual features

The intuition behind the virtual features is to find representatives of the feature points extracted in the training setSimply the training set features are grouped in N clusters

The authors show that the choice of N is not crucial (the bigger the better, but careful to computational complexity)It is better to cluster features within each class

80

Conclusions

Understanding the image content is difficult

Statistical learning can help a lot

Don’t forget computer vision! Appropriate descriptions, similarity measures allow us to achieve good results and to obtain effective solutions

81

That’s all!

How to contact us:Ernesto: [email protected]: [email protected]

http://slipguru.disi.unige.itwhere you will find updated versions of the slides

82

Selected (and very incomplete) biblio

A. Destrero, C. De Mol, F. Odone, A. Verri. A regularized approachto feature selection for face detection. DISI-TR-2007-01A. Mohan, C. Papageorgiou, T. PoggioExample based object detection in images by components, PAMI(Vol. 23, No. 4), 2001F. Odone, A. Barla, and A. Verri. Building kernels from binary

strings for image matching, IEEE Transactions on ImageProcessing, 14(2):169-180, 2005

P. Viola and M. J. Jones. Robust real-time face detection. International Journal on Computer Vision, 57(2),2004.C. Wallraven, B. Caputo, A. Graf. Recognition with Local features: the kernel recipe. ICCV03.

Documents

ICIAP 2007 - Tutorial Advances of statistical learning and