Анализ изображений и видео 2, весна 2015: Распознавание лиц

Анализ изображений и видео - 2

Наталья Васильева

[email protected] HP Labs Russia

23 ноября 2012, Computer Science Center

Обнаружение и распознавание лиц

2 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Что можно делать с лицами на изображениях?

Обнаружение лиц

Распознавание лиц

Моделирование лиц

Слежение за человеком/положением головы в видео

Распознавание эмоций/слежение за эмоциями

Распознавание атрибутов (пол, возраст, ...)

Черчилль Рузвельт Сталин


Основные сложности?


Michelangelo 1475-1564

Сложности: разный ракурс


Сложности: разное освещение


slide credit: S. Ullman

Сложности: разный масштаб


Сегодня на лекции

• Обнаружение лиц

● Skin detection

● The Viola/Jones Face Detector

● Large-scale unsupervised learning

• Распознавание лиц

● EigenFaces

● Local Binary Patterns


Обнаружение лиц

Присутствует или нет лицо на изображении?

Локализация лица


Skin detection

Skin pixels have a distinctive range of colors– Corresponds to region(s) in RGB color space

Skin classifier– A pixel X = (R,G,B) is skin if it is in the skin (color) region– How to find this region?

skin


Skin detection

Learn the skin region from examples•Manually label skin/non pixels in one or more “training

images”•Plot the training data in RGB space

– skin pixels shown in orange, non-skin pixels shown in gray– some skin pixels may be outside the region, non-skin pixels inside.


11

Skin classifier

Given X = (R,G,B): how to determine if it is skin or not?•Nearest neighbor

– find labeled pixel closest to X•Find plane/curve that separates the two classes

– popular approach: Support Vector Machines (SVM)•Data modeling

– fit a probability density/distribution model to each class


Traditional skin detection framework

1. Collecting a database of skin patches from different images. Such a database typically contains skin-colored patches from a variety of people under different illumination conditions.

2. Choosing a suitable color space.

3. Learning the parameters of a skin classifier.

Given a trained skin detector, identifying skin pixels in a given image or video frame involves:

1. Converting the image into the same color space that was used in the training phase.

2. Classifying each pixel using the skin classifier to either a skin or non-skin.

3. Typically post processing is needed using morphology to impose spatial homogeneity on the detected regions.


Probability

• X is a random variable• P(X) is the probability that X achieves a certain value

continuous X discrete X

called a PDF- probability distribution/density function- a 2D PDF is a surface- 3D PDF is a volume


Probabilistic skin classification

Model PDF / uncertainty•Each pixel has a probability of being skin or not skin

Skin classifier• Given X = (R,G,B): how to determine if it is skin or not?• Choose interpretation of highest probability

Where do we get and ?


Learning conditional PDF’s

We can calculate P(R | skin) from a set of training images•It is simply a histogram over the pixels in the training images• each bin Ri contains the proportion of skin pixels with color Ri


Learning conditional PDF’s

We can calculate P(R | skin) from a set of training images

But this isn’t quite what we want•Why not? How to determine if a pixel is skin?•We want P(skin | R) not P(R | skin)•How can we get it?


Bayes rule

In terms of our problem:

what we measure(likelihood)

domain knowledge(prior)

what we want(posterior)

normalization term

What can we use for the prior P(skin)?• Domain knowledge:

– P(skin) may be larger if we know the image contains a person– For a portrait, P(skin) may be higher for pixels in the center

• Learn the prior from the training set. How?– P(skin) is proportion of skin pixels in training set

18 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 18

Bayesian estimation

Bayesian estimation•Goal is to choose the label (skin or ~skin) that maximizes the

posterior ↔ minimizes probability of misclassification– this is called Maximum A Posteriori (MAP) estimation

likelihood posterior (unnormalized)


Density plots of Asian skin in different color spaces


Asian, African and Caucasian


Skin detection results


•Basic idea: slide a window across image and evaluate a face model at every location

Face detection


Scan classifier over locs. & scales


Challenges of face detection

•Sliding window detector must evaluate tens of thousands of location/scale combinations

•Faces are rare: 0–10 per image– For computational efficiency, we should try to spend as little time as possible

on the non-face windows– A megapixel image has ~10^6 pixels and a comparable number of candidate

face locations– To avoid having a false positive in every image, our false positive rate has to

be less than 10^-6


Детектор Violo-Jones

• «Быстрые», «простые» признаки объектов

• Интегральные изображения, свертка с прибиженным базисом Хаара

• Использование адаптивного бустинга (AdaBoost) для выбора наиболее информаитвных признаков

• Использование каскада классификаторов для быстрой отбраковки не-объектов

Основные идеи:


The Viola/Jones Face Detector

• Основополагающий подход для распознавания объектов в режиме реального времени • Обучение медленное, распознавание очень быстрое• Основные идеи

– «Быстрые», «простые» признаки объектовo Интегральные изображения, свертка с приближенным базисом Хаара

– Использование адаптивного бустинга (AdaBoost) для выбора наиболее информативных признаков

– Использование каскада классификаторов для быстрой отбраковки не-объектов

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004.

Fastest known face detector for gray scale images

http://research.microsoft.com/en-us/um/people/viola/pubs/detect/violajones_cvpr2001.pdf

http://www.vision.caltech.edu/html-files/EE148-2005-Spring/pprs/viola04ijcv.pdf


“Rectangle filters”

Value =

∑ (pixels in white area) – ∑ (pixels in black area)

Similar to Haar wavelets

Image Features


Example

Source

Result


Fast computation with integral images

•The integral image computes a value at each pixel (x,y) that is the sum of the pixel values above and to the left of (x,y), inclusive•This can quickly be computed in one pass through the image

(x,y)


Computing the integral image


Computing sum within a rectangle

•Let A,B,C,D be the values of the integral image at the corners of a rectangle

•Then the sum of original image values within the rectangle can be computed as:

sum = A – B – C + D

•Only 3 additions are required for any size of rectangle!

D B

C A


Example

-1 +1+2

-1

-2+1

Integral Image


Feature selection

•For a 24x24 detection region, the number of possible rectangle features is ~160,000!


Feature selection

•For a 24x24 detection region, the number of possible rectangle features is ~160,000!

•At test time, it is impractical to evaluate the entire feature set

•Can we create a good classifier using just a small subset of all possible features?

•How to select such a subset?


Boosting

•Boosting is a classification scheme that works by combining weak learners into a more accurate ensemble classifier

•A weak learner need only do better than chance

• Training consists of multiple boosting rounds

•During each boosting round, we select a weak learner that does well on examples that were hard for the previous weak learners

•“Hardness” is captured by weights attached to training examples

Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.

http://www.cs.princeton.edu/~schapire/uncompress-papers.cgi/FreundSc99.ps


Training procedure

• Initially, weight each training example equally• In each boosting round:

• Find the weak learner that achieves the lowest weighted training error

• Raise the weights of training examples misclassified by current weak learner

• Compute final classifier as linear combination of all weak learners (weight of each learner is directly proportional to its accuracy)

• Exact formulas for re-weighting and combining weak learners depend on the particular boosting scheme (e.g., AdaBoost)

Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.

http://www.cs.princeton.edu/~schapire/uncompress-papers.cgi/FreundSc99.ps


Boosting illustration

Weak Classifier 1



WeightsIncreased



Weak Classifier 2



WeightsIncreased



Weak Classifier 3



Final classifier is a combination of weak classifiers


Boosting vs. SVM

Advantages of boosting• Integrates classification with feature selection• Complexity of training is linear instead of quadratic in the number of

training examples• Flexibility in the choice of weak learners, boosting scheme• Testing is fast• Easy to implement

Disadvantages• Needs many training examples• Often doesn’t work as well as SVM (especially for many-class problems)


Boosting for face detection

•Define weak learners based on rectangle features

>

=otherwise 0

)( if 1)( tttt

t

pxfpxh

θ

window

value of rectangle feature

polarity{+1,-1}

threshold


•Define weak learners based on rectangle features•For each round of boosting:

•Evaluate each rectangle filter on each example

•Sort examples by filter values

•Select best threshold for each filter (min error)•Use sorting to quickly scan for optimal threshold

•Select best filter/threshold combination

•Weight is a simple function of error rate

•Reweight examples• (There are many tricks to make this more efficient.)

•Computational complexity of learning: O(MNK)•M rounds, N examples, K features

Boosting for face detection


Boosting for face detection•First two features selected by boosting:

••••This feature combination can yield 100% detection rate and 50% false positive rate


Boosting for face detection•A 200-feature classifier can yield 95% detection rate and a false positive rate of 1 in 14084

Not good enough!

Receiver operating characteristic (ROC) curve


2 3

Slide credit: M. Everingham


Attentional cascade

•We start with simple classifiers which reject many of the negative sub-windows while detecting almost all positive sub-windows

•Positive response from the first classifier triggers the evaluation of a second (more complex) classifier, and so on

•A negative outcome at any point leads to the immediate rejection of the sub-window

FACEIMAGESUB-WINDOW

Classifier 1T

Classifier 3T

F

NON-FACE

TClassifier 2

T

F

NON-FACE

F

NON-FACE


Attentional cascade

Chain classifiers that are progressively more complex and have lower false positive rates:

vs false neg determined by

% False Pos

% D

etec

tion

0 50

0

100

FACEIMAGESUB-WINDOW

Classifier 1T

Classifier 3T

F

NON-FACE

TClassifier 2

T

F

NON-FACE

F

NON-FACE

Receiver operating characteristic


Attentional cascade

•The detection rate and the false positive rate of the cascade are found by multiplying the respective rates of the individual stages

•A detection rate of 0.9 and a false positive rate on the order of 10^-6 can be achieved by a 10-stage cascade if each stage has a detection rate of 0.99 (0.99^10 ≈ 0.9) and a false positive rate of about 0.30 (0.3^10 ≈ 6×10^-6)

FACEIMAGESUB-WINDOW

Classifier 1T

Classifier 3T

F

NON-FACE

TClassifier 2

T

F

NON-FACE

F

NON-FACE


Training the cascade

•Set target detection and false positive rates for each stage•Keep adding features to the current stage until its target rates have been met

•Need to lower AdaBoost threshold to maximize detection (as opposed to minimizing total classification error)

•Test on a validation set•If the overall false positive rate is not low enough, then add another stage•Use false positives from current stage as the negative training examples for the next stage


Slide credit: M. Everingham


Non-maximal suppression (NMS)

Many detections above threshold.


Non-maximal suppression (NMS)


The implemented system

•Training Data• 5000 faces

– All frontal, rescaled to 24x24 pixels

• 300 million non-faces– 9500 non-face images

• Faces are normalized– Scale, translation

•Many variations• Across individuals• Illumination• Pose


System performance

•Training time: “weeks” on 466 MHz Sun workstation

•38 layers, total of 6061 features

•Average of 10 features evaluated per window on test set

•“On a 700 Mhz Pentium III processor, the face detector can process a 384 by 288 pixel image in about .067 seconds”

•15 times faster than previous detector of comparable accuracy (Rowley et al., 1998)


Output of Face Detector on Test Images


Viola-Jones Face Detector: Summary

Train with 5K positives, 350M negatives

Real-time detector using 38 layer cascade

6061 features in final layer[Implementation available in OpenCV: http://www.intel.com/technology/computing/opencv/]

K. Grauman, B. Leibe

Faces

Non-faces

Train cascade of classifiers with

AdaBoost

Selected features, thresholds, and weights

New image

Appl

y to

eac

h

subw

indo

w

60 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 60

Viola-Jones Face Detector: Summary

•Fastest known face detector for gray images

•Three contributions with broad applicability:Rectangle Features + Integral Image can be used for rapid

image analysisAdaBoost as an extremely efficient feature selectorCascaded classifier yields rapid classification


Other detection tasks

Facial Feature Localization

Male vs. female

Profile Detection


Profile Detection


Построение признаков при помощи «глубокого обучения» без учителя: обнаружение лиц

Построение иерархии признаков при помощи многоуровневой нейронной сети

Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, A.Y. Ng. Building high-level features using large scale unsupervised learning.


Результаты

Обучающее множество: 10 000 000 стоп-кадров 200х200 пикселей из видео-роликов YouTube (<3% лиц на случайной выборке 100000 фрагментов 60х60)

Тестовое множество:37000 изображений: 13 026 лиц из Labeled Faces In the Wild, ~24000 не-лиц из ImageNetЛицаНе-лица



Подзадачи:• Верификация: определить для двух изображений лиц, один и тот же это человек или нет

• Идентификация: найти в базе изображений того, кто представлен на тестовом изображении

• “Watch list”: есть ли человек с тестовой фотографии в базе «подозрительных лиц»


Face recognition processing



Eigenfaces for recognition

Matthew Turk and Alex Pentland

J. Cognitive Neuroscience

1991


Linear subspaces

Classification can be expensive:•Big search prob (e.g., nearest neighbors) or store large PDF’s

Suppose the data points are arranged as above• Idea—fit a line, classifier measures distance to line

convert x into v1, v2 coordinates

What does the v2 coordinate measure?

What does the v1 coordinate measure?

- distance to line- use it for classification—near 0 for orange pts

- position along line- use it to specify which orange point it is


Dimensionality reduction

Dimensionality reduction• We can represent the orange points with only their v1 coordinates

(since v2 coordinates are all essentially 0)• This makes it much cheaper to store and compare points• A bigger deal for higher dimensional problems


Linear subspaces

Consider the variation along direction v among all of the orange points:

What unit vector v minimizes var?

What unit vector v maximizes var?

Solution: v1 is eigenvector of A with largest eigenvalue v2 is eigenvector of A with smallest eigenvalue


Principal component analysis

Suppose each data point is N-dimensional•Same procedure applies:

•The eigenvectors of A define a new coordinate system– eigenvector with largest eigenvalue captures the most variation

among training vectors x– eigenvector with smallest eigenvalue has least variation

•We can compress the data using the top few eigenvectors– corresponds to choosing a “linear subspace”o represent points on a line, plane, or “hyper-plane”

– these eigenvectors are known as the principal components


The space of faces

An image is a point in a high dimensional space•An N x M image is a point in RNM•We can define vectors in this space as we did in the 2D

case

+=

73 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.CSE 576, Spring 2008

Face Recognition and Detection

73

Dimensionality reduction

The set of faces is a “subspace” of the set of images•We can find the best subspace using PCA


Eigenfaces

PCA extracts the eigenvectors of A• Gives a set of vectors v1, v2, v3, ...• Each vector is a direction in face space

– what do these look like?


Projecting onto the eigenfaces

The eigenfaces v1, ..., vK span the space of faces•A face is converted to eigenface coordinates by


Recognition with eigenfaces

Algorithm1. Process the image database (set of images with

labels)• Run PCA—compute eigenfaces• Calculate the K coefficients for each image

2. Given a new image (to be recognized) x, calculate K coefficients

3. Detect if x is a face

4. If it is a face, who is it?– Find closest labeled face in database

o nearest-neighbor in K-dimensional space


Choosing the dimension K

K NMi =

eigenvalues

How many eigenfaces to use?

Look at the decay of the eigenvalues•the eigenvalue tells you the amount of variance “in the

direction” of that eigenface•ignore eigenfaces with low variance


View-Based and Modular Eigenspaces for Face Recognition

Alex Pentland, Baback Moghaddam and Thad StarnerCVPR’94


Part-based eigenfeatures

Learn a separateeigenspace for eachface feature

Boosts performanceof regulareigenfaces


The same person with the same facial expression, and seen from the same viewpoint, can appear dramatically different when light sources illuminate the face from different directions.

One Difficulty: Lighting


Facial expressions changes also create variations that PCA will hold unto, yet these variations may confuse recognition.

Another Difficulty: Facial Expression


Local Binary Patterns

T. Ojala, M. Pietikäinen, and D. Harwood (1996), "A Comparative Study of Texture Measures with Classification Based on Feature Distributions", Pattern Recognition, vol. 29, pp. 51-59


LBP feature vector

The LBP feature vector, in its simplest form, is created in the following manner:

• Divide the examined window into cells (e.g. 16x16 pixels for each cell).

• For each pixel in a cell, compare the pixel to each of its 8 neighbors (on its left-top, left-middle, left-bottom, right-top, etc.). Follow the pixels along a circle, i.e. clockwise or counter-clockwise.

• Where the center pixel's value is greater than the neighbor's value, write "1". Otherwise, write "0". This gives an 8-digit binary number (which is usually converted to decimal for convenience).

• Compute the histogram, over the cell, of the frequency of each "number" occurring (i.e., each combination of which pixels are smaller and which are greater than the center).

• Optionally normalize the histogram.

• Concatenate (normalized) histograms of all cells. This gives the feature vector for the window.


Применение локальных бинарных шаблонов


Morphable Face Models

Rowland and Perrett ’95Lanitis, Cootes, and Taylor ’95, ’97

Blanz and Vetter ’99Matthews and Baker ’04, ‘07


Morphable Face Model

Use subspace to model elastic 2D or 3D shape variation (vertex positions), in addition to appearance variation

Shape S

Appearance T


Заключение

• Обнаружение лиц

• Skin detection

• The Viola/Jones Face Detector• Интегральные изображения, много простых и быстрых признаков

• Boosting

• Attention cascade

• Large-scale unsupervised learning

• Распознавание лиц

• EigenFaces (PCA)

Local Binary Patterns (описывают текстуру)

• Morphable face models

Documents

Анализ изображений и видео 2, весна 2015: Распознавание лиц