Machine Learning for Computer Vision – Lecture 5 1 Machine Learning for Computer Visionvision.mas.ecp.fr/Personnel/iasonas/course/Kokkinos_le… · · 2012-10-22Machine Learning

1 Machine Learning for Computer Vision – Lecture 5

Iasonas Kokkinos

22 October, 2012 MVA – ENS Cachan

Machine Learning for Computer Vision

Galen Group INRIA-Saclay

Center for Visual Computing Ecole Centrale Paris

Lecture 5: Introduction to generative models

[email protected]


Lecture outline

Bayes’ rule and generative models

Density estimation

Parametric deformable models


Decision Theory

•  What is an optimal decision rule? •  Consider loss matrix

•  Consider underlying joint distribution of data-label pairs

•  Find decision rule f that minimizes expected loss:


Optimal Classifier

•  Consider zero-one loss function: •  Form `Expected Prediction Error’:

•  Optimal decision at any :

–  `Bayes-optimal classifier‘


Bayes’ theorem

•  P(X|Y): likelihood of observations X, given class Y. •  P(Y): Prior probability of class Y •  P(Y|X): Posterior probability of class Y, given observations Y.

x i ; yi ; i = 1 : : : Nx i

Why is this identity important?


•  Discriminative

•  Generative

x i ; yi ; i = 1 : : : Nx i

Posterior Model-based Likelihood

Two approaches to pattern recognition

Observations, X

Y

Bayesian Vision: Inverse Graphics

Discriminative Vision: Function approximation

Y Observations, X


Generative or disciminative?

x i ; yi ; i = 1 : : : Nx i

–  Discriminative Models (lectures 1-4) : Skip density estimation

•  More robust to wrong distribution assumptions (e.g. outliers) •  V. Vapnik: `one should solve the classification problem directly and never solve

a more general problem as an intermediate step’

Training Set

Class Distributions

Class posteriors

Density Estimation (e.g. ML)

Bayes’ Rule

–  Generative Models (Lectures 5-7) : Core task: density estimation •  If we know the distributions, requires smaller training sets •  Dealing with missing/corrupt data •  Explicit modelling of sources of variation (e.g. translation) •  Conceptual clarity, ability for `visual debugging’


Lecture outline


Density estimation

Gaussian distributions

Mixture-of-Gaussian models

Hidden Variables and Expectation-Maximization algorithm

Factor Analysis & PCA


Mixed Discrete/Continuous hidden variable models


•  Training set:

•  Examples corresponding to class k:

•  Training data for class k:

•  One density estimate per class: for short:

Density Estimation


Parametric Distributions: Gaussian

–  1 D

–  N D

•  e.g. 2D:


•  Covariance matrix:

•  Uncorrelated coordinates: diagonal covariance

Covariance matrix reminder

Height, Income Height, Weight


Density Estimation for a Gaussian distribution •  Given:

•  Notation: •  Maximum Likelihood Estimation for class k:

x i ; yi ; i = 1 : : : Nx i


Classification task: Ferrari or Fiat?

•  Consider placing a personalized ad. Which car will you try?

•  New client:

•  Classification problem: new client is likely to buy Fiat/Ferarri. •  Build class-specific probability distributions

x i ; yi ; i = 1 : : : Nx i


Ferrari or Fiat, continued

x i ; yi ; i = 1 : : : Nx i

Can we proceed to classification?

Need to estimate:

ML estimate:

Parameter estimation: Maximum Likelihood (ML)

Class-specific Gaussian Distributions

Bayes’ rule:


Bayes rule, 1D


Classifier form for Gaussian Distributions

•  Choose class

•  Decision boundary for the binary case:

Quadratic Decision Boundaries

•  Special case:

Linear Decision Boundaries


Lecture outline


Density estimation



Expectation-Maximization algorithm and hidden variables





Mixture of Gaussians model

• 

P(x) = .2

P(x) = .5

P(x) = .1

Main challenge: parameter estimation Which points go with which cluster?


K-Means algorithm

–  Coordinate descent on distortion cost:

–  Local minima (multiple initializations to find better solution)


Lecture outline


Density estimation








K-Means algorithm


Adaptation for Gaussian distributions


Expectation Maximization algorithm

M-step

E-step


K-means vs. EM

k-means EM Closest center’s index Soft assignment, R Isotropic Distance Anisotropic Likelihood (Euclidean) (Covariance-based,`Mahalanobis’)

Fast (e.g. kd-trees) Accurate & more flexible More robust to initalization Prone to local minima

Typical usage: initialize EM with k-means results

Coordinate Descent on Coordinate descent on?


Mixture of Gaussians

• 

•  Maximum Likelihood Estimation:



Hidden Variables:

•  Criterion:

–  Problem: Summation inside logarithm –  We do not know which component generated each point –  What if we knew?


Plato’s cave

Observations: B&W Images Models: 3D surfaces

Hidden variables: positions


Hidden Variables:

•  Criterion:

–  Problem: Summation inside logarithm –  We do not know which component generated each point –  What if we knew?

•  Hidden variable –  Indicate which component is responsible for each point –  Multinomially distributed variable


Rewriting the MoG distribution

•  Marginalization

•  Chain rule

•  We have


Complete Log-Likelihood

•  Assume hidden variables are given •  Data+ hidden variables = complete observations •  Complete log-likelihood

•  Summation falls outside the logarithm!


•  Given: Hidden Variables

•  Maximize w.r.t . parameters

Full Observation Log Likelihood


•  We do not know the hidden variables (`missing data’) •  Complete log-likelihood is a random quantity. •  Form its expectation, using a distribution q(h) on hidden variables:

•  Expected complete log-likelihood

Expected Complete Log-Likelihood


•  Given: Hidden Variables

•  Maximize w.r.t . parameters

Full Observation Log-Likelihood


•  Given: Probability of assignment

•  Maximize w.r.t . parameters •  M-step!

Expected Log-Likelihood


Lecture outline


Density estimation








P(Grades|MVA) •  10 students, 20 courses

–  How can we model the distribution of the grades? –  Consider a Gaussian distribution..

•  20X19/2 Parameters in covariance, 10 measurements –  Could we `summarize’ performance in a more compact way?

•  3 `hidden’ causes –  Math skills, CS skills, Effort –  Different skills per student –  Different effects of skills on grade per course

Observed grades Influence of skills on grade Skills per student


Generative Model: Factor Analysis

•  Hidden variables (skills) •  Observations

–  `factor loading’ matrix Λ (course-specific effect of skills on grade) –  noise covariance matrix Ψ (performance on exam)

•  Linear model •  Distribution of x (see end of slides)

•  Density estimation: recover optimal µ, Λ, Ψ, for a set of data Χ


Continuous Hidden Variables: Factor Analysis

•  Find low-dimensional subspace (`skills’) explaining data •  Hidden variables: coordinates on subspace

–  E-step: posterior on coordinates –  M-step: subspace


EM for Factor Analysis

•  E-step: distribution on h(skills), conditioned on x (grades)

•  M-step: plug in distribution on h, and maximize w.r.t. parameters


Principal Component Analysis (PCA)

•  Find a low-dimensional subspace to reconstruct high-dimensional data •  Reconstruction on orthogonal basis Approximation with K terms


Relation with Factor Analysis?

•  PCA criterion:

•  Regularize solution

•  Equivalently:

•  Difference from FA:

•  What we gain: no EM, factorization-based estimate of Λ, h •  What we lose: proper probabilistic framework.


Principal component analysis

•  The k orthogonal directions that capture most of the data variance are the k leading (largest-eigenvalue) covariance eigenvectors

Factor Analysis PCA Λ matrix Leading K eigenvectors of covariance Hidden variables Inner product of data with eigenvectors


PCA: decorrelation/dimensionality reduction •  `Hidden variables’: projection onto eigenvectors of covariance matrix

Dimensionality reduction by using only leading eigenvectors

Grades in 60 courses -> Good in math, computer science


Lecture outline


Density estimation








Continuous Hidden Variables: Factor Analysis

•  Also known as Dimensionality Reduction


Discrete hidden variables: Mixture of Gaussians

•  Also known as Clustering


•  Consider shift as a hidden variable, l •  Estimate model with EM

Transformation-resilient image averaging

Observed Image

Deformation-free image Shift

Input Plain mean & std

With transformation & EM


•  Latent variables for synthesis (continuous) •  Latent variables for shift (discrete)

•  Estimate mean basis using EM

Transformed Components Analysis

Plain mean & PCA

With offset Input

Samples of model


•  Latent variables for cluster (discrete) •  Latent variables for shift (discrete)

Transformed Mixture of Gaussians

Input Plain Mixture-of-Gaussians

With offset


Transformed Mixture of Gaussians

Plain Mixture-of-Gaussians

With offset

Input


•  Latent variables for cluster •  Latent variables for components •  Latent variables for shift

Mixture of Transformed Components


Lecture outline


Density estimation


Eigenfaces Active appearance models

3D Morphable models

Statistical active shape models


Example: bone contours

54

Task: localize anatomical structures


Task: Analyze a hand radiograph


Task: Analyze a hand radiograph

Assume: we are looking for proximal phalanx 2

PP3

PP2

PP4

PP5

MC2

MC3

MC4

MC5

MP5

MP4

MP3

MP2


Analyzing a hand radiograph

We have a priori knowledge about the typical appearance: e.g. bone shapes and texture

PP2

How can we represent this knowledge? How can we exploit it?


Statistical Shape Models

Each example is represented by a vector containing the coordinates of the landmarks.

Learning: Model Acquisition Inference: Model Fitting


•  Bone shapes: vectors in

•  Goal: project data onto a low-dimensional linear subspace that best

explains their variation.

The space of all bone shapes

60 Machine Learning for Computer Vision – Lecture 5 New subspace: `better’ coordinate system

60 1. Active Shape Models

New coordinates reflect the distribution of the data. Few coordinates suffice to represent a high dimensional vector They can be viewed as parameters of a model

Mean


Using PCA to model shape

+ + = +


Active shape models (ASM)

•  A set of training examples (images) •  A set of landmarks, that are present on all images •  Build a statistical model of shape variation (PCA) •  Build a statistical model of the local texture (PCA) •  Use the model for the search in a new image


ASM search

Adjust to texture Fit to shape model

Initialize


ASM search

64


Lecture outline


Density estimation



3D Morphable models


= +

µ + w1u1+w2u2+w3u3+w4u4+ … ^ x =


Appearance modelling for faces •  When viewed as vectors of pixel values, face images are extremely high-dimensional

–  100x100 image = 10,000 dimensions •  Very few vectors correspond to valid face images

•  Original coordinates are not revealing about face properties •  We want to model the subspace (`manifold’) of face images


Continuous Hidden Variables: Appearance Manifolds

x2

x1

xn

Lighting x Pose [Murase and Nayar 1993]


Eigenfaces (Murase & Nayar, 91) •  Training images •  x1,…,xN


Eigenfaces Top eigenvectors: u1,…uk

Mean: µ


Eigenfaces Principal component (eigenvector) uk

µ + 3σkuk

µ – 3σkuk


Eigenfaces example

•  Face x in “face space” coordinates:

•  Reconstruction:

= +

µ + w1u1+w2u2+w3u3+w4u4+ …

=

^ x =


Limitations

•  Global appearance method: not robust to misalignment, background variation


Lecture outline


Density estimation



3D Morphable models



Active Appearance Models (AAMs)

Shape:

Appearance:

Synthesis:

I(S (x)) = T (x)

X S (X )

T emplate Ins tance


Playing with the AAM parameters

First two modes of shape variation First two modes of gray-level variation

First four modes of appearance variation


Active Appearance Model Search (Results)


AAM Search


Lecture outline


Density estimation



3D Morphable models



3-D surface acquisition

Laser Range Scanners Stereo Cameras Structured Light (Kinect) Photometric Stereo


What can we do with 3d shape models?

[Blanz and Vetter 1999, 2003]


Building a Morphable Face Model



3-D Morphable Models



3D Morphable models

Recover Shape

Synthesize new views

Synthesize new expressions


•  Rough manual initialization •  Gradient descent to minimize reconstruction error functional

•  And then

3-D Morphable Model fitting


3D AAM for face tracking

CMU group: I. Matthews, S. Baker, R. Gross (230 Frames per second, 2004)


86

3D AAM for face tracking


Playing with Facial Attributes Several classes of attributes are modeled: •  Facial expressions (smile, frown) •  Individual characteristics (double chin, hooked nose, ‘maleness’) •  Distinctiveness


Manipulating Facial Attributes via Deformations •  For each face in the database, two scans are recorded: Sneutral, and Sexpression. •  The difference vector ΔS = Sexpression - Sneutral is saved and later on simply added to the

3D reconstruction of the input image.



90


APPENDIX

x i ; yi ; i = 1 : : : Nx i


Factor Analysis: Generative Model

•  Hidden variables •  Observations

–  noise covariance matrix •  Linear model

•  Distribution of x


Full observation distribution

•  Consider covariance of x, h:

•  Full observations

•  Distribution

•  We will need to write •  Problem: non-diagonal matrix


Block matrix diagonalization

¨  Schur Complement


Factorizing a Gaussian distribution


PCA criterion

•  Minimize reconstruction error of training set


Spectral Decomposition of a matrix


Principal Component Analysis

•  Given: N data points x1, … ,xN in Rd

•  We want to find a new set of features that are linear combinations of original ones:

u(xi) = uT(xi – µ)

(µ: mean of data points)

•  What unit vector u in Rd captures the most variance of the data?


Principal Component Analysis •  Variance of projection on u:

Projection of data point

Covariance matrix of data

The direction that maximizes the variance: the eigenvector associated with the largest eigenvalue of Σ

Direction: Unit norm vector

Documents

Machine Learning for Computer Vision – Lecture 5 1 Machine Learning for Computer Visionvision.mas.ecp.fr/Personnel/iasonas/course/Kokkinos_le… · · 2012-10-22Machine Learning