Algorithms computer vision

Algorithms bookletDecember 10, 2012

2

Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.

Algorithms booklet

This document accompanies the book “Computer vision: models, learning, and inference” bySimon J.D. Prince. It contains concise descriptions of almost all of the models and algorithmsin the book. The goal is to provide sufficient information to implement a naive version ofeach method. This information was published separately from the main book because (i) itwould have impeded the clarity of the main text and (ii) on-line publishing means that I canupdate the text periodically and eliminate any mistakes.

In the main, this document uses the same notation as the main book (see Appendix A fora summary). In addition, we also use the following conventions:

• When two matrices are concatenated horizontally, we write C = [A,B].

• When two matrices are concatenated vertically, we write C = [A; B].

• The function argminx f [x] returns the value of the argument x that minimizes f [x]. Ifx is discrete then this should be done by exhaustive search. If x is continuous, then itshould be done by gradient descent and I usually supply the gradient and Hessian ofthe function to help with this.

• The function δ[x] for discrete x returns 1 when the argument x is 0 and returns 0otherwise.

• The function diag[A] returns a column vector containing the elements on the diagonalof matrix A.

• The function zeros[I, J ] creates an I × J matrix that is full of zeros.

As a final note, I should point out that this document has not yet been checked very care-fully. I’m looking for volunteers to help me with this. There are two main ways you can help.First, please mail me at [email protected] if you manage to successfully implement oneof these methods. That way I can be sure that the description is sufficient. Secondly, pleasealso mail me if you if you have problems getting any of these methods to work. It’s possiblethat I can help, and it will help me to identify ambiguities and errors in the descriptions.

Simon Prince


4


List of Algorithms

4.1 Maximum likelihood learning for normal distribution . . . . . . . . . . . . . . 74.2 MAP learning for normal distribution with conjugate prior . . . . . . . . . . . 74.3 Bayesian approach to normal distribution . . . . . . . . . . . . . . . . . . . . . 84.4 Maximum likelihood learning for categorical distribution . . . . . . . . . . . . 84.5 MAP learning for categorical distribution with conjugate prior . . . . . . . . . 94.6 Bayesian approach to categorical distribution . . . . . . . . . . . . . . . . . . . 96.1 Basic Generative classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107.1 Maximum likelihood learning for mixtures of Gaussians . . . . . . . . . . . . . 117.2 Maximum likelihood learning for t-distribution . . . . . . . . . . . . . . . . . . 127.3 Maximum likelihood learning for factor analyzer . . . . . . . . . . . . . . . . . 138.1 Maximum likelihood learning for linear regression . . . . . . . . . . . . . . . . 148.2 Bayesian formulation of linear regression. . . . . . . . . . . . . . . . . . . . . . 158.3 Gaussian process regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168.4 Sparse linear regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178.5 Dual formulation of linear regression. . . . . . . . . . . . . . . . . . . . . . . . 188.6 Dual Gaussian process regression. . . . . . . . . . . . . . . . . . . . . . . . . . 188.7 Relevance vector regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199.1 Cost and derivatives for MAP logistic regression . . . . . . . . . . . . . . . . . 209.2 Bayesian logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219.3 Cost and derivatives for MAP dual logistic regression . . . . . . . . . . . . . . 229.4 Dual Bayesian logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . 239.5 Relevance vector classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 249.6 Incremental logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 259.7 Logitboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269.8 Cost function, derivative and Hessian for multi-class logistic regression . . . . . 279.9 Multiclass classification tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2810.1 Gibbs’ sampling from undirected model . . . . . . . . . . . . . . . . . . . . . . 2910.2 Contrastive divergence learning of undirected model . . . . . . . . . . . . . . . 3011.1 Dynamic programming in chain . . . . . . . . . . . . . . . . . . . . . . . . . . 3111.2 Dynamic programming in tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 3211.3 Forward backward algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3311.4 Sum product: distribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3411.4b Sum product: collate and compute marginal distributions . . . . . . . . . . . . 3512.1 Binary graph cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3612.2 Reparameterization for binary graph cut . . . . . . . . . . . . . . . . . . . . . 37


6 LIST OF ALGORITHMS

12.3 Multilabel graph cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3812.4 Alpha expansion algorithm (main loop) . . . . . . . . . . . . . . . . . . . . . . 3912.4b Alpha expansion (expand) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4013.1 Principal components analysis (dual) . . . . . . . . . . . . . . . . . . . . . . . 4113.2 K-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4214.1 ML learning of extrinsic parameters . . . . . . . . . . . . . . . . . . . . . . . . 4314.2 ML learning of intrinsic parameters . . . . . . . . . . . . . . . . . . . . . . . . 4414.3 Inferring 3D world position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4515.1 Maximum likelihood learning of Euclidean transformation . . . . . . . . . . . . 4615.2 Maximum likelihood learning of similarity transformation . . . . . . . . . . . . 4715.3 Maximum likelihood learning of affine transformation . . . . . . . . . . . . . . 4815.4 Maximum likelihood learning of projective transformation . . . . . . . . . . . . 4915.5 Maximum likelihood inference for transformation models . . . . . . . . . . . . 5015.6 ML learning of extrinsic parameters (planar scene) . . . . . . . . . . . . . . . . 5115.7 ML learning of intrinsic parameters (planar scene) . . . . . . . . . . . . . . . . 5215.8 Robust ML learning of homography . . . . . . . . . . . . . . . . . . . . . . . . 5315.9 Robust sequential learning of homographies . . . . . . . . . . . . . . . . . . . . 5415.10 PEaRL learning of homographies . . . . . . . . . . . . . . . . . . . . . . . . . . 5516.1 Extracting relative camera position from point matches . . . . . . . . . . . . . 5616.2 Eight point algorithm for fundamental matrix . . . . . . . . . . . . . . . . . . 5716.3 Robust ML fitting of fundamental matrix . . . . . . . . . . . . . . . . . . . . . 5816.4 Planar rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5917.1 Generalized Procrustes analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 6017.2 ML learning of PPCA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6118.1 Maximum likelihood learning for identity subspace model . . . . . . . . . . . . 6218.2 Maximum likelihood learning for PLDA model . . . . . . . . . . . . . . . . . . 6318.3 Maximum likelihood learning for asymmetric bilinear model . . . . . . . . . . 6418.4 Style translation with asymmetric bilinear model . . . . . . . . . . . . . . . . . 6519.1 The Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6619.2 Fixed interval Kalman smoother . . . . . . . . . . . . . . . . . . . . . . . . . . 6719.3 The extended Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6819.4 The iterated extended Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . 6919.5 The unscented Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7019.6 The condensation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7120.1 Learn bag of words model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7220.2 Learn latent Dirichlet allocation model . . . . . . . . . . . . . . . . . . . . . . 7320.2b MCMC Sampling for LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74


Fitting probability distributions 7

Algorithm 4.1: Maximum likelihood learning of normal distribution

The univariate normal distribution is a probability density model suitable for describingcontinuous data x in one dimension. It has pdf

Pr(x) =1√

2πσ2exp

[−0.5(x− µ)2/σ2

],

where the parameter µ denotes the mean and σ2 denotes the variance.

Algorithm 4.1: Maximum likelihood learning for normal distribution

Input : Training data xiIi=1

Output: Maximum likelihood estimates of parameters θ = µ, σ2begin

// Set mean parameter

µ =∑Ii=1 xi/I

// Set variance

σ2 =∑Ii=1(xi − µ)2/I

end

Algorithm 4.2: MAP learning of univariate normal parameters

The conjugate prior to the normal distribution is the normal-scaled inverse gamma which haspdf

Pr(µ, σ2) =

√γ

σ√

2π

βα

Γ(α)

(1

σ2

)α+1

exp

[−2β + γ(δ − µ)2

2σ2

],

with hyperparameters α, β, γ > 0 and δ ∈ [−∞,∞].

Algorithm 4.2: MAP learning for normal distribution with conjugate prior

Input : Training data xiIi=1, Hyperparameters α, β, γ, δOutput: MAP estimates of parameters θ = µ, σ2begin

// Set mean parameter

µ = (∑i=1 xi + γδ)/(I + γ)

// Set variance

σ2 = (∑Ii=1(xi − µ)2 + 2β + γ(δ − µ)2)/(I + 3 + 2α)

end


8 Fitting probability distributions

Algorithm 4.3: Bayesian approach to univariate normal distribution

In the Bayesian approach to fitting the univariate normal distribution we again use a normal-scaled inverse gamma prior. In the learning stage we compute a normal inverse gammadistribution over the mean and variance parameters. The predictive distribution for a newdatum is computed by integrating the predictions for a given set of parameters weighted bythe probability of those parameters being present.

Algorithm 4.3: Bayesian approach to normal distribution

Input : Training data xiIi=1, Hyperparameters α, β, γ, δ, Test data x∗

Output: Posterior parameters α, β, γ, δ, predictive distribution Pr(x∗|x1...I)begin

// Compute normal inverse gamma posterior over normal parameters

α = α+ I/2

β =∑i x

2i /2 + β + γδ2/2− (γδ +

∑i xi)

2/(2γ + 2I)γ = γ + I

δ = (γδ +∑i xi)/(γ + I)

// Compute intermediate parameters

α = α+ 1/2

β = x∗2/2 + β + γδ2/2− (γδ + x∗)2/(2γ + 2)γ = γ + 1// Evaluate new datapoint under predictive distribution

Pr(x∗|x1...I) =√γβαΓ[α]/

(√2π√γβαΓ[α]

)end

Algorithm 4.4: ML learning of categorical parameters

The categorical distribution is a probability density model suitable for describing discretemultivalued data x ∈ 1, 2, . . .K. It has pdf

Pr(x = k) = λk,

where the parameter λk denotes the probability of observing category k.

Algorithm 4.4: Maximum likelihood learning for categorical distribution

Input : Multi-valued training data xiIi=1

Output: ML estimate of categorical parameters θ = λ1 . . . λkbegin

for k=1 to K do

λk =∑Ii=1 δ[xi − k]/I

end

end


Fitting probability distributions 9

Algorithm 4.5: MAP learning of categorical parameters

For MAP learning of the categorical parameters, we need to define a prior and to this end,we choose the Dirichlet distribution:

Pr(λ1 . . . λK) =Γ[∑Kk=1 αk]∏K

k=1 Γ[αk]

K∏k=1

λαk−1k ,

where Γ[•] is the Gamma function and αkKk=1 are hyperparameters.

Algorithm 4.5: MAP learning for categorical distribution with conjugate prior

Input : Binary training data xiIi=1, Hyperparameters αkKk=1

Output: MAP estimates of parameters θ = λkKk=1

beginfor k=1 to K do

Nk =∑Ii=1 δ[xi − k])

λk = (Nk − 1 + αk)/(I −K +∑Kk=1 αk)

end

end

Algorithm 4.6: Bayesian approach to categorical distribution

In the Bayesian approach to fitting the categorical distribution we again use a Dirichlet prior.In the learning stage we compute a probability distribution over K categorical parameters,which is also a Dirichlet distribution. The predictive distribution for a new datum is based aweighted sum of the predictions for all possible parameter values where the weights used arebased on the Dirichlet distribution computed in the learning stage.

Algorithm 4.6: Bayesian approach to categorical distribution

Input : Categorical training data xiIi=1, Hyperparameters αkKk=1

Output: Posterior parameters αkKk=1, predictive distribution Pr(x∗|x1...I)begin

// Compute caterorical posterior over λfor k=1 to K do

αk = αk +∑Ii=1 δ[xi − k]

end// Evaluate new datapoint under predictive distribution

for k=1 to K do

Pr(x∗ = k|x1...I) = αk/(∑Km=1 αm)

end

end


10 Learning and inference in vision

Algorithm 6.1: Basic generative classifier

Consider the situation where we wish to assign a label w ∈ 1, 2, . . .K based on an observedmultivariate measurement vector xi. We model the class conditional density functions asnormal distributions so that

Pr(xi|wi = k) = Normxi [µk,Σk],

with prior probabilities over the world state defined by

Pr(wi) = Catwi [λ].

In the learning phase, we fit the parameters µk and σ2k of the kth class conditional density

function Pr(xi|wi = k) from just the subset of data Sk = xi : wi = k where the kth statewas observed. We learn the prior parameter λ from the training world states wiIi=1. Herewe have used the maximum likelihood approach in both cases.

The inference algorithm takes new datum x and returns the posterior Pr(w∗|x∗,θ) overthe world state w using Bayes’ rule,

Pr(w∗|x∗) =Pr(x∗|w∗)Pr(w∗)∑K

w∗=1 Pr(x∗|w∗)Pr(w∗)

.

Algorithm 6.1: Basic Generative classifier

Input : Training data xi, wiIi=1, new data example x∗

Output: ML parameters θ = λ1...K ,µ1...K ,Σ1...K, posterior probability Pr(w∗|x∗)begin

// For each training class

for k=1 to K do// Set mean

µk = (∑Ii=1 xiδ[wi − k])/(

∑Ii=1 δ[wi − k])

// Set variance

Σk = (∑Ii=1(xi − µ)(xi − µ)T δ[wi − k])/(

∑Ii=1 δ[wi − k])

// Set prior

λk =∑Ii=1 δ[wi − k]/I

end// Compute likelihoods for each class for a new datapoint

for k=1 to K dolk = Normx∗ [µk,Σk]

end// Classify new datapoint using Bayes’ rule

for k=1 to K do

Pr(w∗ = k|x∗) = lkλk/∑Km=1 lmλm

end

end


Modelling complex densities 11

Algorithm 7.1: Fitting mixture of Gaussians

The mixture of Gaussians (MoG) is a probability density model suitable for data x in Ddimensions. The data is described as a weighted sum of K normal distributions

Pr(x|θ) =

K∑k=1

λkNormx[µk,Σk],

where µ1...K and Σ1...K are the means and covariances of the normal distributions and λ1...Kare positive valued weights that sum to one.

The MoG is fit using the EM algorithm. In the E-step, we compute the posterior distri-bution over a hidden variable hi for each observed data point xi. In the M-step, we iteratethrough the K components, updating the mean µk and Σk for each and also update theweights λkKk=1.

Algorithm 7.1: Maximum likelihood learning for mixtures of Gaussians

Input : Training data xiIi=1, number of clusters KOutput: ML estimates of parameters θ = λ1...K ,µ1...K ,Σ1...Kbegin

Initialize θ = θ0a

repeat// Expectation Step

for i=1 to I dofor k=1 to K do

lik = λkNormxi [µk,Σk] // numerator of Bayes’ rule

end// Compute posterior (responsibilities) by normalizing

for k=1 to K do

rik = lik/(∑Kk=1 lik)

end

end

// Maximization Step b

for k=1 to K do

λ[t+1]k = (

∑Ii=1 rik)/(

∑Kk=1

∑Ii=1 rik)

µ[t+1]k = (

∑Ii=1 rikxi)/(

∑Ii=1 rik)

Σ[t+1]k = (

∑Ii=1 rik(xi − µ[t+1]

k )(xi − µ[t+1]k )T )/(

∑Ii=1 rik).

end// Compute Data Log Likelihood and EM Bound

L =∑Ii=1 log

[∑Kk=1 λkNormxi [µk,Σk]

]B =

∑Ii=1

∑Kk=1 rik log [λkNormxi [µk,Σk]/rik]

until No further improvement in L

end

aOne possibility is to set the weights λ• = 1/K, the means µ• to the values of K randomly chosendatapoints and the variances Σ• to the variance of the whole dataset.bFor a diagonal covariance retain only the diagonal of the Σk update.


12 Modelling complex densities

Algorithm 7.2: Fitting the t-distribution

The t-distribution is a robust (long-tailed) distribution with pdf

Pr(x) =Γ(ν+D2

)(νπ)D/2|Σ|1/2Γ

(ν2

) (1 +(x− µ)TΣ−1(x− µ)

ν

)−(ν+D)/2

.

where µ is the mean of the distribution Σ is a matrix that controls the spread, ν is the degreesof freedom, and D is the dimensionality of the input data.We use the EM algorithm to fit the parameters θ = µ,Σ, ν. In the E-step, we compute thegamma-distributed posterior over the hidden variable hi for each observed data point xi. Inthe M-step we update the parameters µ and Σ in closed form, but must perform an explicitline search to update ν using the criterion:

tCost[ν, E[hi], E[log[hi]]Ii=1

]=

−I∑i=1

ν

2log[ν

2

]+ log

[Γ[ν

2

]]−(ν

2− 1)E[log[hi]] +

ν

2E[hi].

Algorithm 7.2: Maximum likelihood learning for t-distribution

Input : Training data xiIi=1

Output: Maximum likelihood estimates of parameters θ = µ,Σ, νbegin


repeat// Expectation step

for i=1 to I doδi = (xi − µ)TΣ−1(xi − µ)E[hi] = (ν +D)/(ν + δi)E[log[hi] = Ψ[ν/2 +D/2]− log[ν/2 + δi/2]

end// Maximization step

µ = (∑Ii=1 E[hi]xi)/(

∑Ii=1 E[hi])

Σ = (∑Ii=1 E[hi](xi − µ)(xi − µ)T )/(

∑Ii=1 E[hi])

ν = argminν [tCost[ν, E[hi], E[log[hi]]Ii=1]]// Compute data log Likelihood

for i=1 to I doδi = (xi − µ)TΣ−1(xi − µ)

endL = I log[Γ[(ν +D)/2]]− ID log[νπ]/2− I log[|Σ|]/2− I log[Γ[ν/2]]

L = L− (ν +D)∑Ii=1 log[1 + δi/ν]/2


end

a One possibility is to initialize the parameters µ and Σ to the mean and variance of the distributionand set the initial degrees of freedom to a large value say ν = 1000.


Modelling complex densities 13

Algorithm 7.3: Fitting a factor analyzer

The factor analyzer is a probability density model suitable for data x in D dimensions. Ithas pdf

Pr(xi|θ) = Normx∗ [µ,ΦΦ + Σ],

where µ is a D× 1 mean vector, Φ is a D×K matrix containing the K factors φKk=1 in itscolumns and Σ is a diagonal matrix of size D ×D.

The factor analyzer is fit using the EM algorithm. In the E-step, we compute the posteriordistribution over the hidden variable hi for each data example xi and extract the expectationsE[hi] and E[hih

Ti ]. In the M-step, we use these distributions in closed-form updates for the

basis function matrix Φ and the diagonal noise term Σ.

Algorithm 7.3: Maximum likelihood learning for factor analyzer

Input : Training data xiIi=1, number of factors KOutput: Maximum likelihood estimates of parameters θ = µ,Φ,Σbegin


// Set mean

µ =∑Ii=1 xi/I

repeat// Expectation Step

for i=1 to I doE[hi] = (ΦTΣ−1Φ + I)−1ΦTΣ−1(xi − µ)

E[hihTi ] = (ΦTΣ−1Φ + I)−1 + E[hi]E[hi]

T

end// Maximization Step

Φ =(∑I

i=1(xi − µ)E[hi]T)(∑I

i=1 E[hihTi ])−1

Σ = diag[(xi − µ)(xi − µ)T −ΦE[hi](x

Ti − µ)

]/I

// Compute Data Log Likelihoodb

L =∑Ii=1 log

[Normxi [µ,ΦΦT + Σ]

]until No further improvement in L

end

a It is usual to initialize Φ to random values. The D diagonal elements of Σ can be initialized to thevariances of the D data dimensions.b In high dimensions it is worth reformulating the covariance of this matrix using the Sherman-

Morrison-Woodbury relation (matrix inversion lemma) .


14 Models for regression

Algorithm 8.1: ML fitting of linear regression model

The linear regression model describes the world w as a normal distribution. The mean of thisdistribution is a linear function φ0 +φTx and the variance is constant. In practice we add a1 to the start of every data vector xi ← [1 xTi ]T and attach the y-intercept φ0 to the startof the gradient vector φ← [φ0 φT ]T and write

Pr(wi|xi,θ) = Normwi

[φTxi, σ

2].

In the learning algorithm, we work with the matrix X = [x1,x2 . . .xI ] which contains allof the training data examples in its columns and the world vector w = [w1, w2 . . . wI ]

T whichcontains the training world states.

Algorithm 8.1: Maximum likelihood learning for linear regression

Input : (D + 1)×I data matrix X, I×1 world vector wOutput: Maximum likelihood estimates of parameters θ = Φ, σ2begin

// Set gradient parameter

φ = (XXT )−1Xw// Set variance parameter

σ2 = (w −XTφ)T (w −XTφ)/I

end


Models for regression 15

Algorithm 8.2: Bayesian linear regression

In Bayesian linear regression we define a normal prior over the parameters φ

Pr(φ) = Normφ[0, σ2pI],

which contains one hyperparameter σ2p which determines the prior variance. We compute a

distribution over possible parameters φ and use this to evaluate the mean µw∗|x∗ and varianceσ2w∗|x∗ of the predictive distribution for new data x∗.

As in the previous algorithm, we add a 1 to the start of every data vector xi ← [1 xTi ]T

and then work with the matrix X = [x1,x2 . . .xI ] which contains all of the training dataexamples in its columns and the world vector w = [w1, w2 . . . wI ]

T which contains the trainingworld states.

The choice of approach depends on whether the number of data examples I is greater orless than the dimensionality D of the data. Depending on which case which situation we arein we move to a situation where we invert the (D + 1) × (D + 1) matrix XXT or the I × Imatrix XTX.

Algorithm 8.2: Bayesian formulation of linear regression.

Input : (D + 1)×I data matrix X, I×1 world vector w, Hyperparameter σ2p,

Output: Distribution Pr(w∗|x∗) over world given new data example x∗

begin// If dimensions D less than number of data examples Iif D < I then

// Fit variance parameter σ2 with line search

σ2 = argminσ2

[− log[Normw[0, σ2

pXTX + σ2I]]

]a

// Compute inverse variance of posterior distribution over φ

A−1 = (XXT /σ2 + I/σ2p)−1

else// Fit variance parameter σ2 with line search

σ2 = argminσ2


pXTX + σ2I]]

]// Compute inverse variance of posterior distribution over φ

A−1 = σ2pI− σ2

pX(XTX + (σ2/σ2

p)I)−1

XT

end// Compute mean of prediction for new example x∗

µw∗|x∗ = x∗TA−1Xw/σ2

// Compute variance of prediction for new example x∗

σ2w∗|x∗ = x∗TA−1x∗ + σ2

end

a To compute this cost function when the dimensions D < I we need to compute both the inverse and

determinant of the covariance matrix. It is inefficient to implement this directly as the covariance

is I × I. To compute the inverse, the covariance should be reformulated using the matrix inversion

lemma and the determinant calculated using the matrix determinant lemma.



Algorithm 8.3: Gaussian process regression

To compute a non-linear fit to a set of data, we first transform the data x by a non-linearfunction f [•] to create a new variable z = f [xi]. We then proceed as normal with the Bayesianapproach, but using the transformed data.

In practice, we exploit the fact that the Bayesian non-linear regression fitting and pre-diction algorithms can be described in terms of inner products zT z of the transformed data.We hence directly define a single kernel function k[xi,xj ] as a replacement for the operationf [xi]

T f [xj ]. For many transformations f [•] it is more efficient to evaluate the kernel functiondirectly than to transform the variables separately and then compute the dot product. It isfurther possible to choose kernel functions that correspond to projection to very high or eveninfinite dimensional spaces without ever having to explicitly compute this transformation.

As usual we add a 1 to the start of every data vector xi ← [1 xTi ]T and then work withthe matrix X = [x1,x2 . . .xI ] which contains all of the training data examples in its columnsand the world vector w = [w1, w2 . . . wI ]

T which contains the training world states. In thisalgorithm, we use the notation K[A,B] to denote the DA ×DB matrix containing all of theinner products of the DA columns of A with the DB columns of B.

Algorithm 8.3: Gaussian process regression.

Input : (D+1)×I data matrix X, I×1 world vector w, hyperparameter σ2p

Output: Normal distribution Pr(w∗|x∗) over world given new data example x∗

begin// Fit variance parameter σ2 with line search

σ2 = argminσ2


pK[X,X] + σ2I]]]

// Compute inverse term

A−1 =(K[X,X] + (σ2/σ2

p)I)−1

// Compute mean of prediction for new example x∗

µw∗|x∗ = (σ2p/σ

2)K[x∗,X]w − (σ2/σ2p)K[x∗,X]A−1K[X,X]w


σ2w∗|x∗ = σ2

pK[x∗,x∗]− σ2pK[x∗,X]A−1K[X,x∗] + σ2

end



Algorithm 8.4: Sparse linear regression

In the sparse linear regression model, we replace the normal prior over the parameters with aprior that is a product of t-distributions. This favours solutions where most of the regressionparameters are effectively zero. In practice, the t-distribution corresponding to the dth di-mension of the data is represented as a marginalization of a joint distribution with a hiddenvariable hd.

The algorithm is iterative and alternates between updating the hidden variables in closedform and performing a line search for the noise parameters σ2. After the system has converged,we prune the model to remove dimensions where the hidden variable was large (>1000 is areasonable criterion); these dimensions contribute very little to the final prediction.

Algorithm 8.4: Sparse linear regression.

Input : (D + 1)×I data matrix X, I×1 world vector w, degrees of freedom, νOutput: Distribution Pr(w∗|x∗) over world given new data example x∗

begin// Initialize variables

H = diag[1, 1, . . . 1]repeat

// Maximize marginal likelihood w.r.t. variance parameter

σ2 = argminσ2

[− log[Normw[0,XTH−1X + σ2I]]

]// Maximize marginal likelihood w.r.t. relevance parameters H

Σ = σ2(XXT + H)−1

µ = ΣXw/σ2

// For each dimension except the first (the constant)

for d=2 to D + 1 do// Update the diagonal entry of Hhdd = (1− hddΣdd + ν)/(µ2

d + ν)

end

until No further improvement// Remove columns of X, rows of w and rows and columns of H where value hdd

on the diagonal of H is large

[H,X,w] = prune[H,X,w]// Compute variance of posterior over Φ

A−1 = H−1 −H−1X(XTH−1X + σ2I

)−1XTH−1


µw∗|x∗ = x∗TA−1Xw/σ2


σ2w∗|x∗ = x∗TA−1x∗ + σ2

end



Algorithm 8.5: Dual Bayesian linear regression

In dual linear regression, we formulate the weight vector as a sum of the observed dataexamples X so that

φ = Xψ

and then solve for the dual parameters ψ. To this end we place a normally distributed prioron Ψ with a uniform covariance matrix with magnitude σp.

Algorithm 8.5: Dual formulation of linear regression.

Input : (D + 1)×I data matrix X, I×1 world vector w, Hyperparameter σ2p,

Output: Distribution Pr(w∗|x∗) over world given new data example x∗


σ2 = argminσ2


pXTXXTX + σ2I]]

]// Compute inverse variance of posterior over Φ

A = XTXXTX/σ2 + I/σ2p


µw∗|x∗ = x∗TXA−1XTXw/σ2


σ2w∗|x∗ = x∗TXA−1Xx∗ + σ2

end

Algorithm 8.6: Dual Gaussian process regression

The dual algorithm relies only on inner products of the form xTx and so can be kernelized toform a non-linear regression method. As previously, we use the notation K[A,B] to denotethe DA ×DB matrix containing all of the inner products of the DA columns of A with theDB columns of B.

Algorithm 8.6: Dual Gaussian process regression.

Input : (D + 1)×I data matrix X, I×1 world vector w, Hyperparameter σ2p, Kernel Function

K[•, •]Output: Distribution Pr(w∗|x∗) over world given new data example x∗


σ2 = argminσ2


pK[X,X]K[X,X] + σ2I]]]

// Compute inverse term

A = K[X,X]K[X,X]/σ2 + I/σ2p


µw∗|x∗ = K[x∗,X]A−1K[X,X]w/σ2


σ2w∗|x∗ = K[x∗T ,X]A−1K[X,x∗] + σ2

end



Algorithm 8.7: Relevance vector regression

Relevance vector regression is simply sparse linear regression applied in the dual situation; weencourage the dual parameters ψ to be sparse using a prior that is a product of t-distributions.Since there is one dual parameter for each of the I training examples, we introduce I hiddenvariables hi which control the tendency to be zero for each dimension.

The algorithm is iterative and alternates between updating the hidden variables in closedform and performing a line search for the noise parameter σ2. After the system has converged,we prune the model to remove dimensions where the hidden variable was large (>1000 is areasonable criterion); these dimensions contribute very little to the final prediction.

Algorithm 8.7: Relevance vector regression.

Input : (D+1)×I data matrix X, I×1 world vector w, kernel K[•, •], degrees of freedom, νOutput: Distribution Pr(w∗|x∗) over world given new data example x∗

begin// Initialize variables

H = diag[1, 1, . . . 1]repeat

// Maximize marginal likelihood wrt variance parameter σ2

σ2 = argminσ2

[− log[Normw[0,K[X,X]H−1K[X,X] + σ2I]]

]// Maximize marginal likelihood wrt relevance parameters HΣ = (K[X,X]K[X,X]/σ2 + H)−1

µ = ΣK[X,X]w/σ2

// For each dual parameter

for i=1 to I do// Update diagonal entry of Hhdd = (1− hddΣii + ν)/(µ2

i + ν)

end

until No further improvement// Remove cols of X, rows of w, rows and cols of H where hdd is large

[H,X,w] = prune[H,X,w]// Compute inverse term

A = K[X,X]K[X,X]/σ2 + H// Compute mean of prediction for new example x∗

µw∗|x∗ = K[x∗,X]A−1K[X,X]w/σ2


σ2w∗|x∗ = K[x∗,X]A−1K[X,x∗] + σ2

end


20 Models for classification

Algorithm 9.1: MAP Logistic regression

The logistic regression model is defined as

Pr(w|x,φ) = Bernw

[1

1 + exp[−φTx]

],

where as usual, we have attached a 1 to the start of each data example xi. We now performa non-linear minimization over the negative log binomial probability with respect to theparameter vector φ:

φ = argminφ

[−

I∑i=1

log

[Bernwi

[1

1 + exp[−φTxi]

]]− log

[Normφ[0, σ2

pI]]],

where we have also added a prior over the parameters φ. The MAP solution is superior tothe maximum likelihood approach in that it encourages the function to be smooth even whenthe classes are completely separable. A typical approach would be to use a second orderoptimization method such as the Newton method (e.g., using Matlab’s fminunc function).The optimization method will need to compute the cost function and it’s derivative andHessian with respect to the parameter φ.

Algorithm 9.1: Cost and derivatives for MAP logistic regression

Input : Binary world state wiIi=1, observed data xiIi=1, parameters φOutput: cost L, gradient g, Hessian Hbegin

// Initialize cost, gradient, Hessian

L = L+ (D + 1) log[2πσ2]/2 + φTφ/(2σ2p)

g = φ/σ2p

H = 1/σ2p

// For each data point

for i=1 to I do// Compute prediction y

yi = 1/(1 + exp[−φTxi])// Add term to log likelihood

if wi == 1 thenL = L− log[yi]

elseL = L− log[1− yi]

end// Add term to gradient

g = g + (yi − wi)xi// Add term to Hessian

H = H + yi(1− yi)xixTiend

end


Models for classification 21

Algorithm 9.2: Bayesian logistic regression

In Bayesian logistic regression, we aim to compute the predictive distribution Pr(w∗|x∗) overthe binary world state w∗ for a new data example x∗. This takes the form of a Bernoullidistribution and is hence summarized by the single λ∗ = Pr(w∗ = 1|x∗).

The method works by first finding the MAP solution (using the cost function in the previ-ous algorithm). It then builds a Laplace approximation based on this result and the Hessianat the MAP solution. Using the mean and variance of the Laplace approximation we cancompute a probability distribution over the activation. We then use a further approximationto compute the integral over this distribution.

As usual, we assume that we have added a one to the start of every data vector so thatxi ← [1,xTi ]T to model the offset parameter elegantly.

Algorithm 9.2: Bayesian logistic regression

Input : Binary world state wiIi=1, observed data xiIi=1, new data x∗

Output: Predictive distribution Pr(w∗|x∗)begin

// Optimization using cost function of algorithm 9.1

φ = argminφ

[−∑Ii=1 log

[Bernwi [1/(1 + exp[−φTxi])]

]− log

[Normφ[0, σ2

pI]]]

// Compute Hessian at peak

H = 1/σ2p

for i=1 to I doyi = 1/(1 + exp[−φTxi]) // Compute prediction y

H = H + yi(1− yi)xixTi // Add term to Hessian

end// Set mean and variance of Laplace approximation

µ = φΣ = −H−1

// Compute mean and variance of activation

µa = µTx∗

σ2a = x∗TΣx∗

// Approximate integral to get Bernoullic parameters

λ∗ = 1/(1 + exp[−µa/√

1 + πσ2a/8])

// Compute predictive distribution

Pr(w∗|x∗) = Bernw∗ [λ∗]

end



Algorithm 9.3: MAP dual logistic regression

The dual logistic regression model is the same as the logistic regression model, but now werepresent the parameters φ as a weighted sum φ = Xψ of the original data points, where Xis a matrix containing all of the training data giving the prediction:

Pr(w|ψ,x) = Bernw

[1

1 + exp[−ψTXTx]

]We place a normal prior on the dual parameters ψ and optimize them using the criterion:

ψ = argminψ

[−

I∑i=1

log

[Bernwi

[1

1 + exp[−φTXxi]

]]− log

[Normψ[0, σ2

pI]]],

A typical approach would be to use a second order optimization method such as theNewton method (e.g., using Matlabs fminunc function). The optimization method will needto compute the cost function and its derivative and Hessian with respect to the parameterψ, and the calculations for these are given in the algorithm below.

Algorithm 9.3: Cost and derivatives for MAP dual logistic regression

Input : Binary world state wiIi=1, observed data xiIi=1, parameters ψOutput: cost L, gradient g, Hessian Hbegin


L = −I log[2πσ2]/2−ψTψ/(2σ2p)

g = −ψ/σ2p

H = −1/σ2p

// Form compound data matrix

X = [x1,x2, . . .xI ]// For each data point


yi = 1/(1 + exp[−ψTXxi])// Update log likelihood, gradient and Hessian

if wi == 1 thenL = L+ log[yi]

elseL = L+ log[1− yi]

end

g = g + (yi − wi)XTxiH = H + yi(1− yi)XTxix

Ti X

end

end



Algorithm 9.4: Dual Bayesian logistic regression

In dual Bayesian logistic regression, we aim to compute the predictive distribution Pr(w∗|x∗)over the binary world state w∗ for a new data example x∗. This takes the form of a Bernoullidistribution and is hence summarized by the single λ∗ = Pr(w∗ = 1|x∗).

The method works by first finding the MAP solution to the dual problem(using the costfunction in the previous algorithm). It then builds a Laplace approximation based on thisresult and the Hessian at the MAP solution. Using the mean and variance of the Laplaceapproximation we can compute a probability distribution over the activation. We then use afurther approximation to compute the integral over this distribution.

As usual, we assume that we have added a one to the start of every data vector so thatxi ← [1,xTi ]T to model the offset parameter elegantly.

Algorithm 9.4: Dual Bayesian logistic regression

Input : Binary world state wiIi=1, observed data xiIi=1, new data x∗

Output: Bernoulli parameter λ∗ from Pr(w∗|x∗) for new data x∗

begin// Optimization using cost function of algorithm 9.3

ψ = argminψ

[−∑Ii=1 log

[Bernwi [1/(1 + exp[−ψTXTxi])]

]− log

[Normψ[0, σ2

pI]]]

// Compute Hessian at peak

H = 1/σ2p

for i=1 to I doyi = 1/(1 + exp[−φTXTxi]) // Compute prediction y

H = H + yi(1− yi)XTxixTi X // Add term to Hessian


µ = ψΣ = −H−1

// Compute mean and variance of activation

µa = µTXTx∗

σ2a = x∗TXΣXTx∗

// Compute approximate prediction

λ∗ = 1/(1 + exp[−µa/√

1 + πσ2a/8])

end

Algorithm 9.4b: Gaussian process classification

Notice that algorithm 9.4a and algorithm 9.3, which it uses, are defined entirely in terms ofinner products of the form xTi xj , which usually occur in matrix multiplications like XTx∗.This means they is amenable to kernelization. When we replace all of the inner productsin algorithm 9.4a with a kernel function K[•, •], the resulting algorithm is called Gaussianprocess classification or kernel logistic regression.



Algorithm 9.5: Relevance vector classification

Relevance vector classification is a version of the kernel logistic regression (Gaussian processclassification) that encourages the dual parameters ψ to be sparse using a prior that is aproduct of t-distributions. Since there is one dual parameter for each of the I trainingexamples, we introduce I hidden variables hi which control the tendency to be zero for eachdimension.

The algorithm is iterative and alternates between updating the hidden variables in closedform and finding the resulting MAP solutions. After the system has converged, we prunethe model to remove dimensions where the hidden variable was large (> 1000 is a reasonablecriterion); these dimensions contribute very little to the final prediction.

Algorithm 9.5: Relevance vector classification

Input : (D+1)×I data X, I×1 binary world vector w, degrees of freedom, ν, kernel K[•, •]Output: Bernoulli parameter λ∗ from Pr(w∗|x∗) for new data x∗

begin// Initialize I hidden variables to reasonable values

H = diag[1, 1, . . . 1]repeat

// Find MAP solution using kernelized version of algorithm 9.3

ψ =

argminψ

[−∑Ii=1 log

[Bernwi [1/(1 + exp[−ψTK[X,xi]])]

]− log

[Normψ[0,H−1]

]]// Compute Hessian S at peak a

S = Hfor i=1 to I do

yi = 1/(1 + exp[−ψTK[X,xi]]) // Compute prediction yS = S + yi(1− yi)K[X,xi]K[xi,X] // Add term to Hessian


µ = ψΣ = −S−1

// For each data example

for I=1 to I do// Update the diagonal entry of Hhii = (1− hiiΣii + ν)/(µ2

i + ν)

end

until No further improvement// Remove rows of µ, cols of X, rows and cols of Σ where hdd is large

[µ,Σ,X] = prune[µ,Σ,X]// Compute mean and variance of activation

µa = µTK[X,x∗]σ2a = K[x∗,X]ΣK[X,x∗]

// Compute approximate prediction

λ∗ = 1/(1 + exp[−µa/√

1 + πσ2a/8])

end

a Notice that I have used S to represent the Hessian here, so that it’s not confused with the diagonal

matrix H containing the hidden variables.



Algorithm 9.6: Incremental fitting for logistic regression

The incremental fitting approach applies to the non-linear model

Pr(w|φ,x) = Bernw

[1

1 + exp[−φ0 −∑Kk=1 φkf[xi, ξk]]

].

The method initializes the weights φkKk=1 to zero and then optimizes them one by one. Atthe first stage we optimize φ0, φ1 and ξ1. Then we optimize φ0, φ2 and ξ2 and so on.

Algorithm 9.6: Incremental logistic regression

Input : Binary world state wiIi=1, observed data xiIi=1

Output: ML parameters φ0, φk, ξkKk=1

begin// Initialize parameters

φ0 = 0// Initialize activation for each data point (sum of first k-1 functions)

for i=1 to I doai = 0

endfor k=1 to K do

// Reset offset parameter φ0

for i=1 to I doai = ai − φ0

end

[φ0, φk, ξk] = argminφ0,φk,ξk

[−∑Ii=1 log [Bernwi [1/(1 + exp[−ai − φ0 − φkf[xi, ξk]])]]

]for i=1 to I do

ai = ai + φ0 + φkf[xi, ξk]end

end

end

Obviously, the derivatives for the optimization algorithm depend on the choice of non-linearfunction. For example, if we use the function f[xi, ξk] = arctan[ξTk xi] where we have added a1 to the start of each data vector xi, then the first derivatives of the cost function L are:

∂ L

∂φ0=

I∑i=1

(yi − wi)

∂ L

∂φk=

I∑i=1

(yi − wi)atan[ξTk xi]

∂ L

∂ξ=

I∑i=1

(yi − wi)φk(

1

1 + (ξTk xi)2

)xi

where yi = 1/(1 + exp[−ai − φ0 − φkf[xi, ξk]] is the current prediction for the ith data point.



Algorithm 9.7: Logitboost

Logitboost is a special case of non-linear logistic regression, with heaviside step functions:

Pr(w|φ,x) = Bernw

[1

1 + exp[−φ0 −∑Kk=1 φkheaviside[f [x, ξck ]]

].

One interpretation s that we are combining a set of ’weak classifiers’ which decide on the classbased on whether it is to the left or the right of the step in the step function.

The step functions do not have smooth derivatives, so at the kth stage, the algorithmexhaustively considers a set of possible step functions heaviside[f [x, ξm]]Mm=1, choosing theindex ck ∈ 1, 2, . . .M that is best, and simultaneously optimizes the weights φ0 and φk.

Algorithm 9.7: Logitboost

Input : Binary world state wiIi=1, observed data xiIi=1, functions fm[x, ξm]Mm=1

Output: ML parameters φ0, φkKk=1, ck ∈ 1 . . .Mbegin

// Initialize activations

for i=1 to I doai = 0

end// Initialize parameters

for k=1 to K do// Find best weak classifier by looking at magnitude of gradient

ck = maxm[(∑Ii=1(ai − wi)f[xi, ξm])2]

// Remove effect of offset parameters

for i=1 to I doai = ai − φ0

endφ0 = 0// Perform optimization

[φ0, φk] = argminφ0,φk

[∑Ii=1− log

[Bernwi

[1/(1 + exp[−ai − φ0 − φkf[xi, ξck ]])

]]]// Compute new activation

for i=1 to I doai = ai + φ0 + φkf[xi, ξck ]

end

end

end

The derivatives for the optimization are given by

∂ L

∂φ0=

I∑i=1

(yi − wi)

∂ L

∂φk=

I∑i=1

(yi − wi)f[xi, ξck ]

where yi = 1/(1 + exp[−ai−φ0−φkf[xi, ξck ]] is the current prediction for the ith data point.



Algorithm 9.8: Multi-class logistic regression

The multi-class logistic regression model is defined as

Pr(w|φ,x) = Catw[softmax[φT1 x, φT2 x, . . . φTNx]

].

where we have prepended a 1 to the start of each data vector x. This is a straightforwardoptimization problem over the negative log probability with respect to the parameter vectorφ = [φ1;φ2; . . . ;φN ]. We need to compute this value, and the derivative and Hessian withrespect to the parameters φm.

Algorithm 9.8: Cost function, derivative and Hessian for multi-class logistic regression

Input : World state wiIi=1, observed data xiIi=1, parameters φNn=1

Output: cost L, gradient g, Hessian Hbegin


L = 0for n=1 to N do

gn = 0 // Part of gradient relating to φnfor m=1 to N do

Hmn = 0 // Portion of Hessian relating φn and φmend

end// For each data point


yi = softmax[φT1 xi, φT2 xi, . . . φ

Tk xi]

// Update log likelihood

L = L+ log[yi,wi ] // Take wthi element of yi// Update gradient and Hessian

for n=1 to N dogn = gn + (yin − δ[wi − n])xifor m=1 to M do

Hmn = Hmn + yim(δ[m− n]− yin)xixTi

end

end

end// Assemble final gradient vector

g = [g1; g2; . . .gk]// Assemble final Hessian

for n=1 to N doHn = [Hn1,Hn2, . . .HnN ]

endH = [H1; H2; . . .HN ]

end



Algorithm 9.9: Multi-class logistic classification tree

Here, we present a deterministic multi-class classification tree. At the jth branching point,it selects the index cj ∈ 1, 2, . . . ,M indicating which of a pre-determined set of classifiersg[x, ωm]Mm=1 should be chosen.

Algorithm 9.9: Multiclass classification tree

Input : World state wiIi=1, data xiIi=1Mm=1, classifiers g[x, ωm]Mm=1

Output: Categorical params at leaves λpJ+1p=1 , Classifier indices cjJj=1

beginenqueue[x1...I , w1...I ] // Store data and class labels

// For each node in tree

for j = 1 to J do[x1...I , w1...I ] = dequeue[ ] // Retrieve data and class labels

for m = 1 to M do

// Count frequency for kth class in left and right branches

for k = 1 to K do

n(l)k =

∑Ii=1 δ[g[xi, ωm]− 0]δ[wi − k]

n(r)k =

∑Ii=1 δ[g[xi, ωm]− 1]δ[wi − k]

end// Compute log likelihood

lm =∑Kk=1 log[n

(l)k /

∑Kq=1 n

(l)q ] // Contribution from left branch

lm = lm +∑Kk=1 log[n

(r)k /

∑Kq=1 n

(r)q ] // Contribution from right branch

end// Store index of best classifier

cj = argmaxm [lm]// Partition into two sets

Sl = ;Sr = for i=1 to I do

if g[xi, ωcj ] == 0 thenSL = Sl ∪ i

elseSR = Sr ∪ i

end

end// Add to queue of nodes to process next

enqueue[xSl , wSl ]enqueue[xSr , wSr ]

end// Recover categorical parameters at J + 1 leaves

for p = 1 to J + 1 do[x1...I , w1...I ] = dequeue[ ]for k=1 to K do

nk =∑Ii=1 δ[wi − k] // Frequency of class k at the pth leaf

end

λp = n/∑Kk=1 nk // ML solution for categorical parameter

end

end


Graphical models 29

Algorithm 10.1: Gibbs’ Sampling from a discrete undirected model

This algorithm generates samples from an undirected model with distribution

Pr(x1...N ) =1

Z

C∏c=1

φc[Sc],

where the cth function φc[Sc] operates on a subset of Sc ⊂ x1, x2, . . . , xD of the D variablesand returns a positive number. For this algorithm, we assume that each variable xddd=1 isdiscrete and takes values xd ∈ 1, 2, . . . ,K

In Gibbs’ sampling, we choose each variable in turn and update by sampling from itsmarginal posterior distribution. Since, the variables are discrete, the marginal distribution isa categorical distribution (a histogram), so we can sample from it by partitioning the range0 to 1 according to the probabilities, drawing a uniform sample between 0 and 1, and seeingwhich partition it falls into.

Algorithm 10.1: Gibbs’ sampling from undirected model

Input : Potential functions φc[Sc]Cc=1

Output: Samples xtT1begin

// Initialize first sample in chain

x0 = x(0)

// For each time sample

for t=1 to T doxt = xt−1

// For each dimension

for d=1 to D do// For each possible value of the dth variable

for k=1 to K do// Set the variable to kxtd = k// Compute the unnormalized marginal probability

λk = 1for c s.t. xd ∈ Sc do

λk = λk · φc[Sc]end

end// Normalize the probabilities

λ = λ/∑Kk=1 λk

// Draw from categorical distribution

xtd = Sample [Catxtd [λ]]

end

end

end

It is normal to discard the first few thousand entries so that the initial conditions are forgotten.Then entries are chosen that are spaced apart to avoid correlation between the samples.


30 Graphical models

Algorithm 10.2: Contrastive divergence for learning undirected models

The contrastive divergence algorithm is used to learn the parameters θ of an undirected modelof the form

Pr(x1...N ,θ) =1

Z[θ]f(x,θ) =

1

Z[θ]

C∏c=1

φc[Sc,θ].

where the cth function φc[Sc] operates on a subset of Sc ⊂ x1, x2, . . . , xD of the D variablesand returns a positive number. It is generally not possible to maximize log likelihood eitherin closed form or via a non-linear optimization algorithm, because we cannot compute thedenominator Z[θ] that normalizes the distribution and which also depends on the parameters.

The contrastive divergence algorithm gets around this problem by computing the approx-imate gradient by means of generating J samples x∗jJj=1 and then using this approximategradient to perform gradient descent. The approximate gradient is computed as

∂L

∂θ≈ − I

J

J∑j=1

∂ log[f(x∗j ,θ)]

∂θ+

I∑i=1

∂ log[f(xi,θ)]

∂θ.

In the algorithm below, the function gradient[x,θ] represent the derivative of the unnor-malized log likelihood (i.e. the two terms on the right hand side). We’ve also made thesimplifying assumption that there is one sample x∗i for each training example xi (i.e., I = J).In practice, computing valid samples is a burden, so in this algorithm we generate the ith

sample x∗i by taking a single Gibbs’ sample step from the ith training example.

Algorithm 10.2: Contrastive divergence learning of undirected model

Input : Data xKk=1, learning rate αOutput: ML Parameters θbegin

// Initialize parameters

θ = θ(0)

// For each time sample

repeatfor i=1 to I do

// Take a single Gibbs’ sample step from the ith data point

x∗i = Sample[xi,θ]

end// Update parameters

// Function gradient[•, •] returns derivative of log of unnormalized

probability

θ = θ + α∑Ii=1(gradient[xi, θ]− gradient[x∗i ,θ])

until No further average change in θ

end


Models for chains and trees 31

Algorithm 11.1: Dynamic programming for chain model

This algorithm computes the maximum a posteriori solution for a chain model. The directedchain model has a likelihood and prior that factorize as

Pr(x|w) =

N∏n=1

Pr(xn|wn)

Pr(w) =

N∏n=2

Pr(wn|wn−1),

respectively. To find the MAP solution, we minimize the negative log posterior:

w1...N = argminw1...N

[−

N∑n=1

log [Pr(xn|wn)]−N∑n=2

log [Pr(wn|wn−1)]

]

= argminw1...N

[N∑n=1

Un(wn) +

N∑n=2

Pn(wn, wn−1)

].

This cost function can be optimized using dynamic programming. We pass from variablesx1 to xN , computing the minimum cost to reach each point, and caching the route. We find theoverall minimum at xN and retrieve the cached route. Here, denote the unary cost Un(wn = k)for the nth variable taking value k by Un,k, and the pairwise cost Pn(wn = k,wn−1 = l) forthe nth variable taking value k and the n− 1th variable taking value l by Pn,k,l.

Algorithm 11.1: Dynamic programming in chain

Input : Unary costs Un,kN,Kn=1,k=1, Pairwise costs Pn,k,lN,K,Kn=2,k=1,l=1

Output: Minimum cost path wnNn=1

begin// Initialize cumulative sums Sn,kfor k=1 to K do

S1,k = U1,k

end// Work forward through chain

for n=2 to N do// Find minimum cost to get to this node

Sn,k = Un,k + minl[Sn−1,l + Pn,k,l]// Store route by which we got here

Rn,k = argminl[Sn−1,l + Pn,k,l]

end// Find node yN with overall minimum cost

wN = argmink[SN,k]// Trace back to retrieve route

for n=N to 2 down−1 = Rn,wn

end

end


32 Models for chains and trees

Algorithm 11.2: Dynamic programming for tree model

This algorithm can be used to compute the MAP solution for a directed or undirected modelwhich has the form of a tree. As such, it generalizes algorithm 11.2 which is specializedfor chains. As for the simpler case, the algorithm proceeds by working through the nodes,computing the minimum possible cost to reach this position and caching the route by whichwe reached this point. At the last node we compute the overall minimum cost and then traceback the route using the cached information.

Here, denote the unary cost Un(wn = k) for the nth variable taking value k by Un,k. Wedenote the higher order cost for assigning value K to the nth variable based on its childrench[n] as Hn,k[ch[n]]. This might consist of pairwise, three-wise, or higher costs depending onthe number of children in the graph.

Algorithm 11.2: Dynamic programming in tree

Input : Unary costs Un,kN,Kn=1,k=1, higher order cost function Hn,k[ch[n]]N,Kn=1,k=1

Output: Minimum cost path wnNn=1

beginrepeat

// Retrieve nodes in an order so children always come before parents

n = GetNextNode[ ]// For each possible value of this node

for k=1 to K do// Compute the minimum cost for reaching here

Sn,k = Un,k + min[Sch[n]

+Hn,k[ch[n]]]

a

// Cache the route for reaching here (store |ch[n]| values)Rn,k = argmin

[Hn,k[Sch[n]

+ ch[n]]]

a

end// Push node index onto stack

push[n]// Until no more parents

until pa[wn] = // Find node wN with overall minimum cost

wn = mink[Sn,k]// Trace back to retrieve route

for c=1 to N don = pop[ ]if ch[n] 6= then

wch[n]= Rn,wn

end

end

end

a This minimization is done over all the values of all of the children variables. With a pairwise term,

this would be a single minimization over the single previous variable that fed into this one. With a

three-wise term is would be a joint minimization over both children variables etc.



Algorithm 11.3: Forward-backward algorithm

This algorithm computes the marginal posterior distributions Pr(wn|x1...N ) for a chain model.To find the marginal posteriors we perform a forward recursion and a backward recursion andmultiply these two quantities together.

Here, we use the term un,k to represent the likelihood Pr(xn|wn = k) of the data xn at thenth node taking label k and the term pn,k,l to represent the prior term Pr(wn = k|wn−1 = l)when the nth variable takes value k and the n−1th variable takes value l. Note that un,k andpn,k,l are probabilities, and are not the same as the unary and pairwise costs in the dynamicprogramming algorithms.

Algorithm 11.3: Forward backward algorithm

Input : Likelihoods lnkN,Kn=1,k=1, prior terms Pn,k,lN,K,Kn=2,k=1,l=11

Output: Marginal probability distributions qn[wn]Nn=1

begin// Initialize forward vector to likelihood of first variable

for k=1 to K dof1,k = u1k

end// For each state of each subsequent variable

for n=2 to N dofor k=1 to K do

// Forward recursion

fn,k = un,k∑Kl=1 pn−1,k,lfn−1,l

end

end// Initialize vector for backward pass

for k=1 to K dobN,k = 1/K

end// For each state of each previous variable

for n= N to 2 dofor k=1 to K do

// Backward recursion

bn−1,k =∑Kl=1 un,lpn,l,kbn,l

end

end// Compute marginal posteriors

for n= 1 to N dofor k=1 to K do

// Take product of forward and backward messages and normalize

qn[wn = k] = fn,kbn,k/(∑kl=1 fn,lbn,l)

end

end

end


34 Models for chains and trees

Algorithm 11.4: Sum product belief propagation

The sum product algorithm proceeds in two phases: a forward pass and a backward pass.The forward pass distributes evidence through the graph and the backward pass collatesthis evidence. Both the distribution and collation of evidence are accomplished by passingmessages from node to node in the factor graph. Every edge in the graph is connected toexactly one variable node, and each message is defined over the domain of this variable.

In the description of the algorithm below, we denote the edges by enNn=1, which joinsnode en1 to en2. The edges are processed in such an order that all incoming edges to a functionare processed before the outgoing message mn is passed. We first discuss the distribute phase.

Algorithm 11.4: Sum product: distribute

Input : Observed data z∗nn∈Sobs, functions φk[Ck]Kk=1, edges enNn=1

Output: Forward messages mn on each of the n edges enbegin

repeat// Retrieve edges in any valid order

en = GetNextEdge[ ]// Test for type of edge - returns 1 if en2 is a function, else 0

t = isEdgeToFunction[en]if t then

// If this data was observed

if en1 ∈ Sobs thenmn = δ[z∗en1

]else

// Find set of edges that are incoming to start of this edge

S = k : en1 == ek2// Take product of messages

mn =∏k∈Smk

// Add edge to stack

push[en]

end

else// Find set of edges incoming to start of this edge

S = k : en1 == ek2// Find all variables connected to this function

V = eS1 ∪ en2

// Take product of messages

mn =∑

y∈S φn [yV ]∏k∈Smk

// Add edge to stack

push[n]

end

until pa[en] = end

This algorithm continues overleaf...



Algorithm 11.4b: Collate and compute marginal distributions

After the distribute stage is complete (one message has been passed along each edge in thegraph) we commence t the second pass through the variables. This happens in the oppositeorder to the first stage (accomplished by popping edges off the stack). Now, we collate theevidence and compute the normalized distributions at each node.

Algorithm 11.4b: Sum product: collate and compute marginal distributions

Input : Observed data z∗nn∈Sobs, functions φk[Ck]Kk=1, edges enNn=1

Output: Marginal probability distributions qn[yn]Nn=1

begin// Collate evidence

repeat// Retrieve edges in opposite order

n = pop[ ]// Test for type of edge - returns 1 if en2 is a function, else 0

t = isEdgeToFunction[en]// Test for type of edge

if t then// Find set of edges incoming to function node

S = k : en2 == ek1// Find all variables connected to this function

V = eS2 ∪ en1

// Take product of messages

bn =∑

y∈mathcalS φn[yS ]∏k∈S bk

else// Find set of edges that are incoming to data node

S = k : en2 == ek1// Take product of messages

bn =∏k∈S bk

end

until stack empty// Compute distributions at nodes

for k=1 to K do// Find set of edges that are incoming to data node

S1 = n : en2 == kS2 = n : en1 == kqk =

∏n∈S1 mn

∏n∈S2 bn

end

end


36 Models for grids

Algorithm 12.1: Binary graph cuts

This algorithm assumes that we have N variables each of which takes a binary value. Theirconnections are indicated by a series of flags EmnN,Nn,m=1 which are set to one if the variablesare connected (and have an associated pairwise term) or zero otherwise. This algorithm setsup the graph but doesn’t find the min-cut solution. Consult a standard algorithms text fordetails of how to do this.

Algorithm 12.1: Binary graph cuts

Input : Unary costs Un(k)N,Kn,k=1, pairwise costs Pn,m(k, l)N,N,K,Kn,m,k,l=1,flags emn,N,Nn=1,m=1

Output: Label assignments wnbegin

// Initialize graph to empty

G = for n=1 to N do

// Create edges from source and to sink and set capacity to zero

G = G ∪ s, n; csn = 0G = G ∪ n, t; cnt = 0// If edge between m and n is desired

if em,n = 1 thenG = G ∪ m,n; cnm = 0G = G ∪ n,m; cmn = 0

end

end// Add costs to edges

for n=1 to N docsn = csn + Un(0) cnt = cnt + Un(1) for m=1 to n− 1 do

if em,n = 1 thencnm = cnm + Pmn(1, 0)− Pmn(1, 1)− Pmn(0, 0)cmn = cmn + Pmn(1, 0)csm = csm + Pmn(0, 0)cnt = cnt + Pmn(1, 1)

end

end

endC = Reparameterize[C] // Ensures all capacities are positive (see overleaf)

G = ComputeMinCut[G,C] // Augmenting paths or similar

// Read off world state values based on new (cut) graph

for n=1 to N doif s, n ∈ G then

wn = 1else

wn = 0end

end

end


Models for grids 37

Algorithm 12.2: Reparameterization for graph cuts

The previous algorithm relies on a max-flow / min cut algorithm such as augmenting pathsor push-relabel. For these algorithms to converge, it is critical that all of the capacities arenon-negative. The process of making them non-negative is called re-parameterization. It isonly possible in certain special cases, and here the problem is known as submodular. Costfunctions in vision tend to encourage smoothing and are submodular.

Algorithm 12.2: Reparameterization for binary graph cut

Input : Edge flags emnN,Nm,n=1, capacities cmn : em,n = 1Output: Modified graph with non-negative capacitiesbegin

// For each node pair

for n=1 to N dofor m=1 to n− 1 do

// If an edge between the nodes exist

if em,n = 1 then// Test if submodular and return error code if not

if cnm < 0 && cmn < −cnm thenreturn[-1]

endif cmn < 0 && cnm < −cmn then

return[-1]end// Handle links between source and sink

if cnm < 0 thenβ = cnm

endif cmn < 0 then

β = −cmnendcnm = cnm − βcmn = cmn + βcsm = csm + βcmt = cmt + β

end

end// Handle links between source and sink

α = min[csn, cnt]csn = csn − αcnt = cnt − α

end

end


38 Models for grids

Algorithm 12.3: Multi-label graph cuts

This algorithm assumes that we have N variables each of which takes one of K values. Theirconnections are indicated by a set of flags emnN,Nn,m=1 which are set to one if the variables areconnected (and have an associated pairwise term) or zero otherwise. We construct a graphthat has N · (K+ 1) nodes where the first K+ 1 nodes pertain to the first variable and so on.

Algorithm 12.3: Multilabel graph cuts

Input : Unary costs Un(k)N,Kn,k=1, pairwise costs Pn,m(k, l)N,N,K,Kn,m,k,l=1, flags emn,N,Nn=1,m=1

Output: Label assignments wnbeginG = // Initialize graph to empty

for n=1 to N do// Create edges from source and to sink and set costs

G = G ∪ s, (n− 1)(K + 1) + 1; cs,(n−1)(K+1)+1 =∞G = G ∪ n(K + 1), t; c,n(K+1)t =∞// Create edges within columns and set costs

for k=1 to K doG = G ∪ (n− 1)(K + 1) + k, (n− 1)(K + 1) + k + 1c(n−1)(K+1)+k,(n−1)(K+1)+k+1 = U(n−1)(K+1)+k,k

G = G ∪ (n− 1)(K + 1) + k + 1, (n− 1)(K + 1) + kc(n−1)(K+1)+k+1,(n−1)(K+1)+k =∞

end// Create edges between columns and set costs

for m=1 to n− 1 doif em,n = 1 then

for k=1 to K dofor L=2 to K + 1 doG = G ∪ (n− 1)(K + 1) + k(m− 1)(K + 1) + lc(n−1)(K+1)+k(m−1)(K+1)+l =Pn,m(k, l − 1) + Pn,m(k − 1, l)− Pn,m(k, l)− Pn,m(k − 1, l − 1)

end

end

end

end

endC = Reparameterize[C] // Ensures all capacities are positive (see book)


tcpRead off values for n=1 to N down = 1for k=1 to K do

if (n− 1)(K + 1) + k, (n− 1)(K + 1) + k ∈ G] thenwn = wn + 1

end

end

end

end


Models for grids 39

Algorithm 12.4: Alpha-expansion algorithm

The alpha-expansion algorithm works by breaking the solution down into a series of binaryproblems, each of which can be solved exactly. At each iteration, we choose one of the Klabel values α, and for each pixel, we consider either retaining the current label, or switchingit to α. The name alpha-expansion derives from the fact that the space occupied by label α inthe solution expands at each iteration. The process is iterated until no choice of α causes anychange. Each expansion move is guaranteed to lower the overall objective function, althoughthe final result is not guaranteed to be the global minimum.

Algorithm 12.4: Alpha expansion algorithm (main loop)

Input : Unary costs Un(k)N,Kn,k=1, pairwise costs Pn,m(k, l)N,N,K,Kn,m,k,l=1, flags emn,N,Nn=1,m=1

Output: Label assignments wnNn=1

begin// Initialize labels in some way - perhaps to minimize unary costs

w = w0

// Compute log likelihood

L =∑Nn=1 Un(wn) +

∑Nn=1

∑Mm=1 emnPn,m(wn, wm)

repeat// Store initial log likelihood

L0 = L// For each label in turn

for k=1 to K do// Try to expand this label (see overleaf)

w = AlphaExpand[w, k]

end// Compute new log likelihood

L =∑Nn=1 Un(wn) +

∑Nn=1

∑Mm=1 EmnPn,m(wn, wm)

until L = L0

end

In the alpha-expansion graph construction, there is one vertex associated with each pixel.Each of these vertices is connected to the source (representing keeping the original label orα) and the sink (representing the label α). To separate source from sink, we must cut one ofthese two edges at each pixel. The choice of edge will determine whether we keep the originallabel or set it to α. Accordingly, we associate the unary costs for each edge being set to α orits original label with the two links from each pixel. If the pixel already has label α, then weset the cost of being set to α to ∞.

The remaining structure of the graph is dynamic: it changes at each iteration dependingon the choice of α and the current labels. There are four possible relationships betweenadjacent pixels:

• They can both already be set to alpha.

• One can be set to alpha and the other to another value β.

• Both can be set to the same other value β .

• They can be set to two other values β and γ.


40 Models for grids

Algorithm 12.4b: Alpha expansion (expand)

Algorithm 12.4b: Alpha expansion (expand)

Input : Costs Un(k)N,Kn,k=1, Pn,m(k, l)N,N,K,Kn,m,k,l=1, expansion label k, states wnNn=1

Output: New label assignments wnNn=1

beginG = // Initialize graph to empty

z = N // Counter for new nodes added to graph

for n=1 to N doG = G ∪ s, n; csn = Un(k) // Connect pixel nodes to source and set cost

if wn = k thenG = G ∪ n, t; cnt =∞ // Connect pixel nodes to sink and set cost

elseG = G ∪ n, t; cnt = Un(wn) // Connect pixel nodes to sink and set cost

endfor m=1 to n do

if em,n == 1 thenif (wn == k || wm == k) then

if wn! = k thenG = G ∪ n,m; cnm = Pn,m(wm, wn) // Case 2a

endif wm! = k thenG = G ∪ m,n; cmn = Pn,m(wn, wm) // Case 2b

end

elseif wn == wm thenG = G ∪ n,m; cnm = Pn,m(k,wn) // Case 3

G = G ∪ m,n; cmn = Pn,m(wn, k)

elsez = z + 1 // Increment new node counter

G = G ∪ n, z; cnz = Pn,m(k,wn); czn =∞ // Case 4

G = G ∪ m, z; cmz = Pn,m(wm, k)czm =∞G = G ∪ z, t; czt = Pn,m(wm, wn)

end

end

end

end

endC = Reparameterize[C] // Ensures all capacities are positive


// Read off values

for n=1 to N doif n, t ∈ G then

wn = kend

end

end


Preprocessing 41

Algorithm 13.1: Principal components analysis

The goal of PCA is to approximate a set of multivariate data xiIi=1 with a second set ofvariables of reduced size hiIi=1, so that

xi ≈ µ+ Φhi,

where Φ is a rectangular matrix where the columns are unit length and orthogonal to oneanother so that ΦTΦ = I.

This formulation assumes that the number of original data dimensions D is higher thanthe number of training examples I and so works by taking the singular value decompositionof the I × I matrix XTX to compute the dual principal components Ψ before recovering theoriginal principal components Φ.

Algorithm 13.1: Principal components analysis (dual)

Input : Training data xiIi=1, number of components KOutput: Mean µ, PCA basis functions Φ, low dimensional data hiIi=1

begin// Estimate mean

µ =∑Ii=1 xi/I

// Form mean zero data matrix

X = [x1 − µ,x2 − µ, . . .xI − µ]// Do spectral decomposition and compute dual components

[Ψ,L,Ψ] = svd[XTX]// Compute principal components

Φ = XΨL−1/2

// Retain only the first K columns

Φ = [φ1,φ2, . . . ,φK ]// Convert data to low dimensional representation

for i=1 to I dohi = ΦT (xi − µ)

end// Reconstruct data

for i = 1 to I doxi = µ+ Φhi

end

end


42 Preprocessing

Algorithm 13.2: k-means algorithm

The goal of the k-means algorithm is to partition a set of data xiIi=1 into K clusters. Itcan be thought of as approximating each data point with the associated cluster mean µk , sothat

xi ≈ µhi,

where hi ∈ 1, 2, . . .K is a discrete variable that indicates which cluster the ith point belongsto. The algorithm works by alternately (i) assigning data points to the nearest cluster centerand (ii)

Algorithm 13.2: K-means algorithm

Input : Data xiIi=1, number of clusters K, data dimension DOutput: Cluster means µkKk=1, cluster assignment indices, hiIi=1

begin// Initialize cluster means (one of many heuristics)

µ =∑Ii=1 xi/I // Compute overall mean

Σ =∑Ii=1(xi − µ)T (xi − µ)/I // Compute overall covariance

for k=1 to K do

µk = µ+ Σ1/2randn[D, 1] // Randomly draw from normal model

end// Main loop

repeat// Compute distance from data points to cluster means

for i=1 to I dofor k=1 to K do

dik = (xi − µk)T (xi − µk)end// Update cluster assignments based on closest cluster

hi = argmink [dik]

end// Update cluster means from data that was assigned to this cluster

for k=1 to K do

µk = (∑Ii=1 δ[hi − k]xi)/(

∑Ii=1 δ[hi − k])

end

until No further change in µkKk=1

end


The pinhole camera 43

Algorithm 14.1: ML learning of camera extrinsic parameters

Given a known object, with I distinct three-dimensional points wiIi=1 points, their corre-sponding projections in the image xiIi=1 and known camera parameters Λ, estimate thegeometric relationship between the camera and the object determined by the rotation Ω andthe translation τ .

The solution to this problem is to minimize:

Ω, τ =argminΩ,τ

[I∑i=1

(xi−pinhole[wi,Λ,Ω, τ ])T

(xi−pinhole[wi,Λ,Ω, τ ])

]

where pinhole[wi,Λ,Ω, τ ] represents the action of the pinhole camera (equation 14.8 fromthe book. The bulk of this algorithm consists of finding a good initial starting point for thisminimization. This optimization should be carried out while enforcing the constraint that Ωremains a valid rotation matrix.

Algorithm 14.1: ML learning of extrinsic parameters

Input : Intrinsic matrix Λ, pairs of points xi,wiIi=1

Output: Extrinsic parameters: rotation Ω and translation τbegin

for i=1 to I do// Convert to normalized camera coordinates

x′i = Λ−1[xi; yi; 1]// Compute linear constraints

a1i = [ui, vi, wi, 1, 0, 0, 0, 0,−uix′i,−vix′i,−wix′i,−x′i]a2i = [0, 0, 0, 0, ui, vi, wi, 1,−uiy′i,−viy′i,−wiy′i,−y′i]

end// Stack linear constraints

A = [a11; a21; a12; a22; . . . a1I ; a2I ]// Solve with SVD

[U,L,V] = svd[A]b = v12 // extract last column of V// Extract estimates up to unknown scale

Ω = [b1, b2, b3; b5, b6, b7; b9; b10, b11]τ = [b4; b8; b12]// Find closest rotation using Procrustes method

[U,L,V] = svd[Ω]

Ω = UVT

// Rescale translation

τ = τ∑3i=1

∑3j=1(Ωij/Ωij)/9

// Use these parameters as initial conditions in non-linear optimization

[Ω, τ ] = argminΩ,τ

[∑Ii=1 (xi−pinhole[wi,Λ,Ω, τ ])T (xi−pinhole[wi,Λ,Ω, τ ])

]end


44 The pinhole camera

Algorithm 14.2: ML learning of intrinsic parameters (camera calibration)

Given a known object, with I distinct 3D points wiIi=1 points and their correspondingprojections in the image xiIi=1, establish the camera parameters Λ. In order to do this weneed also to estimate the extrinsic parameters. We use the following criterion

Λ=argminΛ

[minΩ,τ

[I∑i=1



]]

where pinhole[wi,Λ,Ω, τ ] represents the action of the pinhole camera (equation 14.8 fromthe book).

This algorithm consists of an alternating approach in which the extrinsic parameters arefound using the previous algorithm and then the intrinsic parameters are found in closedform. Finally, these estimates should form the starting point for a non-linear optimizationprocess over all of the unknown parameters.

Algorithm 14.2: ML learning of intrinsic parameters

Input : World points wiIi=1, image points xiIi=1, initial ΛOutput: Intrinsic parameters Λbegin

// Main loop for alternating optimization

for t=1 to T do// Compute extrinsic parameters

[Ω, τ ] = calcExtrinsic[Λ, wi,xiIi=1]// Compute intrinsic parameters

for i=1 to I do// Compute matrix Ai

ai = (ω1•wi + τx)/(ω3•wi + τ z) // ωk• is kth row of Ωbi = (ω2•wi + τ y)/(ω3•wi + τ z)Ai = [ai, bi, 1, 0, 0; 0, 0, 0, bi, 1]

end// Concatenate matrices and data points

x = [x1; x2; . . .xI ]A = [A1; A2; . . .AI ]// Compute parameters

θ = (ATA)−1ATxΛ = [θ1,θ2,θ3; 0,θ4,θ5; 0, 0, 1]

end// Refine parameters with non-linear optimization

Λ =argminΛ

[minΩ,τ


]]end


The pinhole camera 45

Algorithm 14.3: Inferring 3D world points (reconstruction)

Given J calibrated cameras in known positions (i.e. cameras with known Λ,Ω, τ ), viewingthe same three-dimensional point w and knowing the corresponding projections in the imagesxjJj=1, establish the position of the point in the world.

As for the previous algorithms the final solution depends on a non-linear minimization ofthe reprojection error between w and the observed data xj ,

w = argminw

J∑j=1

(xj−pinhole[w,Λj ,Ωj , τ j ])T

(xj−pinhole[w,Λj ,Ωj , τ j ])

The algorithm below finds a good approximate initial conditions for this minimization

using a closed-form least-squares solution.

Algorithm 14.3: Inferring 3D world position

Input : Image points xjJj=1, camera parameters Λj ,Ωj , τ jJj=1

Output: 3D world point wbegin

for j=1 to J do// Convert to normalized camera coordinates

x′j = Λ−1j [xj , yj , 1]T

// Compute linear constraints

a1j = [ω31jx′j − ω11j , ω32jx

′j − ω12j , ω33jx

′j − ω13j ]

a2j = [ω31jy′j − ω21j , ω32jy

′j − ω22j , ω33jy

′j − ω23j ]

bj = [τxj − τzjx′j ; τyj − τzjy′j ]end// Stack linear constraints

A = [a11; a21; a12; a22; . . . a1J ; a2J ]b = [b1; b2; . . . bJ ]// LS solution for parameters

w = (ATA)−1ATb// Refine parameters with non-linear optimization

w = argminw

[∑Jj=1 (xj−pinhole[w,Λj ,Ωj , τ j ])

T (xj−pinhole[w,Λj ,Ωj , τ j ])]

end


46 Models for transformations

Algorithm 15.1: ML learning of Euclidean transformation

The Euclidean transformation model maps one set of 2D points wiIi=1 to another set xiIi=1

with a rotation Ω and a translation τ . To recove these parameters we use the criterion

Ω, τ = argminΩ,τ

[−

I∑i=1

log[Normxi

[Ωwi + τ , σ2I

]]]where Ω is constrained to be a rotation matrix so that ΩTΩ = I and det[Ω] = 1.

Algorithm 15.1: Maximum likelihood learning of Euclidean transformation

Input : Training data pairs xi,wiIi=1

Output: Rotation Ω, translation τ , variance, σ2

begin// Compute mean of two data sets

µw =∑Ii=1 wi/I

µx =∑Ii=1 xi/I

// Concatenate data into matrix form

W = [w1 − µw,w2 − µw, . . . ,wI − µw]X = [x1 − µx,x2 − µx, . . . ,xI − µx]// Solve for rotation

[U,L,V] = svd[WXT ]

Ω = VUT

// Solve for translation

τ =∑Ii=1(xi −Ωwi)/I

end


Models for transformations 47

Algorithm 15.2: ML learning of similarity transformation

The similarity transformation model maps one set of 2D points wiIi=1 to another set xiIi=1

with a rotation Ω, a translation τ and a scaling factor ρ. To recover these parameters we usethe criterion:

Ω, τ , ρ = argminΩ,τ ,ρ

[−

I∑i=1

log[Normxi

[ρΩwi + τ , σ2I

]]]where Ω is constrained to be a rotation matrix so that ΩTΩ = I and det[Ω] = 1.

Algorithm 15.2: Maximum likelihood learning of similarity transformation


Output: Rotation Ω, translation τ , scale ρ, variance σ2

begin// Compute mean of two data sets

µw =∑Ii=1 wi/I

µx =∑Ii=1 xi/I

// Concatenate data into matrix form

W = [w1 − µw,w2 − µw, . . . ,wI − µw]X = [x1 − µx,x2 − µx, . . . ,xI − µx]// Solve for rotation

[U,L,V] = svd[WXT ]

Ω = VUT

// Solve for scaling

ρ = (∑Ii=1(xi − µx)TΩ(wi − µw))/(

∑Ii=1(wi − µw)T (w − µw))

// Solve for translation

τ =∑Ii=1(xi − ρΩwi)/I

end



Algorithm 15.3: ML learning of affine transformation

The affine transformation model maps one set of 2D points wiIi=1 to another set xiIi=1

with a linear transformation Φ and an offset τ . To recover these parameters we use thecriterion

Φ, τ = argminΦ,τ

[−

I∑i=1

log[Normxi

[Φwi + τ , σ2I

]]].

Algorithm 15.3: Maximum likelihood learning of affine transformation


Output: Linear transformation Φ, offset τ , variance σ2

begin// Compute intermediate 2×6 matrices Ai

for i=1 to I doAi = [wT

i , 1,0T ; 0T ,wT

i , 1]end// Concatenate matrices Ai into 2I×6 matrix AA = [A1; A2; . . .AI ]// Concatenate output points into 2I × 1 vector cc = [x1; x2; . . .xI ]// Solve for linear transformation

φ = (ATA)−1AT c// Extract parameters

Φ = [φ1, φ2;φ4, φ5]τ = [φ3;φ6]// Solve for variance

σ2 =∑Ii=1(xi − φwi − τ )T (xi − φwi − τ )/2I

end



Algorithm 15.4: ML learning of projective transformation (homography)

The projective transformation model maps one set of 2D points wiIi=1 to another setxiIi=1 with a non-linear transformation with 3×3 parameter matrix Φ. To recover thismatrix we use the criterion

Φ = argminΦ

[−

I∑i=1

log[Normxi

[proj[wi,Φ], σ2I

]]].

where the function proj[wi,Φ] applies the homography to point wi and is defined as

proj[wi,Φ] =[φ11u+φ12v+φ13

φ31u+φ32v+φ33

φ21u+φ22v+φ23

φ31u+φ32v+φ33

]T.

Unlike the previous three transformations, it is not possible to minimize this criterion inclosed form. The best that we can do is to get an approximate solution and use this to starta non-linear minimization process.

Algorithm 15.4: Maximum likelihood learning of projective transformation


Output: Parameter matrix Φ,, variance σ2

begin// Convert data to homogeneous representation

for i=1 to I doxi = [xi; 1]

end// Compute intermediate 2×9 matrices Ai

for i=1 to I doAi = [0, wi;−wi,0; yiwi,−xiwi]

T

end// Concatenate matrices Ai into 2I×9 matrix AA = [A1; A2; . . .AI ]// Solve for approximate parameters

[U,L,V] = svd[A]Φ0 = [v19, v29, v39; v49, v59, v69; v79, v89, v99]// Refine parameters with non-linear optimization

Φ = argminΦ

[−∑Ii=1 log

[Normxi

[proj[wi,Φ], σ2I

]]].

end



Algorithm 15.5: ML Inference for transformation models

Consider a transformation model maps one set of 2D points wiIi=1 to another set xiIi=1

so that

Pr(xi|wi,Φ) = Normxi

[trans[wi,Φ], σ2I

].

In inference we wish are given a new data point x = [x, y] and wish to compute the most likelypoint w = [u, v] that was responsible for it. To make progress, we consider the transformationmodel trans[wi,Φ] in homogeneous form

λ

xy1

=

φ11 φ12 φ13φ21 φ22 φ23φ31 φ32 φ33

uv1

,or x = Φw. The Euclidean, similarity, affine and projective transformations can all beexpressed as a 3× 3 matrix of this kind.

Algorithm 15.5: Maximum likelihood inference for transformation models

Input : Transformation parameters Φ, new point xOutput: point wbegin

// Convert data to homogeneous representation

x = [x; 1]// Apply inverse transform

a = Φ−1x// Convert back to Cartesian coordinates

w = [a1/a3, a2/a3]

end



Algorithm 15.6: Learning extrinsic parameters (planar scene)

Consider a calibrated camera with known parameters Λ viewing a planar. We are given aset of 2D positions on the plane wI

i=1 (measured in real world units like cm) and theircorresponding 2D pixel positions xIi−1. The goal of this algorithm is to learn the 3D rotationΩ and translation τ that maps a point in the frame of reference of the plane w = [u, v, w]T

(w = 0 on the plane) into the frame of reference of the camera.This goal is accomplished by minimizing the following criterion:


[I∑i=1



]This optimization should be carried out while enforcing the constraint that Ω remains a validrotation matrix. The bulk of this algorithm consists of computing a good initialization pointfor this minimization procedure.

Algorithm 15.6: ML learning of extrinsic parameters (planar scene)

Input : Intrinsic matrix Λ, pairs of points xi,wiIi=1

Output: Extrinsic parameters: rotation Ω and translation τbegin

// Compute homography between pairs of points

Φ = LearnHomography[xiIi=1, wiIi=1]// Eliminate effect of intrinsic parameters

Φ = Λ−1Φ// Compute SVD of first two columns of Φ[ULV] = svd[φ1,φ2]// Estimate first two columns of rotation matrix

[ω1,ω2] = [u1,u2] ∗VT

// Estimate third column by taking cross product

ω3 = ω1 × ω2

Ω = [ω1,ω2,ω3]// Check that determinant is not minus 1

if |Ω| < 0 thenΩ = [ω1,ω2,−ω3]

end// Compute scaling factor for translation vector

λ = (∑3i=1

∑2j=1 ωij/φij)/6

// Compute translation

τ = λφ3

// Refine parameters with non-linear optimization



]end



Algorithm 15.7: Learning intrinsic parameters (planar scene)

This is also known as camera calibration from a plane. The camera is presented with J viewsof a plane with unknown pose Ωj , τ j. For each image we know I points wiIi=1 where

wi = [ui, vi, 0] and we know their imaged positions xijI,Ji=1,j=1 in each of the J scenes. Thegoal is to compute the intrinsic matrix Λ. To this end, we use the criterion:

Λ=argminΛ

J∑j=1

minΩj ,τ j

[I∑i=1

(xij−pinhole[wi,Λ,Ωj , τ j ])T

(xij−pinhole[wi,Λ,Ωj , τ j ])

]where again, the minimization must be carried out while ensuring that Ω is a valid rotationmatrix. The strategy is to alternately estimate the extrinsic parameters using the previousalgorithm and compute the intrinsic parameters in closed form. After several iterations weuse the resulting solution as initial conditions for a non-linear optimization procedure.

Algorithm 15.7: ML learning of intrinsic parameters (planar scene)

Input : World points wiIi=1, image points xijI,Ji=1,j=1, initial ΛOutput: Intrinsic parameters Λbegin

// Main loop for alternating optimization

for k=1 to K do// Compute extrinsic parameters for each image

for j=1 to J do[Ωj , τ j ] = calcExtrinsic[Λ, wi,xijIi=1]

end// Compute intrinsic parameters

for i=1 to I dofor j=1 to J do

// Compute matrix Aij

aij = (ωT1•jwi + τxj)/(ωT3•jwi + τ zj) // ωk•j is kth row of Ωj

bij = (ωT2•jwi + τ yj)/(ωT3•jwi + τ zj) // τzj is z component of τ j

Aij = [aij , bij , 1, 0, 0; 0, 0, 0, bij , 1]

end

end// Concatenate matrices and data points

x = [x11; x12; . . .xIJ ]A = [A11; A12; . . .AIJ ]// Compute parameters

θ = (ATA)−1ATxΛ = [θ1,θ2,θ3; 0,θ4,θ5; 0, 0, 1]

end// Refine parameters with non-linear optimization

Λ=argminΛ

[∑j minΩj ,τj

[∑i (xij−pinhole[wi,Λ,Ωj , τ j ])

T (xij−pinhole[wi,Λ,Ωj , τ j ])]]

end



Algorithm 15.8: Robust learning of projective transformation with RANSAC

The goal of this algorithm is to fit a homography that maps one set of 2D points wiIi=1 toanother set xiIi=1, in the case where some of the point matches are known to be wrong(outliers). The algorithm also returns the true matches and the outliers.

The algorithm uses the RANSAC procedure - it repeatedly computes the homographybased on a minimal subset of matches. Since there are 8 unknowns in the 3 × 3 matrixthat defines the homography, and each match provides two linear constrains (due to the x−and y−coordinates), we need a minimum of four matches to compute the homography. TheRANSAC procedure chooses these four matches randomly, computes the homography, andthen looks for the amount of agreement in the rest of the dataset. After many iterations ofthis procedure, we recompute the homography based on the randomly chosen matches withthe best agreement and the points that agreed with it (the inliers).

Algorithm 15.8: Robust ML learning of homography

Input : Point pairs xi,wiIi=1, number of RANSAC steps N , threshold τOutput: Homography Φ, inlier indices Ibegin

// Initialize best inlier set to empty

B = for n=1 to N do

// Draw 4 different random integers between 1 and IR = RandomSubset[1 . . . I, 4]// Compute homography (algorithm 15.4)

Φn = LearnHomography[xii∈R, wii∈R]// Initialize set of inliers to empty

Sn = for i=1 to I do

// Compute squared distance

d = (xi − proj[wi,Φn])T (xi − proj[wi,Φn])// If small enough then add to inliers

if d < τ2 thenSn = Sn ∩ i

end

end// If best outliers so far then store

if |Sn| > |B| thenB = Sn

end

end// Compute homography from all outliers

Φ = LearnHomography[xii∈B, wii∈B]

end



Algorithm 15.9: Sequential RANSAC for fitting homographies

Sequential RANSAC fits K homographies to disjoint subsets of point pairs wi,xiIi=1. Thisprocedure is greedy – the algorithm fits the first homography, then removes the inliers fromthis set from the point pairs and tries to fit a second homography to the remaining points. Inprinciple, this algorithm can find a set of matching planes between two images. However, inpractice, it often makes mistakes. It does not exploit information about the spatial coherenceof matches and it cannot recover from mistakes in the greedy matching procedure.

Algorithm 15.9: Robust sequential learning of homographies

Input : Points xi,wiIi=1, RANSAC steps N , inlier threshold τ , number of homographies KOutput: K homographies Φk, and associated inlier indices Ikbegin

// Initialize set of indices of remaining point pairs

S = 1 . . . I for k=1 to K do// Compute homography using RANSAC (algorithm 51)

[Φk, Ik] = LearnHomographyRobust[xii∈S , wii∈S , N, τ ]// Remove inliers from remaining points

S = S\Ik// Check that there are enough remaining points

if |S| < 4 thenbreak

end

end

end



Algorithm 15.10: PEaRL for fitting homographies

The propose, expand and re-learn (PEaRL) attempts to make up for the deficiencies ofsequential RANSAC for fitting homographies. It first proposes a large number of possiblehomographies relating point pairs wi,xiIi=1. These then compete for the point pairs to beassigned to them and they are re-learnt based on these assignments. The algorithm has aspatial component that encourages nearby points to belong to the same model, and it iterativerather than greedy and so can recover from errors.

Algorithm 15.10: PEaRL learning of homographies

Input : Point pairs xi,wiIi=1, number of initial models M , inlier threshold τ , mininum numberof inliers l, number of iterations J , neighborhood system NiIi=1, pairwise cost P

Output: Set of homographies Φk, and associated inlier indices Ikbegin

// Propose Step: generate M hypotheses

m = 1 // hypothesis number

repeat// Draw 4 different random integers between 1 and IR = RandomSubset[1 . . . I, 4]// Compute homography (algorithm 47)

Φm = LearnHomography[xii∈R, wii∈R]Im = // Initialize inlier set to empty

for i=1 to I dodim = (xi − proj[wi,Φn])T (xi − proj[wi,Φn])if dim < τ2 then // if distance small, add to inliers

In = In ∩ iend

endif |Im| ≥ l then // If enough inliers, get next hypothesis

m = m+ 1end

until m < Mfor j=1 to J do

// Expand Step: returns I × 1 label vector l

l = AlphaExpand[D, P, NiIi=1]// Re-Learn Step: re-estimate homographies with support

for m=1 to M doIm = find[L == m] // Extract points with label L// If enough support then re-learn, update distances

if |Im| ≥ 4 thenΦm = LearnHomography[xii∈Im , wii∈Im ]for i=1 to I do

dim = (xi − proj[wi,Φn])T (xi − proj[wi,Φn])end

end

end

end

end


56 Multiple cameras

Algorithm 16.1: Camera geometry from point matches

This algorithm finds approximate estimates of the rotation and translation (up to scale)between two cameras given a set of I point matches xi1,xi2Ii=1 between two images. Moreprecisely, the algorithm assumes that the first camera is at the world origin and recovers theextrinsic parameters of the second camera.

There is a fourfold ambiguity in the possible solution due to the symmetry of the cameramodel - it allows for points that are behind the camera to be imaged, although this is clearlynot possible in the real world. This algorithm distinguishes between these four solutions byreconstructing all of the points with each and choosing the solution where the largest numberare in front of both cameras.

Algorithm 16.1: Extracting relative camera position from point matches

Input : Point pairs xi1,xi2Ii=1, intrinsic matrices Λ1,Λ2

Output: Rotation Ω, translation τ between camerasbegin

// Compute fundamental matrix (algorithm 16.2)

F = ComputeFundamental[x1i,x2iIi=1]// Compute essential matrix

E = ΛT2 FΛ1

// Extract four possible rotation and translations from EW = [0,−1, 0; 1, 0, 0; 0, 0,−1][U,L,V] = svd[E]

τ1 = ULWUT ; Ω1 = UW−1VT

τ2 = ULW−1UT ; Ω2 = UWVT

τ3 = −τ1; Ω3 = Ω1

τ4 = −τ2; Ω4 = Ω1

// For each possibility

for k=1 to K do

tk = 0 // number of points in front of camera for kth soln

// For each point

for i=1 to I do// Reconstruct point (algorithm 14.3)

w = Reconstruct[xi1,xi2,Λ1,Λ2,0, I,Ωk, τ k]// Compute point in frame of reference of second camera

w′ = Ωk + τ k// Test if point reconstructed in front of both cameras

if w3 > 0 & w′3 > 0 thentk = tk + 1

end

end

end// Choose solution with most support

k = argmaxk[tk]Ω = Ωk

τ = τ kend


Multiple cameras 57

Algorithm 16.2: Eight point algorithm for fundamental matrix

This algorithm takes a set of I ≥ 8 point correspondences xi1,xi2Ii=1 between two imagesand computes the fundamental matrix using the 8 point algorithm. To improve the numericalstability of the algorithm, the point positions are transformed to have unit mean and sphericalcovariance before the calculation proceeds. The resulting fundamental matrix is modified tocompensate for this transformation. This algorithm is usually used to compute an initialestimate for a subsequent non-linear optimization of the symmetric epipolar distance.

Algorithm 16.2: Eight point algorithm for fundamental matrix

Input : Point pairs x1i,x2iIi=1

Output: Fundamental matrix Fbegin

// Compute statistics of data

µ1 =∑Ii=1 x1i/I

Σ1 =∑Ii=1(x1i − µ1)(x1i − µ1)/I

µ2 =∑Ii=1 x2i/I

Σ2 =∑Ii=1(x2i − µ2)(x2i − µ2)/I

for k=1 to K do// Compute transformed coordinates

xi1 = Σ−1/21 (xi1 − µ1)

xi2 = Σ−1/22 (xi2 − µ2)

// Compute constraint

Ai = [xi2xi1, xi2yi1, xi2, yi2xi1, yi2yi1, yi2, xi1, yi1, 1]

end// Append constraints and solve

A = [A1; A2; . . .AI ][U,L,V] = svd[A]F = [v19, v29, v39; v49, v59, v69; v79, v89, v99]// Compensate for transformation

T1 = [Σ−1/21 ,Σ

−1/21 µ1; 0, 0, 1]

T2 = [Σ−1/22 ,Σ

−1/22 µ2; 0, 0, 1]

F = TT2 FT1

// Ensure that matrix has rank 2

[U,L,V] = svd[F]l33 = 0

F = ULVT

end


58 Multiple cameras

Algorithm 16.3: Robust computation of fundamental matrix with RANSAC

The goal of this algorithm is to estimate the fundamental matrix from 2D point pairsxi1,xi2Ii=1 to another in the case where some of the point matches are known to be wrong(outliers). The robustness is achieved by applying the RANSAC algorithm. Since the funda-mental matrix has a eight unknown quantities, we randomly select eight point pairs at eachstage of the algorithm (each pair contributes one constraint). The algorithm also returns thetrue matches.

Algorithm 16.3: Robust ML fitting of fundamental matrix

Input : Point pairs xi1,xi2Ii=1, number of RANSAC steps N , threshold τOutput: Fundamental matrix F, set of inlier indices Ibegin

// Initialize best inlier set to empty

I = for n=1 to N do

// Draw 8 different random integers between 1 and IR = RandomSubset[1 . . . I, 8]// Compute fundamental matrix (algorithm 16.2)

Φn = ComputeFundamental[xi1i∈R, xi2i∈R]// Initialize set of inliers to empty

Sn = for i=1 to I do

// Compute epipolar line in first image

xi2 = [xi2; 1]

l = xTi2F// Compute squared distance to epipolar line

d1 = (l1xi1 + l2yi1 + l3)2/(l21 + l22)// Compute epipolar line in second image

xi1 = [xi1; 1]l2 = Fxi1// Compute squared distance to epipolar line

d2 = (l1xi2 + l2yi2 + l3)2/(l21 + l22)// If small enough then add to inliers

if (d1 < τ2) && (d2 < τ2) thenSn = Sn ∩ i

end

end// If best outliers so far then store

if |Sn| > |I| thenI = Sn

end

end// Compute fundamental matrix from all outliers

Φ = ComputeFundamental[xi1i∈I , xi2i∈I ]

end


Multiple cameras 59

Algorithm 16.4: Planar rectification

This algorithm computes homographies that can be used to rectify the two images. Thehomography for this second image is chosen so that it moves the epipole to infinity along thex−axis. The homography for the first image is chosen so that the matches are on the samehorizontal lines as in the first image and the distance between the matches is smallest in aleast squares sense (i.e., the disparity is smallest).

Algorithm 16.4: Planar rectification

Input : Point pairs xi1,xi2Ii=1

Output: Homographies Φ1, Φ2 to transform first and second imagesbegin

// Compute fundamental matrix (algorithm 55)

F = ComputeFundamental[x1i,x2iIi=1]// Compute epipole in image 2

[U,L,V] = svd[F]

e = [u13, u23, u33]T

// Compute three transformation matrices

T1 = [0, 0,−δx; 0, 0, δy, 0, 0, 1]θ = atan2[ey − δy, ex − δx]T2 = [cos[θ], sin[θ], 0;− sin[θ], cos[θ], 0; 0, 0, 1]T3 = [1, 0, 0; 0, 1, 0,−1/(cos[θ], sin[θ]), 0, 1]]// Compute homography for second image

Φ2 = T3T2,T1

// Compute factorization of fundamental matrix

L = diag[l11, l22, (l11 + l22)/2]W = [0,−1, 0; 1, 0, 0; 0, 0, 1]

M = ULWVT

// Prepare matrix for soln for Φ1

for k=1 to K dox′i1 = hom[xi1,Φ2M]// Transform points

x′i2 = hom[xi2,Φ2]// Create elements of A and bAi = [x′i1, y

′i1, 1]

bi = x′i2end// Concatenate elements of A and bA = [A1; A2; . . .AI ]b = [b1; b2; . . . bI ]// Solve for α

α = (ATA)−1ATb// Calculate homography in first image

Φ1 = (I + [1, 0, 0]TαT )Φ2M

end


60 Models for shape

Algorithm 17.1: Generalized Procrustes analysis

The goal of generalized Procrustes analysis is to align a set of shape vectors wiIi=1 withrespect to a given transformation family (Euclidean, similarity, affine etc.). Each shape vectorconsists of a set of N 2D points wi = [wT

i1,wTi2, . . .w

TiN ]T . In the algorithm below, we will

use the example of registering with respect to a Similarity transformation, which consists ofa rotation Ω, scaling ρ and translation τ .

Algorithm 17.1: Generalized Procrustes analysis

Input : Shape vectors wiIi=1, number of factors, KOutput: Template w, transformations Ωi,ρi, τ iIi=1, number of iterations Kbegin

Initialize w = w1

// Main iteration loop

for k=1 to K do// For each transformation

for i=1 to I do// Compute transformation to template (algorithm 15.2)

[Ωi, ρi, τ i] = EstimateSimilarity[wnNn=1, winNn=1]

end// Update template (average of inverse transform)

wi =∑Ii=1 ΩT

i (win − τ i)/(Iρi)// Normalize template

wi = wi/|wi|end

end


Models for shape 61

Algorithm 17.2: Probabilistic principal components analysis

The probabilistic principal components analysis algorithm describes a set of I D × 1 dataexamples xiIi=1 with the model

Pr(xi) = Normxi[µ,ΦΦT + σ2I]

where µ is the D×1 mean vector, Φ is a D×K matrix containing the K principal componentsin its columns. The principal components define a K dimensional subspace and the parameterσ2 explains the variation of the data around this subspace.

Notice that this model is very similar to factor analysis (see Algorithm 6.3). The onlydifference is that here we have spherical additive noise σ2I rather than a diagonal noisecomponents Σ. This small change has important ramifications for the learning algorithm; weno longer need to use an iterative learning procedure based on the EM algorithm and caninstead learn the parameters in closed form.

Algorithm 17.2: ML learning of PPCA model

Input : Training data xiIi=1, number of principal components, KOutput: Parameters µ,Φ, σ2

begin// Estimate mean parameter

µ =∑Ii=1 xi/I

// Form matrix of mean-zero data

X = [x1 − µ,x2 − µ, . . . ,xI − µ]// Decompose X to matrices U,L,V

[VLVT ] = svd[XTX]

U = WVL−1/2

// Estimate noise parameter

σ2 =∑Dj=K+1 ljj/(D −K)

// Estimate principal components

Uk = [u1,u2, . . .uK ]Lk = diag[l11, l22, . . . lKK ]

Φ = UK(LK − σ2I)1/2

end


62 Models for style and identity

Algorithm 18.1: ML learning of subspace identity model

This describes the jth of J data examples from the ith of I identities as

xij = µ+ Φhi + εij ,

where xij is the D×1 observed data, µ is the D×1 mean vector, Φ is the D×K factor matrix,hi is the K×1 hidden variable representing the identity and εij is a D×1 additive normalnoise multivariate noise with diagonal covariance Σ.

Algorithm 18.1: Maximum likelihood learning for identity subspace model

Input : Training data xijI,Ji=1,j=1, number of factors, KOutput: Maximum likelihood estimates of parameters θ = µ,Φ,Σbegin


// Set mean

µ =∑Ii=1

∑Jj=1 xij/IJ

repeat// Expectation step

for i=1 to I do

E[hi] = (JΦTΣ−1Φ + I)−1ΦTΣ−1∑Jj=1(xij − µ)

E[hihTi ] = (JΦTΣ−1Φ + I)−1 + E[hi]E[hi]

T


Φ =(∑I

i=1

∑Jj=1(xij − µ)E[hi]

T)(∑I

i=1 JE[hihTi ])−1

Σ = 1IJ

∑Ii=1

∑Jj=1 diag

[(xij − µ)(xij − µ)T −ΦE[hi](xij − µ)T

]// Compute data log likelihood

for i=1 to I dox′i = [xTi1,x

Ti2, . . . ,x

TiJ ]T // compound data vector, JD×1

end

µ′ = [µT ,µT . . .µT ]T // compound mean vector, JD×1

Φ′ = [ΦT ,ΦT . . .ΦT ]T // compound factor matrix, JD×KΣ′ = diag[Σ,Σ, . . .Σ] // compound covariance, JD×JDL =

∑Ii=1 log

[Normx′i

[µ′,Φ′Φ′T + Σ′]]

b


end

a It is usual to initialize Φ to random values. The D diagonal elements of Σ can be initialized to thevariances of the D data dimensions.b In high dimensions it is worth reformulating the covariance of this matrix using the matrix inversionlemma.


Models for style and identity 63

Algorithm 18.2: ML learning of PLDA model

PLDA describes the jth of J data examples from the ith of I identities as

xij = µ+ Φhi + Ψsij + εij ,

where all terms are the same as in subspace identity model but now we add Ψ, the D×Lwithin-individual factor matrix and sij the L×1 style variable.

Algorithm 18.2: Maximum likelihood learning for PLDA model

Input : Training data xijI,Ji=1,j=1, numbers of factors, K,LOutput: Maximum likelihood estimates of parameters θ = µ,Φ,Ψ,Σbegin


// Set mean

µ =∑Ii=1

∑Jj=1 xij/IJ

repeatµ′ = [µT ,µT . . .µT ]T // compound mean vector, JD×1

Φ′ = [ΦT ,ΦT . . .ΦT ]T // compound factor matrix 1, JD×KΨ′ = diag[Ψ,Ψ, . . .Ψ] // compound factor matrix 2, JD×JLΦ′ = [Φ′,Ψ′] // concatenate matrices JD×(K+JL)Σ′ = diag[Σ,Σ, . . .Σ] // compound covariance, JD×JD// Expectation step

for i=1 to I dox′i = [xTi1,x

Ti2, . . . ,x

TiJ ]T // compound data vector, JD×1

µh′i= (Φ′TΣ′−1Φ′ + I)−1Φ′TΣ′−1(x′i − µ′)

Σh′i= (Φ′TΣ′−1Φ′ + I)−1 + E[h′i]E[h′i]

T

for j=1 to J doSij = [1 . . .K,K+(J − 1)L+1 . . .K+JL]

E[h′′ij ] = µh′i

(Sij) // Extract subvector of mean

E[h′′ijh′′Tij ] = Σ

h′i(Sij ,Sij) // Extract submatrix from covariance

end


Φ′′ =(∑I

i=1

∑Jj=1(xij − µ)E[h

′′ij ]T)(∑I

i=1

∑Jj=1 E[h

′′ijh′′Tij ])−1

Σ = 1IJ

∑Ii=1

∑Jj=1 diag

[(xij − µ)(xij − µ)T − [Φ,Ψ]E[hij ](xij − µ)T

]Φ = Φ′′(:, 1 : K) // Extract original factor matrix

Ψ = Φ′′(:,K + 1 : K + L) // Extract other factor matrix

// Compute data log likelihood

L =∑Ii=1 log

[Normx′i

[µ′,Φ′Φ′T + Σ′]]


end

a Initialize Ψ to random values, other variables as in identity subspace model.


64 Models for style and identity

Algorithm 18.3: ML learning of asymmetric bilinear model

This describes the jth data example from the ith identities and the kth styles as

xijs = µs + Φshi + εijs,

where the terms have the same interpretation as for the subspace identity model except nowthere is one set of parameters θs = µs,Φs,Σs per style, s.

Algorithm 18.3: Maximum likelihood learning for asymmetric bilinear model

Input : Training data xijI,J,Si=1,j=1,s=1, number of factors, KOutput: ML estimates of parameters θ = µ1...S ,Φ1...S ,Σ1...Sbegin

Initialize θ = θ0

for s=1 to S do

µs =∑Ii=1

∑Jj=1 xijs/IJ // Set mean

endrepeat

// Expectation step

for i=1 to I do

E[hi] = (I + J∑Ss=1 ΦT

s Σ−1s Φs)

−1∑Ss=1 ΦT

s Σ−1s

∑Jj=1(xijs − µs)

E[hihTi ] = (I + JΦT

s Σ−1s Φs)

−1 + E[hi]E[hi]T


for s=1 to S do

Φs =(∑I

i=1

∑Jj=1(xijs − µs)E[hi]

T)(∑I

i=1 JE[hihTi ])−1

Σs = 1IJ

∑Ii=1

∑Jj=1 diag

[(xijs − µs)(xijs − µs)T −ΦsE[hi](xijs − µs)T

]end// Compute data log likelihood

for s=1 to S doΦ′s = [ΦT

s ,ΦTs . . .Φ

Ts ]T

Σ′s = diag[Σs,Σs, . . .Σs]for i=1 to I do

x′is = [xTi1s,xTi2s, . . . ,x

TiJs]

T

x′i = [xTi1,xTi2, . . . ,x

TiS ]T // compound data vector, JSD×1

end

end

µ′ = [µT ,µT . . .µT ]T // compound mean vector, JSD×1

Φ′ = [Φ′T1 ,Φ

′T2 . . .Φ

′TS ]T // compound factor matrix, JSD×K

Σ′ = diag[Σ′1,Σ′2, . . .Σ

′S ] // compound covariance, JSD×JSD

L =∑Ii=1 log

[Normx′i

[µ′,Φ′Φ′T + Σ′]]


end


Models for style and identity 65

Algorithm 18.4: Style translation with asymmetric bilinear model

To translate a data example from one style to another we first estimate the hidden variableassociated with the example, and then use the generative equation to simulate the new style.We cannot know the hidden variable for certain, but we can compute it’s posterior distribu-tion, which has a Gaussian form, and then choose the MAP solution which is the mean ofthis Gaussian.

Algorithm 18.4: Style translation with asymmetric bilinear model

Input : Example x in style s1, model parameters θOutput: Prediction for data x∗ in style s2

begin// Estimate hidden variable

E[h] = (I + ΦTs1Σ−1

s1 Φs1)−1ΦTs1Σ−1

s1 (x− µs1)

// Predict in different style

x∗ = µs2 + Φs2E[h]

end


66 Temporal models

Algorithm 19.1: Kalman filter

To define the Kalman filter, we must specify the temporal and measurement models. First,the temporal model relates the states w at times t−1 and t and is given by

Pr(wt|wt−1) = Normwt[µp + Ψwt−1,Σp].

where µp is a Dw×1 vector, which represents the mean change in the state and Ψ is a Dw×Dw

matrix, which relates the mean of the state at time t to the state at time t−1. This is knownas the transition matrix. The transition noise Σp determines how closely related the statesare at times t and t−1.

Second, the measurement model relates the data xt at time t to the state wt,

Pr(xt|wt) = Normxt[µm + Φwt,Σm].

where µm is a Dx×1 mean vector and Φ is a Dx×Dw matrix relating the Dx×1 measurementvector to the Dw×1 state. The measurement noise Σm defines additional uncertainty on themeasurements that cannot be explained by the state.

The Kalman filter is a set of rules for computing the marginal posterior probabilityPr(wt|x1...t) based on a normally distributed estimate of the marginal posterior probabil-ity Pr(wt−1|x1...t−1) at the previous time and a new measurement xt. In this algorithm wedenote the mean of the posterior marginal probability as µt−1 and the variance as Σt−1.

Algorithm 19.1: The Kalman filter

Input : Measurements xTt=1, temporal params µp,Ψ,Σp, measurement params µm,Φ,Σm

Output: Means µtTt=1 and covariances ΣtTt=1 of marginal posterior distributionsbegin

// Initialize mean and covariance

µ0 = 0Σ0 = Σ0 // Typically set to large multiple of identity

// For each time step

for t=1 to T do// State prediction

µ+ = µp + Ψµt−1

// Covariance prediction

Σ+ = Σp + ΨΣt−1ΨT

// Compute Kalman gain

K = Σ+ΦT (Σm + ΦΣ+ΦT )−1

// State update

µt = µ+ + K(xt − µm −Φµ+)// Covariance update

Σt = (I−KΦ)Σ+

end

end


Temporal models 67

Algorithm 19.2: Fixed interval Kalman smoother

The fixed interval smoother consists of a backward set of recursions that estimate the marginalposterior distributions Pr(wt|x1...T ) of the state at each time step, taking into accountall of the measurements x1...T . In these recursions, the marginal posterior distributionPr(wt|x1...T ) of the state at time t is updated, and, based on this result, the marginalposterior Pr(wt−1|x1...T ) at time t− 1 is updated and so on.

In the algorithm, we denote the mean and variance of the marginal posterior Pr(wt|x1...T )at time t by µt|T and Σt|T , respectively The notation µ+|t and Σ+|t denotes the mean andvariance of the posterior distribution Pr(wt|x1...t−1) of the state at time t based on themeasurements up to time t−1 (i.e., what we denoted as µ+ and Σ+ during the forwardKalman filter recursions).

Algorithm 19.2: Fixed interval Kalman smoother

Input : Means, variances µt|t,Σt|t,µ+|t,Σ+|tTt=1, temporal param Ψ

Output: Means µt|T Tt=1 and covariances Σt|T Tt=1 of marginal posterior distributions

begin// For each time step

for t=T-1 to 1 do// Compute gain matrix

Ct = Σt|tΨTΣ−1

+|t// Compute mean

µt|T = µt + Ct(µt+1|T − µ+|t)

// Compute variance

Σt|T = Σt + Ct(Σt+1|T −Σ+|t)CTt

end

end


68 Temporal models

Algorithm 19.3: Extended Kalman filter

The extended Kalman filter (EKF) is designed to cope with more general temporal models,where the relationship between the states at time t is an arbitrary nonlinear function f [•, •]of the state at the previous time step and a stochastic contribution εp

wt = f [wt−1, εp],

where the covariance of the noise term εp is Σp as before. Similarly, it can cope with anonlinear relationship g[•, •] between the state and the measurements

xt = g[wt, εm],

where the covariance of εm is Σm.The extended Kalman filter works by taking linear approximations to the nonlinear func-

tions at the peak µt of the current estimate using the Taylor expansion. We define theJacobian matrices,

Ψ =∂f [wt−1, εp]

∂wt−1

∣∣∣∣µt−1,0

Υp =∂f [wt−1, εp]

∂εp

∣∣∣∣µt−1,0

Φ =∂g[wt, εm]

∂wt

∣∣∣∣µ+,0

Υm =∂g[wt, εm]

∂εm

∣∣∣∣µ+,0

,

where |µ+,0 denotes that the derivative is computed at position w = µ+ and ε = 0.

Algorithm 19.3: The extended Kalman filter

Input : Measurements xTt=1, temporal function f [•, •], measurement function g[•, •]Output: Means µtTt=1 and covariances ΣtTt=1 of marginal posterior distributionsbegin

// Initialize mean and covariance




µ+ = f [µt−1,0]// Covariance prediction

Σ+ = ΨΣt−1ΨT + ΥpΣpΥ

Tp


K = Σ+ΦT (ΥmΣmΥTm + ΦΣ+ΦT )−1

// State update

µt = µ+ + K(xt − g[µ+,0])// Covariance update

Σt = (I−KΦ)Σ+

end

end


Temporal models 69

Algorithm 19.4: Iterated extended Kalman filter

The iterated extended Kalman filter passes Q times through the dataset, repeating the com-putations of the extended Kalman filter. At each iteration it linearizes around the previousestimate of the state, with the idea that the linear approximation will get better and better.We define the initial Jacobian matrices as before:

Ψ =∂f [wt−1, εp]

∂wt−1

∣∣∣∣µt−1,0

Υp =∂f [wt−1, εp]

∂εp

∣∣∣∣µt−1,0

Φ0 =∂g[wt, εm]

∂wt

∣∣∣∣µ+,0

Υ0m =

∂g[wt, εm]

∂εm

∣∣∣∣µ+,0

.

However, on the qth iteration, we use the Jacobians

Φq =∂g[wt, εm]

∂wt

∣∣∣∣µq−1

t ,0

Υqm =

∂g[wt, εm]

∂εm

∣∣∣∣µq−1

t ,0

,

where µq−1t is the estimate of the state at the tth time step on the q − 1th iteration.

Algorithm 19.4: The iterated extended Kalman filter

Input : Measurements xTt=1, temporal function f [•, •], measurement function g[•, •]Output: Means µtTt=1 and covariances ΣtTt=1 of marginal posterior distributionsbegin

// For each iteration

for q=0 to Q do// Initialize mean and covariance




µ+ = f [µt−1,0]// Covariance prediction

Σ+ = ΨΣt−1ΨT + ΥpΣpΥ

Tp


K = Σ+ΦqT (ΥqmΣmΥqT

m + ΦqΣ+ΦqT )−1

// State update

µqt = µ+ + K(xt − g[µ+,0])// Covariance update

Σqt = (I−KΦq)Σ+

end

end

end

This algorithm can be improved by running the fixed interval smoother inbetween each iter-ation and re-linearizing around the smoothed estimates.


70 Temporal models

Algorithm 19.5: Unscented Kalman filter

The unscented filter is an alternative to the extended Kalman filter that works by approximat-ing the Gaussian state distribution as a set of particles with the same mean and covariance,passing these particles through the non-linear temporal / measurement equations and thenrecomputing the mean and covariance based on the new positions of these particles. In theexample below, we assume that the state has dimensions Dw and use 2Dw + 1 particles toapproximate the world state.

Algorithm 19.5: The unscented Kalman filter

Input : Measurements xTt=1, temporal, measurement functions f [•, •], g[•, •], weight a0

Output: Means µtTt=1 and covariances ΣtTt=1 of marginal posterior distributionsbegin


for t=1 to T do// Approximate state with particles

w[0] = µt−1

for j=1 to Dw do

w[j] = µt−1 +√

Dw1−a0

Σ1/2t−1ej

w[Dw+j] = µt−1 −√

Dw1−a0

Σ1/2t−1ej

aj = (1− a0)/(2Dw)

end// Pass through measurement eqn and compute predicted mean and covariance

µ+ =∑2Dwj=0 ajf[w

[j]]

Σ+ =∑2Dwj=0 aj(f[w

[j]]− µ+)(f[w[j]]− µ+)T + Σp

// Approximate predicted state with particles

w[0] = µ+

for j=1 to Dw do

w[j] = µ+ +√

Dw1−a0

Σ1/2+ ej

w[Dw+j] = µ+ −√

Dw1−a0

Σ1/2+ ej

end// Pass through measurement equation

for j=0 to 2Dw do

x[j] = g[w[j]]end// Compute predicted measurement state and covariance

µx =∑2Dwj=0 aj x

[j]

Σx =∑2Dwj=0 aj(x

[j] − µx)(x[j] − µx)T + Σm

// Compute new world state and covariance

K =(∑2Dw

j=0 aj(w[j] − µ+)T (x[j] − µx)T

)Σ−1x

µt = µ+ + K (xt − µx)

Σt = Σ+ −KΣxKT

end

end


Temporal models 71

Algorithm 19.6: Condensation algorithm

The condensation algorithm completely does away with the Gaussian representation andrepresents the distributions entirely as sets of weighted particles, where each particle can beinterpreted as a hypothesis about the world state and the weight as the probability of thishypothesis being true.

Algorithm 19.6: The condensation algorithm

Input : Measurements xTt=1, temporal model Pr(wt|wt−1), measurement model Pr(xt|wt)

Output: Weights a[j]t Tt=1, hypotheses w[j]

t Tt=1

begin// Initialise weights to equal

a0 = [1/J, 1/J, . . . , 1/J ]// Initialize hypotheses to plausible values for state

for j=1 to J do

w[j]0 = Initialize[ ]

end// For each time step

for t=1 to T do// For each particle

for j=1 to J do

// Sample from 1 . . . J according to probabilities a[1]t−1 . . . a

[J]t−1

n = sampleFromCategorical[at−1]// Draw sample from temporal update model

w[j]t = sample[Pr(wt|wt−1 = w

[n]t−1)]

// Set weight for particle according to measurement model

a[j]t = Pr(xt|w[j]

t )

end// Normalise weights

at = at/(∑Jj=1 a

[j]t )

end

end


72 Models for visual words

Algorithm 20.1: Bag of features model

The bag of features model treats each object class as a distribution over discrete features fregardless of their position in the image. Assume that there are I images with Ji features inthe ith image. Denote the jth feature in the ith image as fij . Then we have

Pr(Xi|w = n) =

Ij∏j=1

Catfij [λn]

Algorithm 20.1: Learn bag of words model

Input : Features fijI,Jii=1,j=1, wiIi=1, Dirichlet parameter α

Output: Model parameters λmMm=1

begin// For each object class

for n=1 to N do// For each feature

for k=1 to L do// Compute number of times feature k observed for object m

Nfnk =

∑Ii=1

∑Jij=1 δ[wi − n]δ[fij − k]

end// Compute parameter

λnk = (Nfnk + α− 1)/(

∑Kk=1 N

fnk +Kα− 1)

end

end


Models for visual words 73

Algorithm 20.2: Latent Dirichlet Allocation

The latent Dirichlet allocation model models a discrete set of features fij ∈ 1 . . .K as amixture of M categorical distributions (parts), where the categorical distributions themselvesare shared, but the mixture weights πi differ from image to image

Algorithm 20.2: Learn latent Dirichlet allocation model

Input : Features fijI,Jii=1,j=1, wiIi=1, Dirichlet parameters α, β

Output: Model parameters λmMm=1, πiIi=1

begin// Initialize categorical parameters

θ = θ0a

// Initialize count parameters

N(f) = 0

N(p) = 0for i=1 to I do

for j=1 to J do// Initialize hidden variables

pij = randInt[M ]// Update count parameters

N(f)pij ,fij

= N(f)pij ,fij

+ 1

N(p)i,pij

= N(f)i,pij

+ 1

end

end// Main MCMC Loop

for t=1 to T do

p(t) = MCMCSample[p, f ,N(f),N(w), λmMm=1, πiIi=1,M,K]end// Choose samples to use for parameter estimate

St = [BurnInTime : SkipTime : Last Sample]for i=1 to I do

for m=1 to M do

πi,m =∑Jij=1

∑t∈St δ[p

[t]ij −m] + α

end

πi = πi/∑Mm=1 πim

endfor m=1 to M do

for k=1 to K do

λm,k =∑Ii=1

∑Jij=1

∑t∈St δ[p

[t]ij −m]δ[fij − k] + β

end

λm = λm/∑Kk=1 λm,k

end

end

a One way to do this would be to set the categorical parameters λmMm=1, πiIi=1 to random

values by generating positive random vectors and normalizing them to sum to one.


74 Models for visual words

Algorithm 20.2b: Gibbs’ sampling for LDA

The preceding algorithm relies on Gibbs sampling from the posterior distribution over thepart labels. This can be achieved efficiently using the following method.

Algorithm 20.2b: MCMC Sampling for LDA

Input : p, f ,N(f),N(w), λmMm=1, πiIi=1,M,KOutput: Part sample pbegin

repeat// Choose next feature

(a, b) = ChooseFeature[J1, J2, . . . JI ]// Remove feature from statistics

N(f)pab,fab

= N(f)pab,fab

− 1

N(p)a,pab = N

(p)pab − 1

for m=1 to M do

qm = (N(f)m,fab

+ β)(N(p)a,m + α)

qm = qm/(∑Kk=1(N

(f)m,k + β)

∑Nm=1(N

(p)a,m + α))

end// Normalize

q = q/(∑Mm=1 qm)

// Draw new feature

pij = DrawCategorical[q]// Replace feature in statistics

N(f)pab,fab

= N(f)pab,fab

+ 1

N(p)a,pab = N

(p)pab + 1

until All parts pij updated

end


Education

Algorithms computer vision