Upload
gonzalo-santiago
View
282
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Algorithms computer vision
Citation preview
Algorithms bookletDecember 10, 2012
2
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Algorithms booklet
This document accompanies the book “Computer vision: models, learning, and inference” bySimon J.D. Prince. It contains concise descriptions of almost all of the models and algorithmsin the book. The goal is to provide sufficient information to implement a naive version ofeach method. This information was published separately from the main book because (i) itwould have impeded the clarity of the main text and (ii) on-line publishing means that I canupdate the text periodically and eliminate any mistakes.
In the main, this document uses the same notation as the main book (see Appendix A fora summary). In addition, we also use the following conventions:
• When two matrices are concatenated horizontally, we write C = [A,B].
• When two matrices are concatenated vertically, we write C = [A; B].
• The function argminx f [x] returns the value of the argument x that minimizes f [x]. Ifx is discrete then this should be done by exhaustive search. If x is continuous, then itshould be done by gradient descent and I usually supply the gradient and Hessian ofthe function to help with this.
• The function δ[x] for discrete x returns 1 when the argument x is 0 and returns 0otherwise.
• The function diag[A] returns a column vector containing the elements on the diagonalof matrix A.
• The function zeros[I, J ] creates an I × J matrix that is full of zeros.
As a final note, I should point out that this document has not yet been checked very care-fully. I’m looking for volunteers to help me with this. There are two main ways you can help.First, please mail me at [email protected] if you manage to successfully implement oneof these methods. That way I can be sure that the description is sufficient. Secondly, pleasealso mail me if you if you have problems getting any of these methods to work. It’s possiblethat I can help, and it will help me to identify ambiguities and errors in the descriptions.
Simon Prince
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
4
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
List of Algorithms
4.1 Maximum likelihood learning for normal distribution . . . . . . . . . . . . . . 74.2 MAP learning for normal distribution with conjugate prior . . . . . . . . . . . 74.3 Bayesian approach to normal distribution . . . . . . . . . . . . . . . . . . . . . 84.4 Maximum likelihood learning for categorical distribution . . . . . . . . . . . . 84.5 MAP learning for categorical distribution with conjugate prior . . . . . . . . . 94.6 Bayesian approach to categorical distribution . . . . . . . . . . . . . . . . . . . 96.1 Basic Generative classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107.1 Maximum likelihood learning for mixtures of Gaussians . . . . . . . . . . . . . 117.2 Maximum likelihood learning for t-distribution . . . . . . . . . . . . . . . . . . 127.3 Maximum likelihood learning for factor analyzer . . . . . . . . . . . . . . . . . 138.1 Maximum likelihood learning for linear regression . . . . . . . . . . . . . . . . 148.2 Bayesian formulation of linear regression. . . . . . . . . . . . . . . . . . . . . . 158.3 Gaussian process regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168.4 Sparse linear regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178.5 Dual formulation of linear regression. . . . . . . . . . . . . . . . . . . . . . . . 188.6 Dual Gaussian process regression. . . . . . . . . . . . . . . . . . . . . . . . . . 188.7 Relevance vector regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199.1 Cost and derivatives for MAP logistic regression . . . . . . . . . . . . . . . . . 209.2 Bayesian logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219.3 Cost and derivatives for MAP dual logistic regression . . . . . . . . . . . . . . 229.4 Dual Bayesian logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . 239.5 Relevance vector classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 249.6 Incremental logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 259.7 Logitboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269.8 Cost function, derivative and Hessian for multi-class logistic regression . . . . . 279.9 Multiclass classification tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2810.1 Gibbs’ sampling from undirected model . . . . . . . . . . . . . . . . . . . . . . 2910.2 Contrastive divergence learning of undirected model . . . . . . . . . . . . . . . 3011.1 Dynamic programming in chain . . . . . . . . . . . . . . . . . . . . . . . . . . 3111.2 Dynamic programming in tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 3211.3 Forward backward algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3311.4 Sum product: distribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3411.4b Sum product: collate and compute marginal distributions . . . . . . . . . . . . 3512.1 Binary graph cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3612.2 Reparameterization for binary graph cut . . . . . . . . . . . . . . . . . . . . . 37
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
6 LIST OF ALGORITHMS
12.3 Multilabel graph cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3812.4 Alpha expansion algorithm (main loop) . . . . . . . . . . . . . . . . . . . . . . 3912.4b Alpha expansion (expand) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4013.1 Principal components analysis (dual) . . . . . . . . . . . . . . . . . . . . . . . 4113.2 K-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4214.1 ML learning of extrinsic parameters . . . . . . . . . . . . . . . . . . . . . . . . 4314.2 ML learning of intrinsic parameters . . . . . . . . . . . . . . . . . . . . . . . . 4414.3 Inferring 3D world position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4515.1 Maximum likelihood learning of Euclidean transformation . . . . . . . . . . . . 4615.2 Maximum likelihood learning of similarity transformation . . . . . . . . . . . . 4715.3 Maximum likelihood learning of affine transformation . . . . . . . . . . . . . . 4815.4 Maximum likelihood learning of projective transformation . . . . . . . . . . . . 4915.5 Maximum likelihood inference for transformation models . . . . . . . . . . . . 5015.6 ML learning of extrinsic parameters (planar scene) . . . . . . . . . . . . . . . . 5115.7 ML learning of intrinsic parameters (planar scene) . . . . . . . . . . . . . . . . 5215.8 Robust ML learning of homography . . . . . . . . . . . . . . . . . . . . . . . . 5315.9 Robust sequential learning of homographies . . . . . . . . . . . . . . . . . . . . 5415.10 PEaRL learning of homographies . . . . . . . . . . . . . . . . . . . . . . . . . . 5516.1 Extracting relative camera position from point matches . . . . . . . . . . . . . 5616.2 Eight point algorithm for fundamental matrix . . . . . . . . . . . . . . . . . . 5716.3 Robust ML fitting of fundamental matrix . . . . . . . . . . . . . . . . . . . . . 5816.4 Planar rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5917.1 Generalized Procrustes analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 6017.2 ML learning of PPCA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6118.1 Maximum likelihood learning for identity subspace model . . . . . . . . . . . . 6218.2 Maximum likelihood learning for PLDA model . . . . . . . . . . . . . . . . . . 6318.3 Maximum likelihood learning for asymmetric bilinear model . . . . . . . . . . 6418.4 Style translation with asymmetric bilinear model . . . . . . . . . . . . . . . . . 6519.1 The Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6619.2 Fixed interval Kalman smoother . . . . . . . . . . . . . . . . . . . . . . . . . . 6719.3 The extended Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6819.4 The iterated extended Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . 6919.5 The unscented Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7019.6 The condensation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7120.1 Learn bag of words model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7220.2 Learn latent Dirichlet allocation model . . . . . . . . . . . . . . . . . . . . . . 7320.2b MCMC Sampling for LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Fitting probability distributions 7
Algorithm 4.1: Maximum likelihood learning of normal distribution
The univariate normal distribution is a probability density model suitable for describingcontinuous data x in one dimension. It has pdf
Pr(x) =1√
2πσ2exp
[−0.5(x− µ)2/σ2
],
where the parameter µ denotes the mean and σ2 denotes the variance.
Algorithm 4.1: Maximum likelihood learning for normal distribution
Input : Training data xiIi=1
Output: Maximum likelihood estimates of parameters θ = µ, σ2begin
// Set mean parameter
µ =∑Ii=1 xi/I
// Set variance
σ2 =∑Ii=1(xi − µ)2/I
end
Algorithm 4.2: MAP learning of univariate normal parameters
The conjugate prior to the normal distribution is the normal-scaled inverse gamma which haspdf
Pr(µ, σ2) =
√γ
σ√
2π
βα
Γ(α)
(1
σ2
)α+1
exp
[−2β + γ(δ − µ)2
2σ2
],
with hyperparameters α, β, γ > 0 and δ ∈ [−∞,∞].
Algorithm 4.2: MAP learning for normal distribution with conjugate prior
Input : Training data xiIi=1, Hyperparameters α, β, γ, δOutput: MAP estimates of parameters θ = µ, σ2begin
// Set mean parameter
µ = (∑i=1 xi + γδ)/(I + γ)
// Set variance
σ2 = (∑Ii=1(xi − µ)2 + 2β + γ(δ − µ)2)/(I + 3 + 2α)
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
8 Fitting probability distributions
Algorithm 4.3: Bayesian approach to univariate normal distribution
In the Bayesian approach to fitting the univariate normal distribution we again use a normal-scaled inverse gamma prior. In the learning stage we compute a normal inverse gammadistribution over the mean and variance parameters. The predictive distribution for a newdatum is computed by integrating the predictions for a given set of parameters weighted bythe probability of those parameters being present.
Algorithm 4.3: Bayesian approach to normal distribution
Input : Training data xiIi=1, Hyperparameters α, β, γ, δ, Test data x∗
Output: Posterior parameters α, β, γ, δ, predictive distribution Pr(x∗|x1...I)begin
// Compute normal inverse gamma posterior over normal parameters
α = α+ I/2
β =∑i x
2i /2 + β + γδ2/2− (γδ +
∑i xi)
2/(2γ + 2I)γ = γ + I
δ = (γδ +∑i xi)/(γ + I)
// Compute intermediate parameters
α = α+ 1/2
β = x∗2/2 + β + γδ2/2− (γδ + x∗)2/(2γ + 2)γ = γ + 1// Evaluate new datapoint under predictive distribution
Pr(x∗|x1...I) =√γβαΓ[α]/
(√2π√γβαΓ[α]
)end
Algorithm 4.4: ML learning of categorical parameters
The categorical distribution is a probability density model suitable for describing discretemultivalued data x ∈ 1, 2, . . .K. It has pdf
Pr(x = k) = λk,
where the parameter λk denotes the probability of observing category k.
Algorithm 4.4: Maximum likelihood learning for categorical distribution
Input : Multi-valued training data xiIi=1
Output: ML estimate of categorical parameters θ = λ1 . . . λkbegin
for k=1 to K do
λk =∑Ii=1 δ[xi − k]/I
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Fitting probability distributions 9
Algorithm 4.5: MAP learning of categorical parameters
For MAP learning of the categorical parameters, we need to define a prior and to this end,we choose the Dirichlet distribution:
Pr(λ1 . . . λK) =Γ[∑Kk=1 αk]∏K
k=1 Γ[αk]
K∏k=1
λαk−1k ,
where Γ[•] is the Gamma function and αkKk=1 are hyperparameters.
Algorithm 4.5: MAP learning for categorical distribution with conjugate prior
Input : Binary training data xiIi=1, Hyperparameters αkKk=1
Output: MAP estimates of parameters θ = λkKk=1
beginfor k=1 to K do
Nk =∑Ii=1 δ[xi − k])
λk = (Nk − 1 + αk)/(I −K +∑Kk=1 αk)
end
end
Algorithm 4.6: Bayesian approach to categorical distribution
In the Bayesian approach to fitting the categorical distribution we again use a Dirichlet prior.In the learning stage we compute a probability distribution over K categorical parameters,which is also a Dirichlet distribution. The predictive distribution for a new datum is based aweighted sum of the predictions for all possible parameter values where the weights used arebased on the Dirichlet distribution computed in the learning stage.
Algorithm 4.6: Bayesian approach to categorical distribution
Input : Categorical training data xiIi=1, Hyperparameters αkKk=1
Output: Posterior parameters αkKk=1, predictive distribution Pr(x∗|x1...I)begin
// Compute caterorical posterior over λfor k=1 to K do
αk = αk +∑Ii=1 δ[xi − k]
end// Evaluate new datapoint under predictive distribution
for k=1 to K do
Pr(x∗ = k|x1...I) = αk/(∑Km=1 αm)
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
10 Learning and inference in vision
Algorithm 6.1: Basic generative classifier
Consider the situation where we wish to assign a label w ∈ 1, 2, . . .K based on an observedmultivariate measurement vector xi. We model the class conditional density functions asnormal distributions so that
Pr(xi|wi = k) = Normxi [µk,Σk],
with prior probabilities over the world state defined by
Pr(wi) = Catwi [λ].
In the learning phase, we fit the parameters µk and σ2k of the kth class conditional density
function Pr(xi|wi = k) from just the subset of data Sk = xi : wi = k where the kth statewas observed. We learn the prior parameter λ from the training world states wiIi=1. Herewe have used the maximum likelihood approach in both cases.
The inference algorithm takes new datum x and returns the posterior Pr(w∗|x∗,θ) overthe world state w using Bayes’ rule,
Pr(w∗|x∗) =Pr(x∗|w∗)Pr(w∗)∑K
w∗=1 Pr(x∗|w∗)Pr(w∗)
.
Algorithm 6.1: Basic Generative classifier
Input : Training data xi, wiIi=1, new data example x∗
Output: ML parameters θ = λ1...K ,µ1...K ,Σ1...K, posterior probability Pr(w∗|x∗)begin
// For each training class
for k=1 to K do// Set mean
µk = (∑Ii=1 xiδ[wi − k])/(
∑Ii=1 δ[wi − k])
// Set variance
Σk = (∑Ii=1(xi − µ)(xi − µ)T δ[wi − k])/(
∑Ii=1 δ[wi − k])
// Set prior
λk =∑Ii=1 δ[wi − k]/I
end// Compute likelihoods for each class for a new datapoint
for k=1 to K dolk = Normx∗ [µk,Σk]
end// Classify new datapoint using Bayes’ rule
for k=1 to K do
Pr(w∗ = k|x∗) = lkλk/∑Km=1 lmλm
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Modelling complex densities 11
Algorithm 7.1: Fitting mixture of Gaussians
The mixture of Gaussians (MoG) is a probability density model suitable for data x in Ddimensions. The data is described as a weighted sum of K normal distributions
Pr(x|θ) =
K∑k=1
λkNormx[µk,Σk],
where µ1...K and Σ1...K are the means and covariances of the normal distributions and λ1...Kare positive valued weights that sum to one.
The MoG is fit using the EM algorithm. In the E-step, we compute the posterior distri-bution over a hidden variable hi for each observed data point xi. In the M-step, we iteratethrough the K components, updating the mean µk and Σk for each and also update theweights λkKk=1.
Algorithm 7.1: Maximum likelihood learning for mixtures of Gaussians
Input : Training data xiIi=1, number of clusters KOutput: ML estimates of parameters θ = λ1...K ,µ1...K ,Σ1...Kbegin
Initialize θ = θ0a
repeat// Expectation Step
for i=1 to I dofor k=1 to K do
lik = λkNormxi [µk,Σk] // numerator of Bayes’ rule
end// Compute posterior (responsibilities) by normalizing
for k=1 to K do
rik = lik/(∑Kk=1 lik)
end
end
// Maximization Step b
for k=1 to K do
λ[t+1]k = (
∑Ii=1 rik)/(
∑Kk=1
∑Ii=1 rik)
µ[t+1]k = (
∑Ii=1 rikxi)/(
∑Ii=1 rik)
Σ[t+1]k = (
∑Ii=1 rik(xi − µ[t+1]
k )(xi − µ[t+1]k )T )/(
∑Ii=1 rik).
end// Compute Data Log Likelihood and EM Bound
L =∑Ii=1 log
[∑Kk=1 λkNormxi [µk,Σk]
]B =
∑Ii=1
∑Kk=1 rik log [λkNormxi [µk,Σk]/rik]
until No further improvement in L
end
aOne possibility is to set the weights λ• = 1/K, the means µ• to the values of K randomly chosendatapoints and the variances Σ• to the variance of the whole dataset.bFor a diagonal covariance retain only the diagonal of the Σk update.
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
12 Modelling complex densities
Algorithm 7.2: Fitting the t-distribution
The t-distribution is a robust (long-tailed) distribution with pdf
Pr(x) =Γ(ν+D2
)(νπ)D/2|Σ|1/2Γ
(ν2
) (1 +(x− µ)TΣ−1(x− µ)
ν
)−(ν+D)/2
.
where µ is the mean of the distribution Σ is a matrix that controls the spread, ν is the degreesof freedom, and D is the dimensionality of the input data.We use the EM algorithm to fit the parameters θ = µ,Σ, ν. In the E-step, we compute thegamma-distributed posterior over the hidden variable hi for each observed data point xi. Inthe M-step we update the parameters µ and Σ in closed form, but must perform an explicitline search to update ν using the criterion:
tCost[ν, E[hi], E[log[hi]]Ii=1
]=
−I∑i=1
ν
2log[ν
2
]+ log
[Γ[ν
2
]]−(ν
2− 1)E[log[hi]] +
ν
2E[hi].
Algorithm 7.2: Maximum likelihood learning for t-distribution
Input : Training data xiIi=1
Output: Maximum likelihood estimates of parameters θ = µ,Σ, νbegin
Initialize θ = θ0a
repeat// Expectation step
for i=1 to I doδi = (xi − µ)TΣ−1(xi − µ)E[hi] = (ν +D)/(ν + δi)E[log[hi] = Ψ[ν/2 +D/2]− log[ν/2 + δi/2]
end// Maximization step
µ = (∑Ii=1 E[hi]xi)/(
∑Ii=1 E[hi])
Σ = (∑Ii=1 E[hi](xi − µ)(xi − µ)T )/(
∑Ii=1 E[hi])
ν = argminν [tCost[ν, E[hi], E[log[hi]]Ii=1]]// Compute data log Likelihood
for i=1 to I doδi = (xi − µ)TΣ−1(xi − µ)
endL = I log[Γ[(ν +D)/2]]− ID log[νπ]/2− I log[|Σ|]/2− I log[Γ[ν/2]]
L = L− (ν +D)∑Ii=1 log[1 + δi/ν]/2
until No further improvement in L
end
a One possibility is to initialize the parameters µ and Σ to the mean and variance of the distributionand set the initial degrees of freedom to a large value say ν = 1000.
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Modelling complex densities 13
Algorithm 7.3: Fitting a factor analyzer
The factor analyzer is a probability density model suitable for data x in D dimensions. Ithas pdf
Pr(xi|θ) = Normx∗ [µ,ΦΦ + Σ],
where µ is a D× 1 mean vector, Φ is a D×K matrix containing the K factors φKk=1 in itscolumns and Σ is a diagonal matrix of size D ×D.
The factor analyzer is fit using the EM algorithm. In the E-step, we compute the posteriordistribution over the hidden variable hi for each data example xi and extract the expectationsE[hi] and E[hih
Ti ]. In the M-step, we use these distributions in closed-form updates for the
basis function matrix Φ and the diagonal noise term Σ.
Algorithm 7.3: Maximum likelihood learning for factor analyzer
Input : Training data xiIi=1, number of factors KOutput: Maximum likelihood estimates of parameters θ = µ,Φ,Σbegin
Initialize θ = θ0a
// Set mean
µ =∑Ii=1 xi/I
repeat// Expectation Step
for i=1 to I doE[hi] = (ΦTΣ−1Φ + I)−1ΦTΣ−1(xi − µ)
E[hihTi ] = (ΦTΣ−1Φ + I)−1 + E[hi]E[hi]
T
end// Maximization Step
Φ =(∑I
i=1(xi − µ)E[hi]T)(∑I
i=1 E[hihTi ])−1
Σ = diag[(xi − µ)(xi − µ)T −ΦE[hi](x
Ti − µ)
]/I
// Compute Data Log Likelihoodb
L =∑Ii=1 log
[Normxi [µ,ΦΦT + Σ]
]until No further improvement in L
end
a It is usual to initialize Φ to random values. The D diagonal elements of Σ can be initialized to thevariances of the D data dimensions.b In high dimensions it is worth reformulating the covariance of this matrix using the Sherman-
Morrison-Woodbury relation (matrix inversion lemma) .
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
14 Models for regression
Algorithm 8.1: ML fitting of linear regression model
The linear regression model describes the world w as a normal distribution. The mean of thisdistribution is a linear function φ0 +φTx and the variance is constant. In practice we add a1 to the start of every data vector xi ← [1 xTi ]T and attach the y-intercept φ0 to the startof the gradient vector φ← [φ0 φT ]T and write
Pr(wi|xi,θ) = Normwi
[φTxi, σ
2].
In the learning algorithm, we work with the matrix X = [x1,x2 . . .xI ] which contains allof the training data examples in its columns and the world vector w = [w1, w2 . . . wI ]
T whichcontains the training world states.
Algorithm 8.1: Maximum likelihood learning for linear regression
Input : (D + 1)×I data matrix X, I×1 world vector wOutput: Maximum likelihood estimates of parameters θ = Φ, σ2begin
// Set gradient parameter
φ = (XXT )−1Xw// Set variance parameter
σ2 = (w −XTφ)T (w −XTφ)/I
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for regression 15
Algorithm 8.2: Bayesian linear regression
In Bayesian linear regression we define a normal prior over the parameters φ
Pr(φ) = Normφ[0, σ2pI],
which contains one hyperparameter σ2p which determines the prior variance. We compute a
distribution over possible parameters φ and use this to evaluate the mean µw∗|x∗ and varianceσ2w∗|x∗ of the predictive distribution for new data x∗.
As in the previous algorithm, we add a 1 to the start of every data vector xi ← [1 xTi ]T
and then work with the matrix X = [x1,x2 . . .xI ] which contains all of the training dataexamples in its columns and the world vector w = [w1, w2 . . . wI ]
T which contains the trainingworld states.
The choice of approach depends on whether the number of data examples I is greater orless than the dimensionality D of the data. Depending on which case which situation we arein we move to a situation where we invert the (D + 1) × (D + 1) matrix XXT or the I × Imatrix XTX.
Algorithm 8.2: Bayesian formulation of linear regression.
Input : (D + 1)×I data matrix X, I×1 world vector w, Hyperparameter σ2p,
Output: Distribution Pr(w∗|x∗) over world given new data example x∗
begin// If dimensions D less than number of data examples Iif D < I then
// Fit variance parameter σ2 with line search
σ2 = argminσ2
[− log[Normw[0, σ2
pXTX + σ2I]]
]a
// Compute inverse variance of posterior distribution over φ
A−1 = (XXT /σ2 + I/σ2p)−1
else// Fit variance parameter σ2 with line search
σ2 = argminσ2
[− log[Normw[0, σ2
pXTX + σ2I]]
]// Compute inverse variance of posterior distribution over φ
A−1 = σ2pI− σ2
pX(XTX + (σ2/σ2
p)I)−1
XT
end// Compute mean of prediction for new example x∗
µw∗|x∗ = x∗TA−1Xw/σ2
// Compute variance of prediction for new example x∗
σ2w∗|x∗ = x∗TA−1x∗ + σ2
end
a To compute this cost function when the dimensions D < I we need to compute both the inverse and
determinant of the covariance matrix. It is inefficient to implement this directly as the covariance
is I × I. To compute the inverse, the covariance should be reformulated using the matrix inversion
lemma and the determinant calculated using the matrix determinant lemma.
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
16 Models for regression
Algorithm 8.3: Gaussian process regression
To compute a non-linear fit to a set of data, we first transform the data x by a non-linearfunction f [•] to create a new variable z = f [xi]. We then proceed as normal with the Bayesianapproach, but using the transformed data.
In practice, we exploit the fact that the Bayesian non-linear regression fitting and pre-diction algorithms can be described in terms of inner products zT z of the transformed data.We hence directly define a single kernel function k[xi,xj ] as a replacement for the operationf [xi]
T f [xj ]. For many transformations f [•] it is more efficient to evaluate the kernel functiondirectly than to transform the variables separately and then compute the dot product. It isfurther possible to choose kernel functions that correspond to projection to very high or eveninfinite dimensional spaces without ever having to explicitly compute this transformation.
As usual we add a 1 to the start of every data vector xi ← [1 xTi ]T and then work withthe matrix X = [x1,x2 . . .xI ] which contains all of the training data examples in its columnsand the world vector w = [w1, w2 . . . wI ]
T which contains the training world states. In thisalgorithm, we use the notation K[A,B] to denote the DA ×DB matrix containing all of theinner products of the DA columns of A with the DB columns of B.
Algorithm 8.3: Gaussian process regression.
Input : (D+1)×I data matrix X, I×1 world vector w, hyperparameter σ2p
Output: Normal distribution Pr(w∗|x∗) over world given new data example x∗
begin// Fit variance parameter σ2 with line search
σ2 = argminσ2
[− log[Normw[0, σ2
pK[X,X] + σ2I]]]
// Compute inverse term
A−1 =(K[X,X] + (σ2/σ2
p)I)−1
// Compute mean of prediction for new example x∗
µw∗|x∗ = (σ2p/σ
2)K[x∗,X]w − (σ2/σ2p)K[x∗,X]A−1K[X,X]w
// Compute variance of prediction for new example x∗
σ2w∗|x∗ = σ2
pK[x∗,x∗]− σ2pK[x∗,X]A−1K[X,x∗] + σ2
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for regression 17
Algorithm 8.4: Sparse linear regression
In the sparse linear regression model, we replace the normal prior over the parameters with aprior that is a product of t-distributions. This favours solutions where most of the regressionparameters are effectively zero. In practice, the t-distribution corresponding to the dth di-mension of the data is represented as a marginalization of a joint distribution with a hiddenvariable hd.
The algorithm is iterative and alternates between updating the hidden variables in closedform and performing a line search for the noise parameters σ2. After the system has converged,we prune the model to remove dimensions where the hidden variable was large (>1000 is areasonable criterion); these dimensions contribute very little to the final prediction.
Algorithm 8.4: Sparse linear regression.
Input : (D + 1)×I data matrix X, I×1 world vector w, degrees of freedom, νOutput: Distribution Pr(w∗|x∗) over world given new data example x∗
begin// Initialize variables
H = diag[1, 1, . . . 1]repeat
// Maximize marginal likelihood w.r.t. variance parameter
σ2 = argminσ2
[− log[Normw[0,XTH−1X + σ2I]]
]// Maximize marginal likelihood w.r.t. relevance parameters H
Σ = σ2(XXT + H)−1
µ = ΣXw/σ2
// For each dimension except the first (the constant)
for d=2 to D + 1 do// Update the diagonal entry of Hhdd = (1− hddΣdd + ν)/(µ2
d + ν)
end
until No further improvement// Remove columns of X, rows of w and rows and columns of H where value hdd
on the diagonal of H is large
[H,X,w] = prune[H,X,w]// Compute variance of posterior over Φ
A−1 = H−1 −H−1X(XTH−1X + σ2I
)−1XTH−1
// Compute mean of prediction for new example x∗
µw∗|x∗ = x∗TA−1Xw/σ2
// Compute variance of prediction for new example x∗
σ2w∗|x∗ = x∗TA−1x∗ + σ2
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
18 Models for regression
Algorithm 8.5: Dual Bayesian linear regression
In dual linear regression, we formulate the weight vector as a sum of the observed dataexamples X so that
φ = Xψ
and then solve for the dual parameters ψ. To this end we place a normally distributed prioron Ψ with a uniform covariance matrix with magnitude σp.
Algorithm 8.5: Dual formulation of linear regression.
Input : (D + 1)×I data matrix X, I×1 world vector w, Hyperparameter σ2p,
Output: Distribution Pr(w∗|x∗) over world given new data example x∗
begin// Fit variance parameter σ2 with line search
σ2 = argminσ2
[− log[Normw[0, σ2
pXTXXTX + σ2I]]
]// Compute inverse variance of posterior over Φ
A = XTXXTX/σ2 + I/σ2p
// Compute mean of prediction for new example x∗
µw∗|x∗ = x∗TXA−1XTXw/σ2
// Compute variance of prediction for new example x∗
σ2w∗|x∗ = x∗TXA−1Xx∗ + σ2
end
Algorithm 8.6: Dual Gaussian process regression
The dual algorithm relies only on inner products of the form xTx and so can be kernelized toform a non-linear regression method. As previously, we use the notation K[A,B] to denotethe DA ×DB matrix containing all of the inner products of the DA columns of A with theDB columns of B.
Algorithm 8.6: Dual Gaussian process regression.
Input : (D + 1)×I data matrix X, I×1 world vector w, Hyperparameter σ2p, Kernel Function
K[•, •]Output: Distribution Pr(w∗|x∗) over world given new data example x∗
begin// Fit variance parameter σ2 with line search
σ2 = argminσ2
[− log[Normw[0, σ2
pK[X,X]K[X,X] + σ2I]]]
// Compute inverse term
A = K[X,X]K[X,X]/σ2 + I/σ2p
// Compute mean of prediction for new example x∗
µw∗|x∗ = K[x∗,X]A−1K[X,X]w/σ2
// Compute variance of prediction for new example x∗
σ2w∗|x∗ = K[x∗T ,X]A−1K[X,x∗] + σ2
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for regression 19
Algorithm 8.7: Relevance vector regression
Relevance vector regression is simply sparse linear regression applied in the dual situation; weencourage the dual parameters ψ to be sparse using a prior that is a product of t-distributions.Since there is one dual parameter for each of the I training examples, we introduce I hiddenvariables hi which control the tendency to be zero for each dimension.
The algorithm is iterative and alternates between updating the hidden variables in closedform and performing a line search for the noise parameter σ2. After the system has converged,we prune the model to remove dimensions where the hidden variable was large (>1000 is areasonable criterion); these dimensions contribute very little to the final prediction.
Algorithm 8.7: Relevance vector regression.
Input : (D+1)×I data matrix X, I×1 world vector w, kernel K[•, •], degrees of freedom, νOutput: Distribution Pr(w∗|x∗) over world given new data example x∗
begin// Initialize variables
H = diag[1, 1, . . . 1]repeat
// Maximize marginal likelihood wrt variance parameter σ2
σ2 = argminσ2
[− log[Normw[0,K[X,X]H−1K[X,X] + σ2I]]
]// Maximize marginal likelihood wrt relevance parameters HΣ = (K[X,X]K[X,X]/σ2 + H)−1
µ = ΣK[X,X]w/σ2
// For each dual parameter
for i=1 to I do// Update diagonal entry of Hhdd = (1− hddΣii + ν)/(µ2
i + ν)
end
until No further improvement// Remove cols of X, rows of w, rows and cols of H where hdd is large
[H,X,w] = prune[H,X,w]// Compute inverse term
A = K[X,X]K[X,X]/σ2 + H// Compute mean of prediction for new example x∗
µw∗|x∗ = K[x∗,X]A−1K[X,X]w/σ2
// Compute variance of prediction for new example x∗
σ2w∗|x∗ = K[x∗,X]A−1K[X,x∗] + σ2
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
20 Models for classification
Algorithm 9.1: MAP Logistic regression
The logistic regression model is defined as
Pr(w|x,φ) = Bernw
[1
1 + exp[−φTx]
],
where as usual, we have attached a 1 to the start of each data example xi. We now performa non-linear minimization over the negative log binomial probability with respect to theparameter vector φ:
φ = argminφ
[−
I∑i=1
log
[Bernwi
[1
1 + exp[−φTxi]
]]− log
[Normφ[0, σ2
pI]]],
where we have also added a prior over the parameters φ. The MAP solution is superior tothe maximum likelihood approach in that it encourages the function to be smooth even whenthe classes are completely separable. A typical approach would be to use a second orderoptimization method such as the Newton method (e.g., using Matlab’s fminunc function).The optimization method will need to compute the cost function and it’s derivative andHessian with respect to the parameter φ.
Algorithm 9.1: Cost and derivatives for MAP logistic regression
Input : Binary world state wiIi=1, observed data xiIi=1, parameters φOutput: cost L, gradient g, Hessian Hbegin
// Initialize cost, gradient, Hessian
L = L+ (D + 1) log[2πσ2]/2 + φTφ/(2σ2p)
g = φ/σ2p
H = 1/σ2p
// For each data point
for i=1 to I do// Compute prediction y
yi = 1/(1 + exp[−φTxi])// Add term to log likelihood
if wi == 1 thenL = L− log[yi]
elseL = L− log[1− yi]
end// Add term to gradient
g = g + (yi − wi)xi// Add term to Hessian
H = H + yi(1− yi)xixTiend
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for classification 21
Algorithm 9.2: Bayesian logistic regression
In Bayesian logistic regression, we aim to compute the predictive distribution Pr(w∗|x∗) overthe binary world state w∗ for a new data example x∗. This takes the form of a Bernoullidistribution and is hence summarized by the single λ∗ = Pr(w∗ = 1|x∗).
The method works by first finding the MAP solution (using the cost function in the previ-ous algorithm). It then builds a Laplace approximation based on this result and the Hessianat the MAP solution. Using the mean and variance of the Laplace approximation we cancompute a probability distribution over the activation. We then use a further approximationto compute the integral over this distribution.
As usual, we assume that we have added a one to the start of every data vector so thatxi ← [1,xTi ]T to model the offset parameter elegantly.
Algorithm 9.2: Bayesian logistic regression
Input : Binary world state wiIi=1, observed data xiIi=1, new data x∗
Output: Predictive distribution Pr(w∗|x∗)begin
// Optimization using cost function of algorithm 9.1
φ = argminφ
[−∑Ii=1 log
[Bernwi [1/(1 + exp[−φTxi])]
]− log
[Normφ[0, σ2
pI]]]
// Compute Hessian at peak
H = 1/σ2p
for i=1 to I doyi = 1/(1 + exp[−φTxi]) // Compute prediction y
H = H + yi(1− yi)xixTi // Add term to Hessian
end// Set mean and variance of Laplace approximation
µ = φΣ = −H−1
// Compute mean and variance of activation
µa = µTx∗
σ2a = x∗TΣx∗
// Approximate integral to get Bernoullic parameters
λ∗ = 1/(1 + exp[−µa/√
1 + πσ2a/8])
// Compute predictive distribution
Pr(w∗|x∗) = Bernw∗ [λ∗]
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
22 Models for classification
Algorithm 9.3: MAP dual logistic regression
The dual logistic regression model is the same as the logistic regression model, but now werepresent the parameters φ as a weighted sum φ = Xψ of the original data points, where Xis a matrix containing all of the training data giving the prediction:
Pr(w|ψ,x) = Bernw
[1
1 + exp[−ψTXTx]
]We place a normal prior on the dual parameters ψ and optimize them using the criterion:
ψ = argminψ
[−
I∑i=1
log
[Bernwi
[1
1 + exp[−φTXxi]
]]− log
[Normψ[0, σ2
pI]]],
A typical approach would be to use a second order optimization method such as theNewton method (e.g., using Matlabs fminunc function). The optimization method will needto compute the cost function and its derivative and Hessian with respect to the parameterψ, and the calculations for these are given in the algorithm below.
Algorithm 9.3: Cost and derivatives for MAP dual logistic regression
Input : Binary world state wiIi=1, observed data xiIi=1, parameters ψOutput: cost L, gradient g, Hessian Hbegin
// Initialize cost, gradient, Hessian
L = −I log[2πσ2]/2−ψTψ/(2σ2p)
g = −ψ/σ2p
H = −1/σ2p
// Form compound data matrix
X = [x1,x2, . . .xI ]// For each data point
for i=1 to I do// Compute prediction y
yi = 1/(1 + exp[−ψTXxi])// Update log likelihood, gradient and Hessian
if wi == 1 thenL = L+ log[yi]
elseL = L+ log[1− yi]
end
g = g + (yi − wi)XTxiH = H + yi(1− yi)XTxix
Ti X
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for classification 23
Algorithm 9.4: Dual Bayesian logistic regression
In dual Bayesian logistic regression, we aim to compute the predictive distribution Pr(w∗|x∗)over the binary world state w∗ for a new data example x∗. This takes the form of a Bernoullidistribution and is hence summarized by the single λ∗ = Pr(w∗ = 1|x∗).
The method works by first finding the MAP solution to the dual problem(using the costfunction in the previous algorithm). It then builds a Laplace approximation based on thisresult and the Hessian at the MAP solution. Using the mean and variance of the Laplaceapproximation we can compute a probability distribution over the activation. We then use afurther approximation to compute the integral over this distribution.
As usual, we assume that we have added a one to the start of every data vector so thatxi ← [1,xTi ]T to model the offset parameter elegantly.
Algorithm 9.4: Dual Bayesian logistic regression
Input : Binary world state wiIi=1, observed data xiIi=1, new data x∗
Output: Bernoulli parameter λ∗ from Pr(w∗|x∗) for new data x∗
begin// Optimization using cost function of algorithm 9.3
ψ = argminψ
[−∑Ii=1 log
[Bernwi [1/(1 + exp[−ψTXTxi])]
]− log
[Normψ[0, σ2
pI]]]
// Compute Hessian at peak
H = 1/σ2p
for i=1 to I doyi = 1/(1 + exp[−φTXTxi]) // Compute prediction y
H = H + yi(1− yi)XTxixTi X // Add term to Hessian
end// Set mean and variance of Laplace approximation
µ = ψΣ = −H−1
// Compute mean and variance of activation
µa = µTXTx∗
σ2a = x∗TXΣXTx∗
// Compute approximate prediction
λ∗ = 1/(1 + exp[−µa/√
1 + πσ2a/8])
end
Algorithm 9.4b: Gaussian process classification
Notice that algorithm 9.4a and algorithm 9.3, which it uses, are defined entirely in terms ofinner products of the form xTi xj , which usually occur in matrix multiplications like XTx∗.This means they is amenable to kernelization. When we replace all of the inner productsin algorithm 9.4a with a kernel function K[•, •], the resulting algorithm is called Gaussianprocess classification or kernel logistic regression.
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
24 Models for classification
Algorithm 9.5: Relevance vector classification
Relevance vector classification is a version of the kernel logistic regression (Gaussian processclassification) that encourages the dual parameters ψ to be sparse using a prior that is aproduct of t-distributions. Since there is one dual parameter for each of the I trainingexamples, we introduce I hidden variables hi which control the tendency to be zero for eachdimension.
The algorithm is iterative and alternates between updating the hidden variables in closedform and finding the resulting MAP solutions. After the system has converged, we prunethe model to remove dimensions where the hidden variable was large (> 1000 is a reasonablecriterion); these dimensions contribute very little to the final prediction.
Algorithm 9.5: Relevance vector classification
Input : (D+1)×I data X, I×1 binary world vector w, degrees of freedom, ν, kernel K[•, •]Output: Bernoulli parameter λ∗ from Pr(w∗|x∗) for new data x∗
begin// Initialize I hidden variables to reasonable values
H = diag[1, 1, . . . 1]repeat
// Find MAP solution using kernelized version of algorithm 9.3
ψ =
argminψ
[−∑Ii=1 log
[Bernwi [1/(1 + exp[−ψTK[X,xi]])]
]− log
[Normψ[0,H−1]
]]// Compute Hessian S at peak a
S = Hfor i=1 to I do
yi = 1/(1 + exp[−ψTK[X,xi]]) // Compute prediction yS = S + yi(1− yi)K[X,xi]K[xi,X] // Add term to Hessian
end// Set mean and variance of Laplace approximation
µ = ψΣ = −S−1
// For each data example
for I=1 to I do// Update the diagonal entry of Hhii = (1− hiiΣii + ν)/(µ2
i + ν)
end
until No further improvement// Remove rows of µ, cols of X, rows and cols of Σ where hdd is large
[µ,Σ,X] = prune[µ,Σ,X]// Compute mean and variance of activation
µa = µTK[X,x∗]σ2a = K[x∗,X]ΣK[X,x∗]
// Compute approximate prediction
λ∗ = 1/(1 + exp[−µa/√
1 + πσ2a/8])
end
a Notice that I have used S to represent the Hessian here, so that it’s not confused with the diagonal
matrix H containing the hidden variables.
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for classification 25
Algorithm 9.6: Incremental fitting for logistic regression
The incremental fitting approach applies to the non-linear model
Pr(w|φ,x) = Bernw
[1
1 + exp[−φ0 −∑Kk=1 φkf[xi, ξk]]
].
The method initializes the weights φkKk=1 to zero and then optimizes them one by one. Atthe first stage we optimize φ0, φ1 and ξ1. Then we optimize φ0, φ2 and ξ2 and so on.
Algorithm 9.6: Incremental logistic regression
Input : Binary world state wiIi=1, observed data xiIi=1
Output: ML parameters φ0, φk, ξkKk=1
begin// Initialize parameters
φ0 = 0// Initialize activation for each data point (sum of first k-1 functions)
for i=1 to I doai = 0
endfor k=1 to K do
// Reset offset parameter φ0
for i=1 to I doai = ai − φ0
end
[φ0, φk, ξk] = argminφ0,φk,ξk
[−∑Ii=1 log [Bernwi [1/(1 + exp[−ai − φ0 − φkf[xi, ξk]])]]
]for i=1 to I do
ai = ai + φ0 + φkf[xi, ξk]end
end
end
Obviously, the derivatives for the optimization algorithm depend on the choice of non-linearfunction. For example, if we use the function f[xi, ξk] = arctan[ξTk xi] where we have added a1 to the start of each data vector xi, then the first derivatives of the cost function L are:
∂ L
∂φ0=
I∑i=1
(yi − wi)
∂ L
∂φk=
I∑i=1
(yi − wi)atan[ξTk xi]
∂ L
∂ξ=
I∑i=1
(yi − wi)φk(
1
1 + (ξTk xi)2
)xi
where yi = 1/(1 + exp[−ai − φ0 − φkf[xi, ξk]] is the current prediction for the ith data point.
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
26 Models for classification
Algorithm 9.7: Logitboost
Logitboost is a special case of non-linear logistic regression, with heaviside step functions:
Pr(w|φ,x) = Bernw
[1
1 + exp[−φ0 −∑Kk=1 φkheaviside[f [x, ξck ]]
].
One interpretation s that we are combining a set of ’weak classifiers’ which decide on the classbased on whether it is to the left or the right of the step in the step function.
The step functions do not have smooth derivatives, so at the kth stage, the algorithmexhaustively considers a set of possible step functions heaviside[f [x, ξm]]Mm=1, choosing theindex ck ∈ 1, 2, . . .M that is best, and simultaneously optimizes the weights φ0 and φk.
Algorithm 9.7: Logitboost
Input : Binary world state wiIi=1, observed data xiIi=1, functions fm[x, ξm]Mm=1
Output: ML parameters φ0, φkKk=1, ck ∈ 1 . . .Mbegin
// Initialize activations
for i=1 to I doai = 0
end// Initialize parameters
for k=1 to K do// Find best weak classifier by looking at magnitude of gradient
ck = maxm[(∑Ii=1(ai − wi)f[xi, ξm])2]
// Remove effect of offset parameters
for i=1 to I doai = ai − φ0
endφ0 = 0// Perform optimization
[φ0, φk] = argminφ0,φk
[∑Ii=1− log
[Bernwi
[1/(1 + exp[−ai − φ0 − φkf[xi, ξck ]])
]]]// Compute new activation
for i=1 to I doai = ai + φ0 + φkf[xi, ξck ]
end
end
end
The derivatives for the optimization are given by
∂ L
∂φ0=
I∑i=1
(yi − wi)
∂ L
∂φk=
I∑i=1
(yi − wi)f[xi, ξck ]
where yi = 1/(1 + exp[−ai−φ0−φkf[xi, ξck ]] is the current prediction for the ith data point.
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for classification 27
Algorithm 9.8: Multi-class logistic regression
The multi-class logistic regression model is defined as
Pr(w|φ,x) = Catw[softmax[φT1 x, φT2 x, . . . φTNx]
].
where we have prepended a 1 to the start of each data vector x. This is a straightforwardoptimization problem over the negative log probability with respect to the parameter vectorφ = [φ1;φ2; . . . ;φN ]. We need to compute this value, and the derivative and Hessian withrespect to the parameters φm.
Algorithm 9.8: Cost function, derivative and Hessian for multi-class logistic regression
Input : World state wiIi=1, observed data xiIi=1, parameters φNn=1
Output: cost L, gradient g, Hessian Hbegin
// Initialize cost, gradient, Hessian
L = 0for n=1 to N do
gn = 0 // Part of gradient relating to φnfor m=1 to N do
Hmn = 0 // Portion of Hessian relating φn and φmend
end// For each data point
for i=1 to I do// Compute prediction y
yi = softmax[φT1 xi, φT2 xi, . . . φ
Tk xi]
// Update log likelihood
L = L+ log[yi,wi ] // Take wthi element of yi// Update gradient and Hessian
for n=1 to N dogn = gn + (yin − δ[wi − n])xifor m=1 to M do
Hmn = Hmn + yim(δ[m− n]− yin)xixTi
end
end
end// Assemble final gradient vector
g = [g1; g2; . . .gk]// Assemble final Hessian
for n=1 to N doHn = [Hn1,Hn2, . . .HnN ]
endH = [H1; H2; . . .HN ]
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
28 Models for classification
Algorithm 9.9: Multi-class logistic classification tree
Here, we present a deterministic multi-class classification tree. At the jth branching point,it selects the index cj ∈ 1, 2, . . . ,M indicating which of a pre-determined set of classifiersg[x, ωm]Mm=1 should be chosen.
Algorithm 9.9: Multiclass classification tree
Input : World state wiIi=1, data xiIi=1Mm=1, classifiers g[x, ωm]Mm=1
Output: Categorical params at leaves λpJ+1p=1 , Classifier indices cjJj=1
beginenqueue[x1...I , w1...I ] // Store data and class labels
// For each node in tree
for j = 1 to J do[x1...I , w1...I ] = dequeue[ ] // Retrieve data and class labels
for m = 1 to M do
// Count frequency for kth class in left and right branches
for k = 1 to K do
n(l)k =
∑Ii=1 δ[g[xi, ωm]− 0]δ[wi − k]
n(r)k =
∑Ii=1 δ[g[xi, ωm]− 1]δ[wi − k]
end// Compute log likelihood
lm =∑Kk=1 log[n
(l)k /
∑Kq=1 n
(l)q ] // Contribution from left branch
lm = lm +∑Kk=1 log[n
(r)k /
∑Kq=1 n
(r)q ] // Contribution from right branch
end// Store index of best classifier
cj = argmaxm [lm]// Partition into two sets
Sl = ;Sr = for i=1 to I do
if g[xi, ωcj ] == 0 thenSL = Sl ∪ i
elseSR = Sr ∪ i
end
end// Add to queue of nodes to process next
enqueue[xSl , wSl ]enqueue[xSr , wSr ]
end// Recover categorical parameters at J + 1 leaves
for p = 1 to J + 1 do[x1...I , w1...I ] = dequeue[ ]for k=1 to K do
nk =∑Ii=1 δ[wi − k] // Frequency of class k at the pth leaf
end
λp = n/∑Kk=1 nk // ML solution for categorical parameter
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Graphical models 29
Algorithm 10.1: Gibbs’ Sampling from a discrete undirected model
This algorithm generates samples from an undirected model with distribution
Pr(x1...N ) =1
Z
C∏c=1
φc[Sc],
where the cth function φc[Sc] operates on a subset of Sc ⊂ x1, x2, . . . , xD of the D variablesand returns a positive number. For this algorithm, we assume that each variable xddd=1 isdiscrete and takes values xd ∈ 1, 2, . . . ,K
In Gibbs’ sampling, we choose each variable in turn and update by sampling from itsmarginal posterior distribution. Since, the variables are discrete, the marginal distribution isa categorical distribution (a histogram), so we can sample from it by partitioning the range0 to 1 according to the probabilities, drawing a uniform sample between 0 and 1, and seeingwhich partition it falls into.
Algorithm 10.1: Gibbs’ sampling from undirected model
Input : Potential functions φc[Sc]Cc=1
Output: Samples xtT1begin
// Initialize first sample in chain
x0 = x(0)
// For each time sample
for t=1 to T doxt = xt−1
// For each dimension
for d=1 to D do// For each possible value of the dth variable
for k=1 to K do// Set the variable to kxtd = k// Compute the unnormalized marginal probability
λk = 1for c s.t. xd ∈ Sc do
λk = λk · φc[Sc]end
end// Normalize the probabilities
λ = λ/∑Kk=1 λk
// Draw from categorical distribution
xtd = Sample [Catxtd [λ]]
end
end
end
It is normal to discard the first few thousand entries so that the initial conditions are forgotten.Then entries are chosen that are spaced apart to avoid correlation between the samples.
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
30 Graphical models
Algorithm 10.2: Contrastive divergence for learning undirected models
The contrastive divergence algorithm is used to learn the parameters θ of an undirected modelof the form
Pr(x1...N ,θ) =1
Z[θ]f(x,θ) =
1
Z[θ]
C∏c=1
φc[Sc,θ].
where the cth function φc[Sc] operates on a subset of Sc ⊂ x1, x2, . . . , xD of the D variablesand returns a positive number. It is generally not possible to maximize log likelihood eitherin closed form or via a non-linear optimization algorithm, because we cannot compute thedenominator Z[θ] that normalizes the distribution and which also depends on the parameters.
The contrastive divergence algorithm gets around this problem by computing the approx-imate gradient by means of generating J samples x∗jJj=1 and then using this approximategradient to perform gradient descent. The approximate gradient is computed as
∂L
∂θ≈ − I
J
J∑j=1
∂ log[f(x∗j ,θ)]
∂θ+
I∑i=1
∂ log[f(xi,θ)]
∂θ.
In the algorithm below, the function gradient[x,θ] represent the derivative of the unnor-malized log likelihood (i.e. the two terms on the right hand side). We’ve also made thesimplifying assumption that there is one sample x∗i for each training example xi (i.e., I = J).In practice, computing valid samples is a burden, so in this algorithm we generate the ith
sample x∗i by taking a single Gibbs’ sample step from the ith training example.
Algorithm 10.2: Contrastive divergence learning of undirected model
Input : Data xKk=1, learning rate αOutput: ML Parameters θbegin
// Initialize parameters
θ = θ(0)
// For each time sample
repeatfor i=1 to I do
// Take a single Gibbs’ sample step from the ith data point
x∗i = Sample[xi,θ]
end// Update parameters
// Function gradient[•, •] returns derivative of log of unnormalized
probability
θ = θ + α∑Ii=1(gradient[xi, θ]− gradient[x∗i ,θ])
until No further average change in θ
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for chains and trees 31
Algorithm 11.1: Dynamic programming for chain model
This algorithm computes the maximum a posteriori solution for a chain model. The directedchain model has a likelihood and prior that factorize as
Pr(x|w) =
N∏n=1
Pr(xn|wn)
Pr(w) =
N∏n=2
Pr(wn|wn−1),
respectively. To find the MAP solution, we minimize the negative log posterior:
w1...N = argminw1...N
[−
N∑n=1
log [Pr(xn|wn)]−N∑n=2
log [Pr(wn|wn−1)]
]
= argminw1...N
[N∑n=1
Un(wn) +
N∑n=2
Pn(wn, wn−1)
].
This cost function can be optimized using dynamic programming. We pass from variablesx1 to xN , computing the minimum cost to reach each point, and caching the route. We find theoverall minimum at xN and retrieve the cached route. Here, denote the unary cost Un(wn = k)for the nth variable taking value k by Un,k, and the pairwise cost Pn(wn = k,wn−1 = l) forthe nth variable taking value k and the n− 1th variable taking value l by Pn,k,l.
Algorithm 11.1: Dynamic programming in chain
Input : Unary costs Un,kN,Kn=1,k=1, Pairwise costs Pn,k,lN,K,Kn=2,k=1,l=1
Output: Minimum cost path wnNn=1
begin// Initialize cumulative sums Sn,kfor k=1 to K do
S1,k = U1,k
end// Work forward through chain
for n=2 to N do// Find minimum cost to get to this node
Sn,k = Un,k + minl[Sn−1,l + Pn,k,l]// Store route by which we got here
Rn,k = argminl[Sn−1,l + Pn,k,l]
end// Find node yN with overall minimum cost
wN = argmink[SN,k]// Trace back to retrieve route
for n=N to 2 down−1 = Rn,wn
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
32 Models for chains and trees
Algorithm 11.2: Dynamic programming for tree model
This algorithm can be used to compute the MAP solution for a directed or undirected modelwhich has the form of a tree. As such, it generalizes algorithm 11.2 which is specializedfor chains. As for the simpler case, the algorithm proceeds by working through the nodes,computing the minimum possible cost to reach this position and caching the route by whichwe reached this point. At the last node we compute the overall minimum cost and then traceback the route using the cached information.
Here, denote the unary cost Un(wn = k) for the nth variable taking value k by Un,k. Wedenote the higher order cost for assigning value K to the nth variable based on its childrench[n] as Hn,k[ch[n]]. This might consist of pairwise, three-wise, or higher costs depending onthe number of children in the graph.
Algorithm 11.2: Dynamic programming in tree
Input : Unary costs Un,kN,Kn=1,k=1, higher order cost function Hn,k[ch[n]]N,Kn=1,k=1
Output: Minimum cost path wnNn=1
beginrepeat
// Retrieve nodes in an order so children always come before parents
n = GetNextNode[ ]// For each possible value of this node
for k=1 to K do// Compute the minimum cost for reaching here
Sn,k = Un,k + min[Sch[n]
+Hn,k[ch[n]]]
a
// Cache the route for reaching here (store |ch[n]| values)Rn,k = argmin
[Hn,k[Sch[n]
+ ch[n]]]
a
end// Push node index onto stack
push[n]// Until no more parents
until pa[wn] = // Find node wN with overall minimum cost
wn = mink[Sn,k]// Trace back to retrieve route
for c=1 to N don = pop[ ]if ch[n] 6= then
wch[n]= Rn,wn
end
end
end
a This minimization is done over all the values of all of the children variables. With a pairwise term,
this would be a single minimization over the single previous variable that fed into this one. With a
three-wise term is would be a joint minimization over both children variables etc.
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for chains and trees 33
Algorithm 11.3: Forward-backward algorithm
This algorithm computes the marginal posterior distributions Pr(wn|x1...N ) for a chain model.To find the marginal posteriors we perform a forward recursion and a backward recursion andmultiply these two quantities together.
Here, we use the term un,k to represent the likelihood Pr(xn|wn = k) of the data xn at thenth node taking label k and the term pn,k,l to represent the prior term Pr(wn = k|wn−1 = l)when the nth variable takes value k and the n−1th variable takes value l. Note that un,k andpn,k,l are probabilities, and are not the same as the unary and pairwise costs in the dynamicprogramming algorithms.
Algorithm 11.3: Forward backward algorithm
Input : Likelihoods lnkN,Kn=1,k=1, prior terms Pn,k,lN,K,Kn=2,k=1,l=11
Output: Marginal probability distributions qn[wn]Nn=1
begin// Initialize forward vector to likelihood of first variable
for k=1 to K dof1,k = u1k
end// For each state of each subsequent variable
for n=2 to N dofor k=1 to K do
// Forward recursion
fn,k = un,k∑Kl=1 pn−1,k,lfn−1,l
end
end// Initialize vector for backward pass
for k=1 to K dobN,k = 1/K
end// For each state of each previous variable
for n= N to 2 dofor k=1 to K do
// Backward recursion
bn−1,k =∑Kl=1 un,lpn,l,kbn,l
end
end// Compute marginal posteriors
for n= 1 to N dofor k=1 to K do
// Take product of forward and backward messages and normalize
qn[wn = k] = fn,kbn,k/(∑kl=1 fn,lbn,l)
end
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
34 Models for chains and trees
Algorithm 11.4: Sum product belief propagation
The sum product algorithm proceeds in two phases: a forward pass and a backward pass.The forward pass distributes evidence through the graph and the backward pass collatesthis evidence. Both the distribution and collation of evidence are accomplished by passingmessages from node to node in the factor graph. Every edge in the graph is connected toexactly one variable node, and each message is defined over the domain of this variable.
In the description of the algorithm below, we denote the edges by enNn=1, which joinsnode en1 to en2. The edges are processed in such an order that all incoming edges to a functionare processed before the outgoing message mn is passed. We first discuss the distribute phase.
Algorithm 11.4: Sum product: distribute
Input : Observed data z∗nn∈Sobs, functions φk[Ck]Kk=1, edges enNn=1
Output: Forward messages mn on each of the n edges enbegin
repeat// Retrieve edges in any valid order
en = GetNextEdge[ ]// Test for type of edge - returns 1 if en2 is a function, else 0
t = isEdgeToFunction[en]if t then
// If this data was observed
if en1 ∈ Sobs thenmn = δ[z∗en1
]else
// Find set of edges that are incoming to start of this edge
S = k : en1 == ek2// Take product of messages
mn =∏k∈Smk
// Add edge to stack
push[en]
end
else// Find set of edges incoming to start of this edge
S = k : en1 == ek2// Find all variables connected to this function
V = eS1 ∪ en2
// Take product of messages
mn =∑
y∈S φn [yV ]∏k∈Smk
// Add edge to stack
push[n]
end
until pa[en] = end
This algorithm continues overleaf...
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for chains and trees 35
Algorithm 11.4b: Collate and compute marginal distributions
After the distribute stage is complete (one message has been passed along each edge in thegraph) we commence t the second pass through the variables. This happens in the oppositeorder to the first stage (accomplished by popping edges off the stack). Now, we collate theevidence and compute the normalized distributions at each node.
Algorithm 11.4b: Sum product: collate and compute marginal distributions
Input : Observed data z∗nn∈Sobs, functions φk[Ck]Kk=1, edges enNn=1
Output: Marginal probability distributions qn[yn]Nn=1
begin// Collate evidence
repeat// Retrieve edges in opposite order
n = pop[ ]// Test for type of edge - returns 1 if en2 is a function, else 0
t = isEdgeToFunction[en]// Test for type of edge
if t then// Find set of edges incoming to function node
S = k : en2 == ek1// Find all variables connected to this function
V = eS2 ∪ en1
// Take product of messages
bn =∑
y∈mathcalS φn[yS ]∏k∈S bk
else// Find set of edges that are incoming to data node
S = k : en2 == ek1// Take product of messages
bn =∏k∈S bk
end
until stack empty// Compute distributions at nodes
for k=1 to K do// Find set of edges that are incoming to data node
S1 = n : en2 == kS2 = n : en1 == kqk =
∏n∈S1 mn
∏n∈S2 bn
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
36 Models for grids
Algorithm 12.1: Binary graph cuts
This algorithm assumes that we have N variables each of which takes a binary value. Theirconnections are indicated by a series of flags EmnN,Nn,m=1 which are set to one if the variablesare connected (and have an associated pairwise term) or zero otherwise. This algorithm setsup the graph but doesn’t find the min-cut solution. Consult a standard algorithms text fordetails of how to do this.
Algorithm 12.1: Binary graph cuts
Input : Unary costs Un(k)N,Kn,k=1, pairwise costs Pn,m(k, l)N,N,K,Kn,m,k,l=1,flags emn,N,Nn=1,m=1
Output: Label assignments wnbegin
// Initialize graph to empty
G = for n=1 to N do
// Create edges from source and to sink and set capacity to zero
G = G ∪ s, n; csn = 0G = G ∪ n, t; cnt = 0// If edge between m and n is desired
if em,n = 1 thenG = G ∪ m,n; cnm = 0G = G ∪ n,m; cmn = 0
end
end// Add costs to edges
for n=1 to N docsn = csn + Un(0) cnt = cnt + Un(1) for m=1 to n− 1 do
if em,n = 1 thencnm = cnm + Pmn(1, 0)− Pmn(1, 1)− Pmn(0, 0)cmn = cmn + Pmn(1, 0)csm = csm + Pmn(0, 0)cnt = cnt + Pmn(1, 1)
end
end
endC = Reparameterize[C] // Ensures all capacities are positive (see overleaf)
G = ComputeMinCut[G,C] // Augmenting paths or similar
// Read off world state values based on new (cut) graph
for n=1 to N doif s, n ∈ G then
wn = 1else
wn = 0end
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for grids 37
Algorithm 12.2: Reparameterization for graph cuts
The previous algorithm relies on a max-flow / min cut algorithm such as augmenting pathsor push-relabel. For these algorithms to converge, it is critical that all of the capacities arenon-negative. The process of making them non-negative is called re-parameterization. It isonly possible in certain special cases, and here the problem is known as submodular. Costfunctions in vision tend to encourage smoothing and are submodular.
Algorithm 12.2: Reparameterization for binary graph cut
Input : Edge flags emnN,Nm,n=1, capacities cmn : em,n = 1Output: Modified graph with non-negative capacitiesbegin
// For each node pair
for n=1 to N dofor m=1 to n− 1 do
// If an edge between the nodes exist
if em,n = 1 then// Test if submodular and return error code if not
if cnm < 0 && cmn < −cnm thenreturn[-1]
endif cmn < 0 && cnm < −cmn then
return[-1]end// Handle links between source and sink
if cnm < 0 thenβ = cnm
endif cmn < 0 then
β = −cmnendcnm = cnm − βcmn = cmn + βcsm = csm + βcmt = cmt + β
end
end// Handle links between source and sink
α = min[csn, cnt]csn = csn − αcnt = cnt − α
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
38 Models for grids
Algorithm 12.3: Multi-label graph cuts
This algorithm assumes that we have N variables each of which takes one of K values. Theirconnections are indicated by a set of flags emnN,Nn,m=1 which are set to one if the variables areconnected (and have an associated pairwise term) or zero otherwise. We construct a graphthat has N · (K+ 1) nodes where the first K+ 1 nodes pertain to the first variable and so on.
Algorithm 12.3: Multilabel graph cuts
Input : Unary costs Un(k)N,Kn,k=1, pairwise costs Pn,m(k, l)N,N,K,Kn,m,k,l=1, flags emn,N,Nn=1,m=1
Output: Label assignments wnbeginG = // Initialize graph to empty
for n=1 to N do// Create edges from source and to sink and set costs
G = G ∪ s, (n− 1)(K + 1) + 1; cs,(n−1)(K+1)+1 =∞G = G ∪ n(K + 1), t; c,n(K+1)t =∞// Create edges within columns and set costs
for k=1 to K doG = G ∪ (n− 1)(K + 1) + k, (n− 1)(K + 1) + k + 1c(n−1)(K+1)+k,(n−1)(K+1)+k+1 = U(n−1)(K+1)+k,k
G = G ∪ (n− 1)(K + 1) + k + 1, (n− 1)(K + 1) + kc(n−1)(K+1)+k+1,(n−1)(K+1)+k =∞
end// Create edges between columns and set costs
for m=1 to n− 1 doif em,n = 1 then
for k=1 to K dofor L=2 to K + 1 doG = G ∪ (n− 1)(K + 1) + k(m− 1)(K + 1) + lc(n−1)(K+1)+k(m−1)(K+1)+l =Pn,m(k, l − 1) + Pn,m(k − 1, l)− Pn,m(k, l)− Pn,m(k − 1, l − 1)
end
end
end
end
endC = Reparameterize[C] // Ensures all capacities are positive (see book)
G = ComputeMinCut[G,C] // Augmenting paths or similar
tcpRead off values for n=1 to N down = 1for k=1 to K do
if (n− 1)(K + 1) + k, (n− 1)(K + 1) + k ∈ G] thenwn = wn + 1
end
end
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for grids 39
Algorithm 12.4: Alpha-expansion algorithm
The alpha-expansion algorithm works by breaking the solution down into a series of binaryproblems, each of which can be solved exactly. At each iteration, we choose one of the Klabel values α, and for each pixel, we consider either retaining the current label, or switchingit to α. The name alpha-expansion derives from the fact that the space occupied by label α inthe solution expands at each iteration. The process is iterated until no choice of α causes anychange. Each expansion move is guaranteed to lower the overall objective function, althoughthe final result is not guaranteed to be the global minimum.
Algorithm 12.4: Alpha expansion algorithm (main loop)
Input : Unary costs Un(k)N,Kn,k=1, pairwise costs Pn,m(k, l)N,N,K,Kn,m,k,l=1, flags emn,N,Nn=1,m=1
Output: Label assignments wnNn=1
begin// Initialize labels in some way - perhaps to minimize unary costs
w = w0
// Compute log likelihood
L =∑Nn=1 Un(wn) +
∑Nn=1
∑Mm=1 emnPn,m(wn, wm)
repeat// Store initial log likelihood
L0 = L// For each label in turn
for k=1 to K do// Try to expand this label (see overleaf)
w = AlphaExpand[w, k]
end// Compute new log likelihood
L =∑Nn=1 Un(wn) +
∑Nn=1
∑Mm=1 EmnPn,m(wn, wm)
until L = L0
end
In the alpha-expansion graph construction, there is one vertex associated with each pixel.Each of these vertices is connected to the source (representing keeping the original label orα) and the sink (representing the label α). To separate source from sink, we must cut one ofthese two edges at each pixel. The choice of edge will determine whether we keep the originallabel or set it to α. Accordingly, we associate the unary costs for each edge being set to α orits original label with the two links from each pixel. If the pixel already has label α, then weset the cost of being set to α to ∞.
The remaining structure of the graph is dynamic: it changes at each iteration dependingon the choice of α and the current labels. There are four possible relationships betweenadjacent pixels:
• They can both already be set to alpha.
• One can be set to alpha and the other to another value β.
• Both can be set to the same other value β .
• They can be set to two other values β and γ.
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
40 Models for grids
Algorithm 12.4b: Alpha expansion (expand)
Algorithm 12.4b: Alpha expansion (expand)
Input : Costs Un(k)N,Kn,k=1, Pn,m(k, l)N,N,K,Kn,m,k,l=1, expansion label k, states wnNn=1
Output: New label assignments wnNn=1
beginG = // Initialize graph to empty
z = N // Counter for new nodes added to graph
for n=1 to N doG = G ∪ s, n; csn = Un(k) // Connect pixel nodes to source and set cost
if wn = k thenG = G ∪ n, t; cnt =∞ // Connect pixel nodes to sink and set cost
elseG = G ∪ n, t; cnt = Un(wn) // Connect pixel nodes to sink and set cost
endfor m=1 to n do
if em,n == 1 thenif (wn == k || wm == k) then
if wn! = k thenG = G ∪ n,m; cnm = Pn,m(wm, wn) // Case 2a
endif wm! = k thenG = G ∪ m,n; cmn = Pn,m(wn, wm) // Case 2b
end
elseif wn == wm thenG = G ∪ n,m; cnm = Pn,m(k,wn) // Case 3
G = G ∪ m,n; cmn = Pn,m(wn, k)
elsez = z + 1 // Increment new node counter
G = G ∪ n, z; cnz = Pn,m(k,wn); czn =∞ // Case 4
G = G ∪ m, z; cmz = Pn,m(wm, k)czm =∞G = G ∪ z, t; czt = Pn,m(wm, wn)
end
end
end
end
endC = Reparameterize[C] // Ensures all capacities are positive
G = ComputeMinCut[G,C] // Augmenting paths or similar
// Read off values
for n=1 to N doif n, t ∈ G then
wn = kend
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Preprocessing 41
Algorithm 13.1: Principal components analysis
The goal of PCA is to approximate a set of multivariate data xiIi=1 with a second set ofvariables of reduced size hiIi=1, so that
xi ≈ µ+ Φhi,
where Φ is a rectangular matrix where the columns are unit length and orthogonal to oneanother so that ΦTΦ = I.
This formulation assumes that the number of original data dimensions D is higher thanthe number of training examples I and so works by taking the singular value decompositionof the I × I matrix XTX to compute the dual principal components Ψ before recovering theoriginal principal components Φ.
Algorithm 13.1: Principal components analysis (dual)
Input : Training data xiIi=1, number of components KOutput: Mean µ, PCA basis functions Φ, low dimensional data hiIi=1
begin// Estimate mean
µ =∑Ii=1 xi/I
// Form mean zero data matrix
X = [x1 − µ,x2 − µ, . . .xI − µ]// Do spectral decomposition and compute dual components
[Ψ,L,Ψ] = svd[XTX]// Compute principal components
Φ = XΨL−1/2
// Retain only the first K columns
Φ = [φ1,φ2, . . . ,φK ]// Convert data to low dimensional representation
for i=1 to I dohi = ΦT (xi − µ)
end// Reconstruct data
for i = 1 to I doxi = µ+ Φhi
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
42 Preprocessing
Algorithm 13.2: k-means algorithm
The goal of the k-means algorithm is to partition a set of data xiIi=1 into K clusters. Itcan be thought of as approximating each data point with the associated cluster mean µk , sothat
xi ≈ µhi,
where hi ∈ 1, 2, . . .K is a discrete variable that indicates which cluster the ith point belongsto. The algorithm works by alternately (i) assigning data points to the nearest cluster centerand (ii)
Algorithm 13.2: K-means algorithm
Input : Data xiIi=1, number of clusters K, data dimension DOutput: Cluster means µkKk=1, cluster assignment indices, hiIi=1
begin// Initialize cluster means (one of many heuristics)
µ =∑Ii=1 xi/I // Compute overall mean
Σ =∑Ii=1(xi − µ)T (xi − µ)/I // Compute overall covariance
for k=1 to K do
µk = µ+ Σ1/2randn[D, 1] // Randomly draw from normal model
end// Main loop
repeat// Compute distance from data points to cluster means
for i=1 to I dofor k=1 to K do
dik = (xi − µk)T (xi − µk)end// Update cluster assignments based on closest cluster
hi = argmink [dik]
end// Update cluster means from data that was assigned to this cluster
for k=1 to K do
µk = (∑Ii=1 δ[hi − k]xi)/(
∑Ii=1 δ[hi − k])
end
until No further change in µkKk=1
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
The pinhole camera 43
Algorithm 14.1: ML learning of camera extrinsic parameters
Given a known object, with I distinct three-dimensional points wiIi=1 points, their corre-sponding projections in the image xiIi=1 and known camera parameters Λ, estimate thegeometric relationship between the camera and the object determined by the rotation Ω andthe translation τ .
The solution to this problem is to minimize:
Ω, τ =argminΩ,τ
[I∑i=1
(xi−pinhole[wi,Λ,Ω, τ ])T
(xi−pinhole[wi,Λ,Ω, τ ])
]
where pinhole[wi,Λ,Ω, τ ] represents the action of the pinhole camera (equation 14.8 fromthe book. The bulk of this algorithm consists of finding a good initial starting point for thisminimization. This optimization should be carried out while enforcing the constraint that Ωremains a valid rotation matrix.
Algorithm 14.1: ML learning of extrinsic parameters
Input : Intrinsic matrix Λ, pairs of points xi,wiIi=1
Output: Extrinsic parameters: rotation Ω and translation τbegin
for i=1 to I do// Convert to normalized camera coordinates
x′i = Λ−1[xi; yi; 1]// Compute linear constraints
a1i = [ui, vi, wi, 1, 0, 0, 0, 0,−uix′i,−vix′i,−wix′i,−x′i]a2i = [0, 0, 0, 0, ui, vi, wi, 1,−uiy′i,−viy′i,−wiy′i,−y′i]
end// Stack linear constraints
A = [a11; a21; a12; a22; . . . a1I ; a2I ]// Solve with SVD
[U,L,V] = svd[A]b = v12 // extract last column of V// Extract estimates up to unknown scale
Ω = [b1, b2, b3; b5, b6, b7; b9; b10, b11]τ = [b4; b8; b12]// Find closest rotation using Procrustes method
[U,L,V] = svd[Ω]
Ω = UVT
// Rescale translation
τ = τ∑3i=1
∑3j=1(Ωij/Ωij)/9
// Use these parameters as initial conditions in non-linear optimization
[Ω, τ ] = argminΩ,τ
[∑Ii=1 (xi−pinhole[wi,Λ,Ω, τ ])T (xi−pinhole[wi,Λ,Ω, τ ])
]end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
44 The pinhole camera
Algorithm 14.2: ML learning of intrinsic parameters (camera calibration)
Given a known object, with I distinct 3D points wiIi=1 points and their correspondingprojections in the image xiIi=1, establish the camera parameters Λ. In order to do this weneed also to estimate the extrinsic parameters. We use the following criterion
Λ=argminΛ
[minΩ,τ
[I∑i=1
(xi−pinhole[wi,Λ,Ω, τ ])T
(xi−pinhole[wi,Λ,Ω, τ ])
]]
where pinhole[wi,Λ,Ω, τ ] represents the action of the pinhole camera (equation 14.8 fromthe book).
This algorithm consists of an alternating approach in which the extrinsic parameters arefound using the previous algorithm and then the intrinsic parameters are found in closedform. Finally, these estimates should form the starting point for a non-linear optimizationprocess over all of the unknown parameters.
Algorithm 14.2: ML learning of intrinsic parameters
Input : World points wiIi=1, image points xiIi=1, initial ΛOutput: Intrinsic parameters Λbegin
// Main loop for alternating optimization
for t=1 to T do// Compute extrinsic parameters
[Ω, τ ] = calcExtrinsic[Λ, wi,xiIi=1]// Compute intrinsic parameters
for i=1 to I do// Compute matrix Ai
ai = (ω1•wi + τx)/(ω3•wi + τ z) // ωk• is kth row of Ωbi = (ω2•wi + τ y)/(ω3•wi + τ z)Ai = [ai, bi, 1, 0, 0; 0, 0, 0, bi, 1]
end// Concatenate matrices and data points
x = [x1; x2; . . .xI ]A = [A1; A2; . . .AI ]// Compute parameters
θ = (ATA)−1ATxΛ = [θ1,θ2,θ3; 0,θ4,θ5; 0, 0, 1]
end// Refine parameters with non-linear optimization
Λ =argminΛ
[minΩ,τ
[∑Ii=1 (xi−pinhole[wi,Λ,Ω, τ ])T (xi−pinhole[wi,Λ,Ω, τ ])
]]end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
The pinhole camera 45
Algorithm 14.3: Inferring 3D world points (reconstruction)
Given J calibrated cameras in known positions (i.e. cameras with known Λ,Ω, τ ), viewingthe same three-dimensional point w and knowing the corresponding projections in the imagesxjJj=1, establish the position of the point in the world.
As for the previous algorithms the final solution depends on a non-linear minimization ofthe reprojection error between w and the observed data xj ,
w = argminw
J∑j=1
(xj−pinhole[w,Λj ,Ωj , τ j ])T
(xj−pinhole[w,Λj ,Ωj , τ j ])
The algorithm below finds a good approximate initial conditions for this minimization
using a closed-form least-squares solution.
Algorithm 14.3: Inferring 3D world position
Input : Image points xjJj=1, camera parameters Λj ,Ωj , τ jJj=1
Output: 3D world point wbegin
for j=1 to J do// Convert to normalized camera coordinates
x′j = Λ−1j [xj , yj , 1]T
// Compute linear constraints
a1j = [ω31jx′j − ω11j , ω32jx
′j − ω12j , ω33jx
′j − ω13j ]
a2j = [ω31jy′j − ω21j , ω32jy
′j − ω22j , ω33jy
′j − ω23j ]
bj = [τxj − τzjx′j ; τyj − τzjy′j ]end// Stack linear constraints
A = [a11; a21; a12; a22; . . . a1J ; a2J ]b = [b1; b2; . . . bJ ]// LS solution for parameters
w = (ATA)−1ATb// Refine parameters with non-linear optimization
w = argminw
[∑Jj=1 (xj−pinhole[w,Λj ,Ωj , τ j ])
T (xj−pinhole[w,Λj ,Ωj , τ j ])]
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
46 Models for transformations
Algorithm 15.1: ML learning of Euclidean transformation
The Euclidean transformation model maps one set of 2D points wiIi=1 to another set xiIi=1
with a rotation Ω and a translation τ . To recove these parameters we use the criterion
Ω, τ = argminΩ,τ
[−
I∑i=1
log[Normxi
[Ωwi + τ , σ2I
]]]where Ω is constrained to be a rotation matrix so that ΩTΩ = I and det[Ω] = 1.
Algorithm 15.1: Maximum likelihood learning of Euclidean transformation
Input : Training data pairs xi,wiIi=1
Output: Rotation Ω, translation τ , variance, σ2
begin// Compute mean of two data sets
µw =∑Ii=1 wi/I
µx =∑Ii=1 xi/I
// Concatenate data into matrix form
W = [w1 − µw,w2 − µw, . . . ,wI − µw]X = [x1 − µx,x2 − µx, . . . ,xI − µx]// Solve for rotation
[U,L,V] = svd[WXT ]
Ω = VUT
// Solve for translation
τ =∑Ii=1(xi −Ωwi)/I
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for transformations 47
Algorithm 15.2: ML learning of similarity transformation
The similarity transformation model maps one set of 2D points wiIi=1 to another set xiIi=1
with a rotation Ω, a translation τ and a scaling factor ρ. To recover these parameters we usethe criterion:
Ω, τ , ρ = argminΩ,τ ,ρ
[−
I∑i=1
log[Normxi
[ρΩwi + τ , σ2I
]]]where Ω is constrained to be a rotation matrix so that ΩTΩ = I and det[Ω] = 1.
Algorithm 15.2: Maximum likelihood learning of similarity transformation
Input : Training data pairs xi,wiIi=1
Output: Rotation Ω, translation τ , scale ρ, variance σ2
begin// Compute mean of two data sets
µw =∑Ii=1 wi/I
µx =∑Ii=1 xi/I
// Concatenate data into matrix form
W = [w1 − µw,w2 − µw, . . . ,wI − µw]X = [x1 − µx,x2 − µx, . . . ,xI − µx]// Solve for rotation
[U,L,V] = svd[WXT ]
Ω = VUT
// Solve for scaling
ρ = (∑Ii=1(xi − µx)TΩ(wi − µw))/(
∑Ii=1(wi − µw)T (w − µw))
// Solve for translation
τ =∑Ii=1(xi − ρΩwi)/I
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
48 Models for transformations
Algorithm 15.3: ML learning of affine transformation
The affine transformation model maps one set of 2D points wiIi=1 to another set xiIi=1
with a linear transformation Φ and an offset τ . To recover these parameters we use thecriterion
Φ, τ = argminΦ,τ
[−
I∑i=1
log[Normxi
[Φwi + τ , σ2I
]]].
Algorithm 15.3: Maximum likelihood learning of affine transformation
Input : Training data pairs xi,wiIi=1
Output: Linear transformation Φ, offset τ , variance σ2
begin// Compute intermediate 2×6 matrices Ai
for i=1 to I doAi = [wT
i , 1,0T ; 0T ,wT
i , 1]end// Concatenate matrices Ai into 2I×6 matrix AA = [A1; A2; . . .AI ]// Concatenate output points into 2I × 1 vector cc = [x1; x2; . . .xI ]// Solve for linear transformation
φ = (ATA)−1AT c// Extract parameters
Φ = [φ1, φ2;φ4, φ5]τ = [φ3;φ6]// Solve for variance
σ2 =∑Ii=1(xi − φwi − τ )T (xi − φwi − τ )/2I
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for transformations 49
Algorithm 15.4: ML learning of projective transformation (homography)
The projective transformation model maps one set of 2D points wiIi=1 to another setxiIi=1 with a non-linear transformation with 3×3 parameter matrix Φ. To recover thismatrix we use the criterion
Φ = argminΦ
[−
I∑i=1
log[Normxi
[proj[wi,Φ], σ2I
]]].
where the function proj[wi,Φ] applies the homography to point wi and is defined as
proj[wi,Φ] =[φ11u+φ12v+φ13
φ31u+φ32v+φ33
φ21u+φ22v+φ23
φ31u+φ32v+φ33
]T.
Unlike the previous three transformations, it is not possible to minimize this criterion inclosed form. The best that we can do is to get an approximate solution and use this to starta non-linear minimization process.
Algorithm 15.4: Maximum likelihood learning of projective transformation
Input : Training data pairs xi,wiIi=1
Output: Parameter matrix Φ,, variance σ2
begin// Convert data to homogeneous representation
for i=1 to I doxi = [xi; 1]
end// Compute intermediate 2×9 matrices Ai
for i=1 to I doAi = [0, wi;−wi,0; yiwi,−xiwi]
T
end// Concatenate matrices Ai into 2I×9 matrix AA = [A1; A2; . . .AI ]// Solve for approximate parameters
[U,L,V] = svd[A]Φ0 = [v19, v29, v39; v49, v59, v69; v79, v89, v99]// Refine parameters with non-linear optimization
Φ = argminΦ
[−∑Ii=1 log
[Normxi
[proj[wi,Φ], σ2I
]]].
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
50 Models for transformations
Algorithm 15.5: ML Inference for transformation models
Consider a transformation model maps one set of 2D points wiIi=1 to another set xiIi=1
so that
Pr(xi|wi,Φ) = Normxi
[trans[wi,Φ], σ2I
].
In inference we wish are given a new data point x = [x, y] and wish to compute the most likelypoint w = [u, v] that was responsible for it. To make progress, we consider the transformationmodel trans[wi,Φ] in homogeneous form
λ
xy1
=
φ11 φ12 φ13φ21 φ22 φ23φ31 φ32 φ33
uv1
,or x = Φw. The Euclidean, similarity, affine and projective transformations can all beexpressed as a 3× 3 matrix of this kind.
Algorithm 15.5: Maximum likelihood inference for transformation models
Input : Transformation parameters Φ, new point xOutput: point wbegin
// Convert data to homogeneous representation
x = [x; 1]// Apply inverse transform
a = Φ−1x// Convert back to Cartesian coordinates
w = [a1/a3, a2/a3]
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for transformations 51
Algorithm 15.6: Learning extrinsic parameters (planar scene)
Consider a calibrated camera with known parameters Λ viewing a planar. We are given aset of 2D positions on the plane wI
i=1 (measured in real world units like cm) and theircorresponding 2D pixel positions xIi−1. The goal of this algorithm is to learn the 3D rotationΩ and translation τ that maps a point in the frame of reference of the plane w = [u, v, w]T
(w = 0 on the plane) into the frame of reference of the camera.This goal is accomplished by minimizing the following criterion:
Ω, τ = argminΩ,τ
[I∑i=1
(xi−pinhole[wi,Λ,Ω, τ ])T
(xi−pinhole[wi,Λ,Ω, τ ])
]This optimization should be carried out while enforcing the constraint that Ω remains a validrotation matrix. The bulk of this algorithm consists of computing a good initialization pointfor this minimization procedure.
Algorithm 15.6: ML learning of extrinsic parameters (planar scene)
Input : Intrinsic matrix Λ, pairs of points xi,wiIi=1
Output: Extrinsic parameters: rotation Ω and translation τbegin
// Compute homography between pairs of points
Φ = LearnHomography[xiIi=1, wiIi=1]// Eliminate effect of intrinsic parameters
Φ = Λ−1Φ// Compute SVD of first two columns of Φ[ULV] = svd[φ1,φ2]// Estimate first two columns of rotation matrix
[ω1,ω2] = [u1,u2] ∗VT
// Estimate third column by taking cross product
ω3 = ω1 × ω2
Ω = [ω1,ω2,ω3]// Check that determinant is not minus 1
if |Ω| < 0 thenΩ = [ω1,ω2,−ω3]
end// Compute scaling factor for translation vector
λ = (∑3i=1
∑2j=1 ωij/φij)/6
// Compute translation
τ = λφ3
// Refine parameters with non-linear optimization
Ω, τ = argminΩ,τ
[∑Ii=1 (xi−pinhole[wi,Λ,Ω, τ ])T (xi−pinhole[wi,Λ,Ω, τ ])
]end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
52 Models for transformations
Algorithm 15.7: Learning intrinsic parameters (planar scene)
This is also known as camera calibration from a plane. The camera is presented with J viewsof a plane with unknown pose Ωj , τ j. For each image we know I points wiIi=1 where
wi = [ui, vi, 0] and we know their imaged positions xijI,Ji=1,j=1 in each of the J scenes. Thegoal is to compute the intrinsic matrix Λ. To this end, we use the criterion:
Λ=argminΛ
J∑j=1
minΩj ,τ j
[I∑i=1
(xij−pinhole[wi,Λ,Ωj , τ j ])T
(xij−pinhole[wi,Λ,Ωj , τ j ])
]where again, the minimization must be carried out while ensuring that Ω is a valid rotationmatrix. The strategy is to alternately estimate the extrinsic parameters using the previousalgorithm and compute the intrinsic parameters in closed form. After several iterations weuse the resulting solution as initial conditions for a non-linear optimization procedure.
Algorithm 15.7: ML learning of intrinsic parameters (planar scene)
Input : World points wiIi=1, image points xijI,Ji=1,j=1, initial ΛOutput: Intrinsic parameters Λbegin
// Main loop for alternating optimization
for k=1 to K do// Compute extrinsic parameters for each image
for j=1 to J do[Ωj , τ j ] = calcExtrinsic[Λ, wi,xijIi=1]
end// Compute intrinsic parameters
for i=1 to I dofor j=1 to J do
// Compute matrix Aij
aij = (ωT1•jwi + τxj)/(ωT3•jwi + τ zj) // ωk•j is kth row of Ωj
bij = (ωT2•jwi + τ yj)/(ωT3•jwi + τ zj) // τzj is z component of τ j
Aij = [aij , bij , 1, 0, 0; 0, 0, 0, bij , 1]
end
end// Concatenate matrices and data points
x = [x11; x12; . . .xIJ ]A = [A11; A12; . . .AIJ ]// Compute parameters
θ = (ATA)−1ATxΛ = [θ1,θ2,θ3; 0,θ4,θ5; 0, 0, 1]
end// Refine parameters with non-linear optimization
Λ=argminΛ
[∑j minΩj ,τj
[∑i (xij−pinhole[wi,Λ,Ωj , τ j ])
T (xij−pinhole[wi,Λ,Ωj , τ j ])]]
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for transformations 53
Algorithm 15.8: Robust learning of projective transformation with RANSAC
The goal of this algorithm is to fit a homography that maps one set of 2D points wiIi=1 toanother set xiIi=1, in the case where some of the point matches are known to be wrong(outliers). The algorithm also returns the true matches and the outliers.
The algorithm uses the RANSAC procedure - it repeatedly computes the homographybased on a minimal subset of matches. Since there are 8 unknowns in the 3 × 3 matrixthat defines the homography, and each match provides two linear constrains (due to the x−and y−coordinates), we need a minimum of four matches to compute the homography. TheRANSAC procedure chooses these four matches randomly, computes the homography, andthen looks for the amount of agreement in the rest of the dataset. After many iterations ofthis procedure, we recompute the homography based on the randomly chosen matches withthe best agreement and the points that agreed with it (the inliers).
Algorithm 15.8: Robust ML learning of homography
Input : Point pairs xi,wiIi=1, number of RANSAC steps N , threshold τOutput: Homography Φ, inlier indices Ibegin
// Initialize best inlier set to empty
B = for n=1 to N do
// Draw 4 different random integers between 1 and IR = RandomSubset[1 . . . I, 4]// Compute homography (algorithm 15.4)
Φn = LearnHomography[xii∈R, wii∈R]// Initialize set of inliers to empty
Sn = for i=1 to I do
// Compute squared distance
d = (xi − proj[wi,Φn])T (xi − proj[wi,Φn])// If small enough then add to inliers
if d < τ2 thenSn = Sn ∩ i
end
end// If best outliers so far then store
if |Sn| > |B| thenB = Sn
end
end// Compute homography from all outliers
Φ = LearnHomography[xii∈B, wii∈B]
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
54 Models for transformations
Algorithm 15.9: Sequential RANSAC for fitting homographies
Sequential RANSAC fits K homographies to disjoint subsets of point pairs wi,xiIi=1. Thisprocedure is greedy – the algorithm fits the first homography, then removes the inliers fromthis set from the point pairs and tries to fit a second homography to the remaining points. Inprinciple, this algorithm can find a set of matching planes between two images. However, inpractice, it often makes mistakes. It does not exploit information about the spatial coherenceof matches and it cannot recover from mistakes in the greedy matching procedure.
Algorithm 15.9: Robust sequential learning of homographies
Input : Points xi,wiIi=1, RANSAC steps N , inlier threshold τ , number of homographies KOutput: K homographies Φk, and associated inlier indices Ikbegin
// Initialize set of indices of remaining point pairs
S = 1 . . . I for k=1 to K do// Compute homography using RANSAC (algorithm 51)
[Φk, Ik] = LearnHomographyRobust[xii∈S , wii∈S , N, τ ]// Remove inliers from remaining points
S = S\Ik// Check that there are enough remaining points
if |S| < 4 thenbreak
end
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for transformations 55
Algorithm 15.10: PEaRL for fitting homographies
The propose, expand and re-learn (PEaRL) attempts to make up for the deficiencies ofsequential RANSAC for fitting homographies. It first proposes a large number of possiblehomographies relating point pairs wi,xiIi=1. These then compete for the point pairs to beassigned to them and they are re-learnt based on these assignments. The algorithm has aspatial component that encourages nearby points to belong to the same model, and it iterativerather than greedy and so can recover from errors.
Algorithm 15.10: PEaRL learning of homographies
Input : Point pairs xi,wiIi=1, number of initial models M , inlier threshold τ , mininum numberof inliers l, number of iterations J , neighborhood system NiIi=1, pairwise cost P
Output: Set of homographies Φk, and associated inlier indices Ikbegin
// Propose Step: generate M hypotheses
m = 1 // hypothesis number
repeat// Draw 4 different random integers between 1 and IR = RandomSubset[1 . . . I, 4]// Compute homography (algorithm 47)
Φm = LearnHomography[xii∈R, wii∈R]Im = // Initialize inlier set to empty
for i=1 to I dodim = (xi − proj[wi,Φn])T (xi − proj[wi,Φn])if dim < τ2 then // if distance small, add to inliers
In = In ∩ iend
endif |Im| ≥ l then // If enough inliers, get next hypothesis
m = m+ 1end
until m < Mfor j=1 to J do
// Expand Step: returns I × 1 label vector l
l = AlphaExpand[D, P, NiIi=1]// Re-Learn Step: re-estimate homographies with support
for m=1 to M doIm = find[L == m] // Extract points with label L// If enough support then re-learn, update distances
if |Im| ≥ 4 thenΦm = LearnHomography[xii∈Im , wii∈Im ]for i=1 to I do
dim = (xi − proj[wi,Φn])T (xi − proj[wi,Φn])end
end
end
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
56 Multiple cameras
Algorithm 16.1: Camera geometry from point matches
This algorithm finds approximate estimates of the rotation and translation (up to scale)between two cameras given a set of I point matches xi1,xi2Ii=1 between two images. Moreprecisely, the algorithm assumes that the first camera is at the world origin and recovers theextrinsic parameters of the second camera.
There is a fourfold ambiguity in the possible solution due to the symmetry of the cameramodel - it allows for points that are behind the camera to be imaged, although this is clearlynot possible in the real world. This algorithm distinguishes between these four solutions byreconstructing all of the points with each and choosing the solution where the largest numberare in front of both cameras.
Algorithm 16.1: Extracting relative camera position from point matches
Input : Point pairs xi1,xi2Ii=1, intrinsic matrices Λ1,Λ2
Output: Rotation Ω, translation τ between camerasbegin
// Compute fundamental matrix (algorithm 16.2)
F = ComputeFundamental[x1i,x2iIi=1]// Compute essential matrix
E = ΛT2 FΛ1
// Extract four possible rotation and translations from EW = [0,−1, 0; 1, 0, 0; 0, 0,−1][U,L,V] = svd[E]
τ1 = ULWUT ; Ω1 = UW−1VT
τ2 = ULW−1UT ; Ω2 = UWVT
τ3 = −τ1; Ω3 = Ω1
τ4 = −τ2; Ω4 = Ω1
// For each possibility
for k=1 to K do
tk = 0 // number of points in front of camera for kth soln
// For each point
for i=1 to I do// Reconstruct point (algorithm 14.3)
w = Reconstruct[xi1,xi2,Λ1,Λ2,0, I,Ωk, τ k]// Compute point in frame of reference of second camera
w′ = Ωk + τ k// Test if point reconstructed in front of both cameras
if w3 > 0 & w′3 > 0 thentk = tk + 1
end
end
end// Choose solution with most support
k = argmaxk[tk]Ω = Ωk
τ = τ kend
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Multiple cameras 57
Algorithm 16.2: Eight point algorithm for fundamental matrix
This algorithm takes a set of I ≥ 8 point correspondences xi1,xi2Ii=1 between two imagesand computes the fundamental matrix using the 8 point algorithm. To improve the numericalstability of the algorithm, the point positions are transformed to have unit mean and sphericalcovariance before the calculation proceeds. The resulting fundamental matrix is modified tocompensate for this transformation. This algorithm is usually used to compute an initialestimate for a subsequent non-linear optimization of the symmetric epipolar distance.
Algorithm 16.2: Eight point algorithm for fundamental matrix
Input : Point pairs x1i,x2iIi=1
Output: Fundamental matrix Fbegin
// Compute statistics of data
µ1 =∑Ii=1 x1i/I
Σ1 =∑Ii=1(x1i − µ1)(x1i − µ1)/I
µ2 =∑Ii=1 x2i/I
Σ2 =∑Ii=1(x2i − µ2)(x2i − µ2)/I
for k=1 to K do// Compute transformed coordinates
xi1 = Σ−1/21 (xi1 − µ1)
xi2 = Σ−1/22 (xi2 − µ2)
// Compute constraint
Ai = [xi2xi1, xi2yi1, xi2, yi2xi1, yi2yi1, yi2, xi1, yi1, 1]
end// Append constraints and solve
A = [A1; A2; . . .AI ][U,L,V] = svd[A]F = [v19, v29, v39; v49, v59, v69; v79, v89, v99]// Compensate for transformation
T1 = [Σ−1/21 ,Σ
−1/21 µ1; 0, 0, 1]
T2 = [Σ−1/22 ,Σ
−1/22 µ2; 0, 0, 1]
F = TT2 FT1
// Ensure that matrix has rank 2
[U,L,V] = svd[F]l33 = 0
F = ULVT
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
58 Multiple cameras
Algorithm 16.3: Robust computation of fundamental matrix with RANSAC
The goal of this algorithm is to estimate the fundamental matrix from 2D point pairsxi1,xi2Ii=1 to another in the case where some of the point matches are known to be wrong(outliers). The robustness is achieved by applying the RANSAC algorithm. Since the funda-mental matrix has a eight unknown quantities, we randomly select eight point pairs at eachstage of the algorithm (each pair contributes one constraint). The algorithm also returns thetrue matches.
Algorithm 16.3: Robust ML fitting of fundamental matrix
Input : Point pairs xi1,xi2Ii=1, number of RANSAC steps N , threshold τOutput: Fundamental matrix F, set of inlier indices Ibegin
// Initialize best inlier set to empty
I = for n=1 to N do
// Draw 8 different random integers between 1 and IR = RandomSubset[1 . . . I, 8]// Compute fundamental matrix (algorithm 16.2)
Φn = ComputeFundamental[xi1i∈R, xi2i∈R]// Initialize set of inliers to empty
Sn = for i=1 to I do
// Compute epipolar line in first image
xi2 = [xi2; 1]
l = xTi2F// Compute squared distance to epipolar line
d1 = (l1xi1 + l2yi1 + l3)2/(l21 + l22)// Compute epipolar line in second image
xi1 = [xi1; 1]l2 = Fxi1// Compute squared distance to epipolar line
d2 = (l1xi2 + l2yi2 + l3)2/(l21 + l22)// If small enough then add to inliers
if (d1 < τ2) && (d2 < τ2) thenSn = Sn ∩ i
end
end// If best outliers so far then store
if |Sn| > |I| thenI = Sn
end
end// Compute fundamental matrix from all outliers
Φ = ComputeFundamental[xi1i∈I , xi2i∈I ]
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Multiple cameras 59
Algorithm 16.4: Planar rectification
This algorithm computes homographies that can be used to rectify the two images. Thehomography for this second image is chosen so that it moves the epipole to infinity along thex−axis. The homography for the first image is chosen so that the matches are on the samehorizontal lines as in the first image and the distance between the matches is smallest in aleast squares sense (i.e., the disparity is smallest).
Algorithm 16.4: Planar rectification
Input : Point pairs xi1,xi2Ii=1
Output: Homographies Φ1, Φ2 to transform first and second imagesbegin
// Compute fundamental matrix (algorithm 55)
F = ComputeFundamental[x1i,x2iIi=1]// Compute epipole in image 2
[U,L,V] = svd[F]
e = [u13, u23, u33]T
// Compute three transformation matrices
T1 = [0, 0,−δx; 0, 0, δy, 0, 0, 1]θ = atan2[ey − δy, ex − δx]T2 = [cos[θ], sin[θ], 0;− sin[θ], cos[θ], 0; 0, 0, 1]T3 = [1, 0, 0; 0, 1, 0,−1/(cos[θ], sin[θ]), 0, 1]]// Compute homography for second image
Φ2 = T3T2,T1
// Compute factorization of fundamental matrix
L = diag[l11, l22, (l11 + l22)/2]W = [0,−1, 0; 1, 0, 0; 0, 0, 1]
M = ULWVT
// Prepare matrix for soln for Φ1
for k=1 to K dox′i1 = hom[xi1,Φ2M]// Transform points
x′i2 = hom[xi2,Φ2]// Create elements of A and bAi = [x′i1, y
′i1, 1]
bi = x′i2end// Concatenate elements of A and bA = [A1; A2; . . .AI ]b = [b1; b2; . . . bI ]// Solve for α
α = (ATA)−1ATb// Calculate homography in first image
Φ1 = (I + [1, 0, 0]TαT )Φ2M
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
60 Models for shape
Algorithm 17.1: Generalized Procrustes analysis
The goal of generalized Procrustes analysis is to align a set of shape vectors wiIi=1 withrespect to a given transformation family (Euclidean, similarity, affine etc.). Each shape vectorconsists of a set of N 2D points wi = [wT
i1,wTi2, . . .w
TiN ]T . In the algorithm below, we will
use the example of registering with respect to a Similarity transformation, which consists ofa rotation Ω, scaling ρ and translation τ .
Algorithm 17.1: Generalized Procrustes analysis
Input : Shape vectors wiIi=1, number of factors, KOutput: Template w, transformations Ωi,ρi, τ iIi=1, number of iterations Kbegin
Initialize w = w1
// Main iteration loop
for k=1 to K do// For each transformation
for i=1 to I do// Compute transformation to template (algorithm 15.2)
[Ωi, ρi, τ i] = EstimateSimilarity[wnNn=1, winNn=1]
end// Update template (average of inverse transform)
wi =∑Ii=1 ΩT
i (win − τ i)/(Iρi)// Normalize template
wi = wi/|wi|end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for shape 61
Algorithm 17.2: Probabilistic principal components analysis
The probabilistic principal components analysis algorithm describes a set of I D × 1 dataexamples xiIi=1 with the model
Pr(xi) = Normxi[µ,ΦΦT + σ2I]
where µ is the D×1 mean vector, Φ is a D×K matrix containing the K principal componentsin its columns. The principal components define a K dimensional subspace and the parameterσ2 explains the variation of the data around this subspace.
Notice that this model is very similar to factor analysis (see Algorithm 6.3). The onlydifference is that here we have spherical additive noise σ2I rather than a diagonal noisecomponents Σ. This small change has important ramifications for the learning algorithm; weno longer need to use an iterative learning procedure based on the EM algorithm and caninstead learn the parameters in closed form.
Algorithm 17.2: ML learning of PPCA model
Input : Training data xiIi=1, number of principal components, KOutput: Parameters µ,Φ, σ2
begin// Estimate mean parameter
µ =∑Ii=1 xi/I
// Form matrix of mean-zero data
X = [x1 − µ,x2 − µ, . . . ,xI − µ]// Decompose X to matrices U,L,V
[VLVT ] = svd[XTX]
U = WVL−1/2
// Estimate noise parameter
σ2 =∑Dj=K+1 ljj/(D −K)
// Estimate principal components
Uk = [u1,u2, . . .uK ]Lk = diag[l11, l22, . . . lKK ]
Φ = UK(LK − σ2I)1/2
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
62 Models for style and identity
Algorithm 18.1: ML learning of subspace identity model
This describes the jth of J data examples from the ith of I identities as
xij = µ+ Φhi + εij ,
where xij is the D×1 observed data, µ is the D×1 mean vector, Φ is the D×K factor matrix,hi is the K×1 hidden variable representing the identity and εij is a D×1 additive normalnoise multivariate noise with diagonal covariance Σ.
Algorithm 18.1: Maximum likelihood learning for identity subspace model
Input : Training data xijI,Ji=1,j=1, number of factors, KOutput: Maximum likelihood estimates of parameters θ = µ,Φ,Σbegin
Initialize θ = θ0a
// Set mean
µ =∑Ii=1
∑Jj=1 xij/IJ
repeat// Expectation step
for i=1 to I do
E[hi] = (JΦTΣ−1Φ + I)−1ΦTΣ−1∑Jj=1(xij − µ)
E[hihTi ] = (JΦTΣ−1Φ + I)−1 + E[hi]E[hi]
T
end// Maximization step
Φ =(∑I
i=1
∑Jj=1(xij − µ)E[hi]
T)(∑I
i=1 JE[hihTi ])−1
Σ = 1IJ
∑Ii=1
∑Jj=1 diag
[(xij − µ)(xij − µ)T −ΦE[hi](xij − µ)T
]// Compute data log likelihood
for i=1 to I dox′i = [xTi1,x
Ti2, . . . ,x
TiJ ]T // compound data vector, JD×1
end
µ′ = [µT ,µT . . .µT ]T // compound mean vector, JD×1
Φ′ = [ΦT ,ΦT . . .ΦT ]T // compound factor matrix, JD×KΣ′ = diag[Σ,Σ, . . .Σ] // compound covariance, JD×JDL =
∑Ii=1 log
[Normx′i
[µ′,Φ′Φ′T + Σ′]]
b
until No further improvement in L
end
a It is usual to initialize Φ to random values. The D diagonal elements of Σ can be initialized to thevariances of the D data dimensions.b In high dimensions it is worth reformulating the covariance of this matrix using the matrix inversionlemma.
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for style and identity 63
Algorithm 18.2: ML learning of PLDA model
PLDA describes the jth of J data examples from the ith of I identities as
xij = µ+ Φhi + Ψsij + εij ,
where all terms are the same as in subspace identity model but now we add Ψ, the D×Lwithin-individual factor matrix and sij the L×1 style variable.
Algorithm 18.2: Maximum likelihood learning for PLDA model
Input : Training data xijI,Ji=1,j=1, numbers of factors, K,LOutput: Maximum likelihood estimates of parameters θ = µ,Φ,Ψ,Σbegin
Initialize θ = θ0a
// Set mean
µ =∑Ii=1
∑Jj=1 xij/IJ
repeatµ′ = [µT ,µT . . .µT ]T // compound mean vector, JD×1
Φ′ = [ΦT ,ΦT . . .ΦT ]T // compound factor matrix 1, JD×KΨ′ = diag[Ψ,Ψ, . . .Ψ] // compound factor matrix 2, JD×JLΦ′ = [Φ′,Ψ′] // concatenate matrices JD×(K+JL)Σ′ = diag[Σ,Σ, . . .Σ] // compound covariance, JD×JD// Expectation step
for i=1 to I dox′i = [xTi1,x
Ti2, . . . ,x
TiJ ]T // compound data vector, JD×1
µh′i= (Φ′TΣ′−1Φ′ + I)−1Φ′TΣ′−1(x′i − µ′)
Σh′i= (Φ′TΣ′−1Φ′ + I)−1 + E[h′i]E[h′i]
T
for j=1 to J doSij = [1 . . .K,K+(J − 1)L+1 . . .K+JL]
E[h′′ij ] = µh′i
(Sij) // Extract subvector of mean
E[h′′ijh′′Tij ] = Σ
h′i(Sij ,Sij) // Extract submatrix from covariance
end
end// Maximization step
Φ′′ =(∑I
i=1
∑Jj=1(xij − µ)E[h
′′ij ]T)(∑I
i=1
∑Jj=1 E[h
′′ijh′′Tij ])−1
Σ = 1IJ
∑Ii=1
∑Jj=1 diag
[(xij − µ)(xij − µ)T − [Φ,Ψ]E[hij ](xij − µ)T
]Φ = Φ′′(:, 1 : K) // Extract original factor matrix
Ψ = Φ′′(:,K + 1 : K + L) // Extract other factor matrix
// Compute data log likelihood
L =∑Ii=1 log
[Normx′i
[µ′,Φ′Φ′T + Σ′]]
until No further improvement in L
end
a Initialize Ψ to random values, other variables as in identity subspace model.
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
64 Models for style and identity
Algorithm 18.3: ML learning of asymmetric bilinear model
This describes the jth data example from the ith identities and the kth styles as
xijs = µs + Φshi + εijs,
where the terms have the same interpretation as for the subspace identity model except nowthere is one set of parameters θs = µs,Φs,Σs per style, s.
Algorithm 18.3: Maximum likelihood learning for asymmetric bilinear model
Input : Training data xijI,J,Si=1,j=1,s=1, number of factors, KOutput: ML estimates of parameters θ = µ1...S ,Φ1...S ,Σ1...Sbegin
Initialize θ = θ0
for s=1 to S do
µs =∑Ii=1
∑Jj=1 xijs/IJ // Set mean
endrepeat
// Expectation step
for i=1 to I do
E[hi] = (I + J∑Ss=1 ΦT
s Σ−1s Φs)
−1∑Ss=1 ΦT
s Σ−1s
∑Jj=1(xijs − µs)
E[hihTi ] = (I + JΦT
s Σ−1s Φs)
−1 + E[hi]E[hi]T
end// Maximization step
for s=1 to S do
Φs =(∑I
i=1
∑Jj=1(xijs − µs)E[hi]
T)(∑I
i=1 JE[hihTi ])−1
Σs = 1IJ
∑Ii=1
∑Jj=1 diag
[(xijs − µs)(xijs − µs)T −ΦsE[hi](xijs − µs)T
]end// Compute data log likelihood
for s=1 to S doΦ′s = [ΦT
s ,ΦTs . . .Φ
Ts ]T
Σ′s = diag[Σs,Σs, . . .Σs]for i=1 to I do
x′is = [xTi1s,xTi2s, . . . ,x
TiJs]
T
x′i = [xTi1,xTi2, . . . ,x
TiS ]T // compound data vector, JSD×1
end
end
µ′ = [µT ,µT . . .µT ]T // compound mean vector, JSD×1
Φ′ = [Φ′T1 ,Φ
′T2 . . .Φ
′TS ]T // compound factor matrix, JSD×K
Σ′ = diag[Σ′1,Σ′2, . . .Σ
′S ] // compound covariance, JSD×JSD
L =∑Ii=1 log
[Normx′i
[µ′,Φ′Φ′T + Σ′]]
until No further improvement in L
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for style and identity 65
Algorithm 18.4: Style translation with asymmetric bilinear model
To translate a data example from one style to another we first estimate the hidden variableassociated with the example, and then use the generative equation to simulate the new style.We cannot know the hidden variable for certain, but we can compute it’s posterior distribu-tion, which has a Gaussian form, and then choose the MAP solution which is the mean ofthis Gaussian.
Algorithm 18.4: Style translation with asymmetric bilinear model
Input : Example x in style s1, model parameters θOutput: Prediction for data x∗ in style s2
begin// Estimate hidden variable
E[h] = (I + ΦTs1Σ−1
s1 Φs1)−1ΦTs1Σ−1
s1 (x− µs1)
// Predict in different style
x∗ = µs2 + Φs2E[h]
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
66 Temporal models
Algorithm 19.1: Kalman filter
To define the Kalman filter, we must specify the temporal and measurement models. First,the temporal model relates the states w at times t−1 and t and is given by
Pr(wt|wt−1) = Normwt[µp + Ψwt−1,Σp].
where µp is a Dw×1 vector, which represents the mean change in the state and Ψ is a Dw×Dw
matrix, which relates the mean of the state at time t to the state at time t−1. This is knownas the transition matrix. The transition noise Σp determines how closely related the statesare at times t and t−1.
Second, the measurement model relates the data xt at time t to the state wt,
Pr(xt|wt) = Normxt[µm + Φwt,Σm].
where µm is a Dx×1 mean vector and Φ is a Dx×Dw matrix relating the Dx×1 measurementvector to the Dw×1 state. The measurement noise Σm defines additional uncertainty on themeasurements that cannot be explained by the state.
The Kalman filter is a set of rules for computing the marginal posterior probabilityPr(wt|x1...t) based on a normally distributed estimate of the marginal posterior probabil-ity Pr(wt−1|x1...t−1) at the previous time and a new measurement xt. In this algorithm wedenote the mean of the posterior marginal probability as µt−1 and the variance as Σt−1.
Algorithm 19.1: The Kalman filter
Input : Measurements xTt=1, temporal params µp,Ψ,Σp, measurement params µm,Φ,Σm
Output: Means µtTt=1 and covariances ΣtTt=1 of marginal posterior distributionsbegin
// Initialize mean and covariance
µ0 = 0Σ0 = Σ0 // Typically set to large multiple of identity
// For each time step
for t=1 to T do// State prediction
µ+ = µp + Ψµt−1
// Covariance prediction
Σ+ = Σp + ΨΣt−1ΨT
// Compute Kalman gain
K = Σ+ΦT (Σm + ΦΣ+ΦT )−1
// State update
µt = µ+ + K(xt − µm −Φµ+)// Covariance update
Σt = (I−KΦ)Σ+
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Temporal models 67
Algorithm 19.2: Fixed interval Kalman smoother
The fixed interval smoother consists of a backward set of recursions that estimate the marginalposterior distributions Pr(wt|x1...T ) of the state at each time step, taking into accountall of the measurements x1...T . In these recursions, the marginal posterior distributionPr(wt|x1...T ) of the state at time t is updated, and, based on this result, the marginalposterior Pr(wt−1|x1...T ) at time t− 1 is updated and so on.
In the algorithm, we denote the mean and variance of the marginal posterior Pr(wt|x1...T )at time t by µt|T and Σt|T , respectively The notation µ+|t and Σ+|t denotes the mean andvariance of the posterior distribution Pr(wt|x1...t−1) of the state at time t based on themeasurements up to time t−1 (i.e., what we denoted as µ+ and Σ+ during the forwardKalman filter recursions).
Algorithm 19.2: Fixed interval Kalman smoother
Input : Means, variances µt|t,Σt|t,µ+|t,Σ+|tTt=1, temporal param Ψ
Output: Means µt|T Tt=1 and covariances Σt|T Tt=1 of marginal posterior distributions
begin// For each time step
for t=T-1 to 1 do// Compute gain matrix
Ct = Σt|tΨTΣ−1
+|t// Compute mean
µt|T = µt + Ct(µt+1|T − µ+|t)
// Compute variance
Σt|T = Σt + Ct(Σt+1|T −Σ+|t)CTt
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
68 Temporal models
Algorithm 19.3: Extended Kalman filter
The extended Kalman filter (EKF) is designed to cope with more general temporal models,where the relationship between the states at time t is an arbitrary nonlinear function f [•, •]of the state at the previous time step and a stochastic contribution εp
wt = f [wt−1, εp],
where the covariance of the noise term εp is Σp as before. Similarly, it can cope with anonlinear relationship g[•, •] between the state and the measurements
xt = g[wt, εm],
where the covariance of εm is Σm.The extended Kalman filter works by taking linear approximations to the nonlinear func-
tions at the peak µt of the current estimate using the Taylor expansion. We define theJacobian matrices,
Ψ =∂f [wt−1, εp]
∂wt−1
∣∣∣∣µt−1,0
Υp =∂f [wt−1, εp]
∂εp
∣∣∣∣µt−1,0
Φ =∂g[wt, εm]
∂wt
∣∣∣∣µ+,0
Υm =∂g[wt, εm]
∂εm
∣∣∣∣µ+,0
,
where |µ+,0 denotes that the derivative is computed at position w = µ+ and ε = 0.
Algorithm 19.3: The extended Kalman filter
Input : Measurements xTt=1, temporal function f [•, •], measurement function g[•, •]Output: Means µtTt=1 and covariances ΣtTt=1 of marginal posterior distributionsbegin
// Initialize mean and covariance
µ0 = 0Σ0 = Σ0 // Typically set to large multiple of identity
// For each time step
for t=1 to T do// State prediction
µ+ = f [µt−1,0]// Covariance prediction
Σ+ = ΨΣt−1ΨT + ΥpΣpΥ
Tp
// Compute Kalman gain
K = Σ+ΦT (ΥmΣmΥTm + ΦΣ+ΦT )−1
// State update
µt = µ+ + K(xt − g[µ+,0])// Covariance update
Σt = (I−KΦ)Σ+
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Temporal models 69
Algorithm 19.4: Iterated extended Kalman filter
The iterated extended Kalman filter passes Q times through the dataset, repeating the com-putations of the extended Kalman filter. At each iteration it linearizes around the previousestimate of the state, with the idea that the linear approximation will get better and better.We define the initial Jacobian matrices as before:
Ψ =∂f [wt−1, εp]
∂wt−1
∣∣∣∣µt−1,0
Υp =∂f [wt−1, εp]
∂εp
∣∣∣∣µt−1,0
Φ0 =∂g[wt, εm]
∂wt
∣∣∣∣µ+,0
Υ0m =
∂g[wt, εm]
∂εm
∣∣∣∣µ+,0
.
However, on the qth iteration, we use the Jacobians
Φq =∂g[wt, εm]
∂wt
∣∣∣∣µq−1
t ,0
Υqm =
∂g[wt, εm]
∂εm
∣∣∣∣µq−1
t ,0
,
where µq−1t is the estimate of the state at the tth time step on the q − 1th iteration.
Algorithm 19.4: The iterated extended Kalman filter
Input : Measurements xTt=1, temporal function f [•, •], measurement function g[•, •]Output: Means µtTt=1 and covariances ΣtTt=1 of marginal posterior distributionsbegin
// For each iteration
for q=0 to Q do// Initialize mean and covariance
µ0 = 0Σ0 = Σ0 // Typically set to large multiple of identity
// For each time step
for t=1 to T do// State prediction
µ+ = f [µt−1,0]// Covariance prediction
Σ+ = ΨΣt−1ΨT + ΥpΣpΥ
Tp
// Compute Kalman gain
K = Σ+ΦqT (ΥqmΣmΥqT
m + ΦqΣ+ΦqT )−1
// State update
µqt = µ+ + K(xt − g[µ+,0])// Covariance update
Σqt = (I−KΦq)Σ+
end
end
end
This algorithm can be improved by running the fixed interval smoother inbetween each iter-ation and re-linearizing around the smoothed estimates.
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
70 Temporal models
Algorithm 19.5: Unscented Kalman filter
The unscented filter is an alternative to the extended Kalman filter that works by approximat-ing the Gaussian state distribution as a set of particles with the same mean and covariance,passing these particles through the non-linear temporal / measurement equations and thenrecomputing the mean and covariance based on the new positions of these particles. In theexample below, we assume that the state has dimensions Dw and use 2Dw + 1 particles toapproximate the world state.
Algorithm 19.5: The unscented Kalman filter
Input : Measurements xTt=1, temporal, measurement functions f [•, •], g[•, •], weight a0
Output: Means µtTt=1 and covariances ΣtTt=1 of marginal posterior distributionsbegin
// For each time step
for t=1 to T do// Approximate state with particles
w[0] = µt−1
for j=1 to Dw do
w[j] = µt−1 +√
Dw1−a0
Σ1/2t−1ej
w[Dw+j] = µt−1 −√
Dw1−a0
Σ1/2t−1ej
aj = (1− a0)/(2Dw)
end// Pass through measurement eqn and compute predicted mean and covariance
µ+ =∑2Dwj=0 ajf[w
[j]]
Σ+ =∑2Dwj=0 aj(f[w
[j]]− µ+)(f[w[j]]− µ+)T + Σp
// Approximate predicted state with particles
w[0] = µ+
for j=1 to Dw do
w[j] = µ+ +√
Dw1−a0
Σ1/2+ ej
w[Dw+j] = µ+ −√
Dw1−a0
Σ1/2+ ej
end// Pass through measurement equation
for j=0 to 2Dw do
x[j] = g[w[j]]end// Compute predicted measurement state and covariance
µx =∑2Dwj=0 aj x
[j]
Σx =∑2Dwj=0 aj(x
[j] − µx)(x[j] − µx)T + Σm
// Compute new world state and covariance
K =(∑2Dw
j=0 aj(w[j] − µ+)T (x[j] − µx)T
)Σ−1x
µt = µ+ + K (xt − µx)
Σt = Σ+ −KΣxKT
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Temporal models 71
Algorithm 19.6: Condensation algorithm
The condensation algorithm completely does away with the Gaussian representation andrepresents the distributions entirely as sets of weighted particles, where each particle can beinterpreted as a hypothesis about the world state and the weight as the probability of thishypothesis being true.
Algorithm 19.6: The condensation algorithm
Input : Measurements xTt=1, temporal model Pr(wt|wt−1), measurement model Pr(xt|wt)
Output: Weights a[j]t Tt=1, hypotheses w[j]
t Tt=1
begin// Initialise weights to equal
a0 = [1/J, 1/J, . . . , 1/J ]// Initialize hypotheses to plausible values for state
for j=1 to J do
w[j]0 = Initialize[ ]
end// For each time step
for t=1 to T do// For each particle
for j=1 to J do
// Sample from 1 . . . J according to probabilities a[1]t−1 . . . a
[J]t−1
n = sampleFromCategorical[at−1]// Draw sample from temporal update model
w[j]t = sample[Pr(wt|wt−1 = w
[n]t−1)]
// Set weight for particle according to measurement model
a[j]t = Pr(xt|w[j]
t )
end// Normalise weights
at = at/(∑Jj=1 a
[j]t )
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
72 Models for visual words
Algorithm 20.1: Bag of features model
The bag of features model treats each object class as a distribution over discrete features fregardless of their position in the image. Assume that there are I images with Ji features inthe ith image. Denote the jth feature in the ith image as fij . Then we have
Pr(Xi|w = n) =
Ij∏j=1
Catfij [λn]
Algorithm 20.1: Learn bag of words model
Input : Features fijI,Jii=1,j=1, wiIi=1, Dirichlet parameter α
Output: Model parameters λmMm=1
begin// For each object class
for n=1 to N do// For each feature
for k=1 to L do// Compute number of times feature k observed for object m
Nfnk =
∑Ii=1
∑Jij=1 δ[wi − n]δ[fij − k]
end// Compute parameter
λnk = (Nfnk + α− 1)/(
∑Kk=1 N
fnk +Kα− 1)
end
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
Models for visual words 73
Algorithm 20.2: Latent Dirichlet Allocation
The latent Dirichlet allocation model models a discrete set of features fij ∈ 1 . . .K as amixture of M categorical distributions (parts), where the categorical distributions themselvesare shared, but the mixture weights πi differ from image to image
Algorithm 20.2: Learn latent Dirichlet allocation model
Input : Features fijI,Jii=1,j=1, wiIi=1, Dirichlet parameters α, β
Output: Model parameters λmMm=1, πiIi=1
begin// Initialize categorical parameters
θ = θ0a
// Initialize count parameters
N(f) = 0
N(p) = 0for i=1 to I do
for j=1 to J do// Initialize hidden variables
pij = randInt[M ]// Update count parameters
N(f)pij ,fij
= N(f)pij ,fij
+ 1
N(p)i,pij
= N(f)i,pij
+ 1
end
end// Main MCMC Loop
for t=1 to T do
p(t) = MCMCSample[p, f ,N(f),N(w), λmMm=1, πiIi=1,M,K]end// Choose samples to use for parameter estimate
St = [BurnInTime : SkipTime : Last Sample]for i=1 to I do
for m=1 to M do
πi,m =∑Jij=1
∑t∈St δ[p
[t]ij −m] + α
end
πi = πi/∑Mm=1 πim
endfor m=1 to M do
for k=1 to K do
λm,k =∑Ii=1
∑Jij=1
∑t∈St δ[p
[t]ij −m]δ[fij − k] + β
end
λm = λm/∑Kk=1 λm,k
end
end
a One way to do this would be to set the categorical parameters λmMm=1, πiIi=1 to random
values by generating positive random vectors and normalizing them to sum to one.
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.
74 Models for visual words
Algorithm 20.2b: Gibbs’ sampling for LDA
The preceding algorithm relies on Gibbs sampling from the posterior distribution over thepart labels. This can be achieved efficiently using the following method.
Algorithm 20.2b: MCMC Sampling for LDA
Input : p, f ,N(f),N(w), λmMm=1, πiIi=1,M,KOutput: Part sample pbegin
repeat// Choose next feature
(a, b) = ChooseFeature[J1, J2, . . . JI ]// Remove feature from statistics
N(f)pab,fab
= N(f)pab,fab
− 1
N(p)a,pab = N
(p)pab − 1
for m=1 to M do
qm = (N(f)m,fab
+ β)(N(p)a,m + α)
qm = qm/(∑Kk=1(N
(f)m,k + β)
∑Nm=1(N
(p)a,m + α))
end// Normalize
q = q/(∑Mm=1 qm)
// Draw new feature
pij = DrawCategorical[q]// Replace feature in statistics
N(f)pab,fab
= N(f)pab,fab
+ 1
N(p)a,pab = N
(p)pab + 1
until All parts pij updated
end
Copyright c©2012 by Simon Prince. This latest version of this document can be downloaded fromhttp://www.computervisionmodels.com.