Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
A Diffusion Process on Riemannian Manifoldfor Visual Tracking
Marcus Chen, Cham Tat Jen, Pang Sze Kim, Alvina Goh
Abstract—Robust visual tracking for long video sequences is a research area that has many important applications. The mainchallenges include how the target image can be modeled and how this model can be updated. In this paper, we model the targetusing a covariance descriptor, as this descriptor is robust to problems such as pixel-pixel misalignment, pose and illuminationchanges, that commonly occur in visual tracking. We model the changes in the template using a generative process. Weintroduce a new dynamical model for the template update using a random walk on the Riemannian manifold where the covariancedescriptors lie in. This is done using log-transformed space of the manifold to free the constraints imposed inherently by positivesemidefinite matrices. Modeling template variations and poses kinetics together in the state space enables us to jointly quantifythe uncertainties relating to the kinematic states and the template in a principled way. Finally, the sequential inference of theposterior distribution of the kinematic states and the template is done using a particle filter. Our results shows that this principledapproach can be robust to changes in illumination, poses and spatial affine transformation. In the experiments, our methodoutperformed the current state-of-the-art algorithm - the incremental Principal Component Analysis method [34], particularlywhen a target underwent fast poses changes and also maintained a comparable performance in stable target tracking cases.
Index Terms—Tracking, Particle filtering, Template update, Generative Template Model, Riemannian manifolds, log-transformedspace.
F
1 INTRODUCTION1
Visual tracking is an important vision research topic2
that has many applications, ranging from motion-3
based recognition [7], surveillance [18], human-4
computer interaction [10], etc. It also covers many5
aspects of computer vision problems, such as target6
feature representation [46], feature selection [9], and7
feature learning [15]. Even though it has been actively8
researched for decades, many challenges remain espe-9
cially with changes in target poses and appearance,10
and illumination in a long video sequence. Figure 111
shows two simple examples of how a target can vary12
over a short time interval. Often these challenges are13
common and require a good solution in order for14
long stable tracking in many real life tasks. There15
are generally three common approaches to deal with16
target appearance variations. First is to use robust or17
invariant target features such as scale invariant feature18
transformation and color histogram [3]. However, as19
shown by Figure 1, target appearance can change20
significantly over time, and end up totally different21
from the starting frame due to variations in target22
poses and image illumination. The second approach23
is to employ a complete set of possible target mod-24
els [4], aiming to model possible target variations.25
• Marcus Chen and Cham Tat Jen are with the Department of Schoolof Computer Engineering, Nanyang Technological University, is withthe Department.
• Pang Sze Kim and Alvina Goh are with DSO National Laboratories,Singapore.
Fig. 1. Target patches for successive 871 frames, from#1, 31, ...871 from 2 video sequences. Target changesin both illumination, poses, appearances even afterbeing affine warped to a standard size.
However, this requires learning of the target model 26
in advance and can hardly be scalable. Finally, the 27
last approach is to update the template gradually 28
as it evolves. Note that in this paper, we loosely 29
use the term template for target representation, and 30
do not strictly limit to the image patches. There are 31
arX
iv:1
303.
5913
v1 [
cs.C
V]
24
Mar
201
3
2
several choices for a target template found in the32
literature. For example, [38] uses the histogram of33
oriented gradients, while [3] uses the color histogram,34
[45] L1 sparse representation, [23] active appearance35
model, [34] principal subspace of image patches, and36
[31] features covariance.37
The template update problem can be expressed38
mathematically as Eqn. (1).39
Tt = f(Tt, Tt−1
)(1)40
41
where Tt, Tt, t ∈ [1, 2, ...] are the estimated and up-42
dated templates respectively at time t. However, as43
shown in [23], target template updating is a challeng-44
ing task. According to [23], if the template was not45
updated at all, the template would become outdated46
shortly and cannot be used for matching as the target47
appearance would have undergone changes tempo-48
rally. On the other hand, update at every frame would49
result in accumulation of small errors, and eventually50
a template drift and loss target information.51
Recognizing the importance of template update,52
many methods have been proposed. One common53
and intuitive approach is to use linear updating func-54
tion in the respective feature spaces, such as [31]55
on the covariance manifold. This will smoothen the56
changes between the estimated Template and updated57
template. Similarly, Kalman filter has also been used58
in [25] to track template features variables, but not59
target trajectory. On the other hand, there are three60
well-known template update algorithms in the litera-61
ture, namely template alignment [23], Online Expec-62
tation and Maximization (EM) [19], and incremental63
subspace method [34]. Here, we briefly survey these64
three algorithms.65
In template alignment method, [23] proposes a66
heuristic but robust criteria to decide whether to67
update the template at time t. The basic idea is to68
keep the starting template to correct the drift of the69
estimated template. The latest estimated template is70
first matched to the previous updated template. It is71
then warped before checking with the first template.72
For a small template displacement, this method works73
very well. However, by imposing alignment between74
the latest template and the first template, this method75
inherently limit target poses changes to a warping76
model.77
The online EM method [19] employs a mixture of78
three template distributions to account for template79
variations, namely, long term stable template, interframe80
variational template, and outlier template. These tem-81
plates model stable appearance of target, interframe82
changes in appearance of poses, and occlusion or83
outliers respectively. Employing a Gaussian mixture84
model, parameters and membership are estimated on85
the fly using online EM. In this framework, each pixel86
in the target patches is assumed to be independent87
and consequently more stable pixels tend to gain88
more weights in the similarity measure. This could 89
gradually drift the template in the presence of more 90
stable background pixels. 91
The third algorithm is to represent the target in 92
its eigenspace, proposed by [34]. The posterior es- 93
timates of the template are collected over an inter- 94
val, and these estimates are then analyzed online 95
through an Incremental Principal Component Anal- 96
ysis method(IPCA). This method can capture changes 97
in template variation in eigenbases. The mean of the 98
posterior estimates are also kept as stable templates. 99
The authors have tested IPCA with various video 100
sequences, and demonstrated its great robustness to 101
the template variations due to pose changes and illu- 102
mination changes. Figure 2 illustrates an incremental 103
update of eigenbases and means. The images in the 104
3rd row show how the eigenbases evolve over time. 105
It has been shown in the paper that the updated 106
templates could almost reconstruct the original im- 107
age samples over the sequence, reflecting the ability 108
of the eigenbases to model temporal variations. Al- 109
though IPCA is often very robust and can track target 110
very accurately even in noisy, low contrast image 111
sequences, IPCA falls short when the target undergoes 112
fast pose changes and dramatic illumination changes 113
as stated on the paper. This may be because PCA 114
inherently assumes that the target templates over time 115
are from a Gaussian distribution. In abrupt changes 116
in poses and illumination, this assumption does not 117
hold. The unimodal distribution also requires good 118
pixel-wise alignment between the posterior estimate 119
and eigenbases, otherwise uncertainties in template 120
alignment would contribute to template variance and 121
may lead to non-informative basis. A good example 122
from the paper is shown in Figure 2. One can see 123
that from frames #600 to #636, the eigenbases are not 124
representative anymore and the tracker loses track of 125
the target.
(a) Representativeeigenbases
(b) Eigenbases ofmisaligned targetregions
(c) Eigenbases are notrepresentative
Fig. 2. Results of incremental subspace method onthe Sylvester sequence. Pixel-wise misalignment couldrender eigenbasis non-representative. The 1st row arethe sample frames. The 2nd row images are the currentsample mean, tracked region, reconstructed image,and the reconstruction error respectively. The 3rd and4th rows are the top 10 principal eigenbases.
126
3
So far, most of the current state-of-the-art algo-127
rithms update templates in an out-of-chain manner,128
by assuming the posterior estimate is “good enough”129
for template update with pixel-wise alignment. If the130
target poses posterior estimate is inaccurate or there131
is a mis-alignment between the estimated and last132
updated template, the update methods will gradually133
drift. On the other hand, if the template update is not134
good, then the posterior estimate of target poses is135
unlikely to be accurate. These coupled dual problems136
often render these methods unable to track well when137
the targets undergo fast changes in poses or non-rigid138
transformation. However, robustness to fast target139
poses has many real life applications such as human140
tracking, maritime target tracking, etc.141
To solve these dual problems faced by the exist-142
ing state-of-the-art algorithms, [8] introduces a novel143
approach to simultaneously quantify these two uncer-144
tainties by including both of them into the state space145
of a Bayesian framework, instead of just target poses146
in the exist methods. In this manner, no posterior147
estimate is used for updating, instead better matched148
multiple hypothesized templates are propagated au-149
tomatically.150
Paper contributions. To the best of our knowledge,151
almost all the state-of-art algorithms use out-of-chain152
template updating methods. That is to say, the up-153
dating of template model is done after obtaining the154
posterior estimate of the targets position. In this paper,155
we propose a method to update target model in156
tandem with the target kinematics. In other words,157
we model the target template as a part of the state158
space. We choose the covariance descriptor for the159
target descriptor as it is more robust to problems160
such as pixel-pixel misalignment and changes in pose161
and illumination. Since positive definite covariance162
matrices form a Riemannian manifold, we model the163
target template model variation by a random walk164
on the covariance Riemannian manifold. We propose165
a novel superior template propagation mechanism in166
the log-transformed space of the manifold to free the167
constraints imposed inherently by positive semidefi-168
nite matrices, leading to a greater ability in dealing169
with template variations. Our resultant method out-170
performs the state-of-the-art Incremental PCA algo-171
rithm [34] in dealing with fast moving and changing172
targets, as will be clearly shown in the experiments173
section.174
The paper is organized as follows: Section 2 gives175
a brief introduction to both covariance descriptors176
and Riemannian manifold, Section 3 gives a Bayesian177
formulation of simultaneous inference of both target178
kinetics and template posterior distribution, Section 4179
analyzes the template generative process. In Section180
5, we empirically compare our results with IPCA and181
give a short discussion. Finally, section 6 concludes182
this paper. 183
2 TARGET COVARIANCE DESCRIPTOR 184
In this section, we explain the motivation of using co- 185
variance descriptor and its operation on Riemannian 186
manifold. 187
2.1 Covariance Descriptor 188
A covariance descriptor is defined as follows: 189
C =1
N − 1
N∑i=1
(f(i)− f
) (f(i)− f
)T (2) 190
where f is a feature vector, f = 1N
∑Ni=1(f(i)) is 191
the mean of the feature vector over N pixels in the 192
target region. In this paper, we use the following 9- 193
dimensional feature vector: 194
f(i) =[xw, yw, I(xw, yw), |Ixw
|, |Iyw|,√I2xw
+ I2yw, 195
arctan|Ixw ||Iyw|, |Ixxw |, |Iyyw |
]. (3) 196
197
They are x, y coordinates, pixel intensity, x, y direc- 198
tional intensity gradients, gradient magnitude and 199
angle, and second order gradients respectively. w 200
denotes that these features are extracted after warping 201
image patches to a standard size. 202
Since its proposed use in human detection [40], 203
covariance descriptor has gained popularity for many 204
applications, such as face recognition [26], license 205
plate detection [30], and tracking [31], [45]. Some main 206
advantages of choosing the covariance descriptor [42] 207
to model the template include its lower dimensional- 208
ity of 12 (d2 + d) (45 in this paper as d = 9), compared 209
to its number of target pixels (32 × 32 = 1024 in this 210
paper), its ability to fuse multiple possibly correlated 211
features, and its robustness to match targets in differ- 212
ent views and poses. 213
By its definition, covariance matrix is clearly a posi- 214
tive semi-definite matrix, which lies on a Riemannian 215
manifold. We will now briefly explain some basic 216
operations on the Riemannian manifold. 217
2.2 Riemannian Manifold218
Fig. 3. The geodesic distance is the norm of a vectoron the tangent space TC1
M of at point C1 on themanifoldM
4
TABLE 1Operations in Euclidean and Riemannian spaces
Euclidean space Riemannian manifoldCj = Ci +
−−−→CiCj Cj = expCi
(−−−→CiCj)−−−→
CiCj = Cj − Ci−−−→CiCj = logCi
(Cj)
dist(Ci, C − j) = ‖Cj − Ci‖ dist(Ci, Cj) = ‖−−−→CiCj‖Ci
A Riemannian manifold M is a differential mani-219
fold and each of its tangent space TCiM has a metric220
function g which defines the dot products between221
any two tangent vectors yk, yl. The covariance descrip-222
tor is a point on the manifold M, the following oper-223
ations can be applied to it. The Riemannian metric:224
〈yk, yl〉Ci= trace
(C
12i ykC
−1i ylC
− 12
i
). (4)225
226
The exponential map expCi: TCi
M → M, takes a227
tangent vector at point Ci and maps to another point228
Cj :229
Cj = expCi(y) = Ci
1/2 exp(Ci− 1
2 yCi− 1
2
)Ci
12 , (5)230
231
The inverse of the exponential map is the logarithm232
map, which takes a starting point Ci and destination233
Cj , maps to the tangent vector y at point Ci.234
y = logCiCj = Ci
12 log
(Ci− 1
2CjCi− 1
2
)Ci
12 . (6)235
236
Finally, the distance between two covariance matrices237
Ci and Cj is given as:238
d(Ci, Cj) =
√√√√ d∑k=1
ln2 λk (Ci, Cj), (7)239
240
where λk (Ci, Cj) are the generalized eigenvalues of241
Ci and Cj . That is, λkCivk − Cjvk = 0, and d is the242
dimension of the covariance matrices.243
Note that expCi(·) and logCi
(·) are maps on the244
Riemannian manifold, whereas exp(·) and log(·) de-245
note the normal matrix exponential and logarithmic246
operations. Both expCi(y) and tangent vector y are247
both d× d matrices in this paper.248
2.3 Motivation of Manifold Modeling249
High dimensional image data often lies in low di-250
mensional manifold. For an example, a collection of251
rotated handwriting zeros in Figure 4 lie in a dimen-252
sion of 28× 28 = 784 using vectorized representation,253
but have only one rotational parameter. Popular di-254
mensional reduction methods such as ISOMAP [39],255
eigenmap [2], LLE [35] model data using manifold256
structures. In visual tracking, the target patches in257
the image sequence are implicitly bounded by the258
target’s degree of freedom captured by images, such259
as rotation, translation, scaling etc. These implicit260
parameters modeled using low dimensional manifold261
could capture image distance. The simplest and yet 262
Fig. 4. A collection of rotated handwriting zeros by theangle of 450 each time.
(a) Targets (b) Backgrounds (c) SVM re-sults
(d) Euclidean distance (e) Manifold distance
Fig. 5. An illustration of distance between imagescan be better modeled on a manifold, a sequenceof dancing penguin from youtube. In (d), Euclideandistance between targets is larger than the distancebetween target and background; and not on the man-ifold space. Furthermore, in (e), SVM cannot linearlyseparate the targets from backgrounds in vectorizedEuclidean space.
most popular distance measure between images is 263
Euclidean distance between the vectorized images. 264
A simple example of the head of dancing penguin 265
from youtube is adopted to illustrate in Figure 5 that 266
the manifold of covariance descriptor can model the 267
image distance better and can separate target patches 268
from the background better. Using Euclidean distance, 269
the distance between target patches and first target270
5
patch could be larger than the one between the back-271
ground and the first target patch. Furthermore, a test272
of support vector machine using linear kernel showed273
that some background patches are classified into the274
target patches. On the other hand, the separation275
between target patches and background patches were276
separated shown in Figure 5e.277
3 BAYESIAN FRAMEWORK278
In this section, we use a standard Bayesian framework279
[33] to formulate tracking of both template and kinet-280
ics as follows:281
P (Ct, st|z1:t) ∝ P (zt|Ct, st)
∫P (st, Ct|st−1, Ct−1)282
P (Ct−1, st−1|z1:t−1)dst−1dCt−1, (8)283284
where zt is the measurement, st is the kinetic state285
variables, Ct is the covariance descriptor, P (Ct, st|z1:t)286
is the posterior probability of target template and pose287
given the measurement, P (zt|Ct, st) is the observa-288
tional model, and P (st, Ct|st−1, Ct−1) is the dynamical289
model. They are further elaborated in the following290
subsections.291
3.1 Dynamical Model292
The state space in our paper includes both target293
kinetic variables st and template covariance descriptor294
Ct. The state variables are defined in Eqns. (9) and295
(10), and we would like to estimate them through the296
Bayesian framework in Eqn.(8). These state variables297
are propagated from time t−1 to t through a dynam-298
ical model P (st, Ct|st−1, Ct−1) .299
st = [xt, yt, xt, yt, ht, θt], (9)300
Ct = cov (xw, yw, I(xw, yw), |Ixw|, |Iyw
|,301
arctan|Iyw|
|Ixw|,√I2xw
+ I2yw, |Ixxw |, |Iyyw |
), (10)302
303
where xt, yt are the spatial coordinates of the target304
center position at time t, xt, yt are the velocities, ht305
is the scaling factor, and θt is the orientation. xw, yw306
are the coordinates of a pixel on the standard target307
patch warped from xt, yt, I(xw, yw) is the pixel inten-308
sity and {Ixw, Iyw} are the patch intensity gradients,309
{Ixxw, Iyyw
} are the second order gradients. Assuming310
independence between kinetic variables and covari-311
ance, we model the joint dynamics as follows:312
P (st, Ct|st−1, Ct−1) = P (st|st−1)P (Ct|Ct−1), (11)313
st = k(st−1) + ut, (12)314
Ct = expCt−1(nt). (13)315
316
k is the kinetic model and we use a near constant317
velocity linear model k(st−1) = Ast−1. ut is generated318
with an interacting Gaussian models with a jump-319
ing probability of [0.9, 0.1] to model sudden changes320
in target poses. As for template dynamical model, 321
nt ∈ TCiM is a random process on the tangent plane 322
of manifold M. An example of this could be the 323
Brownian motion process as described by [17]. In this 324
paper, we choose to model the template dynamical 325
model in log-transformed space of the manifold as 326
follows: 327
Ct = exp(log(Ct−1) + wt) (14) 328
where wt ∼ N(0,Σ), Σi,j = Σj,i ∼ N(0, σ2
i,j
)329
P (log(Ct)) ∝ exp
−1
2
∑i≤j,i,j∈[1,d]
wt(i, j)2
σ2i,j
(15) 330
331
where wt is simply a random symmetric matrices and 332
N(0, σ2i,j), i, j ∈ [1, d] are normal distributions. Ac- 333
cording to [1], the matrix exponential function maps a 334
symmetric matrix to its corresponding positive semi- 335
definite, exp : Sym(d) → Sym+(d), and it is one-to 336
one mapping. As such, the generated samples of Ct 337
is always a positive semi-definite (PSD) matrix. This 338
frees the inherent constraints of positive eigenvalues 339
in a PSD matrix. This distribution may be considered 340
as a log-normal distribution of the PSD matrices as 341
defined in [36]. 342
C−1t Ct−1
= exp (− log(Ct−1)− wt) exp (log(Ct−1))
= exp(−wt)
(16) 343
344
Generalized Eigenvalues: 345
λkCtv − Ct−1v = 0
C−1t Ct−1v = λkv
exp(−wt)v = λkv
d(Ct−1, Ct) =
[d∑
k=1
[ln2 λk (exp(−wt))
]]1/2
=
[d∑
k=1
λ2k(wt)
]1/2, if ∃w−1t
(17) 346
347
In this paper, for d = 9, wt’s eigenvalues λ1(wt) ≥ 348
λ2(wt) ≥ ... ≥ λ9(wt) can be bounded according 349
to [47], assuming the entries of the noise matrix are 350
bounded by [a, b], i.e. a ≤ wt(i, j) ≤ b: 351
λ9(wt) ≥
{12
(9a−
√a2 + 80b2
)|a| < b
9a otherwise.(18) 352
λ1(wt) ≤
{12
(9b+
√a2 + 80b2
)|a| < b
9b otherwise.(19) 353
354
In other words, the eigenvalues are roughly within 355
an order of magnitude of max(σi,j) for this random 356
process. In this way, the template diffusion spread on 357
the manifold can be easily managed by choosing an 358
appropriate max(σi,j) in wt.359
6
Fig. 6. Evolution of template and generated templates,frame #1, 101, 201, 501, 801, Green cross: top 10similar generated templates, Red: ground truth, Bluecross: the background.
3.2 Observation Model360
The observation model P (zt|Ct, st) measures the like-361
lihood of a target given target poses and template362
values, it is modeled as follows:363
P (zt|Ct, st) ∼ N(0, σ2), (20)364
zt = d(Ct, C∗t ),365
C∗t = g(st, Image),366
P (zt|Ct, st) ∝ exp(− 1
2σ2d2) (21)367
368
Here, d(Ct, C∗t ) is given by Eqn. (7). g is the covariance369
computation operator; g takes the kinetic value st370
of each particle at time t, warps the region to a371
standard size (in this paper, 32×32) before computing372
covariance. 373
3.3 Overall Framework 374
We use a standard particle filter to do sequential 375
inference. The particle filter [5], [33], [16] represents 376
the distribution of state variables by a collection of 377
samples and their weights. The advantage of using 378
a particle filter is that it can deal with non-linear 379
system and multi-modal posterior. The algorithm of 380
the particle filter is as follows: 381
1) Initialization.The particle filter is initialized 382
with a known realization of target state vari- 383
ables. This includes the target initial state values. 384
Covariance of the target C0, i.e. initial template 385
is extracted for comparison later. The parameters 386
of covariance generative process. i.e. template 387
dynamical model are also determined. 388
2) Propagation. Each particle is propagated accord- 389
ing to the propagation model in Eqns. (12) and 390
(14). Both kinetic variables and template are 391
generated through these random processes. 392
3) Measure the likelihood. At each particle i, the 393
covariance descriptor C∗t (i) extracted is com- 394
pared to its corresponding template Ct(i). The 395
likelihood of the particle is then estimated as 396
given in Eqn. (21). 397
4) Posterior estimation. The posterior estimate 398
gives the estimate of the current target state, 399
given all its previous information and measure- 400
ments. This could be maximum a posteriori 401
probability estimate or minimum mean square 402
error estimate (MMSE). In this paper, we use 403
MMSE. 404
5) Resampling. To avoid any degeneracies, resam- 405
pling is conducted to redistribute the weight of 406
particles. 407
6) Loop. Repeat the process from step 2 to 5 as time 408
progresses. 409
4 ANALYSIS OF TEMPLATE GENERATIVE 410
PROCESS411
Fig. 7. Visualization of target in“soccer” sequenceson the covariance manifold, Red:target patches, Blue:background patches.
7
In this section, we show that the covariance descrip-412
tor is a good representation of the target as well as413
the motivation behind performing a random walk as414
given in Section 3.1. Two reasonable criteria for a good415
target representation are as follows:416
• the representation evolves gradually as the target417
undergoes changes in poses, appearance etc,418
• there is clear separation of target and back-419
ground.420
To help visualize the distribution of target covariance421
matrices on the manifold, we use multidimensional422
scaling [21] to construct a visualization of the dis-423
tribution of the covariance matrices. The distance424
matrix is constructed using Riemannian distance as425
given in Eqn.(7). The visualization shows the relative426
positions of targets (red) and backgrounds(blue). Vi-427
sualizing the PETS 2003 Soccer sequence and Dudek428
Face sequence in Figures 7 and 6, we noticed that429
our representation of the targets tended to cluster430
together as they evolved gradually. This evolution is431
smoother and easier to model on the manifold as com-432
pared to the evolution of its original feature values at433
each pixel. This observation motivated us to model434
the template variations by using a random walk on435
the Riemannian manifold. Based on Eqns.(12) and436
(14), Figure 6 illustrates a realization of the random437
walk. This shows that our template dynamical model438
can model the actual target appearance variations.439
Changes in facial expression and face poses cause440
covariance template (shown as red points) to evolve441
slowly on the manifold, and they are well modeled442
by the generated covariances on the manifold (shown443
as green points).444
5 EXPERIMENTS AND RESULTS445
5.1 Experimental data446
We tested our algorithm on some of popular tracking447
datasets, David Ross’s sequences including plush toy448
(toy Sylv), toy dog, david, car 4sequences from his449
website, Dudek Face sequences, and vehicle track-450
ing sequences from PETS2001, soccer sequence from451
PETS2003. The test data information is tabulated in452
Table 2.453
5.2 Performance measure454
As spelled out in [22], a good measure should include455
both overall tracking and goodness of track. This456
paper uses the ratio between on-track length and457
sequence length to capture the performance of overall458
tracking, and on-track accuracy for goodness of track.459
Define tracking errors as: ex(t) = ‖gx(t)−x(t)‖, ey(t) =460
‖gy(t) − y(t)‖, where ex(t), ey(t), gx(t), gy(t) are the461
errors in x, y and ground truth in x, y at time t 462
respectively. 463
γontrack =1
2
(ex(t)
Hx(t)+
ey(t)
Hy(t)
)≤ 1 (22) 464
rontrack =γontrack
l(23) 465
rmsontrack =
√(ex(t)
Hx(t)
)1/2
+
(ey(t)
Hy(t)
)1/2
(24) 466
467
Hx(t), Hy(t) are the ground truth target size at time 468
t. In this work, ground truth on the target center is 469
manually annotated, the target size is assumed to as 470
those of the first frame (this may not be applicable to 471
frames with a large change in target size). 472
5.3 Results and discussion 473
We compared our method with the current state- 474
of-the-art algorithm, the incremental PCA (IPCA) 475
method by David et al [34]. Our results are shown 476
in red and the IPCA in green from Figures 9 to 15. 477
In PLUSH TOY SYLV sequences shown in Figure 478
9,the IPCA failed to recover tracking from frame #609 479
when it locked onto the background, which looks 480
more similar to the upright SYLV. Fast poses changes 481
around frame #609 caused the IPCA eigenbases non- 482
representative as shown in Figure 4. 483
Similarly, in Figure 10, the IPCA failed to follow 484
through when target underwent a fast motion towards 485
the frame #1351. This shortcoming of the IPCA is 486
better reflected in Soccer Sequences of PETS2003. the 487
IPCA started to drift off from frame #628 shown in 488
Figure 11 when the player moved his legs fast, and 489
lost track shortly. In the same sequence in Figure 12, 490
the IPCA found it hard to track the opposite team 491
players who wore dark clothes after a short occlusion 492
at frame #285. 493
In Figure 13, Dudek Face sequences, both methods 494
perform well despite of his rich facial expressions, 495
which have more effects on our covariance descriptor. 496
In the more stable vehicle sequence from PETS2001 497
in Figure 14, again both methods could track well. 498
Figure 15 shows an example of a car sequence, in 499
which our method did not perform satisfactorily. Our 500
method locked onto the background whereas the 501
IPCA showed robustness to the illumination changes. 502
The possible explanation is that our template dynam- 503
ics was unable to account for this dramatic and non- 504
smooth transition of the template when the car went 505
into a shadowed region. Also, a closer look showed 506
that the IPCA eigenbasis looked similar to the target 507
template in shadows. 508
The overall tracking performance on the test cases 509
is summarized in Figure 8. Note that images se- 510
quences of Sylv, PETS2001 and soccer player 4 511
have targets out of the images, this explained the 512
small track duration performance. Nevertheless, our 513
method shown in red generally had longer track 514
length. On the hand, given frames that were on515
8
TABLE 2Test Sequences
Test sequences Source No. of frames Characteristics
Plush Toy (Toy Sylv) David Ross 1344 fast changing, 3D Rotation, Scaling, Clutter, large movementToy dog David Ross 1390 fast changing, 3D Rotation, Scaling, Clutter, large movement
Soccer player 1
PETS 2003 1000 Fast changing, white team, good contrast with background, occlusionSoccer player 2Soccer player 3Soccer player 4 Fast changing, gray(red) team,poor contrast with background, occlusion
Dudek Face Sequence A.D. Jepson 1145 Relatively stable, occlusion, 3D rotationTruck PETS 2001 200 relatively stable, scalingDavid David Ross 503 relatively stable 2D rotationCar 4 David Ross 640 Relatively stable, scaling, shadow, specular effects
(a) Track duration rate rontrack (b) Track accuracy rmsontrack
Fig. 8. The results statistics, our results in blue, IPCA in red.
track for both trackers, IPCA showed better track516
accuracy shown in Figure 8b. For the sequences with517
frequent changes in target appearance such as soc-518
cer sequences, the track goodness was comparable.519
The video sequences may be found on the website,520
http://www.youtube.com/watch?v=KaSrVbGyvq4.521
Discussion. In stable tracking cases, good pixel-wise522
alignment enabled the IPCA to track very well. The523
IPCA was generally very robust to blurring, even illu-524
mination changes, as eigenbasis tended to encompass525
these changes. In other words, some eigenbasis looked526
similar to blurred or illumination-changed templates.527
The distance measure in the IPCA uses a norm of all528
corresponding pixels difference; as such, it tends to be529
very stable and well aligned in the stable target cases.530
On the other hand, it is likely to favor the relatively531
stable regions in the target. When such regions are too532
similar to the background and target poses changes533
at the same time, then the IPCA may lose track very534
quickly in the Soccer Sequence in Figure 12. On the535
other hand, our method uses covariance of gradients536
and intensity; the template feature descriptor is much537
smaller in dimension. This may cause our method538
slightly less precise than the IPCA shown in Figure539
10, which our method did not match to pixel accuracy.540
Figure15, our method lost track when the vehicle541
entered the shadowed region, because the both gra- 542
dients and intensity changed significantly and for an 543
interval. 544
Although our method was slightly not as precise 545
in the stable cases, it gain much more flexibility in 546
the non-stable tracking scenarios. In the cases of non- 547
rigid or fast motion of targets, mis-alignment in the 548
posterior estimate (the new template sample to add 549
to the eigen space in the IPCA) and eigenbases may 550
accumulate over a short interval and consequently 551
render eigenbases non-representative at all. This in- 552
evitably leads to loss in tracking. Our method could 553
deal with these scenarios a lot better for two rea- 554
sons. Firstly, the template descriptor did not require 555
pixel-wise alignment and is robust to mis-alignment. 556
Secondly, the generative process could accommodate 557
multiple hypothesis of the template on the covariance 558
Riemannian manifold, and it automatically selects the 559
better hypothesis as the target template evolves as 560
shown in Figure 6. 561
However, there are some limitations in our algo- 562
rithm. One of them is to the need to careful choose a 563
suitable region for tracking. Since we used the pub- 564
lished features such as intensity and gradients, and 565
second order gradients for covariance, these features 566
are sensitive to specular effects, dark shadows as 567
shown in Figure 15. It is also important to choose 568
a target region with fairly good gradients varia-569
9
Fig. 9. Tracking results on PLUSH TOY SYLVsequences, frame #133, 594, 609, 613, 957, 1338,Green: IPCA, Red: our results. The IPCA failed torecover track from frame # 609.
tions, otherwise the covariance descriptor may be ill-570
conditioned consequently affecting both eigenvalues571
estimation and distance measurements in Equation572
(7).573
6 CONCLUSION574
In this paper, we have proposed a new method to575
update target model in tandem with the target kine-576
matics. More precisely, we have developed a gener-577
ative template model in a principled way within a578
Bayesian framework. A novel template propagation579
mechanism in the log-transformed space of the co-580
variance manifold to free the constraints inherently581
imposed by positive definite matrices. We have shown582
that the simple generative process can allow template583
to evolve naturally with target appearance variation.584
It is hoped that by jointly quantifying the uncertainties585
of the target kinematics and template, we are able to586
achieve more robust visual tracking. We have chosen587
the covariance descriptor as the target representation.588
We have modeled the target template model dynamic589
using a random walk on the covariance Riemannian590
manifold. Our template dynamic model is an example591
of a diffusion process on the covariance Riemannian592
manifold. In the experiments, our algorithm outper-593
formed with the current state-of-the-art algorithm594
IPCA particularly when the target underwent a fast595
and non-rigid poses changes, and also maintained a596
comparable performance when the target was more597
stable. Some future work includes automatic selection598
of covariance features that are more robust to a sud-599
den dramatic change in illuminations.600
Future work includes addressing a number of ques-601
tions such as how should the diffusion speed be602
adjusted and can the diffusion process be better con-603
strained. Another area of work is to deal with illu-604
mination changes in the manifold generative process.605
In order to improve the goodness of track, a more606
discriminative target descriptor is to be explored.607
Fig. 10. Tracking results on toy dog sequences, frame#1, 450, 715, 1014, 1271, 1351, Green: IPCA, Red:our results. The IPCA was slightly more localized instable case, but failed to follow through when the targetunderwent a fast motion towards frame #1351.
Fig. 11. Tracking results on soccer sequences, frame#246, 628, 630, 661, 686, 996, Green: IPCA, Red:our results. The IPCA started to drift off from frame#628 when the player’s legs moved fast, and lost trackshortly.
Fig. 12. Tracking results on soccer sequences, frame#10, 15, 122, 248, 285, 360, Green: IPCA, Red: ourresults. The IPCA started to drift off from frame #15 dueto low contrast between the target and the background.
ACKNOWLEDGMENTS608
We would like to thank DSO National Laboratories, 609
Singapore for partially sponsoring this work, and 610
David Ross for sharing the test sequences. 611
REFERENCES 612
[1] V. Arsigny, P. Fillard, X. Pennec, N. Ayache, et al. Geometric 613
means in a novel vector space structure on symmetric positive- 614
definite matrices. SIAM Journal on Matrix Analysis and Appli- 615
cations, 29(1):328, 2008. 616
10
Fig. 13. Tracking results on DUDEK FACE sequences,frame #1, 361, 459, 605, 795, 1095, Green: IPCA, Red:our results. Both results were comparable despite ofhis rich facial expressions, which had more effects onour covariance descriptor.
Fig. 14. Tracking results on PETS2001 vehicle se-quences, frame #1, 25, 50, 75, 100, 125, Green: IPCA,Red: our results. Both results were comparable.
Fig. 15. Tracking results on car sequences, frame #1,132, 150, 168, 184, 227, Green: IPCA, Red: our re-sults. The IPCA performed better, was robust to illumi-nation changes, but our method mainly used templategradients, which changed dramatically due to shadowand lack of reflection of the car plate. At frame #227,the arrow sign might look too similar to the target ingradients.
[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimension- 617
ality reduction and data representation. Neural computation, 618
15(6):1373–1396, 2003. 619
[3] S. Birchfield. Elliptical head tracking using intensity gradients620
and color histograms. In CVPR, pages 232–, 1998.621
[4] M. Black and A. Jepson. Eigentracking: Robust matching and622
tracking of articulated objects using a view-based representa-623
tion. IJCV, 26(1):63–84, 1998.624
[5] O. Cappe, S. Godsill, and E. Moulines. An overview of625
existing methods and recent advances in sequential monte626
carlo. Proceedings of the IEEE, 95(5):899 –924, May 2007.627
[6] P. C. Cargill, C. U. Rius, D. M. Quiroz, and A. Soto. Per-628
formance evaluation of the covariance descriptor for target629
detection. In Chilean Computer Science Society (SCCC), 2009630
International Conference of the, pages 133 –141, nov 2009.631
[7] C. Cedras and M. Shah. Motion-based recognition a survey.632
Image and Vision Computing, 13(2):129–155, 1995.633
[8] M. Chen, S. Pang, T. Cham, and A. Goh. Visual tracking with634
generative template model based on riemannian manifold of 635
covariances. In Information Fusion (FUSION), 2011 Proceedings 636
of the 14th International Conference on, pages 1–8. IEEE, 2011. 637
[9] R. Collins, Y. Liu, and M. Leordeanu. Online selection of 638
discriminative tracking features. PAMI, 27(10):1631–1643, 2005. 639
[10] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kol- 640
lias, W. Fellenz, and J. Taylor. Emotion recognition in 641
human-computer interaction. Signal Processing Magazine, IEEE, 642
18(1):32–80, 2001. 643
[11] N. Dalal and B. Triggs. Histograms of oriented gradients for 644
human detection. In CVPR, volume 1, pages 886 –893 vol. 1, 645
June 2005. 646
[12] P. T. Fletcher and S. Joshi. Riemannian geometry for the 647
statistical analysis of diffusion tensor data. Signal Processing, 648
87(2):250 – 262, 2007. 649
[13] W. Forstner and B. Moonen. A metric for covariance matri- 650
ces. Technical report, Dept. of Geodesy and Geoinformatics, 651
Stuttgart University, 1999. 652
[14] A. Goh, C. Lenglet, P. Thompson, and R. Vidal. A non- 653
parametric riemannian framework for processing high angular 654
resolution diffusion images (hardi). In CVPR, pages 2496 – 655
2503, june 2009. 656
[15] M. Grabner, H. Grabner, and H. Bischof. Learning features for 657
tracking. In CVPR, pages 1–8. IEEE, 2007. 658
[16] M. Isard and A. Blake. CONDENSATION - conditional density 659
propagation for visual tracking. IJCV, 29:5–28, 1998. 660
[17] K. Ito. The brownian motion and tensor fields on riemannian 661
manifold. Kiyosi Ito selected papers, page 298, 1987. 662
[18] O. Javed and M. Shah. Tracking and object classification for 663
automated surveillance. ECCV 2002, pages 439–443, 2006. 664
[19] A. Jepson, D. Fleet, and T. El-Maraghi. Robust online appear- 665
ance models for visual tracking. PAMI, 25(10):1296 – 1311, oct 666
2003. 667
[20] T. Kaneko and O. Hori. Template update criterion for template 668
matching of image sequences. In Pattern Recognition, 2002. 669
Proceedings. 16th International Conference on, volume 2, pages 670
1–5. IEEE, 2002. 671
[21] J. Kruskal. Multidimensional scaling by optimizing goodness 672
of fit to a nonmetric hypothesis. Psychometrika, 29(1):1–27, 673
1964. 674
[22] V. Manohar, P. Soundararajan, H. Raju, D. Goldgof, R. Kasturi, 675
and J. Garofolo. Performance evaluation of object detection 676
and tracking in video. Computer Vision–ACCV 2006, pages 151– 677
161, 2006. 678
[23] I. Matthews, T. Ishikawa, and S. Baker. The template update 679
problem. PAMI, 26:810–815, 2004. 680
[24] D. McKeown Jr and J. Denlinger. Cooperative methods for 681
road tracking in aerial imagery. In Computer Vision and Pat- 682
tern Recognition, 1988. Proceedings CVPR’88., Computer Society 683
Conference on, pages 662–672. IEEE, 1988. 684
[25] H. Nguyen, M. Worring, and R. Van Den Boomgaard. Occlu- 685
sion robust adaptive template tracking. In ICCV, volume 1, 686
pages 678–683. IEEE, 2001. 687
[26] Y. Pang, Y. Yuan, and X. Li. Gabor-based region covariance 688
matrices for face recognition. Circuits and Systems for Video 689
Technology, IEEE Transactions on, 18(7):989–993, 2008. 690
[27] X. Pennec. Intrinsic statistics on riemannian manifolds: Basic 691
tools for geometric measurements. J. Math. Imaging Vis., 692
25:127–154, July 2006. 693
[28] X. Pennec, P. Fillard, and N. Ayache. A riemannian framework 694
for tensor computing. IJCV, 66:41–66, 2006. 695
[29] F. Porikli. Integral histogram: a fast way to extract histograms 696
in cartesian spaces. In CVPR, volume 1, pages 829 – 836 vol. 697
1, 2005. 698
[30] F. Porikli and T. Kocak. Robust license plate detection using 699
covariance descriptor in a neural network framework. In Video 700
and Signal Based Surveillance, 2006. AVSS’06. IEEE International 701
Conference on, pages 107–107. IEEE, 2006. 702
[31] F. Porikli, O. Tuzel, and P. Meer. Covariance tracking using 703
model update based on lie algebra. In CVPR, volume 1, pages 704
728 – 735, june 2006. 705
[32] Y. Rathi, A. Tannenbaum, and O. Michailovich. Segmenting 706
images on the tensor manifold. In CVPR, pages 1–8, june 2007. 707
[33] B. Ristic, S. Arulampalam, and N. Gordon. Beyond the Kalman 708
Filter: Particle Filters for Tracking Applications. Artech House, 709
2004. 710
[34] D. Ross, J. Lim, R.-S. Lin, and M.-H. Yang. Incremental 711
11
Learning for Robust Visual Tracking. IJCV, 77(1):125–141, 2008. 712
[35] S. Roweis and L. Saul. Nonlinear dimensionality reduction by 713
locally linear embedding. Science, 290(5500):2323–2326, 2000. 714
[36] A. Schwartzman. Random ellipsoids and false discovery rates:715
Statistics for diffusion tensor imaging data. PhD thesis, Stanford716
University, 2006.717
[37] M. B. Stegmann. Object tracking using active appearance718
models. Proc. 10th Danish Conference on Pattern Recognition and719
Image Analysis, pages 54–60, 2001.720
[38] F. Suard, A. Rakotomamonjy, A. Bensrhair, and A. Broggi.721
Pedestrian detection using infrared images and histograms of722
oriented gradients. In Intelligent Vehicles Symposium, 2006 IEEE,723
pages 206–212. Ieee, 2006.724
[39] J. Tenenbaum, V. De Silva, and J. Langford. A global geometric725
framework for nonlinear dimensionality reduction. Science,726
290(5500):2319–2323, 2000.727
[40] O. Tuzel, F. Porikli, and P. Meer. Region covariance: A fast728
descriptor for detection and classification. ECCV, pages 589–729
600, 2006.730
[41] O. Tuzel, F. Porikli, and P. Meer. Region Covariance: A Fast731
Descriptor for Detection and Classification. In A. Leonardis,732
H. Bischof, and A. Pinz, editors, ECCV 2006, volume 3952733
of Lecture Notes in Computer Science, pages 589–600. Springer734
Berlin / Heidelberg, 2006.735
[42] O. Tuzel, F. Porikli, and P. Meer. Human detection via736
classification on riemannian manifolds. In CVPR, pages 1 –737
8, june 2007.738
[43] O. Tuzel, F. Porikli, and P. Meer. Pedestrian detection via739
classification on riemannian manifolds. PAMI, 30(10):1713 –740
1727, oct. 2008.741
[44] G. Wang, Y. Liu, and H. Shi. Covariance tracking via geometric742
particle filtering. In Intelligent Computation Technology and743
Automation, 2009. ICICTA ’09. Second International Conference744
on, volume 1, pages 250 –254, 2009.745
[45] Y. Wu, H. Ling, E. Blasch, L. Bai, and G. Chen. Visual track-746
ing based on log-euclidean riemannian sparse representation.747
Advances in Visual Computing, pages 738–747, 2011.748
[46] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey.749
ACM Comput. Surv., 38(4), 2006.750
[47] X. Zhan. Extremal eigenvalues of real symmetric matrices751
with entries in an interval. SIAM journal on matrix analysis752
and applications, 27(3):851–860, 2006.753
Marcus Chen is a PhD student in the School of Computer Engi-754
neering, Nanyang Technological University, and member of technical755
staff in DSO National Laboratories. He received his B.S. in Electrical756
and Computer Engineering with university honours in 2007 from757
Carnegie Mellon University, and M.S. in Eletrical Engineering from758
Stanford University in 2008.759
Dr. Cham Tat Jen is an Associate Professor in the School of Com-760
puter Engineering, Nanyang Technological University, and Director761
of the Centre for Multimedia & Network Technology (CeMNet). He762
received his BA in Engineering with triple first class honours in 1993763
and his PhD in 1996, both from the University of Cambridge, during764
which he was awarded the Loke Cheng-Kim Foundation Scholarship,765
the Alexandria Prize, the Engineering Members Prize as well as the766
St Catharines College Senior and Research Scholarships. Tat-Jen767
was subsequently conferred a Jesus College Research Fellowship in768
Science in 1996-97. From 1998 to 2001, he was a research scientist769
at DEC/Compaq CRL in Cambridge, MA, USA, where his experience770
included technology transfer to product groups and showcasing771
research work to Hollywood studios. After joining NTU in 2002, he772
was concurrently a Faculty Fellow in the Singapore-MIT Alliance773
Computer Science Program in 2003-2006.774
Dr. Pang Sze Kim is a principal member of technical staff and775
currently is the head of Signal Processing Laboratorie at DSO776
National Laboratories. Sze Kim obtained his Ph.D. from University 777
of Cambridge. 778
DR. Alvina Goh obtained her Ph.D. from the Department of Biomed- 779
ical Engineering at the Johns Hopkins University in May 2010. 780
Currently, she is a researcher at the DSO National Labs and an 781
adjunct assistant professor in the Department of Mathematics at 782
the National University of Singapore. Her research interests include 783
machine learning, computer vision, and medical imaging. 784