12
Object Tracking and Segmentation in a Closed Loop Konstantinos E. Papoutsakis and Antonis A. Argyros Institute of Computer Science, FORTH and Computer Science Department, University of Crete {papoutsa,argyros}@ics.forth.gr http://www.ics.forth.gr/cvrl/ Abstract. We introduce a new method for integrated tracking and seg- mentation of a single non-rigid object in an monocular video, captured by a possibly moving camera. A closed-loop interaction between EM-like color-histogram-based tracking and Random Walker-based image seg- mentation is proposed, which results in reduced tracking drifts and in fine object segmentation. More specifically, pixel-wise spatial and color image cues are fused using Bayesian inference to guide object segmenta- tion. The spatial properties and the appearance of the segmented objects are exploited to initialize the tracking algorithm in the next step, closing the loop between tracking and segmentation. As confirmed by experi- mental results on a variety of image sequences, the proposed approach efficiently tracks and segments previously unseen objects of varying ap- pearance and shape, under challenging environmental conditions. 1 Introduction The vision-based tracking and the segmentation of an object of interest in an image sequence are two challenging computer vision problems. Each of them has its own importance and challenges and can be considered as “chicken-and- egg” problems. By solving the segmentation problem, a solution to the tracking problem can easily be obtained. At the same time, tracking provides important input to segmentation. In a recent and thorough review on the state-of-the-art tracking techniques [1], tracking methods are divided into three categories: point tracking, silhouette tracking and kernel tracking. Silhouette-based tracking methods usually evolve an initial contour to its new position in the current frame. This can be done using a state space model [2] defined in terms of shape and motion parameters [3] of the contour or by the minimization of a contour-based energy function [4, 5], providing an accurate representation of the tracked object. Point-tracking algorithms [6, 7] can also combine tracking and fine object segmentation using multiple image cues. Towards a more reliable and drift-free tracking, some point tracking algorithms utilize energy minimization techniques, such as Graph-Cuts or Belief Propagation on a Markov Random Field (MRF) [8] or on a Conditional G. Bebis et al. (Eds.): ISVC 2010, Part I, LNCS 6453, pp. 405–416, 2010. c Springer-Verlag Berlin Heidelberg 2010

Object tracking and segmentation in a closed loop

  • Upload
    crete

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Object Tracking and Segmentation in

a Closed Loop

Konstantinos E. Papoutsakis and Antonis A. Argyros

Institute of Computer Science, FORTHand

Computer Science Department, University of Crete{papoutsa,argyros}@ics.forth.grhttp://www.ics.forth.gr/cvrl/

Abstract. We introduce a new method for integrated tracking and seg-mentation of a single non-rigid object in an monocular video, capturedby a possibly moving camera. A closed-loop interaction between EM-likecolor-histogram-based tracking and Random Walker-based image seg-mentation is proposed, which results in reduced tracking drifts and infine object segmentation. More specifically, pixel-wise spatial and colorimage cues are fused using Bayesian inference to guide object segmenta-tion. The spatial properties and the appearance of the segmented objectsare exploited to initialize the tracking algorithm in the next step, closingthe loop between tracking and segmentation. As confirmed by experi-mental results on a variety of image sequences, the proposed approachefficiently tracks and segments previously unseen objects of varying ap-pearance and shape, under challenging environmental conditions.

1 Introduction

The vision-based tracking and the segmentation of an object of interest in animage sequence are two challenging computer vision problems. Each of themhas its own importance and challenges and can be considered as “chicken-and-egg” problems. By solving the segmentation problem, a solution to the trackingproblem can easily be obtained. At the same time, tracking provides importantinput to segmentation.

In a recent and thorough review on the state-of-the-art tracking techniques [1],tracking methods are divided into three categories: point tracking, silhouettetracking and kernel tracking. Silhouette-based tracking methods usually evolvean initial contour to its new position in the current frame. This can be done usinga state space model [2] defined in terms of shape and motion parameters [3]of the contour or by the minimization of a contour-based energy function [4,5], providing an accurate representation of the tracked object. Point-trackingalgorithms [6, 7] can also combine tracking and fine object segmentation usingmultiple image cues. Towards a more reliable and drift-free tracking, some pointtracking algorithms utilize energy minimization techniques, such as Graph-Cutsor Belief Propagation on a Markov Random Field (MRF) [8] or on a Conditional

G. Bebis et al. (Eds.): ISVC 2010, Part I, LNCS 6453, pp. 405–416, 2010.c© Springer-Verlag Berlin Heidelberg 2010

406 K.E. Papoutsakis and A.A. Argyros

Random Field (CRF) [9, 10]. Most of the kernel-based tracking algorithms [11–13] provide a coarse representation of the tracked object based on a boundingbox or an ellipsoid region.

Despite the many important research efforts devoted to the problem, the de-velopment of algorithms for tracking objects in unconstrained videos constitutesan open research problem. Moving cameras, appearance and shape variability ofthe tracked objects, varying illumination conditions and cluttered backgroundsconstitute some of the challenges that a robust tracking algorithm needs to copewith. To this end, in this work we consider the combined tracking and segmen-tation of previously unseen objects in monocular videos captured by a possiblymoving camera. No strong constraints are imposed regarding the appearanceand the texture of the target object or the rigidity of its shape. All of the abovemay dynamically vary over time under challenging illumination conditions andchanging background appearance. The basic aim of this work is to preclude track-ing failures by enhancing its target localization performance through fine objectsegmentation that is appropriately integrated with tracking in a closed-loop al-gorithmic scheme. A kernel-based tracking algorithm [14], a natural extensionof the popular mean-shift tracker [11, 15], is efficiently combined with RandomWalker-based image segmentation [16, 17]. Explicit segmentation of the targetregion of interest in an image sequence enables reliable tracking and reducesdrifting by exploiting static image cues and temporal coherence.

The key benefits of the proposed method are (i) close-loop interaction betweentracking and segmentation (ii) enhanced tracking performance under challeng-ing conditions (iii) fine object segmentation (iv) capability to track objects froma moving camera (v) increased tolerance to extensive changes of object’s ap-pearance and shape and, (vi) continual refinement of both the object and thebackground appearance models.

The rest of the paper is organized as follows. The proposed method is pre-sented in Sec. 2. Experimental results are presented in Sec. 3. Finally, Sec. 4summarizes the main conclusions from this work and future work perspectives.

2 Proposed Method

For each input video frame, the proposed framework encompasses a number ofalgorithmic steps, tightly interconnected in a closed-loop which is illustratedschematically in Fig.1. To further ease understanding, Fig.2 provides sampleintermediate results of the most important algorithmic steps.

The method assumes that at a certain moment t in time, a new image frameIt becomes available and that a fine object segmentation mask Mt−1 is availableas the result of the previous time step t − 1. For time t = 0, Mt−1 should beprovided for initialization purposes. Essentially, Mt−1 is a binary image whereforeground/background pixels have a value of 1/0, respectively (see Fig.2). Thegoal of the method is to produce the current object segmentation mask Mt.Towards this end, the spatial mean and covariance matrix of the foreground re-gion of Mt−1 is computed, thus defining an ellipsoid region coarsely representing

Object Tracking and Segmentation in a Closed Loop 407

EM-like Color-BasedObject Tracking

SegmentedObject Mask

Mt

Pixel-Wise Computation of Prior Image Cuesfor Foreground Object & Background Labels

- Object Shape Band Calculation- Distance Transform-based Spatial Image Cues- Color Histogram-based Image Cues

Probabilistic Fusion Of Image CuesUsing Bayesian Inference

Random Walker-BasedFine Object Segmentation

Object Tracking Object Segmentation

SegmentedObject Mask

Mt-1

Image FrameIt

Affine Propagation ofObject Shape

Fig. 1. Outline of the proposed method

the object at t − 1. Additionally, a color-histogram-based appearance model ofthe segmented object (i.e., the one corresponding to the foreground of Mt−1) iscomputed using a Gaussian weighting kernel function. The iterative (EM-like)tracking algorithm in [14] is initialized based on the computed ellipsoid and ap-pearance models. The tracking thus performed, results in a prediction of theposition and covariance of the ellipsoid representing the tracked object. Basedon the transformation parameters of the ellipsoid between t− 1 and t, a 2D spa-tial affine transformation of the foreground object mask Mt−1 is performed. Thepropagated object mask M

′t indicates the predicted position and shape of the

object in the new frame It. The Hausdorff distance between the contour pointsof Mt−1 and M

′t masks is then computed and a shape band, as in [4, 9], around

the M′t contour points is determined, denoted as Bt. The width of Bt is equal

to the computed Hausdorff distance of the two contours. This is performed toguarantee that the shape band contains the actual contour pixels of the trackedobject in the new frame. Additionally, the pixel-wise Distance Transform like-lihoods for the object and background areas are computed together with thepixel-wise color likelihoods based on region-based color histograms. Pixel-wiseBayesian inference is applied to fuse spatial and color image cues, in order tocompute the probability distribution for the object and the background regions.Given the estimated Probability Density Functions (PDFs) for each region, aRandom Walker-based segmentation algorithm is finally employed to obtain Mt

in It.In the following sections, the components of the proposed method are de-

scribed in more detail.

408 K.E. Papoutsakis and A.A. Argyros

Mt

ProbabilisticFusion

It Random WalkerSegmentation

Object Tracking

Affine ShapePropagation

Shape Band Spatial Prior Cue

Fusion Map

Color Prior Cue Color Contrast basedGaussian Weights

RW Seeds & Priors

Mt-1P(c|Lo)

P(Lo|x)

P(Lo|x,c)

M΄t Βt

Fig. 2. Sample intermediate results of the proposed method. To avoid clutter, resultsrelated to the processing of the scene background are omitted.

2.1 Object Tracking

This section presents the tracking part of the proposed combined tracking andsegmentation method (see the bottom-left part of Fig.1).

EM-Like Color Based Object Tracking: The tracking method [14] used inthis work is closely related to the widely-used mean-shift tracking method [11,15]. More specifically, this algorithm coarsely represents the objects’ shape bya 2D ellipsoid region, modeled by its center θ and covariance matrix Σ. Theappearance model of the tracked object is represented by the color histogram ofthe image pixels under the 2D ellipsoid region corresponding to θ and Σ, andis computed using a Gaussian weighting kernel function. Provided Mt−1 andIt−1, θt−1, Σt−1 the object appearance model can be computed for time t − 1.Given a new image frame It where the tracked object is to be localized, thetracking algorithm evolves the ellipsoid region in order to determine the imagearea in It that best matches the appearance of the tracked object in terms of aBhattacharrya coefficient-based color similarity measure. This gives rise to theparameters θt and Σt that represent the predicted object position and covariancein It.

Affine Propagation of Object Shape: The tracking algorithm presentedabove assumes that the shape of an object can be accurately represented asan ellipse. In the general case, this is a quite limiting assumption, thereforethe objects’ appearance model is forced to include background pixels, causingtracking to drift. The goal of this work is to prevent tracking drifts by integratingtracking with fine object segmentation.

Object Tracking and Segmentation in a Closed Loop 409

To accomplish that, the contour Ct−1 of the object mask in Mt−1 is prop-agated to the current frame It based on the transformation suggested by theparameters θt−1, θt, Σt−1 and Σt. A 2D spatial, affine transformation is de-fined between the corresponding ellipses. Exploiting the obtained Σt−1 and Σt

covariance matrices, a linear 2 × 2 affine transformation matrix At can be com-puted based on the square root (Σ1/2) of each of these matrices. It is knownthat a covariance matrix is a square, symmetric and positive semidefinite ma-trix. The square root of any 2 × 2 covariance matrix Σ can be calculated bydiagonalization as

Σ1/2 = QΛ1/2Q−1, (1)

where Q is the square 2 × 2 matrix whose ith column is the eigenvector qi of Σand Λ1/2 is the diagonal matrix whose diagonal elements are the square valuesof the corresponding eigenvalues. Since Σ is a covariance matrix, the inverse ofits Q matrix is equal to the transposed matrix QT , therefore Σ1/2 = QΛ1/2QT .Accordingly, we compute the transformation matrix At by:

At = QtΛ1/2t Λ

−1/2t−1 QT

t−1. (2)

Finally, C′t is derived from Ct based on the following transformation

C′t = At(Ct − θt−1) + θt. (3)

The result indicates a propagated contour C′t, practically a propagated object

mask M′t that serves as a prediction of the position and the shape of the tracked

object in the new frame It.

2.2 Object Segmentation

This section presents how the pixel-wise posterior values on spatial and colorimage cues are computed and fused using Bayesian inference in order to guidethe segmentation of the tracked foreground object (see the right part of Fig.1).

Object Shape Band: An object shape band Bt is determined around thepredicted object contour C′

t. Our notion of shape band is similar to those usedin [4, 9]. Bt can be regarded as an area of uncertainty, where the true objectcontour lies in image It. The width of Bt is determined by the Euclidean, 2DHausdorff distance between contours Ct−1 and C

′t regarded as two point sets.

Spatial Image Cues: We use the Euclidean 2D Distance Transform to computethe probability of a pixel xi in image It to belong to either the object Lo orthe background Lb region, based on its 2D location xi = (x, y) on the imageplane. As a first step, the shape band Bt of the propagated object contourC′

t is considered and its inner and outer contours are extracted. The DistanceTransform is then computed starting from the outer contour of Bt towards theinner part of the object. The probability P (Lo|xi) of a pixel to belong to the

410 K.E. Papoutsakis and A.A. Argyros

object given its image location is set proportional to its normalized distancefrom the outer contour of the shape band. For pixels that lie outside the outercontour of Bt, it holds that P (Lo|xi) = ε, where ε is a small constant.

Similarly, we compute the Euclidean Distance Transform measure startingfrom the inner contour of Bt towards the exterior part of the object. The proba-bility P (Lb|xi) of a pixel to belong to the background given its image location isset proportional to its normalized distance from the inner contour of the shapeband. For pixels that lie inside the inner contour of Bt, it holds that P (Lb|xi) = ε.

Color Based Image Cues: Based on the segmentation Mt−1 of the imageframe It−1, we define a partition of image pixels Ω into sets Ωo and Ωb indicatingthe object and background image pixels, respectively. The appearance model ofthe tracked object is represented by the color histogram Ho computed on the Ωo

set of pixels. The normalized value in a histogram bin c encodes the conditionalprobability P (c|Lo). Similarly, the appearance model of the background region isrepresented by the color histogram Hb, computed over pixels in Ωb and encodingthe conditional probability P (c|Lb).

Probabilistic Fusion of Image Cues: Image segmentation can be consideredas a pixel-wise classification problem for a number of classes/labels. Our goalis to generate the posterior probability distribution for each of the labels Lo

and Lb, which will be further utilized to guide the Random Walker-based imagesegmentation. Using Bayesian inference, we formulate a probabilistic frameworkto efficiently fuse the available prior image cues, based on the pixel color andposition information, as described earlier. Considering the pixel color c as theevidence and conditioning on pixel position xi in image frame It, the posteriorprobability distribution for label Ll is given by

P (Ll | c, xi) =P (c | Ll, xi)P (Ll | xi)

∑Nl=0 P (c | Ll, xi)P (Ll | xi)

, (4)

where N = 2 in our case. The probability distribution P (c | Ll, xi) encodes theconditional probability of color c taking the pixel label Ll as the evidence andconditioning on its location xi. We assume that knowing the pixel position xi,does not affect our belief about its color c. Thus, the probability of color c is onlyconditioned on the prior knowledge of its class Ll following that P (c | Ll, xi) =P (c | Ll). Given this, Eq.(4) transforms to

P (Ll | c, xi) =P (c | Ll)P (Ll | xi)

∑Nl=0 P (c | Ll)P (Ll | xi)

. (5)

The conditional color probability P (c | Ll) for the class Ll is obtained by thecolor histogram Hl. The conditional spatial probability P (Ll | xi) is obtainedby the Distance-Transform measure calculation. Both of these calculations havebeen presented earlier in this section.

Object Tracking and Segmentation in a Closed Loop 411

Random Walker Based Object Fine Segmentation: The resulting poste-rior distribution P (Ll | c, xi) for each of the two labels Lo and Lb on pixels xi

guides the Random Walker-based image segmentation towards an explicit andaccurate segmentation of the tracked object in It.

Random Walks for image segmentation was introduced in [18] as a methodto perform K-way graph-based image segmentation given a number of pixelswith user (or automatically) defined labels, indicating the K disjoint regions ina new image that is to be segmented. The principal idea behind the method isthat one can analytically determine the real-valued probability that a randomwalker starting at each unlabeled image pixel will first reach one of the pre-labeled pixels. The random walker-based framework bears some resemblance tothe popular graph-cuts framework for image segmentation, as they are bothrelated to the spectral clustering family of algorithms [19], but they also exhibitsignificant differences concerning their properties, as described in [17].

The algorithm is formulated on a discrete weighted undirected graph G =(V, E), where nodes u ∈ V represent the image pixels and the positive-weightededges e ∈ E ⊆ V xV indicate their local connectivity. The solution is calculatedanalytically by solving K-1 sparse, symmetric, positive-definite linear systems ofequations, for K labels. For each graph node, the resulting probabilities of thepotential labels sum up to 1.

In order to represent the image structure by random walker biases, we mapthe edge weights to positive weighting scores computed by the Gaussian weight-ing function on the normalized Euclidean distance of the color intensities be-tween two adjacent pixels, practically the color contrast. The Gaussian weightingfunction is

wi,j = e−βρ (‖ci−cj‖)2 + ε, (6)

where ci stands for the vector containing the color channel values of pixel/nodei, ε is a small constant (i.e ε = 10−6) and ρ is a normalizing scalar ρ =max(‖ci − cj‖), ∀i, j ∈ E. The parameter β is user-defined and modulates thespatial random walker biases, in terms of image edgeness. The posterior proba-bility distribution P (Ll | c, xi) computed over the pixels xi of the current imageIt suggest the probability of the pixels to be assigned to the label Ll. Therefore,we consider the pixels of highest posterior probability values for the label Ll aspre-labeled/seeds nodes of that label in the formulated graph.

An alternative formulation of the Random Walker-based image segmentationmethod is presented in [16]. This method incorporates non-parametric probabil-ity models, that is, prior belief on label assignments. In [16], the sparse linearsystems of equations that need to be solved to obtain a real-valued density-based multilabel image segmentation are also presented. The two modalities ofthis alternative formulation suggest for using only prior knowledge on the beliefof a graph node toward each of the potential labels, or using prior knowledge inconjunction with pre-labeled/seed graph nodes. The γ scalar weight parameteris introduced in these formulations, controlling the degree of effectiveness of theprior belief values towards the belief information obtained by the random walks.This extended formulation of using both seeds and prior beliefs on graph nodes

412 K.E. Papoutsakis and A.A. Argyros

is compatible with our approach considering the obtained posterior probabil-ity distributions P (Ll | c, xi) for the two segmentation labels. The two RandomWalker formulations that use prior models, suggest for a graph construction sim-ilar to the graph-cut algorithm [20], where the edge weights of the constructedgraph can be seen as the N-links or link-terms and the prior belief values of thegraph nodes for any of the potential labels can be considered as the T-links orthe data-terms, in graph cuts terminology.

Regardless of the exact formulation used, the primary output of the algorithmconsists of K probability maps, that is a soft image segmentation per label. Byassigning each pixel to the label for which the greatest probability is calculated,a K-way segmentation is obtained. This process gives rise to object mask Mt forimage frame It.

3 Experimental Results and Implementation Issues

The proposed method was extensively tested on a variety of image sequences.Due to space limitations, results on eight representative image sequences arepresented in this paper. The objects tracked in these sequences go through ex-tensive appearance, shape and pose changes. Additionally, these sequences differwith respect to the camera motion and to the lighting conditions during imageacquisition which affects the appearance of the tracked objects.

We compare the proposed joint tracking and segmentation method with thetracking-only approach of [14]. The parameters of this algorithm were kept iden-tical in the stand-alone run and in the run within the proposed framework. Itis important to note that stand-alone tracking based on [14] is initialized withthe appearance model extracted in the first frame of the sequence and that thisappearance model is not updated over time. This is done because in all thechallenging sequences we used as the basis of our evaluation, updating the ap-pearance model based on the results of tracking, soon causes tracking drifts andtotal loss of the tracked object.

Figure 3 illustrates representative tracking results (i.e., five frames for eachof the eight sequences). In the first sequence, a human hand undergoes complexarticulations, whereas the lighting conditions significantly affect its skin colortone. In the second sequence, a human head is tracked despite its abrupt scalechanges and the lighting variations. In the third sequence the articulations of ahuman hand are observed by a moving camera in the context of a continuouslyvarying cluttered background. The green book tracked in the fourth sequenceundergoes significant changes regarding its pose and shape, whereas light reflec-tions on its glossy surface significantly affect its appearance. The fifth sequenceis an example of a low quality video captured by a moving camera, illustratingthe inherently deformable body of a caterpillar in motion. The sixth and seventhsequences show a human head and hand, respectively, which both go throughextended pose variations in front of a complex background. Finally, the last, lowresolution sequence has been captured by a medical endoscope. In this sequence,a target object is successfully tracked within a low-contrast background.

Object Tracking and Segmentation in a Closed Loop 413

Each of the image sequences in Fig.(3) illustrating human hands or faces aswell as the green book sequence consists of 400 frames of resolution 640 × 480pixels, captured at a frame rate of 5-10 fps. The resolution of each frame ofthe image sequences illustrated in the second and the fifth row is 320 × 240pixels. The last image sequence depicted in the Fig.(3), captured by a medicalendoscope consists of 20 image frames of size 256 × 256 pixels each.

The reported experiments were generated based on a Matlab implementation,running on a PC equipped with an Intel i7 CPU and 4 GB of RAM memory.The runtime performance of the current implementation varies between 4 to 6seconds per frame for 640 × 480 images. A near real-time runtime performanceis feasible by optimizing both the EM-like component of the tracking methodand the solution of the large sparse linear system of equations of the RandomWalker formulation in the segmentation procedure.

Each frame shown in Fig.(3) is annotated with the results of the proposedalgorithm and the results of the tracking method proposed in [14]. More specif-ically, the blue solid ellipse shows the expected position and coarse orientationof the tracked object as this results from the tracking part of the proposedmethodology. The green solid object contour is the main result of the proposedalgorithm which shows the fine object segmentation. Finally, the result of [14]is shown for comparison as a red dotted ellipse. Experimental results on the fullvideo datasets are available online1.

In all sequences, the appearance models of the tracked objects have been builtbased on the RGB color space. The object and background appearance modelsused to compute the prior color cues are color histograms with 32 bins perhistogram for both the object and the background. Preserving the parameterconfiguration of the object tracking algorithm as described in [14], the targetappearance model of the tracker is implemented by a color histogram of 8 binsper dimension.

The Random Walker segmentation method involves three different formula-tions to obtain the probabilities of each pixel to belong to each of the labels ofthe segmentation problem, as described in Sec. 2.2. The three options refer to theusage of seed pixels (pre-labeled graph nodes), prior values (probabilities/beliefson label assignments for some graph nodes), or a combination of them. For thelast option, the edge weights of the graph are computed by the Eq.(6), wherethe β scalar parameter controls the scale of the edgeness (color contrast) be-tween adjacent graph nodes. The pixel-wise posterior values are computed usingBayesian inference as described in Sec. 2.2 and are exploited to guide segmenta-tion as seed and prior values in terms of Random Walker terminology. Each pixelxi of posterior value P (Ll | xi) greater or equal to 0.9 is considered as a seedpixel for the label Ll, thus as a seed node on the graph G. Any other pixel ofposterior value P (Ll | xi) less than 0.9 is considered as a prior value for label Ll.In the case of prior values, the γ parameter is introduced to adjust the degree ofauthority of the prior beliefs towards the definite label-assignments expressed by

1 http://www.ics.forth.gr/~argyros/research/trackingsegmentation.htm

414 K.E. Papoutsakis and A.A. Argyros

Fig. 3. Experimental results and qualitative comparison between the proposed frame-work providing tracking and segmentation results (blue solid ellipse and green solidobject contour, respectively) and the tracking algorithm of [14] (red dotted ellipse).See text for details.

Object Tracking and Segmentation in a Closed Loop 415

Table 1. Quantitative assessment of segmentation accuracy. See text for details.

Segmentation option Precision Recall F-measure

Priors 93,5% 92,9% 93,1%

Seeds 97,5% 99,1% 98,3%

Priors and Seeds 97,5% 99,1% 98,3%

the seed nodes of the graph. In our experiments, the β parameter was selectedwithin the interval of [10− 50], whereas the γ ranges within [0.05 − 0.5].

In order to assess quantitatively the influence of the three different optionsregarding the operation of the Random Walker on the quality of segmentationresults, the three different variants have been tested independently on an imagesequence consisting of 1, 000 video frames. For each and every of these framesground truth information is available in the form of a manually segmented fore-ground object mask. Table 1 summarizes the average (per frame) precision, recalland F-measure performance of the proposed algorithm compared to the groundtruth. As it can be verified, although all three options perform satisfactorily, theuse of seeds improves the segmentation performance.

4 Summary

In this paper we presented a method for online, joint tracking and segmenta-tion of an non-rigid object in a monocular video, captured by a possibly movingcamera. The proposed approach aspires to relax several limiting assumptionsregarding the appearance and shape of the tracked object, the motion of thecamera and the lighting conditions. The key contribution of the proposed frame-work is the efficient combination of an appearance-based tracking with RandomWalker-based segmentation that jointly enables enhanced tracking performanceand fine segmentation of the target object. A 2D affine transformation is com-puted to propagate the segmented object shape of the previous frame to the newframe exploiting the information provided by the ellipse region capturing the seg-mented object and the ellipse region predicted by the tracker in the new frame.A shape-band area is computed indicating an area of uncertainty where thetrue object boundaries lie in the new frame. Static image cues including pixel-wise color and spatial likelihoods are fused using Bayesian inference to guidethe Random Walker-based object segmentation in conjunction with the color-contrast (edgeness) likelihoods between neighboring pixels. The performance ofthe proposed method is demonstrated in a series of challenging videos and incomparison with the results of the tracking method presented in [14].

Acknowledgments

This work was partially supported by the IST-FP7-IP-215821 project GRASP.The contributions of FORTH-ICS members I. Oikonomidis and N. Kyriazis tothe development of the proposed method are gratefully acknowledged.

416 K.E. Papoutsakis and A.A. Argyros

References

1. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Comput.Surv. 38, 13 (2006)

2. Isard, M., Blake, A.: Condensation: Conditional density propagation for visualtracking. International Journal of Computer Vision 29, 5–28 (1998)

3. Paragios, N., Deriche, R.: Geodesic active contours and level sets for the detectionand tracking of moving objects. IEEE Transactions on PAMI 22, 266–280 (2000)

4. Yilmaz, A., Li, X., Shah, M.: Contour-based object tracking with occlusion han-dling in video acquired using mobile cameras. IEEE Transactions on PAMI 26,1531–1536 (2004)

5. Bibby, C., Reid, I.: Robust real-time visual tracking using pixel-wise posteriors. In:Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303,pp. 831–844. Springer, Heidelberg (2008)

6. Khan, S., Shah, M.: Object based segmentation of video using color, motion andspatial information. In: IEEE Computer Society Conference on CVPR, vol. 2, p.746 (2001)

7. Baltzakis, H., Argyros, A.A.: Propagation of pixel hypotheses for multiple objectstracking. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Kuno, Y., Wang, J., Pa-jarola, R., Lindstrom, P., Hinkenjann, A., Encarnacao, M.L., Silva, C.T., Coming,D. (eds.) ISVC 2009. LNCS, vol. 5876, pp. 140–149. Springer, Heidelberg (2009)

8. Yu, T., Zhang, C., Cohen, M., Rui, Y., Wu, Y.: Monocular video fore-ground/background segmentation by tracking spatial-color gaussian mixture mod-els. In: IEEE Workshop on Motion and Video Computing (2007)

9. Yin, Z., Collins, R.T.: Shape constrained figure-ground segmentation and tracking.In: IEEE Computer Society Conference on CVPR, pp. 731–738 (2009)

10. Ren, X., Malik, J.: Tracking as repeated figure/ground segmentation. In: IEEEComputer Society Conference on CVPR, pp. 1–8 (2007)

11. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans-actions on PAMI 25, 564–577 (2003)

12. Tao, H., Sawhney, H., Kumar, R.: Object tracking with bayesian estimation ofdynamic layer representations. IEEE Transactions on PAMI 24, 75–89 (2002)

13. Jepson, A.D., Fleet, D.J., El-Maraghi, T.F.: Robust online appearance models forvisual tracking. IEEE Transactions on PAMI 25, 1296–1311 (2003)

14. Zivkovic, Z., Krose, B.: An em-like algorithm for color-histogram-based object track-ing. In: IEEE Computer Society Conference on CVPR, vol. 1, pp. 798–803 (2004)

15. Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects usingmean shift. In: IEEE Computer Society Conference on CVPR, vol. 2, p. 2142 (2000)

16. Grady, L.: Multilabel random walker image segmentation using prior models. In:Proceedings of the 2005 IEEE Computer Society Conference on CVPR, vol. 1, pp.763–770 (2005)

17. Grady, L.: Random walks for image segmentation. IEEE Transactions on PAMI 28,1768–1783 (2006)

18. Grady, L., Funka-Lea, G.: Multi-label image segmentation for medical applicationsbased on graph-theoretic electrical potentials. In: Sonka, M., Kakadiaris, I.A., Ky-bic, J. (eds.) CVAMIA/MMBIA 2004. LNCS, vol. 3117, pp. 230–245. Springer,Heidelberg (2004)

19. von Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17,395–416 (2007)

20. Boykov, Y., Funka-Lea, G.: Graph cuts and efficient n-d image segmentation. In-ternational Journal of Computer Vision 70, 109–131 (2006)