7
Video Object Segmentation with Occlusion Map Hao Xiong , Zhiyong Wang , Renjie He †∗ and David Dagan Feng School of Information Technologies, The University of Sydney, Australia Northwestern Polytechnical University, Xian 710072, China Email: [email protected], [email protected], [email protected], [email protected] Abstract—Extracting foreground objects from a video captured by a hand-held camera has been a new challenge in video segmentation, since most of the existing approaches generally work well provided that certain assumptions on the background scene or camera motion (e.g. still surveillance cameras) are imposed. While some approaches exploit several clues such as depth and motion to extract the foreground layer from hand- held camera videos, we propose to leverage the advances in high quality interactive image segmentation. That is, we treat each video frame as an individual image and segment foreground objects with interactive image segmentation algorithms. In order to simulate user interactions, we derive reliable occlusion map for foreground objects and use the occlusion map as the “seeding” interactive input to an interactive image segmentation approach. In this paper, we employ an optical flow based occlusion detection approach for extracting the occlusion map and Geodesic star con- vexity based interactive image segmentation approach. In order to obtain accurate “seeding” user interactions, both forward and backward occlusion maps are computed and utilized. As a result, our approach is able to extract the whole objects having only partial movements, which overcomes the limitation of the state- of-the-art algorithm. Experimental results demonstrate both the effectiveness and efficiency of our proposed approach. I. I NTRODUCTION Video object segmentation has been attracting a lot of attentions due to its essential role in a wide range of ap- plications such as intelligent video editing (e.g. inserting a segmented object into a new scene). While most of the existing algorithms address the problem in specific domains such as video surveillance where a camera is fixed and still [1], new video data demands more advanced solutions [2]. With the advancement of imaging technologies, it has been never easier for users to produce video content with portable devices such as camcorders and smart phones. Such videos captured by hand-held cameras impose great challenges for video object segmentation, since the camera motion is arbi- trary, which makes the conventional background modelling more difficult. Recently, Zhang et al. [2] proposed a robust bilayer seg- mentation algorithm to extract moving objects from videos captured by hand-held cameras. It exploits the clues from depth and motion to estimate camera configurations and warp one frame against its reference frame. Therefore, the difference between the warped frame and the reference frame will be the initial segmentation mask. In order to improve the robustness, it also employs a simple voting-like strategy to balance the set of terms and automatically reject occasional outliers. However, due to its dependence on motion information, a foreground object with a partial movement will not be fully segmented. (a) (b) (c) Fig. 1. Illustration of occlusion map with a sample video from [2]. The occluded area between two temporally adjacent frames ((a) and (b)) is highlighted in red in (c). As a result, it will fail segmenting the whole body of a person partially waving his/her arm only(shown in Fig.7). Meanwhile, interactive image segmentation has been able to achieve high quality object segmentation, if appropriate user interaction is provided. For example, in [3] users are allowed to label both the foreground and background seeds beforehand such that object segmentation is achieved by incorporating the shape prior into graph cut based segmentation algorithms. Hence, it is anticipated that foreground objects will be accu- rately segmented from each frame, if appropriate user input is available. That is, the key to automatic high quality video object segmentation is to simulate user interactions by deriving “seeding” user inputs from videos. Based on the above observation, in this paper, we take a different perspective on object segmentation from hand-held camera videos by leveraging the advances in interactive image segmentation. Instead of modelling the scene background or tracking moving objects, we aim to automatically identify clues of background and foreground and feed such “seeding” clues as user inputs to an interactive image segmentation algorithm. It is noticed that arbitrary movement of a hand-held camera often results in the change of the camera’s viewpoint, which contributes to the occlusions between two temporally adjacent frames. As a result, there exists an occlusion map which indicates the areas visible in one frame, but not in the other adjacent frame. In our work, we employ the optical flow based occlusion detection algorithm [4] to identify the occlusion map of foreground objects and the Geodesic star convexity based interactive image segmentation approach [3] for object seg- mentation, since these two approaches are the state-of-the- art in the fields. As shown in Fig. 1(c), the occluded area generated by camera movement is highlighted in red and will be a reliable “seeding” input for interactive image segmen- tation algorithm. Such automatically derived occlusion map can be a replacement for the manually labelling seeds for 978-1-4673-2181-5/12/$31.00 ©2012 Crown

[IEEE 2012 International Conference on Digital Image Computing: Techniques and Applications (DICTA) - Fremantle, Australia (2012.12.3-2012.12.5)] 2012 International Conference on Digital

Embed Size (px)

Citation preview

Video Object Segmentation with Occlusion MapHao Xiong∗, Zhiyong Wang∗, Renjie He†∗ and David Dagan Feng∗∗School of Information Technologies, The University of Sydney, Australia

†Northwestern Polytechnical University, Xian 710072, ChinaEmail: [email protected], [email protected], [email protected], [email protected]

Abstract—Extracting foreground objects from a video capturedby a hand-held camera has been a new challenge in videosegmentation, since most of the existing approaches generallywork well provided that certain assumptions on the backgroundscene or camera motion (e.g. still surveillance cameras) areimposed. While some approaches exploit several clues such asdepth and motion to extract the foreground layer from hand-held camera videos, we propose to leverage the advances in highquality interactive image segmentation. That is, we treat eachvideo frame as an individual image and segment foregroundobjects with interactive image segmentation algorithms. In orderto simulate user interactions, we derive reliable occlusion map forforeground objects and use the occlusion map as the “seeding”interactive input to an interactive image segmentation approach.In this paper, we employ an optical flow based occlusion detectionapproach for extracting the occlusion map and Geodesic star con-vexity based interactive image segmentation approach. In orderto obtain accurate “seeding” user interactions, both forward andbackward occlusion maps are computed and utilized. As a result,our approach is able to extract the whole objects having onlypartial movements, which overcomes the limitation of the state-of-the-art algorithm. Experimental results demonstrate both theeffectiveness and efficiency of our proposed approach.

I. INTRODUCTION

Video object segmentation has been attracting a lot ofattentions due to its essential role in a wide range of ap-plications such as intelligent video editing (e.g. inserting asegmented object into a new scene). While most of the existingalgorithms address the problem in specific domains such asvideo surveillance where a camera is fixed and still [1], newvideo data demands more advanced solutions [2].

With the advancement of imaging technologies, it has beennever easier for users to produce video content with portabledevices such as camcorders and smart phones. Such videoscaptured by hand-held cameras impose great challenges forvideo object segmentation, since the camera motion is arbi-trary, which makes the conventional background modellingmore difficult.

Recently, Zhang et al. [2] proposed a robust bilayer seg-mentation algorithm to extract moving objects from videoscaptured by hand-held cameras. It exploits the clues fromdepth and motion to estimate camera configurations and warpone frame against its reference frame. Therefore, the differencebetween the warped frame and the reference frame will be theinitial segmentation mask. In order to improve the robustness,it also employs a simple voting-like strategy to balance the setof terms and automatically reject occasional outliers. However,due to its dependence on motion information, a foregroundobject with a partial movement will not be fully segmented.

(a) (b) (c)

Fig. 1. Illustration of occlusion map with a sample video from [2]. Theoccluded area between two temporally adjacent frames ((a) and (b)) ishighlighted in red in (c).

As a result, it will fail segmenting the whole body of a personpartially waving his/her arm only(shown in Fig.7).

Meanwhile, interactive image segmentation has been able toachieve high quality object segmentation, if appropriate userinteraction is provided. For example, in [3] users are allowedto label both the foreground and background seeds beforehandsuch that object segmentation is achieved by incorporatingthe shape prior into graph cut based segmentation algorithms.Hence, it is anticipated that foreground objects will be accu-rately segmented from each frame, if appropriate user inputis available. That is, the key to automatic high quality videoobject segmentation is to simulate user interactions by deriving“seeding” user inputs from videos.

Based on the above observation, in this paper, we take adifferent perspective on object segmentation from hand-heldcamera videos by leveraging the advances in interactive imagesegmentation. Instead of modelling the scene background ortracking moving objects, we aim to automatically identifyclues of background and foreground and feed such “seeding”clues as user inputs to an interactive image segmentationalgorithm. It is noticed that arbitrary movement of a hand-heldcamera often results in the change of the camera’s viewpoint,which contributes to the occlusions between two temporallyadjacent frames. As a result, there exists an occlusion mapwhich indicates the areas visible in one frame, but not in theother adjacent frame.

In our work, we employ the optical flow based occlusiondetection algorithm [4] to identify the occlusion map offoreground objects and the Geodesic star convexity basedinteractive image segmentation approach [3] for object seg-mentation, since these two approaches are the state-of-the-art in the fields. As shown in Fig. 1(c), the occluded areagenerated by camera movement is highlighted in red and willbe a reliable “seeding” input for interactive image segmen-tation algorithm. Such automatically derived occlusion mapcan be a replacement for the manually labelling seeds for

978-1-4673-2181-5/12/$31.00 ©2012 Crown

interactive image segmentation. As observed, the occlusionmap highlights both the moving part and static part of thehuman body (shown in Fig. 6), which overcomes the limitationof [2] in handling partial movement (shown in Fig. 7).

The rest of the paper is organized as follows. In SectionII we review the related work on video object segmentation.In Section III we explain the key technical components ofour proposed approach, including occlusion map extractionand refinement and Geodesic star convexity based imagesegmentation. In Section IV we present experimental resultsand discussions. Finally, we conclude this paper in Section V.

II. RELATED WORK

Video object segmentation has been a very challengingand popular research issues and many algorithms have beenproposed. Some of them are fully automatic, and some of themneed user interaction [5]. In this section, we briefly reviewautomatic video object segmentation algorithms. Generallyspeaking, there are two categories of approaches: one focuseson static background and the other on dynamic background.

A. Segmentation with Static Background

Videos with static background are generally captured withphysically fixed cameras. The key of object segmentation fromsuch videos is to model the background, so that foregroundobjects can be extracted by subtracting the current framewith the background model [6][7][8][9][10]. These approachesoften assume that the background is known beforehand (e.g.being stationary in advance) such that Gaussian models willbe built for background and foreground respectively.

However, these approaches may fail when a backgroundobject “wakes up” and moves. In [1] Sun et al. proposed toto deal with potential waking objects in a video by preservingthe contrasts across foreground/background boundaries. Inaddition, an adaptive mixture model of global and per-pixelbackground colors was proposed to improve the robustness ofour system under various background changes.

Some researchers also work on more challenging issue:multiple object segmentation. In [11] Shao et al. proposedto segment multiple moving objects through spatio-temporalenergy modeling. In [12], Zhao et al. proposed to handlemultiple human segmentation in crowded environments byintegrating various knowledge including human shape, cameramodel, and image cues.

B. Segmentation with Dynamic Background

Video object segmentation from videos with dynamic back-ground is more challenging, since it is very difficult (if notimpossible) to model the background scene. One group ofstudies focus on exploiting the difference of motion patternsbetween foreground and background, since the movement offoreground objects are generally different from the motionof background objects [13][14]. However, the assumption isthat background objects move consistently. Observing that theforeground may move arbitrarily, Dennis et al. [15] proposedto detect objects of interest by adopting HOG detector [16] and

(a) (b)

(c) (d)

Fig. 2. Illustration of occlusion map extraction. (a) The color-coded motionestimates. (b) The residual I(x, t) − I(w(x), t + dt) before re-weightingstage. (c) The residual after re-weighting. (d) The sparse error term e1.

exploiting the incompatibility of the background geometry. In[17], Zhang et al. proposed to differentiate foreground objectsfrom background by incorporating the depth map of eachframe.

In [2], Zhang et al. for the first time proposed to segmentforeground objects from hand-held camera videos where back-ground dynamics is contributed by both camera movement andmoving background objects. They presented a comprehensivesystem by combining different clues such as depth and motioninformation. However, the approach is not able to segment awhole object part of which is moving. The aim of our workis to overcome such a limitation of their approach.

III. OUR PROPOSED SEGMENTATION APPROACH

Our proposed approach consists of three major steps: occlu-sion map extraction, occlusion map refinement, and Geodesicstar convexity based image object segmentation. The first twosteps are to obtain accurate “seeding” inputs and the last stepis to extract foreground objects by treating each video frameas an individual image.

A. Occlusion Map Extraction

Due to the temporal continuity of video content, majorityof the pixels in one frame have corresponding placement.Meanwhile, the motion either from moving objects or camerasmake some content occluded between two adjacent frames.That is, for some pixels in one frame, there are no corre-sponding placements in the adjacent frames. Such occludedpixels form an occlusion map between two adjacent frames.In this paper, the sparse occlusion detection method proposedin [4] is employed due to its superiority in occlusion detectionand ability to handle multiple occlusion layers.

In [4], Ayvaci et al. formulated occlusion detection andoptical flow estimate in a joint framework by assembling acost function that penalizes the optical flow residual in the co-visible regions, as well as the occluded regions. As a result,the optimization problem is solved jointly with respect to theunknown optical flow field and the indicator function of theoccluded region.

Let I(x, t) be a grayscale time-varying image defined ona domain D. As a consequence, the relation between twoconsecutive frames in a video {I(x, t)T

t=0} is given by

I(x, t) ={I(w(x, t), t+ dt) + n(x, t), x ∈ D\Ω(t; dt),ρ(x, t), x ∈ Ω(t; dt),

(1)where x → w(x, t) = x + v(x, t) is the domain deformationmapping I(x, t) onto I(x, t + dt) everywhere except at theoccluded region Ω, and n(x, t) is the uncertainty term.

For any x ∈ D, Equation (1) can be rewritten as,

I(x, t) = I(w(x, t), t+ dt) + e1(x, t; dt) + e2(x, t; dt), (2)

where the terms e1 and e2 are defined as,{e1(x, t; dt) = ρ(x, t) − I(w(x, t), t+ dt), x ∈ Ωe2(x, t; dt) = n(x, t), x ∈ D\Ω

(3)For a sufficiently small dt, we can approximate I(x, t+dt)

for any x ∈ D\Ω,

I(x, t+ dt) = I(x, t) + �I(x, t)v(x, t) + n(x, t), (4)

where the linearisation error has been incorporated into theuncertainty term n(x, t).

Since the residual term e1 is large but sparse and e2 is smallbut dense, the goal is to optimize the following data fidelityterm that minimizes the number of non-zero elements of e1and the negative log-likelihood of n,

ψdata(v, e1) = ‖ � Iv + It − e1‖L2(D) + λ‖e1‖L0(D). (5)

In addition to the data term, because the unknown v isinfinite-dimensional, regularization is imposed by requiringthat the total variation (TV) is small,

ψreg(v) = u‖v1‖TV + u‖v2‖TV , (6)

where v1 and v2 are the first and second components of theoptical flow v, and u is a multiplier factor to weight thestrength of the regularizer.

Hence, the overall problem can be written as the minimiza-tion of the cost function ψ = ψdata + ψreg , which is

v1, v2, e1 = arg minv1,v2,e1

12‖A[v1, v2, e1]T + b‖2

l2

+λ‖e1‖l0 + μ‖v1‖TV + μ‖v2‖TV ,(7)

where A is spatial derivative matrix, e1 is the vector ob-tained from stacking the values of e1(x, t) on the lattice ∧(i.e. a digital image on the domain D is quantized into anM × N lattice ∧) on top of one another, and similarly withthe vector field components {v1(x, t)}x∈∧ and {v2(x, t)}x∈∧stacked into MN-dimensional vectors v1, v2, and the temporalderivative values {It(x, t)}x∈∧ are stacked into b.

In order to solve the above NP-hard problem (Equation(7)), a relaxation into a convex would simply replace the l0norm with l1. In order to avoid the fact that “bright” occluded

(a) (b)

Fig. 3. Illustration of forward and backward occlusion maps. (a) Forwardocclusion map. (b) Backward occlusion map.

regions are penalized more than “dim” regions, a weighted−l1is introduced. Therefore, Equation (7) can be rewritten as

v1, v2, e1 = arg minv1,v2,e1

12‖A[v1, v2, e1]T + b‖2

l2

+λ‖We1‖l1 + μ‖v1‖TV + μ‖v2‖TV ,(8)

where W is a diagonal matrix initialized with the identitymatrix and resort to adapt the weights with an iterativeprocedure called reweighted− l1 (proposed by [18]) so as tobetter approximate the l0 norm. Each iteration has a globallyoptimal solution that can be reached efficiently from any initialcondition.

In [4], two minimization algorithms, Nesterov’s Algorithmand Split-Bregman Algorithm, were proposed. As shown inFig. 2, an illustration of extracting occlusion map is provided.

Though indicating the contours of foreground objects, theocclusion map lacks information of “seeding” foregroundand background regions. Therefore, we will obtain the otherocclusion map by reversing the temporal order of two adjacentframes. That is, the occlusion map Fig. 2(d) is for the targetframe Fig. 1(b) by referring to the reference frame Fig. 1(a),which is named forward occlusion map; and the occlusionmap of Fig. 1(a) by referring to Fig. 1(b) is named as thebackward occlusion map of Fig. 1(b). As observed from Fig.3, the forward and backward occlusion maps of Fig. 1(b) areable to define the regions of foreground objects between twomaps.

B. Refining Occlusion Map

Due to self occlusions (e.g. hair occluding face) and thenoises in occlusion maps, we have to perform a refinement soas to ensure that the segmentation algorithm will receive highquality “seeding” inputs. Our refinement approach consists oftwo steps: coarse contour extraction of the occlusion map andnoise removal.

We extract the coarse contour of the occlusion map byhorizontally scanning the occlusion map. In order to decidethe scanning direction (from the left side or the right side),we obtain the average positions of both the forward andbackward occlusion maps and choose the direction from theaverage position closer to the origin to further one. The coarsecontour of the forward occlusion map Fig. 3(a) is shown inFig. 4(a) where there exist noises. Note that the assumptionon horizontal scan is valid, since the general directions of theocclusion maps can be utilized to rotate the occlusion maps.

(a) (b) (c) (d) (e)

Fig. 4. Illustration of occlusion map refinement with the forward occlusion map. (a) Contour of the occlusion map. (b) Shifted contour. (c) Contour projection.(d) Performing image closing operator. (e) Final refined occlusion contour.

(a) (b)

Fig. 5. Brief illustration of refining backward occlusion map. (a) Contourof the occlusion map. (b) Final refined occlusion contour.

In order to remove noises, we employ the closing operatorof morphological image analysis. However, direct applicationof the closing operator may dilate the contour, which results ininaccurate identification of the contour. Therefore, we firstlyshift the coarse contour along the scanning direction (thesame as the scanning direction in the step of coarse contourextraction). Both the original and the shift contour are shownin Fig. 4(b).

When projecting only the shift contour towards the shiftdirection, we are able to obtain Fig. 4(c) where there existspiky horizontal peaks. After applying the closing operator onthe projected contour, we are able to obtain a refined shiftcontour as shown in Fig. 4(d). By using such refined shiftcontour as a mask (where white region indicates the regionsto be masked) to the coarse contour Fig. 4(a), we obtain thefinal refined contour of the forward occlusion map as shown inFig. 4(e). Similarly, a brief illustration of refining the backwardocclusion map is shown in Fig. 5.

In our experiments, we empirically set the the shift distanceas 30 pixels and the closing structure element as a 25 × 25square image, which provides good results.

C. Geodesic Star Convexity based Object Segmentation

We utilize the Geodesic star convexity based image segmen-tation algorithm [3]. Essentially, it is an extension of graphcut based algorithm with Geodesic star convexity of shapeconstraint. Graph cut [19] has been widely used in imagesegmentation by clustering image pixels into background andforeground regions. In order to achieve more meaningfulsegmentation, prior knowledge is usually desirable. In [3],Gulshan et al. proposed to model the object of interest withmultiple convex stars in the graph cut segmentation approachand to replace the Euclidean distance between two pixels withGeodesic path. Therefore, we explain this algorithm with threecomponents: graph cut based image segmentation, Geodesicstar convexity prior, and object segmentation.

1) Graph cut based image segmentation: Segmenting anobject from its background is generally formulated as a binarylabelling problem. That is, each pixel in an image has to beassigned a label from the label set L = {0, 1}, where 0 and1 stand for the background and the object respectively. Let Pbe the set of all pixels in the image, and N be the standard 4or 8 connected neighbourhood on P , consisting of orderedpixel pairs (p, q) where p < q. Let fp ∈ L be the labelassigned to pixel p, and f = {fp|p ∈ P} be the collectionof all label assignment. The energy function commonly usedfor segmentation is formulated as follows:

E(f) =∑p∈P

Dp(fp) + λ∑

(p,q)∈N

Vpq(fp, fq). (9)

The first term in Equation (9) is referred to as the regionalor data term, which measures how well pixels match theobject or background models. Dp(fp) is essentially the penaltyfor assigning label fp to pixel p. The more likelihood fp isfor p, the smaller is Dp(fp).

The second term is called the boundary term because itincorporates the boundary constraints. A segmentation bound-ary occurs whenever two neighbouring pixels are assigneddifferent labels. Vpq(fp, fq) is the penalty for assigning labelsfp and fq to neighbouring pixels.

Parameter λ ≥ 0 weights the relative importance betweenthe regional and boundary terms. Smaller λ makes regionalterms more important.

Therefore, the goal of image segmentation is to optimizethe energy function (Equation (9)). In graph cut algorithm, animage is represented with a graph G = (V,E) where verticesV denotes pixels in the image and edges E are formed by eachpixel with its neighbour pixels. Each edge e ∈ E connectingpixel p and q has a non-negative cost we(p, q). A cut C ∈ Eis a subset of edges, such that if C is removed from G, thenV is partitioned into two disjoint sets S and T = V −S suchthat s ∈ S and t ∈ T . The cost of cut C is the sum of itsedge weight |C| =

∑e∈C we. The minimum cut is the cut C

with the smallest cost. As indicated in [19], a graph can beconstructed so that the labelling corresponding to the minimumcut is the one optimizing the energy function Equation (9).

2) Geodesic star convexity prior: Geodesic star convexitybased image segmentation approach [3] is a direct extensionof Veksler’s work [20] which only incorporated single-star-convexity by adding the following shape constraint term Spq

(a) (b)

Fig. 6. Illustration of interactive image segmentation. (a) “Seeding” inputswhere background seed is highlighted in red and foreground seed highlightedin white. (b) Segmentation results.

( (p, q) ∈ N ) as the third term to Equation (9),

Spq(fp, fq) =

⎧⎨⎩

0 if fp = fq,∞ if fp = 1 and fq = 0,β if fp = 0 and fq = 1

(10)

where β is a bias parameter.In [3], while extending the single-star-convexity, Gulshan et

al. also replaced the Euclidean distance with geodesic distancefor the shortest path between two points a and b,

dg(a, b) = minΓ∈Pa,b

L(Γ) (11)

Γa,b = arg minΓ∈Pa,b

L(Γ), (12)

where Pa,b denotes the set of all discrete paths between twogrid points a and b and L(Γ) is the length of discrete path Γwith n pixels given by {Γ1,Γ2, ...,Γn}.

The path length L(Γ) is defined as follows:

L(Γ) =n−1∑i=1

√(1 − γg)d(Γi,Γi+1)2 + γg‖ � I(Γi)‖2, (13)

where d(Γi,Γi+1) is the Euclidean distance between succes-sive pixels, and the quantity ‖ � I(Γi)‖2 is a finite differ-ence approximation of the image gradient between the points(Γi; Γi+1). The parameter γg weights the Euclidean distancewith the geodesic length.

Finally, a global minima of the revised version of Equation(9) (where shape constraint is considered with Equation (10))can be obtained in a similar way to the solution of [19].

3) Object segmentation: Since Geodesic Star Convexitybased interactive image object segmentation requires userinputs indicating foreground and background regions, we needto derive such “seeding” inputs from the refined forward andbackward contours. In general, foreground objects resides be-tween two contours. Therefore, we obtain foreground seedinginputs by extending the refined contours of both forwardand backward occlusion maps towards inner regions betweentwo contours. Similarly, the background seeding inputs areobtained by extending the refined contours of two occlusionmaps in the opposite direction.

As shown in Fig. 6(a), red regions indicate the backgroundregions and the foreground objects are highlighted in white.The final segmentation result is shown in Fig. 6(b).

(a) (b)Fig. 7. “Waving-Arm” example with the method proposed in [2]. (a) and(b) are the segmentation results of the images shown in Figs. 1 (a) and (b),respectively.

IV. EXPERIMENTS AND DISCUSSIONS

We conducted experiments with five challenging videos asshown in Table 1 to evaluate our proposed approach. The firsttwo are from [2] and the other three are downloaded fromvideo sharing websites1.

The experimental results are organized to demonstrate theperformance of our proposed approach on segmenting partiallymoving objects and fully moving objects, as well discover itslimitations.

TABLE ITHE STATISTICS OF THE TESTED SEQUENCES

Sequences Waving-Arm

WalkingMan

TalkingMan

TalkingWoman

SingingBoy

Frames 71 150 130 120 82Resolution 960*540 720*576 638*360 512*288 512*288

A. Partially Moving Object

At first, we perform the comparison with Zhang’s method(i.e. the method proposed in [2]). As shown in Fig. 7, Zhang’smethod is able to segment the moving arm only, instead ofthe whole person, since it relies on motion information. Byusing occlusion map, our method is able to segment the wholeperson as shown in Fig.6 (b).

It is also noticed that our proposed approach is faster thanZhang’s method. With a desktop PC configured with 2.1GHzCPU and 4096MB memory, it takes about 200 seconds for ourapproach to complete object segmentation from one frame ofvideo “Waving-Arm”, while in Zhang’s method taking about180 seconds for depth estimation only. In addition, Zhang’smethod requires the color model prior of the objects to besegmented.

We also performed segmentation on more challenging webvideos where the object motion is not very significant. Asshown in Fig. 8 where a talking man hardly moves hisbody except talking and the motion is generated due to thecamera, both the backward and forward occlusion maps (themiddle row) are accurately identified. Therefore, the wholeperson is well segmented. Similarly, as shown in Fig. 9, thewoman is well segmented, though she only slightly turnsher head. Note that, due to very small camera motion and

1http://www.tudou.com/programs/view/Uj0u6AxTSck/http://www.tudou.com/programs/view/A4Q4gYzEPFQ/http://v.youku.com/v show/id XMzEyNzk5MDQ0.html/

Fig. 8. Segmentation result of the partially moving “Talking Man”.

Fig. 9. Segmentation result of the partially moving “Talking Woman”.

body motion, the occlusion map does not cover the shoulderpart very well. However, the Geodesic star convexity basedsegmentation method [3] is still capable of achieving verygood segmentation results with incomplete “seeding” inputs.

B. Fully Moving Object

We also conducted experiments on videos with fully movingobject in order to demonstrate that our approach is generallyapplicable to hand-held camera videos. As shown in Fig. 10,the walking person is well segmented.

C. Limitations

However, our approach also has the following limitationsdue to the dependence on occlusion map. Firstly, if there do notexist occlusion regions generated by camera motion or objectmotion, our approach will not be able to work. For example,

Fig. 10. Segmentation result of a fully moving object with the “WalkingMan” video used in [2].

Fig. 11. Illustration of the limitations with the “Singing boy” video.

in the “Talking Man” example, if the camera did not move,there would be no occlusion regions at all, since the body doesnot move much neither. Similarly, in the “Talking Woman”example, our approach would fail if she did not turn her head.Secondly, if the foreground object has similar color with thebackground, some background regions will be included intothe foreground objects due to the segmentation capability ofthe Geodesic star convexity based segmentation algorithm. Asshown in Fig. 11 where a teenager poses in front of a camera,

the white region of the chair is also extracted due to its colorsimilarity to the white T-shirt he wears, though the occlusionmaps (shown in the middle row) are well extracted. It isexpected that such a problem could be remedied if temporalconsistency is exploited, since the motion pattern of the personis different with that of the chair.

V. CONCLUSION AND FUTURE WORK

By taking a different perspective on video object segmen-tation, we present a novel video object segmentation methodby exploiting occlusion map generated by camera motion andleveraging the advances of interactive image segmentation.As a result, our proposed algorithm is able to fully segmentpartially moving objects in a video captured by a hand-heldcamera. In order to obtain accurate “seeding” inputs to theinteractive image segmentation algorithm, both backward andforward occlusion maps are extracted and utilized, and mor-phological operators are utilized to refine extracted occlusionmaps. Experimental results with a number of challengingscenarios demonstrate the effectiveness and superiority of ourmethod over the state-of-the-art.

ACKNOWLEDGMENT

This research was supported by ARC (Australian ResearchCouncil). The authors would like to thank Dr G.Zhang for hishelp.

REFERENCES

[1] J. Sun, W. Zhang, X. Tang, and H. Shum, “Background cut,” in EuropeanConference on Computer Vision, 2006, pp. 628–641.

[2] G. Zhang, J. Jia, W. Hua, and H. Bao, “Robust bilayer segmentation andmotion/depth estimation with a handheld camera,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 33, no. 3, pp. 603–617,2011.

[3] V. Gulshan, C. Rother, A. Criminisi, A. Blake, and A. Zisserman,“Geodesic star convexity for interactive image segmentation,” in IEEEInternational Conference on Computer Vision and Pattern Recognition,2010, pp. 3129–3136.

[4] A. Ayvaci, M. Raptis, and S. Soatto, “Sparse occlusion detection withoptical flow,” International Journal of Computer Vision, vol. 97, pp.322–338, 2012.

[5] Y. Li, J. Sun, and H.-Y. Shum, “Video object cut and paste,” ACMTransactions on Graphics, vol. 24, no. 3, pp. 595–600, 2005.

[6] M. Leung and Y. Yang, “Human body motion segmentation in a complexscene,” Pattern Recognition, vol. 20, no. 1, pp. 55–64, 1987.

[7] A. Elgammali, D. Harwood, and L. Davis, “Non-parametric model forbackground subtraction,” in European Conference on Computer Vision,2000, pp. 751–767.

[8] V. Kolmogorov, A. Criminisii, A. Blake, G. Cross, and C. Rother,“Bilayer segmentation of binocular stereo video,” in IEEE InternationalConference on Computer Vision and Pattern Recognition, 2005, pp. 407–414.

[9] A. Criminisii, G. Cross, A. Blake, and V. Kolmogorov, “Bilayer segmen-tation of live video,” in IEEE International Conference on ComputerVision and Pattern Recognition, 2006, pp. 53–60.

[10] P. Yin, A. Criminisii, J. Winn, and I. Essa, “Tree-based classfiersfor bilayer video segmentation,” in IEEE International Conference onComputer Vision and Pattern Recognition, 2007.

[11] J. Shao, Z. Jia, Z. Li, F. Liu, J. Zhao, and P. Peng, “Spatiotemporal en-ergy modeling for foreground segmentation in multiple object tracking,”in IEEE International Conference on Robotics and Automation, 2011.

[12] T. Zhao, R. Nevatia, and B. Wu, “Segmentation and tracking of mul-tiple humans in crowded environments,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 30, no. 7, pp. 1198–1211, 2008.

[13] J. Zhong and S. Sclaroff, “Segmenting foreground objects from adynamic textured background via a robust kalman filter,” in IEEEInternational Conference on Computer Vision, 2003, pp. 44–50.

[14] V. Mahadevan and N. Vasconcelos, “Spatiotemporal saliency in dynamicscenes,” IEEE Transactions on Pattern Analysis and Machine Intelli-gence, vol. 32, no. 1, pp. 171–177, 2010.

[15] D. Mitzel, E. Horbert, A. Ess, and B. Leibe, “Multi-person tracking withsparse detection and continuous segmentation,” in European Conferenceon Computer Vision, 2010, pp. 397–410.

[16] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in IEEE International Conference on Computer Vision andPattern Recognition, vol. 1, 2005, pp. 886–893.

[17] G. Zhang, J. Jia, T. Wong, and H. Bao, “Consistent depth maps recoveryfrom a video sequence,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 31, no. 6, pp. 974–988, 2007.

[18] E. Candes, M. Wakin, and S. Boyd, “Enhancing sparsity by reweightedl1 minimization,” Journal of Fourier Analysis and Application, vol. 14,no. 5, pp. 877–905, 2008.

[19] Y. Boykov and M. P. Jolly, “Interactive graph cuts for optimal boundaryand region segmentation,” in IEEE International Conference on Com-puter Vision, vol. 1, 2001, pp. 105–112.

[20] O. Veksler, “Star shape prior for graph-cut image segmentation,” inEuropean Conference on Computer Vision, 2008, pp. 454–467.