6
Video Object Segmentation by Hierarchical Localized Classication of Regions Chenguang Zhang, Haizhou Ai Dept. of Computer Science and Technology Tsinghua University, Beijing, P.R. China zhangcg06@mail s.tsinghua.edu.cn, [email protected] ghua.edu.cn  Abstract —Video Obj ect Segme ntation (VOS) is to cut out a selected object from video sequences, where the main difculties are shape deformat ion, appea rance variation s and backg roun d clutte r . To cop e wit h the se dif cu lti es, we pr opose a nov el metho d, named as Hier arc hical Localiz ed Clas sic ation of Re- gions (HLCR). We sugges t that appearan ce models as well as the spatial and temporal coher enc e bet wee n fra mes are the keys to bre ak thro ugh bottl enec k. Local ly , in order to iden tify for egr ound regi ons, we prop ose to use Hier arch ial Localized Class ier s, whic h organ ize reg ional featu res as deci sion trees. In globa l, we adopt Gaussi an Mixtu re Color Models (GMMs). After integrating the local and global results into a probability mask, we can achiev e the nal segme ntati on result by graph cut. Experiments on various challenging video sequences demonstrate the efciency and adaptability of the proposed method.  Index Terms—video object segmentation, classication, track- ing, graph cut I. I NTRODUCTION In comp uter vision, Video Obje ct Segmenta tion (VOS) is an attractive task which has many applications, such as video edit, video composition, object recognition, etc. Generally, a VOS system mainly faces two basi c prob lems in comp uter vision, object tracking and segmentation. There are numerous algorithms to solve object tracking [1], such as mean shift [2], particle lter [3], online boosting [4], random forest [5], etc. There are also a great deal of works on object segmentation, such as level set methods [6], graph cut [7] and grab cut [8]. It is well known that , for a VOS system, deal ing wi th general video sequences is an extremely challenging objective, due to the factors from appearance variations, irregular motion and background clutter. On the basis of object tracking and segmentation, various approaches have been proposed for VOS in rec ent years . Li et al. [9] direc tly exten d the tr adi tio nal graph cut [7] algorithm from 2D image to 3D image sequence, and optimize the global energy function to yield segmentation result. Apart from the limitation of heavily relying on Gaussian mixture color models, this 3D graph cut method is quite time- consu ming and does not allo w user interact ion. After ward , localized color and shape models are introduced by Xue [10] in Video SnapCut system, which shows increased discriminative abi lit y and pro ves to be mor e ef ci ent . Howe ve r, due to unexpected errors of optical ow when the object is occluded by itself or others, it is not reliable to perform classication on obje ct bound ary and shif t loca l window. An alte rnati ve met hod by Bre nde l et al. [11 ], foc usi ng on tra cki ng re gio n acro ss frames, is attr acti ve for its comp utat iona l bene t and spatial-temporal coherence. However, suffering from failure of matching the contour of regions, this method lacks of the ability to deal with complex deformation of non-rigid object. Meanwhile, Niebles et al. [12] demonstrate how to combine model-based information (e.g. part-based detection result for huma n) and appearan ce approaches to ext ract human body reg ions. Nev erthe less , for gener al obje cts, high perf orma nce detectors are usually not available, which limits the general- ization of that method. Inspired by previous works of localized windows [10] and tracking re gions [11 ], we pro pos e a nov el met hod , named as Hierarchical Localized Classication of Regions (HLCR), for video object segmentation. The main contribution of our approach is to ove rcome the limitations of dire ctly shiftin g local windows and unreliable region tracking, by taking the spatial-temporal relationship between corresponding regions in neighboring frames as inference strategy. The rest of this paper is organized as follows. In Section II, we rst give a formulation, and then show a brief overview of our system. Section III introduces the whole pipeline of our appr oach. Exper imental resu lts on dif fere nt video seque nces are presented in Section IV. Finally, in Section V, we offer a conclusion of our method, followed by a discussion about the future work. II. PROBLEM F ORMULATION AND S YSTEM OVERVIEW Given an input video sequence  I  = {I 0 , I 1 ,...,I  N 1 }, the VOS system is ini tia lized by a selected ke y fra me  I k  with known foreground mask  F (I k ). The output of a typical VOS system is to label out the foreground mask  M (I t )  for each frame  I t . Taking the foreground mask in a particular frame as input, as illustrated in Fig . 1, our syste m is des ign ed to gen era te the fore gro und mask in the next fra me. Wit h the help of Regional Back-Track Method for motion estimation, we can assign regions to a series of Hierarchical Localized Classiers, to predict potential foreground and background regions locally. Comb inin g the clas sication result with Gauss ian Mixt ure Color Models (GMMs), we can produce a probability mask, followed by an optimization based on the mask to yield nal segmentation results with graph cut [7] algorithm.

[2011][Acpr]Zhang Chenguang

Embed Size (px)

Citation preview

8/9/2019 [2011][Acpr]Zhang Chenguang

http://slidepdf.com/reader/full/2011acprzhang-chenguang 1/5

Video Object Segmentation by Hierarchical

Localized Classification of Regions

Chenguang Zhang, Haizhou AiDept. of Computer Science and Technology

Tsinghua University, Beijing, P.R. China

[email protected], [email protected]

 Abstract—Video Object Segmentation (VOS) is to cut out aselected object from video sequences, where the main difficultiesare shape deformation, appearance variations and backgroundclutter. To cope with these difficulties, we propose a novelmethod, named as Hierarchical Localized Classification of Re-gions (HLCR). We suggest that appearance models as well asthe spatial and temporal coherence between frames are thekeys to break through bottleneck. Locally, in order to identifyforeground regions, we propose to use Hierarchial LocalizedClassifiers, which organize regional features as decision trees.

In global, we adopt Gaussian Mixture Color Models (GMMs).After integrating the local and global results into a probabilitymask, we can achieve the final segmentation result by graph cut.Experiments on various challenging video sequences demonstratethe efficiency and adaptability of the proposed method.

 Index Terms—video object segmentation, classification, track-ing, graph cut

I. INTRODUCTION

In computer vision, Video Object Segmentation (VOS) is

an attractive task which has many applications, such as video

edit, video composition, object recognition, etc. Generally, a

VOS system mainly faces two basic problems in computer

vision, object tracking and segmentation. There are numerousalgorithms to solve object tracking [1], such as mean shift [2],

particle filter [3], online boosting [4], random forest [5], etc.

There are also a great deal of works on object segmentation,

such as level set methods [6], graph cut [7] and grab cut [8].

It is well known that, for a VOS system, dealing with

general video sequences is an extremely challenging objective,

due to the factors from appearance variations, irregular motion

and background clutter. On the basis of object tracking and

segmentation, various approaches have been proposed for VOS

in recent years. Li et al. [9] directly extend the traditional

graph cut [7] algorithm from 2D image to 3D image sequence,

and optimize the global energy function to yield segmentation

result. Apart from the limitation of heavily relying on Gaussianmixture color models, this 3D graph cut method is quite time-

consuming and does not allow user interaction. Afterward,

localized color and shape models are introduced by Xue [10] in

Video SnapCut system, which shows increased discriminative

ability and proves to be more efficient. However, due to

unexpected errors of optical flow when the object is occluded

by itself or others, it is not reliable to perform classification

on object boundary and shift local window. An alternative

method by Brendel et al. [11], focusing on tracking region

across frames, is attractive for its computational benefit and

spatial-temporal coherence. However, suffering from failure

of matching the contour of regions, this method lacks of the

ability to deal with complex deformation of non-rigid object.

Meanwhile, Niebles et al. [12] demonstrate how to combine

model-based information (e.g. part-based detection result for

human) and appearance approaches to extract human body

regions. Nevertheless, for general objects, high performance

detectors are usually not available, which limits the general-ization of that method.

Inspired by previous works of localized windows [10] and

tracking regions [11], we propose a novel method, named

as Hierarchical Localized Classification of Regions (HLCR),

for video object segmentation. The main contribution of our

approach is to overcome the limitations of directly shifting

local windows and unreliable region tracking, by taking the

spatial-temporal relationship between corresponding regions

in neighboring frames as inference strategy.

The rest of this paper is organized as follows. In Section II,

we first give a formulation, and then show a brief overview of 

our system. Section III introduces the whole pipeline of our

approach. Experimental results on different video sequencesare presented in Section IV. Finally, in Section V, we offer a

conclusion of our method, followed by a discussion about the

future work.

I I . PROBLEM F ORMULATION AND S YSTEM OVERVIEW

Given an input video sequence  I  = {I 0, I 1, . . . , I  N −1}, the

VOS system is initialized by a selected key frame   I k   with

known foreground mask  F (I k). The output of a typical VOS

system is to label out the foreground mask  M (I t)   for each

frame  I t.

Taking the foreground mask in a particular frame as input,as illustrated in Fig. 1, our system is designed to generate

the foreground mask in the next frame. With the help of 

Regional Back-Track Method for motion estimation, we can

assign regions to a series of Hierarchical Localized Classifiers,

to predict potential foreground and background regions locally.

Combining the classification result with Gaussian Mixture

Color Models (GMMs), we can produce a probability mask,

followed by an optimization based on the mask to yield final

segmentation results with graph cut [7] algorithm.

8/9/2019 [2011][Acpr]Zhang Chenguang

http://slidepdf.com/reader/full/2011acprzhang-chenguang 2/5

Fig. 1. Outline of our approach

III. OUR  A PPROACH

First of all, the initial foreground mask  F (I k)   is provided

by user. Since video frames are spatial-temporally cohesive,

we can propagate the foreground mask between neighboring

frames. From the reference frame (Fig. 2(a)) to the target

frame (Fig. 2(b)), bidirectional propagations are both feasible.

Without loss of generality, the following analysis only explains

the forward direction, which is from frame  I t   to frame   I t+1.

Naturally, using the selected key frame  I k  as the first referenceframe and repeatedly applying this procedure of propagation,

we can get foreground masks in all frames.

For computational benefit as well as distinctiveness and ro-

bustness, each frame is over-segmented into SLIC superpixels

[13], which convert the original pixel-connected graph   GP 

(Fig. 2(b)) to a regional-connected graph  GR  (Fig. 2(c) ).

(a) Frame  I t   (b) Frame  I t+1   (c) SLIC Regions (d) Optical Flow

(e) Classification (f) GMMs Prob. (g) Graph Prob. (h) Seg Result

Fig. 2. An example of the procedure of processing a single frame

 A. Regional Back-Track Method 

For a region in frame   t + 1, Regional Back-Track Method

is introduced to find out the best matching region in frame   t,

and determine whether they are essentially corresponded.

There is no doubt that pixel-level optical flow (Fig. 2(d) )

is not reliable when heavy occlusion happens. Although it is

claimed in [10] that flow averaging approach in local windowcould generate more robust result, it still produces meaningless

motion vector when there are no really “matched” regions.

Based on this observation, we suggest that a reliable region

track method should not only be insensitive to minor optical

flow errors, but also judge whether the matched regions are

essentially corresponded. For arbitrary region Ra  in frame  t+1, Regional Back-Track Method is defined as

BackTrack(Ra) = mincRa−vRa−cRb≤δ

Diff (Ra, Rb)   (1)

where  Rb   is in frame   t,  cRa denotes the center of region  Ra,

vRa denotes the averaged motion vector for all pixels in region

Ra   and Diff (Ra, Rb)  denotes the difference between region

Ra   and  Rb. Obviously, a larger   δ   would be more robust to

optical flow errors while more risky to introduce mistaken

regions. On the other hand,   δ   is highly related to the radius

rRa, since the center of large regions drift easier than small

ones. Consequently, in our experiments,   δ   is set as   rRa

  and

Diff (Ra, Rb) is set as the Euclid Distance between the mean

color of two regions.

A key issue of Regional Back-Track Method is how

to convert   Diff (Ra, Rb)   to a binary decision. Traditional

methods, such as selecting a global threshold or using Chi-

square test, are very tricky and unstable. Here, inspired by

the Statistical Region Merging Method [14], we choose the

independent bounded difference inequality as the decision

function. (Treating each pixel in  Ra  as a bounded independent

random variable.) As a result, the predicate logic is shown

below.

B(Ra, Rb) =

1   if    |Ra − Rb| ≤ 

b2(Ra) + b2(Rb)0   otherwise

.

(2)

To summarize, for an arbitrary region  Ra   in frame   t + 1,

Regional Back-Track Method provide the best match region

Rb   in frame  t   if they are essentially corresponded. Otherwise,

this method would mark  Ra   as a “mismatched” region.

 B. Hierarchical Localized Classifiers

In this section, we introduce Hierarchical Localized Classi-

fiers to evaluate the probability of that a region in frame  t + 1belongs to foreground.

Localized classifiers for VOS system are introduced in

Video SnapCut System [10], in which a series of overlapping

local windows are created along foreground boundary with

fixed size and then propagate through frames. However, due to

a large boundary variation and local window drift, that method

is limited when facing topology changes. In addition, since

the size of local window is fixed, we definitely sacrifice the

ability to benefit from multi-scale space. To overcome these

limitations, we propose a new solution called Hierarchical

Localized Classifiers.

Given a foreground mask   M (I t)   and the corresponding

foreground bounding box   B(I t)   in reference frame   t, we

define a potential searching box   S (I t)   by extending   B(I t)for a fixed ratio   β   (β   = 0.3   in our experiments), using the

following equations.

center(S (I t)) =  center(B(I t))

height(S (I t)) = (1 + β )height(B(I t))

width(S (I t)) = (1 + β )width(B(I t))

(3)

Next, we build a hierarchical quad-tree structure by splitting

the searching box  S (I t), in which each tree node corresponds

to a local window. The partition rules are shown in Fig.

3. Then, we generate a localized classifier   L(W i)   for each

8/9/2019 [2011][Acpr]Zhang Chenguang

http://slidepdf.com/reader/full/2011acprzhang-chenguang 3/5

window  W i, trained by all inner regions which have already

been labeled as foreground or background according to the

foreground mask  M (I t). Here, we build a multi-dimensional

feature vector   f (R) = (r,g,b,y,u,v,cx,cy)   for region   R,

where   (r,g,b ,y,u,v)  denotes the average value of all pixels

in region R  in RGB and YUV color space and  (cx, cy) denotes

the center of region  R. If  W i  contains both foreground and

background regions, we use a decision tree for classification.

Otherwise, the localized classifier  L(W i)   is degenerated into

a constant function (Return  1   if it contains only foreground,

and return  0  if not.).

Fig. 3. Hierarchical Localized Classifiers based on quad-tree partition. If alocal window is larger than a fixed size  λ   and contains both foreground and

background regions, e.g.  W i, we split it into four sub-windows. Otherwise,the partition terminates here and this window turns out to be leaf node, e.g.W j . For each window  W i, a localized classifier  L(W i)  is trained by all theinside regions.

As for prediction, instead of shifting local windows, we

prefer to assign each region  Ra   in frame   t + 1   to a series

of windows   {W i0 ,W i1 , . . . ,W  in−1}   in frame   t. Recall the

Regional Back-Track method introduced in section III-A,

assuming we have found the best match region  Rb   in frame  t

(if not, we will discuss how to handle the mismatched   Ra

later in section III-C),   Rb   should be covered by a unique

leaf node of the quad-tree partition. Tracing back to all

the ancient nodes in the quad-tree, we can get a series

of windows   {W i0 ,W i1 , . . . ,W  in−1}. For each window  W ik ,

we use the pre-trained localized classifier  L(W ik)   to predict

whether Ra  is belong to foreground or not. (Note here we use

(r,g,b,y,u,v,cx  −  vxRa, cy  −  vyRa

)   as the feature vector,

where (vxRa, vyRa

)  is the averaged motion vector of  Ra.)

To produce the final classification result  q Ra, we need inte-

grating the localized classifiers together, using this equation:

q Ra =

n−1k=0 ωkq kn−1

k=0 ωk

(4)

where   q k   denotes the binary prediction of   L(W ik)   and   ωk

denotes the weight of classifier  L(W ik). Obviously, the clas-

sifiers with high confidence should be weighted more thanthose with low confidence. Therefore, in our experiments, the

classification ratio on training set is used as  ωk.

In summary, for an arbitrary region  Ra  in frame t+1 which

finds corresponding region   Rb   in frame   t, the Hierarchical

Localized Classifiers make an integrated prediction of the

probability that  Ra  will be included in the foreground mask.

C. Combined Probability Mask and Iterative Refinement 

Combined Probability Mask is introduced to integrate lo-

calized classification result with global GMMs. As a result,

we can use graph cut algorithm to optimize the segmentation

result.

For graph cut method, we need to optimize the following

energy function

E  =  λ

i

E d(Ri) +i=j

E c(Ri, Rj )   (5)

where   E d(Ri)   is data energy and   E c(Ri, Rj )   is regional

connection energy. In our framework,  E c(Ri, Rj )  is the color

difference between region  Ri   and  Rj , which is the same as

traditional graph cut method [7], and   E d(Ri)   is the com-

bined probability of Global Gaussian Mixture Color Models

(GMMs) and Hierarchical Localized Classifiers predictions,

which is shown as follows.

GMMs are widely used in segmentation and tracking tasks

and turn out to be quite effective. In our system, both fore-

ground and background GMMs are acquired by clustering

regions in the reference frame  t   according to the given mask.

Note that directly updating foreground GMMs is very risky.

Considering the initial foreground mask provided by user

input in the key frame is extremely important, we suggestthat a combination of foreground in the initial key frame

and reference frame is quite necessary. In general, though

the discrimination ability of Hierarchical Localized Classifiers

is better than GMMs, it may suffer from the risk of over-

fitting and is incapable of handling mismatched regions in

section III-A. Consequently, we combine these two responses

to generate a more reliable foreground probability   p(Ra),

using the formula shown below.

1) If  Ra   has a corresponding region  Rb   in frame   t, then

 p(Ra) =  q fg (Ra) · q Ra

q fg (Ra) · q Ra +  q bg(Ra) · (1 − q Ra

).   (6)

2) Otherwise, Ra  is mismatched. Since  q Ra  is not available,we have

 p(Ra) =  q fg (Ra)

q fg (Ra) + q bg(Ra)  (7)

where q f g(Ra) is probability that  Ra  is in foreground GMMs,

q bg(Ra)   is probability that  Ra   is in background GMMs and

q Ra is the classification response in section III-B.

Given the combined probability   p(Ra)   as data energy

E d(Ri), we can solve this two-label graph cut problem

through max-flow method. However, since complex videos

often contain unexpected noise, the combined probability

 p(Ra)   may drift in a few regions. Therefore, we apply a

iterative refinement to the graph cut result, which is shown

as following.1) Perform Graph Cut based on the combined probability

 p(Ra)   to get foreground regions.

2) Perform the max-connected component detection for

foreground regions to filter false alarmed regions.

3) Update the foreground and background GMMs and the

combined probability   p(Ra). Repeat Step 1) and 2) until

converge.

In our experiments, repeating for only   2   or   3   times, the

iterative refinement will produce a convincing result.

8/9/2019 [2011][Acpr]Zhang Chenguang

http://slidepdf.com/reader/full/2011acprzhang-chenguang 4/5

IV. EXPERIMENTS

Currently, since there is no standard datasets for video

segmentation, in our experiments, the testing datasets are

collected from [15] and [12]. The first video clip is waterskiing

from [15],   97  frames,   544 ×  280. The second one is diving

from [15], 179 frames, 880×488. The third one is skating from

[15], 573  frames,  552 × 310. The fourth one is dancing from

[12],  138  frames,  320  × 240. Note that these videos are verychallenging in terms of dynamic camera, background clutter,

blurred motion, object shadows, etc.

We quantitatively analysis our approach on these test

datasets. We randomly select 10 frames from each video clip

for evaluation and label out the true foreground manually. The

metric is standard  F -Measure, which is defined as below.

F -Measure = 2 · P recision · Recall

Precision + Recall  (8)

where  P recision   is the probability that an auto-segmented

foreground pixel is a true foreground pixel and  Recall   is the

probability that a true foreground pixel is detected.

Since there is no available source code or executable binary

for current VOS method, such as [10] and [11], we chooseto use Grab Cut [8] algorithm for comparison, where we

draw foreground bounding boxes for several times and select

the best one for each frame. Table. I sums up the achieved

comparisons, from which we can see that our approach is

much better than Grab cut. Note that our method works very

well when handling visually similar foreground and back-

ground (such as dark legs and black background in Fig. 4(d)),

which improves F -Measure by as much as twenty percentage

points. Some examples are shown in Fig. 4, which demonstrate

that our method significantly improves the subjective quality

of segmentation.

TABLE IEXPERIMENTAL  R ESULTS

Vide Cli p Method   P r ecision Recal l F  -Measure

Water-skiing

Grab Cut 0.753 0.911 0.836

Our Method 0.938 0.849 0.891+/−   0.185 -0.062 0.067

Diving

Grab Cut 0.823 0.849 0.836

Our Method 0.914 0.950 0.931+/−   0.091 0.101 0.096

Skating

Grab Cut 0.956 0.905 0.930

Our Method 0.973 0.919 0.945+/−   0.017 0.014 0.015

DancingGrab Cut 0.873 0.620 0.725

Our Method 0.946 0.947 0.947

+/−   0.073 0.327 0.221

In terms of complexity, our method only takes about  300milliseconds for each frame on an Intel core quad  2.40  GHz

CPU with  3GB memory. With the help of the initial labeled

foreground mask and a reliable frame-by-frame inference

strategy, our method can deal with very complex videos. Nev-

ertheless, our method fails when unexpected sudden change

of foreground appearance occurs.

V. CONCLUSION

In this paper, we propose a novel method to regard VOS

as a problem of tracking and classifying regions in local

windows. Regional Back-Track Method, which is based on

optical flow, is applied to track regions across frames. The

Hierarchical Localized Classifiers are introduced for the pre-

diction of potential foreground regions. Combined probability

mask based on classification results and GMMs is used for

graph cut algorithm with iterative refinement, which produces

reliable segmentation results. Experiments on various videos

demonstrate its great performance.

In current version, we only use single frame propagation

in this paper, which may lead to unexpected drifts in certain

extreme scenario. Although the foreground GMMs in the

initial key frame are used as global constraints, which enhance

the stability of our method, we believe that multi-frames

propagation will benefit more from spatial temporal space.

Another potential work is extending this work to multi-object

cutout, which has more extensive application prospect. We

expect to investigate these issues in our future work.

ACKNOWLEDGMENT

This work is supported by National Science Foundation of 

China under grant No.61075026.

REFERENCES

[1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,”  ACM Comput. Surv., vol. 38, no. 4, p. 13, 2006.

[2] D. Comaniciu and P. Meer, “Mean shift analysis and applications,” inThe Proceedings of IEEE International Conference on Computer Vision,vol. 2, 1999, pp. 1197 –1203.

[3] K. Nummiaro, E. Koller-Meier, and L. J. V. Gool, “An adaptive color-based particle filter,”  Image Vision Comput., vol. 21, no. 1, pp. 99–110,2003.

[4] H. Grabner and H. Bischof, “On-line boosting and vision,” in   IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, vol. 1, 2006, pp. 260 – 267.

[5] A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof, “On-linerandom forests,” in  IEEE International Conference on Computer VisionWorkshops, 2009, pp. 1393 –1400.

[6] A. reza Mansouri and J. Konrad, “Motion segmentation with level sets,”in  IEEE International Conference on Image Processing, 1999, pp. 126–130.

[7] Y. Boykov and M. pierre Jolly, “Interactive graph cuts for optimalboundary and region segmentation of objects in n-d images,” in   IEEE 

 International Conference on Computer Vision, 2001, pp. 105–112.[8] C. Rother, V. Kolmogorov, and A. Blake, “Grab cut: interactive fore-

ground extraction using iterated graph cuts,”   ACM Transactions onGraphics, vol. 23, pp. 309–314, 2004.

[9] Y. Li, J. Sun, and H. yeung Shum, “Video object cut and paste,”  ACM 

Transactions on Graphics, vol. 24, pp. 595–600, 2005.[10] X. Bai, J. Wang, D. Simons, and G. Sapiro, “Video snapcut: robust video

object cutout using localized classifiers,” vol. 28, 2009.[11] W. Brendel and S. Todorovic, “Video object segmentation by tracking

regions,” in   IEEE International Conference on Computer Vision, 2009,pp. 833 –840.

[12] J. C. Niebles, B. Han, A. Ferencz, and F. fei Li, “Extracting moving

people from internet videos,” in   European Conference on Computer Vision, 2008, pp. 527–540.

[13] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk,“Slic superpixels,” EPFL, Tech. Rep., jun 2010.

[14] R. Nock and F. Nielsen, “Statistical region merging,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 26, pp. 1452–1458,2004.

[15] M. Grundmann, V. Kwatra, M. Han, and I. Essa, “Efficient hierarchicalgraph based video segmentation,” in  IEEE International Conference onComputer Vision and Pattern Recognition, 2010, pp. 2141–2148.

8/9/2019 [2011][Acpr]Zhang Chenguang

http://slidepdf.com/reader/full/2011acprzhang-chenguang 5/5

(a) Water-skiing Sequence on Frame 27, 48, 57, 67

(b) Diving Sequence on Frame 35, 64, 83, 122

(c) Skating Sequence on Frame 12, 18, 63, 111

(d) Dancing Sequence on Frame 5, 20, 101, 130

Fig. 4. Experimental Results. From left to right,  1st row: Original Key Frame Image, Segmentation Results of Our Approach;  2nd row: Initial LabeledForeground Mask, Segmentation Results of Grab Cut [8]. Please zoom in to check for more segmentation details.