Towards Unsupervised Whole-Object Segmentation: Combining … · 2008. 12. 3. · length vector, vi, of α-values at each pixeli. If we scale the α-values to be between+1 and −1

Towards Unsupervised Whole-Object Segmentation:Combining Automated Matting with Boundary Detection

Andrew N. Stein∗ Thomas S. Stepleton Martial HebertThe Robotics Institute, Carnegie Mellon University

Pittsburgh, Pennsylvania 15213, [email protected]

Abstract

We propose a novel step toward the unsupervised seg-mentation of whole objects by combining “hints” of partialscene segmentation offered by multiple soft, binary mattes.These mattes are implied by a set of hypothesized objectboundary fragments in the scene. Rather than trying to findor define a single “best” segmentation, we generate multi-ple segmentations of an image. This reflects contemporarymethods for unsupervised object discovery from groups ofimages, and it allows us to define intuitive evaluation met-rics for our sets of segmentations based on the accurate andparsimonious delineation of scene objects. Our proposedapproach builds on recent advances in spectral clustering,image matting, and boundary detection. It is demonstratedqualitatively and quantitatively on a dataset of scenes andissuitable for current work in unsupervised object discoverywithout top-down knowledge.

1. Introduction

It is well known that generalscenesegmentation is anill-posed problem whose “correct” solution is largely de-pendent on application, if not completely subjective. Ob-jective evaluation of segmentations is itself the subject ofsignificant research (see [31] for a recent review) . Here weconsider instead the more specific problem ofwhole objectsegmentation;i.e., our goal is to accurately and conciselysegment the foreground objects or “things” without nec-essarily worrying about the background or “stuff” [1] —without the use of top-down object knowledge. As we willexplain, we use hypothesized boundary fragments to sug-gest partial segmentation “hints” to achieve this goal. Oncethe objects of interest are defined (which admittedly coulditself involve some subjectivity), it becomes somewhat eas-ier to define natural and intuitive measures of segmentationquality on a per-object basis.

∗Partial support provided by National Science Foundation (NSF) GrantIIS-0713406.

Input Image

High−ProbabilityBoundary Fragments

Pairwise Affinities

Object Segmentation

... ...

Segmentation "Hints"

Figure 1. Starting with an image and hypothesized boundaryfragments (only highest-probability fragments shown for clarity,though many more are used), we generate a large set of segmen-tation “hints” using automated matting. Combining informationfrom those mattes into an affinity matrix, we can then generate asegmentation for each of the foreground objects in the scene.

Furthermore, segmentation results are intimately tied tothe selection of the parameter controlling the granularityofthe segmentation (e.g., the number of segments/clusters, ora bandwidth for kernel-based methods). Instead of seek-ing a single perfect segmentation, the integration ofmulti-ple segmentations obtained over a range of parameter set-tings has become common [9, 12, 17, 21, 29]. In such ap-proaches, segmentation is treated as a mid-level processingstep rather than an end goal. This reduces the pressure toobtain — or even define — the one “best” result, while alsoside-stepping the problem of parameter selection.

We are motivated by approaches such as [21], in whichRussellet al. showed that it is possible to discover objectsin an unsupervised fashion from a collection of images byrelying on multiple segmentations. Systems using segmen-tation in this way, effectively as aproposal mechanism,should benefit from an underlying segmentation methodwhich accurately and frequently delineates whole objects.Thus we will present a novel segmentation strategy de-signed to outperform existing methods in terms of two inter-related and intuitive measures of object segmentation qual-

1

ity. First, we take into account accuracy in terms of pixel-wise,per-objectagreement between segments. Second, wealso consider conciseness in terms of the number of seg-ments needed to capture an object.

Following the classical Gestalt view as well as Marr’stheories, we recognize that a key cue in differentiating ob-jects from each other and their background lies in the dis-covery of their boundaries. Therefore we use hypothesizedfragmentsof boundaries as input for our segmentations.Certainly there exists substantial prior work in directly ex-ploiting boundary reasoning for object/ scene segmentationand figure-ground determination,e.g. [18] and referencestherein. Recently, however, Steinet al. [26] have arguedfor the importance of detectingphysical boundariesdue toobject pose and 3D shape, rather than relying on more typ-ical, purely appearance-basededges, which may arise dueto other phenomena such as lighting or surface markings.They have demonstrated improved boundary detection us-ing cues derived from a combination of appearance andmotion, where the latter helps incorporate local occlusioninformation due to parallax from a dynamic scene/ camera.We employ this method for generating the necessary bound-ary information for our approach.

Not only do boundaries indicate the spatial extent of ob-jects, however, they also suggest natural regions to use inmodeling those objects’ appearances. The pixels on eitherside of a boundary provide evidence which, though only lo-cal and imperfect, can offer a “hint” of the correct segmen-tation by indicating discriminative appearance information.

In practice, we use the output offered by recentα-matting approaches as an approximate “classification”which realizes these segmentation hints. And though theindividual mattes only suggest binary discrimination (usu-ally “foreground” vs. “background”), we can neverthelesssegment arbitrary numbers of objects in the scene by utiliz-ing the collective evidence offered by a large set of mattes.

Recently, there has been a surge of interest and impres-sive results in interactive matting [2, 4, 6, 11, 22, 28]. Usingonly a sparse set of user-specified constraints, usually in theform of “scribbles” or a “tri-map”, these methods produce asoft foreground/ background matte of the entire image. Asinitially demonstrated in [3], such methods can potentiallybe automated by replacing user-specified “scribbles” withconstraints indicated by local occlusion information.

In the proposed approach, we use each hypothesizedboundary fragment to provide the matting constraints.Based on the combination of the set of mattes suggestedby all of the fragments, we then derive an affinity measurewhich is suitable for use with any subsequent clustering al-gorithm,e.g. Normalized Cuts (NCuts) [24]. Through thiscombination of individually-weaker sources of information(i.e. fragmented boundaries and impoverished, binary mat-tes), we relieve the pressure to extract the one “perfect”matte or the “true” set of extended boundary contours, andyet we are still able to segment multiple objects in the scene

accurately and concisely. Though beyond the scope of thiswork, it is our hope that such improved segmentation strate-gies will benefit methods which rely on multiple segmenta-tions to generate, for example, object models from unla-beled data.

2. Segmentation “Hints” via Multiple Mattes

As discussed in Section1, we will use image matting toproduce segmentation “hints” for generating affinities use-ful in segmentation via clustering. After first providing anoverview of matting, we will explain how we use bound-ary fragments to implymultiplemattes for estimating theseaffinities.

2.1.α-MattingIn a standard matting approach, one assumes that each

observed pixel in an image,I(x, y), is explained by theconvex combination of two unknown colors,F andB. Asoft weight,α, controls this combination:

I(x, y) = α(x, y)F (x, y) + (1 − α(x, y)) B(x, y) (1)

We will use this model as a proxy for a classifier in ourwork. In this formulation, then, pixels with anα-value nearone are likely part of theF “class”, while those with anα-value near zero are likely part of theB “class”. Valuesnear0.5 indicate mixed pixels whose “membership” maybe considered unknown. Typically,F and B correspondto notions of “foreground” and “background”, but we willexplain in the next section that these semantic assignmentsare not necessary for our approach.

Simultaneously solving forF , B, and α in (1) is ofcourse not feasible. In general, a user specifies a small setof pixels, often referred to as “scribbles” or a “tri-map”,which are then constrained to belong to one class or theother. These hard constraints are then combined with as-sumptions aboutα (e.g., smoothness) to find a solution atthe unspecified pixels [2, 4, 6, 11, 22, 28]. We have adoptedthe approach proposed by Levinet al. [11], which offers ex-cellent results via a closed-form solution forα based on rea-sonable assumptions about the local distributions of colorinnatural images.

Note that methods also exist for constrained “hard” seg-mentations,e.g. [5] — potentially with some soft mattingat the boundaries enforced in post-processing [20] — but aswe will see, using fully-softα-mattes allows us to exploitthe uncertainty of mixed pixels (i.e. those withα values near0.5) rather than arbitrarily (and erroneously) assigning themto one class or the other. In fact, we follow the conventionalwisdom of avoiding early commitment throughout our ap-proach. Hard thresholds or grouping decisions are avoidedin favor of maintaining soft weights until the final segmen-tation procedure. In addition to retaining the maximumamount of information for the entire process, this method-ology also avoids the many extra parameters required fortypical ad hocdecisions or thresholds.

2.2. Multiple Mattes → AffinitiesIn anautomatedmatting approach, the goal is to provide

a good set of constraints without requiring human interven-tion. Since object boundaries separate two different objectsby definition, they are natural indicators of potential con-straint locations:F on one side andB on the other. In [3],T-junctions were used to suggest sparse constraints in a sim-ilar manner. The benefit of matting techniques here is theirability to propagate throughout the image the appearanceinformation offered by such local, sparse constraints. Themiddle of Figure1 depicts a sampling of mattes generatedfrom the differing constraints (or “scribbles”) implied byvarious boundary fragments.

A remaining problem is that the approach describedthus far only offers a binary (albeit soft) decision about apixel’s membership: it must belong to eitherF or B, whichare usually understood to represent foreground and back-ground. How then can we deal with multi-layered scenes?

We recognize that the actual class membership of a par-ticular pixel, as indicated by itsα value in a single matte,is rather meaningless in isolation. What we wish to cap-ture, however, is that locations with similarα values acrossmany different mattes (whether both zeroor one) are morelikely to belong to the same object, while locations withconsistently different values are more likely to be part ofdifferent objects. While existing methods, such as Interven-ing Contours [7, 8, 10], may attempt to perform this type ofreasoning over short- and medium-ranges within the imageusing standard edge maps, our use of mattes simultaneouslyfactors in the boundaries themselves as well as theglobalappearance discriminationthey imply. Furthermore, mattevalues near 0.5 carry a notion ofuncertaintyabout the rela-tionship between locations.

Using each of theNF potential boundary fragments inthe scene to generate an image-wide matte yields anNF -length vector,vi, of α-values at each pixeli. If we scalethe α-values to be between+1 and−1 (instead of 0 and1), such that zero now represents “don’t know”, then theagreement, oraffinity, between two pixelsi and j can bewritten in terms of the normalized correlation between theirtwo scaled matting vectors:

Aij =vTi Wvj

|vi||vj |, (2)

whereW is a diagonal matrix of weights corresponding tothe confidence of each fragment actually being an objectboundary. Thus mattes derived from fragments less likelyto be object boundaries will not significantly affect the finalaffinity. Note that this combinesall hypothesized fragmentsin a soft manner, avoiding the need to choose some idealsubset,e.g., via thresholding. Figure2 provides an exam-ple schematic describing the overall process of employingboundary fragments to suggest mattes, which in turn gener-ate an affinity matrix.

The value ofAij will be maximized when the matting

vectors for a pair of pixels usually put them in the sameclass. When a pair’s vectors usually put the two pixels inopposite classes, the affinity will be minimized. The nor-malization effectively handles discounting the “don’t know”(near-zero) entries which arise from mattes that do not pro-vide strong evidence for one or both pixels of the pair.

We have now defined a novel matting-based affinity mea-sure which can be used with any off-the-shelf clusteringtechnique. In addition to incorporating boundary knowl-edge and global appearance reasoning, a unique and note-worthy aspect of our affinities is that they are defined basedon feature vectors in a space whose dimension is afunctionof the image content— i.e. the number of detected frag-ments,NF — rather than an arbitrarily pre-defined featurespace of fixed dimension.

3. Detecting Boundary Fragments

In the previous section, we constructed mattes based onhypothesized boundary fragments. We will now explain thesource of these hypotheses. While one could use a purelyappearance-based edge detector, such asPb [13], as a proxyfor suggesting locations of object boundaries in a scene,this approach could also return many edges correspondingto non-boundaries, such as surface markings, yielding extraerroneous and misleading mattes. While it would be naı̈veto expect or requireperfectboundary hypotheses from anymethod, we still wish to maximize the fraction that do in-deed correspond to true object boundaries.

Recently, Steinet al. demonstrated improved detectionof object/ occlusion boundaries by incorporating local, in-stantaneous motion estimates when short video clips areavailable [26, 27]. Their method first over-segments a sceneinto a few hundred “super-pixels” [19] using a watershed-based approach. All super-pixel boundaries are used as po-tential object boundary fragments (where each fragment be-gins and ends at the intersection of three or more super-pixels). Using learned classifiers and inference on a graphi-cal model, they estimate the probability that each fragmentis an object boundary based on appearance and motion cuesextracted along the fragment and from within each of theneighboring super-pixels. The resulting boundary probabil-ities provide the weights forW in (2). An example inputimage and its boundary fragments (shown in differing col-ors), can be found on the left side of Figure1. For clarity,only the high-probability fragments are displayed, thoughwe emphasize thatall are used.

For each fragment, we can now generate an image-widematte according to [11] as described in Section2. Thesuper-pixels on either side of a fragment naturally designatespatial support for the required labeling constraints. Sincesuper-pixels generally capture fairly homogeneous regions,however, we have found it better to expand the constraintset for a fragment by using atriplet of super-pixels formedby also incorporating the super-pixels of the two neighbor-ing fragments most likely to be boundaries. This process is

B

FCompare

B

FB vi

vj

i

j

Aij

F

Boundary Fragments → Matting Constraints Segmentation "Hints" → Feature Vectors Pairwise Comparisons → Affinity Matrix

Figure 2. Each potential boundary fragment found in the original image(three shown here) implies a set of constraints (F andB) used togenerate a matte. Vectors of matting results at each location in the image (vi andvj) can be compared in a pairwise fashion to generate anaffinity matrix,A, suitable for clustering/ segmentation.

illustrated in Figure3. Note also that our approach avoidsany need to choose the “right” boundary fragments (e.g. bythresholding), nor does it attempt to extract extended con-tours using morphological techniques or anad hocchainingprocedure, both of which are quite brittle in practice. In-stead we consider only the individual short fragments, withan average length of 18 pixels in our experiments.

We employ the technique proposed in [26, 27] in orderto improve our performance by offering better boundary de-tection and, in turn, better mattes. As mentioned above,that approach utilizes instantaneous motion estimates nearboundaries. We arenot, however, incorporating motion es-timates (e.g. optical/ normal flow) directly into our segmen-tation affinities [23, 32], nor is our approach fundamentallytied to the use of motion.1

Pr = 0.4

Pr = 0.8

Pr = 0.3

Pr = 0.7Pr = 0.2

Pr = 0.1

B

F

B

F

BB

F

F

Figure 3. TheF andB constraint sets for a given fragment areinitialized to its two neighboring super-pixels (left). To improvethe mattes, these constraint sets are expanded to also include thesuper-pixels of the fragment’s immediate neighbors with the high-est probability of also being object boundaries (right).

4. Image Segmentation by NCuts

Given our pairwise affinity matrixA, which defines afully-connected, weighted graph over the elements in ourimage, we can use spectral clustering according to the Nor-malized Cut Criterion to produce an image segmentationwith K segments [24].2 To obtain a set of multiple segmen-tations, we can simply vary this parameterK.

1Separate experiments usingPb alone to supply the weights inWyielded reasonable but slightly worse results than those presented in Sec-tion 6. This suggests that both the better boundaries from [26, 27] and thematting-based affinities presented here are useful.

2Certainly other techniques exist, but NCuts is a one popularone thatalso offers mature, publicly-available segmentation methods, facilitatingthe comparisons in Section6. Our work is not exclusively tied to NCuts.

Using the boundary-detection approach described above,we not only obtain a better set of hypothesized boundaryfragments and probabilities of those fragments correspond-ing to physical object boundaries (i.e. for use inW from(2)), but we can also use the computed super-pixels as ourbasic image elements instead of relying on individual pix-els. In addition to offering improved, data-driven spatialsupport, thisdrastically reduces the computational burdenof constructingAij , making it possible to computeall pair-wise affinities between super-pixels instead of having tosamplepixelwise affinities very sparsely.

The benefit here is more than reduced computation, how-ever, particularly when using the popular NCuts technique.Other authors have noted that non-intuitive segmentationstypical of NCuts stem from an inability to fully populatethe affinity matrix and/or the topology of a simple, four-connected, pixel-based graph [30]. By computing affinitiesbetweenall pairs of super-pixels, we alleviate somewhatboth of these problems while also improving speed.

We will not review here all the details of spectral clus-tering via NCuts, given a matrix of affinities; see [14] forspecifics of the method we follow. We compare the segmen-tations obtained using NCuts on our matting-based affini-ties to two popular approaches from the literature, each ofwhich also uses NCuts and offers an implementation online.First is the recent multiscale approach of Couret al. [7],in which affinities are based on Intervening Contours [10].Second, we compare to the Berkeley Segmentation Engine(BSE) [8, 13], which relies on a combined set of patch-based and gradient-based cues (including Intervening Con-tours). Note that different methods exist for the final step ofdiscretizing the eigenvectors of the graph Laplacian. Whilethe BSE usesk-means (as do we, see [14]), the multiscaleNCuts method attempts to find an optimal rotation of thenormalized eigenspace [34]. Furthermore, both methodscompute a sparsely-sampled,pixelwiseaffinity matrix.

5. Evaluating Object Segmentations

In this section, we will present an intuitive approach fordetermining whether one set of segmentations is “better”

than another generated using a different method. For theeventual goal of object segmentation and discovery, we pro-pose that the best set of segmentations will contain at leastone result for each object in the scene in which that ob-ject can be extracted accurately and with as few segmentsas possible. Ideally, each object would be represented by asingle segment which is perfectly consistent with the shapeof that object’s ground truth segmentation. We leave theproblem of identifying the object(s) from sets of multiplesegmentations to future work and methods such as [21].

Thus, we define two measures to capture these concepts:consistencyand efficiency. For consistency, we adopt acommon metric for comparing segmentations, which con-veniently lies on the interval[0, 1] and captures the degreeto which two sets of pixels,R andG, agree based on theirintersection and union:

c(R,G) =|R ∩ G|

|R ∪ G|. (3)

Here, R = {A,B,C, . . . } ⊆ S is a subset of segmentsfrom a given (over-)segmentationS, andG is the groundtruth object segmentation.

At one extreme, if our segmentation were to suggest asingle segment for each pixel in the image, we could alwaysreconstruct the object perfectly by selecting those segments(or in this case, pixels) which corresponded exactly to theground-truth object’s segmentation. But this nearly-perfectconsistency would come at the cost of an unacceptably highnumber of constituent segments, as indicated in the right-most example in Figure4. At the opposite extreme, oursegmentation could offer a single segment which covers theentire object, as shown in the leftmost example. In this case,we would achieve the desired minimum of one segment tocapture the whole object, but with very poor consistency.

Number ofSegments Required

13

6

250+

Inreasing Segmentation Consistency

Figure 4. The tradeoff between segmentation consistency (or ac-curacy) and efficiency (or the number of segments required toachieve that consistency). As desired consistency increases, so toodoes the number of segments required to achieve that consistency.

Thus we see that there exists a fundamental tradeoff be-tween consistency and the number of segments needed tocover the object, orefficiency.3 Therefore, when evaluat-

3It may be helpful to think of these measures and their tradeoff as beingroughly analogous to the common notions of precision and recall, e.g., inobject recognition.

ing the quality of a scene’s (over-)segmentationS, we musttake into account both measures. We define the efficiencyas the size of the minimal subset of segments,R, requiredto achieve a specified desired consistency,cd, according tothe ground truth object segmentation,G:

ecd(S,G) = min |R|, such thatc(R,G) ≥ cd. (4)

Note that with this definition, alower value ofe(S,G) im-plies amoreefficient (or parsimonious) segmentation.

We can now specifycd and compute the correspondingecd , which is equivalent to asking, “what is the minimumnumber of segments required to achieve the desired consis-tency for each object in this scene?” By asking this questionfor a variety of consistency levels, we can evaluate the qual-ity of a particular method and compare it to other methods’performance at equivalent operating points.

Note that we can avoid a combinatorial search over allsubsetsR possibly required to achieve a particularcd byconsidering only those segments which overlap the groundtruth object and by imposing a practical limit on the numberof segments we are willing to consider (selected in order oftheir individual consistency measures) [27].

Referring once again to Figure4, the middle two ex-amples indicate that we can achieve a reasonable level ofconsistency with only three segments, and if we raise thedesired consistency a bit higher (perhaps in order to cap-ture the top of the mug), it will require us to use a differentsegmentation from our set which covers the mug with sixsegments. In general, the best method would be capable ofproviding at least one segmentation which yields a desiredhigh level of consistency with the minimum degree of over-segmentation of any particular object.

6. Experiments

For each of a set of test scenes, we have labeled theground truth segmentation for foreground objects of inter-est, which we roughly defined as those objects for whichthe majority of their boundaries are visible. Since we em-ploy [26]’s method for boundary information, we also usetheir online dataset of 30 scenes. From those scenes, wehave labeled ground truth segmentations for 50 foregroundobjects on which we will evaluate our approach.

We generate a set of segmentations for each scene byvaryingK between 2 and 20 while using either our matting-based approach, multiscale NCuts [7], or the BSE ap-proach [8, 13]. Recall that the two latter methods com-pute affinities in apixelwisefashion. To verify that anyimprovement offered by our approach is notsolelydue toour use of super-pixels, we also implemented a baselineapproach which computes affinities from pairwise compar-isons of L*a*b* color distributions within each super-pixel,using theχ2-distance.

For our proposed approach, we use each of the individ-ual boundary fragments from [26] to suggest constraints formattes as described in Section2.2. In practice, we ignore

those fragments with anextremelylow probability (< 0.01)of being boundaries, since the resulting mattes in thosecases would have almost no effect on computed affinitiesanyway. From an initial set of 350-1000, this yields ap-proximately 90-350 fragments (and mattes) per image forcomputing pairwise, super-pixel affinities according to (2).

We first selected a set of ten desired consistency levels,from 0.5 to 0.95. Then for each labeled object in a scene andfor each segmentation method, we pose the question, “whatis the minimum number of segments required to achieveeach desired consistency in segmenting this object?” Wecan then graph and compare the methods’ best-case effi-ciencies as a function of the desired consistencies.

A typical graph is provided at the top of Figure5, inwhich bars of different colors correspond to different seg-mentation methods. Each group of bars corresponds to acertain consistency level, and the height of the bars indi-cates the minimum efficiency (i.e. number of segments re-quired) to achieve that consistency. Thus,lower bars arebetter. Bars extend to the top of the graph when a methodcould not achieve a desired consistency withanynumber ofsegments. Thus we see that our approach is able to achievesimilar consistency with fewer segments — untilcd reaches0.85, at which point all methods fail on this image. Not sur-prisingly, the relatively simplistic appearance model of thecolor-only baseline tends to over-segment objects the most.

We can also examine the actual segmentations producedby each method at corresponding desired consistency lev-els, as shown at the bottom of Figure5. For the input imageand selected ground truth object shown we provide for eachmethod the segmentations which use the minimum numberof segments and achievedat leastthe desired consistencyindicated to the left of the row. Also shown are the super-pixels used for our method and the color-distribution ap-proach, along with a high-probability subset of the bound-ary fragments used in the proposed approach. (We displayonly a subset of the fragments actually used simply for clar-ity.) Below each segmentation are the actual consistenciesand efficiencies (i.e. number of segments) obtained. Notehow our method achieves comparable consistencies withfewer segments — even when other methods may not beable to achieve that consistency at all. More results are pro-vided in Figures6-7 and in the supplemental material.

Not surprisingly, when the desired consistency is low,any method is usually capable of capturing an object withfew segments. But as the desired consistency increases, itbecomes more difficult, or even impossible, to find a smallnumber of segments to recover that object so accurately.Finally, as the desired consistency becomes too high, allmethods begin to fail. We find, however, that for many ob-jects our matting-based approach tends to maintain betterefficiency (i.e. decreased over-segmentation) into a higher-consistency regime than the other methods.

To capture this more quantitatively over all objects, wecan compute the difference between the number of seg-

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950 1 2 3 4 5 6 7 8 9 10

Not Possible

Desired Object Segmentation Consistency

# S

egm

ents

Req

uire

d

Segments Required to Achieve Specified Consistency

Color Distribution AffinityMultiscale NCutsBerkeley Segmentation Engine (BSE)Proposed Matting−Based Affinity

Input Image Ground Truth Object Super−PixelsHigh−Probability

Boundary Fragments

c=0.71, ec=5.00

c d ≥

0.7

0

ColorDistribution

c=0.79, ec=3.00

Multiscale NCuts

c=0.71, ec=2.00

BSE

c=0.71, ec=1.00

Our ProposedMatting Affinity

c=0.81, ec=9.00

c d ≥

0.8

0Not Possible Not Possible c=0.81, ec=3.00

Figure 5.Top: A typical graph of segmentation efficiency vs. con-sistency for a set of desired consistency levels. Our method is ableto achieve similar object segmentation consistency with fewer seg-ments.Bottom: The corresponding input data (first row) and theresulting segmentations at two consistency levels (2nd, 3rd rows),as indicated bycd = {0.70, 0.80} to the left of each row. Forclarity, only high-probability boundary fragments used by our ap-proach are shown, using a different color for each fragment. Forvisualization, the individual segments corresponding to the object,i.e. those used to compute thec and e values displayed beloweach segmentation, have been colored with a red-yellow colormap,while background segments are colored blue-green.

ments our method requires at each consistency level and thenumber required by the other methods. We would like tosee that our method often requires significantly fewer seg-ments to achieve the same consistency. Certainly, there aresome “easier” objects for which the choice of segmentationmethod may not matter, so we would also expect to find thatour method regularly performs only as well as other meth-ods. But we also must ensure that we do better much moreoften than worse. Furthermore, we expect the potential ben-efit of our approach to be most obvious within a reasonablerange of desired consistency: all methods will tend to per-form equally well at low consistencies, and all methods willtend to fail equally often at very high consistencies.

Figure8 offers evidence that our approach does indeedoutperform the other methods as desired. As expected, weperform just as well (or poorly) as other methods for manyof the objects, particularly at very low or very high desiredconsistency. But we require several fewer segments per ob-ject in a significant number of cases. Furthermore, our ap-proach rarelyhurts; we do not often requiremoresegmentsthan the other methods.

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950

1

2

3

4

5

6

7

8

9

Not Possible


# S

egm

ents

Req

uire

dSegments Required to Achieve Specified Consistency

Color Distribution AffinityMultiscale NCutsBerkeley Segmentation Engine (BSE)Proposed Matting−Based Affinity


Boundary Fragments

c=0.74, ec=5.00

c d ≥

0.7

0

ColorDistribution

c=0.83, ec=7.00

c d ≥

0.8

0

c=0.72, ec=1.00


c=0.83, ec=3.00

c=0.72, ec=6.00

Multiscale NCuts

Not Possible

c=0.71, ec=2.00

BSE

c=0.82, ec=4.00

Figure 6. Another example showing our method’s ability toachieve comparable consistency with fewer segments.

Discussion & Conclusion

Debate continues over whether high-level reasoning,e.g.object recognition, should precede figure-ground percep-tion [15], or if purely bottom-up, local reasoning can ac-count for this process [16, 25]. Our work takes the morebottom-up perspective, since knowledge of specific objectsis not required (unlike [18]), but our use of matting in-corporates more powerful, semi-global reasoning as com-pared to purely-local methods. Our experiments indicatethat by maintaining “soft” reasoning throughout, and bycombining multiple, individually-imperfect sources of in-formation in the form of fragmented boundary hypothesesand oft-uncertain mattes, our novel method for addressingobject segmentation yields promising results for use in sub-sequent work on unsupervised object discovery or scene un-derstanding. While here we have evaluated our affinities inisolation, as compared to existing methods, it is certainlypossible thatcombiningmultiple types of affinities wouldoffer further improvement.

Using boundaries and mattes as described simultane-ously implies thegroupingandseparationof super-pixelson the same or opposite sides of a given boundary, re-spectively. We performed preliminary experiments withthe technique described in [33] to incorporateseparatelysuch “attraction” and “repulsion” evidence via spectral clus-tering, rather than simply treating the two as equivalentsources of information with opposite signs. This often

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950 1 2 3 4 5 6 7 8 9

10 Not Possible


# S

egm

ents

Req

uire

d

Segments Required to Achieve Specified Consistency

Color Distribution AffinityMultiscale NCutsBSEProposed Matting−Based


Boundary Fragments

c=0.69, ec=2.00

c d ≥

0.6

5

ColorDistribution

c=0.65, ec=2.00

Multiscale NCuts

c=0.72, ec=3.00

BSE

c=0.84, ec=1.00


c=0.76, ec=10.00

c d ≥

0.8

0

Not Possible c=0.81, ec=4.00 c=0.84, ec=1.00

Figure 7. Another example like those in Figures5-6. Note onlyone of the three objects in the scene is shown here.

yielded very good results, but it was fickle: when it failed,it often failed completely. Future research in this directionis certainly warranted, however.

References

[1] E. H. Adelson and J. R. Bergen.The plenoptic function andthe elements of early vision, chapter 1. The MIT Press, 1991.

[2] N. E. Apostoloff and A. W. Fitzgibbon. Bayesian video mat-ting using learnt image priors. InCVPR, 2004.

[3] N. E. Apostoloff and A. W. Fitzgibbon. Automatic videosegmentation using spatiotemporal T-junctions. InBMVC,2006.

[4] X. Bai and G. Sapiro. A geodesic framework for fast inter-active image and video segmentation and matting. InICCV,2007.

[5] Y. Boykov and M. P. Jolly. Interactive graph cuts for optimalboundary and region segmentation of objects in N-D images.In ICCV, 2001.

[6] Y. Chuang, B. Curless, D. Salesin, and R. Szeliski. Abayesian approach to digital matting. InCVPR, 2001.

[7] T. Cour, F. B́eńezit, and J. Shi. Spectral segmentation withmultiscale graph decomposition. InCVPR, 2005.

[8] C. Fowlkes, D. Martin, and J. Malik. Learning affinity func-tions for image segmentation: Combining patch-based andgradient-based approaches. InCVPR, 2003.

[9] D. Hoiem, A. A. Efros, and M. Hebert. Recovering surfacelayout from an image.IJCV, 75(1), October 2007.

[10] T. Leung and J. Malik. Contour continuity in region basedimage segmentation. InECCV, June 1998.

= +50

0.2

0.4

0.6

0.8

1Desired Consistency of 0.60

Fra

ctio

n of

Obj

ects

Improvement Summary:For 27% of the objects, we requireaverage of 2.5 FEWER segments

Degradation Summary:For 12% of the objects, we require average of 1.6 MORE segments

Performance of "Matting: Simple NCuts" vs. Other Approaches

= +50

0.2

0.4

0.6

0.8


Fra

ctio

n of

Obj

ects



= +50

0.2

0.4

0.6

0.8


Fra

ctio

n of

Obj

ects



Relative Number of Segments Required by Other Method

As Compared to: Color Distribtions Multiscale NCuts BSE

Improvement Degradationcd ∆e % Obj. −∆e % Obj.

0.50 2.0 21% 1.3 3%0.55 2.3 26% 1.4 6%0.60 2.5 27% 1.6 12%0.65 2.7 34% 1.2 10%0.70 3.4 42% 1.1 11%0.75 3.8 47% 1.9 11%0.80 4.0 58% 1.7 7%0.85 3.6 39% 2.1 12%0.90 4.2 30% 3.5 9%0.95 4.7 11% 3.4 5%For each desired consistency levelcd, we in-dicate the percentage of objects for whichwe offer improvement over other methodsand how manyfewer segments we requireon average (∆e) — higher values are bet-ter for both. Similarly, for cases where wedo worse, we indicate how manymoreseg-ments we require (−∆e) and how often —lower values are better here.

Figure 8.Overall Performance. At left are histograms of therelativenumber of segments required by our approach as compared to theother methods. The height of the bars corresponds to the fraction of thetotal number of objects for which we achieve the specified relativeefficiency on thex-axis. Thus, bars at zero, in the center of the graph, correspond to cases when we perform just as well as the otherapproaches. Bars to the right (left) correspond to cases where we perform better (resp., worse), using fewer (resp., more) segments thanthe competition. The table and the insets on the graphs provide summary statistics at the specified consistency levels.

[11] A. Levin, D. Lischinski, and Y. Weiss. A closed form solu-tion to natural image matting. InCVPR, 2006.

[12] T. Malisiewicz and A. A. Efros. Improving spatial supportfor objects via multiple segmentations. InBMVC, 2007.

[13] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to detectnatural image boundaries using local brightness, color, andtexture cues.POCV, 26(5):530–549, May 2004.

[14] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering:Analysis and an algorithm. InNIPS, 2001.

[15] M. A. Peterson and B. S. Gibson. Must figure-ground orga-nization precede object recognition? An assumption in peril.Psychological Science, 5(5):253–259, September 1994.

[16] F. T. Qiu and R. von der Heydt. Figure and ground in thevisual cortex: V2 combines stereoscopic cues with gestaltrules.Neurons, 47:155–166, July 2005.

[17] A. Rabinovich, A. Vedaldi, and S. Belongie. Does imagesegmentation improve object categorization? Cs2007-0908,University of California San Diego, 2007.

[18] X. Ren, C. C. Fowlkes, and J. Malik. Figure/ground assign-ment in natural images. InECCV, 2006.

[19] X. Ren and J. Malik. Learning a classification model forsegmentation. InICCV, volume 1, pages 10–17, 2003.

[20] C. Rother, V. Kolmogorov, and A. Blake. “grabCut”: Inter-active foreground extraction using iterated graph cuts.SIG-GRAPH, 23(3):309–314, 2004.

[21] B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman, andA. Zisserman. Using multiple segmentations to discover ob-jects and their extent in image collections. InCVPR, 2006.

[22] M. Ruzon and C. Tomasi. Alpha estimation in natural im-ages. InCVPR, 2000.

[23] J. Shi and J. Malik. Motion segmentation and tracking usingnormalized cuts. InICCV, pages 1154–1160, 1998.

[24] J. Shi and J. Malik. Normalized cuts and image segmenta-tion. POCV, 22(8):888–905, August 2000.

[25] L. Spillmann and W. H. Ehrenstein. Gestalt factors in thevisual neurosciences. In L. M. Chalupa and J. S. Werner,editors,The Visual Neurosciences, pages 1573–1589. MITPress, Cambridge, MA, November 2003.

[26] A. Stein, D. Hoiem, and M. Hebert. Learning to find objectboundaries using motion cues. InICCV, 2007.

[27] A. N. Stein.Occlusion Boundaries: Low-Level Processing toHigh-Level Reasoning. Doctoral dissertation, The RoboticsInstitute, Carnegie Mellon University, February 2008.

[28] J. Sun, J. Jia, C.-K. Tang, and H.-Y. Shum. Poisson matting.SIGGRAPH, 23(3):315–321, 2004.

[29] J. Tang and P. Lewis. Using multiple segmentations for im-age auto-annotation. InACM International Conference onImage and Video Retrieval, 2007.

[30] D. A. Tolliver and G. L. Miller. Graph partitioning by spec-tral rounding: Applications in image segmentation and clus-tering. InCVPR, 2006.

[31] R. Unnikrishnan, C. Pantofaru, and M. Hebert. Toward ob-jective evaluation of image segmentation algorithms.POCV,29(6):929–944, June 2007.

[32] J. Xiao and M. Shah. Accurate motion layer segmentationand matting. InCVPR, 2005.

[33] S. Yu and J. Shi. Segmentation with pairwise attraction andrepulsion. InICCV, July 2001.

[34] S. X. Yu and J. Shi. Multiclass spectral clustering. InICCV,2003.

Documents

Towards Unsupervised Whole-Object Segmentation: Combining … · 2008. 12. 3. · length vector, vi, of α-values at each pixeli. If we scale the α-values to be between+1 and −1