A Novel Framework for Automatic Trimap Generation …aalka046/spie2015matting/matting_spie_15_2.pdf · A Novel Framework for Automatic Trimap Generation Using The ... we can either

A Novel Framework for Automatic Trimap Generation Using

The Gestalt Laws of Grouping

Ahmad Al-Kabbany and Eric Dubois

School of Electrical Engineering and Computer Science, University of Ottawa, K1N6N5,Ottawa, Canada

ABSTRACT

In this paper, we are concerned with unsupervised natural image matting. Due to the under-constrained nature ofthe problem, image matting algorithms are usually provided with user interactions, such as scribbles or trimaps.This is a very tedious task and may even become impractical for some applications. For unsupervised mattecalculation, we can either adopt a technique that supports an unsupervised mode for alpha map calculation, orwe may automate the process of acquiring user interactions provided for a matting algorithm. Our proposedtechnique contributes to both approaches and is based on spectral matting. The latter is the only techniquein the literature that supports automatic matting but it su↵ers from critical limitations among which is theunreliable unsupervised operation. Stressing on that drawback, spectral matting may produce erroneous mattesin the absence of guiding scribbles or trimaps. Using the Gestalt laws of grouping, we propose a method thatautomatically produces more truthful mattes than spectral matting. In addition, it can be used to generatetrimaps, eliminating the required user interactions and making it possible to harness the powers of mattingtechniques that are better than spectral matting but don’t support unsupervised operation.The main contributionof this research is the introduction of the Gestalt laws of grouping to the matting problem.

Keywords: Matting, Gestalt, grouping, segmentation

1. INTRODUCTION

Natural image matting is the problem of estimating the fractional foreground opacity in images and video. Thisinvolves the calculation of an alpha map, where the value at each pixel indicates either that pixel is a foreground(↵ = 1), a background (↵ = 0) or a linear combination of the colors of a foreground/background (FG/BG) pair.1

A linear convex image model is thus used and is expressed by the compositing equation I = ↵F + (1� ↵)B.

Finding ↵, F and B from a single equation is not possible, and thus some additional information is usuallyprovided to the algorithm in the form of sparse scribbles or a dense trimap. In that map, some pixels are labeledas FG (white), some as BG (black) while the rest of the pixels are considered of unknown alpha value (gray).

Matting can play a key role in applications where a soft FG/BG segmentation would considerably enhancethe results, like novel-view synthesis. However, it is an open problem as to how matting can be incorporatedin the pipeline of such applications where acquiring user interaction for every frame is either impossible orhighly inconvenient, especially if real-time operation is a constraint. A recent publication2 on image mattingfeatured real-time matte extraction, but they assume the availability of the dense trimaps, which is not a realisticassumption. In video matting, o↵-line operation is often acceptable. However, two of its three main challengeslie in the ‘how-to’ of allowing the user to specify the foreground object spatio-temporally, and how to propagatesuch interaction through the whole sequence in a temporally coherent way.1,3

Thus, it is extremely useful to find a robust matting technique that can operate automatically, either intrin-sically or through the automation of trimap generation. The existing techniques for image matting include onlyone algorithm that is able to extract mattes automatically, namely spectral matting.1 It is still a state-of-the-arttechnique and is among the top techniques in the online benchmark.4,5 Unfortunately, it su↵ers from many

Further author information:

A.A.: E-mail: [email protected]

E.D.: E-mail: [email protected]

Figure 1: The results of the unsupervised spectral matting (column 3), the alpha mattes produced automatically by our

algorithm (column 4) and the trimaps generated from them (column 5). Column 1 shows the original images and column

2 shows the ground truth. The results in column 3 were calculated using RGB a�nities. Using DC1C2 a�nities gave

exactly similar (or similarly unsatisfactory for some images) mattes. Please see text for more details.

disadvantages including the high computational demands and the unreliable operation in the case where no userinteraction is provided. Nowadays, leveraging the power of cloud and GPU computing is a common practice innumerous fields and the latter’s computational power doubles every year. Hence, while heading towards the goalof automatic matte extraction, we have had more concern about the unreliable operation of spectral mattingand this research is devoted to handle it.

Although the Gestalt laws of grouping play an important role in the perception of figural objects, theyhaven’t been used before in the matting problem. We propose a novel technique that adopts them to guide thegrouping process of matting components in spectral matting. This also can automatically generate trimaps ifmore accurate alpha maps are to be calculated at a later stage. This is a step ahead towards robust automaticmatte extraction. We also envisage their future usage as high-level cues in several matting algorithms, especiallyin video matting. The rest of the paper is organized as follows: in section 2, a review of spectral matting ispresented, section 3 is an overview on the related work and section 4 discusses the proposed technique. Finally,we present the results in section 5 followed by the conclusion in section 6.

2. REVIEW OF SPECTRAL MATTING

In Ref. 6, Levin et al. proposed a matting technique that is inspired by spectral clustering. In spectral imagesegmentation, the smallest eigenvectors of the image’s graph Laplacian are used to automatically extract thehard segments in an image. Similarly, the smallest eigenvectors of the matting Laplacian defined in Ref. 7 wereshown to span, usually quite well, the matting components of an image. Those fuzzy components can thus be

obtained by linear transformations of the smallest eigenvectors, and then act as building blocks for a meaningfulcomplete foreground matte. An example of the matting components of an image can be found in Fig. 2. Thosecomponents were calculated using the publicly available spectral matting code.8

Figure 2: Matting components of the 6

thtraining image in Ref. 4,5. The first two images in the first row are the original

and the ground truth respectively.

The algorithm proceeds as follows: given an input image with N pixels, it calculates the N ⇥ N sparseLaplacian matrix L and its P smallest eigenvectors. K matting components are then obtained from the Peigenvectors by minimizing a non-convex energy function that respects the sparsity in the resulting componentsand their linear-convexity at each pixel. Newton’s method was used to minimize that energy function and theprocess was initialized by applying an unsupervised k -means clustering on the smallest eigenvectors.

Starting with K matting components, the last step in the algorithm aims at grouping those components, toextract a complete matte, by choosing the grouping which minimizes the quadratic cost given by

J(↵) = ↵T ⇥ L⇥ ↵, (1)

where L is the matting Laplacian and ↵ is a particular component grouping (alpha map), computed as thesummation of particular matting components. Specifically, they test the 2K grouping hypotheses in which each ofthe K matting components is either high or low (active or inactive). To minimize eqn. 1, the correlations between

the K matting components are precomputed via L and saved in a K⇥K matrix � where �(k, l) = ↵kT ⇥L⇥↵l.Then, the score of a hypothesis is calculated as

J(↵) = bT ⇥ �⇥ b, (2)

where b is a K dimensional binary vector indicating whether a component is active or not. Such a cost is biasedtowards nominating groupings with small number of fuzzy pixels and the authors of Ref. 6 resorted to balancedcuts, putting a constraint on the size of the FG object, which impacts negatively the generality of their groupingapproach. Our proposed cost function doesn’t make any presumptions about the object’s size but favors thehypotheses that adhere to the Gestalt laws of visual grouping instead.

3. RELATED WORK

This research intersects with two areas in the literature, namely, the usage of Gestalt laws for scene analysis, andthe inter-related problems of image/video matting, improving spectral matting and automatic trimap generation.

Thus, to better clarify our contribution and its impact on matting, we will focus on the literature of the latterset of problems that are all related to matting.

To the best of our knowledge, the matting problem hadn’t benefited from the Gestalt laws before our pre-liminary work in Ref. 9. However, the approach in Ref. 9 counted on fewer Gestalt cues than the method athand; only concavity and connectivity were used. In addition, the distance functions designed to quantify thecues failed to deal with the variety of challenges in the datasets equally well.

From the video matting literature, we highlight three of the most-recent and state-of-the-art techniques.10–12

The authors of Ref. 10 proposed an interactive system for object cut-out from video. The performance is afunction in the amount of user assistance and in some cases the user has to supply frame-by-frame controlpoints. The method in Ref. 12 capitalizes on the non-local principle. Their technique is not just more e�cientthan Ref. 11 (which is also non-local-principle-based), but it also often su�ces to provide a single trimap on thefirst frame of a video sequence. However, the algorithm fails gracefully with motion blur.

Automatic trimap generation has also attracted researchers’ attention. Starting with the most recent, thealgorithm in Ref. 13 uses the feature map of the RGB image, morphological dilation and the region growingalgorithm to generate a trimap. A learning-based approach was given in Ref. 14 in which GMMs for the FGand the BG (estimated static) are initially learnt, and then a per-pixel classification is carried out to generate atrimap of the scene. In Ref. 15, the authors used a range sensor to acquire depth information from which theygenerate the trimaps.

4. PROPOSED TECHNIQUE

The proposed technique starts with calculating the Laplacian matrix, L, and then it follows the same pipelinementioned in section 2 reaching a set of K matting components (K = 10 in our experiments). For calculatingthe a�nities of L, using other features beside color intuitively will result in better clustering. In our algorithm,the a�nities were calculated from defocus+chroma maps (DC1C2) rather than the RGB values of the originalimage. Although it doesn’t guarantee an acceptable final matte on its own, this step just represents a reliablestart. For fair comparison with Ref. 6, we adopted their pipeline while using DC1C2-based Laplacian; theresults were exactly similar to those obtained using RGB-based Laplacian (shown in Fig. 1), or just equallyunsatisfactory. The chroma information was obtained by the standard conversion from RGB to LC1C2 space.For the defocus maps, the literature is rich with defocus estimation approaches; the latest to our knowledge isin Ref. 16. Our defocus maps though were obtained by the publicly-available-code in Ref. 17. For the calculatedmatting components, the algorithm tests all the possible 2K groupings; for each of them, it calculates three coststhat indicate its adherence to the Gestalt laws according to certain cues. A total cost function is then used tonominate the grouping with the least cost. A discussion of those cues and the cost calculation will be given inthe following sub-sections.

4.1 The concavity cue

Motivated by the study in Ref. 18, we have incorporated the concavity cue to penalize (and eventually rule out)erroneous groupings where, for example, an area bounded by a concave arc is assigned to a background. Tocut back the algorithm complexity, in contrast to Ref. 18, we didn’t fit splines to contours. Alternatively, weanalyzed the contours of the groupings themselves and used a simple dictionary of rules (two entries are shownin Table. 1) to decide if a certain pair of direction changes is concave or convex.

For a concave section of the contour, we formulated the concavity score as

CAV =X

Scvj

C ⇥ Lf ⇥ (1 + �⇥ P ), (3)

where C is the curvature, Lf is the lifetime of a curve section, P is a constant which contributes to the concavity(orconvexity) score if a curve persists on a certain status (either concavity or convexity), and lastly � is a binaryindicator which takes the value of 1 or 0 if persistency exists or doesn’t exist respectively. The overall concavityscore of a particular contour, CAV , is then the summation of the concavity scores of the set of concave curvesections {Scv

j } in the contour. Figure 3(a) depicts a typical example of a concave section in a contour, and the

Table 1: Two entries from the dictionary used for concavity/convexity calculation

Direction Change Indication

North-east To South-east

Concavity

North-east To North-west

Convexity

FG

BG

BG FG

di↵erent variables in eqn. 3 are shown on it. Fig. 3(b) illustrates how we quantified the curvature of a contoursection which is formulated as follows:

C = supa2A

infb2B

D(a, b), (4)

where ‘sup’ is the supremum, ‘inf’ is the infimum, A is the set of all the contour pixels of the curve whosecurvature is to be quantified, B is the set of all points on the chord subtending to A, and D(a, b) is the Euclideandistance from a to b. Particularly, the curvature is computed as the largest minimum distance from the pointson the contour to the chord subtending to it. If the contour we have is a perfect circular arc, its maximumcurvature is given by its sagitta. An equation identical to eqn.3 is used to calculate the convexity score.

(a)

C

(b)

Figure 3: (a) An illustration of a concave arc with variables in eqn.3 shown on it, and (b) depicts how the curvature

was measured

Having the overall concavity and convexity scores calculated, the concavity cost of a particular grouping isthen computed as

Jcav(↵) =2⇥ CVX

CVX + CAV , (5)

where CVX and CAV are the convexity and the concavity scores respectively.

4.2 The symmetry and closure cues

According to Gestalt laws, we tend to perceive objects as being symmetric figural entities. Even though, explicitsymmetry is an over-simplification of the scenes encountered in matting tasks, the spatial extent of a BG in animage/scene is usually more than that of the FG which creates a stance of a non-trivial symmetry. This is alsoclosely related to the FG object being observed as a closed surface.19 We have designed a cost function that isexpected to have low values (leading to a low cost) in case of a symmetric grouping. This function is given byeqn. 6.

The first term of this cost capitalizes on a notion that is related to symmetry which is the centre of mass(CoM). To single out an overall concave, connected and enclosed grouping, we have incorporated CoM twice inour objective function, once as a prior and once as a constraint. The prior favors groupings which minimize theabsolute horizontal distance between the CoM and the apsis of the alpha matte (the FG pixel with the largest

vertical coordinate in the image plane), while the constraint rejects groupings in which the CoM lies outside theFG. The latter constraint is backed by the observation that objects like a horseshoe, for example, are not usualin matting scenes. However, it could be easily reformulated as a prior depending on the application.

Inspired by Ref. 19, the second and third terms of eqn. 6 capitalize on the closed-boundary characteristic ofan object to infer symmetry. In case of a symmetric grouping, the bounding box ‘W ’ of the grouping’s edgemap should enclose most of the figural object and most of it should be occupied by that object as well. Thesignificance of that rule is illustrated in Fig. 4 and more examples can be found in Ref. 20. The symmetry costis thus calculated as

Jsym(↵) =

DH

IW

!+

|FG \W |NFG

!+

1�min

(T ,

|FG \W ||W |

)!, (6)

where DH is the absolute horizontal distance between the CoM and the matte’s apsis, IW is the width of theimage, |FG \ W | is the number of foreground pixels in W , |W | is the area of W , |FG \ W | is the number offoreground pixels outside W , NFG is the total number of foreground pixels in the grouping and T is a constant(T = 0.5 in our experiments).

(a) (b) (c) (d)

Figure 4: An erroneous grouping (a) and its canny edge map (b) featuring its W as a green rectangle, and the best map

(c) with its edge map (d). The figure illustrates the significance of the second and the third terms in eqn. 6.

4.3 The proximity and connectivity cues

We assume that the foreground object is spatially connected, which is true for all trimap-based matting.21

Consequently, the graph whose nodes are the matting components has to be connected, which is satisfied ifevery pair of components are connected. This can be verified from the components adjacency matrix using theDulmage-Mendelsohn decomposition.22 For the adjacency matrix A with entries ai,j :

ai,j = 1 () 9 p 2 Ci : Dmin(p, Cj) D (7)

where Ci and Cj are the contour pixels of components i and j respectively, Dmin(p, Cj) is the minimum Euclideandistance between pixel p and the set of contour pixels Cj , and D is a constant (D = 4 in our experiments). Weused the connectivity as a constraint and defined Jconn as

Jconn(↵) =

⇢1 if T (ADM ) = 11 otherwise,

(8)

where ADM is the Dulmage-Mendelsohn decomposition of the adjacency matrix A, and T (ADM ) is an indicatorfunction which returns ‘1’ if the grouping ‘↵’ is connected and returns ‘0’ otherwise. Our final cost function isthen formulated as the product of the costs corresponding to the three cues discussed in this section, and is givenby

J(↵) = Jcav(↵)⇥ Jsym(↵)⇥ Jconn(↵), (9)

where Jcav(↵) is given by eqn. 5, Jsym(↵) is given by eqn. 6 and Jconn(↵) is given by eqn. 8.

4.4 A Probabilistic View

The problem of picking the best grouping can be seen as a set of optimum decisions, each of which determineswhether a sub-component in that grouping will be set to high or low. A single decision for one sub-componentcan be modelled in a Bayesian fashion as follows:

P (lc/I) / P (I/lc) · P (lc/lN ), (10)

where P (lc/I) is the MAP estimate that a certain matting component ‘c’ will take the binary label ‘lc’ (lc = 1means the component is active). The first term on the right hand side, P (I/lc), is the observed data (likelihood)term which can be the probability that the component ‘c’ contains a FG object or mixed pixels with fractionalalpha values. Embarking on the basics of the image formation process, we argue that getting a fuzzy-structuredFG object in focus entails an unmistakable defocus for the background. Hence, the defocus feature, togetherwith the image’s binary segmentation, can give us an insight about the likelihood of having a FG object or mixedpixels in a particular matting component. The second term in this formulation, P (lc/lN ), can encode our priorassumptions about the final matte. Particularly, given the status of the neighbors of ‘c’, this term will considerthe label lc = 1 highly probable if the overall grouping is more concave, symmetric and connected than if lc = 0.The best grouping is thus the one which maximizes the equation given by:

argmaxj

NX

c=1

P j(I/lc) · P j(lc/lN ); j := {lc, lN } (11)

where P j(I/lc) is the likelihood of component ‘c’ to have label ‘lc’ in grouping ‘j’, and P j(lc/lN ) is the probabilityof component ‘c’ to have label ‘lc’ in grouping ‘j’ given the labels of the neighboring components in that grouping.

5. RESULTS AND DISCUSSION

We have tested our algorithm on the standard image matting dataset4 in addition to the most widely usedvideo matting dataset.23 Figure 1 shows the results of our algorithm together with the results of Ref. 6. Theproposed framework has succeeded to automatically extract mattes that are better than the unsupervised spectralmatting. Even though spectral matting is a state-of-the-art technique, there are better, yet supervised, mattingtechniques.4 Using the groupings nominated by our algorithm, and a few morphological operations, we canautomatically generate the trimap to calculate more accurate mattes posteriorly. Interestingly, almost all thetop performers in Ref. 4 calculate the Laplacian either within the algorithm or in a refinement post-processingstep. If one is calculating the Laplacian matrix anyways, using our algorithm, the trimap can be acquiredautomatically and for free. Figure 1 shows examples for the generated trimaps.

We have further examined the usage of an objective function that incorporates one cue, rather than theproposed three cues, e.g. the concavity cue alone or the symmetry cue alone. This strategy has failed to properlyguide the components grouping process. In Fig. 5, we show some examples of highly symmetric, yet erroneousgroupings. We also show a few instances of mattes that are overall concave but they still have a bad symmetryscore.

We have also investigated the automatic acquisition of trimaps from motion-blurred datasets. We couldn’taccess the blurred datasets in Ref. 12, so we generated synthetically motion-blurred images, using the Matlabr

function ‘fspecial ’, as depicted in Fig. 6. By captializing on such high-level cues, the proposed method shows goodimmunity to motion-blur. We argue that incorporating those cues augments the robustness of video mattingtechniques in relation to motion blur. Cases of failure for our algorithm together with the results shown in thispaper can all be found in Ref. 20.

6. CONCLUSION AND FUTURE WORK

Inspired by Ref. 6, we have proposed a novel framework that introduces the Gestalt laws to the matting problem.It makes the reliable and totally-automatic matte extraction and trimap generation possible. This is expected toimpact image and video matting by paving the way to harnessing the quality of other state-of-the-art techniques

Figure 5: Examples for erroneous groupings calculated by adopting a single cue in the objective function, instead of

using three cues for inferring the correct matte. The first column is the original of every image. The upper two rows

show instances of highly-symmetric, yet erroneous, groupings. The lower two rows show overall concave groupings with

bad symmetry score.

(a) (b) (c)

Figure 6: Synthetically motion-blurred version (b) of a video matting dataset (a), shown beside the extracted matte (c).

The ten matting components are arranged as a 2x5 array on the right of the figure. Please see text for more details.

that don’t support unsupervised operation. By capitalizing on a high-level cue, the proposed method has alsoshown good resistence to motion-blur from which many video matting techniques have su↵ered. How multipleFG objects can be handled and the ‘how to’ of generating trimaps with an adaptive matting band3 is a point offuture research.

ACKNOWLEDGMENTS

This research was partially funded by NSERC and Magor Corp R�.

REFERENCES

1. J. Wang and M. Cohen, “Image and video matting: A survey,” Foundations and Trends in ComputerGraphics and Vision, vol. 3, no. 2, pp. 97 – 175, 2007.

2. E. Gastal and M. Oliveira, “Shared sampling for real-time alpha matting,” Eurographics Computer GraphicsForum, vol. 29, no. 2, 2010.

3. X. Bai, J. Wang, and D. Simons, “Towards temporally-coherent video matting,” in Computer Vi-sion/Computer Graphics Collaboration Techniques, 2011, vol. 6930.

4. Alpha matting online benchmark. [Online]. Available: http://alphamatting.com5. C. Rhemann, C. Rother, J. Wang, M. Gelautz, P. Kohli, and P. Rott, “A perceptually motivated online

benchmark for image matting,” in CVPR, 2009.6. A. Levin, A. Rav-Acha, and D. Lischinski, “Spectral matting,” IEEE Transactions on Pattern Analysis and

Machine Intelligence, vol. 30, no. 10, pp. 1699 – 1712, 2008.7. A. Levin, D. Lischinski, and Y. Weiss, “A closed-form solution to natural image matting,” IEEE Transactions

on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 228–242, 2008.8. A. Levin, R. Alex, and D. Lischinski. Spectral matting project page. [Online]. Available:

http://www.vision.huji.ac.il/SpectralMatting/9. A. Al-Kabbany and E. Dubois, “Improved global-sampling matting using local and non-local color/texture

pair-selection strategy,” School of Electerical Engineering and Computer Science, University of Ottawa, ON,Canada, K1N6N5, Tech. Rep. 00701-2013, March 2013.

10. X. Bai, J. Wang, D. Simons, and G. Sapiro, “Video snapcut: Robust video object cutout using localizedclassifiers,” in ACM SIGGRAPH, 2009.

11. I. Choi, M. Lee, and Y.-W. Tai, “Video matting using multi-frame nonlocal matting laplacian,” in ECCV,2012.

12. D. Li, Q. Chen, and C.-K. Tang, “Motion-aware KNN Laplacian for video matting,” in ICCV, 2013.13. S. Singh, A. Jalal, and C. Bhatanagar, “Automatic trimap and alpha-matte generation for digital image

matting,” in Sixth International Conference on Contemporary Computing, 2013.14. K. Lee, “Learning-based trimap generation for video matting,” Master’s thesis, University of California, San

Diego, 2010.15. O. Wang, J. Finger, Q. Yang, J. Davis, and R. Yang, “Automatic natural video matting with depth,” in

15th Pacific Conference on Computer Graphics and Applications, 2007.16. D. Morgan-Mar and M. R. Arnison, “Depth from defocus using the mean spectral ratio,” in Digital Photog-

raphy X, SPIE, 2014.17. S. Zhuo and T. Sim, “Defocus map estimation from a single image,” Journal of Pattern Recognition, vol. 44,

no. 9, pp. 1852 – 1858, 2011.18. Y. Lu, W. Zhang, H. Lu, and X. Xue, “Salient object detection using concavity context,” in ICCV, 2011.19. B. Alexe, T. Deselares, and V. Ferrari, “What is an object?” in CVPR, 2010.20. A. Al-Kabbany. Project webpage: Automatic generation of trimaps using the gestalt cues. [Online].

Available: http://www.site.uottawa.ca/%7Eaalka046/spie2015matting/index-spie15matting.html21. C. Rhemann, C. Rother, and M. Gelautz, “Improving color modeling for alpha matting,” in BMVC, 2008.22. A. L. Dulmage and N. S. Mendelsohn, “Coverings of bipartite graphs,” Canadian Journal of Mathematics,

vol. 30, no. 517534, 1958.23. Y. Chuang, A. Agarwala, B. Curless, D. Salesin, and R. Szeliski. Video matting of complex scenes.

[Online]. Available: http://grail.cs.washington.edu/projects/digital-matting/video-matting/

Documents

A Novel Framework for Automatic Trimap Generation …aalka046/spie2015matting/matting_spie_15_2.pdf · A Novel Framework for Automatic Trimap Generation Using The ... we can either