Chapter 6 Previous Work on Perceptual Groupingstrider/publications/Chapter6.pdfPrevious Work on Perceptual Grouping ... 3-D reconstruction, ... tion hierarchy illustrates the broad

Chapter 6

Previous Work on Perceptual Grouping

The complexity of natural images limits the utility of many visual processing algorithms in

everyday environments; the robustness and performance of tasks such as model based recog-

nition, 3-D reconstruction, motion analysis, and appearance based feature matching degrade

quickly when confronted with unconstrained natural environments. This has motivated a vast

research effort aimed at identifying regularities in the input data that would allow subsets of

features corresponding to single objects to be identified; these subsets can then be processed

separately from the remaining features in the image, decreasing the computational burden of

high level modules and increasing robustness. However, if such grouping is to be useful in un-

constrained environments, the chosen grouping cues must be general and context independent,

and the grouping should be performed efficiently.

In the early 20th century, Gestalt psychologists [112], [56] proposed that human observers

organize visually perceived stimuli according using general cues such as proximity, similarity,

good continuation, and closure. These grouping principles indicate regularities in the perceived

stimuli that suggest that certain subsets of features have a common origin. Gestalt laws of

perceptual organization, as they are commonly known, have become a cornerstone of feature

grouping in computational vision, and are present in some form or another in most perceptual

grouping algorithms developed to date.

99

CHAPTER 6. PREVIOUS WORK ON PERCEPTUAL GROUPING 100

Perceptual grouping is the process of clustering together perceived stimuli that exhibit cer-

tain regularities that generally indicate a common origin [115]. This definition leaves ample

space for selecting the nature of the perceived stimuli, and indeed, perceptual grouping spans a

wide range of problems that go from the grouping of edgels into primitives such as curves and

lines, and the detection of feature groups that can be used to bootstrap model-based recognition

algorithms; to the extraction of object contours from images and the fitting of 3D surfaces to

range measurements. Sarkar and Boyer [88] proposed a classificatory structure for perceptual

grouping that organizes algorithms based on the dimensionality of the problem (e.g. 2-D vs.

3-D, and the inclusion or lack of a time dimension), and the level of the input primitives used

(e.g. edgels and filter responses at the lowest level, followed by lines and curve segments, and

with primitives at each succeeding level increasing in complexity). The resulting classifica-

tion hierarchy illustrates the broad spectrum of perceptual organization tasks that have been

explored by the vision community.

Given the nature of the grouping task, a review that spans the complete field would be a

monumental undertaking. Here, we have limited the scope of our review to algorithms that are

directly related to the problem of contour extraction on single images. Readers interested in a

wider sample of current research in perceptual grouping are referred to [10].

6.1 The Search for Salient Features

David Lowe [58], and [59] introduced Perceptual Grouping as a tool to reduce the complexity

of the object recognition task. In his work, a set of lines extracted from an image is searched

for groups of lines that correspond to the projection of known 3-D wire-frame models of simple

objects. Matching involves computing a model transformation (3-D rotation, translation, and

scaling) that yields the smallest error between the observed features and the estimated location

of the projected edges of the model.

While searching the complete space of possible correspondences would quickly become a


hopeless task for large line sets, Lowe noted that particular relationships between lines, such

as collinearity, parallelism, and endpoint proximity seemed to cause certain groups of lines to

stand out from otherwise random clutter (in other words, these groups of lines were percep-

tually salient). He also noted that such relationships correspond to the behavior of projected

object boundaries. In particular, assuming orthographic projection, collinearity and parallelism

in 2-D correspond to collinearity and parallelism in 3-D, and 2-D junctions are likely to result

from the projection of 3-D object vertices.

Lowe’s system used groups of features that possessed one of the above properties to index

into the model database; an initial match would be determined from a pair of parallel lines,

a group of collinear segments, or the hypothesized projection of a vertex. This initial corre-

spondence could then be used to search for additional features to either refine or reject the

initial match. Lowe’s work demonstrated that perceptual grouping increased recognition per-

formance and robustness (Grimson [39], and Clemens and Jacobs [20] later investigated the

potential savings in computation time and storage requirements that can be obtained from us-

ing feature groups for recognition). More importantly, it started a vast research effort aimed at

detecting perceptually salient groups of features in images.

Ullman and Sha’ashua [104] proposed to assign a measure of saliency to each feature in

the image based on a combination of length and smoothness at some particular scale. They

identified a property that has since gained importance in contour grouping, namely, that the

saliency of a particular segment depends on the saliency of the contour passing through it.

They noted that in general there can be many contours passing through any particular feature,

and thus, depending on the selected contour, the saliency of a feature may vary.

Ullman and Sha’ashua define a measure of saliency Φ that applies to points along a curve,

this saliency measure grows when the curve itself increases in length and/or smoothness. Φ is

computed using a locally connected network of processing elements. Each element in the grid

corresponds to a pixel, and has a predefined number of input and output connections toward

its neighbors. Each of these connections responds to an image by signaling the presence or


absence of a line segment joining the two neighbors at a particular orientation.

Finding the saliency at some pixel pi becomes a problem of maximizing ΦN over all con-

tours of length N that pass through pi. For each contour, saliency is measured as a weighted

sum of the saliencies of the individual elements that form the curve. Ullman and Sha’ashua

show that if the saliency measure has some particular properties this optimization problem can

be solved in linear time for N . The optimization process is iterative, and computes updates

to the estimated saliency of each link in the saliency network. At the end of the procedure

the network’s links contain the saliency of every possible link within the image. Experimental

results demonstrate that this network can identify salient groups of segments from images with

significant background clutter.

Mohan and Nevatia [74], and [75] proposed a grouping scheme that organizes line segments

into progressively more complicated structures. First, proximity and continuity are used to

group lines into curve fragments, and curve fragments into contours. Every pair of curves or

contours is checked for symmetry, and each symmetric pair gives rise to a hypothetical closed

group called a ribbon. Together, curves, contours, symmetries (pairs of symmetric curves),

and ribbons define a constraint satisfaction network where nodes correspond to features, and

arcs encode the relationships between them. Given a set of functions that evaluate the amount

of support in favour of each relationship, the network determines the set of hypotheses with

maximal support.

Sarkar and Boyer [88], [89], and [90] describe a system that uses Bayesian Networks (ex-

tended for their application to perceptual grouping, and called Perceptual Inference Networks

in their work) to organize features hierarchically into progressively more complicated prim-

itives. Each level of the hierarchy is built using voting methods and graph operations, and

the voting itself is influenced by the principles of proximity, similarity, smooth continuity, and

closure. Each of these Gestalt principles is used to generate a separate graph whose nodes are

image features at some level of the hierarchy, and whose arcs represent the existence of the

corresponding relationship between two nodes.


a) b)

Figure 6.1: Extension fields for a) Oriented edgel (red) and b) Point (red). The length of the

lines indicates the magnitude of the extension field at a particular distance and orientation.

To determine the existence or absence of a given relationship, a voting scheme is used.

Given a compatibility relationship R that is a function of N input tokens (features at some

level), an N -dimensional voting space is constructed; in this space, each token votes for all

points satisfying the compatibility relationship R. If two tokens vote for a common location,

the tokens are associated. A perceptual inference network encodes the particular dependencies

between features that indicate the existence of simple shapes such as circles, ellipses, rectan-

gles, and polygons. The output of this network is a set of hypotheses of salient structures, along

with the estimated probabilities for each hypothesis.

Guy and Medioni [40], propose a different voting method known as Tensor Voting. The

input consists of a set of points and oriented edgels. Each of these features then votes within

a tensor field that spans its local neighborhood, the voting follows the feature’s extension field

which incorporates the Gestalt principles of co-curvilinearity, constancy of curvature, proxim-

ity, and a bias for smaller curvature. In other words, each feature votes for a favoured set of

possible orientations that neighboring features should have to form a group having the desired

Gestalt properties. The extension fields for points and oriented edgels are shown in Figure 6.1.

Votes are accumulated at each pixel from all features whose extension fields overlap the

pixel. The orientation and weight of the vote from each feature is accumulated onto a co-


variance matrix which is then decomposed into its eigenvectors and eigenvalues. The eigen

decomposition of the covariance matrix provides statistical information about the votes at each

location.

The results of the vector vote at each location describe an ellipse (λmax, λmin, θ) from which

several observations can be made: λmax is related to the number of votes collected, so it can

be used as a raw saliency measure. The difference in magnitude between the eigenvectors is

indicative of the eccentricity of the ellipse. Features such as points and junctions are expected

to have low eccentricity since votes accumulate over many directions. Curves, on the other

hand, are locally represented by an ellipse with high eccentricity, indicating a strongly preferred

orientation. Based on these observations, Guy and Medioni define a saliency map for curves

as (λmax − λmin) at each pixel in the image.

Figures 6.2 and 6.3 show the saliency maps for curves (local maxima of λmax − λmin), and

junctions (local maxima of λmin), as well as the extracted curves on two sets of input edgels.

The results are quite interesting, on the first image, where most of the objects of interest are

nearly circular, the algorithm recovers many of the boundary contours, but also introduces

curves that join together edgels from texture elements. The saliency map correctly assigns

high saliency to contours along many of the object boundaries.

On the second image, the abundance of straight lines instead of curves has a noticeable

effect on the saliency map, most of the structure in the image receives similar saliency values,

and junction identification becomes problematic (most likely due to the abundance of junctions

all over the image, see the junction saliency map in Figure 6.3). Both figures were generated

using the software provided by the authors at [70].

Both for curves and junctions, selecting an appropriate saliency threshold that keeps most

of the interesting image structure and rejects responses from texture or noise is problematic.

Also, depending on the parameters for the extension fields, the algorithm may join separate

structures that happen to lie along smooth, curved paths. Even so, the scheme demonstrates

that voting schemes can be used to obtain a local estimate of the saliency of image features. It


a) b)

d)c)

Figure 6.2: Tensor voting results. a) Input features, b) Saliency map for curves (yellow indi-

cates high saliency), c) Saliency map for junctions, d) Extracted contours (green) and extracted

junctions (white).


a)

c)

b)

d)

Figure 6.3: Tensor voting results. a) Input features, b) Saliency map for curves (yellow indi-

cates high saliency), c) Saliency map for junctions, d) Extracted contours (green) and extracted

junctions (white).


is worth mentioning that this voting scheme has been demonstrated both in 2-D and 3-D [41],

and has also been applied to motion analysis and tracking [57].

Several grouping methods based on graph partitioning have been proposed. Sarkar and

Boyer [91] build a relation graph (suitably encoded in matrix form) whose nodes are image

features and whose arcs encode Gestalt relations, and then use the top eigenvectors and eigen-

values of this graph to obtain statistical information about the feature groups present in the

image. They show that this information can be used to estimate the degree of construction

development on aerial photographs of a particular site (a highly developed site will have many

tight clusters of features corresponding to separate artificial structures such as buildings, and

houses. This has a measurable effect on the eigenvalues of the graph).

Perona and Freeman [84] propose an algorithm that groups input features into two sets, cor-

responding to foreground and background. Their algorithm computes an affinity matrix that

contains the similarities between feature pairs, and then reduces these pairwise similarities to

a point-wise saliency measure using the top eigenvector of the affinity matrix. They show that

certain structures, such as groups of segments that show smooth continuation, can be extracted

from random clutter in this fashion. Amir and Lindenbaum [1] proposed a grouping method

that finds clusters of features in a graph using maximum likelihood estimation; evidence accu-

mulation is also used to increase the reliability of the selected grouping cue. Their algorithm is

demonstrated on features from different domains.

More recently, Sarkar and Soundararajan [92] proposed a method that builds a graph en-

coding parallelism, perpendicularity, continuity, and proximity. The appropriate link to join

two nodes in the graph is selected using a Bayesian network. Once the graph has been con-

structed, graph partitioning techniques are used to identify groups of features that are weakly

connected to the remaining nodes in the graph (their formulation is related to the Normalized

Cuts graph partitioning algorithm). Additionally, they propose the use of supervised learning

to adapt the parameters of their algorithm, and to determine the relative importance of different

grouping cues in a particular domain.


Fischer and Buhmann [31] proposed a greedy algorithm that starts with single feature

groups, and greedily grows clusters of features by merging smaller groups together so as to

minimize an objective function. The objective function is based on pairwise similarities be-

tween features, and in the case of edgels, it is influenced by proximity, co-circularity, and

smoothness. Ren and Malik [86] propose an algorithm based on image statistics obtained from

human segmentations. The statistics from human extracted boundaries are used to estimate the

prior probabilities for contour shapes, as well as the likelihood of particular image measure-

ments. High order Markov models are used to encode the contour shape, and the algorithm

incorporates multi-scale information by using contours extracted at coarser scales to condition

the prior probabilities at a finer scales.

A different class of algorithms uses biologically motivated methods for grouping features.

Field et al. [30] introduced the concept of an association field along which perceptual interac-

tions between contour segments tend to occur. This association field determines the contribu-

tion of a particular activation unit to other activation units within its neighborhood. Figure 6.4

shows this association field (usually called the bipole icon), as well as the variables that deter-

mine its strength at every point within a unit’s neighborhood.

The bipole icon forms the basis of the Boundary Contour System [73]. BCS is a neural

network that performs contour completion in an iterative fashion. Layers of contrast sensitive

cells that respond to image edges interact with layers of bipole units, the bipole units provide

positive feedback that leads to the activation of edge sensitive cells that span the gaps between

detected image edges. This process is illustrated in Figure 6.5. For a thorough review of BCS

and other related schemes, the reader is referred to [77].

The algorithms we have discussed so far do not require that the resulting groups form closed

contours, therefore, in general they do not provide a means for carrying out figure-ground seg-

mentation. The following section presents current research in closed contour extraction, the

algorithms presented there are particularly relevant to our own research, and they are represen-

tative of the state-of-the-art in contour grouping.


Figure 6.4: Activation field for the unit shown in red, this field is also known as the ’bipole

icon’. The contribution of the red unit to the activation of the purple unit depends on r, the

distance between the units, θ the angle between the orientation of the first unit and the line

joining both units, and φ, the difference in orientation between the units. Notice also that

this field is quite similar to the extension field for oriented edgels used in Tensor Voting (See

Fig. 6.1).

6.2 Closure and Contour Extraction

Elder and Zucker [24], and [26] examined the nature of the closure cue. Their results showed

that shape processing is rapid for closed boundaries, and that response speed is a monotonic

function of the degree of closure. They also proposed that the degree of closure C of a contour

with n gaps can be measured using the L2 measure of a gap:

C = 1 −1

l

√

√

√

√

n∑

1

g2

i , (6.1)

where gi is the length of the ith gap in the contour, and l is the contour length.

In further research [27], Elder and Zucker presented an algorithm for smooth, closed con-

tour extraction. The input to the algorithm is a set of extended tangents which represent con-

tours as a sequence of disjoint line segments. Contours are composed by sequences of extended

tangents, and closed contours correspond to cyclic tangent sequences. To find the closed con-

tours, they develop a measure of the likelihood of a tangent sequence based on the geometric


a)

c)

b)

Figure 6.5: Gap filling in BCS. The red units are contrast sensitive and respond to image edges.

a) Initially, only two contrast sensitive units are active because of a gap in the contour. These

units feed bipole units at a higher level in the network, bipole unit 1 is active since both of its

lobes are being fed by contrast sensitive cells, while unit 2 is inactive. b) Bipole unit 1 feeds

back to the contrast sensitive layer and causes an additional cell to become active, this cell in

turn causes bipole cell 2 to become active. c) Bipole unit 2 causes yet another contrast sensitive

cell to become active. This process is iterative and the result of it is that gaps in the input are

filled up quickly. The range and strength of these interactions is controlled by the parameters

of the algorithm.


Figure 6.6: Two extended tangents and the quantities that determine the likelihood of a link

between them. l1 and l2 are the tangent lengths, r is the length of the line connecting them

which also defines θa and θb. The brightness on each side of the tangents is used to calculate

brightness differences on the bright and dark side of the tangents.

and photometric properties of neighboring tangents in the curve. Figure 6.6 shows the quanti-

ties that are used to estimate the likelihood of a particular link; these are the lengths l1 and l2

of the extended tangents, the length r of the straight line connecting the tangents, the angles θa

and θb between the tangents and the connecting line, and the differences in brightness ∆ih and

∆il on the bright and dark sides of the tangents respectively.

Setting o = {l1, l2, r, θa, θb, ∆il, ∆ih}, Bayes theorem is used to determine the posterior

likelihood of a link between two tangents

p(t1 → t2|o) =p(o|t1 → t2)p(t1 → t2)

p(o), (6.2)

which can be rewritten as

p(t1 → t2|o) =1

(1 + LP ), (6.3)

where

L =p(o|t1 9 t2)

p(o|t1 → t2)P =

p(t1 9 t2)

p(t1 → t2)(6.4)


and t1 9 t2 represents the hypothesis that there is no link between the tangents. Elder and

Zucker propose appropriate formulas to evaluate the prior ratios shown above, and use them to

generate a graph where each tangent is joined to the m most likely continuants.

Defining the likelihood for a tangent sequence as the product of the likelihoods of all links

in the curve, the algorithm finds the maximum likelihood cycle for each tangent. This in fact

corresponds to minimizing the absolute sum of the log-likelihoods of each link, which is done

efficiently using Dijkstra’s shortest path algorithm. The paper shows sample contours that

demonstrate that the method is indeed capable of extracting closed boundaries.

The above algorithm finds optimal closed paths, however, it has a bias for small boundaries,

and global constraints such as simplicity (contours should not be self-intersecting) can not

be incorporated into the framework. Elder et al. [25] extend the above algorithm in several

important ways. First, instead of using a shortest path algorithm to find optimal contours, they

propose a constructive algorithm that extends contours iteratively. Secondly, they incorporate

domain knowledge in the form of priors computed from image statistics for several of the

probabilistic components of the algorithm.

In this new algorithm, contours are still treated as ordered sequences of extended tangents,

but instead of finding paths starting at each tangent in turn, the algorithm builds a set S2 =

{s21, . . . , s2n2} of all contours of length 2 that can be formed in the image. Next, they estimate

the posterior probability that each of these groups corresponds to an actual contour, and only

the N2 groups with the highest posterior are kept. From these, they compute a set S3 =

{s31, . . . , s3n3} with all contours of length 3 that can be generated from the previously extracted

length 2 groups. This procedure is repeated until a maximum contour length has been reached,

and its constructive nature allows for global constraints to be tested and enforced at each step.

Elder et al. evaluate this algorithm in the context of lake boundary extraction from GIS

(Geographic Information Systems) images. Statistics from several lake images are used to

determine the different prior probabilities used in the algorithm, and the performance of the

method is evaluated in a separate set of lake images, and compared against the boundaries gen-


erated by human observers, as well as those obtained from an active contour method. Their

results indicate that for this task the algorithm yields accurate boundary estimates, outperform-

ing even human observers (the distance between the contour traced by the algorithm and the

contours traced by the different human observers is in general smaller than the distance be-

tween contours traced by different humans). Johnston and Elder [50] have further enhanced

the above constructive algorithm so that instead of making local, greedy choices at each step of

the construction process, contour continuations are weighted according to an estimated prob-

ability that they will lead to a closed boundary. The extended algorithm uses Hidden Markov

Models to influence the contour growing process.

Wang et al. [107] propose a graph based algorithm that addresses the bias for short bound-

aries in the original shortest-path based algorithm of Elder and Zucker. They also propose

finding contours as highest likelihood cycles in a graph, but in their formulation the length of

the contour is used as a normalization factor to compute the cycle ratio for a particular contour

CR(C) =

∑

e∈C w(e)∑

e∈C l(e), (6.5)

where w(e) gives the weight associated with an edge in a graph that encodes closure, proximity

and continuity, and l(e) is the length of the edge. Wang et al. propose an algorithm for finding

minimum cycle-ratio contours in polynomial time, and show experimental results in a few

images.

Williams and Jacobs [113] proposed a contour completion algorithm motivated by the work

on elastica for computer vision by Mumford [76], and Nitzberg and Mumford [79]. In this

algorithm, the probability that two features ti and tj (which are oriented edgels, just as in Tensor

Voting) form a continuous contour is estimated as the probability that a particle undergoing

a random walk starting at ti will visit tj . The random walks used for this purpose are biased

toward generating short, smooth paths. Particles in this formulation have both a position and an

orientation, and the random walk is defined by a motion equation and a decay constant which

determines the particle’s average lifetime. A particle’s position and orientation are updated


a) b) c)

Figure 6.7: Each image shows 1000 simulated random walks for the indicated value of σ. Color

represents the probability that the particle will visit a particular point within the image, yellow

is most likely, followed by purple and blue. a) σ = π/64, b) σ = π/32, c) σ = π/16. The

particle starts at the center of the red circle, and its initial orientation is along the positive x

axis, the effects of particle decay are not shown. Notice the similarity of these figures with the

bipole icon and the Tensor Voting extension field (see Figs. 6.4, and 6.1).

according to the following differential equations:

x = cos θ, y = sin θ, and θ = κ(0, σ2; t),

where x and y represent the change in position, θ represents the change in orientation, and

κ(0, σ2; t) is a random variable with a normal distribution, zero mean, and specified variance.

The effect of the above equations is that the particle will tend to travel in a straight line, but

will drift over time by an amount proportional to σ2. Additionally, a particle decays at a rate

controlled by 1 − exp(−1/τ) so that longer walks are less likely. Figure 6.7 shows multiple

random walks for different values of σ2.

Williams and Jacobs use the above formulation to define two types of fields: source fields

which correspond to particles leaving an image edge as shown in Figure 6.7, and sink fields,

which indicate the probability that a particle starting at some particular position and orientation

within an edgel’s neighborhood will reach that edgel. The stochastic completion field is simply

the product of the source and sink fields; it is large along smooth, continuous contours, includ-


ing paths that cross gaps between image edgels. In their experiments, Williams and Jacobs

show that the completion fields are capable of reproducing certain perceptual phenomena such

as illusory contours.

Mahamud et al. [60], and [61] developed a contour extraction algorithm based on stochastic

completion fields. They use the random walk formulation to estimate an affinity between

two directed edgels; namely, the transition probability between edges i and j, denoted Pji

is the sum of the probabilities of all the paths that the particle can take from i to j (a closed

form computation for this probability was developed by Thornber and Williams in [101]). For

every edgel in the input, a pair of directed edgels is generated, and the transition probabilities

between pairs of directed edgels are computed and stored in a transition matrix P (notice that

the matrix is not a Markov matrix since the columns are not normalized to represent probability

distributions).

The saliencies of edges and contours are then estimated from the eigenvectors and eigen-

values of the matrix P as follows: The saliency of a directed edgel i (and it’s opposite-direction

complement i) is given by

ci = c i = sisi, (6.6)

where si and si are the components corresponding to directed edgels i and i in the eigenvector

of P with the largest, positive, real eigenvalue λ. The saliency of the link between two edges j

and i is given by

Cij =siPijsj

λ, (6.7)

Figure 6.8 shows the edge and link saliencies computed for a set of input edges.

Given the edge and link saliencies, contour extraction is performed by finding strongly con-

nected components in the matrix containing the link saliencies. Starting with highest saliency

edge j, the algorithm constructs the set of all edges that are reachable from j. An edge i is

reachable from j if there is a path from j to i (passing, in general, through other edges), and

the saliency of every link along the path is greater than some threshold. Once a contour has

been extracted, the links between its edgels are deflated by setting Pij = 0. The process is then


a)

b) c)

Figure 6.8: Edge and link saliencies. a) Input set of edges. b) Estimated edge saliency (the

length of each line length is proportional to the edge’s saliency). c) Most salient links (shades

of gray) with overlayed edge saliency; in this case, darker links are more salient. As expected,

salient links usually originate from (and link to) salient edges.


repeated to find the next salient contour. Mahamud et al. show that their algorithm is capable

of recovering smooth, closed contours in the presence of random clutter.

The last class of algorithms we shall discuss deals with contour extraction as a constrained

search problem. Grimson [39] studied constrained search for feature grouping in the context

of model-based object recognition. An initial group (consisting of a single line segment) is

expanded by adding other segments one at a time, until a stopping criterion is met. In general,

there will be many possible choices of segments to add to a particular group at any given time,

each of these choices must be explored in turn. The complete set of groups that are explored

for a given initial segment constitutes the interpretation tree for that segment.

Since exploring the full space of possible feature combinations is not tractable (the inter-

pretation tree grows exponentially with the size of the groups), objects in a model database are

used to constrain the search for groups of segments that are consistent with one of the mod-

els (after suitable translation, rotation, and scaling). The resulting interpretation tree contains

all combinations of features that are consistent with at least one of the models in the object

database.

Grimson noted the importance of using every possible means to constrain the search so as

to reduce the size of the interpretation tree. Since contour extraction algorithms are meant to

be as general as possible, the use of object specific constraints is usually not an option. Instead,

general contour extraction algorithms must use geometric constraints, Gestalt principles, and

other weak priors to prune search paths in the interpretation tree. Perhaps it is because of the

inherent computational complexity of this problem that there are few methods that use con-

strained search for contour extraction. However, the few existing methods have demonstrated

that careful use of geometric constraints and perceptual grouping cues can lead to efficient

algorithms.

Besides the common Gestalt principles of proximity, and smooth continuation, convexity,

has received a significant amount of attention [81] as a useful cue for constraining group for-

mation. Huttenlocher and Wayner [45] used convexity to constrain the formation of feature


groups; they proposed a greedy algorithm that determines the best possible convex group for

every segment in the image using proximity and orientation similarity, and extracts convex

(not necessarily closed) contours efficiently (the algorithm is linear in the number of input seg-

ments). Its greedy nature, however, precludes it from detecting many groups that are present

in the image, but whose links receive lower scores.

Jacobs [47] proposed a search-based algorithm for convex group extraction that combined

the convexity constraint with the Gestalt principle of proximity. A search is initiated at each line

segment in turn, and groups are formed by adding one segment at a time subject to convexity

and as long as the resulting group satisfies a threshold on boundary coverage

L1,n

L1,n + G1,n

> k, (6.8)

where L1,n is sum the lengths of the n segments that form the group, and G1,n is the sum of

the lengths of any gaps that exist between the group’s segments. The above coverage fraction

is also used as an indicator of the group’s saliency, groups with fewer and smaller gaps are

considered more salient.

The search algorithm uses oriented segments, and produces as output a set of closed, con-

vex polygons whose saliency fraction is above the user-defined threshold. Jacobs shows ana-

lytically that on random line sets, the algorithm has an expected complexity of O(NM + N 2),

where N is the number of oriented segments in the input and M is the number of output groups.

Thus, by keeping the number of output groups small, the algorithm can be made very efficient.

In practice, this amounts to choosing a suitable value for the coverage threshold k so that only

a small number of groups are formed in the image.

Jacobs shows results of his procedure, and demonstrates that convex groups obtained in this

fashion can be used to index into a model database for object recognition, and that this leads to a

significant increase in object recognition performance. This algorithm is particularly important

for our own research, since it constituted the state of the art in convex group extraction. In the

next chapter we will explore the reasons why a boundary coverage threshold is not sufficient

to robustly and efficiently extract convex groups in complex images, and we will compare


the results obtained by our convex contour extraction algorithm with the results that can be

obtained using a boundary coverage based method.

More recently, Saund [93] presented a contour extraction algorithm that is not limited to

convex shapes. The algorithm uses the principles of proximity, smooth continuation, and shape

compactness. Saund’s algorithm extracts contours by growing an initial group in both direc-

tions; search order is determined by the geometric configuration of neighboring edges, ac-

cording to pre-computed tables derived empirically from the observation of closed contours in

sketches. Figure 6.9a shows a sample preference table.

The algorithm extracts contours using two strategies. The first one is to choose the maxi-

mally turning path which leads to smaller contours. The second strategy attempts to maximize

smoothness (see 6.9b). Shapes extracted by the algorithm are evaluated with regard to several

criteria for figural goodness: Convexity, which is the ratio of the figure’s area to its convex

hull; endpoint distance, which depends on the distance between the endpoints of the curve, and

penalizes open contours; and non-end nearest approach term, which decreases if an open figure

is such that one of the endpoints lies near some part of the shape’s boundary (away from the

other end of the contour).

Search is controlled in two ways, first, by limiting the number of groups a segment can

form, which is achieved by considering only segments within a small distance d from each

other, d is chosen adaptively to limit the number of possible grouping choices. The second

technique for controlling search depends on the preference scores along a contour. A cu-

mulative preference score is calculated simply by multiplying the preference scores for each

junction in the curve. Two thresholds on cumulative preference are used to prune search paths,

one for maximally turning contours, and the second for smooth continuation contours. Search

is terminated when any contour goes below the corresponding cumulative preference thresh-

old. Additionally, any contour that achieves a certain figural goodness score stops the search

from its particular initial group. The algorithm is demonstrated on several sketches with good

results.


a)

b)

Figure 6.9: a) Junction preference table, CW and CCW indicate clockwise and counter-

clockwise winding contours respectively. b) Maximally turning, and smooth continuation

paths.


6.3 Discussion

Despite the variety of algorithms that have been proposed for perceptual grouping and contour

extraction, there are still fundamental problems that limit the usefulness of perceptual organi-

zation schemes. In general, the performance of grouping algorithms (both in terms of run-time

and the quality of the extracted groups) degrades very quickly in the presence of texture, clut-

ter, occlusion, contrast variations, and other common image artifacts. The result is that many

perceptual grouping tasks are still not feasible in unconstrained scenes; in particular, the prob-

lem of extracting the contour of an arbitrary object, in an unconstrained environment, and in a

reasonably efficient manner is still unsolved.

At the same time, while the grouping task itself has received significant attention, determin-

ing the saliency of the extracted groups is a different matter. Most grouping algorithms define

some measure of saliency that is usually related to the Gestalt cues used during the grouping

phase. Many of these saliency measures tend to work well in particular domains; however, un-

constrained environments pose a greater challenge. A general measure of perceptual saliency

that is applicable to a wide variety of images from different domains is required if contour

extraction is to be of use under such conditions.

Finally, while the quantitative evaluation of image segmentation techniques has seen sig-

nificant developments in recent years, the evaluation of grouping algorithms has not received

much attention. Borra and Sarkar [9] proposed that grouping should be evaluated in the context

of a particular task, and introduced a framework for comparing grouping algorithms with re-

gard to the increase in performance that an object recognition module derives from using each

particular grouping technique. Their results were compelling. They noticed that the quality

of the input had a strong effect in grouping performance, and that the use of more global cues

(such as convexity, used by Jacobs’ convex grouper) resulted in better recognition performance.

They also pointed out several interesting properties of the algorithms considered in their study.

This framework, however, did not become a standard for testing grouping modules. There

are a few reasons why this form of evaluation may not be suitable for general grouping algo-


rithms. It is not clear at the present time how to define the recognition task in an unconstrained

environment; choosing a particular domain could bias the results in favour of particular algo-

rithms or algorithm classes. It is also unclear how to evaluate algorithms that index differently

into model databases. Jacobs’ algorithm, for example, uses convex groups to match directly

against convex polygons present in the objects within the database; a general contour extraction

would (by definition) be required to match against the boundary of an object (or a part thereof).

The two indexing schemes are unlikely to be directly comparable.

Additionally, one of the main postulates of Gestalt psychology is that a significant compo-

nent of the grouping task is pre-attentive; that is, some form of perceptual organization takes

place independently of context, recognition, and other high-level visual processes. Most group-

ing schemes rely on such grouping principles, and make little or no use of domain or object

specific knowledge. This indicates that at least some part of the evaluation for grouping meth-

ods should be purely perceptual. However, there is no equivalent in perceptual grouping of

the Berkeley Segmentation Database, and techniques and metrics to evaluate the perceptual

component of the grouping task are yet to be developed.

It is perhaps not surprising in light of the above that few other examples of grouping eval-

uation are to be found in the literature. Williams and Thornber [114] propose evaluating algo-

rithms in terms of the saliency they assign to each edge in the image. Given a single affinity

measure, they compare the saliencies computed by several methods for a set of input edges,

and conclude that their contour extraction method performs better than the grouping algorithms

of Guy and Medioni [40], and Sha’ashua and Ullman [104].

More recently, Wang et al. [110] provide a comparative study of the contour grouping algo-

rithms of Elder and Zucker [27], Mahamud et al. [60], and Wang et al. [107]. The algorithms

are compared in terms of their similarity functions, the graph structure used by the method,

and the results obtained on several images. The results, though qualitative, are interesting, and

they show some of the problems encountered by grouping methods on less constrained im-

agery. The evaluation presented by Elder et al. [25] is perhaps closer in spirit to the techniques


proposed for comparing segmentation results. They evaluate their contour extraction algorithm

against contours extracted by human experts on the same images. Other than the above papers,

perceptual grouping algorithms continue to be evaluated in a qualitative fashion, and this is

likely to continue to be the case in the near future.

This ends our review of perceptual grouping, and in particular contour extraction tech-

niques. In the chapters that follow, we will present our work on contour extraction. Starting

with the well constrained problem of finding convex groups, we will present a framework that

achieves a very significant (three orders of magnitude) reduction in search complexity with

regard to the previous state of the art in convex group extraction (see Jacobs [47]). We will

also show that a previously developed probabilistic framework for evidence accumulation can

be used to robustly evaluate and rank the groups generated by our method independently of the

domain of the image.

We will then extend the convex grouping scheme to deal with arbitrary shapes, and we

will show that the resulting method retains the efficiency of the convex grouper while allowing

for the extraction of contours of arbitrary shape. Finally, we will introduce extensions to the

framework that allow for the use of colour and other cues during grouping for increased effi-

ciency and robustness, and discuss how these additional cues can improve the estimation of the

saliency of the groups output by our algorithm.

Documents

Chapter 6 Previous Work on Perceptual Groupingstrider/publications/Chapter6.pdfPrevious Work on Perceptual Grouping ... 3-D reconstruction, ... tion hierarchy illustrates the broad