8
Shape using volumetric primitives Roger C Munck-Fairwood* and Li Dut This paper traces the evolution of thought and tech- niques in shape recognition using volumetric primitives. We attempt to draw lessons from a relatively long history of studies in this area, and also focus on the recent developments that are inspired by Biederman’s scheme of Recognition-By-Components. We identify a number of issues as a basis for assessing these achievements, and propose some methodological considerations for choos- ing a way forward. Keywords: object recognition, volumetric primitives, shape representation THE PURPOSE OF VOLUMETRIC PRIMITIVES A truly remarkable phenomenon of human vision is the ability to perform rapid object recognition from a single image, when there are many possible objects in memory and when the object may be a new exemplar viewed from a new viewpoint and possibly partly occluded or degraded by other extraneous noise. Although psychologists have studied many cases of normal and abnormal human vision, we are still a long way from a full explanation of the above phenomenon. Therefore, workers in computer vision have been forced to adopt functionally explicit approaches and heuristic representations in terms of 3D shape and 2D cues. One such common approach is to assume that a single, precise 3D model is known a priori; recognition then consists of searching for a pose of that object in the image. Even this very restricted approach is fraught with difficulties when uncontrolled scenes are encoun- tered. This approach concentrates on localisation, i.e. the ‘where’ function of vision, rather than the ‘what’ function that actually distinguishes object recognition from spatial description. In terms of modularization, it Department of Electronic and Electrical Engineering, University of Surrey, Guildford, Surrey GU2 5XH, UK (email:r.munck- [email protected]) *Formerly R. C. Fairwood. Paper received: 14 July 1992; revised paper received: 15 December 1992 is useful to separate these two issues in machine vision. Moreover, there is some neurological evidence for their separation in human vision (see, for example, ‘Neurophysiology of shape processing’ in this issue). In this paper, we focus on the ‘what’ function, i.e. the need to achieve model invocation, or access to a generic 3D model category. This corresponds to the ‘primal access’ function in human vision. The use of volumetric primitives for object recogni- tion by machine was initially an intuitive practical choice, after it became evident that an intermediate representation was required to handle a large number of objects of different shapes. The latter facilitates the organization of a large model database, and enables data grouping at the level of 3D entities to curb combinatorics. The use of a fixed number of primitives means that the addition of new objects to the database does not require new primitives, so that the number of possible ‘aspects’ (i.e. topologically equivalent views) of the primitives comprising all the objects does not increase. This greatly facilitates data-driven extraction, as these aspects can be used as more discriminating indices into a database than simple 2D features. This in turn reduces the need for verification of model hypotheses (many of which may be spurious) where the precise model geometry and pose would be required. Nothing is free, though: the extraction of 3D primitives is obviously more difficult than that of simple 2D features. A fundamental and recent contribution of Biederman’ is in providing strong psychological evidence that volumetric components do play an important role in human recognition of objects. By experiments on the visual priming of manually contour- deleted images, it was found that deletion which preserves the components does not weaken facilitation by priming, while deletion which completely removes some of the components weakens the facilitation. This indicates that the visual priming of an image of an object may operate by the activation of a represen- tation of its components. The significance of this discovery to machine vision is that the use of volumetric components should no longer be regarded as only a practical choice, but has good correlation with human vision. 364 0262-8856/93/060364-08 0 1993 Butterworth-Heinemann Ltd image and vision computing

Shape using volumetric primitives

Embed Size (px)

Citation preview

Page 1: Shape using volumetric primitives

Shape using volumetric primitives

Roger C Munck-Fairwood* and Li Dut

This paper traces the evolution of thought and tech- niques in shape recognition using volumetric primitives. We attempt to draw lessons from a relatively long history of studies in this area, and also focus on the recent developments that are inspired by Biederman’s scheme of Recognition-By-Components. We identify a number of issues as a basis for assessing these achievements, and propose some methodological considerations for choos- ing a way forward.

Keywords: object recognition, volumetric primitives, shape representation

THE PURPOSE OF VOLUMETRIC PRIMITIVES

A truly remarkable phenomenon of human vision is the ability to perform rapid object recognition from a single image, when there are many possible objects in memory and when the object may be a new exemplar viewed from a new viewpoint and possibly partly occluded or degraded by other extraneous noise.

Although psychologists have studied many cases of normal and abnormal human vision, we are still a long way from a full explanation of the above phenomenon. Therefore, workers in computer vision have been forced to adopt functionally explicit approaches and heuristic representations in terms of 3D shape and 2D cues.

One such common approach is to assume that a single, precise 3D model is known a priori; recognition then consists of searching for a pose of that object in the image. Even this very restricted approach is fraught with difficulties when uncontrolled scenes are encoun- tered. This approach concentrates on localisation, i.e. the ‘where’ function of vision, rather than the ‘what’ function that actually distinguishes object recognition from spatial description. In terms of modularization, it

Department of Electronic and Electrical Engineering, University of Surrey, Guildford, Surrey GU2 5XH, UK (email:r.munck- [email protected])

*Formerly R. C. Fairwood.

Paper received: 14 July 1992; revised paper received: 15 December 1992

is useful to separate these two issues in machine vision. Moreover, there is some neurological evidence for their separation in human vision (see, for example, ‘Neurophysiology of shape processing’ in this issue). In this paper, we focus on the ‘what’ function, i.e. the need to achieve model invocation, or access to a generic 3D model category. This corresponds to the ‘primal access’ function in human vision.

The use of volumetric primitives for object recogni- tion by machine was initially an intuitive practical choice, after it became evident that an intermediate representation was required to handle a large number of objects of different shapes. The latter facilitates the organization of a large model database, and enables data grouping at the level of 3D entities to curb combinatorics.

The use of a fixed number of primitives means that the addition of new objects to the database does not require new primitives, so that the number of possible ‘aspects’ (i.e. topologically equivalent views) of the primitives comprising all the objects does not increase. This greatly facilitates data-driven extraction, as these aspects can be used as more discriminating indices into a database than simple 2D features. This in turn reduces the need for verification of model hypotheses (many of which may be spurious) where the precise model geometry and pose would be required. Nothing is free, though: the extraction of 3D primitives is obviously more difficult than that of simple 2D features.

A fundamental and recent contribution of Biederman’ is in providing strong psychological evidence that volumetric components do play an important role in human recognition of objects. By experiments on the visual priming of manually contour- deleted images, it was found that deletion which preserves the components does not weaken facilitation by priming, while deletion which completely removes some of the components weakens the facilitation. This indicates that the visual priming of an image of an object may operate by the activation of a represen- tation of its components. The significance of this discovery to machine vision is that the use of volumetric components should no longer be regarded as only a practical choice, but has good correlation with human vision.

364 0262-8856/93/060364-08 0 1993 Butterworth-Heinemann Ltd

image and vision computing

Page 2: Shape using volumetric primitives

In addition, Biederman proposed the Recognition- By-Components (RBC) theory which unites the attempt to construct machine vision systems for large model databases and the attempt to explain object recognition in human vision. As part of this theory, a significant contribution * is the proposal that the parts of objects mediating recognition be not merely compo- nents (i.e. particular to each object) but primitive components. A small vocabulary of ‘geons’ (geometric primitives) was invented which are expected to be, on the one hand, rich enough when combined to represent many realistic objects, while on the other hand extractable in principle from 2D images by virtue of their viewpoint-invariant qualitative features. When applied to machine vision, this small set of primitives has the potential of facilitating both database indexing and data-driven extraction (through feature grouping).

NOTABLE FOOTSTEPS

Important footsteps in object recognition trace back to the 196Os, when computer vision arose out of pattern recognition with the consideration of the third dimen- sion.

Polyhedral world (196Os-70s)

Object recognition using volumetric primitives dates back to the 1960s when attempts were made to interpret line drawings of a polyhedral world. The most noted studies of that period include: Roberts (1965)3 who assumed a world made up of three classes of parts (cuboids, wedges and hexagonal prisms) occurring singly or merged to form compound objects; Guzman ( 1968)4 whose SEE program interpreted a line drawing by segmenting image regions and grouping those regions belonging to the same object using a junc- tion .dictionary; Huffman’ and Clowes6 (1971) who furthered the junction dictionary approach by exploit- ing the consistency constraint on junction labelling; Mackworth (1973)’ who introduced the concept of coherence with the view point (i.e the viewpoint consistency constraint) to improve on Guzman’s junc- tion dictionary approach; and Waltz (1975)’ and Shirai (1975)” who extended line-drawing interpretation to cope with shadows. Systems during this period mainly used edge-based image features or simply line- drawings, and there was a ‘intentionality”’

strong sense of in the design of these systems, in the

sense that explicit, functional relations were invented, which do not necessarily correspond to human vision.

Marr et al. (late 1970s)

Marr’s computational theory of vision” represents a stage in the progress in the study of vision along with a number of noted contemporaries (Poggio, Hildreth, Barrow, Tenenbaum, etc.). Unlike the highly inten- tional emphases in earlier studies mentioned above, this theory drew on the results of the psychophysical discovery of domain independent depth recovery, such as that of Julesz’ random dot stereogram experiment ‘*. It proposed intrinsic surface description irrespective of the object and viewpoint, which leads to a domain- independent volumetric scene description.

vol II no 6 julylaugust 1993

Marr’s computational theory is, however, somewhat intentional in the levels which succeed early vision, particularly in shape representation for recognition. In order to fulfil his least-commitment principle (which facilitates data-driven processes for recognition), he concluded that a shape representation must be:

(9

(ii)

(iii)

accessible - to enable the model description to be accessible from the image; unique and of wide scope - to capture many objects without redundancy; and stable and sensitive - to accommodate noise as well as object variations.

In practice, this leads to the choice of an object-centred coordinate system together with volumetric primitives. A hierarchical structure for shape representation which covers from coarse to fine details was also suggested. To some extent, Marr’s scheme of shape representation is an infant form of the RBC scheme, but it is intentional in the sense that no psychological evidence was presented for the use of volumetric components in human vision.

Brooks (1980)

Brooks’36 ACRONYM system was the first implemen- tation of object recognition using an object represen- tation similar to that suggested by Marr, but using edge-based image features instead of an intermediate stage of depth recovery. This system used primitives in the form of generalized cones. Such a cone is defined by a planar cross-section, a space curve spine and a sweeping rule. By taking combinations of variations in these attributes, Brooks arrived at a number of different sub-categories of primitives similar to Biederman’s set of geons.

Brooks’ object representation was a graph containing two main branches at the very top: (i) a ‘subpart sub- tree’ which described the object from coarse to fine details, and (ii) ‘an affixment arc sub-graph’ which described spatial relations among parts, carrying rota- tion and translation vectors. Parameters were intro- duced in this object description, which had associated ranges of possible values; this was thus a step towards qualitative components and implied some measure of viewpoint invariance.

Although the theoretical framework aimed to be wide-ranging, the practical implementation was much more restricted in scope and in the possible range of viewpoints. An example of grounded aeroplane recog- nition was given, with the viewpoint well above the ground and using ribbons (parallel pairs of curves) to cue for aeroplane fuselage and wings. This example unfortunately nearly reduced the problem to one of 2D pattern recognition. Nevertheless, the basis of this representation scheme is a clear forerunner to RBC.

Lowe et al. (1980s)

The part of Lowe’sI work that is of relevance to RBC is perceptual organization by non-accidental relations among edge-based image features. The listed relations, such as parallelism, co-linearity and co-termination (as later adopted by Biederman) provide a very powerful way of cuing intermediate shape components.

365

Page 3: Shape using volumetric primitives

Throughout the 1980s a number of studies on recognition from monocular grey-level images’3-‘5 were committed to recognition of single known objects. The relevance of these studies to our discussion is that it became evident that only such a commitment enables the use of domain knowledge to cope with the enormous ambiguity posed by natural complex scenes. In other words, it was found that it is extremely difficult to extract a stable volumetric representation of the scene from complex images.

This period also saw many systems for object recognition, but many of them were actually concerned with recognition from depth images (e.g. Faugerausi6, Grimson and Lozano-Perez’7, Jain and Hoffman”), which is outside the scope of our consideration.

Depth-from-‘X’ (late 1970s onwards)

Depth-from-multiple-channels (stereo, shading, etc.) 19-” has received much attention due to its role in providing intrinsic surface description in the framework of Marr’s computational theory. However, most of the recent systems which are inspired by RBC use only edge-based features to extract geons, making depth recovery for geon extraction seem irrelevant. It is worth noting that both perceptual organization of edge- based features and depth-from-‘X’ are alternative routes towards extracting geons from grey level images, although one may be more preferable than the other in practice.

Biederman (1985)

We have already outlined Biederman’s contribution in terms of psychological evidence for the use of volumetric components, as well as a recognition scheme based on a small number of primitives. Biederman revitalized effort towards object recogni- tion from a large data-base, and brought together two classical approaches in machine vision: (i) the general- ized cylinder (or cone), which combines volumetric (object-centred) and surface (viewer-centred, contour- based) representations, and (ii) viewpoint invariant features; obtained by classifying generalized cylinders into qualitative categories, each of which exhibit qualitative features in the image independent of view- point. This scheme also includes a few qualitative constructors for combining geons into composite objects such that the attachments are also extractable from the image. We use Biederman’s terms Recognition-By-Components (RBC) and ‘geons’ in referring to this scheme.

Recent systems of RBC (1985 onwards)

The final footsteps which we trace are recent approaches to object recognition using volumetric primitives. The contributions are summarized in very roughly chronological order. It is worth noting that most of these were presented recently at what was apparently the first international meeting22 dealing specifically with implementations of recognition systems inspired by Biederman’s geons scheme. Further assessment of their techdnical merits and limitations is presented in the next section.

Bergevin and Levine23 use hand-input line drawings

of isolated composite objects. The lines are segmented into those belonging to individual geons by using T-junction pairs occurring in the outline of the object. Faces are used as an intermediate representation between lines and geons. Geons are attributed with the usual Biederman shape parameters and also with coarse aspect ratios (stick/plate/blob, reminiscent of Shapiro et ~1.~~). A scheme is suggested where a complete object is represented by an attributed graph in which each node represents a geon and each arc a connection type (one of five points along a geon) with an estimate of relative size. The suggested recognition scheme uses a precompiled object database with a kind of cross-indexing, so that all objects containing a particular geon are listed together. The graph derived from the image is compared with a graph in the object database by means of an arbitrarily chosen weighted sum of matching attributes. The results show the identification of geons and their connection modes.

Munck-Fairwood25’26 uses hand-input line drawing features of complete isolated geons. One part of the work26 emphasizes an approach which presents a logical description of the scene (a geon), the projection relationships (formalizing and extending Biederman’s suggestions), and the image (a unified geon ‘aspect’). Recognition of geons is achieved by embodying the above logic formulations in a logic program (using the Prolog language). The other part of the work27 emphasizes an approach which uses a causal probabi- listic network in order to handle uncertainty, at one level of reasoning (the zrojection

Dickinson et al.28-3 model’ part).

used both hand-input line drawings and real images of isolated specially made composite objects (consisting of two or three idealized geon shapes) under laboratory lighting conditions. The major contribution is in defining an ‘aspect hierarchy’ which introduces intermediate levels between lines and geons, corresponding to boundary groups, faces and aspects. The levels are linked by probabilistically weighted arcs so that, for example, a particular face suggests various aspects in a probabilistic ranking. The probabilities are derived from statistical simulation experiments. Hashing is used as an efficient form of model indexing. The hash function is based on the labels of the primitives contained in a subgraph consisting of ‘strongly connected’ primitives, i.e. primi- tives which have a visible connection. The multiple representation levels are used to reduce the combina- torial complexity of the search and to accommodate missing or spurious features. The authors point out that the scheme is not dependent on geons as such, but that any set of primitives which exhibit ‘distinct faces’ could be used.

Hummel and Biederman31 present apparently the first neural network approach to RBC. This ambitious prototype system aims at recognizing isolated objects consisting of two or three geons (out of eight geon classes), from idealized line drawing data. The scheme uses seven layers. The lower layers are proposed to have fixed weights and to detect, first, junctions, axes and blobs (only the junctions module is implemented), then geon features (axis curvature, etc.), and geon unary ‘relations’ (above/beside/below something, oblique/perpendicular to something, and larger/smaller than something). The two upper layers are trainable and detect ‘geon feature assemblies’ (type of geon, its

366 image and vision computing

Page 4: Shape using volumetric primitives

attributes and its spatial relation to other geons). In this scheme, object recognition corresponds to activating a particular cell corresponding to a particular object. The variation of recognition time with image rotation angle (both in the plane and in depth) seems to resemble that of humans.

Raja and Jain32*‘3 deviate from the original RBC scheme in using range data and superquadrics34. In one approach32 range data is used to first estimate super- quadric parameters, which are then partitioned into ranges to represent different geons. Alternatively33, the range data is segmented into surfaces which are then classified. These surface types and their adjacen- ties are used to match to a ‘surface adjacency graph’ and hence achieve identification of the geon. Much input data was needed (20 views of range data) and the superquadric fitting was unreliable, especially if the object surface was rough, and was high1 dependent on viewpoint. Jacot-Descombes and Pun Z examine the probabilistic interpretation of certain 2D features of primitives - cross-section ellipticity and cone vertex angle. Again, this deviates from the original RBC scheme in using metrics (rather than qualitative, viewpoint-invariant features); e.g. a primitive is dis- tinguished by a precise value of cross-section ellipticity. The probabilistic distribution over the view sphere is considered; simulation experiments showed that the most probable 3D parameters can be discriminated.

ISSUES AND CURRENT STATE OF KNOWLEDGE

Having presented a retrospective of the evolution of shape recognition using volumetric primitives, we take a novel angle to re-assess the studies listed above. We identify a number of issues as a basis for assessing what has already been achieved, what has been speculated and what remains untouched. In Figure 1, we illustrate relationships between a number of ingredients which we consider as crucial to satisfactory theoretical and practical advances in the field.

Establishing a vocabulary of primitives

In the construction of a RBC system, the basic

Object Categories

Pragamatic

Theoritical

Large model base indexing

I <

Necessary set

> The choice of primitive relations

Sufficient set

I Bidertnan’s

Pragamatlc

Theorihcal > ’ The choice of primitives Superquadrics

’ Others

Extraction of Primitives &Relations

I Data complexity Complexity

< Model complexity

Figure 1. Relationships between issues in RBC

vocabulary of primitives is the first important issue. In practice, the principle of accessibility (see Marr above) has almost entirely determined the geon vocabulary adopted in existing systems.

An intentional geon vocabulary Biederman suggested a small vocabulary of geons on the basis that they can be detected through non- accidental image edge properties, as inspired by Lowe. Although the role of components in human primal access was supported by his psychological experiments, the choice of geon vocabulary is largely intentional, on the grounds of accessibility and, to a lesser degree, stability.

Even so, this practical choice is not unprecedented, because it was Brooks who first proposed primitives (generalized cones) with variations of the cross-section, space spine curve and sweeping rules. These variations leave only one of Biederman’s variations - symmetry - unaccounted for, and this variation is much more difficult to extract from an image.

The implementors of all the above-cited systems have pushed the practical choice of geons even further by using less than the initial 36 geons. This is justified in a prototype system where the number of objects to be discriminated is relatively small.

A theory of geon vocabulary As far as the authors are aware, there is no theory to guide the choice of a geon vocabulary for either machine or human vision. A theory of geons must involve a study of morphology of objects concerned. It must also obey the principle of accessibility, although we may consider means of accessing components other than the existing ones. Accessibility (and, to a less extent, stability) has been considered in the various choices of sets of geons, but the other Marr criteria of uniqueness, scope and sensitivit

Y have not been

addressed, except that Biederman argued for the scope of his geon (and relation) set in terms of combinatorics.

Perhaps psychological experiments or theory could be developed to determine the necessary and sufficient set of primitives and constructors from a percep- tual point of view. For example, does it matter, perceptually, whether a geon’s cross-section is symmet- rical, or how elliptical it is? Some studies have been made35 on the role of unary shape and pairwise relational geometry in a geon-based model, and on the process of inference from 2D contour to geons.

Geon extraction from 2D images

Geon extraction from real or idealized data has received considerable attention. Existing strategies can be categorized into two classes.

From 2D edge-based features Component extraction from edge-based features (or even line-drawings) is usually based on some qualita- tive viewpoint invariance. These invariances are strong cues to corresponding 3D relations (e.g. curvilinearity), and are thus thought to be the reason for the phenomenon of ‘perceptual grouping’ in 2D. Such techniques began with the early systems of polyhedron interpretation by Roberts, who treated a polygon as an

vol I I no 6 julylaugust 1993 367

Page 5: Shape using volumetric primitives

invariant cue to a face of a cuboid, a triangle as cue to a wedge, etc.

Perceptual grouping of image features has been extensively used for cuing 3D parts. The categories of non-accidental feature organisations summarized by Lowe are the best known guide to geon extraction, not surprisingly, as geons were derived with these feature organizations in mind. These perceptual organizations include: (i) co-linearity of points or lines, (ii) curvili- nearity of points or arcs, (iii) symmetry, (iv) parallel curves, and (v) vertices. However, in each geon extraction strategy the pa~icular mechanism of reason- ing with these organizations has been very different.

Brooks36 mainly used ribbons and elipses to cue for generalized cones. These features have qualitative viewpoint invariance.

Bergevin and Levine 23 firstly used corners and T-junctions to segment line drawings into parts and then assigned several symbolic attributes (coarse aspect ratio, axis curvature, etc.) based on viewpoint invariances to classify parts. Hypothesis verification was employed to resolve ambiguities.

Hummel and Biederman3’ proposed that the first six layers of their neural-net model could extract geons by reasoning with the invariances of vertices, axes and b!obs, creating a ‘geon feature assembly’ as the complete description of an individual geon. T-junctions were used to segment the drawings into parts.

Munck-Fai~~d26,27 used explicit image input in terms of the invariants: vertices, curvature and paral- lelism. The projection relationships for geons, based on Lowe’s list, were formalized using logic and showed to contain ambiguous inferences - more than one geon characteristic could give rise to a certain image feature, such as curvature. Some of this ambiguity could be resolved using a causal probabilistic network. The latter also demonstrates how the acquisition of some data (e.g. part of an image) can influence the expec- tation of other data (e.g. other parts or properties of image curves). This kind of phenomenon is evident in certain psychophysical experiments (e.g. illusory con- tours), and may be exploited in machine vision by the ‘tuning’ of feature detectors according to already- detected data.

Dickinson~sza3’ geon extraction process adopted two intermediate levels of abstraction between pecep- tual feature groups and geons: faces and aspects. Probabilistic biases were used to curb the combina- torics, and hypothesis verification was also used to resolve ambuiguities.

A potentially useful strategy in extracting geons from edges is the apparently human strategy of grouping junctions based on proximity and geometric consis- tency, rather than connectivity, as demonstrated in the perception of nonsense drawings (see ‘From line draw- ings to impressions of 3D objects: developing a model to account for the shapes that people see’ in this issue). This strategy could use 3D fragments to bridge between junctions and geons, as a complement to the other intermediates such as an aspect hierarchy.

From depth information An alternative strategy for geon extraction is through surface description derived from depth information existing in a grey-level image. This depth information could be derived from one of the great number of

‘shape-from-X’ methods. For example, it is possible to estimate some superquadric parameters using shape- from-shadin 2’

A and in turn relate these to some geon

parameters . Many of these methods give useful information in principle but are sensitive to noise, particularly when used alone. We note in passing that superquadrics are an alternative, but metric represen- tation of volume. The use of superquadrics to represent geons is theoretically limited in that not all geons can be modelled by superquadrics. Nevertheless, this approach does provide a partial bridge between two shape com~nent representation schemes which are very different in terms of scope and accessibility.

The use of explicit depth information seems to contradict Biederman’s (and Lowe’s) approach, where the Marr depth map is bypassed in favour of image edge features themselves mediating object recognition. However, Biederman’s input has always been in manual line drawing form; deriving this abstraction from real images may require fusion of the output of more than the single edge-detector module, including domain knowledge.

Describing and extracting geon relations

Geon relations are concerned with the spatial relation- ships and relative sizes among object components. As in the case of the geon vocabulary itself, the choice of relation vocabulary for existing systems has been on the basis of accessibility or, in some cases, scope (i.e. what was thought to be a necessary set to describe adequately the objects to be modelled). Many of the above comments on a theory for a geon vocabulary could be applied to a theory for a relation vocabulary.

Brooksf6 used ‘affixment arcs’ to express the 3D spatial relationships among parts. Each of the arcs had semi-quantitative rotation and translation transfor- mation specifications constrained within a numerical range. A number of simple rules were suggested for inferring 3D affixment parameters from 2D relations of parts in the image. However, these rules are rather dependent on the viewpoint, although they were satisfactory for the restricted range of viewpoints used in this work.

Dickinson et al.2s30 used a small geon relational vocabulary containing only a ‘strong’ and ‘weak’ connection between pairs of primitives, defined on the basis of connection visibility, i.e. whether two faces belonging to a pair of geons share a contour.

Biederman’s early suggestion for a relation voca- bulary consisted of: relative sizes of geons (in three steps); relative position (above~elow) of geons; nature of join (end-end or end-side); and surface of join (long/ short surface). Later, Hummel and Biederman” used a similar vocabulary in their neural network. The positional relations (‘above’, etc.) have unavoidable ambiguity when there are more than two geons present, as they are in fact unary (one-argument) predicates (e.g. a geon can have the property ‘above’). Con- nectedness between geons is not explicit but is implied by relative spatial relations.

Bergevin and Levine= used a scheme of geon relational representation which is similar to that of Biederman’s in the sense that the vocabulary was qualitative and largely viewpoint invariant. But this scheme has a larger geon relational vocabulary referred

368 image and vision computing

Page 6: Shape using volumetric primitives

to as five connection modes, which are all end-to-side and describe the position of attachment of the end of one geon along the side of another, in five quantitized positions. The relative sizes of geons are also estimated.

Large model base indexing

One of the ultimate objectives of using volumetric primitives is to achieve efficient access to a database of a large number of object models, using object descrip- tions in terms of geons and their spatial relations. This issue is essentially concerned with how to organise such a large database. We consider here two major approaches.

Structural indexing Marr and Nishihara” proposed a hierarchical represen- tation scheme which would allow object indexing to proceed in three different manners: (i) specificity indexing: from coarse to fine scale; (ii) adjunct indexing, using details of spatial relationships among components; and (iii) parent indexing: top-down access by using a focus feature (such as an identified horse head). Little specific study has followed this proposal, and the problem remains largely unaddressed.

The indexing schemes used in the implementations of RBC cited above are all in the ‘adjunct indexing’ category, in that they use the spatial relationships between geons, derived after the geons have been extracted. TechnicalIy, Bergevin and Levine23 and Dickinson et &.2s-30 use hashing schemes, whereas Hummel and Biederman” use self-organizing links between their upper two layers so that, after training, particular combinations of ‘geon feature assembles’ in Layer 6 activate single complete-object cells in Layer 7.

Invariant signatures The majority of those who specifically address the problem of indexing use invariant signatures37r38, According to the invariant theory, there are a number of feature groups which can provide invariants. They include the four colinear point cross-ratio, five coplanar points, a conic with two points, and a pair of tonics. Such invariants can be pre-computed and recorded as the signatures of the objects.

In practice, invariant signatures are applicable to situations where it is possible to identify the relevant feature groups. The main implementation problems of this indexing scheme are to establish the feature groups automatically, and to devise an algorithm to extract them from the image with sufficient reliabiIity. These processes may be very sensitive to noise. It has been suggested”” that such an indexing scheme can handle a large model database by indexing each object with an invariant vector that combines various feature groups of the object.

Geometric hashing”O seems to suggest the idea of using a hash-table on the basis of some form of object representation. However, the particular forms of representations are based on so called ‘interest points’, which makes this approach very much reminiscent of invariant signatures.

Object categories All of the works cited above have assumed that the

final object CIassification is in terms of a particular spatial arrangement of specific geons. Let us assume that this defines the ‘shape’ of an object. In human vision, objects of slightly different - or perhaps grossly different - shape are intuitively classified in the same category, at least by name. This is so even if small-scale shape details are ignored. Some shape variations could be accommodated in an ad hoc way by permitting multiple geon labels (e.g. it does not matter whether a table’s legs have curved or polygonal cross-sections), but the general problem leads us into new territories such as the relationship between shape and name, or shape and function . 41 These are outside the scope of this article. In another article4’, we address these and other wider issues in generic object recognition.

Coping with model and data complexity

The complexity of the model and the data are two important issues of both practical and theoretical reievance.

Model complexity Model complexity refers to the extent to which geometrical details of objects should be made explicit and how to cope with insignificant variations in the shape. This poses a significance problem, because objects in the real world are hardly ever composed of simple, ideal geons, but often contain small details of shape.

The significance of small shape variations is only meaningful with respect to a particular ctass of object and the scale of focal attention. It seems that human vision copes with this problem b a complex mechanism of focal attention in early vision Y2 and of focal attention guided by cognitive process (e.g. Marr’s scale-space representation of objects). However, no study has been made to account for this issue in RBC. All the works cited above use drawings of idealised geon-based objects, or specifically constructed real objects (blocks, cylinders, etc.).

Data complexity We use the term ‘data complexity’ to mean the extent to which confounding factors in the data can mislead the interpretation process, rather than how complex the image appears to the human eye. For example, if two cubes are well separated in the image they do not pose a more complex interpretation problem than a single cube; if they are close together, features of one cube may be considered as extraneous features to the other one and may mislead (complicate) the interpreta- tion process. Alternatively, if a simple object occludes a complex object the image will certainly appear simpler whereas the interpretation of the complex object is much more difficult. Thus, the measure of ‘data copmplexity’ is related to the form of the interpretation process, and is not merely intrinsic to the image.

In the case of object recognition using edge-based features, data complexity includes (i) noise contamina- tion of the features that correspond to geometric discontinuities of the object, (ii) extraneous features that do not correspond to geometrical discontinuities of the object (markings, highlights, shadows and back- ground), and (iii) occlusion of features among objects.

vol I I nn 6 julylaugust I993 369

Page 7: Shape using volumetric primitives

Of the RBC-inspired recent works cited above, only one2s3* attempts to use edge-based features derived from real images, albeit under highly controlled light- ing conditions and with no background. All the others use manually derived data. In machine vision generally, demonstration of system performance on a few real images is a common way of showing consideration for data complexity, but such an unsystematic approach is limited in generality. A systematic approach to data complexity has been taken in model-based vision43; a similar approach could be taken in RBC.

THE WAY FORWARD

Although studies have brought significant insight into the problem of object recognition and created many useful algorithms, they have also made it evident that object recognition is much more difficult than it was thought to be 30 years ago. Incidentally, the RBC scheme was never expected to solve the problem of recognition of arbitrary shapes. For example, even humans find difficulty in catego~zing irregular objects (such as crumpled paper) from unknown views, or in recognizing common objects from highly degenerate views. In particular, the recognition of faces [see ‘Describing the shapes of faces using surface primitives’ in this issue] seems to require surface, rather than volumetric, primitives.

It is clear that progress in RBC needs joint and step- by-step effort, through accumulation of knowledge concerning each of the issues above. There are a number of strategic considerations in choosing a possible way forward.

Restricted versus open-ended issues Comparatively speaking, the extraction of geons and their spatial relations (given a geon vocabulary and relation vocabulary) are better defined problems in computational terms, and have attracted most of the effort in machine vision studies. The other issues are rather open-ended in computational terms. For instance, the establishment of a satisfactory theory for both geon vocabulary and geon relation vocabulary is beyond the traditional centre of computer vision study. It may involve more investigation into mathematical morphology, psychology, lingustics or even representa- tional art@.

This, perhaps, is the reason why the majority of RBC studies have centred on geon and geon relation extraction by making assumptions about the other issues, whereas work on the other issues is sparse and largely speculative. However, continued neglect of the other issues stifles the generality of the whole structure of RBC. Specific studies may give rise to better definitions to restrict those open-ended issues. In another article42, we propose that many of these issues can only be rationalized under the notion of an externally supplied ‘purpose’.

Real images versus short-cuts An RBC strategy must ultimately be able to interpret real images, but this does not make interpreting line drawings trivial. On the contrary, the use of line drawings serves as an assumption to simplify an enormously complex problem and enables the attention to be concentrated on certain aspects of the recognition

370

problem, so that (i) higher-level processes can be studied in isolation and (ii) objectives can be set for efforts in lower-level vision, Range images can also be used as another short-cut input to replace the assump- tion of a good shape recovery from 2D monocular images.

However, if real images are used, then treating a few images taken in a carefully controlled laboratory environment does not necessarily give much insight into data complexity or object complexity. Serious advance in the consideration of data complexity posed by real images of real objects needs a large number of real images and objects and a data complexity model which allows the simulation of complication factors43.

Implementation versus theoretical studies It is generally desirable to implement an object recognition system and demonstrate that the program works by examples using a number of images or line- drawings. However, a number of RBC-based systems have been developed, and the implementations have come to such a stage that it is equally desirable to develop further theory, particularly (i) working defini- tions concerning the rather open-ended issues, and then (ii) better heuristics given these better defined issues.

This should be a fruitful way of dealing with the enormous combinatorial problem of interpreting image features, instead of shifting the computational burden around without bringing better understanding to the problem.

ACKNOWLEDGEMENTS

This review was carried out with the support of the Science and Engineering research Council, grant GRG53255. We thank Roddy Cowie for inspiring this review and Sven Dickinson for his constructive criticism of the manuscript.

R~~RENCES Biederman, I and Cooper, E ‘Priming contour- deleted images: evidence for intermediate repre- sentations in visual object recognition’, Cognitive Psychol., Vol 23 (1991) pp 394-419 Biederman, I ‘Human image understanding: recent research and a theory’, Comput. Vision, Graph. Image Process., Vol 32 (1985) pp 29-73 Roberts, L ‘Machine perception of three- dimensional solids’, in Optical and Eiectro-Optical Znformat~on Processing (Tippet, J T et a&, eds), MIT Press, Cambridge, MA (1965) Guzman, A ‘Decomposition of a visual scene into 3-Dimensional bodies’, AFZPS Proc. Fail Joint Comp. Conf. Vo133 (1968) pp 291-304 Huffman, D ‘Impossible objects as nonsense sen- tences’ in Machine Intelligence 6, Melttzer and Michie, eds) Elsevier, New York (1971) pp 29% 323 Clowes, M ‘On seeing things’, Artif. Zntell., Vol 2 (1965) pp 79-116 Mackworth, A ‘Interpreting pictures of polyhedral scenes’, Artif. Intel, Vol 4 (1973) pp 121-137 Waltz, D ‘Generating semantic descriptions from drawing of scenes with shadows’ in Psychology of

image and vision computing

Page 8: Shape using volumetric primitives

9

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Computer Vision Winston, P H (ed) McGraw-Hill, New York (1975) pp 19-92 Shirai, Y ‘A context sensitive line finder for recognition of polyhedra’, in Psychology of Com- puter Vision Winston, P H (ed) McGraw-Hill, New York (1975) pp 93-114 Boden, M Arti~eial Intelligence and natural Man, The Harvester Press, Brighton (1981) Marr, D Vision, WH Freeman, San Francisco (1982) Julesz, B ‘Early vision and focal attention’, Rev. Modern Physics Vol 63 No 3 (July 1991) Lowe, D ‘Three-dimensional object recognition from single two-dimensional images’, Artif Intell., Vol31 (1987) pop 355-395 . Goad, C ‘Special purpose automatic programming for 3D model-based vision’, Proc. ARPA Image Under~~tanding Workshop, Arlington, VA (1983) Sullivan, G ‘Alvey MMI-007 vehicle Exemplar: Performance and Limitations’, Alvey Vision Con- ference, Cambridge, UK (1987) pp 39-45 Faugeras, 0 and Hebert, M ‘The representation, recognition, and location of 3-D objects’, Znt. J. Robotics Res., Vol 5 No 3 (1986) pp 27-52 Grimson, L G and Lozano-Perez, T ‘Model-based recognition and localization from sparse range or tactile data’, ht. J. Robotics Ref., Vol 3 NO 3 (1984) pp 382-414 Jain, A and Hoffman, R ‘Evidence based Recog- nition of 3-D Objects’, IEEE Trans. PA MI, Vol 36 No 6 (1988) pp 783-800 Horn, B ‘Obtaining shape from shading infor- mation’ in The Psychology of Computer Vision, Winston, P H (ed.) McGraw-Hill, New York (1975) Mayhew, J and Frisby, J ‘Towards a compu- tational and psychophysical theory of stereopsis’, AZJ., Vot 17 (1981) pp 349-385 Pentland, A ‘Shape information from shading: a theory about human perception’, Proc. 2nd Int. Conf. on Comput. Vision, Tampa, FL (1988) pp 404-4 13 Bowyer, K W (ed) Sessions I and II on ‘Explora- tion of Recognition by Components’, Applications of Artificial Intelligence X: Machine Vision and Robotics, Proc. SPIE 1708, Orlando, FL (April 1992) pp 569-627 Bergevin, R and Levine, M D ‘Generic object recognition: BuiIding coarse 3D descriptions from line drawings’, Proc. IEEE. Workshop on inter- pretation of 30 Scenes, Austin, TX (November 1989) pp 68-74 Shapiro, L G, Moriarty, J D, Haralick, R M and Mulgaonkar, P G ‘Matching three-dimensional objects using a relational paradigm’, Patt. Recogn., Vol 17 No 4 (1984) pp 385-405 Fairwood, R (now Munck-Fairwood, R) ‘Recognition-by-components and probabilistic reasoning in computational vision’, Sabbatical Report, Dept. of Electronic and Electrical Eng., University of Surrey, UK (July 1988) Fairwood, R (now Munck-Fairwood, R) ‘Recog- nition of generic components using logic-program relations of image contours’, Image & Vision Cornput., Vol 9 (1991) pp 113-122 Fairwood, R (now Munck-Fairwood, R) and Barreau, G ‘A belief network for the recognition

vol I I no 6 jl~iy~aug~t 1993

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

of 3D geometric primitives’, 6th Int. Conf. on Image Analysis and Process., Como, Italy (1991) Dickinson, S J, Pentland, A P and Rosenfeld, A Qualitative 3-D shape recovery using distributed aspect matching, Tech. Rep. CAR-TR-453, Cent. Automat. Res., Univ. Maryland, MD (June 1990) Dickin~n, S J, Pentland, A P and Rosenfeid, A ‘3-D shape recovery using distributed aspect matching’, IEEE Trans. PAMI, Vol 14 No 2 (February 1992) pp 174-198 Dickinson, S J, Pentland, A P and Rosenfeld, A ‘From volumes to views: An approach to 3-D object recognition’, Comput. Vision, Graph. Image Process, : Image Understanding, Vol5.5 No 2 (March 1992) pp 130-154 Hummel, J and Biederman, I ‘Dynamic binding in a neural network for shape recognition’, Psycho6 Rev., Vol 99 (1992) pp 480-517 Raja, N S and Jain, A K ‘Recognizing geons from superquadrics fitted to range data’, Image & Vision Cornput., Vol 10 No 3 (April 1992) pp 179- 190 Raja, N S and Jain, A K ‘Obtaining generic parts from range data using a multi-view representa- tion’, Appl. Artif. Intell. X: Machine Vision and Robotics, Proc. SPZE 2708, Orlando, FL (April 1992) pp 602-613 Pentland, A .4 utomatic extraction of deformable part models, MIT Media Lab Vision Science Technical Report 104 (1989) Wallace, A and Brodie, E ‘Volumetric, surface and contour based models of object-recognition’, Proc. 2nd Int. Conf on Visual Search, Durham, UK (1990) Brooks, R Model-based Computer Vision. UMI Research Press, MI, USA (1984) Clemens, D and Jacobs, D ‘Model group indexing for recognition’, Proc. Conf. on ~omput. Vision and Putt. Recogn., Marui, HI (1991) pp 4-10 Rothwell, C, Zisserman, A, Forsyth, D and Mundy, J ‘Canonical frames for planar object recognition’, Proc. Euro. Conf. on Comput. Vision, St. Margharita, Italy (1992) pp 557-572 Rothwell, C, Zisserman, A, Forsyth, D and Mundy, J ‘Using projective invariants for constant time library indexing in model-based vision’. Proc. BMVC, Glasgow, UK (1991) pp 62-70 Lamdan, Y and Wotfson, H ‘Geometric hashing: a genera1 and efficient model-based recognition scheme’, Proc. 2nd Int. Conf. on Comput. Vision, Tampa, FL (1988) pp 238-249 Stark, L and Bowyer, K ‘Achieving generalised object recognition through reasoning about asso- ciation of function to structure’, IEEE Trans. PAMI, Vol 13 No 10 (1991) pp 1097-1104 Du, L and Munck-Fairwood, R A formal definition and framework for generic object recognition’, 8th Scandinavian Conf In~age Ar~al_ysi~~, Tromso, Norway (May 1993) Du, L, Sullivan, G and Baker, K ‘Modelling data complexity for model-based vision’, Proc. Br. Machine Vision Conf., Springer-Verlag, Berlin (1992) Hagen, M Varieties of Realism: geometries of representational art, Cambridge University Press, UK (1986)

371