Progress review1

October 26, 2009

Compositional Hierarchy for 3D Object RecognitionMaria Isabel Restrepo

Maria Isabel Restrepo

Goal:


Goal

Geometry Expected Appearance

Renderings obtained by Dan Crispell


Goal: Recognition in a 3D World


Compelling Characteristics

POWERFUL GEOMETRIC AND PHOTOMETRIC REPRESENTATION* OF SCENES

✤ It is a 3D, geometric representation that supports discovery of spatial relations

✤ Its appearance is modeled by MOG to handle illumination variations

✤Appearance and geometry are automatically learned from multiple images with calibrated cameras

✤ It is faithful to the scenes: There are no prior assumptions about the model

THESE CHARACTERISTICS ARE IDEAL FOR OBJECT RECOGNITION

* [Pollard and Mundy, CVPR 2007] [Crispell]


Outline

✤ Volumetric appearance model - The Voxel World

✤ Insights on classical recognition methods

✤ Compositional hierarchies✤ Bienenstock, Geman, Potter, 97; Geman, Chi, 2002; Geman, Jin, CVPR 2006✤ Fidler & Leonardis, CVPR’07; Fidler, Boben & Leonardis, CVPR 2008✤ Mundy & Ozcanli, SPIE ’09

✤ Experimental work: Proof of concept

✤ Future work


The Voxel World

intensity

p(in

tens

ity)

Surface probability is given by incremental learning

PN+1(X ∈ S) = PN (X ∈ S)pN (IN+1

x |X ∈ S)pN (IN+1

x )p(I) =

3�

k=1

wk

W

�1�

2πσ2k

e− (I−µk)2

2σ2k

�Appearance is modeled a Mixture of Gaussians

Probabilistic representation of 3-d scenes based on volumetric units -voxel.


Outline

✤ Volumetric appearance model - The Voxel World


✤ Compositional hierarchies✤ Jin & Geman✤ Fidler & Leonardis, CVPR’07; Fidler, Boben & Leonardis, CVPR 2008✤ Mundy & Ozcanli, SPIE ’09

✤ Experimental work: Proof of concept

✤ Future work


Classical Recognition: Bag of Features

Drawbacks:✤ Disregards spatial

information ✤ Large number of features are

needed

Many have proposed more complex representations of spatial object structure.

✤ Constellation Models [Weber and Welling et al, Fergus et al] -Complex, few parts

✤ Probabilistic voting [Leibe, Schiele] -Large codebook - complex matching

✤ Hierarchical representations

e.g. SIFT- LoweHOG- Dalal

Feature descriptor Feature space - Classify

e.g .SVMNaive Bayes

NN

Codeword, Codebook


Hierarchical Representations

Fidler, Boben &LeonardisU. of Ljubljana, Slovenia

Learning Hierarchical Models of Scenes, Objects, and Parts

Erik B. Sudderth, Antonio Torralba, William T. Freeman, and Alan S. WillskyElectrical Engineering & Computer Science, Massachusetts Institute of Technology

[email protected], [email protected], [email protected], [email protected]

TO APPEAR AT THE 2005 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION

Abstract

We describe a hierarchical probabilistic model for thedetection and recognition of objects in cluttered, naturalscenes. The model is based on a set of parts which describethe expected appearance and position, in an object centeredcoordinate frame, of features detected by a low-level inter-est operator. Each object category then has its own distri-bution over these parts, which are shared between objects.We learn the parameters of this model via a Gibbs samplerwhich uses the graphical model’s structure to analyticallyaverage over many parameters. Applied to a database ofimages of isolated objects, the sharing of parts among ob-jects improves detection accuracy when few training exam-ples are available. We also extend this hierarchical frame-work to scenes containing multiple objects.

1. IntroductionIn this paper, we develop methods for the visual detec-

tion and recognition of object categories. We argue thatmulti–object recognition systems should be based on mod-els which consider the relationships between different ob-ject categories during the training process. This approachprovides several benefits. At the lowest level, significantcomputational savings can be achieved if different cate-gories share a common set of features. More importantly,jointly trained recognition systems can use similarities be-tween object categories to their advantage by learning fea-tures which lead to better generalization [4, 18]. This inter–category regularization is particularly important in the com-mon case where few training examples are available.In complex, natural scenes, object recognition systems

can be further improved by using contextual knowledgeabout the objects likely to be found in a given scene, andcommon spatial relationships between those objects [7, 19,20]. In this paper, we propose a hierarchical generativemodel for objects, the parts composing them, and the scenessurrounding them. The model, which is summarized inFigs. 1 and 5, shares information between object categoriesin three distinct ways. First, parts define distributions over acommon low–level feature vocabularly, leading to compu-tational savings when analyzing new images. In addition,and more unusually, objects are defined using a commonset of parts. This structure leads to the discovery of parts

!"z

o

w x

r

#$

O

PMNm

% &

'O

µ

(P

)

o

o

&

)

p

p

Figure 1. Graphical model describing how latent parts zgenerate the appearance w and position x, relative toan image–specific reference location r, of the featuresdetected in an image of object o. Boxes denote repli-cation of the corresponding random variables: there areM images, withNm observed features in imagem.

with interesting semantic interpretations, and can improveperformance when few training examples are available. Fi-nally, object appearance information is shared between themany scenes in which that object is found.We begin in Sec. 2 by describing our generative model

for objects and parts, including a discussion of related workin the machine vision and text analysis literature. Sec. 3then describes parameter estimation methods which com-bine Gibbs sampling with efficient variational approxima-tions. In Sec. 4, we provide simulations demonstratingthe potential benefits of feature sharing. We conclude inSec. 5 with preliminary extensions of the object hierarchyto scenes containing multiple objects.

2. A Generative Model for Object FeaturesOur generative model for objects is summarized in the

graphical model (a directed Bayesian network) of Fig. 1.The nodes of this graph represent random variables, whereshaded nodes are observed during training, and roundedboxes are fixed hyperparameters. Edges encode the con-ditional densities underlying the generative process [12].2.1. From Images to Features

Following [17], we represent each of our M grayscaletraining images by a set of SIFT descriptors [13] computedon affine covariant regions. We use K-means clustering to

Sudderth, Torralba, Freeman WillskyMIT-2006

group the closest clusters together. Each node is thereforea hyperball in the feature space represented by a centroidand a radius. The centroid is a mean of all features withina ball and the radius is a distance from the center to thefurthest feature in the cluster. The radii of nodes increasetowards the top of the tree. Each of the appearance clus-ters has several geometric distributions gj(d, !, ", #) corre-sponding to the object classes. The distributions contain in-formation about the geometric relations between the objectcenter and local appearance. Figure 2(b) shows appearanceclusters and their distributions for objects. All object classesare therefore represented in one tree by a set of appearanceclusters and geometric distributions.

p(g ,a |O )j j n

l m n

d

a

r

b

f c g

h

e

ik

top nodes

geometricdistributions

ajappearance

aj

j mj ! " #p[g (d, , , ),a |O ]

(a) (b)

Figure 2. (a) Hierarchical structure. (b) Codebook representation.

Appearance clusters (left column) and their geometric distribu-

tions for different object classes. For visualization the distributions

are in 2D Cartesian coordinate system.

Building the tree. To build the tree we use a metric (Eu-clidean distance) to group the appearance clusters that lie ina hyperball of a given radius regardless of the object classor part they belong to. To build the tree structure we ap-ply agglomerative clustering. This bottom-up method startswith the number of clusters equal to the number of featuresand merges the two closest clusters at each iteration. Werecord the indices of merged clusters and the similarity dis-tance at which the clusters are merged at each iteration. Theprocedure continues until the last two clusters are merged.The resulting clustering trace is then used to construct thetree. The only parameter to provide is the size of the bot-tom nodes (radius of appearance clusters) and the number oftree levels. The radii for intermediate levels are equally dis-tributed between the bottom node radius and the radius ofthe top node. These radii are only indicators for maximumcluster size at a given level. If there are no clusters betweentwo levels that can be merged within the prescribed radius,the level is not created. Therefore, the actual number of lev-els and the number of nodes depends on the distribution ofdata points (features).

Discussion. This tree representation has several advan-tages. The appearance clusters are shared within one imageas well as among different classes and object parts. Despitemassive amounts of features extracted from images the rep-resentation is compact. The tree can be used to represent

features extracted from individual images as well as to rep-resent all object classes. The difference is that a single im-age representation contains only one geometric distributionper cluster, and the object class representation contains asmany geometric distribution as there are categories in themodel. This representation can be also seen as a metric treesimilar to those used in fast data structure algorithms [17]thus allowing for efficient search when matching features tothe appearance clusters (cf. section 3.3). Finally, one canalso interpret it as a tree of classifiers. A classifier (node)gives a positive response if a query feature lies within theradius of the node, negative otherwise.

2.3. Recognition

The approach relies on robustly estimated probabilitydistributions of the model parameters. We apply Bayesianrules and use a formulation similar to [8]. Given featuresF detected in a query image, appearance clusters A, andgeometric distributions G we make a decision :

p(Om|G, A)

p(B|G, A)=

p(G, A|Om)p(Om)

p(G, A|B)p(B)(1)

p(Om) and p(B) are priors of an object and background. Inthe following we explain only the object related terms butthe background terms are estimated in a similar way. Thelikelihood p(G, A|Om) is given by

p(G, A|Om)=!

i

"

j

p(gj, aj |fi, Om)p(fi|Om) (2)

where p(fi|Om) is a feature probability and p(gj , aj |Om) isa joint appearance-geometry distribution of each cluster. Tosimplify the notation we use gj instead of gj(d, !, "). Eachfeature likelihood is then modeled by a mixture of distribu-tions p(gj , aj |fi, Om) from appearance clusters aj whichmatch to query feature fi.

To find initial hypotheses for object locations we searchfor local maxima of the likelihood function given by equa-tion 1. To improve detection precision overlapping bound-ing boxes of detected objects are frequently removed. How-ever, we cannot follow this strategy since we want to dealwith multiple objects which can partially occlude eachother, that is the hypotheses can overlap. We allow sev-eral objects to be simultaneously present at similar positionsand scales. However, we impose a condition that each fea-ture can contribute only to one hypothesis, since it is veryunlikely that similar parts belonging to different objects arevisible at the same position in the image. We therefore re-move from the weaker hypothesis H , all p(gj , aj |fi, On)that contribute to both overlapping hypotheses:

p(G, A|Om)=p(GH , AH |Om) ! p(GS , AS |On) (3)

where S " H is a set of appearance-geometry likelihoodsthat also contribute to another stronger hypothesis. The

3

Mikolajcjzyk, Leibe, SchieleUK, Switzerland, Germany 2006

Fidler, Boben &LeonardisU. of Ljubljana, Slovenia

CVPR 2007, 2008

We thus need the means of finding the similarities among

different hierarchical nodes in a geometrical sense.

We propose to create similarity connections between hi-

erarchical nodes within layers to achieve invariance for high

variability in object shape and draw similarities across lay-ers to achieve a proper scale normalization of features. We

show how a layer-independent description of objects de-

fined by the so-called shape-terminals, i.e. shapinals, can

be passed to the higher-level, the category-specific repre-

sentation. If performed in this manner, the problem of ter-

minal nodes within the hierarchical “library” is solved in a

natural way. There is no need to by-pass or float features to

the top-most layer and thus unnecessarily load the complex-

ity of representation, which may prevent the unsupervised

creation of higher layers (the problem arising in [7]). In-

stead, at each hierarchical stage of learning, only a subset

of the layer’s statistically most repeatable features can be

combined further, yet the final, cross-layered description of

objects will retain its descriptive power.

Figure 1. Cross-layered, scale independent representation.

3.1. The base model: hierarchical compositionalframework [7]

We build on our previously proposed approach [7],

where we proposed an unsupervised learning framework

to obtain a hierarchical compositional representation of ob-

ject categories. Starting with simple oriented filters the ap-

proach learns the first three layers of optimally sharable

features, defined as loose spatial compositions, i.e. parts.

Upon the third layer, a higher-layer categorical representa-

tion is derived with minimal supervision. The model is in

essence composed of two recursively iterated steps, 1.) a

Figure 2. For greater generalization we establish similarities be-

tween hierarchical nodes within layers.

layer-learning process that statistically extracts parts by se-

quentially increasing the number of subparts contained in

local image neighborhoods, and 2.) a part detection step

that finds the learned compositions in images with an effi-

cient and robust indexing and matching scheme. The ad-

vantage of the proposed representation lies in the capability

to model exponential variability present in images, yet still

retaining the computational efficiency by keeping the num-

ber of indexing links per each part approximately constant

across layers.

We adopt the terminology and the notation from [7].

The rotation invariance is dropped from our implementa-

tion, however, since we believe it can be incorporated into

the model along with other invariances at a later stage of

learning.

Let Ln denote the n-th Layer. Each element of Ln, i.e.

part, is envisioned to model spatial relations between its

subparts, which furthermore model the spatial relations be-

tween their constituent subcomponents, etc. Parts are thus

defined recursively in the following way. Each part Pni

in Ln is characterized by the center of mass, and a list of

subparts (parts of the previous layer) with their respective

positions relative to the center of Pni . One subpart is the

so-called central part that indexes into Pni from the lower,

(n− 1)th layer. Specifically, a Pni that is centered to (0, 0)

encompasses a list {�Pn−1

j , (xj , yj), (σ1j ,σ2j)�}j , where

(xj , yj) denotes the relative position of subpartPn−1j , while

σ1j and σ2j denote the principal axes of an elliptical Gaus-

sian encoding the variance of its position around (xj , yj).The hierarchy starts with a fixed L1 composed of a set of

oriented Gabor filters, {filter i}.

Along the lines of how the parts are formed [7], the rel-

ative positions with variances {(xj , yj), (σ1j ,σ2j)} may

be preferably replaced with the segmented spatial maps,

{mapj}j , which capture the variability of subparts more

accurately than the fitted Gaussians. Spatial map is a two-

dimensional map that contains the learned disposition of

locations of each subpart relative to the central part, upon

which the parameters of the approximately Gaussian dis-

Jin and Geman, 2006

Chris Williams ANC

Hierarchical Object Recognition

Jin and GemanBrown University

CVPR 2006

✤ Address the need for a representation that incorporates geometric coherence

✤ Allow for a more efficient representation

✤ Consistent with biological systems


Prior work by Geman: Efficient Discrimination[Bienenstock, Geman, Potter, 97], [Geman, Chi, 2002], [Geman, Jin, CVPR 2006]

A COMPOSITIONAL MACHINE:

Test set: 385 images, mostly from Logan Airport

Courtesy of Visics Corporation

parts of characters and

plate sides

characters, plate sides

generic letter, generic number

plate boundary

license numbers

license plates✤ Probabilistic framework✤ Hierarchy and reusability✤ It does not exclude the sharing of subparts✤ Parts are everywhere, compositions are rare✤ Need to model relative geometry of parts

Markovian distribution:Basic structures

20 40 60 80 100 120 140 160 180 200

20

40

60

80

100

120

140

160

20 40 60 80 100 120 140 160 180 200

20

40

60

80

100

120

140

160

Figure 3. Samples from Markov backbone (upper panel, ‘4850’)and compositional distribution (lower panel, ‘8502’).

a!(I) returns the relative coordinates of the four numeralsthat instantiate ! in the interpretation I . Similarly, eachcharacter brick, and each numeral in particular, has an as-sociated attribute function that computes the relative coor-dinates of the particular parts that are composed into thatcharacter in a particular interpretation. A “compositionaldistribution” is built from a Markov backbone (Equation 1)and a pair of probability distributions, pc

! (“composed”) andp0

! (“null”), on each attribute a! . The former, composeddistribution, captures regularities of the arrangements (i.e.instantiations) of the children bricks, given that they areparts of the object represented by !; the latter, null distribu-tion, is the attribute distribution in the absence of the non-Markovian term. The set of relative coordinates of the threeparts that make up the ‘0’ in the upper panel of Figure 3 isan example of an attribute, and the particular arrangementof the parts in the figure is a sample from the correspondingnull distribution.

In a compositional distribution, the null attribute distri-butions are compared to their composed counterparts: givenI ! I,

P (I) "!

!!B("!x! )

!!!B(I)(1 # "!

0 )

"

!!A(I)

pc!(a!(I))

p0!(a!(I))

(2)

where A(I), the “above set”, is the set of non-terminal on

(active) bricks. The proportionality sign (") can be replacedwith equality (=) if, at the introduction of each attributefunction, a! , care is taken to ensure that p0

!(a!) is exactlythe current (“unperturbed”) conditional distribution on a!

given x! > 0. In general, it is not practical to compute anexact null distribution and P must be re-normalized.

The effect on coverage of the perturbation can be seenby comparing the upper and lower panels in Figure 3. Foreach non-terminal brick !, the denominator, p0

!(a!), wasapproximated by assuming that in the absence of an explicitconstraint, the prior distribution on a! is the one consis-tent with independent instantiations of the children. Thenumerator, pc

!(a!), was constructed to encourage regularityin the relative positions of character parts, and of charac-ters, in composing characters and strings, respectively. Theupper panel is a sample instantiation from the Markov back-bone; the lower panel is a sample instantiation from the fullcompositional distribution. Samples from the full compo-sitional distribution can be computed (at considerable com-putational cost) through a variant of importance sampling.

Conditional Data Models. The data model connects in-terpretations to the grey-level image, and completes theBayesian framework. In the license-plate-reading demon-stration system, we have assumed that the data distribution,conditioned on an interpretation, is a function only of thestates of the terminal bricks:

P (#y|I) = P (#y|{x! : ! ! T })

where T $ B is the set of terminal, or bottom-row, bricks.Good performance in most image analysis applications

requires some degree of photometric invariance. In thecontext of a probability model, the notion of invariance isclosely connected to the statistical notion of sufficiency.The following data model, employed in the demonstrationsystem, is an example of the application of sufficiency toinvariance. As remarked earlier, the terminal bricks inthe demonstration system represent reusable parts of alpha-numeric characters. The states of the terminal bricks codethe local position of the represented part. Some of the partscan be more-or-less clearly discerned from the upper-hand(Markov) panel in Figure 3. The zero and the eight are eachmade of three parts whereas the four and the five are eachmade of two parts. The black portion of each “part filter”represents image locations that are expected to be dark, rel-ative to the locations represented by the white portion of thefilter. The rank sum R (cf. Lehmann [26]) of the intensitiesof the corresponding “black” pixels, among the union of in-tensities of black and white pixels, is a convenient statisticthat is demonstrably invariant to all monotone transforma-tions of the image histogram. We model pixel grey levelsby assuming that their distribution depends only on R (Ris sufficient), and we model R with an exponential proba-bility distribution, thereby promoting small rank sums cor-

Compositional distribution:Composition vs.

Coincidence

20 40 60 80 100 120 140 160 180 200

20

40

60

80

100

120

140

160

20 40 60 80 100 120 140 160 180 200

20

40

60

80

100

120

140

160

Figure 3. Samples from Markov backbone (upper panel, ‘4850’)and compositional distribution (lower panel, ‘8502’).

a!(I) returns the relative coordinates of the four numeralsthat instantiate ! in the interpretation I . Similarly, eachcharacter brick, and each numeral in particular, has an as-sociated attribute function that computes the relative coor-dinates of the particular parts that are composed into thatcharacter in a particular interpretation. A “compositionaldistribution” is built from a Markov backbone (Equation 1)and a pair of probability distributions, pc

! (“composed”) andp0

! (“null”), on each attribute a! . The former, composeddistribution, captures regularities of the arrangements (i.e.instantiations) of the children bricks, given that they areparts of the object represented by !; the latter, null distribu-tion, is the attribute distribution in the absence of the non-Markovian term. The set of relative coordinates of the threeparts that make up the ‘0’ in the upper panel of Figure 3 isan example of an attribute, and the particular arrangementof the parts in the figure is a sample from the correspondingnull distribution.

In a compositional distribution, the null attribute distri-butions are compared to their composed counterparts: givenI ! I,

P (I) "!

!!B("!x! )

!!!B(I)(1 # "!

0 )

"

!!A(I)

pc!(a!(I))

p0!(a!(I))

(2)

where A(I), the “above set”, is the set of non-terminal on

(active) bricks. The proportionality sign (") can be replacedwith equality (=) if, at the introduction of each attributefunction, a! , care is taken to ensure that p0

!(a!) is exactlythe current (“unperturbed”) conditional distribution on a!

given x! > 0. In general, it is not practical to compute anexact null distribution and P must be re-normalized.

The effect on coverage of the perturbation can be seenby comparing the upper and lower panels in Figure 3. Foreach non-terminal brick !, the denominator, p0

!(a!), wasapproximated by assuming that in the absence of an explicitconstraint, the prior distribution on a! is the one consis-tent with independent instantiations of the children. Thenumerator, pc

!(a!), was constructed to encourage regularityin the relative positions of character parts, and of charac-ters, in composing characters and strings, respectively. Theupper panel is a sample instantiation from the Markov back-bone; the lower panel is a sample instantiation from the fullcompositional distribution. Samples from the full compo-sitional distribution can be computed (at considerable com-putational cost) through a variant of importance sampling.

Conditional Data Models. The data model connects in-terpretations to the grey-level image, and completes theBayesian framework. In the license-plate-reading demon-stration system, we have assumed that the data distribution,conditioned on an interpretation, is a function only of thestates of the terminal bricks:

P (#y|I) = P (#y|{x! : ! ! T })

where T $ B is the set of terminal, or bottom-row, bricks.Good performance in most image analysis applications

requires some degree of photometric invariance. In thecontext of a probability model, the notion of invariance isclosely connected to the statistical notion of sufficiency.The following data model, employed in the demonstrationsystem, is an example of the application of sufficiency toinvariance. As remarked earlier, the terminal bricks inthe demonstration system represent reusable parts of alpha-numeric characters. The states of the terminal bricks codethe local position of the represented part. Some of the partscan be more-or-less clearly discerned from the upper-hand(Markov) panel in Figure 3. The zero and the eight are eachmade of three parts whereas the four and the five are eachmade of two parts. The black portion of each “part filter”represents image locations that are expected to be dark, rel-ative to the locations represented by the white portion of thefilter. The rank sum R (cf. Lehmann [26]) of the intensitiesof the corresponding “black” pixels, among the union of in-tensities of black and white pixels, is a convenient statisticthat is demonstrably invariant to all monotone transforma-tions of the image histogram. We model pixel grey levelsby assuming that their distribution depends only on R (Ris sufficient), and we model R with an exponential proba-bility distribution, thereby promoting small rank sums cor-

Sampling

Original image Zoomed license region

Top object under Markov distribution

Top object under content-sensitive distribution

Efficient discrimination: Markov versus Content-Sensitive dist.









Detection


Prior Work by Fidler and Leonardis[Fidler, Berginc, Leonardis CVPR 2006], [Fidler, Leonardis, CVPR 2007], [Fidler, Boben, Leonardis CVPR 2008]

Example of learned whole-object shape models. Fidler, M. Boben, A. Leonardis. Learning a Hierarchical Compositional Shape Vocabulary for Multi-class Object Representation. Submitted to a journal.

Images from Fidler webpage

Compositionality and bottom-up learning✤ Computation efficiency - Scalable

✤ Bottom up learning: All classes in early layers, then class specific

✤ Models general and discriminative

✤ Sharing of parts

Have learned complete objects from simple edges


Work by Mundy and Ozcanli[Mundy, Ozcanli, SPIE 2009 ]

✤ Combine Geman’s and Leonardis’ work into an unified Bayesian framework

✤ Classification of foreground objects: Vehicles

✤ Domain: Low resolution, satellite images

!!

!

F igure 6 An example of vehicle extrema operator responses. darko ,90 ,5.0 ,1 . The spatial resolution is around 0.7 meters, with about 25 pixels on a vehicle. The operator response is indicated by the cyan dot. The operator kernel extent is indicated in blue. The original grey scale intensity is in the red channel.

F igure 7 The composition of extrema operators. The anisotropic dark operator , , is composed with one of a

bright peak operator , ' . The composition is character ized by distance 'd and relative

orientation ' .

F igure 8 Three primitive extrema operators compose in a Layer 1 node. The central part is

brighto ,45- ,1 ,2 , and the second primitive part is darko ,45- ,1 ,2 ' . The peak responses of the operators are indicated by cyan pixels. The operator kernel is indicated in blue. The vehicle intensity is in the red channel.

Probabilistic Score:

p(ciαα� |dαα� , θαα�) =

p(dαα� , θαα� |ciαα�)P (ci

αα�)p(dαα� , θαα�)

p(dαα� , θαα�) = p(dαα� , θαα� |c̄αα�)P (c̄αα�) +k−1�

j=0

p(dαα� , θαα� |cjαα�)P (cj

αα�)

Composition of Parts

!

!

!


Hierarchical Composition for 3D Objects

Simple primitives e.g edges

Junctions, curves...

Windows, street lines,roofs, leafs ...

Buildings, streets, trees,rivers...

Learn bottom-up


Outline✤ Volumetric appearance model - The Voxel World


✤ Compositional hierarchies✤ Jin & Geman✤ Fidler & Leonardis, CVPR’07; Fidler, Boben & Leonardis, CVPR 2008✤ Mundy & Ozcanli, SPIE ’09

✤ Proof of concept: Construction of a simple hierarchy to find windows in the voxel world

✤ Future Work


Data and Algorithm

Algorithm Steps

1.For each orientation

✤ Apply corner kernel on appearance and occupancy grids

✤ Perform non-maxima suppression on kernel-specific region

2.Build a hierarchy to find windows

Top :Mean appearance near wall surface. Bottom: occupancy

f(x) =K1�

k=1

wkfk(x)

DKL(f(x)|f̃(x))min

f̃(x) ∼ N(µ̃f , σ̃2f )

or f1(x)


The Primitives: Corner Kernel

Corner kernel in 2DEvery pixel has a label/weight

Corner kernel in 3DEvery voxel has a label/weight

PLUS (+)REGION

MINUS (-)REGION

WIDTH

HEIGHT

DEPTH



PLUS (+)REGION -

WHITE VOXELS

MINUS (-)REGION-

BLACK VOXELS



x

y

z

θ

φ

Rotate kernel to create layer of primitives

Coordinate system of a corner kernel

Layer 1: Primitives 3D Corners

ψ


Applying the Kernel

Corresponding voxels


Applying the Kernel

“Convolve” kernel with appearance grid


Operator Response and Simplifications

Ixi : Intensity at voxel xi

K : Kernel response

K =�

i:xi∈R+

Ixi −�

j:xj∈R−

Ixj

K ∼ Nk(µk, σ2k) Distribution of the response

µk =�

i:xi∈R+

µxi −�

j:xj∈R−

µxj σ2k =

�

i:xi∈R+

σ2xi

+�

j:xj∈R−

σ2xj

This may be the first feature detector based on the spatial arrangement of appearance distributions

0, otherwise{kernel response =

µk,

�

i:xi∈R+

P (xi ∈ S)

1|R+|

> t and µk > 0rα =


Experiment Setup:

Layer 1: Corner primitives

Layer 2: Pairs of corners

Layer 3:Triplets of corners

Object Layer: Window

1. Demonstrate Hierarchy on a small region

2. Show some results on the full grid

Experimental hierarchy


Algorithm Steps

1. For each orientation

✤ Run a corner kernel

Algorithm Steps


Layer 1: Simple Features

1. For each orientation

✤ Run a corner kernel

✤ Perform non-maxima suppression on kernel-specific region

Algorithm Steps


Layer 2

2. Build a hierarchy2.1 Pair corners (90°)→Pairs

Algorithm Steps

0, otherwise{p(ci

αi,αj|dαi,αj , θαi,αj ) =

1|{rαi , rαj > 0}| , for rαi , rαj > 0


Layer 3

1. ...2. Build a hierarchy

2.1.Pair coplanar corners (90°)→ Pairs2.2.Pair corner pairs→ Triplets

Algorithm Steps


Object Layer : Windows

Algorithm Steps

1. ...2. Build a hierarchy

2.1.Pair corners (90°)→Pair2.2.Pair corner pairs→ L-shape2.3.Pair Triplets→ Window

Full Grid: Occupancy Probabilities

Full Grid Results: Corners

Full Grid Results: Windows


Summary

✤Appealing characteristics of The Voxel World and Compositional Hierarchies

✤ Introduced volumetric feature detectors that operate on distribution functions of appearance

✤Demonstrated, using a very simple instance of a compositional hierarchy the efficiency of such representation.

✤Localized large number of windows


Future Work

✤ Include other extrema operators in the hierarchy (e.g. edges)

✤Use occupancy information

✤Learn prior distributions to fully explain probability density of compositions

✤Optimize source code: Search and storage of parts (e.g octree)

✤Learn parts automatically

✤Learn whole-object hierarchies


The meaning of a complex expression is determined by its structure and the meanings of its constituents.

Stanford Encyclopedia of Philosophy

The Principle of Compositionality

Questions?

Documents

Progress review1