Upload
isa-restrepo
View
123
Download
2
Embed Size (px)
Citation preview
October 26, 2009
Compositional Hierarchy for 3D Object RecognitionMaria Isabel Restrepo
Maria Isabel Restrepo
Goal:
Maria Isabel Restrepo
Goal
Geometry Expected Appearance
Renderings obtained by Dan Crispell
Maria Isabel Restrepo
Goal: Recognition in a 3D World
Maria Isabel Restrepo
Compelling Characteristics
POWERFUL GEOMETRIC AND PHOTOMETRIC REPRESENTATION* OF SCENES
✤ It is a 3D, geometric representation that supports discovery of spatial relations
✤ Its appearance is modeled by MOG to handle illumination variations
✤Appearance and geometry are automatically learned from multiple images with calibrated cameras
✤ It is faithful to the scenes: There are no prior assumptions about the model
THESE CHARACTERISTICS ARE IDEAL FOR OBJECT RECOGNITION
* [Pollard and Mundy, CVPR 2007] [Crispell]
Maria Isabel Restrepo
Outline
✤ Volumetric appearance model - The Voxel World
✤ Insights on classical recognition methods
✤ Compositional hierarchies✤ Bienenstock, Geman, Potter, 97; Geman, Chi, 2002; Geman, Jin, CVPR 2006✤ Fidler & Leonardis, CVPR’07; Fidler, Boben & Leonardis, CVPR 2008✤ Mundy & Ozcanli, SPIE ’09
✤ Experimental work: Proof of concept
✤ Future work
Maria Isabel Restrepo
The Voxel World
intensity
p(in
tens
ity)
Surface probability is given by incremental learning
PN+1(X ∈ S) = PN (X ∈ S)pN (IN+1
x |X ∈ S)pN (IN+1
x )p(I) =
3�
k=1
wk
W
�1�
2πσ2k
e− (I−µk)2
2σ2k
�Appearance is modeled a Mixture of Gaussians
Probabilistic representation of 3-d scenes based on volumetric units -voxel.
Maria Isabel Restrepo
Outline
✤ Volumetric appearance model - The Voxel World
✤ Insights on classical recognition methods
✤ Compositional hierarchies✤ Jin & Geman✤ Fidler & Leonardis, CVPR’07; Fidler, Boben & Leonardis, CVPR 2008✤ Mundy & Ozcanli, SPIE ’09
✤ Experimental work: Proof of concept
✤ Future work
Maria Isabel Restrepo
Classical Recognition: Bag of Features
Drawbacks:✤ Disregards spatial
information ✤ Large number of features are
needed
Many have proposed more complex representations of spatial object structure.
✤ Constellation Models [Weber and Welling et al, Fergus et al] -Complex, few parts
✤ Probabilistic voting [Leibe, Schiele] -Large codebook - complex matching
✤ Hierarchical representations
e.g. SIFT- LoweHOG- Dalal
Feature descriptor Feature space - Classify
e.g .SVMNaive Bayes
NN
Codeword, Codebook
Maria Isabel Restrepo
Hierarchical Representations
Fidler, Boben &LeonardisU. of Ljubljana, Slovenia
Learning Hierarchical Models of Scenes, Objects, and Parts
Erik B. Sudderth, Antonio Torralba, William T. Freeman, and Alan S. WillskyElectrical Engineering & Computer Science, Massachusetts Institute of Technology
[email protected], [email protected], [email protected], [email protected]
TO APPEAR AT THE 2005 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION
Abstract
We describe a hierarchical probabilistic model for thedetection and recognition of objects in cluttered, naturalscenes. The model is based on a set of parts which describethe expected appearance and position, in an object centeredcoordinate frame, of features detected by a low-level inter-est operator. Each object category then has its own distri-bution over these parts, which are shared between objects.We learn the parameters of this model via a Gibbs samplerwhich uses the graphical model’s structure to analyticallyaverage over many parameters. Applied to a database ofimages of isolated objects, the sharing of parts among ob-jects improves detection accuracy when few training exam-ples are available. We also extend this hierarchical frame-work to scenes containing multiple objects.
1. IntroductionIn this paper, we develop methods for the visual detec-
tion and recognition of object categories. We argue thatmulti–object recognition systems should be based on mod-els which consider the relationships between different ob-ject categories during the training process. This approachprovides several benefits. At the lowest level, significantcomputational savings can be achieved if different cate-gories share a common set of features. More importantly,jointly trained recognition systems can use similarities be-tween object categories to their advantage by learning fea-tures which lead to better generalization [4, 18]. This inter–category regularization is particularly important in the com-mon case where few training examples are available.In complex, natural scenes, object recognition systems
can be further improved by using contextual knowledgeabout the objects likely to be found in a given scene, andcommon spatial relationships between those objects [7, 19,20]. In this paper, we propose a hierarchical generativemodel for objects, the parts composing them, and the scenessurrounding them. The model, which is summarized inFigs. 1 and 5, shares information between object categoriesin three distinct ways. First, parts define distributions over acommon low–level feature vocabularly, leading to compu-tational savings when analyzing new images. In addition,and more unusually, objects are defined using a commonset of parts. This structure leads to the discovery of parts
!"z
o
w x
r
#$
O
PMNm
% &
'O
µ
(P
)
o
o
&
)
p
p
Figure 1. Graphical model describing how latent parts zgenerate the appearance w and position x, relative toan image–specific reference location r, of the featuresdetected in an image of object o. Boxes denote repli-cation of the corresponding random variables: there areM images, withNm observed features in imagem.
with interesting semantic interpretations, and can improveperformance when few training examples are available. Fi-nally, object appearance information is shared between themany scenes in which that object is found.We begin in Sec. 2 by describing our generative model
for objects and parts, including a discussion of related workin the machine vision and text analysis literature. Sec. 3then describes parameter estimation methods which com-bine Gibbs sampling with efficient variational approxima-tions. In Sec. 4, we provide simulations demonstratingthe potential benefits of feature sharing. We conclude inSec. 5 with preliminary extensions of the object hierarchyto scenes containing multiple objects.
2. A Generative Model for Object FeaturesOur generative model for objects is summarized in the
graphical model (a directed Bayesian network) of Fig. 1.The nodes of this graph represent random variables, whereshaded nodes are observed during training, and roundedboxes are fixed hyperparameters. Edges encode the con-ditional densities underlying the generative process [12].2.1. From Images to Features
Following [17], we represent each of our M grayscaletraining images by a set of SIFT descriptors [13] computedon affine covariant regions. We use K-means clustering to
Sudderth, Torralba, Freeman WillskyMIT-2006
group the closest clusters together. Each node is thereforea hyperball in the feature space represented by a centroidand a radius. The centroid is a mean of all features withina ball and the radius is a distance from the center to thefurthest feature in the cluster. The radii of nodes increasetowards the top of the tree. Each of the appearance clus-ters has several geometric distributions gj(d, !, ", #) corre-sponding to the object classes. The distributions contain in-formation about the geometric relations between the objectcenter and local appearance. Figure 2(b) shows appearanceclusters and their distributions for objects. All object classesare therefore represented in one tree by a set of appearanceclusters and geometric distributions.
p(g ,a |O )j j n
l m n
d
a
r
b
f c g
h
e
ik
top nodes
geometricdistributions
ajappearance
aj
j mj ! " #p[g (d, , , ),a |O ]
(a) (b)
Figure 2. (a) Hierarchical structure. (b) Codebook representation.
Appearance clusters (left column) and their geometric distribu-
tions for different object classes. For visualization the distributions
are in 2D Cartesian coordinate system.
Building the tree. To build the tree we use a metric (Eu-clidean distance) to group the appearance clusters that lie ina hyperball of a given radius regardless of the object classor part they belong to. To build the tree structure we ap-ply agglomerative clustering. This bottom-up method startswith the number of clusters equal to the number of featuresand merges the two closest clusters at each iteration. Werecord the indices of merged clusters and the similarity dis-tance at which the clusters are merged at each iteration. Theprocedure continues until the last two clusters are merged.The resulting clustering trace is then used to construct thetree. The only parameter to provide is the size of the bot-tom nodes (radius of appearance clusters) and the number oftree levels. The radii for intermediate levels are equally dis-tributed between the bottom node radius and the radius ofthe top node. These radii are only indicators for maximumcluster size at a given level. If there are no clusters betweentwo levels that can be merged within the prescribed radius,the level is not created. Therefore, the actual number of lev-els and the number of nodes depends on the distribution ofdata points (features).
Discussion. This tree representation has several advan-tages. The appearance clusters are shared within one imageas well as among different classes and object parts. Despitemassive amounts of features extracted from images the rep-resentation is compact. The tree can be used to represent
features extracted from individual images as well as to rep-resent all object classes. The difference is that a single im-age representation contains only one geometric distributionper cluster, and the object class representation contains asmany geometric distribution as there are categories in themodel. This representation can be also seen as a metric treesimilar to those used in fast data structure algorithms [17]thus allowing for efficient search when matching features tothe appearance clusters (cf. section 3.3). Finally, one canalso interpret it as a tree of classifiers. A classifier (node)gives a positive response if a query feature lies within theradius of the node, negative otherwise.
2.3. Recognition
The approach relies on robustly estimated probabilitydistributions of the model parameters. We apply Bayesianrules and use a formulation similar to [8]. Given featuresF detected in a query image, appearance clusters A, andgeometric distributions G we make a decision :
p(Om|G, A)
p(B|G, A)=
p(G, A|Om)p(Om)
p(G, A|B)p(B)(1)
p(Om) and p(B) are priors of an object and background. Inthe following we explain only the object related terms butthe background terms are estimated in a similar way. Thelikelihood p(G, A|Om) is given by
p(G, A|Om)=!
i
"
j
p(gj, aj |fi, Om)p(fi|Om) (2)
where p(fi|Om) is a feature probability and p(gj , aj |Om) isa joint appearance-geometry distribution of each cluster. Tosimplify the notation we use gj instead of gj(d, !, "). Eachfeature likelihood is then modeled by a mixture of distribu-tions p(gj , aj |fi, Om) from appearance clusters aj whichmatch to query feature fi.
To find initial hypotheses for object locations we searchfor local maxima of the likelihood function given by equa-tion 1. To improve detection precision overlapping bound-ing boxes of detected objects are frequently removed. How-ever, we cannot follow this strategy since we want to dealwith multiple objects which can partially occlude eachother, that is the hypotheses can overlap. We allow sev-eral objects to be simultaneously present at similar positionsand scales. However, we impose a condition that each fea-ture can contribute only to one hypothesis, since it is veryunlikely that similar parts belonging to different objects arevisible at the same position in the image. We therefore re-move from the weaker hypothesis H , all p(gj , aj |fi, On)that contribute to both overlapping hypotheses:
p(G, A|Om)=p(GH , AH |Om) ! p(GS , AS |On) (3)
where S " H is a set of appearance-geometry likelihoodsthat also contribute to another stronger hypothesis. The
3
Mikolajcjzyk, Leibe, SchieleUK, Switzerland, Germany 2006
Fidler, Boben &LeonardisU. of Ljubljana, Slovenia
CVPR 2007, 2008
We thus need the means of finding the similarities among
different hierarchical nodes in a geometrical sense.
We propose to create similarity connections between hi-
erarchical nodes within layers to achieve invariance for high
variability in object shape and draw similarities across lay-ers to achieve a proper scale normalization of features. We
show how a layer-independent description of objects de-
fined by the so-called shape-terminals, i.e. shapinals, can
be passed to the higher-level, the category-specific repre-
sentation. If performed in this manner, the problem of ter-
minal nodes within the hierarchical “library” is solved in a
natural way. There is no need to by-pass or float features to
the top-most layer and thus unnecessarily load the complex-
ity of representation, which may prevent the unsupervised
creation of higher layers (the problem arising in [7]). In-
stead, at each hierarchical stage of learning, only a subset
of the layer’s statistically most repeatable features can be
combined further, yet the final, cross-layered description of
objects will retain its descriptive power.
Figure 1. Cross-layered, scale independent representation.
3.1. The base model: hierarchical compositionalframework [7]
We build on our previously proposed approach [7],
where we proposed an unsupervised learning framework
to obtain a hierarchical compositional representation of ob-
ject categories. Starting with simple oriented filters the ap-
proach learns the first three layers of optimally sharable
features, defined as loose spatial compositions, i.e. parts.
Upon the third layer, a higher-layer categorical representa-
tion is derived with minimal supervision. The model is in
essence composed of two recursively iterated steps, 1.) a
Figure 2. For greater generalization we establish similarities be-
tween hierarchical nodes within layers.
layer-learning process that statistically extracts parts by se-
quentially increasing the number of subparts contained in
local image neighborhoods, and 2.) a part detection step
that finds the learned compositions in images with an effi-
cient and robust indexing and matching scheme. The ad-
vantage of the proposed representation lies in the capability
to model exponential variability present in images, yet still
retaining the computational efficiency by keeping the num-
ber of indexing links per each part approximately constant
across layers.
We adopt the terminology and the notation from [7].
The rotation invariance is dropped from our implementa-
tion, however, since we believe it can be incorporated into
the model along with other invariances at a later stage of
learning.
Let Ln denote the n-th Layer. Each element of Ln, i.e.
part, is envisioned to model spatial relations between its
subparts, which furthermore model the spatial relations be-
tween their constituent subcomponents, etc. Parts are thus
defined recursively in the following way. Each part Pni
in Ln is characterized by the center of mass, and a list of
subparts (parts of the previous layer) with their respective
positions relative to the center of Pni . One subpart is the
so-called central part that indexes into Pni from the lower,
(n− 1)th layer. Specifically, a Pni that is centered to (0, 0)
encompasses a list {�Pn−1
j , (xj , yj), (σ1j ,σ2j)�}j , where
(xj , yj) denotes the relative position of subpartPn−1j , while
σ1j and σ2j denote the principal axes of an elliptical Gaus-
sian encoding the variance of its position around (xj , yj).The hierarchy starts with a fixed L1 composed of a set of
oriented Gabor filters, {filter i}.
Along the lines of how the parts are formed [7], the rel-
ative positions with variances {(xj , yj), (σ1j ,σ2j)} may
be preferably replaced with the segmented spatial maps,
{mapj}j , which capture the variability of subparts more
accurately than the fitted Gaussians. Spatial map is a two-
dimensional map that contains the learned disposition of
locations of each subpart relative to the central part, upon
which the parameters of the approximately Gaussian dis-
Jin and Geman, 2006
Chris Williams ANC
Hierarchical Object Recognition
Jin and GemanBrown University
CVPR 2006
✤ Address the need for a representation that incorporates geometric coherence
✤ Allow for a more efficient representation
✤ Consistent with biological systems
Maria Isabel Restrepo
Prior work by Geman: Efficient Discrimination[Bienenstock, Geman, Potter, 97], [Geman, Chi, 2002], [Geman, Jin, CVPR 2006]
A COMPOSITIONAL MACHINE:
Test set: 385 images, mostly from Logan Airport
Courtesy of Visics Corporation
parts of characters and
plate sides
characters, plate sides
generic letter, generic number
plate boundary
license numbers
license plates✤ Probabilistic framework✤ Hierarchy and reusability✤ It does not exclude the sharing of subparts✤ Parts are everywhere, compositions are rare✤ Need to model relative geometry of parts
Markovian distribution:Basic structures
20 40 60 80 100 120 140 160 180 200
20
40
60
80
100
120
140
160
20 40 60 80 100 120 140 160 180 200
20
40
60
80
100
120
140
160
Figure 3. Samples from Markov backbone (upper panel, ‘4850’)and compositional distribution (lower panel, ‘8502’).
a!(I) returns the relative coordinates of the four numeralsthat instantiate ! in the interpretation I . Similarly, eachcharacter brick, and each numeral in particular, has an as-sociated attribute function that computes the relative coor-dinates of the particular parts that are composed into thatcharacter in a particular interpretation. A “compositionaldistribution” is built from a Markov backbone (Equation 1)and a pair of probability distributions, pc
! (“composed”) andp0
! (“null”), on each attribute a! . The former, composeddistribution, captures regularities of the arrangements (i.e.instantiations) of the children bricks, given that they areparts of the object represented by !; the latter, null distribu-tion, is the attribute distribution in the absence of the non-Markovian term. The set of relative coordinates of the threeparts that make up the ‘0’ in the upper panel of Figure 3 isan example of an attribute, and the particular arrangementof the parts in the figure is a sample from the correspondingnull distribution.
In a compositional distribution, the null attribute distri-butions are compared to their composed counterparts: givenI ! I,
P (I) "!
!!B("!x! )
!!!B(I)(1 # "!
0 )
"
!!A(I)
pc!(a!(I))
p0!(a!(I))
(2)
where A(I), the “above set”, is the set of non-terminal on
(active) bricks. The proportionality sign (") can be replacedwith equality (=) if, at the introduction of each attributefunction, a! , care is taken to ensure that p0
!(a!) is exactlythe current (“unperturbed”) conditional distribution on a!
given x! > 0. In general, it is not practical to compute anexact null distribution and P must be re-normalized.
The effect on coverage of the perturbation can be seenby comparing the upper and lower panels in Figure 3. Foreach non-terminal brick !, the denominator, p0
!(a!), wasapproximated by assuming that in the absence of an explicitconstraint, the prior distribution on a! is the one consis-tent with independent instantiations of the children. Thenumerator, pc
!(a!), was constructed to encourage regularityin the relative positions of character parts, and of charac-ters, in composing characters and strings, respectively. Theupper panel is a sample instantiation from the Markov back-bone; the lower panel is a sample instantiation from the fullcompositional distribution. Samples from the full compo-sitional distribution can be computed (at considerable com-putational cost) through a variant of importance sampling.
Conditional Data Models. The data model connects in-terpretations to the grey-level image, and completes theBayesian framework. In the license-plate-reading demon-stration system, we have assumed that the data distribution,conditioned on an interpretation, is a function only of thestates of the terminal bricks:
P (#y|I) = P (#y|{x! : ! ! T })
where T $ B is the set of terminal, or bottom-row, bricks.Good performance in most image analysis applications
requires some degree of photometric invariance. In thecontext of a probability model, the notion of invariance isclosely connected to the statistical notion of sufficiency.The following data model, employed in the demonstrationsystem, is an example of the application of sufficiency toinvariance. As remarked earlier, the terminal bricks inthe demonstration system represent reusable parts of alpha-numeric characters. The states of the terminal bricks codethe local position of the represented part. Some of the partscan be more-or-less clearly discerned from the upper-hand(Markov) panel in Figure 3. The zero and the eight are eachmade of three parts whereas the four and the five are eachmade of two parts. The black portion of each “part filter”represents image locations that are expected to be dark, rel-ative to the locations represented by the white portion of thefilter. The rank sum R (cf. Lehmann [26]) of the intensitiesof the corresponding “black” pixels, among the union of in-tensities of black and white pixels, is a convenient statisticthat is demonstrably invariant to all monotone transforma-tions of the image histogram. We model pixel grey levelsby assuming that their distribution depends only on R (Ris sufficient), and we model R with an exponential proba-bility distribution, thereby promoting small rank sums cor-
Compositional distribution:Composition vs.
Coincidence
20 40 60 80 100 120 140 160 180 200
20
40
60
80
100
120
140
160
20 40 60 80 100 120 140 160 180 200
20
40
60
80
100
120
140
160
Figure 3. Samples from Markov backbone (upper panel, ‘4850’)and compositional distribution (lower panel, ‘8502’).
a!(I) returns the relative coordinates of the four numeralsthat instantiate ! in the interpretation I . Similarly, eachcharacter brick, and each numeral in particular, has an as-sociated attribute function that computes the relative coor-dinates of the particular parts that are composed into thatcharacter in a particular interpretation. A “compositionaldistribution” is built from a Markov backbone (Equation 1)and a pair of probability distributions, pc
! (“composed”) andp0
! (“null”), on each attribute a! . The former, composeddistribution, captures regularities of the arrangements (i.e.instantiations) of the children bricks, given that they areparts of the object represented by !; the latter, null distribu-tion, is the attribute distribution in the absence of the non-Markovian term. The set of relative coordinates of the threeparts that make up the ‘0’ in the upper panel of Figure 3 isan example of an attribute, and the particular arrangementof the parts in the figure is a sample from the correspondingnull distribution.
In a compositional distribution, the null attribute distri-butions are compared to their composed counterparts: givenI ! I,
P (I) "!
!!B("!x! )
!!!B(I)(1 # "!
0 )
"
!!A(I)
pc!(a!(I))
p0!(a!(I))
(2)
where A(I), the “above set”, is the set of non-terminal on
(active) bricks. The proportionality sign (") can be replacedwith equality (=) if, at the introduction of each attributefunction, a! , care is taken to ensure that p0
!(a!) is exactlythe current (“unperturbed”) conditional distribution on a!
given x! > 0. In general, it is not practical to compute anexact null distribution and P must be re-normalized.
The effect on coverage of the perturbation can be seenby comparing the upper and lower panels in Figure 3. Foreach non-terminal brick !, the denominator, p0
!(a!), wasapproximated by assuming that in the absence of an explicitconstraint, the prior distribution on a! is the one consis-tent with independent instantiations of the children. Thenumerator, pc
!(a!), was constructed to encourage regularityin the relative positions of character parts, and of charac-ters, in composing characters and strings, respectively. Theupper panel is a sample instantiation from the Markov back-bone; the lower panel is a sample instantiation from the fullcompositional distribution. Samples from the full compo-sitional distribution can be computed (at considerable com-putational cost) through a variant of importance sampling.
Conditional Data Models. The data model connects in-terpretations to the grey-level image, and completes theBayesian framework. In the license-plate-reading demon-stration system, we have assumed that the data distribution,conditioned on an interpretation, is a function only of thestates of the terminal bricks:
P (#y|I) = P (#y|{x! : ! ! T })
where T $ B is the set of terminal, or bottom-row, bricks.Good performance in most image analysis applications
requires some degree of photometric invariance. In thecontext of a probability model, the notion of invariance isclosely connected to the statistical notion of sufficiency.The following data model, employed in the demonstrationsystem, is an example of the application of sufficiency toinvariance. As remarked earlier, the terminal bricks inthe demonstration system represent reusable parts of alpha-numeric characters. The states of the terminal bricks codethe local position of the represented part. Some of the partscan be more-or-less clearly discerned from the upper-hand(Markov) panel in Figure 3. The zero and the eight are eachmade of three parts whereas the four and the five are eachmade of two parts. The black portion of each “part filter”represents image locations that are expected to be dark, rel-ative to the locations represented by the white portion of thefilter. The rank sum R (cf. Lehmann [26]) of the intensitiesof the corresponding “black” pixels, among the union of in-tensities of black and white pixels, is a convenient statisticthat is demonstrably invariant to all monotone transforma-tions of the image histogram. We model pixel grey levelsby assuming that their distribution depends only on R (Ris sufficient), and we model R with an exponential proba-bility distribution, thereby promoting small rank sums cor-
Sampling
Original image Zoomed license region
Top object under Markov distribution
Top object under content-sensitive distribution
Efficient discrimination: Markov versus Content-Sensitive dist.
Original image Zoomed license region
Top object under Markov distribution
Top object under content-sensitive distribution
Efficient discrimination: Markov versus Content-Sensitive dist.
Original image Zoomed license region
Top object under Markov distribution
Top object under content-sensitive distribution
Efficient discrimination: Markov versus Content-Sensitive dist.
Detection
Maria Isabel Restrepo
Prior Work by Fidler and Leonardis[Fidler, Berginc, Leonardis CVPR 2006], [Fidler, Leonardis, CVPR 2007], [Fidler, Boben, Leonardis CVPR 2008]
Example of learned whole-object shape models. Fidler, M. Boben, A. Leonardis. Learning a Hierarchical Compositional Shape Vocabulary for Multi-class Object Representation. Submitted to a journal.
Images from Fidler webpage
Compositionality and bottom-up learning✤ Computation efficiency - Scalable
✤ Bottom up learning: All classes in early layers, then class specific
✤ Models general and discriminative
✤ Sharing of parts
Have learned complete objects from simple edges
Maria Isabel Restrepo
Work by Mundy and Ozcanli[Mundy, Ozcanli, SPIE 2009 ]
✤ Combine Geman’s and Leonardis’ work into an unified Bayesian framework
✤ Classification of foreground objects: Vehicles
✤ Domain: Low resolution, satellite images
!!
!
F igure 6 An example of vehicle extrema operator responses. darko ,90 ,5.0 ,1 . The spatial resolution is around 0.7 meters, with about 25 pixels on a vehicle. The operator response is indicated by the cyan dot. The operator kernel extent is indicated in blue. The original grey scale intensity is in the red channel.
F igure 7 The composition of extrema operators. The anisotropic dark operator , , is composed with one of a
bright peak operator , ' . The composition is character ized by distance 'd and relative
orientation ' .
F igure 8 Three primitive extrema operators compose in a Layer 1 node. The central part is
brighto ,45- ,1 ,2 , and the second primitive part is darko ,45- ,1 ,2 ' . The peak responses of the operators are indicated by cyan pixels. The operator kernel is indicated in blue. The vehicle intensity is in the red channel.
Probabilistic Score:
p(ciαα� |dαα� , θαα�) =
p(dαα� , θαα� |ciαα�)P (ci
αα�)p(dαα� , θαα�)
p(dαα� , θαα�) = p(dαα� , θαα� |c̄αα�)P (c̄αα�) +k−1�
j=0
p(dαα� , θαα� |cjαα�)P (cj
αα�)
Composition of Parts
!
!
!
Maria Isabel Restrepo
Hierarchical Composition for 3D Objects
Simple primitives e.g edges
Junctions, curves...
Windows, street lines,roofs, leafs ...
Buildings, streets, trees,rivers...
Learn bottom-up
Maria Isabel Restrepo
Outline✤ Volumetric appearance model - The Voxel World
✤ Insights on classical recognition methods
✤ Compositional hierarchies✤ Jin & Geman✤ Fidler & Leonardis, CVPR’07; Fidler, Boben & Leonardis, CVPR 2008✤ Mundy & Ozcanli, SPIE ’09
✤ Proof of concept: Construction of a simple hierarchy to find windows in the voxel world
✤ Future Work
Maria Isabel Restrepo
Data and Algorithm
Algorithm Steps
1.For each orientation
✤ Apply corner kernel on appearance and occupancy grids
✤ Perform non-maxima suppression on kernel-specific region
2.Build a hierarchy to find windows
Top :Mean appearance near wall surface. Bottom: occupancy
f(x) =K1�
k=1
wkfk(x)
DKL(f(x)|f̃(x))min
f̃(x) ∼ N(µ̃f , σ̃2f )
or f1(x)
Maria Isabel Restrepo
The Primitives: Corner Kernel
Corner kernel in 2DEvery pixel has a label/weight
Corner kernel in 3DEvery voxel has a label/weight
PLUS (+)REGION
MINUS (-)REGION
WIDTH
HEIGHT
DEPTH
Maria Isabel Restrepo
The Primitives: Corner Kernel
PLUS (+)REGION -
WHITE VOXELS
MINUS (-)REGION-
BLACK VOXELS
Maria Isabel Restrepo
The Primitives: Corner Kernel
x
y
z
θ
φ
Rotate kernel to create layer of primitives
Coordinate system of a corner kernel
Layer 1: Primitives 3D Corners
ψ
Maria Isabel Restrepo
Applying the Kernel
Corresponding voxels
Maria Isabel Restrepo
Applying the Kernel
“Convolve” kernel with appearance grid
Maria Isabel Restrepo
Operator Response and Simplifications
Ixi : Intensity at voxel xi
K : Kernel response
K =�
i:xi∈R+
Ixi −�
j:xj∈R−
Ixj
K ∼ Nk(µk, σ2k) Distribution of the response
µk =�
i:xi∈R+
µxi −�
j:xj∈R−
µxj σ2k =
�
i:xi∈R+
σ2xi
+�
j:xj∈R−
σ2xj
This may be the first feature detector based on the spatial arrangement of appearance distributions
0, otherwise{kernel response =
µk,
�
i:xi∈R+
P (xi ∈ S)
1|R+|
> t and µk > 0rα =
Maria Isabel Restrepo
Experiment Setup:
Layer 1: Corner primitives
Layer 2: Pairs of corners
Layer 3:Triplets of corners
Object Layer: Window
1. Demonstrate Hierarchy on a small region
2. Show some results on the full grid
Experimental hierarchy
Maria Isabel Restrepo
Algorithm Steps
1. For each orientation
✤ Run a corner kernel
Algorithm Steps
Maria Isabel Restrepo
Layer 1: Simple Features
1. For each orientation
✤ Run a corner kernel
✤ Perform non-maxima suppression on kernel-specific region
Algorithm Steps
Maria Isabel Restrepo
Layer 2
2. Build a hierarchy2.1 Pair corners (90°)→Pairs
Algorithm Steps
0, otherwise{p(ci
αi,αj|dαi,αj , θαi,αj ) =
1|{rαi , rαj > 0}| , for rαi , rαj > 0
Maria Isabel Restrepo
Layer 3
1. ...2. Build a hierarchy
2.1.Pair coplanar corners (90°)→ Pairs2.2.Pair corner pairs→ Triplets
Algorithm Steps
Maria Isabel Restrepo
Object Layer : Windows
Algorithm Steps
1. ...2. Build a hierarchy
2.1.Pair corners (90°)→Pair2.2.Pair corner pairs→ L-shape2.3.Pair Triplets→ Window
Full Grid: Occupancy Probabilities
Full Grid Results: Corners
Full Grid Results: Windows
Maria Isabel Restrepo
Summary
✤Appealing characteristics of The Voxel World and Compositional Hierarchies
✤ Introduced volumetric feature detectors that operate on distribution functions of appearance
✤Demonstrated, using a very simple instance of a compositional hierarchy the efficiency of such representation.
✤Localized large number of windows
Maria Isabel Restrepo
Future Work
✤ Include other extrema operators in the hierarchy (e.g. edges)
✤Use occupancy information
✤Learn prior distributions to fully explain probability density of compositions
✤Optimize source code: Search and storage of parts (e.g octree)
✤Learn parts automatically
✤Learn whole-object hierarchies
Maria Isabel Restrepo
The meaning of a complex expression is determined by its structure and the meanings of its constituents.
Stanford Encyclopedia of Philosophy
The Principle of Compositionality
Questions?