MIT6870_ORSU_lecture4: Explicit and implicit 3D object models

8/3/2019 MIT6870_ORSU_lecture4: Explicit and implicit 3D object models

http://slidepdf.com/reader/full/mit6870orsulecture4-explicit-and-implicit-3d-object-models 1/69

Lecture 4Explicit and implicit 3D object models

6.870 Object Recognition and Scene Understandinghttp://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm



Monday

Recognition of 3D objects

Presenter: Alec Rivers

Evaluator:



2D frontal face detection

Amazing how far they have gotten with so little«



People have the bad taste of not being

rotationally symmetric

Examples of un-collaborative subjects



Objects are not flat*

*In the old days, some toy makers and few people working on face detection

suggested that flat objects could be a good approximation to real objects.



Solution to deal with 3D variations:

³do not deal with it´³not´-Dealing with rotations and pose:

Train a different

model for each view.

The combined detector is invariant to pose variations without an explicit 3D model.



viewpoints

Need to detect Nclasses * Nviews * Nstyles, in clutter.

Lots of variability within classes, and across viewpoints.

Object classes

And why should we stop with pose?

Let¶s do the same with styles,

lighting conditions, etc, etc, etc«

So, how many classifiers?



Depth without objects

Random dot stereograms (Bela Julesz)

Julesz, 1971

3D is so important for humans that wedecided to grow two eyes in front of the

face instead of having one looking to the

front and another to the back.

(this is not something that Julesz said« but he could, maybehe did)



Objects 3D shape priors

by H Bülthoff Max-Planck-Institut für biologische Kybernetik in Tübingen

Video taken from http://www.michaelbach.de/ot/fcs_hollow-face/index.html



3D drives perception of important

object attributes

by Roger Shepard (´Turning the Tables´)

Depth processing is automatic, and we can not shut it down«



3D drives perception of important

object attributes

Frederick Kingdom, Ali Yoonessi and Elena Gheorghiu of McGill Vision Research unit.

The two Towers of Pisa



It is not all about objects

3D percept is driven by the scene, which imposes its ruling to the objects



Class experiment



Class experiment

Experiment 1: draw a horse (the entire

body, not just the head) in a white piece of

paper.

Do not look at your neighbor! You already

know how a horse looks like« no need to

cheat.



Class experiment

Experiment 2: draw a horse (the entire

body, not just the head) but this time

chose a viewpoint as weird as possible.



Anonymous participant



3D object categorization

Wait: object categorization in humans is not

invariant to 3D pose



3D object categorization

byGreg Robbins

Despite we can categorize all three

pictures as being views of a horse,

the three pictures do not look asbeing equally typical views of

horses. And they do not seem to be

recognizable with the same

easiness.



Observations about pose invariance

in humans

Canonical perspective

Priming effects

Two main families of effects have been observed:



Canonical Perspective

From Vision Science, Palmer

Experiment (Palmer, Rosch & Chase 81):

participants are shown views of an object

and are asked to rate ³how much each one

looked like the objects they depict´

(scale; 1=very much like, 7=very unlike)

5

2



Canonical Perspective


Examples of canonical perspective:

In a recognition task, reaction time

correlated with the ratings.

Canonical views are recognized faster

at the entry level.

Why?



Canonical Viewpoint

Frequency hypothesis

Maximal information hypothesis



Canonical Viewpoint

Frequency hypothesis: easiness of recognition is

related to the number of times we have see the

objects from each viewpoint.

For a computer, using its Google memory, a horse

looks like:

It is not a uniform sampling on viewpoints

(some artificial datasets might contain non natural statistics)



Canonical Viewpoint

Frequency hypothesis: easiness of recognition is

related to the number of times we have see the

objects from each viewpoint.

Can you think of some

examples in which this

hypothesis might be

wrong?



Canonical Viewpoint

Maximal information hypothesis: Some views

provide more information than others about the

objects.


Best views tend to showmultiple sides of the

object.

Can you think of someexamples in which this

hypothesis might be

wrong?



Canonical Viewpoint

Maximal information hypothesis:

Clocks are preferred as purely frontal



Canonical Viewpoint

Frequency hypothesis

Maximal information hypothesis

Probably both are correct.Edelman & Bulthoff 92: created new objects to control familiarity.

1- When presenting all view points with the same frequency, observers had

preference for specific viewpoints.

2- When few viewpoints were presented, recognition was better for previously

seen viewpoints.



Observations about pose invariance

in humans

Canonical perspective

Priming effects

Two main families of effects have been observed:



Priming effects

Priming paradigm: recognition of an object is

faster the second time that you see it.

Biederman & Gerhardstein 93



Priming effects

Same

exemplars

Differentexemplars




Priming effects




Object representations

Explicit 3D mode

ls: use volumetricrepresentation. Have an explicit model of

the 3D geometry of the object.

Appealing but hard to get it to work«




Imp

licit 3D mode

ls: matching the input

2Dview to view-specific representations.

Not very appealing but somewhat easy to get it to work*«

* we all know what I mean by ³work´




I

mplicit 3D mode

ls: matching the input

2Dview to view-specific representations.

The object is represented as a collection of 2D

views (maybe the most frequent views seen in thepast).

Tarr & Pinker (89) show people are faster at

recognizing previously seen views, as if they were

storing them. People were also able to recognize

unseen views, so they also generalize to new

views. It is not just template matching.



Why do I explain all this?

As we build systems and develop

algorithms it is good to:

± Get inspiration from what others have thought

± Get intuitions about what can work, and how

things can fail.



Explicit 3D model

Object Recognition in the Geometric Era: a Retrospective, Joseph L. Mundy



Explicit 3D model

Not all explicit 3D models were disappointing.

For some object classes, with accurategeometric and appearance models, it is

possible to get remarkable results.



A Morphable Model for the Synthesis

of 3D Faces

Blanz & Vetter, Siggraph 99





A Morphable Model for the Synthesis

of 3D Faces

Blanz & Vetter, Siggraph 99



We have not achieved yet the same level of

description for other object classes



Implicit 3D models



Aspect Graphs

³The nodes of the graph represent object views that are adjacent to each other on the unit sphere of viewing directions but differ in some significant way.The most common view relationship in aspect graphs is based on thetopological structure of the view, i.e., edges in the aspect graph arise fromtransitions in the graph structure relating vertices, edges and faces of the

projected object.´ Joseph L. Mundy



Aspect Graphs



Affine patches

Revisit invariants as a l ocal description of

3D objects: Indeed, although smooth

surfaces are almost never planar in the

large, they are always planar in the small

3D Object Modeling and Recognition Using Local Affine-Invariant Image Descriptors and Multi-View Spatial

Constraints. F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce, IJCV 2006



Affine patches

Two steps:

1. Detection of salient image regions

2. Extraction of a descriptor around the

detected locations



Affine patches

Two steps:

1. Detection of salient image regions

(Garding and Lindeberg, 96; Mikolajczyk and Schmid, 02)

a) an elliptical image region is deformed to maximizethe isotropy of the corresponding brightness pattern.

b) its characteristic scale is determined as a local

extreme of the normalized Laplacian in scale space.

c) the Harris (1988) operator is used to refine theposition of the ellipse¶s center.

The elliptical region obtained at convergence can be

shown to be covariant under affine transformations.



Affine patches



Affine patches



Affine patches

ff



Affine patches

Affi h



Affine patches

Each region is represented with

the SIFT

descriptor.

Affi t h



Affine patches A coherent 3D interpretation of all

the matches is obtained using a

formulation derived from

structure-from-motion and

RANSAC to deal with outliers.

Affi t h



Affine patches

P t h b d i l i d t t



Patch-based single view detector

Car modelScreen model

Vidal-Naquet, Ullman (2003)

F i l i



For a single view

First we collect a set of part templates from a set of trainingobjects.

Vidal-Naquet, Ullman (2003)

«

E t d d f t



Extended fragments

View-Invariant Recognition Using Corresponding Object Fragments

E. Bart, E. Byvatov, & S. Ullman

E t d d f t



Extended fragments



E t d d f t



Extended fragments



E t d d f t



Extended fragments

Extended patches are extracted using short sequences.

Use Lucas-Kanade motion estimation to track patches across the sequence.

L i



Learning

Once a large pool of extended fragments is created, there

is a training stage to select the most informativefragments.

For each fragment evaluate:

Select the fragment B with

In the subsequent rounds, use

Class label Fragment present/absent

All these operations are easy to compute. It is just counting.

If C and Fare independent,then I(C,F) = 0



1

0

1

1

0

0

0

0

0

1

C

1

1

1

1

1

0

0

0

0

0

F

P(C=1, F=1) = 3 / 10

P(C=1, F=0) =

P(C=0, F=1) =

P(C=0, F=0) =

Training without sequences



Training without sequences

Challenges:

- We do not know which fragments are incorrespondence (we can not use motion

estimation due to strong transformation)

Fragments that are in correspondence will have

detections that are correlated across viewpoints.

The same approach can be used for

arbitrary transformations

Bart & Ullman

Shared features for Multi view object



Shared features for Multi-view object

detection

Viewinvariant

features

View

specific

features

Training does not require having different views of the same object.

Torralba, Murphy, Freeman. PAMI 07

Shared features for Multi view



Sharing is not a tree. Depends also on 3D symmetries.

«

«

Shared features for Multi-view

object detection

Torralba, Murphy, Freeman. PAMI 07

Multi view object detection



Multi-view object detection

Strong learner

H response for

car as function

of assumed

view angle Torralba, Murphy, Freeman. PAMI 07

Voting schemes



Voting schemes

Towards Multi-View Object ClassDetection

Alexander Thomas

Vittorio Ferrari

Bastian Leibe

Tinne Tuytelaars

Bernt SchieleLuc Van Gool

Viewpoint Independent Object Class Detection using 3D Feature Maps



Viewpoint-Independent Object Class Detection using 3D Feature Maps

Training dataset: synthetic objects

Features

Voting scheme and detectionEach cluster casts votes for the

voting bins of the discrete poses

contained in its internal list.

Liebelt, Schmid, Schertler. CVPR 2008

Monday



Monday

Recognition of 3D objects

Presenter: Alec Rivers

Evaluator:

Documents

MIT6870_ORSU_lecture4: Explicit and implicit 3D object models