Last Week Recognized the fact that the 2D image is a representation of a 3D scene thus contains a consistent interpretation –Labeled edges –Labeled vertices

Last Week

• Recognized the fact that the 2D image is a representation of a 3D scene thus contains a consistent interpretation– Labeled edges– Labeled vertices

• Matching techniques for object recognition– Graph theoretic– Relaxation– Perceptual organization (neural networks)

This Week

• Look at direct measurement of 3D attributes via stereo cameras

• Look at other uses of matching– Stereo correspondence– Motion correspondence

Stereo Vision

• Goal is to extract scene depth via multiple monocular images with a passive sensor– Note that this can be done by other “active” means such as

LIDAR (LIght Detection And Ranging)

Stereo Vision

• Humans do it well from a single image and very, very well through stereo images

• Not well understood what the mechanism is– We understand the biological design, but not the

exact algorithm

• Goal of computer vision is not to mimic the mechanics of the biological system, but to mimic the functionality of the system

Stereo Vision

• Depth information will be used to…– Differentiate objects from background– Differentiate objects from one another– Expose camouflaged objects

• Basic method is to take advantage of the lateral displacement of the image of a 3D object in two cameras with different, but overlapping views– Lateral displacement is also known as disparity

Stereo Vision

• Two sub-problems– Correspondence problem

• The problem of measuring the disparity of each point in the two eye (camera) projections

– Interpretation problem • The use of disparity information to recover the

orientation and distance of surfaces in the scene

Stereo Algorithmic Steps

• Basic steps to be performed in any stereo imaging system– Image Acquisition– Camera Modeling– Feature Extraction– Image Matching– Depth Determination– Depth Interpolation

Image Acquisition

• Just as the name implies

• Capturing two images with a very specific camera geometry

Camera Modeling

• Related to Image Acquisition

• For accurate depth results the camera parameters must be known

• Also, the relationship between the two cameras must be known

Stereo Imaging Geometry

f (focal length)

Right Camera

Axis

Left Camera

Axis

Stereo BaselineB

RightImage

LeftImage

Scene

• The result is two images that are slightly different

Feature Acquisition

• These are the image objects that will be matched between the left and right images– Gray level pixel based

– Edge based

– Line based

– Region based

– Hybrid approaches

• All techniques have been tried– All provide some degree of success

– All have drawbacks

Image Matching

• By far the most difficult part of the stereo problem

• Also called the “stereo correspondence problem”

• When people “study” stereo imaging, this is generally what they are looking at

• The question is: Which parts (pixels, edges, lines, etc.) of the left image correspond to which parts of the right image?

Image Matching

• Gray level based– Take a section of one image and use it as a convolution mask over the

other

• Edge based– Extract edges then take a section of one edge image and used it as a

convolution mask over the other

• Line based– Extract edges, form line segments, then match using a relaxation

technique

• Region based– Extract regions then match using a relaxation technique

• Hybrid approach– Use matched regions (or lines) as guides to further pixel level matches

Image Matching Issues

• Density of depth map– Would like to have a depth measurement at every

image pixel• This means a correspondence between every pixel in

each image must be made

– Clearly difficult (if not impossible) to do• Gray level matching is the only real hope

• All other approaches will not provide a dense map, especially the region based approach

• Thus the study of hybrid algorithms

Depth Map

CSC508 15


• Photometric variation– The two cameras image the scene from two

different viewpoints, by definition– Thus the lighting on the scene differs for the two

cameras• Shadows, reflectance, etc.

– Affects all matching and feature extraction techniques


• Occlusion– When the image of one object is blocked by

another in one of the two cameras• It’s a 3D scene so this will happen!

– Some features will show up in one image and not the other thus making matching impossible

– Affects all matching and feature extraction techniques


• Repetitive texture– i.e. A brick wall (or any other regular, repeated

pattern texture)– Makes the matching process very difficult

although some sort of a relaxation algorithm may address the issue

– Region based matching may be used to address this issue


• Lack of texture– i.e. Smooth, feature-less objects– If there are no features, there is no way to match– Region based matching may be used to address

this issue

Depth Determination

• It’s all math!– And relatively simple math at that.

CSC508 21

Depth Determination

Pl(Xl,Yl) Pr(Xr,Yr)

(Xw,0,Zw)

Pw(Xw,Yw,Zw)Xw

Yw

Zw

(Xw,0,0)

Xl

Yl Yr

Xr

f (focal length)

Right Camera

Axis

Left Camera

Axis

Stereo BaselineB

RightImage

LeftImage

Depth Determination

• Depth (distance of a pixel location to the baseline) can be determined through simple algebraic and geometric relationships

• is referred to as the stereo disparity– i.e. the difference in how

the two cameras saw an object

fzxx wl

w B

fzxx wr

w

Bffzxzx wrwl

Bfzxzx wrwl

xxz

zxxzxzx

rlw

wrl

wrwl

Bf

Bf

Bf

xx rl

Depth Interpolation

• We want to describe surfaces, not individual points

• In the event that we don’t get a dense depth map (and we rarely do) we must interpolate the missing points– What we get is called a sparse depth map

Depth Interpolation

• Three basic methods– Relaxation – surface fitting with constraints

• Similar in nature to the relaxation labeling

– Analytic – surface fitting to a specified model (equation)

– Heuristic – use of local neighborhoods and predetermined rules

• Use of “educated guesses” and “higher level scene knowledge” – AI technique

Assumptions To Make Life Easier

• From psychological studies…– In light of ambiguities in the matching problem,

matches which preserve “figural continuity” are to be preferred

– That is, we prefer smooth surfaces over sharp changes

– This isn’t really a problem since the sharp changes [in all likelihood] won’t result in ambiguities

Assumptions To Make Life Easier

• Epipolarity (epipolar lines)– The camera geometry can be defined such that a

point feature in one image must lie on a specific line in the other image

– This constrains the search to multiple 1D problems

Epipolar Lines

Pl(Xl,Yl) Pr(Xr,Yr)

(Xw,0,Zw)

Pw(Xw,Yw,Zw)Xw

Yw

Zw

(Xw,0,0)

Xl

Yl Yr

Xr

f (focal length)

Right Camera

Axis

Left Camera

Axis

Stereo BaselineB

RightImage

LeftImage

Stereo Pair Images

Left Camera Right Camera

Depth Map Rendering

Gray Level Rendering

Final Thoughts

• Yes, it can be done with more than two cameras– This improves the accuracy of (removes ambiguity

from) the match

• Yes, it can be done with one camera– Simply move the camera along the baseline

snapping pictures as it goes

Motion Processing

• Whereas stereo processing worked on two (or more) frames taken at the same time, motion processing works on two (or more) frames taken at different times

Motion Processing

• Uses for motion processing– Scene segmentation– Motion detection (is something moving?)

• Security applications– Motion estimation (how is the object moving?)

• MPEG uses this to predict future frames– 3D structure determination

• Multiple views of an object as it moves– Object tracking

• Defense industry makes great use of this– Separate camera motion from object motion

• Camera stablization

Motion Processing

• Approaches range from simple…– Frame-to-frame subtraction

• to intermediate…– Frame-to-frame correspondence

• to difficult…– Statistical based processing for tracking

Correspondence

• The frame-to-frame correspondence problem is essentially the same as that for stereo processing– But, it may be more difficult since…

• objects may be moving towards the camera (they get larger)

• objects may be moving away from the camera (they get smaller)

• objects may be rotating (they change shape)

Frame Subtraction

• Avoids the correspondence operation all together• Problems arise in that objects lacking texture do not

get detected• We also must address the threshold selection problem• Assumes that the scene changes will be small due to

the short time duration between frames• Variations include learning the background (static

scene) and subtracting it from the live (dynamic scene)

Frame SubtractionFrame(n) Frame(n + 1)

Frame(n) - Frame(n + 1)Frame(n) - Frame(n + 1)

enhanced

Optical Flow

• Apparent motion of the brightness patterns within an image

• You end up with pictures as shown– In this case the camera

was moving towards the object

Another Example

Optical Flow

• Its basically frame-to-frame subtraction with a lot more information

• From the optical flow field various parameters can be measured– Object shape– Object segmentation– Camera motion– Multiple object motions

Motion Estimation in MPEG

• Select an image block from frame fn

• Select a larger image block from frame fn+1

• Center the fn block on the fn+1 block

• Compute correlation between the two blocks

• Spiral the fn block outward on the fn+1 block until the correlation yields a suitable response

Image block from frame fn

Image block from frame fn+1

Motion Estimation in MPEG

• The basic scheme using gray level correlation (matching) works because the premise is that there will be very small motions between frames

• In the event of large motions or illumination changes (or any other “drastic” changes) the systems reinitializes and doesn’t try to use any motion information

Object Tracking

• This is essentially motion prediction

• After observing a moving object can we predict where it will appear in the next frame?

Object Tracking

• Can be as simple as a low pass filter– A weighted average of the object’s position in previous frames– Heavily weight the newest frames

• Can be a complex statistical model taking into account noisy measurements– Kalman Filter

• As your confidence in the prediction increases the window in which you must perform the correspondence decreases in size– Basically, you’re trying to reduce the time to search

Summary

• We have merely touched on the basics of Computer Vision

• There is much, much more

• Hopefully, with this introduction you will be able to pursue other topic areas on your own

CSC508 46

Things To Do

• Final Exam due next week

• Course evaluation this week (online)

Documents

Last Week Recognized the fact that the 2D image is a representation of a 3D scene thus contains a consistent interpretation –Labeled edges –Labeled vertices