A Tutorial on using SIFT Presented by Jimmy Huff (Slightly modified by Josiah Yoder for Winter 2014-2015)

A Tutorial on using SIFT

Presented by Jimmy Huff

(Slightly modified by Josiah Yoder for Winter 2014-2015)

Introduction

• The Scale-Invariant Feature Transform by David Lowe is useful in many applications of object recognition.

• Our objective in this presentation is to understand how to extract SIFT descriptors from an image

Introduction

• To extract SIFT keypoints, we use a cascaded filtering algorithm with the following four steps of filtering:– Scale-Space Extrema Detection– Keypoint Localization– Orientation Assignment– Keypoint Descriptor

• This algorithm is efficient as its more expensive operations are performed on a small subset of the initial image input.

Scale-Space Extrema Detection – Get the Points!

• In order to have scale-invariant features, we must have a way to extract features from an image across all scales.

• This can be done using a continuous function known as scale-space (Witkin, 1983)

• The only scale-space kernel is the Gaussian function.

• Lowe proposed to use Difference of Gaussians (DOG) in order to collect extrema as interest points.


• Scale-space groups an image into an octave with S levels.

• The smoothing is done incrementally such that σ of the S + 1 image in the octave is twice that of the first image.



• DOG is used for its efficiency.

• Using the images to the right, we may now find the extrema for this octave.


• If a point is greater or less than its 26 neighbors, it is regarded as an extreme point.

• This is a relatively inexpensive step as most points are not compared to every neighbor.

• Note that this comparison cannot be done on the boundaries of an image or on the top and bottom DOG.


. . .

Each octave is processed separately.Each octave starts with σ twice the value of σ of the previous octave and continues to increase.

2σ

σ

As sample points are collected, they are stored as a three-vector

p = (x, y, σ) [σ being scale in this case]

Refine the Points!

If we were to stop after the first steps, we would have too many interest points to be effective. In this second step, we eliminate points of low contrast. [Ignoring localization of “real” SIFT here…]

Can you see the truck??

Refine the Points!

Only keep points where DOG > some threshold (e.g. 3% of maximum intensity in original image)

Refine the Points!

• By applying this to our previous image, with 8714 sample points…

• We reduce the number of sample points to 362

We may further refine the sample points by removing them from edges. First, we take the Hessian matrix computed at the location and scale of the keypoint.

FurtherRefine the Points!

Further Refine the Points!

The eigenvalues of the matrix H are proportional to the principal curvatures of D. If a point is on an edge, its ratio of eigenvalues will be very high (recall Harris Corner Detector). Since we are only concerned with ratios we may set a threshold r, where α = rβ and

Therefore, if

the point is ignored.

Further Refine the Points!

• By applying this to our previous image, with 362 sample points…

• We reduce the number of sample points to 240

Orientation Assignment• In order to be rotation invariant, each

point must have a reference angle based on its neighbor points.

• We find the magnitude and angle of every pixel in the scale space by the following equations

• We are concerned with the points in the region of the keypoint.

• The magnitudes are weighted according to a Gaussian function centered at the keypoint.

Orientation Assignment

0

10

20

30

40

50

60

70

80

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

We then use the magnitudes to populate a histogram of 36 bins


0

10

20

30

40

50

60

70

80

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

A parabola is fit to the maximum value and the two values nearest to it. The maximum of this parabola gives us the angle θ. Furthermore, the point now has four components

p = (x, y, σ, θ)


Keypoint Descriptor

We now assign a descriptor to the sample point. The two above points represent sample points, with the red arrow being the points orientation assignment. By assigning a keypoint descriptor, we will know if these two are alike or not.

Keypoint Descriptor

We again use gradients of neighboring pixels to determine the descriptor. The size of the region is a Gaussian window proportional to the scale of the keypoint.

Keypoint Descriptor

We first must rotate the neighboring pixels vectors relative to the keypoint’s angle θ.

Notice that these two are (most likely) a match after this step is done to ensure rotation invariance!

Keypoint Descriptor

We then group the vectors from step 3 into a 2 x 2 set with 8 bins each. However, experimentation has shown it is best to use a 4 x 4 set with 8 bins each for maximum effectiveness and efficiency. This is essentially a 128-feature vector.

Keypoint Descriptor

By generalizing the gradient vectors in the neighboring pixels into 8 bins, this keypoint is resilient against different 3D perspectives.

Keypoint Descriptor

In order to be resilient to differences in illumination, we normalize the entries of the feature vector. This makes the descriptor invariant to changes in contrast or brightness

In order to be resilient to non-linear changes in illumination, such as camera saturation, we reduce the effect of large gradient vectors by setting a threshold in the feature vector such that no value is larger than 0.2. We then re-normalize.

Keypoint Descriptor

Rotation Invariance

Scale Invariance

3D Perspective Resilience

Occlusion – with outliers

Occlusion

Tracking

Tracking

Tracking

Tracking

Tracking

Documents

A Tutorial on using SIFT Presented by Jimmy Huff (Slightly modified by Josiah Yoder for Winter 2014-2015)