213
1/47 CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only Look Once (YOLO) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

  • Upload
    others

  • View
    25

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

1/47

CS7015 (Deep Learning) : Lecture 12Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only Look Once

(YOLO)

Mitesh M. Khapra

Department of Computer Science and EngineeringIndian Institute of Technology Madras

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 2: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

2/47

Acknowledgements

Some images borrowed from Ross Girshick’s original slides on RCNN, FastRCNN, etc.

Some ideas borrowed from the presentation of Kaustav Kundu∗

∗ Deep Object Detection

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 3: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

3/47

Module 12.1 : Introduction to object detection

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 4: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

4/47

So far we have looked at Image Classification

We will now move on to another Image Processing Task - Object Detection

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 5: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

5/47

Task Image classification Object Detection

Output

Car Car, exact bound-ing box contain-ing car

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 6: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

5/47

Task Image classification

Object Detection

Output

Car Car, exact bound-ing box contain-ing car

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 7: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

5/47

Task Image classification

Object Detection

Output Car

Car, exact bound-ing box contain-ing car

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 8: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

5/47

Task Image classification Object Detection

Output Car

Car, exact bound-ing box contain-ing car

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 9: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

5/47

Task Image classification Object Detection

Output Car Car, exact bound-ing box contain-ing car

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 10: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

6/47

Region proposals Feature extraction

x1 x2 . . . xd

Classifierperson flag ball none

Let us see a typical pipeline for object detection

It starts with a region proposal stage where we identify potential regions whichmay contain objects

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 11: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

6/47

Region proposals

Feature extraction

x1 x2 . . . xd

Classifierperson flag ball none

Let us see a typical pipeline for object detection

It starts with a region proposal stage where we identify potential regions whichmay contain objects

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 12: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

6/47

Region proposals

Feature extraction

x1 x2 . . . xd

Classifierperson flag ball none

We could think of these regions as mini-images

We extract features(SIFT, HOG, CNNs) from these mini-images

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 13: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

6/47

Region proposals Feature extraction

x1 x2 . . . xd

Classifierperson flag ball none

We could think of these regions as mini-images

We extract features(SIFT, HOG, CNNs) from these mini-images

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 14: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

6/47

Region proposals Feature extraction

x1 x2 . . . xd

Classifierperson flag ball none

Pass these through a standard image classifer to determine the class

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 15: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

7/47

Region proposals

h

w

h

w

h

w

Feature extraction

x1 x2 . . . xd

Bounding box regression

h∗

w∗

h∗

w∗

h∗

w∗

In addition we would also like to correct the proposed bounding boxes

This is posed as a regression problem (for example, we would like to predict w∗,h∗ from the proposed w and h)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 16: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

7/47

Region proposals

h

w

h

w

h

w

Feature extraction

x1 x2 . . . xd

Bounding box regression

h∗

w∗

h∗

w∗

h∗

w∗

In addition we would also like to correct the proposed bounding boxes

This is posed as a regression problem (for example, we would like to predict w∗,h∗ from the proposed w and h)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 17: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

7/47

Region proposals

h

w

h

w

h

w

Feature extraction

x1 x2 . . . xd

Bounding box regression

h∗

w∗

h∗

w∗

h∗

w∗

In addition we would also like to correct the proposed bounding boxes

This is posed as a regression problem (for example, we would like to predict w∗,h∗ from the proposed w and h)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 18: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

8/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Faster RCNN

Let us see how these three compo-nents have evolved over time

Propose all possible regions in theimage of varying sizes (almost bruteforce)

Use handcrafted features (SIFT,HOG)

Train a linear classifier using thesefeatures

We will now see three algorithms thatprogressively improve these compo-nents

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 19: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

8/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Faster RCNN

Let us see how these three compo-nents have evolved over time

Propose all possible regions in theimage of varying sizes (almost bruteforce)

Use handcrafted features (SIFT,HOG)

Train a linear classifier using thesefeatures

We will now see three algorithms thatprogressively improve these compo-nents

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 20: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

8/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Faster RCNN

Let us see how these three compo-nents have evolved over time

Propose all possible regions in theimage of varying sizes (almost bruteforce)

Use handcrafted features (SIFT,HOG)

Train a linear classifier using thesefeatures

We will now see three algorithms thatprogressively improve these compo-nents

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 21: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

8/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Faster RCNN

Let us see how these three compo-nents have evolved over time

Propose all possible regions in theimage of varying sizes (almost bruteforce)

Use handcrafted features (SIFT,HOG)

Train a linear classifier using thesefeatures

We will now see three algorithms thatprogressively improve these compo-nents

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 22: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

8/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Faster RCNN

Let us see how these three compo-nents have evolved over time

Propose all possible regions in theimage of varying sizes (almost bruteforce)

Use handcrafted features (SIFT,HOG)

Train a linear classifier using thesefeatures

We will now see three algorithms thatprogressively improve these compo-nents

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 23: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

8/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Faster RCNN

Let us see how these three compo-nents have evolved over time

Propose all possible regions in theimage of varying sizes (almost bruteforce)

Use handcrafted features (SIFT,HOG)

Train a linear classifier using thesefeatures

We will now see three algorithms thatprogressively improve these compo-nents

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 24: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

8/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Faster RCNN

Let us see how these three compo-nents have evolved over time

Propose all possible regions in theimage of varying sizes (almost bruteforce)

Use handcrafted features (SIFT,HOG)

Train a linear classifier using thesefeatures

We will now see three algorithms thatprogressively improve these compo-nents

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 25: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

8/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Faster RCNN

Let us see how these three compo-nents have evolved over time

Propose all possible regions in theimage of varying sizes (almost bruteforce)

Use handcrafted features (SIFT,HOG)

Train a linear classifier using thesefeatures

We will now see three algorithms thatprogressively improve these compo-nents

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 26: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

9/47

Module 12.2 : RCNN model for object detection

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 27: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

10/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

Selective Search for region proposals

Does hierarchical clustering at different scales

For example the figures from left to right showclusters of increasing sizes

Such a hierarchical clustering is importantas we may find different objects at differentscales

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 28: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

10/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

Selective Search for region proposals

Does hierarchical clustering at different scales

For example the figures from left to right showclusters of increasing sizes

Such a hierarchical clustering is importantas we may find different objects at differentscales

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 29: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

10/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

Selective Search for region proposals

Does hierarchical clustering at different scales

For example the figures from left to right showclusters of increasing sizes

Such a hierarchical clustering is importantas we may find different objects at differentscales

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 30: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

10/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

Selective Search for region proposals

Does hierarchical clustering at different scales

For example the figures from left to right showclusters of increasing sizes

Such a hierarchical clustering is importantas we may find different objects at differentscales

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 31: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

10/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

Selective Search for region proposals

Does hierarchical clustering at different scales

For example the figures from left to right showclusters of increasing sizes

Such a hierarchical clustering is importantas we may find different objects at differentscales

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 32: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

11/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

Proposed regions are cropped to form mini im-ages

Each mini image is scaled to match the CNN’s(feature extractor) input size

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 33: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

11/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

Proposed regions are cropped to form mini im-ages

Each mini image is scaled to match the CNN’s(feature extractor) input size

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 34: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

11/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

Proposed regions are cropped to form mini im-ages

Each mini image is scaled to match the CNN’s(feature extractor) input size

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 35: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

12/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

10

10

5

5

fc7

For feature extraction any CNNtrained for Image Classification canbe used (AlexNet/ VGGNet etc.)

Outputs from fc7 layer are taken asfeatures

CNN is fine tuned using ground truth(cropped) object images

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 36: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

12/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

10

10

5

5

fc7

For feature extraction any CNNtrained for Image Classification canbe used (AlexNet/ VGGNet etc.)

Outputs from fc7 layer are taken asfeatures

CNN is fine tuned using ground truth(cropped) object images

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 37: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

12/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

10

10

5

5

fc7

For feature extraction any CNNtrained for Image Classification canbe used (AlexNet/ VGGNet etc.)

Outputs from fc7 layer are taken asfeatures

CNN is fine tuned using ground truth(cropped) object images

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 38: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

13/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

. . .

Linear models (SVMs) are used for classification

(1 model per class)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 39: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

13/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

. . .

Linear models (SVMs) are used for classification

(1 model per class)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 40: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

13/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

. . .

Linear models (SVMs) are used for classification (1 model per class)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 41: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

13/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

. . .

Linear models (SVMs) are used for classification (1 model per class)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 42: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

13/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

. . .

Linear models (SVMs) are used for classification (1 model per class)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 43: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

14/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

(x,y)

w

h

Proposed Box

w∗

h∗(x∗,y∗)

True Box

z : features from pool5 layer of the network

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 44: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

14/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

(x,y)

w

h

Proposed Box

w∗

h∗(x∗,y∗)

True Box

z : features from pool5 layer of the network

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 45: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

14/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

(x,y)

w

h

Proposed Box

w∗

h∗(x∗,y∗)

True Box

z : features from pool5 layer of the network

min

N∑i=1

x∗ − xw

− wT1 z

The proposed regions may not be perfect

We want to learn four regression models which willlearn to predict x∗, y∗, w∗, h∗

We will see their respective objective functions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 46: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

14/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

(x,y)

w

h

Proposed Box

w∗

h∗(x∗,y∗)

True Box

z : features from pool5 layer of the network

min

N∑i=1

x∗ − xw

− wT1 z

The proposed regions may not be perfect

We want to learn four regression models which willlearn to predict x∗, y∗, w∗, h∗

We will see their respective objective functions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 47: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

14/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

(x,y)

w

h

Proposed Box

w∗

h∗(x∗,y∗)

True Box

z : features from pool5 layer of the network

min

N∑i=1

x∗ − xw

− wT1 z

The proposed regions may not be perfect

We want to learn four regression models which willlearn to predict x∗, y∗, w∗, h∗

We will see their respective objective functions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 48: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

14/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

(x,y)

w

h

Proposed Box

w∗

h∗(x∗,y∗)

True Box

z : features from pool5 layer of the network

min

N∑i=1

(x∗ − xw

− wT1 z)2

x∗−xw is the normalized difference between proposed x

and true x∗

If we can predict this difference we can compute x∗

The model predicts wT1 z as this difference

The above objective function minimize the differencebetween the predicted and the actual value

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 49: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

14/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

(x,y)

w

h

Proposed Box

w∗

h∗(x∗,y∗)

True Box

z : features from pool5 layer of the network

min

N∑i=1

(x∗ − xw

− wT1 z)2

x∗−xw is the normalized difference between proposed x

and true x∗

If we can predict this difference we can compute x∗

The model predicts wT1 z as this difference

The above objective function minimize the differencebetween the predicted and the actual value

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 50: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

14/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

(x,y)

w

h

Proposed Box

w∗

h∗(x∗,y∗)

True Box

z : features from pool5 layer of the network

min

N∑i=1

(x∗ − xw

− wT1 z)2

x∗−xw is the normalized difference between proposed x

and true x∗

If we can predict this difference we can compute x∗

The model predicts wT1 z as this difference

The above objective function minimize the differencebetween the predicted and the actual value

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 51: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

14/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

(x,y)

w

h

Proposed Box

w∗

h∗(x∗,y∗)

True Box

z : features from pool5 layer of the network

min

N∑i=1

(x∗ − xw

− wT1 z)2

x∗−xw is the normalized difference between proposed x

and true x∗

If we can predict this difference we can compute x∗

The model predicts wT1 z as this difference

The above objective function minimize the differencebetween the predicted and the actual value

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 52: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

14/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

(x,y)

w

h

Proposed Box

w∗

h∗(x∗,y∗)

True Box

z : features from pool5 layer of the network

min

N∑i=1

(x∗ − xw

− wT1 z)2

x∗−xw is the normalized difference between proposed x

and true x∗

If we can predict this difference we can compute x∗

The model predicts wT1 z as this difference

The above objective function minimize the differencebetween the predicted and the actual value

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 53: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

14/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

(x,y)

w

h

Proposed Box

w∗

h∗(x∗,y∗)

True Box

z : features from pool5 layer of the network

min

N∑i=1

(y∗ − yh− wT

2 z)2

Similarly for y

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 54: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

14/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

(x,y)

w

h

Proposed Box

w∗

h∗(x∗,y∗)

True Box

z : features from pool5 layer of the network

min

N∑i=1

(ln(w∗

w

)− wT

3 z)2

Similarly for w

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 55: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

14/47

Input Region ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

(x,y)

w

h

Proposed Box

w∗

h∗(x∗,y∗)

True Box

z : features from pool5 layer of the network

min

N∑i=1

(ln(h∗h

)− wT

4 z)2

Similarly for h

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 56: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

15/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

WCONVWclassifier

Wregression

Classifier

...

Bounding BoxRegression

What are the parameters of this model?

WCONV is taken as it is from a CNN trained for Image classification (say onImageNet)

WCONV is then fine tuned using ground truth (cropped) object images

Wclassifier is learned using ground truth (cropped) object images

Wregression is learned using ground truth bounding boxes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 57: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

15/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

WCONVWclassifier

Wregression

Classifier

...

Bounding BoxRegression

What are the parameters of this model?

WCONV is taken as it is from a CNN trained for Image classification (say onImageNet)

WCONV is then fine tuned using ground truth (cropped) object images

Wclassifier is learned using ground truth (cropped) object images

Wregression is learned using ground truth bounding boxes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 58: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

15/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

WCONVWclassifier

Wregression

Classifier

...

Bounding BoxRegression

What are the parameters of this model?

WCONV is taken as it is from a CNN trained for Image classification (say onImageNet)

WCONV is then fine tuned using ground truth (cropped) object images

Wclassifier is learned using ground truth (cropped) object images

Wregression is learned using ground truth bounding boxes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 59: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

15/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

WCONVWclassifier

Wregression

Classifier

...

Bounding BoxRegression

What are the parameters of this model?

WCONV is taken as it is from a CNN trained for Image classification (say onImageNet)

WCONV is then fine tuned using ground truth (cropped) object images

Wclassifier is learned using ground truth (cropped) object images

Wregression is learned using ground truth bounding boxes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 60: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

15/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

WCONVWclassifier

Wregression

Classifier

...

Bounding BoxRegression

What are the parameters of this model?

WCONV is taken as it is from a CNN trained for Image classification (say onImageNet)

WCONV is then fine tuned using ground truth (cropped) object images

Wclassifier is learned using ground truth (cropped) object images

Wregression is learned using ground truth bounding boxes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 61: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

15/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

WCONVWclassifier

Wregression

Classifier

...

Bounding BoxRegression

What are the parameters of this model?

WCONV is taken as it is from a CNN trained for Image classification (say onImageNet)

WCONV is then fine tuned using ground truth (cropped) object images

Wclassifier is learned using ground truth (cropped) object images

Wregression is learned using ground truth bounding boxes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 62: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

16/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

What is the computational cost for processing one image at test time?

Inference Time = Proposal Time + # Proposals × Convolution Time + #Proposals × classification + # Proposals × regression

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 63: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

16/47

Input Region ProposalsRegion ProposalsFeature Extrac-tion

10

10

5

5

Classifier

...

Bounding BoxRegression

What is the computational cost for processing one image at test time?

Inference Time = Proposal Time + # Proposals × Convolution Time + #Proposals × classification + # Proposals × regression

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 64: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

17/47

Source: Ross Girshick

On average selective searchgives 2K region proposal

Each of these pass throughthe CNN for feature extrac-tion

Followed by classificationand regression

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 65: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

17/47

Source: Ross Girshick

On average selective searchgives 2K region proposal

Each of these pass throughthe CNN for feature extrac-tion

Followed by classificationand regression

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 66: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

17/47

Source: Ross Girshick

On average selective searchgives 2K region proposal

Each of these pass throughthe CNN for feature extrac-tion

Followed by classificationand regression

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 67: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

17/47

Source: Ross Girshick

On average selective searchgives 2K region proposal

Each of these pass throughthe CNN for feature extrac-tion

Followed by classificationand regression

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 68: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

17/47

Source: Ross Girshick

On average selective searchgives 2K region proposal

Each of these pass throughthe CNN for feature extrac-tion

Followed by classificationand regression

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 69: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

17/47

Source: Ross Girshick

On average selective searchgives 2K region proposal

Each of these pass throughthe CNN for feature extrac-tion

Followed by classificationand regression

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 70: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

18/47

No joint learning

Use ad hoc training objectives

Fine tune network with softmaxclassifier (log loss)Train post-hoc linear SVMs (hingeloss)Train post-hoc bounding-box re-gressors (squared loss)

Training (≈ 3 days) and testing (47sper image) is slow1.

Takes a lot of disk space

1Source: Ross Girshick1Using VGG-Net

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 71: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

18/47

No joint learning

Use ad hoc training objectives

Fine tune network with softmaxclassifier (log loss)Train post-hoc linear SVMs (hingeloss)Train post-hoc bounding-box re-gressors (squared loss)

Training (≈ 3 days) and testing (47sper image) is slow1.

Takes a lot of disk space

1Source: Ross Girshick1Using VGG-Net

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 72: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

18/47

No joint learning

Use ad hoc training objectives

Fine tune network with softmaxclassifier (log loss)

Train post-hoc linear SVMs (hingeloss)Train post-hoc bounding-box re-gressors (squared loss)

Training (≈ 3 days) and testing (47sper image) is slow1.

Takes a lot of disk space

1Source: Ross Girshick1Using VGG-Net

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 73: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

18/47

No joint learning

Use ad hoc training objectives

Fine tune network with softmaxclassifier (log loss)Train post-hoc linear SVMs (hingeloss)

Train post-hoc bounding-box re-gressors (squared loss)

Training (≈ 3 days) and testing (47sper image) is slow1.

Takes a lot of disk space

1Source: Ross Girshick1Using VGG-Net

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 74: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

18/47

No joint learning

Use ad hoc training objectives

Fine tune network with softmaxclassifier (log loss)Train post-hoc linear SVMs (hingeloss)Train post-hoc bounding-box re-gressors (squared loss)

Training (≈ 3 days) and testing (47sper image) is slow1.

Takes a lot of disk space

1Source: Ross Girshick1Using VGG-Net

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 75: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

18/47

No joint learning

Use ad hoc training objectives

Fine tune network with softmaxclassifier (log loss)Train post-hoc linear SVMs (hingeloss)Train post-hoc bounding-box re-gressors (squared loss)

Training (≈ 3 days) and testing (47sper image) is slow1.

Takes a lot of disk space

1Source: Ross Girshick1Using VGG-Net

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 76: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

18/47

No joint learning

Use ad hoc training objectives

Fine tune network with softmaxclassifier (log loss)Train post-hoc linear SVMs (hingeloss)Train post-hoc bounding-box re-gressors (squared loss)

Training (≈ 3 days) and testing (47sper image) is slow1.

Takes a lot of disk space

1Source: Ross Girshick1Using VGG-Net

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 77: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

19/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Region Proposals: SelectiveSearch

Feature Extraction: CNNs

Classifier: Linear

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 78: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

19/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Region Proposals: SelectiveSearch

Feature Extraction: CNNs

Classifier: Linear

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 79: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

19/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Region Proposals: SelectiveSearch

Feature Extraction: CNNs

Classifier: Linear

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 80: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

20/47

Module 12.3 : Fast RCNN model for object detection

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 81: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

21/47

Suppose we apply a 3 × 3 kernel onan image

What is the region of influence of eachpixel in the resulting output ?

Each pixel contributes to a 5 × 5 re-gion

Suppose we again apply a 3×3 kernelon this output?

What is the region of influence of theoriginal pixel from the input ?

(a 7×7region)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 82: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

21/47

Suppose we apply a 3 × 3 kernel onan image

What is the region of influence of eachpixel in the resulting output ?

Each pixel contributes to a 5 × 5 re-gion

Suppose we again apply a 3×3 kernelon this output?

What is the region of influence of theoriginal pixel from the input ?

(a 7×7region)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 83: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

21/47

Suppose we apply a 3 × 3 kernel onan image

What is the region of influence of eachpixel in the resulting output ?

Each pixel contributes to a 5 × 5 re-gion

Suppose we again apply a 3×3 kernelon this output?

What is the region of influence of theoriginal pixel from the input ?

(a 7×7region)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 84: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

21/47

Suppose we apply a 3 × 3 kernel onan image

What is the region of influence of eachpixel in the resulting output ?

Each pixel contributes to a 5 × 5 re-gion

Suppose we again apply a 3×3 kernelon this output?

What is the region of influence of theoriginal pixel from the input ?

(a 7×7region)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 85: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

21/47

Suppose we apply a 3 × 3 kernel onan image

What is the region of influence of eachpixel in the resulting output ?

Each pixel contributes to a 5 × 5 re-gion

Suppose we again apply a 3×3 kernelon this output?

What is the region of influence of theoriginal pixel from the input ?

(a 7×7region)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 86: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

21/47

Suppose we apply a 3 × 3 kernel onan image

What is the region of influence of eachpixel in the resulting output ?

Each pixel contributes to a 5 × 5 re-gion

Suppose we again apply a 3×3 kernelon this output?

What is the region of influence of theoriginal pixel from the input ?

(a 7×7region)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 87: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

21/47

Suppose we apply a 3 × 3 kernel onan image

What is the region of influence of eachpixel in the resulting output ?

Each pixel contributes to a 5 × 5 re-gion

Suppose we again apply a 3×3 kernelon this output?

What is the region of influence of theoriginal pixel from the input ? (a 7×7region)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 88: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

22/47

Input

224

224

Conv

224

224

64 maxpool

112

112

64

Conv

112

112

128 maxpool

5656

128

Conv

5656

256 maxpool

2828

256

Conv

2828

512

maxpool

14

14

512

Conv

14

14

512

maxpool

7

7

512

fc fc

4096 4096

softmax

1000

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 89: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

23/47

Source: Ross Girshick

Using this idea we could get a bound-ing box’s region of influence on anylayer in the CNN

The projected Region of Interest(RoI) may be of different sizes

Divide them into k equally sized re-gions of dimension H × W and domax pooling in each of those regionsto construct a k dimensional vector

Connect the k dimensional vector toa fully connected layer

This max pooling operation is callRoI pooling

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 90: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

23/47

Source: Ross Girshick

Using this idea we could get a bound-ing box’s region of influence on anylayer in the CNN

The projected Region of Interest(RoI) may be of different sizes

Divide them into k equally sized re-gions of dimension H × W and domax pooling in each of those regionsto construct a k dimensional vector

Connect the k dimensional vector toa fully connected layer

This max pooling operation is callRoI pooling

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 91: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

23/47

Source: Ross Girshick

Using this idea we could get a bound-ing box’s region of influence on anylayer in the CNN

The projected Region of Interest(RoI) may be of different sizes

Divide them into k equally sized re-gions of dimension H × W and domax pooling in each of those regionsto construct a k dimensional vector

Connect the k dimensional vector toa fully connected layer

This max pooling operation is callRoI pooling

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 92: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

23/47

Source: Ross Girshick

Using this idea we could get a bound-ing box’s region of influence on anylayer in the CNN

The projected Region of Interest(RoI) may be of different sizes

Divide them into k equally sized re-gions of dimension H × W and domax pooling in each of those regionsto construct a k dimensional vector

Connect the k dimensional vector toa fully connected layer

This max pooling operation is callRoI pooling

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 93: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

23/47

Source: Ross Girshick

Using this idea we could get a bound-ing box’s region of influence on anylayer in the CNN

The projected Region of Interest(RoI) may be of different sizes

Divide them into k equally sized re-gions of dimension H × W and domax pooling in each of those regionsto construct a k dimensional vector

Connect the k dimensional vector toa fully connected layer

This max pooling operation is callRoI pooling

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 94: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

24/47

Source: Ross Girshick

Once we have the FC layer it gives usthe representation of this region pro-posal

We can then add a softmax layer ontop of it to compute a probabilitydistribution over the possible objectclasses

Similarly we can add a regressionlayer on top of it to predict the newbounding box (w∗, h∗, x∗, y∗)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 95: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

24/47

Source: Ross Girshick

Once we have the FC layer it gives usthe representation of this region pro-posal

We can then add a softmax layer ontop of it to compute a probabilitydistribution over the possible objectclasses

Similarly we can add a regressionlayer on top of it to predict the newbounding box (w∗, h∗, x∗, y∗)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 96: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

24/47

Source: Ross Girshick

Once we have the FC layer it gives usthe representation of this region pro-posal

We can then add a softmax layer ontop of it to compute a probabilitydistribution over the possible objectclasses

Similarly we can add a regressionlayer on top of it to predict the newbounding box (w∗, h∗, x∗, y∗)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 97: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

24/47

Source: Ross Girshick

Once we have the FC layer it gives usthe representation of this region pro-posal

We can then add a softmax layer ontop of it to compute a probabilitydistribution over the possible objectclasses

Similarly we can add a regressionlayer on top of it to predict the newbounding box (w∗, h∗, x∗, y∗)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 98: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

25/47

Input

Conv

Max-pool

ROI

W

Recall that the last pooling layer ofVGGNet-16 results in an output ofsize 512× 7× 7

We replace the last max pooling layerby a RoI pooling layer

We set H = W = 7 and divide eachof these RoIs into (k = 49) regions

We do this for every feature map re-sulting in an ouput of size 512× 49

This output is of the same size as theoutput of the original max poolinglayer

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 99: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

25/47

Input

Conv

Max-pool

ROI

W

Recall that the last pooling layer ofVGGNet-16 results in an output ofsize 512× 7× 7

We replace the last max pooling layerby a RoI pooling layer

We set H = W = 7 and divide eachof these RoIs into (k = 49) regions

We do this for every feature map re-sulting in an ouput of size 512× 49

This output is of the same size as theoutput of the original max poolinglayer

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 100: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

25/47

Input

Conv

Max-pool

ROI

W

Recall that the last pooling layer ofVGGNet-16 results in an output ofsize 512× 7× 7

We replace the last max pooling layerby a RoI pooling layer

We set H = W = 7 and divide eachof these RoIs into (k = 49) regions

We do this for every feature map re-sulting in an ouput of size 512× 49

This output is of the same size as theoutput of the original max poolinglayer

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 101: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

25/47

Input

Conv

Max-pool

ROI

W

Recall that the last pooling layer ofVGGNet-16 results in an output ofsize 512× 7× 7

We replace the last max pooling layerby a RoI pooling layer

We set H = W = 7 and divide eachof these RoIs into (k = 49) regions

We do this for every feature map re-sulting in an ouput of size 512× 49

This output is of the same size as theoutput of the original max poolinglayer

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 102: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

25/47

Input

Conv

Max-pool

ROI

W

Recall that the last pooling layer ofVGGNet-16 results in an output ofsize 512× 7× 7

We replace the last max pooling layerby a RoI pooling layer

We set H = W = 7 and divide eachof these RoIs into (k = 49) regions

We do this for every feature map re-sulting in an ouput of size 512× 49

This output is of the same size as theoutput of the original max poolinglayer

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 103: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

25/47

Input

Conv

Max-pool

ROI

WIt is thus compatible with the dimen-sions of the weight matrix connectingthe original pooling layer to the firstFC layer

We can just retain that weight matrixand fine tune it

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 104: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

25/47

Input

Conv

Max-pool

ROI

WIt is thus compatible with the dimen-sions of the weight matrix connectingthe original pooling layer to the firstFC layer

We can just retain that weight matrixand fine tune it

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 105: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

26/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Region Proposals: SelectiveSearch

Feature Extraction: CNN

Classifier: CNN

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 106: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

26/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Region Proposals: SelectiveSearch

Feature Extraction: CNN

Classifier: CNN

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 107: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

26/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Region Proposals: SelectiveSearch

Feature Extraction: CNN

Classifier: CNN

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 108: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

27/47

Module 12.4 : Faster RCNN model for object detection

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 109: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

28/47

So far the region proposals were be-ing made using Selective Search algo-rithm

Idea: Can we use a CNN for makingregion proposals also?

How? Well it’s slightly tricky

We will illustrate this using VG-GNet

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 110: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

28/47

image

conv layers

feature maps

Region Proposal Network

proposals

classifier

RoI pooling

So far the region proposals were be-ing made using Selective Search algo-rithm

Idea: Can we use a CNN for makingregion proposals also?

How? Well it’s slightly tricky

We will illustrate this using VG-GNet

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 111: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

28/47

image

conv layers

feature maps

Region Proposal Network

proposals

classifier

RoI pooling

So far the region proposals were be-ing made using Selective Search algo-rithm

Idea: Can we use a CNN for makingregion proposals also?

How? Well it’s slightly tricky

We will illustrate this using VG-GNet

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 112: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

28/47

image

conv layers

feature maps

Region Proposal Network

proposals

classifier

RoI pooling

So far the region proposals were be-ing made using Selective Search algo-rithm

Idea: Can we use a CNN for makingregion proposals also?

How? Well it’s slightly tricky

We will illustrate this using VG-GNet

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 113: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

29/47

w

512

h

x1 x2 ·······x512

x1 x2 ·······x512

x1 x2 ·······x512

Consider the output of the last con-volutional layer of VGGNet

Now consider one cell in one of the512 feature maps

If we apply a 3× 3 kernel around thiscell then we will get a 1D representa-tion for this cell

If we repeat this for all the 512 featuremaps then we will get a 512 dimen-sional representation for this position

We use this process to get a 512 di-mensional representation for each ofthe w × h positions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 114: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

29/47

w

512

h

x1 x2 ·······x512

x1 x2 ·······x512

x1 x2 ·······x512

Consider the output of the last con-volutional layer of VGGNet

Now consider one cell in one of the512 feature maps

If we apply a 3× 3 kernel around thiscell then we will get a 1D representa-tion for this cell

If we repeat this for all the 512 featuremaps then we will get a 512 dimen-sional representation for this position

We use this process to get a 512 di-mensional representation for each ofthe w × h positions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 115: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

29/47

w

512

h

x1 x2 ·······x512

x1 x2 ·······x512

x1 x2 ·······x512

Consider the output of the last con-volutional layer of VGGNet

Now consider one cell in one of the512 feature maps

If we apply a 3× 3 kernel around thiscell then we will get a 1D representa-tion for this cell

If we repeat this for all the 512 featuremaps then we will get a 512 dimen-sional representation for this position

We use this process to get a 512 di-mensional representation for each ofthe w × h positions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 116: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

29/47

w

512

h

x1 x2 ·······x512

x1 x2 ·······x512

x1 x2 ·······x512

Consider the output of the last con-volutional layer of VGGNet

Now consider one cell in one of the512 feature maps

If we apply a 3× 3 kernel around thiscell then we will get a 1D representa-tion for this cell

If we repeat this for all the 512 featuremaps then we will get a 512 dimen-sional representation for this position

We use this process to get a 512 di-mensional representation for each ofthe w × h positions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 117: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

29/47

w

512

h

x1 x2 ·······x512

x1 x2 ·······x512

x1 x2 ·······x512

Consider the output of the last con-volutional layer of VGGNet

Now consider one cell in one of the512 feature maps

If we apply a 3× 3 kernel around thiscell then we will get a 1D representa-tion for this cell

If we repeat this for all the 512 featuremaps then we will get a 512 dimen-sional representation for this position

We use this process to get a 512 di-mensional representation for each ofthe w × h positions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 118: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

29/47

w

512

h

x1 x2 ·······x512

x1 x2 ·······x512

x1 x2 ·······x512

Consider the output of the last con-volutional layer of VGGNet

Now consider one cell in one of the512 feature maps

If we apply a 3× 3 kernel around thiscell then we will get a 1D representa-tion for this cell

If we repeat this for all the 512 featuremaps then we will get a 512 dimen-sional representation for this position

We use this process to get a 512 di-mensional representation for each ofthe w × h positions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 119: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

29/47

w

512

h

x1 x2 ·······x512

x1 x2 ·······x512

x1 x2 ·······x512

Consider the output of the last con-volutional layer of VGGNet

Now consider one cell in one of the512 feature maps

If we apply a 3× 3 kernel around thiscell then we will get a 1D representa-tion for this cell

If we repeat this for all the 512 featuremaps then we will get a 512 dimen-sional representation for this position

We use this process to get a 512 di-mensional representation for each ofthe w × h positions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 120: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

30/47

Input

Conv

Max-pool

x1 x2 · · · · · x512

We now consider k bounding boxes(called anchor boxes) of different sizes& aspect ratio

We are interested in the following twoquestions:

Given the 512d representation of aposition, what is the probability thata given anchor box centered at thisposition contains an object?(Classification)

How do you predict the true bound-ing box from this anchor box? (Re-gression)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 121: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

30/47

Input

Conv

Max-pool

x1 x2 · · · · · x512

We now consider k bounding boxes(called anchor boxes) of different sizes& aspect ratio

We are interested in the following twoquestions:

Given the 512d representation of aposition, what is the probability thata given anchor box centered at thisposition contains an object?(Classification)

How do you predict the true bound-ing box from this anchor box? (Re-gression)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 122: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

30/47

Input

Conv

Max-pool

x1 x2 · · · · · x512

We now consider k bounding boxes(called anchor boxes) of different sizes& aspect ratio

We are interested in the following twoquestions:

Given the 512d representation of aposition, what is the probability thata given anchor box centered at thisposition contains an object?(Classification)

How do you predict the true bound-ing box from this anchor box? (Re-gression)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 123: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

30/47

Input

Conv

Max-pool

x1 x2 · · · · · x512

We now consider k bounding boxes(called anchor boxes) of different sizes& aspect ratio

We are interested in the following twoquestions:

Given the 512d representation of aposition, what is the probability thata given anchor box centered at thisposition contains an object?

(Classification)

How do you predict the true bound-ing box from this anchor box? (Re-gression)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 124: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

30/47

Input

Conv

Max-pool

x1 x2 · · · · · x512

We now consider k bounding boxes(called anchor boxes) of different sizes& aspect ratio

We are interested in the following twoquestions:

Given the 512d representation of aposition, what is the probability thata given anchor box centered at thisposition contains an object?(Classification)

How do you predict the true bound-ing box from this anchor box? (Re-gression)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 125: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

30/47

Input

Conv

Max-pool

x1 x2 · · · · · x512

We now consider k bounding boxes(called anchor boxes) of different sizes& aspect ratio

We are interested in the following twoquestions:

Given the 512d representation of aposition, what is the probability thata given anchor box centered at thisposition contains an object?(Classification)

How do you predict the true bound-ing box from this anchor box?

(Re-gression)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 126: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

30/47

Input

Conv

Max-pool

x1 x2 · · · · · x512

We now consider k bounding boxes(called anchor boxes) of different sizes& aspect ratio

We are interested in the following twoquestions:

Given the 512d representation of aposition, what is the probability thata given anchor box centered at thisposition contains an object?(Classification)

How do you predict the true bound-ing box from this anchor box? (Re-gression)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 127: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

31/47

Input

Conv

Max-pool

x1 x2 · · · · · x512

We train a classification model and aregression model to address these twoquestions

How do we get the ground truth data?

What is the objective function usedfor training?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 128: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

31/47

Input

Conv

Max-pool

x1 x2 · · · · · x512

We train a classification model and aregression model to address these twoquestions

How do we get the ground truth data?

What is the objective function usedfor training?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 129: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

31/47

Input

Conv

Max-pool

x1 x2 · · · · · x512

We train a classification model and aregression model to address these twoquestions

How do we get the ground truth data?

What is the objective function usedfor training?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 130: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

31/47

Input

Conv

Max-pool

x1 x2 · · · · · x512

We train a classification model and aregression model to address these twoquestions

How do we get the ground truth data?

What is the objective function usedfor training?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 131: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

32/47

Input

Conv

Max-pool

InputInput

x1x2 · · · · · ·

Classification Regression

Consider a ground truth object andits corresponding bounding box

Consider the projection of this imageonto the conv5 layer

Consider one such cell in the output

This cell corresponds to a patch in theoriginal image

Consider the center of this patch

We consider anchor boxes of differentsizes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 132: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

32/47

Input

Conv

Max-pool

InputInput

x1x2 · · · · · ·

Classification Regression

Consider a ground truth object andits corresponding bounding box

Consider the projection of this imageonto the conv5 layer

Consider one such cell in the output

This cell corresponds to a patch in theoriginal image

Consider the center of this patch

We consider anchor boxes of differentsizes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 133: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

32/47

Input

Conv

Max-pool

InputInput

x1x2 · · · · · ·

Classification Regression

Consider a ground truth object andits corresponding bounding box

Consider the projection of this imageonto the conv5 layer

Consider one such cell in the output

This cell corresponds to a patch in theoriginal image

Consider the center of this patch

We consider anchor boxes of differentsizes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 134: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

32/47

Input

Conv

Max-pool

Input

Input

x1x2 · · · · · ·

Classification Regression

Consider a ground truth object andits corresponding bounding box

Consider the projection of this imageonto the conv5 layer

Consider one such cell in the output

This cell corresponds to a patch in theoriginal image

Consider the center of this patch

We consider anchor boxes of differentsizes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 135: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

32/47

Input

Conv

Max-pool

InputInput

x1x2 · · · · · ·

Classification Regression

Consider a ground truth object andits corresponding bounding box

Consider the projection of this imageonto the conv5 layer

Consider one such cell in the output

This cell corresponds to a patch in theoriginal image

Consider the center of this patch

We consider anchor boxes of differentsizes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 136: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

32/47

Input

Conv

Max-pool

InputInput

x1x2 · · · · · ·

Classification Regression

Consider a ground truth object andits corresponding bounding box

Consider the projection of this imageonto the conv5 layer

Consider one such cell in the output

This cell corresponds to a patch in theoriginal image

Consider the center of this patch

We consider anchor boxes of differentsizes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 137: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

32/47

Input

Conv

Max-pool

InputInput

x1x2 · · · · · ·

Classification

Regression

For each of these anchor boxes, wewould want the classifier to predict1 if this anchor box has a reason-able overlap (IoU > 0.7) with the truegrounding box

Similarly we would want the regres-sion model to predict the true box(red) from the anchor box (pink)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 138: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

32/47

Input

Conv

Max-pool

InputInput

x1x2 · · · · · ·

Classification Regression

For each of these anchor boxes, wewould want the classifier to predict1 if this anchor box has a reason-able overlap (IoU > 0.7) with the truegrounding box

Similarly we would want the regres-sion model to predict the true box(red) from the anchor box (pink)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 139: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

33/47

Input

Conv

Max-pool

x1x2 · · · · · ·

Classification Regression

We train a classification model and aregression model to address these twoquestions

How do we get the ground truth data?

What is the objective function usedfor training?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 140: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

34/47

The full network is trained using the following objective.

L (pi, ti) =1

Ncls

∑i

Lcls(pi, p∗i ) +

λ

Nreg

∑i

p∗iLreg(ti, t∗i )

p∗i = 1 if anchor box contains ground truth object

= 0 otherwise

pi = predicted probability of anchor box containing an object

Ncls = batch-size

Nreg = batch-size× kk = anchor boxes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 141: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

34/47

The full network is trained using the following objective.

L (pi, ti) =1

Ncls

∑i

Lcls(pi, p∗i )

Nreg

∑i

p∗iLreg(ti, t∗i )

p∗i = 1 if anchor box contains ground truth object

= 0 otherwise

pi = predicted probability of anchor box containing an object

Ncls = batch-size

Nreg = batch-size× kk = anchor boxes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 142: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

34/47

The full network is trained using the following objective.

L (pi, ti) =1

Ncls

∑i

Lcls(pi, p∗i )

Nreg

∑i

p∗iLreg(ti, t∗i )

p∗i = 1 if anchor box contains ground truth object

= 0 otherwise

pi = predicted probability of anchor box containing an object

Ncls = batch-size

Nreg = batch-size× kk = anchor boxes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 143: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

34/47

The full network is trained using the following objective.

L (pi, ti) =1

Ncls

∑i

Lcls(pi, p∗i ) +

λ

Nreg

∑i

p∗iLreg(ti, t∗i )

p∗i = 1 if anchor box contains ground truth object

= 0 otherwise

pi = predicted probability of anchor box containing an object

Ncls = batch-size

Nreg = batch-size× kk = anchor boxes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 144: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

34/47

The full network is trained using the following objective.

L (pi, ti) =1

Ncls

∑i

Lcls(pi, p∗i ) +

λ

Nreg

∑i

p∗iLreg(ti, t∗i )

p∗i = 1 if anchor box contains ground truth object

= 0 otherwise

pi = predicted probability of anchor box containing an object

Ncls = batch-size

Nreg = batch-size× kk = anchor boxes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 145: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

35/47

Input

Conv

Max-pool

x1x2 · · · · ·x512

Classification Regression

Fast RCNN

Region Proposals

So far we have seen a CNN based ap-proach for region proposals instead ofusing selective search

We can now take these region propos-als and then add fast RCNN on topof it to predict the class of the object

And regress the proposed boundingbox

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 146: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

35/47

Input

Conv

Max-pool

x1x2 · · · · ·x512

Classification Regression

Fast RCNN

Region Proposals

So far we have seen a CNN based ap-proach for region proposals instead ofusing selective search

We can now take these region propos-als and then add fast RCNN on topof it to predict the class of the object

And regress the proposed boundingbox

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 147: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

35/47

Input

Conv

Max-pool

x1x2 · · · · ·x512

Classification Regression

Fast RCNN

Region Proposals

So far we have seen a CNN based ap-proach for region proposals instead ofusing selective search

We can now take these region propos-als and then add fast RCNN on topof it to predict the class of the object

And regress the proposed boundingbox

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 148: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

35/47

Input

Conv

Max-pool

x1x2 · · · · ·x512

Classification Regression

Fast RCNN

Region Proposals

So far we have seen a CNN based ap-proach for region proposals instead ofusing selective search

We can now take these region propos-als and then add fast RCNN on topof it to predict the class of the object

And regress the proposed boundingbox

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 149: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

36/47

Input

Conv

Max-pool

x1x2 · · · · ·x512

Classification Regression

Fast RCNN

Region Proposals

But the fast RCNN would again usea VGG Net

Can’t we use a single VGG Net andshare the parameters of RPN andRCNN

Yes, we can

In practice, we use a 4 step alternat-ing training process

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 150: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

36/47

Input

Conv

Max-pool

x1x2 · · · · ·x512

Classification Regression

Fast RCNN

Region Proposals

But the fast RCNN would again usea VGG Net

Can’t we use a single VGG Net andshare the parameters of RPN andRCNN

Yes, we can

In practice, we use a 4 step alternat-ing training process

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 151: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

36/47

Input

Conv

Max-pool

x1x2 · · · · ·x512

Classification Regression

Fast RCNN

Region Proposals

But the fast RCNN would again usea VGG Net

Can’t we use a single VGG Net andshare the parameters of RPN andRCNN

Yes, we can

In practice, we use a 4 step alternat-ing training process

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 152: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

36/47

Input

Conv

Max-pool

x1x2 · · · · ·x512

Classification Regression

Fast RCNN

Region Proposals

But the fast RCNN would again usea VGG Net

Can’t we use a single VGG Net andshare the parameters of RPN andRCNN

Yes, we can

In practice, we use a 4 step alternat-ing training process

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 153: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

36/47

Input

Conv

Max-pool

x1x2 · · · · ·x512

Classification Regression

Fast RCNN

Region Proposals

But the fast RCNN would again usea VGG Net

Can’t we use a single VGG Net andshare the parameters of RPN andRCNN

Yes, we can

In practice, we use a 4 step alternat-ing training process

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 154: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

37/47

Input

Conv

Max-pool

x1x2 · · · · ·x512

Classification Regression

Fast RCNN

Region Proposals

Faster RCNN:Training

Fine-tune RPN using a pre-trainedImageNet network

Fine-tune fast RCNN from a pre-trained ImageNet network usingbounding boxes from step 1

Keeping common convolutional layerparameters fixed from step 2, fine-tune RPN (post conv5 layers)

Keeping common convolution layerparameters fixed from step 3, fine-tune fc layers of fast RCNN

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 155: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

37/47

Input

Conv

Max-pool

x1x2 · · · · ·x512

Classification Regression

Fast RCNN

Region Proposals

Faster RCNN:Training

Fine-tune RPN using a pre-trainedImageNet network

Fine-tune fast RCNN from a pre-trained ImageNet network usingbounding boxes from step 1

Keeping common convolutional layerparameters fixed from step 2, fine-tune RPN (post conv5 layers)

Keeping common convolution layerparameters fixed from step 3, fine-tune fc layers of fast RCNN

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 156: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

37/47

Input

Conv

Max-pool

x1x2 · · · · ·x512

Classification Regression

Fast RCNN

Region Proposals

Faster RCNN:Training

Fine-tune RPN using a pre-trainedImageNet network

Fine-tune fast RCNN from a pre-trained ImageNet network usingbounding boxes from step 1

Keeping common convolutional layerparameters fixed from step 2, fine-tune RPN (post conv5 layers)

Keeping common convolution layerparameters fixed from step 3, fine-tune fc layers of fast RCNN

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 157: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

37/47

Input

Conv

Max-pool

x1x2 · · · · ·x512

Classification Regression

Fast RCNN

Region Proposals

Faster RCNN:Training

Fine-tune RPN using a pre-trainedImageNet network

Fine-tune fast RCNN from a pre-trained ImageNet network usingbounding boxes from step 1

Keeping common convolutional layerparameters fixed from step 2, fine-tune RPN (post conv5 layers)

Keeping common convolution layerparameters fixed from step 3, fine-tune fc layers of fast RCNN

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 158: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

37/47

Input

Conv

Max-pool

x1x2 · · · · ·x512

Classification Regression

Fast RCNN

Region Proposals

Faster RCNN:Training

Fine-tune RPN using a pre-trainedImageNet network

Fine-tune fast RCNN from a pre-trained ImageNet network usingbounding boxes from step 1

Keeping common convolutional layerparameters fixed from step 2, fine-tune RPN (post conv5 layers)

Keeping common convolution layerparameters fixed from step 3, fine-tune fc layers of fast RCNN

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 159: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

38/47

Faster RCNN and RPN are the basis of several 1st place entries in the ILSVRCand COCO tracks on :

Imagenet detection

COCO Segmentation

Imagenet localization

COCO detection

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 160: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

38/47

Faster RCNN and RPN are the basis of several 1st place entries in the ILSVRCand COCO tracks on :

Imagenet detection

COCO Segmentation

Imagenet localization

COCO detection

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 161: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

38/47

Faster RCNN and RPN are the basis of several 1st place entries in the ILSVRCand COCO tracks on :

Imagenet detection

COCO Segmentation

Imagenet localization

COCO detection

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 162: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

38/47

Faster RCNN and RPN are the basis of several 1st place entries in the ILSVRCand COCO tracks on :

Imagenet detection

COCO Segmentation

Imagenet localization

COCO detection

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 163: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

38/47

Faster RCNN and RPN are the basis of several 1st place entries in the ILSVRCand COCO tracks on :

Imagenet detection

COCO Segmentation

Imagenet localization

COCO detection

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 164: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

39/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Faster RCNN

Region Proposals: CNN

Feature Extraction: CNN

Classifier: CNN

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 165: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

39/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Faster RCNN

Region Proposals: CNN

Feature Extraction: CNN

Classifier: CNN

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 166: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

39/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Faster RCNN

Region Proposals: CNN

Feature Extraction: CNN

Classifier: CNN

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 167: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

39/47

Region proposals Feature extraction Classifier

Pre 2012

RCNN

Fast RCNN

Faster RCNN

Region Proposals: CNN

Feature Extraction: CNN

Classifier: CNN

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 168: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

40/47

Object Detection Performance

Source: Ross Girshick

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 169: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

41/47

Module 12.5 : YOLO model for object detection

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 170: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

42/47

image

conv layers

feature maps

Region Proposal Network

proposals

classifier

RoI pooling

The approaches that we have seen sofar are two stage approaches

They involve a region proposal stageand then a classification stage

Can we have an end-to-end architec-ture which does both proposal andclassification simultaneously ?

This is the idea behind YOLO-YouOnly Look Once.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 171: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

42/47

image

conv layers

feature maps

Region Proposal Network

proposals

classifier

RoI pooling

The approaches that we have seen sofar are two stage approaches

They involve a region proposal stageand then a classification stage

Can we have an end-to-end architec-ture which does both proposal andclassification simultaneously ?

This is the idea behind YOLO-YouOnly Look Once.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 172: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

42/47

image

conv layers

feature maps

Region Proposal Network

proposals

classifier

RoI pooling

The approaches that we have seen sofar are two stage approaches

They involve a region proposal stageand then a classification stage

Can we have an end-to-end architec-ture which does both proposal andclassification simultaneously ?

This is the idea behind YOLO-YouOnly Look Once.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 173: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

42/47

image

conv layers

feature maps

Region Proposal Network

proposals

classifier

RoI pooling

The approaches that we have seen sofar are two stage approaches

They involve a region proposal stageand then a classification stage

Can we have an end-to-end architec-ture which does both proposal andclassification simultaneously ?

This is the idea behind YOLO-YouOnly Look Once.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 174: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

42/47

image

conv layers

feature maps

Region Proposal Network

proposals

classifier

RoI pooling

The approaches that we have seen sofar are two stage approaches

They involve a region proposal stageand then a classification stage

Can we have an end-to-end architec-ture which does both proposal andclassification simultaneously ?

This is the idea behind YOLO-YouOnly Look Once.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 175: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

43/47

c w h x y

P (cow)

P (dog)

· ·P (truck)

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

P (cow)

P (dog)

· ·P (truck)

Divide an image into S × S grids(S=7)

For each such cell we are interested inpredicting 5 + k quantities

Probability (confidence) that this cellis indeed contained in a true bound-ing box

Width of the bounding box

Height of the bounding box

Center (x,y) of the bounding box

Probability of the object in thebounding box belonging to the kth

class (k - values)

The output layer thus contains S ×S × (5 + k) elements

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 176: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

43/47

c w h x y

P (cow)

P (dog)

· ·P (truck)

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

P (cow)

P (dog)

· ·P (truck)

Divide an image into S × S grids(S=7)

For each such cell we are interested inpredicting 5 + k quantities

Probability (confidence) that this cellis indeed contained in a true bound-ing box

Width of the bounding box

Height of the bounding box

Center (x,y) of the bounding box

Probability of the object in thebounding box belonging to the kth

class (k - values)

The output layer thus contains S ×S × (5 + k) elements

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 177: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

43/47

c w h x y

P (cow)

P (dog)

· ·P (truck)

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

c

P (cow)

P (dog)

· ·P (truck)

Divide an image into S × S grids(S=7)

For each such cell we are interested inpredicting 5 + k quantities

Probability (confidence) that this cellis indeed contained in a true bound-ing box

Width of the bounding box

Height of the bounding box

Center (x,y) of the bounding box

Probability of the object in thebounding box belonging to the kth

class (k - values)

The output layer thus contains S ×S × (5 + k) elements

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 178: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

43/47

c w h x y

P (cow)

P (dog)

· ·P (truck)

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

w

P (cow)

P (dog)

· ·P (truck)

Divide an image into S × S grids(S=7)

For each such cell we are interested inpredicting 5 + k quantities

Probability (confidence) that this cellis indeed contained in a true bound-ing box

Width of the bounding box

Height of the bounding box

Center (x,y) of the bounding box

Probability of the object in thebounding box belonging to the kth

class (k - values)

The output layer thus contains S ×S × (5 + k) elements

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 179: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

43/47

c w h x y

P (cow)

P (dog)

· ·P (truck)

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

h

P (cow)

P (dog)

· ·P (truck)

Divide an image into S × S grids(S=7)

For each such cell we are interested inpredicting 5 + k quantities

Probability (confidence) that this cellis indeed contained in a true bound-ing box

Width of the bounding box

Height of the bounding box

Center (x,y) of the bounding box

Probability of the object in thebounding box belonging to the kth

class (k - values)

The output layer thus contains S ×S × (5 + k) elements

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 180: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

43/47

c w h x y

P (cow)

P (dog)

· ·P (truck)

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

x y

P (cow)

P (dog)

· ·P (truck)

Divide an image into S × S grids(S=7)

For each such cell we are interested inpredicting 5 + k quantities

Probability (confidence) that this cellis indeed contained in a true bound-ing box

Width of the bounding box

Height of the bounding box

Center (x,y) of the bounding box

Probability of the object in thebounding box belonging to the kth

class (k - values)

The output layer thus contains S ×S × (5 + k) elements

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 181: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

43/47

c w h x y

P (cow)

P (dog)

· ·P (truck)

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

P (cow)

P (dog)

· ·P (truck) Divide an image into S × S grids

(S=7)

For each such cell we are interested inpredicting 5 + k quantities

Probability (confidence) that this cellis indeed contained in a true bound-ing box

Width of the bounding box

Height of the bounding box

Center (x,y) of the bounding box

Probability of the object in thebounding box belonging to the kth

class (k - values)

The output layer thus contains S ×S × (5 + k) elements

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 182: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

43/47

c w h x y

P (cow)

P (dog)

· ·P (truck)

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

P (cow)

P (dog)

· ·P (truck) Divide an image into S × S grids

(S=7)

For each such cell we are interested inpredicting 5 + k quantities

Probability (confidence) that this cellis indeed contained in a true bound-ing box

Width of the bounding box

Height of the bounding box

Center (x,y) of the bounding box

Probability of the object in thebounding box belonging to the kth

class (k - values)

The output layer thus contains S ×S × (5 + k) elements

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 183: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

44/47

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsInput Image

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Bounding Boxes & Confidence

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

How do we interpret this S×S×(5+k)dimensional output?

For each cell, we are computing abounding box, its confidence and theobject in it

We then retain the most confidentbounding boxes and the correspond-ing object label

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 184: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

44/47

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsInput Image

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Bounding Boxes & Confidence

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

How do we interpret this S×S×(5+k)dimensional output?

For each cell, we are computing abounding box, its confidence and theobject in it

We then retain the most confidentbounding boxes and the correspond-ing object label

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 185: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

44/47

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsInput Image

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Bounding Boxes & Confidence

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

How do we interpret this S×S×(5+k)dimensional output?

For each cell, we are computing abounding box, its confidence and theobject in it

We then retain the most confidentbounding boxes and the correspond-ing object label

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 186: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

44/47

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsInput Image

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Bounding Boxes & Confidence

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

How do we interpret this S×S×(5+k)dimensional output?

For each cell, we are computing abounding box, its confidence and theobject in it

We then retain the most confidentbounding boxes and the correspond-ing object label

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 187: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

44/47

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsInput Image

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Bounding Boxes & Confidence

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

How do we interpret this S×S×(5+k)dimensional output?

For each cell, we are computing abounding box, its confidence and theobject in it

We then retain the most confidentbounding boxes and the correspond-ing object label

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 188: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

44/47

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsInput Image

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Bounding Boxes & Confidence

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

How do we interpret this S×S×(5+k)dimensional output?

For each cell, we are computing abounding box, its confidence and theobject in it

We then retain the most confidentbounding boxes and the correspond-ing object label

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 189: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

44/47

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsInput Image

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Bounding Boxes & Confidence

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

How do we interpret this S×S×(5+k)dimensional output?

For each cell, we are computing abounding box, its confidence and theobject in it

We then retain the most confidentbounding boxes and the correspond-ing object label

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 190: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

44/47

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsInput Image

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Bounding Boxes & Confidence

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

How do we interpret this S×S×(5+k)dimensional output?

For each cell, we are computing abounding box, its confidence and theobject in it

We then retain the most confidentbounding boxes and the correspond-ing object label

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 191: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

44/47

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsInput Image

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Bounding Boxes & Confidence

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

How do we interpret this S×S×(5+k)dimensional output?

For each cell, we are computing abounding box, its confidence and theobject in it

We then retain the most confidentbounding boxes and the correspond-ing object label

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 192: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

44/47

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsInput Image

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Bounding Boxes & Confidence

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

How do we interpret this S×S×(5+k)dimensional output?

For each cell, we are computing abounding box, its confidence and theobject in it

We then retain the most confidentbounding boxes and the correspond-ing object label

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 193: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

44/47

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsInput Image

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Bounding Boxes & Confidence

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

How do we interpret this S×S×(5+k)dimensional output?

For each cell, we are computing abounding box, its confidence and theobject in it

We then retain the most confidentbounding boxes and the correspond-ing object label

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 194: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

45/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

P (cow)

P (dog)

· ·P (truck)

How do we train this network ?

Consider a cell such that the centerof the true bonding box lies in it

The network is initialized randomlyand it will predict some values forc, w, h, x, y & `

We can then compute the followinglosses∑k

i=1(`i − ˆi)

2

And train the network to minimizethe sum of these losses

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 195: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

45/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsS × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

P (cow)

P (dog)

· ·P (truck)

How do we train this network ?

Consider a cell such that the centerof the true bonding box lies in it

The network is initialized randomlyand it will predict some values forc, w, h, x, y & `

We can then compute the followinglosses∑k

i=1(`i − ˆi)

2

And train the network to minimizethe sum of these losses

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 196: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

45/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsS × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

P (cow)

P (dog)

· ·P (truck)

How do we train this network ?

Consider a cell such that the centerof the true bonding box lies in it

The network is initialized randomlyand it will predict some values forc, w, h, x, y & `

We can then compute the followinglosses∑k

i=1(`i − ˆi)

2

And train the network to minimizethe sum of these losses

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 197: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

45/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsS × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

P (cow)

P (dog)

· ·P (truck)

How do we train this network ?

Consider a cell such that the centerof the true bonding box lies in it

The network is initialized randomlyand it will predict some values forc, w, h, x, y & `

We can then compute the followinglosses

∑ki=1(`i − ˆ

i)2

And train the network to minimizethe sum of these losses

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 198: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

45/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsS × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

c

P (cow)

P (dog)

· ·P (truck)

How do we train this network ?

Consider a cell such that the centerof the true bonding box lies in it

The network is initialized randomlyand it will predict some values forc, w, h, x, y & `

We can then compute the followinglosses

(1− c)2

∑ki=1(`i − ˆ

i)2

And train the network to minimizethe sum of these losses

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 199: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

45/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsS × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

w

P (cow)

P (dog)

· ·P (truck)

How do we train this network ?

Consider a cell such that the centerof the true bonding box lies in it

The network is initialized randomlyand it will predict some values forc, w, h, x, y & `

We can then compute the followinglosses

(√w −√w)2

∑ki=1(`i − ˆ

i)2

And train the network to minimizethe sum of these losses

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 200: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

45/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsS × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

h

P (cow)

P (dog)

· ·P (truck)

How do we train this network ?

Consider a cell such that the centerof the true bonding box lies in it

The network is initialized randomlyand it will predict some values forc, w, h, x, y & `

We can then compute the followinglosses

(√h−

√h)2

∑ki=1(`i − ˆ

i)2

And train the network to minimizethe sum of these losses

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 201: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

45/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsS × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

x y

P (cow)

P (dog)

· ·P (truck)

How do we train this network ?

Consider a cell such that the centerof the true bonding box lies in it

The network is initialized randomlyand it will predict some values forc, w, h, x, y & `

We can then compute the followinglosses

(x− x)2

∑ki=1(`i − ˆ

i)2

And train the network to minimizethe sum of these losses

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 202: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

45/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsS × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

x y

P (cow)

P (dog)

· ·P (truck)

How do we train this network ?

Consider a cell such that the centerof the true bonding box lies in it

The network is initialized randomlyand it will predict some values forc, w, h, x, y & `

We can then compute the followinglosses

(y − y)2

∑ki=1(`i − ˆ

i)2

And train the network to minimizethe sum of these losses

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 203: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

45/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsS × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

P (cow)

P (dog)

· ·P (truck)

How do we train this network ?

Consider a cell such that the centerof the true bonding box lies in it

The network is initialized randomlyand it will predict some values forc, w, h, x, y & `

We can then compute the followinglosses∑k

i=1(`i − ˆi)

2

And train the network to minimizethe sum of these losses

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 204: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

45/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detectionsS × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

P (cow)

P (dog)

· ·P (truck)

How do we train this network ?

Consider a cell such that the centerof the true bonding box lies in it

The network is initialized randomlyand it will predict some values forc, w, h, x, y & `

We can then compute the followinglosses∑k

i=1(`i − ˆi)

2

And train the network to minimizethe sum of these losses

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 205: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

46/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Now consider a grid which does notcontain any object

For this grid we do not care about thepredictions w, h, x, y & `

But we want the confidence to be low

So we minimize only the following loss

(0− c)2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 206: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

46/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Now consider a grid which does notcontain any object

For this grid we do not care about thepredictions w, h, x, y & `

But we want the confidence to be low

So we minimize only the following loss

(0− c)2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 207: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

46/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Now consider a grid which does notcontain any object

For this grid we do not care about thepredictions w, h, x, y & `

But we want the confidence to be low

So we minimize only the following loss

(0− c)2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 208: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

46/47

c w h x y ˆ1

ˆ2 · · ˆ

k

S × S grid on input

Bounding boxes + confidence

Class probability map

Final detections

Now consider a grid which does notcontain any object

For this grid we do not care about thepredictions w, h, x, y & `

But we want the confidence to be low

So we minimize only the following loss

(0− c)2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 209: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

47/47

Method Pascal 2007 mAP Speed

DPM v5 33.7 0.07 FPS — 14 sec/ image

RCNN 66.0 0.05 FPS — 20 sec/ imageFast RCNN 70.0 0.5 FPS — 2 sec/ image

Faster RCNN 73.2 7 FPS — 140 msec/ imageYOLO 69.0 45 FPS — 22 msec/ image

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 210: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

47/47

Method Pascal 2007 mAP Speed

DPM v5 33.7 0.07 FPS — 14 sec/ imageRCNN 66.0 0.05 FPS — 20 sec/ image

Fast RCNN 70.0 0.5 FPS — 2 sec/ imageFaster RCNN 73.2 7 FPS — 140 msec/ image

YOLO 69.0 45 FPS — 22 msec/ image

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 211: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

47/47

Method Pascal 2007 mAP Speed

DPM v5 33.7 0.07 FPS — 14 sec/ imageRCNN 66.0 0.05 FPS — 20 sec/ image

Fast RCNN 70.0 0.5 FPS — 2 sec/ image

Faster RCNN 73.2 7 FPS — 140 msec/ imageYOLO 69.0 45 FPS — 22 msec/ image

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 212: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

47/47

Method Pascal 2007 mAP Speed

DPM v5 33.7 0.07 FPS — 14 sec/ imageRCNN 66.0 0.05 FPS — 20 sec/ image

Fast RCNN 70.0 0.5 FPS — 2 sec/ imageFaster RCNN 73.2 7 FPS — 140 msec/ image

YOLO 69.0 45 FPS — 22 msec/ image

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12

Page 213: CS7015 (Deep Learning) : Lecture 12miteshk/CS7015/Slides/Teaching/pdf/Lecture12.pdf · Some images borrowed from Ross Girshick’s original slides on RCNN, Fast RCNN, etc. Some ideas

47/47

Method Pascal 2007 mAP Speed

DPM v5 33.7 0.07 FPS — 14 sec/ imageRCNN 66.0 0.05 FPS — 20 sec/ image

Fast RCNN 70.0 0.5 FPS — 2 sec/ imageFaster RCNN 73.2 7 FPS — 140 msec/ image

YOLO 69.0 45 FPS — 22 msec/ image

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 12