61
Region Based CNN Detection & Segmentation

Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Region Based CNN

Detection & Segmentation

Page 2: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Today’s program – 1st hour• Classification & Localization

• Object detection

• Region proposals• Selective search

• Region proposals algorithms • RCNN

• Fast RCNN

• Faster RCNN

• Detection without proposals • YOLO

• SSD

Page 3: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

References

Some slides were taken from the following resources:1. Stanford university course Cs231n 2017 – lecture 11

2. www.coursera.orgc – convolutional neural networks – Object detection

Articles references:Efficient Graph-Based Image Segmentation Pedro F. Felzenszwalb Artificial Intelligence Lab, Massachusetts Institute of Technology

Rich feature hierarchies for accurate object detection and semantic segmentation Tech report (v5) Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik UC Berkeley

Fast R-CNN Ross Girshick Microsoft Research

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun

You Only Look Once: Unified, Real-Time Object Detection Joseph Redmon , Santosh Divvala, Ross Girshick , Ali Farhadi University of Washington , Allen Institute for AI , Facebook AI Research

SSD: Single Shot MultiBox Detector Wei Liu , Dragomir Anguelov , Dumitru Erhan , Christian Szegedy , Scott Reed , Cheng-Yang Fu, Alexander C. Berg UNC Chapel Hill Zoox Inc. Google Inc. University of Michigan, Ann-Arbor

Page 4: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Already covered: Classification

Page 5: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Some definitions…

Single object Multiple objects

ClassificationClassification with

localizationObject detection

“Car” “Car” “Car” “Car”

Page 6: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Classification + Localization

Page 7: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Classification + Localization task

Page 8: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Classification + Localization

Page 9: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Classification + Localization

Page 10: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Evaluation metric – IoU (Intersection Over Union)

To evaluate how well is our Bounding box prediction we use the Intersection Over Union method:

𝐼𝑂𝑈 = 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛(𝑃.𝐵𝑏𝑜𝑥,𝐺.𝐵𝑏𝑜𝑥)𝑈𝑛𝑖𝑜𝑛(𝑃.𝐵𝑏𝑜𝑥,𝐺.𝐵𝑏𝑜𝑥)

Prediction Bbox

Ground truth Bbox

In practice, the Bbox prediction is “Correct” if IOU > ~0.5-0.6

Page 11: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Localization as Regression

• For example for 3 classes – Pedestrian, Car & Motorcycle, the output:

Pc – Probability for existence of class in the image

Bx – x0 coordinate of the Bbox

By – y0 coordinate of the Bbox

Bh – Height of the Bbox

Bw – Width of the Bbox

Pc1 – Probability for class C1?

Pc2 – Probability for class C2?

Pc3 – Probability for class C3?

ConvNet ⋯

(0,0)

(1,1)

Page 12: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

\ IOU

Multi Task Loss

Page 13: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Multi Task Loss

• How do we train the network with two absolutely different loss functions ?

• The solution – Hyper parameter that give weights to each loss.

• Be careful.. Its tricky. This hyper parameter is special, it changes the absolute value of the loss.

Page 14: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Side note - Landmark detection

Treat each landmark as a class.

For each landmark output 𝐿𝑥 , 𝐿𝑦 coordinates

For example :

Face recognition (64 facial landmarks), Pose estimation(14 joints)

Application example – Virtual make over

Page 15: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Object Detection

Page 16: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Object detection reminder…

• We are looking for all the objects in the input image

• The output should say the class and the bounding box of each object

Main challenge:

The image has vary number of objects.

We don’t know what to expect

Page 17: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object
Page 18: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Its not a good paradigm for this kind of problem…

Page 19: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Sliding window• Small patches cropped from the input insert to the ConvNet which will

classify each patch as true/false for each class or background if there is no object.

Is there any problem with this method ?

Page 20: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Sliding window main pitfall

• How to choose the crops?

• Each object can appear at any location / size / aspect ratio

• We need to tackle thousands of crops

• Before ConvNets using linear classifiers it was possible to compute.

• With ConvNets its very inefficient computationally

• Possible to improve computation using Sliding window “Overfeat” which implement sliding window convolutionally.

Page 21: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Region proposals

Page 22: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Region proposals

Goal – Find image regions that are likely to contain an object.

Example of fast method and quite used in practice – “Selective search”.

This algorithm gives 2000 region proposals in a few seconds on CPU, finds regions at any scale, consider multi grouping criteria (cup on a table), and is not limited to rectangle ROIs.

Idea: Use bottom-up grouping of image regions to generate a hierarchy of small to large regions.

Page 23: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Efficient Graph-Based Image Segmentation Pedro F. Felzenszwalb, Artificial Intelligence Lab, Massachusetts Institute of Technology

We define a graph G=(V,E) s.t.

Vertices 𝑣𝑖 ∈ 𝑉 = 𝑇ℎ𝑒 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑤𝑒 𝑤𝑎𝑛𝑡 𝑡𝑜 𝑠𝑒𝑔𝑚𝑒𝑛𝑡

Edges 𝑣𝑖 , 𝑣𝑗 ∈ 𝐸 = 𝑡𝑤𝑜 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠 𝑠ℎ𝑎𝑟𝑒 𝑎𝑛 𝑒𝑑𝑔𝑒.

Each edge has non negative weight w(e) that describes the dissimilarity between the two elements.

In image segmentation V = pixels in the image.

w (e) = Intensity deference (or any other measurable pixel attribute)

Page 24: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Efficient Graph-Based Image Segmentation Pedro F. Felzenszwalb, Artificial Intelligence Lab, Massachusetts Institute of Technology

We define S as the segmentation of V to components (regions in the image) s.t.

pixels in the same component are similar

pixels that are from disjoint components are non similar.

We define the predicate D which evaluate the dissimilarity between two components as follows:

𝐷(𝐶1, 𝐶2 ) =𝑡𝑟𝑢𝑒 𝑖𝑓 𝐷𝑖𝑓(𝑐1, 𝑐2 ) > 𝑀𝐼𝑛𝑡(𝑐1, 𝑐2 )

𝑓𝑎𝑙𝑠𝑒 𝑜. 𝑤.

𝐷𝑖𝑓 𝑐1, 𝑐2 = 𝑚𝑖𝑛𝑣𝑖∈𝑐1,𝑣𝑗∈𝑐2, 𝑣𝑖,𝑣𝑗 ∈𝐸 𝑤 𝑣𝑖 , 𝑣𝑗

𝑀𝐼𝑛𝑡 𝑐1, 𝑐2 = min(𝐼𝑛𝑡 𝑐1 + 𝛾 𝑐1 , 𝐼𝑛𝑡 𝑐2 + 𝛾 𝑐2 )

𝐼𝑛𝑡 𝑐 = 𝑚𝑎𝑥𝑒∈𝑀𝑆𝑇 𝑐,𝐸 𝑤(𝑒)

Page 25: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

• sort E into 𝑜1, … , 𝑜𝑚1. Start with a segmentation 𝑆0 ,where each vertex 𝑣𝑖 is in its own

component.2. Repeat step 3 for q=1,…,m3. Construct 𝑆𝑞 given 𝑆𝑞−1as follows. Let 𝑣𝑖 𝑎𝑛𝑑 𝑣𝑗 denote the vertices

connected by the q-th edge in the ordering. i.e. 𝑜𝑞 = 𝑣𝑖 , 𝑣𝑗 . If 𝑣𝑖 𝑎𝑛𝑑 𝑣𝑗 are in disjoint components 𝑆𝑞−1 𝑎𝑛𝑑 𝑤(𝑜𝑞) is small compared to the internal difference of both those components, then merge the two components, otherwise do nothing.

4. Return 𝑆 = 𝑆𝑚

Efficient Graph-Based Image Segmentation - Algorithm Pedro F. Felzenszwalb, Artificial Intelligence Lab, Massachusetts Institute of Technology

Back to “Selective Search””

Page 26: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Selective search1. Generate initial over-segmentation (using the method of Felzenszwalb et al.), R

2. Initialize 𝑆 = ∅ . Recursively combine similar regions into larger ones.

a. From set of regions R , choose two that are most similar.

b. Combine them into a single, larger region.

c. Repeat until only one region remains.

3. Use the generated regions to produce candidate object locations

Page 27: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

RCNN

Page 28: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Problem: each ROI may be of

different size.

Page 29: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Warp = transform I(u,v) pixels locations

with respect to mapping function. In this case – scaling

Page 30: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object
Page 31: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object
Page 32: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object
Page 33: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

RCNN Problems

• ~2k regions, each region computed independently –computationally expensive

• Training is slow (84h), takes a lot of disk space• Each image trained has all ROI’s labeled with category and

Bounding box. Assume full supervision of all objects.• Inference (detection) is slow

• 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]

Page 34: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Fast & Faster RCNN – Summary

Fast RCNN1. Instead of run each region separately we first run the whole image

through convolutional network which extract a full resolution feature map with respect to the whole image.

2. Still use “Selective search” only we crop the ROI projected on the feature map instead of cropping the image pixels. Allows us to share high computations between regions.

3. ROI pooling layer (instead of warp) – quite similar to max pooling

4. Multi task loss: Log loss (classification) , L1 (regress the offsets)

Page 35: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Fast RCNN – Multi task loss

𝐿 𝑝, 𝑘∗, 𝑡, 𝑡∗ = 𝐿𝑐𝑙𝑠 𝑝, 𝑘∗ + 𝜆 𝑘∗ ≥ 1 Lreg t, t

Where:

𝑘∗ is the true class label𝐿𝑐𝑙𝑠 𝑝, 𝑘

∗ = − log 𝑝𝑘∗ : 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑐𝑟𝑜𝑠𝑠 𝑒𝑛𝑡𝑟𝑜𝑝𝑦/ log 𝑙𝑜𝑠𝑠𝐿𝑟𝑒𝑔 𝑡, 𝑡

∗ :

𝑡∗ = 𝑡𝑥∗ , 𝑡𝑦∗ , 𝑡𝑤∗ , 𝑡ℎ∗ 𝑡𝑟𝑢𝑒 𝑏𝑏𝑜𝑥 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑡𝑎𝑟𝑔𝑒𝑡

𝑡 = 𝑡𝑥, 𝑡𝑦, 𝑡𝑤 , 𝑡ℎ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑡𝑜𝑢𝑝𝑙𝑒 𝑓𝑜𝑟 𝑐𝑙𝑎𝑠𝑠 𝑘

𝐿𝑟𝑒𝑔 𝑡, 𝑡∗ =

𝑖∈{𝑥,𝑦,𝑤,ℎ}

𝑠𝑚𝑜𝑜𝑡ℎ𝐿1(𝑡𝑖 , 𝑡𝑖∗)

Fast & Faster RCNN – Summary

𝑠𝑚𝑜𝑜𝑡ℎ𝐿1 𝑥 =0.5𝑥2 𝑖𝑓 𝑥 < 1

𝑥 − 0.5 𝑜. 𝑤.

Page 36: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Fast & Faster RCNN – Summary

Fast-er RCNN1. Instead of selective search, we do the following:

• Produce feature map in high resolution corresponding to the whole image.

• RPN – Region Proposals Network – A CNN that extract the region proposals from the feature map. The Region proposal network classifies binary { object \ No object } and regress a bounding box.

2. Using the feature map and the region proposals continue the same as Fast RCNN.

Page 37: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Region proposals Network (RPN)

Input – Image at any size

Output – Rectangular regions as a proposition for object location & confidence score .

1. Input image goes through a CNN to generate feature map.

Page 38: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Region proposals Network (RPN) cont.

2. Slide a 3X3 window on the feature map, for each window a set of 9 anchors is generated with same center (𝑥𝑎, 𝑦𝑎) but with 3 different aspect ratios and 3 different scales.

3. The window is passed to 2 FC layers:

• Classify Object / No Object (Binary softmax)

• Bounding box regression

4. For each window k=9 predictions are made for each anchor box. The output is 4k (36) numbers for the Bbox regression and 2k (18) numbers for the binary softmax classifier.

Page 39: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Region proposals Network (RPN) cont.

RPN Loss function:

Positive label – anchor has highest IOU with ground truth-box or has an IOU>0.7 with any ground truth-box.

Negative label – IOU<0.3 for all ground truth-boxes.

𝑝∗ =1 𝑖𝑓 𝐼𝑜𝑈 > 0.7−1 𝑖𝑓 𝐼𝑜𝑈 < 0.30 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Objective function with multi-task loss: Similar to Fast R-CNN.𝐿 𝑝𝑖 , 𝑡𝑖 = 𝐿𝑐𝑙𝑠(𝑝𝑖,𝑝𝑖

∗) + 𝜆𝑝𝑖∗𝐿𝑟𝑒𝑔(𝑡𝑖 , 𝑡𝑖

∗)

Where 𝑝𝑖∗ is 1 if the anchor is labeled positive, and 0 negative.

λ=10 bias towards better box location

Page 40: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Fast & Faster RCNN – Summary

Fast-er RCNN Multi task loss1. RPN – classify object Yes/No

2. RPN regress box coordinates

3. Final classification score (object classes)

4. Final box coordinates

Page 41: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object
Page 42: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Detection without proposals

Page 43: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Detection without region proposals

Page 44: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

YOLO

Page 45: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

YOLO Architecture

• A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.

• 24 convolution layers

• 2 FC layers

Page 46: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Yolo – You Only Look Once

𝑂𝑢𝑡𝑝𝑢𝑡 =

𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3

[Redmon et al., 2015, You Only Look Once: Unified real-time object detection]

1 - pedestrian

2 - car

3 - motorcycle

Output size is 3 × 3 × 8

Grid cells Offset + classes

• Classification + localization per cell

Page 47: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

example – 2 anchor boxes

Output =

𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3

[Redmon et al., 2015, You Only Look Once: Unified real-time object detection]

1 - pedestrian

2 - car

3 - motorcycle

Output is 3 × 3 × 2 × 8

Grid cells Anchor boxes

Offset + classes

Anchor boxes 1

Anchor boxes 2

Page 48: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

YOLO Loss functionBbox coordinates error

Bbox w/h error

Class error when object found

Class error when object not found

Is there object error

1𝑖𝑜𝑏𝑗

= is there object at the i-th cell

1𝑖𝑗𝑜𝑏𝑗

= is there object at the i-th cell and anchor box j predicted it

𝜆𝑐𝑜𝑜𝑟𝑑 , 𝜆𝑛𝑜𝑜𝑏𝑗 are optimization parameters to give some loss parts higher/lower effectiveness

Page 49: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

What to do with overlapping objects ?

Page 50: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Overlapping objects:

Anchor box 2:Anchor box 1:

[Redmon et al., 2015, You Only Look Once: Unified real-time object detection]

Previously:

Each object in training image is assigned to grid cell that contains that object’s midpoint.

With two anchor boxes:

Each object in training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.

Page 51: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

How to choose the anchor boxes ?

Page 52: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

YOLO – Choosing the anchor boxes• Ad hoc – Define default anchor boxes (In practice 9 – 3 scales, 3 aspect ratios).

• Using K-Means clustering – Use the data set to conclude the anchor boxes.

IoU is used as metric distance for k-means.

Using width and height as features we compute the cluster centers

Calculation is done for various number of clusters choosing the number that is optimal for mean IoU and anchor box overlap.

• Example taken from: Vivek Yadav, Staff Software Engineer at Lockheed Martin-Autonomous System with research interest in control, machine learning/AI. Lifelong learner with glassblowing problem. Jul 10, 2017

Page 53: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

How to deal with detecting same object more then once?

Page 54: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Non-max suppression example

Page 55: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Non-max suppression example

0.8

0.5

0.6

0.90.3

Object’s midpointObject’s

midpoint

Page 56: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Non-max suppression example

0.8

0.7

0.6

0.9

0.7

Page 57: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Non-max suppression𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤

Discard all boxes with 𝑝𝑐 ≤ 0.6

While there are any remaining boxes:

• Pick the box with the largest 𝑝𝑐Output that as a prediction.

• Discard any remaining box with IoU ≥ 0.5 with the box outputin the previous step

Each output prediction is:

0.8

0.7

0.6

0.90.7

Page 58: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Outputting the non-max suppressed outputs

• For each grid call, get 2 predicted bounding boxes.

• Get rid of low probability predictions.

• For each class (pedestrian, car, motorcycle) use non-max suppression to generate final predictions.

Page 59: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

SSD

Page 60: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

SSD – Single Shot DetectorMain differences from YOLO

Page 61: Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object