Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object

Region Based CNN

Detection & Segmentation

Today’s program – 1st hour• Classification & Localization

• Object detection

• Region proposals• Selective search

• Region proposals algorithms • RCNN

• Fast RCNN

• Faster RCNN

• Detection without proposals • YOLO

• SSD

References

Some slides were taken from the following resources:1. Stanford university course Cs231n 2017 – lecture 11

2. www.coursera.orgc – convolutional neural networks – Object detection

Articles references:Efficient Graph-Based Image Segmentation Pedro F. Felzenszwalb Artificial Intelligence Lab, Massachusetts Institute of Technology

Rich feature hierarchies for accurate object detection and semantic segmentation Tech report (v5) Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik UC Berkeley

Fast R-CNN Ross Girshick Microsoft Research

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun

You Only Look Once: Unified, Real-Time Object Detection Joseph Redmon , Santosh Divvala, Ross Girshick , Ali Farhadi University of Washington , Allen Institute for AI , Facebook AI Research

SSD: Single Shot MultiBox Detector Wei Liu , Dragomir Anguelov , Dumitru Erhan , Christian Szegedy , Scott Reed , Cheng-Yang Fu, Alexander C. Berg UNC Chapel Hill Zoox Inc. Google Inc. University of Michigan, Ann-Arbor

Already covered: Classification

Some definitions…

Single object Multiple objects

ClassificationClassification with

localizationObject detection

“Car” “Car” “Car” “Car”

Classification + Localization

Classification + Localization task



Evaluation metric – IoU (Intersection Over Union)

To evaluate how well is our Bounding box prediction we use the Intersection Over Union method:

𝐼𝑂𝑈 = 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛(𝑃.𝐵𝑏𝑜𝑥,𝐺.𝐵𝑏𝑜𝑥)𝑈𝑛𝑖𝑜𝑛(𝑃.𝐵𝑏𝑜𝑥,𝐺.𝐵𝑏𝑜𝑥)

Prediction Bbox

Ground truth Bbox

In practice, the Bbox prediction is “Correct” if IOU > ~0.5-0.6

Localization as Regression

• For example for 3 classes – Pedestrian, Car & Motorcycle, the output:

⋮

Pc – Probability for existence of class in the image

Bx – x0 coordinate of the Bbox

By – y0 coordinate of the Bbox

Bh – Height of the Bbox

Bw – Width of the Bbox

Pc1 – Probability for class C1?



ConvNet ⋯

(0,0)

(1,1)

\ IOU

Multi Task Loss

Multi Task Loss

• How do we train the network with two absolutely different loss functions ?

• The solution – Hyper parameter that give weights to each loss.

• Be careful.. Its tricky. This hyper parameter is special, it changes the absolute value of the loss.

Side note - Landmark detection

Treat each landmark as a class.

For each landmark output 𝐿𝑥 , 𝐿𝑦 coordinates

For example :

Face recognition (64 facial landmarks), Pose estimation(14 joints)

Application example – Virtual make over

Object Detection

Object detection reminder…

• We are looking for all the objects in the input image

• The output should say the class and the bounding box of each object

Main challenge:

The image has vary number of objects.

We don’t know what to expect

Its not a good paradigm for this kind of problem…

Sliding window• Small patches cropped from the input insert to the ConvNet which will

classify each patch as true/false for each class or background if there is no object.

Is there any problem with this method ?

Sliding window main pitfall

• How to choose the crops?

• Each object can appear at any location / size / aspect ratio

• We need to tackle thousands of crops

• Before ConvNets using linear classifiers it was possible to compute.

• With ConvNets its very inefficient computationally

• Possible to improve computation using Sliding window “Overfeat” which implement sliding window convolutionally.

Region proposals

Region proposals

Goal – Find image regions that are likely to contain an object.

Example of fast method and quite used in practice – “Selective search”.

This algorithm gives 2000 region proposals in a few seconds on CPU, finds regions at any scale, consider multi grouping criteria (cup on a table), and is not limited to rectangle ROIs.

Idea: Use bottom-up grouping of image regions to generate a hierarchy of small to large regions.

Efficient Graph-Based Image Segmentation Pedro F. Felzenszwalb, Artificial Intelligence Lab, Massachusetts Institute of Technology

We define a graph G=(V,E) s.t.

Vertices 𝑣𝑖 ∈ 𝑉 = 𝑇ℎ𝑒 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑤𝑒 𝑤𝑎𝑛𝑡 𝑡𝑜 𝑠𝑒𝑔𝑚𝑒𝑛𝑡

Edges 𝑣𝑖 , 𝑣𝑗 ∈ 𝐸 = 𝑡𝑤𝑜 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠 𝑠ℎ𝑎𝑟𝑒 𝑎𝑛 𝑒𝑑𝑔𝑒.

Each edge has non negative weight w(e) that describes the dissimilarity between the two elements.

In image segmentation V = pixels in the image.

w (e) = Intensity deference (or any other measurable pixel attribute)

Efficient Graph-Based Image Segmentation Pedro F. Felzenszwalb, Artificial Intelligence Lab, Massachusetts Institute of Technology

We define S as the segmentation of V to components (regions in the image) s.t.

pixels in the same component are similar

pixels that are from disjoint components are non similar.

We define the predicate D which evaluate the dissimilarity between two components as follows:

𝐷(𝐶1, 𝐶2 ) =𝑡𝑟𝑢𝑒 𝑖𝑓 𝐷𝑖𝑓(𝑐1, 𝑐2 ) > 𝑀𝐼𝑛𝑡(𝑐1, 𝑐2 )

𝑓𝑎𝑙𝑠𝑒 𝑜. 𝑤.

𝐷𝑖𝑓 𝑐1, 𝑐2 = 𝑚𝑖𝑛𝑣𝑖∈𝑐1,𝑣𝑗∈𝑐2, 𝑣𝑖,𝑣𝑗 ∈𝐸 𝑤 𝑣𝑖 , 𝑣𝑗

𝑀𝐼𝑛𝑡 𝑐1, 𝑐2 = min(𝐼𝑛𝑡 𝑐1 + 𝛾 𝑐1 , 𝐼𝑛𝑡 𝑐2 + 𝛾 𝑐2 )

𝐼𝑛𝑡 𝑐 = 𝑚𝑎𝑥𝑒∈𝑀𝑆𝑇 𝑐,𝐸 𝑤(𝑒)

• sort E into 𝑜1, … , 𝑜𝑚1. Start with a segmentation 𝑆0 ,where each vertex 𝑣𝑖 is in its own

component.2. Repeat step 3 for q=1,…,m3. Construct 𝑆𝑞 given 𝑆𝑞−1as follows. Let 𝑣𝑖 𝑎𝑛𝑑 𝑣𝑗 denote the vertices

connected by the q-th edge in the ordering. i.e. 𝑜𝑞 = 𝑣𝑖 , 𝑣𝑗 . If 𝑣𝑖 𝑎𝑛𝑑 𝑣𝑗 are in disjoint components 𝑆𝑞−1 𝑎𝑛𝑑 𝑤(𝑜𝑞) is small compared to the internal difference of both those components, then merge the two components, otherwise do nothing.

4. Return 𝑆 = 𝑆𝑚

Efficient Graph-Based Image Segmentation - Algorithm Pedro F. Felzenszwalb, Artificial Intelligence Lab, Massachusetts Institute of Technology

Back to “Selective Search””

Selective search1. Generate initial over-segmentation (using the method of Felzenszwalb et al.), R

2. Initialize 𝑆 = ∅ . Recursively combine similar regions into larger ones.

a. From set of regions R , choose two that are most similar.

b. Combine them into a single, larger region.

c. Repeat until only one region remains.

3. Use the generated regions to produce candidate object locations

RCNN

Problem: each ROI may be of

different size.

Warp = transform I(u,v) pixels locations

with respect to mapping function. In this case – scaling

RCNN Problems

• ~2k regions, each region computed independently –computationally expensive

• Training is slow (84h), takes a lot of disk space• Each image trained has all ROI’s labeled with category and

Bounding box. Assume full supervision of all objects.• Inference (detection) is slow

• 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]

Fast & Faster RCNN – Summary

Fast RCNN1. Instead of run each region separately we first run the whole image

through convolutional network which extract a full resolution feature map with respect to the whole image.

2. Still use “Selective search” only we crop the ROI projected on the feature map instead of cropping the image pixels. Allows us to share high computations between regions.

3. ROI pooling layer (instead of warp) – quite similar to max pooling

4. Multi task loss: Log loss (classification) , L1 (regress the offsets)

Fast RCNN – Multi task loss

𝐿 𝑝, 𝑘∗, 𝑡, 𝑡∗ = 𝐿𝑐𝑙𝑠 𝑝, 𝑘∗ + 𝜆 𝑘∗ ≥ 1 Lreg t, t

∗

Where:

𝑘∗ is the true class label𝐿𝑐𝑙𝑠 𝑝, 𝑘

∗ = − log 𝑝𝑘∗ : 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑐𝑟𝑜𝑠𝑠 𝑒𝑛𝑡𝑟𝑜𝑝𝑦/ log 𝑙𝑜𝑠𝑠𝐿𝑟𝑒𝑔 𝑡, 𝑡

∗ :

𝑡∗ = 𝑡𝑥∗ , 𝑡𝑦∗ , 𝑡𝑤∗ , 𝑡ℎ∗ 𝑡𝑟𝑢𝑒 𝑏𝑏𝑜𝑥 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑡𝑎𝑟𝑔𝑒𝑡

𝑡 = 𝑡𝑥, 𝑡𝑦, 𝑡𝑤 , 𝑡ℎ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑡𝑜𝑢𝑝𝑙𝑒 𝑓𝑜𝑟 𝑐𝑙𝑎𝑠𝑠 𝑘

𝐿𝑟𝑒𝑔 𝑡, 𝑡∗ =

𝑖∈{𝑥,𝑦,𝑤,ℎ}

𝑠𝑚𝑜𝑜𝑡ℎ𝐿1(𝑡𝑖 , 𝑡𝑖∗)


𝑠𝑚𝑜𝑜𝑡ℎ𝐿1 𝑥 =0.5𝑥2 𝑖𝑓 𝑥 < 1

𝑥 − 0.5 𝑜. 𝑤.


Fast-er RCNN1. Instead of selective search, we do the following:

• Produce feature map in high resolution corresponding to the whole image.

• RPN – Region Proposals Network – A CNN that extract the region proposals from the feature map. The Region proposal network classifies binary { object \ No object } and regress a bounding box.

2. Using the feature map and the region proposals continue the same as Fast RCNN.

Region proposals Network (RPN)

Input – Image at any size

Output – Rectangular regions as a proposition for object location & confidence score .

1. Input image goes through a CNN to generate feature map.

Region proposals Network (RPN) cont.

2. Slide a 3X3 window on the feature map, for each window a set of 9 anchors is generated with same center (𝑥𝑎, 𝑦𝑎) but with 3 different aspect ratios and 3 different scales.

3. The window is passed to 2 FC layers:

• Classify Object / No Object (Binary softmax)

• Bounding box regression

4. For each window k=9 predictions are made for each anchor box. The output is 4k (36) numbers for the Bbox regression and 2k (18) numbers for the binary softmax classifier.

Region proposals Network (RPN) cont.

RPN Loss function:

Positive label – anchor has highest IOU with ground truth-box or has an IOU>0.7 with any ground truth-box.

Negative label – IOU<0.3 for all ground truth-boxes.

𝑝∗ =1 𝑖𝑓 𝐼𝑜𝑈 > 0.7−1 𝑖𝑓 𝐼𝑜𝑈 < 0.30 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Objective function with multi-task loss: Similar to Fast R-CNN.𝐿 𝑝𝑖 , 𝑡𝑖 = 𝐿𝑐𝑙𝑠(𝑝𝑖,𝑝𝑖

∗) + 𝜆𝑝𝑖∗𝐿𝑟𝑒𝑔(𝑡𝑖 , 𝑡𝑖

∗)

Where 𝑝𝑖∗ is 1 if the anchor is labeled positive, and 0 negative.

λ=10 bias towards better box location


Fast-er RCNN Multi task loss1. RPN – classify object Yes/No

2. RPN regress box coordinates

3. Final classification score (object classes)

4. Final box coordinates

Detection without proposals

Detection without region proposals

YOLO

YOLO Architecture

• A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.

• 24 convolution layers

• 2 FC layers

Yolo – You Only Look Once

𝑂𝑢𝑡𝑝𝑢𝑡 =

𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3

[Redmon et al., 2015, You Only Look Once: Unified real-time object detection]

1 - pedestrian

2 - car

3 - motorcycle

Output size is 3 × 3 × 8

Grid cells Offset + classes

• Classification + localization per cell

example – 2 anchor boxes

Output =

𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3


1 - pedestrian

2 - car

3 - motorcycle

Output is 3 × 3 × 2 × 8

Grid cells Anchor boxes

Offset + classes

Anchor boxes 1

Anchor boxes 2

YOLO Loss functionBbox coordinates error

Bbox w/h error

Class error when object found

Class error when object not found

Is there object error

1𝑖𝑜𝑏𝑗

= is there object at the i-th cell

1𝑖𝑗𝑜𝑏𝑗

= is there object at the i-th cell and anchor box j predicted it

𝜆𝑐𝑜𝑜𝑟𝑑 , 𝜆𝑛𝑜𝑜𝑏𝑗 are optimization parameters to give some loss parts higher/lower effectiveness

What to do with overlapping objects ?

Overlapping objects:

Anchor box 2:Anchor box 1:


Previously:

Each object in training image is assigned to grid cell that contains that object’s midpoint.

With two anchor boxes:

Each object in training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.

How to choose the anchor boxes ?

YOLO – Choosing the anchor boxes• Ad hoc – Define default anchor boxes (In practice 9 – 3 scales, 3 aspect ratios).

• Using K-Means clustering – Use the data set to conclude the anchor boxes.

IoU is used as metric distance for k-means.

Using width and height as features we compute the cluster centers

Calculation is done for various number of clusters choosing the number that is optimal for mean IoU and anchor box overlap.

• Example taken from: Vivek Yadav, Staff Software Engineer at Lockheed Martin-Autonomous System with research interest in control, machine learning/AI. Lifelong learner with glassblowing problem. Jul 10, 2017

How to deal with detecting same object more then once?

Non-max suppression example


0.8

0.5

0.6

0.90.3

Object’s midpointObject’s

midpoint


0.8

0.7

0.6

0.9

0.7

Non-max suppression𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤

Discard all boxes with 𝑝𝑐 ≤ 0.6

While there are any remaining boxes:

• Pick the box with the largest 𝑝𝑐Output that as a prediction.

• Discard any remaining box with IoU ≥ 0.5 with the box outputin the previous step

Each output prediction is:

0.8

0.7

0.6

0.90.7

Outputting the non-max suppressed outputs

• For each grid call, get 2 predicted bounding boxes.

• Get rid of low probability predictions.

• For each class (pedestrian, car, motorcycle) use non-max suppression to generate final predictions.

SSD

SSD – Single Shot DetectorMain differences from YOLO

Documents

Region Based CNN Detection & Segmentation · 2018-05-07 · Yolo –You Only Look Once 𝑂 P L Q P= L ℎ 1 2 3 [Redmon et al., 2015, You Only Look Once: Unified real-time object