82
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 1 Lecture 6: Convnets for object detection and segmentation Deep Learning @ UvA

Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 1

Lecture 6: Convnets for object detection and segmentationDeep Learning @ UvA

Page 2: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 2

o What do convolutions look like?

o Build on the visual intuition behind Convnets

o Deep Learning Feature maps

o Transfer Learning

Previous Lecture

Page 3: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 3

o What is object detection, segmentation and structured output

o How to use a convnet to localize objects?

o What is the relation convnets and Conditional Random Fields (CRF)

Lecture Overview

Page 4: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSEEFSTRATIOS GAVVES

CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 4

Object localization

Page 5: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 5

o The task of discovering the location of one or more object in an image

o Images◦ Detect spatial object location

o Videos◦ Detect spatial object location

◦ Optionally, detect object temporal location

o Generic◦ Define the same model for all types of categories

o Specific◦ Define specialized models for particular categories (“face”, “pedestrian”)

Object localization

Page 6: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 6

o The task of discovering the location of one or more object in an image

o Images◦ Detect spatial object location

o Videos◦ Detect spatial object location

◦ Optionally, detect object temporal location

o Generic◦ Define the same model for all types of categories

o Specific◦ Define specialized models for particular categories (“face”, “pedestrian”)

Object localization

Page 7: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 7

o The task of discovering the location of one or more object in an image

o Images◦ Detect spatial object location

o Videos◦ Detect spatial object location

◦ Optionally, detect object temporal location

o Generic◦ Define the same model for all types of categories

o Specific◦ Define specialized models for particular categories (“face”, “pedestrian”)

Object localization

Page 8: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 8

o The task of discovering the location of one or more object in an image

o Images◦ Detect spatial object location

o Videos◦ Detect spatial object location

◦ Optionally, detect object temporal location

o Generic◦ Define the same model for all types of categories

o Specific◦ Define specialized models for particular categories (“face”, “pedestrian”)

Object localization

Page 9: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 9

o The task of discovering the location of one or more object in an image

o Images◦ Detect spatial object location

o Videos◦ Detect spatial object location

◦ Optionally, detect object temporal location

o Generic◦ Define the same model for all types of categories

o Specific◦ Define specialized models for particular categories (“face”, “pedestrian”)

Object localization

Page 10: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 10

o Robotics

o Self-driving cars

o Better classification◦ Isolates the object signal from the

background signal

o Huge pictures◦ E.g. in astronomy pictures have ultra-

high resolution

◦ Detecting a new star or galaxy goes beyond image classification

Why is localization important?

Page 11: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 11

o Box level, aka object detection◦ Predict a bounding box surrounding the objects

o Pixel level, aka semantic segmentation◦ Predict the category label for each pixel

o Pixel and instance level◦ Predict a free-form delineation around object instances and predict their category

◦ Separate between instances of the same category (much more difficult!)

Localization granularity

Image classification Object detection Pixel classification Pixel and instance classification

Page 12: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 12

o Supervised (category specific)◦ Examples for “bicycle”, “dog”, “person”, “face”

o Unsupervised (category agnostic)◦ aka bounding box/segment proposal algorithm

◦ Method typically returns hundreds possible object locations

Localization supervision

Supervised (category specific)Supervised (category specific)

Page 13: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 13

o Training◦ Crop objects from the images

◦ Train a classifier (SVM) for each of the categories

o Inference◦ Detect possible, category agnostic object locations

◦ Crop all of them from the image

◦ Assign a per category score for each of them

◦ Keep the ones with score more than a threshold

◦ Success: > 50% overlap with ground truth

Agnostic to category specific detection

Page 14: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 14

o Training◦ Crop objects from the images

◦ Train a classifier (SVM) for each of the categories

o Inference◦ Detect possible, category agnostic object locations

◦ Crop all of them from the image

◦ Assign a per category score for each of them

◦ Keep the ones with score more than a threshold

◦ Success: > 50% overlap with ground truth

Agnostic to category specific detection

Page 15: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 15

o Training◦ Crop objects from the images

◦ Train a classifier (SVM) for each of the categories

o Inference◦ Detect possible, category agnostic object locations

◦ Crop all of them from the image

◦ Assign a per category score for each of them

◦ Keep the ones with score more than a threshold

◦ Success: > 50% overlap with ground truth

Agnostic to category specific detection

Page 16: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 16

o Training◦ Crop objects from the images

◦ Train a classifier (SVM) for each of the categories

o Inference◦ Detect possible, category agnostic object locations

◦ Crop all of them from the image

◦ Assign a per category score for each of them

◦ Keep the ones with score more than a threshold

◦ Success: > 50% overlap with ground truth

Agnostic to category specific detection

Page 17: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 17

o Training◦ Crop objects from the images

◦ Train a classifier (SVM) for each of the categories

o Inference◦ Detect possible, category agnostic object locations

◦ Crop all of them from the image

◦ Assign a per category score for each of them

◦ Keep the ones with score more than a threshold

◦ Success: > 50% overlap with ground truth

Agnostic to category specific detection

Page 18: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 18

o Training◦ Crop objects from the images

◦ Train a classifier (SVM) for each of the categories

o Inference◦ Detect possible, category agnostic object locations

◦ Crop all of them from the image

◦ Assign a per category score for each of them

◦ Keep the ones with score more than a threshold

◦ Success: > 50% overlap with ground truth

Agnostic to category specific detection

Page 19: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 19

o Training◦ Crop objects from the images

◦ Train a classifier for each of the categories

o Inference◦ Parse image at several locations and scales

◦ For each location compute category score

◦ Keep boxes with high enough score

◦ More thorough search

◦ Potentially more expensive (or maybe not)

Sliding window and direct classification

Page 20: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 20

o Training◦ Crop objects from the images

◦ Train a classifier for each of the categories

o Inference◦ Parse image at several locations and scales

◦ For each location compute category score

◦ Keep boxes with high enough score

◦ More thorough search

◦ Potentially more expensive (or maybe not)

Sliding window and direct classification

Page 21: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 21

o Training◦ Crop objects from the images

◦ Train a classifier for each of the categories

o Inference◦ Parse image at several locations and scales

◦ For each location compute category score

◦ Keep boxes with high enough score

◦ More thorough search

◦ Potentially more expensive (or maybe not)

Sliding window and direct classification

Page 22: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 22

o Training◦ Crop objects from the images

◦ Train a classifier for each of the categories

o Inference◦ Parse image at several locations and scales

◦ For each location compute category score

◦ Keep boxes with high enough score

◦ More thorough search

◦ Potentially more expensive (or maybe not)

Sliding window and direct classification

Page 23: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 23

o Training◦ Crop objects from the images

◦ Train a classifier for each of the categories

o Inference◦ Parse image at several locations and scales

◦ For each location compute category score

◦ Keep boxes with high enough score

◦ More thorough search

◦ Potentially more expensive (or maybe not)

Sliding window and direct classification

Page 24: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 24

o Training◦ Crop objects from the images

◦ Train a classifier for each of the categories

o Inference◦ Regress bounding box coordinates

◦ Cheap

◦ Not accurate enough

Sliding window and direct classification

Page 25: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSEEFSTRATIOS GAVVES

CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 25

Convnets and object localization

Page 26: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 26

o Simple pipeline!

o Use the convnet as a feature extractor

o Given a bounding box, compute a “deep” feature

o Run an SVM (one per category) on the deep features

R-CNN [Girshick2013]

Page 27: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 27

o Simple pipeline!

o Use the convnet as a feature extractor

o Given a bounding box, compute a “deep” feature

o Run an SVM (one per category) on the deep features

o Warp cropped images to make them “square”

R-CNN [Girshick2013]

Page 28: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 28

o Simple pipeline!

o Use the convnet as a feature extractor

o Given a bounding box, compute a “deep” feature

o Run an SVM (one per category) on the deep features

o Warp cropped images to make them “square”

o Add some background context

R-CNN [Girshick2013]

Page 29: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 29

o Simple pipeline!

o Use the convnet as a feature extractor

o Given a bounding box, compute a “deep” feature

o Run an SVM (one per category) on the deep features

o Warp cropped images to make them “square”

o Add some background context

o Very accurate!!!

R-CNN [Girshick2013]

Page 30: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 30

o Simple pipeline!

o Use the convnet as a feature extractor

o Given a bounding box, compute a “deep” feature

o Run an SVM (one per category) on the deep features

o Warp cropped images to make them “square”

o Add some background context

o Very accurate!!!

o Unfortunately slow◦ More than 60 seconds per image

R-CNN [Girshick2013]

Page 31: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 31

R-CNN results

R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, arXiv 2013

+25%!!!!!

Page 32: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 32

R-CNN examples

Page 33: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 33

R-CNN examples

Page 34: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 34

o Observation: A Convnet is convolutional!!!

o Convolutions are location specific◦ The same filter is applied on different locations

o R-CNN assumes that the image locations for different bounding boxes are different

o However, because of the convolutions the convolved image locations are shared between box proposals

o Can we reused the convolutional feature maps

Fast R-CNN

Page 35: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 35

o Observation: A Convnet is convolutional!!!

o Convolutions are location specific◦ The same filter is applied on different locations

o R-CNN assumes that the image locations for different bounding boxes are different

o However, because of the convolutions the convolved image locations are shared between box proposals

o Can we reused the convolutional feature maps

Fast R-CNN

Page 36: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 36

o Observation: A Convnet is convolutional!!!

o Convolutions are location specific◦ The same filter is applied on different locations

o R-CNN assumes that the image locations for different bounding boxes are different

o However, because of the convolutions the convolved image locations are shared between box proposals

o Can we reused the convolutional feature maps

Fast R-CNN

Page 37: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 37

o Observation: A Convnet is convolutional!!!

o Convolutions are location specific◦ The same filter is applied on different locations

o R-CNN assumes that the image locations for different bounding boxes are different

o However, because of the convolutions the convolved image locations are shared between box proposals

o Can we reused the convolutional feature maps

Fast R-CNN

Page 38: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 38

o Observation: A Convnet is convolutional!!!

o Convolutions are location specific◦ The same filter is applied on different locations

o R-CNN assumes that the image locations for different bounding boxes are different

o However, because of the convolutions the convolved image locations are shared between box proposals

o Can we reused the convolutional feature maps

Fast R-CNN

Page 39: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 39

o Do not compute forward propagate from scratch for each box

o Compute feature maps once

o Reuse the convolutions for different boxes

Fast-RCNN

Page 40: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 40

o Do not compute forward propagate from scratch for each box

o Compute feature maps once

o Reuse the convolutions for different boxes

o Region-of-Interest pooling◦ Don’t define the stride absolutely (e.g. sample every 4 pixels)

◦ Define stride relatively (box width divided by predefined number of “poolings” T)

◦ Fixed length vector

Fast-RCNN

T=5

Page 41: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 41

o Do not compute forward propagate from scratch for each box

o Compute feature maps once

o Reuse the convolutions for different boxes

o Region-of-Interest pooling◦ Don’t define the stride absolutely (e.g. sample every 4 pixels)

◦ Define stride relatively (box width divided by predefined number of “poolings” T)

◦ Fixed length vector

o End-to-end training!

o More accurate

o Much faster◦ Less than a second per image

o Still, external box proposals needed

Fast-RCNN

T=5

Page 42: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 42

Fast-RCNN results

R. Girshick, Fast R-CNN, CVPR, 20152013

Page 43: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 43

Fast-RCNN results

R. Girshick, Fast R-CNN, CVPR, 20152013

+4%!!!!!

Page 44: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 44

Fast-RCNN results

R. Girshick, Fast R-CNN, CVPR, 20152013

+4%!!!!!

Page 45: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 45

o Not single number output (“car” vs. “not car”)◦ Pixel-wise output (as many outputs, as inputs)

Fully Convolutional Networks [LongCVPR2014]

Page 46: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 46

o Not single number output (“car” vs. “not car”)◦ Pixel-wise output (as many outputs, as inputs)

o A fully connected layer can be viewed as a convolution◦ The size of the convolution filter equals the size of the

whole layer

Fully Convolutional Networks [LongCVPR2014]

, ,

Page 47: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 47

o Not single number output (“car” vs. “not car”)◦ Pixel-wise output (as many outputs, as inputs)

o A fully connected layer can be viewed as a convolution◦ The size of the convolution filter equals the size of the

whole layer

Fully Convolutional Networks [LongCVPR2014]

Page 48: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 48

o Not single number output (“car” vs. “not car”)◦ Pixel-wise output (as many outputs, as inputs)

o A fully connected layer can be viewed as a convolution◦ The size of the convolution filter equals the size of the

whole layer

Fully Convolutional Networks [LongCVPR2014]

“Slices” of 1-of-1000 Classification layer

Page 49: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 49

o Not single number output (“car” vs. “not car”)◦ Pixel-wise output (as many outputs, as inputs)

o A fully connected layer can be viewed as a convolution◦ The size of the convolution filter equals the size of the

whole layer

Fully Convolutional Networks [LongCVPR2014]

“Slices” of 1-of-1000 Classification layer

Page 50: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 50

o Not single number output (“car” vs. “not car”)◦ Pixel-wise output (as many outputs, as inputs)

o A fully connected layer can be viewed as a convolution◦ The size of the convolution filter equals the size of the

whole layer

Fully Convolutional Networks [LongCVPR2014]

“Slices” of 1-of-1000 Classification layer

Page 51: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 51

o Not single number output (“car” vs. “not car”)◦ Pixel-wise output (as many outputs, as inputs)

o A fully connected layer can be viewed as a convolution

◦ The size of the convolution filter equals the size of the whole layer

o Upsampling◦ Deconvolution filters learnt with

backpropagation

◦ Deconvolution filter return feature mapslarger than input

◦ Outputs progressively bigger till theyreach input size

Fully Convolutional Networks [LongCVPR2014]

“Slices” of 1-of-1000 Classification layer

Page 52: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 52

Fully Convolutional Networks architecture

o Connect intermediate layers to output

Page 53: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 53

Fully Convolutional Networks results

Page 54: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 54

o Identity verification◦ Find whether two pictures belong to the same person

o Essentially similar to metric learning◦ Find a projection matrix (metric), with which distances between different person faces

are smaller than distances from different persons

o Triplet “Euclidean” loss

𝐸 𝑊′ =

𝑎,𝑝,𝑛 ∈𝑇

max 0, 𝑎 − 𝑥𝑎 − 𝑥𝑛 22 + 𝑥𝑎 − 𝑥𝑝 2

2, 𝑥𝑖 = 𝑊′

𝜑(ℓ𝑖)

𝜙 ℓ𝑖 2

o Start from very deep architecture

o More similar to classification, but face detection already very accurate

Deep Face Recognition

Page 55: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 55

Deep Face Recognition results

Page 56: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Image Segmentation using Conditional Random Fields

Deep Learning @ UvA Patrick Putzky

[Shotton et al., 2009]

Page 57: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Outlook

[Krähenbühl & Koltun, 2011]

Page 58: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Conditional Random Fieldsp(y|x) = 1

Z(x)

exp(�E(y|x))

E(y|x) =X

c2CG

c(y|x)

y

x

General form

E(y|x) =X

i

u(yi|x)

Unary form

E(y|x) =X

i

u(yi|x)

+X

i<j

p(yi, yj |x)

Pairwise CRF

Page 59: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Unary modelsE(y|x) =

X

i

u(yi|x)

- Per pixel predictions - No interactions between neighbouring class predictions

Example: Fully Convolutional Networks

[Long et al., 2015]

x y

Page 60: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Unary models

- Per pixel predictions - No interactions between neighbouring class predictions - No object coherency

Fully Convolutional Networks: Fail Case

[Zheng et al., 2015]

Page 61: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Adjacency CRFsE(y|x) =

X

i

u(yi|x)| {z }unary potential

+X

j2N (i)

p(yi, yj |x)| {z }pairwise potential

- define a neighbourhood graph - arises naturally in images - efficient inference - only local interactions- graph is not trainable

Page 62: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Adjacency CRFs

- No distinction between classes

Popular: Potts model

p(yi, yj |x) = 1[yi 6=yj ]

�w1 exp

���kxi � xjk2

�+ w2

- Shrinking bias- Limited edge-awareness

Page 63: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Fully connected CRFs

- Edges between all pairs of nodes - Long-range interactions

E(y|x) =X

i

u(yi|x)| {z }unary potential

+X

j 6=i

p(yi, yj |x)| {z }pairwise potential

Page 64: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Fully connected CRFs

- Strength of edges define connectivity - No more shrinking bias - N2 edges -> computationally expensive

E(y|x) =X

i

u(yi|x)| {z }unary potential

+X

j 6=i

p(yi, yj |x)| {z }pairwise potential

Page 65: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Gaussian Edge Potentials

km (fi, fj) = exp

✓�1

2

(fi � fj)TQm (fi � fj)

p(yi, yj |x) = 1[yi 6=yj ]

�w1 exp

���kxi � xjk2

�+ w2

�Potts model

p(yi, yj |x) = µ(yi, yj)MX

m=1

wmkm (fi, fj)

Generalisation

With Gaussian kernels

µ(yi, yj) = 1[yi 6=yj ]Pottsµ(yi, yj)Label compatibility models interactions between classes

trainable parameters

Page 66: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Gaussian Edge Potentialskm (fi, fj) = exp

✓�1

2

(fi � fj)TQm (fi � fj)

x f

km

Page 67: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Gaussian Edge Potentialskm (fi, fj) = exp

✓�1

2

(fi � fj)TQm (fi � fj)

x f

km

Page 68: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Edge potentials as convolutions

Ep(y|x) =X

i

X

j 6=i

p(yi, yj |x)

=X

i

X

j 6=i

µ(yi, yj)MX

m=1

wmkm (fi, fj)

=MX

m=1

wm

X

i

X

j 6=i

µ(yi, yj)km (fi, fj)

=MX

m=1

wm

X

i

NX

j=1

µ(yi, yj)km (fi, fj)� µ(yi, yi)km (fi, fi)

=MX

m=1

wm

X

i

i�NX

l=i�1

µ(yi, yi�l)km (fi, fi�l)� µ(yi, yi)km (fi, fi)

Page 69: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Edge potentials as convolutions

Ep(y|x) =X

i

X

j 6=i

p(yi, yj |x)

=X

i

X

j 6=i

µ(yi, yj)MX

m=1

wmkm (fi, fj)

=MX

m=1

wm

X

i

X

j 6=i

µ(yi, yj)km (fi, fj)

=MX

m=1

wm

X

i

NX

j=1

µ(yi, yj)km (fi, fj)� µ(yi, yi)km (fi, fi)

=MX

m=1

wm

X

i

i�NX

l=i�1

µ(yi, yi�l)km (fi, fi�l)� µ(yi, yi)km (fi, fi)

Page 70: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Edge potentials as convolutions

Ep(y|x) =X

i

X

j 6=i

p(yi, yj |x)

=X

i

X

j 6=i

µ(yi, yj)MX

m=1

wmkm (fi, fj)

=MX

m=1

wm

X

i

X

j 6=i

µ(yi, yj)km (fi, fj)

=MX

m=1

wm

X

i

NX

j=1

µ(yi, yj)km (fi, fj)� µ(yi, yi)km (fi, fi)

=MX

m=1

wm

X

i

i�NX

l=i�1

µ(yi, yi�l)km (fi, fi�l)� µ(yi, yi)km (fi, fi)

Page 71: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Edge potentials as convolutions

Ep(y|x) =X

i

X

j 6=i

p(yi, yj |x)

=X

i

X

j 6=i

µ(yi, yj)MX

m=1

wmkm (fi, fj)

=MX

m=1

wm

X

i

X

j 6=i

µ(yi, yj)km (fi, fj)

=MX

m=1

wm

X

i

NX

j=1

µ(yi, yj)km (fi, fj)� µ(yi, yi)km (fi, fi)

=MX

m=1

wm

X

i

i�NX

l=i�1

µ(yi, yi�l)km (fi, fi�l)� µ(yi, yi)km (fi, fi)

Page 72: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Edge potentials as convolutions

- If km is a symmetric kernel the pairwise potential can be implemented as a convolution - Gaussian has most probability mass around a close region about it’s mean - Approximate this using a truncated Gaussian

[Wikipedia]

Ep(y|x) =X

i

X

j 6=i

p(yi, yj |x)

=X

i

X

j 6=i

µ(yi, yj)MX

m=1

wmkm (fi, fj)

=MX

m=1

wm

X

i

X

j 6=i

µ(yi, yj)km (fi, fj)

=MX

m=1

wm

X

i

NX

j=1

µ(yi, yj)km (fi, fj)� µ(yi, yi)km (fi, fi)

=MX

m=1

wm

X

i

i�NX

l=i�1

µ(yi, yi�l)km (fi, fi�l)� µ(yi, yi)km (fi, fi)

Page 73: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

k̃m

Gaussian Edge Potentialskm (fi, fj) = exp

✓�1

2

(fi � fj)TQm (fi � fj)

x f

km

- Convolution in a high dimensional space

Page 74: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Permutohedral Lattices for convolution in high dimensional space

fi- can be high dimensional - as a result the feature space will be sparsely populated

Problem:

[Adams et al., 2010]- Convolutions in and for separable filters - Naive convolution on a grid is O(2dn)

O(d2n) O(dn)

Page 75: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Efficient inference in dense CRFs

P (y|x) ⇡ Q(y|x) =Y

i

Q(yi|x)Mean-field approximation:

p(y|x) = 1

Z(x)

exp

0

@�X

i

u(yi|x)�X

j 6=i

p(yi, yj |x)

1

A

Qi(yi = l) =1

Ziexp

0

@� u(yi|x)�X

l02Lµ (l, l0)

MX

m=1

wm

X

j 6=i

km (fi, fj)Qj(l0)

1

A

Mean field updates:

[Krähenbühl & Koltun, 2011]

- Naively implemented an update over all Qi costs O(N2) - With the approximations from before it is O(N)

Hard to evaluate

Page 76: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Efficient inference in dense CRFs

Qi(yi = l) =1

Ziexp

0

@� u(yi|x)�X

l02Lµ (l, l0)

MX

m=1

wm

X

j 6=i

km (fi, fj)Qj(l0)

1

A

Mean field updates:

[Krähenbühl & Koltun, 2011]

10 Iterations own 0.2 seconds

Page 77: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Efficient inference in dense CRFs

[Krähenbühl & Koltun, 2011]

Page 78: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

What are these features?km (fi, fj) = exp

✓�1

2

(fi � fj)TQm (fi � fj)

fi =

✓pi

xi

pi position vectorxi vector of pixel values

- This is known as bilateral filtering - It is an edge-preserving approach to filtering

[Adams et al., 2010]

Page 79: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

CRFs as RNNs

Demo: http://www.robots.ox.ac.uk/~szheng/crfasrnndemo

[Zheng et al., 2015]

- Mean field iterations as steps in an RNN - End-to-end training

Page 80: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Learning the filters

[Kiefel et al., 2015]

Page 81: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

Summary- We can improve performance of segmentation

algorithms by adding pairwise interactions between pixel predictions

- Approximate Inference in dense CRFs can be done efficiently

- End-to-end learning can improve performance - Pairwise kernels can be learned and lead to

potential improvements - These methods could have an impact in other

problem domains

Page 82: Lecture 6: Convnets for object detection and segmentation · uva deep learning course –efstratios gavves convnets for object detection, segmentation and structured outputs - 6 o

References(1) Adams A, Baek J, Davis MA. Fast high-dimensional filtering using the

permutohedral lattice. Comput Graph Forum. 2010;29(2):753-762. (2) Krähenbühl P, Koltun V. Efficient Inference in Fully Connected CRFs with

Gaussian Edge Potentials. In: Advances in Neural Information Processing Systems.; 2011:109-117. http://papers.nips.cc/paper/4296-efficient-inference-in-fully-connected-crfs-with-gaussian-edge-potentials. Accessed January 8, 2016.

(3) Kiefel M, Gehler P V, Kiefel M, Mpg T, Gehler P, Mpg T. Sparse Convolutional Networks using the Permutohedral Lattice. 2015. http://arxiv.org/abs/1503.04949. Accessed January 8, 2016.

(4) Shotton J, Winn J, Rother C, Criminisi A. TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. Int J Comput Vis. 2009. http://research.microsoft.com/apps/pubs/default.aspx?id=117885.

(5) Zheng S, Jayasumana S, Romera-Paredes B, et al. Conditional Random Fields as Recurrent Neural Networks. In: International Conference on Computer Vision (ICCV).; 2015:16. http://arxiv.org/abs/1502.03240. Accessed January 8, 2016.