40
1 / 40 Deep Learning Lab: Computer Vision Christian Zimmermann and Silvio Galesso

Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

1 / 40

Deep Learning Lab: Computer Vision

Christian Zimmermann and Silvio Galesso

Page 2: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

2 / 40

Outline

● Introduction to Human Pose Estimation (HPE)● Single HPE● Multi HPE

– Top down– Bottom up

● Semantic Segmentation● Exercise

Page 3: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

3 / 40

Intro: What is HPE?

● Given a single color image infer body pose:

HPE

Page 4: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

4 / 40

Intro: What is HPE?

● Given a single color image infer body pose:

● 2D pose P is defined as:

HPE

P=(p1⋮pn

)=(x1 y1⋮ ⋮xn yn

)∈ℝnx 2

y

x

Page 5: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

5 / 40

P=(p1⋮pn

)=(x1 y1⋮ ⋮xn yn

)∈ℝnx 2

Intro: What is HPE?

● Given a single color image infer body pose:

● 2D pose P is defined as:

HPE

f.e. “nose”

y

x

Page 6: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

6 / 40

Intro: Why HPE?

● Human machine interaction:

– Autonomous driving: Infer People and their heading direction and intentions

From: Kreiss et al., PifPaf: Composite Fields for Human Pose Estimation, CVPR 2019

Page 7: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

7 / 40

Intro: Why HPE?

● Human machine interaction:

– Autonomous driving: Infer People and their heading direction and intentions

– Pose based gaming: Microsoft Xbox Kinect

From: xbox.com

Page 8: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

8 / 40

Intro: Why HPE?

● Human machine interaction:

– Autonomous driving: Infer People and their heading direction and intentions

– Pose based gaming: Microsoft Xbox Kinect– In robotics: Learning from demonstration

From: Zimmermann et al., 3D Human Pose Estimation in RGBD Images for Robotic Task Learning, ICRA 2018

Page 9: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

9 / 40

Intro: Why HPE?

● Human machine interaction● Quantify movement:

– Sport action analysis● Track players during sports

– Medicine ● Whats the stage of the ALS disease?

Page 10: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

10 / 40

Intro: What makes it hard?

● Large variation in appearance● Ambiguities

Page 11: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

11 / 40

Intro: What makes it hard?

● Large variation in appearance● Ambiguities● Occlusions● Crowding

Page 12: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

12 / 40

Single HPE: Regression

● Directly regress Cartesian image coordinates ● Network outputs one vector of coordinates

From: Toshev et al., DeepPose: Human Pose Estimation via Deep Neural Networks, CVPR 2014

L=∑i

(‖Pi− Pi‖2)2

Ground truthPrediction

Page 13: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

13 / 40

Single HPE: Scoremap

● Network estimates per pixel keypoint likelihood

● For each keypoint there is one map

● Ground truth maps are created from point annotations

From: Newell et al., Stacked Hourglass Networks for Human Pose Estimation, ECCV 2016

S∈ℝMxMxK

I∈ℝNxNx 3

Page 14: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

14 / 40

Single HPE: Scoremap

● Network estimates per pixel keypoint likelihood

● For each keypoint there is one map

● Ground truth maps are created from point annotations

S∈ℝMxMxK

I∈ℝNxNx 3

From: Newell et al., Stacked Hourglass Networks for Human Pose Estimation, ECCV 2016

L=∑i

K

(‖Si− S i‖2)2

Ground truthPrediction

Page 15: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

15 / 40

Single HPE: Scoremap

● Network estimates per pixel keypoint likelihood

● For each keypoint there is one map

● Ground truth maps are created from point annotations

S∈ℝMxMxK

I∈ℝNxNx 3

From: Newell et al., Stacked Hourglass Networks for Human Pose Estimation, ECCV 2016

Si(~p )=exp(

−‖~p− p‖2σ2 )

L=∑i

K

(‖Si− S i‖2)2

Page 16: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

16 / 40

Single HPE: Softargmax

● Heat maps are learned implicitly

● Softmax squashed map into a probability distribution

● Elementwise multiplication and summation reduces to predicted coordinate

From: Luvizon, 2d/3d pose estimation and action recognition using multitask deep learning, CVPR 2018

Page 17: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

17 / 40

Multi HPE

Page 18: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

18 / 40

Multi HPE

● Top-Down– Detects persons first

– Estimates pose for each person independently

– Suffers early commitment

– Runtime scales linear in #people

– Struggles when people crowd

● Bottom-Up– Detects keypoints first

– Subsequently groups keypoints into indivuals

– Makes keypoint detection harder because of less prior knowledge

– Extensive grouping is NP hard problem

Page 19: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

19 / 40

Top-down approach

● Example for top-down: Mask R-CNN– First part of the network detects bounding boxes

– Then pool features from each bounding box and apply a sub-network (‘head’) on them

– There are heads for classification, segmentation and pose

estimation

From: He et al., Mask R-CNN, ICCV 2017

Page 20: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

20 / 40

Bottom-up approach

● Single network that estimates two entities:– Keypoint locations (Part Intensity Field, also called

Scoremap) → Gives joint estimates

– Association scores between keypoints that should form a limb (Part Affinity Fields) → Enables grouping

From: Kreiss et al., PifPaf: Composite Fields for Human Pose Estimation, CVPR 2019

Page 21: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

21 / 40

Bottom-up approach

● Part Intensity Fields– Likelihood at each location if the keypoint is present

– One Scoremap per keypoint needed

From: Kreiss et al., PifPaf: Composite Fields for Human Pose Estimation, CVPR 2019

Page 22: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

22 / 40

Bottom-up approach

● Part Affinity Fields– Vector field pointing in the direction from ‘start’ to ‘end’ of a

limb

– Two maps per keypoint (one for each vector component)

From: Cao et al., Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields, CVPR 2017

Page 23: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

23 / 40

Bottom-up approach

● Grouping keypoint candidates to instances– Finding the optimal parse of from the detected keypoint

candidates is NP-Hard.

– Therefore relax complete matching to greedy bipartite graph matching: i.e. match one limb at a time

● Practical implementation:– Start with the most confident keypoint locations

– Greedy growing of the person instance using the PAF based score

Page 24: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

24 / 40

Evaluation

● Common measure: Mean per joint position error (MPJPE)

PredictionGround truth

MPJPE=1

|V|∑i∈V

K

‖pi− pi‖2

Set of visible keypoints

Page 25: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

25 / 40

Semantic Segmentation

Page 26: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

26 / 40

Image Segmentation

Input image Segmentation mask

Definition: partitioning the image into coherent regions/subsets of pixels

Source: COCO dataset, http://cocodataset.org/#explore

Page 27: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

27 / 40

Segmentation Tasks● Binary

segmentation● Semantic

segmentation● Instance

segmentation

● Assign a class to each pixel

● 2 classes: foreground/background

● Assign a class to each pixel

● Multiple classes with semantic meaning: person, dog, sheep, pig, background, ...

● Predict segmentation mask for foreground objects

● Instance specific (usually coupled with detection)

Sources: Ronneberger et al., U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015He et al., Mask R-CNN, ICCV 2017

Page 28: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

28 / 40

Segmentation with CNNs

Encoder-Decoder architecture:

Input image

Encoder Decoder

Prediction:segmentation

mask

Bottleneck

Page 29: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

29 / 40

Encoder Network

Conv. Layers (+ReLU)

Pooling layers

Input: ● Original resolution ● Low level representation (RGB)

Output/bottleneck: ● Low resolution● High receptive field● High level feature representation

(high number of channels)

Page 30: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

30 / 40

Decoder Network

Transposed Conv. Layers (+ReLU)

Upsampling layers

● Uses feature representation to solve task

● Increases resolution via upsampling operations and/or transposed convolutions

Page 31: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

31 / 40

Transposed Convolutions● Also known as upconvolutions or deconvolutions● They map the input to a higher resolution output● Can be seen as “learned upsampling” operations

“Transposed” because:

convolutional filter

convolutional matrix

Page 32: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

32 / 40

Transposed Convolutions● Also known as upconvolutions or (wrongly) deconvolutions● They map the input to a higher resolution output● Can be seen as “learned upsampling” operations

“Transposed” because:

Convolution:

Transposed convolution:

it can be computed using the transposed matrix of some convolution.

Page 33: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

33 / 40

Skip Connections

skip

skip

● “Shortcuts” from encoder activations to corresponding decoder stages● Preserve high-res information, useful for refinement● Improve sharpness of output

Page 34: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

34 / 40

Skip Connections

skip

concatenation

64 x

64

64 x

64rest of the network...

......

Page 35: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

35 / 40

Example: U-Net

Ronneberger et al., U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015

Page 36: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

36 / 40

Example: ECRU

Source: Robail Yasrab, “ECRU: An Encoder-Decoder Based Convolution Neural Network (CNN) for Road-Scene Understanding”, Journal of Imaging 2018

Page 37: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

37 / 40

Training for Segmentation

mask prediction

last decoder layer

conv. sigmoid activation

ground truth mask

Cross Entropy

Loss

Page 38: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

38 / 40

Evaluating SegmentationCommon evaluation metric for detection and segmentation: Intersection over Union (IoU)

E.g. in object detection:

intersection union

predictionground

truth

Page 39: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

39 / 40

A Few References

● Ronneberger et al., “U-Net: Convolutional Networks for Biomedical Image Segmentation”, MICCAI 2015

● Robail Yasrab, “ECRU: An Encoder-Decoder Based Convolution Neural Network (CNN) for Road-Scene Understanding”, Journal of Imaging 2018

● He et al., “Mask R-CNN”, ICCV 2017

● He et al., Deep Residual Learning for Image Recognition, CVPR 2016

● Dumoulin et al., A guide to convolution arithmetic for deep learning, arXiv 1603.07285

Page 40: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous

40 / 40

Exercise

● Set up and train a model for pose estimation using the direct scalar regression approach.

● Implement the Softargmax loss and use it to train a second model. Compare it with the previous approach.

● Implement different encoder-decoder networks for segmentation and compare their performance.