#6 PyData Warsaw: Deep learning for image segmentation

Deep Learning for image segmentation

Michael Jamroz & Matthew Opala

AGENDA

Deep Learning methods for image segmentation

Case study - clothing parsing

Segmentation in Computer Vision

Segmentation in Computer Vision1

Computer Vision tasks

DRESS HEELS

BAG

Classification Detection Segmentation

DRESS HEELS

BAG

DRESS HEELS

BAG

Semantic Segmentation

◦ Annotate each pixel◦ Doesn’t differentiate instances◦ Classic computer vision task

Instance Aware Segmentation

◦ Detect instances

◦ Annotate each pixel

◦ Simultaneous

detection and

segmentation

◦ Recent challenge in

MS-COCO

Traditional methods

Kota Yamaguchi, M Hadi Kiapour, Tamara L Berg, "Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items", ICCV 2013

● Multi-stage pipeline with image features engineered by hand (HoGs, MR8 etc.)

● Segmentation -> classification of every pixel with linear regression

Deep Learning methods for image segmentation

2

Convolutional neural networks

● Firstly used successfully in classification task● Three basic operations: convolution, pooling,

nonlinearity function

Semantic segmentation with CNN

CNN DRESS

Input Extract Patch Classify center pixel

Repeat for each pixel

Semantic segmentation with CNN

CNN Smaller output due to pooling

Fully Convolutional Neural Networks

Long, Shelhamer and Darrell, “Fully Convolutional Networks For Semantic Segmentation”, CVPR 2015

Fully Convolutional Neural Networks

Learnable upsampling: deconvolution

Typical 3 x 3 convolution, stride 1 pad 1

Input: 4 x 4 Output: 4 x 4




Dot productbetween filter and

input




Dot productbetween filter and

input







Dot productbetween filter

and input




Dot productbetween filter

and input


3 x 3 “deconvolution”, stride 2 pad 1





Input gives weight for filter









Sum where output overlaps

Deconvolution Network for Semantic Segmentation

Normal VGG “Upside down” VGG

Noh, Hong and Hang, “Learning Deconvolution Network for Semantic Segmentation”, arXiv 2015

Deconvolution Network: Pooling

Input

Pooled map

Switch Variables

Deconvolution Network: Unpooling

Input

Pooled map

Switch Variables

DeconvNet vs. FCN

Input Ground truth

FCN DeconvNet EDeconvNet EDeconvNet + CRF

DeepLab: Atrous Convolution and Fully Connected CRFs

Chen, Papandreou, Kokkinos, Murphy, Yuille “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs”, ICLR 2015

● Conditional random field used as a post-processing step

Conditional Random Field

Atrous convolution

● Convolution “with holes”

● Performing convolution with larger receptive field without losing performance

Atrous convolution

● Performing convolution on downsampled input, later upsampling the result to

original resolution

● Performing convolution with holes on originally-sized input

Case study - clothing parsing3

Clothing parsing

◦ Goal: detect and segment some basic clothing

categories: dresses, bags, shoes, trousers etc. on

humans

◦ We need precise clothing masks for further

processing (image search, color detection)

◦ The biggest publicly available dataset contains 7,7k

images

ATR Dataset

◦ Images with ground-truth labels, 7.7k examples◦ 18 clothing categories◦ https://github.com/lemondan/HumanParsing-Dataset

ATR Dataset

Clothing parsing with general segmentation

◦ DeepLab model basing on VGG-16 architecture

◦ Both variants: with and without CRF post-processing

◦ Finetuning from VGG-16 trained on ImageNet

classification challenge

◦ Images resized to 513 x 513 resolution

◦ Training details

▫ Batch size: 8

▫ 20k iterations - 10 epochs

▫ Dataset divided into train/test in ratio = 0.9

Clothing parsing with general segmentation: results

Input

DeepLab + CRFDeepLab

Ground truth

Clothing parsing with general segmentation: results

DeepLab:DeepLab

+ CRF:

Ground truthInput

Clothing parsing with general segmentation: metrics

Bags:

Dresses:

model accuracy precision recall f1-score IoU

DeepLab 0,9903 0,64 0,51 0,54 0,45

DeepLab + CRF

0,9908 0,664 0,525 0,553 0,48


DeepLab 0,9586 0,481 0,39 0,399 0,349

DeepLab + CRF

0,9558 0,506 0,436 0,438 0,397

Clothing parsing with detection and segmentation

● Detecting category with object detector like R-CNN, SSD, YOLO etc.

● Segmenting the object inside bounding box with models like DeepLab, DeepCut etc.

● Motivation: it’s much faster to gather bounding box level annotations than pixel-wise annotations

● Hypothesis: given correct bounding box it’s easier to segment clothing item than on whole image

Single Shot Multibox Detector (SSD)

Wen Liu et. al,, "SSD: Single Shot Multibox Detector", 2016

4135/360Bags train/test size

11740/ 3990Dresses train/test size

0.93Bags mAP

0.7Dresses mAP


DeepLab 0,9903 0,64 0,51 0,54 0,45

DeepLab + CRF

0,9908 0,664 0,525 0,553 0,48

D&S 0,993 0,765 0,709 0,731 0,64

Clothing parsing with detection and segmentation: bags metrics


DeepLab 0,9586 0,481 0,39 0,399 0,349

DeepLab + CRF

0,9558 0,506 0,436 0,438 0,397

D&S 0,931 0,416 0,409 0,407 0,378

Clothing parsing with detection and segmentation: dresses metrics

Visualisations of Detection & Segmentation approach



What have we used?

◦ Caffe & Python

◦ https://github.com/weiliu89/caff

e/tree/ssd

◦ https://bitbucket.org/aquariusja

y/deeplab-public-ver2

Thanks!

Q&AYou can contact us at:

[email protected]

[email protected]

Science

#6 PyData Warsaw: Deep learning for image segmentation