View
695
Download
0
Category
Preview:
Citation preview
Deep Learning for image segmentation
Michael Jamroz & Matthew Opala
AGENDA
Deep Learning methods for image segmentation
Case study - clothing parsing
Segmentation in Computer Vision
Segmentation in Computer Vision1
Computer Vision tasks
DRESS HEELS
BAG
Classification Detection Segmentation
DRESS HEELS
BAG
DRESS HEELS
BAG
Semantic Segmentation
◦ Annotate each pixel◦ Doesn’t differentiate instances◦ Classic computer vision task
Instance Aware Segmentation
◦ Detect instances
◦ Annotate each pixel
◦ Simultaneous
detection and
segmentation
◦ Recent challenge in
MS-COCO
Traditional methods
Kota Yamaguchi, M Hadi Kiapour, Tamara L Berg, "Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items", ICCV 2013
● Multi-stage pipeline with image features engineered by hand (HoGs, MR8 etc.)
● Segmentation -> classification of every pixel with linear regression
Deep Learning methods for image segmentation
2
Convolutional neural networks
● Firstly used successfully in classification task● Three basic operations: convolution, pooling,
nonlinearity function
Semantic segmentation with CNN
CNN DRESS
Input Extract Patch Classify center pixel
Repeat for each pixel
Semantic segmentation with CNN
CNN Smaller output due to pooling
Fully Convolutional Neural Networks
Long, Shelhamer and Darrell, “Fully Convolutional Networks For Semantic Segmentation”, CVPR 2015
Fully Convolutional Neural Networks
Learnable upsampling: deconvolution
Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Learnable upsampling: deconvolution
Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Dot productbetween filter and
input
Learnable upsampling: deconvolution
Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Dot productbetween filter and
input
Learnable upsampling: deconvolution
Typical 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
Learnable upsampling: deconvolution
Typical 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
Dot productbetween filter
and input
Learnable upsampling: deconvolution
Typical 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
Dot productbetween filter
and input
Learnable upsampling: deconvolution
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Learnable upsampling: deconvolution
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Input gives weight for filter
Learnable upsampling: deconvolution
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Input gives weight for filter
Learnable upsampling: deconvolution
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Input gives weight for filter
Sum where output overlaps
Deconvolution Network for Semantic Segmentation
Normal VGG “Upside down” VGG
Noh, Hong and Hang, “Learning Deconvolution Network for Semantic Segmentation”, arXiv 2015
Deconvolution Network: Pooling
Input
Pooled map
Switch Variables
Deconvolution Network: Unpooling
Input
Pooled map
Switch Variables
DeconvNet vs. FCN
Input Ground truth
FCN DeconvNet EDeconvNet EDeconvNet + CRF
DeepLab: Atrous Convolution and Fully Connected CRFs
Chen, Papandreou, Kokkinos, Murphy, Yuille “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs”, ICLR 2015
● Conditional random field used as a post-processing step
Conditional Random Field
Atrous convolution
● Convolution “with holes”
● Performing convolution with larger receptive field without losing performance
Atrous convolution
● Performing convolution on downsampled input, later upsampling the result to
original resolution
● Performing convolution with holes on originally-sized input
Case study - clothing parsing3
Clothing parsing
◦ Goal: detect and segment some basic clothing
categories: dresses, bags, shoes, trousers etc. on
humans
◦ We need precise clothing masks for further
processing (image search, color detection)
◦ The biggest publicly available dataset contains 7,7k
images
ATR Dataset
◦ Images with ground-truth labels, 7.7k examples◦ 18 clothing categories◦ https://github.com/lemondan/HumanParsing-Dataset
ATR Dataset
Clothing parsing with general segmentation
◦ DeepLab model basing on VGG-16 architecture
◦ Both variants: with and without CRF post-processing
◦ Finetuning from VGG-16 trained on ImageNet
classification challenge
◦ Images resized to 513 x 513 resolution
◦ Training details
▫ Batch size: 8
▫ 20k iterations - 10 epochs
▫ Dataset divided into train/test in ratio = 0.9
Clothing parsing with general segmentation: results
Input
DeepLab + CRFDeepLab
Ground truth
Clothing parsing with general segmentation: results
DeepLab:DeepLab
+ CRF:
Ground truthInput
Clothing parsing with general segmentation: metrics
Bags:
Dresses:
model accuracy precision recall f1-score IoU
DeepLab 0,9903 0,64 0,51 0,54 0,45
DeepLab + CRF
0,9908 0,664 0,525 0,553 0,48
model accuracy precision recall f1-score IoU
DeepLab 0,9586 0,481 0,39 0,399 0,349
DeepLab + CRF
0,9558 0,506 0,436 0,438 0,397
Clothing parsing with detection and segmentation
● Detecting category with object detector like R-CNN, SSD, YOLO etc.
● Segmenting the object inside bounding box with models like DeepLab, DeepCut etc.
● Motivation: it’s much faster to gather bounding box level annotations than pixel-wise annotations
● Hypothesis: given correct bounding box it’s easier to segment clothing item than on whole image
Single Shot Multibox Detector (SSD)
Wen Liu et. al,, "SSD: Single Shot Multibox Detector", 2016
4135/360Bags train/test size
11740/ 3990Dresses train/test size
0.93Bags mAP
0.7Dresses mAP
model accuracy precision recall f1-score IoU
DeepLab 0,9903 0,64 0,51 0,54 0,45
DeepLab + CRF
0,9908 0,664 0,525 0,553 0,48
D&S 0,993 0,765 0,709 0,731 0,64
Clothing parsing with detection and segmentation: bags metrics
model accuracy precision recall f1-score IoU
DeepLab 0,9586 0,481 0,39 0,399 0,349
DeepLab + CRF
0,9558 0,506 0,436 0,438 0,397
D&S 0,931 0,416 0,409 0,407 0,378
Clothing parsing with detection and segmentation: dresses metrics
Visualisations of Detection & Segmentation approach
Visualisations of Detection & Segmentation approach
Visualisations of Detection & Segmentation approach
What have we used?
◦ Caffe & Python
◦ https://github.com/weiliu89/caff
e/tree/ssd
◦ https://bitbucket.org/aquariusja
y/deeplab-public-ver2
Thanks!
Q&AYou can contact us at:
michaljamroz@craftinity.com
mateuszopala@craftinity.com
Recommended