Presented by Shashank Shastry - cvrr.ucsd.educvrr.ucsd.edu/ece285sp18/files/shashank_shastry_ece285.pdf · Presented by Shashank Shastry. OBJECTIVE 1. Semantic image segmentation

Presented by Shashank Shastry

OBJECTIVE

1. Semantic image segmentation in street scenes

2. Current state of the art approaches have good recognition performance but lack localization accuracy

3. To achieve high-quality semantic segmentation with precise boundary adherence

Semantic Image Segmentation

Semantic segmentation - Assigning a set of predefined class labels to image pixels

Semantic segmentation in intelligent vehicles

1. Important tool for modeling the complex relationships of the semantic entities usually found in street scenes, such as cars, pedestrians, road, or sidewalks

2. Pre-processing step to discard image regions that are unlikely to contain objects of interest

3. To improve object detection

4. To obtain 3-D scene geometry in multiple view scenarios

RELATED WORK

1. Current state of the art approaches use fully convolutional networks, generally pre-trained nets for image classification

2. Pooling operations in such nets significantly deteriorate localization performance when applied to semantic image segmentation

3. Approaches to overcome this problem – Dilated convolution, Multi-scale architectures, Post-processing.

C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning Hierarchical Features for Scene Labeling. PAMI, 35(8), 2013

Performs pixel-wise classification using CNN features originating from multiple scales, followed by aggregation of these noisy pixel prediction

J. Long, E. Shelhamer, and T. Darrell. Fully Convolutional Networks for Semantic Segmentation. In CVPR, 2015

Reformulated the popular VGG architecture as a fully convolutional network(FCN),enabling the use of pretrainedmodels for this architecture. To improve segmentation performance at object boundaries, skip connections were added which allow information to propagate directly from early, high-resolution layers to deeper layers.

METHODOLOGY

1. Deepening traditional feedforward networks often results in an increased training loss

2. ResNets have superior training properties over traditional feedforward networks due to improved gradient flow within the network. Gradient flow is improved since there are both depth dependent and independent terms in the gradient.

3. Full-resolution residual networks (FRRNs) exhibit the same superior training properties as ResNets but have two processing streams. The features on one stream, the residual stream, are computed by adding successive residuals, while the features on the other stream, the pooling stream, are the direct result of a sequence of convolution and pooling operations applied to the input.

4. Design is motivated by the need to have networks that can jointly compute good high-level features for recognition and good low-level features for localization.

Encoder/decoder formulation.

In the encoder, we reduce the size of the pooling stream using max pooling operations.

The pooled feature maps are then successively upscaled using bilinear interpolation in the decoder

TRAINING DETAILS

Loss function – Bootstrapped cross entropy

where L[x] = 1 iff x is true and tk ∈R is chosen such that |{i ∈{1,...,N} : pi,yi < tk}| = K

Basically, the K worst errors are backpropagated

Data augmentation - translation augmentation and gamma augmentation

Translation augmentation randomly translates an image and its annotationsGamma augmentation is an augmentation method that varies the image contrast and brightness

ANALYSIS AND RESULTS

Cityscapes dataset - 5000 images with dense pixel annotation spanning 30 classes, collected in 50 cities

FRRN A and B trained at quarter and half resolution respectively

1. Baseline – All FRRUs replaced by RUs with skip connections that connect the input of each pooling layer to the output of the corresponding unpooling layer. The FRRN A resulted in a validation set mean IoU score of 65.7% while the baseline only achieved 62.8%, showing a significant advantage of FRRNs

2. The first to show that it is possible to obtain state-of-the-art results even without pre-training. This shows how network architectures can have a crucial effect on a system’s overall performance.

Since inaccurate boundaries are often not apparent from the standard evaluation metric scores, a typical approach is a trimap evaluation in order to quantify detailed boundary adherence. During trimap evaluation, all predictions are ignored if they do not fall within a certain radius r of a ground truth label boundary

STRENGTHS OF APPROACH

1. Clean and powerful architecture, proven to be effective2. No pre-training3. No post-processing4. Superior boundary adherence5. State of the art results6. Architecture can be used in other applications such as optical flow

WEAKNESSES OF APPROACH

1. Semantic segmentation not instance aware

2. HD stream throughout the layers – heavy computation, more time. Authors had to create custom procedure to handle memory issues while training. They were also unable to train at full resolution of dataset and thus report results of nets trained only at quarter and half resolution.

3. Runtime not available in paper/ cityscapes leaderboard – probably not real time

TAKEAWAY

1. Well justified, targeted design of network architecture can lead to great gains.

2. The authors identified a problem (boundary adherence), came up with a specialized architecture, justified the design and obtained state of the art results

Deeplab V3+

Presently at the top of Cityscapes leaderboard – 82%

QUESTIONS?

Documents

Presented by Shashank Shastry - cvrr.ucsd.educvrr.ucsd.edu/ece285sp18/files/shashank_shastry_ece285.pdf · Presented by Shashank Shastry. OBJECTIVE 1. Semantic image segmentation