
Citation preview



Authors: Castrejon

(Dept of CS,

University of Toronto)

Presented by Mandar Pradhan


● To find how to annotate instances in an image as fast as possible


● To do the annotation as close to the ground truth as possible (POLYGON


● To allow a scope for human intervention to correct automated annotations



● More Data == More annotation == Time consuming and lots of hard work!! (if

done by manual polygon annotation)

● Other automated methods (Images Tags, Bounding Boxes, Scribbles, Single

point objects) - not as accurate as supervised methods (but an easier way to

obtain ground truth)

● Need for human intervention to correct automated annotations to prevent

model from breaking down


● Semi automatic annotations:○ Scribbles/ Multi Scribbles - segmentation using graph cut by combining

appearance cues and a smoothness term (Additional layer of training examples,

not accurate)

○ GrabCut - Annotation as 2D bounding boxes + per pixel labelling using EM

algorithm (Idea extended to 3D bounding boxes + point clouds )

Scribbles GrabCut


● Semi automatic annotations:○ Scribbles/ Multi Scribbles - segmentation using graph cut by combining

appearance cues and a smoothness term (Additional layer of training examples,

not accurate)

○ GrabCut - Annotation as 2D bounding boxes + per pixel labelling using EM

algorithm (Idea extended to 3D bounding boxes + point clouds )


- Hard to incorporate shape priors

- Labellings with holes

- Hard to correct (Not ideal )


● Semi automatic annotations:- Done at super pixel level

- May merge small objects or parts


● Semi automatic annotations:- Done at super pixel level

- May merge small objects or parts

● Object instance segmentation (**USED IN THIS PAPER)- CNN used for box / patch for labelling

- Detect edges and link them to obtain coherent region

- Combine small polygons into object regions to label images



Polygon - RNN (High level overview)

● Does automated annotation using CNN followed by RNN

● CNN extracts a Bounding Box output of the instance

● RNN Input : Image crop inside the Bounding Box + List of Vertices at time t-1,

t-2 + Initial Point (details in subsequent slide)

● RNN Output : “Polygon object” outlining the instance with a bounding box

(Polygons are list of 2-D vertices)

● Trained end to end

● CNN are fine tuned to object boundaries, RNNs encode the priors on objects


Polygon - RNN (Some more details)

● “Polygon object” : List of vertices of bounding polygon

● Defining a specific polygon may involve multiple parameterizations. (We can

choose any vertex as starting point and then move on to the next points using

any orientation)

● Convention: Any starting point, Clockwise orientation

Polygon - RNN (Some more details)

● “Polygon object” : List of vertices of bounding polygon

● Defining a specific polygon may involve multiple parameterizations. (We can

choose any vertex as starting point and then move on to the next points using

any orientation)

● Convention: Any starting point, Clockwise orientation

● Why are vertices from t-1 and t-2, both, fed into the RNN input???

○ Account for the orientation

● Why is initial point of polygon fed into RNN input ???

○ Decide when to close the polygon

CNN Module - CNN + Skip connects

● Based on VGG16 architecture with fully connected layer and last max pooling

layer removed and replaced

● We stack all skip connects from the lower layers, after they pass through 3X3

convolutional layer + ReLU and upscaling them to 28 X 28

● Output is downsampled by a factor of 16

CNN Module - CNN + Skip connects

● Based on VGG16 architecture with fully connected layer and last max pooling

layer removed and replaced

● We stack all skip connects from the lower layers, after they pass through 3X3

convolutional layer + ReLU and upscaling them to 28 X 28

● Output is downsampled by a factor of 16

● Why skip connects??? - Pull out low level features like edges and corners)

and semantics of the instance

● How to handle skip connections from multiple dimensions???

- Bilinear upsampling after additional convolution at the conv5

- 2X2 max-pooling before additional convolution at pool2

RNN Module for vertex prediction

● Aim of RNN - Capture history(previous edges) and predict the future(next

edges/ polygon).

● Does coherent prediction for ambiguous cases (occlusion, shadows)

● Units : Convolutional LSTMS - they operate in 2D and preserve spatial info

from CNNs, reduce number of parameters to deal with

The overall network architecture is presented in the diagram below

RNN Module for vertex prediction

● 2 layer RNN with 16 channels and 3X3 kernels

● Representation of output vertex - D X D+1 matrix (one hot encoded)

● The DXD dimensions represent the possible 2D coordinates of the vertices

● The additional dimension is used to denote the end of sequence token

(polygon is complete)

● At the input, apart from the CNN representation of the image, we have the

one hot encoded forms of vertices at t-1 and t-2 along with initial vertex.

RNN Module for vertex prediction

● Prediction of starting point

- Reuse the CNN architecture with 2 additional layers

- The first layer predicts object boundaries

- The second branch takes first branch as well as the image features as

inputs and gives the vertices

- Both the above stated problems are binary classification problems

Training Details

● Loss - Cross Entropy

● Smoothening of target distribution (the D X D+1 grid is non binary)

- To prevent over-penalising the incorrect predictions.

- Assigning non zero probability to locations in distance of 2 from target in


● Optimizer - Adam

● Batch size - 8

● Learning rate - 10-4 with decay by a factor of 10 every 10 epochs

● 𝜷1 = 0.9 , 𝜷2= 0.999 (Momentum constant)

● Use logistic regression

● Ground truth of object boundaries - edges of ground truth polygon

● Ground truth of vertex layer - vertices of the ground truth polygon


Implementational details

● How to choose the best vertex at each time step of RNN?? - look for the one

with highest log-probs

● How does correction of vertex take place?? - Annotator feeds in the correct

annotation at the next time step

● Inference time - 250 ms

● Polygon Simplification

- Eliminate 3 vertices in same line and 2 vertices in same grid cases

● Data augmentation:

- Flip image crop and annotation at random

- Randomly increase context (10-20% of the bounding box)

- Randomly pick the starting vertex


● Datasets: KITTI, Cityscape

● Goals of the model :

- Polygon must be as accurate as possible

- Minimal number of clicks

● Yardsticks to gauge performance:

- Intersection over union measure

- No of vertex corrects needed to predict polygon

● Annotation of polygon done by inhouse detector, bounding box easy to obtain

using AMT

Results : Cityscape

● What in this dataset ?? - 27 cities, 2950 train images, 500 valid, 1525 test

● Issue faced - Test set has no ground truth instances

● Solutions - 500 validation images are now test images

- The images from the Weimar and Zurich are the validation sets

● Labels - person, Car, Rider, truck, Bus, Train and Motorcycle

● Size of Instances - 28 -1792 pixel

Results : Cityscape

● What in this dataset ?? - 27 cities, 2950 train images, 500 valid, 1525 test

● Issue faced - Test set has no ground truth instances

● Solutions - 500 validation images are now test images

- The images from the Weimar and Zurich are the validation sets

● Labels - person, Car, Rider, truck, Bus, Train and Motorcycle

● Size of Instances - 28 -1792 pixel

● Inbuilt instance segmentation is both in terms of pixel labelling as well as


● New Problem - Polygons in cityspace capture occlusion portion

● Solution - Depth ordering to remove the occluded part (we want only visible


Results : Cityscape

● What do we do about objects with multiple components due to occlusion???

● The authors have treated each component as a single object

● So what happens if the RNN keeps adding new vertices without reaching a


● The authors set a hard limit of 70 vertices for the RNN (GPU constraints)

Results : Evaluation Metric

● Intersection of Union : Obtained prediction vs Ground Truth (Average over all


● How to evaluate the Human Action (Corrections of vertices)??? - simulate the

action of the annotators who correct the point each time predicted vertex

● Testing Gameplan : First do sanity check in PREDICTION mode (no

interaction of the annotators to correct). Then evaluate the amount of human

intervention needed

Results : Baselines

● DeepMask : Uses CNN to output pixel labels, indifferent to class

● SharpMask : Improvise the DeepMask idea using upsampling of output to

obtain improved resolution

● Performance is reported based on ground truth boxes

● Network structure: 50 layer ResNet architecture trained on COCO dataset

● For DeepMask and SharpMask, the ResNet part is trained for 150 epochs

and the upsampling part of SharpMask is trained for 70 epochs

Results : Baselines

● SquareBox: Object is mapped to a bounding box (of reduced dimensions).

Individual boxes for each component of the object

● Dilation10: Use segmentation dataset. Pixels are mapped to objects are

grouped as instance masks

Results : Baselines

Results : Baselines

● Verdict

- Baselines are hard to correct

- Better overall average and tops the charts in 6 / 8 categories

- Outperforming SharpMask in Car, Rider, Person classes by 12%, 6 %

and 7% respectively

- Why is the previous point worth noting - SharpMask uses ResNet

architecture which is much powerful vs VGG

Results : Baselines

● Verdict

- Baselines are hard to correct

- Better overall average and tops the charts in 6 / 8 categories

- Outperforming SharpMask in Car, Rider, Person classes by 12%, 6 %

and 7% respectively

- Why is the previous point worth noting - SharpMask uses ResNet

architecture which is much powerful vs VGG

- Larger instances have advantage in larger objects like bus and train due

to better resolution

Results : Annotators in the loop

● How is the quality of annotation and amount of human intervention

quantified??? - No. of mouses clicks needed to get different levels of


● What do they mean by different “levels” of segmentation accuracy ??? -

chessboard metric of distance of the errors

● Also, show the resulting IoU to compare

Results : Annotators in the loop

● How is the quality of annotation and amount of human intervention

quantified??? - No. of mouses clicks needed to get different levels of


● What do they mean by different “levels” of segmentation accuracy ??? -

chessboard metric of distance of the errors

● Also, show the resulting IoU to compare

● Methodology in a nutshell

- In the first method, pick 10 images per annotator and ask them to

annotate freely without any cues or hint.

- In the second method, crop images and place blue markers on the

instances to be annotated (disambiguous)

Results : Annotators in the loop

Results : Annotators in the loop

● Verdict

- Human annotator IoU: 69.5% in free viewing method and 78.60% for

cropped images

- Indicates need to collect multiple annotations to reduce variations and

biases in the annotators

Results : Annotators in the loop

● Comparison with GRABCUT:

- 54 randomly chosen instances

● Grabcut stats: 42.2s and 17.5 clicks per instance, 70.7% IoU

● Given model’s stats: 5-9.6 clicks per instance, 77.6% IoU

● Verdict - Given model is faster as it needs lesser clicks for comparable

inference time

Results : Annotators in the loop

Results : Annotators in the loop

Results : Final Verdict


● Polygon RNN provides plausible annotations with relatively less latency

● Performance is good on smaller objects. This fact is visible in performance

over the different instances of varying sizes within the same datasets (in

Cityscape) as well as in between 2 datasets (smaller objects in KITTI vs

larger objects in Cityscapes)

● Competes well with SharpMask which had ResNet based architecture

● Definitely reduces annotation cost for IoU comparable to human annotation

● Introduction of human intervention adds scope to avoid extremely bad


Results : Final Verdict


● Lower resolution and associated quantization error manifest in segmentation

of larger instances.

● Memory intensive - Polygons have more vertices to predict than a single

bounding box which may add latency in return for more accuracy.

● Cannot exploit Velodyne point clouds in KITTI dataset like other datasets

which puts it at a disadvantage

Results : Final Verdict


● Tries to address issues of speed and accuracy of annotations

● The novelty of allowing human intervention allows it to not give very bad


● Performance is good for smaller objects but lowers as complexity reduces

● Scope to work improving resolution and ability to exploit Velodyne point cloud

data to performance address issues in KITTI dataset


[1]D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup:Scribble-supervised

convolutional networks for semantic segmentation. In CVPR, 2016

[2]C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground

extraction using iterated graph cuts. In SIGGRAPH, 2004.