Semantic Segmentation, Urban Navigation, and Research ... · 2.3. Urban-Scene understanding datasets Several of the datasets are targeted towards urban scene understanding. Commonly,

Semantic Segmentation, Urban Navigation, and Research Directions

Qasim NadeemPrinceton University

[email protected]

Abstract

We introduce and give a detailed review of semantic seg-mentation. We begin with some key early methods, and ourdiscussion will make its way to the state of the art. We alsotalk about the use of semantic segmentation for navigationin outdoor urban environments. Finally, we touch on twofascinating research directions that we believe hold a lot ofpotential.

Semantic segmentation is the task of taking an input im-age and producing dense, pixel level, semantic predictions.Accurate and rich semantic segmentation is a driver of vi-sual understanding and reasoning, and has a direct impacton many real world applications.

1. IntroductionA semantic segmentation system takes an image or video

frame as input, and outputs a heat-map classifying each ofthe pixels into one of k pre-defined categories. Normally,the system undergoes learning on some training data wherethe ground truth labels are known. Then, provided a newunlabeled test image, the learner predicts the label for eachpixel with one of k semantic classes. The most popular eval-uation criteria is the very intuitive metric mean Intersection-over-Union (mIoU).

There are numerous applications to justify the useful-ness of semantic segmentation. Robotic navigation andautonomous driving immediately benefit; obstacle detec-tion, path planning, recognizing traversable terrain are someuses. Two features in smart-phones that rely on semanticsegmentation are (i) Portrait mode where background pix-els are blurred to sharpen the person’s silhouette, (ii) sev-eral of the clever filters you see on Snapchat and Instagram.A third area of interest is biomedical image segmentation;many processes in radiology, and lab testing can be mademore efficient and/or accurate with automatic semantic seg-mentation.

Instance segmentation (or instance aware segmentation)is a stronger task where separate instances of the same class(say ’bicycle’) are made distinctly identified. We do not

cover systems that achieve this but many important ideasnaturally overlap.

We start with a quick discussion of significant datasetsfor semantic segmentation. We then explore papers in se-mantic segmentation starting from traditional methods, andmaking our way to the state of the art. As much as possible,we explore ideas in chronological order.

2. Datasets

PASCAL VOC 2012 [10] has been the most testedupon benchmark for semantic segmentation, although thatis bound to change since state of the art has reached 89.0mIOU on its test set. MS COCO [21] is another large-scale,popular segmentation dataset. Other significant datasets in-clude SiftFlow [22], NYUDv2 [36], Stanford Background[13]. DAVIS [26] is a significant video object segmentationdataset.

Then, some important datasets geared towards urban-scene understanding, especially with autonomous vehi-cles’ perception in mind. The most significant ones areCityScapes [8], KITTI [12], and CamVid [3]. A couple oth-ers include Urban LabelMe [30], and CBCL StreetScenes[2]. A special mention must be made for the large-scaleApolloScape dataset (2018) [15] released by Baidu to fur-ther autonomous driving research.

2.1. PASCAL VOC 2012

The dataset is composed of images from the image-hosting website Flickr. The images are hand annotated.There are several detection related challenges for the PAS-CAL VOC 2012, one of which is a semantic segmentationchallenge. There are 21 classes (airplane, bicycle, back-ground...). The public dataset has 1464 training and 1449validation images. The test set is privately held. Althoughformally no new competition has been held since 2012, al-gorithms continue to be evaluated on the 2012 challenge forsegmentation (for example DeepLab v3, top of the leader-board, is a 2018 submission).

1

2.2. MS COCO

MS COCO is a popular and very challenging large-scaledetection data, which includes pixel segmentations. Stateof the art mIoU on MS COCO is around 48.0. The datasetcontains high-resolution images, and features 80 object cat-egories and 91 stuff categories. The semantic segmenta-tion challenge provides 82000 train images, 40500 valida-tion images, and the evaluation is carried out on subsets of80000 test images. MS COCO has receives continued at-tention due to its quality annotations and large size. Severaldetection challenges are hosted at ECCV each year.

2.3. Urban-Scene understanding datasets

Several of the datasets are targeted towards urban sceneunderstanding. Commonly, they have been gathered fromcars with high-resolution cameras mounted, some kind ofautomatic segmentation/detection ran on them, and thenmanually cleaned by human volunteers.

Cityscapes contains 2D semantic, instance-wise, densepixel annotations for 30 classes. It has 5000 fine annotatedimages and 20000 coarse annotated ones. Data was cap-tured in 50 cities during several months, daytimes, and goodweather conditions. It was originally recorded as video sothe frames were manually selected to have large number ofdynamic objects, and varying scene layout and background.

CamVid is a road scene understanding database cap-tured as video sequences with a 960×720 resolution cameramounted on a car’s dashboard. Those sequences were sam-pled to collect 701 frames, and manually annotated with 32classes. A popular evaluation partition divides this datasetinto train/val/test sets of 367/100/233, and uses 11 of theclass labels only.

KITTI is a dataset for different computer vision taskssuch as stereo, optical flow, 2D/3D object detection andtracking. There are 7481 training and 7518 test images an-notated with 2D and 3D bounding boxes for object detec-tion and orientation estimation. There are up to 15 cars and30 pedestrians in each image. However, pixel-level anno-tations ( 7000 images) were only provided later by variousthird party researchers with differing quality controls.

ApolloScape is large-scale comprehensive dataset forurban street views. The eventual dataset will include RGBvideos with 1 million+ high resolution images with per-pixel semantic labels, survey-grade dense 3D points withsemantic segmentation, stereoscopic video with rare events,and night-vision sensors. The collection is also careful tocover a wide range of environment, weather, and trafficconditions. The initial release has 143906 video framesand corresponding annotations, 25 class labels and 28 lanemarkings type labels, for semantic segmentation task (seeFigure 1). In addition, 89430 instance-level annotationsfor movable objects are further provided, to evaluate forinstance-level video object segmentation. ApolloScape is

thus orders of magnitude larger than other urban dataset.

Figure 1. ApolloScape: sample frame and semantic labeling. [15]

3. Traditional, pre-neural net approachesTraditional methods for semantic segmentation used

hand-crafted and carefully engineered features such as SIFTor HoG, along with learning algorithms like SVMs, andRandom Decision Forests. The advent of neural net-works has coalesced these traditionally separate parts ofthe pipeline; NNs learn a suitable feature representation aswell as a classifier. Below we talk about different featureextractors, learning models successfully used along withthem, and a few significant papers that proposed completepipelines for semantic segmentations.

Image segmentation has often been modeled as an en-ergy minimization problem. CRFs are rich, probabilisticgraphical models; pixels of an image can be viewed as vari-able nodes in a CRF. They generally only model local in-teractions, and global effects only arise as an indirect con-sequence. The seminal paper by P Krhenbhl and V Koltun(2011) [17] gave an extremely efficient, approximate infer-ence algorithm for fully-connected CRFs where pairwiseedge potentials are defined by a linear combination of Gaus-sian kernels. These continue to be used to the current dayas a post-processing step to boost segmentation accuracy.

Moving on, the choice of feature descriptors and incor-poration of domain knowledge has been very significant incomputer vision. It is significant even now, in many appli-cations of machine learning, especially when large amountof data is not available. We assume that the reader is famil-iar with SIFT and HoG feature descriptors already.

Bag of Visual Words (BOV) features are histogrambased counts. As the name suggests, the idea is an exten-sion of ’Bag of Words’ for text documents which is a vectorof counts of the vocabulary. For BOV, quantization on localimage features is done to build a vocabulary of visual words(representative vectors chosen for example using k-meansclustering). Then, BOV is a histogram of occurrences of

2

these visual words.Textons (described neatly in [42]) computed on d × d

patches of images can be thought of as a BOV, where thevisual words are vector quantized exemplar responses ofa linear filter bank. These features are thought capable ofmodeling object and class shape, appearance and context,thus capturing semantics of pixels.

3.1. Some significant papers

J Shotton et al (2008) [35] introduced the powerful Se-mantic Texton Forests (STFs) and achieved state of theart results on the PASCAL VOC 2007 challenge for objectdetection and segmentation. They also increased the exe-cution speed by at least 5×. STFs are a type of randomdecision forest that can efficiently compute powerful, low-level features. Each decision tree acts directly on imagepixels, and therefore bypassing the need for expensive com-putation of filter-bank responses or local descriptors. STFsare extremely fast to both train and test, especially whencompared with k-means clustering and nearest-neighbor as-signment of feature descriptors, which was needed for tra-ditional textons and BOV models. The nodes in the deci-sion trees provide (i) an implicit hierarchical clustering intosemantic textons, and (ii) an explicit local classification es-timate. The actual decision rules for nodes in the trees aresimple functions of raw image pixels within a d × d patch:either the raw value of one pixel, or sum/difference/absolutedifference of a pair of pixels. The bag of semantic textonscombines a histogram of semantic textons over an image re-gion. And then image-level object detection priors are con-sidered to allow for coherent, refined segmentations. SeeFigure 2.

Figure 2. (a) Test image with ground truth. STFs efficiently com-pute (b) a set of semantic textons per-pixel and (c) a rough localsegmentation prior. The algorithm combines both textons and pri-ors as features to give coherent semantic segmentation (d) [35]

G Brostow et al (2008) [4] estimated 3D point cloudsfrom videos (e.g. from a driving car), introduced featuresthat project those 3D cues back to the 2D image plane while

modeling spatial layout and context. They then trained arandomized decision forest that combines many such fea-tures to achieve 2D semantic segmentation of urban scenesthey collected (11 classes). The most important point tonote is that the only features they use are motion and struc-ture based (derived from the point cloud). This is in sup-port of the idea that segmentation for videos can and shouldleverage the incredibly valuable spatio-temporal informa-tion available.

P Sturgess et al (2009) [37] advanced the state of theart, and improved segmentation accuracy on CamVid from69% to 84%. They integrate motion and appearance-basedfeatures. The motion-based features are extracted from 3Dpoint clouds (nearly the same way as [4]), and appearance-based features consist of textons, colour, location, andHOG. They (i) formulate the problem in a CRF frameworkin order to probabilistically model the label likelihoods andpriors; (ii) use a novel boosting approach to combine mo-tion and appearance-based features, which selects discrim-inative features for each class to generate likelihood terms;(iii) incorporate higher order potentials in their CRF model.

L Ladicky (2010) [18] released a paper on scene un-derstanding (umbrella term encompassing object recog-nition/detection, segmentation, and 3D scene recovery).They achieved competitive results on PASCAL VOC andCamVid. Their learning model is a CRF defined on pixels,segments and objects. They define a global energy functionfor the model, which combines results from sliding win-dow detectors (any object detector can be plugged into theirmodel), and low-level pixel-based unary and pairwise rela-tions. A major contribution is showing that their proposedglobal energy function is efficiently solvable using a graph-cut based algorithm.

4. Modern approaches (2015 onwards)Successful deep neural net architectures for image level

classification like AlexNet, VGG net, GoogLeNet, andResNet are a natural precursor to, and often a direct partof semantic segmentation architectures. We assume famil-iarity with those architectures.

It’s helpful to note that pixel-level classification involvestwo simultaneous tasks of classification (correct semanticconcept needs to be picked) and localization (pixel labelmust be aligned to correct coordinate in the output heat-map). These two tasks are naturally in conflict: classifica-tion requires models to be invariant to transformations ofvarious kinds, but localization should indeed be sensitive toit because pinning the precise location is the point. Sinceclassification CNNs were made with the first task in mind,adapting them introduces the second disagreeing task, andas we will see most papers will work to resolve this friction.

In this section, we will talk about how the current stateof the art developed. All state of the art model involve con-

3

volutional neural nets. In our discussion below, it proveshelpful to arrange modern semantic segmentation architec-tures in two divisions. The first division contains architec-tures primarily influenced by the 2015 Fully ConvolutionalNetworks (FCN) paper; these can also be called encoder-decoder architectures. The second division contains archi-tectures that are also influenced by the FCN paper but ad-ditionally they employ dilated convolutions which we willlearn about later. Note that down-sampling and up-samplingstill occurs, as in encoder-decoder architectures, but it’s notnearly as severe.

4.1. FCNs and other encoder-decoder architectures

CNNs had previously been adapted, albeit awkwardly,for segmentation. Those earlier ideas had intricacies likepatch-wise processing, small models restricting receptivefields, various post-processing ideas, multi-scale pyramidprocessing, and ensembles. In 2014, J Long et al inventeda seminal CNN architecture named Fully ConvolutionalNetworks [23]. It offered simplicity, end-to-end training,efficient learning and inference, and significantly improvedthe state of the art on VOC 2012. They took pre-trainedclassification networks (VGG-16, AlexNet, GoogLeNet),and replaced fully connected layers with convolutional lay-ers (thus the name fully convolutional) to output spatialmaps instead of image-level classification scores. The spa-tial maps are still low resolution because of earlier poolinglayers, so they need to be up-sampled. The up-samplingis done in a learn-able manner, as opposed to say bilinearinterpolation, using de-convolutions (also called fraction-ally strided convolutions) to produce pixel-level classifica-tion scores. This already gave them state of the art resultsbut the segmentation is coarse because naturally early pool-ing layers lose spatial information; this is the key weak-ness of FCNs that later papers tried to address. The FCNpaper offered skip connections from lower high resolutionfeature maps with fine strides to the final prediction layerto improve granularity and accuracy. Figure 3 provides anoverview.

The part of FCNs taken from classification architecturesis called an encoder since it encodes the input image as alow resolution feature map, and the part after is called a de-coder since it gradually increases the resolution back to theoriginal image. SegNet (2015) [1] is a encoder-decoder ar-chitecture that gave a novel way to up-sample in the decoderleading to competitive inference time and accuracy, andstate of the art memory efficiency. In particular, it copiesmax-pooling indices from encoding layers to correspond-ing decoding layers for a non-linear up-sampling. Thus, theup-sampling does not depend on learned parameters. Figure4 provides an overview.

U-Net (2015) [28], introduced for biomedical image seg-mentation, is an important architecture that works really

Figure 3. How FCNs re-purpose classification networks for seg-mentation. [23]

Figure 4. SegNet’s architecture. [1]

well in practice too. As there was little training data for thebiomedical task, the authors used excessive data augmenta-tion by applying elastic deformations to available trainingimages. This allows the network to learn invariance to suchdeformations. This is particularly nice for biomedical seg-mentation as deformation is a common variation in tissue,and realistic deformations can be simulated efficiently. Thearchitecture is quite simple. The encoder consists of 3 × 3convolutions, ReLUs, and 2× 2 max pooling for downsam-pling. In each downsampling step, they double the numberof feature channels - increasing the ”what” information anddecreasing the ”where”. A step in the decoder consists of ade-convolution followed by a 2× 2 convolution that halvesthe number of feature channels, a concatenation with thecorrespondingly cropped feature map from the encoder, a3 × 3 convolution, and a ReLU. They achieved the stateof the art on several biomedical image segmentation chal-lenges. Figure 5 provides an overview; each blue rectangleis a feature map and the number written above it gives thenumber of channels.

RefineNet (2016) [20] achieved state of the art resultson 6 datasets which included PASCAL VOC 2012, andCityscapes. The encoder comes from a pre-trained ResNet

4

Figure 5. UNet’s architecture. [28]

divided into blocks. The decoder consists of intricate Re-fineNet blocks which iteratively fuse features for severalranges of resolutions; low resolution features from previousRefineNet block and high resolution ones from the encoderblocks. They introduce chained residual pooling which es-sentially captures background context from a large imageregion, by gathering features from several scales, and fus-ing them in a learn-able manner. Residual connections withidentity mappings found throughout the network accommo-date gradient propagation to allow end-to-end training. Fig-ure 6 provides an overview.

Figure 6. RefineNet’s architecture. [20]

Large Kernel Matters (2017) [25] achieved state of theart on VOC 2012 and Cityscapes. The keen insight they hadwas that stacked small convolution filters (3 × 3 etc) are invogue because they’re much more efficient to compute thanlarger kernel filters (k × k for large k). But they argue thatfor the segmentation task, a large kernel with larger effec-tive receptive field can assist the simultaneous classificationand localization task of semantic segmentation. So they ap-proximate k × k filters using a novel Global ConvolutionalNetwork (GCN), that uses combinations of (1×k)+(k×1)and (k×1)+(1×k) convolutions, which mimic dense con-nections in a k × k region in the feature map. The encodercomes from ResNet followed by GCNs, and the decoder isalong FCN lines. They also present a learn-able Boundary

Refinement module which improves pixel classification onclass boundaries, substituting CRF post processing. Figure7 provides an overview.

Figure 7. Large Kernel Matters’ pipeline. [25]

4.2. Dilated Convolutions paper and inspired archi-tectures

In 2015, F Yu and V Koltun [39] offered an alternativeCNN architecture to the encoder-decoder paradigm, andachieved state of the art on VOC 2012. They dilated con-volutions where dilation factor l determines the expansionand dilation of the manner in which the convolution filteris applied. So it’s a generalization of standard convolutionwhich has l = 1. The insight here is that dilated convolu-tions allow an exponential increase, with l, in the receptivefield without losing spatial resolution (unlike pooling lay-ers for example). They take pre-trained VGG net, removelast two pooling/striding layers, and add dilated convolutionlayers to produce dense output. They also propose a contextmodule composed of cascades of mutli-scale dilated con-volution layers, which takes C feature maps as input andproduces same-sized C feature maps. Intuitively, the con-text module increases the contextual or global ’awareness’of each feature. Figure 8 shows how stacked dilated convo-lutions work.

L C Chen, G Papandreou et al have published a seriesof 3 papers (2014, 2016, 2017) on a semantic segmenta-tion project named DeepLab [5]. They chop off final layersof ResNet/VGG-16, and use dilated (or atrous) convolutionlayers similar to [39]. A very nice visual of the effective-ness of dilated convolution versus encoding-decoding forthe purpose of dense classification of pixels is given in 9.

They also try two approaches for handling scale variabil-ity in semantic segmentation. The first, simpler idea was toextract DCNN feature maps from 3 image scales using par-allel net branches, and then bi-linearly interpolate and fusethem using max pooling. The second approach that workedreally well is inspired by the successful R-CNN (regional-CNN) spatial pyramid pooling approach; they use severalparallel dilated convolution layers with different samplingrates. The features are separately processed and then fused

5

Figure 8. Systematic dilation supports exponential expansion ofreceptive field without loss of resolution. Consider a 3 × 3 filterapplied in succession with increasing dilation. (a) 1-dilated con-volution applied; each element has a receptive field of 3 × 3. (b)Produced from (a) by a 2-dilated convolution; each element nowhas a receptive field of 7× 7. (c) Produced from (b) by a 4-dilatedconvolution; each element now has a receptive field of 15 × 15.The number of parameters in each layer is equal. The receptivefield grows exponentially while the number of parameters growslinearly. [39]

Figure 9. Sparse feature extraction with standard convolutionallayers versus dense feature extraction with dilated convolutions.[6]

to generate the final result. They name this atrous spatialpyramid pooling (ASPP); see 10.

Figure 10. Atrous Spatial Pyramid Pooling: to classify the centerorange pixel. [6]

The final segmentation map is still coarse, but it isnot severely down-sampled (only 8× compared to 32×in a standard encoder), so they simply use bi-linear in-terpolation to enlarge feature map to original image reso-

lution (no fancy up-sampling ideas), followed by a fully-connected CRF to refine segmentation result particularly atclass boundaries. Their 2016 paper [6] achieved state ofthe art on VOC 2012 and Cityscapes. Figure 11 gives anoverview.

Figure 11. DeepLab pipeline. [6]

Their 2017 paper [7] worked on better ways to incorpo-rate information from various scales. One line of effort wasdevoted to improving the effectiveness of the ASPP module;the two main changes were batch normalization parametersare fine-tuned, and image-level features are augmented intothe pyramid. The second line of effort was taking cascad-ing ResNet blocks and replacing last 4 layers with a cas-cade of dilated convolutions. These blocks are applied tointermediate feature maps. It’s worthwhile to note that CRFpost-processing was no longer needed. Both of these effortsindividually lead to improvement over their earlier papers.And they achieved competitively to the state of the art.

Pyramid Scene Parsing (PSP) Net (2017) [41]achieved state of the art on VOC 2012 and Cityscapes.Their main contribution is a Pyramid Pooling Module(PPM) that aggregates global information in a novel man-ner. The intuition is that in urban scene datasets that havehigh-resolution images with some categories that cover alot of pixels (like road, sky etc), global scene categoriesstrongly influence the distribution of segmentation classesand boundaries. Dilated convolution layers are generouslyused, modifying ResNet similar to how DeepLab did. Thefeature maps are fed to the PPM which has parallel pool-ing layers with kernels covering increasing portion of thefeature map; the idea is that this harvests both local andglobal context information per pixel. Finally, the output ofthe PPM is up-sampled and concatenated with the input tothe PPM, followed by dilated convolution layers to get thefinal predictions. Figure 12 provides an overview.

Figure 12. PSPNet’s pipeline. [41]

6

4.3. Research Direction 1: Synthetic Datasets andDataset Augmentation

Annotated data for dense pixel-level prediction is diffi-cult and costly to gather. There is ongoing research in thecommunity to improve the data efficiency of learning sys-tems. CNNs (and NNs in general) are data hungry mod-els; almost every segmentation architecture we saw had apre-trained component (on ImageNet and/or MS COCO).One line of research that I think particularly fascinating isto augment real-world human annotated datasets in differ-ent ways. A large set of approaches for this involve takingthe training data and introducing different levels of trans-formations to introduce variance and richness e.g. severalpapers have pursued cut and paste ideas where they cut ob-ject instances from one training image and paste it into adifferent image. Another set of approaches involves cre-ating synthetic datasets which have automatic annotations.Some examples follow; they’re not all focused only on seg-mentation but the ideas are similar.

SYNTHIA (CVPR 2016) [29] was created by render-ing a photo-realistic virtual city with fine-grained pixel-level annotations for 11 classes (void, road, car, sign, pedes-trian, cyclist...), using the Unity game engine. It’s purposeis urban scene understanding especially in the context ofdriving. It contains images and image-sequences. It fea-tures more than 213400 synthetic images in total. The au-thors also characterize data based on diversity in terms ofscenes (towns, cities, roads), dynamic objects, seasons, andweather.

Authors experimented with two segmentation networks:the first was an FCN with a VGG-16 encoder, and the sec-ond was a T-Net with a VGG-F encoder. They trained andevaluated these nets on four real world urban scene segmen-tation datasets, and what they critically showed was thataugmenting training with SYNTHIA, improved test accu-racy by a significant margin, nearly across the board. Figure13 shows an example frame.

Figure 13. SYNTHIA: A frame and its semantic segmentation.[29]

The ’Driving in the matrix’ paper (2017) [16] from agroup at University of Michigan is another example show-ing the utility of synthetic data. They look at object de-tection (with a bounding box) not pixel-level prediction butI feel the same idea could be extended for the latter. Theygenerated synthetic urban scenes and pixel annotations fromthe video game GTA V. They evaluated a Faster R-CNNobject detector (VGG-16 pre-trained on ImageNet) for the

task of detecting cars on the entire KITTI dataset ( 7500images). They trained one instance of this on Cityscapes’3000 images (a dataset that ”resembles” KITTI, the authorssay, since both come from roads of Germany), and a secondinstance on large volumes of synthetic data. The second in-stance with 50000 synthetic images out-performed the firstinstance. As expected, the training value of a synthetic im-age is much lower than a real image, however a syntheticimage is arguably free of human labor cost. Results are infigure 14.

Figure 14. R-CNN trained on GTA video game outperformsCityscapes, when evaluated on KITTI for detecting cars (parti-tioned into 3 based on difficulty of detection). [16]

There are many other papers showing the promise of syn-thetic datasets for various low and high-level computer vi-sion tasks. We mention a few more below:

(i) House3D (Y Wu et al 2018) [38] by FAIR whichsources from the earlier SunCG dataset (Song et al.,2017) and provides 45000 diverse indoor 3D scenes ofvisually realistic houses with dense 3D annotations ofall objects into 80 categories. The uses of this datasetas imagined by the authors are object and scene under-standing, 3D navigation, embodied question answer-ing.

(ii) ’Playing for Data: Ground Truth from ComputerGames’ (2016) [27] showed that although source codeof games may be unknown, we can use communica-tion between the game and the graphics hardware todeduce region and pixel annotations. This allows themto extract 25000 pixel-annotated frames from a photo-realistic game (GTA V) in only 49 hours. They showedthat training on this data and only 1

3 of CamVid outper-forms model trained entirely on CamVid.

(iii) H Hattori, V N Boddeti et al (2015) [14] showed a wayto learn scene-specific pedestrian detectors in the ab-sence of prior data on a novel location for examplewhen a new security camera is installed somewhere.They infer the geometry of the scene, and a pedes-trian rendering system then creates a scene simulationof pedestrian motion respecting the geometry (walls,obstacles, walk-able regions) of the scene. This datais then used for training a pedestrian detector. Theyshow that augmenting with real-world data improvesaccuracy.

7

Synthetic datasets have room for improvement. Com-puter graphics and video game visuals are improving, andthe fidelity synthetic data can thus be improved. It seemsto me that real world data provides much stronger lowlevel visual knowledge (texture, colors, edges are all clearerand more realistic), but for semantic richness and variance(which objects appear close or in context with each othere.g. people sit in cars, people walk pets etc) syntheticdatasets can come close to real world data, because it’s amatter of rendering varied scenarios. It may be interestingto experimentally judge the training value of subsets of syn-thetic data to understand how to better create synthetic data.It might also be interesting to experiment with interactivelearning, where the learner decides what kind of trainingdata it requires which can then be synthesized.

4.4. Research Direction 2: Video Semantic Segmen-tation, and Faster Inference

Recent lines of work have moved towards achievingfaster inference times; this is useful for real-time applica-tions and mobile hardware. A common theme amongst thesepapers is to find ways to cleverly condense existing segmen-tation networks like PSPNet, SegNet to trade-off speed foraccuracy, at a good rate. Some examples are ENet [24],ICNet [40], ShuffleSeg [11].

Talking about video, the first question that should cometo mind is what’s special about video, and why not just seg-ment each frame separately. That’s certainly possible and astrategy that’s often used in practice. But it leaves a lot tobe desired. Some reasons are: (i) frames close-in-time arestrongly dependent, thus they markedly inform the segmen-tation of the other, and this information should be leveraged;(ii) motion and structural features in videos can inform seg-mentation; (iii) it is desirable to have smooth segmentationchanges (often called temporal continuity in literature) fromframe to frame, which is nearly impossible if frames are in-dependently segmented; (iv) there is orders of magnitudemore unlabeled or weakly labeled data available in videoformat, but annotated data is rare and prohibitively expen-sive to get, so weakly supervised and unsupervised methodsbecome vastly more important to work on.

Two interesting examples are clockwork FCNs (2016)[33], and the CVPR 2018 paper Low-Latency Video Se-mantic Segmentation [19]. The latter has a feature propa-gation module that fuses features over time via spatially-variant convolutions, thus saving computation cost perframe, and an adaptive scheduler that dynamically allocatescomputation based on it’s estimate of current segmentationaccuracy. The two components work together to ensure lowlatency and high segmentation quality. They get compet-itive performance on Cityscapes and CamVid, and reducelatency from 360 ms to 119 ms (see figure 15). A summaryof the pipeline is provided in figure 16.

Figure 15. Latency vs accuracy on Cityscapes dataset.[19]

Figure 16. Overall pipeline: At each time step t, the lower-partof the CNN Sl first computes the low-level features Fl

t. Basedon both Fl

k (the low-level features of the previous key frame) andFl

t, the framework will decide whether to set it as a new key frame.If yes, the high-level features Fh

t will be computed based on theexpensive higher-part Sh; otherwise, they will be derived by prop-agating from Fh

k using spatially variant convolution. The high-level features, obtained in either way, will be used in predictingsemantic labels. [19]

8

References[1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A

deep convolutional encoder-decoder architecture for imagesegmentation. IEEE transactions on pattern analysis andmachine intelligence 39, no. 12, 2017.

[2] S. M. Bileschi. Streetscenes: Towards scene understandingin still images. MASSACHUSETTS INST OF TECH CAM-BRIDGE, 2006.

[3] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic objectclasses in video: A high-definition ground truth database.Pattern Recognition Letters 30, no. 2, 2009.

[4] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Seg-mentation and recognition using structure from motion pointclouds. ECCV, 2008.

[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. Yuille. Semantic image segmentation with deep convolu-tional nets and fully connected crfs. ICLR, 2015.

[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully con-nected crfs. arXiv preprint, 2016.

[7] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Re-thinking atrous convolution for semantic image segmenta-tion. arXiv preprint, 2017.

[8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,R. Benenson, U. Franke, S. Roth, and B. Schiele. Thecityscapes dataset for semantic urban scene understanding.CVPR, 2016.

[9] G. Csurka and F. Perronnin. A simple high performance ap-proach to semantic segmentation. BMVC, 2008.

[10] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge. IJCV, 2010.

[11] M. Gamal, M. Siam, and M. Abdel-Razek. Shuffleseg: Real-time semantic segmentation network. arXiv preprint, 2018.

[12] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meetsrobotics: The kitti dataset. The International Journal ofRobotics Research 32, no. 11, 2013.

[13] S. Gould, R. Fulton, and D. Koller. Decomposing a sceneinto geometric and semantically consistent regions. Com-puter Vision, 2009 IEEE 12th International Conference on,pp. 1-8, 2009.

[14] H. Hattori, V. N. Boddeti, K. Kitani, and T. Kanade. Learn-ing scene-specific pedestrian detectors without real data.CVPR, 2015.

[15] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang,Y. Lin, and R. Yang. The apolloscape dataset for autonomousdriving. CVPR 2018 Workshop on Autonomous DrivingChallenge, 2018.

[16] M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar,K. Rosaen, and R. Vasudevan. Driving in the matrix: Canvirtual worlds replace human-generated annotations for realworld tasks? Robotics and Automation (ICRA), 2017.

[17] P. Krhenbhl and V. Koltun. Efficient inference in fully con-nected crfs with gaussian edge potentials. NIPS, pp. 109-117,2011.

[18] . Ladick, P. Sturgess, K. Alahari, C. Russell, and P. H. Torr.What, where and how many? combining object detectors andcrfs. ECCV, 2010.

[19] Y. Li, J. Shi, and D. Lin. Low-latency video semantic seg-mentation. CVPR, 2018.

[20] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-pathrefinement networks for high-resolution semantic segmen-tation. IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2017.

[21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, P. D.Deva Ramanan, and C. L. Zitnick. Microsoft coco: Commonobjects in context. ECCV, 2014.

[22] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene pars-ing: Label transfer via dense scene alignment. CVPR, 2009.

[23] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2015.

[24] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. Enet:A deep neural network architecture for real-time semanticsegmentation. arXiv, 2016.

[25] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernelmatters–improve semantic segmentation by global convolu-tional network. CVPR, 2017.

[26] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. V. Gool,M. Gross, and A. Sorkine-Hornung. A benchmark datasetand evaluation methodology for video object segmentation.CVPR, 2016.

[27] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing fordata: Ground truth from computer games. ECCV, 2016.

[28] O. Ronneberger, P. Fischer, and T. Brox. U-net: Con-volutional networks for biomedical image segmentation.International Conference on Medical image computingand computer-assisted intervention, pp. 234-241. Springer,Cham, 2015.

[29] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M.Lopez. The synthia dataset: A large collection of syntheticimages for semantic segmentation of urban scenes. IEEEConference on Computer Vision and Pattern Recognition(CVPR), 2016.

[30] B. C. Russell and A. Torralba. Labelme. IJCV, 2008.[31] F. Schroff, A. Criminisi, and A. Zisserman. Object class seg-

mentation using random forests. BMVC, 2008.[32] N. Sebastian and C. H. Lampert. Structured learning and

prediction in computer vision. Foundations and Trends inComputer Graphics and Vision 6.34, 2011.

[33] E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell. Clock-work convnets for video semantic segmentation. ECCV,2016.

[34] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio,R. Moore, A. Kipman, and A. Blake. Real-time human poserecognition in parts from single depth images. IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),2011.

[35] J. Shotton, M. Johnson, and R. Cipolla. Semantic textonforests for image categorization and segmentation. IEEEConference on Computer Vision and Pattern Recognition(CVPR), 2008.

9

[36] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor seg-mentation and support inference from rgbd images. ECCV,2012.

[37] P. Sturgess, K. Alahari, L. Ladicky, and P. H. Torr. Combin-ing appearance and structure from motion features for roadscene understanding. BMVC, 2009.

[38] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian. Building general-izable agents with a realistic and rich 3d environment. arXivpreprint, 2018.

[39] F. Yu and V. Koltun. Multi-scale context aggregation by di-lated convolutions. arXiv preprint arXiv:1511.07122, 2015.

[40] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. Icnet for real-timesemantic segmentation on high-resolution images. arXiv,2017.

[41] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid sceneparsing network. IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2017.

[42] S.-C. Zhu, C.-E. Guo, Y. Wang, and Z. Xu. What are textons?IJCV, 2005.

10

Documents

Semantic Segmentation, Urban Navigation, and Research ... · 2.3. Urban-Scene understanding datasets Several of the datasets are targeted towards urban scene understanding. Commonly,