8
VLASE: Vehicle Localization by Aggregating Semantic Edges Xin Yu 1* , Sagar Chaturvedi 1* , Chen Feng 2 , Yuichi Taguchi 2 , Teng-Yok Lee 2 , Clinton Fernandes 1 , Srikumar Ramalingam 1 Abstract—In this paper, we propose VLASE, a framework to use semantic edge features from images to achieve on-road localization. Semantic edge features denote edge contours that separate pairs of distinct objects such as building-sky, road- sidewalk, and building-ground. While prior work has shown promising results by utilizing the boundary between prominent classes such as sky and building using skylines, we generalize this approach to consider semantic edge features that arise from 19 different classes. Our localization algorithm is simple, yet very powerful. We extract semantic edge features using a recently introduced CASENet architecture and utilize VLAD framework to perform image retrieval. Our experiments show that we achieve improvement over some of the state-of-the-art localization algorithms such as SIFT-VLAD and its deep variant NetVLAD. We use ablation study to study the importance of different semantic classes, and show that our unified approach achieves better performance compared to individual prominent features such as skylines. I. I NTRODUCTION In the pre-GPS era, we do not describe a location using latitude-longitude coordinates. The typical description of a location is based on certain semantic proximity, such as a tall building, traffic light, or an intersection. While the recent successful image-based localization methods rely on either complex hand-crafted features like SIFT [1] or automatically learnt features using CNNs, we would like to take a step back and ask the following question: How powerful are simple semantic cues for the task of localization? There is a general consensus that the salient features for localization are not always human-understandable, and it is important to capture special visual signatures imperceptible to the eye. Surpris- ingly, this paper shows that simple human-understandable semantic features, although extracted using CNNs, provide accurate localization in urban scenes and they compare favorably to some of the state-of-the-art localization methods that employ SIFT features in a VLAD [2] framework. Fig 1 illustrates the basic idea of this paper. Given an image from a vehicle, we first detect semantic boundaries, the pixels between different object classes. In this paper, we use the recently introduced CASENet [4] architecture to extract semantic boundaries. The CASENet architecture not only produces state-of-the-art semantic performance on standard datasets such as SBD [5] and Cityscapes [6] but also provides a multi-label framework where the edge pixels are associated with more than one object classes. For example, a pixel lying on the edge between sky and buildings will * indicate equal contributions. 1 University of Utah, Salt Lake City, UT 84112, USA {xin.yu,sagar.chaturvedi,srikumar}@utah.edu 2 Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA 02139, USA {cfeng,taguchi,tlee}@merl.com building+pole road+sidewalk road sidewalk+building building+traffic sign building+car road+car building building+vegetation road+pole building+sky pole+car building+person pole+vegetation Fig. 1. Illustration of VLASE. Given images (left) from a vehicle, we extract semantic edge features (middle). Different colors indicate different combinations of object classes. The extracted semantic features are com- pared to the features from geo-tagged images in a database to estimate the location. In this example, the red and yellow circles on the map (right) indicate the locations of the two given images. (The images are from the KAIST WEST sequences captured at 9AM [3].). be associated with both sky and building class labels. This allows our method to unify multiple semantic classes as localization features. The middle column of Fig. 1 shows the semantic edge features. By matching the semantic edge features between a query image and geo-tagged images in a database, which is achieved using VLAD in this paper, we can estimate the location of the query image, as illustrated on the Google map in the right of Fig. 1. Besides the matching between semantic edge features, we also observed that in the context of on-road vehicles, appending 2D spatial location information with the extracted features (SIFT or CASENet) boosts the localization perfor- mance by a large margin. In this paper, we heavily rely on the prior that the images are captured from a vehicle- mounted camera, and exploit edge features that are typical in urban scenes. In addition, we sample only a very limited set of poses for on-road vehicles. The motion is near-planar and the orientation is usually aligned with the direction of the road. It is common for many recent methods to make this assumption since the primary application is the accurate vehicle localization in urban canyons, where GPS suffers from multi-path effects. We briefly summarize our contributions as follows: We propose VLASE, a simple method that uses seman- tic edge features for the task of vehicle localization. The idea of simple semantic cues for localization is not completely new, as individual features such as horizon, road maps, and skylines [7]–[10] have been shown to be beneficial. In contrast to these methods, our method is a unified framework that allows the incorporation of arXiv:1807.02536v1 [cs.CV] 6 Jul 2018

VLASE: Vehicle Localization by Aggregating Semantic Edges · VLASE: Vehicle Localization by Aggregating Semantic Edges Xin Yu 1, Sagar Chaturvedi , Chen Feng2, Yuichi Taguchi 2, Teng-Yok

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: VLASE: Vehicle Localization by Aggregating Semantic Edges · VLASE: Vehicle Localization by Aggregating Semantic Edges Xin Yu 1, Sagar Chaturvedi , Chen Feng2, Yuichi Taguchi 2, Teng-Yok

VLASE: Vehicle Localization by Aggregating Semantic Edges

Xin Yu1∗, Sagar Chaturvedi1∗, Chen Feng2, Yuichi Taguchi2, Teng-Yok Lee2, Clinton Fernandes1, Srikumar Ramalingam1

Abstract— In this paper, we propose VLASE, a frameworkto use semantic edge features from images to achieve on-roadlocalization. Semantic edge features denote edge contours thatseparate pairs of distinct objects such as building-sky, road-sidewalk, and building-ground. While prior work has shownpromising results by utilizing the boundary between prominentclasses such as sky and building using skylines, we generalizethis approach to consider semantic edge features that arisefrom 19 different classes. Our localization algorithm is simple,yet very powerful. We extract semantic edge features using arecently introduced CASENet architecture and utilize VLADframework to perform image retrieval. Our experiments showthat we achieve improvement over some of the state-of-the-artlocalization algorithms such as SIFT-VLAD and its deep variantNetVLAD. We use ablation study to study the importance ofdifferent semantic classes, and show that our unified approachachieves better performance compared to individual prominentfeatures such as skylines.

I. INTRODUCTION

In the pre-GPS era, we do not describe a location usinglatitude-longitude coordinates. The typical description of alocation is based on certain semantic proximity, such as atall building, traffic light, or an intersection. While the recentsuccessful image-based localization methods rely on eithercomplex hand-crafted features like SIFT [1] or automaticallylearnt features using CNNs, we would like to take a step backand ask the following question: How powerful are simplesemantic cues for the task of localization? There is a generalconsensus that the salient features for localization are notalways human-understandable, and it is important to capturespecial visual signatures imperceptible to the eye. Surpris-ingly, this paper shows that simple human-understandablesemantic features, although extracted using CNNs, provideaccurate localization in urban scenes and they comparefavorably to some of the state-of-the-art localization methodsthat employ SIFT features in a VLAD [2] framework.

Fig 1 illustrates the basic idea of this paper. Given animage from a vehicle, we first detect semantic boundaries,the pixels between different object classes. In this paper,we use the recently introduced CASENet [4] architectureto extract semantic boundaries. The CASENet architecturenot only produces state-of-the-art semantic performance onstandard datasets such as SBD [5] and Cityscapes [6] but alsoprovides a multi-label framework where the edge pixels areassociated with more than one object classes. For example,a pixel lying on the edge between sky and buildings will

∗indicate equal contributions.1University of Utah, Salt Lake City, UT 84112, USA

{xin.yu,sagar.chaturvedi,srikumar}@utah.edu2Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA

02139, USA {cfeng,taguchi,tlee}@merl.com

building+pole road+sidewalk road sidewalk+building building+traffic sign building+car road+carbuilding building+vegetation road+pole building+sky pole+car building+person pole+vegetation

Fig. 1. Illustration of VLASE. Given images (left) from a vehicle, weextract semantic edge features (middle). Different colors indicate differentcombinations of object classes. The extracted semantic features are com-pared to the features from geo-tagged images in a database to estimate thelocation. In this example, the red and yellow circles on the map (right)indicate the locations of the two given images. (The images are from theKAIST WEST sequences captured at 9AM [3].).

be associated with both sky and building class labels. Thisallows our method to unify multiple semantic classes aslocalization features. The middle column of Fig. 1 showsthe semantic edge features. By matching the semantic edgefeatures between a query image and geo-tagged images in adatabase, which is achieved using VLAD in this paper, wecan estimate the location of the query image, as illustratedon the Google map in the right of Fig. 1.

Besides the matching between semantic edge features,we also observed that in the context of on-road vehicles,appending 2D spatial location information with the extractedfeatures (SIFT or CASENet) boosts the localization perfor-mance by a large margin. In this paper, we heavily relyon the prior that the images are captured from a vehicle-mounted camera, and exploit edge features that are typicalin urban scenes. In addition, we sample only a very limitedset of poses for on-road vehicles. The motion is near-planarand the orientation is usually aligned with the direction ofthe road. It is common for many recent methods to makethis assumption since the primary application is the accuratevehicle localization in urban canyons, where GPS suffersfrom multi-path effects.

We briefly summarize our contributions as follows:• We propose VLASE, a simple method that uses seman-

tic edge features for the task of vehicle localization.The idea of simple semantic cues for localization is notcompletely new, as individual features such as horizon,road maps, and skylines [7]–[10] have been shown tobe beneficial. In contrast to these methods, our methodis a unified framework that allows the incorporation of

arX

iv:1

807.

0253

6v1

[cs

.CV

] 6

Jul

201

8

Page 2: VLASE: Vehicle Localization by Aggregating Semantic Edges · VLASE: Vehicle Localization by Aggregating Semantic Edges Xin Yu 1, Sagar Chaturvedi , Chen Feng2, Yuichi Taguchi 2, Teng-Yok

multiple semantic classes.• We show that it is beneficial to augment semantic

features by 2D spatial coordinates. This is counter-intuitive to prior methods that utilize invariant featuresin a bag-of-words paradigm. In particular, we showthat even standard SIFT-VLAD can be significantlyimproved by embedding additional keypoint locations.

• We show compelling experimental results on two dif-ferent datasets, including the public KAIST [3] and aroute collected by us in Salt Lake City. We outperformcompeting localization methods such as standard SIFT-VLAD [2], pre-trained NetVLAD [11], and the coarselocalization in [12], even with smaller descriptor dimen-sions. Our results are comparable and probably slightlybetter than the improved SIFT-VLAD that incorporateskeypoint locations in the features.

II. RELATED WORK

The vision [13] and robotics [14] communities have wit-nessed the rise of accurate and efficient image-based local-ization techniques that can be complementary to GPS, whichare prone to error due to multi-path effects. The techniquescan be classified into regression-based methods and retrieval-based ones. Regression-based methods [15]–[17] directlyobtain the location coordinates from a given image usingtechniques such as CNNs. Retrieval-based methods matcha given query image to thousands of geo-tagged imagesin a database, and predict the location estimates for thequery image based on the nearest or k-nearest neighborsin the database. Regression-based methods provide the bestadvantage in both memory and speed. For example, methodslike PoseNet [15] does not require huge database withmillions of images and the location estimation can be donein super-real time (e.g. 200 Hertz). On the contrary, retrieval-based ones are usually slower and have a large memoryrequirement for storing images or its descriptors for the entirecity of globe. However, the retrieval-based methods typicallyprovide higher accuracy and robustness [13].

A. Features

In this paper, we will focus on the retrieval-based ap-proach, which essentially find the distance between a pairof images using extracted localization features. Based onhuman understandability, we broadly classify the localizationfeatures into the following two categories:Simple Features: We refer to simple features as the onesthat are human-understandable: line-segments, horizon, roadmaps, and skylines. Skylines or horizon separating sky frombuildings or mountains can be used for localiation [7]–[10]. Several existing methods use 3D models and/or omni-directional cameras for geolocalization [8], [18]–[25]. Linesegments have been shown to be very useful for localization.The localization can be achieved by registering an imagewith a 3D model or a geo-tagged image. By directly aligningthe lines from query images to the ones in a line-based 3Dmodel we can achieve localization [18], [26], [27]. Semantic

segmentation of buildings has been used for registeringimages to 2.5D models [28].

We can also use other human-understandable simple fea-ture such as roadmaps or weather patterns to obtain localiza-tion. Visual odometry can provide the trajectory of a vehiclein motion, and by comparing this with the roadmaps, we cancompute the location of the vehicle [29], [30]. It is intriguingto see that even weather patterns can act as signatures forlocalizing an image [31].

Complex Features: The complex ones are visual patternsextracted through hand-crafted feature descriptors or auto-matically extracted ones using CNNs. These class of featuresare referred to as complex ones since they are not human-understandable, i.e, not easily perceptible to human eye. Oneof the earlier methods used SIFT or SURF descriptors tomatch a query image with a database of images [32]–[34].It is possible to achieve localization in a global scale usingGPS-tagged images from the web and matching the queryimage using a wide variety of image features such as colorand texton histograms, gist descriptor, geometric context, andeven timestamps [35], [36].

The use of neural networks for localization is an old idea.RATSLAM [37] is a classical SLAM algorithm that usesa neural network with local view cells to denote locationsand pose cells to denote heading directions. The algorithmproduces “very coarse” trajectory in comparison to existingSLAM techniques that employ filtering methods or bundle-adjustment machinery. Kendall et al. [15] presented PoseNet,a 23 layer deep convolutional neural network based onGoogleNet [38], to compute the pose in a large-region at200 Hz. CNN can be also applied to learn the distance metricto match two images. As one can achieve localization bymatching an image taken at the ground level to referencedatabase of geo-tagged bird’s eye, aerial, or even satelliteimages [39]–[42], such cross-matching is typically doneusing siamese networks [43]. Recently, it was shown thatLSTMs can be used to achieve accurate localization inchallenging lighting conditions [44]. A survey of differentstate-of-the-art localization techniques is given in [14],and there has been releases of many newer datasets [13],[45]. The idea of dominant set clustering is powerful forlocalization tasks [46]. Many existing methods formulatelocalization problem in a similar manner to per-exemplarSVMs in object recognition. To handle the limitation ofhaving very few positive training examples, a new approachto calibrate all the per-location SVM classifiers using onlythe negative examples is proposed [47].

In this paper, we combine the above two categories bylocalizing from human-interpretable semantic edge featureslearnt from a state-of-the-art CNN [4]. Note that very re-cently semantic segmentation is also used with either asparse 3D model [12] or depth images [48] for long-term 3Dlocalization. We show by experiments that VLASE improvesthe semantic-histogram-based coarse localization in [12].

Page 3: VLASE: Vehicle Localization by Aggregating Semantic Edges · VLASE: Vehicle Localization by Aggregating Semantic Edges Xin Yu 1, Sagar Chaturvedi , Chen Feng2, Yuichi Taguchi 2, Teng-Yok

B. Vocabulary tree

In the retrieval based methods, we match a query image tomillions of images in a database. The computation efficiencyis largely addressed by bag-of-words (BOW) representationthat aggregates local descriptors into a global descriptor, andenables fast large-scale image search [49]–[51]. Recently,extensions of BOW including the Fisher vector and Vectorof Locally Aggregated Descriptors (VLAD) showed state-of-the-art performance [2]. Experimental results demonstratethat VLAD significantly outperforms BOW for the samesize. It is cheaper to compute and its dimensionality can bereduced to a few hundreds of components by PCA withoutnoticeably impacting its accuracy.

The logical extension to VLAD it NetVLAD, whereArandjelovic et al.propose to mimic VLAD in a CNNframework and design a trainable generalized VLAD layer,NetVLAD, for the place recognition task [11]. This layercan be used in any CNN architecture and allows training viabackward propagation. NetVLAD was shown to outperformnon-learnt image representations and off-the-shelf CNN de-scriptors on two challenging place recognition benchmarks,and improves over current state of-the-art compact imagerepresentations on standard image retrieval benchmarks.

III. SEMANTIC EDGES FOR LOCALIZATION

This section explains our main algorithm of using seman-tic edge features for localization. The main idea is verysimple. Similar to the use of SIFT features in a VLADframework, we use CASENet multi-label semantic edge classprobabilities as compact, low-dimensional, and interpretablefeatures. Similar to standard BOW, VLAD also constructsa codebook from a databse of feature descriptors (SIFTor CASENet) by performing a simple K-means clusteringalgorithm on those descriptors. Here we denote M clusters asC = {c1, . . . , cM}. Given a query image, each of its featuredescriptors xi is associated to the nearest cluster cj in thecodebook. The main idea in VLAD is to accumulate thedifference vector xi− cj for every xi that is associated withcj . VLAD is considered to be superior to traditional BOWmethods mainly because this residual statistic provides moreinformation and enables better discrimination.

To detect the semantic edges, we use the recently intro-duced CASENet architecture, whose code is publicly avail-able 1. Given an input image I, we first apply a pretrainedCASENet to compute the multi-label semantic edge proba-bilities Y(p) = [Y1(p), · · · ,YK(p)] for each pixel p ∈ I.Here K is the number of object classes. Then we selectall edge pixels {q ∈ I|Yk(q) ≥ Te,∃k ∈ [1, · · · ,K]}, i.e.,pixels that have at least one semantic edge label probabilityexceeding a given threshold Te. Thus, for any image, we cancompute a set of K-dimensional CASENet edge features (forthe Cityscapes dataset, K = 19). We further augment this K-dimensional feature by appending to its end a 2-dimensionalnormalized-pixel-position feature [qx/W, qy/H], where W

1http://www.merl.com/research/?research=license-request&sw=CASENet

building+pole road+sidewalk road sidewalk+building building+traffic sign building+car road+carbuilding building+vegetation road+pole building+sky pole+car building+person pole+vegetation

Fig. 2. CASENet edge feature and VLAD. Top: An example input image(left) and its CASENet features (right). Each color corresponds to an objectclass. Bottom: Visualization of a CASENet-VLAD vocabulary of M =256 codewords/cluster centers, shown as color-coded dots. For the dot ofeach cluster, its x-y positions correspond to Y20 and Y21, and its color iscomputed from CASENet features Yk . The Voronoi graph (black edges withsmall green nodes) shows the CASENet-VLAD division of the x-y imagespace. The background of the bottom image is an average of CASENetfeature visualization from all images used to train the codebook. As thebackground shows an averaged semantic on-road driving scene, it can beenseen that the colors of the dots in the cluster centers distribute similarly tothe colors of this average scene.

and H are the fixed image width and height, and qx andqy are the column and row index respectively for a pixel q.We will refer to such a K + 2 dimensional feature Y as anaugmented CASENet edge feature.

Due to the often much larger number of edge pixels com-pared to SIFT/SURF features in an image, to build a visualcodebook or vocabulary following the VLAD framework,we run a sequential instead of a full KMeans algorithm(MiniBatchKMeans, implemented in the python packagescikit-learn [52]) using all the augmented CASENet edgefeatures on one training image as a mini-batch. This isiterated over the whole training image set for multiple epochsuntil it converges to M centers [C1, · · · ,CM ], each in theK + 2 dimensional space, to form the trained CASENet-VLAD codebook. An example is visualized in Figure 2.

To perform on-road place recognition, we first need toprocess a sequence of images serving as the visual map,i.e., the mapping sequence. This can be simply done byextracting all augmented CASENet edge features on eachimage and compute a corresponding M×(K+2) CASENet-VLAD descriptor D using the trained codebook, with power-normalization followed by L2-normalization. The CASENet-VLAD descriptors for the mapping sequence are then storedin a database. During place recognition, we repeat thisprocess for the current query image to get its CASENet-VLAD descriptor and search in the mapping database for thetop-N most similar descriptors using cosine-distance. Thispipeline is further illustrated in Figure 3.

Page 4: VLASE: Vehicle Localization by Aggregating Semantic Edges · VLASE: Vehicle Localization by Aggregating Semantic Edges Xin Yu 1, Sagar Chaturvedi , Chen Feng2, Yuichi Taguchi 2, Teng-Yok

Map Database

CASENet-VLAD DescriptorsQuery Image

CASENet-VLAD Codebook

Ne

arFa

r

Rank by Cosine

Distance

Fig. 3. VLASE pipeline. All mapping images are first processed by CASENet, from which we can build a VLAD codebook using all CASENet features.We then compute each image’s CASENet-VLAD descriptor D (the last two dimensions of each residual vector, i.e., D(:, 20) and D(:, 21), are visualizedas 2D vectors origin at the corresponding codeword/cluster center, i.e., Cm). During localization, we similarly compute the currently observed image’sCASENet-VLAD descriptor, and query in the database for the top-N closest descriptors in terms of cosine distance. Note that while the geometry shape ofthe three CASENet edges in column two are visually similar to each other, their corresponding CASENet-VLAD descriptors in the last column are morediscriminative, even only visualized by the last two dimensions.

SLC KAIST WestFig. 4. The testing routes of our experiments.

IV. DATASETS

We have experimented on 2 visual place recognitiondatasets. The first is called SLC, which was captured in SaltLake city downtown. The second is called KAIST, whichis one of the routes from the KAIST All-Day Visual PlaceRecognition dataset [53].

A. SLC

We created our own dataset by capturing two videosequences in the downtown of Salt Lake City. The lengthof our route is about 15km, which is shown in Figure 4. Thetwo sequences were captured at different times, and thusthey have adequate lighting variations for same locationswith abundance of objects belonging to the classes in theCityscapes dataset. We used a Garmin dash-cam to collectvideos of the scenes in front of the vehicle. This dash-camstored the videos at 30 FPS, and the two sequences have98513 and 89633 frames. We resized the image from theoriginal resolution 1920×1080 to 640×360 pixels. A special

feature of this dash-cam is that it also encodes the GPScoordinates in latitude and longitude, which provides theground truth of our video frames. Since the frame rate ofSLC sequence is 30 fps but only the first frame within everysecond has a GPS coordinate, we sampled every 30 framesfrom SLC sequences. We use the longer sequence of SLC(98513) as the database of 3284 images and computing theVLAD codebook, which is denoted as loop1 hereafter. Theother sequence is denoted as loop2, which has 2988 sampledframes for querying.

B. KAIST

The KAIST dataset was captured by Choi et al. [53] inthe campus of Korea Advanced Institute of Science andTechnology (KAIST). They captured 42 km sequences at 15-100Hz using multiple sensor modalities such as fully alignedvisible and thermal devices, high resolution stereo visiblecameras, and a high accuracy GPS/IMU inertial navigationsystem. The sequences covered 3 routes in the campus,which are denoted as west, east and north. Each route has6 sequences recorded at different times of a day, includingday (9 AM, 2 PM), night(10 PM, 2 AM), sunset(7 PM), andsunrise(5 AM). As these sequences capture various illumi-nation conditions, this dataset is helpful for benchmarkingunder lighting variations.

We used two sequences captured on the west route, asshown in 4. The two sequences were captured on 5 AM and9 AM, which were under sunrise and daylight conditions,respectively. The sequence at 9AM contains more dynamicclass objects than that at 5AM. We resized the images fromtheir original size 1280 × 960 to 640 × 480 pixels. Theimages were captured at 15 fps while the GPS coordinateswere measured at 10 FPS. Similar to SLC, we sampled the

Page 5: VLASE: Vehicle Localization by Aggregating Semantic Edges · VLASE: Vehicle Localization by Aggregating Semantic Edges Xin Yu 1, Sagar Chaturvedi , Chen Feng2, Yuichi Taguchi 2, Teng-Yok

TABLE IABLATION STUDY RESULTS FOR THE SLC DATASET.

Top-1 Accuracy Top-5 AccuracyRemoved 5m 10m 20m 5m 10m 20m

Road 53 79 91 89 96 98Sidewalk 54 80 91 87 94 96Building 50 75 86 81 90 92

Wall 50 76 87 85 92 94Fence 54 80 90 87 94 96Pole 51 75 88 85 93 95Light 51 75 87 84 92 95Sign 51 76 87 85 93 95Veg 50 74 85 83 92 95

Terrain 51 77 88 84 91 94Sky 50 75 85 82 91 93

Person 52 79 90 87 94 96Rider 51 78 89 87 94 96Car 54 82 93 89 97 98

Truck 53 80 91 88 95 97Bus 51 77 88 84 92 95

Train 54 79 91 87 95 97Motorcycle 51 77 88 85 92 95

Bicycle 52 77 89 86 94 96

Combinations 5m 10m 20m 5m 10m 20mAll 52 78 90 87 94 96

Static 56 82 94 91 98 99Bld-Sky 49 73 85 77 91 94Veg-Sky 57 83 95 89 96 98

Veg-Bld-Sky 55 80 91 86 94 96All w/o (x,y) 44 67 76 77 86 89

Baselines 5m 10m 20m 5m 10m 20mSIFT+(x,y) 32 47 60 48 61 66

SIFT 22 36 43 32 45 48Toft [12] 32 55 63 57 73 79

route captured on 9AM as the database of 3254 images andcomputing the VLAD codebook, and the route captured on5AM for querying (2207 images).

V. EXPERIMENTS

A. Settings

CASENet: We use the CASENet model pre-trained on theCityscapes dataset [6]. It contains 19 object classes that arealso seen in our testing video sequences. We used nVidiaTitan Xp GPUs to extract CASENet features, which canprocess around 1.25 images per second using CASENetoriginal code. We did not retrain CASENet for our datasets,since getting ground truth semantic edges is a tedious manualtask. We observed that the pre-trained model was sufficientto provide qualitatively accurate semantic edge features.VLAD: We compared the CASENet-based semantic edgefeatures to SIFT [2], and used VLAD to aggregate bothto descriptors for image retrieval. To decide the number ofclusters for VLAD, we find the optimal cluster numberswithin 32, 64 and 256 by experiments, with MiniBatchK-Means of at most 10,000 iterations. Our experiments showedthat 64 clusters for CASENet features and 32 for SIFT arethe most optimal, and thus we applied these cluster numbersfor further experiments. Note that although CASENet featuredimension is much smaller than SIFT (19 vs. 128), there aremore CASENet features for each image as we get them foreach pixel. As a result, CASENet works better with moreclusters than SIFT. The VLAD of both were trained on CPUs.With Intel(R) Xeon(R) E5-2640 CPU and 125GB of usablememory, the training for 3000 images took about 30 minutes.

TABLE IIABLATION STUDY RESULTS FOR THE KAIST DATASET.

Top-1 Accuracy Top-5 AccuracyRemoved 5m 10m 20m 5m 10m 20m

Road 72 84 90 88 91 94Sidewalk 71 84 91 88 92 95Building 71 84 90 88 91 94

Wall 73 85 90 87 91 94Fence 73 86 92 90 93 96Pole 70 84 89 87 91 94Light 73 86 91 88 93 95Sign 71 84 90 88 92 95Veg 69 82 87 87 91 93

Terrain 72 84 90 88 91 94Sky 73 85 91 88 93 95

Person 74 86 91 89 92 95Rider 72 85 90 88 92 95Car 77 88 93 91 94 96

Truck 72 86 90 89 93 94Bus 74 86 90 89 92 94

Train 74 85 91 88 92 95Motorcycle 72 85 90 88 92 95

Bicycle 73 85 90 88 92 95

Combinations 5m 10m 20m 5m 10m 20mAll 73 85 91 89 92 95

Static 77 88 92 91 94 96Bld-Sky 62 74 83 82 87 91Veg-Sky 73 83 88 87 90 93

Veg-Bld-Sky 73 84 89 87 91 93All w/o (x,y) 64 78 85 83 88 91

Baselines 5m 10m 20m 5m 10m 20mSIFT+(x,y) 84 89 91 90 92 93

SIFT 81 86 88 88 89 90Toft [12] 60 73 80 78 85 88

Evaluation criteria: We measured both top-1 and top-5retrieval accuracy under different distance thresholds (5, 10,15, and 20 meters). If any of these top-k retrieved images iswithin the distance threshold of the query image, we countedit as a success localization.

B. Results and Ablation Studies

Figure 5 shows our main results compared with severalbaselines. Fig. 8 presents several best and worst matchingexamples by our method. We also performed ablation studieson the importances of 1) object classes and 2) spatialcoordinates used for feature augmentation, with results listedin Tables I and II.Object classes: We first investigated the importance ofdifferent subsets of the 19 Cityscapes classes for localization(all augmented by 2D spatial coordinates) with two goals.The first is to evaluate individual class contributions tothe accuracy. The second is to compare our approach withexisting methods that also use semantic boundaries butwith much fewer classes. For example, one of the popularlocalization cues is skylines (edges between building and sky,or vegetation and sky) [7]–[10].

For SLC and in most cases, removing dynamic classes(listed in the second half of the first block of Table I)yields better accuracy than all classes, e.g., removing carsimproves the accuracy by 2%. Note in some cases, removalof some dynamic classes causes minor drops in accuracy,e.g., removing Motorcycle and bus, which we believe isinsignificant, and mainly due to the lack of those classes inour dataset. As per our expectation, using only static classes(the 11 out of 19 classes) of CASENet performs better thanusing all classes, for both datasets. Specifically, building, sky

Page 6: VLASE: Vehicle Localization by Aggregating Semantic Edges · VLASE: Vehicle Localization by Aggregating Semantic Edges Xin Yu 1, Sagar Chaturvedi , Chen Feng2, Yuichi Taguchi 2, Teng-Yok

5 10 15 20Distance threshold(m)

30

40

50

60

70

80

90

100

Acc

urac

y(%

)

SLC-Top1

SIFT + (x,y)ToftPretrained VGG+NetvladAll classesCar class removedStatic classesSky + BuildingSky + Vegetation

5 10 15 20Distance threshold(m)

40

50

60

70

80

90

100

Acc

urac

y(%

)

SLC-Top5

SIFT + (x,y)ToftPretrained VGG+NetvladAll classesCar class removedStatic classesSky + BuildingSky + Vegetation

5 10 15 20Distance threshold(m)

60

65

70

75

80

85

90

95

Acc

urac

y(%

)

KAIST-Top1

SIFT + (x,y)ToftAll classesCar class removedStatic classesSky + BuildingSky + Vegetation

5 10 15 20Distance threshold(m)

78

80

82

84

86

88

90

92

94

96

Acc

urac

y(%

)

KAIST-Top5

SIFT + (x,y)ToftAll classesCar class removedStatic classesSky + BuildingSky + Vegetation

(a) (b) (c) (d)Fig. 5. Localization accuracies. (a) and (b) represent the results for SLC dataset while (c) and (d) represent the results for KAIST dataset. The x-axisrepresents the distance threshold and the y-axis represents the accuracy. Non-CASENet results are shown using dashed lines. No weighting of features areapplied. Note for KAIST, the pretrained VGG-NetVLAD performances are very low (and even with retraining), thus we do not include them here. NoteCASENet is not retrained either.

and wall are the top 3 individual contributors, as removingthem causes highest drop in the accuracy. Also using onlyvegetation and sky is comparable to using all static classes.

For KAIST, vegetation seems to be the most importantindividual class. Removing it causes the highest drop inthe accuracy. Building and sky classes individually does notseem very significant. Again, using only static CASENetfeatures performs better than any other feature combination.Spatial coordinates: Besides object classes and their prob-abilities, we also tried removing the 2D-image-coordinateaugmentation from the feature descriptors for both CASENetand SIFT. Surprisingly, this augmentation boosted the per-formance of both SIFT and CASENet by a large margin:SIFT+(x,y) vs. SIFT, and All vs. All w/o (x,y) in Table Iand II. While this result seems counter-intuitive due to theloss of invariance in feature descriptors, the on-road vehiclelocalization is a more restricted setup and such constraintslead to high-accuracy localization.

A natural concern for such direct augmentation is theweighting of spatial coordinates compared with object classprobabilities or SIFT features, which have much largerdimensions. Thus we investigate the effect of a weighted fea-ture augmentation as Y = [αY1, · · · , αYK , (1−α)Yx, (1−α)Yy], where K = 19 for CASENet and K = 128 forSIFT, Yx,Yy indicate normalized 2D spatial coordinates. InFigure 6 and 7, we show that combination of the two indeedachieves the best performance, and higher weights shouldbe given to spatial coordinates due to the smaller number ofdimensions.

In summary, CASENet-VLAD generally performs betterthan SIFT-VLAD (and also augmented SIFT-VLAD forSLC), although the augmentation sometimes makes SIFTcomparable to CASENet. For example, augmented SIFTfeatures performed better than CASENet on KAIST, sincewithout augmentation CASENet already performed worsethan SIFT (Figure 5). We conjectured the main reasonto be the different data distributions between the KAISTand Cityscapes, leading to degraded quality of CASENetfeatures without domain adaption. Note that another deepbaseline [12], pretrained on the Cityscapes, also performsworse than SIFT on KAIST.

Other deep baselines: We also compare with three deepbaselines: 1) Toft et al.’s method [12], which performssemantic segmentation using a pre-trained network [54] andcomputes a descriptor by combining histograms of staticsemantic classes as well as gradient histograms of buildingand vegetation masks in six different regions of the top halfof the image; 2) VGG-NetVLAD [11]; and 3) PoseNet [15],a convolutional neural network that regresses the 6-DOFcamera pose from a given RGB image. The results of thefirst deep baseline (our own implementation) and VGG-NetVLAD (the best pre-trained weights from the Pittsburghdataset provided in [11]) are shown to be worse thanCASENet in Figure 5. Note for KAIST, the pretrained VGG-NetVLAD performances are very low, and even with retrain-ing the performance is still below 30%, thus we excludethem from Figure 5. For the application of PoseNet in thispaper, instead of the 6-DOF output, we only regress 3 valuesfrom an image: the x-, y-location, and the orientation of thevehicle. Based on our initial experiments, we observed thatthe performance of PoseNet is less than 50%. This is muchlower than other methods tested in this paper (Figure 5). Weplan to investigate this further, but the high error could bedue to the fact that the restricted pose parameters from theon-road vehicles (mostly straight lines and occasional turns)is insufficient to train the network.

VI. DISCUSSION

We proposed and validated a simple method to achievehigh-accuracy localization using recently introduced seman-tic edge features [4]. While SIFT is one of the earliest featuredescriptor used for localization, SIFT-VLAD is still consid-ered as the state-of-the-art localization algorithm. We showsignificant improvement over the standard SIFT-VLAD, andwe perform favorably to the augmented SIFT-VLAD method.While the CASENet features are trained only on cityscapesdataset, the pretrained model was sufficient for achievingstate-of-the-art localization accuracy.

Another interesting result that came out of our analysisis to show that skyline (either from building and sky, orfrom vegetation and sky) is a very powerful localization cue.In some of the datasets where there is too much lighting

Page 7: VLASE: Vehicle Localization by Aggregating Semantic Edges · VLASE: Vehicle Localization by Aggregating Semantic Edges Xin Yu 1, Sagar Chaturvedi , Chen Feng2, Yuichi Taguchi 2, Teng-Yok

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 20

30

40

50

60

70

80

90

100

Acc

urac

y(%

)

SIFT-VLAD on SLC

5m-top15m-top520m-top120m-top5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 20

30

40

50

60

70

80

90

100

Acc

urac

y(%

)

CASENet-VLAD on SLC

5m-top15m-top520m-top120m-top5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 60

65

70

75

80

85

90

95

100

Acc

urac

y(%

)

SIFT-VLAD on KAIST

5m-top15m-top520m-top120m-top5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 60

65

70

75

80

85

90

95

100

Acc

urac

y(%

)

CASENet-VLAD on KAIST

5m-top15m-top520m-top120m-top5

Fig. 6. Effect of weighted spatial coordinate augmentation on SLC (left) and KAIST (right). At the optimal α = 0.1, CASENet is still better than SIFT.

5 10 15 20Distance threshold(m)

30

40

50

60

70

80

90

100

Acc

urac

y(%

)

SLC-Top1 with =0.1

SIFT + (x,y)ToftPretrained VGG+NetvladAll classesCar class removedStatic classesSky + BuildingSky + Vegetation

5 10 15 20Distance threshold(m)

55

60

65

70

75

80

85

90

95

100

Acc

urac

y(%

)

SLC-Top5 with =0.1

SIFT + (x,y)ToftPretrained VGG+NetvladAll classesCar class removedStatic classesSky + BuildingSky + Vegetation

5 10 15 20Distance threshold(m)

60

65

70

75

80

85

90

95

100

Acc

urac

y(%

)

KAIST-Top1 with =0.1

SIFT + (x,y)ToftAll classesCar class removedStatic classesSky + BuildingSky + Vegetation

5 10 15 20Distance threshold(m)

75

80

85

90

95

100

Acc

urac

y(%

)

KAIST-Top5 with =0.1

SIFT + (x,y)ToftAll classesCar class removedStatic classesSky + BuildingSky + Vegetation

Fig. 7. Localization accuracies using weighted augmentation, with α = 0.1 found to be optimal for both SIFT and CASENet. Other settings are thesame as in Figure 5. Note Toft [12] and NetVLAD are not weighted.

building+pole road+sidewalk road sidewalk+building building+traffic sign building+car road+carbuilding building+vegetation road+pole building+sky pole+car building+person pole+vegetation

Fig. 8. Successful and failed matches of CASENet+VLAD. The top 2 rows show good matches . The bottom 2 rows show two of the worst resultswhere the true distance is greater than 2 kms. In the 3rd row, the presence of dynamic object such as the train might lead to the high error.

variation, the feature descriptor that just uses skylines pro-duces results that is only marginally inferior to using all theCASENet features.

While the main localization idea is simple, we believe thatthis work unifies several ideas in the community. Further-more, it has already been shown that semantic segmentationand depth estimation are closely related to each other [55],[56]. This paper takes a step towards showing that semanticsegmentation and localization are also closely related, mak-ing one more argument towards holistic scene understanding.

In the future, we plan to consider retraining CASENet

for images under bad lighting conditions. While this workwas primarily about understanding how useful semanticedges are, we plan to explore more CNN-based VLADtechniques [11]. We will release the SLC dataset and codefor research purposes.

REFERENCES

[1] D. Lowe, “Distinctive image features from scale-invariant keypoints,”IJCV, 2004. 1

[2] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid,“Aggregating local image descriptors into compact codes,” PAMI,vol. 34, no. 9, pp. 1704–1716, Sept. 2012. 1, 2, 3, 5

Page 8: VLASE: Vehicle Localization by Aggregating Semantic Edges · VLASE: Vehicle Localization by Aggregating Semantic Edges Xin Yu 1, Sagar Chaturvedi , Chen Feng2, Yuichi Taguchi 2, Teng-Yok

[3] Y. Choi, N. Kim, K. Park, S. Hwang, J. Yoon, and I. Kweon, “All-day visual place recognition: Benchmark dataset and baseline,” inCVPR 2015 Workshop on Visual Place Recognition in ChangingEnvironments, 2015, pp. 8–10. 1, 2

[4] Z. Yu, C. Feng, M. Y. Liu, and S. Ramalingam, “Casenet:Deep category-aware semantic edge detection,” arXiv preprintarXiv:1705.09759, 2017. 1, 2, 6

[5] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik, “Se-mantic contours from inverse detectors,” in ICCV, 2011. 1

[6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes datasetfor semantic urban scene understanding,” in CVPR, 2016. 1, 5

[7] M. Bansal and K. Daniilidis, “Geometric urban geo-localization,” inCVPR, 2014. 1, 2, 5

[8] J. Meguro, T. Murata, H. Nishimura, Y. Amano, T. Hasizume, andJ. Takiguchi, “Development of positioning technique using omni-directional ir camera and aerial survey data,” in Advanced IntelligentMechatronics, 2007. 1, 2, 5

[9] S. Ramalingam, S. Bouaziz, P. Sturm, and M. Brand, “Localization inurban canyons using omni-skylines,” in IROS, 2010. 1, 2, 5

[10] O. Saurer, G. Baatz, K. Koeser, L. Ladicky, and M. Pollefeys, “Imagebased geo-localization in the alps,” IJCV, 2015. 1, 2, 5

[11] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad:Cnn architecture for weakly supervised place recognition,” in CVPR,2016. 2, 3, 6, 7

[12] C. Toft, C. Olsson, and F. Kahl, “Long-term 3d localization and posefrom semantic labellings,” in ICCV, 2017, pp. 650–659. 2, 5, 6, 7

[13] T. Sattler, W. Maddern, A. Torii, J. Sivic, T. Pajdla, M. Pollefeys,and M. Okutomi, “Benchmarking 6dof urban visual localization inchanging conditions,” arXiv preprint arXiv:1707.09092, 2017. 2

[14] S. Lowry, N. Sunderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke,and M. J. Milford, “Visual place recognition: A survey,” T-RO, 2016.2

[15] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutionalnetwork for real-time 6-dof camera relocalization,” in ICCV, 2015. 2,6

[16] T. Weyand, I. Kostrikov, and J. Philbin, “Planet-photo geolocation withconvolutional neural networks,” in ECCV. Springer, 2016, pp. 37–55.2

[17] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel,S. Gumhold, and C. Rother, “Dsac-differentiable ransac for cameralocalization,” in CVPR, vol. 3, 2017. 2

[18] O. Koch and S. Teller, “Wide-area egomotion estimation from known3d structure,” in CVPR 2007: Proceedings of IEEE Conference onComputer Vision and Pattern Recognition, 2007. 2

[19] F. Stein and G. Medioni, “Map-based localization using the panoramichorizon,” in IEEE Transactions on Robotics and Automation, 1995. 2

[20] J. Tardif, Y. Pavlidis, and K. Daniilidis, “Monocular visual odometryin urban environments using an omnidirectional camera,” in IROS,2008. 2

[21] B. Zeisl, T. Sattler, and M. Pollefeys, “Camera pose voting for large-scale image-based localization,” in ICCV, 2015. 2

[22] T. Sattler, B. Leibe, and L. Kobbelt, “Improving image-based local-ization by active correspondence search,” in ECCV, 2012. 2

[23] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua, “Worldwide poseestimation using 3d point clouds,” in ECCV, 2012. 2

[24] A. Torii, J. Sivic, T. Pajdla, and M. Okutomi, “Visual place recognitionwith repetitive structures,” in CVPR, 2013. 2

[25] A. Majdik, D. Verda, Y. Albers-Schoenberg, and D. Scaramuzza, “Air-ground matching: Appearance-based gps-denied urban localization ofmicro aerial vehicles,” J. Field Robot., 2015. 2

[26] S. Ramalingam, S. Bouaziz, and P. Sturm, “Pose estimation using bothpoints and lines for geo-localization,” in ICRA, 2011. 2

[27] B. Micusik and H. Wildenauer, “Descriptor free visual indoor local-ization with line segments,” in CVPR, 2015. 2

[28] A. Armagan, M. Hirzer, P. M. Roth, and V. Lepetit, “Learning to alignsemantic segmentation and 2.5d maps for geolocalization,” in CVPR,2017. 2

[29] H. Badino, D. Huber, Y. Park, and T. Kanade, “Real-time topometriclocalization,” in ICRA, 2012. 2

[30] M. A. Brubaker, A. Geiger, and R. Urtasun, “Lost! leveraging thecrowd for probabilistic visual self-localization,” in CVPR, 2013, pp.3057–3064. 2

[31] N. Jacobs, S. Satkin, N. Roman, R. Speyer, and R. Pless, “Geolo-cating static cameras,” in ICCV 2007: Proceedings of InternationalConference on Computer Vision, 2007. 2

[32] D. Robertson and R. Cipolla, “An image-based system for urbannavigation,” in BMVC 2004: Proceedings of British Machine VisionConference, 2004. 2

[33] W. Zhang and J. Kosecka, “Image based localization in urban envi-ronments,” in 3DPVT 2006: Proceedings of International Symposiumon 3D Data Processing, Visualization, and Transmission, 2006, pp.33–40. 2

[34] M. Cummins and P. Newman, “Appearance-only slam at large scalewith fab-map 2.0,” International Journal of Robotics Research, vol. 30,no. 9, pp. 1100–1123, 2011. 2

[35] J. Hays and A. Efros, “Im2gps: estimating geographic images fromsingle images,” in CVPR, 2008. 2

[36] E. Kalogerakis, O. Vesselova, J. Hays, A. Efros, and A. Hertzmann,“Image sequence geolocation with human travel priors,” in ICCV,2009. 2

[37] M. J. Milford, G. Wyeth, and D. Prasser, “Ratslam: A hippocampalmodel for simultaneous localization and mapping,” in ICRA, 2004. 2

[38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,” in CVPR, 2015. 2

[39] Y. Tian, C. Chen, and M. Shah, “Cross-view image matching for geo-localization in urban environments,” arXiv preprint arXiv:1703.07815,2017. 2

[40] N. N. Vo and J. Hays, “Localizing and orienting street views usingoverhead imagery,” in ECCV, 2016. 2

[41] T. Y. Lin, Y. Cui, S. Belongie, and J. Hays, “Learning deep represen-tations for ground-to-aerial geolocalization,” in CVPR, 2015. 2

[42] S. Pillai and J. Leonard, “Self-supervised place recognition in mobilerobots,” in Learning for Localization and Mapping Workshop, IROS,2017. 2

[43] J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, and R. Shah, “Signatureverification using a” siamese” time delay neural network,” in NIPS,1994. 2

[44] F. Walch, C. Hazirbas, L. eal-Taixe, T. Sattler, S. Hilsenbeck, andD. Cremers, “Image-based localization with spatial lstms,” CoRR, vol.abs/1611.07890, 2016. 2

[45] X. Sun, Y. Xie, P. Luo, and L. Wang, “A dataset for benchmarkingimage-based localization,” in CVPR, 2017. 2

[46] E. Zemene, Y. Tariku, H. Idrees, A. Prati, M. Pelillo, and M. Shah,“Large-scale image geo-localization using dominant sets,” arXivpreprint arXiv:1702.01238, 2017. 2

[47] P. Gronat, G. Obozinski, J. Sivic, and T. Pajdla, “Learning andcalibrating per-location classifiers for visual place recognition,” inCVPR, 2013. 2

[48] J. L. Schonberger, M. Pollefeys, A. Geiger, and T. Sattler, “Semanticvisual localization,” arXiv preprint arXiv:1712.05773, 2017. 2

[49] D. Nister and H. Stewenius, “Scalable recognition with a vocabularytree,” in CVPR, 2006. 3

[50] G. Schindler, M. Brown, and R. Szeliski, “City-scale location recog-nition,” in CVPR, 2007, pp. 1–7. 3

[51] J. Lee, S. Lee, G. Zhang, J. Lim, and I. S. W.K. Chung, “Outdoor placerecognition in urban environments using straight lines,” in ICRA, 2014.3

[52] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Van-derplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay, “Scikit-learn: Machine learning in Python,” Journalof Machine Learning Research, vol. 12, pp. 2825–2830, 2011. 3

[53] Y. Choi, N. Kim, K. Park, S. Hwang, J. Yoon, and I. Kweon, “All-dayvisual place recognition: Benchmark dataset and baseline,” in CVPR,2015. 4

[54] G. Ghiasi and C. C. Fowlkes, “Laplacian pyramid reconstruction andrefinement for semantic segmentation,” in ECCV. Springer, 2016, pp.519–534. 6

[55] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. Yuille, “Towardsunified depth and semantic prediction from a single image,” in CVPR,2015. 7

[56] D. Eigen and R. Fergus, “Predicting depth, surface normals and se-mantic labels with a common multi-scale convolutional architecture,”in ICCV, 2015. 7