Consistency Regularization with High-dimensional Non

Consistency Regularization with High-dimensional Non-adversarialSource-guided Perturbation for Unsupervised Domain Adaptation in

Segmentation

Kaihong WangBoston [email protected]

Chenhongyi YangBoston [email protected]

Margrit BetkeBoston University

[email protected]

Abstract

Unsupervised domain adaptation for semantic segmen-tation has been intensively studied due to the low cost ofthe pixel-level annotation for synthetic data. The most com-mon approaches try to generate images or features mim-icking the distribution in the target domain while preserv-ing the semantic contents in the source domain so thata model can be trained with annotations from the latter.However, such methods highly rely on an image transla-tor or feature extractor trained in an elaborated mecha-nism including adversarial training, which brings in extracomplexity and instability in the adaptation process. Fur-thermore, these methods mainly focus on taking advantageof the labeled source dataset, leaving the unlabeled tar-get dataset not fully utilized. In this paper, we proposea bidirectional style-induced domain adaptation method,called BiSIDA, that employs consistency regularization toefficiently exploit information from the unlabeled target do-main dataset, requiring only a simple neural style trans-fer model. BiSIDA aligns domains by not only transferringsource images into the style of target images but also trans-ferring target images into the style of source images to per-form high-dimensional perturbation on the unlabeled targetimages, which is crucial to the success in applying consis-tency regularization in segmentation tasks. Extensive ex-periments show that our BiSIDA achieves new state-of-the-art on two commonly-used synthetic-to-real domain adap-tation benchmarks: GTA5-to-CityScapes and SYNTHIA-to-CityScapes.

1. IntroductionDeep learning methods for semantic segmentation [21],

the problem of dividing the pixels in an image into mu-tually exclusive and collectively exhaustive sets of class-labeled regions, have gained increasing attention. Researchprogress is hindered by the difficulty of creating large train-

ing datasets with accurate pixel-level annotations of theseregions. As a consequence, the use of synthetic datasetshas become popular because pixel-level ground truth an-notations can be generated along with the images. Unfor-tunately, when deep models that were trained on syntheticdata are used to segment real-world images, their perfor-mance is typically limited due to the domain gap betweenthe training and testing data. Domain adaptation methodsseek to bridge the gap between the source domain train-ing data and the target domain testing data. We here focuson unsupervised domain adaptation (UDA), the problem ofadapting a model that was trained with a labeled source do-main dataset, by using an unlabeled target domain datasetand optimizing its performance on the target domain.

To perform domain alignment on a pixel-level or feature-level basis, existing methods [29, 11, 32, 22, 18, 4] typicallyuse adversarial training [9], and training with the aligneddata is then supervised by a loss computed with the anno-tation of the source domain dataset. However, the use ofadversarial training typically comes with extra complexityand instability in training. Alternative approaches [39, 32,18, 4] seek to exploit information about the unlabeled targetdataset by performing semi-supervised learning includingentropy minimization [10], pseudo-labeling [16] and con-sistency regularization. However, these approaches eitherjust play an auxiliary role in the training process besides su-pervised learning, or fail to take full advantage of the targetdataset.

In this paper, we propose Bidirectional Style-inducedDomain Adaptation (BiSIDA) that takes better advantageof the unlabeled dataset and optimizes the performance ofa segmentation model on the target dataset. Our pipelineincludes a supervised learning phase that provides supervi-sion using annotations in the source dataset and an unsuper-vised phase for learning from the unlabeled target datasetwithout requiring its annotation. To perform domain adap-tation, we construct a non-adversarial yet effective pre-trained style-induced image generator that translates images

1

arX

iv:2

009.

0861

0v1

[cs

.CV

] 1

8 Se

p 20

20

through style transfer. In the supervised learning phase, thestyle-induced image generator translates images with differ-ent styles to align the source domain to the direction of thetarget domain. In the unsupervised phase, it performs high-dimensional perturbations on target domain images withconsistency regularization. Consequently, the unlabeled tar-get dataset is utilized efficiently and the domain gap is re-duced effectively from another direction at the same timethrough a self-supervised approach.

Our model performs image translation from the sourceto the target domain using an image generator in the super-vised phase similar to existing methods. However, in orderto facilitate generalization, our model synthesizes imageswith semantic content from the source domain, and with astyle that is defined by a continuous parameter that repre-sents a ”mix” of source and target domain styles, instead oftransferring the style directly to the target domain. Conse-quently, the stochasticity of the whole process facilitates notonly the training on the original images but also the gradualadaptation towards the target domain. The resulting imageis then sent along with its corresponding pixel-level annota-tion to compute a supervised cross-entropy loss to train thesegmentation model.

BiSIDA employs consistency regularization in the unsu-pervised phase to yield consistent predictions on randomlyperturbed inputs without requiring their annotations. We ap-ply our style-induced image generator as an augmentationmethod and transfer each target domain image together witha number of randomly sampled source domain images, justas in the supervised phase, but in an opposite direction. Aseries of images with identical content but different stylesfrom source domain images is generated. Given that su-pervised learning is performed on source images that aretransferred with combined styles of source images and tar-get images, our model will be more adapted and more likelyto produce correct predictions when target domain imagesare transferred towards the direction of the source domainimages. Meanwhile, our style-induced image generator pro-vides a high-dimensional perturbation that keeps the seman-tic content as indicated in [7] for consistency regulariza-tion in a computational affordable way. To further improvethe quality of predictions, the transferred images are passedthrough the self-ensemble of the trained segmentation mod-els, which is the exponential moving average of itself, andgathered to get a pseudo-label for the unlabeled target do-main image. The training of the segmentation model on theoriginal target domain image augmented with only colorspace perturbations is guided by its pseudo-label. Duringthe process, information and knowledge lied in the unla-beled target images can be learned through the consistencyregularization framework and the model is finally adaptedto the target domain.

Combined with our supervised and unsupervised learn-

ing methods, we are able to utilize annotation from the la-beled source dataset, exploit knowledge from the unlabeledtarget dataset and perform gradual adaptation between thesource and the target domain from both sides. In conclu-sion, our key contributions include:

1. A Bidirectional Style-induced Domain Adaptation(BiSIDA) framework that incorporates both target-guided supervised and source-guided unsupervisedlearning. We also show that domain adaptation isachievable in a bidirectional way through a continuousparameterization of the two domains, without requir-ing adversarial training;

2. A non-adversarial style-induced image generator thatperforms a high-dimensional source-guided perturba-tion on target images for consistency regularization.

3. Extensive experiments show that our BiSIDA achievesnew state-of-the-art on two commonly-used synthetic-to-real domain adaptation benchmarks: GTA5-to-CityScapes and SYNTHIA-to-CityScapes.

2. Related Works2.1. Image-to-image Translation

Recent progress in image-to-image translation that trans-fers the style of an image while preserving its seman-tic content has inspired research in various related areas,including image synthesis and reducing domain discrep-ancy. Typical image-to-image translation approaches in-clude CycleGAN [38] and DualGAN [36], which keepcycle-consistency in adversarial training to preserve the se-mantic content of images when transferring the style of im-age. UNIT [20] and MUNIT [14] address the problem bymapping images into a common latent content space Neuralstyle transfer offers an alternative way to perform image-to-image translation [8], but its optimization process is compu-tationally impractical. Several works [15, 17, 30, 31, 6] pro-posed improvements, but these methods are limited sincethe style to be transferred is either fixed or the number ofstyles is limited.

2.2. Semi-supervised Learning

When the gap between source and target domains be-comes small, the problem of unsupervised domain adapta-tion intriguingly degenerates to semi-supervised learning.Pseudo-labeling [16], a commonly-used semi-supervisedlearning method, takes predictions on the unlabeled datasetwith high confidence as one-hot labels guiding further train-ing. Entropy minimization [10] can be seen as a soft as-signment of the pseudo-label on the unlabeled dataset. Re-cently, consistency regularization has gained attention dueto its outstanding performance as a semi-supervised learn-ing method. The Mean-Teacher [28] approach minimizes

Student Segmentation

Network

Ls

Student Segmentation

Network

Source Image Source Prediction

Source Label

Target Prediction

Style Transfer src → tgt

Target Image

{ x}

Target Style Image

Style Transfer tgt → src

Teacher Segmentation

Network

avg

Pseudo LabelSource Style Images Lu

{ y}

Figure 1: The proposed training framework BiSIDA: It consists of two branches: 1) The supervised branch (top) augmentsa source-domain image through our style-induced image generator with the style of a target-domain image. A supervisedsegmentation loss Ls is computed with the corresponding annotation of the source image. 2) In the unsupervised learningbranch (bottom), a target domain image and a series of source domain images are used to produce the corresponding trans-lated images x. Then each of these images will pass the through the teacher network to generate a set of probability mapsy. We computes an average of these maps to generate a pseudo-label, which is then used to compute Lu for consistencyregularization.

consistency loss on an unlabeled image between the outputof a student network and the ensemble of itself, a teachernetwork. Fixmatch [27] further outperforms Mean-Teacherby performing pseudo-labeling and consistency regulariza-tion between images with different degree of perturbationsand achieves state-of-the-art performance on several semi-supervised learning benchmarks.

2.3. UDA for Semantic Segmentation

Current methods in UDA for segmentation can be cat-egorized into adversarial and non-adversarial methods.”FCN in the wild”” [12] was the first to perform a segmen-tation task under UDA settings and align both global and lo-cal features between domains through adversarial training.Other works [11, 29, 32] tried to align features in one ormultiple feature levels. The adversarial alignment processof each category between domains can be treated adaptively[22, 33]. [4] train an image translator in an adversarial wayand take its output to perform consistency regularization.[18] applied bidirectional learning in which an image trans-lator and a segmentation model guide each others trainingin a mutual way. Pseudo-labeling is also performed to en-hance performance.

Non-adversarial methods include a variety of techniques.Curriculum DA [37] and PyCDA [19], for example, adoptthe concept of curriculum learning and align label dis-tribution over images, landmark superpixels, or regions.CBST [39] utilizes self-training to exploit information fromthe target domain images. DCAN [34] applies channel-wisealignment to merge the domain gap from both pixel-leveland feature-level. Recently, [35] proposed to align pixel-level discrepancy by performing a Fourier transformation.Combined with entropy minimization, pseudo-labeling andmodel ensemble, their method achieves current state-of-the-art performance.

The work that maybe most resembles ours is by [4].However, our methods does not rely on a strong imagetranslator that needs to be trained in an adversarial way.Furthermore, our method of adopting consistency regular-ization is able to exploit information more efficiently fromtarget images by virtue of our high-dimensional perturba-tion method.

3. Background

BiSIDA uses Adaptive Instance Normalization, orAdaIN [13], which consists of a encoder extracting a fea-

ture map from a given input image and a decoder that up-samples a feature map back to the original size of the inputsize. Given a content image c and a style image s from an-other domain, an image that mimic the style of s while per-taining the content of c will be synthesized. Practically, theencoder is taken from the first few layer of a pretrained fixedVGG-19 [26] while the decoder mirrors the architecture ofthe encoder and is trained as as proposed in [13]. Formally,the feature map of a content image c and a style image sthrough an encoder f can be represented as tc = f(c) andts = f(s). We can normalize the mean and the standard de-viation for each channel of tc and ts and produce the targetfeature maps t:

t = AdaIN(tc, ts) = σ(ts)

(tc − µ(tc)σ(tc)

)+ µ(ts), (1)

where µ(t∗) and σ(t∗) are the mean and variance of thefeature maps.

A typical problem of training a model with pseudo-labelsis the instability in the process caused by the uncertain qual-ity of the pseudo-label. It may lead to oscillation in predic-tions or bias to some easier classes. To stabilize the gen-eration of pseudo-labels, we employ self-ensembling [28]which consists of a segmentation network as student net-work F s and a teacher network F t with the same archi-tecture. The teacher network is essentially the temporal en-semble of the student network so that a radical change in theweight of the teacher network can be alleviated and more in-formed prediction can be made. The weight of the teachernetwork F t at the ith iteration θti is updated as the exponen-tial moving average of the weight θsi of the student networkF s, θti = ηθti−1 + (1 − η)θsi given an exponential movingaverage decay η.

4. MethodIn the UDA setting, the dataset from the source domain

S includes images denoted by XS with their correspondingpixel-level annotations denoted by Y S , and the dataset fromthe target domain T contains images represented by XT

without annotation. The task is to optimize a segmentationmodel using source dataset {(xSi , ySi )}(i=1,...,NS) and tar-get images {(xTi )}(i=1,...,NT ) with C common categories.The student network is denoted by F s, the teacher networkby F t. The architecture of our model is shown in Figure 1.

4.1. Continuous Style-induced Image Generator

To better utilize AdaIN to perform image augmentationin our framework, we control its output using content-styletrade-off. Once the target feature map t is obtained, we cansynthesize an image with the combined style with the styleof a source and a target image controlled by a content-styletrade-off parameter α varying from 0 to 1 through our image

generator G:

G(c, s, α) = g(αt+ (1− α)tc), (2)

when α = 0, the content image will be reconstructed withits own style kept, and when α = 1, the output image willbe the combination of the style of the style image s and thecontent of the content image c. Finally, we rectify the outputby clipping it in the range of [0, 255].

4.2. Target-guided Supervised Learning

Given a source domain dataset {(xSi , ySi )}(i=1,...,NS)

and a target domain dataset {xit}, we at first perform a ran-dom color space perturbation A on a source domain imagexs to get A(xs) to enhance the randomness. Images withcolor space perturbation augmentation will then be passedthrough our style-induced image generator G to performstyle transfer as a stronger augmentation method using a tar-get domain image xt. In the process, a content-style trade-off parameter α is randomly sampled from an uniform dis-tribution U(0, 1) to control the style of the translated imagexs = G(A(xs), xt, α). The translation process will be en-abled with probability of ps→t due to the loss of resolutionin the translation process so the segmentation model canalso be trained on details in images. For the rest of the prob-ability, we simply assign A(xs) to xs. Finally, we computethe supervised loss Ls through a cross entropy loss betweenthe probability map ps = F s(xs) and its pixel-level anno-tation ys:

Ls = −1

HW

H×W∑m=1

C∑c=1

ymcs log(pmcs ) (3)

Augmented by a strong and directed augmentationmethod, our framework facilitate generalization of modelon images with different styles and further enable the adap-tation towards the direction of the target domain.

4.3. Source-guided Unsupervised Learning

To start with, we introduce the generation of the pseudo-label that guides the self-learning on the target dataset.Given that our model is more adapted to the source do-main where our supervised learning is performed, the qual-ity of produced pseudo-label is generally higher. Conse-quently, pseudo-label will be computed from target im-ages transferred to the direction of the appearance of thesource domain in our framework. Similar to the supervisedphase, we at first perform a random color space perturba-tion A on a target domain image xt to get A(xt). Thenwe augment each augmented target image A(xt) using krandomly sampled source images {xis}ki=1 as style imagesthrough our style-induced image generator G with proba-bility of pt→s for the consideration of the loss of resolu-tion, and xt will be transferred to a set of images {xit}ki=1

where xit = G(A(xt), xis, α). Otherwise it will simply beassigned to {xit}ki=1 = {A(xt)} with k = 1. With thestochastic sampling of k source images, our augmentationmethod performed on the target images will be strongerwhile their semantic meanings can also be preserved. Af-ter the augmentation process, transformed images {xt}ki=1

will be passed through the teacher model F t individuallyto acquire more stable predictions yi = F t(xit). We thenaverage these predictions to get the probability map pl forthe pseudo-label pl = 1

k

∑ki=1 y

i. Before the generationof pseudo-label, we employ a sharpening function which iswidely adopted in various semi-supervised learning prob-lems [1] to re-arrange the distribution of the probability mapas follows, given temperature T :

Sharpening(p, T )i :=p

1Ti∑C

j=1 p1Tj

(4)

Finally we can acquire the pseudo-label qt as q =argmax(pl), which can be used to comput the loss of ourmodel on the target images in a supervised manner. Con-cretely, we augment the same target image xt using therandom color space augmentation A and pass it throughthe student network F s to get the probability map pt =F s(A(xt)).

In practice, the imbalance and complexity among cat-egories in training datasets will cause the model to biasto popular or easier categories, especially when they aretrained in a semi-supervised manner that relies on pseudo-label. To address this problem, we employ a class-balancedreweighting mechanism which guide the unsupervised losswith a prior distribution of categories. We first compute theclass prior distribution dc as the portion of number of pixelsover all categories on the source training dataset. Then thereweighting factor w for each class is computed as:

wc =1

λ dγc, (5)

where λ and γ are hyper-parameters. Thus, the final unsu-pervised loss is presented as:

Lu = − 1

HW

H×W∑m=1

1(max(pml ) ≥ τ)C∑c=1

wcqmclog(pmct )

(6)

4.4. Optimization

To summarize, our framework comprises a supervisedlearning process performed on the labeled source dataset aswell as an unsupervised learning process performed on theunlabeled target dataset via consistency regularization andpseudo-labeling. As a result, we can compute the final loss

L, given the weight of the unsupervised loss λu in a multi-task learning manner, as follows:

L = Ls + λuLu. (7)

During the training process, the weight of the student net-work Fs is updated toward the direction of the gradientcomputed via back-propagation of the loss L, while theweight of the teacher network is updated as the exponentialmoving average of the student network.

5. ExperimentsExtensive experiments are made on two commonly used

synthetic-to-real segmentation benchmarks. Comparisonswith other SOTA methods and ablation studies are pre-sented to show the effectiveness of our BiSIDA framework.We visualize some segmentation results in Figure 2.

5.1. Datasets

We used two synthetic-to-real benchmarks, GTA5-to-Ci-tyScapes and SYNTHIA-to-CityScapes. The CityScapesdataset [5] consists of images of real street scenes of spatialresolution of 2048×1024 pixels. It includes 2,975 imagesfor training, 500 images for validation, and 1,525 imagesfor testing. In our experiments, we used the 500 validationimages as a test set. The GTA5 dataset [23] includes 24,966synthetic images with a resolution of 1914×1052 pixelsthat are obtained from the video game GTA5 along withpixel-level annotations that share all 19 common categoriesof CityScapes. For the SYNTHIA dataset [24], we usedthe SYNTHIA-RAND-CITYSCAPES subset, which con-tains 9,400 rendered images of size 1280×760 and shares16 common categories with the CityScapes dataset.

5.2. Network Architecture

Image generator: To keep our continuous style-inducedimage generator light-weighted and computationally afford-able, we adopted the first several layers up to relu4 1 of afixed pre-trained VGG-19 network as the encoder in our ex-periments. For the decoder, we reversed the order of layersin the encoder and replaced the pooling layers by nearestup-sampling [13].

Segmentation network: We chose FCN-8s [21] with aVGG16 backbone network, pre-trained with ImageNet.

5.3. Training Protocol

The continuous style-induced image generator wastrained using randomly-cropped 640 × 320 images, and abatch size of 4. The ADAM optimizer was used with alearning rate of 1× 10−5 and momentum of 0.9 and 0.999.To balance the reconstruction of the content image and theextraction from the style image, we optimized the gener-ator loss in [13] and with style weight 0.1. The segmen-tation model was trained on images randomly cropped to

Method road

sidew

alk

build

ing

wal

l

fenc

e

pole

light

sign

vege

tatio

n

terra

in

sky

pers

on

rider

car

truck

bus

train

mot

ocyc

le

bicy

cle

mIoUCurriculum [37] 74.9 22.0 71.7 6.0 11.9 8.4 16.3 11.1 75.7 13.3 66.5 38.0 9.3 55.2 18.8 18.9 0.0 16.8 16.6 29.0

CBST [39] 66.7 26.8 73.7 14.8 9.5 28.3 25.9 10.1 75.5 15.7 51.6 47.2 6.2 71.9 3.7 2.2 5.4 18.9 32.4 30.9AdaSeg [29] 87.3 29.8 78.6 21.1 18.2 22.5 21.5 11.0 79.7 29.6 71.3 46.8 6.5 80.1 23.0 26.9 0.0 10.6 0.3 35.0Cycada [11] 85.2 37.2 76.5 21.8 15.0 23.8 22.9 21.5 80.5 31.3 60.7 50.5 9.0 76.9 17.1 28.2 4.5 9.8 0.0 35.4AdvEnt [32] 86.9 28.7 78.7 28.5 25.2 17.1 20.3 10.9 80.0 26.4 70.2 47.1 8.4 81.5 26.0 17.2 18.9 11.7 1.6 36.1DCAN [34] 82.3 26.7 77.4 23.7 20.5 20.4 30.3 15.9 80.9 25.4 69.5 52.6 11.1 79.6 24.9 21.2 1.3 17.0 6.7 36.2CLAN [22] 88.0 30.6 79.2 23.4 20.5 26.1 23.0 14.8 81.6 34.5 72.0 45.8 7.9 80.5 26.6 29.9 0.0 10.7 0.0 36.6LSD [25] 88.0 30.5 78.6 25.2 23.5 16.7 23.5 11.6 78.7 27.2 71.9 51.3 19.5 80.4 19.8 18.3 0.9 20.8 18.4 37.1BDL [18] 89.2 40.9 81.2 29.1 19.2 14.2 29.0 19.6 83.7 35.9 80.7 54.7 23.3 82.7 25.8 28.0 2.3 25.7 19.9 41.3FDA [35] 86.1 35.1 80.6 30.8 20.4 27.5 30.0 26.0 82.1 30.3 73.6 52.5 21.7 81.7 24.0 30.5 29.9 14.6 24.0 42.2

Stuff&things [33] 88.1 35.8 83.1 25.8 23.9 29.2 28.8 28.6 83.0 36.7 82.3 53.7 22.8 82.3 26.4 38.6 0.0 19.6 17.1 42.4TGCF-DA+SE [4] 90.2 51.5 81.1 15.0 10.7 37.5 35.2 28.9 84.1 32.7 75.9 62.7 19.9 82.6 22.9 28.3 0.0 23.0 25.4 42.5

Ours 89.3 40.9 82.5 30.9 24.7 20.9 26.9 32.1 81.8 33.1 81.6 53.4 20.3 83.0 24.8 29.4 0.0 28.6 36.6 43.2

Table 1: Comparison of our model with other methods on the GTA5-to-CityScapes benchmark using models with VGG-16as backbone. The mIoU represents the average of individual mIoUs among all 19 categories between GTA5 and CityScapes.

Method road

sidew

alk

build

ing

wal

l

fenc

e

pole

light

sign

vege

tatio

n

sky

pers

on

rider

car

bus

mot

ocyc

le

bicy

cle

mIoU mIoU*Curriculum [37] 65.2 26.1 74.9 0.1 0.5 10.7 3.5 3.0 76.1 70.6 47.1 8.2 43.2 20.7 0.7 13.1 29.0 34.8

AdvEnt [32] 67.9 29.4 71.9 6.3 0.3 19.9 0.6 2.6 74.9 74.9 35.4 9.6 67.8 21.4 4.1 15.5 31.4 36.6AdaSeg [29] 78.9 29.2 75.5 - - - 0.1 4.8 72.6 76.7 43.4 8.8 71.1 16.0 3.6 8.4 - 37.6CLAN [22] 80.4 30.7 74.7 - - - 1.4 8.0 77.1 79.0 46.5 8.9 73.8 18.2 2.2 9.9 - 39.3CBST [39] 69.6 28.7 69.5 12.1 0.1 25.4 11.9 13.6 82.0 81.9 49.1 14.5 66.0 6.6 3.7 32.4 35.4 40.7DCAN [34] 79.9 30.4 70.8 1.6 0.6 22.3 6.7 23.0 76.9 73.9 41.9 16.7 61.7 11.5 10.3 38.6 35.4 41.7LSD [25] 80.1 29.1 77.5 2.8 0.4 26.8 11.1 18.0 78.1 76.7 48.2 15.2 70.5 17.4 8.7 16.7 36.1 42.1ROAD [3] 77.7 30.0 77.5 9.6 0.3 25.8 10.3 15.6 77.6 79.8 44.5 16.6 67.8 14.5 7.0 23.8 36.2 41.7

GIO-Ada [2] 78.3 29.2 76.9 11.4 0.3 26.5 10.8 17.2 81.7 81.9 45.8 15.4 68.0 15.9 7.5 30.4 37.3 43.0TGCF-DA+SE [4] 90.1 48.6 80.7 2.2 0.2 27.2 3.2 14.3 82.1 78.4 54.4 16.4 82.5 12.3 1.7 21.8 38.5 46.6

BDL [18] 72.0 30.3 74.5 0.1 0.3 24.6 10.2 25.2 80.5 80.0 54.7 23.2 72.7 24.0 7.5 44.9 39.0 46.1FDA [35] 84.2 35.1 78.0 6.1 0.4 27.0 8.5 22.1 77.2 79.6 55.5 19.9 74.8 24.9 14.3 40.7 40.5 47.3

Ours 87.4 42.4 79.0 17.0 0.1 23.9 2.8 22.9 82.0 80.4 51.1 19.1 76.7 33.3 14.4 41.2 42.1 48.7

Table 2: Comparison of our framework with other methods on SYNTHIA to CityScapes benchmark using models withVGG-16 as backbone. The mIoU represents the average of individual mIoUs among all 16 categories between SYNTHIAand CityScapes while the mIoU∗ represents that among 13 common categories excluding wall, fence and pole.

960×480 with batch size of 1. On the GTA5 dataset, we ap-plied the ADAM optimizer with a learning rate of 1×10−5,weight decay of 5× 10−4 and momentum of 0.9 and 0.999.For the SYNTHIA dataset, we adopted the SGD optimizerwith a learning rate of 1×10−5, momentum of 0.99 and andweight decay of 5 × 10−4. We set the exponential movingaverage decay for the teacher model to 0.999 and the con-fidence threshold τ in the pseudo-label generation processto 0.9. The probability of performing target-guided imagetranslation ps→t and source-guided image translation pt→sis 0.5. The unsupervised weight λu and the sharpening tem-perature T are set to 1 and 0.25, respectively. Both modelsare trained on a NVIDIA Tesla V100 GPU.

5.4. Comparisons with SOTA Methods

We first compare the performance of our BiSIDA on theGTA5-to-CityScapes benchmark with that of other methodsusing models with VGG-16 as backbone (Table 1). Our re-

sults reveal that our method outperforms most competitivemethods, especially TGCF-DA+SE, which employs adver-sarial training as augmentation and achieves state-of-the-artperformance by 1.6%.

We present the performance of our and other methodson the SYNTHIA-to-CityScapes benchmark using two met-rics (Table 2). Due to the less realistic appearance andfewer training data, this task is more difficult than the previ-ous one. However, our framework outperforms the currentstate-of-the-art method by a significant margin of 3.8%.

5.5. Ablation Studies

Style-induced image translation and unsupervisedlearning: We validate the effectiveness of our continuousstyle-induced image generator as well as our self-supervisedlearning modules through an ablation study, and explorehow they contribute to achieve unsupervised domain adap-tation. Results are presented in Table 3. Since our con-

S2T T2S PL SE GTA SYN29.3 28.9

X 34.7 32.0X 31.8 31.4

X X 35.1 40.2X X X 35.4 40.8X X X 39.4 41.8X X X X 43.2 42.1

Table 3: Ablation study on the style-induced image transla-tion and unsupervised modules on SYNTHIA to CityScapesbenchmark. S2T stands for source domain to target domainimage translation, T2S stands for target to source domainimage translation, PL stands for pseudo-labeling and SEstands for self-ensembling. GTA represents the mIoU (16classes) from GTA5 to CityScapes benchmark while SYNrepresents the mIoU from SYNTHIA to CityScapes

tinuous style-induced image generator is used in both thesupervised and the unsupervised learning phase to performa target-guided and a source-guided image translation, weconduct experiments on both of them respectively. Addi-tionally, given that our self-supervised learning paradigm isbased on the source-guided image translation, we deactivatethe self-supervised learning when the source-guided imagetranslation is suppressed in this experiment. As we can ob-serve from the results, the target-guided and source-guidedimage translation improve the performance on both bench-marks when applied separately. It is also worth noting thatthe improvement brought by the target-guided image trans-lation is slightly larger since the target domain images trans-lated with styles from source domain cannot provide bet-ter self-guidance without having the source domain alignedto the intermediate continuous space. A more significantperformance leap is shown when these two translationsare performed simultaneously, especially on SYNTHIA-to-CityScapes benchmark where domain gap is larger, show-ing the advantage of our bidirectional style-induced imagetranslation method.

As for the modules in unsupervised learning phase,we explore the capability of pseudo-labeling and self-ensembling. When pseudo-labeling is disabled, we use theprobability maps to compute the self-supervised loss andthe problem will be transformed to entropy minimization.Also, the probability maps will be generated by the seg-mentation model itself if self-ensembling is disabled. Fromthe results, we can find that both pseudo-labeling and self-ensembling contribute to similar degree of enhancement inthe performance. Additionally, we may also observe thatmost of the improvement on GTA5-to-CityScapes comesfrom the application of self-supervised learning moduleswhile that on SYNTHIA-to-CityScapes, on the other hand,

cAUG T2S mIoU mIoU*32.2 38.6

X 32.1 38.5X 41.8 48.3

X X 42.1 48.7

Table 4: Experiments on augmentation methods on SYN-THIA to CityScapes. cAUG represents color perturbation,T2S represents source-guided image translation performedon target domain. mIoU represents averaged mIoU over 16classes and mIoU* represents that over 13 common classes.

weight 0.1 0.5 1.0 5.0 10.0mIoU 37.8 41.9 42.1 39.8 38.6mIoU* 44.3 48.1 48.7 46.3 45.2

Table 5: Comparison with different unsupervised lossweigths λu. mIoU represents averaged mIoU over 16classes and mIoU* represents that over 13 common classes.

comes from the style-induced image translation process.Based on such observation, we can infer that the chal-lenge in the GTA5-to-CityScapes benchmark is to performfeature-level alignment while for SYNTHIA-to-CityScapesis to perform pixel-level alignment.

Source-guided image translation: In the previous ex-periment, the unsupervised learning was suppressed whensource-guided image translation was not performed. Tolearn more about the effectiveness of the color-space per-turbation and source-guided image translation performed ontarget images in the unsupervised learning phase, we con-ducted an ablation study on the SYNTHIA-to-CityScapesbenchmark, where pixel-level alignment plays a more im-portant role. We tested these two perturbation methods withall other settings fixed. From the results, shown in Table 4,we find that the introduction of source-guided image trans-lation significantly improves performance by a large mar-gin. On the other hand, the color space perturbation onlyhelps when the source-guided image translation is appliedsince it enhances the stochasticity in the high-dimensionalperturbation process. Otherwise, the color space perturba-tion is not a sufficiently strong perturbation method for con-sistency regularization.

5.6. Discussion

Unsupervised learning weight: In our BiSIDA, theunsupervised loss weight λu is a crucial hyperparame-ter to balance the focus of our model between the super-vised learning on the labeled source dataset and the self-supervised learning on the unlabeled target dataset. Toinvestigate the effect of using different unsupervised loss

Figure 2: Sample results (column ”BiSIDA”). Target-domain testing images from the CityScapes dataset (column ”Images”)were segmented with a model trained by our BiSIDA framework on the GTA5 dataset through a FCN-8s with a VGG-16backbone. The ground truth segmentation (”GT”) and results of a model trained only with source domain images (”Sourceonly”) are also shown. Note that our method is capable of capturing rare and difficult categories, such as traffic lights andsigns.

# img 1 2 4 6 8mIoU 41.0 41.4 42.1 41.8 42.0mIoU* 47.3 47.6 48.7 48.1 48.6

Table 6: Numer of source images used to perturb a targetimage. mIoU represents averaged mIoU over 16 classes andmIoU* represents that over 13 common classes.

weights on our method, we conducted an experiment onthe SYNTHIA-to-CityScapes benchmark with five differentunsupervised loss weights The results in Table 5 reveal thatwhen the weight is too small, the benefit of unsupervisedlearning is limited and consistency regularization cannot beperformed effectively. When the weight is too large, themodel fails to achieve satisfying performance. A reasonmay be that the model becomes bias prone and prefers aneasier category in the early stage of training. Our modelreaches the peak of performance when the weight is set to1.

Number of style images used in source-guided imagetranslation: Since we gather the predictions over k imagestranslated from a target domain image with styles from kdifferent source domain style images, the number of imagesused in the image translation process is another importanthyperparameter in our BiSIDA framework. We hereby con-duct experiments on SYNTHIA-to-CityScapes benchmarkwith k value of 1, 2, 4, 6 and 8 respectively. The results

are presented in Table 6. As we can see from the table,when the number of style images is smaller, the model can-not achieve a good performance since the stochasticity inthe perturbation process is undermined and the quality ofthe generated pseudo-label is limited. On the other hand,increasing the number of style images might not be a goodidea as well since it does not necessarily improve the per-formance significantly when the performance starts to besaturated despite of the increase in the computational cost.

6. ConclusionWe proposed a Bidirectional Style-induced Domain

Adaptation (BiSIDA) framework that optimizes a segmen-tation model via target-guided supervised learning andsource-guided unsupervised learning. With the employ-ment of our continuous style-induce image generator, weshow the effectiveness of learning from the unlabeled tar-get dataset by providing high-dimensional perturbations forconsistency regularization. Furthermore, we also reveal thatthe alignment between the source and the target domainfrom both directions without requiring adversarial trainingis achievable.

References[1] David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nico-

las Papernot, Avital Oliver, and Colin Raffel. Mixmatch: Aholistic approach to semi-supervised learning. In Hanna M.Wallach, Hugo Larochelle, Alina Beygelzimer, Florenced’Alche-Buc, Emily B. Fox, and Roman Garnett, editors,Advances in Neural Information Processing Systems 32: An-nual Conference on Neural Information Processing Systems2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC,Canada, pages 5050–5060, 2019.

[2] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool.Learning semantic segmentation from synthetic data: A geo-metrically guided input-output adaptation approach. In IEEEConference on Computer Vision and Pattern Recognition,CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages1841–1850. Computer Vision Foundation / IEEE, 2019.

[3] Yuhua Chen, Wen Li, and Luc Van Gool. ROAD: reality ori-ented adaptation for semantic segmentation of urban scenes.In 2018 IEEE Conference on Computer Vision and PatternRecognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 7892–7901. IEEE Computer Society, 2018.

[4] Jaehoon Choi, Taekyung Kim, and Changick Kim. Self-ensembling with GAN-based data augmentation for domainadaptation in semantic segmentation. In 2019 IEEE/CVFInternational Conference on Computer Vision, ICCV 2019,Seoul, Korea (South), October 27 - November 2, 2019, pages6829–6839. IEEE, 2019.

[5] Marius Cordts, Mohamed Omran, Sebastian Ramos, TimoRehfeld, Markus Enzweiler, Rodrigo Benenson, UweFranke, Stefan Roth, and Bernt Schiele. The Cityscapesdataset for semantic urban scene understanding. In 2016IEEE Conference on Computer Vision and Pattern Recog-nition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016,pages 3213–3223. IEEE Computer Society, 2016.

[6] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur.A learned representation for artistic style. In 5th Interna-tional Conference on Learning Representations, ICLR 2017,Toulon, France, April 24-26, 2017, Conference Track Pro-ceedings. OpenReview.net, 2017.

[7] Geoff French, Timo Aila, Samuli Laine, Michal Mackiewicz,and Graham Finlayson. Semi-supervised semantic segmen-tation needs strong, high-dimensional perturbations. arXivpreprint arXiv:1906.01916, 2019.

[8] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge.Image style transfer using convolutional neural networks.In 2016 IEEE Conference on Computer Vision and PatternRecognition, CVPR 2016, Las Vegas, NV, USA, June 27-30,2016, pages 2414–2423. IEEE Computer Society, 2016.

[9] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville,and Yoshua Bengio. Generative adversarial networks. CoRR,abs/1406.2661, 2014.

[10] Yves Grandvalet and Yoshua Bengio. Semi-supervisedlearning by entropy minimization. In Advances in Neural In-formation Processing Systems 17 [Neural Information Pro-cessing Systems, NIPS 2004, December 13-18, 2004, Van-couver, British Columbia, Canada], pages 529–536, 2004.

[11] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu,Phillip Isola, Kate Saenko, Alexei A. Efros, and Trevor Dar-rell. CyCADA: Cycle-consistent adversarial domain adapta-tion. In Jennifer G. Dy and Andreas Krause, editors, Pro-ceedings of the 35th International Conference on MachineLearning, ICML 2018, Stockholmsmassan, Stockholm, Swe-den, July 10-15, 2018, volume 80 of Proceedings of MachineLearning Research, pages 1994–2003. PMLR, 2018.

[12] Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell.FCNs in the Wild: Pixel-level adversarial and constraint-based adaptation. CoRR, abs/1612.02649, 2016.

[13] Xun Huang and Serge J. Belongie. Arbitrary style transferin real-time with adaptive instance normalization. In IEEEInternational Conference on Computer Vision, ICCV 2017,Venice, Italy, October 22-29, 2017, pages 1510–1519. IEEEComputer Society, 2017.

[14] Xun Huang, Ming-Yu Liu, Serge J. Belongie, and Jan Kautz.Multimodal unsupervised image-to-image translation. InVittorio Ferrari, Martial Hebert, Cristian Sminchisescu, andYair Weiss, editors, Computer Vision - ECCV 2018 - 15thEuropean Conference, Munich, Germany, September 8-14,2018, Proceedings, Part III, volume 11207 of Lecture Notesin Computer Science, pages 179–196. Springer, 2018.

[15] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptuallosses for real-time style transfer and super-resolution. InBastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, ed-itors, Computer Vision - ECCV 2016 - 14th European Con-ference, Amsterdam, The Netherlands, October 11-14, 2016,Proceedings, Part II, volume 9906 of Lecture Notes in Com-puter Science, pages 694–711. Springer, 2016.

[16] Dong-Hyun Lee. Pseudo-label: The simple and effi-cient semi-supervised learning method for deep neural net-works. In Workshop on challenges in representation learn-ing, ICML, volume 3, 2013.

[17] Chuan Li and Michael Wand. Precomputed real-time texturesynthesis with Markovian generative adversarial networks.In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling,editors, Computer Vision - ECCV 2016 - 14th European Con-ference, Amsterdam, The Netherlands, October 11-14, 2016,Proceedings, Part III, volume 9907 of Lecture Notes in Com-puter Science, pages 702–716. Springer, 2016.

[18] Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. Bidirec-tional learning for domain adaptation of semantic segmen-tation. In IEEE Conference on Computer Vision and PatternRecognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 6936–6945. Computer Vision Foundation /IEEE, 2019.

[19] Qing Lian, Lixin Duan, Fengmao Lv, and Boqing Gong.Constructing self-motivated pyramid curriculums for cross-domain semantic segmentation: A non-adversarial approach.In 2019 IEEE/CVF International Conference on ComputerVision, ICCV 2019, Seoul, Korea (South), October 27 -November 2, 2019, pages 6757–6766. IEEE, 2019.

[20] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervisedimage-to-image translation networks. In Isabelle Guyon, Ul-rike von Luxburg, Samy Bengio, Hanna M. Wallach, RobFergus, S. V. N. Vishwanathan, and Roman Garnett, editors,

Advances in Neural Information Processing Systems 30: An-nual Conference on Neural Information Processing Systems2017, 4-9 December 2017, Long Beach, CA, USA, pages700–708, 2017.

[21] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fullyconvolutional networks for semantic segmentation. In IEEEConference on Computer Vision and Pattern Recognition,CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages3431–3440. IEEE Computer Society, 2015.

[22] Yawei Luo, Liang Zheng, Tao Guan, Junqing Yu, and YiYang. Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adapta-tion. In IEEE Conference on Computer Vision and PatternRecognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 2507–2516. Computer Vision Foundation /IEEE, 2019.

[23] Stephan R. Richter, Vibhav Vineet, Stefan Roth, and VladlenKoltun. Playing for data: Ground truth from computergames. In Bastian Leibe, Jiri Matas, Nicu Sebe, and MaxWelling, editors, Computer Vision - ECCV 2016 - 14th Eu-ropean Conference, Amsterdam, The Netherlands, October11-14, 2016, Proceedings, Part II, volume 9906 of LectureNotes in Computer Science, pages 102–118. Springer, 2016.

[24] German Ros, Laura Sellart, Joanna Materzynska, DavidVazquez, and Antonio M. Lopez. The SYNTHIA dataset: Alarge collection of synthetic images for semantic segmenta-tion of urban scenes. In 2016 IEEE Conference on ComputerVision and Pattern Recognition, CVPR 2016, Las Vegas, NV,USA, June 27-30, 2016, pages 3234–3243. IEEE ComputerSociety, 2016.

[25] Swami Sankaranarayanan, Yogesh Balaji, Arpit Jain, Ser-Nam Lim, and Rama Chellappa. Learning from syntheticdata: Addressing domain shift for semantic segmentation.In 2018 IEEE Conference on Computer Vision and PatternRecognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 3752–3761. IEEE Computer Society, 2018.

[26] Karen Simonyan and Andrew Zisserman. Very deep con-volutional networks for large-scale image recognition. InYoshua Bengio and Yann LeCun, editors, 3rd InternationalConference on Learning Representations, ICLR 2015, SanDiego, CA, USA, May 7-9, 2015, Conference Track Proceed-ings, 2015.

[27] Kihyuk Sohn, David Berthelot, Chun-Liang Li, ZizhaoZhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, HanZhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXivpreprint arXiv:2001.07685, 2020.

[28] Antti Tarvainen and Harri Valpola. Mean teachers are betterrole models: Weight-averaged consistency targets improvesemi-supervised deep learning results. In Isabelle Guyon,Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, RobFergus, S. V. N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: An-nual Conference on Neural Information Processing Systems2017, 4-9 December 2017, Long Beach, CA, USA, pages1195–1204, 2017.

[29] Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Ki-hyuk Sohn, Ming-Hsuan Yang, and Manmohan Chandraker.

Learning to adapt structured output space for semantic seg-mentation. In 2018 IEEE Conference on Computer Visionand Pattern Recognition, CVPR 2018, Salt Lake City, UT,USA, June 18-22, 2018, pages 7472–7481. IEEE ComputerSociety, 2018.

[30] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Vic-tor S. Lempitsky. Texture networks: Feed-forward synthesisof textures and stylized images. In Maria-Florina Balcan andKilian Q. Weinberger, editors, Proceedings of the 33nd Inter-national Conference on Machine Learning, ICML 2016, NewYork City, NY, USA, June 19-24, 2016, volume 48 of JMLRWorkshop and Conference Proceedings, pages 1349–1357.JMLR.org, 2016.

[31] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky.Improved texture networks: Maximizing quality and diver-sity in feed-forward stylization and texture synthesis. In 2017IEEE Conference on Computer Vision and Pattern Recog-nition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017,pages 4105–4113. IEEE Computer Society, 2017.

[32] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, MatthieuCord, and Patrick Perez. ADVENT: Adversarial entropyminimization for domain adaptation in semantic segmenta-tion. In IEEE Conference on Computer Vision and PatternRecognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 2517–2526. Computer Vision Foundation /IEEE, 2019.

[33] Zhonghao Wang, Mo Yu, Yunchao Wei, Rogerio Feris, Jin-jun Xiong, Wen-mei Hwu, Thomas S Huang, and HonghuiShi. Differential treatment for stuff and things: A simple un-supervised domain adaptation method for semantic segmen-tation. In Proceedings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition, pages 12635–12644,2020.

[34] Zuxuan Wu, Xintong Han, Yen-Liang Lin, Mustafa GokhanUzunbas, Tom Goldstein, Ser-Nam Lim, and Larry S. Davis.DCAN: dual channel-wise alignment networks for unsuper-vised scene adaptation. In Vittorio Ferrari, Martial Hebert,Cristian Sminchisescu, and Yair Weiss, editors, ComputerVision - ECCV 2018 - 15th European Conference, Munich,Germany, September 8-14, 2018, Proceedings, Part V, vol-ume 11209 of Lecture Notes in Computer Science, pages535–552. Springer, 2018.

[35] Yanchao Yang and Stefano Soatto. FDA: Fourier domainadaptation for semantic segmentation. In Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition, pages 4085–4095, 2020.

[36] Zili Yi, Hao (Richard) Zhang, Ping Tan, and Minglun Gong.DualGAN: Unsupervised dual learning for image-to-imagetranslation. In IEEE International Conference on ComputerVision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages2868–2876. IEEE Computer Society, 2017.

[37] Yang Zhang, Philip David, and Boqing Gong. Curricu-lum domain adaptation for semantic segmentation of urbanscenes. In IEEE International Conference on Computer Vi-sion, ICCV 2017, Venice, Italy, October 22-29, 2017, pages2039–2049. IEEE Computer Society, 2017.

[38] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.Efros. Unpaired image-to-image translation using cycle-

consistent adversarial networks. In IEEE International Con-ference on Computer Vision, ICCV 2017, Venice, Italy, Octo-ber 22-29, 2017, pages 2242–2251. IEEE Computer Society,2017.

[39] Yang Zou, Zhiding Yu, B. V. K. Vijaya Kumar, and JinsongWang. Unsupervised domain adaptation for semantic seg-mentation via class-balanced self-training. In Vittorio Fer-rari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss,editors, Computer Vision - ECCV 2018 - 15th European Con-ference, Munich, Germany, September 8-14, 2018, Proceed-ings, Part III, volume 11207 of Lecture Notes in ComputerScience, pages 297–313. Springer, 2018.

Documents

Consistency Regularization with High-dimensional Non