14
Adjustable Real-time Style Transfer Mohammad Babaeizadeh 1,2 and Golnaz Ghiasi 2 1 University of Illinois at Urbana-Champaign 2 Google Brain Abstract Artistic style transfer is the problem of synthesizing an image with content similar to a given image and style sim- ilar to another. Although recent feed-forward neural net- works can generate stylized images in real-time, these mod- els produce a single stylization given a pair of style/content images, and the user doesn’t have control over the synthe- sized output. Moreover, the style transfer depends on the hyper-parameters of the model with varying “optimum” for different input images. Therefore, if the stylized output is not appealing to the user, she/he has to try multiple models or retrain one with different hyper-parameters to get a fa- vorite stylization. In this paper, we address these issues by proposing a novel method which allows adjustment of cru- cial hyper-parameters, after the training and in real-time, through a set of manually adjustable parameters. These parameters enable the user to modify the synthesized out- puts from the same pair of style/content images, in search of a favorite stylized image. Our quantitative and qualita- tive experiments indicate how adjusting these parameters is comparable to retraining the model with different hyper- parameters. We also demonstrate how these parameters can be randomized to generate results which are diverse but still very similar in style and content. 1. Introduction Style transfer is a long-standing problem in computer vision with the goal of synthesizing new images by com- bining the content of one image with the style of an- other [9, 13, 1]. Recently, neural style transfer tech- niques [10, 11, 16, 12, 20, 19] showed that the correlation between the features extracted from the trained deep neu- ral networks is quite effective on capturing the visual styles and content that can be used for generating images similar in style and content. However, since the definition of sim- ilarity is inherently vague, the objective of style transfer is not well defined [8] and one can imagine multiple stylized images from the same pair of content/style images. Style Content (Fixed) Figure 1. Adjusting the output of the synthesized stylized images in real-time. Each column shows a different stylized image for the same content and style image. Note how each row still resembles the same content and style while being widely different in details. Existing real-time style transfer methods generate only one stylization for a given content/style pair and while the stylizations of different methods usually look distinct [28, 14], it is not possible to say that one stylization is better in all contexts since people react differently to images based on their background and situation. Hence, to get favored stylizations users must try different methods that is not sat- isfactory. It is more desirable to have a single model which can generate diverse results, but still similar in style and content, in real-time, by adjusting some input parameters. One other issue with the current methods is their high 1 arXiv:1811.08560v1 [cs.CV] 21 Nov 2018

Adjustable Real-time Style Transfer - arXiv

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Adjustable Real-time Style Transfer

Mohammad Babaeizadeh1,2 and Golnaz Ghiasi2

1University of Illinois at Urbana-Champaign2Google Brain

Abstract

Artistic style transfer is the problem of synthesizing animage with content similar to a given image and style sim-ilar to another. Although recent feed-forward neural net-works can generate stylized images in real-time, these mod-els produce a single stylization given a pair of style/contentimages, and the user doesn’t have control over the synthe-sized output. Moreover, the style transfer depends on thehyper-parameters of the model with varying “optimum” fordifferent input images. Therefore, if the stylized output isnot appealing to the user, she/he has to try multiple modelsor retrain one with different hyper-parameters to get a fa-vorite stylization. In this paper, we address these issues byproposing a novel method which allows adjustment of cru-cial hyper-parameters, after the training and in real-time,through a set of manually adjustable parameters. Theseparameters enable the user to modify the synthesized out-puts from the same pair of style/content images, in searchof a favorite stylized image. Our quantitative and qualita-tive experiments indicate how adjusting these parametersis comparable to retraining the model with different hyper-parameters. We also demonstrate how these parameters canbe randomized to generate results which are diverse but stillvery similar in style and content.

1. Introduction

Style transfer is a long-standing problem in computervision with the goal of synthesizing new images by com-bining the content of one image with the style of an-other [9, 13, 1]. Recently, neural style transfer tech-niques [10, 11, 16, 12, 20, 19] showed that the correlationbetween the features extracted from the trained deep neu-ral networks is quite effective on capturing the visual stylesand content that can be used for generating images similarin style and content. However, since the definition of sim-ilarity is inherently vague, the objective of style transfer isnot well defined [8] and one can imagine multiple stylizedimages from the same pair of content/style images.

Style

Content

(Fixed)

Figure 1. Adjusting the output of the synthesized stylized imagesin real-time. Each column shows a different stylized image for thesame content and style image. Note how each row still resemblesthe same content and style while being widely different in details.

Existing real-time style transfer methods generate onlyone stylization for a given content/style pair and while thestylizations of different methods usually look distinct [28,14], it is not possible to say that one stylization is better inall contexts since people react differently to images basedon their background and situation. Hence, to get favoredstylizations users must try different methods that is not sat-isfactory. It is more desirable to have a single model whichcan generate diverse results, but still similar in style andcontent, in real-time, by adjusting some input parameters.

One other issue with the current methods is their high

1

arX

iv:1

811.

0856

0v1

[cs

.CV

] 2

1 N

ov 2

018

Figure 2. Architecture of the proposed model. The loss adjustment parameters αααc and αααs is passed to the network Λ which will predictactivation normalizers γααα and βααα that normalize activation of main stylizing network T . The stylized image is passed to a trained imageclassifier where its intermediate representation is used to calculate the style loss Ls and content loss Lc. Then the loss from each layeris multiplied by the corresponding input adjustment parameter. Models Λ and T are trained jointly by minimizing this weighted sum. Atgeneration time, values for αααc and αααs can be adjusted manually or randomly sampled to generate varied stylizations.

sensitivity to the hyper-parameters. More specifically, cur-rent real-time style transfer methods minimize a weightedsum of losses from different layers of a pre-trained im-age classification model [16, 14] (check Sec 3 for de-tails) and different weight sets can result into very differentstyles (Figure 7). However, one can only observe the effectof these weights in the final stylization by fully retrainingthe model with the new set of weights. Considering the factthat the “optimal” set of weights can be different for anypair of style/content (Figure 3) and also the fact that this“optimal” truly doesn’t exist (since the goodness of the out-put is a personal choice) retraining the models over and overuntil the desired result is generated is not practical.

The primary goal of this paper is to address these issuesby providing a novel mechanism which allows for adjust-ment of the stylized image, in real-time and after train-ing. To achieve this, we use an auxiliary network which ac-cepts additional parameters as inputs and changes the styletransfer process by adjusting the weights between multiplelosses. We show that changing these parameters at inferencetime results to stylizations similar to the ones achievable byretraining the model with different hyper-parameters. Wealso show that a random selection of these parameters atrun-time can generate a random stylization. These solu-tions, enable the end user to be in full control of how thestylized image is being formed as well as having the ca-pability of generating multiple stochastic stylized imagesfrom a fixed pair of style/content. The stochastic natureof our proposed method is most apparent when viewingthe transition between random generations. Therefore, wehighly encourage the reader to check the project websitehttps://goo.gl/PVWQ9K to view the generated stylizations.

2. Related Work

The strength of deep networks in style transfer was firstdemonstrated by Gatys et al. [11]. While this method gen-erates impressive results, it is too slow for real-time appli-cations due to its optimization loop. Follow up works speedup this process by training feed-forward networks that cantransfer style of a single style image [16, 30] or multiplestyles [8]. Other works introduced real-time methods totransfer style of arbitrary style image to an arbitrary con-tent image [12, 14]. These methods can generate differentstylizations from different style images; however, they onlyproduce one stylization for a single pair of content/style im-age which is different from our proposed method.

Generating diverse results have been studied in multipledomains such as colorizations [7, 3], image synthesis [4],video prediction [2, 18], and domain transfer [15, 33]. Do-main transfer is the most similar problem to the style trans-fer. Although we can generate multiple outputs from a giveninput image [15], we need a collection of target or style im-ages for training. Therefore we can not use it when we donot have a collection of similar styles.

Style loss function is a crucial part of style transfer whichaffects the output stylization significantly. The most com-mon style loss is Gram matrix which computes the second-order statistics of the feature activations [11], however manyalternative losses have been introduced to measure distancesbetween feature statistics of the style and stylized imagessuch as correlation alignment loss [25], histogram loss [26],and MMD loss [21]. More recent work [23] has used depthsimilarity of style and stylized images as a part of the loss.We demonstrate the success of our method using only Gram

2

1e-2 1e-3 1e-4Style

Figure 3. Effect of adjusting the style weight in style transfer net-work from [16]. Each column demonstrates the result of a separatetraining with all wls set to the printed value. As can be seen, the“optimal” weight is different from one style image to another andthere can be multiple “good” stylizations depending on ones’ per-sonal choice. Check supplementary materials for more examples.

matrix; however, our approach can be expanded to utilizeother losses as well.

To the best of our knowledge, the closest work to thispaper is [32] in which the authors utilized Julesz ensembleto encourage diversity in stylizations explicitly. Althoughthis method generates different stylizations, they are verysimilar in style, and they only differ in minor details. Aqualitative comparison in Figure 8 shows that our proposedmethod is more effective in diverse stylization.

3. Background3.1. Style transfer using deep networks

Style transfer can be formulated as generating a stylizedimage p which its content is similar to a given content im-age c and its style is close to another given style image s.

p = Ψ(c, s)

The similarity in style can be vaguely defined as shar-ing the same spatial statistics in low-level features, whilesimilarity in content is roughly having a close Euclideandistance in high-level features [12]. These features are typ-ically extracted from a pre-trained image classification net-work, commonly VGG-19 [29]. The main idea here is thatthe features obtained by the image classifier contain infor-mation about the content of the input image while the cor-relation between these features represents its style.

In order to increase the similarity between two images,Gatys et al. [11] minimize the following distances betweentheir extracted features:

Llc(p) =∣∣∣∣φl(p)− φl(s)

∣∣∣∣22

(1)

Lls(p) =∣∣∣∣G(φl(p))−G(φl(s))

∣∣∣∣2F

(2)

where φl(x) is activation of a pre-trained classification net-work at layer l given the input image x, while Llc(p) and

Lls(p) are content and style loss at layer l respectively.G(φl(p)) denotes the Gram matrix associated with φl(p).

The total loss is calculated as a weighted sum of lossesacross a set of content layers C and style layers S :

Lc(p) =∑l∈C

wlcLlc(p) and Ls(p) =∑l∈S

wlsLls(p) (3)

where wlc, wls are hyper-parameters to adjust the contribu-

tion of each layer to the loss. Layers can be shared betweenC and S . These hyper-parameters have to be manually finetuned through try and error and usually vary for differentstyle images (Figure 3). Finally, the objective of style trans-fer can be defined as:

minp

(Lc(p) + Ls(p)

)(4)

This objective can be minimized by iterative gradient-basedoptimization methods starting from an initial p which usu-ally is random noise or the content image itself.

3.2. Real-time feed-forward style transfer

Solving the objective in Equation 4 using an iterativemethod can be very slow and has to be repeated for anygiven pair of style/content image. A much faster methodis to directly train a deep network T which maps a givencontent image c to a stylized image p [16]. T is usuallya feed-forward convolutional network (parameterized by θ)with residual connections between down-sampling and up-sampling layers [27] and is trained on many content imagesusing Equation 4 as the loss function:

minθ

(Lc(T (c)) + Ls(T (c))

)(5)

The style image is assumed to be fixed and therefore a dif-ferent network should be trained per style image. However,for a fixed style image, this method can generate stylizedimages in real-time [16]. Recent methods [8, 12, 14] intro-duced real-time style transfer methods for multiple styles.But, these methods still generate only one stylization for apair of style and content images.

4. Proposed Method4.1. Problem Statement

In this paper we address the following issues in real-timefeed-forward style transfer methods:1. The output of these models is sensitive to the hyper-parameters wlc and wls and different weights significantlyaffect the generated stylized image as demonstrated in Fig-ure 7. Moreover, the “optimal” weights vary from one styleimage to another (Figure 3) and therefore finding a good setof weights should be repeated for each style image. Pleasenote that for each set of wlc and wls the model has to be fully

3

Content

1.00.0

Style

conv2_3

conv3_3

conv4_3

Figure 4. Effect of adjusting the input parameters αααs on stylization. Each row shows the stylized output when a single αls increasedgradually from zero to one while other αααs are fixed to zero. Notice how the details of each stylization is different specially at the lastcolumn where the value is maximum. Also note how deeper layers use bigger features of style image to stylize the content.

retrained that limits the practicality of style transfer models.2. Current methods generate a single stylized image given acontent/style pair. While the stylizations of different meth-ods usually look very distinct [28], it is not possible to saywhich stylization is better for every context since it is a mat-ter of personal taste. To get a favored stylization, users mayneed to try different methods or train a network with differ-ent hyper-parameters which is not satisfactory and, ideally,the user should have the capability of getting different styl-izations in real-time.

We address these issues by conditioning the generatedstylized image on additional input parameters where eachparameter controls the share of the loss from a correspond-ing layer. This solves the problem (1) since one can adjustthe contribution of each layer to adjust the final stylized re-sult after the training and in real-time. Secondly, we addressthe problem (2) by randomizing these parameters which re-sult in different stylizations.

4.2. Style transfer with adjustable loss

We enable the users to adjust wlc,wls without retraining

the model by replacing them with input parameters and con-ditioning the generated style images on these parameters:

p = Ψ(c, s,αααc,αααs)

αααc and αααs are vectors of parameters where each elementcorresponds to a different layer in content layers C andstyle layers S respectively. αlc and αls replace the hyper-parameters wlc and wls in the objective Equation 3:

Lc(p) =∑l∈C

αlcLlc(p) and Ls(p) =∑l∈S

αlsLls(p) (6)

To learn the effect of αααc and αααs on the objective, we usea technique called conditional instance normalization [31].This method transforms the activations of a layer x in thefeed-forward network T to a normalized activation z which

is conditioned on additional inputs ααα = [αααc,αααs]:

z = γααα(x− µ

σ

)+ βααα (7)

where µ and σ are mean and standard deviation of activa-tions at layer x across spatial axes [12] and γααα, βααα are thelearned mean and standard deviation of this transformation.These parameters can be approximated using a second neu-ral network which will be trained end-to-end with T :

γααα, βααα = Λ(αααc,αααs) (8)

Since Ll can be very different in scale, one loss term maydominate the others which will fail the training. To balancethe losses, we normalize them using their exponential mov-ing average as a normalizing factor, i.e. each Ll will benormalized to:

Ll(p) =

∑i∈C∪S Li(p)

Ll(p)∗ Ll(p) (9)

where Ll(p) is the exponential moving average of Ll(p).

5. ExperimentsIn this section, first we study the effect of adjusting the

input parameters in our method. Then we demonstrate thatwe can use our method to generate random stylizations andfinally, we compare our method with a few baselines interms of generating random stylizations.

5.1. Implementation details

We implemented Λ as a multilayer fully connected neu-ral network. We used the same architecture as [16, 8, 12]for T and only increased number of residual blocks by3 (look at supplementary materials for details) which im-proved stylization results. We trained T and Λ jointly bysampling random values for ααα from U(0, 1). We trained

4

Rand

omiz

ed

Para

met

ers

Rand

omiz

ed

Con

tent

Noi

seBo

th

Figure 5. Effect of randomizingααα and additive Gaussian noise on stylization. Top row demonstrates that randomizingααα results to differentstylizations however the style features appear in the same spatial position (e.g., look at the swirl effect on the left eye). Middle row visualizesthe effect of adding random noise to the content image in moving these features with fixed ααα. Combination of these two randomizationtechniques can generate highly versatile outputs which can be seen in the bottom row. Notice how each image in this row differs in bothstyle and the spatial position of style elements. Look at Figure 10 for more randomized results.

our model on ImageNet [6] as content images while usingpaintings from Kaggle Painter by Numbers [17] and tex-tures from Descibable Texture Dataset [5] as style images.We selected random images form ImageNet test set, MS-

0.2 0.4 0.6 0.8 1.0Input parameter value

0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

sty

le lo

ss

0.2 0.4 0.6 0.8 1.0Input parameter value

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.25 0.50 0.75 1.000.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

sty

le lo

ss

Other parameters are fixed at 0.0

0.00 0.25 0.50 0.75 1.000.0

0.2

0.4

0.6

0.8

1.0Other parameters are fixed at 1.0

Loss at conv2_3 Loss at conv3_3 Loss at conv4_3

Figure 6. Effect of adjusting the input parameters αααs on style lossfrom different layers across single style image of Figure 4 (top)or 25 different style images (bottom). In each curve, one of theinput parameters αls has been increased from zero to one whileothers are fixed at to zero (left) and to one (right). Then the styleloss has been calculated across 100 different content images. Ascan be seen, increasing αls decreases the loss of the correspondinglayer. Note that the losses is normalized in each layer for bettervisualization.

COCO [22] and faces from CelebA dataset [24] as our con-tent test images. Similar to [12, 8], we used the last featureset of conv3 as content layer C . We used last feature setof conv2, conv3 and conv4 layers from VGG-19 networkas style layers S . Since there is only one content layer, wefix αααc = 1. Our implementation can process 47.5 fps ona NVIDIA GeForce 1080, compared to 52.0 for the basemodel without Λ sub-network.

5.2. Effect of adjusting the input parameters

The primary goal of introducing the adjustable parame-ters ααα was to modify the loss of each separate layer man-ually. Qualitatively, this is demonstrable by increasing oneof the input parameters from zero to one while fixing therest of them to zero. Figure 4 shows one example of suchtransition. Each row in this figure is corresponding to a dif-ferent style layer, and therefore the stylizations at each rowwould be different. Notice how deeper layers stylize the im-age with bigger stylization elements from the style imagebut all of them still apply the coloring. We also visualizethe effect of increasing two of the input parameters at thesame time in Figure 9. However, these transitions are bestdemonstrated interactively which is accessible at the projectwebsite https://goo.gl/PVWQ9K.

To quantitatively demonstrate the change in losses withadjustment of the input parameters, we rerun the same ex-periment of assigning a fixed value to all of the input pa-rameters while gradually increasing one of them from zeroto one, this time across 100 different content images. Thenwe calculate the median loss at each style loss layer S . Ascan be seen in Figure 6-(top), increasing αls decreases themeasured loss corresponding to that parameter. To showthe generalization of our method across style images, we

5

conv4_3 conv3_3 conv2_3conv4_3conv3_3

conv4_3conv2_3

conv3_3conv2_3

conv4_3conv3_3conv2_3

Style

Style

Content

baseours

baseours

Figure 7. Qualitative comparison between the base model from [16] with our proposed method. For the base model, each column hasbeen retrained with all the weights set to zero except for the mentioned layers which has been set to 1e−3. For our model, the respectiveparameters αls has been adjusted. Note how close the stylizations are and how the combination of layers stays the same in both models.

trained 25 models with different style images and then mea-sured median of the loss at any of the S layers for 100 dif-ferent content images (Figure 6)-(bottom). We exhibit thesame drop trends as before which means the model can gen-erate stylizations conditioned on the input parameters.

Finally, we verify that modifying the input parametersαααsgenerates visually similar stylizations to the retrained basemodel with different loss weights wls. To do so, we trainthe base model [16] multiple times with different wls andthen compare the generated results with the output of ourmodel when ∀l ∈ S , αls = wls. Figure 7 demonstrates thiscomparison. Note how the proposed stylizations in test timeand without retraining match the output of the base model.

5.3. Generating randomized stylizations

One application of our proposed method is to generatemultiple stylizations given a fixed pair of content/style im-age. To do so, we randomizeααα to generate randomized styl-ization (top row of Figure 5). Changing values of ααα usuallydo not randomize the position of the “elements” of the style.We can enforce this kind of randomness by adding somenoise with the small magnitude to the content image. Forthis purpose, we multiply the content image with a maskwhich is computed by applying an inverse Gaussian filter

on a white image with a handful (< 10) random zeros. Thismasking can shadow sensitive parts of the image which willchange the spatial locations of the “elements” of style. Mid-dle row in Figure 5 demonstrates the effect of this random-ization. Finally, we combine these two randomizations tomaximizes the diversity of the output which is shown in thebottom row of Figure 5. More randomized stylizations canbe seen in Figure 10 and at https://goo.gl/PVWQ9K.

5.3.1 Comparison with other methods

To the best of our knowledge, generating diverse styliza-tions at real-time is only have been studied at [32] before.In this section, we qualitatively compare our method withthis baseline. Also, we compare our method with a simplebaseline where we add noise to the style parameters.

The simplest baseline for getting diverse stylizations isto add noises to some parameters or the inputs of the style-transfer network. In the last section, we demonstrate thatwe can move the locations of elements of style by addingnoise to the content input image. To answer the questionthat if we can get different stylizations by adding noise tothe style input of the network, we utilize the model of [8]which uses conditional instance normalization for transfer-

6

Train+Run-time N

oiseStyleN

etO

urs

Content

Style

Run-time N

oiseFigure 8. Diversity comparison of our method and baselines. Firstrow shows results for a baseline that adds random noises to thestyle parameters at run-time. While we get diverse stylizations,the results are not similar to the input style image. Second rowcontains results for a baseline that adds random noises to the styleparameters at both training time and run-time. Model is robust tothe noise and it does not generate diverse results. Third row showsstylization results of StyleNet [32]. Our method generates diversestylizations while StyleNet results mostly differ in minor details.

ring style. We train this model with only one style imageand to get different stylizations, we add random noise to thestyle parameters (γααα and βααα parameters of equation 7) atrun-time. The stylization results for this baseline are shownon the top row of Figure 8. While we get different styl-izations by adding random noises, the stylizations are nolonger similar to the input style image.

To enforce similar stylizations, we trained the same base-line while we add random noises at the training phase aswell. The stylization results are shown in the second rowof Figure 8. As it can be seen, adding noise at the trainingtime makes the model robust to the noise and the styliza-tion results are similar. This indicates that a loss term thatencourages diversity is necessary.

We also compare the results of our model withStyleNet [32]. As visible in Figure 8, although StyleNet’sstylizations are different, they vary in minor details and allcarry the same level of stylization elements. In contrast,our model synthesizes stylized images with varying levelsof stylization and more randomization.

6. ConclusionOur main contribution in this paper is a novel method

which allows adjustment of each loss layer’s contribution infeed-forward style transfer networks, in real-time and aftertraining. This capability allows the users to adjust the styl-ized output to find the favorite stylization by changing in-put parameters and without retraining the stylization model.We also show how randomizing these parameters plus somenoise added to the content image can result in very differentstylizations from the same pair of style/content image.

Our method can be expanded in numerous ways e.g. ap-plying it to multi-style transfer methods such as [8, 12],applying the same parametrization technique to random-ize the correlation loss between the features of each layerand finally using different loss functions and pre-trainednetworks for computing the loss to randomize the outputseven further. One other interesting future direction is to ap-ply the same “loss adjustment after training” technique forother classic computer vision and deep learning tasks. Styletransfer is not the only task in which modifying the hyper-parameters can greatly affect the predicted results and itwould be rather interesting to try this method for adjustingthe hyper-parameters in similar problems.

References[1] M. Ashikhmin. Synthesizing natural textures. In Proceed-

ings of the 2001 symposium on Interactive 3D graphics,pages 217–226. ACM, 2001.

[2] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, andS. Levine. Stochastic variational video prediction. arXivpreprint arXiv:1710.11252, 2017.

[3] Y. Cao, Z. Zhou, W. Zhang, and Y. Yu. Unsupervised diversecolorization via generative adversarial networks. In JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases. Springer, 2017.

[4] Q. Chen and V. Koltun. Photographic image synthesis withcascaded refinement networks. In ICCV, 2017.

[5] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , andA. Vedaldi. Describing textures in the wild. In Proceedingsof the IEEE Conf. on Computer Vision and Pattern Recogni-tion (CVPR), 2014.

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.In Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on, pages 248–255. Ieee, 2009.

[7] A. Deshpande, J. Lu, M.-C. Yeh, M. J. Chong, and D. A.Forsyth. Learning diverse image colorization. In CVPR,2017.

[8] V. Dumoulin, J. Shlens, and M. Kudlur. A learned represen-tation for artistic style. Proc. of ICLR, 2017.

[9] A. A. Efros and W. T. Freeman. Image quilting for tex-ture synthesis and transfer. In Proceedings of the 28th an-nual conference on Computer graphics and interactive tech-niques, pages 341–346. ACM, 2001.

7

0.01.0

1.0

conv4_3

conv3_3

0.01.0

1.0

conv4_3

conv3_3

0.01.0

1.0

conv4_3

conv3_3

0.01.0

1.0

conv4_3

conv3_3

Figure 9. More results for adjusting the input parameters in real-time and after training. In each block the style/content pair is fixed while theparameters corresponding to conv3 and conv4 increases vertically and horizontally from zero to one. Notice how the details are differentfrom one layer to another and how the combination of layers may result to more favored stylizations. For an interactive presentation pleasevisit https://goo.gl/PVWQ9K.

8

Figure 10. More results of stochastic stylization from the same pair of content/style. Each block represents randomized stylized outputsgiven the fix style/content image demonstrated at the top. Notice how stylized images vary in style granularity, the spatial position of styleelements while maintaining similarity to the original style and content image. For more results please visit https://goo.gl/PVWQ9K.

9

[10] L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesisusing convolutional neural networks. In Advances in NeuralInformation Processing Systems, pages 262–270, 2015.

[11] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transferusing convolutional neural networks. In CVPR, pages 2414–2423, 2016.

[12] G. Ghiasi, H. Lee, M. Kudlur, V. Dumoulin, and J. Shlens.Exploring the structure of a real-time, arbitrary neural artisticstylization network. arXiv preprint arXiv:1705.06830, 2017.

[13] A. Hertzmann. Painterly rendering with curved brush strokesof multiple sizes. In Proceedings of the 25th annual con-ference on Computer graphics and interactive techniques,pages 453–460. ACM, 1998.

[14] X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017.

[15] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodalunsupervised image-to-image translation. arXiv preprintarXiv:1804.04732, 2018.

[16] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses forreal-time style transfer and super-resolution. In EuropeanConference on Computer Vision, pages 694–711. Springer,2016.

[17] Kaggle. Kaggle Painter by numbers kernel description.www.kaggle.com/c/painter-by-numbers. 2016.

[18] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, andS. Levine. Stochastic adversarial video prediction. arXivpreprint arXiv:1804.01523, 2018.

[19] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang.Universal style transfer via feature transforms. In Advancesin Neural Information Processing Systems, pages 386–396,2017.

[20] Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz. A closed-form solution to photorealistic image stylization. arXivpreprint arXiv:1802.06474, 2018.

[21] Y. Li, N. Wang, J. Liu, and X. Hou. Demystifying neuralstyle transfer. arXiv preprint arXiv:1701.01036, 2017.

[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. In European conference on computervision, pages 740–755. Springer, 2014.

[23] X.-C. Liu, M.-M. Cheng, Y.-K. Lai, and P. L. Rosin. Depth-aware neural style transfer. In Proceedings of the Symposiumon Non-Photorealistic Animation and Rendering, 2017.

[24] Z. Liu, P. Luo, X. Wang, and X. Tang. Large-scale celebfacesattributes (celeba) dataset. Retrieved August, 15:2018, 2018.

[25] X. Peng and K. Saenko. Synthetic to real adaptation withgenerative correlation alignment networks. In WACV, 2018.

[26] E. Risser, P. Wilmot, and C. Barnes. Stable and controllableneural texture synthesis and style transfer using histogramlosses. arXiv preprint arXiv:1701.08893, 2017.

[27] M. Ruder, A. Dosovitskiy, and T. Brox. Artistic style transferfor videos and spherical images. International Journal ofComputer Vision, pages 1–21, 2018.

[28] A. Sanakoyeu, D. Kotovenko, S. Lang, and B. Ommer. Astyle-aware content loss for real-time hd style transfer. arXivpreprint arXiv:1807.10201, 2018.

[29] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014.

[30] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky.Texture networks: Feed-forward synthesis of textures andstylized images. In ICML, pages 1349–1357, 2016.

[31] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance nor-malization: the missing ingredient for fast stylization. corrabs/1607.0 (2016).

[32] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Improvedtexture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In CVPR, 2017.

[33] Y. Zhang. Xogan: One-to-many unsupervised image-to-image translation. arXiv preprint arXiv:1805.07277, 2018.

10

Operation input dimensions output dimensions

input parameters ααα 3 100010×Dense 1000 1000

Dense 1000 2(γααα, βααα)

Optimizer Adam (α = 0.001, β1 = 0.9, β2 = 0.999)Training iterations 200K

Batch size 8Weight initialization Isotropic gaussian (µ = 0, σ = 0.01)

Table 1. Network architecture and hyper-parameters of Λ.

Operation Kernel size Stride Feature maps Padding Nonlinearity

Network – 256× 256× 3 inputConvolution 9 1 32 SAME ReLUConvolution 3 2 64 SAME ReLUConvolution 3 2 128 SAME ReLU

Residual block 128Residual block 128Residual block 128Residual block 128Residual block 128Residual block 128Residual block 128

Upsampling 64Upsampling 32Convolution 9 1 3 SAME Sigmoid

Residual block – C feature mapsConvolution 3 1 C SAME ReLUConvolution 3 1 C SAME Linear

Add the input and the outputUpsampling – C feature maps

Nearest-neighbor interpolation, factor 2Convolution 3 1 C SAME ReLU

Normalization Conditional instance normalization after every convolutionOptimizer Adam (α = 0.001, β1 = 0.9, β2 = 0.999)

Training iterations 200KBatch size 8

Weight initialization Isotropic gaussian (µ = 0, σ = 0.01)

Table 2. Network architecture and hyper-parameters of T .

11

1.00.0

conv2_3

conv3_3

conv4_3

1.00.0

conv2_3

conv3_3

conv4_3

1.00.0

conv2_3

conv3_3

conv4_3

1.00.0

conv2_3

conv3_3

conv4_3

Content

Style

Content

Style

Content

Style

Content

Style

Figure 11. More examples for effect of adjusting the input parameters αααs in real-time. Each row shows the stylized output when a singleαls increased gradually from zero to one while other αααs are fixed to zero. Notice how the details of each stylization is different specially atthe last column where the weight is maximum. Also how deeper layers use bigger features of style image to stylize the content.

12

3e-2 1e-2 3e-3 1e-3 3e-4 1e-4Style

Figure 12. More examples for effect of adjusting the style weight in style transfer network from [16]. Each column demonstrates the resultof a separate training. As can be seen, the “optimal” weight is different from one style image to another and there can be more than one“good” stylization depending on ones personal choice.

13

conv2_3 conv3_3 conv4_3conv2_3conv3_3

conv2_3conv4_3

conv3_3conv4_3

Figure 13. Results of combining losses from different layers at generation time by adjusting their corresponding parameters. The firstcolumn is the style image which is fixed for each row. The content image is the same for all of the outputs. The corresponding parameterfor each one of the losses is zero except for the one(s) mentioned in the title of each column. Notice how each layer enforces a differenttype of stylization and how the combinations vary as well. Also note how a single combination of layers cannot be the “optimal” stylizationfor any style image and one may prefer the results from another column.

14