16
Interpreting Adversarially Trained Convolutional Neural Networks Tianyuan Zhang 1 Zhanxing Zhu 234 Abstract We attempt to interpret how adversarially trained convolutional neural networks (AT-CNNs) recog- nize objects. We design systematic approaches to interpret AT-CNNs in both qualitative and quan- titative ways and compare them with normally trained models. Surprisingly, we find that adver- sarial training alleviates the texture bias of stan- dard CNNs when trained on object recognition tasks, and helps CNNs learn a more shape-biased representation. We validate our hypothesis from two aspects. First, we compare the salience maps of AT-CNNs and standard CNNs on clean images and images under different transformations. The comparison could visually show that the predic- tion of the two types of CNNs is sensitive to dra- matically different types of features. Second, to achieve quantitative verification, we construct ad- ditional test datasets that destroy either textures or shapes, such as style-transferred version of clean data, saturated images and patch-shuffled ones, and then evaluate the classification accuracy of AT-CNNs and normal CNNs on these datasets. Our findings shed some light on why AT-CNNs are more robust than those normally trained ones and contribute to a better understanding of adver- sarial training over CNNs from an interpretation perspective. 1. Introduction Convolutional neural networks (CNNs) have achieved great success in a variety of visual recognition tasks (Krizhevsky et al., 2012; Girshick et al., 2014; Long et al., 2015) with their stacked local connections. A crucial issue is to under- stand what is being learned after training over thousands or even millions of images. This involves interpreting CNNs. 1 School of EECS, Peking University, China 2 School of Mathe- matical Sciences, Peking University, China 3 Center for Data Sci- ence, Peking University 4 Beijing Institute of Big Data Research. Correspondence to: Zhanxing Zhu <[email protected]>. Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Along this line, some recent works showed that standard CNNs trained on ImageNet make their predictions rely on the local textures rather than long-range dependencies en- coded in the shape of objects (Geirhos et al., 2019; Brendel & Bethge, 2019; Ballester & de Ara ´ ujo, 2016). Conse- quently, this texture bias prevents the trained CNNs from generalizing well on those images with distorted textures but maintained shape information. Geirhos et al. (2019) also showed that using a combination of Stylized-ImageNet and ImageNet can alleviate the texture bias of standard CNNs. It naturally raises an intriguing question: Are there any other trained CNNs are more biased towards shapes? Recently, normally trained neural networks were found to be easily fooled by maliciously perturbed examples, i.e., ad- versarial examples (Goodfellow et al., 2014; Kurakin et al., 2016). To defense the adversarial examples, adversarial training was proposed; that is, instead of minimizing the loss function over the clean example, it minimizes almost worst-case loss over the slightly perturbed examples (Madry et al., 2018). We name these adversarially trained networks as AT-CNNs. They were extensively shown to be able to enhance the robustness, i.e., improving the classification accuracy over the adversarial examples. Then, What is learned by adversarially trained CNNs to make it more robust? In this work, in order to explore the answer to the above questions, we systematically design various experiments to interpret the AT-CNNs and compare them with normally trained models. We find that AT-CNNs are better at captur- ing long-range correlations such as shapes, and less biased towards textures than normally trained CNNs in popular object recognition datasets. This finding partially explains why AT-CNNs tends to be more robust than standard CNNs. We validate our hypothesis from two aspects. First, we compare the salience maps of AT-CNNs and standard CNNs on clean images and those under different transformations. The comparison could visually show that the predictions of the two CNNs are sensitive to dramatically different types of features. Second, we construct additional test datasets that destroy either textures or shapes, such as the style- transferred version of clean data, saturated images and patch- arXiv:1905.09797v1 [cs.LG] 23 May 2019

Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

  • Upload
    others

  • View
    34

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

Tianyuan Zhang 1 Zhanxing Zhu 2 3 4

AbstractWe attempt to interpret how adversarially trainedconvolutional neural networks (AT-CNNs) recog-nize objects We design systematic approaches tointerpret AT-CNNs in both qualitative and quan-titative ways and compare them with normallytrained models Surprisingly we find that adver-sarial training alleviates the texture bias of stan-dard CNNs when trained on object recognitiontasks and helps CNNs learn a more shape-biasedrepresentation We validate our hypothesis fromtwo aspects First we compare the salience mapsof AT-CNNs and standard CNNs on clean imagesand images under different transformations Thecomparison could visually show that the predic-tion of the two types of CNNs is sensitive to dra-matically different types of features Second toachieve quantitative verification we construct ad-ditional test datasets that destroy either textures orshapes such as style-transferred version of cleandata saturated images and patch-shuffled onesand then evaluate the classification accuracy ofAT-CNNs and normal CNNs on these datasetsOur findings shed some light on why AT-CNNsare more robust than those normally trained onesand contribute to a better understanding of adver-sarial training over CNNs from an interpretationperspective

1 IntroductionConvolutional neural networks (CNNs) have achieved greatsuccess in a variety of visual recognition tasks (Krizhevskyet al 2012 Girshick et al 2014 Long et al 2015) withtheir stacked local connections A crucial issue is to under-stand what is being learned after training over thousands oreven millions of images This involves interpreting CNNs

1School of EECS Peking University China 2School of Mathe-matical Sciences Peking University China 3Center for Data Sci-ence Peking University 4Beijing Institute of Big Data ResearchCorrespondence to Zhanxing Zhu ltzhanxingzhupkueducngt

Proceedings of the 36 th International Conference on MachineLearning Long Beach California PMLR 97 2019 Copyright2019 by the author(s)

Along this line some recent works showed that standardCNNs trained on ImageNet make their predictions rely onthe local textures rather than long-range dependencies en-coded in the shape of objects (Geirhos et al 2019 Brendelamp Bethge 2019 Ballester amp de Araujo 2016) Conse-quently this texture bias prevents the trained CNNs fromgeneralizing well on those images with distorted texturesbut maintained shape information Geirhos et al (2019) alsoshowed that using a combination of Stylized-ImageNet andImageNet can alleviate the texture bias of standard CNNsIt naturally raises an intriguing question

Are there any other trained CNNs are more biased towardsshapes

Recently normally trained neural networks were found tobe easily fooled by maliciously perturbed examples ie ad-versarial examples (Goodfellow et al 2014 Kurakin et al2016) To defense the adversarial examples adversarialtraining was proposed that is instead of minimizing theloss function over the clean example it minimizes almostworst-case loss over the slightly perturbed examples (Madryet al 2018) We name these adversarially trained networksas AT-CNNs They were extensively shown to be able toenhance the robustness ie improving the classificationaccuracy over the adversarial examples Then

What is learned by adversarially trained CNNs to make itmore robust

In this work in order to explore the answer to the abovequestions we systematically design various experiments tointerpret the AT-CNNs and compare them with normallytrained models We find that AT-CNNs are better at captur-ing long-range correlations such as shapes and less biasedtowards textures than normally trained CNNs in popularobject recognition datasets This finding partially explainswhy AT-CNNs tends to be more robust than standard CNNs

We validate our hypothesis from two aspects First wecompare the salience maps of AT-CNNs and standard CNNson clean images and those under different transformationsThe comparison could visually show that the predictions ofthe two CNNs are sensitive to dramatically different typesof features Second we construct additional test datasetsthat destroy either textures or shapes such as the style-transferred version of clean data saturated images and patch-

arX

iv1

905

0979

7v1

[cs

LG

] 2

3 M

ay 2

019

Interpreting Adversarially Trained Convolutional Neural Networks

shuffled images then evaluate the classification accuracyof AT-CNN and normal CNNs on these datasets Thesesophisticated designed experiments provide a quantitativecomparison between the two CNNs and demonstrate theirbiases when making predictions

To the best of our knowledge we are the first to implementsystematic investigation on interpreting the adversariallytrained CNNs both visually and quantitatively Our find-ings shed some light on why AT-CNNs are more robustthan those normally trained ones and also contribute to bet-ter understanding adversarial training over CNNs from aninterpretation perspective1

The remaining of the paper is structured as follows Weintroduce background knowledge on adversarial training andsalience methods in Section 2 The methods for interpretingAT-CNNS are described in Section 3 Then we present theexperimental results to support our findings in Section 4The related works and discussions are presented in Section 5Section 6 concludes the paper

2 Preliminary21 Adversarial training

This training method was first proposed by (Goodfellowet al 2014) which is the most successful approach forbuilding robust models so far for defending adversarial ex-amples (Madry et al 2018 Sinha et al 2018 Athalye et al2018 Zhang et al 2019ba) It can be formulated as solvinga robust optimization problem (Shaham et al 2015)

minθ

E(xy)simD

[maxδisinS

`(f(x+ δ θ) y)

] (1)

where f(x θ) represents the neural network parameterizedby weights θ the input-output pair (x y) is sample from thetraining set D δ denotes the adversarial perturbation and`(middot middot) is the chosen loss function eg cross entropy loss Sdenotes a certain norm constraints such as `infin or `2

The inner maximization is approximated by adversarialexamples generated by various attack methods Trainingagainst a projected gradient descent (PGD Madry et al(2018)) adversary leads to state-of-the-art white-box ro-bustness We use PGD based adversarial training withbounded linfin and l2 norm constraints We also investigateFGSM (Goodfellow et al 2014) based adversarial training

22 Salience maps

Given a trained neural network visualizing the saliencemaps aims at assigning a sensitivity value sometimes alsocalled ldquoattributionrdquo to show the sensitivity of the output

1Our codes are available at httpsgithubcomPKUAI26AT-CNN

to each pixel of an input image Salience methods canmainly be divided into (Ancona et al 2018) perturbation-based methods (Zeiler amp Fergus 2014 Zintgraf et al 2017)and gradient-based method (Erhan et al 2009 Simonyanet al 2013 Shrikumar et al 2017 Sundararajan et al2017 Selvaraju et al 2017 Zhou et al 2016 Smilkovet al 2017 Bach et al 2015) Recently (Adebayo et al2018) carries out a systematic test for many of the gradient-based salience methods and only variants of Grad andGradCAM (Selvaraju et al 2017) pass the proposed sanitychecks We thus choose Grad and its smoothed versionSmoothGrad (Smilkov et al 2017) for visualization

Formally let x isin Rd denote the input image a trainednetwork is a function f Rd rarr RK where K is the to-tal number of classes Let Sc denotes the class activationfunction for each class c We seek to obtain a salience mapE isin Rd The Grad explanation is the gradient of classactivation with respect to the input image x

E =partSc(x)

partx (2)

SmoothGrad (Smilkov et al 2017) was proposed to alle-viate noises in gradient explanation by averaging over thegradient of noisy copies of an input Thus for an input xthe smoothed variant of Grad SmoothGrad can be writtenas

E =1

n

nsumi=1

partSc(xi)

partxi (3)

where xi = x+gi and gi are noise vectors drawn iid froma Gaussian distribution N (0 σ2) In all our experimentswe set n = 100 and the noise level σ(xmax minus xmin) =01 We choose Sc(x) = log pc(x) where pc(x) is theprobability of class c assigned by a classifier to input x

3 MethodsIn this section we elaborate our method for interpretingthe adversarially trained CNNs and comparing them withnormally trained ones Three image datasets are consideredincluding Tiny ImageNet2 Caltech-256 (Griffin et al 2007)and CIFAR-10

We first visualize the salience maps of AT-CNNs and nor-mal CNNs to demonstrate that the two models trained withdifferent ways are sensitive to different kinds of features Be-sides this qualitative comparison we also test the two kindsof CNNs on different transformed datasets to distinguishthe difference of their preferred features

31 Visualizing the salience maps

A straightforward way of investigating the difference be-tween AT-CNNs and CNNs is to visualize which group of

2httpstiny-imagenetherokuappcom

Interpreting Adversarially Trained Convolutional Neural Networks

(a) Original (b) Stylized (c) Saturated 8 (d) Saturated 1024 (e) patch-shuffle 2 (f) patch-shuffle 4

Figure 1 Visualization of three transformations Original images are from Caltech-256 From left to right original stylized saturationlevel as 8 1024 2times 2 patch-shuffling 4times 4 patch-shuffling

pixels the network outputs are most sensitive to Saliencemaps generated by Grad and its smoothed variant Smooth-Grad are good candidates to show what features a modelis sensitive to We compare the salience maps between AT-CNNs and CNNs on clean images and images under texturepreserving and shape preserving distortions Extensive re-sults can been seen in Section 41

As pointed by Smilkov et al (2017) sensitivity maps basedon Grad method are often visually noisy highlighting thatsome pixels to a human eye seem randomly selectedSmoothGrad in Eq (3) on the other hand could reducevisual noise by averaging the gradient over the Gaussianperturbed images Thus we mainly report the salience mapsproduced by SmoothGrad and the Grad visualization resultsare provided in the appendix Note that the two visualizationmethods could help us draw a consistent conclusion on thedifference between the two trained CNNs

32 Generalization on shapetexture preservingdistortions

Besides visual inspection of sensitivity maps we proposeto measure the sensitivity of AT-CNNs and CNNs to dif-ferent features by evaluating the performance degradationunder several distortions that either preserves shapes or tex-tures Intuitively if one model relies on textures a lot theperformance would degrade severely if we destroy mostof the textures while preserving other information such asthe shapes and other features However a perfect disentan-glement of texture shape and other feature information isimpossible (Gatys et al 2015) In this work we mainlyconstruct three kinds of image translations to achieve theshape or texture distortion style-transfer saturating andpatch-shuffling operation Some of the image samples areshown in Figure 1 We also added three Fourier-filteredtest set in the appendix We now describe each of thesetransformations and their properties

Note that we conduct normal training or adversarial training

on the original training sets and then evaluate their general-izability over the transformed data During the training wenever use the transformed datasets

Stylizing Geirhos et al (2019) utilized style trans-fer (Huang amp Belongie 2017) to generate images with con-flicting shape and texture information to demonstrate thetexture bias of ImageNet-trained standard CNNs Followingthe same rationale we utilize style transfer to destroy mostof the textures while preserving the global shape structuresin images and build a stylized test dataset Therefore withsimilar generalization error models capturing shapes bet-ter should also perform better on stylized test images thanthose biased towards textures The style-transferred imagesamples are shown in Figure 1(b)

Saturation Similar to (Ding et al 2019) we denote thesaturation of the image x by xp where p indicates the sat-uration level ranging from 0 toinfin When p = 2 the satu-ration operation does not change the image When p ge 2increasing the saturation level will push the pixel valuestowards binarized ones and p = infin leads to the pure bi-narization Specifically for each pixel of image x withvalue v isin [0 1] its corresponding saturated pixel of xp isdefined as sign(2v minus 1)|2v minus 1|

2p 2 + 12 One can ob-

serve that from Figure 1(c) and (d) increasing saturationlevel can gradually destroy some texture information whilepreserving most parts of the contour structures

Patch-Shuffling To destroy long-range shape informationwe split images into k times k small patches and randomlyrearranging the order of these patches with k isin 2 4 8Favorably this operation preserves most of the texture in-formation and destroys most of the shape information Thepatch-shuffled image samples are showed in Figure 1(e) (f)Note that as k increasing more information of the originalimage is lost especially for images with low resolution

Interpreting Adversarially Trained Convolutional Neural Networks

Table 1 Accuracy and robustness of all the trained models Robustness is measured against the PGD attack with bounded linfin normDetails are listed in the appendix Note that underfitting CNNs have similar generalization performance with some of the AT-CNNs onclean images

CIFAR10 TinyImageNet Caltech 256Accuracy Robustness Accuracy Robustness Accuracy Robustness

PGD-inf 8 8627 4481 5442 1425 6641 3116PGD-inf 4 8917 3085 6185 687 7222 2010PGD-inf 2 914 3911 6706 166 7651 751PGD-inf 1 9340 753 6942 018 7911 170

PGD-L2 12 8579 3461 5344 1480 6554 3136PGD-L2 8 8801 2688 5821 1003 6975 2619PGD-L2 4 9077 1319 6424 361 7412 1433FGSM 8 8490 3425 6621 001 7088 2002FGSM 4 8813 2508 6343 013 7391 1516Normal 9452 0 7202 001 8332 0Underfit 8679 0 6005 001 6904 0

4 Experiments and analysisExperiments setup We describe the experiment setup toevaluate the performance of AT-CNNs and standard CNNsin data distributions manipulated by above-mentioned opera-tions We conduct experiments on three datasets CIFAR-10Tiny ImageNet and Caltech-256 (Griffin et al 2007) Notethat we do not create the style-transferred and patch-shuffledtest set for CIFAR-10 due to its limited resolution

When training on CIFAR-10 we use the ResNet-18model (He et al 2016ab) for data augmentation we per-form zero paddings with width as 4 horizontal flip andrandom crop

Tiny ImageNet has 200 classes of objects Each class has500 training images 50 validation images and 50 test im-ages All images from Tiny ImageNet are of size 64times 64We re-scale them to 224times224 and perform random horizon-tal flip and per-image standardization as data augmentation

Caltech-256 (Griffin et al 2007) consists of 257 objectcategories containing a total of 30607 images Resolutionof images from Caltech is much higher compared with theabove two datasets We manually split 20 of images asthe test set We perform re-scaling and random croppingfollowing (He et al 2016a) For both Tiny ImageNet andCaltech-256 we use ResNet-18 model as the network archi-tecture

Compared models their generalization and robustnessFor all above three datasets we train three types of AT-CNNs they mainly differ in the way of generating adver-sarial examples FGSM PGD with bounded linfin norm andPGD with bounded l2 norm and for each attack method wetrain several models under different attack strengths Details

are listed in the appendix To understand whether the differ-ence of performance degradation for AT-CNNs and standardCNNs is due to the poor generalization (Schmidt et al 2018Tsipras et al 2018) of adversarial training we also comparethe AT-CNNs with an underfitting CNN (trained over cleandata) with similar generalization performance as AT-CNNsWe train 11 models on each dataset Their generalizationperformance on clean data and robustness measured byPGD attack are shown in Table 1

41 Visualization results

To investigate what features of an input image AT-CNNsand normal CNNs are most sensitive to we generate sen-sitivity maps using SmoothGrad (Smilkov et al 2017) onclean images saturated images and stylized images Thevisualization results are presented in Figure 2

We can easily observe that the salience maps of AT-CNNsare much more sparse and mainly focus on contours of eachobject on all kinds of images including the clean saturatedand stylized ones Differently sensitivity maps of standardCNNs are more noisy and less biased towards the shapesof objects This is consistent with the findings in (Geirhoset al 2019)

Particularly in the second row of Figure 2 sensitivity mapsof normal CNNs of the ldquodogrdquo class are still noisy even whenthe input saturated image are nearly binarized On the otherhand after adversarial training the models successfullycapture the shape information of the object providing amore interpretable prediction

For stylized images shown in the third row of Figure 2 evenwith dramatically changed textures after style transfer AT-CNNs can still be able to focus the shapes of original object

Interpreting Adversarially Trained Convolutional Neural Networks

(a) Images from Caltech-256 (b) Images from Tiny ImageNet

Figure 2 Sensitivity maps based on SmoothGrad (Smilkov et al 2017) of three models on images under saturation and stylizing Fromtop to bottom Original Saturation 1024 and Stylizing For each group of images from left to right original image sensitivity maps ofstandard CNN underfitting CNN and PGD-linfin AT-CNN

while standard CNNs totally fail

Due to the limited space we provide more visualizationresults (including the sensitivity maps generated by Gradmethod) in appendix

42 Generalization performance on transformed data

In this part we mainly show generalization performanceof AT-CCNs and normal CNNs on either shape or texturepreserving distorted image datasets This could help us tounderstand how different that the two types of models arebiased in a quantitative way

For all experimental results below besides the top-1 accu-racy we also report an ldquoaccuracy on correctly classifiedimagesrdquo This accuracy is measured by first selecting theimages from the clean test set that is being correctly clas-sified then measuring the accuracy of transformed imagesfrom these correctly classified ones

421 STYLIZING

Following Geirhos et al (2019) we generate stylized ver-sion of test set for Caltech-256 and Tiny ImageNet

We report the ldquoaccuracy on correctly classified imagesrdquo ofall the trained models on stylized test set in Table 2 Com-pared with standard CNNs though with a lower accuracyon original test images AT-CNNs achieve higher accuracyon stylized ones with textures being dramatically changedThe comparison quantitatively shows that AT-CNNs tend tobe more invariant with respect to local textures

422 SATURATION

We use the saturation operation to manipulate the imagesand show the how increasing saturation levels affects theaccuracy of models trained in different ways

In Figure 4 we visualize images with varying saturation lev-els It can be easily observed that increasing saturation levelspushes images more ldquobinnarizedrdquo where some textures arewiped out but produces sharper edges and preserving shapeinformation When saturation level is smaller than 2 ieclean image it pushes all the pixels towards 12 and nearlyall the information is lost and p = 0 leads to a totally grayimage with constant pixel value

We measure the ldquoaccuracy on correctly classified imagesrdquofor all the trained models and show them in Figure 5 Wecan observe that with the increasing level of saturation moretexture information is lost Favorably adversarially trainedmodels exhibit a much less sensitivity to this texture lossstill obtaining a high classification accuracy The resultsindicate that AT-CNNs are more robust to ldquosaturationrdquo orldquobinarizingrdquo operations which may demonstrate that theprediction capability of AT-CNNs relies less on texture andmore on shapes Results on CIFAR-10 tells the same storyas presented in appendix due to the limited space

Additionally in our experiments for each adversarial train-ing approach either PGD or FGSM based AT-CNNs withhigher robustness towards PGD adversary are more invariantto the increasing of the saturation level and texture loss Onthe other hand adversarial training with higher robustnesstypically ruin the generalization over the clean dataset Ourfinding also supports the claim ldquorobustness maybe at oddswith accuracyrdquo (Tsipras et al 2018)

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 3 Visualization of images from style-transferred test set Applying AdaIn (Huang amp Belongie 2017) style transfer distorts localtextures of original images while the global shape structure is retained The first row are images from Caltech-256 and the second roware images from Tiny ImageNet

Table 2 ldquoAccuracy on correctly classified imagesrdquo for different models on stylized test set The columns named ldquoCaltech-256rdquo andldquoTinyImageNetrdquo show the generalization of different models on the clean test set

DATASET CALTECH-256 STYLIZED CALTECH-256 TINYIMAGENET STYLIZED TINYIMAGENET

STANDARD 8332 1683 7202 725UNDERFIT 6904 975 6035 716PGD-linfin 8 6641 1975 5442 1881PGD-linfin 4 7222 2110 6185 2051PGD-linfin 2 7651 2189 6706 1925PGD-linfin 1 7911 2207 6942 1831PGD-l2 12 6524 2014 5344 1933PGD-l2 8 6975 2162 5821 2042PGD-l2 4 7412 2253 6424 2105FGSM 8 7088 2123 6621 1507FGSM 4 7391 2199 6343 2022

Figure 4 Illustration of how varying saturation changes the appearance of the image From left to right saturation level 025 05 1 2(original image) 4 8 16 64 1024 Increasing saturation level pushes pixels towards 0 or 1 which preserves most of the shape whilewiping most of the textures Decreasing saturation level pushes all pixels to 12

2minus 2 20 22 24 26 28 210

Saturation Level

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

cle

an im

age

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

2minus 2 20 22 24 26 28 210

Saturation Level

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

cle

an im

age

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

(a) Caltech-256 (b) Tiny ImageNetFigure 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 and Tiny ImageNet with respect todifferent saturation levels Note that in the plot there are several curves with same color and line type shown for each adversarial trainingmethod PGD and FGSM-based those of which with larger perturbation achieves better robustness for most of the cases Detailed resultsare list in the appendix

Interpreting Adversarially Trained Convolutional Neural Networks

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0750

0952

0738 0769

0932

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0550

0877

0012 0043 0028

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0541

0913

0002 0012 0012Norma

lUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0005

0305

0002 0002 0003

(a) Original Image (b) Patch-Shuffle 2 (c) Patch-Shuffle 4 (d) Patch-Shuffle 8Figure 6 Visualization of patch-shuffling transformation The first row shows probability of ldquocakerdquo assigned by different models

clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8

Patch-Shuffle

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8

Patch-Shuffle

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

(a) Caltech-256 (b) Tiny ImageNetFigure 7 ldquoAccuracy on correctly classified imagesrdquo for different models on patch-shuffled Tiny ImageNet and Caltech-256 with differentsplitting numbers Detailed results are listed in the appendix

When decreasing the saturation level all models have simi-lar degree of performance degradation indicating that AT-CNNs are not robust to all kinds of image distortions Theytend to be more robust for fixed types of distortions Weleave the further investigation regarding this issue as futurework

423 PATCH-SHUFFLING

Stylizing and saturation operation aim at changing or re-moving the texture information of original images whilepreserving the features of shapes and edges In order to testthe different bias of AT-CNN and standard CNN in the otherway around we shatter the shape and edge information bysplitting the images into k times k patches and then randomlyshuffling them This operation could still maintains the localtextures if k is not too large

Figure 6 shows one example of patch-shuffled images underdifferent numbers of splitting The first row shows the proba-bilities assigned by different models to the ground truth class

of the original image Obviously after random shufflingthe shapes and edge features are destroyed dramaticallythe prediction probability of the adverarially trained CNNsdrops significantly while the normal CNNs still maintainsa high confidence over the ground truth class This revealsAT-CNNs are more baised towards shapes and edges thannormally trained ones

Moreover Figure 7 depicts the ldquo accuracy of correctly classi-fied imagesrdquo for all the models measured on ldquoPatch-shuffledrdquotest set with increasing number of splitting pieces AT-CNNs especially trained against with a stronger attack aremore sensitive to ldquoPatch-shufflingrdquo operations in most ofour experiments

Note that under ldquoPatch-shuffle 8rdquo operation all models havesimilar ldquo accuracy of correctly classified imagesrdquo which islargely due to the severe information loss Also note that thisaccuracy of all models on Tiny ImageNet shown in 7(a) ismush lower than that on Caltech-256 in 7(b) That is underldquoPatch-shuffle 1rdquo normally trained CNN has an accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution

5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs

Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness

Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-

constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work

Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN

6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around

Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation

AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-

Interpreting Adversarially Trained Convolutional Neural Networks

ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Interpreting Adversarially Trained Convolutional Neural Networks

A Experiment SetupA1 Models

bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units

bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch

We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40

A2 Adversarial Training

We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)

A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY

We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations

bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255

bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet

A22 TRAIN AGAINST A FGSM ADVERSARY

ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively

B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)

C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies

C1 Fourier filtering setup

Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set

bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)

bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)

bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test

C2 Results

We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency

D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on

3httpswwwkagglecomcpainter-by-numbers

Interpreting Adversarially Trained Convolutional Neural Networks

Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION

STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734

test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations

E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data

distributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Interpreting Adversarially Trained Convolutional Neural Networks

Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878

Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014

Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutional

networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to

Interpreting Adversarially Trained Convolutional Neural Networks

Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256

DATA SET 2times 2 4times 4 8times 8T

STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603

Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet

DATA SET 2times 2 4times 4 8times 8T

STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598

Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 2: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

shuffled images then evaluate the classification accuracyof AT-CNN and normal CNNs on these datasets Thesesophisticated designed experiments provide a quantitativecomparison between the two CNNs and demonstrate theirbiases when making predictions

To the best of our knowledge we are the first to implementsystematic investigation on interpreting the adversariallytrained CNNs both visually and quantitatively Our find-ings shed some light on why AT-CNNs are more robustthan those normally trained ones and also contribute to bet-ter understanding adversarial training over CNNs from aninterpretation perspective1

The remaining of the paper is structured as follows Weintroduce background knowledge on adversarial training andsalience methods in Section 2 The methods for interpretingAT-CNNS are described in Section 3 Then we present theexperimental results to support our findings in Section 4The related works and discussions are presented in Section 5Section 6 concludes the paper

2 Preliminary21 Adversarial training

This training method was first proposed by (Goodfellowet al 2014) which is the most successful approach forbuilding robust models so far for defending adversarial ex-amples (Madry et al 2018 Sinha et al 2018 Athalye et al2018 Zhang et al 2019ba) It can be formulated as solvinga robust optimization problem (Shaham et al 2015)

minθ

E(xy)simD

[maxδisinS

`(f(x+ δ θ) y)

] (1)

where f(x θ) represents the neural network parameterizedby weights θ the input-output pair (x y) is sample from thetraining set D δ denotes the adversarial perturbation and`(middot middot) is the chosen loss function eg cross entropy loss Sdenotes a certain norm constraints such as `infin or `2

The inner maximization is approximated by adversarialexamples generated by various attack methods Trainingagainst a projected gradient descent (PGD Madry et al(2018)) adversary leads to state-of-the-art white-box ro-bustness We use PGD based adversarial training withbounded linfin and l2 norm constraints We also investigateFGSM (Goodfellow et al 2014) based adversarial training

22 Salience maps

Given a trained neural network visualizing the saliencemaps aims at assigning a sensitivity value sometimes alsocalled ldquoattributionrdquo to show the sensitivity of the output

1Our codes are available at httpsgithubcomPKUAI26AT-CNN

to each pixel of an input image Salience methods canmainly be divided into (Ancona et al 2018) perturbation-based methods (Zeiler amp Fergus 2014 Zintgraf et al 2017)and gradient-based method (Erhan et al 2009 Simonyanet al 2013 Shrikumar et al 2017 Sundararajan et al2017 Selvaraju et al 2017 Zhou et al 2016 Smilkovet al 2017 Bach et al 2015) Recently (Adebayo et al2018) carries out a systematic test for many of the gradient-based salience methods and only variants of Grad andGradCAM (Selvaraju et al 2017) pass the proposed sanitychecks We thus choose Grad and its smoothed versionSmoothGrad (Smilkov et al 2017) for visualization

Formally let x isin Rd denote the input image a trainednetwork is a function f Rd rarr RK where K is the to-tal number of classes Let Sc denotes the class activationfunction for each class c We seek to obtain a salience mapE isin Rd The Grad explanation is the gradient of classactivation with respect to the input image x

E =partSc(x)

partx (2)

SmoothGrad (Smilkov et al 2017) was proposed to alle-viate noises in gradient explanation by averaging over thegradient of noisy copies of an input Thus for an input xthe smoothed variant of Grad SmoothGrad can be writtenas

E =1

n

nsumi=1

partSc(xi)

partxi (3)

where xi = x+gi and gi are noise vectors drawn iid froma Gaussian distribution N (0 σ2) In all our experimentswe set n = 100 and the noise level σ(xmax minus xmin) =01 We choose Sc(x) = log pc(x) where pc(x) is theprobability of class c assigned by a classifier to input x

3 MethodsIn this section we elaborate our method for interpretingthe adversarially trained CNNs and comparing them withnormally trained ones Three image datasets are consideredincluding Tiny ImageNet2 Caltech-256 (Griffin et al 2007)and CIFAR-10

We first visualize the salience maps of AT-CNNs and nor-mal CNNs to demonstrate that the two models trained withdifferent ways are sensitive to different kinds of features Be-sides this qualitative comparison we also test the two kindsof CNNs on different transformed datasets to distinguishthe difference of their preferred features

31 Visualizing the salience maps

A straightforward way of investigating the difference be-tween AT-CNNs and CNNs is to visualize which group of

2httpstiny-imagenetherokuappcom

Interpreting Adversarially Trained Convolutional Neural Networks

(a) Original (b) Stylized (c) Saturated 8 (d) Saturated 1024 (e) patch-shuffle 2 (f) patch-shuffle 4

Figure 1 Visualization of three transformations Original images are from Caltech-256 From left to right original stylized saturationlevel as 8 1024 2times 2 patch-shuffling 4times 4 patch-shuffling

pixels the network outputs are most sensitive to Saliencemaps generated by Grad and its smoothed variant Smooth-Grad are good candidates to show what features a modelis sensitive to We compare the salience maps between AT-CNNs and CNNs on clean images and images under texturepreserving and shape preserving distortions Extensive re-sults can been seen in Section 41

As pointed by Smilkov et al (2017) sensitivity maps basedon Grad method are often visually noisy highlighting thatsome pixels to a human eye seem randomly selectedSmoothGrad in Eq (3) on the other hand could reducevisual noise by averaging the gradient over the Gaussianperturbed images Thus we mainly report the salience mapsproduced by SmoothGrad and the Grad visualization resultsare provided in the appendix Note that the two visualizationmethods could help us draw a consistent conclusion on thedifference between the two trained CNNs

32 Generalization on shapetexture preservingdistortions

Besides visual inspection of sensitivity maps we proposeto measure the sensitivity of AT-CNNs and CNNs to dif-ferent features by evaluating the performance degradationunder several distortions that either preserves shapes or tex-tures Intuitively if one model relies on textures a lot theperformance would degrade severely if we destroy mostof the textures while preserving other information such asthe shapes and other features However a perfect disentan-glement of texture shape and other feature information isimpossible (Gatys et al 2015) In this work we mainlyconstruct three kinds of image translations to achieve theshape or texture distortion style-transfer saturating andpatch-shuffling operation Some of the image samples areshown in Figure 1 We also added three Fourier-filteredtest set in the appendix We now describe each of thesetransformations and their properties

Note that we conduct normal training or adversarial training

on the original training sets and then evaluate their general-izability over the transformed data During the training wenever use the transformed datasets

Stylizing Geirhos et al (2019) utilized style trans-fer (Huang amp Belongie 2017) to generate images with con-flicting shape and texture information to demonstrate thetexture bias of ImageNet-trained standard CNNs Followingthe same rationale we utilize style transfer to destroy mostof the textures while preserving the global shape structuresin images and build a stylized test dataset Therefore withsimilar generalization error models capturing shapes bet-ter should also perform better on stylized test images thanthose biased towards textures The style-transferred imagesamples are shown in Figure 1(b)

Saturation Similar to (Ding et al 2019) we denote thesaturation of the image x by xp where p indicates the sat-uration level ranging from 0 toinfin When p = 2 the satu-ration operation does not change the image When p ge 2increasing the saturation level will push the pixel valuestowards binarized ones and p = infin leads to the pure bi-narization Specifically for each pixel of image x withvalue v isin [0 1] its corresponding saturated pixel of xp isdefined as sign(2v minus 1)|2v minus 1|

2p 2 + 12 One can ob-

serve that from Figure 1(c) and (d) increasing saturationlevel can gradually destroy some texture information whilepreserving most parts of the contour structures

Patch-Shuffling To destroy long-range shape informationwe split images into k times k small patches and randomlyrearranging the order of these patches with k isin 2 4 8Favorably this operation preserves most of the texture in-formation and destroys most of the shape information Thepatch-shuffled image samples are showed in Figure 1(e) (f)Note that as k increasing more information of the originalimage is lost especially for images with low resolution

Interpreting Adversarially Trained Convolutional Neural Networks

Table 1 Accuracy and robustness of all the trained models Robustness is measured against the PGD attack with bounded linfin normDetails are listed in the appendix Note that underfitting CNNs have similar generalization performance with some of the AT-CNNs onclean images

CIFAR10 TinyImageNet Caltech 256Accuracy Robustness Accuracy Robustness Accuracy Robustness

PGD-inf 8 8627 4481 5442 1425 6641 3116PGD-inf 4 8917 3085 6185 687 7222 2010PGD-inf 2 914 3911 6706 166 7651 751PGD-inf 1 9340 753 6942 018 7911 170

PGD-L2 12 8579 3461 5344 1480 6554 3136PGD-L2 8 8801 2688 5821 1003 6975 2619PGD-L2 4 9077 1319 6424 361 7412 1433FGSM 8 8490 3425 6621 001 7088 2002FGSM 4 8813 2508 6343 013 7391 1516Normal 9452 0 7202 001 8332 0Underfit 8679 0 6005 001 6904 0

4 Experiments and analysisExperiments setup We describe the experiment setup toevaluate the performance of AT-CNNs and standard CNNsin data distributions manipulated by above-mentioned opera-tions We conduct experiments on three datasets CIFAR-10Tiny ImageNet and Caltech-256 (Griffin et al 2007) Notethat we do not create the style-transferred and patch-shuffledtest set for CIFAR-10 due to its limited resolution

When training on CIFAR-10 we use the ResNet-18model (He et al 2016ab) for data augmentation we per-form zero paddings with width as 4 horizontal flip andrandom crop

Tiny ImageNet has 200 classes of objects Each class has500 training images 50 validation images and 50 test im-ages All images from Tiny ImageNet are of size 64times 64We re-scale them to 224times224 and perform random horizon-tal flip and per-image standardization as data augmentation

Caltech-256 (Griffin et al 2007) consists of 257 objectcategories containing a total of 30607 images Resolutionof images from Caltech is much higher compared with theabove two datasets We manually split 20 of images asthe test set We perform re-scaling and random croppingfollowing (He et al 2016a) For both Tiny ImageNet andCaltech-256 we use ResNet-18 model as the network archi-tecture

Compared models their generalization and robustnessFor all above three datasets we train three types of AT-CNNs they mainly differ in the way of generating adver-sarial examples FGSM PGD with bounded linfin norm andPGD with bounded l2 norm and for each attack method wetrain several models under different attack strengths Details

are listed in the appendix To understand whether the differ-ence of performance degradation for AT-CNNs and standardCNNs is due to the poor generalization (Schmidt et al 2018Tsipras et al 2018) of adversarial training we also comparethe AT-CNNs with an underfitting CNN (trained over cleandata) with similar generalization performance as AT-CNNsWe train 11 models on each dataset Their generalizationperformance on clean data and robustness measured byPGD attack are shown in Table 1

41 Visualization results

To investigate what features of an input image AT-CNNsand normal CNNs are most sensitive to we generate sen-sitivity maps using SmoothGrad (Smilkov et al 2017) onclean images saturated images and stylized images Thevisualization results are presented in Figure 2

We can easily observe that the salience maps of AT-CNNsare much more sparse and mainly focus on contours of eachobject on all kinds of images including the clean saturatedand stylized ones Differently sensitivity maps of standardCNNs are more noisy and less biased towards the shapesof objects This is consistent with the findings in (Geirhoset al 2019)

Particularly in the second row of Figure 2 sensitivity mapsof normal CNNs of the ldquodogrdquo class are still noisy even whenthe input saturated image are nearly binarized On the otherhand after adversarial training the models successfullycapture the shape information of the object providing amore interpretable prediction

For stylized images shown in the third row of Figure 2 evenwith dramatically changed textures after style transfer AT-CNNs can still be able to focus the shapes of original object

Interpreting Adversarially Trained Convolutional Neural Networks

(a) Images from Caltech-256 (b) Images from Tiny ImageNet

Figure 2 Sensitivity maps based on SmoothGrad (Smilkov et al 2017) of three models on images under saturation and stylizing Fromtop to bottom Original Saturation 1024 and Stylizing For each group of images from left to right original image sensitivity maps ofstandard CNN underfitting CNN and PGD-linfin AT-CNN

while standard CNNs totally fail

Due to the limited space we provide more visualizationresults (including the sensitivity maps generated by Gradmethod) in appendix

42 Generalization performance on transformed data

In this part we mainly show generalization performanceof AT-CCNs and normal CNNs on either shape or texturepreserving distorted image datasets This could help us tounderstand how different that the two types of models arebiased in a quantitative way

For all experimental results below besides the top-1 accu-racy we also report an ldquoaccuracy on correctly classifiedimagesrdquo This accuracy is measured by first selecting theimages from the clean test set that is being correctly clas-sified then measuring the accuracy of transformed imagesfrom these correctly classified ones

421 STYLIZING

Following Geirhos et al (2019) we generate stylized ver-sion of test set for Caltech-256 and Tiny ImageNet

We report the ldquoaccuracy on correctly classified imagesrdquo ofall the trained models on stylized test set in Table 2 Com-pared with standard CNNs though with a lower accuracyon original test images AT-CNNs achieve higher accuracyon stylized ones with textures being dramatically changedThe comparison quantitatively shows that AT-CNNs tend tobe more invariant with respect to local textures

422 SATURATION

We use the saturation operation to manipulate the imagesand show the how increasing saturation levels affects theaccuracy of models trained in different ways

In Figure 4 we visualize images with varying saturation lev-els It can be easily observed that increasing saturation levelspushes images more ldquobinnarizedrdquo where some textures arewiped out but produces sharper edges and preserving shapeinformation When saturation level is smaller than 2 ieclean image it pushes all the pixels towards 12 and nearlyall the information is lost and p = 0 leads to a totally grayimage with constant pixel value

We measure the ldquoaccuracy on correctly classified imagesrdquofor all the trained models and show them in Figure 5 Wecan observe that with the increasing level of saturation moretexture information is lost Favorably adversarially trainedmodels exhibit a much less sensitivity to this texture lossstill obtaining a high classification accuracy The resultsindicate that AT-CNNs are more robust to ldquosaturationrdquo orldquobinarizingrdquo operations which may demonstrate that theprediction capability of AT-CNNs relies less on texture andmore on shapes Results on CIFAR-10 tells the same storyas presented in appendix due to the limited space

Additionally in our experiments for each adversarial train-ing approach either PGD or FGSM based AT-CNNs withhigher robustness towards PGD adversary are more invariantto the increasing of the saturation level and texture loss Onthe other hand adversarial training with higher robustnesstypically ruin the generalization over the clean dataset Ourfinding also supports the claim ldquorobustness maybe at oddswith accuracyrdquo (Tsipras et al 2018)

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 3 Visualization of images from style-transferred test set Applying AdaIn (Huang amp Belongie 2017) style transfer distorts localtextures of original images while the global shape structure is retained The first row are images from Caltech-256 and the second roware images from Tiny ImageNet

Table 2 ldquoAccuracy on correctly classified imagesrdquo for different models on stylized test set The columns named ldquoCaltech-256rdquo andldquoTinyImageNetrdquo show the generalization of different models on the clean test set

DATASET CALTECH-256 STYLIZED CALTECH-256 TINYIMAGENET STYLIZED TINYIMAGENET

STANDARD 8332 1683 7202 725UNDERFIT 6904 975 6035 716PGD-linfin 8 6641 1975 5442 1881PGD-linfin 4 7222 2110 6185 2051PGD-linfin 2 7651 2189 6706 1925PGD-linfin 1 7911 2207 6942 1831PGD-l2 12 6524 2014 5344 1933PGD-l2 8 6975 2162 5821 2042PGD-l2 4 7412 2253 6424 2105FGSM 8 7088 2123 6621 1507FGSM 4 7391 2199 6343 2022

Figure 4 Illustration of how varying saturation changes the appearance of the image From left to right saturation level 025 05 1 2(original image) 4 8 16 64 1024 Increasing saturation level pushes pixels towards 0 or 1 which preserves most of the shape whilewiping most of the textures Decreasing saturation level pushes all pixels to 12

2minus 2 20 22 24 26 28 210

Saturation Level

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

cle

an im

age

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

2minus 2 20 22 24 26 28 210

Saturation Level

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

cle

an im

age

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

(a) Caltech-256 (b) Tiny ImageNetFigure 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 and Tiny ImageNet with respect todifferent saturation levels Note that in the plot there are several curves with same color and line type shown for each adversarial trainingmethod PGD and FGSM-based those of which with larger perturbation achieves better robustness for most of the cases Detailed resultsare list in the appendix

Interpreting Adversarially Trained Convolutional Neural Networks

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0750

0952

0738 0769

0932

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0550

0877

0012 0043 0028

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0541

0913

0002 0012 0012Norma

lUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0005

0305

0002 0002 0003

(a) Original Image (b) Patch-Shuffle 2 (c) Patch-Shuffle 4 (d) Patch-Shuffle 8Figure 6 Visualization of patch-shuffling transformation The first row shows probability of ldquocakerdquo assigned by different models

clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8

Patch-Shuffle

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8

Patch-Shuffle

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

(a) Caltech-256 (b) Tiny ImageNetFigure 7 ldquoAccuracy on correctly classified imagesrdquo for different models on patch-shuffled Tiny ImageNet and Caltech-256 with differentsplitting numbers Detailed results are listed in the appendix

When decreasing the saturation level all models have simi-lar degree of performance degradation indicating that AT-CNNs are not robust to all kinds of image distortions Theytend to be more robust for fixed types of distortions Weleave the further investigation regarding this issue as futurework

423 PATCH-SHUFFLING

Stylizing and saturation operation aim at changing or re-moving the texture information of original images whilepreserving the features of shapes and edges In order to testthe different bias of AT-CNN and standard CNN in the otherway around we shatter the shape and edge information bysplitting the images into k times k patches and then randomlyshuffling them This operation could still maintains the localtextures if k is not too large

Figure 6 shows one example of patch-shuffled images underdifferent numbers of splitting The first row shows the proba-bilities assigned by different models to the ground truth class

of the original image Obviously after random shufflingthe shapes and edge features are destroyed dramaticallythe prediction probability of the adverarially trained CNNsdrops significantly while the normal CNNs still maintainsa high confidence over the ground truth class This revealsAT-CNNs are more baised towards shapes and edges thannormally trained ones

Moreover Figure 7 depicts the ldquo accuracy of correctly classi-fied imagesrdquo for all the models measured on ldquoPatch-shuffledrdquotest set with increasing number of splitting pieces AT-CNNs especially trained against with a stronger attack aremore sensitive to ldquoPatch-shufflingrdquo operations in most ofour experiments

Note that under ldquoPatch-shuffle 8rdquo operation all models havesimilar ldquo accuracy of correctly classified imagesrdquo which islargely due to the severe information loss Also note that thisaccuracy of all models on Tiny ImageNet shown in 7(a) ismush lower than that on Caltech-256 in 7(b) That is underldquoPatch-shuffle 1rdquo normally trained CNN has an accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution

5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs

Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness

Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-

constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work

Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN

6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around

Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation

AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-

Interpreting Adversarially Trained Convolutional Neural Networks

ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Interpreting Adversarially Trained Convolutional Neural Networks

A Experiment SetupA1 Models

bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units

bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch

We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40

A2 Adversarial Training

We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)

A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY

We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations

bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255

bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet

A22 TRAIN AGAINST A FGSM ADVERSARY

ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively

B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)

C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies

C1 Fourier filtering setup

Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set

bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)

bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)

bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test

C2 Results

We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency

D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on

3httpswwwkagglecomcpainter-by-numbers

Interpreting Adversarially Trained Convolutional Neural Networks

Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION

STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734

test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations

E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data

distributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Interpreting Adversarially Trained Convolutional Neural Networks

Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878

Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014

Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutional

networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to

Interpreting Adversarially Trained Convolutional Neural Networks

Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256

DATA SET 2times 2 4times 4 8times 8T

STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603

Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet

DATA SET 2times 2 4times 4 8times 8T

STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598

Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 3: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

(a) Original (b) Stylized (c) Saturated 8 (d) Saturated 1024 (e) patch-shuffle 2 (f) patch-shuffle 4

Figure 1 Visualization of three transformations Original images are from Caltech-256 From left to right original stylized saturationlevel as 8 1024 2times 2 patch-shuffling 4times 4 patch-shuffling

pixels the network outputs are most sensitive to Saliencemaps generated by Grad and its smoothed variant Smooth-Grad are good candidates to show what features a modelis sensitive to We compare the salience maps between AT-CNNs and CNNs on clean images and images under texturepreserving and shape preserving distortions Extensive re-sults can been seen in Section 41

As pointed by Smilkov et al (2017) sensitivity maps basedon Grad method are often visually noisy highlighting thatsome pixels to a human eye seem randomly selectedSmoothGrad in Eq (3) on the other hand could reducevisual noise by averaging the gradient over the Gaussianperturbed images Thus we mainly report the salience mapsproduced by SmoothGrad and the Grad visualization resultsare provided in the appendix Note that the two visualizationmethods could help us draw a consistent conclusion on thedifference between the two trained CNNs

32 Generalization on shapetexture preservingdistortions

Besides visual inspection of sensitivity maps we proposeto measure the sensitivity of AT-CNNs and CNNs to dif-ferent features by evaluating the performance degradationunder several distortions that either preserves shapes or tex-tures Intuitively if one model relies on textures a lot theperformance would degrade severely if we destroy mostof the textures while preserving other information such asthe shapes and other features However a perfect disentan-glement of texture shape and other feature information isimpossible (Gatys et al 2015) In this work we mainlyconstruct three kinds of image translations to achieve theshape or texture distortion style-transfer saturating andpatch-shuffling operation Some of the image samples areshown in Figure 1 We also added three Fourier-filteredtest set in the appendix We now describe each of thesetransformations and their properties

Note that we conduct normal training or adversarial training

on the original training sets and then evaluate their general-izability over the transformed data During the training wenever use the transformed datasets

Stylizing Geirhos et al (2019) utilized style trans-fer (Huang amp Belongie 2017) to generate images with con-flicting shape and texture information to demonstrate thetexture bias of ImageNet-trained standard CNNs Followingthe same rationale we utilize style transfer to destroy mostof the textures while preserving the global shape structuresin images and build a stylized test dataset Therefore withsimilar generalization error models capturing shapes bet-ter should also perform better on stylized test images thanthose biased towards textures The style-transferred imagesamples are shown in Figure 1(b)

Saturation Similar to (Ding et al 2019) we denote thesaturation of the image x by xp where p indicates the sat-uration level ranging from 0 toinfin When p = 2 the satu-ration operation does not change the image When p ge 2increasing the saturation level will push the pixel valuestowards binarized ones and p = infin leads to the pure bi-narization Specifically for each pixel of image x withvalue v isin [0 1] its corresponding saturated pixel of xp isdefined as sign(2v minus 1)|2v minus 1|

2p 2 + 12 One can ob-

serve that from Figure 1(c) and (d) increasing saturationlevel can gradually destroy some texture information whilepreserving most parts of the contour structures

Patch-Shuffling To destroy long-range shape informationwe split images into k times k small patches and randomlyrearranging the order of these patches with k isin 2 4 8Favorably this operation preserves most of the texture in-formation and destroys most of the shape information Thepatch-shuffled image samples are showed in Figure 1(e) (f)Note that as k increasing more information of the originalimage is lost especially for images with low resolution

Interpreting Adversarially Trained Convolutional Neural Networks

Table 1 Accuracy and robustness of all the trained models Robustness is measured against the PGD attack with bounded linfin normDetails are listed in the appendix Note that underfitting CNNs have similar generalization performance with some of the AT-CNNs onclean images

CIFAR10 TinyImageNet Caltech 256Accuracy Robustness Accuracy Robustness Accuracy Robustness

PGD-inf 8 8627 4481 5442 1425 6641 3116PGD-inf 4 8917 3085 6185 687 7222 2010PGD-inf 2 914 3911 6706 166 7651 751PGD-inf 1 9340 753 6942 018 7911 170

PGD-L2 12 8579 3461 5344 1480 6554 3136PGD-L2 8 8801 2688 5821 1003 6975 2619PGD-L2 4 9077 1319 6424 361 7412 1433FGSM 8 8490 3425 6621 001 7088 2002FGSM 4 8813 2508 6343 013 7391 1516Normal 9452 0 7202 001 8332 0Underfit 8679 0 6005 001 6904 0

4 Experiments and analysisExperiments setup We describe the experiment setup toevaluate the performance of AT-CNNs and standard CNNsin data distributions manipulated by above-mentioned opera-tions We conduct experiments on three datasets CIFAR-10Tiny ImageNet and Caltech-256 (Griffin et al 2007) Notethat we do not create the style-transferred and patch-shuffledtest set for CIFAR-10 due to its limited resolution

When training on CIFAR-10 we use the ResNet-18model (He et al 2016ab) for data augmentation we per-form zero paddings with width as 4 horizontal flip andrandom crop

Tiny ImageNet has 200 classes of objects Each class has500 training images 50 validation images and 50 test im-ages All images from Tiny ImageNet are of size 64times 64We re-scale them to 224times224 and perform random horizon-tal flip and per-image standardization as data augmentation

Caltech-256 (Griffin et al 2007) consists of 257 objectcategories containing a total of 30607 images Resolutionof images from Caltech is much higher compared with theabove two datasets We manually split 20 of images asthe test set We perform re-scaling and random croppingfollowing (He et al 2016a) For both Tiny ImageNet andCaltech-256 we use ResNet-18 model as the network archi-tecture

Compared models their generalization and robustnessFor all above three datasets we train three types of AT-CNNs they mainly differ in the way of generating adver-sarial examples FGSM PGD with bounded linfin norm andPGD with bounded l2 norm and for each attack method wetrain several models under different attack strengths Details

are listed in the appendix To understand whether the differ-ence of performance degradation for AT-CNNs and standardCNNs is due to the poor generalization (Schmidt et al 2018Tsipras et al 2018) of adversarial training we also comparethe AT-CNNs with an underfitting CNN (trained over cleandata) with similar generalization performance as AT-CNNsWe train 11 models on each dataset Their generalizationperformance on clean data and robustness measured byPGD attack are shown in Table 1

41 Visualization results

To investigate what features of an input image AT-CNNsand normal CNNs are most sensitive to we generate sen-sitivity maps using SmoothGrad (Smilkov et al 2017) onclean images saturated images and stylized images Thevisualization results are presented in Figure 2

We can easily observe that the salience maps of AT-CNNsare much more sparse and mainly focus on contours of eachobject on all kinds of images including the clean saturatedand stylized ones Differently sensitivity maps of standardCNNs are more noisy and less biased towards the shapesof objects This is consistent with the findings in (Geirhoset al 2019)

Particularly in the second row of Figure 2 sensitivity mapsof normal CNNs of the ldquodogrdquo class are still noisy even whenthe input saturated image are nearly binarized On the otherhand after adversarial training the models successfullycapture the shape information of the object providing amore interpretable prediction

For stylized images shown in the third row of Figure 2 evenwith dramatically changed textures after style transfer AT-CNNs can still be able to focus the shapes of original object

Interpreting Adversarially Trained Convolutional Neural Networks

(a) Images from Caltech-256 (b) Images from Tiny ImageNet

Figure 2 Sensitivity maps based on SmoothGrad (Smilkov et al 2017) of three models on images under saturation and stylizing Fromtop to bottom Original Saturation 1024 and Stylizing For each group of images from left to right original image sensitivity maps ofstandard CNN underfitting CNN and PGD-linfin AT-CNN

while standard CNNs totally fail

Due to the limited space we provide more visualizationresults (including the sensitivity maps generated by Gradmethod) in appendix

42 Generalization performance on transformed data

In this part we mainly show generalization performanceof AT-CCNs and normal CNNs on either shape or texturepreserving distorted image datasets This could help us tounderstand how different that the two types of models arebiased in a quantitative way

For all experimental results below besides the top-1 accu-racy we also report an ldquoaccuracy on correctly classifiedimagesrdquo This accuracy is measured by first selecting theimages from the clean test set that is being correctly clas-sified then measuring the accuracy of transformed imagesfrom these correctly classified ones

421 STYLIZING

Following Geirhos et al (2019) we generate stylized ver-sion of test set for Caltech-256 and Tiny ImageNet

We report the ldquoaccuracy on correctly classified imagesrdquo ofall the trained models on stylized test set in Table 2 Com-pared with standard CNNs though with a lower accuracyon original test images AT-CNNs achieve higher accuracyon stylized ones with textures being dramatically changedThe comparison quantitatively shows that AT-CNNs tend tobe more invariant with respect to local textures

422 SATURATION

We use the saturation operation to manipulate the imagesand show the how increasing saturation levels affects theaccuracy of models trained in different ways

In Figure 4 we visualize images with varying saturation lev-els It can be easily observed that increasing saturation levelspushes images more ldquobinnarizedrdquo where some textures arewiped out but produces sharper edges and preserving shapeinformation When saturation level is smaller than 2 ieclean image it pushes all the pixels towards 12 and nearlyall the information is lost and p = 0 leads to a totally grayimage with constant pixel value

We measure the ldquoaccuracy on correctly classified imagesrdquofor all the trained models and show them in Figure 5 Wecan observe that with the increasing level of saturation moretexture information is lost Favorably adversarially trainedmodels exhibit a much less sensitivity to this texture lossstill obtaining a high classification accuracy The resultsindicate that AT-CNNs are more robust to ldquosaturationrdquo orldquobinarizingrdquo operations which may demonstrate that theprediction capability of AT-CNNs relies less on texture andmore on shapes Results on CIFAR-10 tells the same storyas presented in appendix due to the limited space

Additionally in our experiments for each adversarial train-ing approach either PGD or FGSM based AT-CNNs withhigher robustness towards PGD adversary are more invariantto the increasing of the saturation level and texture loss Onthe other hand adversarial training with higher robustnesstypically ruin the generalization over the clean dataset Ourfinding also supports the claim ldquorobustness maybe at oddswith accuracyrdquo (Tsipras et al 2018)

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 3 Visualization of images from style-transferred test set Applying AdaIn (Huang amp Belongie 2017) style transfer distorts localtextures of original images while the global shape structure is retained The first row are images from Caltech-256 and the second roware images from Tiny ImageNet

Table 2 ldquoAccuracy on correctly classified imagesrdquo for different models on stylized test set The columns named ldquoCaltech-256rdquo andldquoTinyImageNetrdquo show the generalization of different models on the clean test set

DATASET CALTECH-256 STYLIZED CALTECH-256 TINYIMAGENET STYLIZED TINYIMAGENET

STANDARD 8332 1683 7202 725UNDERFIT 6904 975 6035 716PGD-linfin 8 6641 1975 5442 1881PGD-linfin 4 7222 2110 6185 2051PGD-linfin 2 7651 2189 6706 1925PGD-linfin 1 7911 2207 6942 1831PGD-l2 12 6524 2014 5344 1933PGD-l2 8 6975 2162 5821 2042PGD-l2 4 7412 2253 6424 2105FGSM 8 7088 2123 6621 1507FGSM 4 7391 2199 6343 2022

Figure 4 Illustration of how varying saturation changes the appearance of the image From left to right saturation level 025 05 1 2(original image) 4 8 16 64 1024 Increasing saturation level pushes pixels towards 0 or 1 which preserves most of the shape whilewiping most of the textures Decreasing saturation level pushes all pixels to 12

2minus 2 20 22 24 26 28 210

Saturation Level

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

cle

an im

age

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

2minus 2 20 22 24 26 28 210

Saturation Level

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

cle

an im

age

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

(a) Caltech-256 (b) Tiny ImageNetFigure 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 and Tiny ImageNet with respect todifferent saturation levels Note that in the plot there are several curves with same color and line type shown for each adversarial trainingmethod PGD and FGSM-based those of which with larger perturbation achieves better robustness for most of the cases Detailed resultsare list in the appendix

Interpreting Adversarially Trained Convolutional Neural Networks

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0750

0952

0738 0769

0932

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0550

0877

0012 0043 0028

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0541

0913

0002 0012 0012Norma

lUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0005

0305

0002 0002 0003

(a) Original Image (b) Patch-Shuffle 2 (c) Patch-Shuffle 4 (d) Patch-Shuffle 8Figure 6 Visualization of patch-shuffling transformation The first row shows probability of ldquocakerdquo assigned by different models

clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8

Patch-Shuffle

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8

Patch-Shuffle

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

(a) Caltech-256 (b) Tiny ImageNetFigure 7 ldquoAccuracy on correctly classified imagesrdquo for different models on patch-shuffled Tiny ImageNet and Caltech-256 with differentsplitting numbers Detailed results are listed in the appendix

When decreasing the saturation level all models have simi-lar degree of performance degradation indicating that AT-CNNs are not robust to all kinds of image distortions Theytend to be more robust for fixed types of distortions Weleave the further investigation regarding this issue as futurework

423 PATCH-SHUFFLING

Stylizing and saturation operation aim at changing or re-moving the texture information of original images whilepreserving the features of shapes and edges In order to testthe different bias of AT-CNN and standard CNN in the otherway around we shatter the shape and edge information bysplitting the images into k times k patches and then randomlyshuffling them This operation could still maintains the localtextures if k is not too large

Figure 6 shows one example of patch-shuffled images underdifferent numbers of splitting The first row shows the proba-bilities assigned by different models to the ground truth class

of the original image Obviously after random shufflingthe shapes and edge features are destroyed dramaticallythe prediction probability of the adverarially trained CNNsdrops significantly while the normal CNNs still maintainsa high confidence over the ground truth class This revealsAT-CNNs are more baised towards shapes and edges thannormally trained ones

Moreover Figure 7 depicts the ldquo accuracy of correctly classi-fied imagesrdquo for all the models measured on ldquoPatch-shuffledrdquotest set with increasing number of splitting pieces AT-CNNs especially trained against with a stronger attack aremore sensitive to ldquoPatch-shufflingrdquo operations in most ofour experiments

Note that under ldquoPatch-shuffle 8rdquo operation all models havesimilar ldquo accuracy of correctly classified imagesrdquo which islargely due to the severe information loss Also note that thisaccuracy of all models on Tiny ImageNet shown in 7(a) ismush lower than that on Caltech-256 in 7(b) That is underldquoPatch-shuffle 1rdquo normally trained CNN has an accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution

5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs

Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness

Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-

constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work

Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN

6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around

Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation

AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-

Interpreting Adversarially Trained Convolutional Neural Networks

ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Interpreting Adversarially Trained Convolutional Neural Networks

A Experiment SetupA1 Models

bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units

bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch

We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40

A2 Adversarial Training

We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)

A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY

We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations

bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255

bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet

A22 TRAIN AGAINST A FGSM ADVERSARY

ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively

B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)

C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies

C1 Fourier filtering setup

Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set

bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)

bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)

bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test

C2 Results

We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency

D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on

3httpswwwkagglecomcpainter-by-numbers

Interpreting Adversarially Trained Convolutional Neural Networks

Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION

STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734

test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations

E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data

distributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Interpreting Adversarially Trained Convolutional Neural Networks

Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878

Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014

Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutional

networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to

Interpreting Adversarially Trained Convolutional Neural Networks

Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256

DATA SET 2times 2 4times 4 8times 8T

STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603

Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet

DATA SET 2times 2 4times 4 8times 8T

STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598

Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 4: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

Table 1 Accuracy and robustness of all the trained models Robustness is measured against the PGD attack with bounded linfin normDetails are listed in the appendix Note that underfitting CNNs have similar generalization performance with some of the AT-CNNs onclean images

CIFAR10 TinyImageNet Caltech 256Accuracy Robustness Accuracy Robustness Accuracy Robustness

PGD-inf 8 8627 4481 5442 1425 6641 3116PGD-inf 4 8917 3085 6185 687 7222 2010PGD-inf 2 914 3911 6706 166 7651 751PGD-inf 1 9340 753 6942 018 7911 170

PGD-L2 12 8579 3461 5344 1480 6554 3136PGD-L2 8 8801 2688 5821 1003 6975 2619PGD-L2 4 9077 1319 6424 361 7412 1433FGSM 8 8490 3425 6621 001 7088 2002FGSM 4 8813 2508 6343 013 7391 1516Normal 9452 0 7202 001 8332 0Underfit 8679 0 6005 001 6904 0

4 Experiments and analysisExperiments setup We describe the experiment setup toevaluate the performance of AT-CNNs and standard CNNsin data distributions manipulated by above-mentioned opera-tions We conduct experiments on three datasets CIFAR-10Tiny ImageNet and Caltech-256 (Griffin et al 2007) Notethat we do not create the style-transferred and patch-shuffledtest set for CIFAR-10 due to its limited resolution

When training on CIFAR-10 we use the ResNet-18model (He et al 2016ab) for data augmentation we per-form zero paddings with width as 4 horizontal flip andrandom crop

Tiny ImageNet has 200 classes of objects Each class has500 training images 50 validation images and 50 test im-ages All images from Tiny ImageNet are of size 64times 64We re-scale them to 224times224 and perform random horizon-tal flip and per-image standardization as data augmentation

Caltech-256 (Griffin et al 2007) consists of 257 objectcategories containing a total of 30607 images Resolutionof images from Caltech is much higher compared with theabove two datasets We manually split 20 of images asthe test set We perform re-scaling and random croppingfollowing (He et al 2016a) For both Tiny ImageNet andCaltech-256 we use ResNet-18 model as the network archi-tecture

Compared models their generalization and robustnessFor all above three datasets we train three types of AT-CNNs they mainly differ in the way of generating adver-sarial examples FGSM PGD with bounded linfin norm andPGD with bounded l2 norm and for each attack method wetrain several models under different attack strengths Details

are listed in the appendix To understand whether the differ-ence of performance degradation for AT-CNNs and standardCNNs is due to the poor generalization (Schmidt et al 2018Tsipras et al 2018) of adversarial training we also comparethe AT-CNNs with an underfitting CNN (trained over cleandata) with similar generalization performance as AT-CNNsWe train 11 models on each dataset Their generalizationperformance on clean data and robustness measured byPGD attack are shown in Table 1

41 Visualization results

To investigate what features of an input image AT-CNNsand normal CNNs are most sensitive to we generate sen-sitivity maps using SmoothGrad (Smilkov et al 2017) onclean images saturated images and stylized images Thevisualization results are presented in Figure 2

We can easily observe that the salience maps of AT-CNNsare much more sparse and mainly focus on contours of eachobject on all kinds of images including the clean saturatedand stylized ones Differently sensitivity maps of standardCNNs are more noisy and less biased towards the shapesof objects This is consistent with the findings in (Geirhoset al 2019)

Particularly in the second row of Figure 2 sensitivity mapsof normal CNNs of the ldquodogrdquo class are still noisy even whenthe input saturated image are nearly binarized On the otherhand after adversarial training the models successfullycapture the shape information of the object providing amore interpretable prediction

For stylized images shown in the third row of Figure 2 evenwith dramatically changed textures after style transfer AT-CNNs can still be able to focus the shapes of original object

Interpreting Adversarially Trained Convolutional Neural Networks

(a) Images from Caltech-256 (b) Images from Tiny ImageNet

Figure 2 Sensitivity maps based on SmoothGrad (Smilkov et al 2017) of three models on images under saturation and stylizing Fromtop to bottom Original Saturation 1024 and Stylizing For each group of images from left to right original image sensitivity maps ofstandard CNN underfitting CNN and PGD-linfin AT-CNN

while standard CNNs totally fail

Due to the limited space we provide more visualizationresults (including the sensitivity maps generated by Gradmethod) in appendix

42 Generalization performance on transformed data

In this part we mainly show generalization performanceof AT-CCNs and normal CNNs on either shape or texturepreserving distorted image datasets This could help us tounderstand how different that the two types of models arebiased in a quantitative way

For all experimental results below besides the top-1 accu-racy we also report an ldquoaccuracy on correctly classifiedimagesrdquo This accuracy is measured by first selecting theimages from the clean test set that is being correctly clas-sified then measuring the accuracy of transformed imagesfrom these correctly classified ones

421 STYLIZING

Following Geirhos et al (2019) we generate stylized ver-sion of test set for Caltech-256 and Tiny ImageNet

We report the ldquoaccuracy on correctly classified imagesrdquo ofall the trained models on stylized test set in Table 2 Com-pared with standard CNNs though with a lower accuracyon original test images AT-CNNs achieve higher accuracyon stylized ones with textures being dramatically changedThe comparison quantitatively shows that AT-CNNs tend tobe more invariant with respect to local textures

422 SATURATION

We use the saturation operation to manipulate the imagesand show the how increasing saturation levels affects theaccuracy of models trained in different ways

In Figure 4 we visualize images with varying saturation lev-els It can be easily observed that increasing saturation levelspushes images more ldquobinnarizedrdquo where some textures arewiped out but produces sharper edges and preserving shapeinformation When saturation level is smaller than 2 ieclean image it pushes all the pixels towards 12 and nearlyall the information is lost and p = 0 leads to a totally grayimage with constant pixel value

We measure the ldquoaccuracy on correctly classified imagesrdquofor all the trained models and show them in Figure 5 Wecan observe that with the increasing level of saturation moretexture information is lost Favorably adversarially trainedmodels exhibit a much less sensitivity to this texture lossstill obtaining a high classification accuracy The resultsindicate that AT-CNNs are more robust to ldquosaturationrdquo orldquobinarizingrdquo operations which may demonstrate that theprediction capability of AT-CNNs relies less on texture andmore on shapes Results on CIFAR-10 tells the same storyas presented in appendix due to the limited space

Additionally in our experiments for each adversarial train-ing approach either PGD or FGSM based AT-CNNs withhigher robustness towards PGD adversary are more invariantto the increasing of the saturation level and texture loss Onthe other hand adversarial training with higher robustnesstypically ruin the generalization over the clean dataset Ourfinding also supports the claim ldquorobustness maybe at oddswith accuracyrdquo (Tsipras et al 2018)

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 3 Visualization of images from style-transferred test set Applying AdaIn (Huang amp Belongie 2017) style transfer distorts localtextures of original images while the global shape structure is retained The first row are images from Caltech-256 and the second roware images from Tiny ImageNet

Table 2 ldquoAccuracy on correctly classified imagesrdquo for different models on stylized test set The columns named ldquoCaltech-256rdquo andldquoTinyImageNetrdquo show the generalization of different models on the clean test set

DATASET CALTECH-256 STYLIZED CALTECH-256 TINYIMAGENET STYLIZED TINYIMAGENET

STANDARD 8332 1683 7202 725UNDERFIT 6904 975 6035 716PGD-linfin 8 6641 1975 5442 1881PGD-linfin 4 7222 2110 6185 2051PGD-linfin 2 7651 2189 6706 1925PGD-linfin 1 7911 2207 6942 1831PGD-l2 12 6524 2014 5344 1933PGD-l2 8 6975 2162 5821 2042PGD-l2 4 7412 2253 6424 2105FGSM 8 7088 2123 6621 1507FGSM 4 7391 2199 6343 2022

Figure 4 Illustration of how varying saturation changes the appearance of the image From left to right saturation level 025 05 1 2(original image) 4 8 16 64 1024 Increasing saturation level pushes pixels towards 0 or 1 which preserves most of the shape whilewiping most of the textures Decreasing saturation level pushes all pixels to 12

2minus 2 20 22 24 26 28 210

Saturation Level

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

cle

an im

age

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

2minus 2 20 22 24 26 28 210

Saturation Level

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

cle

an im

age

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

(a) Caltech-256 (b) Tiny ImageNetFigure 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 and Tiny ImageNet with respect todifferent saturation levels Note that in the plot there are several curves with same color and line type shown for each adversarial trainingmethod PGD and FGSM-based those of which with larger perturbation achieves better robustness for most of the cases Detailed resultsare list in the appendix

Interpreting Adversarially Trained Convolutional Neural Networks

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0750

0952

0738 0769

0932

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0550

0877

0012 0043 0028

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0541

0913

0002 0012 0012Norma

lUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0005

0305

0002 0002 0003

(a) Original Image (b) Patch-Shuffle 2 (c) Patch-Shuffle 4 (d) Patch-Shuffle 8Figure 6 Visualization of patch-shuffling transformation The first row shows probability of ldquocakerdquo assigned by different models

clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8

Patch-Shuffle

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8

Patch-Shuffle

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

(a) Caltech-256 (b) Tiny ImageNetFigure 7 ldquoAccuracy on correctly classified imagesrdquo for different models on patch-shuffled Tiny ImageNet and Caltech-256 with differentsplitting numbers Detailed results are listed in the appendix

When decreasing the saturation level all models have simi-lar degree of performance degradation indicating that AT-CNNs are not robust to all kinds of image distortions Theytend to be more robust for fixed types of distortions Weleave the further investigation regarding this issue as futurework

423 PATCH-SHUFFLING

Stylizing and saturation operation aim at changing or re-moving the texture information of original images whilepreserving the features of shapes and edges In order to testthe different bias of AT-CNN and standard CNN in the otherway around we shatter the shape and edge information bysplitting the images into k times k patches and then randomlyshuffling them This operation could still maintains the localtextures if k is not too large

Figure 6 shows one example of patch-shuffled images underdifferent numbers of splitting The first row shows the proba-bilities assigned by different models to the ground truth class

of the original image Obviously after random shufflingthe shapes and edge features are destroyed dramaticallythe prediction probability of the adverarially trained CNNsdrops significantly while the normal CNNs still maintainsa high confidence over the ground truth class This revealsAT-CNNs are more baised towards shapes and edges thannormally trained ones

Moreover Figure 7 depicts the ldquo accuracy of correctly classi-fied imagesrdquo for all the models measured on ldquoPatch-shuffledrdquotest set with increasing number of splitting pieces AT-CNNs especially trained against with a stronger attack aremore sensitive to ldquoPatch-shufflingrdquo operations in most ofour experiments

Note that under ldquoPatch-shuffle 8rdquo operation all models havesimilar ldquo accuracy of correctly classified imagesrdquo which islargely due to the severe information loss Also note that thisaccuracy of all models on Tiny ImageNet shown in 7(a) ismush lower than that on Caltech-256 in 7(b) That is underldquoPatch-shuffle 1rdquo normally trained CNN has an accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution

5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs

Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness

Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-

constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work

Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN

6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around

Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation

AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-

Interpreting Adversarially Trained Convolutional Neural Networks

ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Interpreting Adversarially Trained Convolutional Neural Networks

A Experiment SetupA1 Models

bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units

bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch

We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40

A2 Adversarial Training

We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)

A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY

We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations

bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255

bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet

A22 TRAIN AGAINST A FGSM ADVERSARY

ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively

B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)

C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies

C1 Fourier filtering setup

Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set

bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)

bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)

bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test

C2 Results

We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency

D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on

3httpswwwkagglecomcpainter-by-numbers

Interpreting Adversarially Trained Convolutional Neural Networks

Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION

STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734

test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations

E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data

distributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Interpreting Adversarially Trained Convolutional Neural Networks

Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878

Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014

Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutional

networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to

Interpreting Adversarially Trained Convolutional Neural Networks

Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256

DATA SET 2times 2 4times 4 8times 8T

STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603

Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet

DATA SET 2times 2 4times 4 8times 8T

STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598

Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 5: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

(a) Images from Caltech-256 (b) Images from Tiny ImageNet

Figure 2 Sensitivity maps based on SmoothGrad (Smilkov et al 2017) of three models on images under saturation and stylizing Fromtop to bottom Original Saturation 1024 and Stylizing For each group of images from left to right original image sensitivity maps ofstandard CNN underfitting CNN and PGD-linfin AT-CNN

while standard CNNs totally fail

Due to the limited space we provide more visualizationresults (including the sensitivity maps generated by Gradmethod) in appendix

42 Generalization performance on transformed data

In this part we mainly show generalization performanceof AT-CCNs and normal CNNs on either shape or texturepreserving distorted image datasets This could help us tounderstand how different that the two types of models arebiased in a quantitative way

For all experimental results below besides the top-1 accu-racy we also report an ldquoaccuracy on correctly classifiedimagesrdquo This accuracy is measured by first selecting theimages from the clean test set that is being correctly clas-sified then measuring the accuracy of transformed imagesfrom these correctly classified ones

421 STYLIZING

Following Geirhos et al (2019) we generate stylized ver-sion of test set for Caltech-256 and Tiny ImageNet

We report the ldquoaccuracy on correctly classified imagesrdquo ofall the trained models on stylized test set in Table 2 Com-pared with standard CNNs though with a lower accuracyon original test images AT-CNNs achieve higher accuracyon stylized ones with textures being dramatically changedThe comparison quantitatively shows that AT-CNNs tend tobe more invariant with respect to local textures

422 SATURATION

We use the saturation operation to manipulate the imagesand show the how increasing saturation levels affects theaccuracy of models trained in different ways

In Figure 4 we visualize images with varying saturation lev-els It can be easily observed that increasing saturation levelspushes images more ldquobinnarizedrdquo where some textures arewiped out but produces sharper edges and preserving shapeinformation When saturation level is smaller than 2 ieclean image it pushes all the pixels towards 12 and nearlyall the information is lost and p = 0 leads to a totally grayimage with constant pixel value

We measure the ldquoaccuracy on correctly classified imagesrdquofor all the trained models and show them in Figure 5 Wecan observe that with the increasing level of saturation moretexture information is lost Favorably adversarially trainedmodels exhibit a much less sensitivity to this texture lossstill obtaining a high classification accuracy The resultsindicate that AT-CNNs are more robust to ldquosaturationrdquo orldquobinarizingrdquo operations which may demonstrate that theprediction capability of AT-CNNs relies less on texture andmore on shapes Results on CIFAR-10 tells the same storyas presented in appendix due to the limited space

Additionally in our experiments for each adversarial train-ing approach either PGD or FGSM based AT-CNNs withhigher robustness towards PGD adversary are more invariantto the increasing of the saturation level and texture loss Onthe other hand adversarial training with higher robustnesstypically ruin the generalization over the clean dataset Ourfinding also supports the claim ldquorobustness maybe at oddswith accuracyrdquo (Tsipras et al 2018)

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 3 Visualization of images from style-transferred test set Applying AdaIn (Huang amp Belongie 2017) style transfer distorts localtextures of original images while the global shape structure is retained The first row are images from Caltech-256 and the second roware images from Tiny ImageNet

Table 2 ldquoAccuracy on correctly classified imagesrdquo for different models on stylized test set The columns named ldquoCaltech-256rdquo andldquoTinyImageNetrdquo show the generalization of different models on the clean test set

DATASET CALTECH-256 STYLIZED CALTECH-256 TINYIMAGENET STYLIZED TINYIMAGENET

STANDARD 8332 1683 7202 725UNDERFIT 6904 975 6035 716PGD-linfin 8 6641 1975 5442 1881PGD-linfin 4 7222 2110 6185 2051PGD-linfin 2 7651 2189 6706 1925PGD-linfin 1 7911 2207 6942 1831PGD-l2 12 6524 2014 5344 1933PGD-l2 8 6975 2162 5821 2042PGD-l2 4 7412 2253 6424 2105FGSM 8 7088 2123 6621 1507FGSM 4 7391 2199 6343 2022

Figure 4 Illustration of how varying saturation changes the appearance of the image From left to right saturation level 025 05 1 2(original image) 4 8 16 64 1024 Increasing saturation level pushes pixels towards 0 or 1 which preserves most of the shape whilewiping most of the textures Decreasing saturation level pushes all pixels to 12

2minus 2 20 22 24 26 28 210

Saturation Level

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

cle

an im

age

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

2minus 2 20 22 24 26 28 210

Saturation Level

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

cle

an im

age

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

(a) Caltech-256 (b) Tiny ImageNetFigure 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 and Tiny ImageNet with respect todifferent saturation levels Note that in the plot there are several curves with same color and line type shown for each adversarial trainingmethod PGD and FGSM-based those of which with larger perturbation achieves better robustness for most of the cases Detailed resultsare list in the appendix

Interpreting Adversarially Trained Convolutional Neural Networks

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0750

0952

0738 0769

0932

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0550

0877

0012 0043 0028

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0541

0913

0002 0012 0012Norma

lUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0005

0305

0002 0002 0003

(a) Original Image (b) Patch-Shuffle 2 (c) Patch-Shuffle 4 (d) Patch-Shuffle 8Figure 6 Visualization of patch-shuffling transformation The first row shows probability of ldquocakerdquo assigned by different models

clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8

Patch-Shuffle

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8

Patch-Shuffle

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

(a) Caltech-256 (b) Tiny ImageNetFigure 7 ldquoAccuracy on correctly classified imagesrdquo for different models on patch-shuffled Tiny ImageNet and Caltech-256 with differentsplitting numbers Detailed results are listed in the appendix

When decreasing the saturation level all models have simi-lar degree of performance degradation indicating that AT-CNNs are not robust to all kinds of image distortions Theytend to be more robust for fixed types of distortions Weleave the further investigation regarding this issue as futurework

423 PATCH-SHUFFLING

Stylizing and saturation operation aim at changing or re-moving the texture information of original images whilepreserving the features of shapes and edges In order to testthe different bias of AT-CNN and standard CNN in the otherway around we shatter the shape and edge information bysplitting the images into k times k patches and then randomlyshuffling them This operation could still maintains the localtextures if k is not too large

Figure 6 shows one example of patch-shuffled images underdifferent numbers of splitting The first row shows the proba-bilities assigned by different models to the ground truth class

of the original image Obviously after random shufflingthe shapes and edge features are destroyed dramaticallythe prediction probability of the adverarially trained CNNsdrops significantly while the normal CNNs still maintainsa high confidence over the ground truth class This revealsAT-CNNs are more baised towards shapes and edges thannormally trained ones

Moreover Figure 7 depicts the ldquo accuracy of correctly classi-fied imagesrdquo for all the models measured on ldquoPatch-shuffledrdquotest set with increasing number of splitting pieces AT-CNNs especially trained against with a stronger attack aremore sensitive to ldquoPatch-shufflingrdquo operations in most ofour experiments

Note that under ldquoPatch-shuffle 8rdquo operation all models havesimilar ldquo accuracy of correctly classified imagesrdquo which islargely due to the severe information loss Also note that thisaccuracy of all models on Tiny ImageNet shown in 7(a) ismush lower than that on Caltech-256 in 7(b) That is underldquoPatch-shuffle 1rdquo normally trained CNN has an accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution

5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs

Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness

Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-

constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work

Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN

6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around

Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation

AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-

Interpreting Adversarially Trained Convolutional Neural Networks

ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Interpreting Adversarially Trained Convolutional Neural Networks

A Experiment SetupA1 Models

bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units

bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch

We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40

A2 Adversarial Training

We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)

A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY

We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations

bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255

bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet

A22 TRAIN AGAINST A FGSM ADVERSARY

ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively

B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)

C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies

C1 Fourier filtering setup

Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set

bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)

bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)

bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test

C2 Results

We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency

D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on

3httpswwwkagglecomcpainter-by-numbers

Interpreting Adversarially Trained Convolutional Neural Networks

Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION

STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734

test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations

E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data

distributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Interpreting Adversarially Trained Convolutional Neural Networks

Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878

Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014

Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutional

networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to

Interpreting Adversarially Trained Convolutional Neural Networks

Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256

DATA SET 2times 2 4times 4 8times 8T

STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603

Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet

DATA SET 2times 2 4times 4 8times 8T

STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598

Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 6: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 3 Visualization of images from style-transferred test set Applying AdaIn (Huang amp Belongie 2017) style transfer distorts localtextures of original images while the global shape structure is retained The first row are images from Caltech-256 and the second roware images from Tiny ImageNet

Table 2 ldquoAccuracy on correctly classified imagesrdquo for different models on stylized test set The columns named ldquoCaltech-256rdquo andldquoTinyImageNetrdquo show the generalization of different models on the clean test set

DATASET CALTECH-256 STYLIZED CALTECH-256 TINYIMAGENET STYLIZED TINYIMAGENET

STANDARD 8332 1683 7202 725UNDERFIT 6904 975 6035 716PGD-linfin 8 6641 1975 5442 1881PGD-linfin 4 7222 2110 6185 2051PGD-linfin 2 7651 2189 6706 1925PGD-linfin 1 7911 2207 6942 1831PGD-l2 12 6524 2014 5344 1933PGD-l2 8 6975 2162 5821 2042PGD-l2 4 7412 2253 6424 2105FGSM 8 7088 2123 6621 1507FGSM 4 7391 2199 6343 2022

Figure 4 Illustration of how varying saturation changes the appearance of the image From left to right saturation level 025 05 1 2(original image) 4 8 16 64 1024 Increasing saturation level pushes pixels towards 0 or 1 which preserves most of the shape whilewiping most of the textures Decreasing saturation level pushes all pixels to 12

2minus 2 20 22 24 26 28 210

Saturation Level

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

cle

an im

age

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

2minus 2 20 22 24 26 28 210

Saturation Level

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

cle

an im

age

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

(a) Caltech-256 (b) Tiny ImageNetFigure 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 and Tiny ImageNet with respect todifferent saturation levels Note that in the plot there are several curves with same color and line type shown for each adversarial trainingmethod PGD and FGSM-based those of which with larger perturbation achieves better robustness for most of the cases Detailed resultsare list in the appendix

Interpreting Adversarially Trained Convolutional Neural Networks

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0750

0952

0738 0769

0932

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0550

0877

0012 0043 0028

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0541

0913

0002 0012 0012Norma

lUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0005

0305

0002 0002 0003

(a) Original Image (b) Patch-Shuffle 2 (c) Patch-Shuffle 4 (d) Patch-Shuffle 8Figure 6 Visualization of patch-shuffling transformation The first row shows probability of ldquocakerdquo assigned by different models

clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8

Patch-Shuffle

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8

Patch-Shuffle

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

(a) Caltech-256 (b) Tiny ImageNetFigure 7 ldquoAccuracy on correctly classified imagesrdquo for different models on patch-shuffled Tiny ImageNet and Caltech-256 with differentsplitting numbers Detailed results are listed in the appendix

When decreasing the saturation level all models have simi-lar degree of performance degradation indicating that AT-CNNs are not robust to all kinds of image distortions Theytend to be more robust for fixed types of distortions Weleave the further investigation regarding this issue as futurework

423 PATCH-SHUFFLING

Stylizing and saturation operation aim at changing or re-moving the texture information of original images whilepreserving the features of shapes and edges In order to testthe different bias of AT-CNN and standard CNN in the otherway around we shatter the shape and edge information bysplitting the images into k times k patches and then randomlyshuffling them This operation could still maintains the localtextures if k is not too large

Figure 6 shows one example of patch-shuffled images underdifferent numbers of splitting The first row shows the proba-bilities assigned by different models to the ground truth class

of the original image Obviously after random shufflingthe shapes and edge features are destroyed dramaticallythe prediction probability of the adverarially trained CNNsdrops significantly while the normal CNNs still maintainsa high confidence over the ground truth class This revealsAT-CNNs are more baised towards shapes and edges thannormally trained ones

Moreover Figure 7 depicts the ldquo accuracy of correctly classi-fied imagesrdquo for all the models measured on ldquoPatch-shuffledrdquotest set with increasing number of splitting pieces AT-CNNs especially trained against with a stronger attack aremore sensitive to ldquoPatch-shufflingrdquo operations in most ofour experiments

Note that under ldquoPatch-shuffle 8rdquo operation all models havesimilar ldquo accuracy of correctly classified imagesrdquo which islargely due to the severe information loss Also note that thisaccuracy of all models on Tiny ImageNet shown in 7(a) ismush lower than that on Caltech-256 in 7(b) That is underldquoPatch-shuffle 1rdquo normally trained CNN has an accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution

5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs

Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness

Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-

constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work

Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN

6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around

Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation

AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-

Interpreting Adversarially Trained Convolutional Neural Networks

ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Interpreting Adversarially Trained Convolutional Neural Networks

A Experiment SetupA1 Models

bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units

bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch

We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40

A2 Adversarial Training

We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)

A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY

We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations

bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255

bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet

A22 TRAIN AGAINST A FGSM ADVERSARY

ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively

B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)

C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies

C1 Fourier filtering setup

Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set

bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)

bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)

bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test

C2 Results

We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency

D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on

3httpswwwkagglecomcpainter-by-numbers

Interpreting Adversarially Trained Convolutional Neural Networks

Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION

STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734

test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations

E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data

distributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Interpreting Adversarially Trained Convolutional Neural Networks

Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878

Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014

Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutional

networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to

Interpreting Adversarially Trained Convolutional Neural Networks

Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256

DATA SET 2times 2 4times 4 8times 8T

STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603

Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet

DATA SET 2times 2 4times 4 8times 8T

STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598

Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 7: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0750

0952

0738 0769

0932

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0550

0877

0012 0043 0028

NormalUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0541

0913

0002 0012 0012Norma

lUnderf

itPGD-in

f8PGD-L2

12 FGSM800

02

04

06

08

10

0005

0305

0002 0002 0003

(a) Original Image (b) Patch-Shuffle 2 (c) Patch-Shuffle 4 (d) Patch-Shuffle 8Figure 6 Visualization of patch-shuffling transformation The first row shows probability of ldquocakerdquo assigned by different models

clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8

Patch-Shuffle

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8

Patch-Shuffle

0

20

40

60

80

100

Accura

cy o

n c

orr

ectl

y c

lassifie

d im

ages

PGD AT with inf norm

PGD AT with l2 norm

FGSM AT

Stardard Training

Underfitting

(a) Caltech-256 (b) Tiny ImageNetFigure 7 ldquoAccuracy on correctly classified imagesrdquo for different models on patch-shuffled Tiny ImageNet and Caltech-256 with differentsplitting numbers Detailed results are listed in the appendix

When decreasing the saturation level all models have simi-lar degree of performance degradation indicating that AT-CNNs are not robust to all kinds of image distortions Theytend to be more robust for fixed types of distortions Weleave the further investigation regarding this issue as futurework

423 PATCH-SHUFFLING

Stylizing and saturation operation aim at changing or re-moving the texture information of original images whilepreserving the features of shapes and edges In order to testthe different bias of AT-CNN and standard CNN in the otherway around we shatter the shape and edge information bysplitting the images into k times k patches and then randomlyshuffling them This operation could still maintains the localtextures if k is not too large

Figure 6 shows one example of patch-shuffled images underdifferent numbers of splitting The first row shows the proba-bilities assigned by different models to the ground truth class

of the original image Obviously after random shufflingthe shapes and edge features are destroyed dramaticallythe prediction probability of the adverarially trained CNNsdrops significantly while the normal CNNs still maintainsa high confidence over the ground truth class This revealsAT-CNNs are more baised towards shapes and edges thannormally trained ones

Moreover Figure 7 depicts the ldquo accuracy of correctly classi-fied imagesrdquo for all the models measured on ldquoPatch-shuffledrdquotest set with increasing number of splitting pieces AT-CNNs especially trained against with a stronger attack aremore sensitive to ldquoPatch-shufflingrdquo operations in most ofour experiments

Note that under ldquoPatch-shuffle 8rdquo operation all models havesimilar ldquo accuracy of correctly classified imagesrdquo which islargely due to the severe information loss Also note that thisaccuracy of all models on Tiny ImageNet shown in 7(a) ismush lower than that on Caltech-256 in 7(b) That is underldquoPatch-shuffle 1rdquo normally trained CNN has an accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution

5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs

Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness

Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-

constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work

Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN

6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around

Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation

AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-

Interpreting Adversarially Trained Convolutional Neural Networks

ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Interpreting Adversarially Trained Convolutional Neural Networks

A Experiment SetupA1 Models

bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units

bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch

We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40

A2 Adversarial Training

We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)

A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY

We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations

bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255

bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet

A22 TRAIN AGAINST A FGSM ADVERSARY

ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively

B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)

C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies

C1 Fourier filtering setup

Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set

bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)

bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)

bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test

C2 Results

We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency

D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on

3httpswwwkagglecomcpainter-by-numbers

Interpreting Adversarially Trained Convolutional Neural Networks

Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION

STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734

test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations

E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data

distributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Interpreting Adversarially Trained Convolutional Neural Networks

Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878

Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014

Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutional

networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to

Interpreting Adversarially Trained Convolutional Neural Networks

Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256

DATA SET 2times 2 4times 4 8times 8T

STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603

Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet

DATA SET 2times 2 4times 4 8times 8T

STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598

Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 8: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution

5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs

Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness

Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-

constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work

Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN

6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around

Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation

AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-

Interpreting Adversarially Trained Convolutional Neural Networks

ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Interpreting Adversarially Trained Convolutional Neural Networks

A Experiment SetupA1 Models

bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units

bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch

We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40

A2 Adversarial Training

We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)

A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY

We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations

bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255

bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet

A22 TRAIN AGAINST A FGSM ADVERSARY

ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively

B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)

C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies

C1 Fourier filtering setup

Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set

bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)

bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)

bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test

C2 Results

We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency

D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on

3httpswwwkagglecomcpainter-by-numbers

Interpreting Adversarially Trained Convolutional Neural Networks

Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION

STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734

test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations

E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data

distributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Interpreting Adversarially Trained Convolutional Neural Networks

Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878

Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014

Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutional

networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to

Interpreting Adversarially Trained Convolutional Neural Networks

Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256

DATA SET 2times 2 4times 4 8times 8T

STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603

Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet

DATA SET 2times 2 4times 4 8times 8T

STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598

Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 9: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Interpreting Adversarially Trained Convolutional Neural Networks

A Experiment SetupA1 Models

bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units

bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch

We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40

A2 Adversarial Training

We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)

A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY

We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations

bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255

bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet

A22 TRAIN AGAINST A FGSM ADVERSARY

ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively

B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)

C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies

C1 Fourier filtering setup

Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set

bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)

bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)

bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test

C2 Results

We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency

D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on

3httpswwwkagglecomcpainter-by-numbers

Interpreting Adversarially Trained Convolutional Neural Networks

Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION

STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734

test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations

E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data

distributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Interpreting Adversarially Trained Convolutional Neural Networks

Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878

Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014

Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutional

networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to

Interpreting Adversarially Trained Convolutional Neural Networks

Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256

DATA SET 2times 2 4times 4 8times 8T

STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603

Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet

DATA SET 2times 2 4times 4 8times 8T

STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598

Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 10: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Interpreting Adversarially Trained Convolutional Neural Networks

A Experiment SetupA1 Models

bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units

bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch

We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40

A2 Adversarial Training

We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)

A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY

We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations

bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255

bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet

A22 TRAIN AGAINST A FGSM ADVERSARY

ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively

B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)

C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies

C1 Fourier filtering setup

Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set

bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)

bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)

bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test

C2 Results

We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency

D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on

3httpswwwkagglecomcpainter-by-numbers

Interpreting Adversarially Trained Convolutional Neural Networks

Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION

STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734

test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations

E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data

distributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Interpreting Adversarially Trained Convolutional Neural Networks

Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878

Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014

Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutional

networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to

Interpreting Adversarially Trained Convolutional Neural Networks

Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256

DATA SET 2times 2 4times 4 8times 8T

STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603

Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet

DATA SET 2times 2 4times 4 8times 8T

STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598

Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 11: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

A Experiment SetupA1 Models

bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units

bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch

We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40

A2 Adversarial Training

We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)

A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY

We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations

bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255

bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet

A22 TRAIN AGAINST A FGSM ADVERSARY

ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively

B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)

C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies

C1 Fourier filtering setup

Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set

bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)

bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)

bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test

C2 Results

We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency

D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on

3httpswwwkagglecomcpainter-by-numbers

Interpreting Adversarially Trained Convolutional Neural Networks

Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION

STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734

test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations

E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data

distributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Interpreting Adversarially Trained Convolutional Neural Networks

Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878

Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014

Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutional

networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to

Interpreting Adversarially Trained Convolutional Neural Networks

Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256

DATA SET 2times 2 4times 4 8times 8T

STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603

Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet

DATA SET 2times 2 4times 4 8times 8T

STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598

Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 12: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION

STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734

test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations

E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10

ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt

M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018

Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018

Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018

Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015

Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016

Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009

Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data

distributions In International Conference on LearningRepresentations 2019

Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017

Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009

Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015

Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019

Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014

Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014

Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007

He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a

He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b

Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017

Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017

Interpreting Adversarially Trained Convolutional Neural Networks

Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878

Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014

Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutional

networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to

Interpreting Adversarially Trained Convolutional Neural Networks

Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256

DATA SET 2times 2 4times 4 8times 8T

STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603

Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet

DATA SET 2times 2 4times 4 8times 8T

STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598

Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 13: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878

Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014

Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10

SATURAION LEVEL 025 05 1 4 8 16 64 1024

STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012

Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016

Long J Shelhamer E and Darrell T Fully convolutional

networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015

Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018

Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to

Interpreting Adversarially Trained Convolutional Neural Networks

Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256

DATA SET 2times 2 4times 4 8times 8T

STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603

Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet

DATA SET 2times 2 4times 4 8times 8T

STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598

Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 14: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256

DATA SET 2times 2 4times 4 8times 8T

STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603

Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet

DATA SET 2times 2 4times 4 8times 8T

STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598

Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 15: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs

adversarial attacks In International Conference on Learn-ing Representations 2018

Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017

Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018

Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017

Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015

Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017

Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013

Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018

Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017

Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018

Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017

Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017

Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017

Page 16: Interpreting Adversarially Trained Convolutional Neural ...Interpreting Adversarially Trained Convolutional Neural Networks shuffled images, then evaluate the classification accuracy

Interpreting Adversarially Trained Convolutional Neural Networks

Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018

Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018

Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014

Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a

Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b

Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016

Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017