11
History-based anomaly detector: an adversarial approach to anomaly detection Pierrick CHATILLON ´ Ecole normale sup´ erieure Paris-Saclay [email protected] Coloma BALLESTER Universitat Pompeu Fabra [email protected] March 17, 2020 Abstract Anomaly detection is a difficult problem in many areas and has recently been subject to a lot of attention. Clas- sifying unseen data as anomalous is a challenging mat- ter. Latest proposed methods rely on Generative Adver- sarial Networks (GANs) to estimate the normal data dis- tribution, and produce an anomaly score prediction for any given data. In this article, we propose a simple yet new adversarial method to tackle this problem, denoted as History-based anomaly detector (HistoryAD). It con- sists of a self-supervised model, trained to recognize ’nor- mal’ samples by comparing them to samples based on the training history of a previously trained GAN. Quantitative and qualitative results are presented evaluating its perfor- mance. We also present a comparison to several state-of- the-art methods for anomaly detection showing that our proposal achieves top-tier results on several datasets. 1 Introduction Anomaly detection usually refers to the identification of unusual patterns that do not conform to expected be- haviour of data, be it visual data such as images and videos, or other modalities such as acoustics or nat- ural language. Its applications are numerous and in- clude the detection of anomalies in medical or biologi- cal imaging such as failure of neurocognitive functions in damaged brains [15, 40, 38], real-life image forgery resulting in fake news or even fraud [19, 51, 48, 35], anomaly detection in image or video for autonomous Approximate density of anomaly score distribution for each dataset. test split CIFAR-10 CelebA Tiny ImageNet AUPRC 0.941 0.976 0.949 Table 1. AUPRC for SVHN compared to other datasets. Figure 1: Method trained on SVHN and evaluated on sev- eral datasets. navigation, driver assistance systems or surveillance sys- tems for, e.g., violence alerting or evidence investiga- tion [27, 30, 50, 34, 20, 11], or for detection of violation and foul in sports analysis, detection of defective samples in manufacturing industry [44, 52, 4], sea mines in side- scan sonar images [29] or extrange aerial objects in aerial images that may produce collisions [36], anomalies from multi-modal data including visual data, audio data or nat- ural language [25], to name but a few of its applications. The precise definition of anomalous data is inherently difficult as, in practice, an unexpected anomaly can be detected only against the ground of a pattern regularity. 1 arXiv:1912.11843v2 [cs.CV] 14 Mar 2020

anomaly detection - arXiv.org e-Print archive

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: anomaly detection - arXiv.org e-Print archive

History-based anomaly detector: an adversarial approach toanomaly detection

Pierrick CHATILLONEcole normale superieure Paris-Saclay

[email protected]

Coloma BALLESTERUniversitat Pompeu [email protected]

March 17, 2020

AbstractAnomaly detection is a difficult problem in many areasand has recently been subject to a lot of attention. Clas-sifying unseen data as anomalous is a challenging mat-ter. Latest proposed methods rely on Generative Adver-sarial Networks (GANs) to estimate the normal data dis-tribution, and produce an anomaly score prediction forany given data. In this article, we propose a simple yetnew adversarial method to tackle this problem, denotedas History-based anomaly detector (HistoryAD). It con-sists of a self-supervised model, trained to recognize ’nor-mal’ samples by comparing them to samples based on thetraining history of a previously trained GAN. Quantitativeand qualitative results are presented evaluating its perfor-mance. We also present a comparison to several state-of-the-art methods for anomaly detection showing that ourproposal achieves top-tier results on several datasets.

1 IntroductionAnomaly detection usually refers to the identification ofunusual patterns that do not conform to expected be-haviour of data, be it visual data such as images andvideos, or other modalities such as acoustics or nat-ural language. Its applications are numerous and in-clude the detection of anomalies in medical or biologi-cal imaging such as failure of neurocognitive functionsin damaged brains [15, 40, 38], real-life image forgeryresulting in fake news or even fraud [19, 51, 48, 35],anomaly detection in image or video for autonomous

Approximate density of anomaly score distribution foreach dataset.

test split CIFAR-10 CelebA Tiny ImageNetAUPRC 0.941 0.976 0.949

Table 1. AUPRC for SVHN compared to other datasets.

Figure 1: Method trained on SVHN and evaluated on sev-eral datasets.

navigation, driver assistance systems or surveillance sys-tems for, e.g., violence alerting or evidence investiga-tion [27, 30, 50, 34, 20, 11], or for detection of violationand foul in sports analysis, detection of defective samplesin manufacturing industry [44, 52, 4], sea mines in side-scan sonar images [29] or extrange aerial objects in aerialimages that may produce collisions [36], anomalies frommulti-modal data including visual data, audio data or nat-ural language [25], to name but a few of its applications.

The precise definition of anomalous data is inherentlydifficult as, in practice, an unexpected anomaly can bedetected only against the ground of a pattern regularity.

1

arX

iv:1

912.

1184

3v2

[cs

.CV

] 1

4 M

ar 2

020

Page 2: anomaly detection - arXiv.org e-Print archive

This is one of the reasons that anomaly detection is fre-quently approached as out-of-distribution or outlier detec-tion. A detailed account of the many existing methods toapproach this problem can be found in [6, 37, 12, 5].

This paper proposes a new method for anomaly detec-tion in the context of image processing that is based on theunsupervised learning of the underlying probability dis-tribution of normal data through appropriate GANs andthe proposal of a new anomaly score for the detectionof abnormal images. Our anomaly detector leverages arecorded history of the normal data generator to fully dis-criminate regions where true data points are more denseand use this learning to successfully detect anomalies. Itresults in a general anomaly detector that is free of as-sumptions on the data and thus it can be applied in anycontext and data modality. Fig. 1 illustrates an example ofthe performance of our anomaly detector on structurallydifferent datasets. In this experiment, the distribution ofthe Street View House Numbers (SVHN) dataset [32] isfirst learned (details in Section 3) and considered as nor-mal data. Then, our anomaly score is computed on sam-ples of it and also on samples from the CIFAR-10 [24],CelebA [26] and ImageNet [9] datasets. The approxi-mated density of the anomaly score distribution for eachdataset is shown at the top of Fig. 1. Let us notice that, forthe normal data, its anomaly values are around −1 whilefor all the ’anomalous’ datasets, the anomaly scores arearound +1. On the other hand, Table 1 in Fig. 1 bottomshows the area under the precision-recall curve (AUPRC).

The outline of this paper is as follows. Section 2 re-views related research. Section 3 details our proposal. InSection 4, the model architecture and implementation de-tails are provided while the experimental results are pre-sented in Section 5. Finally, the paper is concluded inSection 6.

2 Related WorkThe automatic identification of abnormal or manipulateddata is crucial in many contexts [40, 42, 19, 48, 35, 50, 4].Anomaly detection, has been a topic in statistics for cen-turies (see [6, 37, 12, 5] and references therein). The au-thors of [12] classify the methods in the literature by thestructural assumption made on the normality. Other workschallenge anomaly detection with unsupervised or self-

supervised learning strategies by taking advantage of thehuge amount of data frequently at our disposal [2, 41].Some of them use generative models to learn the (nor-mal) data distribution. Generative models are methodsthat produce novel samples from high-dimensional datadistributions, such as images or videos. Currently themost prominent approaches include autoregressive mod-els [45], variational autoencoders (VAE) [23], and gen-erative adversarial networks (GANs) [14]. GANs are of-ten credited for producing less burry outputs when usedfor image generation. In the anomaly detection context,several approaches tackle it using autoencoders [13] orGANs [41, 49, 8, 39, 17, 1, 21, 33] (we refer to [28]for a summary of those GAN-based anomaly detectionmethods). Some works focus on the implicit inversionof the generator in order to detect anomalous data thatdo not fall in the learned model [10, 17], while othersdirectly infer likelihoods with, for instance, normalizingflows [22, 7, 18, 31, 43]. On the other hand, the recentpaper [13] uses a memory-augmented autoencoder whichlearns and records a fixed number of prototypical normalencoded vectors. Given an input sample, it is encoded andthe memory is then accessed with an attention-based mod-ule to express this encoding by a sparse combination ofthe stored normal prototypes that used to reconstruct theinput data via a decoder. The l2 distance between the in-put and its reconstruction is used as anomaly score. Verydifferently, our anomaly detector leverages the recordedhistory of the normal data generator to fully discriminateregions where true data points are more dense and usethis learning to successfully detect anomalies. The idea ofproducing an anomaly score prediction for any given datahas also been investigated [17, 41, 1]. Our proposal fitsin the class of self-supervised approaches and it is trainedonly on normal (non-anomalous) samples. The proposedmethod is general, efficient and simple as it uses the richinformation of the training process in the construction theanomaly detector.

3 Proposed methodWe will attribute an anomaly score to any image. Inspiredby some ideas in [21] and [33], this score consists of theoutput of a network.

More precisely, let Pdata be the probability distribu-

2

Page 3: anomaly detection - arXiv.org e-Print archive

tion of a given ’normal images’ dataset. Our proposal isgrounded on, first, the learning of the probability distribu-tion Pdata using a GAN learning strategy while simultane-ously keeping track of the states of the associated genera-tor and discriminator during training. Secondly, we createa probability distribution (denoted as PGhist ) that combinesdifferent states of the previous generator’s history. We fi-nally train our anomaly detector by computing the TotalVariation distance between the real data distribution Pdataand PGhist . Fig. 2 displays an outline of the whole method,and in the following Sections 3.1, 3.2, and 3.3, these stepsof our approach are detailed and justified.

3.1 Learning to generate training-like dataAs mentioned, a GAN-based adversarial strategy is fol-lowed. Let us recall that the GAN strategy [14] is basedon a game theory scenario between two networks, thegenerator and the discriminator, having adversarial objec-tives. The generator maps a noise vector (of density PZ)from the latent space to the image space trying to trickthe discriminator, while the discriminator receives eithera generated or a real image and must distinguish betweenboth. This procedure leads the probability distribution ofthe generated data to be as close as possible, for some dis-tance, to the one of the real data. For the Vanilla GAN[14], the minimized distance is the Jensen-Shannon Di-vergence, which has arguably bad properties (see Section2 of [3] for details). The authors of [3] introduced theidea of minimizing the Wasserstein-1 distance (denotedas W1) instead. They proved several of its nice proper-ties, including its continuity and differentiability almosteverywhere (under certain hypotheses). The Wasserstein-1 distance can be computed with the Kantorovich dualityproperty: if P1 and P2 are two probability distributions,then

W1(P1,P2) = supD∈D

Ex∼P1[D(x)]− Ey∼P2

[D(y)], (1)

where D is set of 1-Lipschitz functions, i.e., in the no-tations of [46], the set of c-convex functions for the costfunction c(x, y) =| x− y |. Let G and D be the generatorand the discriminator learned by optimizing the adversar-ial Wasserstein GAN loss (WGAN),

infG

supD∈D

Ex∼PG[D(x)]− Ex∼Pdata [D(x)] . (2)

Notice that the optimal dual variable D∗ obtained fromthe optimization of (2) will be negative on real data sam-ples and positive on generated ones. In this paper, weuse the learning strategy of [16] which is based on ap-proximating the class D by neural networks D subject toa gradient penalty (forcing the L2 norm of the gradient ofthe discriminator with respect to its input to be close to1). The choice of WGAN instead of other GAN lossesfavours nice properties such as avoiding vanishing gradi-ents and mode collapse, and achieves more stable training.

3.2 Generator’s history probability distri-bution

Let us start by presenting the underlying idea. Duringtraining (equation (2) with algorithm of [16]), the discrim-inator D will indicate regions that may contain real data,and G learns to produce samples in that zone. If thesezones do not contain real data, then the discriminator willact as a critic and indicate it to the generator and point atother regions. This way, screenshots of the generator dur-ing training keep track of data points surrounding the realdata manifold. In this paper, we merge the screenshots ofthe generator during training to form:

PGhist ,∫ nepochs

α

c ·Gt(PZ) · e−βtdt, (3)

(see Fig. 2) where Gt denotes the state of the genera-tor at training time t and PZ is the latent space distribu-tion (parameters α and β are discussed in Section 4). Asa weighted mean of training generated distributions, wemay assume by construction that PGhist covers Pdata. Thatis,

Hypothesis: supp (Pdata) ⊂ supp (PGhist) .(4)

To illustrate our hypothesis and our whole method,we present a proof of concept by creating a toy one-dimensional dataset of points sampled from the normallaw. We then train the WGAN, with the generator initial-ized with an offset so that it does not match training data.As previously explained (details in Section 4) we save thestates of the generator. Fig. 3(a) displays empirical gener-ated distributions of some of these states.

3

Page 4: anomaly detection - arXiv.org e-Print archive

��0��2

��0��1

��2

training time

ℙ�ℎ���

��1

ℙ����

��0��1

��2

ℙ��0

ℙ��1

ℙ��2

Weighted sum

ℙ�ℎ���ℙ����

���

-1+1

a) b) c)

Figure 2: Outline of the proposed method: a) Some states of the generator are saved during GAN training (Gti andDti represent the states of of the generator and discriminator at training time ti ). b) these networks are used to form anew distribution PGhist rich in ’anomalous’ samples. c) We use this distribution as negative class for a classifier. DTV

(a) Empirical generateddistributions; color fromblue to red indicates theprogression of training.

(b) Comparison of Pdata andPGhist and profile of optimalD∗TV and trained DTV .

Figure 3: Method illustration on a toy one-dimensionaldataset of points sampled from a normal law.

In order to satisfy Hypothesis (4), we use momentumbased optimizers, so that PG oscillates around Pdata (seeFig. 3(a)), making the support of PGhist cover the one ofPdata better (see Fig. 3(b), where Pdata and PGhist are, re-spectively, plotted in black and red).

This empirically confirms the hypothesis in this toycase.

(a) Discriminator scoreoutput during training

(from blue to red).

(b) Average outputs ofdiscriminators versusoutput of a average

coefficient discriminator.

Figure 4: Justification of DTV initialization on the toyexample.

3.3 Training our anomaly detector DTV

As announced above, we will compute the Total Variationdistance between Pdata and PGhist . Let us explain how andwhy. Firstly, we recall the Total Variation distance defini-tion:

δ(P1,P2) = supA Borel subset

|P1(A)− P2(A)| , (5)

which represents the choice c(x, y) = 1x 6=y in the op-timal transport problem, as stated in [46] (where 1A de-notes the indicator function of a set A, as usual). As no-

4

Page 5: anomaly detection - arXiv.org e-Print archive

ticed by several authors (see, e.g., [3]), the topology in-duced by the Total Variation distance is stronger that theone induced by the Wasserstein-1 distance. Let us remarkthat δ(P1,P2) = 1

2‖P1 − P2‖TV , where ‖ · ‖TV denotesthe Total Variation norm. The Kantorovich duality yields:

2 δ(P1,P2) = sup−1≤D≤1

(Ex∼P1 [D(x)]− Ey∼P2 [D(y)]) .

(6)From this equation (6), we infer our ideal training objec-tive:

sup−1≤D≤1

Ex∼PGhist[D(x)]− Ex∼Pdata [D(x)] . (7)

Let us notice that the optimal stateD∗ in (6) is completelyunderstood: Paraphrasing [3], take µ = P1 − P2, whichis a signed measure, and (P,N) its Hahn decomposition(P = {dP1 > dP2}). Then, we can define D∗ := 1P −1N , we have −1 ≤ D∗ ≤ 1, and

Ex∼P1 [D∗(x)]− Ex∼P2 [D∗(x)] =

∫D∗dµ

= µ(P )− µ(N)

= ‖µ‖TV= 2 δ (P1,P2)

(8)

which closes the duality gap with the Kantorovitch opti-mal transport primal problem, hence the optimality of thedual variable D∗.

Now, we can approximate the Total Variation distancebetween Pdata and PGhist by optimizing (7) over DTV , ourneural network approximation of the dual variable D.

Several authors (e.g., [8]) have pointed out that theoutput of a discriminator obtained in the framework ofadversarial training is not fitted for anomaly detection.Nevertheless, notice that our discriminator DTV dealswith two fixed distributions, Pdata and PGhist . Here, thepurpose of computing Total Variation distance is onlyused to reach the optimal D∗TV in this well-posed prob-lem, assuming Hypothesis (4). DTV should converge toD∗TV = 1P − 1N where (P,N) is the Hahn decompo-sition of dPGhist − dPdata (see Fig. 3(b), blue curve). Im-portantly, we hope that thanks to the structure of the data(PGhist covering Pdata),DTV will be able to generalize highanomaly scores on unseen data. Again, this seems to holdtrue in our simple case: The orange curve in Fig. 3 keepsincreasing outside of supp(PGhist).

To avoid vanishing gradient issues, we enforce the’boundedness’ condition on DTV not by a tanh non-linearity (for instance), but by applying a smooth loss(weighted by λ > 0) to its output:

λ · d(DTV (x), [−1, 1])2 (9)

where d(v, [−1, 1]) denotes the distance of a real valuev ∈ R to the set [−1, 1].

Our final training loss, to be minimized, reads:

L(D) =Ex∼Pdata [D(x)]− Ex∼PGhist[D(x)] (10)

+ λEx∼

Pdata+PGhist2

[d(D(x), [−1, 1])2]

As proved in the Appendix, the optimal D for this problemis D∗ = D∗TV + ∆∗, with

∆∗(x) =dPGhist(x)− dPdata(x)

λ(dPGhist(x) + dPdata(x))(11)

and the minimum loss is

−2 ·δ(Pdata,PGhist)−1

∫(dPGhist(x)− dPdata(x))2

(dPGhist(x) + dPdata(x))dx.

(12)By letting λ −→ ∞, the second term in (10) becomesan infinite well regularization term which enforces −1 ≤D ≤ 1, approaching the solution of (7). This explainswhy the second term in (12) vanishes when λ −→ ∞.In practice, for better results, we allow a small trade-offbetween the two objectives with λ = 10.

From now on, we will use the output of DTV asanomaly score. In a nutshell, our method should fully dis-criminate regions where data points are more dense thansynthetic anomalous points from PGhist . This yields a fastfeed-forward anomaly detector that ideally assigns −1 tonormal data and 1 to anomalous data. Fig. 1 shows anexample.

Our method also has the advantage of staying true tothe training objective, not modifying it as in [33]. Indeed,the authors of [33] implement a similar method but usinga non-converged state of the generator as an anomaly gen-erator. In order to achieve this, they add a term to the logloss that prevents the model from converging all the way.

5

Page 6: anomaly detection - arXiv.org e-Print archive

4 Model architecture and imple-mentation details

In this section, the architecture and implementation ofeach of the three steps detailed in Section 3 is described.

4.1 Architecture description.G uses transpose convolution to upscale the random fea-tures, Leaky ReLU non-linearities and BatchNorm layers.D is a classic Convolutionnal Neural Network for classi-fication, that uses pooling downscaling, Leaky ReLU be-fore passing the obtained features through fully connectedlayers. Both G and D roughly have 5M parameters. DTV

has the same architecture as D.

4.2 Learning to generate training-like data.We first train until convergence of G and D accordingto the WGAN-GP objective of Section 3.1 for a total ofnepochs epochs, and save the network states at regular in-tervals (50 times per epoch). We optimize our objectiveloss using Adam optimizers, with decreasing learning rateinitially equal to 5 · 10−4.

4.3 Generator’s history probability distri-bution

As announced in the previous section, if the training pro-cess were to be continuous we would arbitrarily definePGhist by (3), that is, PGhist =

∫ nepochs

αc · Gt(PZ) · e−βtdt,

where c is a normalization constant that makes PGhist sumto 1. We avoid the first α epochs to avoid heavily biasingPGhist in favour of the initial random state of the generator.The exponential decay gives less importance to higher fi-delity samples at the end of the training. In practice, weapproximate PGhist by sampling data from PGt

where t isa random variable of density of probability:

c · 1[α,nepochs] · e−βt (13)

4.4 Training our anomaly detector DTV.DTV is also optimized using Adam algorithm. Since ithas the same architecture as the discriminator used in the

Approximate density of score distribution among eachclass of datasets (blue shades: MNIST, red shades:

KMNIST).

Red curve: Total Variationdistance minimisation.

Blue curve: AUPRC of themodel during training.

Histograms of anomalyscores for the test splits of

MNIST and KMNIST.

One sample from each class of MNIST and KMNIST.

Figure 5: Method trained on MNIST (normal) and evalu-ated on KMNIST (anomalous).

previous WGAN training, its weights, denoted as WDTV,

can be initialized as:

WDTV=

∫ nepochs

α

c ·WDt · e−βtdt, (14)

whereDt is the state of the WGAN discriminator at train-ing time t. Let us comment on the reason of such ini-tialization. Fig. 4(a) seems to indicate that averaging thediscriminators’ outputs is a good initialization. The ob-

6

Page 7: anomaly detection - arXiv.org e-Print archive

Figure 6: Method trained on MNIST (normal) and evalu-ated on modified MNIST images for different levels σ ofgaussian additive noise.

tained average is indeed shaped as a ’v’ centered on thereal distribution in our toy example. Fig. 4(b) empiricallyshows that the initialization of the discriminator with aver-age coefficients is somewhat close to the average of saiddiscriminators. As explored in [47], this Deep NetworkInterpolation is justified by the strong correlation of thedifferent states of a network during training. To furtherdiscuss how good is exactly this initialization, Fig. 7 (blueerror bars in the figure) shows a comparison of area underthe precision-recall curve (AUPRC) with other methodson the MNIST dataset. The x-axis indicates the MNISTdigit chosen as anomalous.

The following experimental cases are tested:

• Experimental case 1: The training explained aboveis implemented.

• Experimental case 2: The same process is applied,only modifying PGhist , corrupting half generated im-ages, by sampling the latent variable with a widerdistribution PZ′ :PGhist =

∫ nepochs

αc · Gt(PZ)+Gt(PZ′ )

2 · e−βtdt.The idea behind this is encouraging PGhist to spreadits mass further away from Pdata.

Fig. 1 was computed on experimental case 1 withnepochs = 10, α = 1 and β = 5. Fig. 5, 6 and 7 werecomputed with nepochs = 5, α = 1, β = 3.

5 Experimental Results and Discus-sion

This section presents quantitative and qualitative experi-mental results.

Our model behaves as one would expect when pre-sented normal images modified with increasing levels ofnoise, which is attributing an increasing anomaly score tothem. This is illustrated in Fig. 6 where a clear corre-lation is seen between high values of the standard devia-tion of the added Gaussian noise and high density of highanomaly scores.

As a sanity check, we take the final state of the gener-ator, trained on MNIST with (2) and [16] algorithm, andverify that our method is able to detect generated samplethat do not belong to the normal MNIST distribution. InFigure 8, we randomly select two latent variable (z1 andz2) which are confidently classified as normal, then lin-early interpolate all latent variables between them, givenby (1 − t)z1 + tz2,∀t ∈ [0, 1]. Finally, we evaluate theanomaly score of each generated image.

Finally we check the influence of the number of savesper epoch on the performance of the model. Figure 9 dis-plays the AUPRC of normal data (MNIST) against KM-NIST dataset for different values of saving frequency dur-ing WGAN training. For low values, the information car-ried by PGhist starts at a ’early stopping of GANs’ ([21])level, and gets richer as the number of saves per epochincreases; hence the increase in AUPRC. We do not havean explanation for the small decay in performance for bigvalues.

Fig. 7 compares six state-of-the-art anomaly detectionmethods with the presented method with both experimen-tal cases and with our DTV initialization (denoted asno training). Apart from a few digits, HistoryAD chal-lenges state of the art anomaly detection methods. Fig. 5shows the anomaly detection results when our methodwas trained on MNIST dataset and evaluated on KM-NIST. Notice that most of the histogram mass of normaland anomalous data is located around -1 and 1, respec-tively. This figure empirically proves the robustness ofthe method to anomalous data structurally close to train-ing data. On the other hand, Fig. 1 shows how wellthe method performs on structurally different data. Ourmethod was trained on Street View House Numbers [32],and reached high AUPRC results. Both the approximatedensity and the AUPRC comparison show that the pre-sented method is able to discriminate anomalous fromnormal data.

7

Page 8: anomaly detection - arXiv.org e-Print archive

Figure 7: Comparison of AUPRC with other methods (x-axis denotes the MNIST digit chosen as anomalous).

Figure 8: Method trained on MNIST and evaluatingscores on images generated from interpolated latent vari-ables (1− t)z1 + tz2, for t ∈ [0, 1].

6 Conclusions and future work

In this paper, we presented our new anomaly detection ap-proach, HistoryAD, and estimated its performance. Un-like many GAN-based methods, we do not try to invertthe generator’s mapping, but use the rich information ofthe whole training process, yielding an efficient and gen-eral anomaly detector. Further can be done in exploitingthe training process of GANs, for instance, using multiple

Figure 9: Influence of the number of checkpoints pertraining epoch

training histories to improve the adversarial complexity ofPGhist .

Appendix

The goal of this appendix is to obtain a solution of theminimization problem

minDL(D) (15)

8

Page 9: anomaly detection - arXiv.org e-Print archive

where L(D) is given by (10). Assuming that the probabil-ity distributions Pdata and PGhist admit densities dPdata(x)and dPGhist(x), respectively, the loss can be written as in-tegral of the point-wise loss l defined below in (17):

L(D) =

∫l(D(x))dx (16)

where

l(D(x)) =(dPdata(x)− dPGhist(x))D(x)

+ λdPdata(x) + dPGhist(x)

2d(D(x), [−1, 1])2.

(17)

Let us recall that D∗TV = 1P − 1N where (P,N) isthe Hahn decomposition of dPGhist − dPdata (therefore,sign(D∗TV ) = D∗TV ).

We notice that for all x and for all ε > 0,

l[(1− ε)D∗TV (x)] ≥ (dPdata − dPGhist)(x)(1− ε)D∗TV (x)(18)

> (dPdata − dPGhist)(x)D∗TV (x) (19)i.e. > l[D∗TV (x)] (20)

Indeed, inequality (18) comes from the positivity ofthe distance d(·, [−1, 1]). On the other hand, inequal-ity (19) comes from the definition of D∗TV . Indeed, ifdPdata(x) − dPGhist(x) < 0, then D∗TV (x) = 1; the othercase dPdata(x) − dPGhist(x) > 0 gives D∗TV (x) = −1.Either way, we obtain that (dPdata(x) − dPGhist(x))(1 −ε)D∗TV (x) > (dPdata(x) − dPGhist(x))D∗TV (x). Finally,inequality (20) is obtained from d(D∗TV (x), [−1, 1]) = 0.

We can always write a real function D as D = D∗TV +∆, where ∆ is a certain function. We just proved that ifsign(∆(x)) = −sign(D∗TV (x)) on a non-negligible set,then D cannot minimize (10), since D∗TV (x) achieveslower value than D(x) on this set.

Hence all minimizer D∗ of (10) must be of the formD∗(x) = (D∗TV + ∆)(x), where sign(∆) = sign(D∗TV )almost everywhere. We can now re-write the point-wise

loss formula (17) as

l(D(x)) = (dPdata(x)− dPGhist(x)) · (D∗TV (x) + ∆(x))

(21)

+ λdPdata(x) + dPGhist(x)

2∆(x)2(x) (22)

= −2 · δ(Pdata,PGhist) (23)

+

∫(dPdata − dPGhist) ·∆ + λ

dPdata + dPGhist

2∆2

(24)

Minimizing this point-wise second order equation in ∆,we obtain

∆∗(x) =dPGhist(x)− dPdata(x)

λ(dPGhist(x) + dPdata(x)). (25)

Finally, the minimum loss is

−2 ·δ(Pdata,PGhist)−1

∫(dPGhist(x)− dPdata(x))2

(dPGhist(x) + dPdata(x))dx.

(26)

References[1] Samet Akcay, Amir Atapour-Abarghouei, and Toby P

Breckon. Ganomaly: Semi-supervised anomaly detectionvia adversarial training. In Asian Conference on ComputerVision, pages 622–637. Springer, 2018. 2

[2] Jinwon An and Sungzoon Cho. Variational autoencoderbased anomaly detection using reconstruction probability.Special Lecture on IE, 2(1), 2015. 2

[3] Martin Arjovsky, Soumith Chintala, and Leon Bottou.Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.3, 5

[4] Paul Bergmann, Michael Fauser, David Sattlegger, andCarsten Steger. Mvtec ad – a comprehensive real-worlddataset for unsupervised anomaly detection. In The IEEEConference on Computer Vision and Pattern Recognition(CVPR), June 2019. 1, 2

[5] Raghavendra Chalapathy and Sanjay Chawla. Deep learn-ing for anomaly detection: A survey. arXiv preprintarXiv:1901.03407, 2019. 2

[6] Varun Chandola, Arindam Banerjee, and Vipin Kumar.Anomaly detection: A survey. ACM computing surveys(CSUR), 41(3):15, 2009. 2

9

Page 10: anomaly detection - arXiv.org e-Print archive

[7] Hyunsun Choi, Eric Jang, and Alexander A Alemi. Waic,but why? generative ensembles for robust anomaly detec-tion. arXiv preprint arXiv:1810.01392, 2018. 2

[8] Lucas Deecke, Robert Vandermeulen, Lukas Ruff, StephanMandt, and Marius Kloft. Image anomaly detection withgenerative adversarial networks. In Joint European Con-ference on Machine Learning and Knowledge Discoveryin Databases, pages 3–17. Springer, 2018. 2, 5

[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L.Fei-Fei. ImageNet: A Large-Scale Hierarchical ImageDatabase. In CVPR09, 2009. 2

[10] Jeff Donahue, Philipp Krahenbuhl, and Trevor Darrell.Adversarial feature learning. CoRR, abs/1605.09782,2016. 2

[11] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson,Pierre Sermanet, and Andrew Zisserman. Temporal cycle-consistency learning. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, pages1801–1810, 2019. 1

[12] Thibaud Ehret, Axel Davy, Jean-Michel Morel, and Mauri-cio Delbracio. Image anomalies: A review and synthesisof detection methods. Journal of Mathematical Imagingand Vision, pages 1–34, 2019. 2

[13] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha,Moussa Reda Mansour, Svetha Venkatesh, and Antonvan den Hengel. Memorizing normality to detect anomaly:Memory-augmented deep autoencoder for unsupervisedanomaly detection. In The IEEE International Conferenceon Computer Vision (ICCV), October 2019. 2

[14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville,and Yoshua Bengio. Generative adversarial nets. In Adv inneural inf processing systems, pages 2672–2680, 2014. 2,3

[15] Benedicte Grosjean and Lionel Moisan. A-contrario de-tectability of spots in textured backgrounds. Journal ofMathematical Imaging and Vision, 33(3):313, 2009. 1

[16] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, VincentDumoulin, and Aaron C Courville. Improved training ofwasserstein gans. In Adv Neural Inf Proces Sys, pages5769–5779, 2017. 3, 7

[17] Ilyass Haloui, Jayant Sen Gupta, and Vincent Feuillard.Anomaly detection with wasserstein gan. arXiv preprintarXiv:1812.02463, 2018. 2

[18] Dan Hendrycks, Mantas Mazeika, and Thomas G Diet-terich. Deep anomaly detection with outlier exposure.arXiv preprint arXiv:1812.04606, 2018. 2

[19] Minyoung Huh, Andrew Liu, Andrew Owens, andAlexei A Efros. Fighting fake news: Image splice detec-tion via learned self-consistency. In Proceedings of the

European Conference on Computer Vision (ECCV), pages101–117, 2018. 1, 2

[20] Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. Object-centric auto-encoders and dummy anomalies for abnormal event detec-tion in video. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), June 2019. 1

[21] Volker Tresp Jindong Gu, Matthias Schubert. Semi-supervised outlier detection using a generative and adver-sary framework. E& T Int Journal Computer and Informa-tion Engineering, 12(10), 2018. 2, 7

[22] Durk P Kingma and Prafulla Dhariwal. Glow: Generativeflow with invertible 1x1 convolutions. In Adv. Neural In-formation Processing Systems, pages 10215–10224, 2018.2

[23] Diederik P Kingma and Max Welling. Auto-encoding vari-ational bayes. arXiv:1312.6114, 2013. 2

[24] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multi-ple layers of features from tiny images. Technical report,Citeseer, 2009. 2

[25] Yuening Li, Ninghao Liu, Jundong Li, Mengnan Du, andXia Hu. Deep structured cross-modal anomaly detection.arXiv preprint arXiv:1908.03848, 2019. 1

[26] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.Deep learning face attributes in the wild. In ICCV, 2015.2

[27] Weixin Luo, Wen Liu, and Shenghua Gao. A revisitof sparse coding based anomaly detection in stacked rnnframework. In Proceedings of the IEEE International Con-ference on Computer Vision, pages 341–349, 2017. 1

[28] Federico Di Mattia, Paolo Galeone, Michele De Simoni,and Emanuele Ghelfi. A survey on gans for anomaly de-tection. CoRR, abs/1906.11632, 2019. 2

[29] Gal Mishne and Israel Cohen. Multiscale anomaly detec-tion using diffusion maps. IEEE Journal of selected topicsin signal processing, 7(1):111–123, 2013. 1

[30] Romero Morais, Vuong Le, Truyen Tran, BudhadityaSaha, Moussa Mansour, and Svetha Venkatesh. Learningregularity in skeleton trajectories for anomaly detection invideos. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), June 2019. 1

[31] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Di-lan Gorur, and Balaji Lakshminarayanan. Do deep gener-ative models know what they don’t know? arXiv preprintarXiv:1810.09136, 2018. 2

[32] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis-sacco, Bo Wu, and Andrew Y Ng. Reading digits in naturalimages with unsupervised feature learning. In NIPS work-shop on deep learning and unsupervised feature learning,2011. 2, 7

10

Page 11: anomaly detection - arXiv.org e-Print archive

[33] Cuong Phuc Ngo, Amadeus Aristo Winarto, Con-nie Khor Li Kou, Sojeong Park, Farhan Akram, andHwee Kuan Lee. Fence GAN: towards better anomaly de-tection. CoRR, abs/1904.01209, 2019. 2, 5

[34] Trong-Nguyen Nguyen and Jean Meunier. Anomaly detec-tion in video sequence with appearance-motion correspon-dence. In The IEEE International Conference on Com-puter Vision (ICCV), October 2019. 1

[35] T Nikoukhah, J Anger, T Ehret, M Colom, JM Morel, andR Grompone von Gioi. Jpeg grid detection based on thenumber of dct zeros and its application to automatic and lo-calized forgery detection. In CVPR, pages 110–118, 2019.1, 2

[36] Andreas Nussberger, Helmut Grabner, and Luc Van Gool.Robust aerial object tracking from an airborne plat-form. IEEE Aerospace and Electronic Systems Magazine,31(7):38–46, 2016. 1

[37] Marco AF Pimentel, David A Clifton, Lei Clifton, and Li-onel Tarassenko. A review of novelty detection. SignalProcessing, 99:215–249, 2014. 2

[38] Kristina Prokopetc and Adrien Bartoli. Slim (slit lamp im-age mosaicing): handling reflection artifacts. InternationalJournal of Computer Assisted Radiology and Surgery,pages 1–10, 2017. 1

[39] Mahdyar Ravanbakhsh, Moin Nabi, Enver Sangineto, Lu-cio Marcenaro, Carlo Regazzoni, and Nicu Sebe. Abnor-mal event detection in videos using generative adversarialnets. In ICIP, pages 1577–1581. IEEE, 2017. 2

[40] Thomas Schlegl, Philipp Seebock, Sebastian M Waldstein,Ursula Schmidt-Erfurth, and Georg Langs. Unsupervisedanomaly detection with generative adversarial networks toguide marker discovery. In International Conference onInformation Processing in Medical Imaging, pages 146–157. Springer, 2017. 1, 2

[41] Thomas Schlegl, Philipp Seebock, Sebastian M. Wald-stein, Ursula Schmidt-Erfurth, and Georg Langs. Unsuper-vised anomaly detection with generative adversarial net-works to guide marker discovery. CoRR, abs/1703.05921,2017. 2

[42] Susan M Schweizer and Jose MF Moura. Hyperspectralimagery: Clutter adaptation in anomaly detection. IEEETrans. on Information Theory, 46(5):1855–1871, 2000. 2

[43] Joan Serra, David Alvarez, Vicenc Gomez, Olga Sli-zovskaia, Jose F Nunez, and Jordi Luque. Input complex-ity and out-of-distribution detection with likelihood-basedgenerative models. arXiv preprint arXiv:1909.11480,2019. 2

[44] Karim Tout, Florent Retraint, and Remi Cogranne. Auto-matic vision system for wheel surface inspection and mon-

itoring. In ASNT Annual Conference 2017, pages 207–216,2017. 1

[45] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt,Oriol Vinyals, Alex Graves, et al. Conditional image gen-eration with pixelcnn decoders. In Advances in neural in-formation processing systems, pages 4790–4798, 2016. 2

[46] Cedric Villani. Optimal transport: old and new, volume338. Springer Science & Business Media, 2008. 3, 4

[47] Xintao Wang, Ke Yu, Chao Dong, Xiaoou Tang, andChen Change Loy. Deep network interpolation for con-tinuous imagery effect transition. CoRR, abs/1811.10515,2018. 7

[48] Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan.Mantra-net: Manipulation tracing network for detectionand localization of image forgeries with anomalous fea-tures. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), June 2019. 1, 2

[49] Houssam Zenati, Chuan Sheng Foo, Bruno Lecouat, Gau-rav Manek, and Vijay Ramaseshan Chandrasekhar. Ef-ficient gan-based anomaly detection. arXiv preprintarXiv:1802.06222, 2018. 2

[50] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu,Thomas H. Li, and Ge Li. Graph convolutional labelnoise cleaner: Train a plug-and-play action classifier foranomaly detection. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), June 2019. 1, 2

[51] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry SDavis. Learning rich features for image manipulation de-tection. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 1053–1061,2018. 1

[52] Maria Zontak and Israel Cohen. Defect detection in pat-terned wafers using anisotropic kernels. Machine Visionand Applications, 21(2):129–141, 2010. 1

11