GEE: A Gradient-based Explainable Variational Autoencoder ... · Anomalies in network can arise due to attacks and threats in the cyberspace, such as different kinds of DDoS attacks

GEE A Gradient-based Explainable VariationalAutoencoder for Network Anomaly Detection

Quoc Phong NguyenNational University of Singapore

qphongcompnusedusg

Kian Hsiang LowNational University of Singapore

lowkhcompnusedusg

Kar Wai LimNational University of Singapore

karwailimnusedusg

Mun Choon ChanNational University of Singapore

chanmccompnusedusg

Dinil Mon DivakaranTrustwave

dinildivakarantrustwavecom

AbstractmdashThis paper looks into the problem of detectingnetwork anomalies by analyzing NetFlow records While manyprevious works have used statistical models and machine learningtechniques in a supervised way such solutions have the limitationsthat they require large amount of labeled data for training andare unlikely to detect zero-day attacks Existing anomaly detectionsolutions also do not provide an easy way to explain or identifyattacks in the anomalous traffic To address these limitationswe develop and present GEE a framework for detecting andexplaining anomalies in network traffic GEE comprises of twocomponents (i) Variational Autoencoder (VAE) mdash an unsuper-vised deep-learning technique for detecting anomalies and (ii) agradient-based fingerprinting technique for explaining anomaliesEvaluation of GEE on the recent UGR dataset demonstrates thatour approach is effective in detecting different anomalies as wellas identifying fingerprints that are good representations of thesevarious attacks

Index TermsmdashAnomaly Detection NetFlow Records Gradient-based Fingerprinting

I INTRODUCTION

Anomalies in network can arise due to attacks and threatsin the cyberspace such as different kinds of DDoS attacks(eg TCP SYN flooding DNS amplification attacks etc)brute force attempts botnet communications spam campaignnetworkport scans etc Network anomalies may also manifestdue to non-malicious causes such as faults in network mis-configurations BGP policy updates changes in user behaviorsetc Detection of anomalies possibly in real time is of utmostimportance in the context of network security in particular aswe are currently witnessing a continuous change in the threatlandscape leading to increase in the intensity sophisticationand types of attacks [1] This trend is expected to continueas the IoT market keeps expanding Indeed cyber criminalsnow target to harness more resources by exploiting IoT devicesthat are both likely to be much more vulnerable than typicalcomputers and huge in number (eg Mirai attack [2])

In this paper we look into the problem of detecting anoma-lies in large-scale networks like that of an Internet ServiceProvider (ISP) While the problem of network anomaly detec-tion has been of keen interest to the research community formany years now it still remains a challenge for a number ofreasons First the characteristics of network data depend on a

number of factors such as end-user behavior customer busi-nesses (eg banking retail) applications location time of theday and are expected to evolve with time Such diversity anddynamism limits the utility of rule-based detection systems

Next as capturing storing and processing raw traffic fromsuch high capacity networks is not practical Internet routerstoday have the capability to extract and export meta data suchas NetFlow records [3] With NetFlow the amount of infor-mation captured is brought down by orders of magnitude (incomparison to raw packet capture) not only because a NetFlowrecord represents meta data of a set of related packets but alsobecause NetFlow records are often generated from sampledpackets Yet NetFlow records collected from a modest edgerouter with 24x10Gbps links for a 5-minute interval can easilyreach a few GBs However with NetFlow useful informationsuch as suspicious keywords in payloads TCP state transitionsTLS connection handshake etc are lost indeed with sampledNetFlow even sizes of each packet time between consecutivepackets etc are unavailable Therefore anomaly detectionsolutions have to deal with lossy information

Finally the SOC (security operation center) analysts have alimited budget within which they have to analyze the alertsraised by an anomaly detector for purposes such as alertescalation threat and attack mitigation intelligence gatheringforensic analysis etc Therefore anomaly detectors should gobeyond merely indicating the presence of anomalies the timeof anomaly its type and the corresponding set of suspiciousflows are in particular very useful for further analysis Ingeneral the more the information that can be passed (alongwith the alerts) to the analysts the easier is the task of analysisand quicker the decision process

One way to address the above challenges is to apply sta-tistical models and machine learning algorithms Consideringanomaly detection as a binary classification problem a super-vised machine learning model can be built using normal andanomalous data for classifying anomalies However existingapproaches have the following limitations First many of themexploit only a small number of features (eg traffic volumeflow rates or entropy) [4] [5] [6] Such approaches requirethe users to apply domain knowledge to select the right set

of features which may not always be feasible and optimalSecond supervised approaches require large sets of data withground truth for training Note that as the network character-istics and attacks evolve models have to be retrained and thecorresponding labeled datasets have to be made available Thisrequires costly and laborious manual efforts and yet giventhe size of traffic flowing through backbone network it ishighly impractical to assume all data records to be correctlylabeled [7] Besides supervised approaches are unlikely todetect unknown and zero-day attack traffic

To address these limitations in this paper we developand present GEE a gradient-based explainable variationalautoencoder for detecting as well as explaining anomalies inlarge-scale networks GEE comprises of two important com-ponents (i) a variational autoencoder (VAE) an unsuperviseddeep-learning technique for detecting anomalies in networktraffic and (ii) a gradient-based fingerprinting technique forexplaining threats and attacks in anomalous traffic GEE ad-dresses the limitations of existing approaches in the followingway First modeling with VAE allows the use of a largenumber of features from the dataset thus relieving the need toemploy domain knowledge to select the ldquorightrdquo set of featuresSecond GEE works with unlabeled NetFlow traces which isthe norm in ISP networks for data collection Finally GEEprovides explanation as to why an anomaly is detected

To summarize the contributions of this paper are as followbull We present a VAE based framework for network anomaly

detection that is scalable in both the size of data and thefeature dimension

bull We develop a gradient-based framework to explain the de-tected anomalies and identify the main features that causethe anomalies This is of great impact in practice as deeplearning techniques are notorious for non-interpretability

bull Our framework GEE makes it possible to identify networkattacks using gradient-based fingerprints generated by theVAE algorithm To the best of our knowledge this is thefirst attempt to explain network anomalies using gradientsgenerated from a VAE model

Evaluation using the UGR dataset [8] shows that GEE iseffective in identifying the network attacks present such asSpam Botnet (low-rate) DoS and port scans Using GEEwe can identify the features that define these attacks and thefingerprints derived (when ground truth is available) can beutilized to further improve the detection accuracy

We provide an overview of the deep learning model in Sec-tions III and in Section IV we present our anomaly detectionframework GEE Finally we evaluate GEE in Section V

II RELATED WORK

Detection of network anomalies (such as threats intrusionsattacks etc) is a widely and deeply studied research prob-lem Past works have developed solutions based on varyingapproches for example rule-based systems [9] informationtheoretic techniques [10] [11] signal analysis [12] statisticalmodels and hypothesis testing [13] [14] as well as data mining

and machine learning algorithms [15] [16] As computationalresources becoming cheaper there has been an increasinginterest in applying machine learning techniques for detectinganomalies in network In the following we present an overviewof machine learning techniques applied on NetFlow data (orother forms of aggregate data usually collected from routers inISP networks) for the purpose of detecting anomalies

PCA (Principal Component Analysis) has been used to sep-arate traffic space into two different subspaces (lsquonormalrsquo andlsquoanomalousrsquo) for anomaly detection Aggregate byte counts oflinks in an ISP network was initially used as a feature [17]and in Lakhina et al [4] the solution was extended to workon more granular data An inherent disadvantage of PCA isthat the components in the reduced subspace do not providean interpretation of features in the original space Besides aslater studies have shown [18] [19] there are other challengesan important one being that PCA-based detection solution ishighly sensitive to the dimension of normal subspace and thedetection threshold both of which are not easily determined

Another approach to detect anomalies is to first model thenormal traffic and then use a statistical decision theoreticframework to detect deviation from normal data For exampleSimmross-Wattenberg et al [13] used α-stable distributionsto model 30-minute time bins of aggregate traffic rate andgeneralized likelihood ratio test for classifying whether thetime windows are anomalous Since a number of parametershave to be learned large amounts of labeled data have to becontinuously made available for the model to adapt with time

In Bilge et al [16] a Random Forests classifier was trainedto detect CampC servers in a supervised way using features ex-tracted from labeled NetFlow records To reduce false positiverate however the system relies on reputation systems such asmalicious domain classification system Google Safe Browsingetc This affects detection accuracy since most recent botnetsand evasive botnets may not have a low reputation score

More recently deep neural network models (also generallyreferred to as deep learning models) have been embraced bythe research community to tackle anomaly detection problemin networking setting Existing supervised deep learning ap-proaches include work by Tang et al [20] that utilizes a classi-cal deep neural network for flow-based anomalies classificationin a Software Defined Networking (SDN) environment and useof the recurrent neural network (RNN) models for developingintrusion detection solution [21]

There are recent works that use unsupervised deep learningmodels to transform the data into lower rank features before ap-plying supervised machine learning Several prior approachesfirst employ autoencoders [22] and their variants to extract thecompressed latent representation as features and subsequentlyuse these features for anomaly detection by training standardclassifiers such as Random Forests [23]

Anomaly detection methods that are solely based on unsu-pervised deep learning models have also been experimentedThese models do not require labeled information and insteadexploit the fact that anomalous behaviors tend to differ greatly

from the standard or normal behavior of the network Fioreet al [24] made use of discriminative restricted Boltzmannmachine (RBM) for anomaly detection on network data whileMirsky et al [25] proposed an ensembles of light-weight au-toencoders for real time network intrusion detection althoughtheir focus is on scalability of the system Further An andCho [26] demonstrated that the VAE performs much better thanAE and PCA on handwritten digit recognition and networkintrusion detection However the VAE model was trained usingdata labeled as normal ie the anomalies are removed fromtraining which is difficult to do in practice The above [24][25] [26] are also known as semi-supervised learning

Different from the above we develop an anomaly detectorusing an unsupervised deep learning technique without usinglabeled information While existing works (eg [26]) stop atthe task of anomaly detection in our work we also providea gradient-based technique for explaining why the anomaliesare detected together with their relevant features

III UNSUPERVISED DEEP LEARNING MODELS

Given large amount of data being constantly collected by thenetwork routers existing anomaly detection frameworks tendto employ simple methods such as threshold-based approachesor simply PCA for scalability reason Within the past decadehowever we see a rise of application of deep learning modelsdue to their ability to handle big datasets as well as to train real-time in a streaming manner This is while retaining their state-of-the-art performance in various tasks like real time objectrecognition [27] and fraud detection [28]

Additionally deep learning models like the variational au-toencoder (VAE) are shown to be robust to noisy data [29]and thus especially suitable for modeling network flows whichare very noisy in nature Although deep learning models areoften criticized for their lack of interpretability recent advanceshave brought forward better understanding of these models inparticular attributing the causes to the results [30]

We focus on the VAE a probabilistic generalization of theAE for anomaly detection on network data Note that theVAE has been shown to be more flexible and robust [26]compared to the AE Further we demonstrate how we can usegradient information from the VAE for interpretation purposeFor instance it can be used to analyze how a certain set offeatures is more likely to explain a certain anomaly In thefollowing we first describe the AE model since the VAE hasthe same deep architecture as the AE

A Autoencoder (AE)

An AE is made of three main layers which correspondto (i) the input layer to take in the features (ii) the latentrepresentation layer of the features and (iii) the output layerwhich is the reconstruction of the features The AE consistsof two parts called encoder and decoder respectively Theencoder maps the input into its latent representation whilethe decoder attempts to reconstruct the features back fromthe latent representation The encoder may be deep in thesense that information from the input is passed through several

inputlayer outputlayer

latentrepresentation

times features

times 512 times 512

times 1024

times 512times 512

times features

encoder decoder

times 1024

times 100

Fig 1 The autoencoder (AE and VAE) architecture

mappings (and hidden layers) similar to the deep architecturein a supervised deep learning model likewise for the decoderIn this paper we set the size of the latent representationlayer to be 100 In addition the encoder and the decodereach have three hidden layers with size 512 512 and 1024respectively as illustrated in Fig 1 In the figure nodes thatare shaded represent the observed data (used as both inputsand outputs) while the unshaded nodes are unobserved latentvariables corresponding to the hidden layers The exact size ordimension of the layers is shown above the nodes

The links between the layers show how the values of the nextlayer can be computed Commonly the value of one hiddenlayer ~hi can be computed as

~hi = g(Wi~himinus1 +~bi) (1)

where ~himinus1 is a vector of values for the previous layer Wi isa matrix of weights that signifies the relationship from theprevious layer and ~bi is a vector of bias terms Both Wi

and ~bi are parameters to be learned through model trainingHere g() is known as the activation function that transformsthe computation in a non-linear way and allows complexrelationship to be learned Popularly used activation functionsinclude the sigmoid function g(x) = (1 + eminusx)minus1 and therectified linear unit (ReLU) g(x) = max(0 x) which will beused in this paper The learning of the parameters are generallyachieved by minimizing the reconstruction errors (eg meansquare errors) via backpropagation with random initializationand can be optimized with a variety of optimizers such asstochastic gradient descent We refer the readers to Bengio [31]and references therein for details on optimization

In essence we can view the AE as a deterministic model thatmaps a set of inputs (features in our case) into their reconstruc-tion This is in contrast to the generative model variants of deepneural network such as the VAE and generative adversarialnetwork (GAN) which is capable of generating new data basedon the distribution of the training data

B Variational Autoencoder (VAE)

Unlike AE that deterministically encodes the inputs intotheir latent representation and subsequently produce a recon-struction the VAE [32] is a generative model that treats the la-tent representation layer as random variables conditional on theinputs Although the encoder and decoder in the VAE follows

the same computation model as the AE as in Equation (1) theencoding process is instead used to compute the parameters forthe conditional distributions of the latent representation Theparameters can then be used to generate or sample the latentrepresentation for decoding The conditional distributions aregenerally assumed to be Gaussian for real-valued nodes Forexample denote ~zi as the value of the latent representationlayer then it can be written as

~zi sim N (~microi diag(~σ2i )) (2)

where diag denotes a function that transforms a vector into adiagonal matrix and ~microi and ~σ2

i are the mean and variance forthe conditional Gaussian distribution obtained from the outputof the encoder

~microi = Wmicroi~himinus1 +~b

microi (3)

log ~σ2i = Wσ

i~himinus1 +~b

σi (4)

The parameters Wmicroi Wσ

i ~bmicroi and ~bσi are interpreted in thesame way as in the autoencoder above Treatment on the hiddenlayers is identical to that of autoencoder

The probabilistic nature of the VAE also means that wecannot simply employ the usual learning algorithm on standardobjective function (eg mean square error) to train the modelInstead a class of approximate statistical inference method areused which is called the Variational Bayes (thus gives rise tothe name VAE) As the discussion on the inference methods arerather technically involved we refer the interested readers toKingma and Welling [32] for details Put simply an alternativeobjective function known as the variational lower bound isoptimized and stochastic sampling is used for approximation

In terms of architecture VAE is similar to AE and isillustrated in Fig 1 The ReLU activation function is used bythe encoder and the decoder in all of the intermediate layersand the linear activation g(x) = x will be used for the output

IV GEE ANOMALY DETECTION FRAMEWORK

Our anomaly detection framework GEE consists of thefollowing main steps First important information from theNetFlow data such as the average packet size entropy ofdestination ports etc are extracted to obtain a set of featuresThen the features are fed into the VAE for learning thenormal behaviour of the network Subsequently anomaliesare identified based on their reconstruction errors Finally wedescribe how we can employ the gradient information availablefrom the VAE for explanation and for fingerprinting attacks

A Feature Extraction

A NetFlow record is a set of packets that has the samefive-tuple of source and destination IP addresses source anddestination ports and protocol In addition to the abovesome of the important fields that are commonly available inNetFlow records are start time of the flow (based on the firstsampled packet) duration number of packets number of bytesand TCP flag We group the NetFlow records into 3-minutesliding windows based on the source IP addresses to form

aggregated features This means that each data point in thispaper corresponds to the network statistics of a single sourceIP address within a 3-minute period Note that such extractionallows us to identify the offending IP address and also thetime window an anomaly belongs to which are important forfurther analysis and decision making The period of 3 minutesis chosen to balance between the practicality and quality of theaggregated statistics where the statistics will be insignificant ifthe period is too short while using a long time window meanswe cannot perform real time analysis

Overall we extract 53 aggregated features which includebull mean and standard deviation of flow durations number

of packets number of bytes packet rate and byte ratebull entropy of protocol type destination IP addresses source

ports destination ports and TCP flags andbull proportion of ports used for common applications (eg

WinRPC Telnet DNS SSH HTTP FTP and POP3)To ensure that meaningful statistics are captured data pointthat contains too few flows (less than 10 in this case) areremoved from the dataset This reduces noise in the trainingdata Finally the statistics are either scaled to between 0 and 1or normalized into Z-score [33] as input features for the VAE

B Unsupervised Learning of VAE

Training algorithm of the VAE is implemented using Ten-sorFlow [34] which provides powerful tools of automaticdifferentiation and comes with built-in optimization routinesAs pointed out in Section III the training of the VAE is fairlycomplex and beyond the scope of this paper so we provideonly a brief outline here Before starting the training procedurethe parameters in the VAE are randomly initialized This sub-sequently allows us to perform a forward pass on the encoderby computing the distribution of the latent representation layervia Equation (2) With this several samples can be generatedfrom the Gaussian distribution which are used to compute thevariational lower bound which consists of a KL divergenceterm and an expectation term

L = minusDKL[q(~z | ~x) || p(~z)] + Eq[log p(~x |~z)] (5)

where ~z is the latent representation of the input features ~xHere the distribution p(middot) corresponds to the Gaussian priorand conditional distribution of the VAE model while q(middot) isa variational approximation [35] of p(middot) generally chosen tobe Gaussian as well Refer to Kingma and Welling [32] fordetails Fortunately this objective function can be maximizedwith stochastic optimization techniques since the gradients arereadily available via automatic differentiation [36]

Here we employ Adam [37] as the optimization algorithmwhich enables training in minibatches Generally real-timetraining can be achieved by choosing a small minibatch sizeand discarding the data after one epoch We like to highlightthat label information is not used at all during training

C Anomaly Detection

Once the parameters are optimized after training the VAEmodel is used for anomaly detection where an IP address

and its time window is recognized as abnormal when thereconstruction error of its input features is high Here thereconstruction error is the mean square difference between theobserved features and the expectation of their reconstructionas given by the VAE A high reconstruction error is generallyobserved when the network behavior differs greatly from thenormal behavior that was learned by the VAE The thresholdis usually selected such that we treat a small percentage (say5) of the data as anomalies Otherwise we can make useof the labeled information to select a threshold to maximizethe detection rate while having small false positive rate Anillustration of how this can be done is suggested in thenext section with the aid of Fig 2 Note the anomalies areassociated with unusual network behaviours from a particularsource IP address and may not necessarily be malicious

D Gradient-based Explanation for Anomalies

While most existing anomaly detection works in the liter-ature consider only the evaluation on detection accuracy wego beyond and provide an explanation on why a data point isflagged as abnormal This is significant since it challenges thepopular belief that a deep learning model functions as a blackbox that cannot be interpreted a VAE model can in fact beused to explain why an IP address is treated as an anomalyThis is done by analyzing the gradients lsquocontributedrsquo by eachfeature of the data point which is obtainable from the VAEthrough automatic differentiation in TensorFlow

The key question to ask that leads to our approach isHow does the VAErsquos objective function vary if a feature inthe anomaly data point increases or decreases by a smallamount Intuitively given the trained VAE and an anomalousdata point if the function (reconstruction error) changes quitea lot when a particular feature of the anomalous data pointis varied by a small amount then this feature at its currentvalue is significantly abnormal since it would like to perturbthe VAE model (through optimization) to fit itself better

Gradients or more technically the derivative of the varia-tional lower bound partLpartfij are computed for each featurefij from each data point i Two applications of the gradientcan be immediately derived Firstly even without having theground truth labels the flagged anomalies can be clusteredbased on their gradients into groups that share similar behaviormaking it easier for analysts to investigate Secondly if wehave the labeled information on certain types of attacks thenwe can derive gradient-based fingerprints that associate withthe attacks These fingerprints can be used to identify specificattacks from another day Of course the anomalies that areidentified through the fingerprints are more accurate since la-beled information was indirectly used in a way similar to semi-supervised learning The anomalies are detected through the L2distance computed from the normalized gradient vectors Therationale of using such formulae is presented next

V DATASET AND EVALUATION

For evaluation we use the recently published UGR16dataset [8] which contains anonymized NetFlow traces cap-

TABLE I Volume of the NetFlow records (in thousands)

Date (2016) Total DoS Bnet Sc11 Sc44 Spam Blist

Training SetMar 19 (Sat) 110M - - - - 795 352Jul 30 (Sat) 110M 779 152 98 373 - 293

Test SetMar 18 (Fri) 40M - - - - 13 194

Mar 20 (Sun) 110M - - - - 795 352July 31 (Sun) 105M 784 152 72 369 - 225

tured from a real network of a Tier 3 ISP The ISP providescloud services and is used by many client companies ofdifferent sizes and markets The UGR trace is a fairly recentand large-scale data trace that contains real background trafficfrom a wide range of Internet users rather than specific trafficpatterns from synthetically generated data (eg DARPArsquo98 andDARPArsquo99 [38] UNB ISCX 2012 [39] UNSW-NB15 [40]CTU13 [41]) Another publicly available Internet traffic datais from the MAWI Working Group [42] but the labeled dataconsists of only 15-minute of traffic per day On the other handUGR contains traffic for the whole day over a 4-month periodFurthermore UGR attack traffic data is a mixture of generatedattacks labeled real attacks and botnet attacks from controlledenvironment Specifically the labeled attacks consist of

bull Low-rate DoS TCP SYN packets are sent to victims withpacket of size 1280 bits and of rate 100 packetss to port80 The rate of the attack is sufficiently low such that thenormal operation of the network is not affected

bull Port scanning a continuous SYN scanning to commonports of victims There are two kinds of scanning one-to-one scan attack (Scan11) and four-to-four (Scan44)

bull Botnet a simulated botnet traffic obtained from the ex-ecution of the Neris malware This data comes from theCTU13 trace [41]

bull Spam peaks of SMTP traffic forming a spam campaignbull Blacklist flows with IP addresses published in the public

blacklists As emphasized in UGR [8] not all traffic flowsinvolving blacklisted IP addresses are related to attacksHowever we include it for completeness

Other attack labels that are available in the dataset are ignoreddue to their low number of occurrence as they appear inless than 10 flows in total Also we like to caution that thebackground traffic should not be treated as fully free of attackssince it is likely that some attacks have avoided detection

We select a total of five days of UGR data for our exper-iments Two Saturdays are used as training data while threeother days on Friday and Sundays are chosen for testing Thestatistics for the data are presented in Table I NetFlow recordswithout any labels are the background data Note that the dataon March 18 is collected from around 10am thus smaller

After applying feature extraction as discussed in Section IVwe obtain a training dataset of 5990295 data points Thedata are trained via stochastic optimization with 50 epochs andminibatches of size 300 The weight decay is set to 001 Abrief outline of the training procedure was given in Section IV

0 20 40 60Reconstruction error

05 backgroundspam

(a) Spam

04 backgroundnerisbotnet

(b) Botnet

04 backgrounddos

(c) DoS

0 25 50Reconstruction error

04 backgroundscan11

(d) Scan11

04 backgroundscan44

(e) Scan44

05 backgroundblacklist

(f) Blacklist

Fig 2 Distribution for reconstruction error on training data

For the test set we processed a total of 1957711 data points onMarch 18 2954983 data points on March 20 and 2878422data points on July 31 For the purpose of evaluation we saythat a data point belongs to an attack type if more than halfof the flows are labeled with such attack within the 3-minuteaggregation otherwise it is treated as benign data

We present the distribution of the reconstruction errors forthe training data on Fig 2 The ground truth labels are usedto separate the anomalies from the background flows whichlet us examine whether the anomalies behave differently fromthe normal behaviors Overall there is some overlap with thebackground for spam botnet DoS and scanning activitiesbut we can find a cut off point to roughly separate them Forblacklists however their behaviors are indistinguishable fromthe background traffic

A Baseline

We compare our proposed anomaly detection frameworkGEE that uses VAE against that of AE and also a GaussianBased Thresholding (GBT) approach For a fair comparisonthe baseline AE shares the same architecture as the VAEand as illustrated in Fig 1 The AE is implemented usingKeras a high level open source neural network library Weuse the same TensorFlow backend for training the AE Thisis by minimizing the reconstruction error (mean square error)

00 05 10False positive rate

AEVAEGBT

(a) Spam

AEVAEGBT

(b) Botnet

AEVAEGBT

(c) DoS

AEVAEGBT

(d) Scan11

AEVAEGBT

(e) Scan44

AEVAEGBT

(f) Blacklist

Fig 3 ROC for VAE AE and GBT on training data

using the stochastic optimizer Adam [37] with minibatch sizechosen to be 256 Similar to the VAE data points that havelarge reconstruction error are flagged as anomalies

For the GBT we fit independent but non-identical Gaus-sian distribution models to the features to learn the standardbehaviors of the data Then we compute the Z-score for allfeatures in the testing dataset and use the product of theaverage standard deviation and the maximum of the Z-scoresas final score for anomaly detection The data point with scorethat exceeds a certain threshold is considered as anomaly Weemphasize that both the AE and GBT are trained on the sameextracted features as in VAE this is to ensure the comparisonis performed over the models rather than feature engineering

B ROC Performance on Anomaly Detection

We evaluate the performance of these three methods (VAEAE and GBT) using Receiver Operating Characteristic (ROC)curves where the true positive rates are plotted against thefalse positive rates by varying the sensitivity or threshold foranomaly detection A good performance is where the truepositive is high while the false positive is low

We plot the ROC results on the training dataset in Fig 3 andon the testing dataset in Fig 5 The results on the training dataare presented because unlike supervised learning the labeleddata is not used for training hence the results indicate howgood the algorithms are in finding anomalies from statistics

02 00 02 04anomaly-spam

Entropy of dest IPEntropy of dport

SMTP dportHTTP dport

Average durationAverage no of pkt

Std no of pktAverage pkt rate

Average no of bytesAverage byte rate

Std byte rateEntropy of flag

IPv6 protocolIPIP protocol

WinRPC sportWinRPC dport

FTP dport02 00 02

scan1102 00 02

scan44

Fig 4 Normalized gradients of spam scan11 and scan44 on selected features

AEVAEGBT

(a) Spam on March 20

AEVAEGBT

(b) Botnet on July 31

AEVAEGBT

(c) DoS on July 31

AEVAEGBT

(d) Scan44 on July 31

Fig 5 ROC for VAE AE and GBT on test dataset

of the same day On the other hand results for the testingdata evaluate on how the anomaly detection generalizes to anew day based on statistics from another day in the trainingdata For reference we include the results for blacklistedIP addresses which clearly cannot be identified through thestatistics from the NetFlow records see Fig 3f

From the results we make the following observationsFirst threshold based approaches such as GBT work well forattacks that increase the volume of certain categories of trafficsignificantly such as spam and port scanning However suchapproaches do not work as well for botnet and low rate DoSOn the other hand AE does not work well for spam andlow rate DoS For these two attacks AE may be unable todifferentiate spam from regular email traffic because of thehigh volume and unable to detect DoS due to the low rateOverall VAE is the most robust and has good performanceon all attack types Table II summarizes the individual andaverage area under the ROC curve (AUC) for various attacks(except blacklist) and detection models Clearly VAE has the

TABLE II Area under the curve (AUC) of the ROC

Model DoS BotNet Scan11 Scan44 Spam Average

TrainGBT 0892 0735 0989 0989 0961 0913

AE 0917 0931 0965 0948 0741 0900VAE 0986 0940 0966 0961 0927 0956Test

GBT 0875 0722 0980 0985 0970 0906AE 0898 0915 0961 0945 0791 0902

VAE 0983 0936 0949 0958 0908 0947Fprint 0980 0914 0976 0977 0965 0962

best overall performance in terms of AUC Note that the AUCresults on the test set are aggregated over the different daysie spam attacks are from March 18 and 20 while DoS botnetand port scanning attacks are originated from July 31

C Identifying and Using Gradient Fingerprints

In this section we compute and identify the gradients forall the features for various attacks Fig 4 shows the gradientsof the features for spam scan11 and scan44 Recall that 53features are used as input to VAE and significant features withsufficient variation in gradients are shown in the figure Theblack bars reflect one standard error for the gradients whichare useful for assessing the significance of the gradients iewhether it is due to noise or not We also present the featuresthat are not significant for contrast Note that similar gradient-based fingerprints are available for the other labeled attacksnot shown here due to space

Based on the results we can make the following observa-tions First only a small subset of the features have largegradients Second these features with the greatest absolutegradients provide an explanation for why these flows of an IPare detected as anomalies For example in the case of spamattacks (which includes sending spam emails) five featureshave more positive gradient (higher than the learned normal)while four features have much negative gradient (lower thanlearned normal) Next these combination of gradient andfeatures can be used as a fingerprint to identify or cluster

0 1False positive rate

(c) DoS on July 31

Fig 6 ROC for anomaly detection using fingerprints

similar attacks For example it can be observed from Fig 4that Scan11 and Scan44 have similar gradient fingerprints

To further validate our claim that these gradient fingerprintsare useful in identifying similar attacks we plot the ROC forvarious attacks detected in the following way First let L2

denote the Euclidean distance between the average gradientfingerprint obtained from data with labeled attacks and thegradients of the VAErsquos objective function wrt each test datapoint Similarly define L2

n as the same distance metric butcomputed on normalized gradient vectors The ROC is thenproduced by varying the threshold on L2 or L2

n distance If thegradient fingerprint is a good measure of an attack we expectthe ROC to have high true positive and low false positive

The results are shown in Fig 6 It is encouraging to seethat with the use of L2

n the gradient fingerprints learned areindeed good representations of these attacks In fact it is animprovement upon using the reconstruction error for anomalydetection see the last row on Table II for the its AUC We alsonote that the result is slightly worse for botnet see Fig 6b Thiscould be because there are many patterns behind the botnetanomalies Hence simply averaging all normalized gradientsfor botnet might not be the best approach

Finally we may also conclude that we have discovered anew attack if an unseen gradient fingerprint is identified

D Clustering of Anomalies

Another possible use of the VAE generated gradients is thatthey can be used to cluster or group the different attacks Theidea is that if the clustering is effective attacks should belimited to a relatively small number of clusters

We performed k-mean clustering with random initial seed onthe training dataset with k = 100 Fig 7 illustrates the clustersassociated with the attacks discussed above We find that 924of the DoS attacks appear only in two clusters (c82 and c84)with the other 76 appearing in four other clusters For spam

c15269

11 others

(a) Spam

c82360c84 564

4 others

(b) DoS

c39278

3 others

(c) Scan11

c39116

c56116

6 others

(d) Scan44

Fig 7 Distribution of clusters for each attack type

attacks 743 of them appear in two clusters (c11 and c15)while 257 appears in another 11 clusters This suggests thatthe attacks generally exhibit small number of main behavioralpatterns and thus the analysts can focus on a small subset ofclusters to study a particular type of attacks

E Processing Overhead of VAE

Training of the VAE is performed on a single graphicalprocessing unit (GPU) GTX 1080Ti with Intel Xeon CPU E5-2683 (16 cores 21GHz) and 256Gb of RAM which enablesreal time online training on streaming data Since the featuresare aggregated into a 3-minute sliding window the runningstatistics from the NetFlow records are kept for three minutesbefore discarded On average feature extraction took about 25seconds for every 3-minute window of the UGR data whichcorresponds to about 200k NetFlow records Note that ourimplementation did not optimize for such operation and speedimprovement can easily be achieved

Training using the TensorFlow framework is highly paral-lelizable and thus runs very quickly For every minibatch of300 data points of aggregated features only 10ms is neededfor training while testing the data points for abnormalitytakes about 20ms Hence almost all the processing is due tothe feature extraction With the current setup even withoutspeeding up the feature extraction further our algorithm caneasily perform anomaly detection in real-time

VI CONCLUSION

We proposed GEE a VAE-based framework that is robust indetecting a variety of network based anomalies We observedthat the VAE performs better overall compared to the AEand a Gaussian based thresholding (GBT) method Furtherwe demonstrated how to use the gradient information from theVAE to provide an explanation for the flagged anomalies Alsothe gradient-based fingerprints when used directly for anomalydetection was shown to achieve an overall better performance

A potential future research is to use the conditional VAE(CVAE) [43] for anomaly detection The CVAE was developedby the deep learning community as an extension of the VAEto allow for additional auxiliary labels that are available WithCVAE one can consider training an anomaly detection systemfrom multiple data sources that have different behaviors

ACKNOWLEDGMENT

This research is supported by the National Research Foun-dation Prime Ministerrsquos Office Singapore under its CorporateLaboratoryUniversity Scheme National University of Singa-pore and Singapore Telecommunications Ltd

REFERENCES

[1] Symantec ldquoInternet Security Threat Reportrdquo 2018 httpswwwsymanteccomcontentdamsymantecdocsreportsistr-23-2018-enpdfaccessed July 2018

[2] M Antonakakis T April M Bailey M Bernhard E Bursztein et alldquoUnderstanding the Mirai botnetrdquo in Proc 26th USENIX SecuritySymposium 2017 pp 1093ndash1110

[3] B Claise B Trammell and P Aitken ldquoSpecification of the IPflow information export (IPFIX) protocol for the exchange of flowinformationrdquo Internet Requests for Comments RFC Editor STD 772013 [Online] Available httpwwwrfc-editororgrfcrfc7011txt

[4] A Lakhina M Crovella and C Diot ldquoCharacterization of network-wideanomalies in traffic flowsrdquo in Proc ACM SIGCOMM Conf on InternetMeasurement (IMC) 2004 pp 201ndash206

[5] F Silveira C Diot N Taft and R Govindan ldquoASTUTE Detecting adifferent class of traffic anomaliesrdquo in Proc ACM SIGCOMM 2010 pp267ndash278

[6] J Wang and I C Paschalidis ldquoStatistical traffic anomaly detectionin time-varying communication networksrdquo IEEE Trans on Control ofNetwork Systems vol 2 no 2 pp 100ndash111 2015

[7] R Sommer and V Paxson ldquoOutside the closed world On using machinelearning for network intrusion detectionrdquo in Proc IEEE Symposium onSecurity and Privacy 2010 pp 305ndash316

[8] G Macia-Fernandez J Camacho R Magan-Carrion P Garcıa-Teodoroand R Theron ldquoUGRrsquo16 A new dataset for the evaluation ofcyclostationarity-based network IDSsrdquo Computers amp Security vol 73pp 411ndash424 2018

[9] N Duffield P Haffner B Krishnamurthy and H Ringberg ldquoRule-basedanomaly detection on IP flowsrdquo in Proc IEEE INFOCOM 2009 pp424ndash432

[10] Y Gu A McCallum and D Towsley ldquoDetecting anomalies in networktraffic using maximum entropy estimationrdquo in Proc ACM IMC 2005pp 345ndash350

[11] G Nychis V Sekar D G Andersen H Kim and H Zhang ldquoAnempirical evaluation of entropy-based traffic anomaly detectionrdquo in ProcACM IMC 2008 pp 151ndash156

[12] P Barford J Kline D Plonka and A Ron ldquoA signal analysis ofnetwork traffic anomaliesrdquo in Proc ACM SIGCOMM Workshop onInternet Measurement (IMW) 2002 pp 71ndash82

[13] F Simmross-Wattenberg J I Asensio-Perez P Casaseca-de-la Higueraet al ldquoAnomaly detection in network traffic based on statistical inferenceand alpha-stable modelingrdquo IEEE Trans on Dependable and SecureComputing vol 8 no 4 pp 494ndash509 2011

[14] I Nevat D M Divakaran S G Nagarajan P Zhang L Su L L Koand V L L Thing ldquoAnomaly detection and attribution in networks withtemporally correlated trafficrdquo IEEEACM Trans on Networking vol 26no 1 pp 131ndash144 2018

[15] I Paredes-Oliva I Castell-Uroz P Barlet-Ros X Dimitropoulos andJ Sole-Pareta ldquoPractical anomaly detection based on classifying frequenttraffic patternsrdquo in Proc IEEE INFOCOM Workshops 2012 pp 49ndash54

[16] L Bilge D Balzarotti W Robertson E Kirda and C KruegelldquoDisclosure Detecting botnet command and control servers throughlarge-scale NetFlow analysisrdquo in Proc ACSAC 2012 pp 129ndash138

[17] A Lakhina M Crovella and C Diot ldquoDiagnosing network-wide trafficanomaliesrdquo SIGCOMM CCR vol 34 no 4 pp 219ndash230 2004

[18] H Ringberg A Soule J Rexford and C Diot ldquoSensitivity of PCA fortraffic anomaly detectionrdquo SIGMETRICS Perform Eval Rev vol 35no 1 pp 109ndash120 2007

[19] D Brauckhoff K Salamatian and M May ldquoApplying PCA for trafficanomaly detection Problems and solutionsrdquo in Proc IEEE INFOCOM2009 pp 2866ndash2870

[20] T A Tang L Mhamdi D McLernon S A R Zaidi and M GhogholdquoDeep learning approach for network intrusion detection in softwaredefined networkingrdquo in Wireless Networks and Mobile Communications(WINCOM) IEEE 2016 pp 258ndash263

[21] P Torres C Catania S Garcia and C G Garino ldquoAn analysis ofrecurrent neural networks for botnet detection behaviorrdquo in BiennialCongress of Argentina (ARGENCON) IEEE 2016 pp 1ndash6

[22] P Baldi ldquoAutoencoders unsupervised learning and deep architecturesrdquoin Proc ICML Workshop on Unsupervised and Transfer Learning 2012pp 37ndash49

[23] N Shone T N Ngoc V D Phai and Q Shi ldquoA deep learning approachto network intrusion detectionrdquo IEEE Transactions on Emerging Topicsin Computational Intelligence vol 2 no 1 pp 41ndash50 2018

[24] U Fiore F Palmieri A Castiglione and A De Santis ldquoNetworkanomaly detection with the restricted Boltzmann machinerdquo Neurocom-puting vol 122 pp 13ndash23 2013

[25] Y Mirsky T Doitshman Y Elovici and A Shabtai ldquoKitsune Anensemble of autoencoders for online network intrusion detectionrdquo arXivpreprint arXiv180209089 2018

[26] J An and S Cho ldquoVariational autoencoder based anomaly detectionusing reconstruction probabilityrdquo Special Lecture on IE vol 2 pp 1ndash18 2015

[27] S Ren K He R Girshick and J Sun ldquoFaster R-CNN Towards real-time object detection with region proposal networksrdquo in Advances inneural information processing systems 2015 pp 91ndash99

[28] A Roy J Sun R Mahoney L Alonzi S Adams and P Beling ldquoDeeplearning detecting fraud in credit card transactionsrdquo in Systems andInformation Engineering Design Symposium 2018 pp 129ndash134

[29] W-N Hsu Y Zhang and J Glass ldquoUnsupervised domain adaptationfor robust speech recognition via variational autoencoder-based dataaugmentationrdquo in Automatic Speech Recognition and UnderstandingWorkshop IEEE 2017 pp 16ndash23

[30] M Ancona E Ceolini C Oztireli and M Gross ldquoA unified viewof gradient-based attribution methods for deep neural networksrdquo inInternational Conference on Learning Representations 2018

[31] Y Bengio ldquoPractical recommendations for gradient-based training ofdeep architecturesrdquo in Neural Networks Tricks of the Trade Springer2012 pp 437ndash478

[32] D P Kingma and M Welling ldquoAuto-encoding variational Bayesrdquo arXivpreprint arXiv13126114 2013

[33] A Jain K Nandakumar and A Ross ldquoScore normalization in mul-timodal biometric systemsrdquo Pattern Recognition vol 38 no 12 pp2270ndash2285 2005

[34] M Abadi P Barham J Chen Z Chen A Davis et al ldquoTensorFlowA system for large-scale machine learningrdquo in Proc Conference onOperating Systems Design and Implementation 2016 pp 265ndash283

[35] D M Blei A Kucukelbir and J D McAuliffe ldquoVariational inferenceA review for statisticiansrdquo J Amer Statist Assoc vol 112 no 518 pp859ndash877 2017

[36] A G Baydin B A Pearlmutter A A Radul and J M SiskindldquoAutomatic differentiation in machine learning A surveyrdquo Journal ofMachine Learning Research vol 18 no 153 pp 1ndash153 2017

[37] D P Kingma and J Ba ldquoAdam A method for stochastic optimizationrdquoin Third International Conference on Learning Representations 2015

[38] DARPA Datasets httpswwwllmiteduidevaldocsindexhtml[39] A Shiravi H Shiravi M Tavallaee and A A Ghorbani ldquoToward

developing a systematic approach to generate benchmark datasets forintrusion detectionrdquo Comp amp Sec vol 31 no 3 pp 357ndash374 2012

[40] N Moustafa and J Slay ldquoUNSW-NB15 A comprehensive data set fornetwork intrusion detection systems (UNSW-NB15 network data set)rdquoin Military Communications and Info Sys Conf IEEE 2015 pp 1ndash6

[41] S Garcia M Grill J Stiborek and A Zunino ldquoAn empirical comparisonof botnet detection methodsrdquo Comp amp Sec vol 45 pp 100ndash123 2014

[42] MAWI Working Group Traffic Archive httpmawiwideadjpmawi[43] K Sohn H Lee and X Yan ldquoLearning structured output representation

using deep conditional generative modelsrdquo in Advances in NeuralInformation Processing Systems 2015 pp 3483ndash3491

I Introduction

II Related Work

III Unsupervised Deep Learning Models

III-A Autoencoder (AE)

III-B Variational Autoencoder (VAE)

IV GEE Anomaly Detection Framework

IV-A Feature Extraction

IV-B Unsupervised Learning of VAE

IV-C Anomaly Detection

IV-D Gradient-based Explanation for Anomalies

V Dataset and Evaluation

V-A Baseline

V-B ROC Performance on Anomaly Detection

V-C Identifying and Using Gradient Fingerprints

V-D Clustering of Anomalies

V-E Processing Overhead of VAE

VI Conclusion

References