26
A comparative evaluation of outlier detection algorithms: experiments and analyses emi Domingues a , Maurizio Filippone a , Pietro Michiardi a , Jihane Zouaoui b a Department of Data Science, EURECOM, Sophia Antipolis, France b Amadeus, Sophia Antipolis, France Abstract We survey unsupervised machine learning algorithms in the context of outlier de- tection. This task challenges state-of-the-art methods from a variety of research fields to applications including fraud detection, intrusion detection, medical di- agnoses and data cleaning. The selected methods are benchmarked on publicly available datasets and novel industrial datasets. Each method is then submitted to extensive scalability, memory consumption and robustness tests in order to build a full overview of the algorithms’ characteristics. Keywords: Outlier detection; fraud detection; novelty detection; variational inference; isolation forest 1. Introduction Numerous applications nowadays require thorough data analyses to filter out outliers and ensure system reliability. Such techniques are especially useful for fraud detection where malicious attempts often differ from most nominal cases and can thus be prevented by identifying outlying data. Those anomalies can be defined as observations which deviate sufficiently from most observations to consider that they were generated by a different generative process. These ob- servations are called outliers when their number is significantly smaller than the proportion of nominal cases, typically lower than 5%. Outlier detection tech- niques have proven to be efficient for applications such as network intrusions, credit card fraud detection and telecommunications fraud detection [10]. Excluding outliers from a dataset is also a task from which most data min- ing algorithms can benefit. An outlier-free dataset allows for accurate modelling tasks, making outlier detection methods extremely valuable for data cleaning [18]. These techniques have also been successfully applied to fault detection on critical systems and result in improved damage control and component failure prediction [31]. Medical diagnoses may also profit from the identification of outliers in application such as brain tumor detection [24] and cancerous masses recognition in mammograms [28]. Preprint submitted to Elsevier August 20, 2018

A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

A comparative evaluation of outlier detectionalgorithms: experiments and analyses

Remi Dominguesa, Maurizio Filipponea, Pietro Michiardia, Jihane Zouaouib

aDepartment of Data Science, EURECOM, Sophia Antipolis, FrancebAmadeus, Sophia Antipolis, France

Abstract

We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art methods from a variety of researchfields to applications including fraud detection, intrusion detection, medical di-agnoses and data cleaning. The selected methods are benchmarked on publiclyavailable datasets and novel industrial datasets. Each method is then submittedto extensive scalability, memory consumption and robustness tests in order tobuild a full overview of the algorithms’ characteristics.

Keywords: Outlier detection; fraud detection; novelty detection; variationalinference; isolation forest

1. Introduction

Numerous applications nowadays require thorough data analyses to filter outoutliers and ensure system reliability. Such techniques are especially useful forfraud detection where malicious attempts often differ from most nominal casesand can thus be prevented by identifying outlying data. Those anomalies canbe defined as observations which deviate sufficiently from most observations toconsider that they were generated by a different generative process. These ob-servations are called outliers when their number is significantly smaller than theproportion of nominal cases, typically lower than 5%. Outlier detection tech-niques have proven to be efficient for applications such as network intrusions,credit card fraud detection and telecommunications fraud detection [10].

Excluding outliers from a dataset is also a task from which most data min-ing algorithms can benefit. An outlier-free dataset allows for accurate modellingtasks, making outlier detection methods extremely valuable for data cleaning[18]. These techniques have also been successfully applied to fault detection oncritical systems and result in improved damage control and component failureprediction [31]. Medical diagnoses may also profit from the identification ofoutliers in application such as brain tumor detection [24] and cancerous massesrecognition in mammograms [28].

Preprint submitted to Elsevier August 20, 2018

Page 2: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

Numerous machine learning methods are suitable for anomaly detection.However, supervised algorithms are more constraining than unsupervised meth-ods as they need to be provided with a labeled dataset. This requirement isparticularly expensive when the labeling must be performed by humans. Deal-ing with a heavily imbalanced class distribution, which is inherent to anomalydetection, can also affect the efficiency of supervised algorithms [12]. This pa-per focuses on unsupervised machine learning algorithms to isolate outliers fromnominal samples. In previous reviews, such methods were shown to be proficientfor outlier and novelty detection [23, 33]. Unsupervised algorithms make useof unlabeled data to assign a score to each sample, representing the observa-tion normality. Binary segmentations can be further made by thresholding thescores.

Outlier detection is a notoriously hard task: detecting anomalies can bedifficult when overlapping with nominal clusters, and these clusters should bedense enough to build a reliable model. The problem of contamination, i.e.using an input dataset contaminated by outliers, makes this task even trickier asanomalies may degrade the final model if the training algorithm lacks robustness.These issues, which are present in many real-world datasets, are not alwaysaddressed in works describing new unsupervised methods, as these algorithmsmay target a different use case. These reasons motivate the need for a thoroughbenchmark bringing together distinctive techniques on complex datasets.

Our paper extends previous works [7, 33] by using 12 publicly available la-beled datasets, most of which are recommended for outlier detection in [7], inaddition to 3 novel industrial datasets from the production environments of amajor company in the travel industry. The benchmark made in [7] used fewerdatasets and solely numerical features while we benchmark and address waysto handle categorical data. While the previous works perform outlier detectionon training data, our study test for the generalization ability of all methods bypredicting outliers on unseen testing data. The selected parametric and non-parametric algorithms come from a variety of approaches including probabilisticalgorithms, nearest-neighbor based methods, neural networks, information theo-retic and isolation methods. The performance on labeled datasets are comparedwith the area under the ROC and precision-recall curves.

In order to give a full overview of these methods, we also benchmark thetraining time, prediction time, memory usage and robustness of each methodwhen increasing the number of samples, features and the background noise.These scalability measurements allow us to compare algorithms not only basedon their outlier detection performance but also on their scalability, robustnessand suitability for large dimensional problems.

The paper is organized as follows: section 2 introduces the research fieldsand methods targeted by the benchmark; section 3 details the experimentalsetup, the public and proprietary datasets used and the method implementedto generate synthetic datasets; section 4 presents the results of our benchmarkand section 5 summarizes our conclusions.

2

Page 3: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

2. Outlier detection methods

The algorithms described in this section belong to a wide range of ap-proaches. These methods build a model representing the nominal classes, i.e.dense clusters of similar data points, during a training phase. Online or batchpredictions can thereafter be applied to new data based on the trained model toassign an anomaly score to the new observations. Applying a threshold on thereturned scores provides a decision boundary separating nominal samples fromoutliers.

The evaluation presented in this study contains both parametric and non-parametric machine learning algorithms. While parametric approaches modelthe underlying data with a fixed number of parameters, the number of parame-ters of nonparametric methods is potentially infinite and can increase with thecomplexity of data. If the former are often computationally faster, they requireassumptions about the data distribution, e.g. the number of clusters, and mayresult in a flawed model if based on erroneous assumptions. The latter makefewer assumptions about the data distribution and may thus generalize betterwhile requiring less knowledge about the data.

2.1. Probabilistic methods

Probabilistic algorithms estimate the probability density function of a datasetX, by inferring the model parameters θ. Data points having the smallest likeli-hood P (X|θ) are identified as outliers. Most probabilistic methods described inthis section can be trained incrementally, i.e. presenting new input data to anexisting model will cause the model to adapt to the new data while rememberingpast ones.

The first algorithm used in our benchmark is the Gaussian Mixture Model(GMM), which fits a given number of Gaussian distributions to a dataset.The model is trained using the Expectation-Maximization (EM) algorithm [6]which maximizes a lower bound of the likelihood iteratively. This method hasbeen successfully applied to identify suspicious and possibly cancerous massesin mammograms by novelty detection in [28]. However, assessing the number ofcomponents of the mixture by data exploration can be complex and motivatesthe use of nonparametric alternatives described hereafter.

In [3], Blei et al. describe the Dirichlet Process Mixture Model (DPMM),a nonparametric Bayesian algorithm which optimizes the model parameters andtests for convergence by monitoring a non-decreasing lower bound on the log-marginal likelihood. The result is a mixture model where each component is aproduct of exponential-family distributions. Most likelihoods, e.g. Gaussian orCategorical, can be written in exponential-family form; in this formulation, itis possible to obtain suitable conjugate priors to facilitate inference.

The number of components is inferred during the training and requires anupper bound K. Weights πi are represented by a Dirichlet Process modelled as

3

Page 4: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

a truncated stick-breaking process (equation 1). The variable vi follows a Betadistribution, where αk and βk are variational parameters optimized during thetraining for each component.

πi(v) = vi

i−1∏j=1

(1− vj) qα,β(v) =

K−1∏k=1

Beta(αk, βk) (1)

The training optimizes the parameters of the base distributions, e.g. theparameters of a Gaussian-Wishart for a multivariate Gaussian likelihood. Thescoring is then made by averaging the log likelihood computed from each mix-ture of likelihoods sampled from the conjugate priors. The algorithm is appliedto intrusion detection on the kdd 99 dataset in [8] and outperforms svm andknn algorithms. The cluster centroids provided by the model can also be valu-able to an end-user as they represent the average nominal data points.

Kernel density estimators (KDE), also called Parzen windows estima-tors, approximate the density function of a dataset by assigning a kernel func-tion to each data point then summing the local contributions of the kernels. Abandwidth parameter acts as a smoothing parameter on the density shape andcan be estimated by methods such as Least-Squares Cross-Validation (LSCV).As shown in [28], kde methods are efficient when applied to novelty detectionproblems. However, these approaches are sensitive to outliers and struggle infinding a good estimate of the nominal data density in datasets contaminatedby outliers. This issue is shown by Kim et al. in [13] where the authors describea Robust Kernel Density Estimator (RKDE), algorithm which uses M-estimation methods, such as robust loss functions, to overcome the limitationsof plain kde.

Probabilistic principal component analysis (PPCA) [29] is a latentvariable model which estimates the principal components of the data. It allowsfor the projection of a d-dimensional observation vector Y to a k-dimensionalvector of latent variables X, with k the number of components of the model. Therelationship Y = WX+µ+ε is trained by expectation-maximization. The au-thors suggest to use the log-likelihood as a degree of novelty for new data points.

More recently, Least-squares anomaly detection (LSA) [25] developedby Quinn et al. extends a multi-class probabilistic classifier to a one-class prob-lem. The approach is compared against knn and One-class SVM using the areaunder the ROC curve.

2.2. Distance-based methods

This class of methods uses solely the distance space to flag outliers. As such,the Mahalanobis distance is suitable for anomaly detection tasks targetingmultivariate datasets composed of a single Gaussian-shaped cluster [2]. Themodel parameters are the mean and inverse covariance matrix of the data, thus

4

Page 5: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

similar to a one-component gmm with a full covariance matrix.

2.3. Neighbor-based methods

Neighbor-based methods study the neighborhood of each data point to iden-tify outliers. Local outlier factor (LOF) described in [4] is a well-known dis-tance based approach corresponding to this description. For a given data pointx, lof computes its degree dk(x) of being an outlier based on the Euclidean dis-tance d between x and its kth closest neighbor nk, which gives dk(x) = d(x, nk).The scoring of x also takes into account for each of its neighbors ni, the max-imum between dk(ni) and d(x, ni). As shown in [7], lof outperforms Angle-Based Outlier Detection [16] and One-class SVM [26] when applied on real-worlddatasets for outlier detection, which makes it a good candidate for this bench-mark.

Angle-Based Outlier Detection (ABOD) [16] uses the radius and vari-ance of angles measured at each input vector instead of distances to identifyoutliers. The motivation is here to remain efficient in high-dimensional spaceand to be less sensible to the curse of dimensionality. Given an input point x,abod samples several pairs of points and computes the corresponding angles atx and their variance. Broad angles imply that x is located inside a major clusteras it is surrounded by many data points, while small angles denote that x ispositioned far from most points in the dataset. Similarly, a higher variance willbe observed for points inside or at the border of a cluster than for outliers. Theauthors show that their method outperforms lof on synthetic datasets contain-ing more than 50 features. According to the authors, the pairs of vectors canbe built from the entire dataset, a random subset or the k-nearest neighbors inorder to speed up the computation at the cost of lower outlier detection perfor-mance.

The Subspace outlier detection (SOD) [15] algorithm finds for eachpoint p the set of m neighbors shared between p and its k-nearest neighbors.The outlier score is then the standard deviation of p from the mean of a givensubspace, which is composed of a subset of dimensions. The attributes havinga small variance for the set of m points are selected to be part of the subspace.

2.4. Information theory

The Kullback-Leibler (KL) divergence was used as an information-theoretic measure for novelty detection in [9]. The method first trains a Gaussianmixture model on a training set, then estimates the information content of newdata points by measuring the kl divergence between the estimated density andthe density estimated on the training set and the new point. This reduces toan F -test in the case of a single Gaussian.

5

Page 6: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

2.5. Neural networks

In [19], Marsland et al. propose a reconstruction-based nonparametric neuralnetwork called Grow When Required (GWR) network. This method isbased on Kohonen networks, also called Self-Organizing maps (som) [14], andfits a graph of adaptive topology lying in the input space to a dataset. Whiletraining the network, nodes and edges are added or removed in order to best fitthe data, the objective being to end up with nodes positioned in all dense dataregions while edges propagate the displacement of neighboring nodes.

Outliers from artificial datasets are detected using fixed-topology som in anexperimental work [21]. The paper uses two thresholds t1 and t2 to identify datapoints having their closest node further than t1, or projecting on an outlyingnode, i.e. a neuron having a median interneuron distance (mid) higher thant2. The mid of each neuron is computed by taking the median of the distancebetween a neuron and its 8 neighbors in a network following a 2-dimensional gridtopology. Severe outliers and dense clouds of outliers are correctly identifiedwith this technique, though some nominal samples can be mistaken as mildoutliers.

2.6. Domain-based methods

Additional methods for outlying data identification rely on the constructionof a boundary separating the nominal data from the rest of the input space,thus estimating the domain of the nominal class. Any data point falling outsideof the delimited boundary is thus flagged as outlier.

One-class SVM [26], an application of support vector machine (svm) algo-rithms to one-class problems, belongs to this class of algorithms. The methodcomputes a separating hyperplane in a high dimensional space induced by ker-nels performing dot products between points from the input space in high-dimensional space. The boundary is fitted to the input data by maximizing themargin between the data and the origin in the high-dimensional space. Thealgorithm prevents overfitting by allowing a percentage υ of data points to falloutside the boundary. This percentage υ acts as regularization parameter; it isused as a lower bound on the fraction of support vectors delimiting the bound-ary and as an upper bound on the fraction of margin errors, i.e. training pointsremaining outside the boundary.

The experiment of the original paper targets mostly novelty detection, i.e.anomaly detection using a model trained on a dataset free of anomalies. Ourbenchmark uses contaminated datasets to assess the algorithm robustness witha regularization parameter significantly higher than the expected proportion ofoutliers.

2.7. Isolation methods

We include an isolation algorithm in this study, which focuses on separatingoutliers from the rest of the data points. This method differs from the previousmethods as it isolates anomalies instead of profiling normal points.

6

Page 7: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

The concept of Isolation forest was brought by Liu in [17] and uses randomforests to compute an isolation score for each data point. The model is built byperforming recursive random splits on attribute values, hence generating treesable to isolate any data point from the rest of the data. The score of a point isthen the average path length from the root of the tree to the node containingthe single point, a short path denoting a point easy to isolate due to attributevalues significantly different from nominal values.

The author states that his algorithm provides linear time complexity anddemonstrates outlier detection performance significantly better than lof onreal-world datasets.

3. Experimental evaluation

3.1. Metrics

We evaluate the performance of the outlier detection algorithms by compar-ing two metrics based on real-world labeled datasets detailed in section 3.2. Forthis purpose, we use the receiver operating characteristic (ROC) curve (truepositive rate against false positive rate) and the precision-recall (PR) curve(equations 2 and 3), where the positive class represents the anomalies and thenegative class represents the nominal samples. The comparison is then based onthe area under the curve (AUC) of both metrics, respectively the ROC AUCand the average precision (AP).

precision =true positives

true positives+ false positives(2)

recall =true positives

true positives+ false negatives(3)

3.2. Datasets

Our evaluation uses 15 datasets ranging from 723 to 20,000 samples and con-taining from 6 to 107 features. Of those datasets, 12 are publicly available onthe UCI [1] or OpenML [30] repositories while the 3 remaining datasets are novelproprietary datasets containing production data from the company Amadeus.Table 1 gives an overview of the datasets characteristics. Our study assesses ifthe models are able to generalize to future datasets, which is a novel approachin outlier detection works. This requires that algorithms support unseen test-ing data, and is achieved by performing a Monte Carlo cross validation of 5iterations, using 80% of the data for the training phase and 20% for the pre-diction. Training and testing datasets contain the same proportion of outliers,and ROC AUC and AP are measured based on the predictions made. For 7of the publicly available datasets, the outlier classes are selected according tothe recommendations made in [7], which are based on extensive datasets com-parisons. However, the cited experiment discards all categorical data, while wekeep those features and performed one-hot encoding to binarize them, keeping

7

Page 8: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

Table 1: UCI, OpenML and proprietary datasets benchmarked - (# categ. dims) is theaverage number of binarized features obtained after transformation of the categoricals

Dataset Nominal class Anomaly class Numeric dims Categ. dims Samples Anomalies

abalone 8, 9, 10 3, 21 7 1 (3) 1,920 29 (1.51%)ann-thyroid 3 1 21 0 (0) 3,251 73 (2.25%)car unacc, acc, good vgood 0 6 (21) 1,728 65 (3.76%)covtype 2 4 54 0 (0) 10,0001 95 (0.95%)german-sub 1 2 7 13 (54) 723 23 (3.18%)2

kdd-sub normal u2r, probe 34 7 (42) 10,0001 385 (3.85%)magic-gamma-sub g h 10 0 (0) 12,332 408 (3.20%)2

mammography -1 1 6 0 (0) 11,183 260 (2.32%)Mushroom-sub e p 0 22 (107) 4,368 139 (3.20%)2

shuttle 1 2, 3, 5, 6, 7 9 0 (0) 12,345 867 (7.02%)wine-quality 4, 5, 6, 7, 8 3, 9 11 0 (0) 4,898 25 (0.51%)yeast 3 CYT, NUC, MIT ERL, POX, VAC 8 0 (0) 1,191 55 (4.62%)pnr 0 1, 2, 3, 4, 5 82 0 (0) 20,000 121 (0.61%)shared-access 0 1 49 0 (0) 18,722 37 (0.20%)transactions 0 1 41 1 (9) 10,0001 21 (0.21%)

1 Subsets of the original datasets are used, with the same proportion of outliers.2 Anomalies are sampled from the corresponding class, using the average percentage of outliers depicted in [7].3 The first feature corresponding to the protein name was discarded.

all information from the dataset at the cost of a higher dimensionality. Nor-malization is further achieved by centering numerical features to the mean andscaling them to unit variance.

The three following datasets contain production data collected by Amadeus,a Global Distribution System (GDS) providing online platforms to connect thetravel industry. This company manages almost half of the flight bookings world-wide and is targeted by fraud attempts leading to revenue losses and indemni-fications. The datasets do not contain information traceable to any specificindividuals.

The Passenger Name Records (pnr) dataset contains booking records,mostly flight and train bookings, containing 5 types of frauds labeled by fraudexperts. The features in this dataset describe the most important changes ap-plied to a booking from its creation to its deletion. Features include time-basedinformation, e.g. age of a pnr, percentage of cancelled flight segments or passen-gers, and means and standard deviations of counters, e.g. number of passengermodifications, frequent traveller cards, special service requests (additional lug-gage, special seat or meal), or forms of payment.

The transactions dataset is extracted from a Web application used toperform bookings. It focuses on user sessions which are compared to identifybots and malicious users. Examples of features are the number of distinct IPsand organizational offices used by a user, the session duration and means andstandard deviations applied to the number of bookings and number of actions.The most critical actions are also monitored.

The shared-access dataset was generated by a backend application usedto manage shared rights between different entities. It enables an entity to grantspecific reading (e.g. booking retrieval, seat map display) or writing (e.g. cruisedistribution) rights to another entity. Monitoring the actions made with thisapplication should prevent data leaks and sensible right transfers. For each useraccount, features include the average number of actions performed per sessionand time unit, the average and standard deviation for some critical actions per

8

Page 9: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

session, and the targeted rights modified.

3.3. Datasets for scalability tests

As the choice of an outlier detection algorithm may not only be limitedto its accuracy but is often subject to computational constraints, our experi-ment includes training time, prediction time, memory usage and noise resistance(through precision-recall measurements) of each algorithm on synthetic datasets.

For these scalability tests, we generate datasets of different sizes containinga fixed proportion of background noise. The datasets range from 10 samplesto 10 million samples and from 2 to 10,000 numerical features. We also keepthe number of features and samples fixed while increasing the proportion ofbackground noise from 0% to 90% to perform robustness measurements. Theexperiment is repeated 5 times, using the same dataset for training and testing.We allow up to 24 hours for training or prediction steps to be completed and stopthe computation after this period of time. Experiments reaching a timeout orrequiring more than 256 GB RAM do not report memory usage nor robustnessin section 4.

In order to avoid advantaging some algorithms over others, the datasets aregenerated using two Student’s T distributions. The distributions are respec-tively centered in (0, 0..., 0) and (5, 5..., 5), while the covariance matrices arecomputed using cij = ρ|i−j| where cij is entry (i, j) of the matrix. The param-eter ρ follows a uniform distribution ρ ∼ U(0, 1) and the degrees of freedomparameter follows a Gamma distribution υ ∼ Γ(1, 5). We then add outliers uni-formly sampled from a hypercube 7 times bigger than the standard deviationof the nominal data.

3.4. Algorithms implementations and parameters

Most implementations used in this experiment are publicly available. Table2 details the programming languages and initialization parameters selected. Amajority of methods have flexible parameters and perform very well without anextensive tuning. The Matlab Engine API for Python and the rpy2 libraryallow us to call Matlab and R code from Python.

dpgmm is our own implementation of a Dirichlet Process Mixture Model andfollows the guidelines given in [3] where we place a Gamma prior on the scalingparameter β. Making our own implementation of this algorithm is motivatedby its capability of handling a wide range of probability distributions, includingcategorical distributions. We thus benchmark dpgmm, which uses only Gaus-sian distributions to model data and thus uses continuous and binarized featuresas all other algorithms, and dpmm which uses a mixture of multivariate Gaus-sian / Categorical distributions, hence requiring fewer data transformations andworking on a smaller number of features. This algorithm is the only one capableof using the raw string features from the datasets.

Note that dpgmm and dpmm converge to the same results when applied tonon-categorical data, and that our dpgmm performs similarly to the correspond-ing scikit-learn implementation called Bayesian Gaussian Mixture (BGM).

9

Page 10: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

Table 2: Implementations and parameters selected for the evaluation

Algorithm Language Parameters

gmm 1 Python components = 1dpgmm

PythonΓθ = (α = 1, β = 0)), kmax = 10

dpmm µθ = mean(data),Σθ = var(data)rkde 23 Matlab bandwidth = LKCV, loss = Huberppca Python components = mle, svd = fulllsa Python σ = 1.7, ρ = 100maha Python n/alof Python k = max(n ∗ 0.1, 50)abod 2 R k = max(n ∗ 0.01, 50)sod 2 R k = max(n ∗ 0.05, 50), kshared = k

2kl 12 R components = 1gwr 2 Matlab it = 15, thab = 0.1, tinsert = 0.7OCSVM Python ν = 0.5iforest Python contamination = 0.5

1 Parameter tuning is required to maximize the mean average precision (MAP).2 We extend these algorithms to add support for predictions on unseen datapoints.

3 We use Scott’s rule-of-thumb h = n−1/(d+4) [27] to estimate the bandwidthwhen it cannot be computed due to a high number of features.

However, we did not optimize our implementation which uses a more generalexponential-family representation for probability distributions. This greatly in-creases the computational cost and results in a much higher training and pre-diction time.

gwr has a nonparametric topology. Identifying outlying neurons as de-scribed in [21] to detect outliers from a 2D grid topology may thus not beapplicable to the present algorithm. The node connectivity can indeed differsignificantly from one node to another, and we need a ranking of the outliersfor our performance measurements more than a two-parameter binary classifi-cation. Therefore, the score assigned to each observation is here the squareddistance between an observation and the closest node in the network. Note thatregions of outliers sufficiently dense to attract neurons may not be detected withthis technique.

4. Results

For each dataset, the methods described in section 2 are applied to the 5training and testing subsets sampled by Monte Carlo cross validation. We reporthere the average and standard deviation over the runs. The programming lan-guage and optimizations applied to the implementations may affect the trainingtime, prediction time and memory usage measured in sections 4.3 and 4.4. Forthis reason, our analysis focuses more on the curves slope and the algorithmscomplexity than on the measured values. The experiments are performed on aVMware virtual platform running Ubuntu 14.04 LTS and powered by an IntelXeon E5-4627 v4 CPU (10 cores at 2.6GHz) and 256GB RAM. We use the Inteldistribution of Python 3.5.2, R 3.3.2 and Matlab R2016b.

10

Page 11: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

Table 3: Rank aggregation through Cross-Entropy Monte CarloAlgorithm gmm dpgmm dpmm rkde ppca lsa maha lof abod sod kl gwr ocsvm iforest

PR AUC 6 12 9 2 5 14 7 11 10 4 8 13 3 1ROC AUC 11 4 5 1 7 9 6 10 13 8 12 14 3 2

4.1. ROC and precision-recall

The area under the ROC curve is estimated using trapezoidal rule while thearea under the precision-recall curve is computed by average precision (AP).Figure 2 shows the mean and standard deviation AP per algorithm and datasetwhile figure 3 reports the ROC AUC. For the clarity of presentation, figure 1shows the global average and standard deviation average of both metrics, sortingalgorithms by decreasing mean average precision (MAP).

Figure 1: Mean area under the ROC and PR curve per algorithm (descending PR)

Since averaging these metrics may induce a bias in the final ranking causedby extreme values on some specific datasets (e.g. AP of iforest on kdd-sub),we also rank the algorithms per dataset and aggregate the ranking lists withoutconsidering the measured values. The Cross-Entropy Monte Carlo algorithmand the Spearman distance are used for the aggregation [22]. The resultedrankings presented in Table 3 are similar to the rankings given in figure 1 andconfirm the previous trend observed. The rest of this paper will thus refer tothe rankings introduced in figure 1.

Note that we are dealing with heavily imbalanced class distributions wherethe anomaly class is the positive class. For this kind of problems where thepositive class is more interesting than the negative class though underweighteddue to the high number of negative samples, precision-recall curves show to be

11

Page 12: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

Figure 2: Mean and std area under the precision-recall curve per dataset and algorithm (5runs)

particularly useful. Indeed, the use of precision strongly penalizes methods rais-ing false positives even if those represent only a small proportion of the negativesamples [5].

Looking at the mean average precision, iforest, rkde, ppca and ocsvmshow excellent performance and achieve the best outlier detection results of ourbenchmark. Outperforming all other algorithms on several datasets (e.g. kdd-sub, abalone or thyroid), iforest shows good average performance whichmakes it a reliable choice for outlier detection. rkde comes in second positionand also shows excellent performance on most datasets, especially when appliedto high-dimensionality problems.

One-class SVM achieves good performance without requiring significant tun-ing. We note that the algorithm perform best on datasets containing a smallproportion of outliers, which seems to confirm that the method is well-suited tonovelty detection.

The parametric mixture model gmm uses only a single Gaussian, while dpmmand dpgmm often use between 5 and 10 components to model these complexdatasets of variable density and shape. The number of components for para-metric models, e.g. gmm was selected to maximize the MAP. Regarding mostpublic datasets, note that anomalies are not sparse background noise, but actualsamples from one or more classes that are part of classification datasets. Themodel resulting from nonparametric methods may thus be much more accurateas it could cluster dense clouds of outliers, remnants of former classes. Thisis confirmed by the results observed for most one-class datasets contaminatedby outliers, which are the ones where dpmm and dpgmm outperform gmm, e.g.

12

Page 13: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

Figure 3: Mean and std area under the ROC curve per dataset and algorithm (5 runs)

magic-gamma-sub, mushroom-sub and german-sub. In addition, the scoresand ranking of iforest and gmm are much higher for the datasets selected in[7] than for the other datasets where nonparametric methods prevail. Sincethe selection process described in their study makes use of the performance ofseveral outlier detection methods to perform their selection, this process maybenefit algorithms having a behavior similar to the chosen methods.

The good ranking of ppca, kl and the Mahalanobis distance is also ex-plained by precision-recall measurements very similar to the ones of gmm. Prob-abilistic pca is indeed regarded as a gmm with one component, kl is a gmmwith a different scoring function, while the model of the Mahalanobis distance isclosely related to the multivariate Gaussian distribution. If these simple modelsperform well on average, they are not suitable for more complex datasets, e.g.the proprietary datasets from Amadeus, where nonparametric methods able tohandle clusters of arbitrary shape such as rkde, sod or even gwr prevail.

We indeed observe that sod outperforms the other methods on the Amadeusdatasets, and more generally performs very well on large datasets composed ofmore than 10,000 samples. This method is the best neighbor-based algorithmof our benchmark, but requires a sufficient number of features to infer suitablesubspaces through feature selection.

dpmm performs better than dpgmm. As the two methods should convergeidentically when applied to numerical data, it is the way they handle categoricalfeatures that explains this difference. Looking at the detailed average precisionshighlighted in Figure 2 for datasets containing categorical features, we noticethat dpmm outperforms dpgmm on four datasets (kdd-sub, mushroom-sub,car and transactions), while it is outperformed on two others (abalone and

13

Page 14: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

german-sub). This gain of performance for dpgmm is likely to be caused bycategorical features strongly correlated to the true class distribution and havinga high number of distinct values, resulting in several binarized features and thusa higher weight for the corresponding categorical in the final class prediction.However, when the class distribution is heavily unbalanced, the Chi-Square testbased on contingency tables we performed did not allow us to confirm thishypothesis.

The results of dpmm are reached in a smaller training and prediction timedue to a smaller dimensionality of the non-binarized input data. Yet, we spec-ulate that dpgmm would reach a smaller computation time if making all itscomputations for Normal-Wishart distributions instead of exponential-familyrepresentations. This was confirmed by measuring the computation time of thebgm implementation in scikit-learn on the same datasets.

lof and abod do not stand out, with unexpected drop of performance ob-served for lof on transactions and mammography that cannot be explainedsolely based on the dataset characteristics. Using a number of neighbors suffi-ciently high is important when dealing with large datasets containing a highernumber of outliers. Increasing the sample size for abod may lead to slightlybetter performance at the cost of a much higher computation time, for a methodwhich is already slow. We also benchmarked the k-nearest-neighbors approachdescribed in the original paper which showed reduced computation time fork = 15 though this did not improve performance. Despite the use of angles in-stead of distances, this algorithm performs worse than lof on 4 datasets amongthe 7 datasets containing more than 40 features. It is however one of the bestoutlier detection methods on the pnr and transactions datasets.

Although gwr achieves lower performance than other methods in our bench-mark, in the case of a low proportion of anomalies, e.g. with pnr, shared-access and transactions, the algorithm reaches excellent precision-recallscores as the density of outliers is not sufficient to attract any neuron. somcan thus be useful for novelty detection targeting datasets free of outliers, orwhen combined with a manual analysis of the quantization errors and mid ma-trix as described in [21]. Similarly to gwr, lsa reaches low average performance,especially for large or high-dimensional datasets, but it achieves good results forsmall datasets.

4.2. Robustness

As our synthetic datasets contain only numerical data and since dpmm anddpgmm are the same method when the dataset does not contain categoricalfeatures, we exclude dpmm from our results and add bgm, the scikit-learn opti-mized equivalent of our dpgmm implementation. Figures 4, 5 and 6 measure thearea under the precision-recall curve, the positive class being the backgroundnoise for the two first figures, and the nominal samples generated by the mixtureof Student’s T distributions in the last figure.

14

Page 15: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

100 101 102 103 104

# features

0.0

0.2

0.4

0.6

0.8

1.0

Ave

rage

 pre

cisi

on (

outli

ers)

Impact of dimensionality on noise­resistance (1000 float samples, 10.0% outliers)GMMBGMDPGMMRKDEPPCALSAMahaLOFABODSODKLGWROCSVMIForest

Figure 4: Robustness for increasing number of features

The resistance of each algorithm to the curse of dimensionality is highlightedin figure 4, where we keep a fixed level of background noise while increasing thedataset dimensionality. The results show good performance in average and un-expectedly good results for gwr for more than 10 features. Surprisingly, abodwhich is supposed to efficiently handle high dimensionality performs poorly here.Similarly, lsa and Mahalanobis do not perform well. The difference of resultsbetween bgm and dpgmm is likely due to a different cluster responsibility ini-tialization, as bgm uses a K-means to assign data points to clusters centroidsscattered among dense regions of the dataset.

Figure 5 shows decreasing outlier detection performance for lsa and gwrwhile increasing the number of samples in the dataset. sod and kl do notperform well either, though their results are correlated with the variations ofbetter methods. We thus assume that the current experiment is not well-suitedfor these method, as SOD applies feature selection and has demonstrated betterperformance in higher dimensionalities. Increasing the number of samples re-sulted in an overall increasing precision. The results given for less than 100 datapoints show a high entropy as the corresponding datasets contain very sparsedata in which dense regions cannot be easily identified.

Increasing the proportion of background noise in figure 6 shows a lack of ro-bustness for lsa, sod, ocsvm, gwr and abod. While the two first methods arevery sensible to background noise, the three others maintain good performanceup to 50% of outliers. Neighbor-based method can only cope with a restrictedamount of noise, though increasing the number of samples used to computethe scores could lead to better results. Similarly, most neurons of gwr wereattracted by surrounding outliers. In order to avoid penalizing ocsvm, this

15

Page 16: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

101 102 103 104 105 106 107

# samples

0.0

0.2

0.4

0.6

0.8

1.0

Ave

rage

 pre

cisi

on (

outli

ers)

Impact of dataset size on noise­resistance (2 float features, 10.0% outliers)GMMBGMDPGMMRKDEPPCALSAMahaLOFABODSODKLGWROCSVMIForest

Figure 5: Robustness for increasing number of samples

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9% noise

0.0

0.2

0.4

0.6

0.8

1.0

Ave

rage

 pre

cisi

on (

nom

inal

 sam

ples

)

Impact of noise on performances (1000 samples, 2 float features)GMMBGMDPGMMRKDEPPCALSAMahaLOFABODSODKLGWROCSVMIForest

Figure 6: Robustness for increasing noise density

specific evaluation uses an adaptive ν which increases with the proportion ofoutliers with a minimum value of 0.5. In spite of this measure, ocsvm alsoshows very poor results above 50% noise. Mahalanobis, kl, lof and gmm donot perform well either in noisy environments, despite the use of 2 componentsby gmm and kl. For this experimental setting, the best candidates in a datasethighly contaminated by sparse outliers are iforest, dpgmm, bgm, rkde andppca.

To conclude, the robustness measures on synthetic datasets confirm the poor

16

Page 17: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

performance of abod and gwr showed in our previous ranking. Good averageresults were observed for iforest, ocsvm, lof, rkde, dpgmm and gmm. Thenearest-neighbor-based methods showed difficulties in handling datasets with ahigh background noise.

4.3. Complexity

We now focus on the computation and prediction time required by the differ-ent methods when increasing the dataset size and dimensionality. The runningenvironment and the amount of optimizations applied to the implementationsstrongly impacts those measures. For this reason, we focus on the curves’ evo-lutions more than on the actual value recorded. Comparing the measurementsof bgm and dpgmm that implement the same algorithm is a good illustrationof this statement.

Increasing the number of features in figures 7 and 8 shows an excellentscalability for lsa, gwr and iforest with a stable training time and a goodprediction time evolution. dpgmm has here the worst training and predictionscaling, reaching the 24 hours timeout for more than 3,000 features. This scalingis confirmed by the increases observed for bgm and gmm. rkde performs alsopoorly with a timeout caused by a high bandwidth when the number of featuresbecomes higher than the number of samples. High dimensionality datasets donot strongly affect distance-based and neighbor-based methods, though proba-bilistic algorithms suffer from the increasing number of dimensions. The use ofcomputationally expensive matrix operations whose complexity depend on thedata dimensionality, e.g. matrix factorizations and multiplications, is a majorcause of the poor scalability observed. Maximum likelihood estimation failsto estimate the suitable number of components for ppca for more than 1,000features. We keep enough components to explain at least 90% of the variance,which explains the decrease of training time.

The number of samples has a strong impact on the training and predictiontime of rkde, sod, ocsvm, lof and abod which scale very poorly in figures9 and 10. Those five algorithms would reach the timeout of 24 hours for lessthan one million samples, though rkde and lof exceed the available memoryfirst (section 4.4). All the other algorithms show good and similar scalability,despite a higher base computation time for gwr and dpgmm due to the lackof kde optimizations. The additional exponential-family computations do notseem to impact the complexity of this algorithm. Training abod consists onlyin making a copy of the training dataset, which explains the low training timereported. Its prediction time is however the least scalable, the true slope beingobserved for more than 5,000 samples.

In summary and looking at the overall measures, iforest and lsa show avery good training and prediction time scaling for both increasing number offeatures and samples, along with a very small base computation time. dpgmm,gmm and bgm scale well on datasets with a large number of samples and thus

17

Page 18: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

100 101 102 103 104

# features

10 5

10 4

10 3

10 2

10 1

100

101

102

103

104

105

Tra

inin

g tim

e (s

econ

ds)

Training time benchmark (1000 float samples, 10.0% outliers)GMMBGMDPGMMRKDEPPCALSAMahaLOFABODSODKLGWROCSVMIForest

Figure 7: Training time for increasing number of features

100 101 102 103 104

# features

10 4

10 3

10 2

10 1

100

101

102

103

104

Pre

dict

ion 

time 

(sec

onds

)

Prediction time benchmark (1000 float samples, 10.0% outliers)GMMBGMDPGMMRKDEPPCALSAMahaLOFABODSODKLGWROCSVMIForest

Figure 8: Prediction for increasing number of features

could be suitable for systems where fast predictions matter. The base com-putation time of dpgmm is however an important issue when the number offeatures becomes higher than a hundred. rkde, ocsvm and sod which havegood outlier detection performance on real datasets are thus computationallyexpensive, which adds interest to iforest, dpgmm and simpler models such asgmm, kl, ppca or Mahalanobis.

18

Page 19: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

101 102 103 104 105 106 107

# samples

10 6

10 5

10 4

10 3

10 2

10 1

100

101

102

103

104

105

Tra

inin

g tim

e (s

econ

ds)

Training time benchmark (2 float features, 10.0% outliers)GMMBGMDPGMMRKDEPPCALSAMahaLOFABODSODKLGWROCSVMIForest

Figure 9: Training time for increasing number of samples

101 102 103 104 105 106 107

# samples

10 4

10 3

10 2

10 1

100

101

102

103

104

105

Pre

dict

ion 

time 

(sec

onds

)

Prediction time benchmark (2 float features, 10.0% outliers)GMMBGMDPGMMRKDEPPCALSAMahaLOFABODSODKLGWROCSVMIForest

Figure 10: Prediction for increasing number of samples

4.4. Memory usage

We report in figures 11 and 12 the highest amount of memory required byeach algorithm when applied to our synthetic datasets during the training orprediction phase. We clear the Matlab objects and make explicit collect calls tothe Python and R garbage collectors before running the algorithms. We thenmeasure the memory used by the corresponding running process before startingthe algorithm and subtract it to the memory peak observed while running it.This way, our measurements ignore the memory consumption caused by theenvironment and the dataset which reduces the measurement differences due

19

Page 20: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

the running environment, e.g. Matlab versus Python. Measures are taken atintervals of 10−4 second using the memory profiler library for Python and R1

and the UNIX ps command for Matlab. Small variations can be observed formeasures smaller than 1 MB and are not meaningful.

As depicted in figure 11, most algorithms consume little memory, an amountwhich does not significantly increase with the number of features and shouldnot impact the running system. iforest, ocsvm, lof and lsa have a constantmemory usage below 1 MB while rkde remains near-constant. gwr, abodand sod also have a good scalability. The other algorithms may require toomuch memory for high-dimensional problems, with Mahalanobis requiring about4.5GB to store the mean and covariance matrices of 10,000 features. Allowingmany more clusters and storing temporary data structures, dpgmm requires upto 80 GBs when applied to 2,600 features while bgm stores only 14 GBs of datafor 10,000 features.

100 101 102 103 104

# features

10 4

10 3

10 2

10 1

100

101

102

103

104

105

Mem

ory 

usag

e (M

B)

Memory usage benchmark (1000 float samples, 10.0% outliers)GMMBGMDPGMMRKDEPPCALSAMahaLOFABODSODKLGWROCSVMIForest

Figure 11: Memory usage for increasing number of features

The increasing number of samples has a higher impact on the RAM con-sumption depicted in figure 12. rkde and lof both run out of memory beforethe completion of the benchmark and reach, respectively, 158 GB and 118GBmemory usage for 72,000 and 193,000 samples. sod consumes about as muchmemory as rkde though reaches the timeout with fewer samples. This amountof memory is mostly caused by the use of a pairwise distance matrix by thesealgorithms, which requires 76GB of RAM for 100,000 samples using 8 bytesper double precision distance. The other methods scale much better and donot exceed 5GB for 10 million samples, except iforest and lsa which allocate

1rpy2 stores R objects in the running Python process. In addition, R prevents concurrentaccesses which do not allow us to use dedicated R commands to measure memory.

20

Page 21: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

Table 4: Resistance to the curse of dimensionality, runtime scalability and memory scalabilityon datasets of increasing size and dimensionality. Performance on background noise detectionis also reported for datasets of increasing noise proportion.

Training/prediction time Mem. usage RobustnessAlgorithm → Samples → Features → Samples → Features → Noise High dim. Stability

gmm Low/Low Medium/Medium Low Medium High Medium Mediumbgm Low/Low Medium/Medium Low Medium High Medium Highdpgmm Medium/Low High/High Low High High High Highrkde High/High High/High High Low High High Highppca Low/Low High/Low Low Low High Medium Mediumlsa Low/Medium Low/Low Medium Low Low Low Mediummaha Low/Medium Medium/Low Low Medium Medium Low Highlof High/High Low/Low High Low Medium High Highabod Low/High Low/Medium Low Low Medium Low Mediumsod High/High Low/Medium High Low Low High Mediumkl Low/Medium Low/Medium Low Medium High Medium Highgwr Medium/Medium Medium/Low Low Low Low High Mediumocsvm High/High Low/Low Low Low Low High Highiforest Low/Medium Low/Low Medium Low High High Medium

60GB and 38GB RAM. For the sake of completeness, we performed the sameexperiment with the k-nearest neighbors implementation of abod with k = 15,and observed a scalability and memory usage similar to rkde.

101 102 103 104 105 106 107

# samples

10 3

10 2

10 1

100

101

102

103

104

105

106

Mem

ory 

usag

e (M

B)

Memory usage benchmark (2 float features, 10.0% outliers)GMMBGMDPGMMRKDEPPCALSAMahaLOFABODSODKLGWROCSVMIForest

Figure 12: Memory usage for increasing number of samples

We have seen that several algorithms have important memory requirementswhich must be carefully considered depending on the available hardware. Al-gorithms relying on multivariate covariance matrices will be heavily impactedby the growing number of features, while methods storing a pairwise distancematrix are not suitable for a large number of samples. Our implementation ofdpgmm scales as well as other mixture models though comes with a much highermemory usage on high dimensional datasets. ocsvm, gwr and abod have thebest memory requirements and scalability and never exceed 250MB RAM, atthe cost of a higher computation time since these three methods reach our 24hours timeout.

21

Page 22: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

4.5. Segmentation contours

Figure 13 shows the normalized density returned by each algorithm whenapplied to the scaled old faithful dataset. Warm colors depict high anomalyscores. The density was interpolated from the predicted score of 2,500 pointsdistributed on a 50x50 mesh grid.

Figure 13: Normalized anomaly scores predicted - Models are trained on the scaled oldfaithful dataset

The two first plots are the result of Gaussian mixture models, using 2 com-ponents for gmm and an upper bound of 10 components for dpgmm. If theplots are very similar, we denote a slightly higher density area between the twoclusters for dpgmm. This difference is caused by the remaining mixture compo-nents for which the weight is close to 0 and the covariance matrix based on theentire dataset. The information theoretic algorithm based on the kl divergencepredicts scores based on a gmm, which explains the similarity between the two.

The contribution of each observation to the overall density is clearly visiblefor rkde which finds a density estimation tightly fitting the dataset. In con-trast, the models used by ppca and the Mahalanobis distance are much moreconstrained and fail at identifying the two clusters. lsa and lof perform muchbetter though also assign very low anomaly scores to the sparse area locatedbetween the two clusters. However, these methods should be able to handleclusters of arbitrary shape.

The density of abod is of great interest as it highlights some limitations ofthe method. In the case of a data distribution composed of several clusters, thelowest anomaly scores are located in the inter-cluster area instead of the clustercentroids due to the sole use of angles variance and values. This is caused bylarge angles measured when an angle targets two points belonging to distinct

22

Page 23: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

clusters, small angles when the points belong to the same cluster and thus ahigh overall variance for the inter-cluster area. In contrast, the dense areassurrounding the cluster centroids are assigned high anomaly scores since manyangles are directed toward the other clusters, suggesting data points isolatedand far from a major cluster. The same issue arises for the inter-cluster areawhen computing the anomaly scores with the alternative k-nearest neighborsapproach instead of randomly sampling from the dataset.

sod is able to estimate a very accurate density, despite some low scoresbetween the clusters. The neurons belonging to the gwr network result incircular blue areas highlighting their position. The presence of a neuron at thecenter of the plot once again results in very low scores for the inter-cluster area,as low as for the theoretical cluster centroids. Using an additional threshold todetect outlying neurons as suggested in [21] would solve this issue.

A high density is also assigned by ocsvm to this region due to the mappingof the continuous decision boundary from the high-dimensional space. Thesegmentation made by iforest is much tighter than previous methods and seemless prone to overfitting than rkde. Light trails emerging from the clusters canhowever be observed and may result in anomalous observations receiving a lowerscore than patterns slightly deviating from the mean.

5. Conclusions

In the context of outlier detection, we benchmarked the average precision,robustness, computation time and memory usage of 14 algorithms on syntheticand real datasets. Our study demonstrates that iforest is an excellent methodto efficiently identifying outliers while showing an excellent scalability on largedatasets along with an acceptable memory usage for datasets up to one millionsamples. The results suggest that this algorithm is more suitable than rkde ina production environment as the latter is much more computationally expensiveand memory consuming. ocsvm is a good candidate in this benchmark, but itis not suitable either for large datasets.

Sampling a small proportion of outliers from classification datasets as sug-gested in [7] resulted in dense clouds of outliers which made simple methodssuch as gmm with one component, kl, ppca and the Mahalanobis distanceperform well. However, these scalable solutions are not generally optimal fordatasets where the nominal class has a complex shape or is distributed acrossseveral distinct clusters. In these cases, the results indicate that nonparametricclustering alternatives are superior.

sod showed good outlier detection performance and efficiently handled high-dimensional datasets at the cost of poor scalability. Exponential-family repre-sentations for dpmm revealed to be extremely time-consuming without substan-tially improving the detection of outliers made by Gaussian-based approachessuch as dpgmm. Nonetheless, the use of categorical distributions in dpmm re-sulted in a reduced computation time and better outlier detection performance.lof, abod, gwr, kl and lsa reached the lowest performance while the three

23

Page 24: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

first methods also showed poor scalability. The ability to accurately shape nom-inal data was assessed for each method while a borderline case was highlightedfor abod by studying the distribution of the anomaly scores in datasets com-posed of several clusters.

While the coverage of this study should suffice to tackle most outlier detec-tion problems, specific algorithms may be chosen for constrained environments.Distributed implementations, streaming or mini-batch training are a prerequi-site to deal with large datasets, and several methods have been extended tosupport these features, e.g. dpmm, k-means, svm or gmm on Spark MLlib [20].Other promising directions leading the research on outlier detection also focuson ensemble learning [32] and detecting outliers from multi-view data [11].

6. Acknowledgments

The authors wish to thank the Amadeus Middleware Fraud Detection teamdirected by Virginie Amar and Florent Maupay, led by the product ownerChristophe Allexandre and composed of Jihane Zouaoui, Jean-Blas Imbert,Jiang Wu and Damien Fontanes for building and labeling the transactionsand shared-access datasets. The authors are also grateful to the AmadeusPNR-ADI team, especially to Francesco Buonora, for their work in supplying ausable dataset.

7. References

[1] K. Bache and M. Lichman. UCI machine learning repository, 2013.

[2] I. Ben-Gal. Outlier detection. In Data Mining and Knowledge DiscoveryHandbook, pages 131–146. Springer, 2005.

[3] D. M. Blei and M. I. Jordan. Variational inference for dirichlet processmixtures. Bayesian Analysis, 1(1):121–143, 03 2006.

[4] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: Identifyingdensity-based local outliers. SIGMOD Record, 29(2):93–104, May 2000.

[5] J. Davis and M. Goadrich. The relationship between precision-recall andROC curves. In Proceedings of the 23rd International Conference on Ma-chine Learning, ICML ’06, pages 233–240. ACM, 2006.

[6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood fromincomplete data via the EM algorithm. Journal of the Royal StatisticalSociety. Series B (Methodological), 39(1):1–38, 1977.

[7] A. Emmott, S. Das, T. Dietterich, A. Fern, and W.-K. Wong. Anomalydetection meta-analysis benchmarks. 2016. arXiv:1503.01158v2.

24

Page 25: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

[8] W. Fan, N. Bouguila, and D. Ziou. Unsupervised anomaly intrusion de-tection via localized bayesian feature selection. In Proceedings of the 2011IEEE 11th International Conference on Data Mining, ICDM ’11, pages1032–1037. IEEE Computer Society, 2011.

[9] M. Filippone and G. Sanguinetti. Information theoretic novelty detection.Pattern Recognition, 43(3):805–814, 2010.

[10] V. Hodge and J. Austin. A survey of outlier detection methodologies.Artificial Intelligence Review, 22(2):85–126, Oct. 2004.

[11] T. Iwata and M. Yamada. Multi-view anomaly detection via robust prob-abilistic latent variable models. In Advances in Neural Information Pro-cessing Systems 29, NIPS ’16, pages 1136–1144. Curran Associates, Inc.,2016.

[12] N. Japkowicz and S. Stephen. The class imbalance problem: A systematicstudy. Intelligent Data Analysis, 6(5):429–449, Oct. 2002.

[13] J. Kim and C. D. Scott. Robust kernel density estimation. Journal ofMachine Learning Research, 13:2529–2565, Sep 2012.

[14] T. Kohonen. The self-organizing map. Neurocomputing, 21(1):1–6, 1998.

[15] H.-P. Kriegel, P. Kroger, E. Schubert, and A. Zimek. Outlier detection inaxis-parallel subspaces of high dimensional data. In Advances in KnowledgeDiscovery and Data Mining: 13th Pacific-Asia Conference, PAKDD 2009Bangkok, Thailand, April 27-30, 2009 Proceedings, PAKDD ’09, pages 831–838. Springer, 2009.

[16] H.-P. Kriegel, M. S hubert, and A. Zimek. Angle-based outlier detectionin high-dimensional data. In Proceedings of the 14th ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining, KDD ’08,pages 444–452. ACM, 2008.

[17] F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In Proceedingsof the 2008 Eighth IEEE International Conference on Data Mining, ICDM’08, pages 413–422. IEEE Computer Society, 2008.

[18] H. Liu, S. Shah, and W. Jiang. On-line outlier detection and data cleaning.Computers & chemical engineering, 28(9):1635–1647, 2004.

[19] S. Marsland, J. Shapiro, and U. Nehmzow. A self-organising network thatgrows when required. Neural Networks, 15(8-9):1041–1058, Oct. 2002.

[20] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Free-man, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh,M. Zaharia, and A. Talwalkar. Mllib: Machine learning in Apache Spark.Journal of Machine Learning Research, 17(1):1235–1241, Jan. 2016.

25

Page 26: A comparative evaluation of outlier detection algorithms ... · We survey unsupervised machine learning algorithms in the context of outlier de-tection. This task challenges state-of-the-art

[21] A. Munoz and J. Muruzabal. Self-organizing maps for outlier detection.Neurocomputing, 18(1):33–60, 1998.

[22] V. Pihur, S. Datta, and S. Datta. RankAggreg, an r package for weightedrank aggregation. BMC Bioinformatics, 10(1):62, Feb 2009.

[23] M. A. F. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko. A reviewof novelty detection. Signal Processing, 99:215–249, June 2014.

[24] M. Prastawa, E. Bullitt, S. Ho, and G. Gerig. A brain tumor segmentationframework based on outlier detection. Medical Image Analysis, 8(3):275–283, 2004.

[25] J. A. Quinn and M. Sugiyama. A least-squares approach to anomaly detec-tion in static and sequential data. Pattern Recognition Letters, 40:36–40,Apr. 2014.

[26] P. B. Scholkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C.Platt. Support vector method for novelty detection. In Advances in NeuralInformation Processing Systems 12, NIPS ’00, pages 582–588. MIT Press,2000.

[27] D. W. Scott. Multivariate density estimation: theory, practice, and visual-ization. John Wiley & Sons, 1992.

[28] L. Tarassenko, P. Hayton, N. Cerneaz, and M. Brady. Novelty detectionfor the identification of masses in mammograms. In 4th International Con-ference on Artificial Neural Networks, pages 442–447. IET, 1995.

[29] M. E. Tipping and C. M. Bishop. Probabilistic principal component anal-ysis. Journal of the Royal Statistical Society. Series B (Methodological),61(3):611–622, 1999.

[30] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. OpenML: Networkedscience in machine learning. SIGKDD Explorations Newsletter, 15(2):49–60, June 2014.

[31] K. Worden, G. Manson, and N. Fieller. Damage detection using outlieranalysis. Journal of Sound and Vibration, 229(3):647–667, 2000.

[32] A. Zimek, R. J. Campello, and J. Sander. Ensembles for unsupervisedoutlier detection: Challenges and research questions a position paper.SIGKDD Explorations Newsletter, 15(1):11–22, Mar. 2014.

[33] A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlierdetection in high-dimensional numerical data. Statistical Analysis and DataMining, 5(5):363–387, 2012.

26