Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
93
Chapter - 4
A MULTI-LAYERED FRAMEWORK TO DETECT OUTLIERS AS NIDS
94
Chapter 4 A MULTI-LAYERED FRAMEWORK TO DETECT
OUTLIERS AS NIDS
4.1 Introduction to Outliers
Network intrusion detection system serves as a second line of defense to
intrusion prevention. Anomaly detection approach is important in order
to detect new attacks. In my work the statistical, and distance based
approaches have been adapted to detect anomaly behavior towards
network intrusion detection. This model has been evaluated with the
kddcup99 data set. The results showing that the connectivity based
scheme outperform current intrusion detection approach in the
capability to detect attacks and low false alarm rate.
Definition1: An outlier is an observation that lies an abnormal distance
from other values in a random sample from a population. In a sense, this
definition leaves it up to the analyst (or a consensus process) to decide
what will be considered abnormal. Before abnormal observations can be
singled out, it is necessary to characterize normal observations.
Definition2: Anomaly is a pattern in the data that does not conform to
the expected behaviour, and also referred to as outliers, exceptions,
peculiarities, surprise, etc. Anomalies translate to significant real life
entities network intrusions. Outlier detection is a good research problem
in machine learning that aims to find objects that are dissimilar,
exceptional and inconsistent with respect to the majority normal objects
in an input database. This is one of the approaches in intrusion
detection system.
95
A simple example for outlier objects is shown in figure 4.1. Points,
o1 and o2 are outliers. Points in region O3 are anomalies. In the regions
N1 and N2 are normal objects. The outliers are far away from normal
objects and most of the objects are normal objects and only a few objects
are outliers in terms of small percentage of total number of objects.
First, outliers can be classified as point outliers and collective
outliers based on the number of data instances involved in the concept of
outliers.
4.1.1 Point Outliers
In a given set of data instances, an individual outlying instance is termed
as a point outlier. This is the simplest type of outliers and is the focus of
majority of existing outlier detection schemes [33,34]. A data point is
detected as a point outlier because it displays outlier-ness at its own
right, rather than together with other data points. In most cases, data
are represented in vectors as in the relational databases. Each tuple
contains a specific number of attributes.
X
Y
N
N
o
o2
O3
Figure 4.1 Examples for outliers
96
The principled method for detecting point outliers from vector-type
data sets is to quantify, through some outlier-ness metrics, the extent to
which each single data is deviated from the other data in the data set.
4.1.2 Collective Outliers
A collective outlier represents a collection of data instances that is
outlying with respect to the entire data set [35]. The individual data
instance in a collective outlier may not be outlier by itself, but the joint
occurrence as a collection is anomalous [33]. Usually, the data instances
in a collective outlier are related to each other. A typical type of collective
outliers is sequence outliers [49,50], where the data are in the format of
an ordered sequence.
Outliers should be investigated carefully. They may contain
valuable information about the process under investigation or the data
gathering and recording process. Before elimination of these points from
the data, one should try to understand why they appeared and whether
it is like similar values will continue to appear. Of course, outliers are
often bad data points.
4.2 Challenges in Detecting Outlier Objects • Defining a representative normal region is challenging
• The boundary between normal and outlying behaviour is often
not precise
• The exact notion of an outlier is different for different
application domains
• Availability of labelled data for training/validation
• Malicious adversaries
• Normal behaviour keeps evolving
97
4.3 Aspects of Anomaly Detection Problem • Nature of input data
• Availability of supervision
• Type of anomaly: point, contextual, structural
• Output of anomaly detection
• Evaluation of anomaly detection techniques
A key observation that motivates this research is that outliers
existing in high-dimensional data streams [86,87] are embedded in some
lower-dimensional subspaces [90,91]. The existence of projected outliers
is due to the fact that, as the dimensionality of data goes up, data tend to
become equally distant from each other. As a result, the difference of
data point’s outlier-ness [121] will become increasingly weak and thus
undistinguishable.
4.4 Multistage Framework to Detect Outliers (MFDO)
This framework has four layers for IDS. In this framework
supervised and unsupervised learning have been used. They are
explained here.
Only in moderate or low dimensional subspaces can significant
outlier-ness of data be observed. This phenomenon is commonly referred
to as the curse of dimensionality. In my research a framework has been
designed for detecting outliers shown in figure 4.2.
This architecture consists of four layers. In each layer some
process has been done towards detecting the outliers. These layers were
explained below:
98
4.4.1 Layer One Description
The first step in this layer is reading the dataset. After reading the
dataset next step is to preprocess the dataset. The preprocessing steps
were given below:
i) sample dataset if the data set is large dataset
ii) remove any missing values
iii) normalize samples
iv) select feature subset
4.4.1.1 Sampling
Sampling can be used as data reduction technique because it allows a
large data set to be represented by a much smaller random sample or
subset of data. Suppose that a large dataset, D, contains N tuples. A
simple random sample without replacement (SRSWOR) method has been
used for sampling the records.
SRSWOR of size s, this is created by drawing s of the N tuples from
D ( s < N ), where the probability of drawing any tuple in D is 1/N, that
is, all tuples are equally likely to be sampled.
An advantage of sampling for data reduction is that the cost of
obtaining a sample is proportional to the size of the sample, s, as
opposed to N, the data set size. Hence, sampling complexity is
potentially sub linear to the size of data. Other data reduction technique
can require at least one complete pass throw D.
99
Normal Objects
sample dataset
Normalize Data
Select features subset
Use k-Nearest Neighbor distance based method
Normal
Read dataset
Preprocess Dataset
Cal. z-score for each continuous/categorical
zscore>= threshold_val
Outlier Objects
Yes
No
Multi-class Classifier (Bayesian Network classifier)
O_Type1 O_Type2 - - - O_TypeN
For each test l
Any test record
Yes
Stop No
O_Type1 O_Type2 - - - O_TypeN
Figure 4.2 Multistage Framework for IDS
Compute mean for each class
Layer-1
Layer-2
Layer-3
Layer-4
100
For a fixed sample size, sampling complexity increases only
linearly as the number of data dimensions, n, increases, where as
techniques using histograms, for example, increase exponentially in n.
When applied to data reduction, sampling is most commonly used
to estimate the answer to an aggregate query.
It is possible to determine a sufficient sample size for estimating a
given function within a specified degree of error. This sample size, s, may
extremely small in comparison to N. Sampling is a natural choice for the
progressive refinement of a reduced dataset. Such a set can be further
refined by simply increasing the sample size.
4.4.1.2 Removing Missing Values
A missing value can signify a number of different things in data. Perhaps
the field was not applicable, the event did not happen, or the data was
not available. It could be that the person who entered the data did not
know the right value, or did not care if a field was not filled in. Therefore,
Analysis Services provides two distinctly different mechanisms for
managing and calculating these missing values, also known as null
values.
Classification modeling specifies that a column must never have
missing values; you should use the NOT_NULL modeling flag when you
define the mining structure. This will ensure that processing will fail if a
case does not have an appropriate value. If an error occurs when
processing a model, you can then log the error and take steps to correct
the data that is supplied to the model.
There are a variety of tools that we can use to infer and fill in
appropriate values, such as the lookup transformation or the Data
101
Profiler task in SQL Server Integration Services, or the Fill by Example
tool provided in the Data Mining Add-Ins for Excel. The rows were simply
removed which have missing values.
4.4.1.3 Normalize samples
An attribute is normalized by scaling its values so that they call within a
small specified range, such as 0.0 to 1.0. Standardization is particularly
useful for classification algorithms involving neural networks, or distance
measurements such as nearest neighbor classification and clustering.
For distance based methods, data standardization prevents attributes
with initially large ranges from out weighing attributes with initially
smaller ranges.
There are many methods for data normalization. z-score
normalization has been implemented for normalization. In z-score
normalization (or zero-mean normalization), the values for an attribute,
A, are normalized based on the mean and standard deviation of A. A
value, v, of A is normalized to v` by computing
A
Avvσ−
=' , (4.1)
where A and Aσ are the mean and standard deviation, respectively, of
attribute A. This method of normalization is useful when the actual
minimum and maximum of attribute A is unknown, or when there are
outliers that dominate the min-max normalization.
4.4.1.4 Select features subset
Data sets for analysis may contain hundreds of attributes, many of
which may be irrelevant to the mining task or redundant. Leaving out
relevant attributes are keeping irrelevant attributes may be detrimental,
102
causing confusion for the mining algorithm employed. This can result in
discover patterns of poor quality. In addition, the added volume of
irrelevant or redundant attributes can slow down the mining process.
Attributes subset selection reduces the data set size by removing
irrelevant or redundant attributes (dimensions). The goal of attribute
subset selection is to find a minimum set of attributes such that the
resulting probability distribution of the data classes is as close as
possible to the original distribution obtained using all attributes. Mining
on a reduced set of attributes has an additional benefit. It reduces the
number of attributes appearing in the discovered patterns, help to make
the patterns easier to understand.
For n attributes, there are 2n possible subsets. An exhaustive search
for the optimal subset of attributes can be prohibitively expensive,
especially as n and the number of data classes increase. Therefore,
heuristic methods that explore a reduced search space or commonly
used for attribute subset selection.
These methods are typically greedy in that, while searching
through attribute space, they always make what looks to be the best
choice at the time their strategy is to make a locally optimal choice in the
hope that this will lead to a globally optimal solution. Such greedy
methods are effective in practice and may come close to estimating an
optimal solution.
The best (and worst) attributes are typically determining using
tests of statistical significance, which assume that the attributes are
independent of one another. Many other attribute evaluation measures
can be used, such as the information gain measure used in building
decision trees for classification.
103
Most previous works of feature selection emphasized only the
reduction of high dimensionality of the feature space. But in cases where
many features are highly redundant with each other, we must utilize
other means, for example, more complex dependence models such as
Bayesian network classifiers.
In my work, it has been introduced a new information gain and
divergence-based feature selection method. The feature selection method
strives to reduce redundancy between features while maintaining
information gain in selecting appropriate features for classification.
Entropy is one of greedy feature selection methods, and
conventional information gain which is commonly used in feature
selection for classification or clustering models [113]. Moreover, our
feature selection method sometimes produces more improvements of
conventional machine learning algorithms over support vector machines
which are known to give the best classification accuracy. It is defined as
4.4.2 Layer Two Description
Here the separation of objects into normal and outliers can be
done using the zscore measure, which is explained below.
There are many statistical methods to detect outlier data objects.
z-score method has been used to detect outliers. The z-score measure
can be defined as follows:
A
Avzscoreσ−
= , where A and Aσ are the mean and standard
∑−=j
tjptjptEntropy )|(log)|()( 2
104
deviation, respectively, of attribute A. The z-score can be calculated for
selected features subset. For each attribute a threshold value is set to
determine particular instance is outlier or not. If the majority of the z-
scores indicating that instance is outlier, then that object is outlier.
In this research, the problem of detecting outliers from high-
dimensional data streams has been studied. This problem can be
formulated as follows: given a dataset D with a potentially unbounded
size of n-dimensional data points, for each data point xi = {xi1, xi2, . . . ,
xin} in D, outlier detection method performs a mapping as
f : xi → (b, Si, Z-Scorei)
Each data xi is mapped to a triplet (b, Si, Z-Scorei), b is a Boolean
variable indicating, whether or not xi is a projected outlier. If xi is a
projected outlier then b is true, therefore Si is the set of outlying
subspaces of xi and Z-Scorei is the corresponding outlier-ness score of pi
in each subspace of Si. In the case that xi is a normal data, then b is
false, Si = ∅ and Z-Scorei is not applicable.
In my work z-score value has been set to +/-1.5 as a threshold
value for objects separation. Once the objects are separated as normal
and outliers, the outliers can be classified further using multi-class
classifier.
Here Bayesian Classifier [60,61,62] has been used for separating
outliers. After classifying the outliers, the mean value for each group of
class is calculated. Use k-nearest neighbor method between mean values
of class objects and test tuples to determine the class of unseen test
records. The Bayesian Network is briefly explained below.
105
4.4.3 Layer Three: Bayesian Network Classifier
These networks are directed acyclic graphs that allow efficient and
effective representation of the joint probability distributions over a set of
random variables. Each vertex in the graph represents a random variable,
and edges represent direct correlations between the variables.
More precisely, the network encodes the following conditional
independence statements: Each variable is independent of its
nondescendants, in the graph given the state of its parents.
These independencies are then exploited to reduce the number of
parameters needed to characterize a probability distribution, and to
efficiently computer posterior probabilities given evidence. Probabilistic
parameters are encoded in a set of tables, one for each variable, in the
form local conditional distributions of a variable given its parents. Using
the independence statements encoded in the network, the join
distribution is uniquely determined by these local conditional
distributions.
Let U = {x1, x2 . . . , xn}, n ≥ 1 be a set of variables. A Bayesian
network B over a set of variables U is a network structure BS, which is a
directed acyclic graph (DAG) over U and a set of probability tables in BS.
To use a Bayesian network as a classifier, one simply calculates
argmaxyP(y|x) using the distribution P(U) represented by the Bayesian
network.
)(/)()/( xPUPxyP =
α )(UP
=∏∈Uu
upaup ))(|(
106
And since all variables in x are known, we do not need complicated
inference algorithms, but just calculate for all class values.
4.4.3.1 Bayesian Network learning algorithm
The dual nature of a Bayesian network makes learning a Bayesian
network as a two stage process a natural division: first learn a network
structure, and then learn the probability tables.
There are various approaches to structure learning, conditional
independence test has been used as a learning method, which is briefly
given below.
4.4.3.2 Learning Network Structure
Learning a BN from the data is unsupervised learning, in the sense that
the learner does not distinguish the class variable from the attribute
variables in the data. The objective induces a network that best describes
the probability distribution over the training data. This optimization
process is implemented by using heuristic search techniques to find the
best candidate over the space of possible networks. This search process
relies on a scoring function that assesses the merits of each candidate
network.
4.4.3.3 Learning based on Conditional Independence Tests
This method mainly stem from the goal of uncovering causal structure.
The assumption is that there is a network structure that exactly
represents the independencies in the distribution that generated the data.
Then it follows that if a (conditional) independency can be identified in
the data between two variables that there is no arrow between those two
variables.
107
Once locations of edges are identified, the direction of the edges is
assigned such that conditional independencies in the data are properly
represented. The results of model at this stage shown below:
Correctly Classified Instances 21969 98.6972 %
Incorrectly Classified Instances 290 1.3028 %
Mean absolute error 0.0062
Root mean squared error 0.0618
Relative absolute error 3.0386 %
Root relative squared error 19.3229 %
Total Number of Instances 22259
4.5 Layer Four Task Description
Here the classification of new records is done using the
distance based measures, which are explained below.
4.5.1 kNN Distance Based Method
After separating the records into normal and attacks, the test tuples can
be predicted using k-NN distance based method. There have been a
number of different ways for defining outliers from the perspective of
distance-related metrics.
Most existing metrics are used for distance-base outlier detection
techniques are defined based upon the concepts of local neighborhood or
k nearest neighbors (kNN) of the data points. The notion of distance-
based outliers does not assume any underlying data distributions and
generalizes many concepts from distribution-based methods. Moreover,
distance-based methods scale better to multi-dimensional space and can
be computed much more efficiently than the statistical-based methods.
108
In distance-based methods [93,94], distance between data points is
needed to be computed. We can use any of the Lp metrics like the
Manhattan distance or Euclidean distance metrics [95] for measuring the
distance between a pair of points.
Alternately, for some other application domains with presence of
categorical data (e.g., text documents), non-metric distance functions can
also be used, making the distance-based definition of outliers very
general. Data normalization is normally carried out in order to normalize
the different scales of data features before outlier detection is performed.
4.5.2 Local Neighborhood Methods
The first notion of distance-based outliers, called DB(k, λ)-Outlier
[89], is defined as follows. A point p in a data set is a DB(k, λ)-Outlier
[96], with respect to the parameters k and λ, if no more than k points in
the data set are at a distance λ or less (i.e., λ−neighborhood) from p. This
definition of outliers is intuitively simple and straightforward. The major
disadvantage of this method, however, is its sensitivity to the parameter
λ that is difficult to specify a priori.
As we know, when the data dimensionality increases, it becomes
increasingly difficult to specify an appropriate circular local
neighborhood (delimited by λ) for outlier-ness evaluation of each point
since most of the points are likely to lie in a thin shell about any point
[24].
Thus, a very small λ will cause the algorithm to detect all points as
outliers, whereas no point will be detected as outliers if a too large λ is
picked up. In other words, one needs to choose an appropriate λ with a
very high degree of accuracy in order to find a modest number of points
that can then be defined as outliers.
109
To facilitate the choice of parameter values, this first
neighborhood distance-based outlier definition is extended and the so-
called DB (pct, dmin)-Outlier is proposed which defines an object in a
dataset as a DB (pct, dmin)-Outlier if at least pct% of the objects in the
datasets have the distance larger than dmin from this object.
Similar to DB(k, λ)-Outlier, this method essentially delimits the
neighborhood of data points using the parameter dmin and measures the
outlierness of a data point based on the percentage, instead of the
absolute number, of data points falling into this specified local
neighborhood. DB (pct, dmin) is quite general and is able to unify the
existing statistical detection methods using discordancy tests for outlier
detection. The specification of pct is obviously more intuitive and easier
than the specification of k.
4.5.3 Extension to kNN-distance Method : e-kNN
From the k nearest objects to the k nearest dense regions is proposed.
This method employed the sum of the distances between each data point
and its k nearest dense regions to rank data points. This enables the
algorithm to measure the outlier-ness of data points from a more global
perspective.
In the local perspective, human examine the point’s immediate
neighborhood and consider it as an outlier if its neighborhood density is
low. The global observation considers the dense regions where the data
points are densely populated in the data space.
Specifically, the neighboring density of the point serves as a good
indicator of its outlying degree from the local perspective. In the left sub-
figure of Figure 4.3, two square boxes of equal size are used to delimit
the neighborhood of points p1, and p2.
th
d
b
d
h
o
B
th
la
F
4
T
o
fr
T
b
Beca
he outlyin
distance be
between th
Intui
deviated fr
higher outl
f 4.3, we
Because th
hat betwe
arger than
Figure 4.3
4.5.4
The global
utlier-nes
rom the d
The most
between a d
ause the n
ng degree
etween the
his point an
itively, the
om the m
lying degr
can see a
he distanc
en p2 and
n p2.
: Local an
Global O
outlier-ne
s of data p
ense regio
intuitive
data point
neighborin
of p1 is
e point an
nd the den
e larger su
main popul
ee it has,
a dense re
ce between
d the den
nd global p
Outliernes
ess metric
points from
ons (or clu
global ou
t to the cen
110
g density
larger th
nd the den
nse region
uch distan
lation of t
otherwise
egion and
n p1 and
nse region,
perspectiv
ss Metrics
cs refer to
m the pros
usters) of d
utlier-ness
ntroid of it
of p1 is l
han p2. O
nse region
s.
nce is, the
he data p
e it is not.
two outly
the dens
, so the o
ves of out
s
o those me
spective o
data in th
s metric w
ts nearest
less than
On the oth
ns reflects
e more rem
points and
In the rig
ying point
se region i
outlying de
tlier-ness
etrics that
f how far
e data spa
would be
cluster.
that of p2
her hand,
the simila
markably
d therefore
ght sub-fig
s, p1 and
is larger t
egree of p
of p1 and
t measure
away they
ace [135,1
the dista
2, so
the
arity
p is
e the
gure
d p2.
than
p1 is
d p2
e the
y are
136].
ance
111
This metric can be easily extended to considering the k nearest
clusters (k > 1) [142]. For an easy referral, we term these metrics as kNN
Clusters metric. The Largest Cluster metric used in [142] is also a global
outlier-ness metric.
4.6 Validation Methods
Here different validation measures are used to check the
performance of the model, which are explained here.
4.6.1 k-Relative Distance (kRD): In this study, we use the metric of
k-Relative Distance (kRD) in the validation method for identifying
anomalies. Given parameter k, kRD of a point p in subspace s is defined
as
)),((__
),(_),,(spDdistkaverage
spdistkkspkRD =
,where k_dist(p, s) is the sum of distances between p and its k nearest
neighbors and average_k_dist(D(p), s) is the average of such distance sum
for the data points that arrive before p. Using kRD, we can solve the
limitations of kNN_Clusters and Largest Cluster metrics. Specifically, we
can achieve a better outlier-ness measurement than kNN_Clusters
metrics when the shapes of clusters are rather irregular.
In addition, we can relax the rigid assumption of single mode
distribution of normal data and lessen the parameter sensitivity of the
Largest Cluster method. The advantages of using kRD are presented in
details as follows:
1. Firstly, multiple cluster representative points are used in
calculating kRD. This enables kRD to deliver a more accurate
measurement when the cluster shapes are irregular.
112
The representative points, being well scattered within the cluster,
are more robust in capturing the overall geometric shape of the clusters
such that the distance metric can deliver a more accurate measurement
of the distance between a data point and the dense regions of data sets,
while using the centroid is incapable of reliably achieving this.
For instance, in case of a prolonged shaped cluster as illustrated in
Figure 4.4, it is obvious that data o is an outlier while p is a normal data.
However, the distance between p and the cluster centroid is even larger
than that for o (see Figure 4.4(a)).
This will not happen if multiple representative points are used
instead. In Figure 4.4(b), the representative points in this cluster are
marked in red color and we can see that the average distance between p
and its nearest representative points are less than that of o, which is
consistent with human perception;
2. Secondly, we do not limit the normal data to reside only in the
largest cluster as in the Largest Cluster method. kRD provides flexibility
to deal with cases of both single and multiple modes for the normal data
distribution.
3. Finally, we employ the concept of distance ratio, instead of the
absolute distance value, in kRD to measure the outlier-ness of data
points. Also, value specification of a threshold parameter based on the
distance ratio is more intuitive and easier than the threshold based on
the absolute distance.
113
Figure 4.4 Using Centroid and Representative Points to measure Outlierness of Data Points
The steps for computing kRD in the validation method for a data point p
in subspace s are as follows:
1. k-means clustering is performed using D(p) in s to generate k clusters;
2. Find Nr representative points from all the clusters using the method
proposed in [93]. The exact number of representative points for different
clusters can be proportional to the size of clusters;
3. Compute the average distance between each data in the horizon before
p and their k respective nearest representative points;
4. Compute the distance between p and its k nearest cluster
representative points;
5. Compute kRD as
)),((__),(_),,(
spDdistkaveragespdistkkspkRD =
114
To study the quality of anomalies by different candidate validation
methods, we need some objective quality metrics for anomalies. Two
major metrics are commonly used to measure the quality of clusters,
Mean Square Error (MSE) and Silhouette Coefficient (SC)[142].
4.6.2 Mean Square Error (MSE)
MSE [76] is a measure for the compactness of clusters. MSE of a cluster
c is defined as the intra-cluster sum of distance of c in subspace s, that
is the sum of distance between each data in cluster c and its centroid,
i.e.,
),,,(),( 0 scpdistscMSE i= Where ||0 c
pc iΣ
=
Where |c|, pi and c0 correspond to the number of data within cluster c
in s, the ith data within cluster c and the centroid of cluster c in s,
respectively. The overall quality of the whole clustering result C in s is
quantitized by the total intra-cluster sum of distance of all the clusters
we obtain in s. That is,
∑= ),,(),( scMSEsCMSE for c є C
A lower value of MSE(C, s) indicates a better clustering quality in s. A
major problem with MSE is that it is sensitive to the selection of cluster
number; MSE will decrease monotonically as the number of clusters
increases.
4.6.3 Silhouette Coefficient (SC)
SC is another useful measurement used to judge the quality of clusters
[115,139]. The underlying rationale of SC is that, in a good clustering,
the intra-cluster data distance [137] should be smaller than the inter-
115
cluster data distance. The calculation of SC of a clustering result C in
subspace s takes the following several steps:
1. Calculate the average distance between p and all other data points in
its cluster in s, denoted by a(p, s), i.e.,
a(p, s) = Σ dist(p, pi, s)/|c|, for c € C and pi є c;
2. Calculate the average distance between p and data points in each
other cluster that does not contain p in s. Find the minimum such
distance amongst all these clusters, denoted by b(p, s), i.e.,
b(p, s) = min(Σ dist(p, pi, s)/|cj |), for cj є C and p ∉ cj and pi є cj ;
3. The SC of p in s, denoted by SC(p, s), can be computed by
SC(p, s) = )),(),,(max(),(),(spbspa
spaspb −
The value of SC can vary between -1 and 1. It is desired that SC is
positive and is as close to 1 as possible. The major advantage of SC over
MSE is that SC is much less sensitive to the number of clusters than
MSE. Given the advantage of SC over MSE, we will use SC as the
anomaly quality metric in selecting the validation method.
4.6.3.1 Steps for Anomaly Classification
When constructed, the signature subspace lookup table can be used to
classify anomalies in the data stream [116][117][118]. Each anomaly is
classified into one or more possible attack classes or the false-positive
class based on the class membership probabilities of the anomaly.
The class membership probability of an anomaly o with respect to
class ci, of C is computed as
pr(o, ci) = sim(o, ci)/Σi sim(o, ci) × 100%, where ci є C
116
The higher the probability for a class is, the greater is the chance
for the anomaly to fall into this class.
An anomaly o can be classified into a class ci if pr(o, ci) ≥ τ , where
τ is the similarity threshold [119]. As s result, under a given τ, it is
possible that an anomaly o is classified into a few, instead of one, classes
if their similarities are high enough. The set of classes that o may fall
into, denoted as class(o), is defined as
class(o) = {ci, where pr(o, ci) ≥ τ, ci є C}
4.6.3.2 Handling False Positives
False positives, also called false alarms, are those anomalies that are
erroneously detected as the attacks by the system [117,138]. Even
though they are benign in nature and not harmful as compared to real
attacks, false positives consume a fair amount of human effort spent on
investigating them whatsoever and thus making it almost impossible for
security officers to really concentrate only on the real attacks.
Generally, among all the anomalies detected, up to 90% of them
may be false positives. It is much desirable to quickly screen out these
false positives in order to allow closer attention to be paid towards the
real harmful attacks.
4.7 Experimental Results
The real-life dataset kddcup99 has been used for performance evaluation
in our experiments. All the experimental evaluations are carried out on a
Pentium IV PC with 1GB RAM.
117
Many experiments have been conducted for evaluation on
proposed model on kddcup’99 dataset to evaluate the performance of the
model. The results were shown in figure 4.5 and 4.6. In the figures, we
have shown the assignment of objects to clusters. In figure 4.5 the
number of clusters is two and the records are clearly separated with
small error.
In the first cluster, the normal type records and in the second
cluster the outlier type records with sum squared error (SSE) of 6468.92.
In figure 4.6 the assignment of objects to five clusters where the
first cluster is for normal type and the remaining clusters will include
DoS, Probe, R2L and U2R outlier records respectively with SSE error of
4534.6. These experimental summary results are shown in table 4.1
Table 4.1 The results after assigning the objects to clusters
Attribute Full Data Cluster-0 Cluster-1
Service Private http Private
Flag S0 Sf S0
Src_bytes 2704 6765 28
Dst_bytes 5350 13471 0.696
Logged_in 0.3778 0.9514 0
Count 118 8 190
Srv_count 11 10 12
Class Normal/outlier Normal Outliers
N
W
F
Number of
Within clus
Figure 4.5
Table 4.
Clust
0
1
iterations
ster sum o
Assignm
.2 Percent
ter No
: 7
of squared
ent of Obj
118
tage of Ob
Number oInstances 8840
13419
d errors: 45
jects to C
bjects in t
of
% of insta40
60
534.46865
Clusters w
the cluste
ances
56381134
with k=2
ers
F
4
T
d
d
in
Figure 4.6
4.8
The scalab
detection p
data set. W
n all scala
Assignm
Scalabi
bility stud
processes
We conduc
bility expe
ent of Obj
ility Stu
dy investig
with resp
ted the ex
eriments.
119
jects to C
udy of th
gates the
pect to len
xperiments
Clusters w
he Model
scalability
ngth N an
s with diffe
with k=5
y of both
nd dimens
ferent N an
learning
sionality ϕ
nd ϕ are u
and
ϕ of
used
120
4.8.1 Scalability of Learning Process with respect to N
Figure 4.7 shows the scalability of unsupervised learning process
with respect to number of training data N. The major tasks involved
in the unsupervised learning process are multi-run clustering of
training data, selection of top training data that have the highest
outlying degree. The lead clustering method we use requires only a
single scan of the training data, and the number of top training
data we choose is usually linearly depended on N. Therefore, the
overall complexity of unsupervised learning process scales linearly
with respect to N.
Figure. 4.7 Learning Time in seconds Verses Number of Records
0
10
20
30
40
50
60
70
80
90
0 5000 10000 15000 20000 25000
Series1
121
4.8.2 Scalability of Learning Process with respect to ϕ
Figure 4.8 shows the results, which indicate an exponential growth
of execution time of learning process when ϕ is increased from 20
to 40 under a fixed search workload.
Figure 4.8 Learning Time in seconds Verses Number of Dimensions 4.8.3 Scalability of Detection Process with respect to N
In Figure 4.10, we present the scalability result of detection process with
respect to stream length N. In this experiment, the stream length N is set
much larger than the number of training data in order to study the
performance of the model in coping with large data streams. Figure 4.10
shows a promising linear behavior of detection process when handing an
increasing amount of streaming data. This is because that the detection
process needs only one scan of the streaming data.
Dimensions Vs Time
0
2
4
6
8
10
12
0 10 20 30 40 50
Dimensions
Tim
e(Se
cond
s)
Series1
122
Figure 4.9 Learning Time in seconds Verses Number of Dimensions
Figure 4.10 Detection Time in seconds Verses Number of
Records
Learning Time vs Dimensions
0102030405060708090
0 20 40 60
Dimensions
Exe
cutio
n tim
e in
Sec
onds
Series1
Detection Time Vs Number of Data Points
0
20
40
60
80
100
120
140
160
180
0 10000 20000 30000
Number of Data Points
Exec
utio
n Ti
me
in S
econ
ds
Series1
123
4.8.4 Scalability of Detection Process with respect to ϕ
The dimensionality of a data stream ϕ affects the size of data that is
used in detection process. When MaxDimension is fixed, the size of data
is in an exponential order of ϕ. The execution time of detection process is
expected to grow linearly with respect to ϕ. In this experiment, we used
different dimensions with fixed number of records and we can see a
linear behavior of the detection process. We repeated the same
experiment with the KDDCUP’99 data set with dimension of 8, 15, 20, 25,
30 and 42. The results are presented in Figure 4.11 with dimensions on
x-axis and detection on y-axis.
Figure 4.11 Detection Time in seconds Verses Number of Dimensions
Detection Time Vs Dimensions
0
100
200
300
400
500
600
0 5 10 15 20 25 30
Diemensions
Det
ectio
n Ti
me
Series1
124
4.9 Chapter Summary
Even though the problem of outlier detection has been studied
intensively in the past a few years, outlier detection problem in high-
dimensional data streams has rarely been investigated. With the current
state-of-the-art methods of outlier detection it is difficult to find outliers
in the context of data streams, because they are either constrained to
only low dimensional data, or can’t able incrementally update detection
model for real-time data streams. To the best of our knowledge, there has
been a little research work in literature that particularly targets this
research problem.
This research work is motivated by the unique characteristics of
data in high-dimensional space. Owing to the curse of dimensionality,
the outliers existing in the high-dimensional data sets (including data
streams) are embedded in those lower dimensional subspaces. These
outliers existing in high-dimensional data space are termed projected
outliers. In this thesis, the problem of projected outlier detection for
multi and high-dimensional data streams has been introduced and
formally formulated.
To solve this problem, a new framework, called Multistage
Framework (MFDO) has been presented for Detecting Outliers. MFDO
utilizes compact data synopsis, to capture necessary data statistical
information using z-score values for outlier detection. The result is better
performance with simple calculations using nearest based approach
which is simple and easy to compute.