Chapter - 4 A MULTI-LAYERED FRAMEWORK TO DETECT ...shodhganga.inflibnet.ac.in/bitstream/10603/38969/13/14...94 Chapter 4 A MULTI-LAYERED FRAMEWORK TO DETECT OUTLIERS AS NIDS 4.1 Introduction

93

Chapter - 4

A MULTI-LAYERED FRAMEWORK TO DETECT OUTLIERS AS NIDS

94

Chapter 4 A MULTI-LAYERED FRAMEWORK TO DETECT

OUTLIERS AS NIDS

4.1 Introduction to Outliers

Network intrusion detection system serves as a second line of defense to

intrusion prevention. Anomaly detection approach is important in order

to detect new attacks. In my work the statistical, and distance based

approaches have been adapted to detect anomaly behavior towards

network intrusion detection. This model has been evaluated with the

kddcup99 data set. The results showing that the connectivity based

scheme outperform current intrusion detection approach in the

capability to detect attacks and low false alarm rate.

Definition1: An outlier is an observation that lies an abnormal distance

from other values in a random sample from a population. In a sense, this

definition leaves it up to the analyst (or a consensus process) to decide

what will be considered abnormal. Before abnormal observations can be

singled out, it is necessary to characterize normal observations.

Definition2: Anomaly is a pattern in the data that does not conform to

the expected behaviour, and also referred to as outliers, exceptions,

peculiarities, surprise, etc. Anomalies translate to significant real life

entities network intrusions. Outlier detection is a good research problem

in machine learning that aims to find objects that are dissimilar,

exceptional and inconsistent with respect to the majority normal objects

in an input database. This is one of the approaches in intrusion

detection system.

95

A simple example for outlier objects is shown in figure 4.1. Points,

o1 and o2 are outliers. Points in region O3 are anomalies. In the regions

N1 and N2 are normal objects. The outliers are far away from normal

objects and most of the objects are normal objects and only a few objects

are outliers in terms of small percentage of total number of objects.

First, outliers can be classified as point outliers and collective

outliers based on the number of data instances involved in the concept of

outliers.

4.1.1 Point Outliers

In a given set of data instances, an individual outlying instance is termed

as a point outlier. This is the simplest type of outliers and is the focus of

majority of existing outlier detection schemes [33,34]. A data point is

detected as a point outlier because it displays outlier-ness at its own

right, rather than together with other data points. In most cases, data

are represented in vectors as in the relational databases. Each tuple

contains a specific number of attributes.

X

Y

N

N

o

o2

O3

Figure 4.1 Examples for outliers

96

The principled method for detecting point outliers from vector-type

data sets is to quantify, through some outlier-ness metrics, the extent to

which each single data is deviated from the other data in the data set.

4.1.2 Collective Outliers

A collective outlier represents a collection of data instances that is

outlying with respect to the entire data set [35]. The individual data

instance in a collective outlier may not be outlier by itself, but the joint

occurrence as a collection is anomalous [33]. Usually, the data instances

in a collective outlier are related to each other. A typical type of collective

outliers is sequence outliers [49,50], where the data are in the format of

an ordered sequence.

Outliers should be investigated carefully. They may contain

valuable information about the process under investigation or the data

gathering and recording process. Before elimination of these points from

the data, one should try to understand why they appeared and whether

it is like similar values will continue to appear. Of course, outliers are

often bad data points.

4.2 Challenges in Detecting Outlier Objects • Defining a representative normal region is challenging

• The boundary between normal and outlying behaviour is often

not precise

• The exact notion of an outlier is different for different

application domains

• Availability of labelled data for training/validation

• Malicious adversaries

• Normal behaviour keeps evolving

97

4.3 Aspects of Anomaly Detection Problem • Nature of input data

• Availability of supervision

• Type of anomaly: point, contextual, structural

• Output of anomaly detection

• Evaluation of anomaly detection techniques

A key observation that motivates this research is that outliers

existing in high-dimensional data streams [86,87] are embedded in some

lower-dimensional subspaces [90,91]. The existence of projected outliers

is due to the fact that, as the dimensionality of data goes up, data tend to

become equally distant from each other. As a result, the difference of

data point’s outlier-ness [121] will become increasingly weak and thus

undistinguishable.

4.4 Multistage Framework to Detect Outliers (MFDO)

This framework has four layers for IDS. In this framework

supervised and unsupervised learning have been used. They are

explained here.

Only in moderate or low dimensional subspaces can significant

outlier-ness of data be observed. This phenomenon is commonly referred

to as the curse of dimensionality. In my research a framework has been

designed for detecting outliers shown in figure 4.2.

This architecture consists of four layers. In each layer some

process has been done towards detecting the outliers. These layers were

explained below:

98

4.4.1 Layer One Description

The first step in this layer is reading the dataset. After reading the

dataset next step is to preprocess the dataset. The preprocessing steps

were given below:

i) sample dataset if the data set is large dataset

ii) remove any missing values

iii) normalize samples

iv) select feature subset

4.4.1.1 Sampling

Sampling can be used as data reduction technique because it allows a

large data set to be represented by a much smaller random sample or

subset of data. Suppose that a large dataset, D, contains N tuples. A

simple random sample without replacement (SRSWOR) method has been

used for sampling the records.

SRSWOR of size s, this is created by drawing s of the N tuples from

D ( s < N ), where the probability of drawing any tuple in D is 1/N, that

is, all tuples are equally likely to be sampled.

An advantage of sampling for data reduction is that the cost of

obtaining a sample is proportional to the size of the sample, s, as

opposed to N, the data set size. Hence, sampling complexity is

potentially sub linear to the size of data. Other data reduction technique

can require at least one complete pass throw D.

99

Normal Objects

sample dataset

Normalize Data

Select features subset

Use k-Nearest Neighbor distance based method

Normal

Read dataset

Preprocess Dataset

Cal. z-score for each continuous/categorical

zscore>= threshold_val

Outlier Objects

Yes

No

Multi-class Classifier (Bayesian Network classifier)

O_Type1 O_Type2 - - - O_TypeN

For each test l

Any test record

Yes

Stop No

O_Type1 O_Type2 - - - O_TypeN

Figure 4.2 Multistage Framework for IDS

Compute mean for each class

Layer-1

Layer-2

Layer-3

Layer-4

100

For a fixed sample size, sampling complexity increases only

linearly as the number of data dimensions, n, increases, where as

techniques using histograms, for example, increase exponentially in n.

When applied to data reduction, sampling is most commonly used

to estimate the answer to an aggregate query.

It is possible to determine a sufficient sample size for estimating a

given function within a specified degree of error. This sample size, s, may

extremely small in comparison to N. Sampling is a natural choice for the

progressive refinement of a reduced dataset. Such a set can be further

refined by simply increasing the sample size.

4.4.1.2 Removing Missing Values

A missing value can signify a number of different things in data. Perhaps

the field was not applicable, the event did not happen, or the data was

not available. It could be that the person who entered the data did not

know the right value, or did not care if a field was not filled in. Therefore,

Analysis Services provides two distinctly different mechanisms for

managing and calculating these missing values, also known as null

values.

Classification modeling specifies that a column must never have

missing values; you should use the NOT_NULL modeling flag when you

define the mining structure. This will ensure that processing will fail if a

case does not have an appropriate value. If an error occurs when

processing a model, you can then log the error and take steps to correct

the data that is supplied to the model.

There are a variety of tools that we can use to infer and fill in

appropriate values, such as the lookup transformation or the Data

101

Profiler task in SQL Server Integration Services, or the Fill by Example

tool provided in the Data Mining Add-Ins for Excel. The rows were simply

removed which have missing values.

4.4.1.3 Normalize samples

An attribute is normalized by scaling its values so that they call within a

small specified range, such as 0.0 to 1.0. Standardization is particularly

useful for classification algorithms involving neural networks, or distance

measurements such as nearest neighbor classification and clustering.

For distance based methods, data standardization prevents attributes

with initially large ranges from out weighing attributes with initially

smaller ranges.

There are many methods for data normalization. z-score

normalization has been implemented for normalization. In z-score

normalization (or zero-mean normalization), the values for an attribute,

A, are normalized based on the mean and standard deviation of A. A

value, v, of A is normalized to v` by computing

A

Avvσ−

=' , (4.1)

where A and Aσ are the mean and standard deviation, respectively, of

attribute A. This method of normalization is useful when the actual

minimum and maximum of attribute A is unknown, or when there are

outliers that dominate the min-max normalization.

4.4.1.4 Select features subset

Data sets for analysis may contain hundreds of attributes, many of

which may be irrelevant to the mining task or redundant. Leaving out

relevant attributes are keeping irrelevant attributes may be detrimental,

102

causing confusion for the mining algorithm employed. This can result in

discover patterns of poor quality. In addition, the added volume of

irrelevant or redundant attributes can slow down the mining process.

Attributes subset selection reduces the data set size by removing

irrelevant or redundant attributes (dimensions). The goal of attribute

subset selection is to find a minimum set of attributes such that the

resulting probability distribution of the data classes is as close as

possible to the original distribution obtained using all attributes. Mining

on a reduced set of attributes has an additional benefit. It reduces the

number of attributes appearing in the discovered patterns, help to make

the patterns easier to understand.

For n attributes, there are 2n possible subsets. An exhaustive search

for the optimal subset of attributes can be prohibitively expensive,

especially as n and the number of data classes increase. Therefore,

heuristic methods that explore a reduced search space or commonly

used for attribute subset selection.

These methods are typically greedy in that, while searching

through attribute space, they always make what looks to be the best

choice at the time their strategy is to make a locally optimal choice in the

hope that this will lead to a globally optimal solution. Such greedy

methods are effective in practice and may come close to estimating an

optimal solution.

The best (and worst) attributes are typically determining using

tests of statistical significance, which assume that the attributes are

independent of one another. Many other attribute evaluation measures

can be used, such as the information gain measure used in building

decision trees for classification.

103

Most previous works of feature selection emphasized only the

reduction of high dimensionality of the feature space. But in cases where

many features are highly redundant with each other, we must utilize

other means, for example, more complex dependence models such as

Bayesian network classifiers.

In my work, it has been introduced a new information gain and

divergence-based feature selection method. The feature selection method

strives to reduce redundancy between features while maintaining

information gain in selecting appropriate features for classification.

Entropy is one of greedy feature selection methods, and

conventional information gain which is commonly used in feature

selection for classification or clustering models [113]. Moreover, our

feature selection method sometimes produces more improvements of

conventional machine learning algorithms over support vector machines

which are known to give the best classification accuracy. It is defined as

4.4.2 Layer Two Description

Here the separation of objects into normal and outliers can be

done using the zscore measure, which is explained below.

There are many statistical methods to detect outlier data objects.

z-score method has been used to detect outliers. The z-score measure

can be defined as follows:

A

Avzscoreσ−

= , where A and Aσ are the mean and standard

∑−=j

tjptjptEntropy )|(log)|()( 2

104

deviation, respectively, of attribute A. The z-score can be calculated for

selected features subset. For each attribute a threshold value is set to

determine particular instance is outlier or not. If the majority of the z-

scores indicating that instance is outlier, then that object is outlier.

In this research, the problem of detecting outliers from high-

dimensional data streams has been studied. This problem can be

formulated as follows: given a dataset D with a potentially unbounded

size of n-dimensional data points, for each data point xi = {xi1, xi2, . . . ,

xin} in D, outlier detection method performs a mapping as

f : xi → (b, Si, Z-Scorei)

Each data xi is mapped to a triplet (b, Si, Z-Scorei), b is a Boolean

variable indicating, whether or not xi is a projected outlier. If xi is a

projected outlier then b is true, therefore Si is the set of outlying

subspaces of xi and Z-Scorei is the corresponding outlier-ness score of pi

in each subspace of Si. In the case that xi is a normal data, then b is

false, Si = ∅ and Z-Scorei is not applicable.

In my work z-score value has been set to +/-1.5 as a threshold

value for objects separation. Once the objects are separated as normal

and outliers, the outliers can be classified further using multi-class

classifier.

Here Bayesian Classifier [60,61,62] has been used for separating

outliers. After classifying the outliers, the mean value for each group of

class is calculated. Use k-nearest neighbor method between mean values

of class objects and test tuples to determine the class of unseen test

records. The Bayesian Network is briefly explained below.

105

4.4.3 Layer Three: Bayesian Network Classifier

These networks are directed acyclic graphs that allow efficient and

effective representation of the joint probability distributions over a set of

random variables. Each vertex in the graph represents a random variable,

and edges represent direct correlations between the variables.

More precisely, the network encodes the following conditional

independence statements: Each variable is independent of its

nondescendants, in the graph given the state of its parents.

These independencies are then exploited to reduce the number of

parameters needed to characterize a probability distribution, and to

efficiently computer posterior probabilities given evidence. Probabilistic

parameters are encoded in a set of tables, one for each variable, in the

form local conditional distributions of a variable given its parents. Using

the independence statements encoded in the network, the join

distribution is uniquely determined by these local conditional

distributions.

Let U = {x1, x2 . . . , xn}, n ≥ 1 be a set of variables. A Bayesian

network B over a set of variables U is a network structure BS, which is a

directed acyclic graph (DAG) over U and a set of probability tables in BS.

To use a Bayesian network as a classifier, one simply calculates

argmaxyP(y|x) using the distribution P(U) represented by the Bayesian

network.

)(/)()/( xPUPxyP =

α )(UP

=∏∈Uu

upaup ))(|(

106

And since all variables in x are known, we do not need complicated

inference algorithms, but just calculate for all class values.

4.4.3.1 Bayesian Network learning algorithm

The dual nature of a Bayesian network makes learning a Bayesian

network as a two stage process a natural division: first learn a network

structure, and then learn the probability tables.

There are various approaches to structure learning, conditional

independence test has been used as a learning method, which is briefly

given below.

4.4.3.2 Learning Network Structure

Learning a BN from the data is unsupervised learning, in the sense that

the learner does not distinguish the class variable from the attribute

variables in the data. The objective induces a network that best describes

the probability distribution over the training data. This optimization

process is implemented by using heuristic search techniques to find the

best candidate over the space of possible networks. This search process

relies on a scoring function that assesses the merits of each candidate

network.

4.4.3.3 Learning based on Conditional Independence Tests

This method mainly stem from the goal of uncovering causal structure.

The assumption is that there is a network structure that exactly

represents the independencies in the distribution that generated the data.

Then it follows that if a (conditional) independency can be identified in

the data between two variables that there is no arrow between those two

variables.

107

Once locations of edges are identified, the direction of the edges is

assigned such that conditional independencies in the data are properly

represented. The results of model at this stage shown below:

Correctly Classified Instances 21969 98.6972 %

Incorrectly Classified Instances 290 1.3028 %

Mean absolute error 0.0062

Root mean squared error 0.0618

Relative absolute error 3.0386 %

Root relative squared error 19.3229 %

Total Number of Instances 22259

4.5 Layer Four Task Description

Here the classification of new records is done using the

distance based measures, which are explained below.

4.5.1 kNN Distance Based Method

After separating the records into normal and attacks, the test tuples can

be predicted using k-NN distance based method. There have been a

number of different ways for defining outliers from the perspective of

distance-related metrics.

Most existing metrics are used for distance-base outlier detection

techniques are defined based upon the concepts of local neighborhood or

k nearest neighbors (kNN) of the data points. The notion of distance-

based outliers does not assume any underlying data distributions and

generalizes many concepts from distribution-based methods. Moreover,

distance-based methods scale better to multi-dimensional space and can

be computed much more efficiently than the statistical-based methods.

108

In distance-based methods [93,94], distance between data points is

needed to be computed. We can use any of the Lp metrics like the

Manhattan distance or Euclidean distance metrics [95] for measuring the

distance between a pair of points.

Alternately, for some other application domains with presence of

categorical data (e.g., text documents), non-metric distance functions can

also be used, making the distance-based definition of outliers very

general. Data normalization is normally carried out in order to normalize

the different scales of data features before outlier detection is performed.

4.5.2 Local Neighborhood Methods

The first notion of distance-based outliers, called DB(k, λ)-Outlier

[89], is defined as follows. A point p in a data set is a DB(k, λ)-Outlier

[96], with respect to the parameters k and λ, if no more than k points in

the data set are at a distance λ or less (i.e., λ−neighborhood) from p. This

definition of outliers is intuitively simple and straightforward. The major

disadvantage of this method, however, is its sensitivity to the parameter

λ that is difficult to specify a priori.

As we know, when the data dimensionality increases, it becomes

increasingly difficult to specify an appropriate circular local

neighborhood (delimited by λ) for outlier-ness evaluation of each point

since most of the points are likely to lie in a thin shell about any point

[24].

Thus, a very small λ will cause the algorithm to detect all points as

outliers, whereas no point will be detected as outliers if a too large λ is

picked up. In other words, one needs to choose an appropriate λ with a

very high degree of accuracy in order to find a modest number of points

that can then be defined as outliers.

109

To facilitate the choice of parameter values, this first

neighborhood distance-based outlier definition is extended and the so-

called DB (pct, dmin)-Outlier is proposed which defines an object in a

dataset as a DB (pct, dmin)-Outlier if at least pct% of the objects in the

datasets have the distance larger than dmin from this object.

Similar to DB(k, λ)-Outlier, this method essentially delimits the

neighborhood of data points using the parameter dmin and measures the

outlierness of a data point based on the percentage, instead of the

absolute number, of data points falling into this specified local

neighborhood. DB (pct, dmin) is quite general and is able to unify the

existing statistical detection methods using discordancy tests for outlier

detection. The specification of pct is obviously more intuitive and easier

than the specification of k.

4.5.3 Extension to kNN-distance Method : e-kNN

From the k nearest objects to the k nearest dense regions is proposed.

This method employed the sum of the distances between each data point

and its k nearest dense regions to rank data points. This enables the

algorithm to measure the outlier-ness of data points from a more global

perspective.

In the local perspective, human examine the point’s immediate

neighborhood and consider it as an outlier if its neighborhood density is

low. The global observation considers the dense regions where the data

points are densely populated in the data space.

Specifically, the neighboring density of the point serves as a good

indicator of its outlying degree from the local perspective. In the left sub-

figure of Figure 4.3, two square boxes of equal size are used to delimit

the neighborhood of points p1, and p2.

th

d

b

d

h

o

B

th

la

F

4

T

o

fr

T

b

Beca

he outlyin

distance be

between th

Intui

deviated fr

higher outl

f 4.3, we

Because th

hat betwe

arger than

Figure 4.3

4.5.4

The global

utlier-nes

rom the d

The most

between a d

ause the n

ng degree

etween the

his point an

itively, the

om the m

lying degr

can see a

he distanc

en p2 and

n p2.

: Local an

Global O

outlier-ne

s of data p

ense regio

intuitive

data point

neighborin

of p1 is

e point an

nd the den

e larger su

main popul

ee it has,

a dense re

ce between

d the den

nd global p

Outliernes

ess metric

points from

ons (or clu

global ou

t to the cen

110

g density

larger th

nd the den

nse region

uch distan

lation of t

otherwise

egion and

n p1 and

nse region,

perspectiv

ss Metrics

cs refer to

m the pros

usters) of d

utlier-ness

ntroid of it

of p1 is l

han p2. O

nse region

s.

nce is, the

he data p

e it is not.

two outly

the dens

, so the o

ves of out

s

o those me

spective o

data in th

s metric w

ts nearest

less than

On the oth

ns reflects

e more rem

points and

In the rig

ying point

se region i

outlying de

tlier-ness

etrics that

f how far

e data spa

would be

cluster.

that of p2

her hand,

the simila

markably

d therefore

ght sub-fig

s, p1 and

is larger t

egree of p

of p1 and

t measure

away they

ace [135,1

the dista

2, so

the

arity

p is

e the

gure

d p2.

than

p1 is

d p2

e the

y are

136].

ance

111

This metric can be easily extended to considering the k nearest

clusters (k > 1) [142]. For an easy referral, we term these metrics as kNN

Clusters metric. The Largest Cluster metric used in [142] is also a global

outlier-ness metric.

4.6 Validation Methods

Here different validation measures are used to check the

performance of the model, which are explained here.

4.6.1 k-Relative Distance (kRD): In this study, we use the metric of

k-Relative Distance (kRD) in the validation method for identifying

anomalies. Given parameter k, kRD of a point p in subspace s is defined

as

)),((__

),(_),,(spDdistkaverage

spdistkkspkRD =

,where k_dist(p, s) is the sum of distances between p and its k nearest

neighbors and average_k_dist(D(p), s) is the average of such distance sum

for the data points that arrive before p. Using kRD, we can solve the

limitations of kNN_Clusters and Largest Cluster metrics. Specifically, we

can achieve a better outlier-ness measurement than kNN_Clusters

metrics when the shapes of clusters are rather irregular.

In addition, we can relax the rigid assumption of single mode

distribution of normal data and lessen the parameter sensitivity of the

Largest Cluster method. The advantages of using kRD are presented in

details as follows:

1. Firstly, multiple cluster representative points are used in

calculating kRD. This enables kRD to deliver a more accurate

measurement when the cluster shapes are irregular.

112

The representative points, being well scattered within the cluster,

are more robust in capturing the overall geometric shape of the clusters

such that the distance metric can deliver a more accurate measurement

of the distance between a data point and the dense regions of data sets,

while using the centroid is incapable of reliably achieving this.

For instance, in case of a prolonged shaped cluster as illustrated in

Figure 4.4, it is obvious that data o is an outlier while p is a normal data.

However, the distance between p and the cluster centroid is even larger

than that for o (see Figure 4.4(a)).

This will not happen if multiple representative points are used

instead. In Figure 4.4(b), the representative points in this cluster are

marked in red color and we can see that the average distance between p

and its nearest representative points are less than that of o, which is

consistent with human perception;

2. Secondly, we do not limit the normal data to reside only in the

largest cluster as in the Largest Cluster method. kRD provides flexibility

to deal with cases of both single and multiple modes for the normal data

distribution.

3. Finally, we employ the concept of distance ratio, instead of the

absolute distance value, in kRD to measure the outlier-ness of data

points. Also, value specification of a threshold parameter based on the

distance ratio is more intuitive and easier than the threshold based on

the absolute distance.

113

Figure 4.4 Using Centroid and Representative Points to measure Outlierness of Data Points

The steps for computing kRD in the validation method for a data point p

in subspace s are as follows:

1. k-means clustering is performed using D(p) in s to generate k clusters;

2. Find Nr representative points from all the clusters using the method

proposed in [93]. The exact number of representative points for different

clusters can be proportional to the size of clusters;

3. Compute the average distance between each data in the horizon before

p and their k respective nearest representative points;

4. Compute the distance between p and its k nearest cluster

representative points;

5. Compute kRD as

)),((__),(_),,(

spDdistkaveragespdistkkspkRD =

114

To study the quality of anomalies by different candidate validation

methods, we need some objective quality metrics for anomalies. Two

major metrics are commonly used to measure the quality of clusters,

Mean Square Error (MSE) and Silhouette Coefficient (SC)[142].

4.6.2 Mean Square Error (MSE)

MSE [76] is a measure for the compactness of clusters. MSE of a cluster

c is defined as the intra-cluster sum of distance of c in subspace s, that

is the sum of distance between each data in cluster c and its centroid,

i.e.,

),,,(),( 0 scpdistscMSE i= Where ||0 c

pc iΣ

=

Where |c|, pi and c0 correspond to the number of data within cluster c

in s, the ith data within cluster c and the centroid of cluster c in s,

respectively. The overall quality of the whole clustering result C in s is

quantitized by the total intra-cluster sum of distance of all the clusters

we obtain in s. That is,

∑= ),,(),( scMSEsCMSE for c є C

A lower value of MSE(C, s) indicates a better clustering quality in s. A

major problem with MSE is that it is sensitive to the selection of cluster

number; MSE will decrease monotonically as the number of clusters

increases.

4.6.3 Silhouette Coefficient (SC)

SC is another useful measurement used to judge the quality of clusters

[115,139]. The underlying rationale of SC is that, in a good clustering,

the intra-cluster data distance [137] should be smaller than the inter-

115

cluster data distance. The calculation of SC of a clustering result C in

subspace s takes the following several steps:

1. Calculate the average distance between p and all other data points in

its cluster in s, denoted by a(p, s), i.e.,

a(p, s) = Σ dist(p, pi, s)/|c|, for c € C and pi є c;

2. Calculate the average distance between p and data points in each

other cluster that does not contain p in s. Find the minimum such

distance amongst all these clusters, denoted by b(p, s), i.e.,

b(p, s) = min(Σ dist(p, pi, s)/|cj |), for cj є C and p ∉ cj and pi є cj ;

3. The SC of p in s, denoted by SC(p, s), can be computed by

SC(p, s) = )),(),,(max(),(),(spbspa

spaspb −

The value of SC can vary between -1 and 1. It is desired that SC is

positive and is as close to 1 as possible. The major advantage of SC over

MSE is that SC is much less sensitive to the number of clusters than

MSE. Given the advantage of SC over MSE, we will use SC as the

anomaly quality metric in selecting the validation method.

4.6.3.1 Steps for Anomaly Classification

When constructed, the signature subspace lookup table can be used to

classify anomalies in the data stream [116][117][118]. Each anomaly is

classified into one or more possible attack classes or the false-positive

class based on the class membership probabilities of the anomaly.

The class membership probability of an anomaly o with respect to

class ci, of C is computed as

pr(o, ci) = sim(o, ci)/Σi sim(o, ci) × 100%, where ci є C

116

The higher the probability for a class is, the greater is the chance

for the anomaly to fall into this class.

An anomaly o can be classified into a class ci if pr(o, ci) ≥ τ , where

τ is the similarity threshold [119]. As s result, under a given τ, it is

possible that an anomaly o is classified into a few, instead of one, classes

if their similarities are high enough. The set of classes that o may fall

into, denoted as class(o), is defined as

class(o) = {ci, where pr(o, ci) ≥ τ, ci є C}

4.6.3.2 Handling False Positives

False positives, also called false alarms, are those anomalies that are

erroneously detected as the attacks by the system [117,138]. Even

though they are benign in nature and not harmful as compared to real

attacks, false positives consume a fair amount of human effort spent on

investigating them whatsoever and thus making it almost impossible for

security officers to really concentrate only on the real attacks.

Generally, among all the anomalies detected, up to 90% of them

may be false positives. It is much desirable to quickly screen out these

false positives in order to allow closer attention to be paid towards the

real harmful attacks.

4.7 Experimental Results

The real-life dataset kddcup99 has been used for performance evaluation

in our experiments. All the experimental evaluations are carried out on a

Pentium IV PC with 1GB RAM.

117

Many experiments have been conducted for evaluation on

proposed model on kddcup’99 dataset to evaluate the performance of the

model. The results were shown in figure 4.5 and 4.6. In the figures, we

have shown the assignment of objects to clusters. In figure 4.5 the

number of clusters is two and the records are clearly separated with

small error.

In the first cluster, the normal type records and in the second

cluster the outlier type records with sum squared error (SSE) of 6468.92.

In figure 4.6 the assignment of objects to five clusters where the

first cluster is for normal type and the remaining clusters will include

DoS, Probe, R2L and U2R outlier records respectively with SSE error of

4534.6. These experimental summary results are shown in table 4.1

Table 4.1 The results after assigning the objects to clusters

Attribute Full Data Cluster-0 Cluster-1

Service Private http Private

Flag S0 Sf S0

Src_bytes 2704 6765 28

Dst_bytes 5350 13471 0.696

Logged_in 0.3778 0.9514 0

Count 118 8 190

Srv_count 11 10 12

Class Normal/outlier Normal Outliers

N

W

F

Number of

Within clus

Figure 4.5

Table 4.

Clust

0

1

iterations

ster sum o

Assignm

.2 Percent

ter No

: 7

of squared

ent of Obj

118

tage of Ob

Number oInstances 8840

13419

d errors: 45

jects to C

bjects in t

of

% of insta40

60

534.46865

Clusters w

the cluste

ances

56381134

with k=2

ers

F

4

T

d

d

in

Figure 4.6

4.8

The scalab

detection p

data set. W

n all scala

Assignm

Scalabi

bility stud

processes

We conduc

bility expe

ent of Obj

ility Stu

dy investig

with resp

ted the ex

eriments.

119

jects to C

udy of th

gates the

pect to len

xperiments

Clusters w

he Model

scalability

ngth N an

s with diffe

with k=5

y of both

nd dimens

ferent N an

learning

sionality ϕ

nd ϕ are u

and

ϕ of

used

120

4.8.1 Scalability of Learning Process with respect to N

Figure 4.7 shows the scalability of unsupervised learning process

with respect to number of training data N. The major tasks involved

in the unsupervised learning process are multi-run clustering of

training data, selection of top training data that have the highest

outlying degree. The lead clustering method we use requires only a

single scan of the training data, and the number of top training

data we choose is usually linearly depended on N. Therefore, the

overall complexity of unsupervised learning process scales linearly

with respect to N.

Figure. 4.7 Learning Time in seconds Verses Number of Records

0

10

20

30

40

50

60

70

80

90

0 5000 10000 15000 20000 25000

Series1

121

4.8.2 Scalability of Learning Process with respect to ϕ

Figure 4.8 shows the results, which indicate an exponential growth

of execution time of learning process when ϕ is increased from 20

to 40 under a fixed search workload.

Figure 4.8 Learning Time in seconds Verses Number of Dimensions 4.8.3 Scalability of Detection Process with respect to N

In Figure 4.10, we present the scalability result of detection process with

respect to stream length N. In this experiment, the stream length N is set

much larger than the number of training data in order to study the

performance of the model in coping with large data streams. Figure 4.10

shows a promising linear behavior of detection process when handing an

increasing amount of streaming data. This is because that the detection

process needs only one scan of the streaming data.

Dimensions Vs Time

0

2

4

6

8

10

12

0 10 20 30 40 50

Dimensions

Tim

e(Se

cond

s)

Series1

122

Figure 4.9 Learning Time in seconds Verses Number of Dimensions

Figure 4.10 Detection Time in seconds Verses Number of

Records

Learning Time vs Dimensions

0102030405060708090

0 20 40 60

Dimensions

Exe

cutio

n tim

e in

Sec

onds

Series1

Detection Time Vs Number of Data Points

0

20

40

60

80

100

120

140

160

180

0 10000 20000 30000

Number of Data Points

Exec

utio

n Ti

me

in S

econ

ds

Series1

123

4.8.4 Scalability of Detection Process with respect to ϕ

The dimensionality of a data stream ϕ affects the size of data that is

used in detection process. When MaxDimension is fixed, the size of data

is in an exponential order of ϕ. The execution time of detection process is

expected to grow linearly with respect to ϕ. In this experiment, we used

different dimensions with fixed number of records and we can see a

linear behavior of the detection process. We repeated the same

experiment with the KDDCUP’99 data set with dimension of 8, 15, 20, 25,

30 and 42. The results are presented in Figure 4.11 with dimensions on

x-axis and detection on y-axis.

Figure 4.11 Detection Time in seconds Verses Number of Dimensions

Detection Time Vs Dimensions

0

100

200

300

400

500

600

0 5 10 15 20 25 30

Diemensions

Det

ectio

n Ti

me

Series1

124

4.9 Chapter Summary

Even though the problem of outlier detection has been studied

intensively in the past a few years, outlier detection problem in high-

dimensional data streams has rarely been investigated. With the current

state-of-the-art methods of outlier detection it is difficult to find outliers

in the context of data streams, because they are either constrained to

only low dimensional data, or can’t able incrementally update detection

model for real-time data streams. To the best of our knowledge, there has

been a little research work in literature that particularly targets this

research problem.

This research work is motivated by the unique characteristics of

data in high-dimensional space. Owing to the curse of dimensionality,

the outliers existing in the high-dimensional data sets (including data

streams) are embedded in those lower dimensional subspaces. These

outliers existing in high-dimensional data space are termed projected

outliers. In this thesis, the problem of projected outlier detection for

multi and high-dimensional data streams has been introduced and

formally formulated.

To solve this problem, a new framework, called Multistage

Framework (MFDO) has been presented for Detecting Outliers. MFDO

utilizes compact data synopsis, to capture necessary data statistical

information using z-score values for outlier detection. The result is better

performance with simple calculations using nearest based approach

which is simple and easy to compute.

Documents

Chapter - 4 A MULTI-LAYERED FRAMEWORK TO DETECT ...shodhganga.inflibnet.ac.in/bitstream/10603/38969/13/14...94 Chapter 4 A MULTI-LAYERED FRAMEWORK TO DETECT OUTLIERS AS NIDS 4.1 Introduction