final report

A

PRELIMINARY PROJECT REPORT ON

An Automatically Tuning Intrusion Detection System

SUBMITTED

TO

PUNE UNIVERSITY, PUNEFOR THE DEGREE

OF

BACHELOR OF COMPUTER ENGINEERINGBY

Ritesh Kumar Sinha, Ankush Verma,Manikant Ojha,Amit Kumar

UNDER THE GUIDANCE

OF

Prof. S.R.Patil

DEPARTMENT OF COMPUTER ENGINEERING MAHARASHTRA ACADEMY OF ENGINEERING

ALANDI (D), PUNE-412105

2010-2011

MAHARASHTRA ACADEMY OF ENGINEERINGALANDI(D), PUNE-412105

2010-11

CertificateThis is to certify that Project Report entitled

“An Automatic tuning intrusion Detection System”

Has been submitted by

Mr. Ritesh Kumar Sinha

Mr. Ankush Verma

Mr. Manikant Ojha

Mr. Amit Kumar

In partial fulfillment of Bachelors Degree in Computer Engineering awarded by

UNIVERSITY OF PUNE, PUNE

2010-11

Prof. Guide Name Dr. S J Wagh Project guide Head of Department

. Computer Engineering

PrincipalMAHARASHTRA ACADEMY OF ENGINEERING

ALANDI(D), PUNE-412105

II

Acknowledgement

We would like to thank our guide Mr S.R. Patil for his complete support

toward our project. Without his help at every step this project would have

not been successful.We would also like to thank our HOD Prof S J Wagh

for his support towards our project.We would like to thank the MAE staff

including the library for their help during our research.

Mr. Ritesh Kumar Sinha, Mr. Ankush Verma,

Mr. Manikant Ojha, Mr. Amit kumar

Abstract

THEME/PURPOSE:

An intrusion detection system (IDS) is monitoring system which is used to identify

abnormal activities in a computer system. Intrusion detection system reports alarms to system

operator when it detects any abnormal condition. IDS is working in dynamically changing

environment .Traditionally working of IDS depends on security experts which requires

manual tuning. As it is working in dynamically changing environment we develop system

which reduces dependence by tuning the system automatically. Basically an IDS consists of

prediction engine which analyzes data and outputs the prediction on the data. By seeing

predictions system operator is able to know that data record is normal or is affected by any

attack. Therefore prediction engine is the heart intrusion detection system.

In an automatically tuning intrusion detection system (ATIDS), system operator

analyzes the predictions obtained from detection model. In results obtained from detection

model only false prediction are considered .ATIDS consists of three major components:

prediction model, prediction engine and model tuner. First we create prediction

model .prediction engine analyzes and data according to prediction model. System operator

verifies results and marks false predictions .Only false predictions are fed back to model

tuner to tune model automatically.

METHODOLOGY:

Our project will have basis two important aspects and the whole procedure will be as

such to make these aspects implemented in a correct manner. He aspects are given below:

Our project will have basis two important aspects and the whole procedure will be as

such to make these aspects implemented in a correct manner. He aspects are given below:

Attack detection model:

Here we are going to use SLIPPER learning algorithm for detecting

intrusion which is a rule learning system based. The system is evaluated using the

KDDCup’99 intrusion detection dataset.

Prediction engine:

Binary learning algorithm can only build binary classifier. We will group

attacks into categories such as denial-of-service, probing, remote-to-local, and user-to-root.

Correspondingly, we constructed five binary classier from the training dataset. One binary

classier predicts whether the input data record is normal. The other four binary classier will

predict whether the input data record constitutes a particular attack.

HARDWARE AND SOFTWARE REQUIREMENT:

SOFTWARE REQUIRED

Java 1.3 or more.

Java Swings.

HARDWARE REQUIREMENT

Hard Disk(40Gb).

Ram(128Mb).

Processor(Pentium).

APPLICATIONS: Basically an IDS analyzes data and outputs the prediction on the data. By

seeing predictions system operator is able to know that data record is normal or is affected by

any attack.

TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

FRONTPAGE I CERTIFICATE II ACKNOWLEDGEMENT III ABSTRACT IV LIST OF FIGURES VII LIST OF TABLES VIII

Chapter 1. INTRODUCTION 1 1.1 Introduction

Chapter 2. PLATFORM CHOICE 2 2.1 Java Swing

Chapter 3. LITERATURE SURVEY 3

3.1 Basic Structure of IDS 4 3.1.1 Data sampling 5 3.1.2 Data processing 5

3.1.3 Classifier System 6 3.1.4 Types of IDS 7

3.2 System Overview 8 3.2.1 Beginning 8 3.2.2 Types of IDS 9 3.2.2.1 Host based 9 3.2.2.2 Network based 10

V

Chapter 4. REQUIREMENT ANALYSIS 12

4.1 Data Set 12 4.1.1 KDD CUP 99 set description 12 4.2 Arbitral Strategy by neural network 14 4.3 Multi Class Sleeper 15 4.4 Hardware and software requirement 16 4.2.1 Hardware Requirement 16 4.2.2 Software Requirement 16 4.5 Project Plan 16

Chapter 5. SYSTEM DESIGN 17 5.1 UML Diagrams 17 5.1.1 Activity Diagram 17 5.1.2 Use Case Diagram 18 5.1.3 Component Diagram 19 5.1.4 Class Diagram 20 5.1.5 Activity Diagram 21 5.1.6 Sequence Diagram 22Chapter 6. CONCLUSION AND FUTURE SCOPE 23 REFERENCES IX

VI

LIST OF FIGURES

Sr. No. Figure Number Name of figure Page Number

1. Fig 1 Basic architecture of IDS 52. Fig 2 A classifier system consists of four parts: 63. Fig 3 Multi-class SLIPPER 154. Fig 4 Optimized preprocess algorithm 155. Fig 5 Activity Diagram 176. Fig 6 Use case Diagram 187. Fig 7 Component Diagram 198. Fig 8 Class Diagram 209. Fig 9 Deployment Diagram 2110. Fig 10 Sequence Diagram 22

VII

LIST OF TABLES

Sr. No. Table Number Name of Table Page Number

1. Table: 1 PROJECT PLAN 16

VIII

Chapter 1.

Introduction

With the expansion of the Internet, the value of information safety has been on the

rise. There is no standard definition of intrusion detection as such. Usually, intrusion

detection is recognized as the discovery of network behaviors that abuse or put in danger

network security. Intrusion detection can be treated as a pattern recognition problem which

distinguishes between network attacks and normal network behaviors or further distinguishes

between different categories of attacks.

Any set of events that try to compromise on the accessibility, reliability or privacy of

resources is called as interruption. An intruder is a person or collection of persons who

initiates the events during the interruption. Also, the intruder can be from within the system,

that is, someone with the permission to use the computer with normal user privileges, or

someone who uses a hole in some operating system to escalate their privilege level, or it can

be from outside the system that is someone on another network or perhaps even in another

country who exploits a vulnerability, weakness in an unprotected network service on the

computer to gain unauthorized entry and control.

An intrusion recognition system is in fact a security layer used to notice continuing

interfering activities in information systems. Conventionally, intrusion discovery heavily

depends on widespread knowledge of safety experts, in particular, on their knowledge with

the processor system that is to be sheltered. To diminish this dependency, a variety of

mechanism learning techniques and data mining techniques has been deployed for intrusion

discovery. Most often working of IDS is in dynamically altering surroundings, which results

in constant tuning of the intrusion finding model, so as to maintain enough presentation. The

physical alteration process necessary by current systems depends on the system operators in

functioning out the tuning answer and by integrating it into the discovery model. Moreover

Network intrusion detection aims at separating the attacks on the Internet from normal use of

the Internet. It is a very important and essential piece of the information safety system. Due to

the diversity in network behaviors and the rapid development of attack fashions, it is of prime

importance to develop fast machine-learning-based intrusion detection algorithms with low

false-alarm rates and high detection rates and.

1

Chapter 2.

Platform choice

Java Swing :

Swing is a toolkit for Java. It is part of Sun Microsystems’ Java Foundation Classes (JFC) an

API for a graphical user interface (GUI). Swing was developed to give a more sophisticated

set of GUI components than the Abstract Window Toolkit (AWT). Swing gives a native look

and feel which emulates the look and feel of several platforms.

Using Swings we will develop the user interface of our intrusion detection system which will

show all the functionalities of the system such as create rule, prediction and tuning.

The most important advantage of java swings is the cross-platform support, which allows

developers to build applications that execute on Windows, Mac and Linux. Swings in

addition also provides a very rich set of components and features that can very easily satisfy

the requirements of many types of different applications, such as development tools,

administration consoles, and business applications.

2

Chapter 3.

Literature Survey

Protection of any system forms an important aspect of any computing system. Protection

encompasses the accessibility, reliability and privacy of the resources gave by a computing

system. Three aspects of network systems create these systems more susceptible to attack

than as compared to self-sufficient machines-

• Networks typically provide more number of resources than independent machines

• Network systems are normally configured to facilitate resource sharing

• Global protection policies that can be applied to all of the machines in a network are

rare.

As discussed earlier in order to reduce the dependency of security experts, found in

traditional systems there was a lot of research efforts invested in different research projects

which led to the rise of different data mining and machine learning methods that could be

easily incorporated in different intrusion detection systems.

Audit data analysis and mining was one such technique that combined the logic of mining

association rules and classification in order to identify and detect intrusion from the network

traffic. Whereas ISA (Information system assurance laboratory) utilized the technique based

on statistics along with chi-square and exponentially weighted moving averages for statistical

analysis of audit data.

Information security on the Internet consist the following:

1) Protection: The information system is automatically protected to avoid security violations

that are called intrusions.

2) Detection: Security violations are detected as soon as they occur.

3) Reaction: Reactions, such as pursuit of hackers or automatic alarm are performed when the

system is intruded upon.

4) Recovery: The information system automatically repairs the damages caused by an

intrusion.

Intrusion detection forms a crucial part of information security. Only if intrusions are

correctly detected can the subsequent reaction and recovery be successfully implemented.

Intrusion detection system is based on the fact that an intrusion will be detected by a change

in the ‘normal’ patterns of resources. Intrusion detection is a methodology by which any

undesirable or abnormal activity can be detected. An intrusion discovery scheme is a

monitoring system which reports the entire gives alert to the system machinist whenever it

infers from its discovery model. Intrusion discovery System (IDS) is software, hardware or

mixture of both, that is help to notice intruder movement. IDS may have dissimilar capacities

depending upon how stylish and complex the mechanisms are. IDS appliances that are a

mixture of software and hardware are obtainable from lot of organization. An IDS may

possibly apply anomaly based techniques, signatures, or together. Alerts are any kind of user

announcement for an intruder action. When IDS detects an intruder, it informs the security

supervisor about this by means of alerts. These alerts may be in the form of logging to a

console, pop-up windows, sending e-mail and so on. It is an unrelenting active attempt in

discovering or detecting the presence of intrusive activities. As Intrusion discovery (ID)

relates to computers and network communications it encompasses a far broader range. All

processes recommended by it, to which are used in discovering or detecting illegal uses of

network or computer devices. This is achieved by the use of purposely deliberate software

with a lone reason of detecting abnormal or irregular movement. Depending ahead the

network topology, we can place intrusion discovery systems at one or more locations. It also

depends upon the type of intrusion behavior we want to notice: interior, exterior or both. For

instance, if we wish to detect only exterior intrusion behavior, and we have only single router

linking to the Internet, then the finest position for an intrusion discovery system may be just

inside the firewall or a router. If numerous paths exist toward the Internet, then we want to

position one IDS package at every entrance point. But if we want to discover interior threats

as well, then a box should be placed in every network section.

3.1 Basic architecture of IDS

One of the approaches of developing a network safety is to describe network behavior

structure that point out offensive use of the network and also look for the occurrence of those

patterns. While such an approach may be accomplished of detecting different types of known

intrusive actions, it would allow new or undocumented types of attacks to go invisible. As a

3

result, this leads to a system which monitors and learns normal network behavior and then

detects deviations from the normal network behavior.

Fig 1: Basic architecture of IDS

3.1.1 Data Sampling:

The first step in collecting data is to find exactly what type of data should be

collected. Because of the objective of this project is going towards intrusion detection at the

network level, a natural choice for data transmission is the network transmission packet. The

network gives two types of information to study: transport information and user information,

but for this only transport information is selected. Transport data information contains a

structured pair of source and destination. It also consists of some type of checksum on which

the integrity of a packet is determined. Transport information is added to the packet as a part

of the network transmission protocol. Transport information which cannot be made deceptive

by fraudulent user is called as unbiased data. The user information contains information that

is going to be transformed from one machine to another. This can be easily modified by

fraudulent user and hence we call it as biased data. The next stair in collecting data is to

design a device for monitoring network packets. Since finding an intrusion is not reliant on

any particular method used to check packets, any method that’s capable of obtaining a

suitable data example is acceptable. The last step in collecting is to process it in such a way

that it is distorted into a format which is satisfactory to the classifier system.

3.1.2 Data Preprocessing:

4

There are some values which are important to classifier. These are as given:

Packet size value,

timestamp value,

Ethernet source-destination ordered pair.

There are 2 reasons for preprocessing data:

1) In the case of packet sizes and source and destination address, the raw data can be

compacted without loss of relevant information. This results in data which is easier to

manipulate for classifier system. Also, this data requires less disk storage space.

2) In the part of time stamp information, the basic second count provided is greater than

before so as to comprise relative information of day of week and hour of day. This

allows for the structure of network performance which is depends on human temporal

patterns.

3.1.3 Classifier system:

The classifier scheme is a similar, law based, message passing system. All rules are of the

type action form. This action form is receipt of the messages and the action is the sending of

messages when the rule is satisfied. All messages hold a tag specifying their source and an

extra information field.

Fig 2: A classifier system consists of four parts:

1) An input interface

5

In this case an input interface is a message that contains information taken from a 4-

tuple describing an individual packet information.

2) The classifiers

The classifiers are the rules which describe the behavior in which the system operates

and creates messages.

3) The list of message

The message list is a directory of all messages yet to be measured by the classifier

policy. The messages possibly will from fulfilled rules or from input interface.

4) The output interface

An output interface is message signifying whether recent network performance is

supposed to be regular or irregular.

Consider a simple example of how classifier system works. Suppose that transmission of

packets, each of size 100, were being considered as an indicator of normal network behavior

and anybody interested in the number of packets of size 100 over a one second period needs

to evaluate 5, 50 and 150 as possible threshold of abnormality.

Then there are three classifier rules:

1) Rule 1 would examine all mails from the input interface. It would now use the size and

time values in those messages to maintain a count of packets of size 100 over a sliding

time window of one second. After giving out an input message it will set on the message

list a message of its own with the simplified count for the final second.

2) Rule 2 observes all messages set on the message list by Rule1. In case the present count

of packets of size 100 above the last second exceeds 5 then Rule 2 in turn puts message

on the list of message notifying that its threshold has been crossed.

3) Rule 3 and 4 reads all mails having from Rule1 and if the current count exceeds their

particular threshold of 50 and 150 they too put messages on the list of message.

The productivity interface attends to all messages from Rule2,3,4.when any of those rules

have excited and put a message on the list demonstrating that its threshold has been exceeded,

then the output interface will inform the surroundings that the rule is predicting the

occurrence of abnormal behavior.

3.1.4 Types of IDS:

Intrusion discovery systems can be broken down up into 3 major categories:

6

1. Host-based Systems: is a system in which an IDS examines data that comes straight

from individual systems, or computers (hosts), it is host-based. Examples of data

sources include event logs for and applications (Web servers, database products, etc).

2. Network-based Systems: When IDS observes data as it moves crossways the network,

such as TCP/IP traffic, it is called as network based.

3. Hybrid Systems: A hybrid scheme is just an IDS that has features of both network

based scheme and host based scheme.

3.2 System Overview

Since the introduction of the Internet, intrusion attempts on Network Systems have increased

to a great extent. With increase in security measures, there have been clever attacks by much

more sophisticated attackers. Because of this Network Intrusion Detection Systems (NIDS)

have become increasingly necessary in today’s scenario. In the current scenario if you have

internet, then firewall as well as network intrusion detection system is essential.

There is already a number of "ready to run" i.e. software

option available which try to provide some measure of network security. An intrusion in

computer networking terms is defined as someone (hacker, cracker) trying to bypass security

protocols and infiltrate a network system. The impulse behind this could be something as

small as misusing e-mail for spam, stealing confidential data, or any number of things for

which a system administrator could be held responsible. Evidences have shown that these

attacks are becoming more intelligent, subversive and harmful. It has become certain that

anyone accountable for a network with an Internet presence is now a potential target, and

intrusion detection systems are quickly becoming an essential necessity.

3.2.1 Beginning

A USAF paper available in Oct 1972 written via James P. Anderson explained the fact that

the USAF had "become ever more aware of computer security problems. This difficulty was

felt practically in every part of USAF workings and administration". During that period of

time, USAF had to perform the daunting tasks of providing shared used of their computer

systems, which consisted of various levels of classifications in a need to know environment

with a user base containing various levels of security clearance.

Thirty years ago, this created a serious problem

that still exists with us today. The problem is: How to safely protect separate classification

domains on the same network without any compromise in security? The first task was to find

7

8

and define the threats that existed. Before designing IDS, it was necessary to understand and

comprehend the types of threats and attacks that could be mounted against computers systems

and how to recognize them in an audit data. In fact, it was possibly referring to the necessity

of a risk evaluation plan to understand the threat (what the risks are or vulnerabilities, what

the attacks might be or the means of penetrations) thus subsequent with the creation of a

security policy to protect the systems in place. Among 1984 and 1986, Dorothy Denning and

Peter Neumann examined and designed the first model of real-time IDS. This trial product

was named the Intrusion discovery Expert scheme (IDES). This IDES was originally a rule-

based specialist system skilled to detect known cruel movement. This same system has been

developed and improved to form what is identified today as the Next Generation Intrusion

discovery Expert scheme (NIDES).The report published by James P. Anderson and the work

on the IDES was the start of much of the research on IDS throughout the 1980s and 1990s.An

intrusion detection system (IDS) is a system designed to systematically detect host attacks on

a network. These systems provide a secondary, passive level of security by providing the

administrator with critical information about intrusion attempts. Datagram’s are simply the

packet bundles of information that computer systems use to communicate with each other

over the network. Typically an IDS is not intended to block or actively counter attacks, but

some newer systems have an active capacity for dealing with threats. Indeed, a very

knowledgeable human being should be watching and making value judgments on the 'alerts'

that the IDS has presented him or her with. While firewalls can be thought of as a border or

security perimeter, IDS should detect whether that border has been reached .Under no

circumstances does an IDS guarantee security, but with proper policies, authentication, and

access control, some measure of security can be attained.

3.2.2 Types of IDS

3.2.2.1 Host-Based

Host-based approaches detect intrusions utilizing audit data that are collected from the target

host machine.

As the information given by the review data can be tremendously inclusive and complicated,

host based approaches can acquire high discovery rates and low false alarm rates.

However, there are disadvantages for host-based approaches, which include the following:

1) Host-based approaches cannot easily prevent attacks: when an

intrusion is detected, the attack has partially occurred.

2) Audit data may be altered by attackers, influencing the reliability of review data.

The data from a solo host is used to notice symbols of interruption as the packets Enters or

exits the host. Host-based systems are becoming more and more popular due to their

effectiveness at handling insider misuse. This is mostly due to the IDS assembly data (log

files) from each dangerous machine within the network, while network based systems can

only analyses the data that passes by a exacting network node.

Host based scheme stand out at stopping the following:

Data Access/Modification: The makeup of mission critical data is different for every

organization, but includes things like the Web site, customer or member databases,

proposal information, and personnel records. By observance an eye on the access of

this data and taking note of changes, host based IDS’s are superior at significant when

something altered that should not have.

Abuse of Privilege: This is probably one of the most serious problems in most

organizations, and an area where host-based IDS’s excel. By observing track of

changes to permissions, the host based scheme can inform safety personnel when the

doors are swinging too large. In adding up, most host based scheme allow safety

admin to get a rapid view of the privileges that survive across their organization, and

can ensure that people like past employees are detached from all systems.

3.2.2.2 Network-Based

Network-based approaches detect intrusions using the IP package information collected by

the network hardware such as switches and routers. Such information is not so plentiful as the

review data of the objective host machine. Nevertheless, there are advantages for network

based approaches, which include the following:

1) Network-based approaches can detect the so-called “distributed” intrusions over the whole

network and thus lighten the burden on each individual host machine for detecting intrusions.

2) Network-based approaches can defend the machine against attack, as detection occurs

before the data arrive at the machine.

The information from a network is scrutinized next to a database and it flags those who look

doubtful. Review data from one or more than a one hosts may be used as well to detect

symbols of intrusions. Network based systems focus on observing the network packets, by

sniffing them, which means that they proof traffic as it goes by. Some IDS's of this type can

be installed in more than one location, which is usually referred to as a Distributed IDS.

9

10

Network-based IDS's tend to be less expensive than their host based cousins, as they typically

only need to be installed near the entry/exit point of the network.

Network-based systems do extremely well at stranger attacks, and focus on catching people

before they are authenticated. Areas where they will be good at comprise stopping the

following:

DOS & Packet Manipulation: A denial of service (DOS) attack is when someone

sends an overload of network packets to a single resource, causing it to either crash or

become so slow as to be unresponsive. A more advanced version is the Distributed

Denial of Service attack, in which multiple computers all attack the resource

simultaneously. Many network attacks involve sending network packets that are of

incorrect size or configuration, which often causes the targeted resource to crash.

Network-based IDS’s, because they can process huge amounts of network traffic and

sit in an optimal location, are excellent for blocking such attacks. However, note that

they can also be a prime target for these attacks.

Unauthorized Use: This is the most common attack type that people think of when

they hear about IT security. Network-based IDS’s are ideal for tracking unauthorized

access, meaning intruders that are attempting to login to a machine without the proper

credentials, compromise a machine to create a jump-off point, and those that are

looking to grab passwords or data.

11

Chapter 4

Requirement Analysis

4.1Data Set

With the enormous growth of computer networks usage and the huge increase in the number

of applications running on top of it, network security is becoming increasingly more

important. As it is shown in [1], all the computer systems suffer from security vulnerabilities

which are both technically difficult and economically costly to be solved by the

manufacturers. Therefore, the role of Intrusion Detection Systems (IDSs), as special-purpose

devices to detect anomalies and attacks in the network, is becoming more important. The

research in the intrusion detection field has been mostly focused on anomaly-based and

misuse-based detection techniques for a long time. While misuse-based detection is generally

favoured in commercial products due to its predictability and high accuracy, in academic

research anomaly detection is typically conceived as a more powerful method due to its

theoretical potential for addressing novel attacks.

Conducting a thorough analysis of the recent research trend in anomaly detection, one

will encounter several machine learning methods reported to have a very high detection rate

of 98% while keeping the false alarm rate at 1% [2]. However, when we look at the state of

the art IDS solutions and commercial tools, there is few products using anomaly detection

approaches, and practitioners still think that it is not a mature technology yet. To find the

reason of this contrast, we studied the details of the research done in anomaly detection and

considered various aspects such as learning and detection approaches, training data sets,

testing data sets, and evaluation methods. Our study shows that there are some inherent

problems in the KDDCUP’99 data set [3], which is widely used as one of the few publicly

available data sets for network-based anomaly detection systems .

4.1.1 KDD CUP 99 data set description

Since 1999, KDD’99 [3] has been the most wildly used data set for the evaluation of

anomaly detection methods. This data set is prepared by Stolfo et al. [5] and is built based on

the data captured in DARPA’98 IDS evaluation program [6]. DARPA’98 is about 4 gigabytes

of compressed raw (binary) tcp dump data of 7 weeks of network traffic, which can be

processed into about 5 million connection records, each with about 100 bytes. The two weeks

of test data have around 2 million connection records. KDD training dataset consists of

approximately 4,900,000 single connection vectors each of which contains 41 features and

12

is labelled as either normal or an attack, with exactly one specific attack type. The simulated

attacks fall in one of the following four categories:

1) Denial of Service Attack (DoS): is an attack in which the attacker makes some

computing or memory resource too busy or too full to handle legitimate requests, or

denies legitimate users access to a machine.

2) User to Root Attack (U2R): is a class of exploit in which the attacker starts out

with access to a normal user account on the system (perhaps gained by sniffing

passwords, a dictionary attack, or social engineering) and is able to exploit some

vulnerability to gain root access to the system.

3) Remote to Local Attack (R2L): occurs when an attacker who has the ability to send

packets to a machine over a network but who does not have an account on that

machine exploits some vulnerability to gain local access as a user of that machine.

4) Probing Attack: is an attempt to gather information about a network of computers

for the apparent purpose of circumventing its security controls.

It is important to note that the test data is not from the same probability distribution as the

training data, and it includes specific attack types not in the training data which make the

task more realistic. Some intrusion experts believe that most novel attacks are variants of

known attacks and the signature of known attacks can be sufficient to catch novel variants.

The datasets contain a total number of 24 training attack types, with an additional 14 types in

the test data only. The name and detail description of the training attack types are

listed in [7].

KDD’99 features can be classified into three groups:

1) Basic features: this category encapsulates all the attributes that can be extracted

from a TCP/IP connection. Most of these features leading to an implicit delay in

detection.

2) Traffic features: this category includes features that are computed with respect to a

window interval and is divided into two groups:

a) “same host” features: examine only the connections in the past 2 seconds

that have the same destination host as the current connection, and calculate

statistics related to protocol behaviour, service, etc.

b) “same service” features: examine only the connections in the past 2 seconds

that have the same service as the current connection. The two

aforementioned types of “traffic” features are called time-based. However,

there are several slow probing attacks that scan the hosts (or ports) using a

13

much larger time interval than 2 seconds, for example, one in every minute.

As a result, these attacks do not produce intrusion patterns with a time

window of 2 seconds. To solve this problem, the “same host” and “same

service” features are re-calculated but based on the connection window of

100 connections rather than a time window of 2 seconds. These features are

called connection-based traffic features.

3) Content features: unlike most of the DoS and Probing attacks, the R2L and U2R

attacks don’t have any intrusion frequent sequential patterns. This is because the

DoS and Probing attacks involve many connections to some host(s) in a very short

period of time; however the R2L and U2R attacks are embedded in the data

portions of the packets, and normally involves only a single connection. To detect

these kinds of attacks, we need some features to be able to look for suspicious

behaviour in the data portion, e.g., number of failed login attempts. These features

are called content features.

4.2 Arbitral Strategy by Neural Network

Artificial Neural network is a powerful tool to solve complex classification problem.

We do not need to force much assumption on the problem. We only need to prepare a set of

inputs and targets to train it, and let the neural network learn a model. The most popular

neural network is the error back-propagation (BP) neural network. A conventional BP

network is a three layers feed forward network. We choose to build a conventional BP

network as our final arbiter because of its simplicity and popularity. The inputs of the BP

network are the prediction confidence ratios from each binary classifier. The output with

maximal value is interpreted as the final class.

The number of nodes for the input layer and the output layer is the number of binary

classifiers in our MC-SLIPPER. However, it is difficult to choose the best number of nodes

for the hidden layer, because it depends on lots of facts, such as the numbers of nodes

in input and output layer, the number of training examples, the type of hidden node activation

function and so on. We choose the number of nodes for hidden layer according to some rules

of thumb. We have addressed the steps of our Multi-class SLIPPER framework. Next, we

show our experiments.

14

4.3 Framework for Multi-Class SLIPPER

The current version of SLIPPER is a binary classifier. However the intrusion detection problem is a five-class classification problem. To handle multiclass problem, we build a framework (Figure 1.) using the binary SLIPPER as basic modules. The basic idea is to translate the multi-class problem into multiple binary classification problems, and the final arbiter adopting certain strategy to make the final decision. Below we give details of this framework.

4.3 Train Multiple Binary ClassifiersFor a multi-class problem, the training dataset contain examples with multiple class

labels. However, the binary SLIPPER classifier only accepts training examples with two class

labels. To build a binary classifier for each class, we pre-process the training data to generate

proper training data for each class. An optimized pre-process procedure to reduce disk read is

shown in Figure 2. For each training example, if the label is not the target class name, then

change the it to an unused class name, such as “other”, otherwise, keep the label same. While

pre-process the training dataset, we can get the frequency of the target class which can be

used to ensure that the positive class is our target class for each binary classifier.

Once we have got binary classifier model for each class, we can predict an unseen

data example using all models. Each classifier will output its predicted class with confidence.

Obviously, the results might be conflictive. To address the conflict of outputs, we proposed

different arbitral strategies.

15

Fig 3: Multi-class SLIPPER

Fig 4: Optimized preprocess algorithm

4.4 HARDWARE AND SOFTWARE REQUIREMENT:

SOFTWARE REQUIRED

Java 1.3 or more.

Java Swings.

HARDWARE REQUIREMENT

Hard Disk(40Gb).

Ram(128Mb).

Processor(Pentium)

4.5 PROJECT PLAN

Task Effort weeks

Deliverables

Analysis of existing systems & compare with proposed one

4 weeks

Literature survey 1 weekDesigning & planning 1+2

weekso System flow 1 weekso Designing modules &

it’s deliverables2 week Modules

design document

Implementation 9 weeks Primary systemTesting 3 weeks Test ReportsDocumentation 1 weeks Complete

project report

16

Table1: PROJECT PLAN

Chapter 5.

System Design

No

Yes

Tunning

If tune?

NoYesNo

False +ive Prediction False -ive Prediction

Yes

If ∑PC

>0

Positive Prediction Negative Prediction

Rules Data Set

Labeled Data Set

Preprocessed Data

Training Data Set

Multi_class Slipper

Slipper

Prediction engine for U2R

Prediction engine for DOS

Prediction engine for R2L

Prediction engine for Normal

Prediction engine for Probe

If PC in -ive range?

If PC in +ive

range?

No

Yes

Tunning

If tune?

NoYesNo

False +ive Prediction False -ive Prediction

Yes

If ∑PC

>0

Positive Prediction Negative Prediction

Rules Data Set

Labeled Data Set

Preprocessed Data

Training Data Set

Multi_class Slipper

Slipper

Prediction engine for U2R

Prediction engine for DOS

Prediction engine for R2L

Prediction engine for Normal

Prediction engine for Probe

If PC in -ive range?

If PC in +ive

range?

17

Fig 5: Activity Diagram

System operator

User

System operator

User

Rule Set

Standard KDD cup99 dataset

Preprocessed data

Training Dataset

Prediction Engine

Tunning

Attacks detected

Initial rules

Modified tunned rules

<<includes>>

<<includes>>

<<includes>>

<<extends>>

<<includes>>

<<extends>>

<<includes>>

filediscrepto.

java

Atids.java

DOSAttack.java

RR2LAttack.java

U2RAttack.java

ProbeAttack.

java

PredictConfidence.java

Tunning.java18

Fig 6: Use case Diagram

19

Fig 7: Component Diagram

Rules to

1

1..*

Works on

1..* 1

Tunning+percent: double +load rule: String +ratio: double +falseprediction() +falsepositivepredict()

+falsenegativepredict()

Predictorconfidence+Confidence: double +ratio: double+getconfidence() +calculateconfidence()

ProbeAttack

+CalZt: double

+sumconfidence:double

+CalculateZt()

+Calculategrowrule()

+Calculateprunerule()

R2LAttack

+CalZt: double


+CalculateZt()



DOSAttack+CalZt: double +sumconfidence:doubl

+CalculateZt()



U2RAttack

+CalZt: double


+CalculateZt()



Filedescriptor+ KDDinput: file +file: Bufferreader + Getfile() +readfile() +feacturesextraction()

Atids+DOS,R2L,U2R,Prob:String +weight+, weight- : double +confidence: double +AnalyseDataset() +CreateRules() +Calculateconfidence() +calculateweights()

1..* 1

1..*

Prediction to

20

Fig 8: Class Diagram

20

Internet

System

Machine 1

Machine 2

Machine n

21

Fig 9: Deployment Diagram

False prediction & confidence

Prediction engine

Data set, Rules Set

Rules Set

M/C Learning & Slipper

Trained dataset & weights

Training dataset

Input packet from Network

Initial Rules

Labeled dataset

Preprocessed dataset

Rules set

Standard KDD Dataset

: MainUI

: System: System Operator

22

Fig 10: Sequence Diagram

22

Chapter 6.

Conclusion and Future Scope

Because computer networks are continuously changing, it is difficult

to collect high-quality training data to build intrusion detection models. In

this paper, rather than focusing on building a highly effective initial

detection model, we propose to improve a detection model dynamically

after the model is deployed when it is exposed to new data. In our

approach, the

detection performance is fed back into the detection model, and the

model is adaptively tuned. To simplify the tuning procedure, we represent

the detection model in the form of rule sets, which are easily understood

and controlled; tuning amounts to adjusting confidence values associated

with each rule. This approach is simple yet effective. Our experimental

results show that the TMC of ATIDS with full and instant tuning drops

about 35% from the cost of the MC-SLIPPER system with a fixed detection

model. If only 10% false predictions are used to tune the model, the

system still achieves about 30% performance improvement. When tuning

is delayed by only a short time, the system achieves 20% improvement

when only 1.3% false predictions are used to tune the model. ATIDS

imposes a relatively small burden on the system operator: operators need

to mark the false alarms after they identify them. These results are

encouraging. We plan to extend this system by tuning each rule

independently. Another direction is to adopt more flexible rule

adjustments beyond the constant factors relied on in these experiments.

We have further noticed that if system behaviour changes drastically or if

the tuning is delayed too long, the benefit of model tuning might be

diminished or even negative. In the former case, new rules could be

trained and added to the detection model. If it takes too much time to

identify a false prediction, tuning on this particular false prediction is

easily prevented as long as the prediction result is not fed back to the

model tuner.

References

[1]W. Cohen and Y. Singer, "A simple, fast, and effective rule learner," in Proc.Annu. Conf.

Amer. Assoc. Artif. Intell., 1999, pp. 335-342.

[2]W. Lee and S. Stolfo, “A framework for constructing features and models for intrusion

detection systems,” ACMTrans. Inf. Syst. Secur., vol.3, no. 4, pp. 227–261, Nov. 2000

[3] L. Ertoz, E. Eilertson, A. Lazarevic, P. Tan, J. Srivastava, V. Kumar, and P. Dokas, The

MINDS—Minnesota Intrusion Detection System: Next Generation Data Mining. Cambridge,

MA: MIT Press, 2004.

[4] K. Julish, “Data mining for intrusion detection: A critical review,” IBM, Kluwer, Boston,

MA, Res. Rep. RZ 3398, Feb. 2002. No. 93450.

[5] I. Dubrawsky and R. Saville, SAFE: IDS Deployment, Tuning, and Logging in Depth,

CISCO SAFE White Paper. [Online]. Available: http://www.cisco.com/go/safe

[6] W. Lee, S. Stolfo, and P. Chan, “Real time data mining-based intrusion detection,” in

Proc. DISCEX II, Jun. 2001, pp. 89–100.

[7] E. Eskin, M. Miller, Z. Zhong, G. Yi, W. Lee, and S. Stolfo, “Adaptive model generation

for intrusion detection systems,” in Proc. 7th ACM Conf. Comput. Security Workshop

Intrusion Detection and Prevention, Nov. 2000. [Online].

Available:http://www1.cs.columbia.edu/ids/publications/adaptive-ccsids00.pdf

[8] A. Honig, A. Howard, E. Eskin, and S. Stolfo, “Adaptive model generation: An

architecture for the deployment of data mining-based intrusion detection systems,” in Data

Mining for Security Applications. Norwell, MA: Kluwer, 2002.

23

[9] M. Hossian and S. Bridges, “A framework for an adaptive intrusion detection system with

data mining,” in Proc. 13th Annu. CITSS, Jun. 2001. [Online]. Available:

http://www.cs.msstate.edu/~bridges/papers/citss-2001.pdf

[10] X. Li and N. Ye, “Decision tree classifiers for computer intrusion detection,”

J. Parallel Distrib. Comput. Prac., vol. 4, no. 2, pp. 179–180, 2003.

[11] J. Ryan, M. Lin, and R. Miikkulainen, “Intrusion detection with neural networks,” in

Proc. Advances NIPS 10, Denver, CO, 1997, pp. 943–949.

[12] S. Kumar and E. Spafford, “A pattern matching model for misuse intrusion detection,” in

Proc. 17th Nat. Comput. Security Conf., 1994, pp. 11–21.

[13] Z. Yu and J. Tsai, “A multi-class SLIPPER system for intrusion detection,” in Proc. 28th

IEEE Annu. Int. COMPSAC, Sep. 2004, pp. 212–217.

[14] W. Cohen and Y. Singer, “A simple, fast, and effective rule learner,” in Proc. Annu.

Conf. Amer. Assoc. Artif. Intell., 1999, pp. 335–342.

[15] S. Robert and S. Yoram, “Improved boosting algorithms using confidence created

predictions,” Mach. Learn., vol. 37, no. 3, pp. 297–336, Dec. 1999.

[16] L. Faussett, Fundamentals of Neural Networks: Architectures, Algorithms,and

Applications. Englewood Cliffs, NJ: Prentice-Hall, 1994.

IX

Documents

final report