Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Network Intrusion Classification
Using Data Mining Techniques
By
Amneh H. Alamleh
Supervisor
Prof. Alaa F. Sheta
This Thesis was Submitted in Partial Fulfilment of the
Requirements for Masters Degree in Computer Science
Faculty of Graduate Studies
Zarqa University - Jordan
August, 2015
جامعة الزرقاء تفويضإقرار
بات، أو أنا آمنة حسين العملة، أفوض جامعة الزرقاء بتزويد نسخة من رسالتي/ أطروحتي للمكت
المؤسسات، أو الهيئات، أو األشخاص عند طلبهم حسب التعليمات النافذة في الجامعة.
التوقيع:
2015 / 9 / 7التاريخ:
Zarqa University
Authorization Statement
I am Amneh Hussein Alamleh, authorize Zarqa University to supply copies of my thesis to
libraries, establishments or individuals on request, according to the University regulations.
Signature:
Date: 7 / 9 / 2015
Acknowledgment
”In the Name of Allah the Most Gracious the Most Merciful”. First and foremost, I
would like to thank Allah for giving me the strength and the patience to complete
this work. Then I would like to thank my brothers and sisters for their continuous
support. Also many thanks for all of my instructors especially my supervisor Prof.
Alaa Sheta for his guidance, assistance and comments. May Allah bless you all.
Contents
List of Figures........................................................................................................ VIII
List of Tables......................................................................................................... I X
List of Acronyms .................................................................................................... X
Abstract in Arabic .................................................................................................. XII
Abstract in English ............................................................................................... XIV
1 Introduction 1
1.1 Motivation .................................................................................................... 1
1.2 Problem Statement ....................................................................................... 2
1.3 Contributions ................................................................................................ 2
1.4 Methodology ................................................................................................ 3
1.5 Thesis outline ............................................................................................... 3
2 Intrusion Detection System 4
2.1 Introduction .................................................................................................. 4
2.2 Firewalls ....................................................................................................... 5
2.2.1 Firewalls limitations ........................................................................ 6
2.3 IDS Definition .............................................................................................. 6
2.4 IDS Classification ........................................................................................ 7
2.4.1 Detection Methods .......................................................................... 7
2.4.2 IDS Architecture ............................................................................. 9
3 Data Mining Techniques 12
3.1 What is Data Mining .................................................................................. 12
3.2 Data Mining and IDS ................................................................................. 12
3.3 Decision Tree ................................................................................................... 13
3.3.1 How to develop a Decision Tree ....................................................... 14
3.3.2 How to Select Tree Root? .............................................................. 15
3.4 Artificial Neural Network .......................................................................... 16 3.4.1 Perceptron ...................................................................................... 17
3.4.2 Multi-layer Perceptron (MLP) ...................................................... 18
3.5 Support Vector Machines ........................................................................... 19
3.5.1 How SVM Works ................................................................................. 20
3.6 Summary .................................................................................................... 21
4 Related Work 23
4.1 IDS Using Artificial Neural Network ........................................................ 24
4.2 IDS Using Support Vector Machine ........................................................... 25
4.3 IDS Using Decision Tree ................................................................................ 26
4.4 IDS Using Feature Selection ...................................................................... 27
5 Experimental Setup and Results 28
5.1 Experimental Data Set ............................................................................... 28
5.1.1 KDDCUP’99 Data Set .................................................................. 29
5.1.2 NSL-KDD ..................................................................................... 30
5.1.3 Class Distribution .......................................................................... 30
5.2 Classification Models Setup ....................................................................... 32
5.2.1 C4.5 .................................................................................................. 32
5.2.2 ANN (MLP) .................................................................................. 32
5.2.3 SVM .............................................................................................. 33
5.3 10-fold Cross Validation .................................................................................. 34
5.4 Feature Selection ........................................................................................ 35
5.5 Search Space Complexity .......................................................................... 37
5.5.1 Best First Search ............................................................................ 37
5.5.2 Genetic Search ............................................................................... 38
5.6 Model Evaluation ....................................................................................... 40
5.7 Results ........................................................................................................ 41
5.7.1 C4.5 .................................................................................................. 41
5.7.2 MLP ............................................................................................... 42 5.7.3 SVM .............................................................................................. 43
5.8 Results Analysis ......................................................................................... 43
5.9 Summary .................................................................................................... 44
6 Conclusions and Future Work 46
A Features of NSL-KDD 47
Bibliography 48
VIII
List of Figures
2.1 IDSs Classification Dimensions (Pathan, 2014) ......................................... ...7
2.2 Signature based IDS deployment (Gadbois, 2011) ..................................... ...8
2.3 Anomaly based IDS deployment (Gadbois, 2011) ..................................... ...8
2.4 Network based IDS (Gadbois, 2011) .......................................................... .10
3.1 Data Mining methods taxonomy (Maimon and Rokach, 2010) ................. .13
3.2 Simple Tree Structure ................................................................................. .16
3.3 The simple perceptron architecture ............................................................ .18
3.4 Proposed MLP architecture ........................................................................ .18
3.5 Left: the margin for a decision boundary is the distance to the nearest
data point. Right: In SVMs, we find the boundary with maximum
margin. (Figure from Pattern Recognition and Machine Learning by
Chris Bishop.)............................................................................................. .20
3.6 The slack variables ζ ≥ 1 for misclassified points, and 0 < ζ < 1 for
points close to the decision boundary. (Figure from Pattern Recogni-
tion and Machine Learning by Chris Bishop.) ............................................ .22
5.1 C4.5 classification model structure ............................................................. .33
5.2 Weka Setup for the MLP classification model ........................................... .34
5.3 Weka Setup for SVM classification model ................................................. .35
5.4 Main steps of feature selection process (Megha and Amrita, 2013) …….... 36
5.5 Block diagram for proposed methodology ................................................. .40
5.6 Correctly Classified Instances for C4.5, MLP and SVM with the orig-
inal data, selected features of BF, and selected features of GS .................. .45
IX
List of Tables
3.1 Example data set ......................................................................................... 15
5.1 Distribution of attack records per attack category of the NSL-KDD……..31
5.2 Experimental data ……………………………………………….……. ... 31
5.3 MLP parameters and their meaning ........................................................... 33
5.4 SVM parameters and their meaning ........................................................... 34
5.5 BFS Selected Features ............................................................................... 38
5.6 GS Selected Features ................................................................................. 39
5.7 Confusion matrix ........................................................................................ 41
5.8 Confusion matrix for the C4.5 model ........................................................ 42
5.9 Confusion matrix for the MLP model ........................................................ 42
5.10 Confusion matrix for the SVM model ....................................................... 43
5.11 Performance evaluation based C4.5, ANN and SVM models ................... 44
A.1 NSL-KDD Intrusion Detection Data set Features (Kayacik et al., 2005) . 47
X
List of Acronyms
ANN Artificial Neural Networks
BFS Best First Search
CART Classification and Regression Tress
CSE Consistency Subset Evaluator
DARPA Defense Advanced Research Projects Agency
DM Data Mining
DoS Denial of Service
DT Decision Tree
FS Feature Selection
GA Genetic Algorithm
GP Genetic Programming
HIDS Host Based Intrusion Detection System
IDS Intrusion Detection System
IG Information Gain
IDEP Intrusion Detection Evaluation Program
KDD Knowledge Discovery of Data
LR Logistic Regression
MARS Multivariate Regression Splines
NB Naïve Bayes
NIDS Network Based Intrusion Detection System
R2L Remote-to-Local
RBF Radial Basis Function
XI
RF Random Forest
ROC Receiver Operating Characteristic
RST Rough Set Theory
SVM Support Vector Machine
U2R User-to-Root
VP Voted Perception
WEKA Waikato Environment for Knowledge Analysis
XII
اختراق الشبكة الحاسوبية باستخدام طرق التنقيب فى البيانات تصنيف
إعداد
آمنة حسين العملة
بإشراف
ستاذ الدكتور عالء فتحي شتااأل
الملخص
بشكل مستمر مع مرور الوقت، مما يجبر ور ويتطن حجم االختراقات على الشبكة الحاسوبية يتزايد إ
المؤسسات على تجديد وتطوير نظم حماية شبكاتھا لتكون في مأمن من الخسارة المالية والمعلوماتية.
جدا في أنظمة الحماية. Intrusion Detection Systemنظام اكتشاف التسلل ( دويع ) عنصرا ھاما
حاوالت اختراق الوصول غير القانوي لنظام الحاسوب أھمية ھذا النظام في أنه يكتشف موتكمن
والشبكة، والذي ينتج عنه: وصول أشخاص غير مخولين الى البيانات واألنظمة، عدم تكاملية البيانات،
وعدم إتاحة األنظمة والبيانات لالستخدام من قبل األشخاص المخولين. ويعتبر تصنيف أنواع الھجوم
ة اكتشاف التسلل. من الخطوات الرئيسية في عملي
Data Miningفي ھذه األطروحة، تم استكشاف ثالث من طرق التنقيب في البيانات (
Techniques للتعامل مع مشكلة تصنيف انواع الھجوم. وھذه الطرق ھي: شجرة القرار (
)Decision Tree) الشبكات العصبية الصناعية ،(Artificial Neural Network و متجھات ،(
في عدة Support Vector Machineالتمييز ( آالت دعم ). وذلك الن ھذه الطرق حققت نجاحا
تطبيقات من بينھا أمن الشبكات. الھدف الثاني من ھذا البحث ھو ايجاد الطريقة االفضل لتقليل حجم
) لھذه النماذج في المرحلة االولى من بنائھا بتقليل عدد complexity reductionالحسابات (
) لھذه البيانات. وتعد عملية ايجاد مجموعة الخصائص األفضل feature reduction(الخصائص
) من بين مجموعة الخصائص الكبيرة عملية معقدة بسبب وجود عدد كبير classالتي تمثل الصنف (
XIII
من الخيارات الممكنة. كما تعتبر عملية ايجاد أفضل مجموعة من الخصائص التي تمثل الصنف
Best First Search andذلك، تمت محاولة حل ھذه المشكلة باستخدام طريقتين (ضرورية وھامة. ل
Genetic Search وقد تمت اعادة عملية التصنيف باستخدام مجموعة الخصائص المختارة .(
).DT, ANN and SVMباستخدام (
أظھرت النتائج أن إعادة بناء النماذج السابقة باستخدام الخصائص التي تم اختيارھا. و قد عملية ثم تمت
) حققت أعلى درجة من الدقة مقارنة بالطرق االخرى. كما ان األداء العام C4.5خوارزمية (
بشكل قليل بعد اختيار جزء من الخصائص. ) تحسنMLPو ( )DTلخوارزميتي (
) لتنفيذ ھذه النماذج NSL-KDD) و قاعدة بيانات تجريبية (Wekaو قد تم استخدام برمجية (
) تحتوي على امكانيات مختلفة لمعالجة البيانات Wekaل للنتائج. إذ أن ھذه البرمجية (والوصو
والتعامل مع خوارزميات التصنيف على حد سواء.
XIV
Abstract
The volume of targeted network attacks is steadily increasing and evolving, forcing
businesses to revamp their network security systems due to possible data and
financial losses. Intrusion Detection Systems (IDS’s) is an essential component for
any security system. IDS main function is to identify unauthorized access that
attempts to compromise confidentiality, integrity or availability of computer or
computer networks. One of the major steps in encountering the problem of
intrusion detection is classifying the types of attacks. In this research, we explore
the use of three data mining approaches to solve the attack classification problem.
They are: the Decision Tree (DT) based C4.5 algorithm, Artificial Neural Networks
(ANN), and Support Vector Machine (SVM). These techniques show successful
outcomes in variety of applications including network security. Another goal of
this research, is to provide a suitable way to reduce the complexity of the
developed classification models in the first phase by reducing the features domain.
It was found that selecting the best set of features from a larger set is a complex
problem because of the large possible choices available. A combination of features
which best represents class(s) of attacks is urgently needed. Therefore, we explored
the use of both the Best First Search (BFS) the Genetic Search (GS) algorithms to
handle this problem.
The classification process based the reduced feature set was repeated using DT,
ANN and SVM. The performance of the decision tree was superior compared to
the other two approaches. The performance of the DT and ANN slightly improved
with feature selection. To develop our results, we used WEKA software and NSL-
KDD data set. This software is adaptive for various changes need to be
implemented of both data pre-processing and embedded algorithms.
Chapter 1
Introduction
1.1 Motivation
Information technology, networking and connectivity is ever-changing and evolving.
As individuals and organizations, vast amounts of critical and sensitive information
are on the web. At the same time, we need to preserve our privacy, information confi-
dentiality, integrity, and availability. Without appropriate implementation of security
controls, this information is at great risk. Kessel and Allan in their survey titled ”‘Get
Ahead of Cybercrime”’stated ”‘Every organization is at risk of a cyber attack”’ (Kessel
and Allan, 2014). Multiple solutions have been proposed to deal with the issue of
information and systems security, such as encryption, security policies, program con-
trols, and firewalls. These are primary security techniques, but they are not enough to
provide secure systems (Muhammad-Imran et al., 2008). Alone, the mentioned secu-
rity solutions prove to be insufficient, however, with the addition of Intrusion detection
systems (IDSs) a more robust and reliable security system can be implemented. IDS is
a very important component in protecting computers and network systems by detect-
ing any new trial of systems abuse. While there are varying types of intruder attacks;
traditional IDSs require a huge amount of human effort in order to maintain, add, and
improve their performance (Ooi et al., 2013). In this thesis we are continuing to work
towards the goal of implementing optimal data mining techniques in order to reduce
human efforts in managing ever changing intruder attacks.
1
1.2 Problem Statement
• In the past, rulebased analysis relies on sets of predefined rules that are provided
by an administrator or created by the system (Moradi and Zulkernine, 2004).
• Rule based (i.e expert systems) cannot adapt with the evolving nature of attacks
resulting in an inflexible detection system.
• Attacks are continuously increasing and evolving.
• Detection methods should have the same nature to be able to detect new attacks.
• Data mining techniques have the ability to deal with evolving and changing
attacks.
• Many DM techniques were proposed, the main object of the research is to find
the most optimal method for IDS implementation which can detect the attacks
in higher accuracy rate and minimum time.
1.3 Contributions
The following contributions were achieved:
1. Studying and analyzing the nature of the KDD dataset and it’s defects, as a result
two points were proposed:
- An equal class distribution dataset.
- Minmize the number of it’s features.
2. Two methods that can be used to reduce the search space of features has been
proposed. This problem is essential for network engineer such that it reduces
the time to locate attack (i.e. reduce delay), the effort of monitoring the most
significant attributes that could be a source of attack and also the reduction of
damages to the network resources. Best First Search (BFS) and Genetic Search
(GS) were used for this purpose.
2
3. Building three IDS models based Decision Tree (DT) with pruning, Multi Layer
Perceptron (MLP), whichis a type of Artificial Neural Networks (ANN), and
Support Vector Machine (SVM) with Radial Basis Function (RBF). These meth-
ods have advantages over the other methods as concluded from the literature.
1.4 Methodology
To perform this research the following steps will be followed:
• A survey on various methods for handling intrusion detection problem.
• Preprocessing: in this stage; analyzing, understanding and making the necessary
preprocessing for NSL-KDD dataset.
• C4.5 Classification tree, Artificial Neural Networks and Support Vector Ma-
chine classification algorithms will be tested on NSL-KDD data set.
• Feature selection will be executed using Best First and Genetic Search algo-
rithms.
• ANN, C4.5, and SVM will be tested another time after applying feature selec-
tion.
• Results will be analyzed and conclusions will be stated.
1.5 Thesis outline
The rest of this thesis is organized as follows: Chapter two focus on the related works
in the field of intrusion detection using various techniques. Chapter three shows what
is IDS, its role in computer security, its architecture and its taxonomy. In chapter four;
the DT, ANN and SVM techniques is described. The research experiments, models
structures, performance developed models, and results analysis comes in chapter five.
Finally the conclusions and suggested future works shown in chapter six.
3
Chapter 2
Intrusion Detection System
2.1 Introduction
Computer security is defined as the protection of computing systems against threats
to confidentiality, integrity, and availability (Summers, 2010; Pfleeger and Pfleeger,
2006).
• Confidentiality: computer relates assets are revealed only to authorized people
with pre-defined rights.
• Integrity: no one except the authorized parties can apply any type of modifica-
tion to systems including: writing, deleting, or creating.
• Availability: the system is capable of providing the services at any given time
to authorized parties.
There are three main categories of security mechanisms: attack prevention, attack
avoidance, and attack detection (Kruegel et al., 2005).
• Attack prevention i.e ways of preventing certain attacks before they reach the
target. Access control is an important element in this category. Firewall is an
important access control system at the network layer.
• Attack avoidance, in this category an intruder may access the targeted resource
4
but the information is modified in a way that makes it unusable for the attacker.
Cryptography is the most important element in this category.
• Attack detection, assumes that an attacker can obtain access to the desired tar-
gets and successfully can violate a given security policy. If the attack happens,
attack detection has to report that something wrong is going on, and has to re-
act in an appropriate way. Intrusion detection systems are the most important
element of this class.
2.2 Firewalls
Firewalls have been designed to protect a network from outside threats as a first line of
defense, firewalls provide a connection from one network to another (Das and Sarkar,
2014). Typically, firewalls come in hardware, software, or a combination form creat-
ing a check point outside of the network. Basically, providing protection from both
directions of the network, firewalls keep outsiders from breaking in and prevents those
inside the network from revealing valuable data. Furthermore, firewalls can proxy an
internet service and block problematic services.
There are three main types of firewall technologies; packet filtering, application
based firewalls or proxy servers, and stateful packet filtering. Not looking at the con-
tents, packet filtering does IP and port based filtering determining whether a packet
can be accepted or not. A proxy server is used between the service requester and the
service provider hiding the real IP address from whoever one is communicating to.
Proxy servers also does the logging and access control and prevents traffic between
networks. Lastly, stateful packet filtering provides more security checks by being a
cross between functionality of packet filtering and proxy firewalls. It inspects the first
packet, then adds entry to state table (Firewalls, 2015).
5
2.2.1 Firewalls limitations
Even though firewalls are necessary for security, they have some limitations. The
following are some limitations of firewalls (Stallings, 2010):
• If an attack bypasses the firewall, the firewalls protection is void.
• Firewalls do not provide adequate protection from threats that can occur inter-
nally. For instance, an employee with malicious intent or an employee who
unknowingly aids an external attacker via social media.
• A firewall cant protect against wireless communication between local systems
on varying sides of the internal firewall. This is concerning if in the event a
poorly secured LAN can be compromised from outside the organization.
• An infected device (employee devices, laptops, portable devices, etc.) used on
the corporate network completely bypasses firewall security.
An Intrusion Detection system kicks in if and when the firewall fails. IDS will
evaluate an intrusion once it happens. IDS is specifically programmed to prevent and
find attacks that are missed by firewall filters (Firewalls, 2015).
2.3 IDS Definition
Intrusion Detection Systems (IDS) are designed to inspect all network activity and
identify incoming and outgoing patterns that are suspicious. A type of packet scanner,
IDS scans all packets on the network and classifies inbound and outbound traffic as
intrusive or not intrusive. For instance, Denial of Service (DoS) attacks, disclosures,
manipulations, and masqueraders are some examples of an intrusion.
Cyber attacks on systems can either fail or succeed. Intrusion detection systems are
designed to monitor targeted systems and collect the audit trails, analyze the gathered
information for signs reflecting unusual activity and misuse, automatically respond to
detected activity and mitigate damages, generate reports about questionable activity
6
Figure 2.1: IDSs Classification Dimensions (Pathan, 2014).
and send out notifications,and discover and diagnose problems. Intrusions are bro-
ken down to three varying types, host intrusions, network intrusions and application
intrusions.(Liu, 2012)
2.4 IDS Classification
IDSs could be classified in multiple dimensions based on detection method, architec-
ture and their post detection action (Pathan, 2014). A complete categorization can be
seen in Figure 2.1. In this research we are interested in network anomaly detection
method.
2.4.1 Detection Methods
A. Misuse IDS
Misuse (Signature) Intrusion Detection Systems work like a virus scanner way.
Relying on rules, a Signature Detection System will try correlate likely patterns
to intrusion attempts. In order to gain access to a system, viruses try a number
of steps in a particular pattern. These specific steps are made into a customized
rule and when the Intrusion Detection System compares collected data versus
7
Figure 2.2: Signature based IDS deployment (Gadbois, 2011).
Figure 2.3: Anomaly based IDS deployment (Gadbois, 2011).
observations it decideswhether it is positive or negative. Figure 2.2 shows the
deployment of signature based IDS.
B. Anomaly IDS
Anomaly detection consists of a baseline profile that is set by the IDS or a
network administrator. This established baseline informs the system of normal
network traffic and can flag any deviation as an attack (Lokesak, 2008). Figure
2.3 shows the deployment of anomaly based IDS.
C. Anomaly vs. Signature IDS
8
Signature based detection offers more accuracy, time savings, and detailed log
files (Lokesak, 2008). In identifying intrusion detection attempts, signature
based detection is more accurate. Furthermore, because of this increased ac-
curacy, administrators spend far less time on false positives (Lokesak, 2008).
Because this form of detection has detailed log files, it is easier to identify the
cause of alarm. However, there are some downsides. Signature based detection
systems respond to only what is in their database and requires constant updates
(Lokesak, 2008). Additionally, when new viruses are discovered, it may take
hours to days until updates are implemented. Systems can also become sluggish
if hardware isn’t updated and managed. On the other hand, anomaly based de-
tection detects new threats without an administrator’s updates (Lokesak, 2008).
It also learns about network activity and creates profiles on an ongoing basis. So,
the longer this system is implemented the more accurate it becomes. However,
this advantage creates the disadvantage of being unprotected during its profile
building. Furthermore, if an intrusion or attack looks like normal activity to the
system, an alarm is not triggered. Anomaly based detection is also more prone
to sending out false positives (Lokesak, 2008).
2.4.2 IDS Architecture
A. Host Based IDS
Host Based Intrusion Detection Systems (HIDS) focus on collecting and ana-
lyzing information on a specific host or system. While relying heavily on audit
trails and system logs for identifying unauthorized access, HIDS checks and
collects system data from file systems, network events and system calls (Scar-
fone and Mell, 2007). There are two types of HIDS: anomaly detection and
signature based detection which are described above.
B. Network Based IDS
Network Based Intrusion Detection Systems (NIDS) provide real time monitor-
9
Figure 2.4: Network based IDS (Gadbois, 2011).
ing of networks(Scarfone and Mell, 2007).. If there were to be an intrusion,
this system can detect attacks as they happen. NIDS allows direct analysis of
network traffic where all the network traffic is seen on all levels of the operating
system and does not degrade network or host performance. For instance:
• Its first job is to record each incoming or outgoing packet leaving a packet
trace.
• Its second job is to analyze each packet trace to identify a matching attack
signature (Scarfone and Mell, 2007).
Figure 2.4 shows a network architecture with NIDS.
C. Network Based vs. Host Based IDS
A NIDS has some advantages over HIDS. It is physically a separate unit and
it does not take anything from the system. Overall, NIDS is good for detect-
ing unauthorized access, bandwidth theft or DoS attacks (Jessica, 2007). The
disadvantages of NIDS can be that administrators lack in implementing an ap-
propriate plan for traffic growth causing NIDS to overload and drop packets thus
defeating its purpose. NIDS is also susceptible to slow attacks (Jessica, 2007).
HIDS has the ability to use logs, system services, registry events, and etc that
10
are on the system. However, HIDS may detect an attack too late. HIDS also
uses system resourcesdue to the fact that it is running on the host (Jessica,
2007). Ideally, both NIDS and HIDS complete each other and should both be
implemented.
11
Chapter 3
Data Mining Techniques
As the world grows in complexity, overwhelming us with the data it produces, data
mining becomes our only hope for discovering hidden knowledge. DM is defined as
the process of discovering patterns in data (Witten et al., 2011).
3.1 What is Data Mining
Data Mining is defined as the process of extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns or knowledge from huge amount
of data (Han et al., 2012). Data Mining is the core of the Knowledge Discovery of Data
(KDD) process, involving the inferring of algorithms that explore the data, develop the
model and discover unknown patterns (Maimon and Rokach, 2010). There are many
methods of Data Mining used for different purposes and goals. Figure 3.1 presents
DM methods taxonomy (Maimon and Rokach, 2010).
3.2 Data Mining and IDS
Recently, there is a great interest in the application of Data Mining techniques to intru-
sion detection systems. The problem of intrusion detection can be reduced to a Data
Mining task of classifying data. Briefly, one is given a set of data points belonging to
different classes (normal activity, different attacks) and aims to separate them as accu-
12
Figure 3.1: Data Mining methods taxonomy (Maimon and Rokach, 2010).
rately as possible bymeans of a model (Maimon and Rokach, 2010). Many different
data mining techniques exist for intrusion detection classification. Researchers tried
to use distinctive methods to get better accuracy of data classification. In this research
we are using three data mining techniques: Decision Tree, Artificial Neural Network,
and Support Vector Machine.
3.3 Decision Tree
Decision tree is one of the most well known and used classification algorithms:
• Decision tree algorithm known as ID3 (Iterative Dichotomiser), was known
since 1970.
• A Classification and Regression Trees (CART) which was used to generate bi-
nary decision trees as presented in (Breiman et al., 1984).
• C4.5 algorithm was presented later by Quinlan (Quinlan, 1993; Han et al.,
2012). C4.5 became a benchmark to which newer supervised learning algo-
rithms are often compared.
ID3, CART, and C4.5 adopt a greedy approach in which decision trees are constructed
in a top-down recursive divide-and-conquer way (Han et al., 2012). Unlike ID3; C4.5
13
can deal with continuous attributes and handles missing values, but alittle slower than
the other DT algorithms(Ooi et al., 2013).
3.3.1 How to develop a Decision Tree
Decision tree is a directed tree, conforms its structure by recursively separates the set
of observations. It consists of a root with no incoming edges, internal or test nodes
with exactly one outgoing edge for each, and leaves which represent the decision
node and have no outgoing edges (Maimon and Rokach, 2010). The decision tree
development algorithm is a greedy algorithm which is a top-down recursive divide-
and-conquer in nature. The algorithm can be summarized as follows (Kargupta et al.,
2008):
Algorithm 1: Generate-Decision-Tree(samples,att-list)1: Input:2: Samples : training samples3: att-list: set of candidate attributes4: Createa node N // represent the training samples5: If samples are all of the same class, Cthen6: return N as a leaf node labeled with class C;7:8: If att-list is emptythen9: return N as a leaf node labeled with the most common class in samples;
10:11: Selecttest-attribute, the attribute among attribute-list with the highest12: information gain based the Entropy;13: Label node N with test-attribute;14:15: for each known valueai of test-attributedo16: Let si be the set of samples for which test-attribute=ai;17: If si is emptythen18: attach a leaf labeled with the most common class in samples;19: elseattach the node returned by Generate-Decision-Tree(si,att-list)20: end if21: end for
To reduce tree complexity, pruning algorithms were presented. Pruning is agen-
eral technique to go against over fitting has a huge effect on the tree size, and a slight
effect on the accuracy. It results in better accuracy as reported in (Witten et al., 2011).
14
Using Decision Tree, network connections can be classified as normal, anomaly, or
other predefinedtypes of attack.
3.3.2 How to Select Tree Root?
We want to determine which attribute can work as a root of a tree given a set of
training feature vectors. Information gain (IG) define how important certain attribute
of the feature vectors is. IG helps deciding the ordering of attributes in the nodes of
a decision tree. Equations 3.1 and 3.2 show how entropy and information gain are
calculated (Han et al., 2012).
IG = E(Parent)− AE(Children) (3.1)
Entropy =∑
i
−pi log2 pi (3.2)
E, AE are the entropy and the average entropy, respectively.pi is the probability of
classi. Entropy comes from information theory. The higher the entropy the more the
information content. For example, given a training data set in Table 3.1. The table has
three featuresf1, f2 andf3 and the two classesA andB. Assuming thatf1 is the split
best attribute, this node would be further split.
Table 3.1: Example data set
f1 f2 f3 Class1 1 1 A1 1 0 A0 0 1 B1 0 0 B
Thus, the entropy of children and the gain can be computed as follows:
Echild1 = −1
3log2(
1
3)−
2
3log2(
2
3)
= 0.5284 + 0.39
= 0.9184
Echild2 = 0
15
Figure 3.2: Simple Tree Structure
Eparent = 1
IG = 1−3
4× (0.9184)−
1
4× (0)
= 0.3112
If we split using the featuref2, we get the following:
Echild1 = 0
Echild2 = 0
Eparent = 1
IG = 1−1
2× (0)−
1
2× (0)
= 1
Splitting using featuref2 shall produce the best gain. The developed tree structure
in this case can be presented as in Figure 3.2. This tree was developed using Weka
software (Hall et al., 2009).
3.4 Artificial Neural Network
Classification is one of the most active research and application areas of neural net-
works. A classification problem arises when an object needs to be allocated into a pre-
defined group or class based on a number of observed attributes associated to that ob-
16
ject. ANN was successfully used to handle multi-class pattern classification problem
(Zhang, 2000;Ou and Murphey, 2007), medical diagnosis (Brause, 2001), bankruptcy
prediction (du Jardin, 2010), handwritten character recognition (Singh et al., 2009;
Chaturvedi et al., 2014), and speech recognition (Krol and Szlachetko, 2010).
ANN usually consists of many hundreds of simple processing units which are
connected together in a complex communication network. Each unit or node is a
simplified model of a real neuron which fires (sends off a new signal) if it receives
a sufficiently strong input signal from the other nodes to which it is connected. The
strength of these connections may be varied in order to make the network perform
different tasks corresponding to different patterns of node firing activity. ANN model
consists of a set of synapses each of which is characterized by a weight or strength of
its own.
3.4.1 Perceptron
Neuron is the basic processing unit in ANN. Each neuron has number of inputs and a
single output. Each input has an assigned factor or parameter called theweight. The
way how a neuron works, is as follows: an input signal to each neuron is multiplied
by the corresponding weight then the result from the multiplication is summed up and
passes through a transfer function. This transfer function is most likely to be a sigmoid
function (see Equation 3.3) (Quinlan, 1993). The most simple neural network unit is
called ”Perceptron” (see Figure 3.3) (Quinlan, 1993). If the result of the summation is
over a certain threshold, the neuron output will be activated otherwise not.
f(x) =1
1 + e−x(3.3)
For example,given a set of inputsxj and a set of corresponding weightswj, the output
of the neuron is calculated by the following function:
yi = f(n∑
j=1
wjxj + w0) (3.4)
17
Figure 3.3: The simple perceptron architecture
Figure 3.4: Proposed MLP architecture
3.4.2 Multi-layer Perceptron (MLP)
ANN consists of three layers named as: input layer, hidden layer, and output layer.
Neurons are most likely fully connected. Each connection is signified by a weight.
This weight is computed based on what is called a learning algorithm. These neurons
are grouped together to form a layer.
MLP is a fully connected network because all inputs/units in one layer are con-
nected to all units in the following layer. The input layer gets the initial data, the hid-
den layer calculates several interim values which are used to calculate output values
in the output layer. The MLP can be represented mathematically as given in Equation
3.5 (Norgaard et al., 2000; Al-Hiary et al., 2008):
18
yi = gi[Φ, θ]
= Fi
[
nh∑
j=1
Wi,jfj
(
nΦ∑
l=1
wj,lΦl + wj,0
)
+Wi,0
]
(3.5)
whereyi is the output signal,gi is the function realized by the neural network and
θ specifies the parameter vector, which contains all the adjustable parameters of the
network (weightswj,l, and biasesWi,j), nh nodes in the hidden layer. MLP is trained
by using the backpropagation (BP) learning algorithm. Training means adjusting the
network weights such that the objective criteria is minimized (i.e. minimize the error
difference between the network outputy and the inputΦ).
The ANN achieves a good match when the Mean Square Error (MSE) is mini-
mized (See Equation 3.6) (Tim, 2015). Figure 3.4 shows the architecture of MLP with
41 inputs which are the features of NSL-KDD and six outputs which are the types of
attacks. We used MLP to detect the six types of attacks available, in our data samples.
MSE =1
n
n∑
i=1
(yi − yi)2 (3.6)
3.5 Support VectorMachines
Support Vector Machines (SVMs) are one of the latest development of supervised
machine learning technique (Ng, 2014). A survey of SVMs can be found in (Burges,
1998; Cristianini and Shawe-Taylor, 2000). Although SVM were known since late
seventies (Burges, 1998; Vapnik, 1982), it started to receive attention in late nineties
(Burges, 1998). It was applied basically to pattern recognition, also used for pat-
tern classification problems like image recognition, text recognition, face detection,
etc (Pradhan, 2012). However, many researchers implemented SVM techniques in
solving intrusion detection problem such as in (Khan et al., 2007; Jiang et al., 2011;
19
Figure 3.5: Left: the margin for a decision boundary is the distance to the nearestdatapoint. Right: In SVMs, we find the boundary with maximum margin. (Figure fromPattern Recognition and Machine Learning by Chris Bishop.)
Sujatha et al., 2012; Jha and Ragha, 2013). SVMs work mainly by deriving a hyper
plane that maximizes the separating margin between two classes (Hu et al., 2003).
The feature vectors that lie on the boundary of separation vectors are called support
vectors(Hu et al., 2003). SVMs are fantastic because they are very resilient to over
fitting (Witten et al., 2011).
3.5.1 How SVM Works
To see how SVM works, assume we are having a set of training examples in a pair
format (xi, yi), i = 1, . . . , l wherexi ∈ Rn andy ∈ {1,−1}l. Thus, our objective is
to learn a classifier:
f(x) = wTφ(x) + b (3.7)
The classifier’s output for a newx is sign(f(x)). If the training data are linearly-
separable in the feature space ofφ(x) (See Figure 3.5), the two classes of training
examples are sufficiently well separated in the feature space that one can draw a hy-
perplane between them. SVM maps the training vectorxi into a higher dimension
space using the functionφ by finding linear separator hyperplane with the maximum
margin. ζ > 0 is a penalty coefficient for the error term. We need to maximize the
margin (i.e. the distance from the hyperplane to the closest data point in either class)
such that we maximize the margin of error.
20
Many data sets might not be linearly separable. This means that there will beno
solution which could satisfy all the constraints. One way to handle this problem is
to release some of the constraints by introducing slack variables. Slack variables are
presented to permit certain constraint to be violated. It means that, certain training
points could be within the margin. Our objective is to minimize the number of points
within the margin as much as possible. In this case, the SVM (Boser et al., 1992;
Cortes and Vapnik, 1995) requires the solution of the following optimization problem:
minw,b,ζ
∑N
i=1ζi +
1
2||W ||2
∀i yi(wTφ(xi) + b) ≥ 1− ζ
ζi ≥ 0 (3.8)
K(xi, yi) ≡ φ(xi)Tφ(xi) is called the kernel function. Nowadays, many kernels
were proposed for the SVM. Some are listed below:
• linear:K(xi, yi) = xTi xj
• polynomial:K(xi, yi) = (γxTi xj + r)d > 0
• radial basis function (RBF):
K(xi, yi) = exp(−γ||xi − xj||2), γ > 0
• sigmoid:K(xi, yi) = tanh(γxTi xj + r)
whereγ, r, andd are kernel parameters. Slack variables characteristics with vari-
ous values are shown in Figure 3.6.
3.6 Summary
In this chapter, some definitions of data mining were introduced. The way how the
three algorithms: C4.5, MLP, and SVM work was explained in details.
21
Figure 3.6: The slack variablesζ ≥ 1 for misclassified points, and0 < ζ < 1 forpoints close to the decision boundary. (Figure from Pattern Recognition and MachineLearning by Chris Bishop.)
22
Chapter 4
Related Work
In thepast, various aspects of anomaly based intrusion detection in computer security
using machine learning were explored (Liao, 2005). A Review of Intrusion detection
solution using machine learning was presented in (Tsai et al., 2009). This work pre-
sented a revision for 55 related research studies between 2000 and 2007 focusing on
developing single, hybrid, and ensemble classifiers. Classification based unsupervised
and supervised ML techniques in detecting intrusions using network audit trails was
presented in (Mukkamala et al., 2006). The authors investigated well known both the
Frequent Pattern Tree mining (FP-tree), classification and regression trees (CART),
multivariate regression splines (MARS) and TreeNet for solving ID problem. Clas-
sification accuracy based the Receiver Operating Characteristic (ROC) curve analysis
was used to measure the performance of each developed model. The results show that
classification accuracies are better in the cases of SVM and ANN.
Recently, ten machine learning approaches are used to detect network intrusions
using the NSL-KDD data set (Panda et al., 2011). They include Decision Tree J48,
Bayesian Belief Network, Hybrid Naıve Bayes with Decision Tree, Rotation Forest,
Hybrid J48 with Lazy Locally weighted learning, Discriminative multinomial Naıve
Bayes, Combining random Forest with Naıve Bayes and finally ensemble of classifiers
using J48 and NB with AdaBoost AB. Intrusion detection on mobile ad hoc networks
(MANETs) is a challenging process. The reason is because of their dynamic nature,
and their highly resource-constrained nodes. In (Sen and Clark, 2011), the author
23
explored the use of Evolutionary Computation (EC) techniques, specifically Genetic
Programming (GP) andGrammatical Evolution (GE), to evolve intrusion detection
programs. In (Giray and Polat, 2013) the authors made a comparison using three
variations of KDD99, NSL KDD and noisy added data sets. They used WEKA to
compare the performance of eleven classification algorithms including Decision Trees
(DT), Random Forest (RF), Multi-layer Perceptron (MLP), Voted Perceptron (VP),
Bayesian Networks (BN), Naive Bayes (NB) , etc. The conclusions, for the most part,
shows that the performance of various algorithms without noise is not the same as in
the real noisy environment.
4.1 IDS Using Artificial Neural Network
In 1998 Cannady (Cannady, 1998) used a multi layer perceptron (MLP) of four fully
connected layers, nine inputs which represent the data stream features, and two outputs
(0,1) 0 for normal 1 attack class. The objective was to test the ability of MLP to detect
the potential misuse data stream. The model was trained using 9,462 records, 1000
records were selected for testing. The results were measured using root mean square
error and correlation. The results showed the ability of MLP to be used in the IDS for
misuse detection.
ANN were used to deal with intrusion detection problem in (Mohammed et al.,
2007), the proposed model was able to identify three classes of attacks: normal and
two other attack types. The developed ANN model achieved high accuracy. Authors
suggested including more attack scenarios in the data set, they also suggested reducing
the number of records as a trial to minimize the complexity of the system.
Another ANN model was proposed in (Barman and Khataniar, 2012). Authors
defined the output of the ANN to be either 1 or 0 based on the fact that the packet
is infected or not with intrusion. They explored the issue of reducing the domain of
feature set by using rough set theory performed on just one type of attack. The authors
claimed that their model was 20.5 times faster than the previous ones. They suggested
applying their method on other classes of attack as a future work.
24
In (Sahilpreet and Meenakshi, 2013), the authors presented four different algo-
rithms to develop intrusion detection models. They include the MLP, Radial Base
Function (RBF), Logistic Regression (LR) and Voted Perception (VP). All these al-
gorithms were implemented in WEKA (Hall et al., 2009), a software for data mining,
to evaluate the performance. NSL-KDD data set was used. To enhance their results,
feature reduction techniques were applied. The results showed that the MLP network
algorithm provided more accurate results than other algorithms. As a future work,
integrated MLP Network with fuzzy inference rules to improve the performance was
suggested.
4.2 IDS Using Support Vector Machine
Yao et. al. (Yao et al., 2006) proposed an enhanced SVM model for intrusion de-
tection, they used rough set theory to reduce the number of features by removing the
less weighted ones. They evaluated the proposed model using KDD99 and UMN data
sets against precision, recall, false positives, and false negatives criteria. The results
showed that their model was more accurate and needs less time to perform.
Chen et al. (Chen et al., 2009) proposed a model for IDS using SVM based system
on a Rough Set Theory (RST). RST was used to reduce the number of features from
41 to 29. The authors compared RST based SVM with that of a full features and
Entropy. Their proposed RST-SVM model resulted in a better accuracy compared to
the other two mothods.
An integrated model of SVM model and DT model for multiclass classification
proposed in (Mulay et al., 2010). First they separated the classes by binary tree struc-
ture, then each class were fed to a number of SVMs as the number of the classes. The
authors supposed that by combining the two models the results will be more accurate,
and the classification process will be faster than individual models. But they didn’t
prove or simulate their model.
A comparison between three types of Support Vector Machine (SVM) kernel func-
tions: Gaussian Kernel (RBF), Polynomial Kernel, and Sigmoid Kernel were imple-
25
mented in (Bhavsar and Waghmare, 2013). Cross validation test mode was used.The
results showed that RBF kernel function can overcome the drawback of SVM i.e ex-
tensive time needed for model building.
4.3 IDS Using Decision Tree
Farid et al. (Farid et al., 2010) proposed a new learning algorithm for anomaly base
IDS using DT. Their method modified the splitting weights of the dataset. Their
method involved changing the weights relative to posterior probabilities. The results
of their work illustrate a better performance than the traditional DT algorithm.
An ensemble neural decision tree was used in (Sivatha Sindhu et al., 2012) for
feature selection and model reduction. The proposed model was compared to 6 types
of decision trees. They used specificity and sensitivity as evaluation metrics. The
results showed that the proposed model performed better than other methods.
In (Ooi et al., 2013), three types of decision trees: ID3, C4.5, and BFS were tested
on NSL-KDD network intrusion data set. Feature selection was performed using Con-
sistency Subset Evaluator (CSE). NSL-KDD data set and 10-fold cross validation test
mode were used to train and test the three DT algorithms. The analysis of the re-
sults concluded that C4.5 performs better than BFS and ID3 in terms of prediction
accuracy. Also, they used the ROC curve as evaluation criteria. Higher values in
area under curve of ROC denote that the classifier has higher ability to classify the
randomly chosen instance correctly.
Nadiammai et al. (Nadiammai and Hemalatha, 2014) proposed four solutions for
different IDS problems, they included the problem of data classification, high level
of human effort, unlabeled data, and distributed denial of service attack effectiveness.
They solved the first problem (classification of data) using Efficient Data Adapted
Decision Tree (EDADT). The objective of this method was to minimize the dimen-
sionality of model by feature extraction of relevant features to every type of attack.
The authors compared the proposed algorithm to other methods like C4.5, SVM, and
others. The results they obtained show that their algorithm achieved the highest accu-
26
racy rate.
4.4 IDS Using Feature Selection
Using too many features will result in huge feature space. Which leads to slow down
model learning process, and may decrease accuracy. Usually, there are many redun-
dant or irrelevant features, so using feature selection is a good idea to remove these
redundant or less discriminative features (Han et al., 2012).
Sivatha Sindhu et al. (Sivatha Sindhu et al., 2012) improved the genetic algorithm
by formulating a new fitness function to search the best relevant features from the 41
KDDcup’99 features. The objective of feature selection was to reduce the computation
complexity of the classifier. The proposed algorithm was compared to various combi-
nations of feature selection algorithms: Genetic Search, Greedy Stepwise, Ranker and
RankSearch. The accuracy percentage was close but the number of features selected
by the proposed algorithm was less. So, the detection time is less compared to the
other algorithms.
Studying the relevance between the 41 features and the attack types was studied
in (Kayacik et al., 2005). The author concluded that not all the 41 features are needed
to classify types of attacks. They recommended that more studies are required in the
scope of machine learning algorithms.
27
Chapter 5
Experimental Setup and Results
In this thesis, we adopted three classification algorithms to develop set of models for
intrusion detection. MLP, C4.5, and SVM classifiers were trained and tested using
Waikato environment for knowledge analysis (Weka) (Hall et al., 2009).
Weka is a collection of machine learning algorithms used for data mining tasks. It
is an open source software contains tools for data pre-processing, regression, classifi-
cation, clustering, association rules. It also has visualization (Hall et al., 2009).
The improved version of KDDCUP’99; NSL-KDD data set was used to form the
experimental data set. For all experiments we used 10-fold cross validation test mode.
10-fold cross validation test mode is preferred since it reduces the variance of estima-
tion (Witten et al., 2011). The experiments scenario will be explained in details in the
following subsections.
5.1 Experimental Data Set
Intrusion Detection Evaluation Program (IDEP), administered by the Lincoln Labora-
tory at the Massachusetts Institute of Technology, was funded by the United States De-
fense Advanced Research Projects Agency (DARPA) in 1998. This program’s main
object was to build a data set that would help evaluate different intrusion detection
systems (IDSs). KDDCUP’99 data set was a result of seven weeks training and two
weeks testing data of this program (Sabhnani and Serpen, 2004).
28
5.1.1 KDDCUP’99 Data Set
KDDCUP’99 is the mostwidely used data set for ID research, publicly available at
(Lichman, 2013). It contains about 4,900,000 connection records. Each record con-
sists of 41 features.
There are four major categories of attacks in the KDDCUP’99 data set:
1. Probing: information gathering attacks.
2. Denial of Service (DoS): deny legitimate requests to a system.
3. User-to-Root (U2R): unauthorized access to local super-user or root.
4. Remote-to-Local (R2L): unauthorized local access from a remote machine.
A statistical analysis on this data set was proposed by Tavallaee et al. (Tavallaee
et al., 2009). Some important problems that greatly affected the performance of eval-
uated systems were found. For example: it contains a very huge number of redundant
records, and the difficulty level of the different records was not inversely proportional
to the percentage of records in the original KDDCUP’99 data set. These deficits re-
sults in a very poor evaluation of different ID proposed techniques.
Many machine learning and pattern classification algorithms were used to process
the intrusion detection problem based on the KDDCUP’99 data set and failed to iden-
tify most of the user-to-root and remote-to-local attacks. The authors in (Sabhnani and
Serpen, 2004), introduced the deficiencies and limitations of the KDDCUP’99 data set
to argue that this data set should not be used to train pattern recognition or machine
learning algorithms for misuse detection for these two attack categories. Because
their experiments showed that it is not possible for any trainable pattern classification
or machine learning algorithm to reach an acceptable level of misuse detection per-
formance on the KDD testing data subset if classifier models are built using the KDD
training data subset for these categories (Sabhnani and Serpen, 2004).
29
5.1.2 NSL-KDD
NSL-KDD data set was suggested to solve some of the inherent problems of the KDD-
CUP’99 data set. The proposed new data set (NSL-KDD) consists of selected records
of the complete KDDCUP’99 data set and it recovers these problems (Tavallaee et al.,
2009). The following are some of advantages of the NSL-KDD over the original KD-
DCUP’99 data set:
• Redundant records in the training and testing set were removed.
• The number of selected records from each difficulty level group is inversely
proportional to the percentage of records.
• It consists now of reasonable number of instances in the training set and testing
set. So it is affordable to use NSL-KDD dataset for experiments.
NSL-KDD data contains 125,973 records, each record consists of 41 features. The
features, their descriptions and types are shown in Appendix A.1. The records are
instances of network attacks including 23 classes: normal and 22 types of attacks:
neptune, warezclient, ipsweep, portsweep, teardrop, nmap, satan, smurf, pod, back,
guesspasswd, ftpwrite, multihop, rootkit, bufferoverflow, imap, warezmaster, phf,
land, loadmodule, spyand perl. These types represent the 4 main categories mentioned
in section 5.1.1.
5.1.3 Class Distribution
As mentioned before in section 5.1.1; there are 4 main attack categories. The number
of attack records (i.e class distribution) in each attack category differs in wide range
as shown in Table 5.1. This distribution has an effect on classifier learning (Weiss and
Provost, 2001). In this work, we have selected an equal number of attack records per
attack type. We selected randomly 6000 records from NSL-KDD data. The selected
set contains 5 types of attack and normal type, 1000 records for each type. Table 5.2
shows the type of data used and the number of samples for each attack type.
30
Table 5.1: Distribution of attack records per attack category of the NSL-KDD.
Attack Category Attack Name Number of Records Percentage of total %Back 956Land 18Neptune 41214Pod 201Smurf 2646teardrop 892
DoS 45927 36.46Satan 3633Ipsweep 3599Nmap 1493Portsweep 2931
Probe 11656 9.25GuessPassword 53Ftp write 8Imap 11Phf 4Multihop 7Warezmaster 20Warezclient 890Spy 2
R2L 995 0.79Buffer overflow 30Loadmodule 9Rootkit 10Perl 3
U2R 52 0.04Normal 67343 53.46Total 125973
Table 5.2: Experimental data
Attack type No. of recordsnormal 1000ipsweep 1000neptune 1000nmap 1000smurf 1000satan 1000Sum 6000
31
5.2 Classification Models Setup
5.2.1 C4.5
C4.5/J48 is avery popular machine learning algorithm. It is a new variant of ID3 al-
gorithm. The output of this classification algorithm is an understandable tree. To get
the tree small as possible information gain during building the tree is used. Pruning,
which is the process of reducing the size of the tree also has been used to get smaller
tree. Pruning also reduces the classifier complexity and improves the prediction ac-
curacy (Witten et al., 2011). Without pruning we get a tree of 456 nodes and 400
leaves. The classification accuracy computed was 99%. Using pruning we get tree of
229 nodes size and 188 leaves and 99.05% classification accuracy. Confidence factor
of 0.25 was used. The confidence factor used for pruning (smaller values mean more
pruning). Weka setup for building C4.5 model is shown in Figure 5.1. Using data flow
environment the model setup goes through the following steps:
1. Loading the data file through selecting the file loader component.
2. Specify the class attribute using the class assigner component.
3. Selecting the training and testing mode by choosing cross validation fold maker.
4. Attaching those components to C4.5 classifier, in WEKA called (J48).
5. Assigning the evaluation component called classifier performance evaluator.
6. Showing the the performance results via selecting the text viewer.
5.2.2 ANN (MLP)
MLP is a feed forward artificial neural network. It consists of 3 or more layers: 1)
input layer, 2) one or more hidden layer(s), and 3) output layer. In MLP we need to
properly find the best number of hidden layers and the best number of neurons in that
hidden layer. The best efficiency found when using one hidden layer, and the number
32
Figure 5.1: C4.5 classification model structure.
of nodes in thehidden layer is the average of input and output nodes (Witten et al.,
2011). MLP algorithm with the learning rate of 0.3, momentum: 0.2, No. of Epochs:
500, validation threshold: 20. Table 5.3 shows some of the MLP parameters and their
meaning. The default Weka parameter values have been used. The structure of the
MLP was shown in Figure 3.4. Weka Setup for the MLP model is shown in Figure
5.2.
Table 5.3: MLP parameters and their meaning
Parametr Value Explanation
learning rate 0.3 The amount the weights are updatedMomentum 0.2 Momentum applied to the weights during updatingNo. of Epochs 500 The number of epochs to train throughvalidation threshold: 20 Used to terminate validation testing
5.2.3 SVM
SVM is a lineardecision boundary, but can get nonlinear and more complex bound-
aries by replacing the dot product in the support vector formulation by a kernel func-
tion. We explored: Gaussian Kernel (Radial Basis Function), Polynomial kernel, and
Sigmoid kernel. Radial Basis Function (RBF) was selected because it overcomes the
33
Figure 5.2: Weka Setup for the MLP classification model.
drawback of SVMi.e extensive time needed for model building (Bhavsar and Wagh-
mare, 2013). SVM algorithm with RBF kernel function was trained, cahe size: 40.0,
cost parameter C:1.0, Eps: 0.001. Table 5.4 shows some of SVM parameters and their
meaning. The default Weka parameter values have been used. Weka setup for the
MLP model is shown in Figure 5.3.
Table 5.4: SVM parameters and their meaning
Parametr Value Explanation
Cahe size 40 The cache size in MB, the size of the kernel cachehas a strong impacton run times for larger problems
Cost parameter 1 The C parameter tells the SVM optimization how muchyou want toavoid misclassifying each training example
Eps 0.001 The tolerance of the termination criterion
5.3 10-fold Cross Validation
In 10-fold cross validation training and testing mode, the data is randomly divided
into 10 parts in which the class is represented in approximately the same proportions
as in the full dataset. Each part is held out in turn and the learning scheme trained on
the remaining nine parts; then its error rate is calculated on the holdout part. Thus,
34
Figure 5.3: Weka Setup for SVM classification model.
the learning procedure isexecuted a total of 10 times on different training sets (each
set has a lot in common with the others). Finally, the average of 10 error estimates
is calculated to obtain an overall error estimate. Why 10? Extensive tests on numer-
ous different datasets, with different learning techniques, have shown that 10 is about
the right number of folds to get the best estimation of error, and there is also some
theoretical proofs that support this hypothesis (Witten et al., 2011).
For all experiments in this thesis, 10-fold cross validation training and testing
mode was used because it reduces the variance of estimate (Witten et al., 2011).
5.4 Feature Selection
Feature selection was successfully used to enhance the process of modeling for input
output system (Papadakis et al., 2005). In many cases of modeling, various attributes
are gathered during data collection process although they might not be significant. The
more irrelevance data might increase the model complexity and increase the conver-
gence time of the best model structure (Witten et al., 2011).
Feature selection was defined as the process of selecting a subset of originally
defined features based on a pre-defined evaluation criteria (Han et al., 2012; Hall,
35
1999). Feature selection was frequently used for model dimension reduction. Feature
selection helpsreducing the features domain, removes redundant features. This way
will help in speeding up a learning/modeling process (Han et al., 2012; Hall, 1999).
Studying the relevance between the 41 features and the attack types was studied in
(Kayacik et al., 2005). The authors concluded that not all the 41 features are needed
to classify types of attacks. They recommended that more studies are required based
machine learning algorithms. In (Ooi et al., 2013), three types of decision trees: ID3,
C4.5, and BFS were tested on NSL-KDD network intrusion data set. Feature selection
was performed using Consistency Subset Evaluator (CSE). The analysis of the results
concluded that C4.5 performs better than BFS and ID3 in terms of detection accuracy.
Main steps for feature selection process can be summarized as follows (See Figure
5.4):
1. Generation procedure to generate the next candidate subset.
2. Evaluation function to evaluate the subset.
3. Stopping criterion to decide when to stop.
4. Validation procedure to check whether the subset is valid.
Different methods for attribute search and evaluation were analyzed in (Megha and
Amrita, 2013). We selected Best First and Genetic Search algorithms with Correlation-
based Feature Selection evaluator because their performance was better than the other
methods based on Aggarwal study (Megha and Amrita, 2013).
✲ ✲
���
❅❅❅
❅❅❅
���
❄
✲
✻
SubsetEvaluation
Stopping
CriterionResultValidation
YesNo
OriginalSet
SubsetGeneration
Figure 5.4: Main steps of feature selection process (Megha and Amrita, 2013).
36
5.5 Search Space Complexity
For the benefits offeature selection mentioned in section 5.4, it is a primary com-
ponent of classification models in this work. Studying the search space size, forn
features there are2n subsets can be formed and tested (Petra, 2012) i.e241 which is
equal to 2,199,023,255,552 combinations. In addition, before using FS techniques;
we do not know which are the most relevant features? how much are they? There-
fore, it is necessary to test all2n possible feature subsets for training models (Petra,
2012). The following calculations show how much the feature selection will reduce
our domain of search: The original space of search is equal to:
(
n
1
)
+
(
n
2
)
+ ...+
(
n
n
)
For example, selecting 7 features out of 41 will reduce the domain of search to
(
41
7
)
= 22, 481, 940
the reduction ratio = 22,481,940 / 2,199,023,255,552
= 1.022*10E-5
5.5.1 Best First Search
Best first search strategy allows backtracking along the search path. It moves through
the search space by making local changes to the current feature subset. If the path
being explored begins to look less promising, best first search can back-track to a
more promising previous subset and continue the search from there. Best first search
algorithm works as follows:
The selected features by BFS algorithms are shown in Table 5.5.
37
Algorithm 2: Best first search algorithm(Hall, 1999).1: Begin with the OPEN list containing the start state, the CLOSED list empty,2: andBEST ←start state.3: Let s = arg max e(x) (get the state from OPEN with the highest evaluation).4: Remove s from OPEN and add to CLOSED.5: If e(s) ≥ e(BEST ), thenBEST ←s6: For each child t of s that is not in the OPEN or CLOSED
list, evaluate and add to OPEN.7: If BEST changed in the last set of expansions, goto 3.8: Return BEST.
Table 5.5: BFS Selected Features
No. Description Type3 service symbolic5 srcbytes continuous6 dstbytes continuous23 count continuous30 diff srv rate continuous37 dsthost srv diff host rate continuous38 dsthost serrorrate continuous
5.5.2 Genetic Search
Genetic Algorithms (GA) aresearch algorithms adopting the principle of natural se-
lection (Hall, 1999; Sharma et al., 2014). Using GA, robust and adaptable systems can
be developed (Sharma et al., 2014; Kumar and Punia, 2013). GA works on an individ-
ual called chromosome. Initial population is a set of randomly created chromosomes.
Each chromosome represents a possible solution to the problem (Sazzadul Hoque
et al., 2012; Sharma et al., 2014). The generated solutions evolve over time to produce
an optimal solution in an iterative process. In feature selection problem, a solution
usually is a fixed length binary string representing a feature subset. Each position
value in the string represents the presence or absence of a particular feature (Hall,
1999). Initial subset is selected randomly from the all features set. Successive gener-
ations are produced using genetic operators called crossover and mutation applied on
the current selected subset. The new generated subset members are evaluated using
what is called fitness function according to defined fitness criteria. The better subsets
have a stronger chance to be selected for a new subset formation. By this way, newer
evolved subsets potentially have higher quality. Generally, genetic search strategy
38
works as follows:
Algorithm 3: Genetic search strategy(Hall, 1999).1: Begins by randomly generating an initial populationP .2: Calculatese(x) for each memberx ∈ P .3: Definesa probability distributionp over the members ofP wherep(x)αe(x).4: Selects two population membersx andy with respect top.5: Applies crossover tox andy to produce new population membersx andy.6: Applies mutation tox andy.7: Insertx andy into P (the next generation).8: If |P | < |P |, go to 4.9: Let P ← P .
10: If there are more generations to process, goto 2.11: Returnx ∈ P for which e(x) is highest.
The selected features by BFS algorithms are shown in Table 5.6.
Table 5.6: GS SelectedFeatures
No. Description Type2 protocoltype symbolic3 service symbolic5 srcbytes continuous6 dstbytes continuous23 count continuous24 srvcount continuous25 serrorrate continuous30 diff srv rate continuous36 dsthost samesrc port rate continuous37 dsthost srv diff host rate continuous
Figure 5.5 shows the block diagram of the proposed methodology:
• The prepared sample datasetwhich was illustrated in Table 5.2 used for building
the three models.
• Most relevant features based on BFS and GA feature selection algorithms were
selected.
• The data with the selected features was used as input to the three types of clas-
sifiers, SVM, DT, ANN. Each classifier was trained and tested in separate ex-
periment.
39
Figure 5.5: Block diagram for proposed methodology.
• The results of theclassifiers were illustrated and analyzed using evaluation cri-
terion specified in the following section 5.6.
5.6 Model Evaluation
In order to check the performance of the developed models, we explored set perfor-
mance evaluation functions such as: Correctly Classified Instances (CCI), Incorrectly
Classified Instances (ICI), Mean Absolute Error (MAE), Root Mean Square Error
(RMSE), and Relative Absolute Error (RAE). These performance evaluation func-
tions are used to measure how accurate the predicted intrusion types by the learned
algorithms to the actual intrusion types. The equations which described are computed
as follows:
CCI =TP + TN
TP + TN + FP + FN(5.1)
ICI =FP + FN
TP + TN + FP + FN(5.2)
whereTP is the proportion of correctly classified instances as positives,TN the pro-
portion of correctly classified instances as negatives,FP proportion of negative in-
stances that were incorrectly classified as positives,FN the proportion of positive
40
instances that were incorrectly classified as negatives. Confusion matrix shownin
Table 5.7 is used to evaluate the performance of the classification system.
Table 5.7: Confusion matrix.
PredectedPositive Negative
Actual Positive TP FNNegative FP TN
MAE =1
n
n∑
i=1
|y − y| (5.3)
RMSE =
√
√
√
√
1
n
n∑
i=1
(y − y)2 (5.4)
RAE =
∑n
i=1|y − y|
∑n
i=1|y − y|
(5.5)
In Equation 5.3, 5.4,and 5.5y is the actual class of connection,y is the predicted
type andy is the mean of the typey usingn instances (Tim, 2015).
5.7 Results
5.7.1 C4.5
The confusion matrix developed based on the C4.5 model is given in Table 5.8. This
matrix is a result of training and testing the model by Weka. The average ratio of
correctly classified instances shown at the last row of the mentioned table. It is the
average of all correctly classified instances of all the attack classes.
There was 992 correctly classified instances from the first type of attack (ipsweep),
which is equal to 99.2%, 0.30% was classified as nmap attack, 0.30% was classified
as normal and 0.20 was classified and satan these are called false negatives. 0.70 %,
0.60%, 0.10%, was classified as (ipsweep) where they are nmap, normal, and satan
respectively, these are called false positives. Using the above confusion matrix and
equation number 5.1; the CCI ratio is calculated as following:
41
Table 5.8: Confusion matrix for the C4.5 model
Pred. ipsweep% neptune% nmap % normal % satan % smurf %Actualipsweep 99.20 0.00 0.30 0.30 0.20 0.00neptune 0.00 99.8 0.00 0.20 0.00 0.00nmap 0.70 0.00 99.0 0.20 0.10 0.00normal 0.60 0.10 0.60 97.40 1.30 0.00satan 0.10 0.10 0.00 0.90 98.9 0.00smurf 0.00 0.00 0.00 0.00 0.00 100.0
Average of correctly classified instances = 99.05 %
CCI =99.20 + 99.8 + 99.0 + 97.40 + 98.9 + 100.0
99.20 + 99.8 + 99.0 + 97.40 + 98.9 + 100.0 + 5.7= 99.05%
The highest accuracy rate was achieved by detecting smurf attack with 100%. It
means it was fully representative by the training data. While the normal records as
detected in the lowest accuracy of 97.40% which means 2.6% of false positive rate.
5.7.2 MLP
The confusion matrix developed based the MLP model is given in Table 5.9.
Table 5.9: Confusion matrix for the MLP model
Pred. ipsweep% neptune% nmap % normal % satan % smurf %Actualipsweep 98.70 0.00 1.10 0.10 0.10 0.00neptune 0.00 99.90 0.00 0.10 0.00 0.00nmap 1.60 0.00 98.0 0.20 0.20 0.00normal 0.60 0.00 0.30 98.50 0.50 0.10satan 0.30 0.00 0.80 1.30 97.40 0.20smurf 0.20 0.00 0.00 0.00 0.00 99.80
Average of correctly classified instances = 98.72 %
Here, the highest detection rate was 99.90% of neptune attack, and the lowest is
97.40% of satan. False positive rate is 1.50%.
42
5.7.3 SVM
The confusion matrix developed based the SVM model is given in Table 5.10.
Table 5.10: Confusion matrix for the SVM model
Pred. ipsweep% neptune% nmap % normal % satan % smurf %Actualipsweep 84.30 0.00 15.00 0.70 0.00 0.00neptune 0.10 92.90 0.30 3.70 3.00 0.00nmap 25.60 0.00 73.90 0.50 0.00 0.00normal 0.50 0.00 0.10 99.30 0.00 0.10satan 0.40 3.40 0.50 4.20 91.50 0.00smurf 0.00 0.00 0.00 4.40 0.00 95.60
Average of correctly classified instances = 89.58 %
The highest detection accuracy here is 99.30% it is of the normal type, this means
thatSVM achieved the lowest false positive rate of 0.70%.
5.8 Results Analysis
- Performance of each one of the three built models using C4.5, MLP, and SVM were
tested before and after feature selection. The obtained results are shown in Table 5.11
and Figure 5.6.
- C4.5 achieved 99.05% accuracy with all 41 features and building time of 0.47 sec-
ond, and 98.80% with 10 features selected by GA. The accuracy slightly decreased
but the model building time dropped down to 0.06 of the second which is a great deal
of time efficiency level.
- The lowest false positive rate was achieved by SVM, this is because SVM working
way of maximizing the margin between the negative class and the core of the positive
class.
- By applying feature selection process it is clear that the time of all models dropped
down in to less than 50%, it was 10% in C4.5, 30% in SVM, and 40% in the case of
MLP. That means also the reduction of computations size; i.e less computation com-
plexity. This reduction was a result of number of features selected.
- The number of selected features using BFS was 7, and 10 features were selected by
43
GA.
- All features were selectedby BFS, were also selected by GA. They are: proto-
col type, service, srcbytes, dstbytes, count, diffsrv rate, dsthost srv diff host rate.
Since these features have been selected by both algorithms; this implies that these fea-
tures are the most important ones to discover the different types of attack.
- SVM performed better with 10 selected features by GA, means that the classifier
accuracy was negatively affected by the extra irrelevant features.
- Finally, the complexity of the models was reduced in to a respectable amount, the
process of finding the class was a function of (41 features), it dropped down to a func-
tion of (7 features) in the case of BFS method. And this reduced the complexity into
about 1*10E-5.
Table 5.11: Performance evaluation based C4.5, ANN and SVM models
ALGORITHM CCI ICI MAE RMSE RAE Time Taken(s)C4.5 (J48) 99.05% 0.95% 0.0039 0.0534 1.39% 0.47C4.5+BestFirst 97.35% 2.65% 0.0122 0.0903 4.41% 0.06C4.5+Genetic Search 98.80% 1.20% 0.005 0.0573 1.80% 0.11MLP 98.72% 1.28% 0.0061 0.0619 2.18% 485.68MLP+BestFirst 93.05% 6.95% 0.0302 0.1299 10.86% 218.2MLP+Genetic Search 94.77% 5.23% 0.0218 0.1151 7.83% 235.3SVM 89.58% 10.42% 0.0347 0.1863 12.50% 14.66SVM+Best First 93.80% 6.20% 0.0207 0.1438 7.44% 4.08SVM+Genetic Search 86.77% 13.23% 0.0441 0.21 15.88% 5.32
5.9 Summary
In this chapter wegave a detailed analysis of KDDCUP99 and NSL-KDD data sets.
The data features and attack types distribution were shown. The limitations of KD-
DCUP99 were discussed, and the advantages of NSL-KDD explained. The setup of
experiments on the three classification algorithms: C4.5, ANN, and SVM was shown.
Feature selection methods: Best first and genetic search, the evaluation criteria were
explained, and finally, the results were illustrated and analyzed.
44
Original Data BF Search Genetic Search0
10
20
30
40
50
60
70
80
90
100Correctly Classified Instances for C4.5, MLP and SVM
Figure 5.6: Correctly Classified Instances for C4.5, MLP and SVM with the originaldata, selectedfeatures of BF, and selected features of GS.
45
Chapter 6
Conclusions and FutureWork
In this research, we developed three models to solve the intrusion detection problem
using decision tree based C4.5 algorithm, Multi-Layer Perceptron, and Support Vector
Machine. Number of attacks were classified using the three methods. To enhance the
performance of the proposed models and speeding up the detection process, a set of
features were selected using the Best First Search and the Genetic Search methods. A
comparison between the developed models before and after feature selection was pro-
vided. The developed models were capable of reducing the complexity while keeping
acceptable detection accuracy. The decision tree based C4.5 algorithm achieved the
highest classification accuracy compared to other search techniques explored in this
work, while the SVM achieved the lowest false positive rate. As a future work; more
research to be done on how to implement the designed models in real network envi-
ronment. Other data mining techniques could be explored, and working on collecting
new data that could be more useful in the two attack categories: U2R and R2L.
46
Appendix A
Features of NSL-KDD
Table A.1: NSL-KDD Intrusion Detection Data set Features (Kayacik et al., 2005).
Feature name Description Type
1 duration Duration of the connection continuous2 protocol type Connection protocol (e.g. tcp, udp) symbolic3 service Destination service (e.g. telnet, ftp) symbolic4 flag Status flag of the connection symbolic5 src bytes Bytes sent from source to destination continuous6 dst bytes Bytes sent from destination to source continuous7 land 1 if connection is from/to the same host/port; 0 otherwise symbolic8 wrong fragment number of wrong fragments continuous9 urgent number of urgent packets continuous10 hot number of ”hot” indicators continuous11 num failed logins number of failed logins continuous12 loggedin 1 if successfully logged in; 0 otherwise symbolic13 num compromised number of ”compromised” conditions continuous14 root shell 1 if root shell is obtained; 0 otherwise continuous15 su attempted 1 if ”su root” command attempted; 0 otherwise continuous16 num root number of ”root” accesses continuous17 num file creations number of file creation operations continuous18 num shells number of shell prompts continuous19 num accessfiles number of operations on access control files continuous20 num outboundcmds number of outbound commands in an ftp session continuous21 is host login 1 if the login belongs to the ”hot” list; 0 otherwise symbolic22 is guestlogin 1 if the login is a guest login; 0 otherwise symbolic23 count number of connections to the same host as the current connection in the pasttwo seconds continuous24 srv count number of connections to the same host as the current connection in the pasttwo seconds continuous25 serrorrate % of connections that have SYN” errors continuous26 srv serrorrate % of connections that have SYN” errors continuous27 rerror rate % of connections that have REJ” errors continuous28 srv rerror rate % of connections that have REJ” errors continuous29 samesrv rate % of connections to the same service continuous30 diff srv rate % of connections to different services continuous31 srv diff host rate % of connections to different hosts continuous32 dst host count count of connections having the same destination host continuous33 dst host srv count count of connections having the same destination host and using the same service continuous34 dst host samesrv rate count of connections having the same destination host and using the same continuous35 dst hostdiff srv rate % of different services on the current host continuous36 dst host samesrc port rate % of connections to the current host having the same src port continuous37 dst host srv diff host rate % of connections to the current host having the same src port continuous38 dst host serrorrate % of connections to the current host that have an S0 error continuous39 dst host srv serrorrate % of connections to the current host and specified service that have anS0 error continuous40 dst host rerror rate % of connections to the current host that have an RST error continuous41 dst host srv rerror rate % of connections to the current host and specified service that have anRST error continuous42 Label normal/abnormal symbolic
47
Bibliography
Al-Hiary, H., A. Sheta, and A. Ayesh (2008). Identification of a chemical process
reactor using soft computing techniques. InProceedings of the 2008 International
Conference on Fuzzy Systems (FUZZ2008) within the 2008 IEEE World Congress
on Computational Intelligence (WCCI2008), Hong Kong, 1-6 June, pp. 845–653.
Barman, D. K. and G. Khataniar (2012). Design of intrusion detection system based
on artificial neural network and application of rough set.International Journal of
Computer Science and Communication Networks, 548–552.
Bhavsar, Y. B. and K. C. Waghmare (2013). Intrusion detection system using data
mining technique: Support vector machine.International Journal of Emerging
Technology and Advanced Engineering 3(3), 581–586.
Boser, B. E., I. M. Guyon, and V. N. Vapnik (1992). A training algorithm for optimal
margin classifiers. InProceedings of the Fifth Annual Workshop on Computational
Learning Theory, pp. 144–152. ACM.
Brause, R. W. (2001). Medical analysis and diagnosis by neural networks. InProceed-
ings of the Second International Symposium on Medical Data Analysis, ISMDA
’01, London, UK, UK, pp. 1–13. Springer-Verlag.
Breiman, L., J. Friedman, C. Stone J., and R. Olshen (1984).Classification and Re-
gression Trees. The Wadsworth and Brooks-Cole statistics-probability series. Tay-
lor and Francis.
48
Burges, C. (1998). A tutorial on support vector machines for pattern recognition.Data
Mining and Knowledge Discovery 2(2).
Cannady, J. (1998). Artificial neural networks for misuse detection. InNational
Information Systems Security Conference, pp. 443–456.
Chaturvedi, S., R. N. Titre, and N. Sondhiya (2014). Review of handwritten pattern
recognition of digits and special characters using feed forward neural network and
izhikevich neural model. InProceedings of the 2014 International Conference on
Electronic Systems, Signal Processing and Computing Technologies, Washington,
DC, USA, pp. 425–428. IEEE.
Chen, R.-C., K.-F. Cheng, Y.-H. Chen, and C.-F. Hsieh (2009, April). Using rough set
and support vector machine for network intrusion detection system. InIntelligent
Information and Database Systems, 2009. ACIIDS 2009. First Asian Conference
on, pp. 465–470.
Cortes, C. and V. Vapnik (1995, September). Support-vector networks.Machine
Learning 20(3), 273–297.
Cristianini, N. and J. Shawe-Taylor (2000).An Introduction to Support Vector Ma-
chines: And Other Kernel-based Learning Methods. New York, NY, USA: Cam-
bridge University Press.
Das, N. and T. Sarkar (2014, September). Survey on host and network based intru-
sion detection system.Internationl Journal of Advanced Networking and Applica-
tions 6(2), 2266–2269.
Du Jardin, P. (2010, June). Predicting bankruptcy using neural networks and other
classification methods: Theinfluenceof variable selection techniques on model
accuracy.Neurocomput. 73(10-12), 2047–2060.
Farid, D. M., N. Harbi, E. Bahri, M. Z. Rahman, and C. M. Rahman (2010, March).
Attacks classification in adaptive intrusion detection using decision tree. InInter-
national Conference on Computer Science (ICCS’10), Rio De Janeiro, Brazil.
49
Firewalls (2015). Firewall definition from pc magazine encyclopedia. Retrievedfrom
http://www.pcmag.com/encyclopedia/term/43218/firewall; accessed June 18, 2015.
Gadbois, P. (2011, 10). Trainsignal’s comptia security course.https://www.
youtube.com/watch?v=O2Gz-v8WswQ. accessed july 2015.
Giray, S. and A. Polat (2013, Dec). Evaluation and comparison of classification tech-
niques for network intrusion detection. InData Mining Workshops (ICDMW), 2013
IEEE 13th International Conference on, pp. 335–342.
Hall, M. (1999). Correlation-based Feature Selection for Machine Learning. Ph. D.
thesis, University of Waikato.
Hall, M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten (2009,
November). The weka data mining software: An update.Special Interest Group on
Knowledge Discovery and Data Mining (SIGKDD) 11(1), 10–18.
Han, J., M. Kamber, and J. Pei (2012).Data Mining: Concepts and Techniques(3rd
ed.). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Hu, W., Y. Liao, and R. Vemuri (2003). Robust anomaly detection using support
vector machines. InIn Proceedings of the International Conference on Machine
Learning. Morgan Kaufmann Publishers Inc.
Jessica (2007, April). Intrusions Detection Systems (HIDS vs. NIDS).
Retrieved from http://nforcingsecurity.blogspot.com/2007/04/intrusions-detection-
systems-hids-vs.html; accessed 14-June-2015.
Jha, J. and L. Ragha (2013, June). Intrusion detection system using support vector
machine. IJAIS Proceedings on International Conference and workshop on Ad-
vanced Computing 2013 ICWAC(3), 25–30. Published by Foundation of Computer
Science, New York, USA.
Jiang, J., R. Li, T. Zheng, F. Su, and H. Li (2011). A new intrusion detection system
using class and sample weighted c-support vector machine. InProceedings of the
50
2011 Third International Conference on Communications and Mobile Computing,
CMC ’11, Washington, DC, USA, pp. 51–54. IEEE Computer Society.
Kargupta, H., J. Han, P. S. Yu, R. Motwani, and V. Kumar (2008).Next Generation of
Data Mining(1 ed.). Chapman & Hall/CRC.
Kayacik, H. G., A. N. Zincir-Heywood, and M. I. Heywood (2005). Selecting features
for intrusion detection: A feature relevance analysis on kdd 99 intrusion detection
datasets. InProceedings of the Third Annual Conference on Privacy, Security and
Trust.
Kessel, P. v. and K. Allan (2014, 10). Get ahead of cybercrime.
Khan, L., M. Awad, and B. Thuraisingham (2007). A new intrusion detection sys-
tem using support vector machines and hierarchical clustering.The VLDB Jour-
nal 16(4), 507–521.
Krol, D. and B. Szlachetko (2010, April). Automatic image and speech recognition
based on neural network.Journal of Information Technology Research (JITR) 3(2),
1–17.
Kruegel, C., F. Valeur, and G. Vigna (2005).Intrusion Detection and Correlation:
Challenges and Solutions. Springer Science + Business Media, Inc.
Kumar, K. and R. Punia (2013). Improving the performance of ids using genetic
algorithm. International Journal of Computer Science and Communication 4(2).
Liao, Y. (2005).Machine Learning in Intrusion Detection. Ph. D. thesis, Davis, CA,
USA.
Lichman, M. (2013). UCI machine learning repository.http://archive.ics.
uci.edu/ml. accessed july 2015.
Lokesak, B. (2008). A comparison between signature based and
anomaly based intrusion detection systems. Retrieved from
http://www.iup.edu/WorkArea/DownloadAsset.aspx?id=81109; June 8, 2015.
51
Maimon, O. and L. Rokach (Eds.) (2010).Data Mining and Knowledge Discovery
Handbook, 2nd ed. Springer.
Megha, A. and Amrita (2013). Performance analysis of different feature selection
methods in intrusion detection.International Journal of Scientific and Technology
Research 2(6).
Mohammed, S., S. Marwa, E.-b. Mohammed, and S. Imane (2007). Artificial neural
networks architecture for intrusion detection systems and classification of attacks.
In Faculty of Computers and Information Cairo University.
Moradi, M. and M. Zulkernine (2004). A neural network based system for intrusion
detection and classification of attacks. Retrieved June 14, 2015.
Muhammad-Imran, H., A. Bin-Abdullah, M. Hussain, S. Palaniappan, and I. Ah-
mad (2008). Intrusions detection based on optimum features subset and efficient
dataset selection.International Journal of Engineering and Innovative Technology
(IJEIT) 2(6).
Mukkamala, S., D. Xu, and A. H. Sung (2006). Intrusion detection based on behavior
mining and machine learning techniques. InProceedings of the 19th International
Conference on Advances in Applied Artificial Intelligence: Industrial, Engineering
and Other Applications of Applied Intelligent Systems, IEA/AIE’06, pp. 619–628.
Springer-Verlag.
Mulay, S. A., P. Devale, and G. Garje (2010, 6). Intrusion detection system using
support vector machine and decision tree.International Journal of Computer Ap-
plications 3(3), 40–43. Published By Foundation of Computer Science.
Nadiammai, G. and M. Hemalatha (2014). Effective approach toward intrusion detec-
tion system using data mining techniques.Egyptian Informatics Journal 15(1), 37
– 50.
Ng, A. (2014, Autumn). Cs229 lecture notes.
52
Norgaard, M., O. Ravn, Poulsen, and L. K. Hansen (2000).Neural Networks for
Modellingand Control of Dynamic Systems. Springer, London.
Ooi, S. Y., Y. M. Leong, M. F. Lim, H. K. Tiew, and Y. H. Pang (2013). Network
intrusion data analysis via consistency subset evaluator with ID3, C4.5 and best-
first trees.IJCSNS 13(2), 7.
Ou, G. and Y. L. Murphey (2007, January). Multi-class pattern classification using
neural networks.Pattern Recogn. 40(1), 4–18.
Panda, M., A. Abraham, S. Das, and M. R. Patra (2011, October). Network intrusion
detection system: A machine learning approach.Int. Dec. Tech. 5(4), 347–356.
Papadakis, S. E., P. Tzionas, V. G. Kaburlasos, and J. B. Theocharis (2005). A ge-
netic based approach to the type i structure identification problem.Informatica,
Lithuanian Academy of Sciences 16(3), 365–382.
Pathan, A.-S. K. (2014).The State of the Art in Intrusion Prevention and Detection.
CRC press.
Petra, P. (Ed.) (2012).Machine Learning and Data Mining in Pattern Recognition.
Pfleeger, C. P. and S. L. Pfleeger (2006).Security in Computing (4th Edition). Upper
Saddle River, NJ, USA: Prentice Hall PTR.
Pradhan, A. (2012). Support vector machines - a survey.International Journal of
Emerging Technology and Advanced Engineering 2(8).
Quinlan, J. R. (1993).C4.5: Programs for Machine Learning.
Sabhnani, M. and G. Serpen (2004, September). Why machine learning algorithms
fail in misuse detection on KDD intrusion detection data set.Intell. Data Anal. 8(4),
403–415.
Sahilpreet, S. and B. Meenakshi (2013). Improvement of intrusion detection system
in data mining using neural network.International Journal of Advanced Research
in Computer Science and Software Engineering.
53
Sazzadul Hoque, M., M. Abdul Mukit, and M. Bikas (2012). An implementation
of intrusion detection systemusing genetic algorithm.International Journal of
Network Security & Its Applications 4(2).
Scarfone, K. and P. Mell (2007). Guide to intrusion detection and prevention systems
(idps).
Sen, S. and J. A. Clark (2011, October). Evolutionary computation techniques for
intrusion detection in mobile ad hoc networks.Comput. Netw. 55(15), 3441–3457.
Sharma, S., S. Kumar, and M. Kaur (2014). Recent trend in intrusion detection using
fuzzy-genetic algorithm.International Journal of Advanced Research in Computer
and Communication Engineering 3(5).
Singh, D., M. Dutta, and S. H. Singh (2009). Neural network based handwritten
hindi character recognition system. InProceedings of the 2Nd Bangalore Annual
Compute Conference, New York, NY, USA. ACM.
Sivatha Sindhu, S. S., S. Geetha, and A. Kannan (2012). Decision tree based light
weight intrusion detection using a wrapper approach.Expert Syst. Appl. 39(1),
129–141.
Stallings, W. (2010).Cryptography and Network Security: Principles and Practice
(5th ed.). Upper Saddle River, NJ, USA: Prentice Hall Press.
Sujatha, P. K., C. S. Priya, and A. Kannan (2012). Network intrusion detection system
using genetic network programming with support vector machine. InProceedings
of the International Conference on Advances in Computing, Communications and
Informatics, New York, NY, USA, pp. 645–649. ACM.
Summers, R. C. (2010).Secure computing: Threats and safe-guards. McGraw Hill,
New York.
Tavallaee, M., E. Bagheri, W. Lu, and A. A. Ghorbani (2009). A detailed analy-
sis of the kdd cup 99 data set. InProceedings of the Second IEEE International
54
Conference on Computational Intelligence for Security and Defense Applications,
CISDA’09, Piscataway, NJ, USA, pp. 53–58. IEEE Press.
Tim (2015, 1). How to interpret error measures in weka output?
http://stats.stackexchange.com/questions/131267/
how-to-interpret-error-measures-in-weka-output. accessed
july 2015.
Tsai, C.-F., Y.-F. Hsu, C.-Y. Lin, and W.-Y. Lin (2009, December). Intrusion detection
by machine learning: A review.Expert Systems Applications 36(10), 11994–12000.
Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data: Springer
Series in Statistics (Springer Series in Statistics). Springer-Verlag New York, Inc.
Weiss, G. and F. Provost (2001). The effect of class distribution on classifier learning:
An empirical study. Technical report.
Witten, I. H., E. Frank, and M. A. Hall (2011).Data Mining: Practical Machine
Learning Tools and Techniques(3rd ed.). Morgan Kaufmann Publishers Inc.
Yao, J., S. Zhao, and L. Fan (2006). An enhanced support vector machine model for
intrusion detection. InProceedings of the First International Conference on Rough
Sets and Knowledge Technology, pp. 538–543. Springer-Verlag.
Zhang, G. P. (2000, November). Neural networks for classification: A survey.Trans.
Sys. Man Cyber Part C 30(4), 451–462.
55