Intrusion Detection using Sequential Hybrid Model · in data mining. E.g. saint, portsweep, mscan, nmap etc. C. Importance of Intrusion Detection Intrusion Detection is important

Intrusion Detection using Sequential Hybrid Model

Abhishek SinhaDepartment of Computer Science

PES UniversityBengaluru, India - 560085

Aditya PandeyDepartment of Computer Science


Aishwarya PSDepartment of Computer Science


Abstract—A large amount of work has been done on theKDD 99 dataset, most of which includes the use of a hybridanomaly and misuse detection model done in parallel witheach other. In order to further classify the intrusions, ourapproach to network intrusion detection includes use oftwo different anomaly detection models followed by misusedetection applied on the combined output abtained fromthe previous step.The end goal of this is to verify the anomalies detectedby the anomaly detection algorithm and clarify whetherthey are actually intrusions or random outliers from thetrained normal (and thus to try and reduce the numberof false positives). We aim to detect a pattern in this novelintrusion technique itself, and not the handling of suchintrusions. The intrusions were detected to a very highdegree of accuracy.

I. INTRODUCTION

A. Context

A network intrusion could be any unauthorizedactivity on a computer network. Virus attacks, unau-thorized access, theft of information and denial-of-service attacks were the greatest contributors tocomputer crime. Detecting an intrusion depends onthe defenders having a clear understanding of howattacks work. Detecting an intrusion is the first stepto create any sort of counteractive security measure.Hence, it is very important to accurately determinewhether a connection is an intrusion or not.

B. Categories

There are four broad categories of intrusions in anetwork of systems:

• DOS:In computing, a denial-of-service attack is acyber-attack in which the perpetrator seeks tomake a machine or network resource unavail-able to its intended users by temporarily or

indefinitely disrupting services of a host con-nected to the Internet. E.g. back, land, neptune.

• U2R:These attacks are exploitations in which thehacker starts off on the system with a normaluser account and attempts to abuse vulnerabil-ities in the system in order to gain super userprivileges. E.g. perl, xterm.

• R2L:Remote to local is an unauthorized access froma remote machine, by maybe guessing thepassword of the local system and accessingfiles within the local system. E.g. ftp write,guess passwd, imap.

• Probe:Probing is an attack in which the hacker scansa machine or a networking device in orderto determine weaknesses or vulnerabilities thatmay later be exploited so as to compromisethe system. This technique is commonly usedin data mining. E.g. saint, portsweep, mscan,nmap etc.

C. Importance of Intrusion Detection

Intrusion Detection is important for both Militaryas well as commercial sectors for the sake of theirInformation Security, which is the most importanttopic of research for the future networks. It is criticalto maintain a high level of security to ensure safeand trusted communication of information betweenvarious organizations.Intrusion Detection System is a new safeguard tech-nology for system security after traditional technolo-gies, such as firewall, message encryption and so on.An intrusion detection system (IDS) is a device orsoftware application that monitors network system

activities for malicious activities or policy violationsand produces reports to a Management Station.

II. PREVIOUS WORK

The two approaches to an Intrusion DetectionSystem are Misuse Detection and Anomaly Detec-tion. A key advantage of misuse detection tech-niques is its high degree of accuracy in detectingknown attacks and their variations. Their obviousdrawback is the inability to detect attacks whoseinstances have not yet been observed. Anomalydetection schemes on the other hand, suffer froma high rate of false alarms. This occurs primarilybecause previously unseen (yet legitimate) systembehaviours are also recognized as anomalies, and arehence flagged as potential intrusions. Researchershave used various different algorithms to solve thisspecific problem.A data mining algorithm such as Random Forestcan be applied for misuse, anomaly and hybridnetwork based intrusion detection systems. [7] usesa hybrid detection technique where they employmisuse detection followed by anomaly detection.This approach however has multiple limitations:

• Intrusions need to be much lesser than thenormal data. Outlier detection will only workwhen majority of the data is normal

• A novel intrusion producing a large number ofconnections that are not filtered out by the mis-use detection could decrease the performanceof the anomaly detection or even the hybridsystem as a whole.

• Some of the intrusions with a high degree ofsimilarity cannot be detected by this anomalydetection approach.

The paper ’Artificial Neural Networks for MisuseDetection’ [8] talks about the advantages of usingan artificial neural network approach over a rulebased expert system approach. It also tells us aboutthe various approaches with which these neural net-works are applied to get a high accuracy. A reviewpaper on misuse detections [9] sheds light onthe most common techniques to implement misusedetection. These are:

• Expert Systems, which code knowledge aboutattacks as ’if-then’ implication rules.

• Model Based Reasoning Systems, which com-bine models of misuse with evidential reason-

ing to support conclusions about the occurrenceof a misuse.

• State Transition Analysis, which represents at-tacks as a sequence of state transitions of themonitored system.

• Key Stroke Monitoring, which uses user keystrokes to determine the occurrence of an at-tack.

[9] introduces a pattern matching model alongwith an Artificial Neural Network model for theirimplementation of a Misuse Detection System. Fur-ther [9] performed Intrusion Detection using DataMining techniques along with fuzzy logic and basicgenetic algorithms. [1], [2], [3], [4], [5] are variedapproaches but their accuracy is not as good as theapproach in [7]. The approach specified in [6] has agood accuracy and also supports unsupervised learn-ing algorithms but it requires a very complicatedstate machine specification model.

III. PROBLEM STATEMENT

Here, our goal is to detect network intrusionsand to create a predictive model that is able todifferentiate between ’good’ or ’bad’ connections(intrusions) and classify those intrusions into knowncategories. To that end, we used the kddCup datasetfrom 1999 which was created by simulating variousintrusions over a network over the course of severalweeks in a military environment.The dataset itself consists of 42 attributes thatare used in the various models according to rel-evance and importance. It contains approximately1,500,000 datapoints. For the purposes of thisproject, we use a 10% representative dataset, sinceour machines did not have the processing powerto handle the larger dataset. The dataset as is, isunbalanced across the various result categories, andhence, we balance it by upsampling and downsam-pling. The dataset is also cleaned. Certain categor-ical variables like the ’protocol type’ column areone-hot encoded. We found out that the ’flag’ and’services’ attributes are not of much value as theyare nominal variables and hence they are dropped.

IV. APPROACH

Due to the various drawbacks of the individualanomaly detection models as well as the individual

TABLE INUMBER OF DATA POINTS

Label: dos normal probe rtl u2rBefore Sampling: 54572 87832 2131 999 52After Sampling: 27285 39524 2131 999 86

misuse detection models, we use a combined ap-proach i.e. a hybrid of the two. We combine themodels serially such that the anomaly detection isfollowed by the misuse detection. This approachprovides us with multiple advantages:

• The misuse detection acts as a verificationmodel where it verifies whether an anomalyis actually an intrusion or not. This helps usreduce the false positives coming from theanomaly detection (Primary drawback)

• The misuse detection also helps us classifythe intrusions detected into various categoriesbased on the intrusions ’signature’ (centroid ofsample data from the training dataset)

Fig. 1. Block Diagram of Approach

Figure 1 shows the approach for the Intrusion De-tection System.

A. ModellingFor the anomaly detection, we use two models:• Random Forest Model• Neural Network Model

After running both these classification models paral-lely, we take the combined output of the anomaliesdetected by either of the two. Doing so helps usreduce the rate of false negatives. The connectionswhich are identified as anomalies by either of thetwo are then passed on to the misuse detectionmodel. A false positive from the anomaly detectionmodel classified as ’normal’ in the misuse detectionmodel reduces the number of false positives.For Misuse detection, we use:

• KMeans based Clustering Model.Therefore, the use of the misuse detection model onthe anomalies helps trim down the number of falsepositives.

V. COMPONENTS OF THE INTRUSIONDETECTION SYSTEM

Our Intrusion Detection System consists of 3components which have been combined to createone complete robust system.

The components of the Anomaly DetectionSystem are -

A. Neural Network:

A neural network is a system of hardware and/orsoftware patterned after the operations of neuronsin the human brain.For the neural network, we created a sequentialneural network with two hidden layers. The neuralnetwork consists of 41 input nodes and 5 outputnodes. The 41 input nodes are the numerical at-tributes which are obtained after preprocessing andthe 5 output nodes are used to classify the data pointinto one of the 5 classes of connections that we have(u2r, rtl, normal, dos, probe).The model is validated by a K fold cross validationwith k as 2. An average accuracy of 99.57% wasobtained. In an attempt to reduce processing timewithout much drop in accuracy, we reduced thenumber of splits in the k fold cross validation aswell as used a lower number of hidden layers in thenetwork with minimal loss of accuracy.

Figure 2 shows the number of nodes in each layerof the Neural Network.

B. Random Forest:

A random forest is essentially a multitude ofdecision trees. The output obtained from a randomforest model is a combination of the ouputs obtainedfrom all the decision trees.In our case, we use the mode of all the decision treeoutputs as the final output from the model. Basedon trial and error, the number of decision trees inour random forest was taken as 100. We use thisRandom Forest Classifier to classify the connectionamong the 5 main classes.Based on the importance value of variables, all

Fig. 2. Neural Network Model

the non essential variables are dropped before themodel is retrained. This helps in removing all theredundant and usuless attributes in the model. Usingthe random forest model, an average accuracy of99.78% is obtained.In order to reduce the processing time, the numberof decision trees was reduced from an initial valueof 1000 to 100 without resulting in any significantloss in accuracy of the model.

Fig. 3. Part of the RF Model

Figure 3 shows a part of the large Random ForestModel created

C. Misuse Clustering:

We used a K-Means based clustering algorithmas the miuse detection model. This clustering algo-rithm is used to further classify the five classes of

connections into 24 classes. During training, clustersare made for each known class and their centroidsare stored. When the model receives a data pointduring testing, the minimum distance to any of thecentroids is calculated to assign the point to therelevant class.If a connection arrives during testing that is falselytagged as an attack, then the cluster it is assigned toshould theoretically be ’normal’, thereby reducingfalse positives.

D. Combination of Models:

The output from both the models in the anomalydetection stage is verified with each other andthe connections labeled as some kind of attackor connections not having the same labels arepassed on to the misuse clustering model whichwill further classify the attack into its subclass.Another version of our final model is to use theMisuse clustering model to reduce the number offalse positives, thereby increasing the precision andaccuracy.

VI. RESULTS

TABLE IINEURAL NETWORK:

Label: normal rtl dos u2r probePrecision: 98.391 97.560 99.955 79.710 97.468

Recall: 99.817 80.080 99.146 63.953 90.333Accuracy: 98.976 99.687 99.650 99.935 99.634

TABLE IIIRANDOM FOREST:

Label: normal rtl dos u2r probePrecision: 99.820 99.382 99.985 87.500 99.669

Recall: 99.949 96.596 99.978 81.395 98.967Accuracy: 99.870 99.942 99.985 99.962 99.958

Note: All values are with respective to individualclasses vs the entire dataset.

TABLE IVMISUSE:

Type Of Classification: 5 Class 24 ClassAccuracy: 99.651 91.31

Fig. 4. Confusion Matrix for 5 classes

Figure 4 is the confusion matrix of the 5 classmisuse classification (the 24 class classification isdifficult to visualize).

A. Discussion of Results

Since we are using a synthetic dataset, we obtainaccuracies that are quite high. However the pre-cision and recall for u2r in specific, in both themodels is very low compared to the rest of ournumbers. This can be attributed to the fact that thenumber of connections causing u2r attacks itself isvery low, taking up only around 70 of the 70,000connections. As a result, the models have not beentrained enough to be able to detect these attacks.Hence a good proportion of the u2r attacks havebeen misclassified. This problem can be rectified ifa more balanced dataset is provided.From the tabular columns we can see that most ofthe attacks have not only been detected but alsocorrectly classified in the anomaly detection model.In an attempt to further classify the attacks into itssubcategories, we lose a bit of accuracy but it isstill at an acceptable 91.3%. A further classificationis necessary from the perspective of dealing withthese intrusions. For example, A and B might bothbelong to ’dos’ but they might have different se-curity requirements to stop these attacks. We canalso use the misuse clustering to cut down on falsepositives and that will only make our accuracy and

precision better, if in case the drop in accuracy isnot preferable.From the confusion matrix, we can also see thatthe number of ’normal’ and ’dos’ connections aresignificantly higher than the number of all the otherconnections even after downsampling these cate-gories. Therefore, in order to achieve a balance thatis acceptable, we downsampled these 2 categoriesand upsampled the ’u2r’ category.What can also be seen is that the number of misclas-sifications is fairly less. The only issue which canbe seen is that a fraction of the misclassification isclassifying one of the kinds of attacks as normalconnections. The most probable reason for this isdue to the fact that the number of ’normal’ connec-tions make up more than 55% of the total dataset.However, considering the number of false negativesversus the total number of connections, this error isalmost insignificant.

VII. CONCLUSION

Even though technology has advanced leaps andbounds over the past few decades, Intrusion detec-tion continues to be an active research field. Thispaper outlines our approach to solving this problemby using a combination of Anomaly and MisuseDetection models. The techniques used to performthe same have been explained and illustrated. Webelieve that this approach is quite successful in clas-sifying connections into their respective categories.

VIII. FUTURE WORK

The current approach is able to spot and classifynew intrusions as such. However, the number offalse positives for this are high. Furthermore, theconnections detected as new intrusions could befurther categorized into subclasses based on similar-ity(clustering) and dealt with accordingly. Differentapproaches could also be tried in order to obtainmore accurate signatures for the different categoriesof intrusions which would improve the accuracy ofthe misuse detection model.

IX. CONTRIBUTIONS

This project was done by Abhishek SInha, AdityaPandey and Aishwarya PS under the guidance of Dr.Gowri Srinivasa.Parts of the pre-processing of the data was carried

out by all of the team members. Abhishek Sinha wasresponsible for one hot encoding and dropping ofcertain columns, Aditya Pandey for the upsampling,downsampling, removal of duplicate data points andcleaning of the data, and Aishwarya PS adding acolumn for labelling of various intrusion classes astheir parent intrusion domain and for N-Fold crossvalidation.From here onwards,Aishwarya PS carried out the Random ForestModel. Aditya Pandey carried out the Neural Net-work Model. Abhishek Sinha carried out the MisuseDetection Model and then combined the approaches.

REFERENCES

[1] Aleksandar Lazarevic, Levent Ertoz, Vipin Kumar, Aysel Ozgurand Jaideep Srivastava, ’A Comparative Study of AnomalyDetection Schemes in Network Intrusion Detection’, 2002

[2] D. Barbara, N. Wu and S. Jajodia, ’Detecting Novel NetworkIntrusions Using Bayes Estimators’, 2001

[3] S. A. Hofmeyr, S. Forrest and A. Somayaji, ’Intrusion DetectionUsing Sequences of System Calls’, 1998

[4] A. Ghosh and A. Schwartzbard, ’A Study in Using NeuralNetworks for Anomaly and Misuse Detection’, 1999

[5] E. Eskin, W. Lee and S. J. Stolfo, ’Modeling System Calls forIntrusion Detection with Dynamic Window Sizes’, 2001

[6] R. Sekar, A. Gupta, J. Frullo, T. Shanbhag, A. Tiwari, H. Yangand S. Zhou, ’Specification-based Anomaly Detection: A NewApproach for Detecting Network Intrusions’, 2002

[7] Jiong Zhang, Mohammad Zulkernine and Anwar Haque,’Random-Forests-Based Network Intrusion Detection Systems’,2008

[8] James Cannady, ’Artificial Neural Networks for Misuse Detec-tion’, 1998

[9] Ms.R.S.Landge* and Mr.A.P.Wadhe, ’Misuse Detection SystemUsing Various Techniques: A Review’, 2013

Documents

Intrusion Detection using Sequential Hybrid Model · in data mining. E.g. saint, portsweep, mscan, nmap etc. C. Importance of Intrusion Detection Intrusion Detection is important