4. layered approach using conditional random fields for intrusion detection

Layered Approach Using ConditionalRandom Fields for Intrusion Detection

Kapil Kumar Gupta, Baikunth Nath, Senior Member, IEEE, and

Ramamohanarao Kotagiri, Member, IEEE

Abstract—Intrusion detection faces a number of challenges; an intrusion detection system must reliably detect malicious activities

in a network and must perform efficiently to cope with the large amount of network traffic. In this paper, we address these two issues

of Accuracy and Efficiency using Conditional Random Fields and Layered Approach. We demonstrate that high attack detection

accuracy can be achieved by using Conditional Random Fields and high efficiency by implementing the Layered Approach.

Experimental results on the benchmark KDD ’99 intrusion data set show that our proposed system based on Layered Conditional

Random Fields outperforms other well-known methods such as the decision trees and the naive Bayes. The improvement in attack

detection accuracy is very high, particularly, for the U2R attacks (34.8 percent improvement) and the R2L attacks (34.5 percent

improvement). Statistical Tests also demonstrate higher confidence in detection accuracy for our method. Finally, we show that our

system is robust and is able to handle noisy data without compromising performance.

Index Terms—Intrusion detection, Layered Approach, Conditional Random Fields, network security, decision trees, naive Bayes.

Ç

1 INTRODUCTION

INTRUSION detection as defined by the SysAdmin, Audit,Networking, and Security (SANS) Institute is the art of

detecting inappropriate, inaccurate, or anomalous activity[6]. Today, intrusion detection is one of the high priority andchallenging tasks for network administrators and securityprofessionals. More sophisticated security tools mean that theattackers come up with newer and more advanced penetra-tion methods to defeat the installed security systems [4] and[24]. Thus, there is a need to safeguard the networks fromknown vulnerabilities and at the same time take steps todetect new and unseen, but possible, system abuses bydeveloping more reliable and efficient intrusion detectionsystems. Any intrusion detection system has some inherentrequirements. Its prime purpose is to detect as many attacksas possible with minimum number of false alarms, i.e., thesystem must be accurate in detecting attacks. However, anaccurate system that cannot handle large amount of networktraffic and is slow in decision making will not fulfill thepurpose of an intrusion detection system. We desire a systemthat detects most of the attacks, gives very few false alarms,copes with large amount of data, and is fast enough to makereal-time decisions.

Intrusion detection started in around 1980s after theinfluential paper from Anderson [10]. Intrusion detectionsystems are classified as network based, host based, orapplication based depending on their mode of deploymentand data used for analysis [11]. Additionally, intrusiondetection systems can also be classified as signature based oranomaly based depending upon the attack detection method.

Thesignature-based systemsare trained byextracting specificpatterns (or signatures) from previously known attacks whilethe anomaly-based systems learn from the normal datacollected when there is no anomalous activity [11].

Another approach for detecting intrusions is to considerboth the normal and the known anomalous patterns fortraining a system and then performing classification on thetest data. Such a system incorporates the advantages of boththe signature-based and the anomaly-based systems and isknown as the Hybrid System. Hybrid systems can be veryefficient, subject to the classification method used, and canalso be used to label unseen or new instances as they assignone of the known classes to every test instance. This ispossible because during training the system learns featuresfrom all the classes. The only concern with the hybrid methodis the availability of labeled data. However, data requirementis also a concern for the signature- and the anomaly-basedsystems as they require completely anomalous and attack-free data, respectively, which are not easy to ensure.

The rest of this paper is organized as follows: In Section 2,we discuss the related work with emphasis on variousmethods and frameworks used for intrusion detection. Wedescribe the use of Conditional Random Fields (CRFs) forintrusion detection [23] in Section 3 and the LayeredApproach [22] in Section 4. We then describe how tointegrate the Layered Approach and the CRFs in Section 5.In Section 6, we give our experimental results and compareour method with other approaches that are known toperform well. We observe that our proposed system,Layered CRFs, performs significantly better than othersystems. We study the robustness of our method in Section 7by introducing noise in the system. We discuss featureselection in Section 8 and draw conclusions in Section 9.

2 RELATED WORK

The field of intrusion detection and network security hasbeen around since late 1980s. Since then, a number of

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010 35

. The authors are with the Department of Computer Science and SoftwareEngineering, and NICTA Victoria Research Laboratory, The University ofMelbourne, Parkville 3010, Australia.E-mail: {kgupta, bnath, rao}@csse.unimelb.edu.au.

Manuscript received 6 Mar. 2007; revised 11 Dec. 2007; accepted 28 Jan.2008; published online 12 Mar. 2008.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TDSC-2007-03-0031.Digital Object Identifier no. 10.1109/TDSC.2008.20.

1545-5971/10/$26.00 � 2010 IEEE Published by the IEEE Computer Society

Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.

methods and frameworks have been proposed and manysystems have been built to detect intrusions. Varioustechniques such as association rules, clustering, naive Bayesclassifier, support vector machines, genetic algorithms,artificial neural networks, and others have been applied todetect intrusions. In this section, we briefly discuss thesetechniques and frameworks.

Lee et al. introduced data mining approaches fordetecting intrusions in [30], [31], and [32]. Data miningapproaches for intrusion detection include association rulesand frequent episodes, which are based on buildingclassifiers by discovering relevant patterns of programand user behavior. Association rules [8] and frequentepisodes are used to learn the record patterns that describeuser behavior. These methods can deal with symbolic data,and the features can be defined in the form of packet andconnection details. However, mining of features is limitedto entry level of the packet and requires the number ofrecords to be large and sparsely populated; otherwise, theytend to produce a large number of rules that increase thecomplexity of the system [7].

Data clustering methods such as the k-means and the fuzzyc-means have also been applied extensively for intrusiondetection [36] and [39]. One of the main drawbacks of theclustering technique is that it is based on calculating numericdistance between the observations, and hence, the observa-tions must be numeric. Observations with symbolic featurescannot be easily used for clustering, resulting in inaccuracy.In addition, the clustering methods consider the featuresindependently and are unable to capture the relationshipbetween different features of a single record, which furtherdegrades attack detection accuracy.

Naive Bayes classifiers have also been used for intrusiondetection [9]. However, they make strict independenceassumption between the features in an observation result-ing in lower attack detection accuracy when the features arecorrelated, which is often the case for intrusion detection.Bayesian network can also be used for intrusion detection[28]. However, they tend to be attack specific and build adecision network based on special characteristics ofindividual attacks. Thus, the size of a Bayesian networkincreases rapidly as the number of features and the type ofattacks modeled by a Bayesian network increases.

To detect anomalous traces of system calls in privilegedprocesses [20], hidden Markov models (HMMs) have beenapplied in [17], [42], and [43]. However, modeling the systemcalls alone may not always provide accurate classification asin such cases various connection level features are ignored.Further, HMMs are generative systems and fail to modellong-range dependencies between the observations [29]. Wefurther discuss this in detail in Section 3.

Decision trees have also been used for intrusiondetection [9]. The decision trees select the best features foreach decision node during the construction of the tree basedon some well-defined criteria. One such criterion is to usethe information gain ratio, which is used in C4.5. Decisiontrees generally have very high speed of operation and high-attack detection accuracy.

Debar et al. [14] and Zhang et al. [46] discuss the use ofartificial neural networks for network intrusion detection.Though the neural networks can work effectively with noisydata, they require large amount of data for training and it is

often hard to select the best possible architecture for a neuralnetwork. Support vector machines have also been used fordetecting intrusions [26]. Support vector machines map real-valued input feature vector to a higher dimensional featurespace through nonlinear mapping and can provide real-timedetection capability, deal with large dimensionality ofdata, and can be used for binary-class as well as multiclassclassification. Other approaches for detecting intrusioninclude the use of genetic algorithm and autonomous andprobabilistic agents for intrusion detection [1] and [5]. Thesemethods are generally aimed at developing a distributedintrusion detection system.

To overcome the weakness of a single intrusion detectionsystem, a number of frameworks have been proposed, whichdescribe the collaborative use of network-based and host-based systems [45]. Systems that employ both signature-based and behavior-based techniques are discussed in [19]and [41]. In [32], the authors describe a data mining frame-work for building adaptive intrusion detection models. Adistributed intrusion detection framework based on mobileagents is discussed in [12].

The most closely related work, to our work, is of Lee et al.[30], [31], and [32]. They, however, consider a data miningapproach for mining association rules and finding frequentepisodes in order to calculate the support and confidence ofthe rules separately. Instead, in our work, we define featuresfrom the observations as well as from the observations andthe previous labels and perform sequence labeling via theCRFs to label every feature in the observation. This setting issufficient for modeling the correlation between differentfeatures of an observation. We also compare our work with[21], which describes the use of maximum entropy principlefor detecting anomalies in the network traffic. The keydifference between [21] and our work is that the authors in[21] use only the normal data during training and build abaseline system, i.e., a behavior-based system, while we trainour system with both the normal and the anomalous data,i.e., we build a hybrid system. Second, the system in [21] failsto model long-range dependencies in the observations,which can be easily represented in our model. We alsointegrate the Layered Approach with the CRFs to gain thebenefits of computational efficiency and high accuracy ofdetection in a single system.

We compare the Layered Approach with the works in [18],[25], and [41]. The authors in [18] describe the combination of“strong” classifiers using stacking, where the decision tress,naive Bayes, and a number of other classification methods areused as base classifiers. The authors show that the outputfrom these classifiers can be combined to generate a betterclassifier rather than selecting the best one. In [25], the authorsuse a combination of “weak” classifiers. The individualclassification power of weak classifiers is slightly better thanrandom guessing. The authors show that a number of suchclassifiers when combined using simple majority votingmechanism, provide good classification. In [41], the authorsapply a combination of anomaly and misuse detectors forbetter qualification of analyzed events. However, our work isnot based upon classifier combination. Combination ofclassifiers is expensive with regard to the processing timeand decision making. The purpose of classifier combination isto improve accuracy. Rather, our system is based upon serial

36 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010


layering of multiple hybrid detectors. From our experimentsin Section 6, we show that the Layered CRFs perform betterthan individual classifiers and they are more efficient andaccurate than a system based on classifier combination. Theresults from individual classifiers at a layer are not combinedat any later stage in the Layered Approach, and hence, anattack can be blocked at the layer where it is detected. There isno communication overhead among the layers and the centraldecision-maker. In addition, since the layers are independentthey can be trained separately and deployed at criticallocations in a network depending upon the specific require-ments of a network. Using a stacked system will not give usthe advantage of reduced processing when an attack isdetected at the initial layers in the sequential model.

In this paper, we show the effectiveness of CRFs forintrusion detection. Motivated by our results in [23], weperform detailed analysis and show that CRFs are a strongcandidate for building robust intrusion detection systems.We then show that high efficiency can be achieved byimplementing the Layered Approach. Finally, we integratethe Layered Approach and the CRFs to develop a systemthat is accurate and performs efficiently.

3 CONDITIONAL RANDOM FIELDS FOR INTRUSION

DETECTION

Conditional models are probabilistic systems that are usedto model the conditional distribution over a set of randomvariables. Such models have been extensively used in thenatural language processing tasks. Conditional models offera better framework as they do not make any unwarrantedassumptions on the observations and can be used to modelrich overlapping features among the visible observations.Maxent classifiers [37], maximum entropy Markov models[34], and CRFs [29] are such conditional models. Theadvantage of CRFs is that they are undirected and are, thus,free from the Label Bias and the Observation Bias [27]. Thesimplest conditional classifier is the Maxent classifier basedupon maximum entropy classification, which estimates theconditional distribution of every class given the observations[37]. The training data is used to constrain this conditionaldistribution while ensuring maximum entropy and hencemaximum uniformity. We now give a brief description of theCRFs, which is motivated from the work in [29].

Let X be the random variable over data sequence to belabeled and Y the corresponding label sequence. Inaddition, let G ¼ ðV ;EÞ be a graph such that Y ¼ ðYvÞv2ðV Þ,so that Y is indexed by the vertices of G. Then, ðX;Y Þ is aCRF, when conditioned on X, the random variables Yvobey the Markov property with respect to the graph:pðYvjX;Yw; w 6¼ vÞ ¼ pðYvjX;Yw; w � vÞ, where w � v meansthat w and v are neighbors in G, i.e., a CRF is a random fieldglobally conditioned on X. For a simple sequence (or chain)modeling, as in our case, the joint distribution over the labelsequence Y given X has the following form:

p�ðyjxÞ/expXe2E;k

�kfkðe; yje; xÞ þXv2V ;k

�kgkðv; yjv; xÞ !

; ð1Þ

where x is the data sequence, y is a label sequence, and yjs isthe set of components of y associated with the vertices or

edges in subgraph S. In addition, the features fk and gkare assumed to be given and fixed. For example, aBoolean edge feature fk might be true if the observation Xi

is “protocol ¼ tcp,” tag Yi�1 is “normal,” and tag Yi is“normal.” Similarly, a Boolean vertex feature gk might be trueif the observation Xi is “service ¼ ftp” and tag Yi is “attack.”Further, the parameter estimation problem is to find theparameters � ¼ ð�1; �2; . . . ;�1; �2; . . .Þ from the training dataD ¼ ðxi; yiÞNi¼1 with the empirical distribution ~pðx; yÞ [29].

CRFs are undirected graphical models used for sequencetagging. The prime difference between CRF and othergraphical models such as the HMM is that the HMM, beinggenerative, models the joint distribution pðy; xÞ, whereas theCRF are discriminative models and directly model theconditional distribution pðyjxÞ, which is the distribution ofinterest for the task of classification and sequence labeling.Similar to HMM, the naive Bayes is also generative andmodels the joint distribution. Modeling the joint distributionhas two disadvantages. First, it is not the distribution ofinterest, since the observations are completely visible and theinterest is in finding the correct class for the observations,which is the conditional distribution pðyjxÞ. Second, inferringthe conditional probability pðyjxÞ from the modeled jointdistribution, using the Bayes rule, requires the marginaldistribution pðxÞ. To estimate this marginal distribution isdifficult since the amount of training data is often limited andthe observation x contains highly dependent features that aredifficult to model and therefore strong independenceassumptions are made among the features of an observation.This results in reduced accuracy [40]. CRFs, however, predictthe label sequence y given the observation sequence x. Thisallows them to model arbitrary relationship among differentfeatures in an observation x [15]. CRFs also avoid theobservation bias and the label bias problem, which arepresent in other discriminative models, such as the maximumentropy Markov models. This is because the maximumentropy Markov models have a per-state exponential modelfor the conditional probabilities of the next state given thecurrent state and the observation, whereas the CRFs have asingle exponential model for the joint probability of the entiresequence of labels given the observation sequence [29].

The task of intrusion detection can be compared to manyproblems in machine learning, natural language processing,and bioinformatics. The CRFs have proven to be verysuccessful in such tasks, as they do not make any unwar-ranted assumptions about the data. Hence, we explore thesuitability of CRFs for intrusion detection.

3.1 Motivating Example

The data analyzed by the intrusion detection system forclassification often has a number of features that are highlycorrelated and complex relationships exist between them.For example, when classifying network connections aseither normal or as attack, a system may consider featuressuch as “logged in” and “number of file creations.” Whenthese features are analyzed individually, they do notprovide any information that can aid in detecting attacks.However, when these features are analyzed together, theycan provide meaningful information, which can be helpfulfor the classification task. Taking another example, theconnection level feature such as the “service invoked” at the

GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 37


destination provides some information about the class label(in case an attacker sends request to a service that is notavailable). This information becomes more concrete andaids in classification when analyzed with other featuressuch as “protocol type” and “amount of data transferred”between source and destination (in case the client connectsto an available service such as the ftp and performs datatransfer). These relationships, between different features inthe observed data, if considered during classification cansignificantly decrease classification error. The CRFs do notconsider features to be independent and hence performbetter when compared with other methods.

The data set used in our experiments represents featuresof every session in relational form with only one label forthe entire record. In this case, using a conditional modelwould result in a simple maximum entropy classifier [40].However, we represent the data in the form of a sequenceand assign a label to every feature in the sequence using thefirst-order Markov assumption instead of assigning a singlelabel to the entire observation. Though, this increases thecomplexity but it also increases the attack detection accuracy.

Each record represents a separate connection, and hence,we consider every record as a separate sequence. We aim tomodel the relationships among features of individualconnections using a CRF, as shown in Fig. 1. In the figure,features such as duration, protocol, service, flag, andsrc_bytes take some possible value for every connection.During training, feature weights are learnt, and duringtesting, features are evaluated for the given observation,which is then labeled accordingly.

As it is evident from the figure, every label is connectedto every input feature, which indicates that all the featuresin an observation help in labeling, and thus, a CRF canmodel dependencies among the features in an observation.Present intrusion detection systems do not consider suchrelationships among the features in the observations. Theyeither consider only one feature, such as in the case ofsystem call modeling, or assume conditional independenceamong different features in the observation as in the case ofa naive Bayes classifier. As we will show from ourexperimental results, the CRFs can effectively model suchrelationships among different features of an observationresulting in higher attack detection accuracy. Anotheradvantage of using CRFs is that every element in thesequence is labeled such that the probability of the entirelabeling is maximized, i.e., all the features in the observa-tion collectively determine the final labels. Hence, even ifsome data is missing, the observation sequence can still belabeled with less number of features.

Our first goal is to improve the attack detection accuracy.We first compare the accuracy of CRFs for detecting attackswith other methods in Section 6. We consider all the41 features in the data set for each of the four attack groupsseparately. As we shall observe, the CRFs outperform othermethods for detecting “Unauthorized access to Root” (U2R)attacks. They are also effective in detecting the Probe,“Remote to Local” (R2L), and “Denial of Service” (DoS)attacks. However, CRFs can be expensive during trainingand testing. For a simple linear chain structure, the timecomplexity for training a CRF is OðTL2NIÞ, where T is thelength of the sequence, L is the number of labels, N is thenumber of training instances, and I is the number ofiterations. During inference, the Viterbi algorithm isemployed, which has a complexity of OðTL2Þ. The quad-ratic complexity is significant when the number of labels islarge as in language tasks. However, for intrusion detection,there are only two labels “normal” and “attack,” and thus,the system is very efficient. We further improve the overallsystem performance by using the Layered Approach, whichdecreases T , i.e., the length of the sequence. The LayeredApproach is described next.

4 LAYERED APPROACH FOR INTRUSION DETECTION

We now describe the Layer-based Intrusion DetectionSystem (LIDS) in detail. The LIDS draws its motivationfrom what we call as the Airport Security model, where anumber of security checks are performed one after the otherin a sequence. Similar to this model, the LIDS represents asequential Layered Approach and is based on ensuringavailability, confidentiality, and integrity of data and (or)services over a network. Fig. 2 gives a generic representa-tion of the framework.

The goal of using a layered model is to reduce computationand the overall time required to detect anomalous events. Thetime required to detect an intrusive event is significant andcan be reduced by eliminating the communication overheadamong different layers. This can be achieved by making thelayers autonomous and self-sufficient to block an attackwithout the need of a central decision-maker. Every layer inthe LIDS framework is trained separately and then deployedsequentially. We define four layers that correspond to thefour attack groups mentioned in the data set. They are Probelayer, DoS layer, R2L layer, and U2R layer. Each layer is thenseparately trained with a small set of relevant features.Feature selection is significant for Layered Approach anddiscussed in the next section. In order to make the layersindependent, some features may be present in more than onelayer. The layers essentially act as filters that block anyanomalous connection, thereby eliminating the need offurther processing at subsequent layers enabling quick


Fig. 1. Graphical representation of a CRF.

Fig. 2. Layered representation.


response to intrusion. The effect of such a sequence of layers isthat the anomalous events are identified and blocked as soonas they are detected.

Our second goal is to improve the speed of operation ofthe system. Hence, we implement the LIDS and select asmall set of features for every layer rather than using all the41 features. This results in significant performance im-provement during both the training and the testing of thesystem. In many situations, there is a trade-off betweenefficiency and accuracy of the system and there can bevarious avenues to improve system performance. Methodssuch as naive Bayes assume independence among theobserved data. This certainly increases system efficiency,but it may severely affect the accuracy. To balance thistrade-off, we use the CRFs that are more accurate, thoughexpensive, but we implement the Layered Approach toimprove overall system performance. The performance ofour proposed system, Layered CRFs, is comparable to thatof the decision trees and the naive Bayes, and our systemhas higher attack detection accuracy.

5 INTEGRATING LAYERED APPROACH WITH

CONDITIONAL RANDOM FIELD

In Section 1, we discussed two main requirements for anintrusion detection system; accuracy of detection andefficiency in operation. As discussed in Sections 3 and 4,respectively, the CRFs can be effective in improving theattack detection accuracy by reducing the number of falsealarms, while the Layered Approach can be implemented toimprove the overall system efficiency. Hence, a naturalchoice is to integrate them to build a single system that isaccurate in detecting attacks and efficient in operation. Giventhe data, we first select four layers corresponding to the fourattack groups (Probe, DoS, R2L, and U2R) and performfeature selection for each layer, which is described next.

5.1 Feature Selection

Ideally, we would like to perform feature selection auto-matically. However, as will be discussed later in Section 8,the methods for automatic feature selection were not foundto be effective. In this section, we describe our approach forselecting features for every layer and why some featureswere chosen over others. In our system, every layer isseparately trained to detect a single type of attack category.We observe that the attack groups are different in theirimpact, and hence, it becomes necessary to treat themdifferently. Hence, we select features for each layer basedupon the type of attacks that the layer is trained to detect.

5.1.1 Probe Layer

The probe attacks are aimed at acquiring information aboutthe target network from a source that is often external to thenetwork. Hence, basic connection level features such as the“duration of connection” and “source bytes” are significantwhile features like “number of files creations” and “numberof files accessed” are not expected to provide informationfor detecting probes.

5.1.2 DoS Layer

The DoS attacks are meant to force the target to stop theservice(s) that is (are) provided by flooding it with

illegitimate requests. Hence, for the DoS layer, trafficfeatures such as the “percentage of connections having samedestination host and same service” and packet level featuressuch as the “source bytes” and “percentage of packets witherrors” are significant. To detect DoS attacks, it may not beimportant to know whether a user is “logged in or not.”

5.1.3 R2L Layer

The R2L attacks are one of the most difficult to detect asthey involve the network level and the host level features.We therefore selected both the network level features suchas the “duration of connection” and “service requested”and the host level features such as the “number of failedlogin attempts” among others for detecting R2L attacks.

5.1.4 U2R Layer

The U2R attacks involve the semantic details that are verydifficult to capture at an early stage. Such attacks are oftencontent based and target an application. Hence, for U2Rattacks, we selected features such as “number of filecreations” and “number of shell prompts invoked,” whilewe ignored features such as “protocol” and “source bytes.”

We used domain knowledge together with the practicalsignificance and the feasibility of each feature beforeselecting it for a particular layer. Thus, from the total41 features, we selected only 5 features for Probe layer,9 features for DoS layer, 14 features for R2L layer, and8 features for U2R layer. Since each layer is independent ofevery other layer, the feature set for the layers is notdisjoint. The selected features for all the four layers arepresented in Appendix A. We then use the CRFs for attackdetection as discussed in Section 3. However, the differenceis that we use only the selected features for each layer ratherthan using all the 41 features. We now give the algorithmfor integrating CRFs with the Layered Approach.

Algorithm

Training

Step 1: Select the number of layers, n, for the complete

system.Step 2: Separately perform features selection for each layer.

Step 3: Train a separate model with CRFs for each layer

using the features selected from Step 2.

Step 4: Plug in the trained models sequentially such that

only the connections labeled as normal are passed

to the next layer.

Testing

Step 5: For each (next) test instance perform Steps 6through 9.

Step 6: Test the instance and label it either as attack or

normal.

Step 7: If the instance is labeled as attack, block it and

identify it as an attack represented by the layer

name at which it is detected and go to Step 5. Else

pass the sequence to the next layer.

Step 8: If the current layer is not the last layer in the system,test the instance and go to Step 7. Else go to Step 9.

Step 9: Test the instance and label it either as normal or as

an attack. If the instance is labeled as an attack,

block it and identify it as an attack corresponding

to the layer name.



Our final goal is to improve both the attack detectionaccuracy and the efficiency of the system. Hence, weintegrate the CRFs and the Layered Approach to build asingle system. We perform detailed experiments and showthat our integrated system has dual advantage. First, asexpected, the efficiency of the system increases signifi-cantly. Second, since we select significant features for eachlayer, the accuracy of the system further increases. This isbecause all the 41 features are not required for detectingattacks belonging to a particular attack group. Using morefeatures than required can result in fitting irregularities inthe data, which has a negative effect on the attack detectionaccuracy of the system.

6 EXPERIMENTS

For our experiments, we use the benchmark KDD ’99intrusion data set [3]. This data set is a version of the original1998 DARPA intrusion detection evaluation program, whichis prepared and managed by the MIT Lincoln Laboratory.The data set contains about five million connection recordsas the training data and about two million connectionrecords as the test data. In our experiments, we use10 percent of the total training data and 10 percent of thetest data (with corrected labels), which are providedseparately. This leads to 494,020 training and 311,029 testinstances. Each record in the data set represents a connectionbetween two IP addresses, starting and ending at some well-defined times with a well-defined protocol. Further, everyrecord is represented by 41 different features. Each recordrepresents a separate connection and is hence considered tobe independent of any other record.

The training data is either labeled as normal or as one ofthe 24 different kinds of attack. These 24 attacks can begrouped into four classes; Probing, DoS, R2L, and U2R.Similarly, the test data is also labeled as either normal or asone of the attacks belonging to the four attack groups. It isimportant to note that the test data is not from the sameprobability distribution as the training data, and it includesspecific attack types not present in the training data. Thismakes the intrusion detection task more realistic [3]. Table 1gives the number of instances for each group of attack in thedata set.

For our experiments with CRFs, we use the CRF toolkit,CRF++ [2]. We use the Weka tool [44] to perform experimentswith the decision trees and the naive Bayes classifier. Wedevelop python and shell scripts for data formatting andimplementing the Layered Approach. For all our experi-ments, we perform hybrid detection, as discussed in Section 1,and use both the normal and the anomalous connections fortraining the model. We perform our experiments on a

desktop running with Intel(R) Core(TM) 2, CPU 2.4 GHz,and 2-Gbyte RAM under exactly the same conditions. We aremainly interested in the test time efficiency and not in thetime required for training of the model as the real-timeperformance of the system depend upon the test efficiencyalone. We note that our system is very efficient during testing.When we considered all the 41 features, the time taken to testall the 250,436 attacks was 57 seconds, which reduced to17 seconds when we performed feature selection andimplemented the Layered Approach. More details will bepresented when we give the detailed results for theexperiments.

For our results, we give the Precision, Recall, and F -Valueand not the accuracy alone as with the given data set, it iseasy to achieve very high accuracy by carefully selecting thesample size. From Table 1, we note that the number ofinstances for the U2R, Probes, and R2L attacks is very low.Hence, if we use accuracy as a measure for testing theperformance of the system, the system can be biased and canattain an accuracy of more than 99 percent for U2R attacks[16]. However, Precision, Recall, and F -Value are notdependent on the size of the training and the test samples.They are defined as follows:

Precision ¼ TP

TPþ FP

Recall ¼ TP

TPþ FN

F -Value ¼ ð1þ �2Þ � Recall � Precision

�2 � ðRecallþ PrecisionÞ ;

where TP, FP, and FN are the number of True Positives,False Positives, and False Negatives, respectively, and �corresponds to the relative importance of precision versusrecall and is usually set to 1.

We divide the training data into different groups;Normal, Probe, DoS, R2L, and U2R. Similarly, we dividethe test data. We perform 10 experiments for each attackclass by randomly selecting data corresponding to thatattack class and normal data only. For example, to detectProbe attacks, we train and test the system with Probeattacks and normal data only. We do not add the DoS, R2L,and U2R data when detecting Probes. Not including theseattacks while training allows the system to better learn thefeatures for Probe attacks and normal events. When such asystem is deployed online, other attacks such as DoS caneither be seen as normal or as Probes. If DoS attacks aredetected as normal, we expect them to be detected as attackat other layers in the system. However, if the DoS attacks aredetected as Probe, it must be considered as an advantagesince the attack is detected at an early stage. Similarly, ifsome Probe attacks are not detected at the Probe layer, theymay be detected at subsequent layers. Hence, for four attackclasses, we have four independent models, which aretrained separately with specific features to detect attacksbelonging to that particular group. For our experiments, wereport the best, the average, and the worst cases.

We represent a single layer, for example the Probe layer,in Fig. 3. Other layers can be constructed similarly.

In Section 6.1, we perform experiments with individuallayers in the system. In Section 6.2, we represent how to


TABLE 1Data Set


implement the system in real scenario and compare ourresults with other systems in Section 6.3. In Section 6.4, wediscuss the significance of our results.

6.1 Building Individual Layers of the System

We perform two sets of experiments. From the firstexperiment, we wish to examine the accuracy of CRFs forintrusion detection. The objective is to see how CRFscompare with other techniques, which are known toperform well. We do not consider feature selection, andthe systems are trained using all the 41 features. From thisexperiment, we observe that CRFs perform much better forU2R attacks while the decision trees achieve higher attackdetection for Probes and R2L. The difference in attackdetection accuracy for DoS is not significant. We note thatthe reason for better performance of decision trees is thatthey perform feature selection. This motivates us to performour second experiment where we perform feature selectionby selecting a small set of features for every attack groupinstead of using all the 41 features. We perform the sameexperiment with decision trees and naive Bayes andcompare the results. We call the integrated models asLayered CRFs, layered decision trees, and layered naiveBayes, respectively. For better comparison and readability,we give the results for both the experiments together.

6.1.1 Detecting Probe Attacks with All 41 Features

We randomly select about 10,000 normal records and all theProbe records from the training data as the training data fordetecting Probe attacks. We then use all the normal andProbe records from the test data for testing. Hence, we have15,000 training instances and 64,759 test instances. Table 2gives the results for the experiments.

In Table 2, the testing time of 14.53 seconds representsthe total time taken to label 64,759 test instances. The resultsshow that the decision trees are more efficient than the CRFsand the naive Bayes. This is because they have a small treestructure, often with very few decision nodes, which is veryefficient. The attack detection accuracy is also higher for thedecision trees as they select the best possible features duringtree construction. However, as we show next, once we

perform feature selection, our system achieves much higheraccuracy and there is significant improvement in efficiency.

6.1.2 Detecting Probe Attacks with Feature Selection

We used the same set of instances for this experiment asused in the previous experiment. However, we performfeature selection for this experiment. Table 3 gives theresults for this experiment.

We observe that the system takes only 2.04 seconds tolabel all the 64,759 test instances. The Layered CRFsperform better and faster than our previous experimentand are the best choice for detecting Probes. We also notethat there is no significant advantage with respect to timefor the layered decision trees as the number of features usedin normal decision trees and in the layered decision trees isapproximately the same, resulting in similar efficiency. Wefurther note that the Recall and hence the F -Value for thelayered naive Bayes decreases drastically. This can beexplained as follows: The classification accuracy with naiveBayes generally improves as the number of featuresincreases. However, if the number of features increases toa very large extent, the estimation tends to becomeunreliable. As a result, when we use all the 41 features,the naive Bayes performs well but when we decrease thenumber of features to five, its classification accuracydecreases. From this experiment, we conclude that theLayered CRFs are a better choice for detecting Probeattacks.

6.1.3 Detecting DoS Attacks with All 41 Features

We randomly select about 20,000 normal records and about4,000 DoS records from the training data as the training datafor detecting DoS attacks. We then use all the normal andDoS records from the test data for testing. Hence, we have24,000 training instances and 290,446 test instances. Table 4gives the results for the experiments.

In Table 4, the testing time of 64.42 seconds representsthe time taken to label all the 290,446 test instances. Theresults show that all the three methods considered havesimilar attack detection accuracy; however, decision treesgive a slight advantage with regard to test time efficiency.

6.1.4 Detecting DoS Attacks with Feature Selection

We used the same data for this experiment as used in theprevious experiment. However, we perform feature selec-tion. Table 5 gives the results.

We observe that the system now takes only 15.17 secondsto label all the 290,446 test instances. The results follow the


Fig. 3. Representation of a single layer (e.g., probe layer).

TABLE 2Normal and Probes (All 41 Features)

TABLE 3Normal and Probes (with Feature Selection)


same trend as in the previous experiment with only a slightimprovement. However, if we consider the testing time wefind that layered decision trees are a better choice. We alsonote that there is slight increase in the detection accuracywhen we perform feature selection, but this increase is notsignificant. The real advantage is seen in the reduced timefor testing, which decreases four folds.

6.1.5 Detecting R2L Attacks with All 41 Features

We randomly select about 1,000 normal records and all theR2L records from the training data as the training data fordetecting R2L attacks. We then use all the normal andR2L records from the test data for testing. Hence, we have2,000 training instances and 76,942 test instances. Table 6 givesthe results.

In Table 6, the testing time of 17.16 seconds representsthe time taken to label all the 76,942 test instances. Weobserve that the decision trees have a higher F -Value, but ifwe look at the number of false alarms, we find that the CRFsperform better and have high Precision compared to thedecision trees and the naive Bayes.

6.1.6 Detecting R2L Attacks with Feature Selection

Table 7 gives the results when we performed feature

selection for detecting R2L attacks.We observe that the time taken to test all the

76,942 instances is only 5.96 seconds. Further, the

Layered CRFs perform much better than the CRFs (an

increase of about 60 percent), layered decision trees (an

increase of about 125 percent), decision trees (an increase

of about 17 percent), layered naive Bayes (an increase of

about 250 percent), and naive Bayes (an increase of

about 250 percent) and are the best choice for detecting

the R2L attacks. The Layered CRFs take slightly more

time, which is acceptable since we achieve much higher

detection accuracy.

6.1.7 Detecting U2R Attacks with All 41 Features

We randomly select about 1,000 normal records and all the

U2R records from the training data as the training data for

detecting the User to Root attacks. We then use all the

normal and U2R records from the test data for testing.

Hence, we have 1,000 training instances and 60,661 test

instances. Table 8 gives the results.In Table 8, the testing time of 13.45 seconds represents

the time taken to label all the 60,661 test instances. In this

experiment, we find that the CRFs are far better than the

other two methods. The F -Value for CRFs is more than

150 percent with respect to the decision trees and more than

600 percent with respect to the naive Bayes. The U2R attacks

are very difficult to detect and most of the present intrusion

detection systems fail to detect such attacks with acceptable

reliability. We find that the CRFs can be used to reliably

detect such attacks.


TABLE 5Normal and DoS (with Feature Selection)

TABLE 6Normal and R2L (All 41 Features)

TABLE 7Normal and R2L (with Feature Selection)

TABLE 4Normal and DoS (All 41 Features)

TABLE 8Normal and U2R (All 41 Features)


6.1.8 Detecting U2R Attacks with Feature Selection

In this experiment, we used exactly the same set of instancesas we used in the previous experiment. We also performfeature selection. Table 9 gives the results for this experiment.

We observe that the system takes only 2.67 seconds to labelall the 60,661 test instances. The Layered CRFs are the bestchoice for detecting the U2R attacks and are far better thanCRFs (an increase of about 8 percent), layered decision trees(an increase of about 30 percent), decision trees (an increaseof about 184 percent), layered naive Bayes (an increase ofabout 38 percent), and naive Bayes (an increase of about675 percent). We observe that the attack detection capabilityalso increases for the decision trees and the naive Bayes.

It is evident from the results that the accuracy of LayeredCRFs is significantly higher for the U2R, R2L, and the Probeattacks. The difference in accuracy is, however, not sig-nificant for the DoS attacks. Further, regardless of themethod considered and particularly for the CRFs, the timerequired for training and testing the system is drasticallyreduced once we perform feature selection. We also notethat the increase in detection accuracy is not significant forthe layered decision trees and the layered naive Bayes forthe DoS group of attack. Their accuracy of detectiondecreases for the Probe and R2L attacks while it increasesfor the U2R attacks. However, we find that in all the cases,the Layered CRFs perform significantly better and can betterlearn a model when we use a small set of specific features fortraining.

6.2 Implementing the System in Real Life

In real scenario, we are not aware of the category of anattack. Rather, we are interested in identifying the attack

category once the system detects an event as anomalous.Layered Approach not only improves the attack detection,but it also helps identify the type of attack once detectedbecause every layer is trained to detect only a particularcategory of attack. Hence, if an attack is detected at the U2Rlayer, it is very likely that the attack is of “U2R” type. Thisenables to perform quick recovery and prevent similarattacks. Fig. 4 gives the real-time system representation.

We integrate the four models (with feature selection) fromSection 6.1 to develop the final system. In this experiment,we use the same data for training the individual models asused in our previous experiments. However, the data in thetest set is relabeled either as normal or as attack and all thedata from the test set is passed though the system startingfrom the first layer. If layer 1 detects any connection as anattack, it is blocked and labeled as “Probe.” Only the eventslabeled as “Normal” are allowed to go to the next layer. Thesame process is repeated at the next layers where an attack isblocked and labeled as “DoS,” “R2L,” or “U2R” at layer 2,layer 3, and layer 4, respectively. We perform all theexperiments 10 times and report their average. We give theresults for this experiment in Tables 10, 11, and 12. Table 10gives the confusion matrix where the values represent thepercent detection with respect to each of the five classes.

From the table, we observe that our system can detectmost of the “Probe” (98.62 percent), “DoS” (97.40 percent),and “U2R” (86.33 percent) attacks while giving very fewfalse alarms at each layer. The system can also detect “R2L”attacks with much higher reliability (29.62 percent) whencompared with the previously reported systems, as we willdiscuss later in Section 6.3. The confusion matrix shows that


TABLE 9Normal and U2R (with Feature Selection)

Fig. 4. Real-time representation of the system.

TABLE 10Confusion Matrix

TABLE 11Attack Detection at Each Layer (Case 1)

TABLE 12Attack Detection at Each Layer (Case 2)


only 71.90 percent of DoS attacks are labeled as DoS duringtesting. However, it is very important to note that theaccuracy for detecting DoS attacks is not 71.90 percent, butit is 25:50þ 71:90þ 0:00þ 0:00 ¼ 97:40 percent. This isbecause 25.50 percent of the DoS attacks are alreadydetected at the first layer, though our system identifiesthem as probes since they are detected at the first layer. Thisis acceptable because in the real environment it is critical todetect an attack as early as possible to minimize its impact.

It is also important to note that most of the “U2R” attacksare detected in the third layer itself and hence labeled as“R2L.” However, if we remove the third layer, the fourthlayer can detect these attacks with similar accuracy. Further,looking at the “R2L” and “U2R” columns in Table 10, it isnatural to think that the two layers can be merged. However,this has two disadvantages. First, merging the two layersresults in increasing the number of features, which reducesefficiency. The merged layer performs poorly with regard tothe total time taken when compared with both theunmerged layers together. Second, when the layers aremerged, the “U2R” attacks are not detected effectively andtheir individual attack detection accuracy decreases. This isbecause the number of “U2R” instances is very low in thetraining data and the system simply learns the features thatare specific to the “R2L” attacks. Hence, we prefer separatelayers for the two attack groups. Using our approach, we canhope that any attack, even though its category is unknown,can be detected at any one of the layers in the system. Wecan also increase or decrease the number of layers depend-ing upon the environment where the system is deployed.

We evaluate the performance of each layer in thesystem in Table 11. From the table, we observe that out ofall the 250,436 attack instances in the test data set, morethan 25 percent of the attacks are blocked at layer 1, andmore than 90 percent of all the attacks have been blockedby the end of layer 2. Thus, the Layered Approach is veryeffective in reducing the attack traffic at each layer in thesystem. The configuration takes 21 seconds to classify allthe 250,436 attacks.

We can do further optimization by putting the DoS layerbefore the Probe layer. We can do this because the data isrelational and each layer in our system is independent.Putting the DoS layer before the Probe layer serves dualadvantage as most of the attacks are detected at the firstlayer itself and the overall system performs efficiently. Thisoptimization becomes significant in severe attack situationswhen the target is overwhelmed with illegitimate connec-tions. The results are presented in Table 12.

We observe that the Layered Approach can be veryeffective in restricting the attack traffic to the initial layers inthe system. We also performed experiments when we donot implement the Layered Approach, i.e., we consider onlya single system that is trained with two classes (normal andattack). In this system, all the Probes, DoS, R2L, and U2Rattacks are labeled as “attack.” We perform experimentsboth with and without feature selection. For featureselection, we consider 21 features, which are selected byapplying the union operation on the feature sets of all thefour attack types. We compare these results with theLayered Approach in Table 13.

We observe that a system implementing the LayeredApproach with feature selection is more efficient and more

accurate in detecting attacks particularly the U2R, the R2L,and the Probes.

It is important to note that the time should be read inrelative terms rather than absolute, as for ease of experi-ments we used scripts for implementation. In real environ-ment, high speed can be achieved by implementing thecomplete system in languages with efficient compilers suchas the “C Language.” Further, pipelining can be implemen-ted in multicore processors, where each core may represent asingle layer, and due to pipelining, multiple I/O operationscan be replaced by a single I/O operation providing veryhigh speed of operation.

6.3 Comparison of Results

In this section, we compare our work with other well-knownmethods based on the anomaly intrusion detection principle.The anomaly-based systems primarily detect deviationsfrom the learnt normal data by using statistical methods,machine learning, or data mining approaches. Standardtechniques such as the decision trees and naive Bayes areknown to perform well. However, our experiments showthat the Layered CRFs perform far better than thesetechniques. The main reason for this is that the CRFs do notconsider the observation features to be independent. In [38],the authors present a comparative study of various classifierswhen applied to the KDD ’99 data set, and in [13], the authorspropose the use of Principle Component Analysis (PCA)before applying a machine learning algorithm. Use ofsupport vector machines is discussed in [26]. We compareour results from the results presented in these papers inTable 14. The table represents the Probability of Detection(PD) and False Alarm Rate (FAR) in percent for variousmethods including the KDD ’99 cup winners.

From the table, we observe that the Layered CRFsperform significantly better than the previously reportedresults including the winner of the KDD ’99 cup andvarious other methods applied to this data set. The mostimpressive part of the Layered CRFs is the margin of improve-ment as compared with other methods. Layered CRFs have veryhigh attack detection of 98.6 percent for Probes (5.8 percentimprovement) and 97.40 percent detection for DoS. Theyoutperform by a significant percentage for the R2L (34.5 percentimprovement) and the U2R (34.8 percent improvement) attacks.

6.4 Discussion and Issues

From our experiments and the comparison in Table 14, weconclude that the Layered CRFs can be very effective indetecting the Probe, the U2R, and the R2L attacks as well as


TABLE 13Layered versus Nonlayered Approach


the DoS attacks. However, if we consider all the 41 featuresgiven in the data set, we find that the time required to trainand test the model is high. To address this, we performedexperiments with our integrated system by implementing afour-layer system. The four layers correspond to Probe,DoS, R2L, and U2R. For each layer, we then selected a set offeatures that is sufficient to detect attacks at that particularlayer. Feature selection for each layer enhances theperformance of the entire system. The runtime (testing)performance of our model is comparable with othermethods; however, the time required to train the model isslightly higher. We also observe that feature selection notonly decreases the time required to test an instance, but italso increases the accuracy of attack detection. This isbecause using more features than required can generatesuperfluous rules often resulting in fitting irregularities inthe data, which can misguide classification. From ourexperimental results, we conclude that the main strengthof our method lies in detecting the R2L and the U2R attacks,which are not satisfactorily detected by other methods. Ourmethod gives slight improvement for detecting Probeattacks and was similar in accuracy when compared withother methods for detecting the DoS attacks.

The prime reason for better detection accuracy for theCRFs is that they do not consider the observation features tobe independent. CRFs evaluate all the rules together, whichare applicable for a given observation. This results incapturing the correlation among different features of theobservation resulting in higher accuracy. Considering boththe accuracy and the time required for testing, our system

scores better. Our integrated system also has the advantagethat any method can be used in the layers of the system. Thisgives flexibility to the user to decide between the time andaccuracy trade-off. Furthermore, we can increase or decreasethe number of layers in the system depending upon the taskrequirement. Finally, our system can be used for performinganalysis on attacks because the attack category can beinferred from the layer at which the attack is detected.

To determine the statistical significance of our results, wecompare our proposed method (Layered CRFs) with othersfor detecting Probes, DoS, R2L, and U2R attacks. We use theWilcoxon sum rank test with 95 percent confidence intervalto discriminate the performance of these methods. Table 15gives the ranking for various methods compared, where asystem with rank 1 is the best.

The results of the Wilcoxon test indicate that the LayeredCRFs are much better (or equal) for detecting attacks. Thus,we conclude that the Layered CRFs are a strong candidatefor building robust and efficient intrusion detection systems.

7 EFFECT OF NOISE

Ideally, we would like to perform similar experiments witha large number of data sets. However, given the domain ofthe problem, there are no other data sets that are freelyavailable, which can be used for our experiments. Toameliorate this problem to some extent, we add substantialamount of noise in the training data and perform similarexperiments to study the robustness of these systems. Byexperimenting with noisy data, we want to determine thesensitivity of the proposed scheme with respect to noise. Ifthe system performs poorly with noisy data, the resultscould be an artifact of the data set.

7.1 Addition of Noise to Data

Addition of noise was controlled by two parameters, theprobability of adding noise to a feature, p, and the scalingfactor, s, for a feature. We performed four sets of experi-ments with noisy data, separately, one for each layer. Foreach layer, we varied the parameter p between 0 and 0.95 (bykeeping it at values 0.10, 0.20, 0.33, 0.50, 0.75, 0.90, and 0.95)and varied the parameter s between �1,000 and þ1,000. Incase when the original feature was “0,” noise was added toany feature by using an additive function (random valuebetween �1,000 and þ1,000) instead of scaling. Figs. 5, 6, 7,and 8 represent the effect of noise on each layer separately.

We find that our integrated system is robust to noise inthe training data and performs better than other methodsfor all of the four attack groups.


TABLE 14Comparison of Results

TABLE 15Ranking for the Six Methods


8 AUTOMATIC FEATURE SELECTION

From our experiments in the previous sections, we showedthe advantages of performing feature selection and im-plementing the Layered Approach for attack detection. Weperformed our experiments by manually selecting featuresfor different layers. However, we want to compare theresults of manual feature selection with the results ofautomatic feature selection (and no feature selection) for allthe layers. Hence, we investigated various methods forautomatic feature selection.

We experimented with a feed forward neural network todetermine the weights for all the 41 features. Features withweights close to zero were discarded. As a result, only asmall set of features was selected for each layer. However,when we performed the experiments on the reduced set offeatures and compared the results, there was no significantimprovement in the detection accuracy, though there wasreduction in training and testing time.

We then used PCA for dimensionality reduction [13].However, the main drawback of using PCA in our task isthat PCA transforms a large number of possibly correlatedfeatures into a small number of uncorrelated features knownas the principle components. Hence, when we applied PCAfollowed by CRFs in the newly transformed feature space,the method did not provide significant advantage as thestrength of our approach is to model correlation amongfeatures and the features in the new space are independent.

We also note that, to construct a decision tree, theC4.5 algorithm performs feature selection. We selected thesame features as selected by the C4.5 algorithm and thenperformed experiments with only those features. However,there was no significant improvement in the results.

We also performed experiments with the methodproposed in [33] for efficiently inducing features for aCRF. The method is based upon iteratively constructingfeature conjunctions that would significantly increase theconditional log-likelihood if added to the model. We usedthe Mallet tool [35] for performing these experiments andcompare the results with our previous results based onLayered CRFs with manual feature selection in Table 16. Weobserved that both the systems, with automatic and manualfeature selections, had similar test time performance, butthe accuracy of detection when features were inducedautomatically was significantly lower than our systembased upon manual feature selection.

It was not surprising that manual feature selectionperformed better than automatic feature selection. How-ever, we note that automatic feature selection for LayeredCRFs performed better than the decision trees, particularlyfor the R2L and the U2R attacks. This suggests that Layered


Fig. 5. Effect of noise on Probe layer.

Fig. 6. Effect of noise on DoS layer.

Fig. 7. Effect of noise on R2L layer.

Fig. 8. Effect of noise on U2R layer.

TABLE 16Feature Selection


Approach using CRFs with automated feature selection is afeasible scheme for building reliable intrusion detectionsystems.

9 CONCLUSIONS

In this paper, we have addressed the dual problem ofAccuracy and Efficiency for building robust and efficientintrusion detection systems. Our experimental results inSection 6 show that CRFs are very effective in improvingthe attack detection rate and decreasing the FAR. Having alow FAR is very important for any intrusion detectionsystem. Further, feature selection and implementing theLayered Approach significantly reduce the time required totrain and test the model. Even though we used a relationaldata set for our experiments, we showed that the sequencelabeling methods such as the CRFs can be very effective indetecting attacks and they outperform other methods thatare known to work well with the relational data. Wecompared our approach with some well-known methodsand found that most of the present methods for intrusiondetection fail to reliably detect R2L and U2R attacks, whileour integrated system can effectively and efficiently detectsuch attacks giving an improvement of 34.5 percent for theR2L and 34.8 percent for the U2R attacks. We also discussedhow our system is implemented in real life. Our system canhelp in identifying an attack once it is detected at aparticular layer, which expedites the intrusion responsemechanism, thus minimizing the impact of an attack. Weshowed that our system is robust to noise and performsbetter than any other compared system even when thetraining data is noisy. Finally, our system has the advantagethat the number of layers can be increased or decreaseddepending upon the environment in which the system isdeployed, giving flexibility to the network administrators.

The areas for future research include the use of ourmethod for extracting features that can aid in the develop-ment of signatures for signature-based systems. Thesignature-based systems can be deployed at the peripheryof a network to filter out attacks that are frequent andpreviously known, leaving the detection of new unknownattacks for anomaly and hybrid systems. Sequence analysismethods such as the CRFs when applied to relational datagive us the opportunity to employ the Layered Approach,as shown in this paper. This can further be extended toimplement pipelining of layers in multicore processors,which is likely to result in very high performance.

APPENDIX A

FEATURE SELECTION

A.1 Features Selected for Probe Layer

A.2 Features Selected for DoS Layer

A.3 Features Selected for R2L Layer

A.4 Features Selected for U2R Layer

ACKNOWLEDGMENTS

The authors sincerely thank the anonymous reviewers

whose comments have greatly helped clarify and improve

this paper.

REFERENCES

[1] Autonomous Agents for Intrusion Detection, http://www.cerias.purdue.edu/research/aafid/, 2010.

[2] CRF++: Yet Another CRF Toolkit, http://crfpp.sourceforge.net/,2010.

[3] KDD Cup 1999 Intrusion Detection Data, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, 2010.

[4] Overview of Attack Trends, http://www.cert.org/archive/pdf/attack_trends.pdf, 2002.

[5] Probabilistic Agent Based Intrusion Detection, http://www.cse.sc.edu/research/isl/agentIDS.shtml, 2010.



[6] SANS Institute—Intrusion Detection FAQ, http://www.sans.org/resources/idfaq/, 2010.

[7] T. Abraham, IDDM: Intrusion Detection Using Data MiningTechniques, http://www.dsto.defence./gov.au/publications/2345/DSTO-GD-0286.pdf, 2008.

[8] R. Agrawal, T. Imielinski, and A. Swami, “Mining AssociationRules between Sets of Items in Large Databases,” Proc. ACMSIGMOD, vol. 22, no. 2, pp. 207-216, 1993.

[9] N.B. Amor, S. Benferhat, and Z. Elouedi, “Naive Bayes vs.Decision Trees in Intrusion Detection Systems,” Proc. ACM Symp.Applied Computing (SAC ’04), pp. 420-424, 2004.

[10] J.P. Anderson, Computer Security Threat Monitoring and Surveillance,http://csrc.nist.gov/publications/history/ande80.pdf, 2010.

[11] R. Bace and P. Mell, Intrusion Detection Systems, ComputerSecurity Division, Information Technology Laboratory, Nat’l Inst.of Standards and Technology, 2001.

[12] D. Boughaci, H. Drias, A. Bendib, Y. Bouznit, and B. Benhamou,“Distributed Intrusion Detection Framework Based on MobileAgents,” Proc. Int’l Conf. Dependability of Computer Systems(DepCoS-RELCOMEX ’06), pp. 248-255, 2006.

[13] Y. Bouzida and S. Gombault, “Eigenconnections to IntrusionDetection,” Security and Protection in Information Processing Systems,pp. 241-258, 2004.

[14] H. Debar, M. Becke, and D. Siboni, “A Neural NetworkComponent for an Intrusion Detection System,” Proc. IEEE Symp.Research in Security and Privacy (RSP ’92), pp. 240-250, 1992.

[15] T.G. Dietterich, “Machine Learning for Sequential Data: AReview,” Proc. Joint IAPR Int’l Workshop Structural, Syntactic,and Statistical Pattern Recognition (SSPR/SPR ’02), LNCS 2396,pp. 15-30, 2002.

[16] P. Dokas, L. Ertoz, A. Lazarevic, J. Srivastava, and P.-N. Tan, “DataMining for Network Intrusion Detection,” Proc. NSF Workshop NextGeneration Data Mining (NGDM ’02), pp. 21-30, 2002.

[17] Y. Du, H. Wang, and Y. Pang, “A Hidden Markov Models-BasedAnomaly Intrusion Detection Method,” Proc. Fifth World Congresson Intelligent Control and Automation (WCICA ’04), vol. 5,pp. 4348-4351, 2004.

[18] S. Dzeroski and B. Zenko, “Is Combining Classifiers Better thanSelecting the Best One,” Proc. 19th Int’l Conf. Machine Learning(ICML ’02), pp. 123-129, 2002.

[19] L. Ertoz, A. Lazarevic, E. Eilertson, P.-N. Tan, P. Dokas, V. Kumar,and J. Srivastava, “Protecting against Cyber Threats in NetworkedInformation Systems,” Proc. SPIE Battlespace Digitization andNetwork Centric Systems III, pp. 51-56, 2003.

[20] S. Forrest, S.A. Hofmeyr, A. Somayaji, and T.A. Longstaff,“A Sense of Self for Unix Processes,” Proc. IEEE Symp.Research in Security and Privacy (RSP ’96), pp. 120-128, 1996.

[21] Y. Gu, A. McCallum, and D. Towsley, “Detecting Anomalies inNetwork Traffic Using Maximum Entropy Estimation,” Proc.Internet Measurement Conf. (IMC ’05), pp. 345-350, USENIX Assoc.,2005.

[22] K.K. Gupta, B. Nath, and R. Kotagiri, “Network Security Frame-work,” Int’l J. Computer Science and Network Security, vol. 6, no. 7B,pp. 151-157, 2006.

[23] K.K. Gupta, B. Nath, and R. Kotagiri, “Conditional Random Fieldsfor Intrusion Detection,” Proc. 21st Int’l Conf. Advanced InformationNetworking and Applications Workshops (AINAW ’07), pp. 203-208,2007.

[24] K.K. Gupta, B. Nath, R. Kotagiri, and A. Kazi, “AttackingConfidentiality: An Agent Based Approach,” Proc. IEEE Int’l Conf.Intelligence and Security Informatics (ISI ’06), vol. 3975, pp. 285-296,2006.

[25] C. Ji and S. Ma, “Combinations of Weak Classifiers,” IEEE Trans.Neural Networks, vol. 8, no. 1, pp. 32-42, 1997.

[26] D.S. Kim and J.S. Park, “Network-Based Intrusion Detection withSupport Vector Machines,” Proc. Information Networking, Network-ing Technologies for Enhanced Internet Services Int’l Conf. (ICOIN ’03),pp. 747-756, 2003.

[27] D. Klein and C.D. Manning, “Conditional Structure versusConditional Estimation in NLP Models,” Proc. ACL Conf.Empirical Methods in Natural Language Processing (EMNLP ’02),vol. 10, pp. 9-16, Assoc. for Computational Linguistics, 2002.

[28] C. Kruegel, D. Mutz, W. Robertson, and F. Valeur, “BayesianEvent Classification for Intrusion Detection,” Proc. 19th Ann.Computer Security Applications Conf. (ACSAC ’03), pp. 14-23, 2003.

[29] J. Lafferty, A. McCallum, and F. Pereira, “Conditional RandomFields: Probabilistic Models for Segmenting and LabelingSequence Data,” Proc. 18th Int’l Conf. Machine Learning(ICML ’01), pp. 282-289, 2001.

[30] W. Lee and S. Stolfo, “Data Mining Approaches for IntrusionDetection,” Proc. Seventh USENIX Security Symp. (Security ’98),pp. 79-94, 1998.

[31] W. Lee, S. Stolfo, and K. Mok, “Mining Audit Data to BuildIntrusion Detection Models,” Proc. Fourth Int’l Conf. KnowledgeDiscovery and Data Mining (KDD ’98), pp. 66-72, 1998.

[32] W. Lee, S. Stolfo, and K. Mok, “A Data Mining Framework forBuilding Intrusion Detection Model,” Proc. IEEE Symp. Securityand Privacy (SP ’99), pp. 120-132, 1999.

[33] A. McCallum, “Efficiently Inducing Features of ConditionalRandom Fields,” Proc. 19th Ann. Conf. Uncertainty in ArtificialIntelligence (UAI ’03), pp. 403-410, 2003.

[34] A. McCallum, D. Freitag, and F. Pereira, “Maximum EntropyMarkov Models for Information Extraction and Segmentation,”Proc. 17th Int’l Conf. Machine Learning (ICML ’00), pp. 591-598,2000.

[35] A.K. McCallum, MALLET: A Machine Learning for Language Toolkit,http://mallet.cs.umass.edu, 2010.

[36] L. Portnoy, E. Eskin, and S. Stolfo, “Intrusion Detection withUnlabeled Data Using Clustering,” Proc. ACM Workshop DataMining Applied to Security (DMSA), 2001.

[37] A. Ratnaparkhi, “A Maximum Entropy Model for Part-of-SpeechTagging,” Proc. Conf. Empirical Methods in Natural LanguageProcessing (EMNLP ’96), pp. 133-142, Assoc. for ComputationalLinguistics, 1996.

[38] M. Sabhnani and G. Serpen, “Application of Machine LearningAlgorithms to KDD Intrusion Detection Dataset within MisuseDetection Context,” Proc. Int’l Conf. Machine Learning, Models,Technologies and Applications (MLMTA ’03), pp. 209-215, 2003.

[39] H. Shah, J. Undercoffer, and A. Joshi, “Fuzzy Clustering forIntrusion Detection,” Proc. 12th IEEE Int’l Conf. Fuzzy Systems(FUZZ-IEEE ’03), vol. 2, pp. 1274-1278, 2003.

[40] C. Sutton and A. McCallum, “An Introduction to ConditionalRandom Fields for Relational Learning,” Introduction to StatisticalRelational Learning, 2006.

[41] E. Tombini, H. Debar, L. Me, and M. Ducasse, “A SerialCombination of Anomaly and Misuse IDSes Applied to HTTPTraffic,” Proc. 20th Ann. Computer Security Applications Conf.(ACSAC ’04), pp. 428-437, 2004.

[42] W. Wang, X.H. Guan, and X.L. Zhang, “Modeling ProgramBehaviors by Hidden Markov Models for Intrusion Detection,”Proc. Int’l Conf. Machine Learning and Cybernetics (ICMLC ’04),vol. 5, pp. 2830-2835, 2004.

[43] C. Warrender, S. Forrest, and B. Pearlmutter, “Detecting Intru-sions Using System Calls: Alternative Data Models,” Proc. IEEESymp. Security and Privacy (SP ’99), pp. 133-145, 1999.

[44] I.H. Witten and E. Frank, Data Mining: Practical Machine LearningTools and Techniques. Morgan Kaufmann, 2005.

[45] Y.-S. Wu, B. Foo, Y. Mei, and S. Bagchi, “CollaborativeIntrusion Detection System (CIDS): A Framework for Accurateand Efficient IDS,” Proc. 19th Ann. Computer Security ApplicationsConf. (ACSAC ’03), pp. 234-244, 2003.

[46] Z. Zhang, J. Li, C.N. Manikopoulos, J. Jorgenson, and J. Ucles,“HIDE: A Hierarchical Network Intrusion Detection System UsingStatistical Preprocessing and Neural Network Classification,”Proc. IEEE Workshop Information Assurance and Security (IAW ’01),pp. 85-90, 2001.

Kapil Kumar Gupta received the BTech degreein computer science and engineering from theGuru Gobind Singh Indraprastha (GGSIP) Uni-versity, Delhi, India, in 2004. He worked for ayear at HCL Technologies, Noida, India. He iscurrently a PhD student in the Department ofComputer Science and Software Engineering,The University of Melbourne, Parkville, Australia.His research interests include intrusion detec-tion, network security, data security and data

privacy, machine learning, data mining, and artificial intelligence.



Baikunth Nath received the MA degree fromPunjab University, Chandigarh, India and thePhD degree from the University of Queensland,Brisbane, Australia. He was with Monash Uni-versity for more than 25 years in various seniorpositions including the director of research in theGippsland School of IT. In 2001, he joined theDepartment of Computer Science and SoftwareEngineering, The University of Melbourne, Park-ville, Australia, as an associate professor and

the director of postgraduate studies. His research interests includeimage processing, intrusion detection, scheduling, optimization, datamining, evolutionary computing, neural networks, financial forecasting,and operations research. He is the author of numerous researchpublications in various well-reputed international journals and con-ference proceedings. He is a senior member of the IEEE.

Ramamohanarao (Rao) Kotagiri received theBE degree from Andhra University, the MEdegree from the Indian Institute of Science(IISc), Bangalore, India, and the PhD degreefrom Monash University. He was awarded theAlexander von Humboldt Fellowship in 1983. Hejoined the University of Melbourne in 1980 andwas appointed as a professor in computerscience in 1989. He has held several seniorpositions including head of Computer Science

and Software Engineering, head of the School of Electrical Engineeringand Computer Science, deputy director of the Centre for UltraBroadband Information Networks, codirector of the Key Centre forKnowledge-Based Systems, and research director for the CooperativeResearch Centre for Intelligent Decision Systems, The University ofMelbourne. He served as a member of the ARC Information TechnologyPanel. He also served on the Prime Minister’s Science, Engineering andInnovation Council Working Party on Data for Scientists. He is currentlythe associate dean for research in the Faculty of Engineering, TheUniversity of Melbourne. He is on the editorial boards of the UniversalComputer Science, Journal of Knowledge and Information Systems,IEEE Transactions on Knowledge and Data Engineering (TKDE),Journal of Statistical Analysis and Data Mining, and Very Large DataBases (VLDB) Journal. He served as a program committee member ofnumerous international conferences including the International Con-ference on Management of Data (SIGMOD), International Conferenceon Very Large Data Bases (VLDB), International Conference on LogicProgramming (ICLP), and International Conference on Data Engineering(ICDE). He is a steering committee member of the IEEE InternationalConference on Data Mining (ICDM), Pacific-Asia Conference onKnowledge Discovery and Data Mining (PAKDD), and InternationalConference on Database Systems for Advanced Applications(DASFAA). He was the program cochair for VLDB, PAKDD, DASFAA,and the International Conference on Deductive and Object-OrientedDatabases (DOOD). His research interests include database systems,logic-based systems, agent-oriented systems, information retrieval, datamining, intrusion detection, and machine learning. He has publishedwidely in conference proceedings and international journals. He is afellow of the Institute of Engineers Australia, Australian Academy ofTechnological Sciences and Engineering, and Australian Academy ofScience. He is a member of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.