Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm

Embed Size (px)

Citation preview

  • 7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm

    1/12

    Int. J. Inf. Secur. (2009) 8:2535

    DOI 10.1007/s10207-008-0061-2

    R EG U LAR C O N TR I BU TI O N

    Large-scale network intrusion detection based on distributedlearning algorithm

    Daxin Tian Yanheng Liu Yang Xiang

    Published online: 14 November 2008

    Springer-Verlag 2 008

    Abstract As network traffic bandwidth is increasing at an

    exponential rate, its impossible to keep up with the speed ofnetworks by just increasing the speed of processors. Besides,

    increasingly complex intrusion detection methods only add

    further to the pressure on network intrusion detection (NIDS)

    platforms, so the continuousincreasing speed and throughput

    of network poses new challenges to NIDS. To make NIDS

    usable in Gigabit Ethernet, the ideal policy is using a load

    balancer to split the traffic data and forward those to differ-

    ent detection sensors, which can analyze the splitting data

    in parallel. In order to make each slice contains all the evi-

    dence necessary to detect a specific attack, the load balancer

    design must be complicated and it becomes a new bottleneck

    of NIDS. To simplify the load balancer this paper put forward

    a distributed neural network learning algorithm (DNNL).

    Using DNNL a large data set can be split randomly and each

    slice of data is presented to an independent neural network;

    these networks can be trained in distribution and each one

    in parallel. Completeness analysis shows that DNNLs learn-

    ing algorithm is equivalent to training by one neural network

    which uses the technique of regularization. The experiments

    to check the completeness and efficiency of DNNL are per-

    formedon the KDD99 Data Setwhich is a standard intrusion

    detection benchmark. Compared with other approaches on

    the same benchmark, DNNL achieves a high detection rate

    and low false alarm rate.

    D. Tian Y. Liu (B)

    College of Computer Science and Technology,

    Jilin University, 130012 Changchun, China

    e-mail: [email protected]

    Y. Xiang

    School of Management and Information Systems,

    Central Queensland University,

    Rockhampton, QLD 4702, Australia

    Keywords Intrusion detection system Distributed

    learning Neural network Network behavior

    1 Introduction

    With the widespread use of networked computers for criti-

    cal systems, computer security is attracting increasing atten-

    tion and intrusions have become a significant threat in recent

    years. As a second line of defense for computer and net-

    work systems, intrusion detection systems (IDS) have been

    deployed more and more widely along with network security

    techniques such as firewalls. Intrusion detection techniques

    can be classified into two categories: misuse detection and

    anomaly detection. Misuse detection looks for signatures of

    known attacks, and any matched activity is considered an

    attack; anomaly detection models a users behaviors, and any

    significant deviation from the normal behaviors is considered

    the result of an attack. The main shortcoming of IDS is false

    alarm which is caused by misinterpreting normal packets as

    an attack or misclassifying an intrusion as normal behavior.

    This problem is more severe under fast Ethernet, with the

    result that network IDS (NIDS) cant be adapted to protect

    the backbone network. Since network traffic bandwidth is

    increasing at an exponential rate, its impossible to keep up

    with the speed of networks by just increasing the speed of

    processors.

    To resolve the problem and make NIDS which can be used

    in Gigabit Ethernet, one approach is improving the detection

    speed by moving the matching away from the processor and

    on to an FPGA [14], using high performance string match-

    ing algorithm [57] and reducing the dimensionality of the

    data, thereby minimizing computational time [8,9]. Another

    approach is using both distributed and parallel detecting

    methods. this is the best way to make NIDS keep up with

    the speed of networks. The main idea of distributed NIDS

    123

  • 7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm

    2/12

    26 D. Tian et al.

    is splitting the traffic data and forwarding them to detection

    sensors, so that these sensors can analyze the data in parallel.

    Paper [10] presents an approach which allows for meaning-

    ful slicing of the network traffic into portions of manageable

    size. However, their approach uses a simple Round-Robin

    algorithm for load balancing. The splitting algorithm of[11]

    ensures that a single slice contains all the evidence necessary

    to detect a specific attack, making sensor-to-sensor interac-tion unnecessary. Although the algorithm can dynamically

    balance the sensors loads by choosing the sensor with the

    lightest load to process the new connections packets, it still

    may lead to some sensor losing a packet if the traffic of one

    connection is heavy. Paper [12] has a design for a flow-based

    dynamic load-balancing algorithm, which divides the data

    stream based on the current value of each analyzers load

    function. The incoming data packets, which belonged to a

    new session, are forwarded to the analyzer that has the least

    load currently. Paper [13] presents an active splitter architec-

    ture and three methods for improving performance: the first

    is early filtering/forwarding, where a fraction of the packetsis processed on the splitter instead of the sensors; the second

    is the use of locality buffering, where the splitter reorders

    packets in a way that improves memoryaccess location on the

    sensors; the third is the use of cumulative acknowledgments,

    a method that optimizes the coordination between the traffic

    splitter and the sensors. The load balancer of SPANIDS [14]

    employs multiple levels of hashing and incorporates feed-

    back from the sensor nodes to distribute network traffic over

    the sensors without overloading any of them. Although the

    methods of [1214] reduce the load on the sensors, it compli-

    cates the splitting algorithm and makes the splitter become

    the bottleneck of the system.

    The traffic splitter is the key of the distributed intrusion

    detection system. An ideal splitting algorithm should satisfy

    these requirements: (1) the algorithm divides the whole traf-

    fic into slices of equal sizes; (2) each slice contains all the

    evidence necessary to detect a specific attack; (3) the algo-

    rithm is simple and efficient [11]. Through the above analysis

    we can find that the primary goal of a NIDS load balancer is

    to distribute network packets acrossa set of sensor hosts, thus

    reducing the load on each sensor to a level that the sensor can

    handle without dropping packets. However, the connection

    oriented characteristic makes the load balancer of NIDS is

    different from the other environments such as web servers,

    distributed systems or clusters. In order to satisfy the require-

    ment (2), all the distributed intrusion detection systems pay

    more attention to the load balancer, and thus cant satisfy

    requirements (1) and (3). In this paper a distributed neural

    network learning algorithm (DNNL) is presented which can

    be used in distributed anomaly detection system. The idea

    of DNNL is different from the common distributed intru-

    sion detection system. While the usual methods try to satisfy

    requirement (2) through weakening requirements (1) and (3),

    DNNL takes the opposite approach, which first considers

    satisfying requirements (1) and (3) and through the learning

    algorithm to satisfy requirement (2).

    Two important characters of neural networks are: distrib-

    uted, where knowledge representation is distributed across

    many processing units; parallel, where computations take

    place in parallel across these distributed representations.

    Although each neural network can run in parallel, a groupof neural networks cant run in distribution to cope with one

    problem corporately. Since the learning algorithm requires

    all the training data to be submitted to the network one by

    one until the network is stable after one or more epochs. This

    requirement becomes untenable when the amount of data

    exceeds the size of main memory, which is obviously possi-

    ble for any realistic database, such as astronomy data [15],

    biomedical data [16], bioinformatics data [17], etc. DNNL is

    not only a parallel but also a distributed learning algorithm

    which uses independent neural networks to process part of

    the training data. These independent neural networks can

    run in distribution and each one processes in parallel, thus itcan not only take advantage of the neural networks parallel

    character but also overcome the drawback of concentrated

    training. DNNL can also be used in mobile agent [18], dis-

    tributed data mining [19], distributed monitoring [20] and

    ensemble system [21].

    The rest of this paper is organized as follows. Section 2

    describes the main idea of DNNL, and details the basic learn-

    ing algorithm. Section 3 presents a metric embedding method

    and dissimilarity measure algorithm to make DNNL suit the

    data which contains categorical and numerical features. The

    experimental results on dataset KDDCUP99 are given in

    Sect. 4, and conclusions are made in Sect. 5.

    2 DNNL

    2.1 The process of DNNL

    The main process of DNNL is: first, splitting the large sample

    data into small subsets and forwarding these slices to distrib-

    uted sensors; second, each sensors neural network begins

    to be trained by the sliced data in parallel until all of them

    are stable; third, rebuilding the new training data based on

    the training results of each neural network (the new training

    datas amount is much less than the total amount of all the

    sliced data); last, a concentrated learning is carried out on the

    new training data. The process is shown in Fig. 1.

    DNNL involves twophases learning.In thefirst phase(dis-

    tributed learning), large data are splitted randomly and sent

    to independent neuron networks, all the independent neuron

    networks learn the knowledge of each slice in distribution

    and every one in parallel. In the second phase (concentrated

    learning), the training data is built from the training results of

    123

  • 7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm

    3/12

    Large-scale network intrusion detection based on distributed learning algorithm 27

    Fig. 1 The process of DNNL

    the distributed neural networks. Since the new data is much

    less than the original training data, it can be learned by one

    neural network in finite time and memory.

    2.2 Analysis of DNNLs completeness

    The key issue to DNNL is how to build the new data to

    ensure the training is complete, that is, the result is equal to

    thetrainingon thewholedata by oneneural network.Next we

    first present the new datas building method and then analyze

    the completeness of DNNL.

    A stable neural network maintains the knowledge learned

    from the sample data in the weight matrix W(mn), m is the

    number of neurons, n is the dimension of each neuron. In

    DNNL, the dimension of the neuron of the distributed neural

    network is equal to the sample vector x(1n)s dimension.

    After the distributed neural network is stable, each row ofW

    can be regarded as one clustering center of the sliced data.

    The new data are generated from Gaussian distribution in

    each point of W. For example: the whole original data set

    X has p q samples, X is split into p slices and each slice(i )X(i = 1, . . . , p) has q samples; after the i th neural net-

    work trained by the i th slice of data (i )X is stable, its weight(i )W is composed of r rows (r q). The i th slice of the

    new data set (i)X is generated from the Gaussian distribu-

    tion in each point of (i )W. After generation, (i )X contains t

    (r t q) samples. Since t is also much less than q, the

    whole new data set X is much less than the whole original

    data set X. From the above discussion we can find that after

    training each row of the i th neuron networks weight matrix,(i )W represents some samples of (i)X, in which the distance

    between these samples and the corresponding row is lower

    than one threshold value. So we can get

    (i )Wj =(i) Xk + Ak (1)

    where k identifies some samples whose clustering center is

    the j th row of (i )W, and Ak is the vector representing the

    difference between (i )Wj and(i)Xk. The i th slice of new

    data set (i )X is generated from the i th neuron network as

    (i )Xl =(i) Wj + Bm (2)

    where Bm is a random vector whose numbers are generated

    from Gaussian distribution. Substituting Eq. (1) into Eq. (2)

    gets

    (i )Xl =(i) Xk + Ak + Bm (3)

    The neural network can be represented by function f (W, x).

    After training,input of unlabelled data x andthe outputof f()

    is close or equal to future result y. To feedforward network,

    the training is a process to find W

    W = argmi n

    m

    i=1

    nj=1

    yi j f (W, xi )j

    (4)

    A common choice of the error function is the least mean

    square error of the form

    C(x) =

    mi=1

    yi f (W, xi )2 (5)its expected value is

    E(C(x)) =

    C(x) fd (x, y) dxdy (6)

    the function fd (x, y) representing the probability density of

    the training data. Substituting Eq. (5) into Eq. (6) gets

    E(C(x))

    =

    mi =1

    nj =1

    yi j f (W, xi )j

    2fd

    xi , yi

    dxi dyi (7)

    When training with X the function f() becomes f (W,

    x + a + b), which expands it into Taylor series:

    f (W, x + a + b)= f (W, x) + f (W, x + a + b)T (a + b)

    +1

    2(a + b)T 2 f (W, x + a + b) (a + b) +

    = f (W, x) + h (x) (8)

    where f() is gradient and 2 f() is Hessian matrix. The

    expected error value when training with new data set X can

    be written in the form

    E(C(x)) = E(C(x)) + ( f (W, x)) (9)

    123

  • 7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm

    4/12

    28 D. Tian et al.

    where (f (W, x)) is

    ( f (W, x))

    =

    mi =1

    nj=1

    2yi j f (W, xi )j

    h (xi )+h (xi )

    2

    fd xi , yi fd (ai ) fd (bi ) dxi dyi dai dbi (10)From Eq. (9) we can find that training with new data set X

    is equivalent to the technique of regularization which adds a

    penalty term to the error function for controlling the bias and

    variance of a neural network [22].

    This neural network learning rule can be considered as

    a gradient optimization process when an appropriate energy

    function E(w) is selected, the gradient direction is

    dw

    dt=

    E(w)

    w(11)

    andthe synaptic weights areadjustedin thegradientdirection

    w (k + 1) = w (k) E(w)w

    (12)

    If the data set S in Fig. 2 are trained following this process,

    after neural network is stable, that is, the energy function E

    reaches to minimum (local or global), the synaptic weights

    are the black point in Fig. 2.

    If the data set S is randomly (that is not following the

    partition boundary) split into two data sets S1 and S2 which

    are shown in Figs. 3 and 4, distributed learning first trains

    S1 and S2 independently. After they are both stable, S1s

    energy function E1 and S2s energy function E2 are both

    reaching to minimum, the concentrated learning is carried

    on the learning result of data set S1 and S2.During the concentrated learning: since the triangle class

    and the plus sign class have been generalized very well by

    their synaptic weights, these weights will not be adjusted in

    great degree to minimize energy function; However the cross

    class and the six-pointed star class dont reach to the opti-

    mal state, so and therefore their synaptic weights generated

    during the distributed learning will continue to adjust until

    reaching to minimum. Although the training results (synap-

    tic weight number and value) on the splitted data set and the

    whole data set may be different, they have the same gen-

    eralization ability because they all aim to make Ss energy

    function E reach to minimum.

    2.3 Competitive learning algorithm based on kernel

    function

    In order to gain the advantages of being able to learn from

    new data, a neural network must be adaptive or exhibit

    plasticity, possibly allowing the creation of new neurons.

    On the other hand, if the training data structures are unsta-

    ble and the most recently acquired piece of information can

    Fig. 2 The data set S and its training result

    Fig. 3 The data set S1 and its training result

    Fig. 4 The data set S2 and its training result

    cause major reorganization, then it is difficult to ascribe much

    significance to any particular clustering description. This

    problem is even more serious in distributed data training.

    SOM [23], dART [24], RPCL [25], etc have presented some

    methods to overcome this problem, this paper introduces a

    competitive mechanism which absorbs the ideas of above

    methods. The learning algorithm is based on Hebb learning

    and kernel function. To prevent the knowledge included in

    different slices being ignored, DNNL adopts the resonances

    mechanism of ART and adds neurons whenever the network

    123

  • 7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm

    5/12

    Large-scale network intrusion detection based on distributed learning algorithm 29

    in its current state does not sufficiently match the input. Thus

    the learning results of the sensors contain complete or partial

    knowledge and the whole knowledge can be learned by the

    concentrated learning.

    2.3.1 Hebb learning

    In DNNL the learning algorithm is based on the HebbianPostulate which states that When an axon of cell A is near

    enough to excite a cell B and repeatedly or persistently takes

    part in firing it, some growth process or metabolic change

    takes place in one or both cells such that As efficiency, as

    one of the cells firing B, is increased.

    The learning rule for a single neuron can be derived from

    an energy function defined as

    E(w) =

    wTx

    +

    2w22 (13)

    where w is the synaptic weight vector (including a bias or

    threshold), x is the input to the neuron, () is a differentiablefunction, and 0 is the forgetting factor. Also,

    y =d (v)

    dv= f (v) (14)

    is the output of the neuron, where v = wTx is the activity

    level of the neuron. Taking the steepest descent approach to

    derive the continuous-time learning rule

    dw

    dt= w E(w) (15)

    where > 0 is the learning rate parameter, we see that the

    gradient of the energy function in Eq. (13) must be computed

    with respect to thesynaptic weightvector, that is,wE(w) =

    E(w)/w . The gradient of Eq. (13) is

    w E(w) = f (v)v

    w+ w = yx + w (16)

    Therefore, by using the result in Eq. (16) along with that

    of Eq. (15), the continuous-time learning rule for a single

    neuron is

    dw

    dt= [yx w] (17)

    and the discrete-time learning rule (in vector form) is

    w(t + 1) = w(t) + [y(t + 1)x(t + 1) w(t)] (18)

    2.3.2 Competitive mechanism based on kernel function

    To overcome the problem induced by a traffic splitters, the

    inverse distance kernel function is used in Hebb learning.

    The basic idea is that not only the winner is rewarded but

    also all the losers are penalized at a different rates which are

    calculated by the inverse distance function and its input is

    the dissimilarity between the sample data and neuron.

    The dissimilarity measure function is Minkowski metric:

    dp (x, y) =

    l

    i=1

    wi |xi yi |p

    1/p(19)

    where xi , yi are the i th coordinates ofx and y , i = 1, . . . , l,

    and wi 0 is the i th weight coefficient.

    When the j th neuron is most similar to the sample, thenthe learning rule of i th neuron is

    Wi (t + 1) = Wi (t) + i [x(t + 1) Wi (t)] (20)

    where

    i =

    1 : winner, i = j

    K (di ): others, i = 1, . . . , m and i = j(21)

    and K (di) is the inverse distance kernel,

    K (di) =1

    1 + dpi

    (22)

    If the winners dissimilarity measure d < ( is the thresh-old of dissimilarity), then update the synaptic weight by

    learning rule Eq. (20), else add a new neuron and set the

    synaptic weight w = x.

    2.4 Post-prune algorithm

    One of the central issues in network training is to find the

    optimal model f(). Judging the efficiency of f() can be bro-

    ken into two fundamental aspects: bias and variance. Bias

    measures the expected value of the estimator relative to the

    true value, and variance measures the variability of the esti-

    mator about the expected value. Since DNNL determinesnetwork size by adding neurons incrementally, it may model

    noise data into f() and lead to high variance (the phenom-

    enon of overfitting). To prevent overfitting, DNNL uses the

    post-prune method whose strategy is based on the distance

    threshold: if two weights are too similar they will be substi-

    tuted by a new weight. The new weight is calculated as

    Wnew = (Wold1 t1 + Wold2 t2)/(t1 + t2) (23)

    where t1 is the training times of Wold1 , t2 is the training

    times ofWold2.

    Thepruning process is illustrated in Fig. 5, after pruning E,

    F, A and B are aggregated to EF and AB. The prune algorithm

    is shown below:

    Step 0: If old weights muster (oldW) is null then algorithm

    is over, else proceed;

    Step 1: Calculate the distance between the first weight (fw)

    and the other weights;

    Step 2: Find the weight (sw) which is most similar to fw;

    Step 3: If the distance between sw and fw is larger than the

    pruning threshold, then delete fw from oldW and add

    123

  • 7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm

    6/12

    30 D. Tian et al.

    Fig. 5 The pruning process

    fw into new weights muster (newW) goto step 0; else

    continue;

    Step 4: Get fws training times value (ft) and sws training

    times value (st);

    Step 5: Calculate the new weight (nw) and nws training

    times value (nt), nw = (fw ft + sw st)/(ft + st),

    nt = ft + st;

    Step 6: Delete fw and sw from oldW and add nw into newW;

    goto step 0.

    2.5 Learning algorithm of DNNL

    The main learning process of DNNL is:

    Step 0: Initialize learning rate parameter and the threshold

    of dissimilarity ;

    Step 1: Obtain the first input x and set w0 = x as the initial

    weight;

    Step 2: If training is not over, randomly take a feature vector

    x from the feature sample set X and compute the dissimi-

    larity measure between x and each synaptic weight using

    Eq. (19);

    Step 3: Decide the winner neuron j and test tolerance: Ifdj

    , add a new neuron and set the synaptic weight

    w = x, goto Step 2; else continue;

    Step 4: Compute i by using the result of inverse distance

    K (di) ;

    Step 5: Update the synaptic weight as Eq. (20), goto Step 2.

    3 Data preprocessing

    KDDCUP 99 data was collected through a simulation on the

    U.S.A. military network by 1998 DARPA Intrusion Detec-

    tion Evaluation Program, aiming at obtaining the benchmark

    dataset in the field of intrusion detection. The full data set

    contains training data consisting of 7 weeks of network-

    based intrusions inserted in the normal data, and 2 weeks

    of network-based intrusions in normal data for a total of

    4,999,000 connection records described by 41 characteris-

    tics. The records are mainly divided into four types of attack:

    probe, denial of service (DOS), user-to-root (U2R) and

    remote-to-local (R2L).

    3.1 Metric embedding

    The set of features presented in the KDD Cup data set con-

    tains categorical and numerical features of different sources

    and scales. An essential step for handling such data is metric

    embedding which transforms the data into a metric space. In

    this paper the categorical features are represented by metric

    A, each categorical feature Ai expressing g possible categor-

    ical values is defined as Ai =

    A1i , A

    2i , . . . , A

    g

    i

    ; the numer-ical features are represented by B; then the metric space X

    can be defined as X = {A1, . . . , Am, B1, . . . , Bnm }. That

    means each sample data X is described by n features.

    3.2 Dissimilarity measure

    To numerical features, the value |xi yi | of Minkowski met-

    ric can be calculated directly after normalization. But for

    categorical features, we need to define a new calculation

    method. The Hamming distance is often used to quantify the

    extent to which two strings of the same dimension differ. An

    early application was in the theory of error-correcting codeswhere the hamming distance measured the error introduced

    by noise over a channel when a message, typically a sequence

    of bits, is sent between its source and destination. In DNNL

    the calculation of|xi yi | for categorical features is similar

    to Hamming distance. Ifxi and yi are categorical features, xiis the feature of sample data, yi is the corresponding feature

    of one training neuron N, and xi = Aki (k 1, . . . , g), then

    |xi yi | = 1 ck

    C(24)

    123

  • 7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm

    7/12

    Large-scale network intrusion detection based on distributed learning algorithm 31

    where ck is the number which represents how many times Aki

    has been learned by neuron N,

    ck = num

    Aki

    (25)

    and C is the total number of all the categorical features that

    have been learned by neuron N,

    C =m

    i =1

    gj=1

    num

    Aji

    (26)

    If the neuron N is the winner of this training epoch, its value

    ofck will be added by 1. Using this method to calculate the

    Minkowski metric of categorical features, if the value of one

    samples categorical feature Ai is Aki , then the neurons with

    the larger numAki

    are more similar to this sampleregarding

    this categorical feature, that is, the value |xi yi | is much

    smaller.

    4 Experiments

    4.1 Benchmark test

    In the KDDCUP 99 data set, a smaller data set consisting

    of the 10% the overall data set is generally used to eval-

    uate algorithm performance. The smaller data set contains

    22 kinds of intrusion behaviors and 494,019 records, among

    which 97,276 are normal connection records. The test set is

    another data set which contains 37 kinds of intrusion behav-

    iors and 311,029 records, among which 60,593 are normal.

    4.1.1 Performance measures

    The recording format of test results is shown in Table 1.

    False alarm is partitioned into False Positive (FP, normal is

    detected as intrusion) andFalse Negative (FN, intrusionis not

    detected). True detection is also partitioned into True False

    (TF, intrusion is detected rightly) and True Negative (TN,

    normal is detected rightly).

    Table 1Recording format of test result

    Detection results

    Normal Intrusion-1 Intrusion-n

    Normal TN00 FP01 FP0n

    Actual Intrusion-1 F N10 TP11 FP1n

    Intrusion-2 FN20 FP21 FP2n

    Behaviors...

    .

    .

    ....

    .

    .

    ....

    Intrusion-n FNn0 FPn1 TPnn

    Definition 1 The right detection rate ofi th behavior (TR) is

    TR = Tii

    nj =1

    Ri j (27)

    where Ti i is the value which lies in Table 1s i th row and i th

    column; Ri j is the value which lies in Table 1s i th row and

    j th column.

    Definition 2 The right prediction rate of i th behavior (PR)

    is

    PR = Tii

    nj =1

    Rj i (28)

    where Tii is the value which lies in Table1s i th row and i th

    column; Rji is the value which lies in Table1s j th row and

    i th column.

    Definition 3 Detection rate (DR) is computed as the ratio

    between the number of correctly detected intrusions and thetotal number of intrusions. If we regard Table1s record as an

    (n + 1) (n + 1) metric R, then

    DR =

    ni =1

    nj =1

    Ri j

    ni=0

    nj =0

    Ri j (29)

    Definition 4 False positive rate (FPR) is computed as the

    ratio between the number of normal behaviors that are incor-

    rectly classifies as intrusions and the total number of normal

    connections, according to the Table1s record

    FPR =

    ni=1

    FP0i n

    i=1

    FP0i + TN00

    (30)

    4.1.2 Experiment results

    To test the performance of DNNL, we first divide the 494,019

    records into 50 slices. Each slice contains 10,000 records

    except the last which contains 4,019 records. When learning

    rate = 0.1 and the threshold of dissimilarity = 1.5 the

    learning results are shown in Figs. 6 and 7. In Fig. 6, X-axis

    represents each slice and Y-axis records the corresponding

    number of neurons after the neural networks are stable. In

    Fig. 7, Y-axis records the corresponding number of behav-

    iors included in each slice. From the results we can find that

    the distribution of neurons and behaviors are similar, which

    indicates the sensors have learned the knowledge. Since the

    behaviors recorded in 18th36th slices and 43th46th are all

    smurf intrusion, the number of behavior is 1 and the num-

    ber of neurons is 51. After the training on the distributed

    learning result, the knowledge is represented by 368 neurons.

    There are 37 kinds of intrusion behaviors in the test set.

    We first separate them into four kinds of attacks:

    123

  • 7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm

    8/12

    32 D. Tian et al.

    Fig. 6 The number of neurons of each corresponding slices

    Fig. 7 The number of behaviors of each corresponding slices

    Probe: {portsweep, mscan, saint, satan, ipsweep, nmap}

    DOS: {udpstorm, smurf, pod, land, processtable, warezmas-

    ter, apache2, mailbomb, Neptune, back, teardrop}

    U2R: {httptunnel, ftp_write, sqlattack, xterm, multihop,

    buffer_overflow, perl, loadmodule, rootkit, ps}

    R2L: {guess_passwd, phf, snmpguess, named, imap,

    snmpgetattack, xlock, sendmail, xsnoop, worm}

    The test results are summarized in Table 2.

    Comparing the result with the first winner of KDD CUP

    99, we see that the TR of DNNL is almost equal to the first

    winner. There are two reasons leading to the low TR of U2R

    and R2L: first, the size of attack instance that pertained to

    U2R and R2L is much smaller than that of other types of

    attack; second, U2R and R2L are host-based attacks which

    exploit vulnerabilities of the operating systems, not of the

    network protocol. Therefore, these are very similar to the

    normal data. Table 3 shows the DR and FPR of the first

    and second winner of the KDD CUP 99 competition, other

    approaches [21] and DNNL. From the comparison we can

    find that DNNL provides superior performance.

    4.2 Prototype system test

    4.2.1 Test environment

    In order to address the problem of intrusion detection analy-sis in high-speed networks, the data stream on the high-speed

    network link is divided into several smaller slices that are

    fed into a number of distributed neural networks. In order

    to evaluate the effectiveness of the DNNL, we developed a

    prototype IDS using the Libpcap. The test environment is

    shown in Fig. 8. We used 12 PCs with 100 Mbps Ethernet

    cards to serve as background traffic generator, which could

    generate more than 1,000 Mbps TCP and UDP streams with

    an average packet size of 1,024 bytes. One IBM server which

    runs Web services is the attack object and one attacker sends

    attack packets to the server. All the 14 computers are con-

    nected to 100Mbps ports on the Huawei Quidway S3526Cswitch. All the packets through these ports are mirrored to

    a defined mirror port and then distributed to the neural net-

    works.

    4.2.2 Packet capture

    Every station on a LAN hears every packet transmission, so

    there is a destination field and a source field in each packet.

    The Ethernet card can be in promiscuous mode or normal

    mode. Under promiscuous mode, the card will receive and

    deliver every packet. Under normal mode, if the packet desti-

    nation address is identical to the station address, the card will

    receive and pass the packet up to the software, if it is not, the

    card will just drop the packet (filter it). IDS can run under the

    promiscuous mode of Ethernet card to analyze every packets

    passing through the LAN. Libpcap is the library we are going

    to use to grab packets from the network card directly. The

    main functions used are:

    pcap_open_live() is used to obtain a packet capture

    descriptor to look at packets on the network.

    pcap_lookupnet() is used to determine the network num-

    ber and mask associated with the network device.

    pcap_lookupdev() returns a pointer to a network device

    suitable for use with open_live() and lookupnet().

    pcap_loop() is used to collect and process packets. The

    captured packet will be parsedto form thenetwork behav-

    ior vector.

    Our method of parsing based on the character of network

    software structuring technique. In the TCP/IP Reference

    Model, the Internet layer defines an official packet format

    and protocol called IP (Internet Protocol); the layer above

    123

  • 7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm

    9/12

    Large-scale network intrusion detection based on distributed learning algorithm 33

    Table 2 Testing resultsDetection results

    Normal Probe DoS U2R R2L TR (%)

    Normal 58120 927 649 64 833 96.0

    Actual Probe 357 3546 174 21 118 85.1

    DoS 256 5092 223518 52 435 97.2

    Behaviors U2R 143 39 0 23 23 10.1

    R2L 14443 14 1 271 1460 9.0

    PR (%) 79.3 36.9 99.6 5.3 50.9

    Table 3 Comparison with other approaches

    Performances Detection rate False positive rate

    Algorithms (DR) (%) (FPR) (%)

    Winning entry 91.9 0.5

    Second place 91.5 0.6

    Best linear GP-FP rate 89.4 0.7

    Best GEdIDS-FP rate 91 0.4

    DNNL 93.9 0.4

    Fig. 8 Test environment

    the Internet layer is transport layer, two end-to-end protocols

    have been defined here. The first one, TCP (Transmission

    Control Protocol) is a reliable connection-oriented protocol

    that allows a byte stream originating on one machine to be

    delivered without error on any other machine in the Inter-

    net. The second protocol in this layer, UDP (User Datagram

    Protocol), is an unreliable, connectionless protocol. In the

    experiments we use the heads of packets to define the data

    structure of network behavior:

    typedef struct _EthernetBehavior

    {u_int8_t ethernet_dest[12]; /* destination ethernet address */

    u_int8_t ethernet_sour[12];/* source ehternet address */

    u_int16_t ethernet_type;/* packet type ID field */

    }EthernetBehavior;

    The ethernet_type shows the nested structure of protocol

    headers. It may be an IP, ARP, or some other protocols. For

    instance, an IP header can be defined as:

    typedef struct _IPBehavior

    {

    unsigned int header_len; /* the header length */

    unsigned int version; /* version of the protocol */

    u_int8_t tos; /* type of service */

    u_short total_len; /* total length of datagram */

    u_short identification; /* identification */

    u_int8_t flag_off; /* flags and fragment offset */

    u_int8_t time_live; /* the limit of packet lifetimes */

    u_int8_t protocol; /* TCP or UDP */

    u_int8_t checksum; /* header checksum */

    struct in_addr source_addr; /* source address*/

    struct in_addr destination_addr; /* destination address */

    }IPBehavior;

    The variable protocol tells what type of protocol will be

    used in the upper layer, it can be TCP, UDP, ICMP, etc. For

    example the definition of TCP is:

    typedef struct _TCPBehavior

    {

    u_int16_t sour_port; /* source port */

    u_int16_t dest_port; /* destination port */

    tcp_seq seq_num; /* sequence number */

    tcp_seq ack_num; /* acknowledgement number */

    u_int16_t flag; /* flags */

    u_int16_t win_size; /* window size */

    u_int16_t check_sum; /* header checksum */

    u_int16_t urg_pointer; /* urgent pointer */

    }TCPBehavior;

    123

  • 7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm

    10/12

    34 D. Tian et al.

    Fig. 9 Test result

    During the period of training, IDS uses the behavior

    variables of the normal packets to form the binary behavior

    matrix. In detecting time, if an intrusion is detected, IDS will

    alarm and display the detailed information of the intruder,

    which is parsed from the behavior variables.

    4.2.3 Experiment result

    In this experiment, we evaluate our proposed method. We

    first train the neural network with different normal features,

    then use the stable neural network to monitor the systemwhere some abnormal behaviors are happening under the

    same environment. A series of experiments are conducted to

    analyze the effects of varying the value of intrusion threshold

    to system errors. The tests results are graphically represented

    in Fig. 9.

    We can find that the performance of IDS is sensitive

    according to intrusion threshold. As the threshold value

    increases, false positive errors increase while false negative

    errorsdecrease. Since a false negative error is more important

    in IDS, we need to concentrate on the decrease of false neg-

    ative errors according to the change of the threshold value.

    The optimal threshold value is 1.51.6.

    5 Conclusions

    The bandwidth of networks increases faster than the speed

    of processors. Its impossible to keep up with the speed of

    networks by just increasing the processors speed of NIDS.

    To resolve the problem, this paper presents a DNNL which

    can be used in the anomaly detection methods. Completeness

    analysis shows that DNNLs learning algorithm is equivalent

    to training by one neural network which adds a penalty term

    to the error function for controlling the bias and variance of

    a neural network. The main contribution of this approach is:

    reducing the complexity of load balancing while still main-

    taining the completeness of the network behavior, putting

    forward a dissimilarity measure method for categorical and

    numerical features, and increasing the speed of the wholesystems. In the experiments, the KDD data set is used which

    is the common data set used in IDS research papers. If train-

    ing with one neural network it will use 67 h whereas DNNL

    takes only less than 1 h. Comparisons with other approaches

    on the same benchmark show that DNNLs false alarm rate

    is very low.

    Acknowledgments This research is supported by both the National

    Natural Science Foundation of China under Grant No. 60573128 and

    the National Research Foundation for the Doctoral Program of Higher

    Education of China under Grant No.20060183043.

    References

    1. Song, H.Y., Lockwood, J.W.: Efficient packet classification for

    network intrusion detection using FPGA. In: Proceedings of

    the 13th International Symposium on Field-programmable Gate

    Arrays, pp. 238245. Monterey (2005)

    2. Yang, W., Fang, B.X., Liu, B., Zhang, H.L.: Intrusion detection

    system for high-speed network. J. Comput. Commun.27, 1288

    1294 (2004)

    3. Baker, Z.K., Prasanna, V.K.: Automatic synthesis of efficient

    intrusion detection systems on FPGAs. In: Proceedings of the

    14th Field Programmable Logic and Application, pp. 311321.Leuven, Belgium (2004)

    4. Baker, Z.K., Prasanna, V.K.: A methodology for synthesis of effi-

    cientintrusiondetection systems on FPGAs. In: Proceedings of the

    12th Annual IEEE Symposium on Field-Programmable Custom

    Computing Machines (FCCM04), pp. 135144. Napa (2004)

    5. McAlerney, J., Coit,C., Staniford,S.: Towards faster string match-

    ing forintrusion detection or exceeding thespeedof snort.In: Pro-

    ceedings of DARPA Information Survivability Conference and

    Exposition, pp. 367373. Anaheim (2001)

    6. Tuck, N., Sherwood, T., Calder, B., Varghese, G.: Deterministic

    memory-efficient string matching algorithms for intrusion detec-

    tion. In: Proceedings of the 23rd Conference of the IEEE Com-

    munications Society, pp. 26282639. Hong Kong (2004)

    7. Tan, L., Sherwood, T.: A high throughput string matching archi-

    tecture for intrusion detection and prevention. In: Proceedings ofthe 32nd International Symposium on Computer Architecture,

    pp. 112122. Madison, Wisconsin (2005)

    8. Aggarwal, C., Yu, S.: An effective and efficient algorithm for

    high-dimensional outlier detection. J. Int. J. Very Large Data

    Bases 14, 211221 (2005)

    9. Rawat, S., Pujari, A.K., Gulati, V.P.: On the use of singular value

    decomposition for a fast intrusion detection system. J. Electronic

    Notes Theor. Comput. Sci. 142, 215228 (2006)

    10. Kruegel, C., Valeur, F., Vigna, G., Kemmerer, R.: Stateful intru-

    sion detection for high-speed networks. In: Proceedings of the

    IEEE Symposium on Security and Privacy, pp. 285294. Califor-

    nia (2002)

    123

  • 7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm

    11/12

    Large-scale network intrusion detection based on distributed learning algorithm 35

    11. Lai, H.G., Cai, S.W., Huang, H., Xie, J.Y., Li, H.: A parallel intru-

    sion detection system for high-speed networks. In: Proceedings

    of Applied Cryptography and Network Security: Second Interna-

    tional Conference, pp. 439451. ACNS 2004, Yellow Mountain

    (2004)

    12. Jiang, W.B., Song, H., Dai, Y.Q.: Real-time intrusion detection

    for high-speed networks. J. Comput. Secur.24, 287294 (2005)

    13. Xinidis, K., Charitakis, I., Antonatos, S., Anagnostakis, K.G.,

    Markatos, E.P.: An active splitter architecture for intrusion detec-

    tion and prevention. J. IEEE Trans. Dependable. Secure Com-

    put. 3, 3144 (2006)

    14. Schaelicke, L., Wheeler, K., Freeland, C.: SPANIDS: a scalable

    network intrusion detection loadbalancer. In: Proceedings of the

    2nd Conference on Computing Frontiers, pp. 315322. Ischia

    (2005)

    15. Szalay, A., Gray, J.: The world-wide telescope. Science 293,

    20372040 (2001)

    16. Martone, M.E., Gupta, A., Ellisman, M.H.: E-neuroscience: chal-

    lenges and triumphs in integrating distributed data frommolecules

    to brains. Nature Neurosci. 7, 467472 (2004)

    17. Wroe, C.,Goble, C.,Greenwood,M., Lord, P., Miles, S.,Papay, J.,

    Payne, T., Moreau, L.: Automating experiments using semantic

    data on a bioinformatics grid. IEEE Intell. Syst.19, 4855 (2004)

    18. Wang, Y.X., Behera, S.R., Wong, J., Helmer, G., Honavar, V.,

    Miller, L., Lutz,R., Slagell, M.: Towards the automaticgeneration

    of mobile agents for distributed intrusion detectionsystem. J. Syst.

    Softw. 79, 114 (2006)

    19. Bala, J.,Weng, Y., Williams, A.,Gogia, B.K., Lesser, H.K.: Appli-

    cations of Distributed Mining Techniques For Knowledge Discov-

    ery in Dispersed Sensory Data. In: Proceedings of the 7th Joint

    Conference on Information Sciences, pp. 14. Cary (2003)

    20. Kourai, K., Chiba,S.: HyperSpector virtual distributed monitoring

    environmentsfor secure intrusion detection.In: Proceedings of the

    1st ACM/USENIXInternational Conference on Virtual Execution

    Environments, pp. 197207. Chicago (2005)

    21. Folino, G., Pizzuti, C., Spezzano, G.: GP ensemble for distributed

    intrusion detection systems. In: Proceedings of the 3rd Interna-

    tionalConference on Advancedin Pattern Recognition, pp. 5462.

    Bath, UK (2005)

    22. Geman, S., Bienenstock, E.,Doursat, R.: Neural networks and the

    bias/variance dilema. Neural Comput. 4, 158 (1992)

    23. Kuo, R.J., An, Y.L., Wang, H.S., Chung, W.J.: Integration of self-

    organizing feature maps neural network and genetic K-means

    algorithmfor marketsegmentation. J. ExpertSyst. Appl.30, 313

    324 (2006)

    24. Carpenter, G.A., Milenova, B.L., Noeske, B.W.: Distributed

    ARTMAP: a neural network for fast distributed supervised learn-

    ing. J. Neural Networks 11, 793813 (1998)

    25. Nair, T.M., Zheng, C.L., Fink, J.L., Stuart, R.O., Gribskov, M.:

    Rival penalized competitive learning (RPCL): a topology-

    determining algorithmfor analyzinggene expression data. J. Com-

    put. Biol. Chem. 27, 565574 (2003)

    123

  • 7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm

    12/12

    Reproducedwithpermissionof thecopyrightowner. Further reproductionprohibitedwithoutpermission.