arXiv:1709.03082v1 [cs.NE] 10 Sep 2017 June 1, 2019Figure 3: The proposed ESN-SVM model by Alalshekmubarak & Smith[2] for time series classi cation. This research presents a modi ed

A Neural Network Architecture Combining Gated Recurrent Unit

(GRU) and Support Vector Machine (SVM) for

Intrusion Detection in Network Traffic Data

Abien Fred AgarapDepartment of Computer Science

Adamson University

June 1, 2019

arX

iv:1

709.

0308

2v1

[cs

.NE

] 1

0 Se

p 20

17

Abstract

Gated Recurrent Unit (GRU) is a recently published variant of the Long Short-Term Memory(LSTM) network, designed to solve the vanishing gradient and exploding gradient problems. How-ever, its main objective is to solve the long-term dependency problem in Recurrent Neural Networks(RNNs), which prevents the network to connect an information from previous iteration with thecurrent iteration. This study proposes a modification on the GRU model, having Support VectorMachine (SVM) as its classifier instead of the Softmax function. The classifier is responsible forthe output of a network in a classification problem. SVM was chosen over Softmax for its com-putational efficiency. To evaluate the proposed model, it will be used for intrusion detection, withthe dataset from Kyoto University’s honeypot system in 2013 which will serve as both its trainingand testing data.

Contents

1 Introduction 21.1 Background of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Scope and Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Literature Review 4

3 Statement of the Problem 12

4 Methodology 134.1 Machine Intelligence Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.4 The GRU-SVM Neural Network Architecture . . . . . . . . . . . . . . . . . . . . . . 214.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Results 27

1 Introduction

1.1 Background of the Study

The annual cost to the global economy due to cybercrime could be as high as $575 billion, whichincludes both the gain to criminals and the costs to companies for defense and recovery[10]. It iseven projected that the said cost will reach $2 trillion by 2019[20, 33].

Figure 1: Image from [46]. The SVMalgorithm outputs a hyperplane whichcategorizes the data, usually into two

classes.

Figure 2: Image from [37], showing a neuralnetwork that classifies an image as “cat”.The image data is converted to a series ofnumbers, which then gets processed by thehidden layers (violet circles in the image),and outputs the probability of to which

class does the image belong.

2

Among the contributory felonies to cybercrime is intrusions, which is defined as illegal orunauthorized uses, misuses, or exploitations by either authorized users or external attackers[42].To identify intrusions in a computer system, an intrusion detection system (IDS) is used [19, 42].The most common method used for uncovering intrusions is the examination of patterns of useractivities[4, 8, 12, 19, 24, 34, 42].

It could be argued that the aforementioned method is quite laborious when done manually, sincethe data of user activities is massive in nature[25], e.g. 3–35 megabytes of data in an eight-hourperiod in the 1990s [11]. Hence, a number of researchers have studied and proposed the use ofmachine learning techniques to address the said problem[3, 34, 40, 35]. Among the said techniquesis the use of support vector machine (SVM) and artificial neural network (ANN)[3, 5, 26, 34, 40, 35].

In machine learning, SVM is a classification technique that separates two classes of points in adata using a “maximum margin” line – simply, a line “in the middle” (see Figure 1) [2, 7, 34, 35].On the other hand, an ANN is a computational model that represents the human brain, and showshow brain cells (or neurons) pass information from one another (see Figure 2) [14, 18, 27, 32, 36].

Figure 3: The proposed ESN-SVM model by Alalshekmubarak & Smith[2] for time seriesclassification.

This research presents a modified version of the novel approach proposed by Alalshekmubarak& Smith[2], which combines the echo state network (ESN, a type of recurrent neural network,or RNN) and the SVM (see Figure 3), and use it for intrusion detection instead of time seriesclassification.

The proposed model will use recurrent neural networks (RNNs) with gated recurrent units(GRUs) in place of ESN. RNNs are used for analyzing and/or predicting sequential data, whichmakes it a viable candidate for intrusion detection[36, 38]. In addition, the RNN architecturehas already been used in a study for detecting credit card fraud through fraudulent behavior (byanalyzing user logs) [3], but with the conventional Softmax as the final layer in the neural network.

Lastly, the data to be used in this study will be the 2013 network traffic data obtained by thehoneypot systems in Kyoto University[43], a sequential data in nature, hence the use of RNNs.

3

1.2 Significance of the Study

Machine Learning algorithms such as ANN require a tremendous amount of computational resourcesin order to accomplish its task. This study is of significance for the potential savings in terms ofcomputational cost and time required in a machine learning problem. Thus, the saved resourcesmay be used for other purposes.

For the particular problem domain addressed in the study – intrusion detection, the detectionof unauthorized or illegal access to network systems may increase in speed, providing more timefor its tracing and annihilation.

1.3 Scope and Limitation

Artificial Neural Network (ANN) is a kind of machine learning algorithm commonly used for clas-sification problems. For ANNs to be able to categorize data to their classes, the conventionalclassification function used is the Softmax regression function. However, Softmax is built to pro-vide a probability distribution over a range of classes, most beneficial to the task of multi-classclassification. But there are problems that do not necessarily require probabilistic approach, e.g.binary classification. A probabilistic approach requires a relatively higher computational cost thanits non-probabilistic counterpart, since it must satisfy the probability distribution, thus requiring acomputational cost of O(n). On the other hand, a binary classifier such as SVM would only requirea computational cost of O(1) as it does not need to satisfy a probability distribution.

With efficiency as the primary concern, this study proposes a neural network architecture com-bining the Gated Recurrent Unit (GRU) RNN and Support Vector Machine (SVM) for the purposeof binary/non-probabilistic classification.

To evaluate the proposed computational model, the problem domain considered in the studyis intrusion detection. For the model to be able to identify intrusions, it will be trained using the2013 Kyoto University network traffic data[43]. To measure the effectiveness and efficiency of theproposed model, the conventional GRU-Softmax model shall be used as a comparator on the sameproblem domain.

2 Literature Review

The ESN is a type of reservoir computing (RC) proposed by Maass (2002)[30], which uses theoutput of the network to train a simple linear read-out function. The read-out function is simplythe classifier of the network, i.e. determines the class of an input based on the output probabilityover possible classes (see Figure 4 for example).

The way ESN works resembles how SVM works, but with the difference on computationalcomplexity issue – an ESN must have an extremely large size of reservoir (or simply, a collection ofneurons) in order to achieve state-of-the-art performance[47]. Additionally, using a linear read-outfunction implies that a bigger data sample size would be needed, and more advanced regularizationtechniques must be employed to the point that the size of the reservoir will be even bigger thanthe number of training samples[2, 47]. On the other hand, an SVM can deal with infinite spacewithout suffering from the same issue[2].

The ESN model may be described in the following way[2, 7, 28]; First, an m by n matrix,denoted by Win, is initialized randomly. Second, an n by n matrix, denoted by Wres, is initializedrandomly as well, and scaled to obtain the desired dynamics. Then, the ESN model updates its

4

Figure 4: Image from [22]. The red line shows all points in the space that get a score of zero forthe “car” class. All points to the right of the red line have positive (and linearly increasing)

scores, and all points to the left have a negative (and linearly decreasing) scores.

state using the following equations[28]:

x(n) = f

(Win[1;u(n)] + Wresx(n− 1)

)(1)

x(n) = (1− α)x(n− 1) + αx(n) (2)

where u(n) is the input on time n, and f is a nonlinear transfer function which is commonlylogistic (equation 3) or tanh (equation 4).

f(x) =1

1 + e−x(3)

f(x) =2

1 + e−2x− 1 (4)

Lastly, the response of the reservoir and the class labels of training data are used to train a linearread-out function, resulting to learning the weight of the output layer Wout. This is accomplishedusing the following pseudo-inverse equation:

Wout = (WTW)−1WTY (5)

5

Figure 5: Image from [2]. The input signal is fed into the reservoir through the fixed weights Win.Then, the reservoir processes these, and the output is used to train a linear read-out function.

The structure of the ESN model is graphically represented by Figure 5:Due to the abovementioned issue with the linear read-out function of ESN, Alalshekmubarak

& Smith (2013)[2] developed a model using SVM for the output layer of the ESN.The SVM was developed by Vapnik[7], and it has been used in many real-world applications

since then. Its main concept is to find the optimal hyperplane which separates two classes in a databy maximizing the margin[2, 7, 34, 35]. Only a subset of the training data points are used, hencethe name support vectors, for estimating the generalization performance. This is done by the mainequation of SVM which is used to estimate the decision function from a dataset D[15, 39]:

D =

{(xi,yi) | xi ∈ Rp,yi ∈ {−1,+1}

}n

i=1

(6)

f(x) = sign

( l∑n=1

ynαn · k(x, xn) + b

)(7)

where l is the number of support vectors, yn ∈ {−1,+1} is the class sign to which the supportvector belongs, k is a kernel function, b is the bias term, and α is obtained as the solution of thefollowing optimization problem:

min1

2‖w‖2 + C

p∑i=1

ξi (8)

s.t y′i(w · φ(xi) + b) ≥ 1− ξi (9)

ξi ≥ 0, i = 1, ..., p (10)

where ξ is a cost function, and C is the regularization parameter (may be an arbitrary valueor selected value using hyper-parameter tuning). The corresponding unconstrained optimizationproblem Eq. 8 is the following:

min1

2‖w‖2 + C

p∑i=1

max(0, 1− y′i(wTxi + bi)) (11)

6

where wTxi + bi is the function that returns the vector of scores for each classes (i.e. predictedclasses). The objective of Eq. 11 is known as the primal form problem of L1-SVM, with thestandard hinge loss.

The researchers replaced the simple linear read-out function of ESN with L1-SVM to solve thereservoir problem, and their approach was summarized as follows:

1. Map the input Win and pass it to the reservoir Wres.

2. Repeat the same procedure until the end of the input, and collect the response of the reservoirin Xcollected.

3. Use Xcollected and target label −→y to train single SVM classifier in the binary classificationproblem (two classes only) or multiple SVM classifier in the multi-classification problem.

4. Predict a new data point by using the mapping procedures described in the first and secondsteps, then applying the learned SVM classifier on the response of the network to determinethe label of the new sample.

By using SVM in place of the read-out function, Alalshekmubarak & Smith (2013)[2] found outthat their proposed model, ESN-SVM, performs better in terms of accuracy (97.45%) than ESNalone (96.91%). Furthermore, with smaller reservoir size, ESN-SVM achieved higher accuracy thanESN by more than 15%.

In another study by Tang (2013)[45], he also proposed to use SVM as a classifier, the differencebeing he used a Convolutional Neural Network (CNN), and the L2-SVM. Using his CNN-SVMmodel, he found excellent results when it was trained on MNIST dataset (one of the standarddatasets for image classification); an error of 0.87% against using Softmax as the final layer thathad an error of 0.99%. He also used the same model (but with different hyper-parameters) forCIFAR-10 dataset (another one of the standard datasets for image classification), and found outthat his model was 2.1% more accurate than the conventional CNN-Softmax model.

ESN is a class of RNN proposed in attempt to solve problems of training a traditional RNNlike the vanishing gradient problem[17, 28, 30, 47]. However, it has the caveat of computationalcomplexity issue as explained before. Among other proposals to solve the problems of traditionalRNNs is the use of Long Short-Term Memory (LSTM) units.

Figure 6: Image from [38]. An unrolled traditional recurrent neural network (RNN). Thechain-like nature of RNN implies that it is related to sequences and lists, data of sequential nature.

The LSTM was proposed not only to solve the vanishing gradient problem or the exploding gra-dient problem, but also to solve the long-term dependencies of RNNs. Simply put, it is cumbersomefor a traditional RNN to “remember” information when there is a considerable gap between them,

7

Figure 7: Image from [38]. The traditional RNN has a single neural network layer tanh.

Figure 8: Image from [38]. The LSTM network has four neural network layers.

e.g. “I grew up in France ... I speak fluent French.”[9, 38]. On the aforementioned example, thenext word is probably the name of the language of the speaker, but to predict that, the neuralnetwork has to discover the context of France from further back[9, 38].

Consequently, as the gap increases, the RNNs are unable to learn to connect the information[9,38]. This is where the LSTMs comes in. The LSTM is a special kind of RNN, which is capable oflearning long-term dependencies[3, 9, 38]. It was explicitly designed to solve the long-term depen-dency problem using four neural network layers (see Figure 8) instead of a single neural networklayer like in a traditional RNN (see Figure 7).

Let Figure 9 be the notation guide for the walkthrough on LSTM networks. The key ideain LSTMs is the cell state which acts like a conveyor belt, carrying information with minimaltransformation as it goes through each node (see Figure 10)[38].

The LSTM has the ability to manipulate the information in a cell state, i.e. add or removeinformation, by using gates. The following gates are used in an LSTM network[9, 38]: (1) forgetgate, (2) input gate, (3) update gate, and (4) output gate.

The first step in an LSTM network is to decide which information is going to be kept in the cellstate. This is accomplished using a sigmoid σ layer called the “forget gate layer” (see Figure 11);it looks at ht−1 (internal cell state) and xt (real input), then outputs a number between 0 and 1

8

Figure 9: Image from [38]. As per Figure 8, each line carries a vector from the output of one nodeto the input of others. The pink circle represents pointwise operations (e.g. vector addition),

while the yellow boxes are learned neural network layers. The lines merging denote concatenation,while lines forking means the content is being copied, and the copies go to different locations.

Figure 10: Image from [38]. Only minor linear transformation is done on the information passingthrough a cell state.

Figure 11: Image from [38]. A “forget” gate determines whether to completely forget aninformation (output number of 0) or to completely keep an information (output number of 1).

for each number in the cell state Ct−1. A 1 represents “completely keep this”, while a 0 represents“completely forget this”.

The next step is to determine what new information shall be stored in a cell state, which isdivided into two parts (see Figure 12): (1) a sigmoid σ layer called the “input gate layer”, whichdecides what values must be updated; and (2) a tanh layer that creates a vector of new candidatevalues Ct, that may be added to the cell state.

9

Figure 12: Image from [38]. The output of the “input” gate is concatenated with the candidatevalue Ct.

Figure 13: Image from [38]. Update the cell state by “forgetting” the decided information(Ct−1 ∗ ft), then add the result to the candidate value it ∗ Ct.

To update the old cell state Ct−1, it should be multiplied by the ft, to forget the informationwhich was decided to be forgotten at the first step. Then, add the resulting value to the candidatevalues it ∗ Ct (see Figure 13).

Lastly, to determine the output of a cell state (see Figure 14), the sigmoid σ layer will decide

Figure 14: Image from [38]. The output shall be based on a filtered cell state using tanh (totransform the values to be {-1,+1}), and multiply it by the output of the sigmoid σ gate.

10

Figure 15: Image from [38]. The GRU model combines the “forget” gate and “input” gate into asingle “update” gate, making it simpler than the LSTM model.

Figure 16: Image from [21]. Rectangles represent a vector, and arrows represent functions. Inputvectors are red, output vectors are blue, and green vectors represent the RNN state. L-R: (1)

vanilla mode of processing without RNN, from fixed-size input to fixed-size output (e.g. imageclassification), (2) sequence output (e.g. image captioning, where an image is the input and the

output is a sentence), (3) sequence input (e.g. sentiment analysis, where a sentence is classified tobe expressing either positive or negative remark), (4) sequence input and output (e.g. an RNNreads a sentence in English, and then outputs a sentence in French), and (5) synced sequenceinput and output (e.g. video classification, where the aim is to label each frame of the video).

which parts of the cell state should be the output. Then, it shall run through the tanh gate, sothat the value can be transformed to {−1,+1}, and multiply it by the output of the sigmoid σgate, so that only the decided parts shall be the cell output Ct at time t. The resulting value isthen passed on to the next cell, to go through the same procedure again while considering the newinput xt at current time t.

Despite the effectiveness of LSTM[9, 21, 38], a recent variation of it was developed called theGated Recurrent Unit (GRU)[6]. The GRU model is said to be more efficient than the LSTMmodel[6, 9, 38] since it combines the “forget” gate and “input” gate into a single “update” gate.In addition, it merges the cell state Ct and hidden state ht, and the result is a simpler model thanthe standard LSTM[38] (see Figure 15).

One of the conventional uses of such RNNs is the classification task, where the operationis on sequences of vectors, i.e. non-fixed-sized vectors unlike the Vanilla Neural Networks and

11

Convolutional Neural Networks[21]. A few examples were laid down by [21] (see Figure 16). Forthis study, the task of interest is sequence input (3rd case in Figure 16), where the sequential datainput contains features which will give the machine an information to determine the output (i.e.state of the network, whether “under attack” or “normal”).

Figure 17: Image from [13]. The GRU model which uses softmax classifier for its output.

Similar to other neural networks, an RNN has to pass through an activation function first. Thewidely-used activation nowadays is the Softmax activation function[9, 22] (see Figure 17). Thepresent study proposes to replace the Softmax classifier with an SVM classifier like what [2] didwith the linear read-out function of the ESN.

The main motivation behind this amendment to the RNN model is the performance of an SVMclassifier over a Softmax classifier, i.e. not only does the SVM provide stable results, but it alsotrains faster than Softmax[22]. Additionally, if a new class is introduced into the classificationtask, the Softmax classifier will recalculate everything, which is beneficial for a multi-classificationproblem. However, in this study, the problem of concern is binary classification, i.e. whether thereis an intrusion in the system or none – two classes only. Hence, the use of SVM.

3 Statement of the Problem

Artificial neural networks (ANNs) commonly use Softmax activation function as their classifier.However, in a binary classification problem, there is a classifier that is practically better thanthe Softmax activation function – the SVM classifier. The SVM is better than Softmax in sucha problem for once it reaches the optimal hyperplane (a decision function in the form of a linearfunction), it is already satisfied. In contrast, the Softmax[21] classifier needs to satisfy a probabilitydistribution, iterating through the N -dimensional vector. Thus, SVM computes and trains fasterthan Softmax.

Since the primary concern of intrusion detection is to determine whether there is an attack

12

(intrusion) in the system or none[19, 42], there will only be two classes: (1) “under attack” state,and (2) “normal” state.

The present paper then proposes an approach combining RNN (with GRU) and SVM (asclassifier) for intrusion detection systems (IDS). SVM is probably the most suitable machine learningtechnique for a binary classification problem, considering that it was developed primarily for thesaid problem[7].

4 Methodology

4.1 Machine Intelligence Library

To implement the neural network models in this study, both the proposed and the comparator, theopen-source machine intelligence library by Google – TensorFlow[1] was used. TensorFlow enablesscientists and engineers to design, develop, and deploy computational models through the use ofdata flow graphs (see Figure 18). The nodes in the graph represent mathematical operations, whilethe edges represent tensors communicated between them.

Figure 18: TensorFlow graph of GRU+SVM for MNIST classification.

4.2 The Dataset

The 2013 network traffic data obtained by the honeypot systems in Kyoto University[43] (one ofthe updates on their 2006 dataset) was used for the training, validation, and testing of the proposed

13

GRU-SVM model for intrusion detection.The said dataset has 24 statistical features[43]; (1) 14 features from the KDD Cup 1999

dataset[44], and (2) 10 additional features, which according to Song, Takakura, & Okabe (2006)[43],might be pivotal in a more effective investigation on the occurrences in computer system networks.The following is the summary of different connection states (dataset feature #14, flag) in thedataset[43]:

1. S0: Connection attempt seen, no reply.

2. S1: Connection established, not terminated.

3. SF: Normal establishment and termination.

4. REJ: Connection attempt rejected.

5. S2: Connection established and close attempt by originator seen (but no reply from respon-der).

6. S3: Connection established and close attempt by responder seen (but no reply from origina-tor).

7. RSTO: Connection established, originator aborted (sent a RST).

8. RSTR: Established, responder aborted.

9. RSTOS0: Originator sent a SYN followed by a RST, we never saw a SYN ACK from theresponder.

10. RSTRH: Responder sent a SYN ACK followed by a RST, we ne ver saw a SYN from the(purported) originator.

11. SH: Originator sent a SYN followed by a FIN, we never saw a SYN ACK from the responder(hence the connection was half open).

12. SHR: Responder sent a SYN ACK followed by a FIN, we never saw a SYN from the originator.

13. OTH: No SYN seen, just midstream traffic (a partial connection that was not later closed).

14 features as per the KDD Cup 1999 dataset[25, 44, 43]:

1. Duration: the length of the connection (in seconds)

2. Service: service type of the connection, e.g., http, telnet, etc

3. Source bytes: the number of data bytes sent by the source IP address

4. Destination bytes: the number of data bytes sent by the destination IP address

5. Count: the number of connections whose source IP address and destination IP address arethe same to those of the current connection in the past two seconds

6. Same srv rate: % of connections to the same service in Count feature

7. Serror rate: % of connections that have SYN errors in Count feature

14

8. Srv serror rate: % of connections that have SYN errors in Srv count (the number of connec-tions whose service type is the same to that of the current connection in the past two seconds)feature

9. Dst host count: among the past 100 connections whose destination IP address is the sameto that of the current connection, the number of connections whose source IP address is alsothe same to that of the current connection

10. Dst host srv count: among the past 100 connections whose destination IP address is the sameto that of the current connection, the number of connections whose service type is also thesame to that of the current connection

11. Dst host same src port rate: % of connections whose source port is the same to that of thecurrent connection in Dst host count feature

12. Dst host serror rate: % of connections that have SYN errors in Dst host count feature

13. Dst host srv serror rate: % of connections that SYN errors in Dst host srv count feature

14. Flag: the state of the connection at the time the summary was written (which is usually whenthe connection terminated).

10 additional features as per[43]:

1. IDS detection: reflects whether IDS triggered an alert for the connection; 0 means any alertswere not triggered, and an arabic numeral (except 0) means the different kinds of the alerts.Parenthesis indicates the number of the same alert observed during the connection.

2. Malware detection: indicates whether malware, also known as malicious software, was ob-served in the connection; 0 means no malware was observed, and a string indicates thecorresponding malware observed at the connection. Parenthesis indicates the number of thesame malware observed during the connection.

3. Ashula detection: means whether shellcodes and exploit codes were used in the connection;0 means no shellcodes and exploit codes were observed, and an arabic numeral (except 0)means the different kinds of the shellcodes or exploit codes. Parenthesis indicates the numberof the same shellcode or exploit code observed during the connection.

4. Label: indicates whether the session was attack or not; 1 means the session was normal, -1means known attack was observed in the session, and -2 means unknown attack was observedin the session.

5. Source IP Address: indicates the source IP address used in the session (in IPv6).

6. Source Port Number: indicates the source port number used in the session.

7. Destination IP Address: indicates the source IP address used in the session (in IPv6).

8. Destination Port Number: indicates the destination port number used in the session.

9. Start Time: indicates when the session was started.

10. Duration: indicates how long the session was being established.

15

A sample from the dataset to be used is given below:

Listing 1: Sample data from 2013 Kyoto University network traffic data

[1] 0.000000 other 0 0 0 0.00 0.00 0.00 0 0 0.00 0.00 0.00 S0 0 0 0 -1

fd75:41fb:cf76:20be:183a:1595:1ea6:5922 2461 fd75:41fb:cf76:39ef:7d8b:279c:615c:0d4d

445 00:00:00 tcp

[2] 6.127728 smtp 1304 781 0 0.00 0.00 0.00 0 0 0.00 0.00 0.00 SF 0 0 0 1

fd75:41fb:cf76:cb0f:4eb9:31ed:35d3:199a 44040

fd75:41fb:cf76:dc4c:7d2c:2705:07b2:0f45 25 00:00:00 tcp

[3] 653.116991 other 54073 4664 0 0.00 0.00 0.00 0 0 0.00 0.00 0.00 SH 0 0 0 1

fd75:41fb:cf76:301f:3494:6eec:0025:17d9 3879 fd75:41fb:cf76:dc4c:7d2c:2705:07b2:0f45

25 00:00:00 tcp

Since feature #4 from the 10 additional features refers to the label of a log (a row of record), i.e.1 means normal session, -1 means a known attack was observed, and -2 means an unknown attackwas observed, the label was converted to a modified unit vector in i dimension, −→e i ∈ {−1,+1},instead of the conventional −→e i ∈ {0, 1}. More to the point,

Label −→e 0 −→e 1

+1 (normal) +1 -1-1, -2 (attack observed) -1 +1

This adjustment is due to the SVM classes being y ∈ {−1,+1}, and it was done programmati-cally first using the DataFrame.apply() function of pandas[31] (see Listing 2), then TensorFlow’s[1]one-hot encoder, tf.one hot() (see Listing 3).

Listing 2: Using the DataFrame().apply() function of pandas[31] to convert the labels to 0, 1

..

# there is an attack if label == -1 or -2 (replace with 1), otherwise 1 (replace with 0)

df[’label’] = df[’label’].apply(lambda label: 1 if label == -1 or label == -2 else 0)

Listing 3: Code snippet for one-hot encoding using TensorFlow[1]

import tensorflow as tf

...

num_classes = 2

...

tf.one_hot(tf.cast(label_batch, tf.uint8), num_classes, 1.0, -1.0,

name=‘‘label_batch_onehot’’)

4.3 Data Preprocessing

The dataset contains logs for 360 days of the year 2013, with 16.2 GB of data in total. Only thelogs for the following dates are non-existing: (1) March 2-4, and (2) October 13-14 – totalling to 5days. The reason why the data for those days were not available was not stated.

However, for the initial experiment, only 25% of the whole Kyoto University 2013 networktraffic dataset was used, i.e. 4.1 GB of data (from January 1, 2013 to June 1, 2013). Be-fore using the dataset for the neural network training, it was normalized first – standardization(for continuous data) and indexing (for categorical data). To determine which continuous fea-tures in the dataset must be standardized, the summary of the dataset was extracted using the

16

DataFrame().describe() function of pandas[31] (see Listing 4). As for which features in thedataset were categorical, the data description from [34] and [43] were used as a reference (see Table1).

Listing 4: Features Summary extracted using Pandas[31]

Table 1: Summary of Selected Dataset Features[34, 43]

Feature Name Sample Unique Data Number ofUnique Data

Type

duration [ 0. 10.328423 2.996646 ..., 3.1747340.433614 3.473215 ]

5206384 Continuous

service [’other’ ’smtp’ ’http’ ’sip’ ’rdp’ ’ssh’’dns’ ’smtp,ssl’ ’ssl’ ’snmp’ ’ftp’’http,socks’ ’ftp-data’ ’krb tcp’]

14 Categorical

src bytes [ 0 895 824 ..., 26283 26321 24999 ] 30263 Continuous

dest bytes [ 0 385 262 ..., 2438007 5759 7642 ] 11917 Continuous

count [ 0 1 2 3 ..., 98 99 100 ] 101 Continuous

same srv rate [ 1. 0. 0.33 0.5 ..., 0.66 0.32 0.37 ] 98 Continuous

serror rate [ 1. 0. 0.5 0.33 ..., 0.49 0.48 0.46 ] 101 Continuous

srv serror rate [ 0.33 0.5 0.4 0. ..., 0.28 0.26 0.23 ] 101 Continuous

dst host count [ 26 0 1 27 ..., 97 98 99] 101 Continuous

dst host srv count [ 28 14 15 ..., 96 97 98] 101 Continuous

dst host same src port rate [ 0.04 0. ..., 0.39 0.37] 100 Continuous

dst host serror rate [ 0.92 0. 0.93 ..., 0.49 0.02 0.16] 101 Continuous

dst host srv serror rate [ 0.86 0. ..., 0.01 0.99] 101 Continuous

flag [’S0’ ’SF’ ’RSTO’ ’SH’ ’RSTOS0’’REJ’ ’OTH’ ’RSTRH’ ’S1’ ’S2’ ’SHR’

’RSTR’ ’S3’]

13 Categorical

ids detection [’0’ ’22114-1-5(1)’ ’14782-1-15(1) ...] 490 Categorical

malware detection [’0’ ’Email.Trojan.Trojan-805(1)’’Win.Worm.Kido-113(1)’ ...]

372 Categorical

ashula detection [’0’ ’349(2)’ ’349(1)’ ...] 39 Categorical

label [-1, 1, 2] 3 Categorical

src port num [ 4969 1327 ..., 995 469] 65535 Continuous

dst port num [445 25 ..., 16015 39720] 60935 Continuous

start time [’00:00:00’ ’00:00:01’ ..., ’01:22:31’’03:31:46’]

86399 Continuous

protocol [’tcp’ ’udp’ ’icmp’] 3 Categorical

17

duration src_bytes dest_bytes count same_srv_rate

count 4.656124e+06 4.656124e+06 4.656124e+06 4.656124e+06 4.656124e+06

mean 5.629711e+00 7.620859e+03 8.519262e+03 1.819146e+00 2.604244e-01

std 1.747110e+02 3.446973e+06 3.346247e+06 8.298907e+00 4.228922e-01

min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00

25% 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00

50% 2.866777e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00

75% 3.419435e+00 1.280000e+02 2.090000e+02 1.000000e+00 5.000000e-01

max 8.097288e+04 2.133443e+09 2.116371e+09 1.000000e+02 1.000000e+00

serror_rate srv_serror_rate dst_host_count dst_host_srv_count

count 4.656124e+06 4.656124e+06 4.656124e+06 4.656124e+06

mean 5.863585e-02 4.008782e-01 1.056284e+01 2.819324e+01

std 2.289534e-01 4.238607e-01 2.233580e+01 2.824031e+01

min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00

25% 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00

50% 0.000000e+00 3.300000e-01 0.000000e+00 3.000000e+01

75% 0.000000e+00 9.500000e-01 5.000000e+00 5.000000e+01

max 1.000000e+00 1.000000e+00 1.000000e+02 1.000000e+02

dst_host_same_src_port_rate dst_host_serror_rate

count 4.656124e+06 4.656124e+06

mean 3.256177e-02 1.405558e-01

std 1.718919e-01 3.280353e-01

min 0.000000e+00 0.000000e+00

25% 0.000000e+00 0.000000e+00

50% 0.000000e+00 0.000000e+00

75% 0.000000e+00 0.000000e+00

max 1.000000e+00 1.000000e+00

dst_host_srv_serror_rate label src_port_num dst_port_num

count 4.656124e+06 4.656124e+06 4.656124e+06 4.656124e+06

mean 2.122936e-01 -3.552655e-01 2.292569e+04 1.648552e+03

std 3.864984e-01 9.440645e-01 2.250753e+04 6.820971e+03

min 0.000000e+00 -2.000000e+00 0.000000e+00 0.000000e+00

25% 0.000000e+00 -1.000000e+00 3.028000e+03 2.500000e+01

50% 0.000000e+00 -1.000000e+00 6.000000e+03 8.000000e+01

75% 0.000000e+00 1.000000e+00 4.522200e+04 4.450000e+02

max 1.000000e+00 1.000000e+00 6.553500e+04 6.553500e+04

start_time

count 4.656124e+06

mean 1.216933e+01

std 7.080260e+00

min 0.000000e+00

25% 5.916667e+00

50% 1.228806e+01

75% 1.844000e+01

max 2.399944e+01

The features not included in the features summary (Listing 4) were categorical features (exceptfor label), i.e. service, flag, ids detection, malware detection, and ashula detection. Meanwhile,entity features source IP address and destination IP address were removed from the features to be

18

used in the study.As it can be noticed in Listing 4, almost every feature has a standard deviation (denoted by

std) not equal to 1. Thus, the variability among the features is not balanced. Consequently, themachine learning algorithm will improperly assign larger relevance to features that have largervariability (i.e. standard deviation) than the others. In other words, some features may be seen asmore important than the others[16]. Hence, every feature must be standardized.

Standardization is done using the standard score formula

z =X − µσ

where X is the value to be standardized, i.e. a feature value such as duration1 = 10.328423(from Listing 1), µ is the mean value of the said feature, i.e. 5.629711e+ 00 (from Listing 4), andσ is the value of standard deviation of the said feature, i.e. 1.747110e + 02 (from Listing 4) or174.7110 in integer notation. Evaluating the said values using the standard score formula,

z =10.328423− 5.629711

174.711

z = 0.02689419670198213

However, for efficiency and simplicity, the preprocessing.StandardScaler().fit transform()

function of Scikit-learn[41] was used for the data standardization in this study. That is,

Listing 5: Code snippet for data standardization using Scikit-learn

from sklearn import preprocessing

..

df[cols_to_std] = preprocessing.StandardScaler().fit_transform(df[cols_to_std])

As for the feature indexing, the categories are mapped to [0, n − 1], e.g. the data below areconverted to their respective indices, making them numerical values instead of symbolic values.

Table 2: Indexing of protocol feature values

Feature Index

tcp 0

udp 1

icmp 2

From vector protocol = [tcp, udp, icmp]→ protocol = [0, 1, 2]. Similar to how data standardiza-tion was done, the preprocessing.LabelEncoder().fit transform() function of Scikit-learn[41]was used for the data indexing in this study. That is,

Listing 6: Code snippet for data indexing using Scikit-learn

from sklearn import preprocessing

..

df[cols_to_index] = preprocessing.LabelEncoder().fit_transform(df[cols_to_index])

Lastly, there are some features in the dataset that cannot be directly normalized using standard-ization nor indexing due to data type conflict. Those features were start time, malware detection,

ashula detection, and ids detection.

19

First, it can be noticed that the values of the start time feature are in the conventionalHH:MM:SS format. Based on the said time format, the index for HH would be[0], then [1] for MM,and [2] for SS. Before standardization, those values were parsed to their floating-point numberequivalent using the code snippet in Listing 7.

Listing 7: Using the DataFrame().apply() function of pandas[31] to convert time data to itsfloating-point number equivalent

import pandas as pd

..

df = pd.DataFrame()

..

df[’start_time’] = df[’start_time’].apply(lambda time: int(time.split(’:’)[0]) +

(int(time.split(’:’)[1]) * (1 / 60)) + (int(time.split(’:’)[2]) * (1 / 3600)))

Second, the following features; start time, malware detection, ashula detection, andids detection, were in a mixed data type, i.e. integer and string (as described in [43], andas shown in Table 1). The said features were discretized using conditional statement, that is,

f(x) =

{1 x 6= 0

0 otherwise

where x represents the feature name: start time, malware detection, ashula detection,and ids detection. The conditional statement was also implemented using the DataFrame().apply()function of pandas[31] (see Listing 8).

Listing 8: Using the DataFrame().apply() function of pandas[31] to discretize mixedinteger-string data

..

df[’malware_detection’] = df[’malware_detection’].apply(lambda malware_detection: 1 if

malware_detection != ’0’ else 0)

df[’ashula_detection’] = df[’ashula_detection’].apply(lambda ashula_detection: 1 if

ashula_detection != ’0’ else 0)

df[’ids_detection’] = df[’ids_detection’].apply(lambda ids_detection: 1 if ids_detection

!= ’0’ else 0)

Using the data from Listing 1 for a sample normalization (standardization and indexing) willyield the following normalized data:

Listing 9: Normalized version of the sample data from Listing 1

[1] -0.7170812036204817,0,-0.7328336381153805,-0.8898704745742892,0.0,0.0,0.0,0.0,0.0,

0.0,0.0,0.0,0.0,0,0,0,0,1,-0.7435712019264632,1.414213562373095,0.0,0

[2] -0.6970852365248442,1,-0.6810640371563657,-0.5069565127877769,0.0,0.0,0.0,0.0,0.0,

0.0,0.0,0.0,0.0,1,0,0,0,0,1.4135755993550505,-0.7071067811865475,0.0,0

[3] 1.4141664401453258,0,1.413897675271746,1.3968269873620662,0.0,0.0,0.0,0.0,0.0,0.0,

0.0,0.0,0.0,2,0,0,0,0,-0.6700043974285872,-0.7071067811865475,0.0,0

The normalized dataset, particularly the continuous features, were in the form of floating-pointnumbers. Hence, training any neural network model using such data would be computationally-intensive. So, to reduce the intensive computational-resource requirement, the continuous featureswere binned (decile binning, a quantization/discretization technique). Not only does it reduce the

20

required computational cost of a model, but it also improves the classification performance ondatasets[29]. To bin the normalized dataset with continuous features, the 10th, 20th, ..., 90th, and100th quantile are taken, and its index shall serve as the bin number or bin label. This process wasdone using the qcut() function of pandas[31] (see Listing 10). Taking the first feature from thenormalized data for an example,

[ -0.7170812036204817, -0.6970852365248442, 1.4141664401453258]

Listing 10: Quantile-based discretization using qcut() of pandas[31]

import pandas as pd

..

data = [ -0.7170812036204817, -0.6970852365248442, 1.4141664401453258]

# quantization

pd.qcut(data, 10)

# yields the following result

# [(-0.718, -0.713], (-0.701, -0.697], (0.992, 1.414]]

# binning

pd.qcut(data, 10, labels=False)

# yields the following result (the indices in the quantile distribution)

array([0, 4, 9])

After binning, the continuous features were one-hot encoded (with 10 as the depth, since itwas decile binning) for use in the neural network models. The binned data from Listing 10,array([0, 4, 9]) will result to the following one-hot encoded data:

Listing 11: One-hot encoded data of the result from Listing 10

import tensorflow as tf

..

tf.one_hot(pd.qcut(data, 10, labels=False), 1.0, 0.0)

..

# the first vector represents the one-hot encoded binned feature 0

# the second vector represents the one-hot encoded binned feature 4

# the third vector represents the one-hot encoded binned feature 9

[

[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],

[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],

[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]

]

4.4 The GRU-SVM Neural Network Architecture

Similar to the work done by Alalshekmubarak & Smith (2013)[2], the present paper proposes touse SVM as the classification function in an RNN. The difference being instead of ESN, the RNNclass to be used in this study is the GRU model (see Figure 19).

The RNN consists of three layers: (1) input layer, (2) hidden layer, and (3) output layer. Thenumber of units in the input layer shall be the number of features to be used in the classificationproblem, i.e. twenty-one, as enumerated in the previous subsection. The hidden layer will pro-cess the units from the input layer, and in a regular RNN, its parameters (weights and biases)

21

Figure 19: The proposed GRU-SVM architecture model, with 24 GRU cells and SVM for theclassification function.

Figure 20: Image from [3]. The standard RNN structure.

are learned using either backpropagation through time (BPTT)[48] or real-time recurrent learn-ing (RTRL)[49]. However, the BPTT learning algorithm is the one commonly used for its simplestructure and it learns fast[3]. But since the RNN to be used in this study is GRU, the networkparameters are learned using its gating mechanism. Lastly, the output layer will have two unitsthat each represents a classification: (1) “under attack” or “normal”.

The goal in training RNNs is to minimize the error (also known as loss), which is usually com-puted using cross-entropy (see Equation 12):

22

Ey′(y) = −∑i

y′i log(yi) (12)

where y′ is the actual label (also called true distribution), and y is the predicted label (alsocalled predicted probability distribution), both in one-hot encoded form. The said equation is theone used for determining the error of a neural network with Softmax function (see Eq. 13) as itsactivation function.

softmax(y) =eyi∑i eyi

(13)

Since this proposal is about combining GRU and SVM, the error of the neural network shall becomputed using the loss function of SVM, i.e. recalling Eq. 11:

1

2‖w‖2 + C

p∑i=1

max(0, 1− y′i(wTxi + bi))

The same with cross entropy, the goal is to minimize the error computed using the aboveequation. Hence, the min in Eq. 11. However, L1-SVM is not differentiable, a popular variationknown as L2-SVM is differentiable and is more stable than the L1-SVM:

1

2‖w‖2 + C

p∑i=1

max(0, 1− y′i(wTxi + bi))2 (14)

For the proposed neural network architecture, GRU-SVM, the L2-SVM shall be used. Theresult of each computation of loss is a one-hot vector, so, to predict the class label y of a data x,the argmax function shall be used:

argmax(y)

The argmax function will return the index of the largest value across the one-hot vector (con-taining the predicted classes).

The previously-described RNN model has the caveat of difficulty in learning due to problemslike vanishing gradient and exploding gradient, caused by a large number of data which leads tomany epochs during training. This is why the proposed RNN model to be used is the GRU model,an improvement on the LSTM network which solves the said problems as it was described in theLiterature Review section.

As there will only be two states in the classification problem, it is computationally practical touse the SVM classifier over the Softmax classifier[22]. The output of the GRU model shall be passedonto the SVM classifier to learn the two states of the computer system network. The proposedGRU-SVM model shall be implemented in Python using Google’s TensorFlow library[1] (see Figure23 for the graph of the proposed model).

The following is a set of sample data from the training dataset used in this study, together withthe weights and biases, initialized with arbitrary values:

Listing 12: Sample data consisting of actual training labels and initialized weights and biases

y_ = [[-1. 1.] [-1. 1.] [-1. 1.] [ 1. -1.] [-1. 1.] [-1. 1.] [ 1. -1.] [ 1. -1.]]

weights = [[-0.53593349 -0.38771236] [ 0.34040833 -0.57094181] [ 0.4464798 0.30200231] [

0.06995993 0.20755154] [-0.03920161 -0.02265304] [ 0.1920733 0.54813659]

[-0.13325268 -0.0935626 ] [ 0.21997619 -0.70638591]]

biases = [ 0.09 0.11]

23

Figure 21: TensorFlow graph of the proposed GRU-SVM neural network architecture.

For this study, instead of implementing the formulas for GRU by hand, the tensorflow.contrib.rnn.GRUCellclass of TensorFlow[1] was used. The GRU model is then implemented using tensorflow.nn.dynamic rnn,which returns outputs (the output of the RNN), and states (the “memory” values of RNN at agiven time step t). Assume that the following values are the output of an RNN (GRU):

Listing 13: Sample RNN output data which will be used as the predictor variable x in wxi + bi

last = [[ 0.14924656 0.03880657 0.29684058 -0.16149738 -0.13976324 -0.19014949

-0.02387631 -0.26251662]

[ 0.28059918 -0.03701306 0.24274327 0.18435474 -0.18067953 -0.06389535 -0.15113145

-0.3013874 ]

[ 0.13199142 0.01253928 0.29239678 -0.16836613 -0.16880587 -0.14568888 -0.077148

-0.21887848]

[ 0.30739477 -0.01279039 0.31378055 0.24965775 -0.21046136 -0.12470452 -0.16175538

-0.32501441]

[ 0.3757236 -0.05380303 0.24592593 0.20109028 0.05041061 -0.16579902 -0.07312123

-0.19584617]

[-0.04504196 -0.0084012 0.21902309 -0.27880833 -0.05458357 -0.11695549 -0.07514973

-0.05324463]

[ 0.09624991 0.0265259 0.42690384 -0.17606544 -0.10266212 -0.2437247 -0.19277431

-0.17095077]

[ 0.27413425 -0.0088871 0.38139969 0.14770164 -0.16405088 -0.19591859 -0.25730479

-0.35412788]]

24

Using the predictor function wxi + bi, the data from Listing 12 will be evaluated to producepredicted labels y.

y = wx + b

=

−0.53593349 −0.387712360.34040833 −0.570941810.4464798 0.302002310.06995993 0.20755154−0.03920161 −0.02265304

0.1920733 0.54813659−0.13325268 −0.09356260.21997619 −0.70638591

0.14924656 0.03880657 . . . −0.02387631 −0.262516620.28059918 −0.03701306 . . . −0.15113145 −0.30138740.13199142 0.01253928 . . . −0.077148 −0.218878480.30739477 −0.01279039 . . . −0.16175538 −0.325014410.3757236 −0.05380303 . . . −0.07312123 −0.19584617−0.04504196 −0.0084012 . . . −0.07514973 −0.053244630.09624991 0.0265259 . . . −0.19277431 −0.170950770.27413425 −0.0088871 . . . −0.25730479 −0.35412788

+[0.09 0.11

]

y =

0.05884931 0.17271662−0.00305368 0.33001830.08306687 0.190823350.01282253 0.32583345−0.07296827 0.164214440.16754072 0.122309660.17102661 0.157446130.04586467 0.32517922

Then, getting the argmax on predicted labels y would give the following result:

y = argmax

(

0.05884931 0.17271662−0.00305368 0.33001830.08306687 0.190823350.01282253 0.32583345−0.07296827 0.164214440.16754072 0.122309660.17102661 0.157446130.04586467 0.32517922

)

y = [1, 1, 1, 1, 1, 0, 0, 1]

Comparing the predicted labels y with actual labels y :

correct prediction = argmax(y)?= argmax(y)

= [1, 1, 1, 1, 1, 0, 0, 1]?= [1, 1, 1, 0, 1, 1, 0, 0]

= [True, True, True, False, T rue, False, T rue, False]

= [1, 1, 1, 0, 1, 0, 1, 0]

To determine the percentage of accuracy, the mean value of correct prediction shall be calcu-lated:

25

accuracy =

∑i correct predictioni

?= 1

‖correct prediction‖

=5

8= 0.625 or 62.5%

The accuracy of the model improves over training iteration as the weights and biases (the pa-rameters) are learned using an optimization algorithm. For this study, the optimization algorithmused was Adam[23], implemented using tensorflow.train.AdamOptimizer()[1]. The optimiza-tion algorithm minimizes the loss of a network, in this case, loss computed using Eq. 14. Hence,the loss for this sample training iteration is computed by the following:

Assume C = 0.5

loss =1

2‖w‖2 + C

p∑i=1

max(0, 1− y′i(wxi + bi))2

=1

2∗ 2.1311962171321057 + 0.5 ∗ 16.9326

loss = 9.1622282050016004

4.5 Data Analysis

To determine the effectiveness of the proposed GRU-SVM model, there will be two experiments tobe conducted: (1) training phase, and (2) test phase.

The first experiment shall involve the use of 80% of total data points from the 25% of 2013Kyoto University honeypot dataset for training. There will be two trainings to be conducted inthe said experiment: (1) GRU-SVM model training, (2) GRU-Softmax model training. Primarily,SVM was suppose to be a comparator in the experiments, however, it defeats the purpose of thestudy being a proposal on a neural network architecture (as SVM is not a neural network model).The second experiment shall be the evaluation of the two trained models using 20% of total datapoints from the 25% of the 2013 Kyoto University honeypot dataset.

The parameters to be considered in the training and testing results are as follows:

Training parameters

1. Number of data points

2. Architecture

3. Epochs (for neural networks)

4. Kernel used (for SVM classifier)

5. CPU run time

6. Number of misclassifications

7. Number of iterations

26

8. Loss

Testing parameters

1. Number of data points

2. Accuracy

3. CPU run time

4. Number of misclassifications

5. Number of false positives

6. Number of false negatives

The parameters above are based on the parameters considered by Mukkamala, Janoski, & Sung(2002)[34] in their study for intrusion detection comparing SVM and a feed-forward neural network.

5 Results

The proposed neural network architecture, GRU-SVM and its comparator, GRU-Softmax wastrained to detect intrusions using 25% of the Kyoto University 2013 network traffic data[43]. Notethat the results presented here are the initial results only for the study. The following were thehyperparameters (assigned by hand, not through optimization) used for the initial experiment:

Table 3: Hyper-parameters used in both neural network models

Hyper-parameters GRU-SVM GRU-Softmax

Batch Size 256 256

Cell Size 256 256

Dropout Rate 0.85 0.8

Epochs 2 2

Learning Rate 1e-5 1e-6

SVM C 0.5 N/A

Both models were trained using tensorflow.train.AdamOptimizer()[23, 1], with total iter-ations of 116064 (14856316 mod 256 * Epochs) since there are 14856316 lines of data in thepreprocessed 25% of the Kyoto University dataset.

Table 4: Initial results on Training and Validation accuracy of both neural network models

Accuracy GRU-SVM GRU-Softmax

Training Average of 93.29% Average of 71.39%

Validation Average of 80.53% Average of 71.41%

With a considerable gap between the training and validation accuracy of the proposed GRU-SVM architecture, the researcher posits that the hyperparameters used were sub-optimal. Other-wise, there might be a mishap in the preprocessing of the dataset. On the other hand, GRU-Softmaxhad a small gap between its training and validation accuracy, perhaps it is safe to assume that the

27

hyperparameters are almost optimal. However, the problem lies in its low accuracy. Contradictingthe hypothesis of the dataset having a problem with its preprocessing.

An admitted mishap in the experiment is the SVM used for the proposed model, instead ofL2-SVM, the L1-SVM was used in the training. This mistake was only revealed after a review onthe programmatic implementation of both neural network models.

Since Table 4 only provides the average training and validation accuracy of both models, thegraphs of accuracy (recorded during model run) for both neural networks are also attached:

Figure 22: Accuracy of the proposed GRU-SVM model during its run: the orange line representstraining accuracy, and the green line represents validation accuracy.

Figure 23: Accuracy of the proposed GRU-Softmax model during its run: the blue line representstraining accuracy, and the green line represents validation accuracy.

The proposed GRU-SVM model was able to finish its run (2 epochs on 58032 batches of data) inonly 5 hours, 9 minutes, and 40 seconds. On the other hand, its comparator, GRU-Softmax model,was able to finish its run (2 epochs on 58032 batches of data) in 5 hours, 49 minutes, and 3 seconds.The initial results on running time indicate that the proposed model is faster by 39 minutes and23 seconds. This may be attributed to the fact that the algorithm complexity of Softmax functionis O(n) (since it has to run through all the elements in a vector probability distribution), while thealgorithm complexity of the SVM function is only O(1) (since its prediction is done using a linearfunction, i.e. y = wx + b).

References

[1] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Good-fellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, LukaszKaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, DerekMurray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal

28

Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals,Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor-Flow: Large-scale machine learning on heterogeneous systems, 2015. Software available fromtensorflow.org.

[2] A. Alalshekmubarak and L.S. Smith. A novel approach combining recurrent neural networkand support vector machines for time series classification. In Innovations in InformationTechnology (IIT), 2013 9th International Conference on, pages 42–47. IEEE, 2013.

[3] Y. Ando, H. Gomi, and H. Tanaka. Detecting fraudulent behavior using recurrent neuralnetworks. 2016.

[4] James Cannady. Artificial neural networks for misuse detection. In National informationsystems security conference, pages 368–81, 1998.

[5] Rung-Ching Chen, Kai-Fan Cheng, Ying-Hao Chen, and Chia-Fen Hsieh. Using rough setand support vector machine for network intrusion detection system. In Intelligent Informationand Database Systems, 2009. ACIIDS 2009. First Asian Conference on, pages 465–470. IEEE,2009.

[6] Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.

[7] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20.3:273–297, 1995.

[8] Dorothy E Denning. An intrusion-detection model. IEEE Transactions on software engineer-ing, (2):222–232, 1987.

[9] Devoxx. Tensorflow and deep learning - without phd by martin gorner. https://www.youtube.com/watch?v=vq2nnJ4g6N0, November 8, 2016. Accessed: March 12, 2017.

[10] Center for Strategic and International Studies. Net losses: Estimating the global cost ofcybercrime. McAfee, 2014.

[11] Jeremy Frank. Artificial intelligence and intrusion detection: Current and future directions.In Proceedings of the 17th national computer security conference, volume 10, pages 1–12. Bal-timore, USA, 1994.

[12] Anup K Ghosh, Aaron Schwartzbard, and Michael Schatz. Learning program behavior profilesfor intrusion detection. In Workshop on Intrusion Detection and Network Monitoring, volume51462, pages 1–13, 1999.

[13] Martin Gorner. Learn tensorflow and deep learning, without a ph.d. https://cloud.google.com/blog/big-data/2017/01/learn-tensorflow-and-deep-learning-without-a-phd,January 19, 2017.

[14] Neha Gupta. Artificial neural networks. Network and Complex Systems, 3.1:24–28, 2013.

[15] Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. Supportvector machines. IEEE Intelligent Systems and their Applications, 13(4):18–28, 1998.

[16] Abien Fred Agarap (https://stats.stackexchange.com/users/168251/abien-fred agarap). Stan-dardization or feature scaling? CrossValidated.

29

https://www.youtube.com/watch?v=vq2nnJ4g6N0

https://www.youtube.com/watch?v=vq2nnJ4g6N0

https://cloud.google.com/blog/big-data/2017/01/learn-tensorflow-and-deep-learning-without-a-phd

https://cloud.google.com/blog/big-data/2017/01/learn-tensorflow-and-deep-learning-without-a-phd

[17] H. Jaeger. Echo state network. Scholarpedia, 2(9):2330, 2007. revision #151757.

[18] Anil K. Jain, Jianchang Mao, and K.M. Mohiuddin. Artificial neural networks: A tutorial.Computer, 29.3:31–44, March 1996.

[19] Juniper. What is intrusion detection and prevention (ids/idp)? http://www.juniper.net/

us/en/products-services/what-is/ids-ips/.

[20] Juniper. Cybercrime will cost businesses over $2 trillion by2019. https://www.juniperresearch.com/press/press-releases/

cybercrime-cost-businesses-over-2trillion, May 12, 2015. Accessed: May 6,2017.

[21] Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks. http:

//karpathy.github.io/2015/05/21/rnn-effectiveness/, May 21, 2015.

[22] Anrej Karpathy. Cs231n convolutional neural networks for visual recognition. http://cs231n.github.io/.

[23] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

[24] Sandeep Kumar and Eugene H Spafford. An application of pattern matching in intrusiondetection. 1994.

[25] MIT Lincoln Laboratory. 1999 darpa intrusion detection evaluation data set. https://www.

ll.mit.edu/ideval/data/1999data.html, 1999.

[26] Pavel Laskov, Christin Schafer, Igor Kotenko, and K-R Muller. Intrusion detection in unlabeleddata with quarter-sphere support vector machines. Praxis der Informationsverarbeitung undKommunikation, 27(4):228–236, 2004.

[27] S. Lewandowsky and S. Farrell. Computational Modeling in Cognition: Principles and Practice.Sage Publications, 2011.

[28] Mantas Lukosevicius. A practical guide to applying echo state networks. In Neural networks:Tricks of the trade, pages 659–686. Springer, 2012.

[29] Jonathan L Lustgarten, Vanathi Gopalakrishnan, Himanshu Grover, and Shyam Visweswaran.Improving classification performance with discretization on biomedical datasets. In AMIAannual symposium proceedings, volume 2008, page 445. American Medical Informatics Associ-ation, 2008.

[30] Wolfgang Maass, Thomas Natschlager, and Henry Markram. Real-time computing withoutstable states: A new framework for neural computation based on perturbations. Neural com-putation, 14(11):2531–2560, 2002.

[31] Wes McKinney. Data structures for statistical computing in python. In Stefan van der Waltand Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 51 –56, 2010.

[32] M. Molaie, R. Falahian, S. Gharibzadeh, S. Jafari, and J.C. Sprott. Artificial neural networks:Powerful tools for modeling chaotic behavior in the nervous system. Front. Comput. Neuro-sci.,8, 2014.

30

http://www.juniper.net/us/en/products-services/what-is/ids-ips/

http://www.juniper.net/us/en/products-services/what-is/ids-ips/

https://www.juniperresearch.com/press/press-releases/cybercrime-cost-businesses-over-2trillion

https://www.juniperresearch.com/press/press-releases/cybercrime-cost-businesses-over-2trillion

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://cs231n.github.io/

http://cs231n.github.io/

https://www.ll.mit.edu/ideval/data/1999data.html

https://www.ll.mit.edu/ideval/data/1999data.html

[33] S. Morgan. Cyber crime costs projected to reach $2 trillionby 2019. https://www.forbes.com/sites/stevemorgan/2016/01/17/

cyber-crime-costs-projected-to-reach-2-trillion-by-2019, January 17, 2016.Accessed: May 6, 2017.

[34] Srinivas Mukkamala, Guadalupe Janoski, and Andrew Sung. Intrusion detection: support vec-tor machines and neural networks. In proceedings of the IEEE International Joint Conferenceon Neural Networks (ANNIE), St. Louis, MO, pages 1702–1707, 2002.

[35] Srinivas Mukkamala and Andrew Sung. Feature selection for intrusion detection with neu-ral networks and support vector machines. Transportation Research Record: Journal of theTransportation Research Board, (1822):33–39, 2003.

[36] M. Negnevitsky. Artificial Intelligence: A Guide to Intelligent Systems. Pearson EducationLtd., Essex, England, 3rd edition, 2011.

[37] “Domain of Science”. The map of mathematics. https://www.youtube.com/watch?v=

OmJ-4B-mS-Y, February 1, 2017.

[38] Christopher Olah. Understanding lstm networks. http://colah.github.io/posts/

2015-08-Understanding-LSTMs/, August 27, 2015. Accessed: April 20, 2017.

[39] MIT OpenCourseWare. 16. learning: Support vector machines. https://www.youtube.com/

watch?v=_PwhiWxHK8o, January 10, 2014.

[40] R. Patidar and L. Sharma. Credit card fraud detection using neural network. InternationalJournal of Soft Computing and Engineering (IJSCE), 1(32-38), 2011.

[41] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of MachineLearning Research, 12:2825–2830, 2011.

[42] Nicholas J. Puketza, Kui Zhang, Mandy Chung, Biswanath Mukherjee, and Ronald A. Ols-son. A methodology for testing intrusion detection systems. IEEE Transactions on SoftwareEngineering, 22(10):719–729, 1996.

[43] Jungsuk Song, Hiroki Takakura, and Yasuo Okabe. Description of kyoto university bench-mark data. Available at link: http:// www. takakura. com/ Kyoto data/ BenchmarkData-Description-v5. pdf.[Accessed on 15 March 2016], 2006.

[44] J Stolfo, Wei Fan, Wenke Lee, Andreas Prodromidis, and Philip K Chan. Cost-based modelingand evaluation for data mining with application to fraud and intrusion detection. Results fromthe JAM Project by Salvatore, 2000.

[45] Yichuan Tang. Deep learning using linear support vector machines. arXiv preprintarXiv:1306.0239, 2013.

[46] OpenCV Dev Team. Introduction to support vector machines. http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html, May 4, 2017.

[47] David Verstraeten. Reservoir computing: computation with dynamical systems. PhD thesis,Ghent University, 2009.

31

https://www.forbes.com/sites/stevemorgan/2016/01/17/cyber-crime-costs-projected-to-reach-2-trillion-by-2019

https://www.forbes.com/sites/stevemorgan/2016/01/17/cyber-crime-costs-projected-to-reach-2-trillion-by-2019

https://www.youtube.com/watch?v=OmJ-4B-mS-Y

https://www.youtube.com/watch?v=OmJ-4B-mS-Y

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://www.youtube.com/watch?v=_PwhiWxHK8o

https://www.youtube.com/watch?v=_PwhiWxHK8o

http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html

http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html

[48] Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings ofthe IEEE, 78(10):1550–1560, 1990.

[49] Ronald J Williams and David Zipser. A learning algorithm for continually running fullyrecurrent neural networks. Neural computation, 1(2):270–280, 1989.

32

Documents

arXiv:1709.03082v1 [cs.NE] 10 Sep 2017 June 1, 2019Figure 3: The proposed ESN-SVM model by Alalshekmubarak & Smith[2] for time series classi cation. This research presents a modi ed