TR-IDS: Anomaly-Based Intrusion Detection through Text

Research ArticleTR-IDS Anomaly-Based Intrusion Detection throughText-Convolutional Neural Network and Random Forest

Erxue Min 1 Jun Long 1 Qiang Liu 1 Jianjing Cui1 and Wei Chen2

1College of Computer National University of Defense Technology Changsha 410073 China2School of Computer Science University of Birmingham Birmingham British B15 2TT UK

Correspondence should be addressed to Erxue Min mex338qqcom

Received 3 May 2018 Accepted 24 June 2018 Published 5 July 2018

Academic Editor Zhaoqing Pan

Copyright copy 2018 Erxue Min et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

As we head towards the IoT (Internet of Things) era protecting network infrastructures and information security has becomeincreasingly crucial In recent years Anomaly-Based Network Intrusion Detection Systems (ANIDSs) have gained extensiveattention for their capability of detecting novel attacks However most ANIDSs focus on packet header information and omitthe valuable information in payloads despite the fact that payload-based attacks have become ubiquitous In this paper we proposea novel intrusion detection system named TR-IDS which takes advantage of both statistical features and payload features Wordembedding and text-convolutional neural network (Text-CNN) are applied to extract effective information from payloads Afterthat the sophisticated random forest algorithm is performed on the combination of statistical features and payload featuresExtensive experimental evaluations demonstrate the effectiveness of the proposed methods

1 Introduction

Due to the advancements in Internet cyberspace securityhas gained increasing attention [1 2] which has encour-aged many researchers to design effective defense systemscalled Network Intrusion Detection Systems (NIDSs) Cur-rently existing intrusion detection techniques fall into twomain categories misuse-based detection (also known assignature-based detection or knowledge-based detection)and anomaly-based detection (also known as behavior-based detection) Misuse-based detection systems extract thediscriminative features and patterns from known attacks andhand-code them into the system These rules are comparedwith the traffic to detect attacks They are effective andefficient for detecting known type of attacks and have avery low False Alarm Therefore Misuse-based detectionsystems are currently the mainstream NIDSs and somesophisticated ones have been deposited in real scenarioseg snort [3] However misuse detection systems requireupdating the rules and signatures frequently and they areincapable to identify any novel or unknown attacks In recentyears anomaly-based network intrusion detection systems

(ANIDSs) have attracted much attention for their capabilityof detecting zero-day attacks They adopt statistical methodsmachine learning algorithms or data mining algorithms tomodel the pattern of normal network behavior and detectanomalies as deviations from normal behavior

Various algorithmshave beenproposed tomodel networkbehavior and detect anomaly flows including artificial neuralnetworks [4] fuzzy association rules [5] Bayesian network[6] clustering [7] decision trees [8] ensemble learning [9]support vector machine [10] and so on [11 12] Howeverthese methods mostly exploit the information in packetheaders or the statistical information of entire flows and failto detect the malicious content (eg SQL injection cross-site scripting and shellcode) in packet payloads Classicprocessing methods for payloads can be divided into twocategories The first category requires prior knowledge ofprotocol formats which cannot be applied to unknown pro-tocols The second category does not require expert domainknowledge instead they calculate some statistical features orconduct N-gram analysis but they usually suffer from a highfalse positive rate In recent years deep learning algorithms[13 14] have achieved remarkable results in many fields eg

HindawiSecurity and Communication NetworksVolume 2018 Article ID 4943509 9 pageshttpsdoiorg10115520184943509

2 Security and Communication Networks

Computer Vision (CV) [15] Natural Language Processing(NLP) [16] and Automatic Speech Recognition (ASR) [17]They are proven to be capable to extract salient features fromunstructured data Considering the fact that the payloads ofnetwork traffic are sequence data similar to texts we canapply modern deep learning techniques in NLP to the featureextraction of network payloads

In this paper we adopt word embedding [18] and text-convolutional neural network (Text-CNN) [19] to extractfeatures from the payloads in network traffic We combinethe statistical features with payload features and then runrandom forest [20] for the final classification The rest ofthis paper is organized as follows In Section 2 we describethe related work In Section 3 we describe the design andimplementation of our methods In Section 4 we showextensive experimental results to show the effectiveness of ourmethods Finally in Section 5 we conclude this paper

2 Related Work

21 Payload-Based Intrusion Detection In these dayspayload-based attacks have become more prevalent whileolder attacks such as network Probe DoS DDoS andnetwork worm attacks have become less popular Manyattacks place the exploit codes inside the payload of networkpackets thus header-based approaches cannot detect themIn this case many payload-based detection techniqueshave been proposed The first class of these methods iscreating protocol parsers or decoders for different kindsof application Snort [3] includes a number of protocolparsers for protocol anomaly detection For example thehttp inspect preprocessor parses and normalizes HTTPfields making them available to detect oversized headerfields non-RFC characters or Unicode encoding ALAD [21]builds models of allowed keywords in text-based applicationprotocols such as FTP HTTP and SMTP The anomaly scoreis increased when a rare keyword is used for a particularservice These parser-based methods have a high detectionrate for known protocols However these methods requiremanually specified by experts and cannot deal with unknownprotocols The second class applies NLP techniques egN-gram analysis [22] to network traffic payloads PAYL[23] uses 1-grams and unsupervised learning to build a bytefrequency distribution model of payloads McPAD [24]creates 2]-grams and applies a sliding window to coverall sets of 2 bytes ] positions apart in each network trafficpayload They require no expert domain knowledge andcan detect zero-day worms because payloads with exploitcodes generally have an unusual byte frequency distributionThe drawbacks of them are unsatisfactory detection rateand relatively high computational overhead compared withparser-based methods

22 Deep Learning for Intrusion Detection Many deep learn-ing techniques have been used for developing ANIDS Ma etal [25] evaluated deep neural network on the KDDCUP99dataset and Niyaz et al [26] applied deep belief networks tointrusion detection on the NSL-KDD dataset However theyonly tested deep learning techniques on manually designed

features while their powerful ability to learn features fromraw data has not been exploited Recently several attemptsto learn effective features from raw packets have emergedYu et al [27 28] and Mahmood et al [29] used autoencoderto detect anomaly traffic Wang et al [30] applied CNNto learn the spatial features of network traffic and usedthe image classification method to classify malware trafficdespite the fact that network payloads are more similar todocuments Torres et al [31] transformed network trafficfeatures into character sequence and used RNN to learnthe temporal features while Wang et al [32] combinedCNN and LSTM together to learn both spatial and temporalfeatures These methods are of great insights yet have evidentweaknesses Firstly some time-based traffic features such asflow duration packet frequency and average packet lengthcannot be learned automatically by both CNN and LSTMBesides they ignore the semantic relation between each bytewhich is a critical factor in NLP In this paper we remedyboth problems by taking advantage of both expert domainknowledge and deep neural networks The statistical featuresare manually designed and the payload features are extractedby deep learning techniques in NLP To the best of ourknowledge no studies have made use of the advantages ofboth

3 TR-IDS

TR-IDS aims at automatically extracting features from pay-loads of raw network packets to improve the accuracy ofIDS Since random forest has superior performance onstructured data while convolutional network is suitable tohandle unstructured data [33] we combine the advantagesof both It performs classification on bidirectional networkflows (Biflow) which contains more temporal informationthan packet level datasets The implementation schemes areillustrated in Figure 1 and the different stages of TR-IDS aredescribed as follows

(i) Statistical features extraction we extract some crit-ical statistical features from each network flow Thesefeatures include fields in packet headers and statisticalattributes of the entire flow

(ii) Payload features extraction we map each byte inpayloads into a word vector using word embeddingand then extract salient features of payloads usingtext-convolutional neural network

(iii) Classification through random forest the statisti-cal features and payload features are concatenatedtogether and then the random forest algorithm isapplied to classify the generated new dataset

31 Statistical Features Extraction In this section we man-ually extract some discriminative features from the bidirec-tional network flows where the first packet in each flowdetermines the forward (source to destination) and backward(destination to source) direction We extract 44 statisticalfeatures from each flow and most of them are calculatedseparately in both forward and backward direction To be

Security and Communication Networks 3

Header Payload Header Payload Header Payload

Payload Payload Payload

Payload features

Output

Word Embedding

Text-CNN

Random Forest

Statistical features

Figure 1 The general architecture of TR-IDS

more specific we first extract some basic features such asprotocol source port and destination port while ip addressesare not included because they vary in different networksand thus cannot generalize the characteristic of attacksThen some statistical attributes such as packet number bytesnumber and tcp flag number are calculated After that sometime-based statistical measures are also extracted such as thespeed of transmission and time interval between two packetsThese features are vital signatures for detecting attacks suchas Probe DoS DDoS Scan U2R and U2L which havedistinctive traffic patterns We list all these features in Table 1

32 Payload Features Extraction In this section we intro-duce our deep-learning-based method of extracting featuresfrom network payloads Word embedding technique is usedto transfer one-hot representation of each byte to continuousvector representation Then text-convolutional network isutilized to extract themost salient features fromeach payload

Byte-Level Word Embedding The effective representation ofeach byte in payloads is a critical step Yu et al [27] tookthe decimal value of each byte as a feature This methodis not suitable as it introduces order relation to each byteWang et al [32] adopted one-hot encoding to each byteand consider each sample as a picture then a conventionalCNN is applied to extract features However this methodneglects the similarity in semantics and syntax of differentbytes and the worse is that it significantly increases thecomputation complexity To remedy this problem we utilizeword embedding to map each byte into a low dimensionalvector preserving the semantic information and consumingmuch less computational cost By now the most well-known

method of word embedding is word2vec [34] which isconvenient to implement and has superior performance Twopopular kinds of implementation of word2vec are CBoWand Skip-Gram [35] Since Skip-Gram generally has a betterperformance [35] in this paper we apply Skip-Gram to ourbyte-embedding task

The task of Skip-Gram is given one word predicting thesurrounding wordsThe trained model does not perform anynew task instead we just need the projection matrix whichcontains the vector representation of each word We definetwo parameter matrices119882 isin R119889times|119881| and1198821015840 isin R|119881|times119889 where119889 is the embedding dimensionwhich can be set as an arbitrarysize Note that119881 is the vocabulary set and |119881| is the size of119881Each word in 119881 is represented as a |119881| times 1 one-hot vectorThe architecture of Skip-Gram is illustrated in Figure 2 andSkip-Gram works in the following 4 steps

Step 1 Generate the one-hot input vector 119909119894 isin R|119881| of thecenter word

Step 2 Get the embedded vector of the center word V119894 =119882119909119894 isin R119889

Step 3 For each surrounding word generate a score vec-tor 119911 = 1198821015840V119894 and then turn it into probabilities119910 = 119904119900119891119905119898119886119909(119911) Thus we obtain 2119898 softmax outputs119910119894minus119898 119910119894minus1 119910119894+1 119910119894+119898 where 119898 denotes the windowsize

Step 4 Match the generated probability vectors with thetrue probabilities which are the one-hot vectors of theactual output 119910119888minus119898 119910119888minus1 119910119888+1 119910119888+119898 The divergence


Table 1 Statistical features of the network flow

Feature Descriptionprotocol Protocol of the flowsrc port Source portdst port Destination portf(b) urg num Number URG flags in the forward(backward) direction (0 for UDP)f(b) ack num Number ACK flags in the forward(backward) direction (0 for UDP)f(b) psh num Number PSH flags in the forward(backward) direction (0 for UDP)f(b) rst num Number RST flags in the forward(backward) direction (0 for UDP)f(b) syn num Number SYN flags in the forward(backward) direction (0 for UDP)f(b) fin num Number FIN flags in the forward(backward) direction (0 for UDP)pkts num Total packets in the flowbytes num Total bytes in the flowf(b) pkts num Total packets in the forward(backward) directionf(b) bytes num Total bytes in the forward(backward) directionf(b) len min Minimum length of packet in the forward(backward) directionf(b) len max Maximum length of packet in the forward(backward) directionf(b) len mean Mean length of packet in the forward(backward) directionf(b) len std Standard deviation length of packet in the forward(backward) directionduration Duration of the flowpkts psec Number of packets per secondbytes psec Number of packets per secondf(b) pkts psec Number of forward(backward) packets per secondf(b) bytes psec Number of forward(backward) bytes per secondf(b) intv min Minimum time interval between two packets sent in the forward(backward) directionf(b) intv max Maximum time interval between two packets sent in the forward(backward) directionf(b) intv mean Mean time interval between two packets sent in the forward(backward) directionf(b) intv std Standard deviation time interval between two packets sent in the forward(backward) direction

Parametersto learn

SoftmaxOutput

Embeddinglayer

Inputlayer

D-dim

Dtimes|V|

Dtimes|V|

Dtimes|V|

2mtimes|V|-dim

|V|timesD

|V|-dim

Figure 2 This figure illustrates how Skip-Gram works


nxd embeddedrepresentation of

payloads

convolutionkernels of

different size 1-max pooling

concatenateall vectors

feature layer

output kclasses

(static or non-static)

Figure 3 This figure illustrates how Text-CNN works

between generated probabilities and true probabilities is theloss function for optimizing the parameters

When it comes to the byte-level word embedding in ouralgorithms each byte is considered as a word and representedas a one-hot vectorWe first extract the payloads of all packetsin each flow and concatenate them together as a flow payloadEach flow payload can be analogized as a sentence and theyare composed of a text corpus ie a training dataset Theembedding size can be set as a relatively small value (eg10) After the training of Skip-Gram we obtain the embeddedrepresentation of each byte

Extract Payload Features through Text-CNN We apply Text-CNN to extract features from the embedded payloads Text-CNN is a slight variant of the CNN architecture and achievesexcellent results on many benchmarks of sentence classifica-tion (or document classification) [19] Text-CNN adopts theone-dimensional convolution operation to extract featuresfrom the embedded sentences In Text-CNN filters have afixedwidth of embedding size but have varying heights in theone layer while in conventional CNNs the sizes of filters inone layer are usually the sameThe architecture of Text-CNNis illustrated in Figure 3

Let x119894 isin R119889 be a 119889-dimensional word vector correspond-ing to the embedded representation of 119894th word in a sentence(in our task each byte corresponds to a word thus eachpayload is considered as a sentence) A sentence of length 119899(padded if the length is smaller than 119899) is denoted as

x1119899 = x1 oplus x2 oplus sdot sdot sdot oplus x119899 (1)

Note that oplus is the concatenation operator When executinga convolution operation a convolution filter w isin Rℎtimes119889 is

applied to a window of ℎ words in the sentence to generate anew feature To be specific a feature 119888119894 is calculated as follows

119888119894 = 119891 (w sdot x119894119894+ℎminus1 + 119887) (2)

where x119894119894+ℎminus1 is a window of words 119887 is a bias and 119891 isa nonlinear function This filter is applied to each possiblewindow [x1ℎ x2ℎ+1 x119899minusℎ+1119899] to generate a new featuremap c = [1198881 1198882 119888119899minusℎminus1] and c isin R119899minusℎ+1 Then a max-pooling operation is applied to the feature map to obtain themaximum value 119888max = max(c) which is the most importantfeature of each feature map

The process of extracting one feature by one filter isdescribed above and we have multiple filters with varyingwindow size to extract multiple features Note that in theoriginal Text-CNN the features are concatenated and directlypassed to a fully-connected 119904119900119891119905119898119886119909 layer to output theprobabilities of different classes But in our implementationwe insert a feature layer between the concatenated layer andoutput layer After the supervised training of the model weextract features of each payload from this layer

Classification through Random Forest The Random Forest(RF) [20] is an ensemble algorithm consisting of a collectionof tree-structured classifiers Each tree is constructed by adifferent bootstrap sample from the original data using adecision tree algorithm and each node of trees only selectsa small subset of features for the split The learning samplesnot selected with bootstrap are used for evaluation of the treecalled out-of-bag (OOB) evaluation which is an unbiasedestimator of generalization error After the construction ofthe forest once a new sample needs to be classified it is fedinto each tree in the forest and each tree casts a unit voteto certain class which indicates the decision of the tree The


forest chooses the class with the most votes for the inputsample

RF has the following advantages

(i) It has excellent performance in accuracy on struc-tured data

(ii) It is robust against noise and does not over-fit in mostcases

(iii) It is computational efficient and can run on large-scaledatasets with high dimensions

(iv) It can handle unbalanced datasets

(v) It can output the importance weight of each feature

These merits of RF encourage us to choose it as our final clas-sification In this step we concatenate the statistical featuresand payload features to generate the final representation ofthe network flows Then this new dataset is fed into the RFalgorithm for training and validation

4 Performance Evaluation

41 Datasets and Preprocessing We evaluate the performanceof our method on ISCX2012 dataset [36] It is an intrusiondetection dataset generated by the Information SecurityCenter of Excellence (ISCX) of the University of NewBrunswick (UNB) in Canada in 2012 This dataset consists of7 days of network activity including normal traffic and fourtypes of attack traffic ie Infiltrating HttpDoS DDoS andBruteForce SSH AlthoughKDDCUP99 dataset [37] is widelyused to evaluate IDS techniques it is really old-fashionedand cannot actually reflect the behavior of modern attacksIn contrast ISCX2012 is much more updated and closer toreality This dataset consists of seven raw pcap files and a listof label files The label files record the basic information ofeach network flow eg label ip address port start time andstop time We have to split the network flows in the pcap filesand label them using records in the label files Note that thelabeled files contain a few problems For example the packetnumbers recorded in them are not identical to the actualpacket number in pcap files Besides the time records inthemdo not exactly correspond to the timestamps in the pcapfilesTherefore we have to remove all incorrect and confusedrecords We chose most attack samples and randomly chosea small subset of legitimate ones to generate a relativelybalanced dataset Then we divided the preprocessed datasetinto training and testing set using a ratio of 70 and 30respectively Our preprocessing results are shown in Table 2

42 Evaluation Metrics Three metrics are used to evaluatethe performance of TR-IDS Accuracy (ACC) DetectionRate(DR) and False Alarm Rate (FAR) which are frequentlyused in the evaluation of intrusion detection ACC is a goodmetric to evaluate the overall performance of a system DRis used to evaluate the attack detection rate FAR is used to

evaluate misclassification of normal trafficThe three metricsare formulated as follows

119860119888119888119906119903119886119888119910 (119860119862119862) =119879119875 + 119879119873

119879119875 + 119865119875 + 119865119873 + 119879119873

119863119890119905119890119888119905119894119900119899119877119886119905119890 (119863119877) =119879119875

119879119875 + 119865119873

119865119886119897119904119890119860119897119886119903119898119877119886119905119890 (119865119860119877) =119865119875

119865119875 + 119879119873

(3)

where TP is the number of instances correctly classified as ATN is the number of instances correctly classified as Not-AFP is the number of instances incorrectly classified as A andFN is the number of instances incorrectly classified as Not-A

43 Experimental Setup Scapy Pytorch and Scikit-learnare the software frameworks for our implementationThe operating system is CentOS 72 64bit OS Server isPR4712GWX10DRFF-iG with 2 Xeon e5 CPUs with 10 coresand 64GB memory Four Nvidia Tesla K80 GPUs are used toaccelerate the training of CNN In our all experiments theText-CNN contains convolution filters with three differentsize ie 3 4 and 5 and there are 100 channels for eachThe stride is 1 and no padding is used The mini-batch sizeis 100 and optimizer is Adam with default parameters Theparameters of RF are set by default except the number oftrees which is set as 200

44 Experimental Results In this section we show theexperimental results of our methods We set the number ofextracted payload features as 50 and the truncated lengthof bytes in each payload as 1000 Table 3 shows the resultof 5-class classification on ISCX2012 and Table 4 shows theconfusion matrix of the classification It is obvious that ourmethod can nearly identify all attacks of Infiltration BFSSHand HttpDoS but confuses a few DDoS attacks with thenormal trafficThe reason is that somenetwork flows ofDDoSare really similar to normal traffic thus it is unrealistic toidentify each flow in a DDoS attack

Since ISCX2012 dataset was published much later thanDARPA1998 there are much fewer available correspondingexperimental results Although some existing methods areevaluated on it they have different preprocessing proceduresand even use different proportions of the dataset Thus itis unfair to compare our methods with these methods Inthis case in order to demonstrate the effectiveness of ourmethod we implemented five other methods The first fourones are support vector machine (SVM) fully-connectednetwork (NN) convolutional neural network (CNN) andrandom forest (RF-1) and their inputs are statistical featurescombined with 1000 raw bytes The fifth one is runningrandom forest on just statistical features (RF-2) Table 5compares TR-IDS with the five methods Note that the per-formance of RF-1 is inferior to that of RF-2 which means thefeatures of raw bytes may even deteriorate the performanceof intrusion detection The superior performance of TF-IDS demonstrates the effectiveness of the proposed featureextraction techniques


Table 2 Preprocessing results of the ISCX2012 dataset

Category Count Percentage Training count Testing countNormal 10000 2828 6971 3029Infiltration 9925 2807 6930 2995BFSSH 7042 1992 4911 2131DDoS 4963 1404 3513 1450HttpDoS 3427 969 2424 1003Total 35357 100 24749 10608

Table 3 Performance of TR-IDS on ISCX2012 ()

Type ACC DR FARInfiltration 9987 9977 006BFSSH 9999 9995 000DDoS 9809 9593 040HttpDoS 9990 9970 007Total 9913 9926 118

Table 4 Confusion matrix of the 5-class classification task

Normal Infiltration BFSSH DDoS HttpDoSNormal 2993 1 0 33 2Infiltration 0 2988 0 4 3BFSSH 0 1 2130 0 0DDoS 56 1 0 1391 2HttpDoS 0 2 0 1 1000

Table 5 Comparison with other algorithms ()

Type ACC DR FARSVM 8616 8148 195NN 9099 9117 945CNN 9575 9661 641RF-1 9721 9681 174RF-2 9859 9824 267TR-IDS 9913 9926 118

45 Sensitivity Analysis In this section we show the resultsof sensitivity tests on the two important hyperparametersie the truncated length of payloads for feature extractionand the number of features extracted from payloads Wefirst fixed the truncated length of payloads as 1000 and thenvaried the extracted feature number from 5 to 100 Afterthat we fixed the extracted feature number and varied thetruncated length from 500 to 3000 As we can see in Table 6our methods are not sensitive to the two hype-parametersThe best value of feature number locates at the middle of 5and 100 ie 50 The reason is that too few features cannotcontain enough information of the entire payload and toomany features bring noise to the final classification algorithmFor the truncated length of payloads we find that large lengthcontributes to a better performance it is because a smalllength may result in the loss of information of payloadsNevertheless a large length also leads to a high computationalcost

Table 6 Influence of the payload length and feature number ()

Payload length feature number ACC DR FAR1000 5 9868 9892 1911000 10 9871 9894 1881000 20 9884 9920 2011000 50 9913 9926 1181000 100 9909 9924 148500 50 9902 9935 1811000 50 9913 9926 1181500 50 9914 9825 1172000 50 9918 9836 1253000 50 9921 9940 112

5 Conclusion

In this paper we propose a novel intrusion detection frame-work ie TR-IDS which utilizes both manually designedfeatures and payload features to improve the performanceIt adopts two modern NLP techniques ie word embeddingand Text-CNN to extract salient features from payloadsTheword embedding technique retains the semantic relationsbetween each byte and reduces the feature dimension andthen Text-CNN is used to extract features from each payloadWe also apply the sophisticated random forest algorithm forthe final classification Finally extensive experiments showthe superior performance of our method

Data Availability

The data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This paper was previously presented at the 4th InternationalConference onCloudComputing and Security (ICCCS 2018)This work is supported by the National Natural ScienceFoundation of China (Grants nos 61105050 61702539 and60970034)


References

[1] J Cui Y Zhang Z Cai A Liu and Y Li ldquoSecuring displaypath for security-sensitive applications on mobile devicesrdquoComputer Materials amp Continua vol 55 no 1 pp 17ndash35 2018

[2] A Pradeep S Mridula and P Mohanan ldquoHigh securityidentity tags using spiral resonatorsrdquo Computers Materials andContinua vol 52 no 3 pp 185ndash195 2016

[3] M Roesch et al ldquoLightweight intrusion detection for net-worksrdquo Lisa vol 99 pp 229ndash238 1999

[4] J Cannady ldquoArtificial neural networks for misuse detectionrdquo inNational Information Systems Security Conference vol 26 1998

[5] H Brahmi I Brahmi and S Ben Yahia ldquoOMC-IDS At thecross-roads of OLAP mining and intrusion detectionrdquo LectureNotes in Computer Science (including subseries Lecture Notesin Artificial Intelligence and Lecture Notes in Bioinformatics)Preface vol 7301 no 2 pp 13ndash24 2012

[6] F Jemili M Zaghdoud andM B Ahmed ldquoA framework for anadaptive intrusion detection systemusing Bayesian networkrdquo inProceedings of the ISI 2007 2007 IEEE Intelligence and SecurityInformatics pp 66ndash70 May 2007

[7] M Blowers and J Williams ldquoMachine Learning Applied toCyberOperationsrdquo inNetwork Science andCybersecurity vol 55of Advances in Information Security pp 155ndash175 Springer NewYork New York NY 2014

[8] C Kruegel and T Toth ldquoUsing Decision Trees to ImproveSignature-Based Intrusion Detectionrdquo in Recent Advances inIntrusion Detection vol 2820 of Lecture Notes in ComputerScience pp 173ndash191 Springer Berlin Heidelberg Berlin Heidel-berg 2003

[9] J Zhang M Zulkernine and A Haque ldquoRandom-forests-based network intrusion detection systemsrdquo IEEE Transactionson Systems Man and Cybernetics Part C Applications andReviews vol 38 no 5 pp 649ndash659 2008

[10] Y Li J Xia S Zhang J Yan X Ai and K Dai ldquoAn efficientintrusion detection system based on support vector machinesand gradually feature removal methodrdquo Expert Systems withApplications vol 39 no 1 pp 424ndash430 2012

[11] E Min Y Zhao J Long C Wu K Li and J Yin ldquoSVRG withadaptive epoch sizerdquo in Proceedings of the 2017 InternationalJoint Conference on Neural Networks IJCNN 2017 pp 2935ndash2942 USA May 2017

[12] E Min J Cui and J Long ldquoVariance Reduced StochasticOptimization for PCA and PLSrdquo in Proceedings of the 201710th International SymposiumonComputational Intelligence andDesign (ISCID) pp 383ndash388 Hangzhou December 2017

[13] Y LeCun Y Bengio and G Hinton ldquoDeep learningrdquo Naturevol 521 no 7553 pp 436ndash444 2015

[14] G Cheng C Yang X Yao L Guo and J Han ldquoWhenDeep Learning Meets Metric Learning Remote Sensing ImageScene Classification via Learning Discriminative CNNsrdquo IEEETransactions on Geoscience and Remote Sensing pp 1ndash11 2018

[15] Z Pan J Lei Y Zhang and F L Wang ldquoAdaptive Fractional-PixelMotion Estimation SkippedAlgorithm for EfficientHEVCMotionEstimationrdquoACMTransactions onMultimediaComput-ing Communications and Applications (TOMM) vol 14 no 1pp 1ndash19 2018

[16] G G Chowdhury ldquoNatural language processingrdquo AnnualReview of Information Science and Technology vol 37 pp 51ndash892003

[17] B Chigier ldquoAutomatic speech recognitionrdquo The Journal of theAcoustical Society of America vol 103 no 1 p 19 1997

[18] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearn-ing sentiment-specific word embedding for twitter sentimentclassificationrdquo in Proceedings of the 52nd Annual Meeting of theAssociation for Computational Linguistics ACL 2014 pp 1555ndash1565 USA June 2014

[19] Y Kim ldquoConvolutional Neural Networks for Sentence Clas-sificationrdquo in Proceedings of the 2014 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) pp 1746ndash1751 Doha Qatar October 2014

[20] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquoThe R Journal vol 2 no 3 pp 18ndash22 2002

[21] V Matthew Mahoney and K Philip Chan ldquoLearning models ofnetwork traffic for detecting novel attacksrdquo Tech Rep 2002

[22] F Peter Brown V Peter Desouza L Robert Mercer J VincentDella Pietra and C Jenifer Lai ldquoClass-based n-gram models ofnatural languagerdquo Computational Linguistics vol 18 no 4 pp467ndash479 1992

[23] K Wang and S J Stolfo ldquoAnomalous payload-based networkintrusion detectionrdquo in Recent Advances in Intrusion Detectionvol 3224 of Lecture Notes in Computer Science pp 203ndash222Springer Berlin Germany 2004

[24] R Perdisci D Ariu P Fogla G Giacinto andW Lee ldquoMcPADa multiple classifier system for accurate payload-based anomalydetectionrdquoComputer Networks vol 53 no 6 pp 864ndash881 2009

[25] T Ma FWang J Cheng Y Yu and X Chen ldquoA hybrid spectralclustering and deep neural network ensemble algorithm forintrusion detection in sensor networksrdquo Sensors vol 16 no 102016

[26] A Javaid Q Niyaz W Sun and M Alam ldquoA Deep LearningApproach for Network Intrusion Detection Systemrdquo in Pro-ceedings of the 9th EAI International Conference on Bio-inspiredInformation and Communications Technologies (formerly BIO-NETICS) New York NY USA December 2015

[27] Y Yu J Long and Z Cai ldquoSession-Based Network IntrusionDetection Using a Deep Learning Architecturerdquo in ModelingDecisions for Artificial Intelligence vol 10571 of Lecture Notes inComputer Science pp 144ndash155 Springer International Publish-ing Cham Germany 2017

[28] YangYu Jun Long and ZhipingCai ldquoNetwork IntrusionDetec-tion through Stacking Dilated Convolutional AutoencodersrdquoSecurity and Communication Networks vol 2017 pp 1ndash10 2017

[29] M Yousefi-Azar V Varadharajan L Hamey and U TupakulaldquoAutoencoder-based feature learning for cyber security applica-tionsrdquo in Proceedings of the 2017 International Joint ConferenceonNeuralNetworks IJCNN2017 pp 3854ndash3861USAMay 2017

[30] Z Tan A Jamdagni X He P Nanda R P Liu and J HuldquoDetection on denial-of-service attacks based on computervision techniquesrdquo Institute of Electrical and Electronics Engi-neers Transactions on Computers vol 64 no 9 pp 2519ndash25332015

[31] P Torres C Catania S Garcia and C G Garino ldquoAn analysisof Recurrent Neural Networks for Botnet detection behaviorrdquoin Proceedings of the 2016 IEEE Biennial Congress of ArgentinaARGENCON 2016 Argentina June 2016

[32] WWang Y Sheng JWang et al ldquoHAST-IDS LearningHierar-chical Spatial-Temporal Features using Deep Neural Networksto Improve Intrusion Detectionrdquo IEEE Access 2017

[33] httpswwwimportioposthow-to-win-a-kaggle-competition[34] Y Goldberg and O Levy ldquoword2vec explained Deriving

mikolov et alrsquos negative-sampling word-embedding method2014rdquo httpsarxivorgabs14023722


[35] M Tomas K Chen G Corrado and D Jeffrey ldquoEfficient esti-mation of word representations in vector spacerdquo 2013 httpsarxivorgabs13013781

[36] A Shiravi H Shiravi M Tavallaee and A A GhorbanildquoToward developing a systematic approach to generate bench-mark datasets for intrusion detectionrdquo Computers amp Securityvol 31 no 3 pp 357ndash374 2012

[37] M Tavallaee E BagheriW Lu and A A Ghorbani ldquoA detailedanalysis of the KDD CUP 99 data setrdquo in Proceedings of the 2ndIEEE Symposium on Computational Intelligence for Security andDefence Applications pp 1ndash6 IEEE July 2009

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018


Active and Passive Electronic Components

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of



Journal ofEngineeringVolume 2018

SensorsJournal of



RotatingMachinery


Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation


Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom


Computer Vision (CV) [15] Natural Language Processing(NLP) [16] and Automatic Speech Recognition (ASR) [17]They are proven to be capable to extract salient features fromunstructured data Considering the fact that the payloads ofnetwork traffic are sequence data similar to texts we canapply modern deep learning techniques in NLP to the featureextraction of network payloads

In this paper we adopt word embedding [18] and text-convolutional neural network (Text-CNN) [19] to extractfeatures from the payloads in network traffic We combinethe statistical features with payload features and then runrandom forest [20] for the final classification The rest ofthis paper is organized as follows In Section 2 we describethe related work In Section 3 we describe the design andimplementation of our methods In Section 4 we showextensive experimental results to show the effectiveness of ourmethods Finally in Section 5 we conclude this paper

2 Related Work

21 Payload-Based Intrusion Detection In these dayspayload-based attacks have become more prevalent whileolder attacks such as network Probe DoS DDoS andnetwork worm attacks have become less popular Manyattacks place the exploit codes inside the payload of networkpackets thus header-based approaches cannot detect themIn this case many payload-based detection techniqueshave been proposed The first class of these methods iscreating protocol parsers or decoders for different kindsof application Snort [3] includes a number of protocolparsers for protocol anomaly detection For example thehttp inspect preprocessor parses and normalizes HTTPfields making them available to detect oversized headerfields non-RFC characters or Unicode encoding ALAD [21]builds models of allowed keywords in text-based applicationprotocols such as FTP HTTP and SMTP The anomaly scoreis increased when a rare keyword is used for a particularservice These parser-based methods have a high detectionrate for known protocols However these methods requiremanually specified by experts and cannot deal with unknownprotocols The second class applies NLP techniques egN-gram analysis [22] to network traffic payloads PAYL[23] uses 1-grams and unsupervised learning to build a bytefrequency distribution model of payloads McPAD [24]creates 2]-grams and applies a sliding window to coverall sets of 2 bytes ] positions apart in each network trafficpayload They require no expert domain knowledge andcan detect zero-day worms because payloads with exploitcodes generally have an unusual byte frequency distributionThe drawbacks of them are unsatisfactory detection rateand relatively high computational overhead compared withparser-based methods

22 Deep Learning for Intrusion Detection Many deep learn-ing techniques have been used for developing ANIDS Ma etal [25] evaluated deep neural network on the KDDCUP99dataset and Niyaz et al [26] applied deep belief networks tointrusion detection on the NSL-KDD dataset However theyonly tested deep learning techniques on manually designed

features while their powerful ability to learn features fromraw data has not been exploited Recently several attemptsto learn effective features from raw packets have emergedYu et al [27 28] and Mahmood et al [29] used autoencoderto detect anomaly traffic Wang et al [30] applied CNNto learn the spatial features of network traffic and usedthe image classification method to classify malware trafficdespite the fact that network payloads are more similar todocuments Torres et al [31] transformed network trafficfeatures into character sequence and used RNN to learnthe temporal features while Wang et al [32] combinedCNN and LSTM together to learn both spatial and temporalfeatures These methods are of great insights yet have evidentweaknesses Firstly some time-based traffic features such asflow duration packet frequency and average packet lengthcannot be learned automatically by both CNN and LSTMBesides they ignore the semantic relation between each bytewhich is a critical factor in NLP In this paper we remedyboth problems by taking advantage of both expert domainknowledge and deep neural networks The statistical featuresare manually designed and the payload features are extractedby deep learning techniques in NLP To the best of ourknowledge no studies have made use of the advantages ofboth

3 TR-IDS

TR-IDS aims at automatically extracting features from pay-loads of raw network packets to improve the accuracy ofIDS Since random forest has superior performance onstructured data while convolutional network is suitable tohandle unstructured data [33] we combine the advantagesof both It performs classification on bidirectional networkflows (Biflow) which contains more temporal informationthan packet level datasets The implementation schemes areillustrated in Figure 1 and the different stages of TR-IDS aredescribed as follows

(i) Statistical features extraction we extract some crit-ical statistical features from each network flow Thesefeatures include fields in packet headers and statisticalattributes of the entire flow

(ii) Payload features extraction we map each byte inpayloads into a word vector using word embeddingand then extract salient features of payloads usingtext-convolutional neural network

(iii) Classification through random forest the statisti-cal features and payload features are concatenatedtogether and then the random forest algorithm isapplied to classify the generated new dataset

31 Statistical Features Extraction In this section we man-ually extract some discriminative features from the bidirec-tional network flows where the first packet in each flowdetermines the forward (source to destination) and backward(destination to source) direction We extract 44 statisticalfeatures from each flow and most of them are calculatedseparately in both forward and backward direction To be




Payload features

Output

Word Embedding

Text-CNN

Random Forest















Parametersto learn

SoftmaxOutput

Embeddinglayer

Inputlayer

D-dim

Dtimes|V|

Dtimes|V|

Dtimes|V|

2mtimes|V|-dim

|V|timesD

|V|-dim




payloads




feature layer

output kclasses










119888119894 = 119891 (w sdot x119894119894+ℎminus1 + 119887) (2)

















119860119888119888119906119903119886119888119910 (119860119862119862) =119879119875 + 119879119873

119879119875 + 119865119875 + 119865119873 + 119879119873

119863119890119905119890119888119905119894119900119899119877119886119905119890 (119863119877) =119879119875

119879119875 + 119865119873

119865119886119897119904119890119860119897119886119903119898119877119886119905119890 (119865119860119877) =119865119875

119865119875 + 119879119873

(3)

















5 Conclusion


Data Availability




Acknowledgments



References









































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia





Payload features

Output

Word Embedding

Text-CNN

Random Forest















Parametersto learn

SoftmaxOutput

Embeddinglayer

Inputlayer

D-dim

Dtimes|V|

Dtimes|V|

Dtimes|V|

2mtimes|V|-dim

|V|timesD

|V|-dim




payloads




feature layer

output kclasses










119888119894 = 119891 (w sdot x119894119894+ℎminus1 + 119887) (2)

















119860119888119888119906119903119886119888119910 (119860119862119862) =119879119875 + 119879119873

119879119875 + 119865119875 + 119865119873 + 119879119873

119863119890119905119890119888119905119894119900119899119877119886119905119890 (119863119877) =119879119875

119879119875 + 119865119873

119865119886119897119904119890119860119897119886119903119898119877119886119905119890 (119865119860119877) =119865119875

119865119875 + 119879119873

(3)

















5 Conclusion


Data Availability




Acknowledgments



References









































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia





Parametersto learn

SoftmaxOutput

Embeddinglayer

Inputlayer

D-dim

Dtimes|V|

Dtimes|V|

Dtimes|V|

2mtimes|V|-dim

|V|timesD

|V|-dim




payloads




feature layer

output kclasses










119888119894 = 119891 (w sdot x119894119894+ℎminus1 + 119887) (2)

















119860119888119888119906119903119886119888119910 (119860119862119862) =119879119875 + 119879119873

119879119875 + 119865119875 + 119865119873 + 119879119873

119863119890119905119890119888119905119894119900119899119877119886119905119890 (119863119877) =119879119875

119879119875 + 119865119873

119865119886119897119904119890119860119897119886119903119898119877119886119905119890 (119865119860119877) =119865119875

119865119875 + 119879119873

(3)

















5 Conclusion


Data Availability




Acknowledgments



References









































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia




payloads




feature layer

output kclasses










119888119894 = 119891 (w sdot x119894119894+ℎminus1 + 119887) (2)

















119860119888119888119906119903119886119888119910 (119860119862119862) =119879119875 + 119879119873

119879119875 + 119865119875 + 119865119873 + 119879119873

119863119890119905119890119888119905119894119900119899119877119886119905119890 (119863119877) =119879119875

119879119875 + 119865119873

119865119886119897119904119890119860119897119886119903119898119877119886119905119890 (119865119860119877) =119865119875

119865119875 + 119879119873

(3)

















5 Conclusion


Data Availability




Acknowledgments



References









































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia















119860119888119888119906119903119886119888119910 (119860119862119862) =119879119875 + 119879119873

119879119875 + 119865119875 + 119865119873 + 119879119873

119863119890119905119890119888119905119894119900119899119877119886119905119890 (119863119877) =119879119875

119879119875 + 119865119873

119865119886119897119904119890119860119897119886119903119898119877119886119905119890 (119865119860119877) =119865119875

119865119875 + 119879119873

(3)

















5 Conclusion


Data Availability




Acknowledgments



References









































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia














5 Conclusion


Data Availability




Acknowledgments



References









































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia



References









































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia








RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia




RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia


Documents

TR-IDS: Anomaly-Based Intrusion Detection through Text