Defending Malicious Script Attacks Using Machine Learning ...downloads.hindawi.com/journals/wcmc/2017/5360472.pdf · Defending Malicious Script Attacks Using Machine Learning Classifiers

Research ArticleDefending Malicious Script Attacks Using MachineLearning Classifiers

Nayeem Khan Johari Abdullah and Adnan Shahid Khan

Department of Computer Systems amp Communication Technologies Faculty of Computer Science amp Information TechnologyUniversiti Malaysia Sarawak 94300 Kota Samarahan Sarawak Malaysia

Correspondence should be addressed to Nayeem Khan 15010049siswaunimasmy

Received 27 October 2016 Accepted 29 December 2016 Published 7 February 2017

Academic Editor Paul Honeine

Copyright copy 2017 Nayeem Khan et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Theweb application has become a primary target for cyber criminals by injectingmalware especially JavaScript to performmaliciousactivities for impersonation Thus it becomes an imperative to detect such malicious code in real time before any maliciousactivity is performed This study proposes an efficient method of detecting previously unknown malicious java scripts using aninterceptor at the client side by classifying the key features of the malicious code Feature subset was obtained by using wrappermethod for dimensionality reduction Supervisedmachine learning classifiers were used on the dataset for achieving high accuracyExperimental results show that our method can efficiently classify malicious code from benign code with promising results

1 Introduction

Due to the rapid advancement of modern computers andnetwork infrastructure and our dependence on the Internetand its services which are increasing web applications whichare accessed through web browsers have become a primarytarget for cyber criminals As per a report released bySymantec in 2016 [1] about 430 million unique pieces ofmalware were discovered showing a growth of 36 againstthe year 2014 Cyber criminals use the malicious code andmalicious URLs to attack individuals and organizations Thesole purpose of such attacks is for personal financial andpolitical gains through damaging or disrupting normal oper-ation of computer and systems performing phishing attacksdisplaying unwanted advertisements and extorting moneyThus detection of suchmalicious code attacks becomes a top-most security challenge

Open Web Application Security Project (OWASP) hasranked cross-site scripting (XSS) as the 2nd most dangerousvulnerability among top ten vulnerabilities Currently XSSholds a share of 43 among all the reported vulnerabilitiesXSS is a type of injection attack in which malicious scriptsare injected around benign code in a legitimate webpage inorder to access cookie session and other secret information

Web applications are used to transport malicious scripts toperform the attack The target of XSS attack is a client sidewhereas SQL injections target server-side [2] XSS attack isa vulnerability at the application layer of network hierarchywhich occurs by injecting malicious scripts to break securitymechanism About 70 attacks are reported to occur atapplication layer Web browsers are the most susceptibleapplication layer software for attacks The purpose of theweb browser is to get the requested web resource from theserver and displayed in browserrsquos windows The format ofthe supplied resources is not restricted to HyperText MarkupLanguage (HTML) but can also be portable document format(PDF) image and so on Attackers run malicious JavaScriptin a web browser to target users Malicious and obfuscatedURLs also serve as a carrier for XSS attacks [3]

JavaScript is a programmingscripting language used inweb programming for making web pages more interactiveadds more features and improves the end user experienceJavaScript also helps in reducing the server-side load andhelps in shifting some computation to end user side TheJavaScript is commonly used for client-side scripting to beused in web browsers When a user requests for a certainweb page through a web browser the server responds tothe request and sends back web page which might have

HindawiWireless Communications and Mobile ComputingVolume 2017 Article ID 5360472 9 pageshttpsdoiorg10115520175360472

2 Wireless Communications and Mobile Computing

JavaScript embedded into it Then the JavaScript interpreterinside the web browser at the userrsquos side starts executing itdirectly and automatically unless explicitly disabled by theuser Unfortunately JavaScriptrsquos code provides a strong basefor conducting attacks especially drive-by-download attackswhich unnoticeably attack clients amid the visit of a webpage In contrast to different sorts of network-based attacksmalicious JavaScripts are difficult to detect [4] JavaScriptattacks are performed by looking into the vulnerabilityand exploiting by using JavaScript obfuscation techniquesto evade the detection Thus detecting of such maliciousJavaScriptrsquos attacks at the real time becomes imperative

Security solution providers who play a vital role in pro-viding tools for defending attacks are fundamentally basedon two correlative approaches signature-based and heuristic-based detection approaches The signature-based approachis based upon the detection of unique string patterns in thebinary code Signature-based attack detection approach failswhen a new type of attack occurs In this approach theinstance of new attack needs to be captured first in orderto create a signature from it and then subsequently updatethe definitions of the detection tool on the client side Asthe time gap between the capturing of new attack instanceand the updating at the client side is high it may makemillions of client-side machines vulnerable to attacks Cybercriminals take advantage of this time gap and vulnerabilityand simultaneously launch the attack Consequently thisapproach is unacceptable in an environment where newattacks are expected to come Another tradition approach forattack detection is heuristic-based detection Heuristic-baseddetection is based upon the set of expert decision rules todetect the attackThe advantage of using this attack approachis that it cannot only detect modified or variant existingmalware but can also detect the previous unknown attackThe downside of using this approach is that it takes a longtime in performing scanning and analysis which drasticallyslows down the security performance Another problem ofthe approach is that it introduces many false positives Falsepositive occurs when a system wrongly identifies code or afile as malicious when actually it is not Some of the heuristicapproaches include file emulation file analysis and genericsignature detection

Researchers recently have enlisted machine learning inthe detection of malware The advantage of using machinelearning is that it can determine whether a code or a file ismalicious or not in a very small time without the need ofisolating it in a sandbox to perform the analysis MachineLearning is able to detect previously unknown malware withpredictive capabilities and especially useful for the detectionof increasingly polymorphic nature of malware To performthe classification of a benign and a malicious code machinelearning classifiers are used in detection through learningfrom patterns

Keeping in view the impact that an attack can cause onthe client side detection of JavaScript at a real time thusbecame necessary Several approaches have been proposedfor classification and detection of malicious JavaScript codefrom the benign code of a website such as those of Rieck et al[5] Curtsinger et al [6] and Fraiwan et al [7] the problem

of these approaches is time overhead in detection Otherapproaches such as GATEKEEPER [8] and Google Caga [9]use a method of executing random JavaScripts from the codein a protected environment but can detect the limited type ofattacks

Antivirus software whether online or offline usessignature-based techniques for detection of malware thusrequiring a constant update for newmalware signatures fromvendors for detection The time gap between the detectionof new malware and updating the virus signature at theclient side may take a long time It thus leaves the systemsopen for vulnerabilities until the virus definition has notbeen released by the vendor and updated by the client Thusantiviruses are ineffective in current scenario for defendingzero-day attacks where new malware variants are put in wildevery day In this paper we introduced an interceptor fordetection of malicious JavaScript attacks coming towardsthe client side based on machine learning for detection ofmalware especially previously unknown malware variantsTheproposed approach is light weight in naturewithminimalruntime overheadsDetection is based on the static analysis ofa code for extracting features from given JavaScript to be fedinto classifier for the classification process Experiments wereconducted by using a dataset of 1924 instances of JavaScriptwith 409 as malicious and 1515 as benign Machine languageclassifiers such as Naıve Bayes SVM J48 and 119896-NN wereindividually tested for their performance of the selectedfeatures The proposed approach is browser and platformindependent and is depicted in Figure 1 This study is acontinuation of our previous study [10] The rest of the paperis organized as follows Section 2 discusses the architectureof the proposed study Section 3 explains the classificationmodel Section 4 provides details about experiments andanalyzes their results Section 5 deals with the related workSection 6 provides conclusion and future work

2 Architecture of Proposed Approach

Keeping in view the need for the real-time detection ofmalicious JavaScript code we proposed an approach fordetection of such attacks the proposed approach consists ofan interceptor between browser and server for detection ofmalicious JavaScript code In this approach all the trafficbetween server and client is exchanged through the inter-ceptor to check for possible attacks in the source code to beexecuted by the browser There are no direct communicationchannels between browser and server This interceptor isproposed to protect the client side from being attacked ordelivered with the malicious mode The interceptor uses astatic analysis for detection of malicious JavaScript attackThe interceptor will be installed as a plugin on the browserNormally when a client intends to visit awebsite by typing theURL into the address bar anHTTPGET request is sent to theweb server for lookup of the requested web page and if foundresponse is generated and the cookie is set in the browser Inthis approach the response from the server is passed via aninterceptor to find a malicious code Features are extractedfrom the source code of the web page which is fed into the

Wireless Communications and Mobile Computing 3

Web browser

XSS

vuln

erab

ility

in

terc

epto

r

Server

HTTP request HTTP request

HTTP response HTTP response

Client side

Figure 1 System figure of interceptor

MaliciousBenign

RejectAccept

Feature extraction

Loading

Lexical analysis

Classification

Result

Figure 2 Schematic diagram of interceptor

machine learning classification for decision If any maliciousJavaScript found the page is blocked before it is interpretedby the browser

The process of detection of malicious JavaScripts isdone in few steps as shown in Figure 2 At the first stepthe web browser sends a request to a web server for aspecific page using its URL all the subsequent web pagesare cached by the loader into memory for the purpose tomake succeeding requests for the same web domain fasterwhile the background check of the whole domain has alreadybeen done in the background by interceptor in advanceOnly the web pages up to level two will be loaded from theseed URL into the memory on a visit Once the user startsbrowsing and diving deep into the website the progressionis further extended automatically to next two levels witha maximum limit of 500 web pages per level which sumsup 7MB of JavaScript code per level with the average ofJavaScript code found on each web page being 15 KB as per[11]The source code of the web page needs to be a breakdownby performing lexical analysis to generate a sequence of thetokens to be parsed In this method we used Yacc parser (yetanother compiler-compiler) which is an LALR (Look Aheadleft Right) algorithm based parser to generate tokens fromJavaScripts [12 13] The parser reads and analyzes an inputstream of JavaScripts which breaks them into componentpieces Each component piece obtained is a token which may

be a keyword string number or punctuation which cannotbe further broken The tokens obtained after parsing arearranged according to the input structure of JavaScript codeobtained from the web page and grammar rules of the parserThe benefit of using LALR type of parser is to generate a largeparsing table from the input stream of JavaScripts fromwherea larger set of tokens can be drawnThe tokens obtained fromparsing are stored in a file to be used in further processing Toevade the detection attackers employ obfuscation techniquein an attempt to infuse malicious code A wide range ofobfuscation approaches are used such as encoding escapedASCII string split and hidden iFrame to avoid detection Asthe obfuscation is becoming more and more sophisticatedany such obfuscated code will be detected with a high degreeof accuracy and precision during lexical analysis Extractionof features is a key step to perform the transformation ofunstructured and semistructured contents into a structuredform which is used for classification The final step is toperform classification which will accurately predict to whichclass a new script belongs based upon the observations on thetraining set whose class is already known Once a maliciousscript is detected the page will be blocked

3 Building the Classification Model

31 Dataset In order to perform classification with highaccuracy the quality of dataset is very important whichmustcontain the instances of benign and malicious JavaScriptsso that the distribution of samples emulates distribution inthe web Dataset used for conducting experiments consistsof total 1924 instances with 1515 instances of benign and 409malicious instancesThe dataset has been obtained from [14]

32 Feature Selection A number of features can be extractedfrom JavaScripts but not all of them will be helpful inaccurate detection of malicious JavaScripts In the featureselection process a classification algorithm is encounteredwith a problem of selecting a relevant set of features asa primary focus while ignoring others in order to achievehigh accuracy To obtain high accuracy with a particularclassification algorithm on a specific dataset feature subsetselection plays a crucial role in deciding how a classification


Set of all features Performance

Selecting the best subset

Generate a Learningsubset algorithm

Figure 3 Feature selection wrapper method

Table 1 Feature subset obtained by applying wrapper method

Classifier Feature subsetNaıve Bayes 7 9 15 1725 26 36 70J48 3 7 17 69 70SVM 9 12 16 17 20 53 70119870-NN 7 11 69 70

and the training set interact Taking this into considerationwe performed feature subset selection using wrapper methodin order to generate a feature subset which is likely to bemore predictive with a specific classification algorithm Wehave extracted 70 features of JavaScripts as shown in theAppendix In wrapper method the feature selection is doneby an induction algorithm as a black box where the algorithmconducts a search for a good subset itself as a part of theevaluation process as shown in Figure 3 [15] The accuracy ofthe induced classifier is estimated using accuracy estimationtechniques [16] In this study we employed classifier subsetevaluator along with the best first search engine to obtain anoptimal feature subset Classifier subset evaluator evaluatesattribute subsets on training data or performs a seperateholdout test on the testing set and uses a classifier to estimatethe ldquomeritrdquo of a set of attributes [17] We obtained subsetsby using a classifier subset evaluator along with the best firstsearch engineThe subsets obtained for different classifiers aregiven in Table 1

33 Classifiers Researchers in the field of machine learninghave proposed many algorithms for classification in the pastSome of the popular known algorithms are SVM NaıveBayes decision tree119870-Nearest Neighbour (119870-NN) RandomForests and so on The selection of algorithm depends onthe size of the training set and also on the basis of accuracytraining time linearity the number of parameters and thenumber of features If the training set is small in size thenhigh variance low bias classifiers are used and if the trainingset is large for the low variance high bias classifiers areused Machine learning algorithms are divided into threecategories based on their learning style They are supervisedunsupervised and semisupervised algorithmsThe proposedapproach will use 4 supervised machine learning classifiersand their results will suggest which classifier to be used in theinterceptor for the real-time detection malicious code

331 Naıve Bayes Classifier Naıve Bayes [18] classifier tech-nique is based on Bayes theorem with independent assump-tion between predictors Naıve Bayes model is easy to build

when the dimensionality of the dataset is very large Inthe context of learning the process the classifier tokenizestraining data to some tokens 119909119894 (119894 = 1 119899) and countsthe number of occurrences of 119909119894 in each class Based on thisprocess the likelihood of each class is computed with sometest data and classifies that test data for the class which hasthe highest likelihood

Despite its simplicity Naıve Bayes classifier is widely usedas it surpasses more sophisticated classification methodsBayesian classifier is based on Bayes theorem which says

119901 (119888119895 | 119889) = 119901 (119889 | 119888119895) 119901 (119888119895)119901 (119889) (1)

where 119901(119888119895 | 119889) is probability of instance 119889 being in class 119888119895119901(119889 | 119888119895) is probability of generating instance 119889 given class119888119895 119901(119888119895) is probability of occurrence of class 119888119895 and 119901(119889) isprobability of instance 119889 occurring

In our study there are only two classes malicious andbenign for classification of the range of 119895 is from 1 to 2 only

332 Support Vector Machines Support Vector Machines(SVM) were developed by Boser et al in 1992 [19] Howeverthe original optimal hyperplane algorithm was introducedby Vapnik and Lerner in 1963 [20] SVM is considered asone of the most effective models for binary classification ofhigh-dimensional data An SVM performs classification byconstructing an 119873-dimensional hyperplane that optimallyseparates the data into two categories to maximize thedistance of the hyperplane SVM can also be extended to thedata that is not linearly separable by applying kernel to mapthe data into a higher-dimensional space and to separate thedata on the mapped dimension

333 K-Nearest Neighbour The119870-Nearest Neighbour Algo-rithm (KNN) [21] is the simplestmachine learning algorithmTo determine the category of the test data 119870-NN performsa test to check the degree of similarity between documentsand 119896 training data to store a certain amount of classifieddata Since 119896-NN classifies instances in our research it will bemalicious and benign code instances nearest to the trainingspace The classification of unknown instances is performedby measuring the distance between the training instance andunknown instance Since instances are classified based uponthe majority vote of neighbour the most common neighbouris measured by a distance function If 119896 = 1 then theinstance is assigned for the class of its nearest neighbour In119899-dimensional space distance between two points 119909 and 119910 isachieved by using any distance function as follows

Euclidean distance function

radic 119896sum119894=1

(119909119894 minus 119910119894)2 (2)

Manhattan distance function119896sum119894=1

1003816100381610038161003816119909119894 minus 1199101198941003816100381610038161003816 (3)


Minkowski distance function

( 119896sum119894=1

(1003816100381610038161003816119909119894 minus 1199101198941003816100381610038161003816)119902)1119902 (4)

334 Decision Trees Decision tree learners are a non-parametric supervised method used for classification andregression In a decision tree classifier is represented as atree whose internal nodes represent the condition of thevariable and final nodes or leaves are the final decisionof the algorithm In the process of classification a well-formed decision tree can efficiently classify a document byrunning a query from the root not until it reaches a certainnode The main advantage of using decision tree is that itis simple and easy to understand and interpret for naıveusers The risk associated with decision tree is overfittingwhich occurs when a tree is fully grown and it may losesome generalization capabilities Some common reasons ofoverfitting are the presence of noise lack of representationinstance and multiple comparison procedures Overfittingcan be avoided by severing approaches such as prepruningand postpruning

34 Performance Evaluation Performance evaluation acts asa multipurpose tool which is used to measure the actual val-ues within the system against expected values Our purposeof evaluation is to study and measure the performance of aclassifier in detectingmalicious code In order to achieve highresults from the proposed approach we are deeply concernedwith the accuracy which is defined as

Accuracy = No of Classified Benign ScriptsTotal Benign Samples

times 100 (5)

A false positive scenario occurs when the attack detectionapproach mistakenly treats a normal code as a maliciouscode In a given approach a false negative occurs when amalicious code is not detected despite its illegal behaviourDetection rate is measured by using confusion matrix orerror matrix for the assessment of false positives and falsenegatives false positive and false negative detection rate iscalculated by

FPR = FP119873 = FPFP + TN (6)

And the false negative rate is calculated by

FNR = FN119877 = FNFN + TP (7)

where FPR is false positive rate FNR is false negative rate FNis false negative TN is true negative and TP is true positive

True negative shows a number of negative samples cor-rectly identified false negative implies a number of malicioussamples identified as negative false positive indicates thenumber of negative samples identified as malicious andtrue positive shows a number of malicious samples correctlyidentified [22] The performance of the proposed detection

approach will be the rate at which the malicious scripts areprocessed The performance will be calculated by latencytime which is taken in presence and absence of interceptorto display a page and another by calculating the systemresource consumption in both scenarios Achieving real-timedetection is not possible where the detection system is poorReceiver operating characteristic (ROC) had also been usedfor the evaluation process ROC is a graphical plot generatedby using a fraction TPR against the fraction of FPR for abinary classifier at various thresholds

4 Experimental Results and Analysis

In our experiments we used a dataset of total 1924 instanceswith 1515 instances of benign and 409 malicious instancesThe aim of performance evaluation is to study and analyzethe performance of classifiers in correctly classifying theinstance by using the evaluation metrics such as accuracytrue positive rate and false positive rate In this study wehave conducted three experiments In Experiment I we used100 labels as training In Experiment II the entire sampleset is split into two parts the training 80 and testing 20In Experiment III each classifier was evaluated using 10-fold cross-validation 119870-fold cross-validation is a techniqueto evaluate predictive models by partitioning the originalsample into a training set to train the model and a test setto evaluate it where 119896 ranges from 1 to 119896 minus 1 In 10-foldcross-validation the data is portioned into ten subsets foreach portioned subset the remaining nine subsets are used totrain the classifier and the final subset is considered as a testset In all the three conducted experiments four classifierssuch as Naıve Bayes J48 SVM and 119870-NN were used ForSVM RBF kernel was used In order to determine theparameter regularization we used grid-search to determine119862 and gamma using cross-validation Various pairs of (119862 120574)values were tested and the one with the best cross-validationaccuracy was picked RBF kernel 119862 = 250007 and 120574 =001 were found optimal For KNN the Euclidean distancefunction was used as given in (2) The best value for 119870 waschosen automatically by WEKA which uses cross-validationand 119870 = 10 was found optimal The features used for theclassification process were chosen by using wrapper methodThe feature set obtained fromwrappermethodwith respect toindividual classifiers is given in Table 1 The results obtainedfrom three experiments are listed in Tables 2 3 and 4respectively

As shown in Tables 2 3 and 4 the overall performancesof all the four classifiers are good but certain classifierperforms better in each of the experiments SVM in all thethree experiments achieved low accuracy as low as 9455in Experiment II KNN in Experiment I outperformedwith 100 accuracy rate However for KNN a decreasein performance was observed within Experiments II andIII respectively J48 achieved an accuracy of 9709 inExperiment I and it increased in Experiment II to 9922TheROCcurve obtained show thatKNNachievedROC= 1 inExperiment I followed by the J48 classifier in Experiment IIIwith ROC = 0983 Comparing the results obtained in all the


Table 2 Results obtained from four classifiers by using 100 training

Classifier Accuracy TP rate FP rate Precision Recall 119865-measure ROC Class

Naıve Bayes 9725 0875 0001 0994 0875 0931 0943 Malicious0999 0125 0967 0999 0983 0943 Nonmalicious

J48 9709 0966 0003 099 0966 0978 0982 Malicious0997 0034 0991 0997 0994 0982 Nonmalicious

SVM 9651 0912 002 0923 0912 0918 0946 Malicious098 0088 0976 098 0978 0946 Nonmalicious

KNN 10000 1 0 1 1 1 1 Malicious1 0 1 1 1 1 Nonmalicious

Table 3 Results obtained from four classifiers by performing 10-fold cross-validation






Table 4 Results obtained from four classifiers for 80 training and 20 testing






three experiments it can be concluded that KNN performedvery well and has achieved 100 accuracy with ROC = 1 Allthe experiments were run on a PC with Quad core 36GHZprocessor with 16GB primary memory In order to know thecomputation costs the same PC was used with the Internetspeed of 202Mbps as download speed and 390Mbps asupload speed The time between the first requests made bythe user till the display of first page was recorded with andwithout interceptor It was found that the average time todisplay a web page without interceptor was 055 secondswhereas the average time for displaying a singlewebpage afterclassification was 097 seconds WEKA was used to obtainthe classification results WEKA is a collection of machinelearning algorithms for data mining tasks developed by theUniversity of Waikato New Zealand [23]

5 Related Work

Malicious JavaScript attacks have become an essential busi-ness proposition due to the seriousness of their threatand impact Detection of malicious JavaScript attacks thusbecame a vital field of research Despite a number of tech-niques for detection of JavaScript attack have been proposedJavaScript attacks still remain a threat to users Thus anefficient approach to mitigate XSS is demanded Researcherstook the services of machine learning to find the accuratedetection approach for detection of previously unknownattacks In 2005 Kam andThi [24] were the first to explore thepossibility of using machine learning for detection maliciousweb page Their work limited up to the detection of amalicious web page based on URL by performing lexical


Table 5 Complete feature list

(1) Number Of Redirections(2) Difference In Redirections(3) Number Of Instantiated Objects(4) Difference In Instantiated Objects(5) Length Of Longest Single Evaluated Code(6) Total Bytes Of Evaluated Code(7) Total Bytes Allocated(8) Number Of Executions(9) Length Of Longest Unprintable String(10) Number Of Unprintable Strings(11) Ratio Of Definitions To Uses(12) Length Of Method Call Parameter(13) Total Number Of Method Calls(14) Memory Overflow Occured(15) Unprintable String Used For Instantiated Object Parameter(16) Unprintable Strings ANDOR Redirects(17) Method Cal(18) open(19) run(20) Savetofile(21) Send(22) Setattribute(23) Write(24) timespanformat(25) shellexecute(26) playerproperty(27) uploadlogs(28) hgs startnotify(29) downloadandinstall(30) getvariable(31) linksbicons(32) openurl(33) getresponseheader(34) keyframe(35) setrequestheader(36) Setslice(37) buildpath(38) getspecialfolder(39) environment(40) rawparse(41) iestartnative(42) setformatlikesample(43) createobject(44) specialfolders(45) replace(46) createnewfoldername(47) addevent(48) isversionsupported(49) split(50) new(51) evaluate(52) msdatasourceobject

Table 5 Continued

(53) allowscriptaccess(54) concat(55) url(56) mode(57) type(58) Import(59) close(60) allowcontextmenu(61) cachefolder(62) compressedpath(63) console(64) printsnapshot(65) shownavigationbuttons(66) snapshotpath(67) wkspictureinterface(68) zoom(69) script Size(70) intent

analysis This study does not check the code on the webpage for any malicious code In 2006 Kolter and Maloof [25]used 4-gram sequence for feature extraction of features fromtwo 500 119899-grams by using a measure of information gainClassifiers such as Naıve Bayes SVM and decision trees wereused for classification Results show that only J48 classifiercould achieve an AUC of 0996 In 2007 Garera et al [26]used regression analysis by using about 18 selected featuresfor detection of a phishing website An accuracy of 973was achieved by using a small data set of about 2500 URLsIn 2008 McGrath and Gupta [27] try to detect phishingURLs by performing a comparative analysis of phishing andnonphishing URLS Their study does not use any classifierIn 2008 Polychronakis and Provos [28] carried out a studyfor drive-by exploit of URLs by using machine learningclassifiers prefilters The limitation to this approach is that itis time-consuming as it employs a heavyweight classifier forclassification Based on dynamic analysis approach Ahmed etal in 2009 [29] proposed an approach for malware detectionby extracting features in combined from spatial and temporalinformation from API at runtime In 2012 Abbasi et al [30]used a very small dataset for classification of fake medicalwebsites This study is restricted only up to the detection ofthe fakemedical website and cannot detect other types of fakeor malicious websites Huda et al in 2016 [31] proposed anapproach for malware detection based on API call statisticsas a feature for the SVM classifier Soska et al proposedan approach for detection of malicious websites based onfeatures extracted from network traffic statistics file systemand contents on the web pages using C45 decision treebased algorithm Another study conducted by Alazab in 2015[32] proposed an approach for detection of malware basedon API call sequence quite similar to [31] by using 119870-NNclassifier Al-Taharwa et al in 2015 [33] proposed an approachfor detection of deobfuscation of JavaScripts for detectionof malicious JavaScripts Our work is based on extracting


features based on their quality which is precisely analyzed tobe used during the classification processThese features act aspivotal in initiating a JavaScript attack

6 Conclusion and Future Work

In this paper we proposed an interceptor for detection ofmalicious JavaScripts and conducted experiments The pro-posed approach achieved an accuracy of 100 in detectionfor previously unknown malicious JavaScriptrsquos attacks basedupon the learning Experimental results show that ROC =1 was achieved by KNN classifier with no false positiveThe wrapper method played an important role in featureselection which leads to high accuracy compared to otherstudied static approaches

Appendix

See Table 5

Competing Interests

The authors declare that there is no conflict of interestsbetween the publication of this paper and the funding agencyUniversiti Malaysia Sarawak

Acknowledgments

The research conducted is part of the DPP (PhD ResearchGrant) under Universiti Malaysia Sarawak (Research Grantno F08(DPP28)12172015(03))

References

[1] ldquoInternet SecurityThreat Reportrdquo Symantec Vol 21 April 2016[2] M Van Gundy and H Chen ldquoNoncespaces using randomiza-

tion to enforce information ow tracking and thwart cross-sitescripting attacksrdquo in Proceedings of the Network and DistributedSystem Security Symposium (NDSS rsquo09) pp 1ndash18 San DiegoCalifo USA February 2009

[3] A E Nunan E Souto E M dos Santos and E Feitosa ldquoAuto-matic classification of cross-site scripting in web pages usingdocument-based and URL-based featuresrdquo in Proceedings of the17th IEEE Symposium on Computers and Communication (ISCCrsquo12) pp 702ndash707 IEEE Cappadocia Turkey July 2012

[4] K SchuttM Kloft A Bikadorov andK Rieck ldquoEarly detectionof malicious behavior in javascript coderdquo in Proceedings of the5th ACMWorkshop on Artificial Intelligence and Security (AISecrsquo12) pp 15ndash24 Raleigh NC USA October 2012

[5] K Rieck T Krueger and A Dewald ldquoCujo efficient detectionand prevention of drive-by-download attacksrdquo in Proceedingsof the 26th Annual Computer Security Applications Conference(ACSAC rsquo10) pp 31ndash39 Austin Tex USA December 2010

[6] C Curtsinger B Livshits B Zorn and C Seifert ldquoZOZZLEfast and precise in-browser JavaScript malware detectionrdquo inProceedings of the 20thUSENIXConference on Security (SEC rsquo11)San Francisco Calif USA 2011

[7] M Fraiwan R Al-Salman N Khasawneh and S ConradldquoAnalysis and identification of malicious JavaScript coderdquo Infor-mation Security Journal vol 21 no 1 pp 1ndash11 2012

[8] S Guarnieri and V B Livshits ldquoGATEKEEPER mostly staticenforcement of security and reliability policies for javascriptcoderdquo in Proceedings of the 18th Conference on USENIX SecuritySymposium (SSYM rsquo09) vol 10Montreal Canada August 2009

[9] M S Miller M Samuel B Laurie I Awad and M Stay ldquoSafeactive content in sanitized JavaScriptrdquo Tech Rep Google 2008

[10] N Khan J Abdullah and A S Khan ldquoTowards vulnerabilityprevention model for web browser using interceptor approachrdquoin Proceedings of the 9th International Conference on IT in Asia(CITA rsquo15) pp 1ndash5 Seoul Korea August 2015

[11] T Johnson and P Seeling ldquoDesktop and mobile web page com-parison Characteristics trends and implicationsrdquo IEEE Com-munications Magazine vol 52 no 9 pp 144ndash151 2014

[12] S C Y Johnson Yet Another Compiler-Compiler vol 32 BellLaboratories Murray Hill NJ USA 1975

[13] DeRemer Franklin L Practical translators for LR (k) languagesDiss MIT 1969

[14] leakiEst School of Computer Science University of BirminghamBirmingham UK

[15] R Kohavi and G H John ldquoThe wrapper approachrdquo in FeatureExtraction Construction and Selection pp 33ndash50 Springer NewYork NY USA 1998

[16] B E Boser I M Guyon and V N Vapnik ldquoTraining algorithmfor optimal margin classifiersrdquo in Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory pp 144ndash152 ACM July 1992

[17] J Zurada ldquoDoes feature reduction help improve the classifica-tion accuracy rates a credit scoring case using a german datasetrdquo Review of Business Information Systems (RBIS) vol 14 no2 2010

[18] D Lewis David ldquoNaive (Bayes) at forty the independenceassumption in information retrievalrdquo in Machine LearningECML-98 10th European Conference on Machine LearningChemnitz Germany April 21ndash23 1998 Proceedings vol 1398 ofLecture Notes in Computer Science pp 4ndash15 Springer BerlinGermany 1998

[19] B E Boser I M Guyon and V N Vapnik ldquoTraining algorithmfor optimalmargin classifiersrdquo inProceedings of the FifthAnnualACM Workshop on Computational Learning Theory pp 144ndash152 Pittsburgh Pa USA July 1992

[20] V Vapnik and A Lerner ldquoPattern recognition using generalizedportrait methodrdquo Automation and Remote Control vol 24 no6 pp 774ndash780 1963

[21] N S Altman ldquoAn introduction to kernel and nearest-neighbornonparametric regressionrdquo The American Statistician vol 46no 3 pp 175ndash185 1992

[22] A E Nunan E Souto E M Dos Santos and E FeitosaldquoAutomatic classification of cross-site scripting in web pagesusing document-based andURL-based featuresrdquo in Proceedingsof the 17th IEEE Symposium on Computers and Communication(ISCC rsquo12) pp 000702ndash000707 Cappadocia Turkey July 2012

[23] Weka University of Waikato New Zealand[24] M-Y Kan and H O N Thi ldquoFast webpage classification using

URL featuresrdquo in Proceedings of the 14th ACM InternationalConference on Information and Knowledge Management (CIKMrsquo05) pp 325ndash326 Bremen Germany November 2005

[25] J Z Kolter and M A Maloof ldquoLearning to detect and classifymalicious executables in the wildrdquo Journal of Machine LearningResearch vol 7 pp 2721ndash2744 2006


[26] S Garera N Provos M Chew and A D Rubin ldquoA frameworkfor detection andmeasurement of phishing attacksrdquo in Proceed-ings of the ACMWorkshop on Recurring Malcode (WORM rsquo07)pp 1ndash8 ACM Alexandria Va USA November 2007

[27] D K McGrath and M Gupta ldquoBehind phishing an examina-tion of phisher modi operandirdquo LEET vol 8 p 4 2008

[28] M Polychronakis and N Provos ldquoGhost turns zombie explor-ing the life cycle of web-based malwarerdquo LEET vol 2 pp 1ndash82008

[29] F AhmedHHameedM Zubair Shafiq andM Farooq ldquoUsingspatio-temporal information inAPI calls withmachine learningalgorithms for malware detectionrdquo in Proceedings of the 2ndACM Workshop on Security and Artificial Intelligence (AISecrsquo09) Chicago Ill USA November 2009

[30] A Abbasi F M Zahedi and S Kaza ldquoDetecting fake medicalweb sites using recursive trust labelingrdquo ACM Transactions onInformation Systems vol 30 no 4 article 22 2012

[31] S Huda J Abawajy M Alazab M Abdollalihian R Islamand J Yearwood ldquoHybrids of support vector machine wrapperand filter based framework for malware detectionrdquo FutureGeneration Computer Systems vol 55 pp 376ndash390 2016

[32] M Alazab ldquoProfiling and classifying the behavior of maliciouscodesrdquo Journal of Systems and Software vol 100 pp 91ndash1022015

[33] I A Al-Taharwa H-M Lee A B Jeng K-PWu C-S Ho andS-M Chen ldquoJSOD JavaScript obfuscation detectorrdquo Securityand Communication Networks vol 8 no 6 pp 1092ndash1107 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Active and Passive Electronic Components

Control Scienceand Engineering

Journal of



RotatingMachinery


Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpswwwhindawicom

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics


Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation



DistributedSensor Networks



JavaScript embedded into it Then the JavaScript interpreterinside the web browser at the userrsquos side starts executing itdirectly and automatically unless explicitly disabled by theuser Unfortunately JavaScriptrsquos code provides a strong basefor conducting attacks especially drive-by-download attackswhich unnoticeably attack clients amid the visit of a webpage In contrast to different sorts of network-based attacksmalicious JavaScripts are difficult to detect [4] JavaScriptattacks are performed by looking into the vulnerabilityand exploiting by using JavaScript obfuscation techniquesto evade the detection Thus detecting of such maliciousJavaScriptrsquos attacks at the real time becomes imperative

Security solution providers who play a vital role in pro-viding tools for defending attacks are fundamentally basedon two correlative approaches signature-based and heuristic-based detection approaches The signature-based approachis based upon the detection of unique string patterns in thebinary code Signature-based attack detection approach failswhen a new type of attack occurs In this approach theinstance of new attack needs to be captured first in orderto create a signature from it and then subsequently updatethe definitions of the detection tool on the client side Asthe time gap between the capturing of new attack instanceand the updating at the client side is high it may makemillions of client-side machines vulnerable to attacks Cybercriminals take advantage of this time gap and vulnerabilityand simultaneously launch the attack Consequently thisapproach is unacceptable in an environment where newattacks are expected to come Another tradition approach forattack detection is heuristic-based detection Heuristic-baseddetection is based upon the set of expert decision rules todetect the attackThe advantage of using this attack approachis that it cannot only detect modified or variant existingmalware but can also detect the previous unknown attackThe downside of using this approach is that it takes a longtime in performing scanning and analysis which drasticallyslows down the security performance Another problem ofthe approach is that it introduces many false positives Falsepositive occurs when a system wrongly identifies code or afile as malicious when actually it is not Some of the heuristicapproaches include file emulation file analysis and genericsignature detection

Researchers recently have enlisted machine learning inthe detection of malware The advantage of using machinelearning is that it can determine whether a code or a file ismalicious or not in a very small time without the need ofisolating it in a sandbox to perform the analysis MachineLearning is able to detect previously unknown malware withpredictive capabilities and especially useful for the detectionof increasingly polymorphic nature of malware To performthe classification of a benign and a malicious code machinelearning classifiers are used in detection through learningfrom patterns

Keeping in view the impact that an attack can cause onthe client side detection of JavaScript at a real time thusbecame necessary Several approaches have been proposedfor classification and detection of malicious JavaScript codefrom the benign code of a website such as those of Rieck et al[5] Curtsinger et al [6] and Fraiwan et al [7] the problem

of these approaches is time overhead in detection Otherapproaches such as GATEKEEPER [8] and Google Caga [9]use a method of executing random JavaScripts from the codein a protected environment but can detect the limited type ofattacks

Antivirus software whether online or offline usessignature-based techniques for detection of malware thusrequiring a constant update for newmalware signatures fromvendors for detection The time gap between the detectionof new malware and updating the virus signature at theclient side may take a long time It thus leaves the systemsopen for vulnerabilities until the virus definition has notbeen released by the vendor and updated by the client Thusantiviruses are ineffective in current scenario for defendingzero-day attacks where new malware variants are put in wildevery day In this paper we introduced an interceptor fordetection of malicious JavaScript attacks coming towardsthe client side based on machine learning for detection ofmalware especially previously unknown malware variantsTheproposed approach is light weight in naturewithminimalruntime overheadsDetection is based on the static analysis ofa code for extracting features from given JavaScript to be fedinto classifier for the classification process Experiments wereconducted by using a dataset of 1924 instances of JavaScriptwith 409 as malicious and 1515 as benign Machine languageclassifiers such as Naıve Bayes SVM J48 and 119896-NN wereindividually tested for their performance of the selectedfeatures The proposed approach is browser and platformindependent and is depicted in Figure 1 This study is acontinuation of our previous study [10] The rest of the paperis organized as follows Section 2 discusses the architectureof the proposed study Section 3 explains the classificationmodel Section 4 provides details about experiments andanalyzes their results Section 5 deals with the related workSection 6 provides conclusion and future work

2 Architecture of Proposed Approach

Keeping in view the need for the real-time detection ofmalicious JavaScript code we proposed an approach fordetection of such attacks the proposed approach consists ofan interceptor between browser and server for detection ofmalicious JavaScript code In this approach all the trafficbetween server and client is exchanged through the inter-ceptor to check for possible attacks in the source code to beexecuted by the browser There are no direct communicationchannels between browser and server This interceptor isproposed to protect the client side from being attacked ordelivered with the malicious mode The interceptor uses astatic analysis for detection of malicious JavaScript attackThe interceptor will be installed as a plugin on the browserNormally when a client intends to visit awebsite by typing theURL into the address bar anHTTPGET request is sent to theweb server for lookup of the requested web page and if foundresponse is generated and the cookie is set in the browser Inthis approach the response from the server is passed via aninterceptor to find a malicious code Features are extractedfrom the source code of the web page which is fed into the


Web browser

XSS

vuln

erab

ility

in

terc

epto

r

Server



Client side


MaliciousBenign

RejectAccept

Feature extraction

Loading

Lexical analysis

Classification

Result




















119901 (119888119895 | 119889) = 119901 (119889 | 119888119895) 119901 (119888119895)119901 (119889) (1)






radic 119896sum119894=1

(119909119894 minus 119910119894)2 (2)


1003816100381610038161003816119909119894 minus 1199101198941003816100381610038161003816 (3)



( 119896sum119894=1

(1003816100381610038161003816119909119894 minus 1199101198941003816100381610038161003816)119902)1119902 (4)




times 100 (5)


FPR = FP119873 = FPFP + TN (6)


FNR = FN119877 = FNFN + TP (7)



























5 Related Work





Table 5 Continued







Appendix

See Table 5

Competing Interests


Acknowledgments


References





































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










Web browser

XSS

vuln

erab

ility

in

terc

epto

r

Server



Client side


MaliciousBenign

RejectAccept

Feature extraction

Loading

Lexical analysis

Classification

Result




















119901 (119888119895 | 119889) = 119901 (119889 | 119888119895) 119901 (119888119895)119901 (119889) (1)






radic 119896sum119894=1

(119909119894 minus 119910119894)2 (2)


1003816100381610038161003816119909119894 minus 1199101198941003816100381610038161003816 (3)



( 119896sum119894=1

(1003816100381610038161003816119909119894 minus 1199101198941003816100381610038161003816)119902)1119902 (4)




times 100 (5)


FPR = FP119873 = FPFP + TN (6)


FNR = FN119877 = FNFN + TP (7)



























5 Related Work





Table 5 Continued







Appendix

See Table 5

Competing Interests


Acknowledgments


References





































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation





















119901 (119888119895 | 119889) = 119901 (119889 | 119888119895) 119901 (119888119895)119901 (119889) (1)






radic 119896sum119894=1

(119909119894 minus 119910119894)2 (2)


1003816100381610038161003816119909119894 minus 1199101198941003816100381610038161003816 (3)



( 119896sum119894=1

(1003816100381610038161003816119909119894 minus 1199101198941003816100381610038161003816)119902)1119902 (4)




times 100 (5)


FPR = FP119873 = FPFP + TN (6)


FNR = FN119877 = FNFN + TP (7)



























5 Related Work





Table 5 Continued







Appendix

See Table 5

Competing Interests


Acknowledgments


References





































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation











( 119896sum119894=1

(1003816100381610038161003816119909119894 minus 1199101198941003816100381610038161003816)119902)1119902 (4)




times 100 (5)


FPR = FP119873 = FPFP + TN (6)


FNR = FN119877 = FNFN + TP (7)



























5 Related Work





Table 5 Continued







Appendix

See Table 5

Competing Interests


Acknowledgments


References





































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation





























5 Related Work





Table 5 Continued







Appendix

See Table 5

Competing Interests


Acknowledgments


References





































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation












Table 5 Continued







Appendix

See Table 5

Competing Interests


Acknowledgments


References





































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation













Appendix

See Table 5

Competing Interests


Acknowledgments


References





































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation




















RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation











RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation









Documents

Defending Malicious Script Attacks Using Machine Learning ...downloads.hindawi.com/journals/wcmc/2017/5360472.pdf · Defending Malicious Script Attacks Using Machine Learning Classifiers