Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
IT MANAGEMENT TITLESFROM AUERBACHPUBLICATIONS AND CRC PRESS
Net 4 for Enterprise Architects and DevelopersSudhanshu Hate and Suchi PahariaISBN 978-1-4398-6293-3
A Tale of Two Transformations Bringing Lean and AgileSoftware Development to LifeMichael K LevineISBN 978-1-4398-7975-7
Antipatterns Managing Software Organizations andPeople Second EditionColin J Neill Philip A Laplante and Joanna F DeFrancoISBN 978-1-4398-6186-8
Asset Protection through Security AwarenessTyler Justin SpeedISBN 978-1-4398-0982-2
Beyond Knowledge Management What Every LeaderShould KnowEdited by Jay LiebowitzISBN 978-1-4398-6250-6
CISOrsquos Guide to Penetration Testing A Framework toPlan Manage and Maximize Benefits
2
James S TillerISBN 978-1-4398-8027-2
Cybersecurity Public Sector Threats and ResponsesEdited by Kim J AndreassonISBN 978-1-4398-4663-6
Cybersecurity for Industrial Control Systems SCADADCS PLC HMI and SISTyson Macaulay and Bryan SingerISBN 978-1-4398-0196-3
Data Warehouse Designs Achieving ROI with MarketBasket Analysis and Time VarianceFon SilversISBN 978-1-4398-7076-1
Emerging Wireless Networks Concepts Techniques andApplicationsEdited by Christian Makaya and Samuel PierreISBN 978-1-4398-2135-0
Information and Communication Technologies inHealthcareEdited by Stephan Jones and Frank M GroomISBN 978-1-4398-5413-6
Information Security Governance Simplified From theBoardroom to the KeyboardTodd FitzgeraldISBN 978-1-4398-1163-4
3
IP Telephony Interconnection Reference ChallengesModels and EngineeringMohamed Boucadair Isabel Borges Pedro Miguel Nevesand Olafur Pall EinarssonISBN 978-1-4398-5178-4
ITrsquos All about the People Technology Management ThatOvercomes Disaffected People Stupid Processes andDeranged Corporate CulturesStephen J AndrioleISBN 978-1-4398-7658-9
IT Best Practices Management Teams QualityPerformance and ProjectsTom C WittISBN 978-1-4398-6854-6
Maximizing Benefits from IT Project Management FromRequirements to Value DeliveryJoseacute Loacutepez SorianoISBN 978-1-4398-4156-3
Secure and Resilient Software Requirements Test Casesand Testing MethodsMark S Merkow and Lakshmikanth RaghavanISBN 978-1-4398-6621-4
Security De-engineering Solving the Problems inInformation Risk ManagementIan TibbleISBN 978-1-4398-6834-8
4
Software Maintenance Success RecipesDonald J ReiferISBN 978-1-4398-5166-1
Software Project Management A Process-DrivenApproachAshfaque AhmedISBN 978-1-4398-4655-1
Web-Based and Traditional OutsourcingVivek Sharma Varun Sharma and KS Rajasekaran InfosysTechnologies Ltd Bangalore IndiaISBN 978-1-4398-1055-2
5
Data MiningTools for MalwareDetection
Mehedy Masud LatifurKhanand Bhavani Thuraisingham
6
CRC PressTaylor amp Francis Group6000 Broken Sound Parkway NW Suite 300Boca Raton FL 33487-2742
copy 2011 by Taylor amp Francis Group LLCCRC Press is an imprint of Taylor amp Francis Group anInforma business
No claim to original US Government worksVersion Date 20120111
International Standard Book Number-13 978-1-4665-1648-9(eBook - ePub)
This book contains information obtained from authentic andhighly regarded sources Reasonable efforts have been madeto publish reliable data and information but the author andpublisher cannot assume responsibility for the validity of allmaterials or the consequences of their use The authors andpublishers have attempted to trace the copyright holders of allmaterial reproduced in this publication and apologize tocopyright holders if permission to publish in this form has notbeen obtained If any copyright material has not beenacknowledged please write and let us know so we may rectifyin any future reprint
Except as permitted under US Copyright Law no part of thisbook may be reprinted reproduced transmitted or utilized inany form by any electronic mechanical or other means nowknown or hereafter invented including photocopyingmicrofilming and recording or in any information storage or
7
retrieval system without written permission from thepublishers
For permission to photocopy or use material electronicallyfrom this work please access wwwcopyrightcom(httpwwwcopyrightcom) or contact the CopyrightClearance Center Inc (CCC) 222 Rosewood Drive DanversMA 01923 978-750-8400 CCC is a not-for-profitorganization that provides licenses and registration for avariety of users For organizations that have been granted aphotocopy license by the CCC a separate system of paymenthas been arranged
Trademark Notice Product or corporate names may betrademarks or registered trademarks and are used only foridentification and explanation without intent to infringe
Visit the Taylor amp Francis Web site athttpwwwtaylorandfranciscom
and the CRC Press Web site athttpwwwcrcpresscom
8
Dedication
We dedicate this book to our respective families for theirsupport that enabled us to write this book
9
Contents
PREFACE
Introductory Remarks
Background on Data Mining
Data Mining for Cyber Security
Organization of This Book
Concluding Remarks
ACKNOWLEDGMENTS
THE AUTHORS
COPYRIGHT PERMISSIONS
CHAPTER 1 INTRODUCTION
11 Trends
12 Data Mining and Security Technologies
13 Data Mining for Email Worm Detection
14 Data Mining for MaliciousCode Detection
15 Data Mining for Detecting Remote Exploits
10
16 Data Mining for Botnet Detection
17 Stream Data Mining
18 Emerging Data Mining Tools for Cyber SecurityApplications
19 Organization of This Book
110 Next Steps
PART I DATA MINING AND SECURITY
Introduction to Part I Data Mining and Security
CHAPTER 2 DATA MINING TECHNIQUES
21 Introduction
22 Overview of Data Mining Tasks and Techniques
23 Artificial Neural Network
24 Support Vector Machines
25 Markov Model
26 Association Rule Mining (ARM)
27 Multi-Class Problem
271 One-vs-One
272 One-vs-All
11
28 Image Mining
281 Feature Selection
282 Automatic Image Annotation
283 Image Classification
29 Summary
References
CHAPTER 3 MALWARE
31 Introduction
32 Viruses
33 Worms
34 Trojan Horses
35 Time and Logic Bombs
36 Botnet
37 Spyware
38 Summary
References
CHAPTER 4 DATA MINING FOR SECURITYAPPLICATIONS
12
41 Introduction
42 Data Mining for Cyber Security
421 Overview
422 Cyber-Terrorism Insider Threats and External Attacks
423 Malicious Intrusions
424 Credit Card Fraud and Identity Theft
425 Attacks on Critical Infrastructures
426 Data Mining for Cyber Security
43 Current Research and Development
44 Summary
References
CHAPTER 5 DESIGN AND IMPLEMENTATION OFDATA MINING TOOLS
51 Introduction
52 Intrusion Detection
53 Web Page Surfing Prediction
54 Image Classification
55 Summary
13
References
CONCLUSION TO PART I
PART II DATA MINING FOR EMAIL WORMDETECTION
Introduction to Part II
CHAPTER 6 Email Worm Detection
61 Introduction
62 Architecture
63 Related Work
64 Overview of Our Approach
65 Summary
References
CHAPTER 7 DESIGN OF THE DATA MINING TOOL
71 Introduction
72 Architecture
73 Feature Description
731 Per-Email Features
732 Per-Window Features
14
74 Feature Reduction Techniques
741 Dimension Reduction
742 Two-Phase Feature Selection (TPS)
7421 Phase I
7422 Phase II
75 Classification Techniques
76 Summary
References
CHAPTER 8 EVALUATION AND RESULTS
81 Introduction
82 Dataset
83 Experimental Setup
84 Results
841 Results from Unreduced Data
842 Results from PCA-Reduced Data
843 Results from Two-Phase Selection
85 Summary
15
References
CONCLUSION TO PART II
PART III DATA MINING FOR DETECTING MALICIOUSEXECUTABLES
Introduction to Part III
CHAPTER 9 MALICIOUS EXECUTABLES
91 Introduction
92 Architecture
93 Related Work
94 Hybrid Feature Retrieval (HFR) Model
95 Summary
References
CHAPTER 10 DESIGN OF THE DATA MINING TOOL
101 Introduction
102 Feature Extraction Using n-Gram Analysis
1021 Binary n-Gram Feature
1022 Feature Collection
1023 Feature Selection
16
1024 Assembly n-Gram Feature
1025 DLL Function Call Feature
103 The Hybrid Feature Retrieval Model
1031 Description of the Model
1032 The Assembly Feature Retrieval (AFR) Algorithm
1033 Feature Vector Computation and Classification
104 Summary
References
CHAPTER 11 EVALUATION AND RESULTS
111 Introduction
112 Experiments
113 Dataset
114 Experimental Setup
115 Results
1151 Accuracy
11511 Dataset1
11512 Dataset2
17
11513 Statistical Significance Test
11514 DLL Call Feature
1152 ROC Curves
1153 False Positive and False Negative
1154 Running Time
1155 Training and Testing with Boosted J48
116 Example Run
117 Summary
References
CONCLUSION TO PART III
PART IV DATA MINING FOR DETECTING REMOTEEXPLOITS
Introduction to Part IV
CHAPTER 12 DETECTING REMOTE EXPLOITS
121 Introduction
122 Architecture
123 Related Work
124 Overview of Our Approach
18
125 Summary
References
CHAPTER 13 DESIGN OF THE DATA MINING TOOL
131 Introduction
132 DExtor Architecture
133 Disassembly
134 Feature Extraction
1341 Useful Instruction Count (UIC)
1342 Instruction Usage Frequencies (IUF)
1343 Code vs Data Length (CDL)
135 Combining Features and Compute Combined FeatureVector
136 Classification
137 Summary
References
CHAPTER 14 EVALUATION AND RESULTS
141 Introduction
142 Dataset
19
143 Experimental Setup
1431 Parameter Settings
1422 Baseline Techniques
144 Results
1441 Running Time
145 Analysis
146 Robustness and Limitations
1461 Robustness against Obfuscations
1462 Limitations
147 Summary
References
CONCLUSION TO PART IV
PART V DATA MINING FOR DETECTING BOTNETS
Introduction to Part V
CHAPTER 15 DETECTING BOTNETS
151 Introduction
152 Botnet Architecture
20
153 Related Work
154 Our Approach
155 Summary
References
CHAPTER 16 DESIGN OF THE DATA MINING TOOL
161 Introduction
162 Architecture
163 System Setup
164 Data Collection
165 Bot Command Categorization
166 Feature Extraction
1661 Packet-Level Features
1662 Flow-Level Features
167 Log File Correlation
168 Classification
169 Packet Filtering
1610 Summary
21
References
CHAPTER 17 Evaluation and Results
171 Introduction
1711 Baseline Techniques
1712 Classifiers
172 Performance on Different Datasets
173 Comparison with Other Techniques
174 Further Analysis
175 Summary
References
CONCLUSION TO PART V
PART VI STREAM MINING FOR SECURITYAPPLICATIONS
Introduction to Part VI
CHAPTER 18 STREAM MINING
181 Introduction
182 Architecture
183 Related Work
22
184 Our Approach
185 Overview of the Novel Class Detection Algorithm
186 Classifiers Used
187 Security Applications
188 Summary
References
CHAPTER 19 DESIGN OF THE DATA MINING TOOL
191 Introduction
192 Definitions
193 Novel Class Detection
1931 Saving the Inventory of Used Spaces during Training
19311 Clustering
19312 Storing the Cluster Summary Information
1932 Outlier Detection and Filtering
19321 Filtering
1933 Detecting Novel Class
19331 Computing the Set of Novel Class Instances
23
19332 Speeding up the Computation
19333 Time Complexity
19334 Impact of Evolving Class Labels on EnsembleClassification
194 Security Applications
195 Summary
Reference
CHAPTER 20 EVALUATION AND RESULTS
201 Introduction
202 Datasets
2021 Synthetic Data with Only Concept-Drift (SynC)
2022 Synthetic Data with Concept-Drift and Novel Class(SynCN)
2023 Real DatamdashKDD Cup 99 Network Intrusion Detection
2024 Real DatamdashForest Cover (UCI Repository)
203 Experimental Setup
2031 Baseline Method
204 Performance Study
24
2041 Evaluation Approach
2042 Results
2043 Running Time
205 Summary
References
CONCLUSION TO VI
PART VII EMERGING APPLICATIONS
Introduction to Part VII
CHAPTER 21 Data Mining for Active Defense
211 Introduction
212 Related Work
213 Architecture
214 A Data Mining-Based Malware Detection Model
2141 Our Framework
2142 Feature Extraction
21421 Binary n-Gram Feature Extraction
21422 Feature Selection
25
21423 Feature Vector Computation
2143 Training
2144 Testing
215 Model-Reversing Obfuscations
2151 Path Selection
2152 Feature Insertion
2153 Feature Removal
216 Experiments
217 Summary
References
CHAPTER 22 DATA MINING FOR INSIDER THREATDETECTION
221 Introduction
222 The Challenges Related Work and Our Approach
223 Data Mining for Insider Threat Detection
2231 Our Solution Architecture
2232 Feature Extraction and Compact Representation
2233 RDF Repository Architecture
26
2234 Data Storage
22341 File Organization
22342 Predicate Split (PS)
22343 Predicate Object Split (POS)
2235 Answering Queries Using Hadoop MapReduce
2236 Data Mining Applications
224 Comprehensive Framework
225 Summary
References
CHAPTER 23 DEPENDABLE REAL-TIME DATAMINING
231 Introduction
232 Issues in Real-Time Data Mining
233 Real-Time Data Mining Techniques
234 Parallel Distributed Real-Time Data Mining
235 Dependable Data Mining
236 Mining Data Streams
237 Summary
27
References
CHAPTER 24 FIREWALL POLICY ANALYSIS
241 Introduction
242 Related Work
243 Firewall Concepts
2431 Representation of Rules
2432 Relationship between Two Rules
2433 Possible Anomalies between Two Rules
244 Anomaly Resolution Algorithms
2441 Algorithms for Finding and Resolving Anomalies
24411 Illustrative Example
2442 Algorithms for Merging Rules
24421 Illustrative Example of the Merge Algorithm
245 Summary
References
CONCLUSION TO PART VII
CHAPTER 25 SUMMARY AND DIRECTIONS
28
251 Introduction
252 Summary of This Book
253 Directions for Data Mining Tools for Malware Detection
254 Where Do We Go from Here
APPENDIX A DATA MANAGEMENT SYSTEMSDEVELOPMENTS AND TRENDS
A1 Introduction
A2 Developments in Database Systems
A3 Status Vision and Issues
A4 Data Management Systems Framework
A5 Building Information Systems from the Framework
A6 Relationship between the Texts
A7 Summary
References
APPENDIX B TRUSTWORTHY SYSTEMS
B1 Introduction
B2 Secure Systems
B21 Introduction
29
B22 Access Control and Other Security Concepts
B23 Types of Secure Systems
B24 Secure Operating Systems
B25 Secure Database Systems
B26 Secure Networks
B27 Emerging Trends
B28 Impact of the Web
B29 Steps to Building Secure Systems
B3 Web Security
B4 Building Trusted Systems from Untrusted Components
B5 Dependable Systems
B51 Introduction
B52 Trust Management
B53 Digital Rights Management
B54 Privacy
B55 Integrity Data Quality and High Assurance
B6 Other Security Concerns
30
B61 Risk Analysis
B62 Biometrics Forensics and Other Solutions
B7 Summary
References
APPENDIX C SECURE DATA INFORMATION ANDKNOWLEDGE MANAGEMENT
C1 Introduction
C2 Secure Data Management
C21 Introduction
C22 Database Management
C221 Data Model
C222 Functions
C223 Data Distribution
C23 Heterogeneous Data Integration
C24 Data Warehousing and Data Mining
C25 Web Data Management
C26 Security Impact
C3 Secure Information Management
31
C31 Introduction
C32 Information Retrieval
C33 Multimedia Information Management
C34 Collaboration and Data Management
C35 Digital Libraries
C36 E-Business
C37 Security Impact
C4 Secure Knowledge Management
C41 Knowledge Management
C42 Security Impact
C5 Summary
References
APPENDIX D SEMANTIC WEB
D1 Introduction
D2 Layered Technology Stack
D3 XML
D31 XML Statement and Elements
32
D32 XML Attributes
D33 XML DTDs
D34 XML Schemas
D35 XML Namespaces
D36 XML FederationsDistribution
D37 XML-QL XQuery XPath XSLT
D4 RDF
D41 RDF Basics
D42 RDF Container Model
D43 RDF Specification
D44 RDF Schemas
D45 RDF Axiomatic Semantics
D46 RDF Inferencing
D47 RDF Query
D48 SPARQL
D5 Ontologies
D6 Web Rules and SWRL
33
D61 Web Rules
D62 SWRL
D7 Semantic Web Services
D8 Summary
References
INDEX
34
Preface
Introductory RemarksData mining is the process of posing queries to largequantities of data and extracting information often previouslyunknown using mathematical statistical and machinelearning techniques Data mining has many applications in anumber of areas including marketing and sales web ande-commerce medicine law manufacturing and morerecently national and cyber security For example using datamining one can uncover hidden dependencies betweenterrorist groups as well as possibly predict terrorist eventsbased on past experience Furthermore one can apply datamining techniques for targeted markets to improvee-commerce Data mining can be applied to multimediaincluding video analysis and image classification Finallydata mining can be used in security applications such assuspicious event detection and malicious software detectionOur previous book focused on data mining tools forapplications in intrusion detection image classification andweb surfing In this book we focus entirely on the datamining tools we have developed for cyber securityapplications In particular it extends the work we presented inour previous book on data mining for intrusion detection Thecyber security applications we discuss are email wormdetection malicious code detection remote exploit detectionand botnet detection In addition some other tools for streammining insider threat detection adaptable malware detection
35
real-time data mining and firewall policy analysis arediscussed
We are writing two series of books related to datamanagement data mining and data security This book is thesecond in our second series of books which describestechniques and tools in detail and is co-authored with facultyand students at the University of Texas at Dallas It hasevolved from the first series of books (by single authorBhavani Thuraisingham) which currently consists of tenbooks These ten books are the following Book 1 (DataManagement Systems Evolution and Interoperation)discussed data management systems and interoperabilityBook 2 (Data Mining) provided an overview of data miningconcepts Book 3 (Web Data Management and E-Commerce)discussed concepts in web databases and e-commerce Book 4(Managing and Mining Multimedia Databases) discussedconcepts in multimedia data management as well as textimage and video mining Book 5 (XML Databases and theSemantic Web) discussed high-level concepts relating to thesemantic web Book 6 (Web Data Mining and Applications inCounter-Terrorism) discussed how data mining may beapplied to national security Book 7 (Database andApplications Security) which is a textbook discussed detailsof data security Book 8 (Building Trustworthy SemanticWebs) also a textbook discussed how semantic webs may bemade secure Book 9 (Secure Semantic Service-OrientedSystems) is on secure web services Book 10 to be publishedin early 2012 is titled Building and Securing the Cloud Ourfirst book in Series 2 is Design and Implementation of DataMining Tools Our current book (which is the second book ofSeries 2) has evolved from Books 3 4 6 and 7 of Series 1and book 1 of Series 2 It is mainly based on the research
36
work carried out at The University of Texas at Dallas by DrMehedy Masud for his PhD thesis with his advisor ProfessorLatifur Khan and supported by the Air Force Office ofScientific Research from 2005 until now
Background on Data MiningData mining is the process of posing various queries andextracting useful information patterns and trends oftenpreviously unknown from large quantities of data possiblystored in databases Essentially for many organizations thegoals of data mining include improving marketingcapabilities detecting abnormal patterns and predicting thefuture based on past experiences and current trends There isclearly a need for this technology There are large amounts ofcurrent and historical data being stored Therefore asdatabases become larger it becomes increasingly difficult tosupport decision making In addition the data could be frommultiple sources and multiple domains There is a clear needto analyze the data to support planning and other functions ofan enterprise
Some of the data mining techniques include those based onstatistical reasoning techniques inductive logic programmingmachine learning fuzzy sets and neural networks amongothers The data mining problems include classification(finding rules to partition data into groups) association(finding rules to make associations between data) andsequencing (finding rules to order data) Essentially onearrives at some hypothesis which is the information extractedfrom examples and patterns observed These patterns are
37
observed from posing a series of queries each query maydepend on the responses obtained from the previous queriesposed
Data mining is an integration of multiple technologies Theseinclude data management such as database management datawarehousing statistics machine learning decision supportand others such as visualization and parallel computingThere is a series of steps involved in data mining Theseinclude getting the data organized for mining determining thedesired outcomes to mining selecting tools for miningcarrying out the mining process pruning the results so thatonly the useful ones are considered further taking actionsfrom the mining and evaluating the actions to determinebenefits There are various types of data mining By this wedo not mean the actual techniques used to mine the data butwhat the outcomes will be These outcomes have also beenreferred to as data mining tasks These include clusteringclassification anomaly detection and forming associations
Although several developments have been made there aremany challenges that remain For example because of thelarge volumes of data how can the algorithms determinewhich technique to select and what type of data mining to doFurthermore the data may be incomplete inaccurate or bothAt times there may be redundant information and at timesthere may not be sufficient information It is also desirable tohave data mining tools that can switch to multiple techniquesand support multiple outcomes Some of the current trends indata mining include mining web data mining distributed andheterogeneous databases and privacy-preserving data miningwhere one ensures that one can get useful results from miningand at the same time maintain the privacy of the individuals
38
Data Mining for CyberSecurityData mining has applications in cyber security whichinvolves protecting the data in computers and networks Themost prominent application is in intrusion detection Forexample our computers and networks are being intruded onby unauthorized individuals Data mining techniques such asthose for classification and anomaly detection are being usedextensively to detect such unauthorized intrusions Forexample data about normal behavior is gathered and whensomething occurs out of the ordinary it is flagged as anunauthorized intrusion Normal behavior could be Johnrsquoscomputer is never used between 2 am and 5 am in themorning When Johnrsquos computer is in use say at 3 am this isflagged as an unusual pattern
Data mining is also being applied for other applications incyber security such as auditing email worm detection botnetdetection and malware detection Here again data on normaldatabase access is gathered and when something unusualhappens then this is flagged as a possible access violationData mining is also being used for biometrics Here patternrecognition and other machine learning techniques are beingused to learn the features of a person and then to authenticatethe person based on the features
However one of the limitations of using data mining formalware detection is that the malware may change patternsTherefore we need tools that can detect adaptable malwareWe also discuss this aspect in our book
39
Organization of This BookThis book is divided into seven parts Part I which consists offour chapters provides some background information on datamining techniques and applications that has influenced ourtools these chapters also provide an overview of malwareParts II III IV and V describe our tools for email wormdetection malicious code detection remote exploit detectionand botnet detection respectively Part VI describes our toolsfor stream data mining In Part VII we discuss data miningfor emerging applications including adaptable malwaredetection insider threat detection and firewall policyanalysis as well as real-time data mining We have fourappendices that provide some of the background knowledgein data management secure systems and semantic web
Concluding RemarksData mining applications are exploding Yet many booksincluding some of the authorsrsquo own books have discussedconcepts at the high level Some books have made the topicvery theoretical However data mining approaches depend onnondeterministic reasoning as well as heuristics approachesOur first book on the design and implementation of datamining tools provided step-by-step information on how datamining tools are developed This book continues with thisapproach in describing our data mining tools
For each of the tools we have developed we describe thesystem architecture the algorithms and the performance
40
results as well as the limitations of the tools We believe thatthis is one of the few books that will help tool developers aswell as technologists and managers It describes algorithms aswell as the practical aspects For example technologists candecide on the tools to select for a particular applicationDevelopers can focus on alternative designs if an approach isnot suitable Managers can decide whether to proceed with adata mining project This book will be a very valuablereference guide to those in industry government andacademia as it focuses on both concepts and practicaltechniques Experimental results are also given The book willalso be used as a textbook at The University of Texas atDallas on courses in data mining and data security
41
Acknowledgments
We are especially grateful to the Air Force Office ofScientific Research for funding our research on malwaredetection In particular we would like to thank Dr RobertHerklotz for his encouragement and support for our workWithout his support for our research this book would not havebeen possible
We are also grateful to the National Aeronautics and SpaceAdministration for funding our research on stream mining Inparticular we would like to thank Dr Ashok Agrawal for hisencouragement and support
We thank our colleagues and collaborators who have workedwith us on Data Mining Tools for Malware Detection Ourspecial thanks are to the following colleagues
Prof Peng Liu and his team at Penn State University forcollaborating with us on Data Mining for Remote Exploits(Part III)
Prof Jiawei Han and his team at the University of Illinois forcollaborating with us on Stream Data Mining (Part VI)
Prof Kevin Hamlen at the University of Texas at Dallas forcollaborating with us on Data Mining for Active Defense(Chapter 21)
42
Our student Dr M Farhan Husain for collaborating with uson Insider Threat Detection (Chapter 22)
Our colleagues Prof Chris Clifton (Purdue University) DrMarion Ceruti (Department of the Navy) and Mr JohnMaurer (MITRE) for collaborating with us on Real-TimeData Mining (Chapter 23)
Our students Muhammad Abedin and Syeda Nessa forcollaborating with us on Firewall Policy Analysis (Chapter24)
43
The Authors
Mehedy Masud is a postdoctoral fellow at The University ofTexas at Dallas (UTD) where he earned his PhD in computerscience in December 2009 He has published in premierjournals and conferences including IEEE Transactions onKnowledge and Data Engineering and the IEEE InternationalConference on Data Mining He will be appointed as aresearch assistant professor at UTD in Fall 2012 Masudrsquosresearch projects include reactively adaptive malware datamining for detecting malicious executables botnet andremote exploits and cloud data mining He has a patentpending on stream mining for novel class detection
Latifur Khan is an associate professor in the computerscience department at The University of Texas at Dallaswhere he has been teaching and conducting research sinceSeptember 2000 He received his PhD and MS degrees incomputer science from the University of Southern Californiain August 2000 and December 1996 respectively Khan is (orhas been) supported by grants from NASA the NationalScience Foundation (NSF) Air Force Office of ScientificResearch (AFOSR) Raytheon NGA IARPA TektronixNokia Research Center Alcatel and the SUN academicequipment grant program In addition Khan is the director ofthe state-of-the-art DMLUTD UTD Data MiningDatabaseLaboratory which is the primary center of research related todata mining semantic web and imagevideo annotation atThe University of Texas at Dallas Khan has published morethan 100 papers including articles in several IEEETransactions journals the Journal of Web Semantics and the
44
VLDB Journal and conference proceedings such as IEEEICDM and PKDD He is a senior member of IEEE
Bhavani Thuraisingham joined The University of Texas atDallas (UTD) in October 2004 as a professor of computerscience and director of the Cyber Security Research Center inthe Erik Jonsson School of Engineering and ComputerScience and is currently the Louis Beecherl Jr DistinguishedProfessor She is an elected Fellow of three professionalorganizations the IEEE (Institute for Electrical andElectronics Engineers) the AAAS (American Association forthe Advancement of Science) and the BCS (British ComputerSociety) for her work in data security She received the IEEEComputer Societyrsquos prestigious 1997 Technical AchievementAward for ldquooutstanding and innovative contributions tosecure data managementrdquo Prior to joining UTDThuraisingham worked for the MITRE Corporation for 16years which included an IPA (Intergovernmental PersonnelAct) at the National Science Foundation as Program Directorfor Data and Applications Security Her work in informationsecurity and information management has resulted in morethan 100 journal articles more than 200 refereed conferencepapers more than 90 keynote addresses and 3 US patentsShe is the author of ten books in data management datamining and data security
45
Copyright Permissions
Figure 212 Figure 213
B Thuraisingham K Hamlen V Mohan M Masud LKhan Exploiting an antivirus interface in ComputerStandards amp Interfaces Vol 31 No 6 p 1182minus1189 2009with permission from Elsevier
Figure 74 Table 81 Table 82 Table 83 Table 84Figure 82 Table 85 Table 86 Table 87 Table 88
B Thuraisingham M Masud L Khan Email wormdetection using data mining International Journal ofInformation Security and Privacy 14 47minus61 Copyright2007 IGI Global wwwigi-globalcom
Figure 232 Figure 233 Figure 234 Figure 235 Figure236 Figure 237
L Khan C Clifton J Maurer M Ceruti Dependablereal-time data mining Proceedings ISORC 2005 p 158minus165copy 2005 IEEE
Figure 223
M Farhan Husain L Khan M Kantarcioglu Data intensivequery processing for large RDF graphs using cloudcomputing tools IEEE Cloud Computing Miami FL July2010 p 1minus10 copy 2005 IEEE
46
Figures 152 Table 161 Figure 162 Table 162 Figure164 Table 171 Table 172 Figure 172 Figure 173
M Masud T Al-khateeb L Khan K Hamlen Flow-basedidentification of botnet traffic by mining multiple log files inProceedings of the International Conference on DistributedFrameworks amp Applications (DFMA) Penang Malaysia Oct2008 p 200ndash206 copy 2005 IEEE
Figure 102 Figure 103 Table 111 Table 112 Figure112 Table 113 Table 114 Table 115 Table 116 Table117 Table 118 Table 119
M Masud L Khan A scalable multi-level feature extractiontechnique to detect malicious executables InformationSystems Frontiers (Springer Netherlands) 101 33minus45March 2008 copy 2008 Springer With kind permission ofSpringer Science+Business Media
Figure 132 Figure 133 Table 141 Figure 142 Table142 Figure 143
M Masud L Khan X Wang P Liu S Zhu Detectingremote exploits using data mining Proceedings IFIP DigitalForensics Conference Kyoto January 2008 p 177ndash189 copy2008 Springer With kind permission of SpringerScience+Business Media
Figure 192 Figure 193 Figure 202 Table 201 Figure203 Table 202
M Masud J Gao L Khan J Han Integrating novel classdetection with classification for concept-drifting data streams
47
ECML PKDD lsquo09 Proceedings of the European Conferenceon Machine Learning and Knowledge Discovery inDatabases Part II September 2009 pp 79minus94Springer-Verlag Berlin Heidelberg copy 2009 With kindpermission of Springer Science+Business Media
48
1
INTRODUCTION
11 TrendsData mining is the process of posing various queries andextracting useful and often previously unknown andunexpected information patterns and trends from largequantities of data generally stored in databases These datacould be accumulated over a long period of time or theycould be large datasets accumulated simultaneously fromheterogeneous sources such as different sensor types Thegoals of data mining include improving marketingcapabilities detecting abnormal patterns and predicting thefuture based on past experiences and current trends There isclearly a need for this technology for many applications ingovernment and industry For example a marketingorganization may need to determine who their potentialcustomers are There are large amounts of current andhistorical data being stored Therefore as databases becomelarger it becomes increasingly difficult to supportdecision-making In addition the data could be from multiplesources and multiple domains There is a clear need toanalyze the data to support planning and other functions of anenterprise
Data mining has evolved from multiple technologiesincluding data management data warehousing machine
49
learning and statistical reasoning one of the majorchallenges in the development of data mining tools is toeliminate false positives and false negatives Much progresshas also been made on building data mining tools based on avariety of techniques for numerous applications Theseapplications include those for marketing and sales healthcaremedical financial e-commerce multimedia and morerecently security
Our previous books have discussed various data miningtechnologies techniques tools and trends In a recent bookour main focus was on the design and development as well asto discuss the results obtained for the three tools that wedeveloped between 2004 and 2006 These tools include onefor intrusion detection one for web page surfing predictionand one for image classification In this book we continuewith the descriptions of data mining tools we have developedover the past five years for cyber security In particular wediscuss our tools for malware detection
Malware also known as malicious software is developed byhackers to steal data and identity causes harm to computersand denies legitimate services to users among othersMalware has plagued the society and the software industry foralmost four decades Malware includes viruses wormsTrojan horses time and logic bombs botnets and spyware Inthis book we describe our data mining tools for malwaredetection
The organization of this chapter is as follows Supportingtechnologies are discussed in Section 12 These supportingtechnologies are elaborated in Part II The tools that wediscuss in this book are summarized in Sections 13 through
50
18 These tools include data mining for email wormdetection remote exploits detection malicious codedetection and botnet detection In addition we discuss ourstream data mining tool as well as our approaches for insidethreat detection adaptable malware detection real-time datamining for suspicious event detection and firewall policymanagement Each of these tools and approaches arediscussed in Parts II through VII The contents of this bookare summarized in Section 19 of this chapter and next stepsare discussed in Section 110
12 Data Mining andSecurity TechnologiesData mining techniques have exploded over the past decadeand we now have tools and products for a variety ofapplications In Part I we discuss the data mining techniquesthat we describe in this book as well as provide an overviewof the applications we discuss Data mining techniquesinclude those based on machine learning statistical reasoningand mathematics Some of the popular techniques includeassociation rule mining decision trees and K-meansclustering Figure 11 illustrates the data mining techniques
Data mining has been used for numerous applications inseveral fields including in healthcare e-commerce andsecurity We focus on data mining for cyber securityapplications
51
Figure 11 Data mining techniques
Figure 12 Malware
While data mining technologies have exploded over the pasttwo decades the developments in information technologieshave resulted in an increasing need for security As a resultthere is now an urgent need to develop secure systemsHowever as systems are being secured malware technologieshave also exploded Therefore it is critical that we develop
52
tools for detecting and preventing malware Various types ofmalware are illustrated in Figure 12
In this book we discuss data mining for malware detection Inparticular we discuss techniques such as support vectormachines clustering and classification for cyber securityapplications The tools we have developed are illustrated inFigure 13
13 Data Mining for EmailWorm DetectionAn email worm spreads through infected email messages Theworm may be carried by an attachment or the email maycontain links to an infected website When the user opens theattachment or clicks the link the host gets infectedimmediately The worm exploits the vulnerable emailsoftware in the host machine to send infected emails toaddresses stored in the address book Thus new machines getinfected Worms bring damage to computers and people invarious ways They may clog the network traffic causedamage to the system and make the system unstable or evenunusable
53
Figure 13 Data mining tools for malware detection
We have developed tools on applying data mining techniquesfor intrusion email worm detection We use both SupportVector Machine (SVM) and Naiumlve Bayes (NB) data miningtechniques Our tools are described in Part III of the book
14 Data Mining forMalicious Code DetectionMalicious code is a great threat to computers and computersociety Numerous kinds of malicious codes wander in thewild Some of them are mobile such as worms and spreadthrough the Internet causing damage to millions of computersworldwide Other kinds of malicious codes are static such asviruses but sometimes deadlier than their mobile counterpart
54
One popular technique followed by the antivirus communityto detect malicious code is ldquosignature detectionrdquo Thistechnique matches the executables against a unique telltalestring or byte pattern called signature which is used as anidentifier for a particular malicious code However suchtechniques are not effective against ldquozero-dayrdquo attacks Azero-day attack is an attack whose pattern is previouslyunknown We are developing a number of data mining toolsfor malicious code detection that do not depend on thesignature of the malware Our hybrid feature retrieval modelis described in Part IV of this book
15 Data Mining forDetecting Remote ExploitsRemote exploits are a popular means for attackers to gaincontrol of hosts that run vulnerable services or softwareTypically a remote exploit is provided as an input to a remotevulnerable service to hijack the control-flow ofmachine-instruction execution Sometimes the attackers injectexecutable code in the exploit that is executed after asuccessful hijacking attempt We refer to these code-carryingremote exploits as exploit code
We are developing a number of data mining tools fordetecting remote exploits Our tools use differentclassification models such as Support Vector Machine(SVM) Naiumlve Bayes (NB) and decision trees These tools aredescribed in Part V of this book
55
16 Data Mining for BotnetDetectionBotnets are a serious threat because of their volume andpower Botnets containing thousands of bots (compromisedhosts) are controlled from a Command and Control (CampC)center operated by a human botmaster or botherder Thebotmaster can instruct these bots to recruit new bots launchcoordinated distributed denial of service (DDoS) attacksagainst specific hosts steal sensitive information frominfected machines send mass spam emails and so on
We have developed data mining tools for botnet detectionOur tools use Support Vector Machine (SVM) Bayes Netdecision tree (J48) Naiumlve Bayes and Boosted decision tree(Boosted J48) for the classification task These tools aredescribed in Part VI of this book
17 Stream Data MiningStream data are quite common They include video datasurveillance data and financial data that arrive continuouslyThere are some problems related to stream data classificationFirst it is impractical to store and use all the historical datafor training because it would require infinite storage andrunning time Second there may be concept-drift in the datameaning the underlying concept of the data may change overtime Third novel classes may evolve in the stream
56
We have developed stream mining techniques for detectingnovel cases We believe that these techniques could be usedfor detecting novel malware Our tools for stream mining aredescribed in Part VI of this book
18 Emerging Data MiningTools for Cyber SecurityApplicationsIn addition to the tools described in Sections 13 through 17we are also exploring techniques for (a) detecting malwarethat reacts and adapts to the environment (b) insider threatdetection (c) real-time data mining and (d) firewall policymanagement
For malware that adapts we are exploring the stream miningtechniques For insider threat detection we are applyinggraph mining techniques We are exploring real-time datamining to detect malware in real time Finally we areexploring the use of association rule mining techniques forensuring that the numerous firewall policies are consistentThese techniques are described in Part VII of this book
57
19 Organization of ThisBookThis book is divided into seven parts Part I consists of thisintroductory chapter and four additional chapters Chapter 2provides some background information in the data miningtechniques and applications that have influenced our researchand tools Chapter 3 describes types of malware In Chapter 4we provide an overview of data mining for securityapplications The tools we have described in our previousbook are discussed in Chapter 5 We discuss the three toolsas many of the tools we discuss in this current book have beeninfluenced by our early tools
Part II consists of three chapters 6 7 and 8 which describeour tool for email worm detection An overview of emailworm detection is discussed in Chapter 6 Our tool isdiscussed in Chapter 7 Evaluation and results are discussedin Chapter 8 Part III consists of three chapters 9 10 and 11and describes our tool for malicious code detection Anoverview of malicious code detection is discussed in Chapter9 Our tool is discussed in Chapter 10 Evaluation and resultsare discussed in Chapter 11 Part IV consists of threechapters 12 13 and 14 and describes our tool for detectingremote exploits An overview of detecting remote exploits isdiscussed in Chapter 12 Our tool is discussed in Chapter 13Evaluation and results are discussed in Chapter 14 Part Vconsists of three chapters 15 16 and 17 and describes ourtool for botnet detection An overview of botnet detection isdiscussed in Chapter 15 Our tool is discussed in Chapter 16
58
Evaluation and results are discussed in Chapter 17 Part VIconsists of three chapters 18 19 and 20 and describes ourtool for stream mining An overview of stream mining isdiscussed in Chapter 18 Our tool is discussed in Chapter 19Evaluation and results are discussed in Chapter 20 Part VIIconsists of four chapters 21 22 23 and 24 and describes ourtools for emerging applications Our approach to detectingadaptive malware is discussed in Chapter 21 Our approachfor insider threat detection is discussed in Chapter 22Real-time data mining is discussed in Chapter 23 Firewallpolicy management tool is discussed in Chapter 24
The book is concluded in Chapter 25 Appendix A providesan overview of data management and describes therelationship between our books Appendix B describestrustworthy systems Appendix C describes secure datainformation and knowledge management and Appendix Ddescribes semantic web technologies The appendicestogether with the supporting technologies described in Part Iprovide the necessary background to understand the contentof this book
We have essentially developed a three-layer framework toexplain the concepts in this book This framework isillustrated in Figure 14 Layer 1 is the data mining techniqueslayer Layer 2 is our tools layer Layer 3 is the applicationslayer Figure 15 illustrates how Chapters 2 through 24 in thisbook are placed in the framework
59
110 Next StepsThis book provides the information for a reader to get familiarwith data mining concepts and understand how the techniquesare applied step-by-step to some real-world applications inmalware detection One of the main contributions of this bookis raising the awareness of the importance of data mining fora variety of applications in cyber security This book could beused as a guide to build data mining tools for cyber securityapplications
60
61
Figure 14 Framework for data mining tools
We provide many references that can help the reader inunderstanding the details of the problem we are investigatingOur advice to the reader is to keep up with the developmentsin data mining and get familiar with the tools and productsand apply them for a variety of applications Then the readerwill have a better understanding of the limitation of the toolsand be able to determine when new tools have to bedeveloped
62
63
Figure 15 Contents of the book with respect to theframework
64
PART I
DATA MINING AND SECURITY
Introduction to Part I DataMining and SecuritySupporting technologies for data mining for malwaredetection include data mining and malware technologies Datamining is the process of analyzing the data and uncoveringhidden dependencies The outcomes of data mining includeclassification clustering forming associations as well asdetecting anomalies Malware technologies are beingdeveloped at a rapid speed These include worms viruses andTrojan horses
Part I consisting of five chapters discusses supportingtechnologies for data mining for malware detection Chapter 1provides a brief overview of data mining and malware InChapter 2 we discuss the data mining techniques we haveutilized in our tools Specifically we present the Markovmodel support vector machines artificial neural networksand association rule mining In Chapter 3 we discuss varioustypes of malware including worms viruses and Trojanhorses In Chapter 4 we discuss data mining for securityapplications In particular we discuss the threats to thecomputers and networks and describe the applications of datamining to detect such threats and attacks Some of our current
65
research at The University of Texas at Dallas also isdiscussed In Chapter 5 we discuss the three applications wehave considered in our previous book on the design andimplementation of data mining tools These tools haveinfluenced the work discussed in this book a great deal Inparticular we discuss intrusion detection web surfingprediction and image classification tools
66
2
DATA MINING TECHNIQUES
21 IntroductionData mining outcomes (also called tasks) includeclassification clustering forming associations as well asdetecting anomalies Our tools have mainly focused onclassification as the outcome and we have developedclassification tools The classification problem is also referredto as Supervised Learning in which a set of labeled examplesis learned by a model and then a new example with anunknown label is presented to the model for prediction
There are many prediction models that have been used suchas the Markov model decision trees artificial neuralnetworks support vector machines association rule miningand many others Each of these models has strengths andweaknesses However there is a common weakness amongall of these techniques which is the inability to suit allapplications The reason that there is no such ideal or perfectclassifier is that each of these techniques is initially designedto solve specific problems under certain assumptions
In this chapter we discuss the data mining techniques wehave utilized in our tools Specifically we present the Markovmodel support vector machines artificial neural networksassociation rule mining and the problem of
67
multi-classification as well as image classification which isan aspect of image mining These techniques are also used indeveloping and comparing results in Parts II III and IV Inour research and development we propose hybrid models toimprove the prediction accuracy of data mining algorithms invarious applications namely intrusion detection WWWprediction and image classification
The organization of this chapter is as follows In Section 22we provide an overview of various data mining tasks andtechniques The techniques that are relevant to the contents ofthis book are discussed in Sections 22 through 27 Inparticular neural networks support vector machines Markovmodels and association rule mining as well as some otherclassification techniques are described The chapter issummarized in Section 28
22 Overview of DataMining Tasks andTechniquesBefore we discuss data mining techniques we provide anoverview of some of the data mining tasks (also known asdata mining outcomes) Then we discuss the techniques Ingeneral data mining tasks can be grouped into two categoriespredictive and descriptive Predictive tasks essentially predictwhether an item belongs to a class or not Descriptive tasksin general extract patterns from the examples One of themost prominent predictive tasks is classification In some
68
cases other tasks such as anomaly detection can be reducedto a predictive task such as whether a particular situation is ananomaly or not Descriptive tasks in general include makingassociations and forming clusters Therefore classificationanomaly detection making associations and forming clustersare also thought to be data mining tasks
Next the data mining techniques can be either predictivedescriptive or both For example neural networks canperform classification as well as clustering Classificationtechniques include decision trees support vector machines aswell as memory-based reasoning Association rule miningtechniques are used in general to make associations Linkanalysis that analyzes links can also make associationsbetween links and predict new links Clustering techniquesinclude K-means clustering An overview of the data miningtasks (ie the outcomes of data mining) is illustrated inFigure 21 The techniques discussed in this book (eg neuralnetworks support vector machines) are illustrated in Figure22
69
23 Artificial NeuralNetwork
Figure 21 Data mining tasks
70
Figure 22 Data mining techniques
Artificial neural network (ANN) is a very well-knownpowerful and robust classification technique that has beenused to approximate real-valued discrete-valued andvector-valued functions from examples ANNs have beenused in many areas such as interpreting visual scenes speechrecognition and learning robot control strategies An artificialneural network (ANN) simulates the biological nervoussystem in the human brain Such a nervous system iscomposed of a large number of highly interconnectedprocessing units (neurons) working together to produce ourfeelings and reactions ANNs like people learn by exampleThe learning process in a human brain involves adjustmentsto the synaptic connections between neurons Similarly thelearning process of ANN involves adjustments to the nodeweights Figure 23 presents a simple neuron unit which iscalled perceptron The perceptron input x is a vector orreal-valued input and w is the weight vector in which itsvalue is determined after training The perceptron computes alinear combination of an input vector x as follows (Eq 21)
71
Figure 23 The perceptron
Notice that wi corresponds to the contribution of the inputvector component xi of the perceptron output Also in orderfor the perceptron to output a 1 the weighted combination ofthe inputs
must be greater than the threshold w0
Learning the perceptron involves choosing values for theweights w0 + w1x1 + hellip + wnxn Initially random weightvalues are given to the perceptron Then the perceptron isapplied to each training example updating the weights of theperceptron whenever an example is misclassified Thisprocess is repeated many times until all training examples are
72
correctly classified The weights are updated according to thefollowing rule (Eq 22)
where η is a learning constant o is the output computed bythe perceptron and t is the target output for the currenttraining example
The computation power of a single perceptron is limited tolinear decisions However the perceptron can be used as abuilding block to compose powerful multi-layer networks Inthis case a more complicated updating rule is needed to trainthe network weights In this work we employ an artificialneural network of two layers and each layer is composed ofthree building blocks (see Figure 24) We use the backpropagation algorithm for learning the weights The backpropagation algorithm attempts to minimize the squared errorfunction
73
Figure 24 Artificial neural network
Figure 25 The design of ANN used in our implementation
A typical training example in WWW prediction is lang[ktndashτ+1hellip ktndash1 kt]T drang where [ktndashτ+1 hellip ktndash1 kt]T is the input to theANN and d is the target web page Notice that the input unitsof the ANN in Figure 25 are τ previous pages that the userhas recently visited where k is a web page id The output ofthe network is a boolean value not a probability We will seelater how to approximate the probability of the output byfitting a sigmoid function after ANN output Theapproximated probabilistic output becomes oprime = f(o(I) = pt+1where I is an input session and pt+1 = p(d|ktndashτ+1 hellip kt) Wechoose the sigmoid function (Eq 23) as a transfer function so
74
that the ANN can handle a non-linearly separable dataset[Mitchell 1997] Notice that in our ANN design (Figure 25)we use a sigmoid transfer function Eq 23 in each buildingblock In Eq 23 I is the input to the network O is the outputof the network W is the matrix of weights and σ is thesigmoid function
We implement the back propagation algorithm for training theweights The back propagation algorithm employs gradientdescent to attempt to minimize the squared error between thenetwork output values and the target values of these outputsThe sum of the error over all of the network output units isdefined in Eq 24 In Eq 24 the outputs is the set of outputunits in the network D is the training set and tik and oik arethe target and the output values associated with the ith output
75
unit and training example k For a specific weight wji in thenetwork it is updated for each training example as in Eq 25where η is the learning rate and wji is the weight associatedwith the ith input to the network unit j (for details see[Mitchell 1997]) As we can see from Eq 25 the searchdirection δw is computed using the gradient descent whichguarantees convergence toward a local minimum To mitigatethat we add a momentum to the weight update rule such thatthe weight update direction δwji(n) depends partially on theupdate direction in the previous iteration δwji(n ndash 1) The newweight update direction is shown in Eq 26 where n is thecurrent iteration and α is the momentum constant Notice thatin Eq 26 the step size is slightly larger than in Eq 25 Thiscontributes to a smooth convergence of the search in regionswhere the gradient is unchanging [Mitchell 1997]
In our implementation we set the step size η dynamicallybased on the distribution of the classes in the datasetSpecifically we set the step size to large values whenupdating the training examples that belong to low distributionclasses and vice versa This is because when the distributionof the classes in the dataset varies widely (eg a datasetmight have 5 positive examples and 95 negativeexamples) the network weights converge toward theexamples from the class of larger distribution which causes aslow convergence Furthermore we adjust the learning ratesslightly by applying the momentum constant Eq 26 tospeed up the convergence of the network [Mitchell 1997]
76
24 Support VectorMachinesSupport vector machines (SVMs) are learning systems thatuse a hypothesis space of linear functions in a highdimensional feature space trained with a learning algorithmfrom optimization theory This learning strategy introducedby Vapnik [1995 1998 1999 see also Cristianini andShawe-Taylor 2000] is a very powerful method that hasbeen applied in a wide variety of applications The basicconcept in SVM is the hyper-plane classifier or linearseparability To achieve linear separability SVM applies twobasic ideas margin maximization and kernels that ismapping input space to a higher dimension space featurespace
For binary classification the SVM problem can be formalizedas in Eq 27 Suppose we have N training data points (x1y1)(x2y2) hellip (xNyN) where xi isin Rd and yi isin +1ndash1 Wewould like to find a linear separating hyper-plane classifier asin Eq 28 Furthermore we want this hyper-plane to have themaximum separating margin with respect to the two classes(see Figure 26) The functional margin or the margin forshort is defined geometrically as the Euclidean distance ofthe closest point from the decision boundary to the inputspace Figure 27 gives an intuitive explanation of whymargin maximization gives the best solution of separation Inpart (a) of Figure 27 we can find an infinite number ofseparators for a specific dataset There is no specific or clearreason to favor one separator over another In part (b) we see
77
that maximizing the margin provides only one thick separatorSuch a solution achieves the best generalization accuracy thatis prediction for the unseen [Vapnik 1995 1998 1999]
Figure 26 Linear separation in SVM
Figure 27 The SVM separator that causes the maximummargin
78
Notice that Eq 28 computes the sign of the functional marginof point x in addition to the prediction label of x that isfunctional margin of x equals wx ndash b
The SVM optimization problem is a convex quadraticprogramming problem (in w b) in a convex set Eq 27 Wecan solve the Wolfe dual instead as in Eq 29 with respect toα subject to the constraints that the gradient of L(wbα) withrespect to the primal variables w and b vanish and αi ge 0 Theprimal variables are eliminated from L(wbα) (see [Cristianiniand Shawe-Taylor 1999] for more details) When we solve αiwe can get
and we can classify a new object x using Eq 210 Note thatthe training vectors occur only in the form of a dot productand that there is a Lagrangian multiplier αi for each trainingpoint which reflects the importance of the data point Whenthe maximal margin hyper-plane is found only points that lieclosest to the hyper-plane will have αi gt 0 and these points arecalled support vectors All other points will have αi = 0 (seeFigure 28a) This means that only those points that lie closest
79
to the hyper-plane give the representation of the hypothesisclassifier These most important data points serve as supportvectors Their values can also be used to give an independentboundary with regard to the reliability of the hypothesisclassifier [Bartlett and Shawe-Taylor 1999]
Figure 28a shows two classes and their boundaries that ismargins The support vectors are represented by solid objectswhile the empty objects are non-support vectors Notice thatthe margins are only affected by the support vectors that is ifwe remove or add empty objects the margins will not changeMeanwhile any change in the solid objects either adding orremoving objects could change the margins Figure 28bshows the effects of adding objects in the margin area As wecan see adding or removing objects far from the margins forexample data point 1 or minus2 does not change the marginsHowever adding andor removing objects near the marginsfor example data point 2 andor minus1 has created new margins
80
Figure 28 (a) The α values of support vectors andnon-support vectors (b) The effect of adding new data pointson the margins
25 Markov ModelSome recent and advanced predictive methods for websurfing are developed using Markov models [Pirolli et al1996] [Yang et al 2001] For these predictive models thesequences of web pages visited by surfers are typicallyconsidered as Markov chains which are then fed as inputThe basic concept of the Markov model is that it predicts thenext action depending on the result of previous action oractions Actions can mean different things for differentapplications For the purpose of illustration we will consideractions specific for the WWW prediction application InWWW prediction the next action corresponds to predictionof the next page to be traversed The previous actionscorrespond to the previous web pages to be considered Basedon the number of previous actions considered the Markovmodel can have different orders
81
The zeroth-order Markov model is the unconditionalprobability of the state (or web page) Eq 211 In Eq 211Pk is a web page and Sk is the corresponding state Thefirst-order Markov model Eq 212 can be computed bytaking page-to-page transitional probabilities or the n-gramprobabilities of P1 P2 P2 P3 hellip Pkndash1 Pk
In the following we present an illustrative example ofdifferent orders of the Markov model and how it can predict
Example Imagine a web site of six web pages P1 P2 P3P4 P5 and P6 Suppose we have user sessions as in Table21 Table 21 depicts the navigation of many users of thatweb site Figure 29 shows the first-order Markov modelwhere the next action is predicted based only on the lastaction performed ie last page traversed by the user StatesS and F correspond to the initial and final states respectivelyThe probability of each transition is estimated by the ratio ofthe number of times the sequence of states was traversed andthe number of times the anchor state was visited Next to eacharch in Figure 28 the first number is the frequency of thattransition and the second number is the transition probabilityFor example the transition probability of the transition (P2 toP3) is 02 because the number of times users traverse frompage 2 to page 3 is 3 and the number of times page 2 isvisited is 15 (ie 02 = 315)
Notice that the transition probability is used to resolveprediction For example given that a user has already visitedP2 the most probable page she visits next is P6 That isbecause the transition probability from P2 to P6 is the highest
Table 21 Collection of User Sessions and Their Frequencies
82
SESSION FREQUENCYP1P2P4 5P1P2P6 1P5P2P6 6P5P2P3 3
Figure 29 First-order Markov model
Notice that that transition probability might not be availablefor some pages For example the transition probability fromP2 to P5 is not available because no user has visited P5 afterP2 Hence these transition probabilities are set to zerosSimilarly the Kth-order Markov model is where theprediction is computed after considering the last Kth actionperformed by the users Eq 213 In WWW prediction theKth-order Markov model is the probability of user visit to Pk
th
page given its previous k-1 page visits
83
Figure 210 Second-order Markov model
Figure 210 shows the second-order Markov model thatcorresponds to Table 21 In the second-order model weconsider the last two pages The transition probability iscomputed in a similar fashion For example the transitionprobability of the transition (P1P2) to (P2 P6) is 016 = 1 times16 because the number of times users traverse from state(P1P2) to state (P2P6) is 1 and the number of times pages(P1P2) is visited is 6 (ie 016 = 16) The transitionprobability is used for prediction For example given that auser has visited P1 and P2 she most probably visits P4because the transition probability from state (P1P2) to state(P2P4) is greater than the transition probability from state(P1P2) to state (P2P6)
84
The order of Markov model is related to the sliding windowThe Kth-order Markov model corresponds to a slidingwindow of size K-1
Notice that there is another concept that is similar to thesliding window concept which is number of hops In thisbook we use number of hops and sliding windowinterchangeably
In WWW prediction Markov models are built based on theconcept of n-gram The n-gram can be represented as a tupleof the form langx1 x2 hellip xnrang to depict sequences of page clicksby a population of users surfing a web site Each componentof the n-gram takes a specific page id value that reflects thesurfing path of a specific user surfing a web page Forexample the n-gram langP10 P21 P4 P12rang for some user U statesthat the user U has visited the pages 10 21 4 and finallypage 12 in a sequence
26 Association Rule Mining(ARM)Association rule is a data mining technique that has beenapplied successfully to discover related transactions Theassociation rule technique finds the relationships amongitemsets based on their co-occurrence in the transactionsSpecifically association rule mining discovers the frequentpatterns (regularities) among those itemsets for examplewhat the items purchased together in a super store are In thefollowing we briefly introduce association rule mining For
85
more details see [Agrawal et al 1993] [Agrawal andSrikant 1994]
Assume we have m items in our database define I = i1 i2hellipim as the set of all items A transaction T is a set of itemssuch that T sube I Let D be the set of all transactions in thedatabase A transaction T contains X if X sube T and X sube I Anassociation rule is an implication of the form X rarr Y where Xsub I Y sub I and X cap Y = ϕ There are two parameters toconsider a rule confidence and support A rule R = X rarr Yholds with confidence c if c of the transactions of D thatcontain X also contain Y (ie c = pr(Y|X)) The rule R holdswith support s if s of the transactions in D contain X and Y(ie s = pr(XY)) The problem of mining association rules isdefined as the following given a set of transactions D wewould like to generate all rules that satisfy a confidence and asupport greater than a minimum confidence (σ) minconf andminimum support (ϑ) minsup There are several efficientalgorithms proposed to find association rules for examplethe AIS algorithm [Agrawal et al 1993] [Agrawal andSrikant 1994] SETM algorithm [Houstma and Swanu1995] and AprioriTid [Agrawal and Srikant 1994]
In the case of web transactions we use association rules todiscover navigational patterns among users This would helpto cache a page in advance and reduce the loading time of apage Also discovering a pattern of navigation helps inpersonalization Transactions are captured from theclickstream data captured in web server logs
In many applications there is one main problem in usingassociation rule mining First a problem with using globalminimum support (minsup) because rare hits (ie web pages
86
that are rarely visited) will not be included in the frequent setsbecause it will not achieve enough support One solution is tohave a very small support threshold however we will end upwith a very large frequent itemset which is computationallyhard to handle [Liu et al 1999] propose a mining techniquethat uses different support thresholds for different itemsSpecifying multiple thresholds allow rare transactions whichmight be very important to be included in the frequentitemsets Other issues might arise depending on theapplication itself For example in the case of WWWprediction a session is recorded for each user The sessionmight have tens of clickstreams (and sometimes hundredsdepending on the duration of the session) Using each sessionas a transaction will not work because it is rare to find twosessions that are frequently repeated (ie identical) hence itwill not achieve even a very high support threshold minsupThere is a need to break each session into manysubsequences One common method is to use a slidingwindow of size w For example suppose we use a slidingwindow w = 3 to break the session S = langA B C D E E Frangthen we will end up with the subsequences Sprime = langABCranglangBCDrang langCDErang langDEFrang The total number ofsubsequences of a session S using window w is length(S) ndash wTo predict the next page in an active user session we use asliding window of the active session and ignore the previouspages For example if the current session is langABCrang and theuser references page D then the new active session becomeslangBCDrang using a sliding window 3 Notice that page A isdropped and langBCDrang will be used for prediction Therationale behind this is that most users go back and forthwhile surfing the web trying to find the desired informationand it may be most appropriate to use the recent portions of
87
the user history to generate recommendationspredictions[Mobasher et al 2001]
[Mobasher et al 2001] propose a recommendation enginethat matches an active user session with the frequent itemsetsin the database and predicts the next page the user mostprobably visits The engine works as follows Given an activesession of size w the engine finds all the frequent itemsets oflength w + 1 satisfying some minimum support minsup andcontaining the current active session Prediction for the activesession A is based on the confidence (ψ) of the correspondingassociation rule The confidence (ψ) of an association rule Xrarr z is defined as ψ(X rarr z) = σ(X cup z)σ(X) where the lengthof z is 1 Page p is recommendedpredicted for an activesession A if
The engine uses a cyclic graph called the Frequent ItemsetGraph The graph is an extension of the lexicographic treeused in the tree projection algorithm of [Agrawal et al 2001]The graph is organized in levels The nodes in level l haveitemsets of size l For example the sizes of the nodes (ie thesize of the itemsets corresponding to these nodes) in level 1and 2 are 1 and 2 respectively The root of the graph level 0
88
is an empty node corresponding to an empty itemset A nodeX in level l is linked to a node Y in level l + 1 if X sub Y Tofurther explain the process suppose we have the followingsample web transactions involving pages 1 2 3 4 and 5 as inTable 22 The Apriori algorithm produces the itemsets as inTable 23 using a minsup = 049 The frequent itemset graphis shown in Figure 211
Table 22 Sample Web Transaction
TRANSACTION ID ITEMST1 1245T2 12534T3 1253T4 25213T5 41253T6 1234T7 45T8 4531
Table 23 Frequent Itemsets Generated by the AprioriAlgorithm
89
Suppose we are using a sliding window of size 2 and thecurrent active session A = lang23rang To predictrecommend thenext page we first start at level 2 in the frequent itemsetgraph and extract all the itemsets in level 3 linked to A FromFigure 211 the node 23 is linked to 123 and 235nodes with confidence
and the recommended page is 1 because its confidence islarger Notice that in Recommendation Engines the order ofthe clickstream is not considered that is there is nodistinction between a session lang124rang and lang142rang This is adisadvantage of such systems because the order of pagesvisited might bear important information about the navigationpatterns of users
90
Figure 211 Frequent Itemset Graph
27 Multi-Class ProblemMost classification techniques solve the binary classificationproblem Binary classifiers are accumulated to generalize forthe multi-class problem There are two basic schemes for thisgeneralization namely one-vs-one and one-vs-all To avoidredundancy we will present this generalization only forSVM
271 One-vs-One
The one-vs-one approach creates a classifier for each pair ofclasses The training set for each pair classifier (ij) includes
91
only those instances that belong to either class i or j A newinstance x belongs to the class upon which most pairclassifiers agree The prediction decision is quoted from themajority vote technique There are n(n ndash 1)2 classifiers to becomputed where n is the number of classes in the dataset Itis evident that the disadvantage of this scheme is that we needto generate a large number of classifiers especially if thereare a large number of classes in the training set For exampleif we have a training set of 1000 classes we need 499500classifiers On the other hand the size of training set for eachclassifier is small because we exclude all instances that do notbelong to that pair of classes
272 One-vs-All
One-vs-all creates a classifier for each class in the datasetThe training set is pre-processed such that for a classifier jinstances that belong to class j are marked as class (+1) andinstances that do not belong to class j are marked as class(ndash1) In the one-vs-all scheme we compute n classifierswhere n is the number of pages that users have visited (at theend of each session) A new instance x is predicted byassigning it to the class that its classifier outputs the largestpositive value (ie maximal marginal) as in Eq 215 Wecan compute the margin of point x as in Eq 214 Notice thatthe recommendedpredicted page is the sign of the marginvalue of that page (see Eq 210)
92
In Eq 215 M is the number of classes x = langx1 x2hellip xnrang isthe user session and fi is the classifier that separates class ifrom the rest of the classes The prediction decision in Eq215 resolves to the classifier fc that is the most distant fromthe testing example x This might be explained as fc has themost separating power among all other classifiers ofseparating x from the rest of the classes
The advantage of this scheme (one-vs-all) compared to theone-VS-one scheme is that it has fewer classifiers On theother hand the size of the training set is larger for one-vs-allthan for a one-vs-one scheme because we use the wholeoriginal training set to compute each classifier
28 Image MiningAlong with the development of digital images and computerstorage technologies huge amounts of digital images aregenerated and saved every day Applications of digital imagehave rapidly penetrated many domains and markets includingcommercial and news media photo libraries scientific andnon-photographic image databases and medical imagedatabases As a consequence we face a daunting problem oforganizing and accessing these huge amounts of availableimages An efficient image retrieval system is highly desiredto find images of specific entities from a database Thesystem is expected to manage a huge collection of imagesefficiently respond to usersrsquo queries with high speed anddeliver a minimum of irrelevant information (high precision)
93
as well as ensure that relevant information is not overlooked(high recall)
To generate such kinds of systems people tried manydifferent approaches In the early 1990s because of theemergence of large image collections content-based imageretrieval (CBIR) was proposed CBIR computes relevancebased on the similarity of visual contentlow-level imagefeatures such as color histograms textures shapes and spatiallayout However the problem is that visual similarity is notsemantic similarity There is a gap between low-level visualfeatures and semantic meanings The so-called semantic gapis the major problem that needs to be solved for most CBIRapproaches For example a CBIR system may answer a queryrequest for a ldquored ballrdquo with an image of a ldquored roserdquo If weundertake the annotation of images with keywords a typicalway to publish an image data repository is to create akeyword-based query interface addressed to an imagedatabase If all images came with a detailed and accuratedescription image retrieval would be convenient based oncurrent powerful pure text search techniques These searchtechniques would retrieve the images if their descriptionsannotations contained some combination of the keywordsspecified by the user However the major problem is thatmost of images are not annotated It is a laboriouserror-prone and subjective process to manually annotate alarge collection of images Many images contain the desiredsemantic information even though they do not contain theuser-specified keywords Furthermore keyword-based searchis useful especially to a user who knows what keywords areused to index the images and who can therefore easilyformulate queries This approach is problematic howeverwhen the user does not have a clear goal in mind does not
94
know what is in the database and does not know what kind ofsemantic concepts are involved in the domain
Image mining is a more challenging research problem thanretrieving relevant images in CBIR systems The goal ofimage mining is to find an image pattern that is significant fora given set of images and helpful to understand therelationships between high-level semantic conceptsdescriptions and low-level visual features Our focus is onaspects such as feature selection and image classification
281 Feature Selection
Usually data saved in databases is with well-definedsemantics such as numbers or structured data entries Incomparison data with ill-defined semantics is unstructureddata For example images audio and video are data withill-defined semantics In the domain of image processingimages are represented by derived data or features such ascolor texture and shape Many of these features havemultiple values (eg color histogram moment description)When people generate these derived data or features theygenerally generate as many features as possible since they arenot aware which feature is more relevant Therefore thedimensionality of derived image data is usually very highSome of the selected features might be duplicated or may noteven be relevant to the problem Including irrelevant orduplicated information is referred to as ldquonoiserdquo Suchproblems are referred to as the ldquocurse of dimensionalityrdquoFeature selection is the research topic for finding an optimalsubset of features In this section we will discuss this curseand feature selection in detail
95
We developed a wrapper-based simultaneous featureweighing and clustering algorithm The clustering algorithmwill bundle similar image segments together and generate afinite set of visual symbols (ie blob-token) Based onhistogram analysis and chi-square value we assign features ofimage segments different weights instead of removing someof them Feature weight evaluation is wrapped in a clusteringalgorithm In each iteration of the algorithm feature weightsof image segments are reevaluated based on the clusteringresult The reevaluated feature weights will affect theclustering results in the next iteration
282 Automatic Image Annotation
Automatic image annotation is research concerned withobject recognition where the effort is concerned with tryingto recognize objects in an image and generate descriptions forthe image according to semantics of the objects If it ispossible to produce accurate and complete semanticdescriptions for an image we can store descriptions in animage database Based on a textual description morefunctionality (eg browse search and query) of an ImageDBMS could be implemented easily and efficiently byapplying many existing text-based search techniquesUnfortunately the automatic image annotation problem hasnot been solved in general and perhaps this problem isimpossible to solve
However in certain subdomains it is still possible to obtainsome interesting results Many statistical models have beenpublished for image annotation Some of these models tookfeature dimensionality into account and applied singular value
96
decomposition (SVD) or principle component analysis (PCA)to reduce dimension But none of them considered featureselection or feature weight We proposed a new frameworkfor image annotation based on a translation model (TM) Inour approach we applied our weighted feature selectionalgorithm and embedded it in image annotation frameworkOur weighted feature selection algorithm improves the qualityof visual tokens and generates better image annotations
283 Image Classification
Image classification is an important area especially in themedical domain because it helps manage large medicalimage databases and has great potential as a diagnostic aid ina real-world clinical setting We describe our experiments forthe image CLEF medical image retrieval task Sizes of classesof CLEF medical image datasets are not balanced and this isa really serious problem for all classification algorithms Tosolve this problem we re-sample data by generatingsubwindows k nearest neighbor (kNN) algorithm distanceweighted kNN fuzzy kNN nearest prototype classifier andevidence theory-based kNN are implemented and studiedResults show that evidence-based kNN has the bestperformance based on classification accuracy
29 SummaryIn this chapter we first provided an overview of the variousdata mining tasks and techniques and then discussed some ofthe techniques that we will utilize in this book These include
97
neural networks support vector machines and associationrule mining
Numerous data mining techniques have been designed anddeveloped and many of them are being utilized incommercial tools Several of these techniques are variationsof some of the basic classification clustering and associationrule mining techniques One of the major challenges today isto determine the appropriate techniques for variousapplications We still need more benchmarks andperformance studies In addition the techniques should resultin fewer false positives and negatives Although there is stillmuch to be done the progress over the past decade isextremely promising
References[Agrawal et al 1993] Agrawal R T Imielinski A SwamiMining Association Rules between Sets of Items in LargeDatabases in Proceedings of the ACM SIGMOD Conferenceon Management of Data Washington DC May 1993 pp207ndash216
[Agrawal et al 2001] Agrawal R C Aggarwal V PrasadA Tree Projection Algorithm for Generation of Frequent ItemSets Journal of Parallel and Distributed Computing ArchiveVol 61 No 3 2001 pp 350ndash371
[Agrawal and Srikant 1994] Agrawal R and R SrikantFast Algorithms for Mining Association Rules in LargeDatabase in Proceedings of the 20th International
98
Conference on Very Large Data Bases San Francisco CA1994 pp 487ndash499
[Bartlett and Shawe-Taylor 1999] Bartlett P and JShawe-Taylor Generalization Performance of Support VectorMachines and Other Pattern Classifiers Advances in KernelMethodsmdashSupport Vector Learning MIT Press CambridgeMA 1999 pp 43ndash54
[Cristianini and Shawe-Taylor 2000] Cristianini N and JShawe-Taylor Introduction to Support Vector MachinesCambridge University Press 2000 pp 93ndash122
[Houstma and Swanu 1995] Houtsma M and A SwanuSet-Oriented Mining of Association Rules in RelationalDatabases in Proceedings of the Eleventh InternationalConference on Data Engineering Washington DC 1995 pp25ndash33
[Liu et al 1999] Liu B W Hsu Y Ma Association Ruleswith Multiple Minimum Supports in Proceedings of the FifthACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining San Diego CA 1999 pp337ndash341
[Mitchell 1997] Mitchell T M Machine LearningMcGraw-Hill 1997 chap 4
[Mobasher et al 2001] Mobasher B H Dai T Luo MNakagawa Effective Personalization Based on AssociationRule Discovery from Web Usage Data in Proceedings of theACM Workshop on Web Information and Data Management(WIDM01) 2001 pp 9ndash15
99
[Pirolli et al 1996] Pirolli P J Pitkow R Rao Silk from aSowrsquos Ear Extracting Usable Structures from the Web inProceedings of 1996 Conference on Human Factors inComputing Systems (CHI-96) Vancouver British ColumbiaCanada 1996 pp 118ndash125
[Vapnik 1995] Vapnik VN The Nature of StatisticalLearning Theory Springer 1995
[Vapnik 1998] Vapnik VN Statistical Learning TheoryWiley 1998
[Vapnik 1999] Vapnik VN The Nature of StatisticalLearning Theory 2nd Ed Springer 1999
[Yang et al 2001] Yang Q H Zhang T Li Mining WebLogs for Prediction Models in WWW Caching andPrefetching in The 7th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining KDDAugust 26ndash29 2001 pp 473ndash478
100
3
MALWARE
31 IntroductionMalware is the term used for malicious software Malicioussoftware is developed by hackers to steal data identities causeharm to computers and deny legitimate services to usersamong others Malware has plagued society and the softwareindustry for almost four decades Some of the early malwareincludes Creeper virus of 1970 and the Morris worm of 1988
As computers became interconnected the number ofmalwares developed increased at an alarming rate in the1990s Today with the World Wide Web and so manytransactions and acuities being carried out on the Internet themalware problem is causing chaos among the computer andnetwork users
There are various types of malware including virusesworms time and logic bombs Trojan horses and spywarePreliminary results from Symantec published in 2008 suggestthat ldquothe release rate of malicious code and other unwantedprograms may be exceeding that of legitimate softwareapplicationsrdquo [Malware 2011] CME (Common MalwareEnumeration) was ldquocreated to provide single commonidentifiers to new virus threats and to the most prevalent virus
101
threats in the wild to reduce public confusion during malwareincidentsrdquo [CME 2011]
In this chapter we discuss various types of malware In thisbook we describe the data mining tools we have developed tohandle some types of malware The organization of thischapter is as follows In Section 32 we discuss viruses InSection 33 we discuss worms Trojan horses are discussed inSection 34 Time and logic bombs are discussed in Section35 Botnets are discussed in Section 36 Spyware isdiscussed in Section 37 The chapter is summarized inSection 38 Figure 31 illustrates the concepts discussed inthis chapter
Figure 31 Concepts discussed in this chapter
32 VirusesComputer viruses are malware that piggyback onto otherexecutables and are capable of replicating Viruses can exhibit
102
a wide range of malicious behaviors ranging from simpleannoyance (such as displaying messages) to widespreaddestruction such as wiping all the data in the hard drive (egCIH virus) Viruses are not independent programs Ratherthey are code fragments that exist on other binary files Avirus can infect a host machine by replicating itself when it isbrought in contact with that machine such as via a sharednetwork drive removable media or email attachment Thereplication is done when the virus code is executed and it ispermitted to write in the memory
There are two types of viruses based on their replicationstrategy nonresident and resident The nonresident virus doesnot store itself on the hard drive of the infected computer It isonly attached to an executable file that infects a computerThe virus is activated each time the infected executable isaccessed and run When activated the virus looks for othervictims (eg other executables) and infects them On thecontrary resident viruses allocate memory in the computerhard drive such as the boot sector These viruses becomeactive every time the infected machine starts
The earliest computer virus dates back to 1970 with theadvent of Creeper virus detected on ARPANET [SecureList2011] Since then hundreds of thousands of different viruseshave been written and corresponding antiviruses have alsobeen devised to detect and eliminate the viruses fromcomputer systems Most commercial antivirus products applya signature matching technique to detect a virus A virussignature is a unique bit pattern in the virus binary that canaccurately identify the virus [Signature 2011] Traditionallyvirus signatures are generated manually However automated
103
signature generation techniques based on data mining havebeen proposed recently [Masud et al 2007 2008]
33 WormsComputer worms are malware but unlike viruses they neednot attach themselves to other binaries Worms are capable ofpropagating themselves to other hosts through networkconnections Worms also exhibit a wide range of maliciousbehavior such as spamming phishing harvesting andsending sensitive information to the worm writer jamming orslowing down network connections deleting data from harddrive and so on Worms are independent programs and theyreside in the infected machine by camouflage Some of theworms open a backdoor in the infected machine allowing theworm writer to control the machine and making it a zombie(or bot) for his malicious activities (see Section 36)
The earliest computer worm dates back to 1988 programmedby Robert Morris who unleashed the Morris worm Itinfected 10 of the then Internet and his act resulted in thefirst conviction in the United States under the ComputerFraud and Abuse Act [Dressler 2007] One of the threeauthors of this book was working in computer security atHoneywell Inc in Minneapolis at that time and vividlyremembers what happened that November day
Other infamous worms since then include the Melissa wormunleashed in 1999 which crashed servers the Mydoom wormreleased in 2004 which was the fastest spreading email
104
worm and the SQL Slammer worm founded in 2003 whichcaused a global Internet slowdown
Commercial antivirus products also detect worms by scanningworm signature against the signature database Howeveralthough this technique is very effective against regularworms it is usually not effective against zero-day attacks[Frei et al 2008] polymorphic and metamorphic wormsHowever recent techniques for worm detection address theseproblems by automatic signature generation techniques [Kimand Karp 2004] [Newsome et al 2005] Several data miningtechniques also exist for detecting different types of worms[Masud et al 2007 2008]
34 Trojan HorsesTrojan horses have been studied within the context ofmulti-level databases They covertly pass information from ahigh-level process to a low-level process A good example ofa Trojan horse is the manipulation of file locks Nowaccording to the Bell and La Padula Security Policy(discussed in Appendix B) a secret process cannot directlysend data to an unclassified process as this will constitute awrite down However a malicious secret process can covertlypass data to an unclassified process by manipulating the filelocks as follows Suppose both processes want to access anunclassified file The Secret process wants to read from thefile while the unclassified process can write into the fileHowever both processes cannot obtain the read and writelocks at the same time Therefore at time T1 letrsquos assumethat the Secret process has the read lock while the unclassified
105
process attempts to get a write lock The unclassified processcannot obtain this lock This means a one bit information say0 is passed to the unclassified process At time T2 letrsquosassume the situation does not change This means one bitinformation of 0 is passed However at time T3 letrsquos assumethe Secret process does not have the read lock in which casethe unclassified process can obtain the write lock This timeone bit information of 1 is passed Over time a classifiedstring of 0011000011101 could be passed from the Secretprocess to the unclassified process
As stated in [Trojan Horse 2011] a Trojan horse is softwarethat appears to perform a desirable function for the user butactually carries out a malicious activity In the previousexample the Trojan horse does have read access to the dataobject It is reading from the object on behalf of the userHowever it also carries out malicious activity bymanipulating the locks and sending data covertly to theunclassified user
35 Time and Logic BombsIn the software paradigm time bomb refers to a computerprogram that stops functioning after a prespecified time ordate has reached This is usually imposed by softwarecompanies in beta versions of software so that the softwarestops functioning after a certain date An example is theWindows Vista Beta 2 which stopped functioning on May31 2007 [Vista 2007]
106
A logic bomb is a computer program that is intended toperform malicious activities when certain predefinedconditions are met This technique is sometimes injected intoviruses or worms to increase the chances of survival andspreading before getting caught
An example of a logic bomb is the Fannie Mae bomb in 2008[Claburn 2009] A logic bomb was discovered at themortgage company Fannie Mae on October 2008 An Indiancitizen and IT contractor Rajendrasinh Babubhai Makwanawho worked in Fannie Maersquos Urbana Maryland facilityallegedly planted it and it was set to activate on January 312009 to wipe all of Fannie Maersquos 4000 servers As stated in[Claburn 2009] Makwana had been terminated around 100pm on October 24 2008 and planted the bomb while he stillhad network access He was indicted in a Maryland court onJanuary 27 2009 for unauthorized computer access
36 BotnetBotnet is a network of compromised hosts or bots under thecontrol of a human attacker known as the botmaster Thebotmaster can issue commands to the bots to performmalicious actions such as recruiting new bots launchingcoordinated DDoS attacks against some hosts stealingsensitive information from the bot machine sending massspam emails and so on Thus botnets have emerged as anenormous threat to the Internet community
According to [Messmer 2009] more than 12 millioncomputers in the United States are compromised and
107
controlled by the top 10 notorious botnets Among them thehighest number of compromised machines is due to the Zeusbotnet Zeus is a kind of Trojan (a malware) whose mainpurpose is to apply key-logging techniques to steal sensitivedata such as login information (passwords etc) bank accountnumbers and credit card numbers One of its key-loggingtechniques is to inject fake HTML forms into online bankinglogin pages to steal login information
The most prevailing botnets are the IRC-botnets [Saha andGairola 2005] which have a centralized architecture Thesebotnets are usually very large and powerful consisting ofthousands of bots [Rajab et al 2006] However theirenormous size and centralized architecture also make themvulnerable to detection and demolition Many approaches fordetecting IRC botnets have been proposed recently ([Goebeland Holz 2007] [Karasaridis et al 2007] [Livadas et al2006] [Rajab et al 2006]) Another type of botnet is thepeer-to-peer (P2P) botnet These botnets are distributed andmuch smaller than IRC botnets So they are more difficult tolocate and destroy Many recent works in P2P botnet analyzestheir characteristics ([Grizzard et al 2007] [Group 2004][Lemos 2006])
37 SpywareAs stated in [Spyware 2011] spyware is a type of malwarethat can be installed on computers which collects informationabout users without their knowledge For example spywareobserves the web sites visited by the user the emails sent bythe user and in general the activities carried out by the user
108
in his or her computer Spyware is usually hidden from theuser However sometimes employers can install spyware tofind out the computer activities of the employees
An example of spyware is keylogger (also called keystrokelogging) software As stated in [Keylogger 2011] keyloggingis the action of tracking the keys struck on a keyboardusually in a covert manner so that the person using thekeyboard is unaware that their actions are being monitoredAnother example of spyware is adware when advertisementpops up on the computer when the person is doing someusually unrelated activity In this case the spyware monitorsthe web sites surfed by the user and carries out targetedmarketing using adware
38 SummaryIn this chapter we have provided an overview of malware(also known as malicious software) We discussed varioustypes of malware such as viruses worms time and logicbombs Trojan horses botnets and spyware As we havestated malware is causing chaos in society and in thesoftware industry Malware technology is getting more andmore sophisticated Developers of malware are continuouslychanging patterns so as not to get caught Thereforedeveloping solutions to detect andor prevent malware hasbecome an urgent need
In this book we discuss the tools we have developed to detectmalware In particular we discuss tools for email wormdetection remote exploits detection and botnet detection We
109
also discuss our stream mining tool that could potentiallydetect changing malware These tools are discussed in PartsIII through VII of this book In Chapter 4 we will summarizethe data mining tools we discussed in our previous book[Awad et al 2009] Our tools discussed in our current bookhave been influenced by the tools discussed in [Awad et al2009]
References[Awad et al 2009] Awad M L Khan B Thuraisingham LWang Design and Implementation of Data Mining ToolsCRC Press 2009
[CME 2011] httpcmemitreorg
[Claburn 2009] Claburn T Fannie Mae Contractor Indictedfor Logic BombInformationWeek httpwwwinformationweekcomnewssecuritymanagementshowArticlejhtmlarticleID=212903521
[Dressler 2007] Dressler J ldquoUnited States v Morrisrdquo Casesand Materials on Criminal Law St Paul MN ThomsonWest 2007
[Frei et al 2008] Frei S B Tellenbach B Plattner 0-DayPatchmdashExposing Vendors(In)security Performancetechzoomnet Publications httpwwwtechzoomnetpublications0-day-patchindexen
110
[Goebel and Holz 2007] Goebel J and T Holz RishiIdentify Bot Contaminated Hosts by IRC NicknameEvaluation in USENIXHotbots rsquo07 Workshop 2007
[Grizzard et al 2007] Grizzard J B V Sharma CNunnery B B Kang D Dagon Peer-to-Peer BotnetsOverview and Case Study in USENIXHotbots rsquo07Workshop 2007
[Group 2004] LURHQ Threat Intelligence Group Sinit p2pTrojan Analysis LURHQ httpwwwlurhqcomsinithtml
[Karasaridis et al 2007] Karasaridis A B Rexroad DHoeflin Wide-Scale Botnet Detection and Characterizationin USENIXHotbots rsquo07 Workshop 2007
[Keylogger 2011] httpenwikipediaorgwikiKeystroke_logging
[Kim and Karp 2004] Kim H A and Karp B (2004)Autograph Toward Automated Distributed Worm SignatureDetection in Proceedings of the 13th USENIX SecuritySymposium (Security 2004) pp 271ndash286
[Lemos 2006] Lemos R Bot Software Looks to ImprovePeerage httpwwwsecurityfocuscomnews11390
[Livadas et al 2006] Livadas C B Walsh D Lapsley TStrayer Using Machine Learning Techniques to IdentifyBotnet Traffic in 2nd IEEE LCN Workshop on NetworkSecurity (WoNSrsquo2006) November 2006
[Malware 2011] httpenwikipediaorgwikiMalware
111
[Masud et al 2007] Masud M L Khan B ThuraisinghamE-mail Worm Detection Using Data Mining InternationalJournal of Information Security and Privacy Vol 1 No 42007 pp 47ndash61
[Masud et al 2008] Masud M L Khan B ThuraisinghamA Scalable Multi-level Feature Extraction Technique toDetect Malicious Executables Information System FrontiersVol 10 No 1 2008 pp 33ndash45
[Messmer 2009] Messmer E Americarsquos 10 Most WantedBotnets Network World July 22 2009httpwwwnetworkworldcomnews2009072209-botnetshtml
[Newsome et al 2005] Newsome J B Karp D SongPolygraph Automatically Generating Signatures forPolymorphic Worms in Proceedings of the IEEE Symposiumon Security and Privacy 2005 pp 226ndash241
[Rajab et al 2006] Rajab M A J Zarfoss F Monrose ATerzis A Multifaceted Approach to Understanding the BotnetPhenomenon in Proceedings of the 6th ACM SIGCOMM onInternet Measurement Conference (IMC) 2006 pp 41ndash52
[Saha and Gairola 2005] Saha B and A Gairola Botnet AnOverview CERT-In White Paper CIWP-2005-05 2005
[SecureList 2011] Securelistcom Threat Analysis andInformation Kaspersky Labs httpwwwsecurelistcomenthreatsdetect
112
[Signature 2011] Virus Signature PC MagazineEncyclopedia httpwwwpcmagcomencyclopedia_term02542t=virus+signatureampi=5396900asp
[Spyware 2011] httpenwikipediaorgwikiSpyware
[Trojan Horse 2011] httpenwikipediaorgwikiTrojan_horse_(computing)
[Vista 2007] Windows Vista httpwindowsmicrosoftcomen-uswindows-vistaproductshome
113
4
DATA MINING FOR SECURITYAPPLICATIONS
41 IntroductionEnsuring the integrity of computer networks both in relationto security and with regard to the institutional life of thenation in general is a growing concern Security and defensenetworks proprietary research intellectual property anddata-based market mechanisms that depend on unimpededand undistorted access can all be severely compromised bymalicious intrusions We need to find the best way to protectthese systems In addition we need techniques to detectsecurity breaches
Data mining has many applications in security including innational security (eg surveillance) as well as in cybersecurity (eg virus detection) The threats to national securityinclude attacking buildings and destroying criticalinfrastructures such as power grids and telecommunicationsystems [Bolz et al 2005] Data mining techniques are beinginvestigated to find out who the suspicious people are andwho is capable of carrying out terrorist activities Cybersecurity is involved with protecting the computer and networksystems against corruption due to Trojan horses and virusesData mining is also being applied to provide solutions such as
114
intrusion detection and auditing In this chapter we will focusmainly on data mining for cyber security applications
To understand the mechanisms to be applied to safeguard thenation and the computers and networks we need tounderstand the types of threats In [Thuraisingham 2003] wedescribed real-time threats as well as non-real-time threats Areal-time threat is a threat that must be acted upon within acertain time to prevent some catastrophic situation Note thata non-real-time threat could become a real-time threat overtime For example one could suspect that a group of terroristswill eventually perform some act of terrorism Howeverwhen we set time bounds such as that a threat will likelyoccur say before July 1 2004 then it becomes a real-timethreat and we have to take actions immediately If the timebounds are tighter such as ldquoa threat will occur within twodaysrdquo then we cannot afford to make any mistakes in ourresponse
115
Figure 41 Data mining applications in security
There has been a lot of work on applying data mining for bothnational security and cyber security Much of the focus of ourprevious book was on applying data mining for nationalsecurity [Thuraisingham 2003] In this part of the book wediscuss data mining for cyber security In Section 42 wediscuss data mining for cyber security applications Inparticular we discuss the threats to the computers andnetworks and describe the applications of data mining todetect such threats and attacks Some of our current researchat the University of Texas at Dallas is discussed in Section43 The chapter is summarized in Section 44 Figure 41illustrates data mining applications in security
42 Data Mining for CyberSecurity421 Overview
This section discusses information-related terrorism Byinformation-related terrorism we mean cyber-terrorism aswell as security violations through access control and othermeans Trojan horses as well as viruses are alsoinformation-related security violations which we group intoinformation-related terrorism activities
116
Figure 42 Cyber security threats
In the next few subsections we discuss variousinformation-related terrorist attacks In Section 422 we givean overview of cyber-terrorism and then discuss insiderthreats and external attacks Malicious intrusions are thesubject of Section 423 Credit card and identity theft arediscussed in Section 424 Attacks on critical infrastructuresare discussed in Section 425 and data mining for cybersecurity is discussed in Section 426 Figure 42 illustratescyber security threats
117
422 Cyber-Terrorism Insider Threatsand External Attacks
Cyber-terrorism is one of the major terrorist threats posed toour nation today As we have mentioned earlier there is nowso much information available electronically and on the webAttack on our computers as well as networks databases andthe Internet could be devastating to businesses It is estimatedthat cyber-terrorism could cost billions of dollars tobusinesses For example consider a banking informationsystem If terrorists attack such a system and deplete accountsof the funds then the bank could lose millions and perhapsbillions of dollars By crippling the computer system millionsof hours of productivity could be lost and that also equates tomoney in the end Even a simple power outage at workthrough some accident could cause several hours ofproductivity loss and as a result a major financial lossTherefore it is critical that our information systems be secureWe discuss various types of cyber-terrorist attacks One isspreading viruses and Trojan horses that can wipe away filesand other important documents another is intruding thecomputer networks
Note that threats can occur from outside or from the inside ofan organization Outside attacks are attacks on computersfrom someone outside the organization We hear of hackersbreaking into computer systems and causing havoc within anorganization There are hackers who start spreading virusesand these viruses cause great damage to the files in variouscomputer systems But a more sinister problem is the insiderthreat Just like non-information-related attacks there is theinsider threat with information-related attacks There are
118
people inside an organization who have studied the businesspractices and develop schemes to cripple the organizationrsquosinformation assets These people could be regular employeesor even those working at computer centers The problem isquite serious as someone may be masquerading as someoneelse and causing all kinds of damage In the next few sectionswe examine how data mining could detect and perhapsprevent such attacks
423 Malicious Intrusions
Malicious intrusions may include intruding the networks theweb clients the servers the databases and the operatingsystems Many of the cyber-terrorism attacks are due tomalicious intrusions We hear much about network intrusionsWhat happens here is that intruders try to tap into thenetworks and get the information that is being transmittedThese intruders may be human intruders or Trojan horses setup by humans Intrusions can also happen on files Forexample one can masquerade as someone else and log intosomeone elsersquos computer system and access the filesIntrusions can also occur on databases Intruders posing aslegitimate users can pose queries such as SQL queries andaccess data that they are not authorized to know
Essentially cyber-terrorism includes malicious intrusions aswell as sabotage through malicious intrusions or otherwiseCyber security consists of security mechanisms that attemptto provide solutions to cyber attacks or cyber-terrorism Whenwe discuss malicious intrusions or cyber attacks we mayneed to think about the non-cyber worldmdashthat isnon-information-related terrorismmdashand then translate those
119
attacks to attacks on computers and networks For example athief could enter a building through a trap door In the sameway a computer intruder could enter the computer or networkthrough some sort of a trap door that has been intentionallybuilt by a malicious insider and left unattended throughperhaps careless design Another example is a thief enteringthe bank with a mask and stealing the money The analogyhere is an intruder masquerades as someone else legitimatelyenters the system and takes all of the information assetsMoney in the real world would translate to information assetsin the cyber world That is there are many parallels betweennon-information-related attacks and information-relatedattacks We can proceed to develop counter-measures for bothtypes of attacks
424 Credit Card Fraud and IdentityTheft
We are hearing a lot these days about credit card fraud andidentity theft In the case of credit card fraud others get holdof a personrsquos credit card and make purchases by the time theowner of the card finds out it may be too late The thief mayhave left the country by then A similar problem occurs withtelephone calling cards In fact this type of attack hashappened to one of the authors once Perhaps phone callswere being made using her calling card at airports someonemust have noticed say the dial tones and used the callingcard which was a company calling card Fortunately thetelephone company detected the problem and informed thecompany The problem was dealt with immediately
120
A more serious theft is identity theft Here one assumes theidentity of another person for example by getting hold of thesocial security number and essentially carries out all thetransactions under the other personrsquos name This could evenbe selling houses and depositing the income in a fraudulentbank account By the time the owner finds out it will be toolate The owner may have lost millions of dollars due to theidentity theft
We need to explore the use of data mining both for credit cardfraud detection and identity theft There have been someefforts on detecting credit card fraud [Chan 1999] We needto start working actively on detecting and preventing identitytheft
Figure 43 Attacks on critical infrastructures
121
425 Attacks on Critical Infrastructures
Attacks on critical infrastructures could cripple a nation andits economy Infrastructure attacks include attacks on thetelecommunication lines the electronic power and gasreservoirs and water supplies food supplies and other basicentities that are critical for the operation of a nation
Attacks on critical infrastructures could occur during any typeof attack whether they are non-information-relatedinformation-related or bio-terrorist attacks For example onecould attack the software that runs the telecommunicationsindustry and close down all the telecommunications linesSimilarly software that runs the power and gas supplies couldbe attacked Attacks could also occur through bombs andexplosives for example telecommunication lines could beattacked through bombs Attacking transportation lines suchas highways and railway tracks are also attacks oninfrastructures
Infrastructures could also be attacked by natural disasterssuch as hurricanes and earthquakes Our main interest here isthe attacks on infrastructures through malicious attacks bothinformation-related and non-information-related Our goal isto examine data mining and related data managementtechnologies to detect and prevent such infrastructure attacksFigure 43 illustrates attacks on critical infrastructures
426 Data Mining for Cyber Security
Data mining is being applied for problems such as intrusiondetection and auditing For example anomaly detection
122
techniques could be used to detect unusual patterns andbehaviors Link analysis may be used to trace the viruses tothe perpetrators Classification may be used to group variouscyber attacks and then use the profiles to detect an attackwhen it occurs Prediction may be used to determine potentialfuture attacks depending on information learned aboutterrorists through email and phone conversations Also forsome threats non-real-time data mining may suffice whereasfor certain other threats such as for network intrusions wemay need real-time data mining Many researchers areinvestigating the use of data mining for intrusion detectionAlthough we need some form of real-time data miningmdashthatis the results have to be generated in real timemdashwe also needto build models in real time For example credit card frauddetection is a form of real-time processing However heremodels are usually built ahead of time Building models inreal time remains a challenge Data mining can also be usedfor analyzing web logs as well as analyzing the audit trailsBased on the results of the data mining tool one can thendetermine whether any unauthorized intrusions have occurredandor whether any unauthorized queries have been posed
Other applications of data mining for cyber security includeanalyzing the audit data One could build a repository or awarehouse containing the audit data and then conduct ananalysis using various data mining tools to see if there arepotential anomalies For example there could be a situationwhere a certain user group may access the database between 3am and 5 am It could be that this group is working the nightshift in which case there may be a valid explanationHowever if this group is working between 9 am and 5 pmthen this may be an unusual occurrence Another example iswhen a person accesses the databases always between 1 pm
123
and 2 pm but for the past two days he has been accessing thedatabase between 1 am and 2 am This could then be flaggedas an unusual pattern that would require further investigation
Insider threat analysis is also a problem from a nationalsecurity as well as a cyber security perspective That isthose working in a corporation who are considered to betrusted could commit espionage Similarly those with properaccess to the computer system could plant Trojan horses andviruses Catching such terrorists is far more difficult thancatching terrorists outside of an organization One may needto monitor the access patterns of all the individuals of acorporation even if they are system administrators to seewhether they are carrying out cyber-terrorism activities Thereis some research now on applying data mining for suchapplications by various groups
124
Figure 44 Data mining for cyber security
Although data mining can be used to detect and prevent cyberattacks data mining also exacerbates some security problemssuch as the inference and privacy problems With data miningtechniques one could infer sensitive associations from thelegitimate responses Figure 44 illustrates data mining forcyber security For more details on a high-level overview werefer the reader to [Thuraisingham 2005a] and[Thuraisingham 2005b]
43 Current Research andDevelopmentWe are developing a number of tools on data mining forcyber security applications at The University of Texas atDallas In our previous book we discussed one such tool forintrusion detection [Awad et al 2009] An intrusion can bedefined as any set of actions that attempt to compromise theintegrity confidentiality or availability of a resource Assystems become more complex there are always exploitableweaknesses as a result of design and programming errors orthrough the use of various ldquosocially engineeredrdquo penetrationtechniques Computer attacks are split into two categorieshost-based attacks and network-based attacks Host-basedattacks target a machine and try to gain access to privilegedservices or resources on that machine Host-based detectionusually uses routines that obtain system call data from anaudit process that tracks all system calls made on behalf ofeach user
125
Network-based attacks make it difficult for legitimate users toaccess various network services by purposely occupying orsabotaging network resources and services This can be doneby sending large amounts of network traffic exploitingwell-known faults in networking services overloadingnetwork hosts and so forth Network-based attack detectionuses network traffic data (ie tcpdump) to look at trafficaddressed to the machines being monitored Intrusiondetection systems are split into two groups anomaly detectionsystems and misuse detection systems
Anomaly detection is the attempt to identify malicious trafficbased on deviations from established normal network trafficpatterns Misuse detection is the ability to identify intrusionsbased on a known pattern for the malicious activity Theseknown patterns are referred to as signatures Anomalydetection is capable of catching new attacks However newlegitimate behavior can also be falsely identified as an attackresulting in a false positive The focus with the current stateof the art is to reduce false negative and false positive rates
Our current tools discussed in this book include those foremail worm detection malicious code detection bufferoverflow detection and botnet detection as well as analyzingfirewall policy rules Figure 45 illustrates the various toolswe have developed Some of these tools are discussed in PartsII through VII of this book For example for email wormdetection we examine emails and extract features such asldquonumber of attachmentsrdquo and then train data mining toolswith techniques such as SVM (support vector machine) orNaiumlve Bayesian classifiers and develop a model Then we testthe model and determine whether the email has a viruswormor not We use training and testing datasets posted on various
126
web sites Similarly for malicious code detection we extractn-gram features with both assembly code and binary codeWe first train the data mining tool using the SVM techniqueand then test the model The classifier will determine whetherthe code is malicious or not For buffer overflow detectionwe assume that malicious messages contain code whereasnormal messages contain data We train SVM and then test tosee if the message contains code or data
Figure 45 Data mining tools at UT Dallas
44 SummaryThis chapter has discussed data mining for securityapplications We first started with a discussion of data miningfor cyber security applications and then provided a briefoverview of the tools we are developing We describe someof these tools in Parts II through VII of this book Note thatwe will focus mainly on malware detection However in PartVII we also discuss tools for insider threat detection activedefense and real-time data mining
127
Data mining for national security as well as for cyber securityis a very active research area Various data mining techniquesincluding link analysis and association rule mining are beingexplored to detect abnormal patterns Because of data miningusers can now make all kinds of correlations This also raisesprivacy concerns More details on privacy can be obtained in[Thuraisingham 2002]
References[Awad et al 2009] Awad M L Khan B Thuraisingham LWang Design and Implementation of Data Mining ToolsCRC Press 2009
[Bolz et al 2005] Bolz F K Dudonis D Schulz TheCounterterrorism Handbook Tactics Procedures andTechniques Third Edition CRC Press 2005
[Chan 1999] Chan P W Fan A Prodromidis S StolfoDistributed Data Mining in Credit Card Fraud DetectionIEEE Intelligent Systems Vol 14 No 6 1999 pp 67ndash74
[Thuraisingham 2002] Thuraisingham B Data MiningNational Security Privacy and Civil Liberties SIGKDDExplorations 2002 42 1ndash5
[Thuraisingham 2003] Thuraisingham B Web Data MiningTechnologies and Their Applications in Business Intelligenceand Counter-Terrorism CRC Press 2003
[Thuraisingham 2005a] Thuraisingham B ManagingThreats to Web Databases and Cyber Systems Issues
128
Solutions and Challenges Kluwer 2004 (Editors V KumarJ Srivastava A Lazarevic)
[Thuraisingham 2005b] Thuraisingham B Database andApplications Security CRC Press 2005
129
5
DESIGN AND IMPLEMENTATIONOF DATA MINING TOOLS
51 IntroductionData mining is an important process that has been integratedin many industrial governmental and academic applicationsIt is defined as the process of analyzing and summarizing datato uncover new knowledge Data mining maturity depends onother areas such as data management artificial intelligencestatistics and machine learning
In our previous book [Awad et al 2009] we concentratedmainly on the classification problem We appliedclassification in three critical applications namely intrusiondetection WWW prediction and image classificationSpecifically we strove to improve performance (time andaccuracy) by incorporating multiple (two or more) learningmodels In intrusion detection we tried to improve thetraining time whereas in WWW prediction we studiedhybrid models to improve the prediction accuracy Theclassification problem is also sometimes referred to asldquosupervised learningrdquo in which a set of labeled examples islearned by a model and then a new example with anunknown label is presented to the model for prediction
130
There are many prediction models that have been used suchas Markov models decision trees artificial neural networkssupport vector machines association rule mining and manyothers Each of these models has strengths and weaknessesHowever there is a common weakness among all of thesetechniques which is the inability to suit all applications Thereason that there is no such ideal or perfect classifier is thateach of these techniques is initially designed to solve specificproblems under certain assumptions
There are two directions in designing data mining techniquesmodel complexity and performance In model complexitynew data structures training set reduction techniques andorsmall numbers of adaptable parameters are proposed tosimplify computations during learning without compromisingthe prediction accuracy In model performance the goal is toimprove the prediction accuracy with some complication ofthe design or model It is evident that there is a tradeoffbetween the performance complexity and the modelcomplexity In this book we present studies of hybrid modelsto improve the prediction accuracy of data mining algorithmsin two important applications namely intrusion detection andWWW prediction
Intrusion detection involves processing and learning a largenumber of examples to detect intrusions Such a processbecomes computationally costly and impractical when thenumber of records to train against grows dramaticallyEventually this limits our choice of the data mining techniqueto apply Powerful techniques such as support vectormachines (SVMs) will be avoided because of the algorithmcomplexity We propose a hybrid model which is based onSVMs and clustering analysis to overcome this problem The
131
idea is to apply a reduction technique using clusteringanalysis to approximate support vectors to speed up thetraining process of SVMs We propose a method namelyclustering trees-based SVM (CT-SVM) to reduce the trainingset and approximate support vectors We exploit clusteringanalysis to generate support vectors to improve the accuracyof the classifier
Surfing prediction is another important research area uponwhich many application improvements depend Applicationssuch as latency reduction web search and recommendationsystems utilize surfing prediction to improve theirperformance There are several challenges present in this areaThese challenges include low accuracy rate [Pitkow andPirolli 1999] sparsity of the data [Burke 2002] [Grcar et al2005] and large number of labels which makes it a complexmulti-class problem [Chung et al 2004] not fully utilizingthe domain knowledge Our goal is to improve the predictiveaccuracy by combining several powerful classificationtechniques namely SVMs artificial neural networks(ANNs) and the Markov model The Markov model is apowerful technique for predicting seen data however itcannot predict the unseen data On the other hand techniquessuch as SVM and ANN are powerful predictors and canpredict not only for the seen data but also for the unseen dataHowever when dealing with large numbers of classeslabelsor when there is a possibility that one instance may belong tomany classes predictive power may decrease We useDempsterrsquos rule to fuse the prediction outcomes of thesemodels Such fusion combines the best of different modelsbecause it has achieved the best accuracy over the individualmodels
132
Figure 51 Data mining applications
In this chapter we discuss the three applications we haveconsidered in our previous book Design and Implementationof Data Mining Tools [Awad et al 2009] This previous bookis a useful reference and provides some backgroundinformation for our current book The applications areillustrated in Figure 51 In Section 52 we discuss intrusiondetection WWW surfing prediction is discussed in Section53 Image classification is discussed in Section 54 Moredetails in broader applications of data mining such as datamining for security applications web data mining and imagemultimedia data mining can be found in [Awad et al 2009]
52 Intrusion DetectionSecurity and defense networks proprietary researchintellectual property and data-based market mechanismswhich depend on unimpeded and undistorted access can all
133
be severely compromised by intrusions We need to find thebest way to protect these systems
An intrusion can be defined as ldquoany set of actions thatattempts to compromise the integrity confidentiality oravailability of a resourcerdquo [Heady et al 1990] [Axelsson1999] [Debar et al 2000] User authentication (eg usingpasswords or biometrics) avoiding programming errors andinformation protection (eg encryption) have all been used toprotect computer systems As systems become more complexthere are always exploitable weaknesses due to design andprogramming errors or through the use of various ldquosociallyengineeredrdquo penetration techniques For example exploitableldquobuffer overflowrdquo still exists in some recent system softwareas a result of programming errors Elements central tointrusion detection are resources to be protected in a targetsystem ie user accounts file systems and system kernelsmodels that characterize the ldquonormalrdquo or ldquolegitimaterdquobehavior of these resources and techniques that compare theactual system activities with the established modelsidentifying those that are ldquoabnormalrdquo or ldquointrusiverdquo Inpursuit of a secure system different measures of systembehavior have been proposed based on an ad hocpresumption that normalcy and anomaly (or illegitimacy) willbe accurately manifested in the chosen set of system features
Intrusion detection attempts to detect computer attacks byexamining various data records observed through processeson the same network These attacks are split into twocategories host-based attacks [Anderson et al 1995][Axelsson 1999] [Freeman et al 2002] and network-basedattacks [Ilgun et al 1995] [Marchette 1999] Host-basedattacks target a machine and try to gain access to privileged
134
services or resources on that machine Host-based detectionusually uses routines that obtain system call data from anaudit process which tracks all system calls made on behalf ofeach user
Network-based attacks make it difficult for legitimate users toaccess various network services by purposely occupying orsabotaging network resources and services This can be doneby sending large amounts of network traffic exploitingwell-known faults in networking services and overloadingnetwork hosts Network-based attack detection uses networktraffic data (ie tcpdump) to look at traffic addressed to themachines being monitored Intrusion detection systems aresplit into two groups anomaly detection systems and misusedetection systems Anomaly detection is the attempt toidentify malicious traffic based on deviations fromestablished normal network traffic patterns [McCanne et al1989] [Mukkamala et al 2002] Misuse detection is theability to identify intrusions based on a known pattern for themalicious activity [Ilgun et al 1995] [Marchette 1999]These known patterns are referred to as signatures Anomalydetection is capable of catching new attacks However newlegitimate behavior can also be falsely identified as an attackresulting in a false positive Our research will focus onnetwork-level systems A significant challenge in data miningis to reduce false negative and false positive rates Howeverwe also need to develop a realistic intrusion detection system
SVM is one of the most successful classification algorithmsin the data mining area but its long training time limits itsuse Many applications such as data mining forbioinformatics and geoinformatics require the processing ofhuge datasets The training time of SVM is a serious obstacle
135
in the processing of such datasets According to [Yu et al2003] it would take years to train SVM on a datasetconsisting of one million records Many proposals have beensubmitted to enhance SVM to increase its trainingperformance [Agarwal 2002] [Cauwenberghs and Poggio2000] either through random selection or approximation ofthe marginal classifier [Feng and Mangasarian 2001]However such approaches are still not feasible with largedatasets where even multiple scans of an entire dataset are tooexpensive to perform or result in the loss throughoversimplification of any benefit to be gained through the useof SVM [Yu et al 2003]
In Part II of this book we propose a new approach forenhancing the training process of SVM when dealing withlarge training datasets It is based on the combination of SVMand clustering analysis The idea is as follows SVMcomputes the maximal margin separating data points henceonly those patterns closest to the margin can affect thecomputations of that margin while other points can bediscarded without affecting the final result Those points lyingclose to the margin are called support vectors We try toapproximate these points by applying clustering analysis
In general using hierarchical clustering analysis based on adynamically growing self-organizing tree (DGSOT) involvesexpensive computations especially if the set of training datais large However in our approach we control the growth ofthe hierarchical tree by allowing tree nodes (support vectornodes) close to the marginal area to grow while haltingdistant ones Therefore the computations of SVM and furtherclustering analysis will be reduced dramatically Also toavoid the cost of computations involved in clustering
136
analysis we train SVM on the nodes of the tree after eachphase or iteration in which few nodes are added to the treeEach iteration involves growing the hierarchical tree byadding new children nodes to the tree This could cause adegradation of the accuracy of the resulting classifierHowever we use the support vector set as a priori knowledgeto instruct the clustering algorithm to grow support vectornodes and to stop growing non-support vector nodes Byapplying this procedure the accuracy of the classifierimproves and the size of the training set is kept to aminimum
We report results here with one benchmark dataset the 1998DARPA dataset [Lippmann et al 1998] Also we compareour approach with the Rocchio bundling algorithm proposedfor classifying documents by reducing the number of datapoints [Shih et al 2003] Note that the Rocchio bundlingmethod reduces the number of data points before feedingthose data points as support vectors to SVM for training Onthe other hand our clustering approach is intertwined withSVM We have observed that our approach outperforms pureSVM and the Rocchio bundling technique in terms ofaccuracy false positive (FP) rate false negative (FN) rateand processing time
The contribution of our work to intrusion detection is asfollows
1 We propose a new support vector selection techniqueusing clustering analysis to reduce the training timeof SVM Here we combine clustering analysis andSVM training phases
137
2 We show analytically the degree to which ourapproach is asymptotically quicker than pure SVMand we validate this claim with experimental results
3 We compare our approach with random selection andRocchio bundling on a benchmark dataset anddemonstrate impressive results in terms of trainingtime FP (false positive) rate FN (false negative) rateand accuracy
53 Web Page SurfingPredictionSurfing prediction is an important research area upon whichmany application improvements depend Applications such aslatency reduction web search and personalization systemsutilize surfing prediction to improve their performance
Latency of viewing with regard to web documents is an earlyapplication of surfing prediction Web caching andpre-fetching methods are developed to pre-fetch multiplepages for improving the performance of World Wide Websystems The fundamental concept behind all these cachingalgorithms is the ordering of various web documents usingsome ranking factors such as the popularity and the size of thedocument according to existing knowledge Pre-fetching thehighest ranking documents results in a significant reduction oflatency during document viewing [Chinen and Yamaguchi1997] [Duchamp 1999] [Griffioen and Appleton 1994][Teng et al 2005] [Yang et al 2001]
138
Improvements in web search engines can also be achievedusing predictive models Surfers can be viewed as havingwalked over the entire WWW link structure The distributionof visits over all WWW pages is computed and used forre-weighting and re-ranking results Surfer path information isconsidered more important than the text keywords entered bythe surfers hence the more accurate the predictive modelsare the better the search results will be [Brin and Page 1998]
In Recommendation systems collaborative filtering (CF) hasbeen applied successfully to find the k top users having thesame tastes or interests based on a given target userrsquos records[Yu et al 2003] The k Nearest-Neighbor (kNN) approach isused to compare a userrsquos historical profile and records withprofiles of other users to find the top k similar users UsingAssociation Rule Mining (ARM) [Mobasher et al 2001]propose a method that matches an active user session withfrequent itemsets and predicts the next page the user is likelyto visit These CF-based techniques suffer from well-knownlimitations including scalability and efficiency [Mobasher etal 2001] [Sarwar et al 2000] [Pitkow and Pirolli 1999]explore pattern extraction and pattern matching based on aMarkov model that predicts future surfing paths LongestRepeating Subsequences (LRS) is proposed to reduce themodel complexity (not predictive accuracy) by focusing onsignificant surfing patterns
There are several problems with the current state-of-the-artsolutions First the predictive accuracy using a proposedsolution such as a Markov model is low for example themaximum training accuracy is 41 [Pitkow and Pirolli1999] Second prediction using Association Rule Mining andLRS pattern extraction is done based on choosing the path
139
with the highest probability in the training set hence anynew surfing path is misclassified because the probability ofsuch a path occurring in the training set is zero Third thesparsity nature of the user sessions which are used intraining can result in unreliable predictors [Burke 2002][Grcar et al 2005] Finally many of the previous methodshave ignored domain knowledge as a means for improvingprediction Domain knowledge plays a key role in improvingthe predictive accuracy because it can be used to eliminateirrelevant classifiers during prediction or reduce theireffectiveness by assigning them lower weights
WWW prediction is a multi-class problem and prediction canresolve into many classes Most multi-class techniques suchas one-VS-one and one-VS-all are based on binaryclassification Prediction is required to check any newinstance against all classes In WWW prediction the numberof classes is very large (11700 classes in our experiments)Hence prediction accuracy is very low [Chung et al 2004]because it fails to choose the right class For a given instancedomain knowledge can be used to eliminate irrelevant classes
We use several classification techniques namely SupportVector Machines (SVMs) Artificial Neural Networks(ANNs) Association Rule Mining (ARM) and Markovmodel in WWW prediction We propose a hybrid predictionmodel by combining two or more of them using Dempsterrsquosrule Markov model is a powerful technique for predictingseen data however it cannot predict unseen data On theother hand SVM is a powerful technique which can predictnot only for the seen data but also for the unseen dataHowever when dealing with too many classes or when thereis a possibility that one instance may belong to many classes
140
(eg a user after visiting the web pages 1 2 3 might go topage 10 while another might go to page 100) SVMpredictive power may decrease because such examplesconfuse the training process To overcome these drawbackswith SVM we extract domain knowledge from the trainingset and incorporate this knowledge in the testing set toimprove prediction accuracy of SVM by reducing the numberof classifiers during prediction
ANN is also a powerful technique which can predict not onlyfor the seen data but also for the unseen data NonethelessANN has similar shortcomings as SVM when dealing withtoo many classes or when there is a possibility that oneinstance may belong to many classes Furthermore the designof ANN becomes complex with a large number of input andoutput nodes To overcome these drawbacks with ANN weemploy domain knowledge from the training set andincorporate this knowledge in the testing set by reducing thenumber of classifiers to consult during prediction Thisimproves the prediction accuracy and reduces the predictiontime
Our contributions to WWW prediction are as follows
1 We overcome the drawbacks of SVM and ANN inWWW prediction by extracting and incorporatingdomain knowledge in prediction to improve accuracyand prediction time
2 We propose a hybrid approach for prediction inWWW Our approach fuses different combinations ofprediction techniques namely SVM ANN andMarkov using Dempsterrsquos rule [Lalmas 1997] toimprove the accuracy
141
3 We compare our hybrid model with differentapproaches namely Markov model AssociationRule Mining (ARM) Artificial Neural Networks(ANNs) and Support Vector Machines (SVMs) on astandard benchmark dataset and demonstrate thesuperiority of our method
54 Image ClassificationImage classification is about determining the class in whichthe image belongs to It is an aspect of image data miningOther image data mining outcomes include determininganomalies in images in the form of change detection as wellas clustering images In some situations making linksbetween images may also be useful One key aspect of imageclassification is image annotation Here the systemunderstands raw images and automatically annotates themThe annotation is essentially a description of the images
Our contributions to image classification include thefollowing
bull We present a new framework of automatic imageannotation
bull We propose a dynamic feature weighing algorithmbased on histogram analysis and Chi-square
bull We present an image re-sampling method to solve theimbalanced data problem
bull We present a modified kNN algorithm based onevidence theory
142
In our approach we first annotate images automatically Inparticular we utilize K-means clustering algorithms to clusterimage blobs and then make a correlation between the blobsand words This will result in annotating images Our researchhas also focused on classifying images using ontologies forgeospatial data Here we classify images using a regiongrowing algorithm and then use high-level concepts in theform of homologies to classify the regions Our research onimage classification is given in [Awad et al 2009]
55 SummaryIn this chapter we have discussed three applications that weredescribed in [Awad et al 2009] We have developed datamining tools for these three applications They are intrusiondetection web page surfing prediction and imageclassification They are part of the broader class ofapplications cyber security web information managementand multimediaimage information management respectivelyIn this book we have taken one topic discussed in our priorbook and elaborated on it In particular we have describeddata mining for cyber security and have focused on malwaredetection
Future directions will focus on two aspects One is enhancingthe data mining algorithms to address the limitations such asfalse positives and false negatives as well as reason withuncertainty The other is to expand on applying data miningto the broader classes of applications such as cyber securitymultimedia information management and web informationmanagement
143
References[Agarwal 2002] Agarwal D K Shrinkage EstimatorGeneralizations of Proximal Support Vector Machines inProceedings of the 8th International Conference KnowledgeDiscovery and Data Mining Edmonton Canada 2002 pp173ndash182
[Anderson et al 1995] Anderson D T Frivold A ValdesNext-Generation Intrusion Detection Expert System (NIDES)A Summary Technical Report SRI-CSL-95-07 ComputerScience Laboratory SRI International Menlo ParkCalifornia May 1995
[Awad et al 2009]) Awad M L Khan B ThuraisinghamL Wang Design and Implementation of Data Mining ToolsCRC Press 2009
[Axelsson 1999] Axelsson S Research in IntrusionDetection Systems A Survey Technical Report TR 98-17(revised in 1999) Chalmers University of TechnologyGoteborg Sweden 1999
[Brin and Page 1998] Brin S and L Page The Anatomy ofa Large-Scale Hypertextual Web Search Engine inProceedings of the 7th International WWW ConferenceBrisbane Australia 1998 pp 107ndash117
[Burke 2002] Burke R Hybrid Recommender SystemsSurvey and Experiments User Modeling and User-AdaptedInteraction Vol 12 No 4 2002 pp 331ndash370
144
[Cauwenberghs and Poggio 2000] Cauwenberghs G and TPoggio Incremental and Decremental Support VectorMachine Learning Advances in Neural InformationProcessing Systems 13 Papers from Neural InformationProcessing Systems (NIPS) 2000 Denver CO MIT Press2001 T K Leen T G Dietterich V Tresp (Eds)
[Chinen and Yamaguchi 1997] Chinen K and SYamaguchi An Interactive Prefetching Proxy Server forImprovement of WWW Latency in Proceedings of theSeventh Annual Conference of the Internet Society (INETrsquo97)Kuala Lumpur June 1997
[Chung et al 2004] Chung V C H Li J KwokDissimilarity Learning for Nominal Data PatternRecognition Vol 37 No 7 2004 pp 1471ndash1477
[Debar et al 2000] Debar H M Dacier A Wespi ARevised Taxonomy for Intrusion Detection Systems Annalesdes Telecommunications Vol 55 No 7ndash8 2000 pp361ndash378
[Duchamp 1999] Duchamp D Prefetching Hyperlinks inProceedings of the Second USENIX Symposium on InternetTechnologies and Systems (USITS) Boulder CO 1999 pp127ndash138
[Feng and Mangasarian 2001] Feng G and O LMangasarian Semi-supervised Support Vector Machines forUnlabeled Data Classification Optimization Methods andSoftware 2001 Vol 15 pp 29ndash44
145
[Freeman et al 2002] Freeman S A Bivens J Branch BSzymanski Host-Based Intrusion Detection Using UserSignatures in Proceedings of Research Conference RPITroy NY October 2002
[Grcar et al 2005] Grcar M B Fortuna D Mladenic kNNversus SVM in the Collaborative Filtering FrameworkWebKDD rsquo05 August 21 2005 Chicago Illinois
[Griffioen and Appleton 1994] Griffioen J and RAppleton Reducing File System Latency Using a PredictiveApproach in Proceedings of the 1994 Summer USENIXTechnical Conference Cambridge MA
[Heady et al 1990] Heady R Luger G Maccabe AServilla M The Architecture of a Network Level IntrusionDetection System University of New Mexico TechnicalReport TR-CS-1990-20 1990
[Ilgun et al 1995] Ilgun K R A Kemmerer P A PorrasState Transition Analysis A Rule-Based Intrusion DetectionApproach IEEE Transactions on Software Engineering Vol21 No 3 1995 pp 181ndash199
[Lalmas 1997] Lalmas M Dempster-Shaferrsquos Theory ofEvidence Applied to Structured Documents ModellingUncertainty in Proceedings of the 20th Annual InternationalACM SIGIR Philadelphia PA 1997 pp 110ndash118
[Lippmann et al 1998] Lippmann R P I Graf DWyschogrod S E Webster D J Weber S Gorton The1998 DARPAAFRL Off-Line Intrusion DetectionEvaluation First International Workshop on Recent Advances
146
in Intrusion Detection (RAID) Louvain-la-Neuve Belgium1998
[Marchette 1999] Marchette D A Statistical Method forProfiling Network Traffic First USENIX Workshop onIntrusion Detection and Network Monitoring Santa ClaraCA 1999 pp 119ndash128
[Mobasher et al 2001] Mobasher B H Dai T Luo MNakagawa Effective Personalization Based on AssociationRule Discovery from Web Usage Data in Proceedings of theACM Workshop on Web Information and Data Management(WIDM01) 2001 pp 9ndash15
[Mukkamala et al 2002] Mukkamala S G Janoski ASung Intrusion Detection Support Vector Machines andNeural Networks in Proceedings of IEEE International JointConference on Neural Networks (IJCNN) Honolulu HI2002 pp 1702ndash1707
[Pitkow and Pirolli 1999] Pitkow J and P Pirolli MiningLongest Repeating Subsequences to Predict World Wide WebSurfing in Proceedings of 2nd USENIX Symposium onInternet Technologies and Systems (USITSrsquo99) Boulder COOctober 1999 pp 139ndash150
[Sarwar et al 2000] Sarwar B M G Karypis J Konstan JRiedl Analysis of Recommender Algorithms forE-Commerce in Proceedings of the 2nd ACM E-CommerceConference (ECrsquo00) October 2000 Minneapolis Minnesotapp 158ndash167
147
[Shih et al 2003] Shih L Y D M Rennie Y Chang D RKarger Text Bundling Statistics-Based Data ReductionProceedings of the Twentieth International Conference onMachine Learning (ICML) 2003 Washington DC pp696-703
[Teng et al 2005] Teng W-G C-Y Chang M-S ChenIntegrating Web Caching and Web Prefetching in Client-SideProxies IEEE Transaction on Parallel and DistributedSystems Vol 16 No 5 May 2005 pp 444ndash455
[Yang et al 2001] Yang Q H Zhang T Li Mining WebLogs for Prediction Models in WWW Caching andPrefetching in The 7th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining KDDAugust 26ndash29 2001 pp 475ndash478
[Yu et al 2003] Yu H J Yang J Han Classifying LargeData Sets Using SVM with Hierarchical Clusters SIGKDD2003 August 24ndash27 2003 Washington DC pp 306ndash315
148
Conclusion to Part I
We have presented various supporting technologies for datamining for malware detection These include data miningtechnologies malware technologies as well as data miningapplications First we provided an overview of data miningtechniques Next we discussed various types of malwareThis was followed by a discussion of data mining for securityapplications Finally we provided a summary of the datamining tools we discussed in our previous book Design andImplementation of Data Mining Tools
Now that we have provided an overview of supportingtechnologies we can discuss the various types of data miningtools we have developed for malware detection In Part II wediscuss email worm detection tools In Part III we discussdata mining tools for detecting malicious executables In PartIV we discuss data mining for detecting remote exploits InPart V we discuss data mining for botnet detection In PartVI we discuss stream mining tools Finally in Part VII wediscuss some of the emerging tools including data mining forinsider threat detection and firewall policy analysis
149
PART II
DATA MINING FOR EMAILWORM DETECTION
Introduction to Part IIIn this part we will discuss data mining techniques to detectemail worms Email messages contain a number of differentfeatures such as the total number of words in the messagebodysubject presenceabsence of binary attachments type ofattachments and so on The goal is to obtain an efficientclassification model based on these features The solutionconsists of several steps First the number of features isreduced using two different approaches feature selection anddimension reduction This step is necessary to reduce noiseand redundancy from the data The feature selection techniqueis called Two-Phase Selection (TPS) which is a novelcombination of decision tree and greedy selection algorithmThe dimension reduction is performed by PrincipalComponent Analysis Second the reduced data are used totrain a classifier Different classification techniques have beenused such as Support Vector Machine (SVM) Naiumlve Bayesand their combination Finally the trained classifiers aretested on a dataset containing both known and unknown typesof worms These results have been compared with publishedresults It is found that the proposed TPS selection along withSVM classification achieves the best accuracy in detectingboth known and unknown types of worms
150
Part II consists of three chapters 6 7 and 8 In Chapter 6 weprovide an overview of email worm detection including adiscussion of related work In Chapter 7 we discuss our toolfor email worm detection In Chapter 8 we analyze the resultswe have obtained by using our tool
151
6
EMAIL WORM DETECTION
61 IntroductionAn email worm spreads through infected email messages Theworm may be carried by an attachment or the email maycontain links to an infected web site When the user opens theattachment or clicks the link the host gets infectedimmediately The worm exploits the vulnerable emailsoftware in the host machine to send infected emails toaddresses stored in the address book Thus new machines getinfected Worms bring damage to computers and people invarious ways They may clog the network traffic causedamage to the system and make the system unstable or evenunusable
The traditional method of worm detection is signature basedA signature is a unique pattern in the worm body that canidentify it as a particular type of worm Thus a worm can bedetected from its signature But the problem with thisapproach is that it involves a significant amount of humanintervention and may take a long time (from days to weeks) todiscover the signature Thus this approach is not usefulagainst ldquozero-dayrdquo attacks of computer worms Alsosignature matching is not effective against polymorphism
152
Thus there is a growing need for a fast and effectivedetection mechanism that requires no manual interventionOur work is directed toward automatic and efficient detectionof email worms In our approach we have developed atwo-phase feature selection technique for email wormdetection In this approach we apply TPS to select the bestfeatures using decision tree and greedy algorithm Wecompare our approach with two baseline techniques The firstbaseline approach does not apply any feature reduction Ittrains a classifier with the unreduced dataset The secondbaseline approach reduces data dimension using principalcomponent analysis (PCA) and trains a classifier with thereduced dataset It is shown empirically that our TPSapproach outperforms the baseline techniques We also reportthe feature set that achieves this performance For the baselearning algorithm (ie classifier) we use both support vectormachine (SVM) and Naiumlve Bayes (NB) We observerelatively better performance with SVM Thus we stronglyrecommend applying SVM with our TPS process fordetecting novel email worms in a feature-based paradigm
Figure 61 Concepts in this chapter (This figure appears inEmail Work Detection Using Data Mining InternationalJournal of Information Security and Privacy Vol 1 No 4
153
pp 47ndash61 2007 authored by M Masud L Kahn and BThuraisingham Copyright 2010 IGI Globalwwwigi-globalcom Posted by permission of the publisher)
The organization of this chapter is as follows Section 62describes our architecture Section 63 describes related workin automatic email worm detection Our approach is brieflydiscussed in Section 63 The chapter is summarized insection 64 Figure 61 illustrates the concepts in this chapter
62 Architecture
Figure 62 Architecture for email worm detection (Thisfigure appears in Email Work Detection Using Data MiningInternational Journal of Information Security and PrivacyVol 1 No 4 pp 47ndash61 2007 authored by M Masud LKahn and B Thuraisingham Copyright 2010 IGI Globalwwwigi-globalcom Posted by permission of the publisher)
Figure 62 illustrates our architecture at a high level At firstwe build a classifier from training data containing both
154
benign and infected emails Then unknown emails are testedwith the classifier to predict whether it is infected or clean
The training data consist of both benign and malicious(infected) emails These emails are called training instancesThe training instances go through the feature selectionmodule where features are extracted and best features areselected (see Sections 73 74) The output of the featureselection module is a feature vector for each training instanceThese feature vectors are then sent to the training module totrain a classification model (classifier module) We usedifferent classification models such as support vector machine(SVM) Naiumlve Bayes (NB) and their combination (seeSection 75) A new email arriving in the host machine firstundergoes the feature extraction module where the samefeatures selected in the feature selection module areextracted and a feature vector is produced This feature vectoris given as input to the classifier and the classifier predictsthe class (ie benigninfected) of the email
63 Related WorkThere are different approaches to automating the detection ofworms These approaches are mainly of two types behavioraland content based Behavioral approaches analyze thebehavior of messages like source-destination addressesattachment types message frequency and so forthContent-based approaches look into the content of themessage and try to detect the signature automatically Thereare also combined methods that take advantage of bothtechniques
155
An example of behavioral detection is social network analysis[Golbeck and Hendler 2004] [Newman et al 2002] Itdetects worm-infected emails by creating graphs of anetwork where users are represented as nodes andcommunications between users are represented as edges Asocial network is a group of nodes among which there existsedges Emails that propagate beyond the group boundary areconsidered to be infected The drawback of this system is thatworms can easily bypass social networks by intelligentlychoosing recipient lists by looking at recent emails in theuserrsquos outbox
Another example of behavioral approach is the application ofthe Email Mining Toolkit (EMT) [Stolfo et al 2006] TheEMT computes behavior profiles of user email accounts byanalyzing email logs They use some modeling techniques toachieve high detection rates with very low false positive ratesStatistical analysis of outgoing emails is another behavioralapproach [Schultz et al 2001] [Symantec 2005] Statisticscollected from frequency of communication between clientsand their mail server byte sequences in the attachment andso on are used to predict anomalies in emails and thus wormsare detected
An example of the content-based approach is the EarlyBirdSystem [Singh et al 2003] In this system statistics on highlyrepetitive packet contents are gathered These statistics areanalyzed to detect possible infection of host or servermachines This method generates the content signature of aworm without any human intervention Results reported bythis system indicated a very low false positive rate ofdetection Other examples are the Autograph [Kim and Karp
156
2004] and the Polygraph [Newsome et al 2005] developedat Carnegie Mellon University
There are other approaches to detect early spreading ofworms such as employing ldquohoneypotrdquo A honeypot[Honeypot 2006] is a closely monitored decoy computer thatattracts attacks for early detection and in-depth adversaryanalysis The honeypots are designed to not send out email innormal situations If a honeypot begins to send out emailsafter running the attachment of an email it is determined thatthis email is an email worm
Another approach by [Sidiroglou et al 2005] employsbehavior-based anomaly detection which is different fromsignature-based or statistical approaches Their approach is toopen all suspicious attachments inside an instrumented virtualmachine looking for dangerous actions such as writing to theWindows registry and flag suspicious messages
Our work is related to [Martin et al 2005-a] They report anexperiment with email data where they apply a statisticalapproach to find an optimum subset of a large set of featuresto facilitate the classification of outgoing emails andeventually detect novel email worms However our approachis different from their approach in that we apply PCA andTPS to reduce noise and redundancy from data
157
64 Overview of OurApproachWe apply a feature-based approach to worm detection Anumber of features of email messages have been identified in[Martin et al 2005-a) and discussed in this chapter The totalnumber of features is large some of which may be redundantor noisy So we apply two different feature-reductiontechniques a dimension-reduction technique called PCA andour novel feature-selection technique called TPS whichapplies decision tree and greedy elimination These featuresare used to train a classifier to obtain a classification modelWe use three different classifiers for this task SVM NB anda combination of SVM and NB mentioned henceforth as theSeries classifier The Series approach was first proposed by[Martin et al 2005-b]
We use the dataset of [Martin et al 2005-a] for evaluationpurpose The original data distribution was unbalanced so webalance it by rearranging We divide the dataset into twodisjoint subsets the known worms set or K-Set and the novelworms set or N-Set The K-Set contains some clean emailsand emails infected by five different types of worms TheK-Set contains emails infected by a sixth type worm but noclean email We run a threefold cross validation on the K-SetAt each iteration of the cross validation we test the accuracyof the trained classifiers on the N-Set Thus we obtain twodifferent measures of accuracy namely the accuracy of thethreefold cross validation on K-Set and the average accuracyof novel worm detection on N-Set
158
Our contributions to this work are as follows First we applytwo special feature-reduction techniques to removeredundancy and noise from data One technique is PCA andthe other is our novel TPS algorithm PCA is commonly usedto extract patterns from high-dimensional data especiallywhen the data are noisy It is a simple and nonparametricmethod TPS applies decision tree C45 [Quinlan 1993] forinitial selection and thereafter it applies greedy eliminationtechnique (see Section 742 ldquoTwo-Phase Feature Selection(TPS)rdquo) Second we create a balanced dataset as explainedearlier Finally we compare the individual performancesamong NB SVM and Series and show empirically that theSeries approach proposed by [Martin et al 2005-b] performsworse than either NB or SVM Our approach is illustrated inFigure 63
65 SummaryIn this chapter we have argued that feature-based approachesfor worm detection are superior to the traditionalsignature-based approaches Next we described some relatedwork on email worm detection and then briefly discussed ourapproach which uses feature reduction and classificationusing PCA SVM and NB
159
Figure 63 Email worm detection using data mining (Thisfigure appears in Email Work Detection Using Data MiningInternational Journal of Information Security and PrivacyVol 1 No 4 pp 47ndash61 2007 authored by M Masud LKahn and B Thuraisingham Copyright 2010 IGI Globalwwwigi-globalcom Posted by permission of the publisher)
In the future we are planning to detect worms by combiningthe feature-based approach with the content-based approachto make it more robust and efficient We are also focusing onthe statistical property of the contents of the messages forpossible contamination of worms Our approach is discussedin Chapter 7 Analysis of the results of our approach is givenin Chapter 8
References[Golbeck and Hendler 2004] Golbeck J and J HendlerReputation Network Analysis for Email Filtering inProceedings of CEAS 2004 First Conference on Email andAnti-Spam
[Honeypot 2006] Intrusion Detection Honeypots andIncident Handling Resources Honeypotsnethttpwwwhoneypotsnet
[Kim and Karp 2004] Kim H-A and B Karp AutographToward Automated Distributed Worm Signature Detectionin Proceedings of the 13th USENIX Security Symposium(Security 2004) San Diego CA August 2004 pp 271ndash286
160
[Martin et al 2005-a] Martin S A Sewani B Nelson KChen A D Joseph Analyzing Behavioral Features for EmailClassification in Proceedings of the IEEE Second Conferenceon Email and Anti-Spam (CEAS 2005) July 21 amp 22Stanford University CA
[Martin et al 2005-b] Martin S A Sewani B Nelson KChen A D Joseph A Two-Layer Approach for Novel EmailWorm Detection Submitted to USENIX Steps on ReducingUnwanted Traffic on the Internet (SRUTI)
[Newman et al 2002] Newman M E J S Forrest JBalthrop Email Networks and the Spread of ComputerViruses Physical Review E 66 035101 2002
[Newsome et al 2005] Newsome J B Karp D SongPolygraph Automatically Generating Signatures forPolymorphic Worms in Proceedings of the IEEE Symposiumon Security and Privacy May 2005
[Quinlan 1993] Quinlan J R C45 Programs for MachineLearning Morgan Kaufmann Publishers 1993
[Schultz et al 2001] Schultz M E Eskin E Zadok MEFMalicious Email Filter A UNIX Mail Filter That DetectsMalicious Windows Executables in USENIX AnnualTechnical ConferencemdashFREENIX Track June 2001
[Sidiroglou et al 2005] Sidiroglou S J Ioannidis A DKeromytis S J Stolfo An Email Worm VaccineArchitecture in Proceedings of the First InternationalConference on Information Security Practice and Experience(ISPEC 2005) Singapore April 11ndash14 2005 pp 97ndash108
161
[Singh et al 2003] Singh S C Estan G Varghese SSavage The EarlyBird System for Real-Time Detection ofUnknown Worms Technical Report CS2003-0761 Universityof California San Diego August 4 2003
[Stolfo et al 2006] Stolfo S J S Hershkop C W Hu WLi O Nimeskern K Wang Behavior-Based Modeling andIts Application to Email Analysis ACM Transactions onInternet Technology (TOIT) February 2006
[Symantec 2005] W32BeagleBGmmhttpwwwsarccomavcentervencdataw32beaglebgmmhtml
162
7
DESIGN OF THE DATA MININGTOOL
71 IntroductionAs we have discussed in Chapter 6 feature-based approachesfor worm detection are superior to the traditionalsignature-based approaches Our approach for worm detectioncarries out feature reduction and classification using principalcomponent analysis (PCA) support vector machine (SVM)and Naiumlve Bayes (NB) In this chapter we first discuss thefeatures that are used to train classifiers for detecting emailworms Second we describe our dimension reduction andfeature selection techniques Our proposed two-phase featureselection technique utilizes information gain and decision treeinduction algorithm for feature selection In the first phasewe build a decision tree using the training data on the wholefeature set The decision tree selects a subset of featureswhich we call the minimal subset of features In the secondphase we greedily select additional features and add to theminimal subset Finally we describe the classificationtechniques namely Naiumlve Bayes (NB) support vectormachine (SVM) and a combination of NB and SVM
The organization of this chapter is as follows Ourarchitecture is discussed in Section 72 Feature descriptionsare discussed in Section 73 Section 74 describes feature
163
reduction techniques Classification techniques are describedin Section 75 In particular we provide an overview of thefeature selection dimension reduction and classificationtechniques we have used in our tool The chapter issummarized in Section 76 Figure 71 illustrates the conceptsin this chapter
Figure 71 Concepts in this chapter
72 ArchitectureFigure 72 illustrates our system architecture which includescomponents for feature reduction and classification There aretwo stages of the process training and classification Trainingis performed with collected samples of benign and infectedemails that is the training data The training samples arefirst analyzed and a set of features are identified (Section 73)To reduce the number of features we apply a featureselection technique called ldquotwo-phase feature selectionrdquo(Section 74) Using the selected set of features we generate
164
feature vectors for each training sample and the featurevectors are used to train a classifier (Section 75) When anew email needs to be tested it first goes through a featureextraction module that generates a feature vector This featurevector is used by the classifier to predict the class of theemail that is to predict whether the email is clean or infected
Figure 72 Architecture
73 Feature DescriptionThe features are extracted from a repository of outgoingemails collected over a period of two years [Martin et al2005-a] These features are categorized into two differentgroups per-email feature and per-window feature Per-emailfeatures are features of a single email whereas per-windowfeatures are features of a collection of emails sentreceivedwithin a window of time
For a detailed description of the features please refer to[Martin et al 2005-a] Each of these features is eithercontinuous valued or binary The value of a binary feature is
165
either 0 or 1 depending on the presence or absence of thisfeature in a data point There are a total of 94 features Herewe describe some of them
731 Per-Email Features
HTML in body Whether there is HTML in the email bodyThis feature is used because a bug in the HTML parser of theemail client is a vulnerability that may be exploited by wormwriters It is a binary feature
Embedded image Whether there is any embedded imageThis is used because a buggy image processor of the emailclient is also vulnerable to attacks
Hyperlinks Whether there are hyperlinks in the email bodyClicking an infected link causes the host to be infected It isalso a binary feature
Binary attachment Whether there are any binary attachmentsWorms are mainly propagated by binary attachments This isalso a binary feature
Multipurpose Internet Mail Extension (MIME) type ofattachments There are different MIME types for exampleldquoapplicationmswordrdquo ldquoapplicationpdfrdquo ldquoimagegifrdquo ldquotextplainrdquo and others Each of these types is used as a binaryfeature (total 27)
UNIX ldquomagic numberrdquo of file attachments Sometimes adifferent MIME type is assigned by the worm writers to evade
166
detection Magic numbers can accurately detect the MIMEtype Each of these types is used as a binary feature (total 43)
Number of attachments It is a continuous feature
Number of wordscharacters in subjectbody These featuresare continuous Most worms choose random text whereas auser may have certain writing characteristics Thus thesefeatures are sometimes useful to detect infected emails
732 Per-Window Features
Number of emails sent in window An infected host issupposed to send emails at a faster rate This is a continuousfeature
Number of unique email recipients senders These are alsoimportant criteria to distinguish between normal and infectedhost This is a continuous feature too
Average number of wordscharacters per subject bodyaverage word length These features are also useful indistinguishing between normal and viral activity
Variance in number of wordscharacters per subject bodyvariance in word length These are also useful properties ofemail worms
Ratio of emails to attachments Usually normal emails do notcontain attachments whereas most infected emails do containthem
167
74 Feature ReductionTechniques741 Dimension Reduction
The high dimensionality of data always appears to be a majorproblem for classification tasks because (a) it increases therunning time of the classification algorithms (b) it increaseschance of overfitting and (c) a large number of instances isrequired for learning tasks We apply PCA (PrincipalComponents Analysis) to obtain a reduced dimensionality ofdata in an attempt to eliminate these problems
PCA finds a reduced set of attributes by projecting theoriginal dimension into a lower dimension PCA is alsocapable of discovering hidden patterns in data therebyincreasing classification accuracy As high-dimensional datacontain redundancies and noise it is much harder for thelearning algorithms to find a hypothesis consistent with thetraining instances The learned hypothesis is likely to be toocomplex and susceptible to overfitting PCA reduces thedimension without losing much information and thus allowsthe learning algorithms to find a simpler hypothesis that isconsistent with the training examples and thereby reduces thechance of overfitting But it should be noted that PCAprojects data into a lower dimension in the direction ofmaximum dispersion Maximum dispersion of data does notnecessarily imply maximum separation of between-class dataandor maximum concentration of within-class data If this isthe case then PCA reduction may result in poor performance
168
742 Two-Phase Feature Selection(TPS)
Feature selection is different from dimension reductionbecause it selects a subset of the feature set rather thanprojecting a combination of features onto a lower dimensionWe apply a two-phase feature selection (TPS) process Inphase I we build a decision tree from the training data Weselect the features found at the internal nodes of the tree Inphase II we apply a greedy selection algorithm We combinethese two selection processes because of the followingreasons The decision tree selection is fast but the selectedfeatures may not be a good choice for the novel dataset Thatis the selected features may not perform well on the noveldata because the novel data may have a different set ofimportant features We observe this fact when we apply adecision tree on the MydoomM and VBSBubbleBoy datasetThat is why we apply another phase of selection the greedyselection on top of decision tree selection Our goal is todetermine if there is a more general feature set that covers allimportant features In our experiments we are able to findsuch a feature set using greedy selection There are tworeasons why we do not apply only greedy selection First it isvery slow compared to decision tree selection because ateach iteration we have to modify the data to keep only theselected features and run the classifiers to compute theaccuracy Second the greedy elimination process may lead toa set of features that are inferior to the decision tree-selectedset of features That is why we keep the decision tree-selectedfeatures as the minimal features set
169
7421 Phase I We apply decision tree as a feature selectiontool in phase I The main reason behind applying decision treeis that it selects the best attributes according to informationgain Information gain is a very effective metric in selectingfeatures Information gain can be defined as a measure of theeffectiveness of an attribute (ie feature) in classifying thetraining data [Mitchell 1997] If we split the training data onthese attribute values then information gain gives themeasurement of the expected reduction in entropy after thesplit The more an attribute can reduce entropy in the trainingdata the better the attribute in classifying the dataInformation gain of a binary attribute A on a collection ofexamples S is given by (Eq 71)
where Values(A) is the set of all possible values for attributeA and Sv is the subset of S for which attribute A has value vIn our case each binary attribute has only two possible values(0 1) Entropy of subset S is computed using the followingequation
where p(S) is the number of positive examples in S and n(S) isthe total number of negative examples in S Computation of
170
information gain of a continuous attribute is a little trickybecause it has an infinite number of possible values Oneapproach followed by [Quinlan 1993] is to find an optimalthreshold and split the data into two halves The optimalthreshold is found by searching a threshold value with thehighest information gain within the range of values of thisattribute in the dataset
We use J48 for building decision tree which is animplementation of C45 Decision tree algorithms choose thebest attribute based on information gain criteria at each levelof recursion Thus the final tree actually consists of the mostimportant attributes that can distinguish between the positiveand negative instances The tree is further pruned to reducechances of overfitting Thus we are able to identify thefeatures that are necessary and the features that are redundantand use only the necessary features Surprisingly enough inour experiments we find that on average only 45 features areselected by the decision tree algorithm and the total numberof nodes in the tree is only 11 It indicates that only a fewfeatures are important We have six different datasets for sixdifferent worm types Each dataset is again divided into twosubsets the known worms set or K-Set and the novel worm setor N-Set We apply threefold cross validation on the K-Set
7422 Phase II In the second phase we apply a greedyalgorithm to select the best subset of features We use thefeature subset selected in phase I as the minimal subset (MS)At the beginning of the algorithm we select all the featuresfrom the original set and call it the potential feature set (PFS)At each iteration of the algorithm we compute the averagenovel detection accuracy of six datasets using PFS as thefeature set Then we pick up a feature at random from the
171
PFS which is not in MS and eliminate it from the PFS if theelimination does not reduce the accuracy of novel detection ofany classifier (NB SVM Series) If the accuracy drops afterelimination then we do not eliminate the feature and we addit to MS In this way we reduce PFS and continue until nofurther elimination is possible Now the PFS contains themost effective subset of features Although this process istime consuming we finally come up with a subset of featuresthat can outperform the original set
Algorithm 71 sketches the two-phase feature selectionprocess At line 2 the decision tree is built using originalfeature set FS and unreduced dataset DFS At line 3 the set offeatures selected by the decision tree is stored in the minimalsubset MS Then the potential subset PFS is initialized tothe original set FS Line 5 computes the average noveldetection accuracy of three classifiers The functionsNB-Acc(PFS DPFS) SVM-Acc(PFS DPFS) andSeries-Acc(PFS DPFS) return the average novel detectionaccuracy of NB SVM and Series respectively using PFS asthe feature set
Algorithm 71 Two-Phase Feature Selection
1 Two-Phase-Selection (FS DFS) returns FeatureSet
FS original set of features
DFS original dataset with FS as the feature set
2 T larr Build-Decision-Tree (FS DFS)
172
3 MS larr Feature-Set (T) minimal subset of features
4 PFS larr FS potential subset of features
compute novel detection accuracy of FS
5 pavg larr (NB-Acc(PFS DPFS) + SVM-Acc(PFS DPFS)
+ Series-Acc(PFS DPFS)) 3
6 while PFSltgtMS do
7 X larr a randomly chosen feature from PFS that is not in MS
8 PFS larr PFS ndash X
compute novel detection accuracy of PFS
9 Cavg larr (NB-Acc(PFS DPFS) + SVM-Acc(PFS DPFS)
+ Series-Acc(PFS DPFS)) 3
10 if Cavg ge pavg
11 pavg larr Cavg
12 else
13 PFS larr PFS cup X
14 MS larr MS cup X
15 end if
173
16 end while
17 return PFS
In the while loop we randomly choose a feature X such thatX isin PFS but X notin MS and delete it from PFS The accuracyof the new PFS is calculated If after deletion the accuracyincreases or remains the same then X is redundant So weremove this feature permanently Otherwise if the accuracydrops after deletion then this feature is essential so we add itto the minimal set MS (lines 13 and 14) In this way weeither delete a redundant feature or add it to the minimalselection It is repeated until we have nothing more to select(ie MS equals PFS) We return the PFS as the best featureset
75 ClassificationTechniquesClassification is a supervised data mining technique in whicha data mining model is first trained with some ldquoground truthrdquothat is training data Each instance (or data point) in thetraining data is represented as a vector of features and eachtraining instance is associated with a ldquoclass labelrdquo The datamining model trained from the training data is called aldquoclassification modelrdquo which can be represented as a functionf(x) feature vector rarr class label This function approximatesthe feature vector-class label mapping from the training data
174
When a test instance with an unknown class label is passed tothe classification model it predicts (ie outputs) a class labelfor the test instance The accuracy of a classifier is determinedby how many unknown instances (instances that were not inthe training data) it can classify correctly
We apply the NB [John and Langley 1995] SVM [Boser etal 1992] and C45 decision tree [Quinlan 1993] classifiersin our experiments We also apply our implementation of theSeries classifier [Martin et al 2005-b] to compare itsperformance with other classifiers We briefly describe theSeries approach here for the purpose of self-containment
NB assumes that features are independent of each other Withthis assumption the probability that an instance x = (x1 x2hellipxn) is in class c (c isin 1 hellip C) is
where xi is the value of the i-th feature of the instance x P(c)is the prior probability of class C and P(Xj = xj|c) is theconditional probability that the j-th attribute has the value xjgiven class c
So the NB classifier outputs the following class
175
NB treats discrete and continuous attributes differently Foreach discrete attribute p(X = x|c) is modeled by a single realnumber between 0 and 1 which represents the probability thatthe attribute X will take on the particular value x when theclass is c In contrast each numeric (or real) attribute ismodeled by some continuous probability distribution over therange of that attributersquos values A common assumption notintrinsic to the NB approach but often made nevertheless isthat within each class the values of numeric attributes arenormally distributed One can represent such a distribution interms of its mean and standard deviation and one canefficiently compute the probability of an observed value fromsuch estimates For continuous attributes we can write
where
Smoothing (m-estimate) is used in NB We have used thevalue m = 100 and p = 05 while calculating the probability
where nc = total number of instances for which X = x givenClass c and n = total number of instances for which X = x
SVM can perform either linear or non-linear classificationThe linear classifier proposed by [Boser et al 1992] creates a
176
hyperplane that separates the data into two classes with themaximum margin Given positive and negative trainingexamples a maximum-margin hyperplane is identified whichsplits the training examples such that the distance between thehyperplane and the closest examples is maximized Thenon-linear SVM is implemented by applying kernel trick tomaximum-margin hyperplanes The feature space istransformed into a higher dimensional space where themaximum-margin hyperplane is found This hyperplane maybe non-linear in the original feature space A linear SVM isillustrated in Figure 73 The circles are negative instancesand the squares are positive instances A hyperplane (the boldline) separates the positive instances from negative ones Allof the instances are at least at a minimal distance (margin)from the hyperplane The points that are at a distance exactlyequal to the hyperplane are called the support vectors Asmentioned earlier the SVM finds the hyperplane that has themaximum margin among all hyperplanes that can separate theinstances
Figure 73 Illustration of support vectors and margin of alinear SVM
177
In our experiments we have used the SVM implementationprovided at [Chang and Lin 2006] We also implement theSeries or ldquotwo-layer approachrdquo proposed by [Martin et al2005-b] as a baseline technique The Series approach worksas follows In the first layer SVM is applied as a noveltydetector The parameters of SVM are chosen such that itproduces almost zero false positive This means if SVMclassifies an email as infected then with probability (almost)100 it is an infected email If otherwise SVM classifies anemail as clean then it is sent to the second layer for furtherverification This is because with the previously mentionedparameter settings while SVM reduces false positive rate italso increases the false negative rate So any email classifiedas negative must be further verified In the second layer NBclassifier is applied to confirm whether the suspected emailsare really infected If NB classifies it as infected then it ismarked as infected otherwise it is marked as clean Figure74 illustrates the Series approach
76 SummaryIn this chapter we have described the design andimplementation of the data mining tools for email wormdetection As we have stated feature-based methods aresuperior to the signature-based methods for worm detectionOur approach is based on feature extraction We reduce thedimension of the features by using PCA and then useclassification techniques based on SVM and NB for detectingworms In Chapter 8 we discuss the experiments we carriedout and analyze the results obtained
178
Figure 74 Series combination of SVM and NB classifiersfor email worm detection (This figure appears in Email WorkDetection Using Data Mining International Journal ofInformation Security and Privacy Vol 1 No 4 pp 47ndash612007 authored by M Masud L Kahn and BThuraisingham Copyright 2010 IGI Globalwwwigi-globalcom Posted by permission of the publisher)
As stated in Chapter 6 as future work we are planning todetect worms by combining the feature-based approach withthe content-based approach to make it more robust andefficient We will also focus on the statistical property of thecontents of the messages for possible contamination ofworms In addition we will apply other classificationtechniques and compare the performance and accuracy of theresults
179
References[Boser et al 1992] Boser B E I M Guyon V N VapnikA Training Algorithm for Optimal Margin Classifiers in DHaussler editor 5th Annual ACM Workshop on COLTPittsburgh PA ACM Press 1992 pp 144ndash152
[Chang and Lin 2006] Chang C-C and C-J Lin LIBSVMA Library for Support Vector Machineshttpwwwcsientuedutwsimcjlinlibsvm
[John and Langley 1995] John G H and P LangleyEstimating Continuous Distributions in Bayesian Classifiersin Proceedings of the Eleventh Conference on Uncertainty inArtificial Intelligence Morgan Kaufmann Publishers SanMateo CA 1995 pp 338ndash345
[Martin et al 2005-a] Martin S A Sewani B Nelson KChen A D Joseph Analyzing Behavioral Features for EmailClassification in Proceedings of the IEEE Second Conferenceon Email and Anti-Spam (CEAS 2005) July 21 amp 22Stanford University CA
[Martin et al 2005-b] Martin S A Sewani B Nelson KChen A D Joseph A Two-Layer Approach for Novel EmailWorm Detection Submitted to USENIX Steps on ReducingUnwanted Traffic on the Internet (SRUTI)
[Mitchell 1997] Mitchell T Machine LearningMcGraw-Hill 1997
180
[Quinlan 1993] Quinlan J R C45 Programs for MachineLearning Morgan Kaufmann Publishers 1993
181
8
EVALUATION AND RESULTS
81 IntroductionIn Chapter 6 we described email worm detection and inChapter 7 we described our data mining tool for email wormdetection In this chapter we describe the datasetsexperimental setup and the results of our proposed approachand other baseline techniques
The dataset contains a collection of 1600 clean and 1200viral emails which are divided into six different evaluationsets (Section 82) The original feature set contains 94features The evaluation compares our two-phase featureselection technique with two other approaches namelydimension reduction using PCA and no feature selection orreduction Performance of three different classifiers has beenevaluated on these feature spaces namely NB SVM andSeries approach (see Table 88 for summary) Therefore thereare nine different combinations of feature setndashclassifier pairssuch as two-phase feature selection + NB no feature selection+ NB two-phase feature selection + SVM and so on Inaddition we compute three different metrics on these datasetsfor each feature setndashclassifier pair classification accuracyfalse positive rate and accuracy in detecting a new type ofworm
182
The organization of this chapter is as follows In Section 82we describe the distribution of the datasets used In Section83 we discuss the experimental setup including hardwaresoftware and system parameters In Section 84 we discussresults obtained from the experiments The chapter issummarized in Section 85 Concepts in this chapter areillustrated in Figure 81
Figure 81 Concepts in this chapter
82 DatasetWe have collected the worm dataset used in the experimentby [Martin et al 2005] They have accumulated severalhundreds of clean and worm emails over a period of twoyears All of these emails are outgoing emails Severalfeatures are extracted from these emails as explained inSection 73 (ldquoFeature Descriptionrdquo)
183
There are six types of worms contained in the datasetVBSBubbleBoy W32MydoomM W32SobigFW32NetskyD W32MydoomU and W32BagleF But theclassification task is binary clean infected The originaldataset contains six training and six test sets Each training setis made up of 400 clean emails and 1000 infected emailsconsisting of 200 samples from each of the five differentworms The sixth virus is then included in the test set whichcontains 1200 clean emails and 200 infected messages Table81 clarifies this distribution For ease of representation weabbreviate the worm names as follows
bull B VBSBubbleBoybull F W32BagleFbull M W32MydoomMbull N W32NetskyDbull S W32SobigFbull U W32MydoomU
NB SVM and the Series classifiers are applied to the originaldata the PCA-reduced data and the TPS-selected data Thedecision tree is applied on the original data only
Table 81 Data Distribution from the Original Dataset
184
Source This table appears in Email Work Detection UsingData Mining International Journal of Information Securityand Privacy Vol 1 No 4 pp 47-61 2007 authored by MMasud L Kahn and B Thuraisingham Copyright 2010 IGIGlobal wwwigi-globalcom Posted by permission of thepublisher
We can easily notice that the original dataset is unbalancedbecause the ratio of clean emails to infected emails is 25 inthe training set whereas it is 51 in the test set So the resultsobtained from this dataset may not be reliable We make itbalanced by redistributing the examples In our distributioneach balanced set contains two subsets The Known-wormsset or K-Set contains 1600 clean email messages which arethe combination of all the clean messages in the originaldataset (400 from training set 1200 from test set) The K-Setalso contains 1000 infected messages with five types ofworms marked as the ldquoknown wormsrdquo The N-Set contains
185
200 infected messages of a sixth type of worm marked as theldquonovel wormrdquo Then we apply cross validation on K-Set Thecross validation is done as follows We randomly divide theset of 2600 (1600 clean + 1000 viral) messages into threeequal-sized subsets such that the ratio of clean messages toviral messages remains the same in all subsets We take twosubsets as the training set and the remaining set as the test setThis is done three times by rotating the testing and trainingsets We take the average accuracy of these three runs Thisaccuracy is shown under the column ldquoAccrdquo in Tables 83 85and 86 In addition to testing the accuracy of the test set wealso test the detection accuracy of each of the three learnedclassifiers on the N-Set and take the average This accuracyis also averaged over all runs and shown as novel detectionaccuracy Table 82 displays the data distribution of ourdataset
Table 82 Data Distribution from the Redistributed Dataset
186
Source This table appears in Email Work Detection UsingData Mining International Journal of Information Securityand Privacy Vol 1 No 4 pp 47-61 2007 authored by MMasud L Kahn and B Thuraisingham Copyright 2010 IGIGlobal wwwigi-globalcom Posted by permission of thepublisher
83 Experimental SetupIn this section we describe the experimental setup including adiscussion of the hardware and software utilized We run allour experiments on a Windows XP machine with Java version15 installed For running SVM we use the LIBSVM package[Chang and Lin 2006]
We use our own C++ implementation of NB We implementPCA with MATLAB We use the WEKA machine learningtool [Weka 2006] for decision tree with pruning applied
Parameter settings Parameter settings for LIBSVM are asfollows classifier type is C-Support Vector Classification(C-SVC) the kernel is chosen to be the radial basis function(RBF) the values of ldquogammardquo = 02 and ldquoCrdquo = 1 are chosen
Baseline techniques We compare our TPS technique withtwo different feature selectionreduction techniquesTherefore the competing techniques are the following
TPS This is our two-phase feature selection technique
PCA Here we reduce the dimension using PCA With PCAwe reduce the dimension size to 5 10 15 hellip 90 94 That is
187
we vary the target dimension from 5 to 94 with step 5increments
No reduction (unreduced) Here the full feature set is used
Each of these feature vectors are used to train three differentclassifiers namely NB SVM and Series Decision tree isalso trained with the unreduced feature set
84 ResultsWe discuss the results in three separate subsections Insubsection 841 we discuss the results found from unreduceddata that is data before any reduction or selection is appliedIn subsection 842 we discuss the results found fromPCA-reduced data and in subsection 843 we discuss theresults obtained using TPS-reduced data
841 Results from Unreduced Data
Table 83 reports the accuracy of the cross validationaccuracy and false positive for each set The cross validationaccuracy is shown under the column Acc and the falsepositive rate is shown under the column FP The set names atthe row headings are the abbreviated names as explained inldquoDatasetrdquo section From the results reported in Table 83 wesee that SVM observes the best accuracy among allclassifiers although the difference with other classifiers issmall
188
Table 84 reports the accuracy of detecting novel worms Wesee that SVM is very consistent over all sets but NB Seriesand decision tree perform significantly worse in theMydoomM dataset
842 Results from PCA-Reduced Data
Figure 82 shows the results of applying PCA on the originaldata The X axis denotes dimension of the reduceddimensional data which has been varied from 5 to 90 withstep 5 increments The last point on the X axis is theunreduced or original dimension Figure 82 shows the crossvalidation accuracy for different dimensions The data fromthe chart should be read as follows a point (x y) on a givenline say the line for SVM indicates the cross validationaccuracy y of SVM averaged over all six datasets whereeach dataset has been reduced to x dimension using PCA
Table 83 Comparison of Accuracy () and False Positive() of Different Classifiers on the Worm Dataset
189
Source This table appears in Email Work Detection UsingData Mining International Journal of Information Securityand Privacy Vol 1 No 4 pp 47ndash61 2007 authored by MMasud L Kahn and B Thuraisingham Copyright 2010 IGIGlobal wwwigi-globalcom Posted by permission of thepublisher
Table 84 Comparison of Novel Detection Accuracy () ofDifferent Classifiers on the Worm Dataset
Source This table appears in Email Work Detection UsingData Mining International Journal of Information Securityand Privacy Vol 1 No 4 pp 47ndash61 2007 authored by MMasud L Kahn and B Thuraisingham Copyright 2010 IGIGlobal wwwigi-globalcom Posted by permission of thepublisher
190
Figure 82 Average cross validation accuracy of the threeclassifiers on lower dimensional data reduced by PCA (Thisfigure appears in Email Work Detection Using Data MiningInternational Journal of Information Security and PrivacyVol 1 No 4 pp 47ndash61 2007 authored by M Masud LKahn and B Thuraisingham Copyright 2010 IGI Globalwwwigi-globalcom Posted by permission of the publisher)
Figure 82 indicates that at lower dimensions cross validationaccuracy is lower for each of the three classifiers But SVMachieves its near maximum accuracy at dimension 30 NB andSeries reaches within 2 of maximum accuracy at dimension30 and onward All classifiers attain their maximum at thehighest dimension 94 which is actually the unreduced dataSo from this observation we may conclude that PCA is noteffective on this dataset in terms of cross validation accuracyThe reason behind this poorer performance on the reduceddimensional data is possibly the one that we have mentioned
191
earlier in subsection ldquoDimension Reductionrdquo The reductionby PCA is not producing a lower dimensional data wheredissimilar class instances are maximally dispersed and similarclass instances are maximally concentrated So theclassification accuracy is lower at lower dimensions
We now present the results at dimension 25 similar to theresults presented in the previous subsection Table 85compares the novel detection accuracy and cross validationaccuracy of different classifiers The choice of this particulardimension is that at this dimension all the classifiers seem tobe the most balanced in all aspects cross validation accuracyfalse positive and false negative rate and novel detectionaccuracy We conclude that this dimension is the optimaldimension for projection by PCA From Table 85 it isevident that accuracies of all three classifiers on PCA-reduceddata are lower than the accuracy of the unreduced data It ispossible that some information that is useful for classificationmight have been lost during projection onto a lowerdimension
Table 85 Comparison of Cross Validation Accuracy (Acc)and Novel Detection Accuracy (NAcc) among DifferentClassifiers on the PCA-Reduced Worm Dataset at Dimension25
192
Source This table appears in Email Work Detection UsingData Mining International Journal of Information Securityand Privacy Vol 1 No 4 pp 47ndash61 2007 authored by MMasud L Kahn and B Thuraisingham Copyright 2010 IGIGlobal wwwigi-globalcom Posted by permission of thepublisher
We see in Table 85 that both the accuracy and noveldetection accuracy of NB has dropped significantly from theoriginal dataset The novel detection accuracy of NB on theMydoomM dataset has become 0 compared to 17 in theoriginal set The novel detection accuracy of SVM on thesame dataset has dropped to 30 compared to 924 in theoriginal dataset So we can conclude that PCA reduction doesnot help in novel detection
843 Results from Two-Phase Selection
Our TPS selects the following features (in no particularorder)
193
Attachment type binary
MIME (magic) type of attachment applicationmsdownload
MIME (magic) type of attachmentapplicationx-ms-dos-executable
Frequency of email sent in window
Mean words in body
Mean characters in subject
Number of attachments
Number of From Address in Window
Ratio of emails with attachment
Variance of attachment size
Variance of words in body
Number of HTML in email
Number of links in email
Number of To Address in Window
Variance of characters in subject
The first three features actually reflect importantcharacteristics of an infected email Usually infected emailshave binary attachment which is a doswindows executable
194
Meanvariance of words in body and characters in subject arealso considered as important symptoms because usuallyinfected emails contain random subject or body thus havingirregular size of body or subject Number of attachments andratio of emails with attachments and number of links in emailare usually higher for infected emails Frequency of emailssent in window and number of To Address in window arehigher for an infected host as a compromised host sendsinfected emails to many addresses and more frequently Thusmost of the features selected by our algorithm are reallypractical and useful
Table 86 reports the cross validation accuracy () and falsepositive rate () of the three classifiers on the TPS-reduceddataset We see that both the accuracy and false positive ratesare almost the same as the unreduced dataset The accuracy ofMydoomM dataset (shown at row M) is 993 for NB995 for SVM and 994 for Series Table 87 reports thenovel detection accuracy () of the three classifiers on theTPS-reduced dataset We find that the average noveldetection accuracy of the TPS-reduced dataset is higher thanthat of the unreduced dataset The main reason behind thisimprovement is the higher accuracy on the MydoomM set byNB and Series The accuracy of NB on this dataset is 371(row M) compared to 174 in the unreduced dataset (seeTable 84 row M) Also the accuracy of Series on the same is360 compared to 166 on the unreduced dataset (as showin Table 84 row M) However accuracy of SVM remainsalmost the same 917 compared to 924 in the unreduceddataset In Table 88 we summarize the averages from Tables83 through Table 87
195
Table 86 Cross Validation Accuracy () and False Positive() of Three Different Classifiers on the TPS-ReducedDataset
Source This table appears in Email Work Detection UsingData Mining International Journal of Information Securityand Privacy Vol 1 No 4 pp 47ndash61 2007 authored by MMasud L Kahn and B Thuraisingham Copyright 2010 IGIGlobal wwwigi-globalcom Posted by permission of thepublisher
Table 87 Comparison of Novel Detection Accuracy () ofDifferent Classifiers on the TPS-Reduced Dataset
196
Source This table appears in Email Work Detection UsingData Mining International Journal of Information Securityand Privacy Vol 1 No 4 pp 47ndash61 2007 authored by MMasud L Kahn and B Thuraisingham Copyright 2010 IGIGlobal wwwigi-globalcom Posted by permission of thepublisher
The first three rows (after the header row) report the crossvalidation accuracy of all four classifiers that we have used inour experiments Each row reports the average accuracy on aparticular dataset The first row reports the average accuracyfor the unreduced dataset the second row reports the same forPCA-reduced dataset and the third row for TPS-reduceddataset We see that the average accuracies are almost thesame for the TPS-reduced and the unreduced set Forexample average accuracy of NB (shown under column NB)is the same for both which is 992 the accuracy of SVM(shown under column SVM) is also the same 995 Theaverage accuracies of these classifiers on the PCA-reduceddataset are 1 to 2 lower There is no entry under thedecision tree column for the PCA-reduced and TPS-reduceddataset because we only test the decision tree on theunreduced dataset
197
Table 88 Summary of Results (Averages) Obtained fromDifferent Feature-Based Approaches
Source This table appears in Email Work Detection UsingData Mining International Journal of Information Securityand Privacy Vol 1 No 4 pp 47ndash61 2007 authored by MMasud L Kahn and B Thuraisingham Copyright 2010 IGIGlobal wwwigi-globalcom Posted by permission of thepublisher
The middle three rows report the average false positive valuesand the last three rows report the average novel detectionaccuracies We see that the average novel detection accuracyon the TPS-reduced dataset is the highest among all Theaverage novel detection accuracy of NB on this dataset is867 compared to 836 on the unreduced dataset which isa 31 improvement on average Also Series has a noveldetection accuracy of 863 on the TPS-reduced datasetcompared to that of the unreduced dataset which is 831Again it is a 32 improvement on average Howeveraverage accuracy of SVM remains almost the same (only01 difference) on these two datasets Thus on average wehave an improvement in novel detection accuracy across
198
different classifiers on the TPS-reduced dataset WhileTPS-reduced dataset is the best among the three the bestclassifier among the four is SVM It has the highest averageaccuracy and novel detection accuracy on all datasets andalso very low average false positive rates
85 SummaryIn this chapter we have discussed the results obtained fromtesting our data mining tool for email worm detection Wefirst discussed the datasets we used and the experimentalsetup Then we described the results we obtained We havetwo important findings from our experiments First SVM hasthe best performance among all four different classifiers NBSVM Series and decision tree Second feature selectionusing our TPS algorithm achieves the best accuracyespecially in detecting novel worms Combining these twofindings we conclude that SVM with TPS reduction shouldwork as the best novel worm detection tool on a feature-baseddataset
In the future we would like to extend our work tocontent-based detection of the email worm by extractingbinary level features from the emails We would also like toapply other classifiers for the detection task
199
References[Chang and Lin 2006] Chang C-C and C-J Lin LIBSVMA Library for Support Vector Machineshttpwwwcsientuedutwsimcjlinlibsvm
[Martin et al 2005] Martin S A Sewani B Nelson KChen and A D Joseph Analyzing Behavioral Features forEmail Classification in Proceedings of the IEEE SecondConference on Email and Anti-Spam (CEAS 2005) July 21 amp22 Stanford University CA
[Weka 2006] Weka 3 Data Mining Software in Javahttpwwwcswaikatoacnzsimmlweka
200
Conclusion to Part II
In this part we discussed our proposed data mining techniqueto detect email worms Different features such as totalnumber of words in message bodysubject presenceabsenceof binary attachments types of attachments and others areextracted from the emails Then the number of features isreduced using a Two-phase Selection (TPS) technique whichis a novel combination of decision tree and greedy selectionalgorithm We have used different classification techniquessuch as Support Vector Machine (SVM) Naiumlve Bayes andtheir combination Finally the trained classifiers are tested ona dataset containing both known and unknown types ofworms Compared to the baseline approaches our proposedTPS selection along with SVM classification achieves thebest accuracy in detecting both known and unknown types ofworms
In the future we would like to apply our technique on a largercorpus of emails and optimize the feature extraction selectiontechniques to make them more scalable to large datasets
201
PART III
DATA MINING FOR DETECTINGMALICIOUS EXECUTABLES
Introduction to Part IIIWe present a scalable and multi-level feature extractiontechnique to detect malicious executables We propose anovel combination of three different kinds of features atdifferent levels of abstraction These are binary n-gramsassembly instruction sequences and dynamic link library(DLL) function calls extracted from binary executablesdisassembled executables and executable headersrespectively We also propose an efficient and scalable featureextraction technique and apply this technique on a largecorpus of real benign and malicious executables Thepreviously mentioned features are extracted from the corpusdata and a classifier is trained which achieves high accuracyand low false positive rate in detecting malicious executablesOur approach is knowledge based for several reasons Firstwe apply the knowledge obtained from the binary n-gramfeatures to extract assembly instruction sequences using ourAssembly Feature Retrieval algorithm Second we apply thestatistical knowledge obtained during feature extraction toselect the best features and to build a classification modelOur model is compared against other feature-basedapproaches for malicious code detection and found to be
202
more efficient in terms of detection accuracy and false alarmrate
Part III consists of three chapters 9 10 and 11 Chapter 9describes our approach to detecting malicious executablesChapter 10 describes the design and implementation of ourdata mining tools Chapter 11 describes our evaluation andresults
203
9
MALICIOUS EXECUTABLES
91 IntroductionMalicious code is a great threat to computers and computersociety Numerous kinds of malicious codes wander in thewild Some of them are mobile such as worms and spreadthrough the Internet causing damage to millions of computersworldwide Other kinds of malicious codes are static such asviruses but sometimes deadlier than their mobile counterpartMalicious code writers usually exploit softwarevulnerabilities to attack host machines A number oftechniques have been devised by researchers to counter theseattacks Unfortunately the more successful the researchersbecome in detecting and preventing the attacks the moresophisticated the malicious code in the wild appears Thusthe battle between malicious code writers and researchers isvirtually never ending
One popular technique followed by the antivirus communityto detect malicious code is ldquosignature detectionrdquo Thistechnique matches the executables against a unique telltalestring or byte pattern called signature which is used as anidentifier for a particular malicious code Although signaturedetection techniques are being used widely they are noteffective against zero-day attacks (new malicious code)polymorphic attacks (different encryptions of the same
204
binary) or metamorphic attacks (different code for the samefunctionality) So there has been a growing need for fastautomated and efficient detection techniques that are robustto these attacks As a result many automated systems[Golbeck and Hendler 2004] [Kolter and Maloof 2004][Newman et al 2002] [Newsome et al 2005] have beendeveloped
In this chapter we describe our novel hybrid feature retrieval(HFR) model that can detect malicious executables efficiently[Masud et al 2007-a] [Masud et al 2007-b] Theorganization of this chapter is as follows Our architecture isdiscussed in Section 92 Related work is given in Section 93Our approach is discussed in Section 94 The chapter issummarized in Section 95 Figure 91 illustrates the conceptsin this chapter
Figure 91 Concepts in this chapter
205
92 ArchitectureFigure 92 illustrates our architecture for detecting maliciousexecutables The training data consist of a collection ofbenign and malicious executables We extract three differentkinds of features (to be explained shortly) from eachexecutable These extracted features are then analyzed andonly the best discriminative features are selected Featurevectors are generated from each training instance using theselected feature set The feature vectors are used to train aclassifier When a new executable needs to be tested at firstthe features selected during training are extracted from theexecutable and a feature vector is generated This featurevector is classified using the classifier to predict whether it isa benign or malicious executable
Figure 92 Architecture
In our approach we extract three different kinds of featuresfrom the executables at different levels of abstraction andcombine them into one feature set called the hybrid featureset (HFS) These features are used to train a classifier (eg
206
support vector machine [SVM] decision tree etc) which isapplied to detect malicious executables These features are (a)binary n-gram features (b) derived assembly features(DAFs) and (c) dynamic link library (DLL) call featuresEach binary n-gram feature is actually a sequence of nconsecutive bytes in a binary executable extracted using atechnique explained in Chapter 10 Binary n-grams reveal thedistinguishing byte patterns between the benign and maliciousexecutables Each DAF is a sequence of assembly instructionsin an executable and corresponds to one binary n-gramfeature DAFs reveal the distinctive instruction usage patternsbetween the benign and malicious executables They areextracted from the disassembled executables using ourassembly feature retrieval (AFR) algorithm It should benoted that DAF is different from assembly n-gram featuresmentioned in Chapter 10 Assembly n-gram features are notused in HFS because of our findings that DAF performs betterthan them Each DLL call feature actually corresponds to aDLL function call in an executable extracted from theexecutable header These features reveal the distinguishingDLL call patterns between the benign and maliciousexecutables We show empirically that the combination ofthese three features is always better than any single feature interms of classification accuracy
Our work focuses on expanding features at different levels ofabstraction rather than using more features at a single level ofabstraction There are two main reasons behind this First thenumber of features at a given level of abstraction (egbinary) is overwhelmingly large For example in our largerdataset we obtain 200 million binary n-gram featuresTraining with this large number of features is way beyond thecapabilities of any practical classifier That is why we limit
207
the number of features at a given level of abstraction to anapplicable range Second we empirically observe the benefitof adding more levels of abstraction to the combined featureset (ie HFS) HFS combines features at three levels ofabstraction namely binary executables assembly programsand system API calls We show that this combination hashigher detection accuracy and lower false alarm rate than thefeatures at any single level of abstraction
Our technique is related to knowledge management becauseof several reasons First we apply our knowledge of binaryn-gram features to obtain DAFs Second we apply theknowledge obtained from the feature extraction process toselect the best features This is accomplished by extracting allpossible binary n-grams from the training data applying thestatistical knowledge corresponding to each n-gram (ie itsfrequency in malicious and benign executables) to computeits information gain [Mitchell 1997] and selecting the best Sof them Finally we apply another statistical knowledge(presenceabsence of a feature in an executable) obtainedfrom the feature extraction process to train classifiers
Our research contributions are as follows First we proposeand implement our HFR model which combines the threekinds of features previously mentioned Second we apply anovel idea to extract assembly instruction features usingbinary n-gram features implemented with the AFR algorithmThird we propose and implement a scalable solution to then-gram feature extraction and selection problem in generalOur solution works well with limited memory andsignificantly reduces running time by applying efficient andpowerful data structures and algorithms Thus it is scalable toa large collection of executables (in the order of thousands)
208
even with limited main memory and processor speed Finallywe compare our results against the results of [Kolter andMaloof 2004] who used only the binary n-gram feature andshow that our method achieves better accuracy We alsoreport the performancecost trade-off of our method againstthe method of [Kolter and Maloof 2004] It should be pointedout here that our main contribution is an efficient featureextraction technique not a classification technique Weempirically prove that the combined feature set (ie HFS)extracted using our algorithm performs better than otherindividual feature sets (such as binary n-grams) regardless ofthe classifier (eg SVM or decision tree) used
93 Related WorkThere has been significant research in recent years to detectmalicious executables There are two mainstream techniquesto automate the detection process behavioral and contentbased The behavioral approach is primarily applied to detectmobile malicious code This technique is applied to analyzenetwork traffic characteristics such as source-destinationportsIP addresses various packet-levelflow-level statisticsand application-level characteristics such as email attachmenttype and attachment size Examples of behavioral approachesinclude social network analysis [Golbeck and Hendler 2004][Newman et al 2002] and statistical analysis [Schultz et al2001-a] A data mining-based behavioral approach fordetecting email worms has been proposed by [Masud et al2007-a] [Garg et al 2006] apply the feature extractiontechnique along with machine learning for masqueradedetection They extract features from user behavior in
209
GUI-based systems such as mouse speed number of clicksper session and so on Then the problem is modeled as abinary classification problem and trained and tested withSVM Our approach is content based rather than behavioral
The content-based approach analyzes the content of theexecutable Some of them try to automatically generatesignatures from network packet payloads Examples areEarlyBird [Singh et al 2003] Autograph [Kim and Karp2004] and Polygraph [Newsome et al 2005] In contrast ourmethod does not require signature generation or signaturematching Some other content-based techniques extractfeatures from the executables and apply machine learning todetect malicious executables Examples are given in [Schultzet al 2001b] and [Kolter and Maloof 2004] The work in[Schultz et al 2001-b] extracts DLL call information usingGNU Bin-Utils and character strings using GNU strings fromthe header of Windows PE executables [Cygnus 1999] Alsothey use byte sequences as features We also use bytesequences and DLL call information but we also applydisassembly and use assembly instructions as features Wealso extract byte patterns of various lengths (from 2 to 10bytes) whereas they extract only 2-byte length patterns Asimilar work is done by [Kolter and Maloof 2004] Theyextract binary n-gram features from the binary executablesapply them to different classification methods and reportaccuracy Our model is different from [Kolter and Maloof2004] in that we extract not only the binary n-grams but alsoassembly instruction sequences from the disassembledexecutables and gather DLL call information from theprogram headers We compare our modelrsquos performance onlywith [Kolter and Maloof 2004] because they report higheraccuracy than that given in [Schultz et al 2001b]
210
94 Hybrid FeatureRetrieval (HFR) ModelOur HFR model is a novel idea in malicious code detection Itextracts useful features from disassembled executables usingthe information obtained from binary executables It thencombines the assembly features with other features like DLLfunction calls and binary n-gram features We have addresseda number of difficult implementation issues and providedefficient scalable and practical solutions The difficulties thatwe face during implementation are related to memorylimitations and long running times By using efficient datastructures algorithms and disk IO we are able to implementa fast scalable and robust system for malicious codedetection We run our experiments on two datasets withdifferent class distribution and show that a more realisticdistribution improves the performance of our model
Our model also has a few limitations First it does notdirectly handle obfuscated DLL calls or encryptedpackedbinaries There are techniques available for detectingobfuscated DLL calls in the binary [Lakhotia et al 2005] andto unpack the packed binaries automatically We may applythese tools for de-obfuscationdecryption and use their outputto our model Although this is not implemented yet we lookforward to integrating these tools with our model in our futureversions Second the current implementation is an offlinedetection mechanism which means it cannot be directlydeployed on a network to detect malicious code However itcan detect malicious codes in near real time
211
We address these issues in our future work and vow to solvethese problems We also propose several modifications to ourmodel For example we would like to combine our featureswith run-time characteristics of the executables We alsopropose building a feature database that would store all thefeatures and be updated incrementally This would save alarge amount of training time and memory Our approach isillustrated in Figure 93
Figure 93 Our approach to detecting malicious executables
95 SummaryIn this work we have proposed a data mining-based modelfor malicious code detection Our technique extracts threedifferent levels of features from executables namely binarylevel assembly level and API function call level Thesefeatures then go through a feature selection phase forreducing noise and redundancy in the feature set and generatea manageable-sized set of features These feature sets are thenused to build feature vectors for each training data Then aclassification model is trained using the training data point
212
This classification model classifies future instances (ieexecutables) to detect whether they are benign or malicious
In the future we would like to extend our work in twodirections First we would like to extract and utilizebehavioral features for malware detection This is becauseobfuscation against binary patterns may be achieved bypolymorphism and metamorphism but it will be difficult forthe malware to obfuscate its behavioral pattern Second wewould like to make the feature extraction and classificationmore scalable to applying the cloud computing framework
References[Cygnus 1999] GNU Binutils Cygwinhttpsourcewarecygnuscomcygwin
[Freund and Schapire 1996] Freund Y and R SchapireExperiments with a New Boosting Algorithm in Proceedingsof the Thirteenth International Conference on MachineLearning Morgan Kaufmann 1996 pp 148ndash156
[Garg et al 2006] Garg A R Rahalkar S Upadhyaya KKwiat Profiling Users in GUI Based Systems for MasqueradeDetection in Proceedings of the 7th IEEE InformationAssurance Workshop (IAWorkshop 2006) IEEE 2006 pp48ndash54
[Golbeck and Hendler 2004] Golbeck J and J HendlerReputation Network Analysis for Email Filtering inProceedings of CEAS 2004 First Conference on Email andAnti-Spam
213
[Kim and Karp 2004] Kim H A B Karp AutographToward Automated Distributed Worm Signature Detectionin Proceedings of the 13th USENIX Security Symposium(Security 2004) San Diego CA August 2004 pp 271ndash286
[Kolter and Maloof 2004] Kolter J Z and M A MaloofLearning to Detect Malicious Executables in the WildProceedings of the Tenth ACM SIGKDD InternationalConference on Knowledge Discovery and Data MiningACM 2004 pp 470ndash478
[Lakhotia et al 2005] Lakhotia A E U Kumar MVenable A Method for Detecting Obfuscated Calls inMalicious Binaries IEEE Transactions on SoftwareEngineering 31(11) 955minus968
[Masud et al 2007a] Masud M M L Khan and BThuraisingham Feature-Based Techniques forAuto-Detection of Novel Email Worms in Proceedings of the11th Pacific-Asia Conference on Knowledge Discovery andData Mining (PAKDDrsquo07) Lecture Notes in ComputerScience 4426Springer 2007 Bangkok Thailand pp205minus216
[Masud et al 2007b] Masud M M L Khan and BThuraisingham A Hybrid Model to Detect MaliciousExecutables in Proceedings of the IEEE InternationalConference on Communication (ICCrsquo07) pp 1443minus1448
[Mitchell 1997] Mitchell T Machine LearningMcGraw-Hill
214
[Newman et al 2002] Newman M E J S Forrest and JBalthrop Email Networks and the Spread of ComputerViruses Physical Review A 66(3) 035101-1ndash035101-4
[Newsome et al 2005] Newsome J B Karp and D SongPolygraph Automatically Generating Signatures forPolymorphic Worms in Proceedings of the IEEE Symposiumon Security and Privacy May 2005 Oakland CA pp226minus241
[Schultz et al 2001a] Schultz M E Eskin and E ZadokMEF Malicious Email Filter a UNIX Mail Filter That DetectsMalicious Windows Executables in Proceedings of theFREENIX Track USENIX Annual Technical ConferenceJune 2001 Boston MA pp 245minus252
[Schultz et al 2001b] Schultz M E Eskin E Zadok and SStolfo Data Mining Methods for Detection of New MaliciousExecutables in Proceedings of the IEEE Symposium onSecurity and Privacy May 2001 Oakland CA pp 38ndash49
[Singh et al 2003] Singh S C Estan G Varghese and SSavage The EarlyBird System for Real-Time Detection ofUnknown Worms Technical Report CS2003-0761University of California at San Diego (UCSD) August 2003
215
10
DESIGN OF THE DATA MININGtOOL
101 IntroductionIn this chapter we describe our data mining tool for detectingmalicious executables It utilizes the feature extractiontechnique using n-gram analysis We first discuss how weextract binary n-gram features from the executables and thenshow how we select the best features using information gainWe also discuss the memory and scalability problemassociated with the n-gram extraction and selection and howwe solve it Then we describe how the assembly features anddynamic link library (DLL) call features are extractedFinally we describe how we combine these three kinds offeatures and train a classifier using these features
The organization of this chapter is as follows Featureextraction using n-gram analysis is given in Section 102 Thehybrid feature retrieval model is discussed in Section 103The chapter is summarized in Section 104 Figure 101illustrates the concepts in this chapter
216
102 Feature ExtractionUsing n-Gram AnalysisBefore going into the details of the process we illustrate acode snippet in Figure 102 from the email wormldquoWin32Ainjoerdquo and use it as a running example throughoutthe chapter
Feature extraction using n-gram analysis involves extractingall possible n-grams from the given dataset (training set) andselecting the best n-grams among them Each such n-gram isa feature We extend the notion of n-gram from bytes toassembly instructions and DLL function calls That is ann-gram may be either a sequence of n bytes n assemblyinstructions or n DLL function calls depending on whetherwe are to extract features from binary executables assemblyprograms or DLL call sequences respectively Beforeextracting n-grams we preprocess the binary executables byconverting them to hexdump files and assembly programfiles as explained shortly
217
Figure 101 Concepts in this chapter
Figure 102 Code snippet and DLL call info from theEmail-Worm ldquoWin32Ainjoerdquo (From M Masud L Khan BThuraisingham A Scalable Multi-level Feature ExtractionTechnique to Detect Malicious Executables pp 33ndash45Springer With permission)
1021 Binary n-Gram Feature
Here the granularity level is a byte We apply the UNIXhexdump utility to convert the binary executable files intotext files mentioned henceforth as hexdump files containingthe hexadecimal numbers corresponding to each byte of thebinary This process is performed to ensure safe and easyportability of the binary executables The feature extraction
218
process consists of two phases (1) feature collection and (2)feature selection both of which are explained in the followingsubsections
1022 Feature Collection
We collect binary n-grams from the hexdump files This isillustrated in Example-I
Example-I
The 4-grams corresponding to the first 6 bytes sequence(FF2108900027) from the executable in Figure 1 are the4-byte sliding windows FF21890 21089000 and 08900027
The basic feature collection process runs as follows At firstwe initialize a list L of n-grams to empty Then we scan eachhexdump file by sliding an n-byte window Each such n-bytesequence is an n-gram Each n-gram g is associated with twovalues p1 and n1 denoting the total number of positiveinstances (ie malicious executables) and negative instances(ie benign executables) respectively that contain g If g isnot found in L then g is added to L and p1 and n1 are updatedas necessary If g is already in L then only p1 and n1 areupdated When all hexdump files have been scanned Lcontains all the unique n-grams in the dataset along with theirfrequencies in the positive and negative instances There areseveral implementation issues related to this basic approachFirst the total number of n-grams may be very large Forexample the total number of 10-grams in our second datasetis 200 million It may not be possible to store all of them inthe computerrsquos main memory To solve this problem we store
219
the n-grams in a disk file F Second if L is not sorted then alinear search is required for each scanned n-gram to testwhether it is already in L If N is the total number of n-gramsin the dataset then the time for collecting all the n-gramswould be O (N2) an impractical amount of time whenN = 200 million
To solve the second problem we use a data structure calledAdelson Velsky Landis (AVL) tree [Goodrich and Tamassia2006] to store the n-grams in memory An AVL tree is aheight-balanced binary search tree This tree has a propertythat the absolute difference between the heights of the leftsubtree and the right subtree of any node is at most 1 If thisproperty is violated during insertion or deletion a balancingoperation is performed and the tree regains itsheight-balanced property It is guaranteed that insertions anddeletions are performed in logarithmic time So to insert ann-gram in memory we now need only O (log2 (N)) searchesThus the total running time is reduced to O (Nlog2 (N))making the overall running time about 5 million times fasterfor N as large as 200 million Our feature collection algorithmExtract_Feature implements these two solutions It isillustrated in Algorithm 101
Description of the algorithm the for loop at line 3 runs foreach hexdump file in the training set The inner while loop atline 4 gathers all the n-grams of a file and adds it to the AVLtree if it is not already there At line 8 a test is performed tosee whether the tree size has exceeded the memory limit (athreshold value) If it exceeds and F is empty then we savethe contents of the tree in F (line 9) If F is not empty thenwe merge the contents of the tree with F (line 10) Finally wedelete all the nodes from the tree (line 12)
220
Algorithm 101 The n-Gram Feature Collection Algorithm
Procedure Extract_Feature (B)
B = B1 B2 hellip BK all hexdump files
1 T larr empty tree Initialize AVL-tree
2 F larr new file Initialize disk file
3 for each Bi isin B do
4 while not EOF(Bi) do while not end of file
5 g larr next_ngram(Bi) read next n-gram
6 Tinsert(g) insert into tree andor update frequencies asnecessary
7 end while
8 if Tsize gt Threshold then save or merge
9 if F is empty then F larr Tinorder() save tree data insorted order
10 else F larr merge(Tinorder() F) merge tree data with filedata and save
11 end if
221
12 T larr empty tree release memory
13 end if
14 end for
The time complexity of Algorithm 101 is T = time (n-gramreading and inserting in tree) + time (merging with disk) = O(Blog2K) + O (N) where B is the total size of the training datain bytes K is the maximum number of nodes of the tree (iethreshold) and N is the total number of n-grams collectedThe space complexity is O (K) where K is defined as themaximum number of nodes of the tree
1023 Feature Selection
If the total number of extracted features is very large it maynot be possible to use all of them for training because ofseveral reasons First the memory requirement may beimpractical Second training may be too slow Third aclassifier may become confused with a large number offeatures because most of them would be noisy redundant orirrelevant So we are to choose a small relevant and usefulsubset of features We choose information gain (IG) as theselection criterion because it is one of the best criteria used inliterature for selecting the best features
IG can be defined as a measure of effectiveness of an attribute(ie feature) in classifying a training data point [Mitchell1997] If we split the training data based on the values of this
222
attribute then IG gives the measurement of the expectedreduction in entropy after the split The more an attribute canreduce entropy in the training data the better the attribute isin classifying the data IG of an attribute A on a collection ofinstances I is given by Eq 101
where
values (A) is the set of all possible values for attribute A
Iv is the subset of I where all instances have the value of A =v
p is the total number of positive instances in I n is the totalnumber of negative instances in I
pv is the total number of positive instances in Iv and nv is thetotal number of negative instances in Iv
In our case each attribute has only two possible values thatis v isin 0 1 If an attribute A (ie an n-gram) is present inan instance X then XA = 1 otherwise it is 0 Entropy of I iscomputed using the following equation
where I p and n are as defined above Substituting (2) in (1)and letting t = n + p we get
223
The next problem is to select the best S features (ien-grams) according to IG One naiumlve approach is to sort then-grams in non-increasing order of IG and selecting the top Sof them which requires O (Nlog2N) time and O (N) mainmemory But this selection can be more efficientlyaccomplished using a heap that requires O (Nlog2S) time andO(S) main memory For S = 500 and N = 200 million thisapproach is more than 3 times faster and requires 400000times less main memory A heap is a balanced binary treewith the property that the root of any subtree contains theminimum (maximum) element in that subtree We use amin-heap that always has the minimum value at its rootAlgorithm 102 sketches the feature selection algorithm Atfirst the heap is initialized to empty Then the n-grams (alongwith their frequencies) are read from disk (line 2) and insertedinto the heap (line 5) until the heap size becomes S After theheap size becomes equal to S we compare the IG of the nextn-gram g against the IG of the root If IG (root) ge IG (g) theng is discarded (line 6) since root has the minimum IGOtherwise root is replaced with g (line 7) Finally the heapproperty is restored (line 9) The process terminates whenthere are no more n-grams in the disk After termination wehave the S best n-grams in the heap
224
Algorithm 102 The n-Gram Feature Selection Algorithm
Procedure Select_Feature (F H p n)
bull F a disk file containing all n-gramsbull H empty heapbull p total number of positive examplesbull n total number of negative examples
1 while not EOF(F) do
2 ltg p1 n1gt larr next_ngram(F) read n-gram with frequencycounts
3 p0 = P-p1 n0 = N- n1 of positive and negative examplesnot containing g
4 IG larr Gain(p0 n0 p1 n1 p n) using equation (3)
5 if Hsize() lt S then Hinsert(g IG)
6 else if IG lt= HrootIG then continue discard lower gainn-grams
7 else Hroot larr ltg IGgt replace root
8 end if
9 Hrestore() apply restore operation
10 end while
225
The insertion and restoration takes only O (log2(S)) time Sothe total time required is O (Nlog2S) with only O(S) mainmemory We denote the best S binary features selected usingIG criterion as the binary feature set (BFS)
1024 Assembly n-Gram Feature
In this case the level of granularity is an assemblyinstruction First we disassemble all the binary files using adisassembly tool called PEDisassem It is used to disassembleWindows Portable Executable (PE) files Besides generatingthe assembly instructions with opcode and addressinformation PEDisassem provides useful information like listof resources (eg cursor) used list of DLL functions calledlist of exported functions and list of strings inside the codeblock To extract assembly n-gram features we follow amethod similar to the binary n-gram feature extraction Firstwe collect all possible n-grams that is sequences of nconsecutive assembly instructions and select the best S ofthem according to IG We mention henceforth this selectedset of features as the assembly feature set (AFS) We face thesame difficulties as in binary n-gram extraction such aslimited memory and slow running time and solve them in thesame way Example-II illustrates the assembly n-gramfeatures
Example-II
The 2-grams corresponding to the first 4 assembleinstructions in Figure 1 are the two-instruction slidingwindows
226
jmp dword[ecx] or byte[eax+14002700] dl
or byte[eax+14002700] dl add byte[esi+1E] dl
add byte[esi+1E] dh inc ebp
We adopt a standard representation of assembly instructionsthat has the following format nameparam1param2 Name isthe instruction name (eg mov) param1 is the firstparameter and param2 is the second parameter Again aparameter may be one of register memory constant Sothe second instruction above ldquoor byte [eax+14002700] dlrdquobecomes ldquoormemoryregisterrdquo in our representation
1025 DLL Function Call Feature
Here the granularity level is a DLL function call An n-gramof DLL function call is a sequence of n DLL function calls(possibly with other instructions in between two successivecalls) in an executable We extract the information about DLLfunction calls made by a program from the header of thedisassembled file This is illustrated in Figure 102 In ourexperiments we use only 1-grams of DLL calls because thehigher grams have poorer performance We enumerate all theDLL function names that have been used by each of thebenign and malicious executables and select the best S ofthem using information gain We will mention this feature setas DLL-call feature set (DFS)
227
103 The Hybrid FeatureRetrieval ModelThe hybrid feature retrieval (HFR) model extracts andcombines three different kinds of features HFR consists ofdifferent phases and components The feature extractioncomponents have already been discussed in details Thissection gives a brief description of the model
1031 Description of the Model
The HFR model consists of two phases a training phase and atest phase The training phase is shown in Figure 103a andthe test phase is shown in Figure 103b In the training phasewe extract binary n-gram features (BFSs) and DLL callfeatures (DFSs) using the approaches explained in thischapter We then apply AFR algorithm (to be explainedshortly) to retrieve the derived assembly features (DAFs) thatrepresent the selected binary n-gram features These threekinds of features are combined into the hybrid feature set(HFS) Please note that DAFs are different from assemblyn-gram features (ie AFSs)
228
Figure 103 The Hybrid Feature Retrieval Model (a)training phase (b) test phase (From M Masud L Khan BThuraisingham A Scalable Multi-level Feature ExtractionTechnique to Detect Malicious Executables pp 33ndash45Springer With permission)
AFS is not used in HFS because of our findings that DAFperforms better We compute the binary feature vectorcorresponding to the HFS using the technique explained inthis chapter and train a classifier using SVM boosteddecision tree and other classification methods In the testphase we scan each test instance and compute the featurevector corresponding to the HFS This vector is tested against
229
the classifier The classifier outputs the class predictionbenign malicious of the test file
1032 The Assembly Feature Retrieval(AFR) Algorithm
The AFR algorithm is used to extract assembly instructionsequences (ie DAFs) corresponding to the binary n-gramfeatures The main idea is to obtain the complete assemblyinstruction sequence of a given binary n-gram feature Therationale behind using DAF is as follows A binary n-grammay represent partial information such as part(s) of one ormore assembly instructions or a string inside the code blockWe apply AFR algorithm to obtain the complete instructionor instruction sequence (ie a DAF) corresponding to thepartial one Thus DAF represents more completeinformation which should be more useful in distinguishingthe malicious and benign executables However binaryn-grams are still required because they also contain otherinformation like string data or important bytes at the programheader AFR algorithm consists of several steps In the firststep a linear address matching technique is applied asfollows The offset address of the n-gram in the hexdump fileis used to find instructions at the same offset at thecorresponding assembly program file Based on the offsetvalue one of the three situations may occur
1 The offset is before program entry point so there isno corresponding assembly code for the n-gram Werefer to this address as address before entry point(ABEP)
230
2 There are some data but no code at that offset Werefer to this address as DATA
3 There is some code at that offset We refer to thisaddress as CODE If this offset is in the middle of aninstruction then we take the whole instruction andconsecutive instructions within n bytes from theinstruction
In the second step the best CODE instance is selected fromamong all CODE instances We apply a heuristic to find thebest sequence called the most distinguishing instructionsequence (MDIS) heuristic According to this heuristic wechoose the instruction sequence that has the highest IG TheAFR algorithm is sketched in Algorithm 103 Acomprehensive example of the algorithm is illustrated inAppendix A
Description of the algorithm line 1 initializes the lists thatwould contain the assembly sequences The for loop in line 2runs for each hexdump file Each hexdump file is scanned andn-grams are extracted (lines 4 and 5) If any of these n-gramsare in the BFS (lines 6 and 7) then we read the instructionsequence from the corresponding assembly program file at thecorresponding address (lines 8 through 10) This sequence isadded to the appropriate list (line 12) In this way we collectall the sequences corresponding to each n-gram in the BFS Inphase II we select the best sequence in each n-gram list usingIG (lines 18 through 21) Finally we return the bestsequences that is DAFs
Algorithm 103 Assembly Feature Retrieval
231
Procedure Assembly_Feature_Retrieval(G A B)
bull G = g1 g2hellipgM the selected n-gram features(BFS)
bull A = A1 A2 hellip AL all Assembly filesbull B = B1 B2 hellip BL all hexdump filesbull S = size of BFSbull L = of training filesbull Qi = a list containing the possible instruction
sequences for gi phase I sequence collection
1 for i = 1 to S do Qi larr empty end for initialize sequence
2 for each Bi isin B do phase I sequence collection
3 offset larr 0 current offset in file
4 while not EOF(Bi) do read the whole file
5 g larr next_ngram(Bi) read next n-gram
6 ltindex foundgt larr BinarySearch(G g) seach g in G
7 if found then found
8 q larr an empty sequence
9 for each instruction r in Ai with address(r) isin [offset offset+ n] do
10 q larr q cup r
11 end for
232
12 Qindex larr Qindex cup q add to the sequence
13 end if
14 offset = offset + 1
15 end while
16 end for
17 V larr empty list phase II sequence selection
18 for i = 1 to S do for each Qi
19 q larr t isin Qi | foralluisin Qi IG(t) gt= IG(u) the sequence withthe highest IG
20 V larr V cup q
21 end for
22 return V DAF sequences
Time complexity of this algorithm is O (nBlog2S) where B isthe total size of training set in bytes S is the total number ofselected binary n-grams and n is size of each n-gram in bytesSpace complexity is O (SC) where S is defined as the totalnumber of selected binary n-grams and C is the averagenumber of assembly sequences found per binary n-gram Therunning time and memory requirements of all threealgorithms in this chapter are given in Chapter 11
233
1033 Feature Vector Computation andClassification
Each feature in a feature set (eg HFS BFS) is a binaryfeature meaning its value is either 1 or 0 If the feature ispresent in an instance (ie an executable) then its value is 1otherwise its value is 0 For each training (or testing)instance we compute a feature vector which is a bit vectorconsisting of the feature values of the corresponding featureset For example if we want to compute the feature vectorVBFS corresponding to BFS of a particular instance I then foreach feature f isin BFS we search f in I If f is found in I thenwe set VBFS[f] (ie the bit corresponding to f) to 1 otherwisewe set it to 0 In this way we set or reset each bit in thefeature vector These feature vectors are used by theclassifiers for training and testing
We apply SVM Naiumlve Bayes (NB) boosted decision treeand other classifiers for the classification task SVM canperform either linear or non-linear classification The linearclassifier proposed by Vladimir Vapnik creates a hyperplanethat separates the data points into two classes with themaximum margin A maximum-margin hyperplane is the onethat splits the training examples into two subsets such thatthe distance between the hyperplane and its closest datapoint(s) is maximized A non-linear SVM [Boser et al 2003]is implemented by applying kernel trick to maximum-marginhyperplanes The feature space is transformed into a higherdimensional space where the maximum-margin hyperplane isfound A decision tree contains attribute tests at each internalnode and a decision at each leaf node It classifies an instance
234
by performing attribute tests from root to a decision nodeDecision tree is a rule-based classifier Meaning we canobtain human-readable classification rules from the tree J48is the implementation of C45 Decision Tree algorithm C45is an extension to the ID3 algorithm invented by Quinlan Aboosting technique called Adaboost combines multipleclassifiers by assigning weights to each of them according totheir classification performance The algorithm starts byassigning equal weights to all training samples and a modelis obtained from these training data Then each misclassifiedexamplersquos weight is increased and another model is obtainedfrom these new training data This is iterated for a specifiednumber of times During classification each of these modelsis applied on the test data and a weighted voting is performedto determine the class of the test instance We use theAdaBoostM1 algorithm [Freund and Schapire 1996] on NBand J48 We only report SVM and Boosted J48 resultsbecause they have the best results It should be noted that wedo not have a preference for one classifier over the other Wereport these accuracies in the results in Chapter 11
104 SummaryIn this chapter we have shown how to efficiently extractfeatures from the training data We also showed howscalability can be achieved using disk access We haveexplained the algorithm for feature extraction and featureselection and analyzed their time complexity Finally weshowed how to combine the feature sets and build the featurevectors We applied different machine learning techniquessuch as SVM J48 and Adaboost for building the
235
classification model In the next chapter we will show howour approach performs on different datasets compared toseveral baseline techniques
In the future we would like to enhance the scalability of ourapproach by applying the cloud computing framework for thefeature extraction and selection task Cloud computing offersa cheap alternative to more CPU power and much larger diskspace which could be utilized for a much faster featureextraction and selection process We are also interested inextracting behavioral features from the executables toovercome the problem of binary obfuscation by polymorphicmalware
References[Boser et al 2003] Boser B E I M Guyon V N VapnikA Training Algorithm for Optimal Margin Classifiers in DHaussler Editor 5th Annual ACM Workshop on COLT ACMPress 2003 pp 144ndash152
[Freund and Schapire 1996] Freund Y and R E SchapireExperiments with a New Boosting Algorithm MachineLearning Proceedings of the 13th International Conference(ICML) 1996 Bari Italy 148ndash156
[Goodrich and Tamassia 2006] Goodrich M T and RTamassia Data Structures and Algorithms in Java FourthEdition John Wiley amp Sons 2006
[Mitchell 1997] Mitchell T Machine LearningMcGraw-Hill 1997
236
11
EVALUATION AND RESULTS
111 IntroductionIn this chapter we discuss the experiments and evaluationprocess in detail We use two different datasets with differentnumbers of instances and class distributions We compare thefeatures extracted with our approach namely the hybridfeature set (HFS) with two other baseline approaches (1) thebinary feature set (BFS) and (2) the derived assembly featureset (DAF) For classification we compare the performance ofthree different classifiers on each of these feature sets whichare Support Vector Machine (SVM) Naiumlve Bayes (NB)Bayes Net decision tree and boosted decision tree We showthe classification accuracy false positive and false negativerates for our approach and each of the baseline techniquesWe also compare the running times and performancecosttradeoff of our approach compared to the baselines
The organization of this chapter is as follows In Section 112we describe the experiments Datasets are given in Section113 Experimental setup is discussed in Section 114 Resultsare given in Section 115 The example run is given in Section116 The chapter is summarized in Section 117 Figure 111illustrates the concepts in this chapter
237
112 ExperimentsWe design our experiments to run on two different datasetsEach dataset has a different size and distribution of benignand malicious executables We generate all kinds of n-gramfeatures (eg BFS AFS DFS) using the techniquesexplained in Chapter 10 Notice that the BFS corresponds tothe features extracted by the method of [Kolter and Maloof2004] We also generate the DAF and HFS using our modelas explained in Chapter 10 We test the accuracy of each ofthe feature sets applying a threefold cross validation usingclassifiers such as SVM decision tree Naiumlve Bayes BayesNet and Boosted decision tree Among these classifiers weobtain the best results with SVM and Boosted decision treereported in the results section in Chapter 10 We do not reportother classifier results because of space limitations Inaddition to this we compute the average accuracy falsepositive and false negative rates and receiver operatingcharacteristic (ROC) graphs (using techniques in [Fawcett2003] We also compare the running time and performancecost tradeoff between HFS and BFS
238
Figure 111 Concepts in this chapter
113 DatasetWe have two non-disjoint datasets The first dataset (dataset1)contains a collection of 1435 executables 597 of which arebenign and 838 malicious The second dataset (dataset2)contains 2452 executables having 1370 benign and 1082malicious executables So the distribution of dataset1 isbenign = 416 malicious = 584 and that of dataset2 isbenign = 559 malicious = 441 This distribution waschosen intentionally to evaluate the performance of thefeature sets in different scenarios We collect the benignexecutables from different Windows XP and Windows 2000machines and collect the malicious executables from [VXHeavens] which contains a large collection of maliciousexecutables The benign executables contain variousapplications found at the Windows installation folder (egldquoCWindowsrdquo) as well as other executables in the defaultprogram installation directory (eg ldquoCProgram Filesrdquo)Malicious executables contain viruses worms Trojan horsesand back-doors We select only the Win32 PortableExecutables in both the cases We would like to experimentwith the ELF executables in the future
114 Experimental SetupOur implementation is developed in Java with JDK 15 Weuse the LIBSVM library [Chang and Lin 2006] for runningSVM and Weka ML toolbox [Weka] for running Boosted
239
decision tree and other classifiers For SVM we run C-SVCwith a Polynomial kernel using gamma = 01 and epsilon =10E-12 For Boosted decision tree we run 10 iterations of theAdaBoost algorithm on the C45 decision tree algorithmcalled J48
We set the parameter S (number of selected features) to 500because it is the best value found in our experiments Most ofour experiments are run on two machines a Sun Solarismachine with 4GB main memory and 2GHz clock speed anda LINUX machine with 2GB main memory and 18GHz clockspeed The reported running times are based on the lattermachine The disassembly and hex-dump are done only oncefor all machine executables and the resulting files are storedWe then run our experiments on the stored files
115 ResultsIn this subsection we first report and analyze the resultsobtained by running SVM on the dataset Later we show theaccuracies of Boosted J48 Because the results from BoostedJ48 are almost the same as SVM we do not report theanalyses based on Boosted J48
1151 Accuracy
Table 111 shows the accuracy of SVM on different featuresets The columns headed by HFS BFS and AFS representthe accuracies of the Hybrid Feature Set (our method) BinaryFeature Set (Kolter and Maloofrsquos feature set) and AssemblyFeature Set respectively Note that the AFS is different from
240
the DAF (ie derived assembly features) that has been usedin the HFS (see Section IV-A for details) Table 111 reportsthat the classification accuracy of HFS is always better thanother models on both datasets It is interesting to note that theaccuracies for 1-gram BFS are very low in both datasets Thisis because 1 gram is only a 1-byte long pattern having only256 different possibilities Thus this pattern is not useful atall in distinguishing the malicious executables from thenormal and may not be used in a practical application So weexclude the 1-gram accuracies while computing the averageaccuracies (ie the last row)
Table 111 Classification Accuracy () of SVM on DifferentFeature Sets
Source M Masud L Khan B Thuraisingham A ScalableMultilevel Feature Extraction Technique to Detect MaliciousExecutables pp 33ndash45 Springer With permission
a Average accuracy excluding 1 gram
241
11511 Dataset1 Here the best accuracy of the hybrid modelis for n = 6 which is 974 and is the highest among allfeature sets On average the accuracy of HFS is 168 higherthan that of BFS and 1136 higher than that of AFSAccuracies of AFS are always the lowest One possiblereason behind this poor performance is that AFS considersonly the CODE part of the executables So AFS misses anydistinguishing pattern carried by the ABEP or DATA partsand as a result the extracted features have poorerperformance Moreover the accuracy of AFS greatlydeteriorates for n gt= 10 This is because longer sequences ofinstructions are rarer in either class of executables (maliciousbenign) so these sequences have less distinguishing powerOn the other hand BFS considers all parts of the executableachieving higher accuracy Finally HFS considers DLL callsas well as BFS and DAF So HFS has better performancethan BFS
11512 Dataset2 Here the differences between the accuraciesof HFS and BFS are greater than those of dataset1 Theaverage accuracy of HFS is 42 higher than that of BFSAccuracies of AFS are again the lowest It is interesting tonote that HFS has an improved performance over BFS (andAFS) in dataset2 Two important conclusions may be derivedfrom this observation First dataset2 is much larger thandataset1 having a more diverse set of examples Here HFSperforms better than dataset1 whereas BFS performs worsethan dataset1 This implies that HFS is more robust than BFSin a diverse and larger set of instances Thus HFS is moreapplicable than BFS in a large diverse corpus of executablesSecond dataset2 has more benign executables than maliciouswhereas dataset1 has fewer benign executables Thisdistribution of dataset2 is more likely in a real world where
242
benign executables outnumber malicious executables Thisimplies that HFS is likely to perform better than BFS in areal-world scenario having a larger number of benignexecutables in the dataset
11513 Statistical Significance Test We also perform apair-wise two-tailed t-test on the HFS and BFS accuracies totest whether the differences between their accuracies arestatistically significant We exclude 1-gram accuracies fromthis test for the reason previously explained The result of thet-test is summarized in Table 112 The t-value shown in thistable is the value of t obtained from the accuracies There are(5 + 5 ndash 2) degrees of freedom since we have fiveobservations in each group and there are two groups (ieHFS and BFS) Probability denotes the probability ofrejecting the NULL hypothesis (that there is no differencebetween HFS and BFS accuracies) while p-value denotes theprobability of accepting the NULL hypothesis For dataset1the probability is 9965 and for dataset2 it is 1000Thus we conclude that the average accuracy of HFS issignificantly higher than that of BFS
Table 112 Pair-Wise Two-Tailed t-Test Results ComparingHFS and BFS
DATASET1 DATASET2t-value 89 146Degrees of freedom 8 8Probability 09965 100p-value 00035 00000
243
Source M Masud L Khan B Thuraisingham A ScalableMulti-level Feature Extraction Technique to Detect MaliciousExecutables pp 33ndash45 Springer With permission
11514 DLL Call Feature Here we report the accuracies ofthe DLL function call features (DFS) The 1-gram accuraciesare 928 for dataset1 and 919 for dataset2 The accuraciesfor higher grams are less than 75 so we do not report themThe reason behind this poor performance is possibly thatthere are no distinguishing call sequences that can identify theexecutables as malicious or benign
1152 ROC Curves
ROC curves plot the true positive rate against the falsepositive rates of a classifier Figure 112 shows ROC curvesof dataset1 for n = 6 and dataset2 for n = 4 based on SVMtesting ROC curves for other values of n have similar trendsexcept for n = 1 where AFS performs better than BFS It isevident from the curves that HFS is always dominant (ie hasa larger area under the curve) over the other two and it ismore dominant in dataset2 Table 113 reports the area underthe curve (AUC) for the ROC curves of each of the featuresets A higher value of AUC indicates a higher probabilitythat a classifier will predict correctly Table 113 shows thatthe AUC for HFS is the highest and it improves (relative tothe other two) in dataset2 This also supports our hypothesisthat our model will perform better in a more likely real-worldscenario where benign executables occur more frequently
244
Figure 112 ROC curves for different feature sets in dataset1(left) and dataset2 (right) (From M Masud L Khan BThuraisingham A Scalable Multi-level Feature ExtractionTechnique to Detect Malicious Executables pp 33ndash45Springer With permission)
Table 113 Area under the ROC Curve on Different FeatureSets
245
Source M Masud L Khan B Thuraisingham A ScalableMulti-level Feature Extraction Technique to Detect MaliciousExecutables pp 33ndash45 Springer With permission
a Average value excluding 1-gram
1153 False Positive and FalseNegative
Table 114 reports the false positive and false negative rates(in percentage) for each feature set based on SVM outputThe last row reports the average Again we exclude the1-gram values from the average Here we see that in dataset1the average false positive rate of HFS is 49 which is thelowest In dataset2 this rate is even lower (32) Falsepositive rate is a measure of false alarm rate Thus our modelhas the lowest false alarm rate We also observe that this ratedecreases as we increase the number of benign examplesThis is because the classifier gets more familiar with benignexecutables and misclassifies fewer of them as malicious Webelieve that a large collection of training sets with a largerportion of benign executables would eventually diminish falsepositive rate toward zero The false negative rate is also thelowest for HFS as reported in Table 114
Table 114 False Positive and False Negative Rates onDifferent Feature
246
Source M Masud L Khan B Thuraisingham A ScalableMulti-level Feature Extraction Technique to Detect MaliciousExecutables pp 33ndash45 Springer With permission
a Average value excluding 1-gram
1154 Running Time
We compare in Table 115 the running times (featureextraction training testing) of different kinds of features(HFS BFS AFS) for different values of n Feature extractiontime for HFS and AFS includes the disassembly time whichis 465 seconds (in total) for dataset1 and 865 seconds (intotal) for dataset2 Training time is the sum of featureextraction time feature-vector computation time and SVMtraining time Testing time is the sum of disassembly time(except BFS) feature-vector computation time and SVMclassification time Training and testing times based onBoosted J48 have almost similar characteristics so we do not
247
report them Table 115 also reports the cost factor as a ratioof time required for HFS relative to BFS
The column Cost Factor shows this comparison The averagefeature extraction times are computed by excluding the1-gram and 2-grams because these grams are unlikely to beused in practical applications The boldface cells in the tableare of particular interest to us From the table we see that therunning times for HFS training and testing on dataset1 are117 and 487 times higher than those of BFS respectivelyFor dataset2 these numbers are 108 and 45 respectivelyThe average throughput for HFS is found to be 06MBsec (inboth datasets) which may be considered as near real-timeperformance Finally we summarize the costperformancetrade-off in Table 116 The column PerformanceImprovement reports the accuracy improvement of HFS overBFS The cost factors are shown in the next two columns Ifwe drop the disassembly time from testing time (consideringthat disassembly is done offline) then the testing cost factordiminishes to 10 for both datasets It is evident from Table116 that the performancecost tradeoff is better for dataset2than for dataset1 Again we may infer that our model is likelyto perform better in a larger and more realistic dataset Themain bottleneck of our system is disassembly cost Thetesting cost factor is higher because here a larger proportionof time is used up in disassembly We believe that this factormay be greatly reduced by optimizing the disassembler andconsidering that disassembly can be done offline
Table 115 Running Times (in seconds)
248
Source M Masud L Khan B Thuraisingham A ScalableMulti-level Feature Extraction Technique to Detect MaliciousExecutables pp 33ndash45 Springer With permission
a Ratio of time required for HFS to time required for BFS
b Average feature extraction times excluding 1-gram and2-gram
c Average trainingtesting times excluding 1-gram and2-gram
Table 116 PerformanceCost Tradeoff between HFS andBFS
249