IT MANAGEMENT TITLES - Lagout Mining/Data Mining...People, Second Edition Colin J. Neill, Philip A. Laplante, and Joanna F. DeFranco ISBN 978-1-4398-6186-8 Asset Protection through

IT MANAGEMENT TITLESFROM AUERBACHPUBLICATIONS AND CRC PRESS

Net 4 for Enterprise Architects and DevelopersSudhanshu Hate and Suchi PahariaISBN 978-1-4398-6293-3

A Tale of Two Transformations Bringing Lean and AgileSoftware Development to LifeMichael K LevineISBN 978-1-4398-7975-7

Antipatterns Managing Software Organizations andPeople Second EditionColin J Neill Philip A Laplante and Joanna F DeFrancoISBN 978-1-4398-6186-8

Asset Protection through Security AwarenessTyler Justin SpeedISBN 978-1-4398-0982-2

Beyond Knowledge Management What Every LeaderShould KnowEdited by Jay LiebowitzISBN 978-1-4398-6250-6

CISOrsquos Guide to Penetration Testing A Framework toPlan Manage and Maximize Benefits

James S TillerISBN 978-1-4398-8027-2

Cybersecurity Public Sector Threats and ResponsesEdited by Kim J AndreassonISBN 978-1-4398-4663-6

Cybersecurity for Industrial Control Systems SCADADCS PLC HMI and SISTyson Macaulay and Bryan SingerISBN 978-1-4398-0196-3

Data Warehouse Designs Achieving ROI with MarketBasket Analysis and Time VarianceFon SilversISBN 978-1-4398-7076-1

Emerging Wireless Networks Concepts Techniques andApplicationsEdited by Christian Makaya and Samuel PierreISBN 978-1-4398-2135-0

Information and Communication Technologies inHealthcareEdited by Stephan Jones and Frank M GroomISBN 978-1-4398-5413-6

Information Security Governance Simplified From theBoardroom to the KeyboardTodd FitzgeraldISBN 978-1-4398-1163-4

IP Telephony Interconnection Reference ChallengesModels and EngineeringMohamed Boucadair Isabel Borges Pedro Miguel Nevesand Olafur Pall EinarssonISBN 978-1-4398-5178-4

ITrsquos All about the People Technology Management ThatOvercomes Disaffected People Stupid Processes andDeranged Corporate CulturesStephen J AndrioleISBN 978-1-4398-7658-9

IT Best Practices Management Teams QualityPerformance and ProjectsTom C WittISBN 978-1-4398-6854-6

Maximizing Benefits from IT Project Management FromRequirements to Value DeliveryJoseacute Loacutepez SorianoISBN 978-1-4398-4156-3

Secure and Resilient Software Requirements Test Casesand Testing MethodsMark S Merkow and Lakshmikanth RaghavanISBN 978-1-4398-6621-4

Security De-engineering Solving the Problems inInformation Risk ManagementIan TibbleISBN 978-1-4398-6834-8

Software Maintenance Success RecipesDonald J ReiferISBN 978-1-4398-5166-1

Software Project Management A Process-DrivenApproachAshfaque AhmedISBN 978-1-4398-4655-1

Web-Based and Traditional OutsourcingVivek Sharma Varun Sharma and KS Rajasekaran InfosysTechnologies Ltd Bangalore IndiaISBN 978-1-4398-1055-2

Data MiningTools for MalwareDetection

Mehedy Masud LatifurKhanand Bhavani Thuraisingham

CRC PressTaylor amp Francis Group6000 Broken Sound Parkway NW Suite 300Boca Raton FL 33487-2742

copy 2011 by Taylor amp Francis Group LLCCRC Press is an imprint of Taylor amp Francis Group anInforma business

No claim to original US Government worksVersion Date 20120111

International Standard Book Number-13 978-1-4665-1648-9(eBook - ePub)

This book contains information obtained from authentic andhighly regarded sources Reasonable efforts have been madeto publish reliable data and information but the author andpublisher cannot assume responsibility for the validity of allmaterials or the consequences of their use The authors andpublishers have attempted to trace the copyright holders of allmaterial reproduced in this publication and apologize tocopyright holders if permission to publish in this form has notbeen obtained If any copyright material has not beenacknowledged please write and let us know so we may rectifyin any future reprint

Except as permitted under US Copyright Law no part of thisbook may be reprinted reproduced transmitted or utilized inany form by any electronic mechanical or other means nowknown or hereafter invented including photocopyingmicrofilming and recording or in any information storage or

retrieval system without written permission from thepublishers

For permission to photocopy or use material electronicallyfrom this work please access wwwcopyrightcom(httpwwwcopyrightcom) or contact the CopyrightClearance Center Inc (CCC) 222 Rosewood Drive DanversMA 01923 978-750-8400 CCC is a not-for-profitorganization that provides licenses and registration for avariety of users For organizations that have been granted aphotocopy license by the CCC a separate system of paymenthas been arranged

Trademark Notice Product or corporate names may betrademarks or registered trademarks and are used only foridentification and explanation without intent to infringe

Visit the Taylor amp Francis Web site athttpwwwtaylorandfranciscom

and the CRC Press Web site athttpwwwcrcpresscom

Dedication

We dedicate this book to our respective families for theirsupport that enabled us to write this book

Contents

PREFACE

Introductory Remarks

Background on Data Mining

Data Mining for Cyber Security

Organization of This Book

Concluding Remarks

ACKNOWLEDGMENTS

THE AUTHORS

COPYRIGHT PERMISSIONS

CHAPTER 1 INTRODUCTION

11 Trends

12 Data Mining and Security Technologies

13 Data Mining for Email Worm Detection

14 Data Mining for MaliciousCode Detection

15 Data Mining for Detecting Remote Exploits

16 Data Mining for Botnet Detection

17 Stream Data Mining

18 Emerging Data Mining Tools for Cyber SecurityApplications

19 Organization of This Book

110 Next Steps

PART I DATA MINING AND SECURITY

Introduction to Part I Data Mining and Security

CHAPTER 2 DATA MINING TECHNIQUES

21 Introduction

22 Overview of Data Mining Tasks and Techniques

23 Artificial Neural Network

24 Support Vector Machines

25 Markov Model

26 Association Rule Mining (ARM)

27 Multi-Class Problem

271 One-vs-One

272 One-vs-All

28 Image Mining

281 Feature Selection

282 Automatic Image Annotation

283 Image Classification

29 Summary

References

CHAPTER 3 MALWARE

31 Introduction

32 Viruses

33 Worms

34 Trojan Horses

35 Time and Logic Bombs

36 Botnet

37 Spyware

38 Summary

References

CHAPTER 4 DATA MINING FOR SECURITYAPPLICATIONS

41 Introduction

42 Data Mining for Cyber Security

421 Overview

422 Cyber-Terrorism Insider Threats and External Attacks

423 Malicious Intrusions

424 Credit Card Fraud and Identity Theft

425 Attacks on Critical Infrastructures

43 Current Research and Development

44 Summary

References

CHAPTER 5 DESIGN AND IMPLEMENTATION OFDATA MINING TOOLS

51 Introduction

52 Intrusion Detection

53 Web Page Surfing Prediction

55 Summary

References

CONCLUSION TO PART I

PART II DATA MINING FOR EMAIL WORMDETECTION

Introduction to Part II

CHAPTER 6 Email Worm Detection

61 Introduction

62 Architecture

63 Related Work

64 Overview of Our Approach

65 Summary

References

CHAPTER 7 DESIGN OF THE DATA MINING TOOL

71 Introduction

72 Architecture

73 Feature Description

731 Per-Email Features

732 Per-Window Features

74 Feature Reduction Techniques

741 Dimension Reduction

742 Two-Phase Feature Selection (TPS)

7421 Phase I

7422 Phase II

75 Classification Techniques

76 Summary

References

CHAPTER 8 EVALUATION AND RESULTS

81 Introduction

82 Dataset

83 Experimental Setup

84 Results

841 Results from Unreduced Data

842 Results from PCA-Reduced Data

843 Results from Two-Phase Selection

85 Summary

References

CONCLUSION TO PART II

PART III DATA MINING FOR DETECTING MALICIOUSEXECUTABLES

Introduction to Part III

CHAPTER 9 MALICIOUS EXECUTABLES

91 Introduction

92 Architecture

93 Related Work

94 Hybrid Feature Retrieval (HFR) Model

95 Summary

References

101 Introduction

102 Feature Extraction Using n-Gram Analysis

1021 Binary n-Gram Feature

1022 Feature Collection

1024 Assembly n-Gram Feature

1025 DLL Function Call Feature

103 The Hybrid Feature Retrieval Model

1031 Description of the Model

1032 The Assembly Feature Retrieval (AFR) Algorithm

1033 Feature Vector Computation and Classification

104 Summary

References

111 Introduction

112 Experiments

113 Dataset

115 Results

1151 Accuracy

11511 Dataset1

11512 Dataset2

11513 Statistical Significance Test

11514 DLL Call Feature

1152 ROC Curves

1153 False Positive and False Negative

1154 Running Time

1155 Training and Testing with Boosted J48

116 Example Run

117 Summary

References

CONCLUSION TO PART III

PART IV DATA MINING FOR DETECTING REMOTEEXPLOITS

Introduction to Part IV

CHAPTER 12 DETECTING REMOTE EXPLOITS

121 Introduction

122 Architecture

123 Related Work

125 Summary

References

131 Introduction

132 DExtor Architecture

133 Disassembly

134 Feature Extraction

1341 Useful Instruction Count (UIC)

1342 Instruction Usage Frequencies (IUF)

1343 Code vs Data Length (CDL)

135 Combining Features and Compute Combined FeatureVector

136 Classification

137 Summary

References

141 Introduction

142 Dataset

1431 Parameter Settings

1422 Baseline Techniques

144 Results

1441 Running Time

145 Analysis

146 Robustness and Limitations

1461 Robustness against Obfuscations

1462 Limitations

147 Summary

References

CONCLUSION TO PART IV

PART V DATA MINING FOR DETECTING BOTNETS

Introduction to Part V

CHAPTER 15 DETECTING BOTNETS

151 Introduction

152 Botnet Architecture

153 Related Work

154 Our Approach

155 Summary

References

161 Introduction

162 Architecture

163 System Setup

164 Data Collection

165 Bot Command Categorization

1661 Packet-Level Features

1662 Flow-Level Features

167 Log File Correlation

168 Classification

169 Packet Filtering

1610 Summary

References

CHAPTER 17 Evaluation and Results

171 Introduction

1712 Classifiers

172 Performance on Different Datasets

173 Comparison with Other Techniques

174 Further Analysis

175 Summary

References

CONCLUSION TO PART V

PART VI STREAM MINING FOR SECURITYAPPLICATIONS

Introduction to Part VI

CHAPTER 18 STREAM MINING

181 Introduction

182 Architecture

183 Related Work

184 Our Approach

185 Overview of the Novel Class Detection Algorithm

186 Classifiers Used

187 Security Applications

188 Summary

References

191 Introduction

192 Definitions

193 Novel Class Detection

1931 Saving the Inventory of Used Spaces during Training

19311 Clustering

19312 Storing the Cluster Summary Information

1932 Outlier Detection and Filtering

19321 Filtering

1933 Detecting Novel Class

19331 Computing the Set of Novel Class Instances

19332 Speeding up the Computation

19333 Time Complexity

19334 Impact of Evolving Class Labels on EnsembleClassification

195 Summary

Reference

201 Introduction

202 Datasets

2021 Synthetic Data with Only Concept-Drift (SynC)

2022 Synthetic Data with Concept-Drift and Novel Class(SynCN)

2023 Real DatamdashKDD Cup 99 Network Intrusion Detection

2024 Real DatamdashForest Cover (UCI Repository)

2031 Baseline Method

204 Performance Study

2041 Evaluation Approach

2042 Results

2043 Running Time

205 Summary

References

CONCLUSION TO VI

PART VII EMERGING APPLICATIONS

Introduction to Part VII

CHAPTER 21 Data Mining for Active Defense

211 Introduction

212 Related Work

213 Architecture

214 A Data Mining-Based Malware Detection Model

2141 Our Framework

21421 Binary n-Gram Feature Extraction

21423 Feature Vector Computation

2143 Training

2144 Testing

215 Model-Reversing Obfuscations

2151 Path Selection

2152 Feature Insertion

2153 Feature Removal

216 Experiments

217 Summary

References

CHAPTER 22 DATA MINING FOR INSIDER THREATDETECTION

221 Introduction

222 The Challenges Related Work and Our Approach

223 Data Mining for Insider Threat Detection

2231 Our Solution Architecture

2232 Feature Extraction and Compact Representation

2233 RDF Repository Architecture

2234 Data Storage

22341 File Organization

22342 Predicate Split (PS)

22343 Predicate Object Split (POS)

2235 Answering Queries Using Hadoop MapReduce

2236 Data Mining Applications

224 Comprehensive Framework

225 Summary

References

CHAPTER 23 DEPENDABLE REAL-TIME DATAMINING

231 Introduction

232 Issues in Real-Time Data Mining

233 Real-Time Data Mining Techniques

234 Parallel Distributed Real-Time Data Mining

235 Dependable Data Mining

236 Mining Data Streams

237 Summary

References

CHAPTER 24 FIREWALL POLICY ANALYSIS

241 Introduction

242 Related Work

243 Firewall Concepts

2431 Representation of Rules

2432 Relationship between Two Rules

2433 Possible Anomalies between Two Rules

244 Anomaly Resolution Algorithms

2441 Algorithms for Finding and Resolving Anomalies

24411 Illustrative Example

2442 Algorithms for Merging Rules

24421 Illustrative Example of the Merge Algorithm

245 Summary

References

CONCLUSION TO PART VII

CHAPTER 25 SUMMARY AND DIRECTIONS

251 Introduction

252 Summary of This Book

253 Directions for Data Mining Tools for Malware Detection

254 Where Do We Go from Here

APPENDIX A DATA MANAGEMENT SYSTEMSDEVELOPMENTS AND TRENDS

A1 Introduction

A2 Developments in Database Systems

A3 Status Vision and Issues

A4 Data Management Systems Framework

A5 Building Information Systems from the Framework

A6 Relationship between the Texts

A7 Summary

References

APPENDIX B TRUSTWORTHY SYSTEMS

B1 Introduction

B2 Secure Systems

B21 Introduction

B22 Access Control and Other Security Concepts

B23 Types of Secure Systems

B24 Secure Operating Systems

B25 Secure Database Systems

B26 Secure Networks

B27 Emerging Trends

B28 Impact of the Web

B29 Steps to Building Secure Systems

B3 Web Security

B4 Building Trusted Systems from Untrusted Components

B5 Dependable Systems

B51 Introduction

B52 Trust Management

B53 Digital Rights Management

B54 Privacy

B55 Integrity Data Quality and High Assurance

B6 Other Security Concerns

B61 Risk Analysis

B62 Biometrics Forensics and Other Solutions

B7 Summary

References

APPENDIX C SECURE DATA INFORMATION ANDKNOWLEDGE MANAGEMENT

C1 Introduction

C2 Secure Data Management

C21 Introduction

C22 Database Management

C221 Data Model

C222 Functions

C223 Data Distribution

C23 Heterogeneous Data Integration

C24 Data Warehousing and Data Mining

C25 Web Data Management

C26 Security Impact

C3 Secure Information Management

C31 Introduction

C32 Information Retrieval

C33 Multimedia Information Management

C34 Collaboration and Data Management

C35 Digital Libraries

C36 E-Business

C37 Security Impact

C4 Secure Knowledge Management

C41 Knowledge Management

C42 Security Impact

C5 Summary

References

APPENDIX D SEMANTIC WEB

D1 Introduction

D2 Layered Technology Stack

D3 XML

D31 XML Statement and Elements

D32 XML Attributes

D33 XML DTDs

D34 XML Schemas

D35 XML Namespaces

D36 XML FederationsDistribution

D37 XML-QL XQuery XPath XSLT

D4 RDF

D41 RDF Basics

D42 RDF Container Model

D43 RDF Specification

D44 RDF Schemas

D45 RDF Axiomatic Semantics

D46 RDF Inferencing

D47 RDF Query

D48 SPARQL

D5 Ontologies

D6 Web Rules and SWRL

D61 Web Rules

D62 SWRL

D7 Semantic Web Services

D8 Summary

References

Preface

Introductory RemarksData mining is the process of posing queries to largequantities of data and extracting information often previouslyunknown using mathematical statistical and machinelearning techniques Data mining has many applications in anumber of areas including marketing and sales web ande-commerce medicine law manufacturing and morerecently national and cyber security For example using datamining one can uncover hidden dependencies betweenterrorist groups as well as possibly predict terrorist eventsbased on past experience Furthermore one can apply datamining techniques for targeted markets to improvee-commerce Data mining can be applied to multimediaincluding video analysis and image classification Finallydata mining can be used in security applications such assuspicious event detection and malicious software detectionOur previous book focused on data mining tools forapplications in intrusion detection image classification andweb surfing In this book we focus entirely on the datamining tools we have developed for cyber securityapplications In particular it extends the work we presented inour previous book on data mining for intrusion detection Thecyber security applications we discuss are email wormdetection malicious code detection remote exploit detectionand botnet detection In addition some other tools for streammining insider threat detection adaptable malware detection

real-time data mining and firewall policy analysis arediscussed

We are writing two series of books related to datamanagement data mining and data security This book is thesecond in our second series of books which describestechniques and tools in detail and is co-authored with facultyand students at the University of Texas at Dallas It hasevolved from the first series of books (by single authorBhavani Thuraisingham) which currently consists of tenbooks These ten books are the following Book 1 (DataManagement Systems Evolution and Interoperation)discussed data management systems and interoperabilityBook 2 (Data Mining) provided an overview of data miningconcepts Book 3 (Web Data Management and E-Commerce)discussed concepts in web databases and e-commerce Book 4(Managing and Mining Multimedia Databases) discussedconcepts in multimedia data management as well as textimage and video mining Book 5 (XML Databases and theSemantic Web) discussed high-level concepts relating to thesemantic web Book 6 (Web Data Mining and Applications inCounter-Terrorism) discussed how data mining may beapplied to national security Book 7 (Database andApplications Security) which is a textbook discussed detailsof data security Book 8 (Building Trustworthy SemanticWebs) also a textbook discussed how semantic webs may bemade secure Book 9 (Secure Semantic Service-OrientedSystems) is on secure web services Book 10 to be publishedin early 2012 is titled Building and Securing the Cloud Ourfirst book in Series 2 is Design and Implementation of DataMining Tools Our current book (which is the second book ofSeries 2) has evolved from Books 3 4 6 and 7 of Series 1and book 1 of Series 2 It is mainly based on the research

work carried out at The University of Texas at Dallas by DrMehedy Masud for his PhD thesis with his advisor ProfessorLatifur Khan and supported by the Air Force Office ofScientific Research from 2005 until now

Background on Data MiningData mining is the process of posing various queries andextracting useful information patterns and trends oftenpreviously unknown from large quantities of data possiblystored in databases Essentially for many organizations thegoals of data mining include improving marketingcapabilities detecting abnormal patterns and predicting thefuture based on past experiences and current trends There isclearly a need for this technology There are large amounts ofcurrent and historical data being stored Therefore asdatabases become larger it becomes increasingly difficult tosupport decision making In addition the data could be frommultiple sources and multiple domains There is a clear needto analyze the data to support planning and other functions ofan enterprise

Some of the data mining techniques include those based onstatistical reasoning techniques inductive logic programmingmachine learning fuzzy sets and neural networks amongothers The data mining problems include classification(finding rules to partition data into groups) association(finding rules to make associations between data) andsequencing (finding rules to order data) Essentially onearrives at some hypothesis which is the information extractedfrom examples and patterns observed These patterns are

observed from posing a series of queries each query maydepend on the responses obtained from the previous queriesposed

Data mining is an integration of multiple technologies Theseinclude data management such as database management datawarehousing statistics machine learning decision supportand others such as visualization and parallel computingThere is a series of steps involved in data mining Theseinclude getting the data organized for mining determining thedesired outcomes to mining selecting tools for miningcarrying out the mining process pruning the results so thatonly the useful ones are considered further taking actionsfrom the mining and evaluating the actions to determinebenefits There are various types of data mining By this wedo not mean the actual techniques used to mine the data butwhat the outcomes will be These outcomes have also beenreferred to as data mining tasks These include clusteringclassification anomaly detection and forming associations

Although several developments have been made there aremany challenges that remain For example because of thelarge volumes of data how can the algorithms determinewhich technique to select and what type of data mining to doFurthermore the data may be incomplete inaccurate or bothAt times there may be redundant information and at timesthere may not be sufficient information It is also desirable tohave data mining tools that can switch to multiple techniquesand support multiple outcomes Some of the current trends indata mining include mining web data mining distributed andheterogeneous databases and privacy-preserving data miningwhere one ensures that one can get useful results from miningand at the same time maintain the privacy of the individuals

Data Mining for CyberSecurityData mining has applications in cyber security whichinvolves protecting the data in computers and networks Themost prominent application is in intrusion detection Forexample our computers and networks are being intruded onby unauthorized individuals Data mining techniques such asthose for classification and anomaly detection are being usedextensively to detect such unauthorized intrusions Forexample data about normal behavior is gathered and whensomething occurs out of the ordinary it is flagged as anunauthorized intrusion Normal behavior could be Johnrsquoscomputer is never used between 2 am and 5 am in themorning When Johnrsquos computer is in use say at 3 am this isflagged as an unusual pattern

Data mining is also being applied for other applications incyber security such as auditing email worm detection botnetdetection and malware detection Here again data on normaldatabase access is gathered and when something unusualhappens then this is flagged as a possible access violationData mining is also being used for biometrics Here patternrecognition and other machine learning techniques are beingused to learn the features of a person and then to authenticatethe person based on the features

However one of the limitations of using data mining formalware detection is that the malware may change patternsTherefore we need tools that can detect adaptable malwareWe also discuss this aspect in our book

Organization of This BookThis book is divided into seven parts Part I which consists offour chapters provides some background information on datamining techniques and applications that has influenced ourtools these chapters also provide an overview of malwareParts II III IV and V describe our tools for email wormdetection malicious code detection remote exploit detectionand botnet detection respectively Part VI describes our toolsfor stream data mining In Part VII we discuss data miningfor emerging applications including adaptable malwaredetection insider threat detection and firewall policyanalysis as well as real-time data mining We have fourappendices that provide some of the background knowledgein data management secure systems and semantic web

Concluding RemarksData mining applications are exploding Yet many booksincluding some of the authorsrsquo own books have discussedconcepts at the high level Some books have made the topicvery theoretical However data mining approaches depend onnondeterministic reasoning as well as heuristics approachesOur first book on the design and implementation of datamining tools provided step-by-step information on how datamining tools are developed This book continues with thisapproach in describing our data mining tools

For each of the tools we have developed we describe thesystem architecture the algorithms and the performance

results as well as the limitations of the tools We believe thatthis is one of the few books that will help tool developers aswell as technologists and managers It describes algorithms aswell as the practical aspects For example technologists candecide on the tools to select for a particular applicationDevelopers can focus on alternative designs if an approach isnot suitable Managers can decide whether to proceed with adata mining project This book will be a very valuablereference guide to those in industry government andacademia as it focuses on both concepts and practicaltechniques Experimental results are also given The book willalso be used as a textbook at The University of Texas atDallas on courses in data mining and data security

Acknowledgments

We are especially grateful to the Air Force Office ofScientific Research for funding our research on malwaredetection In particular we would like to thank Dr RobertHerklotz for his encouragement and support for our workWithout his support for our research this book would not havebeen possible

We are also grateful to the National Aeronautics and SpaceAdministration for funding our research on stream mining Inparticular we would like to thank Dr Ashok Agrawal for hisencouragement and support

We thank our colleagues and collaborators who have workedwith us on Data Mining Tools for Malware Detection Ourspecial thanks are to the following colleagues

Prof Peng Liu and his team at Penn State University forcollaborating with us on Data Mining for Remote Exploits(Part III)

Prof Jiawei Han and his team at the University of Illinois forcollaborating with us on Stream Data Mining (Part VI)

Prof Kevin Hamlen at the University of Texas at Dallas forcollaborating with us on Data Mining for Active Defense(Chapter 21)

Our student Dr M Farhan Husain for collaborating with uson Insider Threat Detection (Chapter 22)

Our colleagues Prof Chris Clifton (Purdue University) DrMarion Ceruti (Department of the Navy) and Mr JohnMaurer (MITRE) for collaborating with us on Real-TimeData Mining (Chapter 23)

Our students Muhammad Abedin and Syeda Nessa forcollaborating with us on Firewall Policy Analysis (Chapter24)

The Authors

Mehedy Masud is a postdoctoral fellow at The University ofTexas at Dallas (UTD) where he earned his PhD in computerscience in December 2009 He has published in premierjournals and conferences including IEEE Transactions onKnowledge and Data Engineering and the IEEE InternationalConference on Data Mining He will be appointed as aresearch assistant professor at UTD in Fall 2012 Masudrsquosresearch projects include reactively adaptive malware datamining for detecting malicious executables botnet andremote exploits and cloud data mining He has a patentpending on stream mining for novel class detection

Latifur Khan is an associate professor in the computerscience department at The University of Texas at Dallaswhere he has been teaching and conducting research sinceSeptember 2000 He received his PhD and MS degrees incomputer science from the University of Southern Californiain August 2000 and December 1996 respectively Khan is (orhas been) supported by grants from NASA the NationalScience Foundation (NSF) Air Force Office of ScientificResearch (AFOSR) Raytheon NGA IARPA TektronixNokia Research Center Alcatel and the SUN academicequipment grant program In addition Khan is the director ofthe state-of-the-art DMLUTD UTD Data MiningDatabaseLaboratory which is the primary center of research related todata mining semantic web and imagevideo annotation atThe University of Texas at Dallas Khan has published morethan 100 papers including articles in several IEEETransactions journals the Journal of Web Semantics and the

VLDB Journal and conference proceedings such as IEEEICDM and PKDD He is a senior member of IEEE

Bhavani Thuraisingham joined The University of Texas atDallas (UTD) in October 2004 as a professor of computerscience and director of the Cyber Security Research Center inthe Erik Jonsson School of Engineering and ComputerScience and is currently the Louis Beecherl Jr DistinguishedProfessor She is an elected Fellow of three professionalorganizations the IEEE (Institute for Electrical andElectronics Engineers) the AAAS (American Association forthe Advancement of Science) and the BCS (British ComputerSociety) for her work in data security She received the IEEEComputer Societyrsquos prestigious 1997 Technical AchievementAward for ldquooutstanding and innovative contributions tosecure data managementrdquo Prior to joining UTDThuraisingham worked for the MITRE Corporation for 16years which included an IPA (Intergovernmental PersonnelAct) at the National Science Foundation as Program Directorfor Data and Applications Security Her work in informationsecurity and information management has resulted in morethan 100 journal articles more than 200 refereed conferencepapers more than 90 keynote addresses and 3 US patentsShe is the author of ten books in data management datamining and data security

Copyright Permissions

Figure 212 Figure 213

B Thuraisingham K Hamlen V Mohan M Masud LKhan Exploiting an antivirus interface in ComputerStandards amp Interfaces Vol 31 No 6 p 1182minus1189 2009with permission from Elsevier

Figure 74 Table 81 Table 82 Table 83 Table 84Figure 82 Table 85 Table 86 Table 87 Table 88

Figure 232 Figure 233 Figure 234 Figure 235 Figure236 Figure 237

L Khan C Clifton J Maurer M Ceruti Dependablereal-time data mining Proceedings ISORC 2005 p 158minus165copy 2005 IEEE

Figure 223

M Farhan Husain L Khan M Kantarcioglu Data intensivequery processing for large RDF graphs using cloudcomputing tools IEEE Cloud Computing Miami FL July2010 p 1minus10 copy 2005 IEEE

Figures 152 Table 161 Figure 162 Table 162 Figure164 Table 171 Table 172 Figure 172 Figure 173

M Masud T Al-khateeb L Khan K Hamlen Flow-basedidentification of botnet traffic by mining multiple log files inProceedings of the International Conference on DistributedFrameworks amp Applications (DFMA) Penang Malaysia Oct2008 p 200ndash206 copy 2005 IEEE

Figure 102 Figure 103 Table 111 Table 112 Figure112 Table 113 Table 114 Table 115 Table 116 Table117 Table 118 Table 119

M Masud L Khan A scalable multi-level feature extractiontechnique to detect malicious executables InformationSystems Frontiers (Springer Netherlands) 101 33minus45March 2008 copy 2008 Springer With kind permission ofSpringer Science+Business Media

Figure 132 Figure 133 Table 141 Figure 142 Table142 Figure 143

M Masud L Khan X Wang P Liu S Zhu Detectingremote exploits using data mining Proceedings IFIP DigitalForensics Conference Kyoto January 2008 p 177ndash189 copy2008 Springer With kind permission of SpringerScience+Business Media

Figure 192 Figure 193 Figure 202 Table 201 Figure203 Table 202

M Masud J Gao L Khan J Han Integrating novel classdetection with classification for concept-drifting data streams

ECML PKDD lsquo09 Proceedings of the European Conferenceon Machine Learning and Knowledge Discovery inDatabases Part II September 2009 pp 79minus94Springer-Verlag Berlin Heidelberg copy 2009 With kindpermission of Springer Science+Business Media

INTRODUCTION

11 TrendsData mining is the process of posing various queries andextracting useful and often previously unknown andunexpected information patterns and trends from largequantities of data generally stored in databases These datacould be accumulated over a long period of time or theycould be large datasets accumulated simultaneously fromheterogeneous sources such as different sensor types Thegoals of data mining include improving marketingcapabilities detecting abnormal patterns and predicting thefuture based on past experiences and current trends There isclearly a need for this technology for many applications ingovernment and industry For example a marketingorganization may need to determine who their potentialcustomers are There are large amounts of current andhistorical data being stored Therefore as databases becomelarger it becomes increasingly difficult to supportdecision-making In addition the data could be from multiplesources and multiple domains There is a clear need toanalyze the data to support planning and other functions of anenterprise

Data mining has evolved from multiple technologiesincluding data management data warehousing machine

learning and statistical reasoning one of the majorchallenges in the development of data mining tools is toeliminate false positives and false negatives Much progresshas also been made on building data mining tools based on avariety of techniques for numerous applications Theseapplications include those for marketing and sales healthcaremedical financial e-commerce multimedia and morerecently security

Our previous books have discussed various data miningtechnologies techniques tools and trends In a recent bookour main focus was on the design and development as well asto discuss the results obtained for the three tools that wedeveloped between 2004 and 2006 These tools include onefor intrusion detection one for web page surfing predictionand one for image classification In this book we continuewith the descriptions of data mining tools we have developedover the past five years for cyber security In particular wediscuss our tools for malware detection

Malware also known as malicious software is developed byhackers to steal data and identity causes harm to computersand denies legitimate services to users among othersMalware has plagued the society and the software industry foralmost four decades Malware includes viruses wormsTrojan horses time and logic bombs botnets and spyware Inthis book we describe our data mining tools for malwaredetection

The organization of this chapter is as follows Supportingtechnologies are discussed in Section 12 These supportingtechnologies are elaborated in Part II The tools that wediscuss in this book are summarized in Sections 13 through

18 These tools include data mining for email wormdetection remote exploits detection malicious codedetection and botnet detection In addition we discuss ourstream data mining tool as well as our approaches for insidethreat detection adaptable malware detection real-time datamining for suspicious event detection and firewall policymanagement Each of these tools and approaches arediscussed in Parts II through VII The contents of this bookare summarized in Section 19 of this chapter and next stepsare discussed in Section 110

12 Data Mining andSecurity TechnologiesData mining techniques have exploded over the past decadeand we now have tools and products for a variety ofapplications In Part I we discuss the data mining techniquesthat we describe in this book as well as provide an overviewof the applications we discuss Data mining techniquesinclude those based on machine learning statistical reasoningand mathematics Some of the popular techniques includeassociation rule mining decision trees and K-meansclustering Figure 11 illustrates the data mining techniques

Data mining has been used for numerous applications inseveral fields including in healthcare e-commerce andsecurity We focus on data mining for cyber securityapplications

Figure 11 Data mining techniques

Figure 12 Malware

While data mining technologies have exploded over the pasttwo decades the developments in information technologieshave resulted in an increasing need for security As a resultthere is now an urgent need to develop secure systemsHowever as systems are being secured malware technologieshave also exploded Therefore it is critical that we develop

tools for detecting and preventing malware Various types ofmalware are illustrated in Figure 12

In this book we discuss data mining for malware detection Inparticular we discuss techniques such as support vectormachines clustering and classification for cyber securityapplications The tools we have developed are illustrated inFigure 13

13 Data Mining for EmailWorm DetectionAn email worm spreads through infected email messages Theworm may be carried by an attachment or the email maycontain links to an infected website When the user opens theattachment or clicks the link the host gets infectedimmediately The worm exploits the vulnerable emailsoftware in the host machine to send infected emails toaddresses stored in the address book Thus new machines getinfected Worms bring damage to computers and people invarious ways They may clog the network traffic causedamage to the system and make the system unstable or evenunusable

Figure 13 Data mining tools for malware detection

We have developed tools on applying data mining techniquesfor intrusion email worm detection We use both SupportVector Machine (SVM) and Naiumlve Bayes (NB) data miningtechniques Our tools are described in Part III of the book

14 Data Mining forMalicious Code DetectionMalicious code is a great threat to computers and computersociety Numerous kinds of malicious codes wander in thewild Some of them are mobile such as worms and spreadthrough the Internet causing damage to millions of computersworldwide Other kinds of malicious codes are static such asviruses but sometimes deadlier than their mobile counterpart

One popular technique followed by the antivirus communityto detect malicious code is ldquosignature detectionrdquo Thistechnique matches the executables against a unique telltalestring or byte pattern called signature which is used as anidentifier for a particular malicious code However suchtechniques are not effective against ldquozero-dayrdquo attacks Azero-day attack is an attack whose pattern is previouslyunknown We are developing a number of data mining toolsfor malicious code detection that do not depend on thesignature of the malware Our hybrid feature retrieval modelis described in Part IV of this book

15 Data Mining forDetecting Remote ExploitsRemote exploits are a popular means for attackers to gaincontrol of hosts that run vulnerable services or softwareTypically a remote exploit is provided as an input to a remotevulnerable service to hijack the control-flow ofmachine-instruction execution Sometimes the attackers injectexecutable code in the exploit that is executed after asuccessful hijacking attempt We refer to these code-carryingremote exploits as exploit code

We are developing a number of data mining tools fordetecting remote exploits Our tools use differentclassification models such as Support Vector Machine(SVM) Naiumlve Bayes (NB) and decision trees These tools aredescribed in Part V of this book

16 Data Mining for BotnetDetectionBotnets are a serious threat because of their volume andpower Botnets containing thousands of bots (compromisedhosts) are controlled from a Command and Control (CampC)center operated by a human botmaster or botherder Thebotmaster can instruct these bots to recruit new bots launchcoordinated distributed denial of service (DDoS) attacksagainst specific hosts steal sensitive information frominfected machines send mass spam emails and so on

We have developed data mining tools for botnet detectionOur tools use Support Vector Machine (SVM) Bayes Netdecision tree (J48) Naiumlve Bayes and Boosted decision tree(Boosted J48) for the classification task These tools aredescribed in Part VI of this book

17 Stream Data MiningStream data are quite common They include video datasurveillance data and financial data that arrive continuouslyThere are some problems related to stream data classificationFirst it is impractical to store and use all the historical datafor training because it would require infinite storage andrunning time Second there may be concept-drift in the datameaning the underlying concept of the data may change overtime Third novel classes may evolve in the stream

We have developed stream mining techniques for detectingnovel cases We believe that these techniques could be usedfor detecting novel malware Our tools for stream mining aredescribed in Part VI of this book

18 Emerging Data MiningTools for Cyber SecurityApplicationsIn addition to the tools described in Sections 13 through 17we are also exploring techniques for (a) detecting malwarethat reacts and adapts to the environment (b) insider threatdetection (c) real-time data mining and (d) firewall policymanagement

For malware that adapts we are exploring the stream miningtechniques For insider threat detection we are applyinggraph mining techniques We are exploring real-time datamining to detect malware in real time Finally we areexploring the use of association rule mining techniques forensuring that the numerous firewall policies are consistentThese techniques are described in Part VII of this book

19 Organization of ThisBookThis book is divided into seven parts Part I consists of thisintroductory chapter and four additional chapters Chapter 2provides some background information in the data miningtechniques and applications that have influenced our researchand tools Chapter 3 describes types of malware In Chapter 4we provide an overview of data mining for securityapplications The tools we have described in our previousbook are discussed in Chapter 5 We discuss the three toolsas many of the tools we discuss in this current book have beeninfluenced by our early tools

Part II consists of three chapters 6 7 and 8 which describeour tool for email worm detection An overview of emailworm detection is discussed in Chapter 6 Our tool isdiscussed in Chapter 7 Evaluation and results are discussedin Chapter 8 Part III consists of three chapters 9 10 and 11and describes our tool for malicious code detection Anoverview of malicious code detection is discussed in Chapter9 Our tool is discussed in Chapter 10 Evaluation and resultsare discussed in Chapter 11 Part IV consists of threechapters 12 13 and 14 and describes our tool for detectingremote exploits An overview of detecting remote exploits isdiscussed in Chapter 12 Our tool is discussed in Chapter 13Evaluation and results are discussed in Chapter 14 Part Vconsists of three chapters 15 16 and 17 and describes ourtool for botnet detection An overview of botnet detection isdiscussed in Chapter 15 Our tool is discussed in Chapter 16

Evaluation and results are discussed in Chapter 17 Part VIconsists of three chapters 18 19 and 20 and describes ourtool for stream mining An overview of stream mining isdiscussed in Chapter 18 Our tool is discussed in Chapter 19Evaluation and results are discussed in Chapter 20 Part VIIconsists of four chapters 21 22 23 and 24 and describes ourtools for emerging applications Our approach to detectingadaptive malware is discussed in Chapter 21 Our approachfor insider threat detection is discussed in Chapter 22Real-time data mining is discussed in Chapter 23 Firewallpolicy management tool is discussed in Chapter 24

The book is concluded in Chapter 25 Appendix A providesan overview of data management and describes therelationship between our books Appendix B describestrustworthy systems Appendix C describes secure datainformation and knowledge management and Appendix Ddescribes semantic web technologies The appendicestogether with the supporting technologies described in Part Iprovide the necessary background to understand the contentof this book

We have essentially developed a three-layer framework toexplain the concepts in this book This framework isillustrated in Figure 14 Layer 1 is the data mining techniqueslayer Layer 2 is our tools layer Layer 3 is the applicationslayer Figure 15 illustrates how Chapters 2 through 24 in thisbook are placed in the framework

110 Next StepsThis book provides the information for a reader to get familiarwith data mining concepts and understand how the techniquesare applied step-by-step to some real-world applications inmalware detection One of the main contributions of this bookis raising the awareness of the importance of data mining fora variety of applications in cyber security This book could beused as a guide to build data mining tools for cyber securityapplications

Figure 14 Framework for data mining tools

We provide many references that can help the reader inunderstanding the details of the problem we are investigatingOur advice to the reader is to keep up with the developmentsin data mining and get familiar with the tools and productsand apply them for a variety of applications Then the readerwill have a better understanding of the limitation of the toolsand be able to determine when new tools have to bedeveloped

Figure 15 Contents of the book with respect to theframework

PART I

DATA MINING AND SECURITY

Introduction to Part I DataMining and SecuritySupporting technologies for data mining for malwaredetection include data mining and malware technologies Datamining is the process of analyzing the data and uncoveringhidden dependencies The outcomes of data mining includeclassification clustering forming associations as well asdetecting anomalies Malware technologies are beingdeveloped at a rapid speed These include worms viruses andTrojan horses

Part I consisting of five chapters discusses supportingtechnologies for data mining for malware detection Chapter 1provides a brief overview of data mining and malware InChapter 2 we discuss the data mining techniques we haveutilized in our tools Specifically we present the Markovmodel support vector machines artificial neural networksand association rule mining In Chapter 3 we discuss varioustypes of malware including worms viruses and Trojanhorses In Chapter 4 we discuss data mining for securityapplications In particular we discuss the threats to thecomputers and networks and describe the applications of datamining to detect such threats and attacks Some of our current

research at The University of Texas at Dallas also isdiscussed In Chapter 5 we discuss the three applications wehave considered in our previous book on the design andimplementation of data mining tools These tools haveinfluenced the work discussed in this book a great deal Inparticular we discuss intrusion detection web surfingprediction and image classification tools

DATA MINING TECHNIQUES

21 IntroductionData mining outcomes (also called tasks) includeclassification clustering forming associations as well asdetecting anomalies Our tools have mainly focused onclassification as the outcome and we have developedclassification tools The classification problem is also referredto as Supervised Learning in which a set of labeled examplesis learned by a model and then a new example with anunknown label is presented to the model for prediction

There are many prediction models that have been used suchas the Markov model decision trees artificial neuralnetworks support vector machines association rule miningand many others Each of these models has strengths andweaknesses However there is a common weakness amongall of these techniques which is the inability to suit allapplications The reason that there is no such ideal or perfectclassifier is that each of these techniques is initially designedto solve specific problems under certain assumptions

In this chapter we discuss the data mining techniques wehave utilized in our tools Specifically we present the Markovmodel support vector machines artificial neural networksassociation rule mining and the problem of

multi-classification as well as image classification which isan aspect of image mining These techniques are also used indeveloping and comparing results in Parts II III and IV Inour research and development we propose hybrid models toimprove the prediction accuracy of data mining algorithms invarious applications namely intrusion detection WWWprediction and image classification

The organization of this chapter is as follows In Section 22we provide an overview of various data mining tasks andtechniques The techniques that are relevant to the contents ofthis book are discussed in Sections 22 through 27 Inparticular neural networks support vector machines Markovmodels and association rule mining as well as some otherclassification techniques are described The chapter issummarized in Section 28

22 Overview of DataMining Tasks andTechniquesBefore we discuss data mining techniques we provide anoverview of some of the data mining tasks (also known asdata mining outcomes) Then we discuss the techniques Ingeneral data mining tasks can be grouped into two categoriespredictive and descriptive Predictive tasks essentially predictwhether an item belongs to a class or not Descriptive tasksin general extract patterns from the examples One of themost prominent predictive tasks is classification In some

cases other tasks such as anomaly detection can be reducedto a predictive task such as whether a particular situation is ananomaly or not Descriptive tasks in general include makingassociations and forming clusters Therefore classificationanomaly detection making associations and forming clustersare also thought to be data mining tasks

Next the data mining techniques can be either predictivedescriptive or both For example neural networks canperform classification as well as clustering Classificationtechniques include decision trees support vector machines aswell as memory-based reasoning Association rule miningtechniques are used in general to make associations Linkanalysis that analyzes links can also make associationsbetween links and predict new links Clustering techniquesinclude K-means clustering An overview of the data miningtasks (ie the outcomes of data mining) is illustrated inFigure 21 The techniques discussed in this book (eg neuralnetworks support vector machines) are illustrated in Figure22

23 Artificial NeuralNetwork

Figure 21 Data mining tasks

Artificial neural network (ANN) is a very well-knownpowerful and robust classification technique that has beenused to approximate real-valued discrete-valued andvector-valued functions from examples ANNs have beenused in many areas such as interpreting visual scenes speechrecognition and learning robot control strategies An artificialneural network (ANN) simulates the biological nervoussystem in the human brain Such a nervous system iscomposed of a large number of highly interconnectedprocessing units (neurons) working together to produce ourfeelings and reactions ANNs like people learn by exampleThe learning process in a human brain involves adjustmentsto the synaptic connections between neurons Similarly thelearning process of ANN involves adjustments to the nodeweights Figure 23 presents a simple neuron unit which iscalled perceptron The perceptron input x is a vector orreal-valued input and w is the weight vector in which itsvalue is determined after training The perceptron computes alinear combination of an input vector x as follows (Eq 21)

Figure 23 The perceptron

Notice that wi corresponds to the contribution of the inputvector component xi of the perceptron output Also in orderfor the perceptron to output a 1 the weighted combination ofthe inputs

must be greater than the threshold w0

Learning the perceptron involves choosing values for theweights w0 + w1x1 + hellip + wnxn Initially random weightvalues are given to the perceptron Then the perceptron isapplied to each training example updating the weights of theperceptron whenever an example is misclassified Thisprocess is repeated many times until all training examples are

correctly classified The weights are updated according to thefollowing rule (Eq 22)

where η is a learning constant o is the output computed bythe perceptron and t is the target output for the currenttraining example

The computation power of a single perceptron is limited tolinear decisions However the perceptron can be used as abuilding block to compose powerful multi-layer networks Inthis case a more complicated updating rule is needed to trainthe network weights In this work we employ an artificialneural network of two layers and each layer is composed ofthree building blocks (see Figure 24) We use the backpropagation algorithm for learning the weights The backpropagation algorithm attempts to minimize the squared errorfunction

Figure 24 Artificial neural network

Figure 25 The design of ANN used in our implementation

A typical training example in WWW prediction is lang[ktndashτ+1hellip ktndash1 kt]T drang where [ktndashτ+1 hellip ktndash1 kt]T is the input to theANN and d is the target web page Notice that the input unitsof the ANN in Figure 25 are τ previous pages that the userhas recently visited where k is a web page id The output ofthe network is a boolean value not a probability We will seelater how to approximate the probability of the output byfitting a sigmoid function after ANN output Theapproximated probabilistic output becomes oprime = f(o(I) = pt+1where I is an input session and pt+1 = p(d|ktndashτ+1 hellip kt) Wechoose the sigmoid function (Eq 23) as a transfer function so

that the ANN can handle a non-linearly separable dataset[Mitchell 1997] Notice that in our ANN design (Figure 25)we use a sigmoid transfer function Eq 23 in each buildingblock In Eq 23 I is the input to the network O is the outputof the network W is the matrix of weights and σ is thesigmoid function

We implement the back propagation algorithm for training theweights The back propagation algorithm employs gradientdescent to attempt to minimize the squared error between thenetwork output values and the target values of these outputsThe sum of the error over all of the network output units isdefined in Eq 24 In Eq 24 the outputs is the set of outputunits in the network D is the training set and tik and oik arethe target and the output values associated with the ith output

unit and training example k For a specific weight wji in thenetwork it is updated for each training example as in Eq 25where η is the learning rate and wji is the weight associatedwith the ith input to the network unit j (for details see[Mitchell 1997]) As we can see from Eq 25 the searchdirection δw is computed using the gradient descent whichguarantees convergence toward a local minimum To mitigatethat we add a momentum to the weight update rule such thatthe weight update direction δwji(n) depends partially on theupdate direction in the previous iteration δwji(n ndash 1) The newweight update direction is shown in Eq 26 where n is thecurrent iteration and α is the momentum constant Notice thatin Eq 26 the step size is slightly larger than in Eq 25 Thiscontributes to a smooth convergence of the search in regionswhere the gradient is unchanging [Mitchell 1997]

In our implementation we set the step size η dynamicallybased on the distribution of the classes in the datasetSpecifically we set the step size to large values whenupdating the training examples that belong to low distributionclasses and vice versa This is because when the distributionof the classes in the dataset varies widely (eg a datasetmight have 5 positive examples and 95 negativeexamples) the network weights converge toward theexamples from the class of larger distribution which causes aslow convergence Furthermore we adjust the learning ratesslightly by applying the momentum constant Eq 26 tospeed up the convergence of the network [Mitchell 1997]

24 Support VectorMachinesSupport vector machines (SVMs) are learning systems thatuse a hypothesis space of linear functions in a highdimensional feature space trained with a learning algorithmfrom optimization theory This learning strategy introducedby Vapnik [1995 1998 1999 see also Cristianini andShawe-Taylor 2000] is a very powerful method that hasbeen applied in a wide variety of applications The basicconcept in SVM is the hyper-plane classifier or linearseparability To achieve linear separability SVM applies twobasic ideas margin maximization and kernels that ismapping input space to a higher dimension space featurespace

For binary classification the SVM problem can be formalizedas in Eq 27 Suppose we have N training data points (x1y1)(x2y2) hellip (xNyN) where xi isin Rd and yi isin +1ndash1 Wewould like to find a linear separating hyper-plane classifier asin Eq 28 Furthermore we want this hyper-plane to have themaximum separating margin with respect to the two classes(see Figure 26) The functional margin or the margin forshort is defined geometrically as the Euclidean distance ofthe closest point from the decision boundary to the inputspace Figure 27 gives an intuitive explanation of whymargin maximization gives the best solution of separation Inpart (a) of Figure 27 we can find an infinite number ofseparators for a specific dataset There is no specific or clearreason to favor one separator over another In part (b) we see

that maximizing the margin provides only one thick separatorSuch a solution achieves the best generalization accuracy thatis prediction for the unseen [Vapnik 1995 1998 1999]

Figure 26 Linear separation in SVM

Figure 27 The SVM separator that causes the maximummargin

Notice that Eq 28 computes the sign of the functional marginof point x in addition to the prediction label of x that isfunctional margin of x equals wx ndash b

The SVM optimization problem is a convex quadraticprogramming problem (in w b) in a convex set Eq 27 Wecan solve the Wolfe dual instead as in Eq 29 with respect toα subject to the constraints that the gradient of L(wbα) withrespect to the primal variables w and b vanish and αi ge 0 Theprimal variables are eliminated from L(wbα) (see [Cristianiniand Shawe-Taylor 1999] for more details) When we solve αiwe can get

and we can classify a new object x using Eq 210 Note thatthe training vectors occur only in the form of a dot productand that there is a Lagrangian multiplier αi for each trainingpoint which reflects the importance of the data point Whenthe maximal margin hyper-plane is found only points that lieclosest to the hyper-plane will have αi gt 0 and these points arecalled support vectors All other points will have αi = 0 (seeFigure 28a) This means that only those points that lie closest

to the hyper-plane give the representation of the hypothesisclassifier These most important data points serve as supportvectors Their values can also be used to give an independentboundary with regard to the reliability of the hypothesisclassifier [Bartlett and Shawe-Taylor 1999]

Figure 28a shows two classes and their boundaries that ismargins The support vectors are represented by solid objectswhile the empty objects are non-support vectors Notice thatthe margins are only affected by the support vectors that is ifwe remove or add empty objects the margins will not changeMeanwhile any change in the solid objects either adding orremoving objects could change the margins Figure 28bshows the effects of adding objects in the margin area As wecan see adding or removing objects far from the margins forexample data point 1 or minus2 does not change the marginsHowever adding andor removing objects near the marginsfor example data point 2 andor minus1 has created new margins

Figure 28 (a) The α values of support vectors andnon-support vectors (b) The effect of adding new data pointson the margins

25 Markov ModelSome recent and advanced predictive methods for websurfing are developed using Markov models [Pirolli et al1996] [Yang et al 2001] For these predictive models thesequences of web pages visited by surfers are typicallyconsidered as Markov chains which are then fed as inputThe basic concept of the Markov model is that it predicts thenext action depending on the result of previous action oractions Actions can mean different things for differentapplications For the purpose of illustration we will consideractions specific for the WWW prediction application InWWW prediction the next action corresponds to predictionof the next page to be traversed The previous actionscorrespond to the previous web pages to be considered Basedon the number of previous actions considered the Markovmodel can have different orders

The zeroth-order Markov model is the unconditionalprobability of the state (or web page) Eq 211 In Eq 211Pk is a web page and Sk is the corresponding state Thefirst-order Markov model Eq 212 can be computed bytaking page-to-page transitional probabilities or the n-gramprobabilities of P1 P2 P2 P3 hellip Pkndash1 Pk

In the following we present an illustrative example ofdifferent orders of the Markov model and how it can predict

Example Imagine a web site of six web pages P1 P2 P3P4 P5 and P6 Suppose we have user sessions as in Table21 Table 21 depicts the navigation of many users of thatweb site Figure 29 shows the first-order Markov modelwhere the next action is predicted based only on the lastaction performed ie last page traversed by the user StatesS and F correspond to the initial and final states respectivelyThe probability of each transition is estimated by the ratio ofthe number of times the sequence of states was traversed andthe number of times the anchor state was visited Next to eacharch in Figure 28 the first number is the frequency of thattransition and the second number is the transition probabilityFor example the transition probability of the transition (P2 toP3) is 02 because the number of times users traverse frompage 2 to page 3 is 3 and the number of times page 2 isvisited is 15 (ie 02 = 315)

Notice that the transition probability is used to resolveprediction For example given that a user has already visitedP2 the most probable page she visits next is P6 That isbecause the transition probability from P2 to P6 is the highest

Table 21 Collection of User Sessions and Their Frequencies

SESSION FREQUENCYP1P2P4 5P1P2P6 1P5P2P6 6P5P2P3 3

Figure 29 First-order Markov model

Notice that that transition probability might not be availablefor some pages For example the transition probability fromP2 to P5 is not available because no user has visited P5 afterP2 Hence these transition probabilities are set to zerosSimilarly the Kth-order Markov model is where theprediction is computed after considering the last Kth actionperformed by the users Eq 213 In WWW prediction theKth-order Markov model is the probability of user visit to Pk

page given its previous k-1 page visits

Figure 210 Second-order Markov model

Figure 210 shows the second-order Markov model thatcorresponds to Table 21 In the second-order model weconsider the last two pages The transition probability iscomputed in a similar fashion For example the transitionprobability of the transition (P1P2) to (P2 P6) is 016 = 1 times16 because the number of times users traverse from state(P1P2) to state (P2P6) is 1 and the number of times pages(P1P2) is visited is 6 (ie 016 = 16) The transitionprobability is used for prediction For example given that auser has visited P1 and P2 she most probably visits P4because the transition probability from state (P1P2) to state(P2P4) is greater than the transition probability from state(P1P2) to state (P2P6)

The order of Markov model is related to the sliding windowThe Kth-order Markov model corresponds to a slidingwindow of size K-1

Notice that there is another concept that is similar to thesliding window concept which is number of hops In thisbook we use number of hops and sliding windowinterchangeably

In WWW prediction Markov models are built based on theconcept of n-gram The n-gram can be represented as a tupleof the form langx1 x2 hellip xnrang to depict sequences of page clicksby a population of users surfing a web site Each componentof the n-gram takes a specific page id value that reflects thesurfing path of a specific user surfing a web page Forexample the n-gram langP10 P21 P4 P12rang for some user U statesthat the user U has visited the pages 10 21 4 and finallypage 12 in a sequence

26 Association Rule Mining(ARM)Association rule is a data mining technique that has beenapplied successfully to discover related transactions Theassociation rule technique finds the relationships amongitemsets based on their co-occurrence in the transactionsSpecifically association rule mining discovers the frequentpatterns (regularities) among those itemsets for examplewhat the items purchased together in a super store are In thefollowing we briefly introduce association rule mining For

more details see [Agrawal et al 1993] [Agrawal andSrikant 1994]

Assume we have m items in our database define I = i1 i2hellipim as the set of all items A transaction T is a set of itemssuch that T sube I Let D be the set of all transactions in thedatabase A transaction T contains X if X sube T and X sube I Anassociation rule is an implication of the form X rarr Y where Xsub I Y sub I and X cap Y = ϕ There are two parameters toconsider a rule confidence and support A rule R = X rarr Yholds with confidence c if c of the transactions of D thatcontain X also contain Y (ie c = pr(Y|X)) The rule R holdswith support s if s of the transactions in D contain X and Y(ie s = pr(XY)) The problem of mining association rules isdefined as the following given a set of transactions D wewould like to generate all rules that satisfy a confidence and asupport greater than a minimum confidence (σ) minconf andminimum support (ϑ) minsup There are several efficientalgorithms proposed to find association rules for examplethe AIS algorithm [Agrawal et al 1993] [Agrawal andSrikant 1994] SETM algorithm [Houstma and Swanu1995] and AprioriTid [Agrawal and Srikant 1994]

In the case of web transactions we use association rules todiscover navigational patterns among users This would helpto cache a page in advance and reduce the loading time of apage Also discovering a pattern of navigation helps inpersonalization Transactions are captured from theclickstream data captured in web server logs

In many applications there is one main problem in usingassociation rule mining First a problem with using globalminimum support (minsup) because rare hits (ie web pages

that are rarely visited) will not be included in the frequent setsbecause it will not achieve enough support One solution is tohave a very small support threshold however we will end upwith a very large frequent itemset which is computationallyhard to handle [Liu et al 1999] propose a mining techniquethat uses different support thresholds for different itemsSpecifying multiple thresholds allow rare transactions whichmight be very important to be included in the frequentitemsets Other issues might arise depending on theapplication itself For example in the case of WWWprediction a session is recorded for each user The sessionmight have tens of clickstreams (and sometimes hundredsdepending on the duration of the session) Using each sessionas a transaction will not work because it is rare to find twosessions that are frequently repeated (ie identical) hence itwill not achieve even a very high support threshold minsupThere is a need to break each session into manysubsequences One common method is to use a slidingwindow of size w For example suppose we use a slidingwindow w = 3 to break the session S = langA B C D E E Frangthen we will end up with the subsequences Sprime = langABCranglangBCDrang langCDErang langDEFrang The total number ofsubsequences of a session S using window w is length(S) ndash wTo predict the next page in an active user session we use asliding window of the active session and ignore the previouspages For example if the current session is langABCrang and theuser references page D then the new active session becomeslangBCDrang using a sliding window 3 Notice that page A isdropped and langBCDrang will be used for prediction Therationale behind this is that most users go back and forthwhile surfing the web trying to find the desired informationand it may be most appropriate to use the recent portions of

the user history to generate recommendationspredictions[Mobasher et al 2001]

[Mobasher et al 2001] propose a recommendation enginethat matches an active user session with the frequent itemsetsin the database and predicts the next page the user mostprobably visits The engine works as follows Given an activesession of size w the engine finds all the frequent itemsets oflength w + 1 satisfying some minimum support minsup andcontaining the current active session Prediction for the activesession A is based on the confidence (ψ) of the correspondingassociation rule The confidence (ψ) of an association rule Xrarr z is defined as ψ(X rarr z) = σ(X cup z)σ(X) where the lengthof z is 1 Page p is recommendedpredicted for an activesession A if

The engine uses a cyclic graph called the Frequent ItemsetGraph The graph is an extension of the lexicographic treeused in the tree projection algorithm of [Agrawal et al 2001]The graph is organized in levels The nodes in level l haveitemsets of size l For example the sizes of the nodes (ie thesize of the itemsets corresponding to these nodes) in level 1and 2 are 1 and 2 respectively The root of the graph level 0

is an empty node corresponding to an empty itemset A nodeX in level l is linked to a node Y in level l + 1 if X sub Y Tofurther explain the process suppose we have the followingsample web transactions involving pages 1 2 3 4 and 5 as inTable 22 The Apriori algorithm produces the itemsets as inTable 23 using a minsup = 049 The frequent itemset graphis shown in Figure 211

Table 22 Sample Web Transaction

TRANSACTION ID ITEMST1 1245T2 12534T3 1253T4 25213T5 41253T6 1234T7 45T8 4531

Table 23 Frequent Itemsets Generated by the AprioriAlgorithm

Suppose we are using a sliding window of size 2 and thecurrent active session A = lang23rang To predictrecommend thenext page we first start at level 2 in the frequent itemsetgraph and extract all the itemsets in level 3 linked to A FromFigure 211 the node 23 is linked to 123 and 235nodes with confidence

and the recommended page is 1 because its confidence islarger Notice that in Recommendation Engines the order ofthe clickstream is not considered that is there is nodistinction between a session lang124rang and lang142rang This is adisadvantage of such systems because the order of pagesvisited might bear important information about the navigationpatterns of users

Figure 211 Frequent Itemset Graph

27 Multi-Class ProblemMost classification techniques solve the binary classificationproblem Binary classifiers are accumulated to generalize forthe multi-class problem There are two basic schemes for thisgeneralization namely one-vs-one and one-vs-all To avoidredundancy we will present this generalization only forSVM

271 One-vs-One

The one-vs-one approach creates a classifier for each pair ofclasses The training set for each pair classifier (ij) includes

only those instances that belong to either class i or j A newinstance x belongs to the class upon which most pairclassifiers agree The prediction decision is quoted from themajority vote technique There are n(n ndash 1)2 classifiers to becomputed where n is the number of classes in the dataset Itis evident that the disadvantage of this scheme is that we needto generate a large number of classifiers especially if thereare a large number of classes in the training set For exampleif we have a training set of 1000 classes we need 499500classifiers On the other hand the size of training set for eachclassifier is small because we exclude all instances that do notbelong to that pair of classes

272 One-vs-All

One-vs-all creates a classifier for each class in the datasetThe training set is pre-processed such that for a classifier jinstances that belong to class j are marked as class (+1) andinstances that do not belong to class j are marked as class(ndash1) In the one-vs-all scheme we compute n classifierswhere n is the number of pages that users have visited (at theend of each session) A new instance x is predicted byassigning it to the class that its classifier outputs the largestpositive value (ie maximal marginal) as in Eq 215 Wecan compute the margin of point x as in Eq 214 Notice thatthe recommendedpredicted page is the sign of the marginvalue of that page (see Eq 210)

In Eq 215 M is the number of classes x = langx1 x2hellip xnrang isthe user session and fi is the classifier that separates class ifrom the rest of the classes The prediction decision in Eq215 resolves to the classifier fc that is the most distant fromthe testing example x This might be explained as fc has themost separating power among all other classifiers ofseparating x from the rest of the classes

The advantage of this scheme (one-vs-all) compared to theone-VS-one scheme is that it has fewer classifiers On theother hand the size of the training set is larger for one-vs-allthan for a one-vs-one scheme because we use the wholeoriginal training set to compute each classifier

28 Image MiningAlong with the development of digital images and computerstorage technologies huge amounts of digital images aregenerated and saved every day Applications of digital imagehave rapidly penetrated many domains and markets includingcommercial and news media photo libraries scientific andnon-photographic image databases and medical imagedatabases As a consequence we face a daunting problem oforganizing and accessing these huge amounts of availableimages An efficient image retrieval system is highly desiredto find images of specific entities from a database Thesystem is expected to manage a huge collection of imagesefficiently respond to usersrsquo queries with high speed anddeliver a minimum of irrelevant information (high precision)

as well as ensure that relevant information is not overlooked(high recall)

To generate such kinds of systems people tried manydifferent approaches In the early 1990s because of theemergence of large image collections content-based imageretrieval (CBIR) was proposed CBIR computes relevancebased on the similarity of visual contentlow-level imagefeatures such as color histograms textures shapes and spatiallayout However the problem is that visual similarity is notsemantic similarity There is a gap between low-level visualfeatures and semantic meanings The so-called semantic gapis the major problem that needs to be solved for most CBIRapproaches For example a CBIR system may answer a queryrequest for a ldquored ballrdquo with an image of a ldquored roserdquo If weundertake the annotation of images with keywords a typicalway to publish an image data repository is to create akeyword-based query interface addressed to an imagedatabase If all images came with a detailed and accuratedescription image retrieval would be convenient based oncurrent powerful pure text search techniques These searchtechniques would retrieve the images if their descriptionsannotations contained some combination of the keywordsspecified by the user However the major problem is thatmost of images are not annotated It is a laboriouserror-prone and subjective process to manually annotate alarge collection of images Many images contain the desiredsemantic information even though they do not contain theuser-specified keywords Furthermore keyword-based searchis useful especially to a user who knows what keywords areused to index the images and who can therefore easilyformulate queries This approach is problematic howeverwhen the user does not have a clear goal in mind does not

know what is in the database and does not know what kind ofsemantic concepts are involved in the domain

Image mining is a more challenging research problem thanretrieving relevant images in CBIR systems The goal ofimage mining is to find an image pattern that is significant fora given set of images and helpful to understand therelationships between high-level semantic conceptsdescriptions and low-level visual features Our focus is onaspects such as feature selection and image classification

Usually data saved in databases is with well-definedsemantics such as numbers or structured data entries Incomparison data with ill-defined semantics is unstructureddata For example images audio and video are data withill-defined semantics In the domain of image processingimages are represented by derived data or features such ascolor texture and shape Many of these features havemultiple values (eg color histogram moment description)When people generate these derived data or features theygenerally generate as many features as possible since they arenot aware which feature is more relevant Therefore thedimensionality of derived image data is usually very highSome of the selected features might be duplicated or may noteven be relevant to the problem Including irrelevant orduplicated information is referred to as ldquonoiserdquo Suchproblems are referred to as the ldquocurse of dimensionalityrdquoFeature selection is the research topic for finding an optimalsubset of features In this section we will discuss this curseand feature selection in detail

We developed a wrapper-based simultaneous featureweighing and clustering algorithm The clustering algorithmwill bundle similar image segments together and generate afinite set of visual symbols (ie blob-token) Based onhistogram analysis and chi-square value we assign features ofimage segments different weights instead of removing someof them Feature weight evaluation is wrapped in a clusteringalgorithm In each iteration of the algorithm feature weightsof image segments are reevaluated based on the clusteringresult The reevaluated feature weights will affect theclustering results in the next iteration

Automatic image annotation is research concerned withobject recognition where the effort is concerned with tryingto recognize objects in an image and generate descriptions forthe image according to semantics of the objects If it ispossible to produce accurate and complete semanticdescriptions for an image we can store descriptions in animage database Based on a textual description morefunctionality (eg browse search and query) of an ImageDBMS could be implemented easily and efficiently byapplying many existing text-based search techniquesUnfortunately the automatic image annotation problem hasnot been solved in general and perhaps this problem isimpossible to solve

However in certain subdomains it is still possible to obtainsome interesting results Many statistical models have beenpublished for image annotation Some of these models tookfeature dimensionality into account and applied singular value

decomposition (SVD) or principle component analysis (PCA)to reduce dimension But none of them considered featureselection or feature weight We proposed a new frameworkfor image annotation based on a translation model (TM) Inour approach we applied our weighted feature selectionalgorithm and embedded it in image annotation frameworkOur weighted feature selection algorithm improves the qualityof visual tokens and generates better image annotations

Image classification is an important area especially in themedical domain because it helps manage large medicalimage databases and has great potential as a diagnostic aid ina real-world clinical setting We describe our experiments forthe image CLEF medical image retrieval task Sizes of classesof CLEF medical image datasets are not balanced and this isa really serious problem for all classification algorithms Tosolve this problem we re-sample data by generatingsubwindows k nearest neighbor (kNN) algorithm distanceweighted kNN fuzzy kNN nearest prototype classifier andevidence theory-based kNN are implemented and studiedResults show that evidence-based kNN has the bestperformance based on classification accuracy

29 SummaryIn this chapter we first provided an overview of the variousdata mining tasks and techniques and then discussed some ofthe techniques that we will utilize in this book These include

neural networks support vector machines and associationrule mining

Numerous data mining techniques have been designed anddeveloped and many of them are being utilized incommercial tools Several of these techniques are variationsof some of the basic classification clustering and associationrule mining techniques One of the major challenges today isto determine the appropriate techniques for variousapplications We still need more benchmarks andperformance studies In addition the techniques should resultin fewer false positives and negatives Although there is stillmuch to be done the progress over the past decade isextremely promising

References[Agrawal et al 1993] Agrawal R T Imielinski A SwamiMining Association Rules between Sets of Items in LargeDatabases in Proceedings of the ACM SIGMOD Conferenceon Management of Data Washington DC May 1993 pp207ndash216

[Agrawal et al 2001] Agrawal R C Aggarwal V PrasadA Tree Projection Algorithm for Generation of Frequent ItemSets Journal of Parallel and Distributed Computing ArchiveVol 61 No 3 2001 pp 350ndash371

[Agrawal and Srikant 1994] Agrawal R and R SrikantFast Algorithms for Mining Association Rules in LargeDatabase in Proceedings of the 20th International

Conference on Very Large Data Bases San Francisco CA1994 pp 487ndash499

[Bartlett and Shawe-Taylor 1999] Bartlett P and JShawe-Taylor Generalization Performance of Support VectorMachines and Other Pattern Classifiers Advances in KernelMethodsmdashSupport Vector Learning MIT Press CambridgeMA 1999 pp 43ndash54

[Cristianini and Shawe-Taylor 2000] Cristianini N and JShawe-Taylor Introduction to Support Vector MachinesCambridge University Press 2000 pp 93ndash122

[Houstma and Swanu 1995] Houtsma M and A SwanuSet-Oriented Mining of Association Rules in RelationalDatabases in Proceedings of the Eleventh InternationalConference on Data Engineering Washington DC 1995 pp25ndash33

[Liu et al 1999] Liu B W Hsu Y Ma Association Ruleswith Multiple Minimum Supports in Proceedings of the FifthACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining San Diego CA 1999 pp337ndash341

[Mitchell 1997] Mitchell T M Machine LearningMcGraw-Hill 1997 chap 4

[Mobasher et al 2001] Mobasher B H Dai T Luo MNakagawa Effective Personalization Based on AssociationRule Discovery from Web Usage Data in Proceedings of theACM Workshop on Web Information and Data Management(WIDM01) 2001 pp 9ndash15

[Pirolli et al 1996] Pirolli P J Pitkow R Rao Silk from aSowrsquos Ear Extracting Usable Structures from the Web inProceedings of 1996 Conference on Human Factors inComputing Systems (CHI-96) Vancouver British ColumbiaCanada 1996 pp 118ndash125

[Vapnik 1995] Vapnik VN The Nature of StatisticalLearning Theory Springer 1995

[Vapnik 1998] Vapnik VN Statistical Learning TheoryWiley 1998

[Vapnik 1999] Vapnik VN The Nature of StatisticalLearning Theory 2nd Ed Springer 1999

[Yang et al 2001] Yang Q H Zhang T Li Mining WebLogs for Prediction Models in WWW Caching andPrefetching in The 7th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining KDDAugust 26ndash29 2001 pp 473ndash478

MALWARE

31 IntroductionMalware is the term used for malicious software Malicioussoftware is developed by hackers to steal data identities causeharm to computers and deny legitimate services to usersamong others Malware has plagued society and the softwareindustry for almost four decades Some of the early malwareincludes Creeper virus of 1970 and the Morris worm of 1988

As computers became interconnected the number ofmalwares developed increased at an alarming rate in the1990s Today with the World Wide Web and so manytransactions and acuities being carried out on the Internet themalware problem is causing chaos among the computer andnetwork users

There are various types of malware including virusesworms time and logic bombs Trojan horses and spywarePreliminary results from Symantec published in 2008 suggestthat ldquothe release rate of malicious code and other unwantedprograms may be exceeding that of legitimate softwareapplicationsrdquo [Malware 2011] CME (Common MalwareEnumeration) was ldquocreated to provide single commonidentifiers to new virus threats and to the most prevalent virus

threats in the wild to reduce public confusion during malwareincidentsrdquo [CME 2011]

In this chapter we discuss various types of malware In thisbook we describe the data mining tools we have developed tohandle some types of malware The organization of thischapter is as follows In Section 32 we discuss viruses InSection 33 we discuss worms Trojan horses are discussed inSection 34 Time and logic bombs are discussed in Section35 Botnets are discussed in Section 36 Spyware isdiscussed in Section 37 The chapter is summarized inSection 38 Figure 31 illustrates the concepts discussed inthis chapter

Figure 31 Concepts discussed in this chapter

32 VirusesComputer viruses are malware that piggyback onto otherexecutables and are capable of replicating Viruses can exhibit

a wide range of malicious behaviors ranging from simpleannoyance (such as displaying messages) to widespreaddestruction such as wiping all the data in the hard drive (egCIH virus) Viruses are not independent programs Ratherthey are code fragments that exist on other binary files Avirus can infect a host machine by replicating itself when it isbrought in contact with that machine such as via a sharednetwork drive removable media or email attachment Thereplication is done when the virus code is executed and it ispermitted to write in the memory

There are two types of viruses based on their replicationstrategy nonresident and resident The nonresident virus doesnot store itself on the hard drive of the infected computer It isonly attached to an executable file that infects a computerThe virus is activated each time the infected executable isaccessed and run When activated the virus looks for othervictims (eg other executables) and infects them On thecontrary resident viruses allocate memory in the computerhard drive such as the boot sector These viruses becomeactive every time the infected machine starts

The earliest computer virus dates back to 1970 with theadvent of Creeper virus detected on ARPANET [SecureList2011] Since then hundreds of thousands of different viruseshave been written and corresponding antiviruses have alsobeen devised to detect and eliminate the viruses fromcomputer systems Most commercial antivirus products applya signature matching technique to detect a virus A virussignature is a unique bit pattern in the virus binary that canaccurately identify the virus [Signature 2011] Traditionallyvirus signatures are generated manually However automated

signature generation techniques based on data mining havebeen proposed recently [Masud et al 2007 2008]

33 WormsComputer worms are malware but unlike viruses they neednot attach themselves to other binaries Worms are capable ofpropagating themselves to other hosts through networkconnections Worms also exhibit a wide range of maliciousbehavior such as spamming phishing harvesting andsending sensitive information to the worm writer jamming orslowing down network connections deleting data from harddrive and so on Worms are independent programs and theyreside in the infected machine by camouflage Some of theworms open a backdoor in the infected machine allowing theworm writer to control the machine and making it a zombie(or bot) for his malicious activities (see Section 36)

The earliest computer worm dates back to 1988 programmedby Robert Morris who unleashed the Morris worm Itinfected 10 of the then Internet and his act resulted in thefirst conviction in the United States under the ComputerFraud and Abuse Act [Dressler 2007] One of the threeauthors of this book was working in computer security atHoneywell Inc in Minneapolis at that time and vividlyremembers what happened that November day

Other infamous worms since then include the Melissa wormunleashed in 1999 which crashed servers the Mydoom wormreleased in 2004 which was the fastest spreading email

worm and the SQL Slammer worm founded in 2003 whichcaused a global Internet slowdown

Commercial antivirus products also detect worms by scanningworm signature against the signature database Howeveralthough this technique is very effective against regularworms it is usually not effective against zero-day attacks[Frei et al 2008] polymorphic and metamorphic wormsHowever recent techniques for worm detection address theseproblems by automatic signature generation techniques [Kimand Karp 2004] [Newsome et al 2005] Several data miningtechniques also exist for detecting different types of worms[Masud et al 2007 2008]

34 Trojan HorsesTrojan horses have been studied within the context ofmulti-level databases They covertly pass information from ahigh-level process to a low-level process A good example ofa Trojan horse is the manipulation of file locks Nowaccording to the Bell and La Padula Security Policy(discussed in Appendix B) a secret process cannot directlysend data to an unclassified process as this will constitute awrite down However a malicious secret process can covertlypass data to an unclassified process by manipulating the filelocks as follows Suppose both processes want to access anunclassified file The Secret process wants to read from thefile while the unclassified process can write into the fileHowever both processes cannot obtain the read and writelocks at the same time Therefore at time T1 letrsquos assumethat the Secret process has the read lock while the unclassified

process attempts to get a write lock The unclassified processcannot obtain this lock This means a one bit information say0 is passed to the unclassified process At time T2 letrsquosassume the situation does not change This means one bitinformation of 0 is passed However at time T3 letrsquos assumethe Secret process does not have the read lock in which casethe unclassified process can obtain the write lock This timeone bit information of 1 is passed Over time a classifiedstring of 0011000011101 could be passed from the Secretprocess to the unclassified process

As stated in [Trojan Horse 2011] a Trojan horse is softwarethat appears to perform a desirable function for the user butactually carries out a malicious activity In the previousexample the Trojan horse does have read access to the dataobject It is reading from the object on behalf of the userHowever it also carries out malicious activity bymanipulating the locks and sending data covertly to theunclassified user

35 Time and Logic BombsIn the software paradigm time bomb refers to a computerprogram that stops functioning after a prespecified time ordate has reached This is usually imposed by softwarecompanies in beta versions of software so that the softwarestops functioning after a certain date An example is theWindows Vista Beta 2 which stopped functioning on May31 2007 [Vista 2007]

A logic bomb is a computer program that is intended toperform malicious activities when certain predefinedconditions are met This technique is sometimes injected intoviruses or worms to increase the chances of survival andspreading before getting caught

An example of a logic bomb is the Fannie Mae bomb in 2008[Claburn 2009] A logic bomb was discovered at themortgage company Fannie Mae on October 2008 An Indiancitizen and IT contractor Rajendrasinh Babubhai Makwanawho worked in Fannie Maersquos Urbana Maryland facilityallegedly planted it and it was set to activate on January 312009 to wipe all of Fannie Maersquos 4000 servers As stated in[Claburn 2009] Makwana had been terminated around 100pm on October 24 2008 and planted the bomb while he stillhad network access He was indicted in a Maryland court onJanuary 27 2009 for unauthorized computer access

36 BotnetBotnet is a network of compromised hosts or bots under thecontrol of a human attacker known as the botmaster Thebotmaster can issue commands to the bots to performmalicious actions such as recruiting new bots launchingcoordinated DDoS attacks against some hosts stealingsensitive information from the bot machine sending massspam emails and so on Thus botnets have emerged as anenormous threat to the Internet community

According to [Messmer 2009] more than 12 millioncomputers in the United States are compromised and

controlled by the top 10 notorious botnets Among them thehighest number of compromised machines is due to the Zeusbotnet Zeus is a kind of Trojan (a malware) whose mainpurpose is to apply key-logging techniques to steal sensitivedata such as login information (passwords etc) bank accountnumbers and credit card numbers One of its key-loggingtechniques is to inject fake HTML forms into online bankinglogin pages to steal login information

The most prevailing botnets are the IRC-botnets [Saha andGairola 2005] which have a centralized architecture Thesebotnets are usually very large and powerful consisting ofthousands of bots [Rajab et al 2006] However theirenormous size and centralized architecture also make themvulnerable to detection and demolition Many approaches fordetecting IRC botnets have been proposed recently ([Goebeland Holz 2007] [Karasaridis et al 2007] [Livadas et al2006] [Rajab et al 2006]) Another type of botnet is thepeer-to-peer (P2P) botnet These botnets are distributed andmuch smaller than IRC botnets So they are more difficult tolocate and destroy Many recent works in P2P botnet analyzestheir characteristics ([Grizzard et al 2007] [Group 2004][Lemos 2006])

37 SpywareAs stated in [Spyware 2011] spyware is a type of malwarethat can be installed on computers which collects informationabout users without their knowledge For example spywareobserves the web sites visited by the user the emails sent bythe user and in general the activities carried out by the user

in his or her computer Spyware is usually hidden from theuser However sometimes employers can install spyware tofind out the computer activities of the employees

An example of spyware is keylogger (also called keystrokelogging) software As stated in [Keylogger 2011] keyloggingis the action of tracking the keys struck on a keyboardusually in a covert manner so that the person using thekeyboard is unaware that their actions are being monitoredAnother example of spyware is adware when advertisementpops up on the computer when the person is doing someusually unrelated activity In this case the spyware monitorsthe web sites surfed by the user and carries out targetedmarketing using adware

38 SummaryIn this chapter we have provided an overview of malware(also known as malicious software) We discussed varioustypes of malware such as viruses worms time and logicbombs Trojan horses botnets and spyware As we havestated malware is causing chaos in society and in thesoftware industry Malware technology is getting more andmore sophisticated Developers of malware are continuouslychanging patterns so as not to get caught Thereforedeveloping solutions to detect andor prevent malware hasbecome an urgent need

In this book we discuss the tools we have developed to detectmalware In particular we discuss tools for email wormdetection remote exploits detection and botnet detection We

also discuss our stream mining tool that could potentiallydetect changing malware These tools are discussed in PartsIII through VII of this book In Chapter 4 we will summarizethe data mining tools we discussed in our previous book[Awad et al 2009] Our tools discussed in our current bookhave been influenced by the tools discussed in [Awad et al2009]

References[Awad et al 2009] Awad M L Khan B Thuraisingham LWang Design and Implementation of Data Mining ToolsCRC Press 2009

[CME 2011] httpcmemitreorg

[Claburn 2009] Claburn T Fannie Mae Contractor Indictedfor Logic BombInformationWeek httpwwwinformationweekcomnewssecuritymanagementshowArticlejhtmlarticleID=212903521

[Dressler 2007] Dressler J ldquoUnited States v Morrisrdquo Casesand Materials on Criminal Law St Paul MN ThomsonWest 2007

[Frei et al 2008] Frei S B Tellenbach B Plattner 0-DayPatchmdashExposing Vendors(In)security Performancetechzoomnet Publications httpwwwtechzoomnetpublications0-day-patchindexen

[Goebel and Holz 2007] Goebel J and T Holz RishiIdentify Bot Contaminated Hosts by IRC NicknameEvaluation in USENIXHotbots rsquo07 Workshop 2007

[Grizzard et al 2007] Grizzard J B V Sharma CNunnery B B Kang D Dagon Peer-to-Peer BotnetsOverview and Case Study in USENIXHotbots rsquo07Workshop 2007

[Group 2004] LURHQ Threat Intelligence Group Sinit p2pTrojan Analysis LURHQ httpwwwlurhqcomsinithtml

[Karasaridis et al 2007] Karasaridis A B Rexroad DHoeflin Wide-Scale Botnet Detection and Characterizationin USENIXHotbots rsquo07 Workshop 2007

[Keylogger 2011] httpenwikipediaorgwikiKeystroke_logging

[Kim and Karp 2004] Kim H A and Karp B (2004)Autograph Toward Automated Distributed Worm SignatureDetection in Proceedings of the 13th USENIX SecuritySymposium (Security 2004) pp 271ndash286

[Lemos 2006] Lemos R Bot Software Looks to ImprovePeerage httpwwwsecurityfocuscomnews11390

[Livadas et al 2006] Livadas C B Walsh D Lapsley TStrayer Using Machine Learning Techniques to IdentifyBotnet Traffic in 2nd IEEE LCN Workshop on NetworkSecurity (WoNSrsquo2006) November 2006

[Malware 2011] httpenwikipediaorgwikiMalware

[Masud et al 2007] Masud M L Khan B ThuraisinghamE-mail Worm Detection Using Data Mining InternationalJournal of Information Security and Privacy Vol 1 No 42007 pp 47ndash61

[Masud et al 2008] Masud M L Khan B ThuraisinghamA Scalable Multi-level Feature Extraction Technique toDetect Malicious Executables Information System FrontiersVol 10 No 1 2008 pp 33ndash45

[Messmer 2009] Messmer E Americarsquos 10 Most WantedBotnets Network World July 22 2009httpwwwnetworkworldcomnews2009072209-botnetshtml

[Newsome et al 2005] Newsome J B Karp D SongPolygraph Automatically Generating Signatures forPolymorphic Worms in Proceedings of the IEEE Symposiumon Security and Privacy 2005 pp 226ndash241

[Rajab et al 2006] Rajab M A J Zarfoss F Monrose ATerzis A Multifaceted Approach to Understanding the BotnetPhenomenon in Proceedings of the 6th ACM SIGCOMM onInternet Measurement Conference (IMC) 2006 pp 41ndash52

[Saha and Gairola 2005] Saha B and A Gairola Botnet AnOverview CERT-In White Paper CIWP-2005-05 2005

[SecureList 2011] Securelistcom Threat Analysis andInformation Kaspersky Labs httpwwwsecurelistcomenthreatsdetect

[Signature 2011] Virus Signature PC MagazineEncyclopedia httpwwwpcmagcomencyclopedia_term02542t=virus+signatureampi=5396900asp

[Spyware 2011] httpenwikipediaorgwikiSpyware

[Trojan Horse 2011] httpenwikipediaorgwikiTrojan_horse_(computing)

[Vista 2007] Windows Vista httpwindowsmicrosoftcomen-uswindows-vistaproductshome

DATA MINING FOR SECURITYAPPLICATIONS

41 IntroductionEnsuring the integrity of computer networks both in relationto security and with regard to the institutional life of thenation in general is a growing concern Security and defensenetworks proprietary research intellectual property anddata-based market mechanisms that depend on unimpededand undistorted access can all be severely compromised bymalicious intrusions We need to find the best way to protectthese systems In addition we need techniques to detectsecurity breaches

Data mining has many applications in security including innational security (eg surveillance) as well as in cybersecurity (eg virus detection) The threats to national securityinclude attacking buildings and destroying criticalinfrastructures such as power grids and telecommunicationsystems [Bolz et al 2005] Data mining techniques are beinginvestigated to find out who the suspicious people are andwho is capable of carrying out terrorist activities Cybersecurity is involved with protecting the computer and networksystems against corruption due to Trojan horses and virusesData mining is also being applied to provide solutions such as

intrusion detection and auditing In this chapter we will focusmainly on data mining for cyber security applications

To understand the mechanisms to be applied to safeguard thenation and the computers and networks we need tounderstand the types of threats In [Thuraisingham 2003] wedescribed real-time threats as well as non-real-time threats Areal-time threat is a threat that must be acted upon within acertain time to prevent some catastrophic situation Note thata non-real-time threat could become a real-time threat overtime For example one could suspect that a group of terroristswill eventually perform some act of terrorism Howeverwhen we set time bounds such as that a threat will likelyoccur say before July 1 2004 then it becomes a real-timethreat and we have to take actions immediately If the timebounds are tighter such as ldquoa threat will occur within twodaysrdquo then we cannot afford to make any mistakes in ourresponse

Figure 41 Data mining applications in security

There has been a lot of work on applying data mining for bothnational security and cyber security Much of the focus of ourprevious book was on applying data mining for nationalsecurity [Thuraisingham 2003] In this part of the book wediscuss data mining for cyber security In Section 42 wediscuss data mining for cyber security applications Inparticular we discuss the threats to the computers andnetworks and describe the applications of data mining todetect such threats and attacks Some of our current researchat the University of Texas at Dallas is discussed in Section43 The chapter is summarized in Section 44 Figure 41illustrates data mining applications in security

42 Data Mining for CyberSecurity421 Overview

This section discusses information-related terrorism Byinformation-related terrorism we mean cyber-terrorism aswell as security violations through access control and othermeans Trojan horses as well as viruses are alsoinformation-related security violations which we group intoinformation-related terrorism activities

Figure 42 Cyber security threats

In the next few subsections we discuss variousinformation-related terrorist attacks In Section 422 we givean overview of cyber-terrorism and then discuss insiderthreats and external attacks Malicious intrusions are thesubject of Section 423 Credit card and identity theft arediscussed in Section 424 Attacks on critical infrastructuresare discussed in Section 425 and data mining for cybersecurity is discussed in Section 426 Figure 42 illustratescyber security threats

422 Cyber-Terrorism Insider Threatsand External Attacks

Cyber-terrorism is one of the major terrorist threats posed toour nation today As we have mentioned earlier there is nowso much information available electronically and on the webAttack on our computers as well as networks databases andthe Internet could be devastating to businesses It is estimatedthat cyber-terrorism could cost billions of dollars tobusinesses For example consider a banking informationsystem If terrorists attack such a system and deplete accountsof the funds then the bank could lose millions and perhapsbillions of dollars By crippling the computer system millionsof hours of productivity could be lost and that also equates tomoney in the end Even a simple power outage at workthrough some accident could cause several hours ofproductivity loss and as a result a major financial lossTherefore it is critical that our information systems be secureWe discuss various types of cyber-terrorist attacks One isspreading viruses and Trojan horses that can wipe away filesand other important documents another is intruding thecomputer networks

Note that threats can occur from outside or from the inside ofan organization Outside attacks are attacks on computersfrom someone outside the organization We hear of hackersbreaking into computer systems and causing havoc within anorganization There are hackers who start spreading virusesand these viruses cause great damage to the files in variouscomputer systems But a more sinister problem is the insiderthreat Just like non-information-related attacks there is theinsider threat with information-related attacks There are

people inside an organization who have studied the businesspractices and develop schemes to cripple the organizationrsquosinformation assets These people could be regular employeesor even those working at computer centers The problem isquite serious as someone may be masquerading as someoneelse and causing all kinds of damage In the next few sectionswe examine how data mining could detect and perhapsprevent such attacks

Malicious intrusions may include intruding the networks theweb clients the servers the databases and the operatingsystems Many of the cyber-terrorism attacks are due tomalicious intrusions We hear much about network intrusionsWhat happens here is that intruders try to tap into thenetworks and get the information that is being transmittedThese intruders may be human intruders or Trojan horses setup by humans Intrusions can also happen on files Forexample one can masquerade as someone else and log intosomeone elsersquos computer system and access the filesIntrusions can also occur on databases Intruders posing aslegitimate users can pose queries such as SQL queries andaccess data that they are not authorized to know

Essentially cyber-terrorism includes malicious intrusions aswell as sabotage through malicious intrusions or otherwiseCyber security consists of security mechanisms that attemptto provide solutions to cyber attacks or cyber-terrorism Whenwe discuss malicious intrusions or cyber attacks we mayneed to think about the non-cyber worldmdashthat isnon-information-related terrorismmdashand then translate those

attacks to attacks on computers and networks For example athief could enter a building through a trap door In the sameway a computer intruder could enter the computer or networkthrough some sort of a trap door that has been intentionallybuilt by a malicious insider and left unattended throughperhaps careless design Another example is a thief enteringthe bank with a mask and stealing the money The analogyhere is an intruder masquerades as someone else legitimatelyenters the system and takes all of the information assetsMoney in the real world would translate to information assetsin the cyber world That is there are many parallels betweennon-information-related attacks and information-relatedattacks We can proceed to develop counter-measures for bothtypes of attacks

424 Credit Card Fraud and IdentityTheft

We are hearing a lot these days about credit card fraud andidentity theft In the case of credit card fraud others get holdof a personrsquos credit card and make purchases by the time theowner of the card finds out it may be too late The thief mayhave left the country by then A similar problem occurs withtelephone calling cards In fact this type of attack hashappened to one of the authors once Perhaps phone callswere being made using her calling card at airports someonemust have noticed say the dial tones and used the callingcard which was a company calling card Fortunately thetelephone company detected the problem and informed thecompany The problem was dealt with immediately

A more serious theft is identity theft Here one assumes theidentity of another person for example by getting hold of thesocial security number and essentially carries out all thetransactions under the other personrsquos name This could evenbe selling houses and depositing the income in a fraudulentbank account By the time the owner finds out it will be toolate The owner may have lost millions of dollars due to theidentity theft

We need to explore the use of data mining both for credit cardfraud detection and identity theft There have been someefforts on detecting credit card fraud [Chan 1999] We needto start working actively on detecting and preventing identitytheft

Figure 43 Attacks on critical infrastructures

Attacks on critical infrastructures could cripple a nation andits economy Infrastructure attacks include attacks on thetelecommunication lines the electronic power and gasreservoirs and water supplies food supplies and other basicentities that are critical for the operation of a nation

Attacks on critical infrastructures could occur during any typeof attack whether they are non-information-relatedinformation-related or bio-terrorist attacks For example onecould attack the software that runs the telecommunicationsindustry and close down all the telecommunications linesSimilarly software that runs the power and gas supplies couldbe attacked Attacks could also occur through bombs andexplosives for example telecommunication lines could beattacked through bombs Attacking transportation lines suchas highways and railway tracks are also attacks oninfrastructures

Infrastructures could also be attacked by natural disasterssuch as hurricanes and earthquakes Our main interest here isthe attacks on infrastructures through malicious attacks bothinformation-related and non-information-related Our goal isto examine data mining and related data managementtechnologies to detect and prevent such infrastructure attacksFigure 43 illustrates attacks on critical infrastructures

Data mining is being applied for problems such as intrusiondetection and auditing For example anomaly detection

techniques could be used to detect unusual patterns andbehaviors Link analysis may be used to trace the viruses tothe perpetrators Classification may be used to group variouscyber attacks and then use the profiles to detect an attackwhen it occurs Prediction may be used to determine potentialfuture attacks depending on information learned aboutterrorists through email and phone conversations Also forsome threats non-real-time data mining may suffice whereasfor certain other threats such as for network intrusions wemay need real-time data mining Many researchers areinvestigating the use of data mining for intrusion detectionAlthough we need some form of real-time data miningmdashthatis the results have to be generated in real timemdashwe also needto build models in real time For example credit card frauddetection is a form of real-time processing However heremodels are usually built ahead of time Building models inreal time remains a challenge Data mining can also be usedfor analyzing web logs as well as analyzing the audit trailsBased on the results of the data mining tool one can thendetermine whether any unauthorized intrusions have occurredandor whether any unauthorized queries have been posed

Other applications of data mining for cyber security includeanalyzing the audit data One could build a repository or awarehouse containing the audit data and then conduct ananalysis using various data mining tools to see if there arepotential anomalies For example there could be a situationwhere a certain user group may access the database between 3am and 5 am It could be that this group is working the nightshift in which case there may be a valid explanationHowever if this group is working between 9 am and 5 pmthen this may be an unusual occurrence Another example iswhen a person accesses the databases always between 1 pm

and 2 pm but for the past two days he has been accessing thedatabase between 1 am and 2 am This could then be flaggedas an unusual pattern that would require further investigation

Insider threat analysis is also a problem from a nationalsecurity as well as a cyber security perspective That isthose working in a corporation who are considered to betrusted could commit espionage Similarly those with properaccess to the computer system could plant Trojan horses andviruses Catching such terrorists is far more difficult thancatching terrorists outside of an organization One may needto monitor the access patterns of all the individuals of acorporation even if they are system administrators to seewhether they are carrying out cyber-terrorism activities Thereis some research now on applying data mining for suchapplications by various groups

Figure 44 Data mining for cyber security

Although data mining can be used to detect and prevent cyberattacks data mining also exacerbates some security problemssuch as the inference and privacy problems With data miningtechniques one could infer sensitive associations from thelegitimate responses Figure 44 illustrates data mining forcyber security For more details on a high-level overview werefer the reader to [Thuraisingham 2005a] and[Thuraisingham 2005b]

43 Current Research andDevelopmentWe are developing a number of tools on data mining forcyber security applications at The University of Texas atDallas In our previous book we discussed one such tool forintrusion detection [Awad et al 2009] An intrusion can bedefined as any set of actions that attempt to compromise theintegrity confidentiality or availability of a resource Assystems become more complex there are always exploitableweaknesses as a result of design and programming errors orthrough the use of various ldquosocially engineeredrdquo penetrationtechniques Computer attacks are split into two categorieshost-based attacks and network-based attacks Host-basedattacks target a machine and try to gain access to privilegedservices or resources on that machine Host-based detectionusually uses routines that obtain system call data from anaudit process that tracks all system calls made on behalf ofeach user

Network-based attacks make it difficult for legitimate users toaccess various network services by purposely occupying orsabotaging network resources and services This can be doneby sending large amounts of network traffic exploitingwell-known faults in networking services overloadingnetwork hosts and so forth Network-based attack detectionuses network traffic data (ie tcpdump) to look at trafficaddressed to the machines being monitored Intrusiondetection systems are split into two groups anomaly detectionsystems and misuse detection systems

Anomaly detection is the attempt to identify malicious trafficbased on deviations from established normal network trafficpatterns Misuse detection is the ability to identify intrusionsbased on a known pattern for the malicious activity Theseknown patterns are referred to as signatures Anomalydetection is capable of catching new attacks However newlegitimate behavior can also be falsely identified as an attackresulting in a false positive The focus with the current stateof the art is to reduce false negative and false positive rates

Our current tools discussed in this book include those foremail worm detection malicious code detection bufferoverflow detection and botnet detection as well as analyzingfirewall policy rules Figure 45 illustrates the various toolswe have developed Some of these tools are discussed in PartsII through VII of this book For example for email wormdetection we examine emails and extract features such asldquonumber of attachmentsrdquo and then train data mining toolswith techniques such as SVM (support vector machine) orNaiumlve Bayesian classifiers and develop a model Then we testthe model and determine whether the email has a viruswormor not We use training and testing datasets posted on various

web sites Similarly for malicious code detection we extractn-gram features with both assembly code and binary codeWe first train the data mining tool using the SVM techniqueand then test the model The classifier will determine whetherthe code is malicious or not For buffer overflow detectionwe assume that malicious messages contain code whereasnormal messages contain data We train SVM and then test tosee if the message contains code or data

Figure 45 Data mining tools at UT Dallas

44 SummaryThis chapter has discussed data mining for securityapplications We first started with a discussion of data miningfor cyber security applications and then provided a briefoverview of the tools we are developing We describe someof these tools in Parts II through VII of this book Note thatwe will focus mainly on malware detection However in PartVII we also discuss tools for insider threat detection activedefense and real-time data mining

Data mining for national security as well as for cyber securityis a very active research area Various data mining techniquesincluding link analysis and association rule mining are beingexplored to detect abnormal patterns Because of data miningusers can now make all kinds of correlations This also raisesprivacy concerns More details on privacy can be obtained in[Thuraisingham 2002]

[Bolz et al 2005] Bolz F K Dudonis D Schulz TheCounterterrorism Handbook Tactics Procedures andTechniques Third Edition CRC Press 2005

[Chan 1999] Chan P W Fan A Prodromidis S StolfoDistributed Data Mining in Credit Card Fraud DetectionIEEE Intelligent Systems Vol 14 No 6 1999 pp 67ndash74

[Thuraisingham 2002] Thuraisingham B Data MiningNational Security Privacy and Civil Liberties SIGKDDExplorations 2002 42 1ndash5

[Thuraisingham 2003] Thuraisingham B Web Data MiningTechnologies and Their Applications in Business Intelligenceand Counter-Terrorism CRC Press 2003

[Thuraisingham 2005a] Thuraisingham B ManagingThreats to Web Databases and Cyber Systems Issues

Solutions and Challenges Kluwer 2004 (Editors V KumarJ Srivastava A Lazarevic)

[Thuraisingham 2005b] Thuraisingham B Database andApplications Security CRC Press 2005

DESIGN AND IMPLEMENTATIONOF DATA MINING TOOLS

51 IntroductionData mining is an important process that has been integratedin many industrial governmental and academic applicationsIt is defined as the process of analyzing and summarizing datato uncover new knowledge Data mining maturity depends onother areas such as data management artificial intelligencestatistics and machine learning

In our previous book [Awad et al 2009] we concentratedmainly on the classification problem We appliedclassification in three critical applications namely intrusiondetection WWW prediction and image classificationSpecifically we strove to improve performance (time andaccuracy) by incorporating multiple (two or more) learningmodels In intrusion detection we tried to improve thetraining time whereas in WWW prediction we studiedhybrid models to improve the prediction accuracy Theclassification problem is also sometimes referred to asldquosupervised learningrdquo in which a set of labeled examples islearned by a model and then a new example with anunknown label is presented to the model for prediction

There are many prediction models that have been used suchas Markov models decision trees artificial neural networkssupport vector machines association rule mining and manyothers Each of these models has strengths and weaknessesHowever there is a common weakness among all of thesetechniques which is the inability to suit all applications Thereason that there is no such ideal or perfect classifier is thateach of these techniques is initially designed to solve specificproblems under certain assumptions

There are two directions in designing data mining techniquesmodel complexity and performance In model complexitynew data structures training set reduction techniques andorsmall numbers of adaptable parameters are proposed tosimplify computations during learning without compromisingthe prediction accuracy In model performance the goal is toimprove the prediction accuracy with some complication ofthe design or model It is evident that there is a tradeoffbetween the performance complexity and the modelcomplexity In this book we present studies of hybrid modelsto improve the prediction accuracy of data mining algorithmsin two important applications namely intrusion detection andWWW prediction

Intrusion detection involves processing and learning a largenumber of examples to detect intrusions Such a processbecomes computationally costly and impractical when thenumber of records to train against grows dramaticallyEventually this limits our choice of the data mining techniqueto apply Powerful techniques such as support vectormachines (SVMs) will be avoided because of the algorithmcomplexity We propose a hybrid model which is based onSVMs and clustering analysis to overcome this problem The

idea is to apply a reduction technique using clusteringanalysis to approximate support vectors to speed up thetraining process of SVMs We propose a method namelyclustering trees-based SVM (CT-SVM) to reduce the trainingset and approximate support vectors We exploit clusteringanalysis to generate support vectors to improve the accuracyof the classifier

Surfing prediction is another important research area uponwhich many application improvements depend Applicationssuch as latency reduction web search and recommendationsystems utilize surfing prediction to improve theirperformance There are several challenges present in this areaThese challenges include low accuracy rate [Pitkow andPirolli 1999] sparsity of the data [Burke 2002] [Grcar et al2005] and large number of labels which makes it a complexmulti-class problem [Chung et al 2004] not fully utilizingthe domain knowledge Our goal is to improve the predictiveaccuracy by combining several powerful classificationtechniques namely SVMs artificial neural networks(ANNs) and the Markov model The Markov model is apowerful technique for predicting seen data however itcannot predict the unseen data On the other hand techniquessuch as SVM and ANN are powerful predictors and canpredict not only for the seen data but also for the unseen dataHowever when dealing with large numbers of classeslabelsor when there is a possibility that one instance may belong tomany classes predictive power may decrease We useDempsterrsquos rule to fuse the prediction outcomes of thesemodels Such fusion combines the best of different modelsbecause it has achieved the best accuracy over the individualmodels

Figure 51 Data mining applications

In this chapter we discuss the three applications we haveconsidered in our previous book Design and Implementationof Data Mining Tools [Awad et al 2009] This previous bookis a useful reference and provides some backgroundinformation for our current book The applications areillustrated in Figure 51 In Section 52 we discuss intrusiondetection WWW surfing prediction is discussed in Section53 Image classification is discussed in Section 54 Moredetails in broader applications of data mining such as datamining for security applications web data mining and imagemultimedia data mining can be found in [Awad et al 2009]

52 Intrusion DetectionSecurity and defense networks proprietary researchintellectual property and data-based market mechanismswhich depend on unimpeded and undistorted access can all

be severely compromised by intrusions We need to find thebest way to protect these systems

An intrusion can be defined as ldquoany set of actions thatattempts to compromise the integrity confidentiality oravailability of a resourcerdquo [Heady et al 1990] [Axelsson1999] [Debar et al 2000] User authentication (eg usingpasswords or biometrics) avoiding programming errors andinformation protection (eg encryption) have all been used toprotect computer systems As systems become more complexthere are always exploitable weaknesses due to design andprogramming errors or through the use of various ldquosociallyengineeredrdquo penetration techniques For example exploitableldquobuffer overflowrdquo still exists in some recent system softwareas a result of programming errors Elements central tointrusion detection are resources to be protected in a targetsystem ie user accounts file systems and system kernelsmodels that characterize the ldquonormalrdquo or ldquolegitimaterdquobehavior of these resources and techniques that compare theactual system activities with the established modelsidentifying those that are ldquoabnormalrdquo or ldquointrusiverdquo Inpursuit of a secure system different measures of systembehavior have been proposed based on an ad hocpresumption that normalcy and anomaly (or illegitimacy) willbe accurately manifested in the chosen set of system features

Intrusion detection attempts to detect computer attacks byexamining various data records observed through processeson the same network These attacks are split into twocategories host-based attacks [Anderson et al 1995][Axelsson 1999] [Freeman et al 2002] and network-basedattacks [Ilgun et al 1995] [Marchette 1999] Host-basedattacks target a machine and try to gain access to privileged

services or resources on that machine Host-based detectionusually uses routines that obtain system call data from anaudit process which tracks all system calls made on behalf ofeach user

Network-based attacks make it difficult for legitimate users toaccess various network services by purposely occupying orsabotaging network resources and services This can be doneby sending large amounts of network traffic exploitingwell-known faults in networking services and overloadingnetwork hosts Network-based attack detection uses networktraffic data (ie tcpdump) to look at traffic addressed to themachines being monitored Intrusion detection systems aresplit into two groups anomaly detection systems and misusedetection systems Anomaly detection is the attempt toidentify malicious traffic based on deviations fromestablished normal network traffic patterns [McCanne et al1989] [Mukkamala et al 2002] Misuse detection is theability to identify intrusions based on a known pattern for themalicious activity [Ilgun et al 1995] [Marchette 1999]These known patterns are referred to as signatures Anomalydetection is capable of catching new attacks However newlegitimate behavior can also be falsely identified as an attackresulting in a false positive Our research will focus onnetwork-level systems A significant challenge in data miningis to reduce false negative and false positive rates Howeverwe also need to develop a realistic intrusion detection system

SVM is one of the most successful classification algorithmsin the data mining area but its long training time limits itsuse Many applications such as data mining forbioinformatics and geoinformatics require the processing ofhuge datasets The training time of SVM is a serious obstacle

in the processing of such datasets According to [Yu et al2003] it would take years to train SVM on a datasetconsisting of one million records Many proposals have beensubmitted to enhance SVM to increase its trainingperformance [Agarwal 2002] [Cauwenberghs and Poggio2000] either through random selection or approximation ofthe marginal classifier [Feng and Mangasarian 2001]However such approaches are still not feasible with largedatasets where even multiple scans of an entire dataset are tooexpensive to perform or result in the loss throughoversimplification of any benefit to be gained through the useof SVM [Yu et al 2003]

In Part II of this book we propose a new approach forenhancing the training process of SVM when dealing withlarge training datasets It is based on the combination of SVMand clustering analysis The idea is as follows SVMcomputes the maximal margin separating data points henceonly those patterns closest to the margin can affect thecomputations of that margin while other points can bediscarded without affecting the final result Those points lyingclose to the margin are called support vectors We try toapproximate these points by applying clustering analysis

In general using hierarchical clustering analysis based on adynamically growing self-organizing tree (DGSOT) involvesexpensive computations especially if the set of training datais large However in our approach we control the growth ofthe hierarchical tree by allowing tree nodes (support vectornodes) close to the marginal area to grow while haltingdistant ones Therefore the computations of SVM and furtherclustering analysis will be reduced dramatically Also toavoid the cost of computations involved in clustering

analysis we train SVM on the nodes of the tree after eachphase or iteration in which few nodes are added to the treeEach iteration involves growing the hierarchical tree byadding new children nodes to the tree This could cause adegradation of the accuracy of the resulting classifierHowever we use the support vector set as a priori knowledgeto instruct the clustering algorithm to grow support vectornodes and to stop growing non-support vector nodes Byapplying this procedure the accuracy of the classifierimproves and the size of the training set is kept to aminimum

We report results here with one benchmark dataset the 1998DARPA dataset [Lippmann et al 1998] Also we compareour approach with the Rocchio bundling algorithm proposedfor classifying documents by reducing the number of datapoints [Shih et al 2003] Note that the Rocchio bundlingmethod reduces the number of data points before feedingthose data points as support vectors to SVM for training Onthe other hand our clustering approach is intertwined withSVM We have observed that our approach outperforms pureSVM and the Rocchio bundling technique in terms ofaccuracy false positive (FP) rate false negative (FN) rateand processing time

The contribution of our work to intrusion detection is asfollows

1 We propose a new support vector selection techniqueusing clustering analysis to reduce the training timeof SVM Here we combine clustering analysis andSVM training phases

2 We show analytically the degree to which ourapproach is asymptotically quicker than pure SVMand we validate this claim with experimental results

3 We compare our approach with random selection andRocchio bundling on a benchmark dataset anddemonstrate impressive results in terms of trainingtime FP (false positive) rate FN (false negative) rateand accuracy

53 Web Page SurfingPredictionSurfing prediction is an important research area upon whichmany application improvements depend Applications such aslatency reduction web search and personalization systemsutilize surfing prediction to improve their performance

Latency of viewing with regard to web documents is an earlyapplication of surfing prediction Web caching andpre-fetching methods are developed to pre-fetch multiplepages for improving the performance of World Wide Websystems The fundamental concept behind all these cachingalgorithms is the ordering of various web documents usingsome ranking factors such as the popularity and the size of thedocument according to existing knowledge Pre-fetching thehighest ranking documents results in a significant reduction oflatency during document viewing [Chinen and Yamaguchi1997] [Duchamp 1999] [Griffioen and Appleton 1994][Teng et al 2005] [Yang et al 2001]

Improvements in web search engines can also be achievedusing predictive models Surfers can be viewed as havingwalked over the entire WWW link structure The distributionof visits over all WWW pages is computed and used forre-weighting and re-ranking results Surfer path information isconsidered more important than the text keywords entered bythe surfers hence the more accurate the predictive modelsare the better the search results will be [Brin and Page 1998]

In Recommendation systems collaborative filtering (CF) hasbeen applied successfully to find the k top users having thesame tastes or interests based on a given target userrsquos records[Yu et al 2003] The k Nearest-Neighbor (kNN) approach isused to compare a userrsquos historical profile and records withprofiles of other users to find the top k similar users UsingAssociation Rule Mining (ARM) [Mobasher et al 2001]propose a method that matches an active user session withfrequent itemsets and predicts the next page the user is likelyto visit These CF-based techniques suffer from well-knownlimitations including scalability and efficiency [Mobasher etal 2001] [Sarwar et al 2000] [Pitkow and Pirolli 1999]explore pattern extraction and pattern matching based on aMarkov model that predicts future surfing paths LongestRepeating Subsequences (LRS) is proposed to reduce themodel complexity (not predictive accuracy) by focusing onsignificant surfing patterns

There are several problems with the current state-of-the-artsolutions First the predictive accuracy using a proposedsolution such as a Markov model is low for example themaximum training accuracy is 41 [Pitkow and Pirolli1999] Second prediction using Association Rule Mining andLRS pattern extraction is done based on choosing the path

with the highest probability in the training set hence anynew surfing path is misclassified because the probability ofsuch a path occurring in the training set is zero Third thesparsity nature of the user sessions which are used intraining can result in unreliable predictors [Burke 2002][Grcar et al 2005] Finally many of the previous methodshave ignored domain knowledge as a means for improvingprediction Domain knowledge plays a key role in improvingthe predictive accuracy because it can be used to eliminateirrelevant classifiers during prediction or reduce theireffectiveness by assigning them lower weights

WWW prediction is a multi-class problem and prediction canresolve into many classes Most multi-class techniques suchas one-VS-one and one-VS-all are based on binaryclassification Prediction is required to check any newinstance against all classes In WWW prediction the numberof classes is very large (11700 classes in our experiments)Hence prediction accuracy is very low [Chung et al 2004]because it fails to choose the right class For a given instancedomain knowledge can be used to eliminate irrelevant classes

We use several classification techniques namely SupportVector Machines (SVMs) Artificial Neural Networks(ANNs) Association Rule Mining (ARM) and Markovmodel in WWW prediction We propose a hybrid predictionmodel by combining two or more of them using Dempsterrsquosrule Markov model is a powerful technique for predictingseen data however it cannot predict unseen data On theother hand SVM is a powerful technique which can predictnot only for the seen data but also for the unseen dataHowever when dealing with too many classes or when thereis a possibility that one instance may belong to many classes

(eg a user after visiting the web pages 1 2 3 might go topage 10 while another might go to page 100) SVMpredictive power may decrease because such examplesconfuse the training process To overcome these drawbackswith SVM we extract domain knowledge from the trainingset and incorporate this knowledge in the testing set toimprove prediction accuracy of SVM by reducing the numberof classifiers during prediction

ANN is also a powerful technique which can predict not onlyfor the seen data but also for the unseen data NonethelessANN has similar shortcomings as SVM when dealing withtoo many classes or when there is a possibility that oneinstance may belong to many classes Furthermore the designof ANN becomes complex with a large number of input andoutput nodes To overcome these drawbacks with ANN weemploy domain knowledge from the training set andincorporate this knowledge in the testing set by reducing thenumber of classifiers to consult during prediction Thisimproves the prediction accuracy and reduces the predictiontime

Our contributions to WWW prediction are as follows

1 We overcome the drawbacks of SVM and ANN inWWW prediction by extracting and incorporatingdomain knowledge in prediction to improve accuracyand prediction time

2 We propose a hybrid approach for prediction inWWW Our approach fuses different combinations ofprediction techniques namely SVM ANN andMarkov using Dempsterrsquos rule [Lalmas 1997] toimprove the accuracy

3 We compare our hybrid model with differentapproaches namely Markov model AssociationRule Mining (ARM) Artificial Neural Networks(ANNs) and Support Vector Machines (SVMs) on astandard benchmark dataset and demonstrate thesuperiority of our method

54 Image ClassificationImage classification is about determining the class in whichthe image belongs to It is an aspect of image data miningOther image data mining outcomes include determininganomalies in images in the form of change detection as wellas clustering images In some situations making linksbetween images may also be useful One key aspect of imageclassification is image annotation Here the systemunderstands raw images and automatically annotates themThe annotation is essentially a description of the images

Our contributions to image classification include thefollowing

bull We present a new framework of automatic imageannotation

bull We propose a dynamic feature weighing algorithmbased on histogram analysis and Chi-square

bull We present an image re-sampling method to solve theimbalanced data problem

bull We present a modified kNN algorithm based onevidence theory

In our approach we first annotate images automatically Inparticular we utilize K-means clustering algorithms to clusterimage blobs and then make a correlation between the blobsand words This will result in annotating images Our researchhas also focused on classifying images using ontologies forgeospatial data Here we classify images using a regiongrowing algorithm and then use high-level concepts in theform of homologies to classify the regions Our research onimage classification is given in [Awad et al 2009]

55 SummaryIn this chapter we have discussed three applications that weredescribed in [Awad et al 2009] We have developed datamining tools for these three applications They are intrusiondetection web page surfing prediction and imageclassification They are part of the broader class ofapplications cyber security web information managementand multimediaimage information management respectivelyIn this book we have taken one topic discussed in our priorbook and elaborated on it In particular we have describeddata mining for cyber security and have focused on malwaredetection

Future directions will focus on two aspects One is enhancingthe data mining algorithms to address the limitations such asfalse positives and false negatives as well as reason withuncertainty The other is to expand on applying data miningto the broader classes of applications such as cyber securitymultimedia information management and web informationmanagement

References[Agarwal 2002] Agarwal D K Shrinkage EstimatorGeneralizations of Proximal Support Vector Machines inProceedings of the 8th International Conference KnowledgeDiscovery and Data Mining Edmonton Canada 2002 pp173ndash182

[Anderson et al 1995] Anderson D T Frivold A ValdesNext-Generation Intrusion Detection Expert System (NIDES)A Summary Technical Report SRI-CSL-95-07 ComputerScience Laboratory SRI International Menlo ParkCalifornia May 1995

[Awad et al 2009]) Awad M L Khan B ThuraisinghamL Wang Design and Implementation of Data Mining ToolsCRC Press 2009

[Axelsson 1999] Axelsson S Research in IntrusionDetection Systems A Survey Technical Report TR 98-17(revised in 1999) Chalmers University of TechnologyGoteborg Sweden 1999

[Brin and Page 1998] Brin S and L Page The Anatomy ofa Large-Scale Hypertextual Web Search Engine inProceedings of the 7th International WWW ConferenceBrisbane Australia 1998 pp 107ndash117

[Burke 2002] Burke R Hybrid Recommender SystemsSurvey and Experiments User Modeling and User-AdaptedInteraction Vol 12 No 4 2002 pp 331ndash370

[Cauwenberghs and Poggio 2000] Cauwenberghs G and TPoggio Incremental and Decremental Support VectorMachine Learning Advances in Neural InformationProcessing Systems 13 Papers from Neural InformationProcessing Systems (NIPS) 2000 Denver CO MIT Press2001 T K Leen T G Dietterich V Tresp (Eds)

[Chinen and Yamaguchi 1997] Chinen K and SYamaguchi An Interactive Prefetching Proxy Server forImprovement of WWW Latency in Proceedings of theSeventh Annual Conference of the Internet Society (INETrsquo97)Kuala Lumpur June 1997

[Chung et al 2004] Chung V C H Li J KwokDissimilarity Learning for Nominal Data PatternRecognition Vol 37 No 7 2004 pp 1471ndash1477

[Debar et al 2000] Debar H M Dacier A Wespi ARevised Taxonomy for Intrusion Detection Systems Annalesdes Telecommunications Vol 55 No 7ndash8 2000 pp361ndash378

[Duchamp 1999] Duchamp D Prefetching Hyperlinks inProceedings of the Second USENIX Symposium on InternetTechnologies and Systems (USITS) Boulder CO 1999 pp127ndash138

[Feng and Mangasarian 2001] Feng G and O LMangasarian Semi-supervised Support Vector Machines forUnlabeled Data Classification Optimization Methods andSoftware 2001 Vol 15 pp 29ndash44

[Freeman et al 2002] Freeman S A Bivens J Branch BSzymanski Host-Based Intrusion Detection Using UserSignatures in Proceedings of Research Conference RPITroy NY October 2002

[Grcar et al 2005] Grcar M B Fortuna D Mladenic kNNversus SVM in the Collaborative Filtering FrameworkWebKDD rsquo05 August 21 2005 Chicago Illinois

[Griffioen and Appleton 1994] Griffioen J and RAppleton Reducing File System Latency Using a PredictiveApproach in Proceedings of the 1994 Summer USENIXTechnical Conference Cambridge MA

[Heady et al 1990] Heady R Luger G Maccabe AServilla M The Architecture of a Network Level IntrusionDetection System University of New Mexico TechnicalReport TR-CS-1990-20 1990

[Ilgun et al 1995] Ilgun K R A Kemmerer P A PorrasState Transition Analysis A Rule-Based Intrusion DetectionApproach IEEE Transactions on Software Engineering Vol21 No 3 1995 pp 181ndash199

[Lalmas 1997] Lalmas M Dempster-Shaferrsquos Theory ofEvidence Applied to Structured Documents ModellingUncertainty in Proceedings of the 20th Annual InternationalACM SIGIR Philadelphia PA 1997 pp 110ndash118

[Lippmann et al 1998] Lippmann R P I Graf DWyschogrod S E Webster D J Weber S Gorton The1998 DARPAAFRL Off-Line Intrusion DetectionEvaluation First International Workshop on Recent Advances

in Intrusion Detection (RAID) Louvain-la-Neuve Belgium1998

[Marchette 1999] Marchette D A Statistical Method forProfiling Network Traffic First USENIX Workshop onIntrusion Detection and Network Monitoring Santa ClaraCA 1999 pp 119ndash128

[Mukkamala et al 2002] Mukkamala S G Janoski ASung Intrusion Detection Support Vector Machines andNeural Networks in Proceedings of IEEE International JointConference on Neural Networks (IJCNN) Honolulu HI2002 pp 1702ndash1707

[Pitkow and Pirolli 1999] Pitkow J and P Pirolli MiningLongest Repeating Subsequences to Predict World Wide WebSurfing in Proceedings of 2nd USENIX Symposium onInternet Technologies and Systems (USITSrsquo99) Boulder COOctober 1999 pp 139ndash150

[Sarwar et al 2000] Sarwar B M G Karypis J Konstan JRiedl Analysis of Recommender Algorithms forE-Commerce in Proceedings of the 2nd ACM E-CommerceConference (ECrsquo00) October 2000 Minneapolis Minnesotapp 158ndash167

[Shih et al 2003] Shih L Y D M Rennie Y Chang D RKarger Text Bundling Statistics-Based Data ReductionProceedings of the Twentieth International Conference onMachine Learning (ICML) 2003 Washington DC pp696-703

[Teng et al 2005] Teng W-G C-Y Chang M-S ChenIntegrating Web Caching and Web Prefetching in Client-SideProxies IEEE Transaction on Parallel and DistributedSystems Vol 16 No 5 May 2005 pp 444ndash455

[Yu et al 2003] Yu H J Yang J Han Classifying LargeData Sets Using SVM with Hierarchical Clusters SIGKDD2003 August 24ndash27 2003 Washington DC pp 306ndash315

Conclusion to Part I

We have presented various supporting technologies for datamining for malware detection These include data miningtechnologies malware technologies as well as data miningapplications First we provided an overview of data miningtechniques Next we discussed various types of malwareThis was followed by a discussion of data mining for securityapplications Finally we provided a summary of the datamining tools we discussed in our previous book Design andImplementation of Data Mining Tools

Now that we have provided an overview of supportingtechnologies we can discuss the various types of data miningtools we have developed for malware detection In Part II wediscuss email worm detection tools In Part III we discussdata mining tools for detecting malicious executables In PartIV we discuss data mining for detecting remote exploits InPart V we discuss data mining for botnet detection In PartVI we discuss stream mining tools Finally in Part VII wediscuss some of the emerging tools including data mining forinsider threat detection and firewall policy analysis

PART II

DATA MINING FOR EMAILWORM DETECTION

Introduction to Part IIIn this part we will discuss data mining techniques to detectemail worms Email messages contain a number of differentfeatures such as the total number of words in the messagebodysubject presenceabsence of binary attachments type ofattachments and so on The goal is to obtain an efficientclassification model based on these features The solutionconsists of several steps First the number of features isreduced using two different approaches feature selection anddimension reduction This step is necessary to reduce noiseand redundancy from the data The feature selection techniqueis called Two-Phase Selection (TPS) which is a novelcombination of decision tree and greedy selection algorithmThe dimension reduction is performed by PrincipalComponent Analysis Second the reduced data are used totrain a classifier Different classification techniques have beenused such as Support Vector Machine (SVM) Naiumlve Bayesand their combination Finally the trained classifiers aretested on a dataset containing both known and unknown typesof worms These results have been compared with publishedresults It is found that the proposed TPS selection along withSVM classification achieves the best accuracy in detectingboth known and unknown types of worms

Part II consists of three chapters 6 7 and 8 In Chapter 6 weprovide an overview of email worm detection including adiscussion of related work In Chapter 7 we discuss our toolfor email worm detection In Chapter 8 we analyze the resultswe have obtained by using our tool

EMAIL WORM DETECTION

61 IntroductionAn email worm spreads through infected email messages Theworm may be carried by an attachment or the email maycontain links to an infected web site When the user opens theattachment or clicks the link the host gets infectedimmediately The worm exploits the vulnerable emailsoftware in the host machine to send infected emails toaddresses stored in the address book Thus new machines getinfected Worms bring damage to computers and people invarious ways They may clog the network traffic causedamage to the system and make the system unstable or evenunusable

The traditional method of worm detection is signature basedA signature is a unique pattern in the worm body that canidentify it as a particular type of worm Thus a worm can bedetected from its signature But the problem with thisapproach is that it involves a significant amount of humanintervention and may take a long time (from days to weeks) todiscover the signature Thus this approach is not usefulagainst ldquozero-dayrdquo attacks of computer worms Alsosignature matching is not effective against polymorphism

Thus there is a growing need for a fast and effectivedetection mechanism that requires no manual interventionOur work is directed toward automatic and efficient detectionof email worms In our approach we have developed atwo-phase feature selection technique for email wormdetection In this approach we apply TPS to select the bestfeatures using decision tree and greedy algorithm Wecompare our approach with two baseline techniques The firstbaseline approach does not apply any feature reduction Ittrains a classifier with the unreduced dataset The secondbaseline approach reduces data dimension using principalcomponent analysis (PCA) and trains a classifier with thereduced dataset It is shown empirically that our TPSapproach outperforms the baseline techniques We also reportthe feature set that achieves this performance For the baselearning algorithm (ie classifier) we use both support vectormachine (SVM) and Naiumlve Bayes (NB) We observerelatively better performance with SVM Thus we stronglyrecommend applying SVM with our TPS process fordetecting novel email worms in a feature-based paradigm

Figure 61 Concepts in this chapter (This figure appears inEmail Work Detection Using Data Mining InternationalJournal of Information Security and Privacy Vol 1 No 4

The organization of this chapter is as follows Section 62describes our architecture Section 63 describes related workin automatic email worm detection Our approach is brieflydiscussed in Section 63 The chapter is summarized insection 64 Figure 61 illustrates the concepts in this chapter

62 Architecture

Figure 62 Architecture for email worm detection (Thisfigure appears in Email Work Detection Using Data MiningInternational Journal of Information Security and PrivacyVol 1 No 4 pp 47ndash61 2007 authored by M Masud LKahn and B Thuraisingham Copyright 2010 IGI Globalwwwigi-globalcom Posted by permission of the publisher)

Figure 62 illustrates our architecture at a high level At firstwe build a classifier from training data containing both

benign and infected emails Then unknown emails are testedwith the classifier to predict whether it is infected or clean

The training data consist of both benign and malicious(infected) emails These emails are called training instancesThe training instances go through the feature selectionmodule where features are extracted and best features areselected (see Sections 73 74) The output of the featureselection module is a feature vector for each training instanceThese feature vectors are then sent to the training module totrain a classification model (classifier module) We usedifferent classification models such as support vector machine(SVM) Naiumlve Bayes (NB) and their combination (seeSection 75) A new email arriving in the host machine firstundergoes the feature extraction module where the samefeatures selected in the feature selection module areextracted and a feature vector is produced This feature vectoris given as input to the classifier and the classifier predictsthe class (ie benigninfected) of the email

63 Related WorkThere are different approaches to automating the detection ofworms These approaches are mainly of two types behavioraland content based Behavioral approaches analyze thebehavior of messages like source-destination addressesattachment types message frequency and so forthContent-based approaches look into the content of themessage and try to detect the signature automatically Thereare also combined methods that take advantage of bothtechniques

An example of behavioral detection is social network analysis[Golbeck and Hendler 2004] [Newman et al 2002] Itdetects worm-infected emails by creating graphs of anetwork where users are represented as nodes andcommunications between users are represented as edges Asocial network is a group of nodes among which there existsedges Emails that propagate beyond the group boundary areconsidered to be infected The drawback of this system is thatworms can easily bypass social networks by intelligentlychoosing recipient lists by looking at recent emails in theuserrsquos outbox

Another example of behavioral approach is the application ofthe Email Mining Toolkit (EMT) [Stolfo et al 2006] TheEMT computes behavior profiles of user email accounts byanalyzing email logs They use some modeling techniques toachieve high detection rates with very low false positive ratesStatistical analysis of outgoing emails is another behavioralapproach [Schultz et al 2001] [Symantec 2005] Statisticscollected from frequency of communication between clientsand their mail server byte sequences in the attachment andso on are used to predict anomalies in emails and thus wormsare detected

An example of the content-based approach is the EarlyBirdSystem [Singh et al 2003] In this system statistics on highlyrepetitive packet contents are gathered These statistics areanalyzed to detect possible infection of host or servermachines This method generates the content signature of aworm without any human intervention Results reported bythis system indicated a very low false positive rate ofdetection Other examples are the Autograph [Kim and Karp

2004] and the Polygraph [Newsome et al 2005] developedat Carnegie Mellon University

There are other approaches to detect early spreading ofworms such as employing ldquohoneypotrdquo A honeypot[Honeypot 2006] is a closely monitored decoy computer thatattracts attacks for early detection and in-depth adversaryanalysis The honeypots are designed to not send out email innormal situations If a honeypot begins to send out emailsafter running the attachment of an email it is determined thatthis email is an email worm

Another approach by [Sidiroglou et al 2005] employsbehavior-based anomaly detection which is different fromsignature-based or statistical approaches Their approach is toopen all suspicious attachments inside an instrumented virtualmachine looking for dangerous actions such as writing to theWindows registry and flag suspicious messages

Our work is related to [Martin et al 2005-a] They report anexperiment with email data where they apply a statisticalapproach to find an optimum subset of a large set of featuresto facilitate the classification of outgoing emails andeventually detect novel email worms However our approachis different from their approach in that we apply PCA andTPS to reduce noise and redundancy from data

64 Overview of OurApproachWe apply a feature-based approach to worm detection Anumber of features of email messages have been identified in[Martin et al 2005-a) and discussed in this chapter The totalnumber of features is large some of which may be redundantor noisy So we apply two different feature-reductiontechniques a dimension-reduction technique called PCA andour novel feature-selection technique called TPS whichapplies decision tree and greedy elimination These featuresare used to train a classifier to obtain a classification modelWe use three different classifiers for this task SVM NB anda combination of SVM and NB mentioned henceforth as theSeries classifier The Series approach was first proposed by[Martin et al 2005-b]

We use the dataset of [Martin et al 2005-a] for evaluationpurpose The original data distribution was unbalanced so webalance it by rearranging We divide the dataset into twodisjoint subsets the known worms set or K-Set and the novelworms set or N-Set The K-Set contains some clean emailsand emails infected by five different types of worms TheK-Set contains emails infected by a sixth type worm but noclean email We run a threefold cross validation on the K-SetAt each iteration of the cross validation we test the accuracyof the trained classifiers on the N-Set Thus we obtain twodifferent measures of accuracy namely the accuracy of thethreefold cross validation on K-Set and the average accuracyof novel worm detection on N-Set

Our contributions to this work are as follows First we applytwo special feature-reduction techniques to removeredundancy and noise from data One technique is PCA andthe other is our novel TPS algorithm PCA is commonly usedto extract patterns from high-dimensional data especiallywhen the data are noisy It is a simple and nonparametricmethod TPS applies decision tree C45 [Quinlan 1993] forinitial selection and thereafter it applies greedy eliminationtechnique (see Section 742 ldquoTwo-Phase Feature Selection(TPS)rdquo) Second we create a balanced dataset as explainedearlier Finally we compare the individual performancesamong NB SVM and Series and show empirically that theSeries approach proposed by [Martin et al 2005-b] performsworse than either NB or SVM Our approach is illustrated inFigure 63

65 SummaryIn this chapter we have argued that feature-based approachesfor worm detection are superior to the traditionalsignature-based approaches Next we described some relatedwork on email worm detection and then briefly discussed ourapproach which uses feature reduction and classificationusing PCA SVM and NB

Figure 63 Email worm detection using data mining (Thisfigure appears in Email Work Detection Using Data MiningInternational Journal of Information Security and PrivacyVol 1 No 4 pp 47ndash61 2007 authored by M Masud LKahn and B Thuraisingham Copyright 2010 IGI Globalwwwigi-globalcom Posted by permission of the publisher)

In the future we are planning to detect worms by combiningthe feature-based approach with the content-based approachto make it more robust and efficient We are also focusing onthe statistical property of the contents of the messages forpossible contamination of worms Our approach is discussedin Chapter 7 Analysis of the results of our approach is givenin Chapter 8

References[Golbeck and Hendler 2004] Golbeck J and J HendlerReputation Network Analysis for Email Filtering inProceedings of CEAS 2004 First Conference on Email andAnti-Spam

[Honeypot 2006] Intrusion Detection Honeypots andIncident Handling Resources Honeypotsnethttpwwwhoneypotsnet

[Kim and Karp 2004] Kim H-A and B Karp AutographToward Automated Distributed Worm Signature Detectionin Proceedings of the 13th USENIX Security Symposium(Security 2004) San Diego CA August 2004 pp 271ndash286

[Martin et al 2005-a] Martin S A Sewani B Nelson KChen A D Joseph Analyzing Behavioral Features for EmailClassification in Proceedings of the IEEE Second Conferenceon Email and Anti-Spam (CEAS 2005) July 21 amp 22Stanford University CA

[Martin et al 2005-b] Martin S A Sewani B Nelson KChen A D Joseph A Two-Layer Approach for Novel EmailWorm Detection Submitted to USENIX Steps on ReducingUnwanted Traffic on the Internet (SRUTI)

[Newman et al 2002] Newman M E J S Forrest JBalthrop Email Networks and the Spread of ComputerViruses Physical Review E 66 035101 2002

[Newsome et al 2005] Newsome J B Karp D SongPolygraph Automatically Generating Signatures forPolymorphic Worms in Proceedings of the IEEE Symposiumon Security and Privacy May 2005

[Quinlan 1993] Quinlan J R C45 Programs for MachineLearning Morgan Kaufmann Publishers 1993

[Schultz et al 2001] Schultz M E Eskin E Zadok MEFMalicious Email Filter A UNIX Mail Filter That DetectsMalicious Windows Executables in USENIX AnnualTechnical ConferencemdashFREENIX Track June 2001

[Sidiroglou et al 2005] Sidiroglou S J Ioannidis A DKeromytis S J Stolfo An Email Worm VaccineArchitecture in Proceedings of the First InternationalConference on Information Security Practice and Experience(ISPEC 2005) Singapore April 11ndash14 2005 pp 97ndash108

[Singh et al 2003] Singh S C Estan G Varghese SSavage The EarlyBird System for Real-Time Detection ofUnknown Worms Technical Report CS2003-0761 Universityof California San Diego August 4 2003

[Stolfo et al 2006] Stolfo S J S Hershkop C W Hu WLi O Nimeskern K Wang Behavior-Based Modeling andIts Application to Email Analysis ACM Transactions onInternet Technology (TOIT) February 2006

[Symantec 2005] W32BeagleBGmmhttpwwwsarccomavcentervencdataw32beaglebgmmhtml

DESIGN OF THE DATA MININGTOOL

71 IntroductionAs we have discussed in Chapter 6 feature-based approachesfor worm detection are superior to the traditionalsignature-based approaches Our approach for worm detectioncarries out feature reduction and classification using principalcomponent analysis (PCA) support vector machine (SVM)and Naiumlve Bayes (NB) In this chapter we first discuss thefeatures that are used to train classifiers for detecting emailworms Second we describe our dimension reduction andfeature selection techniques Our proposed two-phase featureselection technique utilizes information gain and decision treeinduction algorithm for feature selection In the first phasewe build a decision tree using the training data on the wholefeature set The decision tree selects a subset of featureswhich we call the minimal subset of features In the secondphase we greedily select additional features and add to theminimal subset Finally we describe the classificationtechniques namely Naiumlve Bayes (NB) support vectormachine (SVM) and a combination of NB and SVM

The organization of this chapter is as follows Ourarchitecture is discussed in Section 72 Feature descriptionsare discussed in Section 73 Section 74 describes feature

reduction techniques Classification techniques are describedin Section 75 In particular we provide an overview of thefeature selection dimension reduction and classificationtechniques we have used in our tool The chapter issummarized in Section 76 Figure 71 illustrates the conceptsin this chapter

Figure 71 Concepts in this chapter

72 ArchitectureFigure 72 illustrates our system architecture which includescomponents for feature reduction and classification There aretwo stages of the process training and classification Trainingis performed with collected samples of benign and infectedemails that is the training data The training samples arefirst analyzed and a set of features are identified (Section 73)To reduce the number of features we apply a featureselection technique called ldquotwo-phase feature selectionrdquo(Section 74) Using the selected set of features we generate

feature vectors for each training sample and the featurevectors are used to train a classifier (Section 75) When anew email needs to be tested it first goes through a featureextraction module that generates a feature vector This featurevector is used by the classifier to predict the class of theemail that is to predict whether the email is clean or infected

Figure 72 Architecture

73 Feature DescriptionThe features are extracted from a repository of outgoingemails collected over a period of two years [Martin et al2005-a] These features are categorized into two differentgroups per-email feature and per-window feature Per-emailfeatures are features of a single email whereas per-windowfeatures are features of a collection of emails sentreceivedwithin a window of time

For a detailed description of the features please refer to[Martin et al 2005-a] Each of these features is eithercontinuous valued or binary The value of a binary feature is

either 0 or 1 depending on the presence or absence of thisfeature in a data point There are a total of 94 features Herewe describe some of them

HTML in body Whether there is HTML in the email bodyThis feature is used because a bug in the HTML parser of theemail client is a vulnerability that may be exploited by wormwriters It is a binary feature

Embedded image Whether there is any embedded imageThis is used because a buggy image processor of the emailclient is also vulnerable to attacks

Hyperlinks Whether there are hyperlinks in the email bodyClicking an infected link causes the host to be infected It isalso a binary feature

Binary attachment Whether there are any binary attachmentsWorms are mainly propagated by binary attachments This isalso a binary feature

Multipurpose Internet Mail Extension (MIME) type ofattachments There are different MIME types for exampleldquoapplicationmswordrdquo ldquoapplicationpdfrdquo ldquoimagegifrdquo ldquotextplainrdquo and others Each of these types is used as a binaryfeature (total 27)

UNIX ldquomagic numberrdquo of file attachments Sometimes adifferent MIME type is assigned by the worm writers to evade

detection Magic numbers can accurately detect the MIMEtype Each of these types is used as a binary feature (total 43)

Number of attachments It is a continuous feature

Number of wordscharacters in subjectbody These featuresare continuous Most worms choose random text whereas auser may have certain writing characteristics Thus thesefeatures are sometimes useful to detect infected emails

Number of emails sent in window An infected host issupposed to send emails at a faster rate This is a continuousfeature

Number of unique email recipients senders These are alsoimportant criteria to distinguish between normal and infectedhost This is a continuous feature too

Average number of wordscharacters per subject bodyaverage word length These features are also useful indistinguishing between normal and viral activity

Variance in number of wordscharacters per subject bodyvariance in word length These are also useful properties ofemail worms

Ratio of emails to attachments Usually normal emails do notcontain attachments whereas most infected emails do containthem

74 Feature ReductionTechniques741 Dimension Reduction

The high dimensionality of data always appears to be a majorproblem for classification tasks because (a) it increases therunning time of the classification algorithms (b) it increaseschance of overfitting and (c) a large number of instances isrequired for learning tasks We apply PCA (PrincipalComponents Analysis) to obtain a reduced dimensionality ofdata in an attempt to eliminate these problems

PCA finds a reduced set of attributes by projecting theoriginal dimension into a lower dimension PCA is alsocapable of discovering hidden patterns in data therebyincreasing classification accuracy As high-dimensional datacontain redundancies and noise it is much harder for thelearning algorithms to find a hypothesis consistent with thetraining instances The learned hypothesis is likely to be toocomplex and susceptible to overfitting PCA reduces thedimension without losing much information and thus allowsthe learning algorithms to find a simpler hypothesis that isconsistent with the training examples and thereby reduces thechance of overfitting But it should be noted that PCAprojects data into a lower dimension in the direction ofmaximum dispersion Maximum dispersion of data does notnecessarily imply maximum separation of between-class dataandor maximum concentration of within-class data If this isthe case then PCA reduction may result in poor performance

742 Two-Phase Feature Selection(TPS)

Feature selection is different from dimension reductionbecause it selects a subset of the feature set rather thanprojecting a combination of features onto a lower dimensionWe apply a two-phase feature selection (TPS) process Inphase I we build a decision tree from the training data Weselect the features found at the internal nodes of the tree Inphase II we apply a greedy selection algorithm We combinethese two selection processes because of the followingreasons The decision tree selection is fast but the selectedfeatures may not be a good choice for the novel dataset Thatis the selected features may not perform well on the noveldata because the novel data may have a different set ofimportant features We observe this fact when we apply adecision tree on the MydoomM and VBSBubbleBoy datasetThat is why we apply another phase of selection the greedyselection on top of decision tree selection Our goal is todetermine if there is a more general feature set that covers allimportant features In our experiments we are able to findsuch a feature set using greedy selection There are tworeasons why we do not apply only greedy selection First it isvery slow compared to decision tree selection because ateach iteration we have to modify the data to keep only theselected features and run the classifiers to compute theaccuracy Second the greedy elimination process may lead toa set of features that are inferior to the decision tree-selectedset of features That is why we keep the decision tree-selectedfeatures as the minimal features set

7421 Phase I We apply decision tree as a feature selectiontool in phase I The main reason behind applying decision treeis that it selects the best attributes according to informationgain Information gain is a very effective metric in selectingfeatures Information gain can be defined as a measure of theeffectiveness of an attribute (ie feature) in classifying thetraining data [Mitchell 1997] If we split the training data onthese attribute values then information gain gives themeasurement of the expected reduction in entropy after thesplit The more an attribute can reduce entropy in the trainingdata the better the attribute in classifying the dataInformation gain of a binary attribute A on a collection ofexamples S is given by (Eq 71)

where Values(A) is the set of all possible values for attributeA and Sv is the subset of S for which attribute A has value vIn our case each binary attribute has only two possible values(0 1) Entropy of subset S is computed using the followingequation

where p(S) is the number of positive examples in S and n(S) isthe total number of negative examples in S Computation of

information gain of a continuous attribute is a little trickybecause it has an infinite number of possible values Oneapproach followed by [Quinlan 1993] is to find an optimalthreshold and split the data into two halves The optimalthreshold is found by searching a threshold value with thehighest information gain within the range of values of thisattribute in the dataset

We use J48 for building decision tree which is animplementation of C45 Decision tree algorithms choose thebest attribute based on information gain criteria at each levelof recursion Thus the final tree actually consists of the mostimportant attributes that can distinguish between the positiveand negative instances The tree is further pruned to reducechances of overfitting Thus we are able to identify thefeatures that are necessary and the features that are redundantand use only the necessary features Surprisingly enough inour experiments we find that on average only 45 features areselected by the decision tree algorithm and the total numberof nodes in the tree is only 11 It indicates that only a fewfeatures are important We have six different datasets for sixdifferent worm types Each dataset is again divided into twosubsets the known worms set or K-Set and the novel worm setor N-Set We apply threefold cross validation on the K-Set

7422 Phase II In the second phase we apply a greedyalgorithm to select the best subset of features We use thefeature subset selected in phase I as the minimal subset (MS)At the beginning of the algorithm we select all the featuresfrom the original set and call it the potential feature set (PFS)At each iteration of the algorithm we compute the averagenovel detection accuracy of six datasets using PFS as thefeature set Then we pick up a feature at random from the

PFS which is not in MS and eliminate it from the PFS if theelimination does not reduce the accuracy of novel detection ofany classifier (NB SVM Series) If the accuracy drops afterelimination then we do not eliminate the feature and we addit to MS In this way we reduce PFS and continue until nofurther elimination is possible Now the PFS contains themost effective subset of features Although this process istime consuming we finally come up with a subset of featuresthat can outperform the original set

Algorithm 71 sketches the two-phase feature selectionprocess At line 2 the decision tree is built using originalfeature set FS and unreduced dataset DFS At line 3 the set offeatures selected by the decision tree is stored in the minimalsubset MS Then the potential subset PFS is initialized tothe original set FS Line 5 computes the average noveldetection accuracy of three classifiers The functionsNB-Acc(PFS DPFS) SVM-Acc(PFS DPFS) andSeries-Acc(PFS DPFS) return the average novel detectionaccuracy of NB SVM and Series respectively using PFS asthe feature set

Algorithm 71 Two-Phase Feature Selection

1 Two-Phase-Selection (FS DFS) returns FeatureSet

FS original set of features

DFS original dataset with FS as the feature set

2 T larr Build-Decision-Tree (FS DFS)

3 MS larr Feature-Set (T) minimal subset of features

4 PFS larr FS potential subset of features

compute novel detection accuracy of FS

5 pavg larr (NB-Acc(PFS DPFS) + SVM-Acc(PFS DPFS)

+ Series-Acc(PFS DPFS)) 3

6 while PFSltgtMS do

7 X larr a randomly chosen feature from PFS that is not in MS

8 PFS larr PFS ndash X

compute novel detection accuracy of PFS

9 Cavg larr (NB-Acc(PFS DPFS) + SVM-Acc(PFS DPFS)

10 if Cavg ge pavg

11 pavg larr Cavg

12 else

13 PFS larr PFS cup X

14 MS larr MS cup X

15 end if

16 end while

17 return PFS

In the while loop we randomly choose a feature X such thatX isin PFS but X notin MS and delete it from PFS The accuracyof the new PFS is calculated If after deletion the accuracyincreases or remains the same then X is redundant So weremove this feature permanently Otherwise if the accuracydrops after deletion then this feature is essential so we add itto the minimal set MS (lines 13 and 14) In this way weeither delete a redundant feature or add it to the minimalselection It is repeated until we have nothing more to select(ie MS equals PFS) We return the PFS as the best featureset

75 ClassificationTechniquesClassification is a supervised data mining technique in whicha data mining model is first trained with some ldquoground truthrdquothat is training data Each instance (or data point) in thetraining data is represented as a vector of features and eachtraining instance is associated with a ldquoclass labelrdquo The datamining model trained from the training data is called aldquoclassification modelrdquo which can be represented as a functionf(x) feature vector rarr class label This function approximatesthe feature vector-class label mapping from the training data

When a test instance with an unknown class label is passed tothe classification model it predicts (ie outputs) a class labelfor the test instance The accuracy of a classifier is determinedby how many unknown instances (instances that were not inthe training data) it can classify correctly

We apply the NB [John and Langley 1995] SVM [Boser etal 1992] and C45 decision tree [Quinlan 1993] classifiersin our experiments We also apply our implementation of theSeries classifier [Martin et al 2005-b] to compare itsperformance with other classifiers We briefly describe theSeries approach here for the purpose of self-containment

NB assumes that features are independent of each other Withthis assumption the probability that an instance x = (x1 x2hellipxn) is in class c (c isin 1 hellip C) is

where xi is the value of the i-th feature of the instance x P(c)is the prior probability of class C and P(Xj = xj|c) is theconditional probability that the j-th attribute has the value xjgiven class c

So the NB classifier outputs the following class

NB treats discrete and continuous attributes differently Foreach discrete attribute p(X = x|c) is modeled by a single realnumber between 0 and 1 which represents the probability thatthe attribute X will take on the particular value x when theclass is c In contrast each numeric (or real) attribute ismodeled by some continuous probability distribution over therange of that attributersquos values A common assumption notintrinsic to the NB approach but often made nevertheless isthat within each class the values of numeric attributes arenormally distributed One can represent such a distribution interms of its mean and standard deviation and one canefficiently compute the probability of an observed value fromsuch estimates For continuous attributes we can write

Smoothing (m-estimate) is used in NB We have used thevalue m = 100 and p = 05 while calculating the probability

where nc = total number of instances for which X = x givenClass c and n = total number of instances for which X = x

SVM can perform either linear or non-linear classificationThe linear classifier proposed by [Boser et al 1992] creates a

hyperplane that separates the data into two classes with themaximum margin Given positive and negative trainingexamples a maximum-margin hyperplane is identified whichsplits the training examples such that the distance between thehyperplane and the closest examples is maximized Thenon-linear SVM is implemented by applying kernel trick tomaximum-margin hyperplanes The feature space istransformed into a higher dimensional space where themaximum-margin hyperplane is found This hyperplane maybe non-linear in the original feature space A linear SVM isillustrated in Figure 73 The circles are negative instancesand the squares are positive instances A hyperplane (the boldline) separates the positive instances from negative ones Allof the instances are at least at a minimal distance (margin)from the hyperplane The points that are at a distance exactlyequal to the hyperplane are called the support vectors Asmentioned earlier the SVM finds the hyperplane that has themaximum margin among all hyperplanes that can separate theinstances

Figure 73 Illustration of support vectors and margin of alinear SVM

In our experiments we have used the SVM implementationprovided at [Chang and Lin 2006] We also implement theSeries or ldquotwo-layer approachrdquo proposed by [Martin et al2005-b] as a baseline technique The Series approach worksas follows In the first layer SVM is applied as a noveltydetector The parameters of SVM are chosen such that itproduces almost zero false positive This means if SVMclassifies an email as infected then with probability (almost)100 it is an infected email If otherwise SVM classifies anemail as clean then it is sent to the second layer for furtherverification This is because with the previously mentionedparameter settings while SVM reduces false positive rate italso increases the false negative rate So any email classifiedas negative must be further verified In the second layer NBclassifier is applied to confirm whether the suspected emailsare really infected If NB classifies it as infected then it ismarked as infected otherwise it is marked as clean Figure74 illustrates the Series approach

76 SummaryIn this chapter we have described the design andimplementation of the data mining tools for email wormdetection As we have stated feature-based methods aresuperior to the signature-based methods for worm detectionOur approach is based on feature extraction We reduce thedimension of the features by using PCA and then useclassification techniques based on SVM and NB for detectingworms In Chapter 8 we discuss the experiments we carriedout and analyze the results obtained

Figure 74 Series combination of SVM and NB classifiersfor email worm detection (This figure appears in Email WorkDetection Using Data Mining International Journal ofInformation Security and Privacy Vol 1 No 4 pp 47ndash612007 authored by M Masud L Kahn and BThuraisingham Copyright 2010 IGI Globalwwwigi-globalcom Posted by permission of the publisher)

As stated in Chapter 6 as future work we are planning todetect worms by combining the feature-based approach withthe content-based approach to make it more robust andefficient We will also focus on the statistical property of thecontents of the messages for possible contamination ofworms In addition we will apply other classificationtechniques and compare the performance and accuracy of theresults

References[Boser et al 1992] Boser B E I M Guyon V N VapnikA Training Algorithm for Optimal Margin Classifiers in DHaussler editor 5th Annual ACM Workshop on COLTPittsburgh PA ACM Press 1992 pp 144ndash152

[Chang and Lin 2006] Chang C-C and C-J Lin LIBSVMA Library for Support Vector Machineshttpwwwcsientuedutwsimcjlinlibsvm

[John and Langley 1995] John G H and P LangleyEstimating Continuous Distributions in Bayesian Classifiersin Proceedings of the Eleventh Conference on Uncertainty inArtificial Intelligence Morgan Kaufmann Publishers SanMateo CA 1995 pp 338ndash345

[Mitchell 1997] Mitchell T Machine LearningMcGraw-Hill 1997

EVALUATION AND RESULTS

81 IntroductionIn Chapter 6 we described email worm detection and inChapter 7 we described our data mining tool for email wormdetection In this chapter we describe the datasetsexperimental setup and the results of our proposed approachand other baseline techniques

The dataset contains a collection of 1600 clean and 1200viral emails which are divided into six different evaluationsets (Section 82) The original feature set contains 94features The evaluation compares our two-phase featureselection technique with two other approaches namelydimension reduction using PCA and no feature selection orreduction Performance of three different classifiers has beenevaluated on these feature spaces namely NB SVM andSeries approach (see Table 88 for summary) Therefore thereare nine different combinations of feature setndashclassifier pairssuch as two-phase feature selection + NB no feature selection+ NB two-phase feature selection + SVM and so on Inaddition we compute three different metrics on these datasetsfor each feature setndashclassifier pair classification accuracyfalse positive rate and accuracy in detecting a new type ofworm

The organization of this chapter is as follows In Section 82we describe the distribution of the datasets used In Section83 we discuss the experimental setup including hardwaresoftware and system parameters In Section 84 we discussresults obtained from the experiments The chapter issummarized in Section 85 Concepts in this chapter areillustrated in Figure 81

82 DatasetWe have collected the worm dataset used in the experimentby [Martin et al 2005] They have accumulated severalhundreds of clean and worm emails over a period of twoyears All of these emails are outgoing emails Severalfeatures are extracted from these emails as explained inSection 73 (ldquoFeature Descriptionrdquo)

There are six types of worms contained in the datasetVBSBubbleBoy W32MydoomM W32SobigFW32NetskyD W32MydoomU and W32BagleF But theclassification task is binary clean infected The originaldataset contains six training and six test sets Each training setis made up of 400 clean emails and 1000 infected emailsconsisting of 200 samples from each of the five differentworms The sixth virus is then included in the test set whichcontains 1200 clean emails and 200 infected messages Table81 clarifies this distribution For ease of representation weabbreviate the worm names as follows

bull B VBSBubbleBoybull F W32BagleFbull M W32MydoomMbull N W32NetskyDbull S W32SobigFbull U W32MydoomU

NB SVM and the Series classifiers are applied to the originaldata the PCA-reduced data and the TPS-selected data Thedecision tree is applied on the original data only

Table 81 Data Distribution from the Original Dataset

Source This table appears in Email Work Detection UsingData Mining International Journal of Information Securityand Privacy Vol 1 No 4 pp 47-61 2007 authored by MMasud L Kahn and B Thuraisingham Copyright 2010 IGIGlobal wwwigi-globalcom Posted by permission of thepublisher

We can easily notice that the original dataset is unbalancedbecause the ratio of clean emails to infected emails is 25 inthe training set whereas it is 51 in the test set So the resultsobtained from this dataset may not be reliable We make itbalanced by redistributing the examples In our distributioneach balanced set contains two subsets The Known-wormsset or K-Set contains 1600 clean email messages which arethe combination of all the clean messages in the originaldataset (400 from training set 1200 from test set) The K-Setalso contains 1000 infected messages with five types ofworms marked as the ldquoknown wormsrdquo The N-Set contains

200 infected messages of a sixth type of worm marked as theldquonovel wormrdquo Then we apply cross validation on K-Set Thecross validation is done as follows We randomly divide theset of 2600 (1600 clean + 1000 viral) messages into threeequal-sized subsets such that the ratio of clean messages toviral messages remains the same in all subsets We take twosubsets as the training set and the remaining set as the test setThis is done three times by rotating the testing and trainingsets We take the average accuracy of these three runs Thisaccuracy is shown under the column ldquoAccrdquo in Tables 83 85and 86 In addition to testing the accuracy of the test set wealso test the detection accuracy of each of the three learnedclassifiers on the N-Set and take the average This accuracyis also averaged over all runs and shown as novel detectionaccuracy Table 82 displays the data distribution of ourdataset

Table 82 Data Distribution from the Redistributed Dataset

83 Experimental SetupIn this section we describe the experimental setup including adiscussion of the hardware and software utilized We run allour experiments on a Windows XP machine with Java version15 installed For running SVM we use the LIBSVM package[Chang and Lin 2006]

We use our own C++ implementation of NB We implementPCA with MATLAB We use the WEKA machine learningtool [Weka 2006] for decision tree with pruning applied

Parameter settings Parameter settings for LIBSVM are asfollows classifier type is C-Support Vector Classification(C-SVC) the kernel is chosen to be the radial basis function(RBF) the values of ldquogammardquo = 02 and ldquoCrdquo = 1 are chosen

Baseline techniques We compare our TPS technique withtwo different feature selectionreduction techniquesTherefore the competing techniques are the following

TPS This is our two-phase feature selection technique

PCA Here we reduce the dimension using PCA With PCAwe reduce the dimension size to 5 10 15 hellip 90 94 That is

we vary the target dimension from 5 to 94 with step 5increments

No reduction (unreduced) Here the full feature set is used

Each of these feature vectors are used to train three differentclassifiers namely NB SVM and Series Decision tree isalso trained with the unreduced feature set

84 ResultsWe discuss the results in three separate subsections Insubsection 841 we discuss the results found from unreduceddata that is data before any reduction or selection is appliedIn subsection 842 we discuss the results found fromPCA-reduced data and in subsection 843 we discuss theresults obtained using TPS-reduced data

Table 83 reports the accuracy of the cross validationaccuracy and false positive for each set The cross validationaccuracy is shown under the column Acc and the falsepositive rate is shown under the column FP The set names atthe row headings are the abbreviated names as explained inldquoDatasetrdquo section From the results reported in Table 83 wesee that SVM observes the best accuracy among allclassifiers although the difference with other classifiers issmall

Table 84 reports the accuracy of detecting novel worms Wesee that SVM is very consistent over all sets but NB Seriesand decision tree perform significantly worse in theMydoomM dataset

Figure 82 shows the results of applying PCA on the originaldata The X axis denotes dimension of the reduceddimensional data which has been varied from 5 to 90 withstep 5 increments The last point on the X axis is theunreduced or original dimension Figure 82 shows the crossvalidation accuracy for different dimensions The data fromthe chart should be read as follows a point (x y) on a givenline say the line for SVM indicates the cross validationaccuracy y of SVM averaged over all six datasets whereeach dataset has been reduced to x dimension using PCA

Table 83 Comparison of Accuracy () and False Positive() of Different Classifiers on the Worm Dataset

Source This table appears in Email Work Detection UsingData Mining International Journal of Information Securityand Privacy Vol 1 No 4 pp 47ndash61 2007 authored by MMasud L Kahn and B Thuraisingham Copyright 2010 IGIGlobal wwwigi-globalcom Posted by permission of thepublisher

Table 84 Comparison of Novel Detection Accuracy () ofDifferent Classifiers on the Worm Dataset

Figure 82 Average cross validation accuracy of the threeclassifiers on lower dimensional data reduced by PCA (Thisfigure appears in Email Work Detection Using Data MiningInternational Journal of Information Security and PrivacyVol 1 No 4 pp 47ndash61 2007 authored by M Masud LKahn and B Thuraisingham Copyright 2010 IGI Globalwwwigi-globalcom Posted by permission of the publisher)

Figure 82 indicates that at lower dimensions cross validationaccuracy is lower for each of the three classifiers But SVMachieves its near maximum accuracy at dimension 30 NB andSeries reaches within 2 of maximum accuracy at dimension30 and onward All classifiers attain their maximum at thehighest dimension 94 which is actually the unreduced dataSo from this observation we may conclude that PCA is noteffective on this dataset in terms of cross validation accuracyThe reason behind this poorer performance on the reduceddimensional data is possibly the one that we have mentioned

earlier in subsection ldquoDimension Reductionrdquo The reductionby PCA is not producing a lower dimensional data wheredissimilar class instances are maximally dispersed and similarclass instances are maximally concentrated So theclassification accuracy is lower at lower dimensions

We now present the results at dimension 25 similar to theresults presented in the previous subsection Table 85compares the novel detection accuracy and cross validationaccuracy of different classifiers The choice of this particulardimension is that at this dimension all the classifiers seem tobe the most balanced in all aspects cross validation accuracyfalse positive and false negative rate and novel detectionaccuracy We conclude that this dimension is the optimaldimension for projection by PCA From Table 85 it isevident that accuracies of all three classifiers on PCA-reduceddata are lower than the accuracy of the unreduced data It ispossible that some information that is useful for classificationmight have been lost during projection onto a lowerdimension

Table 85 Comparison of Cross Validation Accuracy (Acc)and Novel Detection Accuracy (NAcc) among DifferentClassifiers on the PCA-Reduced Worm Dataset at Dimension25

We see in Table 85 that both the accuracy and noveldetection accuracy of NB has dropped significantly from theoriginal dataset The novel detection accuracy of NB on theMydoomM dataset has become 0 compared to 17 in theoriginal set The novel detection accuracy of SVM on thesame dataset has dropped to 30 compared to 924 in theoriginal dataset So we can conclude that PCA reduction doesnot help in novel detection

Our TPS selects the following features (in no particularorder)

Attachment type binary

MIME (magic) type of attachment applicationmsdownload

MIME (magic) type of attachmentapplicationx-ms-dos-executable

Frequency of email sent in window

Mean words in body

Mean characters in subject

Number of attachments

Number of From Address in Window

Ratio of emails with attachment

Variance of attachment size

Variance of words in body

Number of HTML in email

Number of links in email

Number of To Address in Window

Variance of characters in subject

The first three features actually reflect importantcharacteristics of an infected email Usually infected emailshave binary attachment which is a doswindows executable

Meanvariance of words in body and characters in subject arealso considered as important symptoms because usuallyinfected emails contain random subject or body thus havingirregular size of body or subject Number of attachments andratio of emails with attachments and number of links in emailare usually higher for infected emails Frequency of emailssent in window and number of To Address in window arehigher for an infected host as a compromised host sendsinfected emails to many addresses and more frequently Thusmost of the features selected by our algorithm are reallypractical and useful

Table 86 reports the cross validation accuracy () and falsepositive rate () of the three classifiers on the TPS-reduceddataset We see that both the accuracy and false positive ratesare almost the same as the unreduced dataset The accuracy ofMydoomM dataset (shown at row M) is 993 for NB995 for SVM and 994 for Series Table 87 reports thenovel detection accuracy () of the three classifiers on theTPS-reduced dataset We find that the average noveldetection accuracy of the TPS-reduced dataset is higher thanthat of the unreduced dataset The main reason behind thisimprovement is the higher accuracy on the MydoomM set byNB and Series The accuracy of NB on this dataset is 371(row M) compared to 174 in the unreduced dataset (seeTable 84 row M) Also the accuracy of Series on the same is360 compared to 166 on the unreduced dataset (as showin Table 84 row M) However accuracy of SVM remainsalmost the same 917 compared to 924 in the unreduceddataset In Table 88 we summarize the averages from Tables83 through Table 87

Table 86 Cross Validation Accuracy () and False Positive() of Three Different Classifiers on the TPS-ReducedDataset

Table 87 Comparison of Novel Detection Accuracy () ofDifferent Classifiers on the TPS-Reduced Dataset

The first three rows (after the header row) report the crossvalidation accuracy of all four classifiers that we have used inour experiments Each row reports the average accuracy on aparticular dataset The first row reports the average accuracyfor the unreduced dataset the second row reports the same forPCA-reduced dataset and the third row for TPS-reduceddataset We see that the average accuracies are almost thesame for the TPS-reduced and the unreduced set Forexample average accuracy of NB (shown under column NB)is the same for both which is 992 the accuracy of SVM(shown under column SVM) is also the same 995 Theaverage accuracies of these classifiers on the PCA-reduceddataset are 1 to 2 lower There is no entry under thedecision tree column for the PCA-reduced and TPS-reduceddataset because we only test the decision tree on theunreduced dataset

Table 88 Summary of Results (Averages) Obtained fromDifferent Feature-Based Approaches

The middle three rows report the average false positive valuesand the last three rows report the average novel detectionaccuracies We see that the average novel detection accuracyon the TPS-reduced dataset is the highest among all Theaverage novel detection accuracy of NB on this dataset is867 compared to 836 on the unreduced dataset which isa 31 improvement on average Also Series has a noveldetection accuracy of 863 on the TPS-reduced datasetcompared to that of the unreduced dataset which is 831Again it is a 32 improvement on average Howeveraverage accuracy of SVM remains almost the same (only01 difference) on these two datasets Thus on average wehave an improvement in novel detection accuracy across

different classifiers on the TPS-reduced dataset WhileTPS-reduced dataset is the best among the three the bestclassifier among the four is SVM It has the highest averageaccuracy and novel detection accuracy on all datasets andalso very low average false positive rates

85 SummaryIn this chapter we have discussed the results obtained fromtesting our data mining tool for email worm detection Wefirst discussed the datasets we used and the experimentalsetup Then we described the results we obtained We havetwo important findings from our experiments First SVM hasthe best performance among all four different classifiers NBSVM Series and decision tree Second feature selectionusing our TPS algorithm achieves the best accuracyespecially in detecting novel worms Combining these twofindings we conclude that SVM with TPS reduction shouldwork as the best novel worm detection tool on a feature-baseddataset

In the future we would like to extend our work tocontent-based detection of the email worm by extractingbinary level features from the emails We would also like toapply other classifiers for the detection task

References[Chang and Lin 2006] Chang C-C and C-J Lin LIBSVMA Library for Support Vector Machineshttpwwwcsientuedutwsimcjlinlibsvm

[Martin et al 2005] Martin S A Sewani B Nelson KChen and A D Joseph Analyzing Behavioral Features forEmail Classification in Proceedings of the IEEE SecondConference on Email and Anti-Spam (CEAS 2005) July 21 amp22 Stanford University CA

[Weka 2006] Weka 3 Data Mining Software in Javahttpwwwcswaikatoacnzsimmlweka

Conclusion to Part II

In this part we discussed our proposed data mining techniqueto detect email worms Different features such as totalnumber of words in message bodysubject presenceabsenceof binary attachments types of attachments and others areextracted from the emails Then the number of features isreduced using a Two-phase Selection (TPS) technique whichis a novel combination of decision tree and greedy selectionalgorithm We have used different classification techniquessuch as Support Vector Machine (SVM) Naiumlve Bayes andtheir combination Finally the trained classifiers are tested ona dataset containing both known and unknown types ofworms Compared to the baseline approaches our proposedTPS selection along with SVM classification achieves thebest accuracy in detecting both known and unknown types ofworms

In the future we would like to apply our technique on a largercorpus of emails and optimize the feature extraction selectiontechniques to make them more scalable to large datasets

PART III

DATA MINING FOR DETECTINGMALICIOUS EXECUTABLES

Introduction to Part IIIWe present a scalable and multi-level feature extractiontechnique to detect malicious executables We propose anovel combination of three different kinds of features atdifferent levels of abstraction These are binary n-gramsassembly instruction sequences and dynamic link library(DLL) function calls extracted from binary executablesdisassembled executables and executable headersrespectively We also propose an efficient and scalable featureextraction technique and apply this technique on a largecorpus of real benign and malicious executables Thepreviously mentioned features are extracted from the corpusdata and a classifier is trained which achieves high accuracyand low false positive rate in detecting malicious executablesOur approach is knowledge based for several reasons Firstwe apply the knowledge obtained from the binary n-gramfeatures to extract assembly instruction sequences using ourAssembly Feature Retrieval algorithm Second we apply thestatistical knowledge obtained during feature extraction toselect the best features and to build a classification modelOur model is compared against other feature-basedapproaches for malicious code detection and found to be

more efficient in terms of detection accuracy and false alarmrate

Part III consists of three chapters 9 10 and 11 Chapter 9describes our approach to detecting malicious executablesChapter 10 describes the design and implementation of ourdata mining tools Chapter 11 describes our evaluation andresults

MALICIOUS EXECUTABLES

91 IntroductionMalicious code is a great threat to computers and computersociety Numerous kinds of malicious codes wander in thewild Some of them are mobile such as worms and spreadthrough the Internet causing damage to millions of computersworldwide Other kinds of malicious codes are static such asviruses but sometimes deadlier than their mobile counterpartMalicious code writers usually exploit softwarevulnerabilities to attack host machines A number oftechniques have been devised by researchers to counter theseattacks Unfortunately the more successful the researchersbecome in detecting and preventing the attacks the moresophisticated the malicious code in the wild appears Thusthe battle between malicious code writers and researchers isvirtually never ending

One popular technique followed by the antivirus communityto detect malicious code is ldquosignature detectionrdquo Thistechnique matches the executables against a unique telltalestring or byte pattern called signature which is used as anidentifier for a particular malicious code Although signaturedetection techniques are being used widely they are noteffective against zero-day attacks (new malicious code)polymorphic attacks (different encryptions of the same

binary) or metamorphic attacks (different code for the samefunctionality) So there has been a growing need for fastautomated and efficient detection techniques that are robustto these attacks As a result many automated systems[Golbeck and Hendler 2004] [Kolter and Maloof 2004][Newman et al 2002] [Newsome et al 2005] have beendeveloped

In this chapter we describe our novel hybrid feature retrieval(HFR) model that can detect malicious executables efficiently[Masud et al 2007-a] [Masud et al 2007-b] Theorganization of this chapter is as follows Our architecture isdiscussed in Section 92 Related work is given in Section 93Our approach is discussed in Section 94 The chapter issummarized in Section 95 Figure 91 illustrates the conceptsin this chapter

92 ArchitectureFigure 92 illustrates our architecture for detecting maliciousexecutables The training data consist of a collection ofbenign and malicious executables We extract three differentkinds of features (to be explained shortly) from eachexecutable These extracted features are then analyzed andonly the best discriminative features are selected Featurevectors are generated from each training instance using theselected feature set The feature vectors are used to train aclassifier When a new executable needs to be tested at firstthe features selected during training are extracted from theexecutable and a feature vector is generated This featurevector is classified using the classifier to predict whether it isa benign or malicious executable

In our approach we extract three different kinds of featuresfrom the executables at different levels of abstraction andcombine them into one feature set called the hybrid featureset (HFS) These features are used to train a classifier (eg

support vector machine [SVM] decision tree etc) which isapplied to detect malicious executables These features are (a)binary n-gram features (b) derived assembly features(DAFs) and (c) dynamic link library (DLL) call featuresEach binary n-gram feature is actually a sequence of nconsecutive bytes in a binary executable extracted using atechnique explained in Chapter 10 Binary n-grams reveal thedistinguishing byte patterns between the benign and maliciousexecutables Each DAF is a sequence of assembly instructionsin an executable and corresponds to one binary n-gramfeature DAFs reveal the distinctive instruction usage patternsbetween the benign and malicious executables They areextracted from the disassembled executables using ourassembly feature retrieval (AFR) algorithm It should benoted that DAF is different from assembly n-gram featuresmentioned in Chapter 10 Assembly n-gram features are notused in HFS because of our findings that DAF performs betterthan them Each DLL call feature actually corresponds to aDLL function call in an executable extracted from theexecutable header These features reveal the distinguishingDLL call patterns between the benign and maliciousexecutables We show empirically that the combination ofthese three features is always better than any single feature interms of classification accuracy

Our work focuses on expanding features at different levels ofabstraction rather than using more features at a single level ofabstraction There are two main reasons behind this First thenumber of features at a given level of abstraction (egbinary) is overwhelmingly large For example in our largerdataset we obtain 200 million binary n-gram featuresTraining with this large number of features is way beyond thecapabilities of any practical classifier That is why we limit

the number of features at a given level of abstraction to anapplicable range Second we empirically observe the benefitof adding more levels of abstraction to the combined featureset (ie HFS) HFS combines features at three levels ofabstraction namely binary executables assembly programsand system API calls We show that this combination hashigher detection accuracy and lower false alarm rate than thefeatures at any single level of abstraction

Our technique is related to knowledge management becauseof several reasons First we apply our knowledge of binaryn-gram features to obtain DAFs Second we apply theknowledge obtained from the feature extraction process toselect the best features This is accomplished by extracting allpossible binary n-grams from the training data applying thestatistical knowledge corresponding to each n-gram (ie itsfrequency in malicious and benign executables) to computeits information gain [Mitchell 1997] and selecting the best Sof them Finally we apply another statistical knowledge(presenceabsence of a feature in an executable) obtainedfrom the feature extraction process to train classifiers

Our research contributions are as follows First we proposeand implement our HFR model which combines the threekinds of features previously mentioned Second we apply anovel idea to extract assembly instruction features usingbinary n-gram features implemented with the AFR algorithmThird we propose and implement a scalable solution to then-gram feature extraction and selection problem in generalOur solution works well with limited memory andsignificantly reduces running time by applying efficient andpowerful data structures and algorithms Thus it is scalable toa large collection of executables (in the order of thousands)

even with limited main memory and processor speed Finallywe compare our results against the results of [Kolter andMaloof 2004] who used only the binary n-gram feature andshow that our method achieves better accuracy We alsoreport the performancecost trade-off of our method againstthe method of [Kolter and Maloof 2004] It should be pointedout here that our main contribution is an efficient featureextraction technique not a classification technique Weempirically prove that the combined feature set (ie HFS)extracted using our algorithm performs better than otherindividual feature sets (such as binary n-grams) regardless ofthe classifier (eg SVM or decision tree) used

93 Related WorkThere has been significant research in recent years to detectmalicious executables There are two mainstream techniquesto automate the detection process behavioral and contentbased The behavioral approach is primarily applied to detectmobile malicious code This technique is applied to analyzenetwork traffic characteristics such as source-destinationportsIP addresses various packet-levelflow-level statisticsand application-level characteristics such as email attachmenttype and attachment size Examples of behavioral approachesinclude social network analysis [Golbeck and Hendler 2004][Newman et al 2002] and statistical analysis [Schultz et al2001-a] A data mining-based behavioral approach fordetecting email worms has been proposed by [Masud et al2007-a] [Garg et al 2006] apply the feature extractiontechnique along with machine learning for masqueradedetection They extract features from user behavior in

GUI-based systems such as mouse speed number of clicksper session and so on Then the problem is modeled as abinary classification problem and trained and tested withSVM Our approach is content based rather than behavioral

The content-based approach analyzes the content of theexecutable Some of them try to automatically generatesignatures from network packet payloads Examples areEarlyBird [Singh et al 2003] Autograph [Kim and Karp2004] and Polygraph [Newsome et al 2005] In contrast ourmethod does not require signature generation or signaturematching Some other content-based techniques extractfeatures from the executables and apply machine learning todetect malicious executables Examples are given in [Schultzet al 2001b] and [Kolter and Maloof 2004] The work in[Schultz et al 2001-b] extracts DLL call information usingGNU Bin-Utils and character strings using GNU strings fromthe header of Windows PE executables [Cygnus 1999] Alsothey use byte sequences as features We also use bytesequences and DLL call information but we also applydisassembly and use assembly instructions as features Wealso extract byte patterns of various lengths (from 2 to 10bytes) whereas they extract only 2-byte length patterns Asimilar work is done by [Kolter and Maloof 2004] Theyextract binary n-gram features from the binary executablesapply them to different classification methods and reportaccuracy Our model is different from [Kolter and Maloof2004] in that we extract not only the binary n-grams but alsoassembly instruction sequences from the disassembledexecutables and gather DLL call information from theprogram headers We compare our modelrsquos performance onlywith [Kolter and Maloof 2004] because they report higheraccuracy than that given in [Schultz et al 2001b]

94 Hybrid FeatureRetrieval (HFR) ModelOur HFR model is a novel idea in malicious code detection Itextracts useful features from disassembled executables usingthe information obtained from binary executables It thencombines the assembly features with other features like DLLfunction calls and binary n-gram features We have addresseda number of difficult implementation issues and providedefficient scalable and practical solutions The difficulties thatwe face during implementation are related to memorylimitations and long running times By using efficient datastructures algorithms and disk IO we are able to implementa fast scalable and robust system for malicious codedetection We run our experiments on two datasets withdifferent class distribution and show that a more realisticdistribution improves the performance of our model

Our model also has a few limitations First it does notdirectly handle obfuscated DLL calls or encryptedpackedbinaries There are techniques available for detectingobfuscated DLL calls in the binary [Lakhotia et al 2005] andto unpack the packed binaries automatically We may applythese tools for de-obfuscationdecryption and use their outputto our model Although this is not implemented yet we lookforward to integrating these tools with our model in our futureversions Second the current implementation is an offlinedetection mechanism which means it cannot be directlydeployed on a network to detect malicious code However itcan detect malicious codes in near real time

We address these issues in our future work and vow to solvethese problems We also propose several modifications to ourmodel For example we would like to combine our featureswith run-time characteristics of the executables We alsopropose building a feature database that would store all thefeatures and be updated incrementally This would save alarge amount of training time and memory Our approach isillustrated in Figure 93

Figure 93 Our approach to detecting malicious executables

95 SummaryIn this work we have proposed a data mining-based modelfor malicious code detection Our technique extracts threedifferent levels of features from executables namely binarylevel assembly level and API function call level Thesefeatures then go through a feature selection phase forreducing noise and redundancy in the feature set and generatea manageable-sized set of features These feature sets are thenused to build feature vectors for each training data Then aclassification model is trained using the training data point

This classification model classifies future instances (ieexecutables) to detect whether they are benign or malicious

In the future we would like to extend our work in twodirections First we would like to extract and utilizebehavioral features for malware detection This is becauseobfuscation against binary patterns may be achieved bypolymorphism and metamorphism but it will be difficult forthe malware to obfuscate its behavioral pattern Second wewould like to make the feature extraction and classificationmore scalable to applying the cloud computing framework

References[Cygnus 1999] GNU Binutils Cygwinhttpsourcewarecygnuscomcygwin

[Freund and Schapire 1996] Freund Y and R SchapireExperiments with a New Boosting Algorithm in Proceedingsof the Thirteenth International Conference on MachineLearning Morgan Kaufmann 1996 pp 148ndash156

[Garg et al 2006] Garg A R Rahalkar S Upadhyaya KKwiat Profiling Users in GUI Based Systems for MasqueradeDetection in Proceedings of the 7th IEEE InformationAssurance Workshop (IAWorkshop 2006) IEEE 2006 pp48ndash54

[Golbeck and Hendler 2004] Golbeck J and J HendlerReputation Network Analysis for Email Filtering inProceedings of CEAS 2004 First Conference on Email andAnti-Spam

[Kim and Karp 2004] Kim H A B Karp AutographToward Automated Distributed Worm Signature Detectionin Proceedings of the 13th USENIX Security Symposium(Security 2004) San Diego CA August 2004 pp 271ndash286

[Kolter and Maloof 2004] Kolter J Z and M A MaloofLearning to Detect Malicious Executables in the WildProceedings of the Tenth ACM SIGKDD InternationalConference on Knowledge Discovery and Data MiningACM 2004 pp 470ndash478

[Lakhotia et al 2005] Lakhotia A E U Kumar MVenable A Method for Detecting Obfuscated Calls inMalicious Binaries IEEE Transactions on SoftwareEngineering 31(11) 955minus968

[Masud et al 2007a] Masud M M L Khan and BThuraisingham Feature-Based Techniques forAuto-Detection of Novel Email Worms in Proceedings of the11th Pacific-Asia Conference on Knowledge Discovery andData Mining (PAKDDrsquo07) Lecture Notes in ComputerScience 4426Springer 2007 Bangkok Thailand pp205minus216

[Masud et al 2007b] Masud M M L Khan and BThuraisingham A Hybrid Model to Detect MaliciousExecutables in Proceedings of the IEEE InternationalConference on Communication (ICCrsquo07) pp 1443minus1448

[Mitchell 1997] Mitchell T Machine LearningMcGraw-Hill

[Newman et al 2002] Newman M E J S Forrest and JBalthrop Email Networks and the Spread of ComputerViruses Physical Review A 66(3) 035101-1ndash035101-4

[Newsome et al 2005] Newsome J B Karp and D SongPolygraph Automatically Generating Signatures forPolymorphic Worms in Proceedings of the IEEE Symposiumon Security and Privacy May 2005 Oakland CA pp226minus241

[Schultz et al 2001a] Schultz M E Eskin and E ZadokMEF Malicious Email Filter a UNIX Mail Filter That DetectsMalicious Windows Executables in Proceedings of theFREENIX Track USENIX Annual Technical ConferenceJune 2001 Boston MA pp 245minus252

[Schultz et al 2001b] Schultz M E Eskin E Zadok and SStolfo Data Mining Methods for Detection of New MaliciousExecutables in Proceedings of the IEEE Symposium onSecurity and Privacy May 2001 Oakland CA pp 38ndash49

[Singh et al 2003] Singh S C Estan G Varghese and SSavage The EarlyBird System for Real-Time Detection ofUnknown Worms Technical Report CS2003-0761University of California at San Diego (UCSD) August 2003

DESIGN OF THE DATA MININGtOOL

101 IntroductionIn this chapter we describe our data mining tool for detectingmalicious executables It utilizes the feature extractiontechnique using n-gram analysis We first discuss how weextract binary n-gram features from the executables and thenshow how we select the best features using information gainWe also discuss the memory and scalability problemassociated with the n-gram extraction and selection and howwe solve it Then we describe how the assembly features anddynamic link library (DLL) call features are extractedFinally we describe how we combine these three kinds offeatures and train a classifier using these features

The organization of this chapter is as follows Featureextraction using n-gram analysis is given in Section 102 Thehybrid feature retrieval model is discussed in Section 103The chapter is summarized in Section 104 Figure 101illustrates the concepts in this chapter

102 Feature ExtractionUsing n-Gram AnalysisBefore going into the details of the process we illustrate acode snippet in Figure 102 from the email wormldquoWin32Ainjoerdquo and use it as a running example throughoutthe chapter

Feature extraction using n-gram analysis involves extractingall possible n-grams from the given dataset (training set) andselecting the best n-grams among them Each such n-gram isa feature We extend the notion of n-gram from bytes toassembly instructions and DLL function calls That is ann-gram may be either a sequence of n bytes n assemblyinstructions or n DLL function calls depending on whetherwe are to extract features from binary executables assemblyprograms or DLL call sequences respectively Beforeextracting n-grams we preprocess the binary executables byconverting them to hexdump files and assembly programfiles as explained shortly

Figure 102 Code snippet and DLL call info from theEmail-Worm ldquoWin32Ainjoerdquo (From M Masud L Khan BThuraisingham A Scalable Multi-level Feature ExtractionTechnique to Detect Malicious Executables pp 33ndash45Springer With permission)

Here the granularity level is a byte We apply the UNIXhexdump utility to convert the binary executable files intotext files mentioned henceforth as hexdump files containingthe hexadecimal numbers corresponding to each byte of thebinary This process is performed to ensure safe and easyportability of the binary executables The feature extraction

process consists of two phases (1) feature collection and (2)feature selection both of which are explained in the followingsubsections

We collect binary n-grams from the hexdump files This isillustrated in Example-I

Example-I

The 4-grams corresponding to the first 6 bytes sequence(FF2108900027) from the executable in Figure 1 are the4-byte sliding windows FF21890 21089000 and 08900027

The basic feature collection process runs as follows At firstwe initialize a list L of n-grams to empty Then we scan eachhexdump file by sliding an n-byte window Each such n-bytesequence is an n-gram Each n-gram g is associated with twovalues p1 and n1 denoting the total number of positiveinstances (ie malicious executables) and negative instances(ie benign executables) respectively that contain g If g isnot found in L then g is added to L and p1 and n1 are updatedas necessary If g is already in L then only p1 and n1 areupdated When all hexdump files have been scanned Lcontains all the unique n-grams in the dataset along with theirfrequencies in the positive and negative instances There areseveral implementation issues related to this basic approachFirst the total number of n-grams may be very large Forexample the total number of 10-grams in our second datasetis 200 million It may not be possible to store all of them inthe computerrsquos main memory To solve this problem we store

the n-grams in a disk file F Second if L is not sorted then alinear search is required for each scanned n-gram to testwhether it is already in L If N is the total number of n-gramsin the dataset then the time for collecting all the n-gramswould be O (N2) an impractical amount of time whenN = 200 million

To solve the second problem we use a data structure calledAdelson Velsky Landis (AVL) tree [Goodrich and Tamassia2006] to store the n-grams in memory An AVL tree is aheight-balanced binary search tree This tree has a propertythat the absolute difference between the heights of the leftsubtree and the right subtree of any node is at most 1 If thisproperty is violated during insertion or deletion a balancingoperation is performed and the tree regains itsheight-balanced property It is guaranteed that insertions anddeletions are performed in logarithmic time So to insert ann-gram in memory we now need only O (log2 (N)) searchesThus the total running time is reduced to O (Nlog2 (N))making the overall running time about 5 million times fasterfor N as large as 200 million Our feature collection algorithmExtract_Feature implements these two solutions It isillustrated in Algorithm 101

Description of the algorithm the for loop at line 3 runs foreach hexdump file in the training set The inner while loop atline 4 gathers all the n-grams of a file and adds it to the AVLtree if it is not already there At line 8 a test is performed tosee whether the tree size has exceeded the memory limit (athreshold value) If it exceeds and F is empty then we savethe contents of the tree in F (line 9) If F is not empty thenwe merge the contents of the tree with F (line 10) Finally wedelete all the nodes from the tree (line 12)

Algorithm 101 The n-Gram Feature Collection Algorithm

Procedure Extract_Feature (B)

B = B1 B2 hellip BK all hexdump files

1 T larr empty tree Initialize AVL-tree

2 F larr new file Initialize disk file

3 for each Bi isin B do

4 while not EOF(Bi) do while not end of file

5 g larr next_ngram(Bi) read next n-gram

6 Tinsert(g) insert into tree andor update frequencies asnecessary

7 end while

8 if Tsize gt Threshold then save or merge

9 if F is empty then F larr Tinorder() save tree data insorted order

10 else F larr merge(Tinorder() F) merge tree data with filedata and save

11 end if

12 T larr empty tree release memory

13 end if

14 end for

The time complexity of Algorithm 101 is T = time (n-gramreading and inserting in tree) + time (merging with disk) = O(Blog2K) + O (N) where B is the total size of the training datain bytes K is the maximum number of nodes of the tree (iethreshold) and N is the total number of n-grams collectedThe space complexity is O (K) where K is defined as themaximum number of nodes of the tree

If the total number of extracted features is very large it maynot be possible to use all of them for training because ofseveral reasons First the memory requirement may beimpractical Second training may be too slow Third aclassifier may become confused with a large number offeatures because most of them would be noisy redundant orirrelevant So we are to choose a small relevant and usefulsubset of features We choose information gain (IG) as theselection criterion because it is one of the best criteria used inliterature for selecting the best features

IG can be defined as a measure of effectiveness of an attribute(ie feature) in classifying a training data point [Mitchell1997] If we split the training data based on the values of this

attribute then IG gives the measurement of the expectedreduction in entropy after the split The more an attribute canreduce entropy in the training data the better the attribute isin classifying the data IG of an attribute A on a collection ofinstances I is given by Eq 101

values (A) is the set of all possible values for attribute A

Iv is the subset of I where all instances have the value of A =v

p is the total number of positive instances in I n is the totalnumber of negative instances in I

pv is the total number of positive instances in Iv and nv is thetotal number of negative instances in Iv

In our case each attribute has only two possible values thatis v isin 0 1 If an attribute A (ie an n-gram) is present inan instance X then XA = 1 otherwise it is 0 Entropy of I iscomputed using the following equation

where I p and n are as defined above Substituting (2) in (1)and letting t = n + p we get

The next problem is to select the best S features (ien-grams) according to IG One naiumlve approach is to sort then-grams in non-increasing order of IG and selecting the top Sof them which requires O (Nlog2N) time and O (N) mainmemory But this selection can be more efficientlyaccomplished using a heap that requires O (Nlog2S) time andO(S) main memory For S = 500 and N = 200 million thisapproach is more than 3 times faster and requires 400000times less main memory A heap is a balanced binary treewith the property that the root of any subtree contains theminimum (maximum) element in that subtree We use amin-heap that always has the minimum value at its rootAlgorithm 102 sketches the feature selection algorithm Atfirst the heap is initialized to empty Then the n-grams (alongwith their frequencies) are read from disk (line 2) and insertedinto the heap (line 5) until the heap size becomes S After theheap size becomes equal to S we compare the IG of the nextn-gram g against the IG of the root If IG (root) ge IG (g) theng is discarded (line 6) since root has the minimum IGOtherwise root is replaced with g (line 7) Finally the heapproperty is restored (line 9) The process terminates whenthere are no more n-grams in the disk After termination wehave the S best n-grams in the heap

Algorithm 102 The n-Gram Feature Selection Algorithm

Procedure Select_Feature (F H p n)

bull F a disk file containing all n-gramsbull H empty heapbull p total number of positive examplesbull n total number of negative examples

1 while not EOF(F) do

2 ltg p1 n1gt larr next_ngram(F) read n-gram with frequencycounts

3 p0 = P-p1 n0 = N- n1 of positive and negative examplesnot containing g

4 IG larr Gain(p0 n0 p1 n1 p n) using equation (3)

5 if Hsize() lt S then Hinsert(g IG)

6 else if IG lt= HrootIG then continue discard lower gainn-grams

7 else Hroot larr ltg IGgt replace root

8 end if

9 Hrestore() apply restore operation

10 end while

The insertion and restoration takes only O (log2(S)) time Sothe total time required is O (Nlog2S) with only O(S) mainmemory We denote the best S binary features selected usingIG criterion as the binary feature set (BFS)

In this case the level of granularity is an assemblyinstruction First we disassemble all the binary files using adisassembly tool called PEDisassem It is used to disassembleWindows Portable Executable (PE) files Besides generatingthe assembly instructions with opcode and addressinformation PEDisassem provides useful information like listof resources (eg cursor) used list of DLL functions calledlist of exported functions and list of strings inside the codeblock To extract assembly n-gram features we follow amethod similar to the binary n-gram feature extraction Firstwe collect all possible n-grams that is sequences of nconsecutive assembly instructions and select the best S ofthem according to IG We mention henceforth this selectedset of features as the assembly feature set (AFS) We face thesame difficulties as in binary n-gram extraction such aslimited memory and slow running time and solve them in thesame way Example-II illustrates the assembly n-gramfeatures

Example-II

The 2-grams corresponding to the first 4 assembleinstructions in Figure 1 are the two-instruction slidingwindows

jmp dword[ecx] or byte[eax+14002700] dl

or byte[eax+14002700] dl add byte[esi+1E] dl

add byte[esi+1E] dh inc ebp

We adopt a standard representation of assembly instructionsthat has the following format nameparam1param2 Name isthe instruction name (eg mov) param1 is the firstparameter and param2 is the second parameter Again aparameter may be one of register memory constant Sothe second instruction above ldquoor byte [eax+14002700] dlrdquobecomes ldquoormemoryregisterrdquo in our representation

Here the granularity level is a DLL function call An n-gramof DLL function call is a sequence of n DLL function calls(possibly with other instructions in between two successivecalls) in an executable We extract the information about DLLfunction calls made by a program from the header of thedisassembled file This is illustrated in Figure 102 In ourexperiments we use only 1-grams of DLL calls because thehigher grams have poorer performance We enumerate all theDLL function names that have been used by each of thebenign and malicious executables and select the best S ofthem using information gain We will mention this feature setas DLL-call feature set (DFS)

103 The Hybrid FeatureRetrieval ModelThe hybrid feature retrieval (HFR) model extracts andcombines three different kinds of features HFR consists ofdifferent phases and components The feature extractioncomponents have already been discussed in details Thissection gives a brief description of the model

The HFR model consists of two phases a training phase and atest phase The training phase is shown in Figure 103a andthe test phase is shown in Figure 103b In the training phasewe extract binary n-gram features (BFSs) and DLL callfeatures (DFSs) using the approaches explained in thischapter We then apply AFR algorithm (to be explainedshortly) to retrieve the derived assembly features (DAFs) thatrepresent the selected binary n-gram features These threekinds of features are combined into the hybrid feature set(HFS) Please note that DAFs are different from assemblyn-gram features (ie AFSs)

Figure 103 The Hybrid Feature Retrieval Model (a)training phase (b) test phase (From M Masud L Khan BThuraisingham A Scalable Multi-level Feature ExtractionTechnique to Detect Malicious Executables pp 33ndash45Springer With permission)

AFS is not used in HFS because of our findings that DAFperforms better We compute the binary feature vectorcorresponding to the HFS using the technique explained inthis chapter and train a classifier using SVM boosteddecision tree and other classification methods In the testphase we scan each test instance and compute the featurevector corresponding to the HFS This vector is tested against

the classifier The classifier outputs the class predictionbenign malicious of the test file

1032 The Assembly Feature Retrieval(AFR) Algorithm

The AFR algorithm is used to extract assembly instructionsequences (ie DAFs) corresponding to the binary n-gramfeatures The main idea is to obtain the complete assemblyinstruction sequence of a given binary n-gram feature Therationale behind using DAF is as follows A binary n-grammay represent partial information such as part(s) of one ormore assembly instructions or a string inside the code blockWe apply AFR algorithm to obtain the complete instructionor instruction sequence (ie a DAF) corresponding to thepartial one Thus DAF represents more completeinformation which should be more useful in distinguishingthe malicious and benign executables However binaryn-grams are still required because they also contain otherinformation like string data or important bytes at the programheader AFR algorithm consists of several steps In the firststep a linear address matching technique is applied asfollows The offset address of the n-gram in the hexdump fileis used to find instructions at the same offset at thecorresponding assembly program file Based on the offsetvalue one of the three situations may occur

1 The offset is before program entry point so there isno corresponding assembly code for the n-gram Werefer to this address as address before entry point(ABEP)

2 There are some data but no code at that offset Werefer to this address as DATA

3 There is some code at that offset We refer to thisaddress as CODE If this offset is in the middle of aninstruction then we take the whole instruction andconsecutive instructions within n bytes from theinstruction

In the second step the best CODE instance is selected fromamong all CODE instances We apply a heuristic to find thebest sequence called the most distinguishing instructionsequence (MDIS) heuristic According to this heuristic wechoose the instruction sequence that has the highest IG TheAFR algorithm is sketched in Algorithm 103 Acomprehensive example of the algorithm is illustrated inAppendix A

Description of the algorithm line 1 initializes the lists thatwould contain the assembly sequences The for loop in line 2runs for each hexdump file Each hexdump file is scanned andn-grams are extracted (lines 4 and 5) If any of these n-gramsare in the BFS (lines 6 and 7) then we read the instructionsequence from the corresponding assembly program file at thecorresponding address (lines 8 through 10) This sequence isadded to the appropriate list (line 12) In this way we collectall the sequences corresponding to each n-gram in the BFS Inphase II we select the best sequence in each n-gram list usingIG (lines 18 through 21) Finally we return the bestsequences that is DAFs

Algorithm 103 Assembly Feature Retrieval

Procedure Assembly_Feature_Retrieval(G A B)

bull G = g1 g2hellipgM the selected n-gram features(BFS)

bull A = A1 A2 hellip AL all Assembly filesbull B = B1 B2 hellip BL all hexdump filesbull S = size of BFSbull L = of training filesbull Qi = a list containing the possible instruction

sequences for gi phase I sequence collection

1 for i = 1 to S do Qi larr empty end for initialize sequence

2 for each Bi isin B do phase I sequence collection

3 offset larr 0 current offset in file

4 while not EOF(Bi) do read the whole file

6 ltindex foundgt larr BinarySearch(G g) seach g in G

7 if found then found

8 q larr an empty sequence

9 for each instruction r in Ai with address(r) isin [offset offset+ n] do

10 q larr q cup r

11 end for

12 Qindex larr Qindex cup q add to the sequence

13 end if

14 offset = offset + 1

15 end while

16 end for

17 V larr empty list phase II sequence selection

18 for i = 1 to S do for each Qi

19 q larr t isin Qi | foralluisin Qi IG(t) gt= IG(u) the sequence withthe highest IG

20 V larr V cup q

21 end for

22 return V DAF sequences

Time complexity of this algorithm is O (nBlog2S) where B isthe total size of training set in bytes S is the total number ofselected binary n-grams and n is size of each n-gram in bytesSpace complexity is O (SC) where S is defined as the totalnumber of selected binary n-grams and C is the averagenumber of assembly sequences found per binary n-gram Therunning time and memory requirements of all threealgorithms in this chapter are given in Chapter 11

1033 Feature Vector Computation andClassification

Each feature in a feature set (eg HFS BFS) is a binaryfeature meaning its value is either 1 or 0 If the feature ispresent in an instance (ie an executable) then its value is 1otherwise its value is 0 For each training (or testing)instance we compute a feature vector which is a bit vectorconsisting of the feature values of the corresponding featureset For example if we want to compute the feature vectorVBFS corresponding to BFS of a particular instance I then foreach feature f isin BFS we search f in I If f is found in I thenwe set VBFS[f] (ie the bit corresponding to f) to 1 otherwisewe set it to 0 In this way we set or reset each bit in thefeature vector These feature vectors are used by theclassifiers for training and testing

We apply SVM Naiumlve Bayes (NB) boosted decision treeand other classifiers for the classification task SVM canperform either linear or non-linear classification The linearclassifier proposed by Vladimir Vapnik creates a hyperplanethat separates the data points into two classes with themaximum margin A maximum-margin hyperplane is the onethat splits the training examples into two subsets such thatthe distance between the hyperplane and its closest datapoint(s) is maximized A non-linear SVM [Boser et al 2003]is implemented by applying kernel trick to maximum-marginhyperplanes The feature space is transformed into a higherdimensional space where the maximum-margin hyperplane isfound A decision tree contains attribute tests at each internalnode and a decision at each leaf node It classifies an instance

by performing attribute tests from root to a decision nodeDecision tree is a rule-based classifier Meaning we canobtain human-readable classification rules from the tree J48is the implementation of C45 Decision Tree algorithm C45is an extension to the ID3 algorithm invented by Quinlan Aboosting technique called Adaboost combines multipleclassifiers by assigning weights to each of them according totheir classification performance The algorithm starts byassigning equal weights to all training samples and a modelis obtained from these training data Then each misclassifiedexamplersquos weight is increased and another model is obtainedfrom these new training data This is iterated for a specifiednumber of times During classification each of these modelsis applied on the test data and a weighted voting is performedto determine the class of the test instance We use theAdaBoostM1 algorithm [Freund and Schapire 1996] on NBand J48 We only report SVM and Boosted J48 resultsbecause they have the best results It should be noted that wedo not have a preference for one classifier over the other Wereport these accuracies in the results in Chapter 11

104 SummaryIn this chapter we have shown how to efficiently extractfeatures from the training data We also showed howscalability can be achieved using disk access We haveexplained the algorithm for feature extraction and featureselection and analyzed their time complexity Finally weshowed how to combine the feature sets and build the featurevectors We applied different machine learning techniquessuch as SVM J48 and Adaboost for building the

classification model In the next chapter we will show howour approach performs on different datasets compared toseveral baseline techniques

In the future we would like to enhance the scalability of ourapproach by applying the cloud computing framework for thefeature extraction and selection task Cloud computing offersa cheap alternative to more CPU power and much larger diskspace which could be utilized for a much faster featureextraction and selection process We are also interested inextracting behavioral features from the executables toovercome the problem of binary obfuscation by polymorphicmalware

References[Boser et al 2003] Boser B E I M Guyon V N VapnikA Training Algorithm for Optimal Margin Classifiers in DHaussler Editor 5th Annual ACM Workshop on COLT ACMPress 2003 pp 144ndash152

[Freund and Schapire 1996] Freund Y and R E SchapireExperiments with a New Boosting Algorithm MachineLearning Proceedings of the 13th International Conference(ICML) 1996 Bari Italy 148ndash156

[Goodrich and Tamassia 2006] Goodrich M T and RTamassia Data Structures and Algorithms in Java FourthEdition John Wiley amp Sons 2006

111 IntroductionIn this chapter we discuss the experiments and evaluationprocess in detail We use two different datasets with differentnumbers of instances and class distributions We compare thefeatures extracted with our approach namely the hybridfeature set (HFS) with two other baseline approaches (1) thebinary feature set (BFS) and (2) the derived assembly featureset (DAF) For classification we compare the performance ofthree different classifiers on each of these feature sets whichare Support Vector Machine (SVM) Naiumlve Bayes (NB)Bayes Net decision tree and boosted decision tree We showthe classification accuracy false positive and false negativerates for our approach and each of the baseline techniquesWe also compare the running times and performancecosttradeoff of our approach compared to the baselines

The organization of this chapter is as follows In Section 112we describe the experiments Datasets are given in Section113 Experimental setup is discussed in Section 114 Resultsare given in Section 115 The example run is given in Section116 The chapter is summarized in Section 117 Figure 111illustrates the concepts in this chapter

112 ExperimentsWe design our experiments to run on two different datasetsEach dataset has a different size and distribution of benignand malicious executables We generate all kinds of n-gramfeatures (eg BFS AFS DFS) using the techniquesexplained in Chapter 10 Notice that the BFS corresponds tothe features extracted by the method of [Kolter and Maloof2004] We also generate the DAF and HFS using our modelas explained in Chapter 10 We test the accuracy of each ofthe feature sets applying a threefold cross validation usingclassifiers such as SVM decision tree Naiumlve Bayes BayesNet and Boosted decision tree Among these classifiers weobtain the best results with SVM and Boosted decision treereported in the results section in Chapter 10 We do not reportother classifier results because of space limitations Inaddition to this we compute the average accuracy falsepositive and false negative rates and receiver operatingcharacteristic (ROC) graphs (using techniques in [Fawcett2003] We also compare the running time and performancecost tradeoff between HFS and BFS

113 DatasetWe have two non-disjoint datasets The first dataset (dataset1)contains a collection of 1435 executables 597 of which arebenign and 838 malicious The second dataset (dataset2)contains 2452 executables having 1370 benign and 1082malicious executables So the distribution of dataset1 isbenign = 416 malicious = 584 and that of dataset2 isbenign = 559 malicious = 441 This distribution waschosen intentionally to evaluate the performance of thefeature sets in different scenarios We collect the benignexecutables from different Windows XP and Windows 2000machines and collect the malicious executables from [VXHeavens] which contains a large collection of maliciousexecutables The benign executables contain variousapplications found at the Windows installation folder (egldquoCWindowsrdquo) as well as other executables in the defaultprogram installation directory (eg ldquoCProgram Filesrdquo)Malicious executables contain viruses worms Trojan horsesand back-doors We select only the Win32 PortableExecutables in both the cases We would like to experimentwith the ELF executables in the future

114 Experimental SetupOur implementation is developed in Java with JDK 15 Weuse the LIBSVM library [Chang and Lin 2006] for runningSVM and Weka ML toolbox [Weka] for running Boosted

decision tree and other classifiers For SVM we run C-SVCwith a Polynomial kernel using gamma = 01 and epsilon =10E-12 For Boosted decision tree we run 10 iterations of theAdaBoost algorithm on the C45 decision tree algorithmcalled J48

We set the parameter S (number of selected features) to 500because it is the best value found in our experiments Most ofour experiments are run on two machines a Sun Solarismachine with 4GB main memory and 2GHz clock speed anda LINUX machine with 2GB main memory and 18GHz clockspeed The reported running times are based on the lattermachine The disassembly and hex-dump are done only oncefor all machine executables and the resulting files are storedWe then run our experiments on the stored files

115 ResultsIn this subsection we first report and analyze the resultsobtained by running SVM on the dataset Later we show theaccuracies of Boosted J48 Because the results from BoostedJ48 are almost the same as SVM we do not report theanalyses based on Boosted J48

1151 Accuracy

Table 111 shows the accuracy of SVM on different featuresets The columns headed by HFS BFS and AFS representthe accuracies of the Hybrid Feature Set (our method) BinaryFeature Set (Kolter and Maloofrsquos feature set) and AssemblyFeature Set respectively Note that the AFS is different from

the DAF (ie derived assembly features) that has been usedin the HFS (see Section IV-A for details) Table 111 reportsthat the classification accuracy of HFS is always better thanother models on both datasets It is interesting to note that theaccuracies for 1-gram BFS are very low in both datasets Thisis because 1 gram is only a 1-byte long pattern having only256 different possibilities Thus this pattern is not useful atall in distinguishing the malicious executables from thenormal and may not be used in a practical application So weexclude the 1-gram accuracies while computing the averageaccuracies (ie the last row)

Table 111 Classification Accuracy () of SVM on DifferentFeature Sets

Source M Masud L Khan B Thuraisingham A ScalableMultilevel Feature Extraction Technique to Detect MaliciousExecutables pp 33ndash45 Springer With permission

a Average accuracy excluding 1 gram

11511 Dataset1 Here the best accuracy of the hybrid modelis for n = 6 which is 974 and is the highest among allfeature sets On average the accuracy of HFS is 168 higherthan that of BFS and 1136 higher than that of AFSAccuracies of AFS are always the lowest One possiblereason behind this poor performance is that AFS considersonly the CODE part of the executables So AFS misses anydistinguishing pattern carried by the ABEP or DATA partsand as a result the extracted features have poorerperformance Moreover the accuracy of AFS greatlydeteriorates for n gt= 10 This is because longer sequences ofinstructions are rarer in either class of executables (maliciousbenign) so these sequences have less distinguishing powerOn the other hand BFS considers all parts of the executableachieving higher accuracy Finally HFS considers DLL callsas well as BFS and DAF So HFS has better performancethan BFS

11512 Dataset2 Here the differences between the accuraciesof HFS and BFS are greater than those of dataset1 Theaverage accuracy of HFS is 42 higher than that of BFSAccuracies of AFS are again the lowest It is interesting tonote that HFS has an improved performance over BFS (andAFS) in dataset2 Two important conclusions may be derivedfrom this observation First dataset2 is much larger thandataset1 having a more diverse set of examples Here HFSperforms better than dataset1 whereas BFS performs worsethan dataset1 This implies that HFS is more robust than BFSin a diverse and larger set of instances Thus HFS is moreapplicable than BFS in a large diverse corpus of executablesSecond dataset2 has more benign executables than maliciouswhereas dataset1 has fewer benign executables Thisdistribution of dataset2 is more likely in a real world where

benign executables outnumber malicious executables Thisimplies that HFS is likely to perform better than BFS in areal-world scenario having a larger number of benignexecutables in the dataset

11513 Statistical Significance Test We also perform apair-wise two-tailed t-test on the HFS and BFS accuracies totest whether the differences between their accuracies arestatistically significant We exclude 1-gram accuracies fromthis test for the reason previously explained The result of thet-test is summarized in Table 112 The t-value shown in thistable is the value of t obtained from the accuracies There are(5 + 5 ndash 2) degrees of freedom since we have fiveobservations in each group and there are two groups (ieHFS and BFS) Probability denotes the probability ofrejecting the NULL hypothesis (that there is no differencebetween HFS and BFS accuracies) while p-value denotes theprobability of accepting the NULL hypothesis For dataset1the probability is 9965 and for dataset2 it is 1000Thus we conclude that the average accuracy of HFS issignificantly higher than that of BFS

Table 112 Pair-Wise Two-Tailed t-Test Results ComparingHFS and BFS

DATASET1 DATASET2t-value 89 146Degrees of freedom 8 8Probability 09965 100p-value 00035 00000

Source M Masud L Khan B Thuraisingham A ScalableMulti-level Feature Extraction Technique to Detect MaliciousExecutables pp 33ndash45 Springer With permission

11514 DLL Call Feature Here we report the accuracies ofthe DLL function call features (DFS) The 1-gram accuraciesare 928 for dataset1 and 919 for dataset2 The accuraciesfor higher grams are less than 75 so we do not report themThe reason behind this poor performance is possibly thatthere are no distinguishing call sequences that can identify theexecutables as malicious or benign

1152 ROC Curves

ROC curves plot the true positive rate against the falsepositive rates of a classifier Figure 112 shows ROC curvesof dataset1 for n = 6 and dataset2 for n = 4 based on SVMtesting ROC curves for other values of n have similar trendsexcept for n = 1 where AFS performs better than BFS It isevident from the curves that HFS is always dominant (ie hasa larger area under the curve) over the other two and it ismore dominant in dataset2 Table 113 reports the area underthe curve (AUC) for the ROC curves of each of the featuresets A higher value of AUC indicates a higher probabilitythat a classifier will predict correctly Table 113 shows thatthe AUC for HFS is the highest and it improves (relative tothe other two) in dataset2 This also supports our hypothesisthat our model will perform better in a more likely real-worldscenario where benign executables occur more frequently

Figure 112 ROC curves for different feature sets in dataset1(left) and dataset2 (right) (From M Masud L Khan BThuraisingham A Scalable Multi-level Feature ExtractionTechnique to Detect Malicious Executables pp 33ndash45Springer With permission)

Table 113 Area under the ROC Curve on Different FeatureSets

a Average value excluding 1-gram

1153 False Positive and FalseNegative

Table 114 reports the false positive and false negative rates(in percentage) for each feature set based on SVM outputThe last row reports the average Again we exclude the1-gram values from the average Here we see that in dataset1the average false positive rate of HFS is 49 which is thelowest In dataset2 this rate is even lower (32) Falsepositive rate is a measure of false alarm rate Thus our modelhas the lowest false alarm rate We also observe that this ratedecreases as we increase the number of benign examplesThis is because the classifier gets more familiar with benignexecutables and misclassifies fewer of them as malicious Webelieve that a large collection of training sets with a largerportion of benign executables would eventually diminish falsepositive rate toward zero The false negative rate is also thelowest for HFS as reported in Table 114

Table 114 False Positive and False Negative Rates onDifferent Feature

1154 Running Time

We compare in Table 115 the running times (featureextraction training testing) of different kinds of features(HFS BFS AFS) for different values of n Feature extractiontime for HFS and AFS includes the disassembly time whichis 465 seconds (in total) for dataset1 and 865 seconds (intotal) for dataset2 Training time is the sum of featureextraction time feature-vector computation time and SVMtraining time Testing time is the sum of disassembly time(except BFS) feature-vector computation time and SVMclassification time Training and testing times based onBoosted J48 have almost similar characteristics so we do not

report them Table 115 also reports the cost factor as a ratioof time required for HFS relative to BFS

The column Cost Factor shows this comparison The averagefeature extraction times are computed by excluding the1-gram and 2-grams because these grams are unlikely to beused in practical applications The boldface cells in the tableare of particular interest to us From the table we see that therunning times for HFS training and testing on dataset1 are117 and 487 times higher than those of BFS respectivelyFor dataset2 these numbers are 108 and 45 respectivelyThe average throughput for HFS is found to be 06MBsec (inboth datasets) which may be considered as near real-timeperformance Finally we summarize the costperformancetrade-off in Table 116 The column PerformanceImprovement reports the accuracy improvement of HFS overBFS The cost factors are shown in the next two columns Ifwe drop the disassembly time from testing time (consideringthat disassembly is done offline) then the testing cost factordiminishes to 10 for both datasets It is evident from Table116 that the performancecost tradeoff is better for dataset2than for dataset1 Again we may infer that our model is likelyto perform better in a larger and more realistic dataset Themain bottleneck of our system is disassembly cost Thetesting cost factor is higher because here a larger proportionof time is used up in disassembly We believe that this factormay be greatly reduced by optimizing the disassembler andconsidering that disassembly can be done offline

Table 115 Running Times (in seconds)

a Ratio of time required for HFS to time required for BFS

b Average feature extraction times excluding 1-gram and2-gram

c Average trainingtesting times excluding 1-gram and2-gram

Table 116 PerformanceCost Tradeoff between HFS andBFS

1155 Training and Testing withBoosted J48

We also train and test with this classifier and report theclassification accuracies for different features and differentvalues of n in Table 117 The second last row (Avg) of Table117 is the average of 2-gram to 10-gram accuracies Againfor consistency we exclude 1-gram from the average Wealso include the average accuracies of SVM (from the lastrow of Table 111) in the last row of Table 117 for ease ofcomparison We would like to point out some importantobservations regarding this comparison First the averageaccuracies of SVM and Boosted J48 are almost the samebeing within 04 of each other (for HFS) There is no clearwinner between these two classifiers So we may use any ofthese classifiers for our model Second accuracies of HFS areagain the best among all three HFS has 184 and 36better accuracies than BFS in dataset1 and dataset2respectively This result also justifies our claim that HFS is abetter feature set than BFS irrespective of the classifier used

Table 117 Classification Accuracy () of Boosted J48 onDifferent Feature Sets

a Average accuracy excluding 1-gram

b Average accuracy for SVM (from Table 111)

116 Example RunHere we illustrate an example run of the AFR algorithm Thealgorithm scans through each hexdump file sliding a windowof n bytes and checking the n-gram against the binary featureset (BFS) If a match is found then we collect thecorresponding (same offset address) assembly instructionsequence in the assembly program file In this way we collectall possible instruction sequences of all the features in BFSLater we select the best sequence using information gainExample-III Table 118 shows an example of the collection

of assembly sequences and their IG values corresponding tothe n-gram ldquo00005068rdquo Note that this n-gram has 90occurrences (in all hexdump files) We have shown only 5 ofthem for brevity The bolded portion of the op-code in Table118 represents the n-gram According to the MostDistinguishing Instruction Sequence (MDIS) heuristic wefind that sequence number 29 attains the highest informationgain which is selected as the DAF of the n-gram In this waywe select one DAF per binary n-gram and return all DAFs

Table 118 Assembly Code Sequence for Binary 4-Gramldquo00005068rdquo

Table 119 Time and Space Complexities of DifferentAlgorithms

ALGORITHM TIMECOMPLEXITY

SPACECOMPLEXITY

Feature Collection O(Blog2K) +O(N) O(K)

Feature Selection O(Nlog2S) O(S)Assembly FeatureRetrieval O(nBlog2S) O(SC)

Total (worst case) O(nBlog2K) O(SC)

Next we summarize the time and space complexities of ouralgorithms in Table 119

B is the total size of training set in bytes C is the averagenumber of assembly sequences found per binary n-gram K isthe maximum number of nodes of the AVL tree (iethreshold) N is the total number of n-grams collected n issize of each n-gram in bytes and S is the total number ofselected n-grams The worst case assumption B gt N and SC gtK

117 SummaryIn this chapter we have described the experiments done onour approach and several other baseline techniques on twodifferent datasets We compared both the classificationaccuracy and running times of each baseline technique Weshowed that our approach outperforms other baselinetechniques in classification accuracy without majorperformance degradation We also analyzed the variation ofresults on different classification techniques and differentdatasets and explained these variations Overall our approachis superior to other baselines not only because of higherclassification accuracy but also scalability and efficiency

In the future we would like to add more features to thefeature set such as behavioral features of the executablesThis is because binary features are susceptible to obfuscationby polymorphic and metamorphic malware But it would bedifficult to obfuscate behavioral patterns We would alsoextend our work to the cloud computing framework so thatthe feature extraction and selection process becomes morescalable

References[Chang and Lin 2006] Chang C-C and C-J Lin LIBSVMA Library for Support Vector Machinehttpwwwcsientuedutwsimcjlinlibsvm

[Faucett 2003] Fawcett T ROC Graphs Notes and PracticalConsiderations for Researchers Technical ReportHPL-2003-4 HP Laboratories httphomecomcastnetsimtomfawcettpublic_htmlpapersROC101pdf

[Freund and Schapire 1996] Freund Y and R E SchapireExperiments with a New Boosting Algorithm MachineLearning Proceedings of the 13th International Conference(ICML) 1996 Bari Italy pp 148ndash156

[VX Heavens] VX Heavens httpvxnetluxorg

[Weka] Weka 3 Data Mining Software in Javahttpwwwcswaikatoacnzmlweka

Conclusion to Part III

We have presented a data mining-based malicious executabledetection technique which is scalable over a large datasetHere we apply a multi-level feature extraction technique bycombining three different kinds of features at different levelsof abstraction These are binary n-grams assembly instructionsequences and Dynamic Link Library (DLL) function callsextracted from binary executables disassembled executablesand executable headers respectively We apply this techniqueon a large corpus of real benign and malicious executablesOur model is compared against other feature-basedapproaches for malicious code detection and found to be moreefficient in terms of detection accuracy and false alarm rate

In the future we would like to apply this technique on a muchlarger corpus of executables and optimize the featureextraction and selection process by applying a cloudcomputing framework

PART IV

DATA MINING FOR DETECTINGREMOTE EXPLOITS

Introduction to Part IVIn this part we will discuss the design and implementation ofDExtor a Data Mining-based Exploit code detector to protectnetwork services The main assumption of our work is thatnormal traffic into the network services contains only datawhereas exploit code contains code Thus the ldquoexploit codedetectionrdquo problem reduces to ldquocode detectionrdquo problemDExtor is an application-layer attack blocker which isdeployed between a web service and its correspondingfirewall The system is first trained with real training datacontaining both exploit code and normal traffic Training isperformed by applying binary disassembly on the trainingdata extracting features and training a classifier Oncetrained DExtor is deployed in the network to detect exploitcode and protect the network service We evaluate DExtorwith a large collection of real exploit code and normal dataOur results show that DExtor can detect almost all exploitcode with a negligible false alarm rate We also compareDExtor with other published works and prove itseffectiveness

Part IV consists of three chapters 12 13 and 14 Chapter 12describes the issues involved in remote code exploitation The

design and implementation of our tool DExtor is discussed inChapter 13 Our results are analyzed in Chapter 14

DETECTING REMOTE EXPLOITS

121 IntroductionRemote exploits are a popular means for attackers to gaincontrol of hosts that run vulnerable servicessoftwareTypically a remote exploit is provided as an input to a remotevulnerable service to hijack the control flow ofmachine-instruction execution Sometimes the attackers injectexecutable code in the exploit that is executed after asuccessful hijacking attempt We will refer to thesecode-carrying remote exploits as exploit code

The problem may be briefly described as follows Usually anexploit code consists of three parts (1) a NOP (no operation)sled at the beginning of the exploit (2) a payload in themiddle and (3) return addresses at the end The NOP sled is asequence of NOP instructions the payload contains attackerrsquoscode and the return addresses point to the code Thus anexploit code always carries some valid executables in theNOP sled and in the payload Such code is considered as anattack input to the corresponding vulnerable service Inputs toa service that do not exploit its vulnerability are considered asnormal inputs For example with respect to a vulnerableHTTP server all benign HTTP requests are ldquonormalrdquo inputsand requests that exploit its vulnerability are ldquoattackrdquo inputsIf we assume that ldquonormalrdquo inputs may contain only data

then the ldquoexploit code detectionrdquo problem reduces to a ldquocodedetectionrdquo problem To justify this assumption we refer to[Chinchani and Berg 2005 p 286] They maintain that ldquothenature of communication to and from network services ispredominantly or exclusively data and not executable coderdquoHowever there are exploits that do not contain code such asinteger overflow exploits or return-to-libc exploits We donot deal with these kinds of exploits It is also worthmentioning that a code detection problem is fundamentallydifferent from a ldquomalware detectionrdquo problem which tries toidentify the presence of malicious content in an executable

There are several approaches for analyzing network flows todetect exploit code [Bro] [Chinchani and Berg 2005][Snort] [Toth and Kruegel 2002] [Wang et al 2005][Wang and Stolfo 2004] If an exploit can be detected andintercepted on its way to a server process then an attack willbe prevented This approach is compatible with legacy codeand does not require any change to the underlying computinginfrastructure Our solution DExtor follows this perspective

It is a data mining approach to the general problem of exploitcode detection

The organization of this chapter is as follows Ourarchitecture is discussed in Section 122 Section 123discusses related work Section 124 briefly describes ourapproach The chapter is summarized in Section 125 Theconcepts in this chapter are illustrated in Figure 121

122 ArchitectureFigure 122 illustrates our architecture for detecting remoteexploits A classification model is trained using a trainingdata consisting of a collection of benign non-executablebinaries and code-carrying remote exploits Each traininginstance first undergoes a feature extraction phase Here thetraining instances are first disassembled using techniquesdescribed in Section 133 Then we extract three differentkinds of features explained in Section 134 These extractedfeatures are then used to generate feature vectors and train aclassifier (Sections 135 and 136) We use differentclassification models such as Support Vector Machine(SVM) Naiumlve Bayes (NB) and decision trees

When new incoming network traffic (such as an HTTPrequest) is to be tested at first the test instance undergoes thesame disassembly and feature extraction process as done forthe training instances This feature vector is classified usingthe classifier to predict whether it is a code-carrying exploit orsimply a traffic containing only data

123 Related WorkThere are many techniques available for detecting exploits innetwork traffic and protecting network services Three maincategories in this direction are signature matching anomalydetection and machine-code analysis

Signature matching techniques are the most prevailing andpopular Intrusion Detection Systems (IDSs) [Snort] and[Bro] follow this approach They maintain asignature-database of known exploits If any traffic matches asignature in the database the IDS raises an alert Thesesystems are relatively easy to implement but they can bedefeated by new exploits as well as polymorphism andmetamorphism On the contrary DExtor does not depend onsignature matching

Anomaly detection techniques detect anomalies in the trafficpattern and raise alerts when an anomaly is detected [Wangand Stolfo 2004] propose a payload-based anomaly detectionsystem called PAYL which first trains itself with normal

network traffic and detects exploit code by computing severalbyte-level statistical measures Other anomaly-baseddetection techniques in the literature are the improvedversions of PAYL [Wang et al 2005] and FLIPS [Locasto etal 2005] DExtor is different from anomaly-based intrusiondetection systems for two reasons First anomaly-basedsystems train themselves using the ldquonormalrdquo trafficcharacteristics and detect anomalies based on thischaracteristic On the other hand our method considers bothldquonormalrdquo and ldquoattackrdquo traffic to build a classification modelSecond we consider instruction patterns rather than raw bytepatterns for building a model

Machine-code analysis techniques apply binary disassemblyand static analysis on network traffic to detect the presence ofexecutables DExtor falls in this category [Toth and Kruegel2002] use binary disassembly to find long sequences ofexecutable instructions and identify the presence of an NOPsled DExtor also applies binary disassembly but it does notneed to identify NOP sled [Chinchani and Berg 2005] detectexploit code based on the same assumption as DExtor thatnormal traffic should contain no code They applydisassembly and static analysis and identify several structuralpatterns and characteristics of code-carrying traffic Theirdetection approach is rule based On the other hand DExtordoes not require generating or following rules SigFree [Wanget al 2006] also disassembles inputs to server processes andapplies static analysis to detect the presence of code SigFreeapplies a code abstraction technique to detect usefulinstructions in the disassembled byte-stream and raises analert if the useful instruction count exceeds a predeterminedthreshold DExtor applies the same disassembly technique asSigFree but it does not detect the presence of code based on a

fixed threshold Rather it applies data mining to extractseveral features and learns to distinguish between normaltraffic and exploits based on these features

124 Overview of OurApproachWe apply data mining to detect the presence of code in aninput We extract three kinds of features Useful InstructionCount (UIC) Instruction Usage Frequencies (IUF) and Codevs Data Length (CDL) These features are explained in detailin Section 134 Data mining is applied to differentiatebetween the characteristics of ldquoattackrdquo inputs from ldquonormalrdquoinputs based on these features The whole process consists ofseveral steps First training data are collected that consist ofreal examples of ldquoattackrdquo (eg exploits) and ldquonormalrdquo (egnormal HTTP requests) inputs The data collection process isexplained in Section 142 Second all of the trainingexamples are disassembled applying the technique explainedin Section 133 Third features are extracted from thedisassembled examples and a classifier is trained to obtain aclassification model A number of classifiers are applied suchas Support Vector Machine (SVM) Bayes net decision tree(J48) and boosted J48 and the best of them is chosenFinally DExtor is deployed in a real networking environmentIt intercepts all inputs destined to the network service that itprotects and it tests them against the classification model todetermine whether they are ldquonormalrdquo or ldquoattackrdquo

The next obvious issue is how we deploy DExtor in a realnetworking environment and protect network servicesDExtor is designed to operate at the application layer and canbe deployed between the server and its correspondingfirewall It is completely transparent to the service that itprotects this means no modification at the server is requiredIt can be deployed as a stand-alone component or coupledwith a proxy server as a proxy filter We have deployedDExtor in a real environment as a proxy protecting a webserver from attack It successfully blocks ldquoattackrdquo requests inreal time We evaluate our technique in two different waysFirst we apply a fivefold cross validation on the collecteddata which contain 9000 exploits and 12000 normal inputsand obtain a 9996 classification accuracy and 0 falsepositive rate Second we test the efficacy of our method indetecting new kinds of exploits This also achieves highdetection accuracy

Our contributions are as follows First we identify differentsets of features and justify their efficacy in distinguishingbetween ldquonormalrdquo and ldquoattackrdquo inputs Second we show howa data mining technique can be efficiently applied in exploitcode detection Finally we design a system to protectnetwork services from exploit code and implement it in a realenvironment In summary DExtor has several advantagesover existing exploit-code detection techniques First DExtoris compatible with legacy code and transparent to the serviceit protects Second it is readily deployable in any systemAlthough currently it is deployed on windows with Intel32-bit architecture it can be adapted to any operating systemand hardware architecture only by modifying thedisassembler Third DExtor does not require any signaturegenerationmatching Finally DExtor is robust against most

attack-side obfuscation techniques as explained in Section146 Our technique is readily applicable to digital forensicsresearch For example after a server crash we may use ourtechnique to analyze the network traffic that went to theserver before the crash Thus we may be able to determinewhether the crash was caused by any code-carrying exploitattack We may also be able to determine the source of theattack

In Chapter 13 we describe DExtor a data mining approachfor detecting exploit code in more detail We introduce threedifferent kinds of features namely useful instruction countinstruction usage frequencies and code versus data lengthand show how to extract them These three kinds of featuresare combined to get a combined feature set We extract thesefeatures from the training data and train a classifier which isthen used for detecting exploits in the network traffic Weevaluate the performance of DExtor on real data and establishits efficacy in detecting new kinds of exploits Our techniquecan also be applied to digital forensics research For exampleby analyzing network traffic we may investigate whether thecause of a server crash was an exploit attack However thereare several issues related to our technique that are worthmentioning

First a popular criticism against data mining is that it isheavily dependent on the training data supplied to it So it ispossible that it performs poorly on some data and showsexcellent performance on another set of data Thus it may notbe a good solution for exploit code detection since there is noguarantee that it may catch all exploit codes with 100accuracy However what appears to be the greatest weaknessof data mining is also the source of a great power If the data

mining method can be fed with sufficient realistic trainingdata it is likely to exhibit near-perfect efficiency inclassification Our results justify this fact too It is one of ourfuture goals to continuously collect real data from networksand feed them into the classification system Because trainingis performed ldquoofflinerdquo longer training time is not a problem

Second we would like to relax our main assumption thatldquonormal traffic carries only datardquo We propose adding aldquomalware detectorrdquo to our model as follows We would detectpresence of code inside the traffic using our current model Ifthe traffic contains no code then it is passed to the serverOtherwise it is sent to the malware detector for a ldquosecondaryinspectionrdquo We have already implemented such a detector inone of our previous works A malware detector detectsmalicious components inside an executable If the malwaredetector outputs a green signal (ie benign executable) thenwe pass the executable to the server Otherwise we block anddiscardquarantine the code Our approach is illustrated inFigure 123

Figure 123 Our approach to detecting remote exploits

125 SummaryIn this chapter we have argued that we need to consider bothbinary and assembly language features for detecting remoteexploits We then discussed related approaches in detectingexploits and gave an overview of our data mining tool calledDExtor which is based on classification The design andimplementation of DExtor is discussed in Chapter 13Analysis of the results of our approach is given in Chapter 14

In the future we are planning to detect remote exploits byexamining other data mining techniques including other typesof classification algorithms We will also be examining waysof extracting more useful features

References[Bro] Bro Intrusion Detection System httpbro-idsorg

[Chinchani and Berg 2005] Chinchani R and EVD BergA Fast Static Analysis Approach to Detect Exploit CodeInside Network Flows Recent Advances in IntrusionDetection 8th International Symposium RAID 2005 SeattleWA September 7minus9 2005 Revised Papers Lecture Notes inComputer Science 3858 Springer 2006 A Valdes DZamboni (Eds) pp 284minus308

[Locasto et al 2005] Locasto M E K Wang A DKeromytis S J Stolfo FLIPS Hybrid Adaptive IntrusionPrevention Recent Advances in Intrusion Detection 8th

International Symposium RAID 2005 Seattle WASeptember 7minus9 2005 Revised Papers Lecture Notes inComputer Science 3858 Springer 2006 A Valdes DZamboni (Eds) pp 82minus101

[Toth and Kruumlgel 2002] Toth T and C Kruumlgel AccurateBuffer Overflow Detection via Abstract Payload ExecutionRecent Advances in Intrusion Detection 5th InternationalSymposium RAID 2002 Zurich Switzerland October16-18 2002 Proceedings Lecture Notes in Computer Science2516 Springer 2002 A Wespi G Vigna L Deri (Eds) pp274ndash291

[Wang et al 2005] Wang K G Cretu S J StolfoAnomalous Payload-Based Network Intrusion Detection andSignature Generation Recent Advances in IntrusionDetection 8th International Symposium RAID 2005 SeattleWA September 7minus9 2005 Revised Papers Lecture Notes inComputer Science 3858 Springer 2006 A Valdes DZamboni (Eds) pp 227ndash246

[Wang and Stolfo 2004] Wang K S J Stolfo AnomalousPayload-Based Network Intrusion Detection RecentAdvances in Intrusion Detection 7th InternationalSymposium RAID 2004 Sophia Antipolis FranceSeptember 15-17 2004 Proceedings Lecture Notes inComputer Science 3224 Springer 2004 E Jonsson AValdes M Almgren (Eds) pp 203ndash222

[Wang et al 2006] Wang X C Pan P Liu S ZhuSigFree A Signature-Free Buffer Overflow Attack Blockerin USENIX Security July 2006

[Wang and Stolfo 2004] Wang K and S J StolfoAnomalous payload-based network intrusion detection InRecent Advances In Intrusion Detection (RAID) 2004

131 IntroductionIn this chapter we describe the design and implementation ofour tool called DExtor for detecting remote exploits Inparticular the architecture of the tool feature extraction andclassification techniques are discussed

DExtor can be applied within a network to protect networkservers DExtor can be deployed between the network serverthat it protects and the firewall that separates the innernetwork from the outside world As DExtor is based on datamining it must be trained with some known training datacontaining both benign traffic and exploit traffic Animportant part of this training is identifying and extractinguseful features from the data Therefore from the trainingdata we identify several features that can help to distinguishthe benign traffic from the remote exploits and extract thosefeatures from the training instances to build feature vectorsThese feature vectors are then used to train classifiers that canbe used to detect future unseen exploits

The organization of this chapter is as follows Thearchitecture of DExtor is given in Section 132 The modulesof DExtor are described in Sections 133 through 136 In

particular the disassembly feature extraction and datamining modules are discussed The chapter is summarized inSection 137 Figure 131 illustrates the concepts in thischapter

132 DExtor ArchitectureThe architecture of DExtor is illustrated in Figure 132DExtor is deployed in a network between the network serviceand its corresponding gatewayfirewall It is first trainedoffline with real instances of attack (eg exploits) and normal(eg normal HTTP requests) inputs and a classificationmodel is obtained Training consists of three stepsdisassembly feature extraction and training with a classifierAfter training DExtor is deployed in the network and allincoming inputs to the service are intercepted and analyzedonline Analysis consists of three steps disassembly featureextraction and testing against the model These processes areexplained in detail in this chapter

Figure 132 DExtor architecture (From M Masud L KhanB Thuraisingham X Wang P Lie S Zhu DetectingRemote Exploits Using Data Mining pp 177ndash189 2008Springer With permission)

The major modules of DExtor are the (1) Disassembly (2)Feature Extraction and (3) Classification The Disassemblymodule will take the binary code as inputs and outputassembly code The feature extraction module will extract themost useful features The classification module will carry outdata mining and determine whether there are remote exploitsSections 133 through 136 describe the various modules ofDExtor

The training data consist of both code-carrying remoteexploits and binaries without any valid executables At firstthe disassembly and feature extraction modules are applied onthe training instances to disassemble them using thetechnique discussed in Section 133 and then features areextracted from the disassembled binaries using the techniqueexplained in Section 134 After feature extraction feature

vectors are generated for each training instance and thefeature vectors are used to train a classification model(Sections 135 and 136) Once trained the classificationmodel is used to test new incoming network traffic (such asHTTP get request) The test instance is first passed throughthe disassembly and feature extraction modules to generatethe feature vector and then the feature vector is tested againstthe classification model If the class prediction is ldquoattackrdquo(ie ldquoexploitrdquo) the traffic is blocked otherwise it is passedto the server

133 DisassemblyThe disassembly algorithm is similar to [Wang et al 2006]Each input to the server is considered as a byte sequenceThere may be more than one valid assembly instructionsequence corresponding to the given byte sequence Thedisassembler applies a technique called the ldquoinstructionsequence distiller and analyzerrdquo to filter out all redundant andillegal instruction sequences The main steps of this processare as follows Step 1 Generate instruction sequences Step2 Prune subsequences Step 3 Discard smaller sequencesStep 4 Remove illegal sequences and Step 5 Identify usefulinstructions

The main difficulty with the disassembly process lies in thefact that there may be more than one valid assemblyinstruction sequence corresponding to a given binarysequence For example if the input size is n bytes thenstarting from byte k isin 1 hellip n we will have a total O(n)different assembly programs (some of the starting positions

may not produce any valid assembly program because theymay end up in an illegal instruction) The problem is toidentify the most appropriate assembly program among theseO(n) programs The instruction sequence distiller filters outall redundant instruction sequences and outputs a single mostviable assembly sequence (ie assembly program) The mainsteps of this process are briefly discussed here

Step 1 Generate instruction sequences The disassemblerassigns an address to every byte of a message Then itdisassembles the message from a certain address until the endof the request is reached or an illegal instruction opcode isencountered Disassembly is performed using the recursivetraversal algorithm [Schwarz et al 2002]

Step 2 Prune subsequences If instruction sequence sa is asubsequence of instruction sequence sb the disassemblerexcludes sa The rationale for this is that if sa satisfies somecharacteristics of programs sb also satisfies thesecharacteristics with a high probability

Step 3 Discard smaller sequences If instruction sequence samerges to instruction sequence sb after a few instructions andsa is no longer than sb the disassembler excludes sa It isreasonable to expect that sb will preserve sarsquos characteristicsMany distilled instruction sequences are observed to mergeinto other instruction sequences after a few instructions Thisproperty is called self-repairing [Linn and Debray 2003] inIntel IA-32 architecture

Step 4 Remove illegal sequences Some instructionsequences when executed inevitably reach an illegalinstruction whatever execution path is being taken The

disassembler excludes the instruction sequences in whichillegal instructions are inevitably reachable because causingthe server to execute an illegal instruction (with possibleconsequence of terminating the web server thread handlingthis request) is not the purpose of a buffer overflow attack

Step 5 Identify useful instructions An instruction sequenceobtained after applying the previous four steps filtering maybe a sequence of random instructions or a fragment of aprogram in machine language This step applies a techniqueto differentiate these two cases and identifies the usefulinstructions that is instructions that are most likely part of avalid executable Readers are requested to consult [Wang etal 2006] for more details

134 Feature ExtractionFeature extraction is the heart of our data mining process Wehave identified three important features based on ourobservation and domain-specific knowledge These areUseful Instruction Count (IUC) Instruction UsageFrequencies (IUF) and Code vs Data Lengths (CDL) Thesefeatures are described here in detail

The UIC is the number of useful instructions found in step 5of the disassembly process This number is important becausea real executable should have a higher number of usefulinstructions whereas data should have less or zero usefulinstructions

1342 Instruction Usage Frequencies(IUF)

To extract the IUF feature we just count the frequency ofeach instruction that appears in an example (normal orattack) Intuitively normal data should not have any biaspreference toward any specific instruction or set ofinstructions Thus the expected distribution of instructionusage frequency in normal data should be random On theother hand an exploit code is supposed to perform somemalicious activities in the victim machine So it must havesome biaspreference toward a specific subset of instructionsThus the expected distribution of instruction usagefrequencies should follow some pattern This idea is alsosupported by our observation of the training data which isillustrated in Section 145

As explained earlier an exploit code has three differentregions the NOP sled the payload and the return addressesFollowing from this knowledge and our observation of theexploit code we divide each input instance into three regionsor ldquozonesrdquo bzone or the beginning zone czone or the codezone and rzone or the remainder zone ldquobzonerdquo correspondsto the first few bytes in the input that could not bedisassembled and probably contains only datamdashfor examplethe first 20 bytes of the exploit in Figure 132 ldquoczonerdquocorresponds to the bytes after ldquobzonerdquo that were successfullydisassembled by the disassembler and probably containssome code (eg bytes 20ndash79 in Figure 132) ldquorzonerdquo

corresponds to the remaining bytes in the input after ldquoczonerdquothat could not be disassembled and probably contains onlydata (eg last 20 bytes in Figure 133) We observe that thenormalized lengths (in bytes) of these three zones follow acertain distribution for ldquoattackrdquo inputs which is differentfrom that of the ldquonormalrdquo inputs These distributions are alsoillustrated in Section 145

Figure 133 Three zones of an input instance (From MMasud L Khan B Thuraisingham X Wang P Lie S ZhuDetecting Remote Exploits Using Data Mining pp 177ndash1892008 Springer With permission)

Intuitively normal inputs should contain code zone at anylocation with equal probability Meaning the expecteddistribution of ldquobzonerdquo and ldquorzonerdquo should be random innormal inputs Also normal inputs should have few or nocode Thus ldquoczonerdquo length should be near zero On the otherhand exploit code is restricted to follow a certain pattern forthe code zone For example the exploit code should beginwith the NOP sled necessitating the ldquobzonerdquo length to beequal to 0 Also ldquoczonerdquo length for exploit codes should behigher than normal inputs In summary the patterns of thesethree zones should be distinguishable in normal and attackinputs

135 Combining Featuresand Compute CombinedFeature VectorThe feature vectorsvalues that we have computed for eachinput sample are (I) UICmdasha single integer (II)IUFmdashcontaining K integer numbers representing thefrequencies of each instruction where K is the total numberof different instructions found in the training data and (III)CDL features containing 3 real values So we have acollection of K+4 features of which the first K+1 featurevalues are integer and the last three are real These K+4features constitute our combined feature vector for an inputinstance

Table 131 A Disassembled Exploit (First 16H Bytes)

We illustrate the feature vector computation with acomprehensive example as follows

Table 131 shows a disassembled exploit with the address andop-code for each instruction The column ldquoUsefulrdquo describeswhether the instruction is useful which is found during thedisassembly step (Section 133) The exploit contains 322bytes total but only the first 16H bytes are shown in the tableAmong these 322 bytes only the first 14H (=20) bytescontain code and the remaining 302 bytes contain dataTherefore the three different kinds of features that we extractfrom this exploit are as follows

1 UIC = 5 since only five instructions are usefulaccording to the ldquoUsefulrdquo column

2 IUF push = 1 pop = 2 xor = 1 sub = 1 add = 0etchellip (count of each instruction in the first 20 bytes)

3 CDLbull bzone = 0 (number of bytes before the first

instruction)bull czone = 20 (20 bytes of instructionscode)bull rzone = 302 (number of bytes after the last

instruction)

Therefore the combined feature vector for the exploit wouldlook as follows assuming the order of features are as shown

Features = UIC IUF(push pop add hellip k-th instructionCDL(bzone czone rzone)

Vector = 5 1 2 0 hellip freq of k-th instruction 0 20 302

136 ClassificationWe use Support Vector Machine (SVM) Bayes Net decisiontree (J48) and Boosting for the classification task Theseclassifiers are found to have better performances in ourprevious work related to malware detection Each of theseclassifiers has its own advantages and drawbacks First SVMis more robust to noise and high dimensionality However itneeds to be fine-tuned to perform efficiently on a specificdomain Decision tree has a very good feature selectioncapability It is also much faster than many other classifiersboth in training and testing time On the other hand it is lessstable than SVM meaning minor variations in the trainingdata may lead to large changes in the decision tree Thisproblem can be overcome with Boosting which appliesensemble methods because ensemble techniques are morerobust than single-model approaches Bayes Net is capable offinding the inter-dependencies between different attributes Itavoids the unrealistic conditional independence assumption ofNaiumlve Bayes by discovering dependency among attributesHowever it may not perform well when there are too manydimensions (ie attributes)

We train a model with each of the four classificationtechniques discussed earlier Therefore we have fourdifferent classification models trained from the same trainingdataset but built using different base learners Each of theseclassification models is evaluated on the evaluation data andthe model with the best accuracy is chosen to be deployed inthe system In our experiments (Chapter 14) we found thatBoosted J48 has the best accuracy in detecting existing and

new kind of exploits Therefore we used Boosted J48 in ourtool that we have developed for remote exploit detection

137 SummaryIn this chapter we have described the design andimplementation of the data mining tool DExtor for detectingremote exploits In particular we discussed the architecture ofthe tool as well as the major modules of the tool Thesemodules include Disassembly Feature Extraction andClassification In Chapter 14 we discuss the experiments wecarried out and analyze the results obtained

As stated in Chapter 12 as future work we are planning todetect remote exploits by examining other types of datamining techniques as well as developing techniques forselecting better features In addition we will apply otherclassification techniques and compare the performance andaccuracy of the results

References[Linn and Debray 2003] Linn C and S DebrayObfuscation of Executable Code to Improve Resistance toStatic Disassembly in Proceedings of the 10th ACMConference on Computer and Communications Security(CCS) October 2003 pp 290ndash299

[Schwarz et al 2002] Schwarz B S K Debray G RAndrews Disassembly of executable code revisited in

Proceedings 9th Working Conference on ReverseEngineering (WCRE) October 2002

[Wang et al 2006] Wang X C Pan P Liu S ZhuSigFree A Signature-Free Buffer Overflow Attack Blockerin Proceedings of USENIX Security July 2006

141 IntroductionIn Chapter 12 we described issues in remote exploitdetection and in Chapter 13 we described our data miningtool DExtor for remote exploit detection In this chapter wedescribe the datasets experimental setup and the results thatwe have obtained for DExtor

We first discuss the datasets that are used to evaluate ourtechniques The dataset contains real exploit code generatedby different polymorphic engines as well as benign inputs toweb servers Then we discuss the evaluation process on thesedatasets We compare our proposed technique whichcombines three different kinds of features with four baselinetechniques These baseline techniques are SigFree [Wang etal 2006] and three other techniques that use only one type offeature that is only UIC only IUF and only CDL We reportthe accuracy and running time of each approach Also weanalyze our results and justify the usefulness of the featureswe extract Finally we discuss some limitations of ourapproach and explain how these limitations can be overcome

The organization of this chapter is as follows In Section 142we describe the datasets used In Section 143 we discuss theexperimental setup such as hardware software and system

parameters In Section 144 we discuss results obtained fromthe experiments Our analysis is given in Section 145 Therobustness and the limitations of our approach are presentedin Section 146 Finally the chapter is summarized in Section147 Figure 141 illustrates the concepts in this chapter

142 DatasetThe dataset contains real exploit code as well as normalinputs to web servers We obtain the exploit codes as followsFirst we generate 20 unencrypted exploits using theMetasploit framework [Metasploit] Second we apply ninepolymorphic engines ldquoADMmutaterdquo [Macaulay] ldquocletrdquo[Detristan et al] ldquoAlpha2rdquo ldquoCountDownrdquoldquoJumpCallAdditiverdquo ldquoJumpiscodesrdquo ldquoPexrdquoldquoPexFnstenvMovrdquo ldquoPexFnstenvSubrdquo on the unencryptedexploits Each polymorphic engine is applied to generate1000 exploits Thus we obtain a collection of 9000 exploitcodes We collect the normal inputs from real traces of HTTP

requestresponses tofrom a web server To collect thesetraces we install a client-side proxy that can monitor andcollect all incoming and outgoing messages Thus the normalinputs consist of a collection of about 12000 messagescontaining HTTP requestsresponses HTTP responses consistof texts (javascript html xml) applications (x-javascriptpdf xml) images (gif jpeg png) sounds (wav) andflash Thus we try to make the dataset as diverse realisticand unbiased as possible to get the flavor of a realenvironment

We perform two different kinds of evaluation on the dataFirst we apply a fivefold cross validation and obtain theaccuracy false positive and false negative rates Second wetest the performance of the classifiers on new kinds ofexploits This is done as follows A classifier is trained usingthe exploits obtained from eight engines and tested on theexploits from the ninth engine This is done nine times byrotating the engine in the test set Normal examples weredistributed in the training and test set with equal proportionsWe report the performances of each classifier for all the ninetests

143 Experimental SetupWe run our experiment with a 20GHz machine with 1GBRAM on a Windows XP machine Our algorithms areimplemented in java and compiled with jdk version 150_06

We use the Weka [Weka] Machine Learning tool for theclassification tasks For SVM the parameter settings are asfollows Classifier type C-Support Vector classifier (C-SVC)Kernel polynomial kernel and gamma = 001 For Bayes Netthe following parameters are set alpha = 05 and networklearning hill-climbing search For decision tree we use J48from Weka pruning = true and C = 025 For Boosting werun 10 iterations of the AdaBoost algorithm to generate 10models (t = 10) and the weak learner for the AdaBoostalgorithm is decision tree (J48)

We compare our approach with four different baselinetechniques as follows

1 Comb The combined feature vector of UIC IUF andCDL features This is our proposed approach

2 UIC Here we use only the UIC feature for bothtraining and testing

3 IUF Here we use only the IUF features for bothtraining and testing

4 CDL Here we use only the CDL features for bothtraining and testing

5 SigFree It is the approach proposed in [Wang et al2006]

Note that each of these features sets (IndashIV) are used to trainfour different classifiers namely Decision Tree (aka J48 inweka) Boosted J48 SVM and Bayes Net

144 ResultsWe apply three different metrics to evaluate the performanceof our method Accuracy (ACC) False Positive (FP) andFalse Negative (FN) where ACC is the percentage ofcorrectly classified instances FP is the percentage of negativeinstances incorrectly classified as positive and FN is thepercentage of positive instances incorrectly classified asnegative

Table 141 Comparing Performances among DifferentFeatures and Classifiers

Source M Masud L Khan B Thuraisingham X Wang PLie S Zhu Detecting Remote Exploits Using Data Miningpp 177ndash189 2008 Springer With permission

Table 141 shows a comparison among different features ofDExtor We see that accuracy of DExtorrsquos Combined (shownunder column Comb) feature classified with Boosted J48 isthe best which is 9996 Individual features have accuraciesless than the combined feature for all classification

techniques Also the combined feature has the lowest falsepositive which is 00 obtained from Boosted J48 Thelowest false negative also comes from the combined featurewhich is only 01 In summary the combined feature withBoosted J48 classifier has achieved near perfect detectionaccuracy The last row shows the accuracy and false alarmrates of SigFree on the same dataset SigFree actually usesUIC with a fixed threshold (15) It is evident that SigFree hasa low false positive rate (02) but high false negative rate(885) causing the overall accuracy to drop below 39Figure 142 shows the Receiver Operating Characteristic(ROC) curves of different features for Boosted J48 classifierROC curves for other classifiers have similar characteristicsand are not shown because of space limitation The area underthe curve (AUC) is the highest for the combined featurewhich is 0999

Table 142 reports the effectiveness of our approach indetecting new kinds of exploits Each row reports thedetection accuracies and false alarm rates of one particularengine-generated exploit For example the row headed byldquoAdmutaterdquo shows the detection accuracy (and false alarmrates) of exploits generated by the Admutate engine In thiscase the classifiers have been trained with the exploits fromeight other engines In each case the training set contains8000 exploits and about 10500 randomly selected normalsamples and the test set contains 1000 exploits and about1500 randomly chosen normal samples The columns headedby SVM BNet J48 and BJ48 show the accuracies (or falsepositivefalse negative rates) of SVM Bayes Net J48 andBoosted J48 classifiers respectively It is evident from thetable that all the classifiers could successfully detect most ofthe new exploits with 99 or better accuracy

Figure 142 ROC curves of different features forBoostedJ48 (From M Masud L Khan B ThuraisinghamX Wang P Lie S Zhu Detecting Remote Exploits UsingData Mining pp 177ndash189 2008 Springer With permission)

Table 142 Effectiveness in Detecting New Kinds of Exploits

1441 Running Time

The total training time for the whole dataset is less than 30minutes This includes disassembly time feature extractiontime and classifier training time This amounts to about37msKB of input The average testing timeKB of input is23ms for the combined feature set This includes thedisassembly time feature value computation time andclassifier prediction time SigFree on the other hand requires185ms to test per KB of input Considering that training canbe done offline this amounts to only 24 increase in runningtime compared to SigFree So the priceperformance tradeoffis in favor of DExtor

145 AnalysisAs explained earlier IUF feature observes different frequencydistributions for the ldquonormalrdquo and ldquoattackrdquo inputs This isillustrated in the leftmost chart of Figure 143 This graphshows the 30 most frequently used instructions (for bothkinds of inputs) It is seen that most of the instructions in thischart are more frequently used by the ldquoattackrdquo inputs thanldquonormalrdquo inputs The first five of the instructions have highfrequencies (gt11) in ldquoattackrdquo inputs whereas they have nearzero frequencies in ldquonormalrdquo input The next 16 instructionsin ldquoattackrdquo inputs have frequencies close to 2 whereasldquonormalrdquo inputs have near zero frequencies for theseinstructions To mimic ldquonormalrdquo input an attacker shouldavoid using all these instructions It may be very hard for anattacker to get around more than 20 most frequently usedinstructions in exploits and craft his code accordingly

Figure 143 Left average instruction usage frequencies(IUF) of some instructions Right distributions of ldquobzonerdquoand ldquoczonerdquo (From M Masud L Khan B Thuraisingham

X Wang P Lie S Zhu Detecting Remote Exploits UsingData Mining pp 177ndash189 2008 Springer With permission)

Similarly we observe specific patterns in the distribution ofthe CDL feature values The patterns for ldquobzonerdquo and ldquoczonerdquoare illustrated in the right two charts of Figure 143 These arehistograms showing the number of input samples having aparticular length (as a fraction of total input size) of ldquobzonerdquoor ldquoczonerdquo These histograms are generated by dividing thewhole range ([01]) of ldquobzonerdquo (or ldquoczonerdquo) sizes into 50equal-sized bins and counting the total number of inputsinstances that fall within the range of a particular bin Byclosely observing the histogram for bzone we see that mostof the ldquoattackrdquo samples have bzone values in the first bin (ie[0002]) whereas that of the ldquonormalrdquo samples are spreadover the whole range of values starting from 01 This meansif the attacker wants to mimic normal traffic he should leavethe first 10 of an exploit without any code This mayincrease his chances of failure since the exploit shouldnaturally start with a NOP sled Again by closely observingthe histogram for czone we see that most of the ldquonormalrdquosamples have ldquoczonerdquo values within the range [0005]whereas ldquoattackrdquo samples mostly have ldquoczonerdquo values greaterthan 005 This means that if the attacker wants to mimicnormal traffic he should keep his code length within 5 ofthe exploitrsquos length For a 200-byte exploit this would allotonly 10 bytes for codemdashincluding the NOP sled Thus theattacker would have a hard time figuring out how to craft hisexploit

146 Robustness andLimitationsIn this section we discuss different security issues and therobustness and limitations of our system

Our technique is robust against ldquoInstruction re-orderingrdquobecause we do not care about the order of instructions It isalso robust against ldquojunk-instruction insertionrdquo as it increasesthe frequency of instructions in the exploit It is robust againstinstruction replacement as long as all the ldquomost frequentlyusedrdquo instructions are not replaced (as explained in Section145) by other instructions It is also robust againstregister-renaming and memory re-ordering because we donot consider register or memory locations Junk byte insertionobfuscation is targeted at the disassembler where junk bytesare inserted at locations that are not reachable at run-timeOur disassembly algorithm applies recursive traversal whichis robust to this obfuscation [Kruegel et al 2004]

1462 Limitations

DExtor is partially affected by the ldquobranch functionrdquoobfuscation The main goal of this obfuscation is to obscurethe control flow in an executable so that disassembly cannotproceed Currently there is no general solution to thisproblem In our case DExtor is likely to produce fragmented

ldquocode blocksrdquo missing some of the original code This willnot affect DExtor as long as the ldquomissedrdquo block contains asignificant number of instructions

Another limitation of DExtor is its processing speed Weevaluated the throughput of DExtor in a real environmentwhich amounts to 42KBsec This might seem unrealistic foran intrusion detection system that has to encounter Gigabitsof data per second Fortunately we intend to protect just onenetwork service which is likely to process inputs muchslower than this rate We suggest two solutions to get aroundthis limitation (1) using faster hardware and optimizing allsoftware components (disassembler feature extractionclassifier) and (2) carefully excluding some incoming trafficfrom analysis For example any bulk input to the serverhaving a size greater than a few hundred KB is too unlikely tobe an exploit code because the length of a typical exploit codeis within a few KB only By applying both the solutionsDExtor should be able to operate in a real-time environment

147 SummaryIn this chapter we have discussed the results obtained fromtesting our data mining tool for detecting remote exploits Wefirst discussed the datasets we used and the experimentalsetup Then we described the results we obtained Theseresults were subsequently analyzed and we discussed therobustness and limitations of our approach

We have shown that code-carrying exploits can besuccessfully detected using our data mining technique The

data mining technique consists of two processes training andclassification In the training phase we take a large number oftraining instances containing both code-carrying exploits andbenign binary files Each training instance is tagged as eitherldquobenignrdquo or ldquoexploitrdquo Each of these training instances is thendisassembled and analyzed using an ldquoinstruction sequencedistiller and analyzerrdquo module The output of this module isan assembly instruction sequence with appropriate attributesassigned to each instruction (eg usefulnot useful) Fromthis sequence we extract three different kinds of features thatis useful instruction count (IUC) code vs data length (CDL)and instruction usage frequency (IUF) Using these featureswe compute the feature vector for each training instance andtrain a classification model This classification model is thenused to classify future instances To classify each instance(ie a binary file transferred through the network) is firstdisassembled and its features are extracted using the sameapproach that was followed during training The extractedfeature values are then supplied to the classification modeland the model outputs the predicted class of the test instanceWe have evaluated our approach on a large corpus of exploitand benign data and obtained very high accuracy and lowfalse alarm rates compared to the previous approach SigFree[Wang et al 2006]

In the future we would like to apply data streamclassification techniques to the remote exploit detectionproblem Note that network traffic is essentially a data streamwhich is both infinite in length and usually evolves over timeTherefore a data stream mining technique would be a moreappropriate and efficient technique for remote exploitdetection

References[Detristan et al] Detristan T T Ulenspiegel Y Malcom MS Von Underduk Polymorphic Shellcode Engine UsingSpectrum Analysis Phrack Magazinehttpwwwphrackorgissueshtmlissue=61ampid=9article

[Kruegal et al 2004] Kruegel C W Robertson F ValeurG Vigna Static Disassembly of Obfuscated Binaries inProceedings of USENIX Security August 2004

[Metasploit] The Metasploit Projecthttpwwwmetasploitcom

[Macaulay] Macaulay S Admutate Polymorphic ShellcodeEngine httpwwwktwocasecurityhtml

[Weka] Data Mining Software in Javahttpwwwcswaikatoacnzmlweka

Conclusion to Part IV

As we have stated remote exploits are a popular means forattackers to gain control of hosts that run vulnerable servicessoftware Typically a remote exploit is provided as an inputto a remote vulnerable service to hijack the control flow ofmachine-instruction execution Sometimes the attackers injectexecutable code in the exploit that are executed after asuccessful hijacking attempt We refer to these code-carryingremote exploits as exploit code In this part we discussed thedesign and implementation of DExtor a Data MiningndashbasedExploit code detector to protect network services Inparticular we discussed the system architecture ourapproach and the algorithms we developed and we reportedour performance analysis We also discussed the strengthsand limitations of our approach

In Parts II III and IV we have discussed our data miningtools for email worm detection detecting maliciousexecutables and detecting remote exploits In the next partwe discuss data mining for botnet detection

PART V

DATA MINING FOR DETECTINGBOTNETS

Introduction to Part VBotnet detection and disruption have been a major researchtopic in recent years One effective technique for botnetdetection is to identify Command and Control (CampC) trafficwhich is sent from a CampC center to infected hosts (bots) tocontrol the bots If this traffic can be detected both the CampCcenter and the bots it controls can be detected and the botnetcan be disrupted We propose a multiple log file-basedtemporal correlation technique for detecting CampC traffic Ourmain assumption is that bots respond much faster thanhumans By temporally correlating two host-based log fileswe are able to detect this property and thereby detect botactivity in a host machine In our experiments we apply thistechnique to log files produced by tcpdump and exedumpwhich record all incoming and outgoing network packets andthe start times of application executions at the host machinerespectively We apply data mining to extract relevantfeatures from these log files and detect CampC traffic Ourexperimental results validate our assumption and show betteroverall performance when compared to other recentlypublished techniques

Part V consists of three chapters 15 16 and 17 An overviewof botnets is provided in Chapter 15 Our data mining tool isdescribed in Chapter 16 Evaluation and results are presentedin Chapter 17

DETECTING BOTNETS

151 IntroductionBotnets are emerging as ldquothe biggest threat facing the internettodayrdquo [Ferguson 2008] because of their enormous volumeand sheer power Botnets containing thousands of bots(compromised hosts) have been tracked by several differentresearchers [Freiling et al 2005] [Rajab et al 2006] Bots inthese botnets are controlled from a Command and Control(CampC) center operated by a human botmaster or botherderThe botmaster can instruct these bots to recruit new botslaunch coordinated DDoS attacks against specific hosts stealsensitive information from infected machines send massspam emails and so on

In this chapter we discuss our approach to detecting botnetsIn particular we use data mining techniques There have beensome discussions whether data mining techniques areappropriate for detecting botnets as botnets may changepatterns We have developed techniques for detecting novelclasses and such techniques will detect changing patternsWe will describe novcl class detection techniques under ourwork on stream mining in Part VI

The organization of this chapter is as follows An architecturefor botnets is discussed in Section 152 Related work is

discussed in Section 153 Our approach is discussed inSection 154 The chapter is summarized in Section 155Figure 151 illustrates the concepts of this chapter

152 Botnet ArchitectureFigure 152 illustrates a typical botnet architecture TheIRC-based (Internet Relay Chat) botnets are centralizedbotnets The IRC server is the central server with which allbot machines are connected through an IRC channel Thebotmaster a human entity controlling the bots also connectshimself with the IRC server through a channel The bots areprogrammed to receive commands from the botmasterthrough the IRC server The commands are sent viaCommand amp Control (CampC) traffic The bots usually recruitother bots by exploiting vulnerable machines The botmastermay launch a distributed denial of service (DDoS) attackusing this bot network Periodically the botmaster may wantto update the bot software This is done by placing theupdated software in a code server and then sendingcommands to the bot machines to download the update fromthe code server

Figure 152 A typical IRC-based botnet architecture (FromM Masud T Al-khateeb L Khan B Thuraisingham KHamlen Flow-based Identification of Botnet Traffic byMining Multiple Log Files pp 200ndash206 2008 copy IEEE Withpermission)

Numerous researchers are working hard to combat this threatand have proposed various solutions [Grizzard et al 2007][Livadas et al 2006] [Rajab et al 2006] One majorresearch direction attempts to detect the CampC center anddisable it preventing the botmaster from controlling the

botnet Locating the CampC center requires identifying thetraffic exchanged between it and the bots Our work adoptsthis approach by using a data mining-based technique toidentify temporal correlations between multiple log files Wemaintain two different log files for each host machine (1) anetwork packet trace or tcpdump and (2) an applicationexecution trace or exedump The tcpdump log file records allnetwork packets that are sentreceived by the host and theexedump log file records the start times of applicationprogram executions on the host machine Our mainassumption is that bots respond to commands much fasterthan humans do Thus the command latency (ie the timebetween receiving a command and taking actions) should bemuch lower and this should be reflected in the tcpdump andexedump log files

Bot commands that have an observable effect upon the logfiles we consider can be grouped into three categories thosethat solicit a response from the bot to the botmaster those thatcause the bot to launch an application on the infected hostmachine and those that prompt the bot to communicate withsome other host (eg a victim machine or a code server)This botnet command categorization strategy is explained inmore detail in Section 165 We apply data mining to learntemporal correlations between an incoming packet and (1) anoutgoing packet (2) a new outgoing connection or (3) anapplication startup Any incoming packet correlated with oneof these logged events is considered a possible botnetcommand packet Our approach is flow based because ratherthan classifying a single packet as CampC or normal traffic weclassify an entire flow (or connection) tofrom a host as CampCor normal This makes the detection process more robust andeffective Our system is first trained with log files obtained

from clean hosts and hosts infected with a known bot thentested with logs collected from other hosts This evaluationmethodology is explained in detail in Chapter 17

Our technique is different from other botnet detectiontechniques [Goebel and Holz 2007] [Livadas et al 2006][Rajab et al 2006] in two ways First we do not impose anyrestriction on the communication protocol Our approachshould therefore also work with CampC protocols other thanthose that use IRC as long as the CampC traffic possesses theobservable characteristics previously defined Second we donot rely on command string matching Thus our methodshould work even if the CampC payloads are not available

Our work makes two main contributions to botnet detectionresearch First we introduce multiple log correlation for CampCtraffic detection We believe this idea could be successfullyextended to additional application-level logs such as thosethat track processservice execution memoryCPU utilizationand disk accesses Second we have proposed a way toclassify botmaster commands into different categories andwe show how to utilize these command characteristics todetect CampC traffic An empirical comparison of our techniquewith another recent approach [Livadas et al 2006] shows thatour strategy is more robust in detecting real CampC traffic

153 Related WorkBotnet defenses are being approached from at least threemajor perspectives analysis tracking and detection [Barfordand Yegneswaran 2006] present a comprehensive analysis of

several botnet codebases and discuss various possible defensestrategies that include both reactive and proactive approaches[Grizzard et al 2007] analyze botnets that communicateusing peer-to-peer networking protocols concluding thatexisting defense techniques that assume a single centralizedCampC center are insufficient to counter these decentralizedbotnets

[Freiling et al 2005] summarize a general botnet-trackingmethodology for manually identifying and dismantlingmalicious CampC centers [Rajab et al 2006] put this intopractice for a specific IRC protocol They first capture botmalware using a honeynet and related techniques Capturedmalware is next executed in a controlled environment toidentify the commands that the bot can receive and executeFinally drone machines are deployed that track botnetactivity by mimicking the captured bots to monitor andcommunicate with the CampC server [Dagon et al 2006]tracked botnet activity as related to geographic region andtime zone over a six-month period They concluded thatbotnet defenses such as those described earlier can be morestrategically deployed if they take into account the diurnalcycle of typical botnet propagation patterns

Our research presented in this chapter is a detectiontechnique [Cooke et al 2005] discuss various botnetdetection techniques and their relative merits They concludethat monitoring CampC payloads directly does not typicallysuffice as a botnet detection strategy because there are nosimple characteristics of this content that reliably distinguishCampC traffic from normal traffic However [Goebel and Holz2007] show that botnets that communicate using IRC canoften be identified by their use of unusual IRC channels and

IRC user nicknames [Livadas et al 2006] use additionalfeatures including packet size flow duration and bandwidthTheir technique is a two-stage process that first distinguishesIRC flows from non-IRC flows and then distinguishes CampCtraffic from normal IRC flows Although these are effectivedetection techniques for some botnets they are specific toIRC-based CampC mechanisms and require access to payloadcontent for accurate analysis and detection In contrast ourmethod does not require access to botnet payloads and is notspecific to any particular botnet communicationinfrastructure [Karasaridis et al 2007] consider botnetdetection from an ISP or network administratorrsquos perspectiveThey apply statistical properties of CampC traffic to mine largecollections of network traffic for botnet activity Our workfocuses on detection from the perspective of individual hostmachines rather than ISPs

154 Our ApproachWe presented the novel idea of correlating multiple log filesand applying data mining for detecting botnet CampC trafficOur idea is to utilize the temporal correlation between twodifferent log files tcpdump and exedump The tcpdump filelogs all network packets that are sentreceived by a hostwhereas the exedump file logs the start times of applicationprogram executions on the host We implement a prototypesystem and evaluate its performance using five differentclassifiers Support Vector Machines decision trees BayesNets Boosted decision trees and Naiumlve Bayes Figure 153illustrates our approach

Figure 153 Our approach to botnet detection

Comparison with another technique by [Livadas et al 2006]for CampC traffic detection shows that our method has overallbetter performance when used with a Boosted decision treeclassifier The technique used by Livadas et al first identifiesIRC flows and then detects botnet traffic from the IRC flowsOur technique is more general because it does not need toidentify IRC traffic and is therefore applicable to non-IRCbotnet protocols as long as certain realistic assumptionsabout the command-response timing relationships (detailed inChapter 16) remain valid

155 SummaryBotnets have been a major threat to the global Internetcommunity in the past decade Although many approacheshave been proposed in detecting IRC botnets in recent yearsthere are very few approaches that apply data miningtechniques We propose a data mining-based technique thatcombines and correlates two log files in a host machine Thenext two chapters discuss the technique and results on botnettraffic generated in a controlled environment

In the future we would like to apply more sophisticated datamining techniques such as the data stream classificationtechniques for botnet detection Data stream classificationtechniques will be particularly suitable for botnet trafficdetection because the botnet traffic itself is a kind of datastream We would also like to extend our host-based detectiontechnique to a distributed framework

References[Barford and Yegneswaran 2006] Barford P and VYegneswaran An Inside Look at Botnets Springer 2006

[Cooke et al 2005] Cooke E F Jahanian D McPhersonThe Zombie Roundup Understanding Detecting andDisrupting Botnets in Proceedings of the Steps to ReducingUnwanted Traffic on the Internet Workshop (SRUTIrsquo05)2005 pp 39ndash44

[Dagon et al 2008] Dagon D C Zou W Lee ModelingBotnet Propagation Using Time Zones in Proceedings of the13th Network and Distributed System Security Symposium(NDSS rsquo06) 2006

[Ferguson 2008] Ferguson T Botnets Threaten the Internetas We Know It ZDNet Australia April 2008

[Freiling et al 2005] Freiling F T Holz G WicherskiBotnet tracking Exploring a Root-Cause Methodology toPrevent Distributed Denial-of-Service Attacks inProceedings of the 10th European Symposium on Research in

Computer Security (ESORICS) September 2005 pp319ndash335

[Goebel and Holz 2007] Goebel J and T Holz RishiIdentify Bot Contaminated Hosts by IRC NicknameEvaluation in Proceedings of the 1st Workshop on HotTopics in Understanding Botnets 2007 p 8

[Grizzard et al 2007] Grizzard J B V Sharma CNunnery B B Kang D Dagon Peer-to-Peer BotnetsOverview and Case Study in Proceedings of the 1stWorkshop on Hot Topics in Understanding Botnets 2007 p1

[Karasaridis et al 2007] Karasaridis A B Rexroad DHoeflin Wide-Scale Botnet Detection and Characterizationin Proceedings of the 1st Workshop on Hot Topics inUnderstanding Botnets 2007 p 7

[Livadas et al 2006] Livadas C B Walsh D Lapsley WStrayer Using Machine Learning Techniques to IdentifyBotnet Traffic in Proceedings of the 31st IEEE Conferenceon Local Computer Networks (LCNrsquo06) November 2006 pp967ndash974

[Rajab et al 2006] Rajab M J Zarfoss F Monrose ATerzis A Multifaceted Approach to Understanding the BotnetPhenomenon in Proceedings of the 6th ACM SIGCOMMConference on Internet Measurement (IMCrsquo06) 2006 pp41ndash52

161 IntroductionIn this chapter we describe our system setup data collectionprocess and approach to categorizing bot commands Webuild a testbed with an isolated network containing twoservers and a three client virtual machine We execute twodifferent IRC bots and collect packet traces We also collectpacket traces of known benign traffic We identify severalpacket-level and flow-level features that can distinguish thebotnet traffic from benign traffic In addition we findtemporal correlations between the system execution log(exedump) and packet trace log (tcpdump) and use thesecorrelations as additional features Using these features wethen train classifiers with known botnet and benign trafficThis classifier is then used to identify future unseen instancesof bot traffic

The organization of this chapter is as follows Ourimplementation architecture is described in Section 162System setup is discussed in Section 163 Data collection isdiscussed in Section 164 Bot command categorization isdescribed in Section 165 Feature extraction is discussed inSection 166 Log file correlation is discussed in Section 167Classification is discussed in Section 168 Packet filtering is

discussed in Section 169 The chapter is summarized inSection 1610 Figure 161 illustrates the concepts in thischapter

162 ArchitectureFigure 162 illustrates the botnet traffic detection systemdeployed in each host machine The host machines areassumed to be connected to the Internet through a firewallThe incoming and outgoing network traffic is logged usingtcpdump and program executions are logged using exedump(see Section 164) These dumps are then processed throughthe feature extraction module (Sections 166 and 167) andfeature vectors are computed for training

Figure 162 System architecture (From M Masud TAl-khateeb L Khan B Thuraisingham K HamlenFlow-based Identification of Botnet Traffic by MiningMultiple Log Files pp 200ndash206 2008 copy IEEE Withpermission)

For training we first label each flowmdashthat is each(ipportipprimeportprime) pairmdashas a bot flow (conversation between abot and its CampC center) or a normal flow (all otherconnections) Then we compute several packet-level features(Section 166) for each incoming packet and compute severalflow-level features for each flow by aggregating thepacket-level features Finally these flow-level features areused to train a classifier and obtain a classification model(Section 167) For testing we take an unlabeled flow andcompute its flow-level features in the same way Then we testthe feature values against the classification model and label ita normal flow or a bot flow

163 System SetupWe tested our approach on two different IRC-basedbotsmdashSDBot (2006) version 05a and RBot (2006) version051 The testing platform consisted of five virtual machinesrunning atop a Windows XP host operating system The hosthardware consisted of an Intel Pentium-IV 32GHz dual coreprocessor with 2GB RAM and 150GB hard disk Each virtualmachine ran Windows XP with 256 MB virtual RAM and8GB virtual hard disk space

The five virtual machines played the role of a botmaster abot an IRC server a victim and a code server respectivelyAs with a typical IRC-based botnet the IRC server served asthe CampC center through which the botmaster issuedcommands to control the bot The IRC server we used was thelatest version of [Unreal IRCd 2007] Daemon and thebotmasterrsquos IRC chat client was MIRC The code server ranApache Tomcat and contained different versions of botmalware code and other executables The victim machine wasa normal Windows XP machine During the experiment thebotmaster instructed the bot to target the victim machine withudp and ping attacks All five machines were interconnectedin an isolated network as illustrated in Figure 163

Figure 163 System configuration

164 Data CollectionWe collect botnet data using our testbed In each hostmachine we collect both the packet traces and programexecution traces Features are extracted from these traces andthe generated feature vectors are used for training classifiers

Data collection was performed in three steps First weimplemented a client for the botmaster that automatically sentall possible commands to the bot Second we ran[WinDump 2007] to generate a tcpdump log file and ran ourown implementation of a process tracer to generate aexedump log file Third we ran each bot separately on a freshvirtual machine collected the resulting traces from the log

files and then deleted the infected virtual machine Traceswere also collected from some uninfected machinesconnected to the Internet Each trace spanned a 12-hourperiod The tcpdump traces amounted to about 3GB in totalFinally these traces were used for training and testing

165 Bot CommandCategorizationNot all bot commands have an observable effect on the logfiles we consider We say that a command is observable if itmatches one or more of the following criteria

1 Bot-response The command solicits a reply messagefrom the bot to the CampC center This reply is loggedin the tcpdump For example the SDbot commandsldquoaboutrdquo and ldquosysinfordquo are observable according tothis criterion

2 Bot-app The command causes the bot to launch anexecutable application on the infected host machineThe application start event will be logged in theexedump The execute command from SDbot is anexample of such a command

3 Bot-other The command causes the bot to contactsome host other than the CampC center For examplethe command might instruct the bot to send UDPpackets as part of a DoS attack send spam emails toother hosts or download new versions of botmalware from a code server Such events are loggedin the tcpdump

Table 161 SDBot and RBot Command Characteristics

Source M Masud T Al-khateeb L Khan BThuraisingham K Hamlen Flow-based Identification ofBotnet Traffic by Mining Multiple Log Files pp 200ndash2062008 copy IEEE With permission

Some of the SDBot and RBot commands are listed in Table161 and categorized using the previously mentioned criteriaFor a comprehensive description of these commands pleaserefer to [RBOT 2006] [SDBOT 2006]

166 Feature ExtractionFirst we discuss the packet-level features and then discuss theflow-level features The intuitive idea behind these features isthat human response to a commandrequest (eg a request tosend a file or execute an application by his peer) should bemuch slower than a bot In what follows we refer to a packetas incoming if its destination is the host being monitored andas outgoing if it originates from the monitored host

The packet-level features we consider can be summarized asfollows

bull Bot-Response (BR) (boolean-valued) An incomingpacket possesses this feature if it originated fromsome ipport and there is an outgoing packet to thesame ipport within 100 ms of arrival of the incomingpacket This indicates that it is a potential commandpacket The 100 ms threshold has been determined byour observation of the bots We will refer to theseincoming packets as BR packets

bull BRtime (real-valued) This feature records the timedifference between a BR packet and itscorresponding outgoing packet This is an importantcharacteristic of a bot

bull BRsize (real-valued) This feature records the length(in KB) of a BR packet We observe that commandpackets typically have lengths of 1KB or lesswhereas normal packets have unbounded size

bull Bot-Other (BO) (boolean-valued) An incomingpacket possesses this feature if it originated fromsome ipport and there is an outgoing packet to someiprsquoportrsquo within 200 ms of the arrival of the incomingpacket where iprsquoneip This is also a potentialcommand packet The 200 ms threshold has also beendetermined by our observation of the bots We willrefer to these incoming packets as BO packets

bull BODestMatch (boolean-valued) A BO packetpossesses this feature if outgoing destination iprsquo isfound in its payload This indicates that the BO

packet is possibly a command packet that tells the botto establish connection with host iprsquo

bull BOtime (real-valued) This feature records the timedifference between a BO packet and itscorresponding outgoing packet This is also animportant characteristic of a bot

bull Bot-App (BA) (boolean-valued) An incomingpacket possesses this feature if an application startson the host machine within 3 seconds of arrival of theincoming packet This indicates that it is potentiallycommand packet that instructs the bot to run anapplication The 3 second threshold has beendetermined by our observation of the bots We willrefer to these incoming packets as BA packets

bull BAtime (real-valued) This feature records the timedifference between receiving a BA packet and thelaunching of the corresponding application

bull BAmatch (boolean-valued) A BA packet possessesthis feature if its payload contains the name of theapplication that was launched

As explained earlier the flow-level features of a flow are theaggregations of packet-level features in that flow They aresummarized in Table 162 All flow-level features arereal-valued Also note that we do not use any flow-levelfeature that requires payload analysis

Table 162 Flow-Level Feature Set

FEATURE DESCRIPTION

AvgPktLen

VarPktLenAverage and variance of length of packets in KB

Bot-App Number of BA packets as percentage of totalpackets

AvgBAtime

VarBAtimeAverage and variance of BAtime of all BApackets

Bot-Reply Number of BR packets as percentage of totalpackets

AvgBRtime

VarBRtimeAverage and variance of BRtime of all BRpackets

AvgBRsize

VarBRsizeAverage and variance of BRsize of all BRpackets

Bot-Other Number of BO packets as percentage of totalpackets

AvgBOtime

VarBOtimeAverage and variance of BOtime of all BOpackets

Source M Masud T Al-khateeb L Khan BThuraisingham K Hamlen Flow-based Identification of

Botnet Traffic by Mining Multiple Log Files pp 200ndash2062008 copy IEEE With permission

167 Log File CorrelationFigure 164 shows an example of multiple log file correlationPortions of the tcpdump (left) and exedump (right) log filesare shown in this example side by side Each record in thetcpdump file contains the packet number (No) arrivaldeparture time (Time) source and destination addresses (SrcDest) and payload or other information (PayloadInfo) Eachrecord in the exedump file contains two fields the processstart time (Start Time) and process name (Process) The firstpacket (10) shown in the tcpdump is a command packet thatinstructs the bot to download an executable from the codeserver and run it

Figure 164 Multiple log file correlation (From M MasudT Al-khateeb L Khan B Thuraisingham K HamlenFlow-based Identification of Botnet Traffic by Mining

Multiple Log Files pp 200ndash206 2008 copy IEEE Withpermission)

The second packet (11) is a response from the bot to thebotmaster so the command packet is a BR packet havingBRtime = 1ms The bot quickly establishes a TCP connectionwith the code server (other host) in packets 12 through 14Thus the command packet is also a BO packet havingBOtime = 7ms (the time difference between the incomingcommand and the first outgoing packet to another host) Afterdownloading the bot runs the executable mycalcexe Thusthis command packet is also a BA packet having BAtime =2283s

168 ClassificationWe use a Support Vector Machine (SVM) Bayes Netdecision tree (J48) Naiumlve Bayes and Boosted decision tree(Boosted J48) for the classification task In our previous work[Masud et al 2008] we found that each of these classifiersdemonstrated good performance for malware detectionproblems

Specifically SVM is robust to noise and high dimensionalityand can be fine-tuned to perform efficiently on a specificdomain Decision trees have a very good feature selectioncapability and are much faster than many other classifiersboth in training and testing time Bayes Nets are capable offinding the inter-dependencies between different attributesNaiumlve Bayes is also fast and performs well when the featuresare independent of one another Boosting is particularly

useful because of its ensemble methods Thus each of theseclassifiers has its own virtue In a real deployment we wouldactually use the best among them

169 Packet FilteringOne major implementation issue related to examining thepacket traces is the large volume of traffic that needs to bescanned We try to reduce unnecessary scanning of packets byfiltering out the packets that are not interesting to us such asthe TCP handshaking packets (SYNACKSYNACK) andNetBios session requestresponse packets This is because theuseful information such as bot commands and bot responsesare carried out via TCP protocol

Packets that do not carry any useful information need not beprocessed further To filter a packet we look into its headerand retrieve the protocol for the packet If it is either TCP orNetBios we keep it and send it to the feature extractionmodule Otherwise we drop the packet In this way we savea lot of execution time that would otherwise be used to extractfeatures from the unimportant packets

1610 SummaryIn this chapter we have discussed our log file correlationtechnique in detail We also explained what features areextracted from the log files and how the features are used tobuild a feature vector and train a classification model Wehave also shown different characteristics of Bot command and

how these characteristics play an important role in detectingthe bot traffic In the next chapter we discuss the evaluationof our technique on botnet traffic generated in a controlledenvironment with real IRC bots

In the future we would like to add more system-level logssuch as process or service execution logs memory and CPUutilization logs disk reads or writes logs and network readwrite logs We believe adding more system-level logs willincrease the chances of detecting the botnet traffic to a greaterextent

References[Masud et al 2008] Masud M M L KhanB Thuraisingham A Scalable Multi-level Feature ExtractionTechnique to Detect Malicious Executables InformationSystems Frontiers Vol 10 No 1 pp 33ndash45 March 2008

[RBOT 2006] RBOT information web pagehttpwwwf-securecomv-descsrbotshtml (AccessedDecember 2006)

[SDBOT 2006] SDBOT information web pagewwwf-securecomv-descsrbotshtml (Accessed December2006)

[Unreal IRCd 2007] The Unreal IRC Daemonhttpwwwunrealircdcom

[WinDump 2007] The WinDump web sitehttpwwwwinpcaporgwindump

171 IntroductionWe evaluate our technique on two different datasets The firstdataset is generated by running SDBot and the second one isgenerated by running RBot Benign traffic collected fromuninfected machines is mixed with the bot traffic in eachdataset to simulate a mixture of benign and bot traffic Fromeach dataset we aggregate network packets to flows (ieconnections) Each of these flows is considered an event or aninstance Each instance is then tagged as either bot flow ornormal flow depending on whether the flow is between a botand its CampC center or not

We compare with another baseline technique discussed here

1 Temporal This is our proposed approach Here weextract feature values for each flow using thetechnique described in Chapter 16

2 Livadas This is the machine-learning techniqueapplied by [Livadas et al 2006] They extract severalflow-based features such as a histogram of packetsizes flow duration bandwidth and so forth butthese are different from our feature set They first

identify IRC flows and then detect bot flows in theIRC flows We donrsquot need to identify IRC flows todetect CampC traffic using our analysis but to performa fair comparison we also filter out non-IRC flowswith the temporal approach The features proposed by[Livadas et al 2006] are extracted from the filtereddata

Figure 171 Evaluation of botnet detection

1712 Classifiers

The feature vectors extracted with the Temporal and Livadasmethods are used to train classifiers We explore five differentclassifiers namely SVM J48 Boosted J48 Naiumlve Bayes andBayes Net

For evaluation we apply fivefold cross validation on the dataand report the accuracy and false alarm rates We use the[Weka 2006] ML toolbox for classification

The organization of this chapter is as follows Datasets arediscussed in Section 172 Comparison with other techniquesis given in Section 173 Further analysis is given in Section174 Finally the chapter is summarized in section 175Figure 171 illustrates the concepts in this chapter

172 Performance onDifferent DatasetsWe evaluate the proposed technique on two different datasetsThe datasets SDBot and RBot correspond to those where thebot flows are generated only from SDBot and RBotrespectively and normal flows are generated from uninfectedmachines For each dataset we apply fivefold crossvalidation This is done for each competing method (ieTemporal and Livadas) with each competing classificationalgorithm (SVM J48 Boosted J48 Bayes Net and NaiumlveBayes) Evaluation metrics are classification accuracy falsepositive and false negative rates

Table 171 reports the classification accuracies (ACC) falsepositive rates (FP) and false negative rates (FN) for each ofthe classifiers for different datasets Boosted J48 has the bestdetection accuracy (988) for RBot whereas Bayes Net hasthe best detection accuracy (990) for SDBot However it isevident that Boosted J48 is less dataset-sensitive since itperforms consistently on both datasets and Bayes Net is only01 better than Boosted J48 for the SDBot dataset Thus weconclude that BoostedJ48 has overall better performance than

other classifiers This is also supported by the resultspresented next

Table 171 Performances of Different Classifiers onFlow-Level Features

173 Comparison with OtherTechniquesThe rows labeled ldquoTemporalrdquo and ldquoLivadasrdquo in Table 172report the classification accuracies (ACC) false positive rates(FP) and false negative rates (FN) of our technique and thetechnique of [Livadas et al 2006] respectively Thecomparison reported is for the combined dataset that consistsof bot flows from both SDBot- and RBot-infected machinesand all the normal flows from uninfected machines (with

non-IRC flows filtered out) We see that Temporal performsconsistently across all classifiers having an accuracy ofgreater than 99 whereas Livadas has less than or equal to975 accuracy in three classifiers and shows slightly betteraccuracy (02 higher) than Temporal only with Bayes NetBayes Net tends to perform well on a feature set if there aredependencies among the features Because it is likely thatthere are dependencies among the features used by Livadaswe infer that the overall detection accuracy of Livadas isprobably sensitive to classifiers whereas Temporal is robustto all classifiers Additionally Temporal outperforms Livadasin false negative rates for all classifiers except Bayes NetFinally we again find that Boosted J48 has the bestperformance among all classifiers so we conclude that ourTemporal method with Boosted J48 has the best overallperformance

Table 172 Comparing Performances between Our Method(Temporal) and the Method of Livadas et al on the CombinedDataset

Figure 172 presents the receiver operating characteristic(ROC) curves corresponding to the combined dataset resultsROC curves plot the true positive rate against the falsepositive rate An ROC curve is better if the area under thecurve (AUC) is higher which indicates a higher probabilitythat an instance will be correctly classified In this figure theROC curve labeled as ldquoBayes NetndashLivadasrdquo corresponds tothe ROC curve of Bayes Net on the combined dataset for theLivadas et al technique and so on We see that all of theROC curves are almost co-incidental except BoostedJ48ndashLivadas which is slightly worse than the others TheAUC of ldquoBoosted J48ndashLivadasrdquo is 0993 whereas the AUCof all other curves are greater than or equal to 0999

Figure 172 ROC curves of Bayes Net and Boosted J48 onthe combined data (From M Masud T Al-khateeb L KhanB Thuraisingham K Hamlen Flow-based Identification of

Botnet Traffic by Mining Multiple Log Files pp 200ndash2062008 copy IEEE With permission)

174 Further AnalysisWe show a couple of analyses to justify the effectiveness ofthe features chosen for classification In the first analysis weshow that the average packet lengths of bot traffic maintain acertain range whereas normal traffic does not have anyspecific range In the second analysis we show that theaverage BOTime BATime and BRTime for bot flows arealso distinguishable from benign flows

Figure 173 shows statistics of several features The upperchart plots the average packet length (in KB) of each flowthat appears in the dataset Bot flows and normal flows areshown as separate series A data point (XY) represents theaverage packet length Y of all packets in flow X of a particularseries (bot flow or normal) It is clear from the chart that botflows have a certain packet length (le02KB) whereas normalflows have rather random packet lengths Thus ourassumption about packet lengths is validated by this chartThe lower chart plots three different response timesBot-Response time (BRtime) Bot-Other time (BOtime) andBot-App time (BAtime) for each bot flow It is evident thataverage BRtime is less than 01 second average BOtime isless than 02 seconds and average BAtime is between 06 and16 seconds The threshold values for these response timeswere chosen according to these observations

Figure 173 Flow summary statistics Above Averagepacket lengths of normal flows and bot flows BelowAverage BRtime BOtime and BAtime of bot flows (From MMasud T Al-khateeb L Khan B Thuraisingham KHamlen Flow-based Identification of Botnet Traffic byMining Multiple Log Files pp 200ndash206 2008 copy IEEE Withpermission)

175 SummaryIn this work we present a data mining-based IRC botnettraffic detection technique We identify several importantfeatures of botnet traffic and demonstrate the importance ofcorrelating the network traces with program execution tracesfor generating useful features We apply our technique on realbotnet traffic generated using our testbed environment andevaluate the effectiveness of our approach on that trafficComparison with another data mining-based botnet detectiontechnique establishes the superiority of our approach

In future work we intend to apply this temporal correlationtechnique to more system-level logs such as those that trackprocessservice executions memoryCPU utilization diskreadswrites and so on We also would like to implement areal-time CampC traffic detection system using our approach

References[Livadas et al 2006] Livadas C B Walsh D LapsleyW Strayer ldquoUsing Machine Learning Techniques to IdentifyBotnet Trafficrdquo in Proceedings of the 31st IEEE Conferenceon Local Computer Networks (LCNrsquo06) November 2006 pp967ndash974

[Weka 2008] The Weka Data Mining with Open SourceSoftware httpwwwcswaikatoacnzmlweka

Conclusion to Part V

As we have stated botnets are emerging as ldquothe biggest threatfacing the Internet todayrdquo because of their enormous volumeand sheer power Botnets containing thousands of bots(compromised hosts) have been studied in the literature Inthis part we have described a data mining tool for botnetdetection In particular we discussed our architecture andalgorithms and we reported our performance analysis Wealso discussed the strengths and limitations of our approach

In Parts II III IV and V we have described our tools foremail worm detection malicious code detection remoteexploit detection and malicious code detection In Part VIwe describe a highly innovative tool for stream mining Inparticular our tool will detect novel classes This way it willbe able to detect malware that can change patterns

PART VI

STREAM MINING FOR SECURITYAPPLICATIONS

Introduction to Part VIIn a typical data stream classification task it is assumed thatthe total number of classes is fixed This assumption may notbe valid in a real streaming environment where new classesmay evolve Traditional data stream classification techniquesare not capable of recognizing novel class instances until theappearance of the novel class is manually identified andlabeled instances of that class are presented to the learningalgorithm for training The problem becomes morechallenging in the presence of concept-drift when theunderlying data distribution changes over time We propose anovel and efficient technique that can automatically detect theemergence of a novel class in the presence of concept-drift byquantifying cohesion among unlabeled test instances andseparating the test instances from training instances Ourapproach is non-parametric meaning it does not assume anyunderlying distributions of data Comparison with thestate-of-the-art stream classification techniques proves thesuperiority of our approach

Part VI consists of three chapters 18 19 and 20 Chapter 18discusses relevant stream mining approaches and gives anoverview of our approach Chapter 19 describes our approach

in detail and Chapter 20 discusses the application andevaluation of our approach on different synthetic andbenchmark data streams

STREAM MINING

181 IntroductionIt is a major challenge to the data mining community to minethe ever-growing streaming data There are three majorproblems related to stream data classification First it isimpractical to store and use all the historical data for trainingbecause it would require infinite storage and running timeSecond there may be concept-drift in the data meaning theunderlying concept of the data may change over time Thirdnovel classes may evolve in the stream There are manyexisting solutions in literature that solve the first twoproblems such as single-model incremental learningalgorithms [Chen et al 2008] [Hulten et al 2001] [Yang etal 2005] and ensemble classifiers [Kolter and Maloof 2005][Masud et al 2008] [Wang et al 2003] However most ofthe existing techniques are not capable of detecting novelclasses in the stream On the other hand our approach canhandle concept-drift and detect novel classes at the same time

Traditional classifiers can only correctly classify instances ofthose classes with which they have been trained When a newclass appears in the stream all instances belonging to thatclass will be misclassified until the new class has beenmanually identified by some experts and a new model istrained with the labeled instances of that class Our approach

provides a solution to this problem by incorporating a novelclass detector within a traditional classifier so that theemergence of a novel class can be identified without anymanual intervention The proposed novel class detectiontechnique can benefit many applications in various domainssuch as network intrusion detection and credit card frauddetection For example in the problem of intrusion detectionwhen a new kind of intrusion occurs we should not only beable to detect that it is an intrusion but also that it is a newkind of intrusion With the intrusion type information humanexperts would be able to analyze the intrusion more intenselyfind a cure set an alarm in advance and make the systemmore secure

The organization of this chapter is as follows In Section 182we describe our architecture related work is presented inSection 183 Our approach is briefly discussed in Section184 The chapter is summarized in Section 185 Figure 181illustrates the concepts in this chapter

182 ArchitectureWe propose an innovative approach to detect novel classes Itis different from traditional novelty (or anomalyoutlier)detection techniques in several ways First traditional noveltydetection techniques [Markou and Singh 2003] [Roberts2000] [Yeung and Chow 2002] work by assuming orbuilding a model of normal data and simply identifying datapoints as outliersanomalies that deviate from the ldquonormalrdquopoints But our goal is not only to detect whether a single datapoint deviates from the normality but also to discover whethera group of outliers have any strong bond among themselvesSecond traditional novelty detectors can be considered as aldquoone-classrdquo model which simply distinguishes betweennormal and anomalous data but cannot distinguish betweentwo different kinds of anomalies Our model is a ldquomulti-classrdquomodel meaning it can distinguish among different classes ofdata and at the same time detect presence of a novel classdata which is a unique combination of a traditional classifierwith a novelty detector

Our technique handles concept-drift by adapting an ensembleclassification approach which maintains an ensemble of Mclassifiers for classifying unlabeled data The data stream isdivided into equal-sized chunks so that each chunk can beaccommodated in memory and processed online We train aclassification model from each chunk as soon as it is labeledThe newly trained model replaces one of the existing modelsin the ensemble if necessary Thus the ensemble evolvesreflecting the most up-to-date concept in the stream

The central concept of our novel class detection technique isthat each class must have an important property the datapoints belonging to the same class should be closer to eachother (cohesion) and should be far apart from the data pointsbelonging to other classes (separation) Every time a new datachunk appears we first detect the test instances that are wellseparated from the training data (ie outliers) Then filteringis applied to remove the outliers that possibly appear as aresult of concept-drift Finally if we find strong cohesionamong those filtered outliers we declare a novel class Whenthe true labels of the novel class(es) arrive and a new model istrained with the labeled instances the existing ensemble isupdated with that model Therefore the ensemble of modelsis continuously enriched with new classes

Figure 182 illustrates the architecture of our novel classdetection approach We assume that the data stream is dividedinto equal-sized chunks The heart of this system is anensemble L of M classifiers L1 hellip LM When a newunlabeled data chunk arrives the ensemble is used to detectnovel class in that chunk If a novel class is detected then theinstances belonging to the novel class are identified andtagged accordingly All other instances in the chunk that isthe instances that are not identified as novel class areclassified using majority voting As soon as a data chunk islabeled it is used to train a classifier which replaces one ofthe existing classifiers in the ensemble During training wecreate an inventory of the used spaces

We have several contributions First we provide a detailedunderstanding of the characteristic of a novel class andpropose a new technique that can detect novel classes in thepresence of concept-drift in data streams Second weestablish a framework for incorporating a novel classdetection mechanism into a traditional classifier Finally weapply our technique on both synthetic and real-world data andobtain much better results than state-of-the-art streamclassification algorithms

183 Related WorkOur work is related to both stream classification and noveltydetection There has been much work on stream dataclassification There are two main approaches single-modelclassification and ensemble classification Some single-modeltechniques have been proposed to accommodate concept-drift[Chen et al 2008] [Hulten et al 2001] [Yang et al 2005]However our technique follows the ensemble approachSeveral ensemble techniques for stream data mining havebeen proposed [Kolter and Maloof 2005] [Masud et al2008] [Wang et al 2003] These ensemble approachesrequire simple operations to update the current concept andthey are found to be robust in handling concept-driftAlthough these techniques can efficiently handleconcept-drift none of them can detect novel classes in thedata stream On the other hand our technique is not onlycapable of handling concept-drift but it is also able to detectnovel classes in data streams In this light our technique isalso related to novelty detection techniques

A comprehensive study on novelty detection has beendiscussed in [Markou and Singh 2003] The authorscategorize novelty detection techniques into two categoriesstatistical and neural network based Our technique is relatedto the statistical approach Statistical approaches are of twotypes parametric and non-parametric Parametric approachesassume that data distributions are known (eg Gaussian) andthey try to estimate the parameters (eg mean and variance)of the distribution If any test data fall outside the normalparameters of the model it is declared as novel [Roberts

2000] Our technique is a non-parametric approachNon-parametric approaches like parzen window method[Yeung and Chow 2002] estimate the density of training dataand reject patterns whose density is beyond a certainthreshold k-nearest neighbor (kNN) based approaches tonovelty detection are also non-parametric [Yang et al 2002]All of these techniques for novelty detection consider onlywhether a test instance is sufficiently close (or far) from thetraining data based on some appropriate metric (eg distancedensity etc) Our approach is different from these approachesin that we not only consider separation from normal data butalso cohesion among the outliers Besides our modelassimilates a novel class into the existing model whichenables it to distinguish future instances of that class fromother classes On the other hand novelty detection techniquesjust remember the ldquonormalrdquo trend and do not care about thesimilarities or dissimilarities among the anomalous instances

A recent work in the data stream mining domain [Spinosa etal 2007] describes a clustering approach that can detect bothconcept-drift and novel class This approach assumes thatthere is only one ldquolsquonormalrdquo class and all other classes arenovel Thus it may not work well if more than one class is tobe considered as ldquonormalrdquo or ldquonon-novelrdquo Our approach canhandle any number of existing classes This makes ourapproach more effective in detecting novel classes than[Spinosa et al 2007] which is justified by the experimentalresults

184 Our ApproachWe have presented a novel technique to detect new classes inconcept-drifting data streams Most of the novelty detectiontechniques either assume that there is no concept-drift orbuild a model for a single ldquonormalrdquo class and consider allother classes as novel But our approach is capable ofdetecting novel classes in the presence of concept-drift evenwhen the model consists of multiple ldquoexistingrdquo classes Inaddition our novel class detection technique isnon-parametric that is it does not assume any specificdistribution of data We also show empirically that ourapproach outperforms the state-of-the-art data stream-basednovelty detection techniques in both classification accuracyand processing speed

Figure 183 Our approach to stream mining

It might appear to readers that to detect novel classes we arein fact examining whether new clusters are being formed andtherefore the detection process could go on withoutsupervision But supervision is necessary for classificationWithout external supervision two separate clusters could be

regarded as two different classes although they are notConversely if more than one novel class appears in a chunkall of them could be regarded as a single novel class if thelabels of those instances are never revealed In future workwe would like to apply our technique in the domain ofmultiple-label instances Our approach is illustrated in Figure183

185 Overview of the NovelClass Detection AlgorithmAlgorithm 181 outlines a summary of our technique Thedata stream is divided into equal-sized chunks The latestchunk which is unlabeled is provided to the algorithm asinput At first it detects if there is any novel class in the chunk(line 1) The term novel class will be defined shortly If anovel class is found we detect the instances that belong to theclass(es) (line 2) Then we use the ensemble L = L1 hellip LMto classify the instances that do not belong to the novelclass(es) When the data chunk becomes labeled a newclassifier Lprime is trained using the chunk Then the existingensemble is updated by choosing the best M classifiers fromthe M + 1 classifiers LcupLprime based on their accuracies on thelatest labeled data chunk

Algorithm 181 MineClass

bull Input Dn the latest data chunk

bull L Current ensemble of best M classifiersbull Output Updated ensemble L

1 found larr DetectNovelClass(DnL) (algorithm 19-1)2 if found then Y larr Novel_instances(Dn) X larr Dn ndash Y

else X larr Dn3 for each instance x isin X do Classify(Lx)4 Assuming that Dn is now labeled5 Lprime larr Train-and-create-inventory(Dn) (Section 193)6 L larr Update(LLprimeDn)

Our algorithm will be mentioned henceforth as ldquoMineClassrdquowhich stands for Mining novel Classes in data streamsMineClass should be applicable to any base learner The onlyoperation that is specific to a learning algorithm isTrain-and-create-inventory We will illustrate this operationfor two base learners

186 Classifiers UsedWe apply our novelty detection technique on two differentclassifiers decision tree and kNN We keep M classificationmodels in the ensemble For the decision tree classifier eachmodel is a decision tree For kNN each model is usually theset of training data itself However storing all the rawtraining data is memory inefficient and using them to classifyunlabeled data is time inefficient We reduce both the timeand memory requirement by building K clusters with thetraining data saving the cluster summaries as classificationmodels and discarding the raw data This process isexplained in detail in [Masud et al 2008] The cluster

summaries are mentioned henceforth as ldquopseudopointsrdquoBecause we store and use only K pseudopoints both the timeand memory requirements become functions of K (a constantnumber) The clustering approach followed here is aconstraint-based K-means clustering where the constraint is tominimize cluster impurity while minimizing the intra-clusterdispersion A cluster is considered pure if it contains instancesfrom only one class The summary of each cluster consists ofthe centroid and the frequencies of data points of each class inthe cluster Classification is done by finding the nearestcluster centroid from the test point and assigning the class thathas the highest frequency to the test point

187 Security ApplicationsThe proposed novel class detection will be useful in severalsecurity applications First it can be used in detecting novelattacks in network traffic If there is a completely new kind ofattack in the network traffic existing intrusion detectiontechniques may fail to detect it On the contrary if acompletely new kind of attack occurs in the network trafficour approach should detect it as a ldquonovel classrdquo and wouldraise an alarm This would invoke system analysts toquarantine and analyze the characteristics of these unknownkinds of events and tag them accordingly The classificationmodels would also be updated with these new class instancesShould the same kind of intrusion occur in the future theclassification model would detect it as a known intrusionSecond our approach can also be used for detecting a newkind of malware Existing malware detection techniques mayfail to detect a completely new kind of malware but our

approach should be able to detect the new malware as a novelclass quarantine it and raise an alarm The quarantinedbinary would be later analyzed and characterized by humanexperts In this way the proposed novel class detectiontechnique can be effectively applied to cyber security

188 SummaryData stream classification is a challenging task that has beenaddressed by different researchers in different ways Most ofthese approaches ignore the fact that new classes may emergein the stream If this phenomenon is considered theclassification problem becomes more challenging Ourapproach addresses this challenge in an efficient way Chapter19 discusses this approach in detail

In the future we would like to extend our approach in twodirections First we would like to address the real-time datastream classification problem Real-time data stream miningis more challenging because of the overhead involved in datalabeling and training classification models Second we wouldlike to utilize the cloud computing framework for data streammining The cloud computing framework will be a cheaperalternative to more efficient and powerful computing that isnecessary for real-time stream mining

References[Chen et al 2008] Chen S H Wang S Zhou P Yu StopChasing Trends Discovering High Order Models in EvolvingData in Proceedings ICDE 2008 pp 923ndash932

[Hulten et al 2001] Hulten G L Spencer P DomingosMining Time-Changing Data Streams in Proceedings ACMSIGKDD 2001 pp 97ndash106

[Kolter and Maloof 2005] Kolter J and M Maloof UsingAdditive Expert Ensembles to Cope with Concept Drift inProceedings ICML 2005 pp 449ndash456

[Markou and Singh 2003] Markou M and S SinghNovelty Detection A ReviewmdashPart 1 StatisticalApproaches Part 2 Neural Network-Based ApproachesSignal Processing 83 2003 pp 2481ndash2521

[Masud et al 2008] Masud M J Gao L Khan J Han BThuraisingham A Practical Approach to Classify EvolvingData Streams Training with Limited Amount of LabeledData in Proceedings ICDM 2008 pp 929ndash934

[Roberts 2000] Roberts S J Extreme Value Statistics forNovelty Detection in Biomedical Signal Processing inProceedings of the International Conference on Advances inMedical Signal and Information Processing 2000 pp166ndash172

[Spinosa et al 2007] Spinosa E J A P de Leon F deCarvalho J Gama OLINDDA A Cluster-Based Approach

for Detecting Novelty and Concept Drift in Data Streams inProceedings 2007 ACM Symposium on Applied Computing2007 pp 448ndash452

[Wang et al 2003] Wang H W Fan P Yu J Han MiningConcept-Drifting Data Streams Using Ensemble Classifiersin Proceedings ACM SIGKDD 2003 pp 226ndash235

[Yeung and Chow 2002] Yeung D Y and C ChowParzen-Window Network Intrusion Detectors in ProceedingsInternational Conference on Pattern Recognition 2002 pp385ndash388

[Yang et al 2002] Y Yang J Zhang J Carbonell C JinTopic-Conditioned Novelty Detection in Proceedings ACMSIGKDD 2002 pp 688ndash693

[Yang et al 2005] Yang Y X Wu X Zhu CombiningProactive and Reactive Predictions for Data Streams inProceedings ACM SIGKDD 2005 pp 710ndash715

191 IntroductionIn this chapter we start with the definitions of novel class andexisting classes Then we state the assumptions based onwhich the novel class detection algorithm works Weillustrate the concept of novel class with an example andintroduce several terms such as used space and unused spacesWe then discuss the three major parts in novel class detectionprocess (1) saving the inventory of used spaces duringtraining (2) outlier detection and filtering and (3) computingcohesion among outliers and separating the outliers from thetraining data We also show how this technique can be madeefficient by raw data reduction using clustering

The organization of this chapter is as follows Definitions aregiven in Section 192 Our novel class detection techniquesare given in Section 193 The chapter is summarized inSection 194 The concepts in this chapter are illustrated inFigure 191

192 DefinitionsWe begin with the definition of ldquonovelrdquo and ldquoexistingrdquo class

Definition 191 (Existing class and Novel class) Let L be thecurrent ensemble of classification models A class c is anexisting class if at least one of the models Li ε L has beentrained with the instances of class c Otherwise c is a novelclass

We assume that any class has the following essentialproperty

Property 191 A data point should be closer to the data pointsof its own class (cohesion) and farther apart from the datapoints of other classes (separation)

Our main assumption is that the instances belonging to a classc is generated by an underlying generative model Θc and theinstances in each class are independently identicallydistributed With this assumption we can reasonably arguethat the instances that are close together are supposed to begenerated by the same model that is belong to the sameclass We now show the basic idea of novel class detection

using decision tree in Figure 192 We introduce the notion ofused space to denote a feature space occupied by any instanceand unused space to denote a feature space unused by aninstance

According to Property 191 (separation) a novel class mustarrive in the unused spaces Besides there must be strongcohesion (eg closeness) among the instances of the novelclass Thus the two basic principles followed by ourapproach are (1) keeping track of the used spaces of each leafnode in a decision tree and (2) finding strong cohesionamong the test instances that fall into the unused spaces

Figure 192 (a) A decision tree and (b) corresponding featurespace partitioning FS(X) denotes the feature space defined bya leaf node X The shaded areas show the used spaces of eachpartition (c) A novel class (denoted by x) arrives in theunused space (From M Masud J Gao L Khan J Han BThuraisingham Integrating Novel Class Detection with

Classification for Concept-Drifting Data Streams pp 79ndash942009 Springer With permission)

193 Novel Class DetectionWe follow two basic steps for novel class detection First theclassifier is trained such that an inventory of the used spaces(described in Section 192) is created and saved This is doneby clustering and saving the cluster summary asldquopseudopointrdquo Second these pseudopoints are used to detectoutliers in the test data and declare a novel class if there isstrong cohesion among the outliers

1931 Saving the Inventory of UsedSpaces during Training

The general idea of creating the inventory is to cluster thetraining data and save the cluster centroids and other usefulinformation as pseudopoints These pseudopoints keep trackof the used spaces How this clustering is done may bespecific to each base learner For example for decision treeclustering is done at each leaf node of the tree because weneed to keep track of the used spaces for each leaf nodeseparately For the kNN classifier discussed in Section 186already existing pseudopoints are utilized to store theinventory

It should be noted here that K-means clustering appears to bethe best choice for saving the decision boundary andcomputing the outliers Density-based clustering could also be

used to detect outliers but it has several problems First wewould have to save all of the raw data points at the leaf nodesto apply the clustering Second the clustering process wouldtake quadratic time compared to linear time for K-meansFinally we would have to run the clustering algorithm forevery data chunk to be tested However the choice ofparameter K in K-means algorithm has some impact on theoverall outcome which is discussed in the experimentalresults

19311 Clustering We build total K clusters per chunk ForkNN we utilize the existing clusters that were createdglobally using the approach For decision tree clustering isdone locally at each leaf node as follows Suppose S is thechunk size During decision tree training when we reach aleaf node li we build ki = (tiS) K clusters in that leaf whereti denotes the number of training instances that ended up inleaf node li

19312 Storing the Cluster Summary Information For eachcluster we store the following summary information inmemory (i) Weight w defined as the total number of pointsin the cluster (ii) Centroid ζ (iii) Radius R defined as themaximum distance between the centroid and the data pointsbelonging to the cluster (iv) Mean distance μd the meandistance from each point to the cluster centroid The clustersummary of a cluster Hi will be referred to henceforth as aldquopseudopointrdquo ψi So w(ψi) denotes the weight ofpseudopoint ψi After computing the cluster summaries theraw data are discarded Let Ψj be the set of all pseudopointsstored in memory for a classifier Lj

Each pseudopoint ψi corresponds to a hypersphere in thefeature space having center ζ(ψi) and radius R(ψi) Thus thepseudopoints ldquomemorizerdquo the used spaces Let us denote theportion of feature space covered by a pseudopoint ψi as theldquoregionrdquo of ψi or RE(ψi) So the union of the regions coveredby all the pseudopoints is the union of all the used spaceswhich forms a decision boundary B(Lj) = uψiεΨj RE(ψi) for aclassifier Lj Now we are ready to define outliers

Definition 192 (Routlier) Let x be a test point and ψmin bethe pseudopoint whose centroid is nearest to x Then x is aRoutlier (ie raw outlier) if it is outside RE(ψmin) that is itsdistance from ζ(ψmin) is greater than R(ψmin)

In other words any point x outside the decision boundaryB(Lj) is a Routlier for the classifier Lj For K-NN Routliersare detected globally by testing x against all the psuedopointsFor decision tree x is tested against only the psueodpointsstored at the leaf node where x belongs

19321 Filtering According to Definition 192 a testinstance may be erroneously considered as a Routlier becauseof one or more of the following reasons (1) The test instancebelongs to an existing class but it is a noise (2) There hasbeen a concept-drift and as a result the decision boundary ofan existing class has been shifted (3) The decision tree hasbeen trained with insufficient data So the predicted decisionboundary is not the same as the actual one

Due to these reasons the outliers are filtered to ensure thatany outlier that belongs to the existing classes does not end upin being declared as a new class instance The filtering is doneas follows if a test instance is a Routlier to all the classifiersin the ensemble then it is considered as a filtered outlier Allother Routliers are filtered out

Definition 193 (Foutlier) A test instance is a Foutlier (iefiltered outlier) if it is a Routlier to all the classifiers Li in theensemble L

Intuitively being an Foutlier is a necessary condition forbeing in a new class Because suppose an instance x is not aRoutlier to some classifier Li in the ensemble Then x must beinside the decision boundary B(Lj) So it violates Property191 (separation) and therefore it cannot belong to a newclass Although being a Foutlier is a necessary condition it isnot sufficient for being in a new class because it does notguarantee the Property 191 (cohesion) So we proceed to thenext step to verify whether the Foutliers satisfy both cohesionand separation

We perform several computations on the Foutliers to detectthe arrival of a new class First we discuss the generalconcepts of these computations and later we describe howthese computations are carried out efficiently For everyFoutlier we define a λc-neighborhood as follows

Definition 194 (λc-neighborhood) The λc-neighborhood ofan Foutlier x is the set of N-nearest neighbors of x belongingto class c

Here N is a user-defined parameter For brevity we denotethe λc-neighborhood of a Foutlier x as λc(x) Thus λ+(x) of aFoutlier x is the set of N instances of class c+ that are closestto the outlier x Similarly λo(x) refers to the set of N Foutliersthat are closest to x This is illustrated in Figure 193 wherethe Foutliers are shown as black dots and the instances ofclass c+ and class cndash are shown with the correspondingsymbols λ+(x) of the Foutlier x is the set of N (= 3) instancesbelonging to class c+ that are nearest to x (inside the circle)and so on Next we define the N-neighborhood silhouettecoefficient (N-NSC)

Figure 193 λc-neighborhood with N = 3 (From M MasudJ Gao L Khan J Han B Thuraisingham Integrating NovelClass Detection with Classification for Concept-Drifting DataStreams pp 79ndash94 2009 Springer With permission)

Definition 195 (N-NSC) Let a(x) be the average distancefrom an Foutlier x to the instances in λo(x) and bc(x) be theaverage distance from x to the instances in λc(x) (where c isan existing class) Let bmin(x) be the minimum among allbc(x) Then N-NSC of x is given by

According to the definition the value of N-NSC is betweenndash1 and +1 It is actually a unified measure of cohesion andseparation A negative value indicates that x is closer to theother classes (less separation) and farther away from its ownclass (less cohesion) We declare a new class if there are atleast Nprime (gtN) Foutliers whose N-NSC is positive

It should be noted that the larger the value is of N the greaterthe confidence we will have in deciding whether a novel classhas arrived However if N is too large then we may also failto detect a new class if the total number of instancesbelonging to the novel class in the corresponding data chunkis less than or equal to N We experimentally find an optimalvalue of N which is explained in Chapter 20

19331 Computing the Set of Novel Class Instances Once wedetect the presence of a novel class the next step is to findthose instances and separate them from the existing class dataAccording to the necessary and sufficient condition a set ofFoutlier instances belong to a novel class if the followingthree conditions satisfy (1) all the Foutliers in the set havepositive N-NSC (2) all the Foutliers in the set have λo(x)within the set and (3) cardinality of the set geN Let G be such

a set Note that finding the exact set G is computationallyexpensive so we follow an approximation Let Gprime be the setof all Foutliers that have positive N-NSC If |Gprime|geN then Gprime isan approximation of G It is possible that some of the datapoints in Gprime may not actually be a novel class instance or viceversa However in our experiments we found that thisapproximation works well

19332 Speeding up the Computation Computing N-NSC forevery Foutlier instance x takes quadratic time in the numberof Foutliers To make the computation faster we also createKo pseudopoints from Foutliers using K-means clustering andperform the computations on the pseudopoints (referred to asFpseudopoints) where Ko = (NoS) K Here S is the chunksize and No is the number of Foutliers Thus the timecomplexity to compute the N-NSC of all of the Fpseudopointsis O(Ko (Ko + K)) which is constant because both Ko and Kare independent of the input size Note that N-NSC of aFpseudopoint is actually an approximate average of theN-NSC of each Foutlier in that Fpseudopoint By using thisapproximation although we gain speed we also lose someprecision However this drop in precision is negligible whenwe keep sufficient number of pseudopoints as shown in theexperimental results The novel class detection process issummarized in Algorithm 191 (DetectNovelClass)

Algorithm 191 DetectNovelClass(D L)

bull Input D An unlabeled data chunkbull L Current ensemble of best M classifiersbull Output true if novel class is found false otherwise

1 for each instance x isin D do2 if x is a Routlier to all classifiers Li isin L

then FList larr FList cup x x is a Foutlier

3 end for4 Make Ko = (K |FList||D|) clusters with the

instances in FList using K-means clustering andcreate Fpseudopoints

5 for each classifier Li isin L do6 Compute N-NSC(ψj) for each Fpseudopoint j7 Ψp larr the set of Fpseudopoints having positive

N-NSC()8 w(p) sum of w() of all Fpseudopoints in Ψp9 if w(p) gt N then NewClassVote++

10 end for11 return NewClassVote gt M ndash NewClassVote

Majority voting

This algorithm can detect one or more novel classesconcurrently (ie in the same chunk) as long as each novelclass follows Property 191 and contains at least N instancesThis is true even if the class distributions are skewedHowever if more than one such novel class appearsconcurrently our algorithm will identify the instancesbelonging to those classes as novel without imposing anydistinction among dissimilar novel class instances (ie it willtreat them simply as ldquonovelrdquo) But the distinction will belearned by our model as soon as those instances are labeledand a classifier is trained with them

19333 Time Complexity Lines 1 through 3 of Algorithm191 require O(KSL) time where S is the chunk size Line 4(clustering) requires O(KS) time and the last for loop (5ndash10)requires O(K2L) time Thus the overall time complexity ofAlgorithm 191 is O(KS + KSL + K2L) = O(K(S + SL + KL))Assuming that SgtgtKL the complexity becomes O(KS) whichis linear in S Thus the overall time complexity (per chunk)of MineClass algorithm (Algorithm 181) is O(KS + fc(LS) +ft(S)) where fc(n) is the time required to classify n instancesand ft(n) is the time required to train a classifier with ntraining instances

19334 Impact of Evolving Class Labels on EnsembleClassification As the reader might have realized alreadyarrival of novel classes in the stream causes the classifiers inthe ensemble to have different sets of class labels Forexample suppose an older (earlier) classifier Li in theensemble has been trained with classes c0 and c1 and a newer(later) classifier Lj has been trained with classes c1 and c2where c2 is a new class that appeared after Li had beentrained This puts a negative effect on voting decision sincethe older classifier misclassifies instances of c2 So ratherthan counting votes from each classifier we selectively counttheir votes as follows if a newer classifier Lj classifies a testinstance x as class c but an older classifier Li does not havethe class label c in its model then the vote of Li will beignored if x is found to be an outlier for Li An oppositescenario occurs when the oldest classifier Li is trained withsome class cprime but none of the later classifiers are trained withthat class This means class cprime has been outdated and in thatcase we remove Li from the ensemble In this way we ensurethat older classifiers have less impact in the voting process If

class cprime later reappears in the stream it will be automaticallydetected again as a novel class (see Definition 191)

194 Security ApplicationsThere are several potential security applications of the novelclass detection technique such as intrusion detection innetwork traffic or malware detection in a host machineConsider the problem of malware detection To apply ournovel class detection technique we first need to identify a setof features for each executable This can be done usingn-gram feature extraction and selection [Masud et al 2008]As long as the feature set selected using the approach of[Masud et al 2008] also remains the best set of features for anew kind of malware the new malware class should bedetected as a novel class by our approach The advantage ofour approach with other classification approaches in thisregard is twofold First it will detect a new kind of malwareas a novel class This detection will lead to further analysisand characterization of the malware On the contrary if a newkind of malware emerges traditional classification techniqueswould either detect it as benign or simply a ldquomalwarerdquo Thusour approach will be able to provide more information aboutthe new malware by identifying it as a novel type The secondadvantage is if an existing type of malware is tested using thenovel class detection system it will be identified as amalware and also the ldquotyperdquo of the malware would bepredicted

195 SummaryIn this chapter we present the working details of the novelclass detection algorithm Our approach builds a decisionboundary around the training data during training Duringclassification if any instance falls outside the decisionboundary it is tagged as outlier and stored for furtheranalysis When enough outliers have been found we computethe cohesion among the outliers and separation of the outliersfrom the training data If both the cohesion and separation aresignificant the outliers are identified as a novel class InChapter 20 we discuss the effectiveness of our approach onseveral synthetic and benchmark data streams

As mentioned in Chapter 18 we would like to extend thistechnique to real-time data stream classification To achievethis goal we will have to optimize the training including thecreation of decision boundary The outlier detection and novelclass detection should also be made more efficient Webelieve the cloud computing framework can play an importantrole in increasing the efficiency of these processes

Reference[Masud et al 2008] Masud M L Khan B ThuraisinghamA Scalable Multi-level Feature Extraction Technique toDetect Malicious Executables Information System FrontiersVol 10 No 1 2008 pp 33ndash45

201 IntroductionWe evaluate our proposed method on a number of syntheticand real datasets and we report results on four datasets Twoof the datasets for which we report the results are syntheticand the other two are real benchmark datasets The firstsynthetic dataset simulates only concept-drift We use thisdataset for evaluation to show that our approach can correctlydistinguish between concept-drift and novel classes Thesecond synthetic dataset simulates both concept-drift andconcept-evolution The two benchmark datasets that we useare the KDD Cup 1999 intrusion detection dataset and theForest Cover type dataset both of which have been widelyused in data stream classification literature Each of thesynthetic and real datasets contains more than or equal to250000 data points

We compare our results with two baseline techniques Foreach dataset and each baseline technique we report theoverall error rate percentage of novel instances misclassifiedas existing class and percentage of existing class instancesmisclassified as novel class We also report the running timesof each baseline techniques on each dataset On all datasetsour approach outperforms the baseline techniques in bothclassification accuracy and false detection rates Our approach

also outperforms the baseline techniques in running time Thefollowing sections discuss the results in detail

The organization of this chapter is as follows Datasets arediscussed in Section 202 Experimental setup is discussed inSection 203 Performance results are given in Section 204The chapter is summarized in Section 205 Figure 201illustrates the concepts in this chapter

202 Datasets2021 Synthetic Data with OnlyConcept-Drift (SynC)

SynC simulates only concept-drift with no novel classesThis is done to show that concept-drift does not erroneouslytrigger new class detection in our approach SynC data aregenerated with a moving hyperplane The equation of a

hyperplane is as follows Σdi=1 aixi = a0 If Σd

i=1 aixi lt= a0then an example is negative otherwise it is positive Eachexample is a randomly generated d-dimensional vector x1hellip xd where xi isin [0 1] Weights a1 hellip ad are alsorandomly initialized with a real number in the range [0 1]The value of a0 is adjusted so that roughly the same numberof positive and negative examples is generated This can bedone by choosing a0 = Σd

i=1 ai We also introduce noiserandomly by switching the labels of p of the exampleswhere p = 5 is set in our experiments There are severalparameters that simulate concept-drift Parameter m specifiesthe percentage of total dimensions whose weights areinvolved in changing and it is set to 20 Parameter tspecifies the magnitude of the change in every N example Inour experiments t is set to 01 and N is set to 1000 si i isin1 hellip d specifies the direction of change for each weightWeights change continuously that is ai is adjusted by sitNafter each example is generated There is a possibility of 10that the change would reverse direction after every N exampleis generated We generate a total of 250000 records

2022 Synthetic Data withConcept-Drift and Novel Class (SynCN)

These synthetic data simulate both concept-drift and novelclass Data points belonging to each class are generated usingGaussian distribution having different means (ndash50 to +50)and variances (05 to 6) for different classes Besides tosimulate the evolving nature of data streams the probabilitydistributions of different classes are varied with time Thiscaused some classes to appear and some other classes to

disappear at different times To introduce concept-drift themean values of a certain percentage of attributes have beenshifted at a constant rate As done in the SynC dataset thisrate of change is also controlled by the parameters m t s andN in a similar way

The dataset is normalized so that all attribute values fallwithin the range [0 1] We generate the SynCN dataset with20 classes 40 real-valued attributes having a total of 400Kdata points

2023 Real DatamdashKDD Cup 99Network Intrusion Detection

We have used the 10 version of the dataset which is moreconcentrated hence more challenging than the full version Itcontains around 490000 instances Here different classesappear and disappear frequently making the new classdetection challenging This dataset contains TCP connectionrecords extracted from LAN network traffic at MIT LincolnLabs over a period of two weeks Each record refers to eitherto a normal connection or an attack There are 22 types ofattacks such as buffer-overflow portsweep guess-passwdneptune rootkit smurf spy and others So there are 23different classes of data Most of the data points belong to thenormal class Each record consists of 42 attributes such asconnection duration the number of bytes transmitted numberof root accesses and so forth We use only the 34 continuousattributes and remove the categorical attributes This datasetis also normalized to keep the attribute values within [0 1]

2024 Real DatamdashForest Cover (UCIRepository)

The dataset contains geospatial descriptions of different typesof forests It contains 7 classes 54 attributes and around581000 instances We normalize the dataset and arrange thedata so that in any chunk at most 3 and at least 2 classesco-occur and new classes appear randomly

203 Experimental SetupWe implement our algorithm in Java The code for decisiontree has been adapted from the Weka machine learning opensource repository (httpwwwcswaikatoacnzmlweka)The experiments were run on an Intel P-IV machine with2GB memory and 3GHz dual processor CPU Our parametersettings are as follows unless mentioned otherwise (1) K(number of pseudopoints per chunk) = 50 (2) N = 50 (3) M(ensemble size) = 6 (4) chunk size = 1000 for syntheticdatasets and 4000 for real datasets These values ofparameters are tuned to achieve an overall satisfactoryperformance

To the best of our knowledge there is no approach that canclassify data streams and detect novel class So we compareMineClass with a combination of two baseline techniquesOLINDDA [Spinosa et al 2007] and Weighted ClassifierEnsemble (WCE) [Wang et al 2003] where the former

works as novel class detector and the latter performsclassification For each chunk we first detect the novel classinstances using OLINDDA All other instances in the chunkare assumed to be in the existing classes and they areclassified using WCE We use OLINDDA as the noveltydetector as it is a recently proposed algorithm that is shownto have outperformed other novelty detection techniques indata streams [Spinosa et al 2007]

However OLINDDA assumes that there is only one ldquonormalrdquoclass and all other classes are ldquonovelrdquo So it is not directlyapplicable to the multi-class novelty detection problem whereany combination of classes can be considered as theldquoexistingrdquo classes We propose two alternative solutionsFirst we build parallel OLINDDA models one for each classwhich evolve simultaneously Whenever the instances of anovel class appear we create a new OLINDDA model for thatclass A test instance is declared as novel if all the existingclass models identify this instance as novel We will refer tothis baseline method as WCE-OLINDDA_PARALLELSecond we initially build an OLINDDA model with all theavailable classes Whenever a novel class is found the class isabsorbed into the existing OLINDDA model Thus only oneldquonormalrdquo model is maintained throughout the stream Thiswill be referred to as WCE-OLINDDA_SINGLE In allexperiments the ensemble size and chunk size are kept thesame for both these techniques Besides the same baselearner is used for WCE and MC The parameter settings forOLINDDA are (1) number of data points per cluster (Nexcl) =15 (2) least number of normal instances needed to update theexisting model = 100 (3) least number of instances needed tobuild the initial model = 30 These parameters are choseneither according to the default values used in [Spinosa et al

2007] or by trial and error to get an overall satisfactoryperformance We will henceforth use the acronyms MC forMineClass W-OP for WCE-OLINDDA_PARALLEL andW-OS for WCE-OLINDDA_SINGLE

204 Performance Study2041 Evaluation Approach

We use the following performance metrics for evaluationMnew = of novel class instances Misclassified as existingclass Fnew= of existing class instances Falsely identified asnovel class ERR = Total misclassification error ()(including Mnew and Fnew) We build the initial models ineach method with the first M chunks From the M + 1st chunkonward we first evaluate the performances of each method onthat chunk then use that chunk to update the existing modelThe performance metrics for each chunk for each method aresaved and averaged for producing the summary result

2042 Results

Figures 202(a) through 202(d) show the ERR for decisiontree classifier of each approach up to a certain point in thestream in different datasets kNN classifier also has similarresults For example at X axis = 100 the Y values show theaverage ERR of each approach from the beginning of thestream to chunk 100 At this point the ERR of MC W-OPand W-OS are 17 116 and 87 respectively for theKDD dataset (Figure 202(c)) The arrival of a novel class in

each dataset is marked by a cross (x) on the top border in eachgraph at the corresponding chunk For example on theSynCN dataset (Figure 202(a)) W-OP and W-OS missesmost of the novel class instances which results in the spikesin their curves at the respective chunks (eg at chunks 12 2437 etc) W-OS misses almost 99 of the novel classinstances Similar spikes are observed for both W-OP andW-OS at the chunks where novel classes appear for KDD andForest Cover datasets For example many novel classesappear between chunks 9 and 14 in KDD most of which aremissed by both W-OP and W-OS Note that there is no novelclass for SynC dataset MC correctly detects most of thesenovel classes Thus MC outperforms both W-OP and W-OSin all datasets

Figure 202 Error comparison on (a) SynCN (b) SynC (c)KDD and (d) Forest Cover (From M Masud J Gao LKhan J Han B Thuraisingham Integrating Novel ClassDetection with Classification for Concept-Drifting DataStreams pp 79ndash94 2009 Springer With permission)

Table 201 summarizes the error metrics for each of thetechniques in each dataset for decision tree and kNN Thecolumns headed by ERR Mnew and Fnew report the averageof the corresponding metric on an entire dataset For example

while using decision tree in the SynC dataset MC W-OPand W-OS have almost the same ERR which are 116130 and 125 respectively This is because SynCsimulates only concept-drift and both MC and WCE handleconcept-drift in a similar manner In SynCN dataset withdecision tree MC W-OP and W-OS have 0 894 and997 Mnew respectively Thus W-OS misses almost all ofthe novel class instances whereas W-OP detects only 11 ofthem MC correctly detects all of the novel class instances Itis interesting that all approaches have lower error rates inSynCN than SynC This is because SynCN is generated usingGaussian distribution which is naturally easier for theclassifiers to learn W-OS mispredicts almost all of the novelclass instances in all datasets The comparatively better ERRrate for W-OS over W-OP can be attributed to the lower falsepositive rate of W-OS which occurs because almost allinstances are identified as ldquonormalrdquo by W-OS Again theoverall error (ERR) of MC is much lower than other methodsin all datasets and for all classifiers K-NN also has similarresults for all datasets

Table 201 Performance Comparison

Source M Masud J Gao L Khan J Han BThuraisingham Integrating Novel Class Detection withClassification for Concept-Drifting Data Streams pp 79ndash942009 Springer With permission

Figures 203(a) through 203(d) illustrate how the error ratesof MC change for different parameter settings on KDDdataset and decision tree classifier These parameters havesimilar effects on other datasets and K-NN classifier Figure203(a) shows the effect of chunk size on ERR Fnew andMnew rates for default values of other parameters Mnewreduces when chunk size is increased This is desirablebecause larger chunks reduce the risk of missing a novelclass But Fnew rate slightly increases because the risk ofidentifying an existing class instance as novel also rises alittle These changes stabilize from chunk size 4000 (forSynthetic dataset it is 1000) That is why we use these valuesin our experiments Figure 203(b) shows the effect of numberof clusters (K) on error Increasing K generally reduces errorrates because outliers are more correctly detected and as aresult Mnew rate decreases However Fnew rate also startsincreasing slowly as more test instances are becomingoutliers (although they are not) The combined effect is thatoverall error keeps decreasing up to a certain value (eg K =50) and then becomes almost flat This is why we use K = 50in our experiments Figure 203(c) shows the effect ofensemble size (M) on error rates We observe that the errorrates decrease up to a certain size (=6) and become stablesince then This is because when M is increased from a lowvalue (eg 2) classification error naturally decreases up to acertain point because of the reduction of error variance [Wanget al 2003] Figure 203(d) shows the effect of N on errorrates The x-axis in this chart is drawn in a logarithmic scale

Naturally increasing N up to a certain point (eg 20) helpsreduce error because we know that a higher value of N givesus a greater confidence in declaring a new class But a toolarge value of N increases Mnew and ERR rates as a new classis missed by the algorithm if it has less than N instances in adata chunk We have found that any value between 20 and100 is the best choice for N

Figure 203 Sensitivity to different parameters (From MMasud J Gao L Khan J Han B Thuraisingham

Integrating Novel Class Detection with Classification forConcept-Drifting Data Streams pp 79ndash94 2009 SpringerWith permission)

2043 Running Time

Table 202 compares the running times of MC W-OP andW-OS on each dataset for decision tree kNN also showssimilar performances The columns headed by ldquoTime (sec)Chunk rdquo show the average running times (train and test) inseconds per chunk the columns headed by ldquoPointssecrdquo showhow many points have been processed (train and test) persecond on average and the columns headed by ldquoSpeed Gainrdquoshow the ratio of the speed of MC to that of W-OP andW-OS respectively For example MC is 2095 and 105 timesfaster than W-OP on KDD dataset and Forest Cover datasetrespectively Also MC is 203 and 27 times faster than W-OPand W-OS respectively on the SynCN dataset W-OP andW-OS are slower on SynCN than on SynC dataset becausethe SynCN dataset has more attributes (20 vs 10) and classes(10 vs 2) W-OP is relatively slower than W-OS becauseW-OP maintains C parallel models where C is the number ofexisting classes whereas W-OS maintains only one modelBoth W-OP and W-OS are relatively faster on Forest Coverthan KDD since Forest Cover has fewer number of classesand relatively less evolution than KDD The main reason forthis extremely slow processing of W-OP and W-OS is that thenumber of clusters for each OLINDDA model keepsincreasing linearly with the size of the data stream causingboth the memory requirement and the running time toincrease linearly But the running time and memory

requirement of MC remain the same over the entire length ofthe stream

Table 202 Running Time Comparison in All Datasets

205 SummaryIn this chapter we discussed the datasets experimentalsetups baseline techniques and evaluation on the datasetsWe used four different datasets two of which are syntheticand the two others are benchmark data streams Our approachoutperforms other baseline techniques in classification andnovel class detection accuracies and running times on alldatasets

In the future we would like to implement our technique onthe cloud computing framework and evaluate the extendedversion of novel class detection technique on larger andreal-world data streams In addition we would extend our

approach to address the real-time classification and novelclass detection problems in data streams

References[Spinosa et al 2007] Spinosa E J A P de Leon F deCarvalho J Gama OLINDDA A Cluster-Based Approachfor Detecting Novelty and Concept Drift in Data Streams inProceedings of the 2007 ACM Symposium on AppliedComputing 2007 pp 448ndash452

[Wang et al 2003] Wang H W Fan P Yu J Han MiningConcept-Drifting Data Streams Using Ensemble Classifiersin Proceedings of the ACM SIGKDD 2003 pp 226ndash235

Conclusion for Part VI

We have presented a novel technique to detect new classes inconcept-drifting data streams Most of the novelty detectiontechniques either assume that there is no concept-drift orbuild a model for a single ldquonormalrdquo class and consider allother classes as novel But our approach is capable ofdetecting novel classes in the presence of concept-drift evenwhen the model consists of multiple ldquoexistingrdquo classes Inaddition our novel class detection technique isnon-parametric meaning it does not assume any specificdistribution of data We also show empirically that ourapproach outperforms the state-of-the-art data stream-basednovelty detection techniques in both classification accuracyand processing speed It might appear to readers that to detectnovel classes we are in fact examining whether new clustersare being formed and therefore the detection process couldgo on without supervision But supervision is necessary forclassification Without external supervision two separateclusters could be regarded as two different classes althoughthey are not Conversely if more than one novel class appearsin a chunk all of them could be regarded as a single novelclass if the labels of those instances are never revealed In thefuture we would like to apply our technique in the domain ofmultiple-label instances

PART VII

EMERGING APPLICATIONS

Introduction to Part VIIIn Part I II III and IV we discussed the various data miningtools we have developed for malware section These includetools for email worm detection malicious code detectionremote exploit detection and botnet detection In Part V wediscussed stream mining technologies and their applicationsin security In this part (ie Part VII) we discuss some of thedata mining tools we are developing for emergingapplications

Part VII consists of four chapters 21 22 23 and 24 InChapter 21 we discuss data mining for active defense Theidea here is that the malware will change its patternscontinuously and therefore we need tools that can detectadaptable malware In Chapter 22 we discuss data mining forinsider threat analysis In particular we discuss how datamining tools may be used for detecting the suspiciouscommunication represented as large graphs In Chapter 23we discuss dependable real-time data mining In particularwe discuss data mining techniques that have to detectmalware in real time Finally we discuss data mining tools forfirewall policy analysis In particular there are numerousfirewall policy rules that may be outdated We need a

consistent set of firewall policies so that packets arriving fromsuspicious ports may be discarded

DATA MINING FOR ACTIVEDEFENSE

211 IntroductionTraditional signature-based malware detectors identifymalware by scanning untrusted binaries for distinguishingbyte sequences or features Features unique to malware aremaintained in a signature database which must becontinually updated as new malware is discovered andanalyzed

Signature-based malware detection generally enforces a staticapproximation of some desired dynamic (ie behavioral)security policy For example access control policies such asthose that prohibit code injections into operating systemexecutables are statically undecidable and can therefore onlybe approximated by any purely static decision procedure suchas signature matching A signature-based malware detectorapproximates these policies by identifying syntactic featuresthat tend to appear only in binaries that exhibitpolicy-violating behavior when executed This approximationis both unsound and incomplete in that it is susceptible toboth false positive and false negative classifications of somebinaries For this reason signature databases are typically keptconfidential because they contain information that an attackercould use to craft malware that the detector would misclassify

as benign defeating the protection system The effectivenessof signature-based malware detection thus depends on boththe comprehensiveness and confidentiality of the signaturedatabase

Traditionally signature databases have been manuallyderived updated and disseminated by human experts as newmalware appears and is analyzed However the escalatingrate of new malware appearances and the advent ofself-mutating polymorphic malware over the past decadehave made manual signature updating less practical This hasled to the development of automated data mining techniquesfor malware detection (eg [Kolter and Maloof 2004][Masud et al 2008] [Schultz et al 2001]) that are capable ofautomatically inferring signatures for previously unseenmalware

In this chapter we show how these data mining techniquescan also be applied by an attacker to discover ways toobfuscate malicious binaries so that they will be misclassifiedas benign by the detector Our approach hinges on theobservation that although malware detectors keep theirsignature databases confidential all malware detectors revealone bit of signature information every time they reveal aclassification decision This information can be harvestedparticularly efficiently when it is disclosed through a publicinterface The classification decisions can then be delivered asinput to a data mining malware detection algorithm to infer amodel of the confidential signature database From theinferred model we derive feature removal and featureinsertion obfuscations that preserve the behavior of a givenmalware binary but cause it to be misclassified as benign The

result is an obfuscation strategy that can defeat any purelystatic signature-based malware detector

We demonstrate the effectiveness of this strategy bysuccessfully obfuscating several real malware samples todefeat malware detectors on Windows operating systemsWindows-based antivirus products typically supportMicrosoftrsquos IOfficeAntiVirus interface [MSDN DigitalLibrary 2009] which allows applications to invoke anyinstalled antivirus product on a given binary and respond tothe classification decision Our experiments exploit thisinterface to obtain confidential signature database informationfrom several commercial antivirus products

This chapter is organized as follows Section 212 describesrelated work Section 213 provides an overview of ourapproach Section 214 describes a data mining-basedmalware detection model and Section 215 discusses methodsof deriving binary obfuscations from a detection modelSection 216 then describes experiments and evaluation of ourtechnique Section 217 concludes with discussion andsuggestions for future work The contents of this chapter areillustrated in Figure 211

212 Related WorkBoth the creation and the detection of malware thatself-modifies to defeat signature-based detectors arewell-studied problems in the literature [Nachenberg 1997][Szoumlr 2005] Self-modifying malware has existed at leastsince the early 1990s and has subsequently become a majorobstacle for modern malware protection systems Forexample Kaspersky Labs [Kaspersky 2009] reported threenew major threats in February 2009 that use self-modifyingpropagation mechanisms to defeat existing malware detectionproducts Propagation and mutation rates for such malwarecan be very high At the height of the Feebs virus outbreak in2007 Commtouch Research Labs [Commtouch 2007]reported that the malware was producing more than 11000unique variants of itself per day

Most self-modifying malware uses encryption or packing asthe primary basis for its modifications The majority of thebinary code in such polymorphic malware exists as anencrypted or packed payload which is unencrypted orunpacked at runtime and executed Signature-based protectionsystems typically detect polymorphic malware by identifyingdistinguishing features in the small unencrypted code stubthat decrypts the payload (eg [Kruegel et al 2005]) Morerecently metamorphic malware has appeared whichrandomly applies binary transformations to its code segment

during propagation to obfuscate features in the unencryptedportion An example is the MetaPHOR system [cfWalenstein et al 2006] which has become the basis formany other metamorphic malware propagation systemsReversing these obfuscations to obtain reliable feature sets forsignature-based detection is the subject of much currentresearch [Brushi et al 2007] [Kruegel et al 2005][Walenstein et al 2006] but case studies have shown thatcurrent antivirus detection schemes remain vulnerable tosimple obfuscation attacks until the detectorrsquos signaturedatabase is updated to respond to the threat [Christodorescuand Jha 2004]

To our knowledge all existing self-modifying malwaremutates randomly Our work therefore differs from pastapproaches in that it proposes an algorithm for choosingobfuscations that target and defeat specific malware defensesThese obfuscations could be inferred and applied fullyautomatically in the wild thereby responding to a signatureupdate without requiring re-propagation by the attacker Weargue that simple signature updates are therefore inadequateto defend against such an attack

Our proposed approach uses technology based on datamining-based malware detectors Data mining-basedapproaches analyze the content of an executable and classifyit as malware if a certain combination of features is found (ornot found) in the executable These malware detectors arefirst trained so that they can generalize the distinctionbetween malicious and benign executables and thus detectfuture instances of malware The training process involvesfeature extraction and model building using these featuresData mining-based malware detectors differ mainly on how

the features are extracted and which machine learningtechnique is used to build the model The performance ofthese techniques largely depends on the quality of the featuresthat are extracted

[Schultz et al 2001] extract DLL call information (usingGNU binutils) and character strings (using GNU strings) fromthe headers of Windows PE executables as well as 2-bytesequences from the executable content The DLL callsstrings and bytes are used as features to train models Modelsare trained using two different machine learningtechniquesmdashRIPPER [Cohen 1996] and Naiumlve Bayes (NB)[Michie et al 1994]mdashto compare their relative performances[Kolter and Maloof 2004] extract binary n-gram featuresfrom executables and apply them to different classificationmethods such as k nearest neighbor (kNN) [Aha et al 1991]NB Support Vector Machines (SVMs) [Boser et al 1992]decision trees [Quinlan 2003] and boosting [Freund andSchapire 1996] Boosting is applied in combination withvarious other learning algorithms to obtain improved models(eg boosted decision trees)

Our previous work on data mining-based malware detection[Masud et al 2008] extracts binary n-grams from theexecutable assembly instruction sequences from thedisassembled executables and DLL call information from theprogram headers The classification models used in this workare SVM decision tree NB boosted decision tree andboosted NB In the following sections we show how thistechnology can also be applied by an attacker to infer andimplement effective attacks against malware detectors usinginformation divulged by antivirus interfaces

213 ArchitectureThe architecture of our binary obfuscation methodology isillustrated in Figure 212 We begin by submitting a diversecollection of malicious and benign binaries to the victimsignature database via the signature query interface Theinterface reveals a classification decision for each query Forour experiments we used the IOfficeAntivirus COM interfacethat is provided by Microsoft Windows operating systems(Windows 95 and later) [MSDN Digital Library 2009] TheScan method exported by this interface takes a filename asinput and causes the operating system to use the installedantivirus product to scan the file for malware infections Oncethe scan is complete the method returns a success codeindicating whether the file was classified as malicious orbenign This allows applications to request virus scans andrespond to the resulting classification decisions

Figure 212 Binary obfuscation architecture (From KHamlen V Mohan M Masud L Khan B ThuraisinghamExploiting an Antivirus Interface pp 1182ndash1189 2009Elsevier With permission)

We then use the original inputs and resulting classificationdecisions as a training set for an inference engine Theinference engine learns an approximating model for thesignature database using the training set In ourimplementation this model was expressed as a decision treein which each node tests for the presence or absence of aspecific binary n-gram feature that was inferred to besecurity-relevant by the data mining algorithm

This inferred model is then reinterpreted as a recipe forobfuscating malware so as to defeat the model That is eachpath in the decision tree encodes a set of binary features thatwhen added or removed from a given malware sample causesthe resulting binary to be classified as malicious or benign bythe model The obfuscation problem is thus reduced to findinga binary transformation that when applied to malware causesit to match one of the benignly classified feature sets Inaddition the transformation must not significantly alter thebehavior of the malware binary being obfuscated Currentlywe identify suitable feature sets by manual inspection but webelieve that future work could automate this process

Once such a feature set has been identified and applied to themalware sample the resulting obfuscated sample is submittedas a query to the original signature database A maliciousclassification indicates that the inferred signature model was

not an adequate approximation for the signature database Inthis case the obfuscated malware is added to the training setand training continues resulting in an improved modelwhereupon the process repeats A benign classificationindicates a successful attack upon the malware detector Inour experiments we found that repeating the inferenceprocess was not necessary our obfuscations producedmisclassified binaries after one round of inference

214 A Data Mining-BasedMalware Detection Model2141 Our Framework

A data mining-based malware detector first trains itself withknown instances of malicious and benign executables Oncetrained it can predict the proper classifications of previouslyunseen executables by testing them against the model Thehigh-level framework of such a system is illustrated in Figure213

The predictive accuracy of the model depends on the giventraining data and the learning algorithm (eg SVM decisiontree Naiumlve Bayes etc) Several data mining-based malwaredetectors have been proposed in the past [Kolter and Maloof2004] [Masud et al 2008] [Schultz et al 2001] The mainadvantage of these models over the traditionalsignature-based models is that data mining-based models aremore robust to changes in the malware Signature-basedmodels fail when new malware appears with an unknown

signature On the other hand data mining-based modelsgeneralize the classification process by learning a suitablemalware model dynamically over time Thus they are capableof detecting malware instances that were not known at thetime of training This makes it more challenging for anattacker to defeat a malware detector based on data mining

Our previous work on data mining-based malware detection[Masud et al 2008] has developed an approach that consistsof three main steps

Figure 213 A data mining-based malware detectionframework (From K Hamlen V Mohan M Masud LKhan B Thuraisingham Exploiting an Antivirus Interfacepp 1182ndash1189 2009 Elsevier With permission)

1 Feature extraction feature selection andfeature-vector computation from the training data

2 Training a classification model using the computedfeature-vector

3 Testing executables with the trained model

These steps are detailed throughout the remainder of thesection

In past work we have extracted three different kinds offeatures from training instances (ie executable binaries)

1 Binary n-gram features To extract these featureswe consider each executable as a string of bytes andextract all possible n-grams from the executableswhere n ranges from 1 to 10

2 Assembly n-gram features We also disassembleeach executable to obtain an assembly languageprogram We then extract n-grams of assemblyinstructions

3 Dynamic link library (DLL) call features Librarycalls are particularly relevant for distinguishingmalicious binaries from benign binaries We extractthe library calls from the disassembly and use themas features

When deriving obfuscations to defeat existing malwaredetectors we found that restricting our attention only tobinary n-gram features sufficed for our experiments reportedin Section 216 However in future work we intend to applyall three feature sets to produce more robust obfuscationalgorithms Next we describe how these binary features areextracted

21421 Binary n-Gram Feature Extraction First we applythe UNIX hexdump utility to convert the binary executablefiles into textual hexdump files which contain thehexadecimal numbers corresponding to each byte of thebinary This process is performed to ensure safe and easyportability of the binary executables The feature extractionprocess consists of two phases (1) feature collection and (2)feature selection

The feature collection process proceeds as follows Let the setof hexdump training files be H = h1 hellip hb We firstinitialize a set L of n-grams to empty Then we scan eachhexdump file hi by sliding an n-byte window over its binarycontent Each recovered n-byte sequence is added to L as ann-gram For each n-gram g isin L we count the total number ofpositive instances pg (ie malicious executables) andnegative instances ng (ie benign executables) that contain g

There are several implementation issues related to this basicapproach First the total number of n-grams may be verylarge For example the total number of 10-grams in ourdataset is 200 million It may not be possible to store all ofthem in a computerrsquos main memory Presently we solve thisproblem by storing the n-grams in a large disk file that isprocessed via random access Second if L is not sorted then alinear search is required for each scanned n-gram to testwhether it is already in L If N is the total number of n-gramsin the dataset then the time for collecting all the n-gramswould be O(N2) an impractical amount of time when N = 200million To solve the second problem we use anAdelson-Velsky-Landis (AVL) tree [Goodrich and Tamassia2005] to index the n-grams An AVL tree is a height-balancedbinary search tree This tree has a property that the absolute

difference between the heights of the left subtree and the rightsubtree of any node is at most 1 If this property is violatedduring insertion or deletion a balancing operation isperformed and the tree regains its height-balanced propertyIt is guaranteed that insertions and deletions are performed inlogarithmic time Inserting an n-gram into the database thusrequires only O(log2(N)) searches This reduces the totalrunning time to O(log2(N)) making the overall running timeabout 5 million times faster when N is large as 200 millionOur feature collection algorithm implements these twosolutions

21422 Feature Selection If the total number of extractedfeatures is very large it may not be possible to use all of themfor training Aside from memory limitations and impracticalcomputing times a classifier may become confused with alarge number of features because most of them would benoisy redundant or irrelevant It is therefore important tochoose a small relevant and useful subset of features formore efficient and accurate classification We chooseinformation gain (IG) as the selection criterion because it isrecognized in the literature as one of the best criteria isolatingrelevant features from large feature sets IG can be defined asa measure of effectiveness of an attribute (ie feature) inclassifying a training data [Mitchell 1997] If we split thetraining data based on the values of this attribute then IGgives the measurement of the expected reduction in entropyafter the split The more an attribute can reduce entropy in thetraining data the better the attribute is for classifying the data

The next problem is to select the best S features (ien-grams) according to IG One naiumlve approach is to sort then-grams in non-increasing order of IG and select the top S of

them which requires O(Nlog2N) time and O(N) mainmemory But this selection can be more efficientlyaccomplished using a heap that requires O(Nlog2S) time andO(S) main memory For S = 500 and N = 200 million thisapproach is more than 3 times faster and requires 400000times less main memory A heap is a balanced binary treewith the property that the root of any subtree contains theminimum (maximum) element in that subtree First we builda min-heap of size S The min-heap contains the minimum-IGn-gram at its root Then each n-gram g is compared with then-gram at the root r If IG(g) le IG(r) then we discard gOtherwise r is replaced with g and the heap is restored

21423 Feature Vector Computation Suppose the set offeatures selected in the above step is F = f1 hellip fs For eachhexdump file hi we build a binary feature vector hi(F) =hi(f1) hellip hi(fS) where hi(fj) = 1 if hi contains feature fj or0 otherwise The training algorithm of a classifier is suppliedwith a tuple (hi(F) l(hi)) for each training instance hi wherehi(F) is the feature vector and l(hi) is the class label of theinstance hi (ie positive or negative)

2143 Training

We apply SVM Naiumlve Bayes (NB) and decision tree (J48)classifiers for the classification task SVM can perform eitherlinear or non-linear classification The linear classifierproposed by Vapnik [Boser et al 1992] creates a hyperplanethat separates the data points into two classes with themaximum margin A maximum-margin hyperplane is the onethat splits the training examples into two subsets such that thedistance between the hyperplane and its closest data point(s)

is maximized A non-linear SVM [Cortes and Vapnik 1995]is implemented by applying a kernel trick tomaximum-margin hyperplanes This kernel trick transformsthe feature space into a higher dimensional space where themaximum-margin hyperplane is found through the aid of akernel function

A decision tree contains attribute tests at each internal nodeand a decision at each leaf node It classifies an instance byperforming the attribute tests prescribed by a path from theroot to a decision node Decision trees are rule-basedclassifiers allowing us to obtain human-readableclassification rules from the tree J48 is the implementation ofthe C45 Decision Tree algorithm C45 is an extension of theID3 algorithm invented by [Quinlan 2003] To train aclassifier we provide the feature vectors along with the classlabels of each training instance that we have computed in theprevious step

2144 Testing

Once a classification model has been trained we can assessits accuracy by comparing its classification of new instances(ie executables) to the original victim malware detectorrsquosclassifications of the same new instances To test anexecutable h we first compute the feature vector h(F)corresponding to the executable in the manner describedearlier When this feature vector is provided to theclassification model the model outputs (predicts) a class labell(h) for the instance If we know the true class label of h thenwe can compare the prediction with the true label and checkthe correctness of the learned model If the modelrsquos

performance is inadequate the new instances are added to thetraining set resulting in an improved model and testingresumes

In the next section we describe how the model yielded by thepreviously described process can be used to derive binaryobfuscations that defeat the model

215 Model-ReversingObfuscationsMalware detectors based on static data mining attempt tolearn correlations between the syntax of untrusted binariesand the (malicious or benign) behavior that those binariesexhibit when executed This learning process is necessarilyunsound or incomplete because most practically usefuldefinitions of ldquomalicious behaviorrdquo are Turing-undecidableThus every purely static algorithm for malware detection isvulnerable to false positives false negatives or both Ourobfuscator exploits this weakness by discovering falsenegatives in the model inferred by a static malware detector

The decision tree model inferred in the previous section canbe used as a basis for deriving binary obfuscations that defeatthe model The obfuscation involves adding or removingfeatures (ie binary n-grams) to and from the malware binaryso that the model classifies the resulting binary as benignThese binary transformations must be carefully crafted so asto avoid altering the runtime behavior of the malwareprogram lest they result in a policy-adherent or

non-executable binary Details of the obfuscation approachare given in [Hamlen et al 2009] We briefly outline thesteps in the ensuing paragraphs

2151 Path Selection

We begin the obfuscation process by searching for acandidate path through the decision tree that ends in a benignleaf node Our goal will be to add and remove features fromthe malicious executable so as to cause the detector to followthe chosen decision tree path during classification Becausethe path ends in a benign-classifying decision node this willcause the malware to be misclassified as benign by thedetector

Inserting new features into executable binaries withoutsignificantly altering their runtime behavior tends to be afairly straightforward task

Removal of a feature from an executable binary is moredifficult to implement without changing the programrsquosruntime behavior Existing malware implement this using oneof two techniques (1) encryption (polymorphic malware) or(2) code mutation (metamorphic malware) Althoughpolymorphism and metamorphism are powerful existingtechniques for obfuscating malware against signature-baseddetectors it should be noted that existing polymorphic and

metamorphic malware mutates randomly Our attacktherefore differs from these existing approaches in that wechoose obfuscations that are derived directly from signaturedatabase information leaked by the malware detector beingattacked Our work therefore builds upon this past work byshowing how antivirus interfaces can be exploited to choosean effective obfuscation which can then be implementedusing these existing techniques

216 ExperimentsTo test our approach we conducted two sets of experimentsIn the first experiment we attempted to collect classificationdata from several commercial antivirus products by queryingtheir public interfaces automatically In the secondexperiment we obfuscated a malware sample to defeat thedata mining-based malware detector we developed in pastwork [Masud et al 2008] and that is described in Section214 In future work we intend to combine these two results totest fully automatic obfuscation attacks upon commercialantivirus products

We have two non-disjoint datasets The first dataset (dataset1)contains a collection of 1435 executables 597 of which arebenign and 838 of which are malicious The second dataset(dataset2) contains 2452 executables having 1370 benignand 1082 malicious executables The distribution of dataset1is hence 416 benign and 584 malicious and that ofdataset2 is 559 benign and 441 malicious Thisdistribution was chosen intentionally to evaluate theperformance of the feature sets in different scenarios We

collect the benign executables from different Windows XPand Windows 2000 machines and collect the maliciousexecutables from [VX Heavens 2009] which contains a largecollection of malicious executables

We carried out two sets of experiments the Interface ExploitExperiment and the model-driven obfuscation experimentThese experiments are detailed in [Hamlen et al 2009] Forexample to test the feasibility of collecting confidentialsignature database information via the antivirus interface onWindows operating systems we wrote a small utility thatqueries the IOfficeAntivirus [MSDN Digital Library 2009]COM interface on Windows XP and Vista machines Theutility uses this interface to request virus scans of instances indataset1 We tested our utility on four commercial antivirusproducts Norton Antivirus 2009 McAfee VirusScan PlusAVG 80 and Avast Antivirus 2009 In all but AvastAntivirus we found that we were able to reliably sample thesignature database using the interface In the case of AvastAntivirus 2009 we found that the return code yielded by theinterface was not meaningfulmdashit did not distinguish betweendifferent classifications Thus Avast Antivirus 2009 was notvulnerable to our attack In the second experiment theobfuscated malware defeated the detector from which themodel was derived

217 SummaryIn this chapter we have outlined a technique wherebyantivirus interfaces that reveal classification decisions can beexploited to infer confidential information about the

underlying signature database These classification decisionscan be used as training inputs to data mining-based malwaredetectors Such detectors will learn an approximating modelfor the signature database that can be used as a basis forderiving binary obfuscations that defeat the signaturedatabase We conjecture that this technique could be used asthe basis for effective fully automatic and targeted attacksagainst signature-based antivirus products

Our experiments justify this conjecture by demonstrating thatclassification decisions can be reliably harvested from severalcommercial antivirus products on Windows operating systemsby exploiting the Windows public antivirus interface We alsodemonstrated that effective obfuscations can be derived forreal malware from an inferred model by successfullyobfuscating a real malware sample using our model-reversingobfuscation technique The obfuscated malware defeated thedetector from which the model was derived

Our signature database inference procedure was not aneffective attack against one commercial antivirus product wetested because that product did not fully support the antivirusinterface In particular it returned the same result codeirrespective of its classification decision for the submittedbinary file However we believe this limitation could beovercome by an attacker in at least two different ways

First although the return code did not divulge classificationdecisions the product did display observably differentresponses to malicious binaries such as opening a quarantinepop-up window These responses could have beenautomatically detected by our query engine Determining

classification decisions in this way is a slower but still fullyautomatic process

Second many commercial antivirus products also exist asfreely distributed stand-alone utilities that scan for (but donot necessarily disinfect) malware based on the samesignature databases used in the retail product Theselightweight scanners are typically implemented as Javaapplets or ActiveX controls so that they are web-streamableand executable at low privilege levels Such applets could beexecuted in a restricted virtual machine environment toeffectively create a suitable query interface for the signaturedatabase The execution environment would provide a limitedview of the filesystem to the victim applet and would inferclassification decisions by monitoring decision-specificsystem calls such as those that display windows and dialogueboxes

From the work summarized in this chapter we conclude thateffectively concealing antivirus signature databaseinformation from an attacker is important but difficultCurrent antivirus interfaces such as the one currentlysupported by Windows operating systems invite signatureinformation leaks and subsequent obfuscation attacksAntivirus products that fail to support these interfaces are lessvulnerable to these attacks however they still divulgeconfidential signature database information through covertchannels such as graphical responses and other side effects

Fully protecting against these confidentiality violations mightnot be feasible however there are some obvious steps thatdefenders can take to make these attacks morecomputationally expensive for the attacker One obvious step

is to avoid implementing or supporting interfaces that divulgeclassification decisions explicitly and on demand throughreturn codes While this prevents benign applications fromdetecting and responding to malware quarantines thisreduction in functionality seems reasonable in the (hopefullyuncommon) context of a malware attack Protecting againstsignature information leaks through covert channels is a morechallenging problem Addressing it effectively might requireleveraging antipiracy technologies that examine the currentexecution environment and refuse to divulge classificationdecisions in restrictive environments that might be controlledby an attacker Without such protection attackers willcontinue to be able to craft effective targeted binaryobfuscations that defeat existing signature-based malwaredetection models

References[Aha et al 1991] Aha D W D Kibler M K AlbertInstance-Based Learning Algorithms Machine Learning Vol6 1991 pp 37ndash66

[Boser et al 1992] Boser B E I M Guyon V N VapnikA Training Algorithm for Optimal Margin Classifiers inProceedings of the 5th ACM Workshop on ComputationalLearning Theory 1992 pp 144ndash152

[Brushi et al 2007] Brushi D L Martignoni M MongaCode Normalization for Self-Mutating Malware inProceedings of the IEEE Symposium on Security and PrivacyVol 5 No 2 pp 46ndash54 2007

[Christodorescu and Jha 2004] Christodorescu M and SJha Testing Malware Detectors in Proceedings of the ACMSIGSOFT International Symposium on Software Testing andAnalysis (ISSTA) 2004 pp 34ndash44

[Cohen 1996] Cohen W W Learning Rules that ClassifyE-mail in Papers from the AAAI Spring Symposium onMachine Learning in Information Access 1996 pp 18ndash25

[Commtouch 2007] Q1 Malware Trends Report Server-SideMalware Explodes across Email White Paper CommtouchResearch Labs Alt-N Technologies Grapevine TX May 22007

[Cortes and Vapnik 1995] Cortes C and V VapnikSupport-Vector Networks Machine Learning Vol 20 No 31995 pp 273ndash297

[Freund and Schapire 1996] Freund Y and R E SchapireExperiments with a New Boosting Algorithm in Proceedingsof the 13th International Conference on Machine Learning1996 pp 148ndash156

[Goodrich and Tamassia 2005] Goodrich M T and RTamassia Data Structures and Algorithms in Java FourthEdition Wiley New York 2005

[Hamlen et al 2009] Hamlen K W V Mohan M MMasud L Khan B M Thuraisingham Exploiting anAntivirus Interface Computer Standards amp Interfaces Vol31 No 6 2009 pp 1182ndash1189

[Kaspersky 2009] Kaspersky Labs Monthly MalwareStatistics httpwwwkasperskycomnewsid=207575761

[Kolter and Maloof 2004] Kolter J Z and M A MaloofLearning to Detect Malicious Executables in the Wild inProceedings of the 10th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining 2004pp 470ndash478

[Kruegel et al 2005] Kruegel C E Kirda D Mutz WRobertson G Vigna Polymorphic Worm Detection UsingStructural Information of Executables in Proceedings of the8th Symposium on Recent Advances in Intrusion Detection(RAID) 2005 pp 207ndash226

[Masud et al 2008] Masud M L Khan B MThuraisingham A Scalable Multi-level Feature ExtractionTechnique to Detect Malicious Executables InformationSystem Frontiers Vol 10 No 1 2008 pp 33ndash35

[Michie et al 1994] Michie D D J Spiegelhalter C CTaylor Editors Machine Learning Neural and StatisticalClassification chap 5 Machine Learning of Rules and TreesMorgan Kaufmann 1994 pp 50ndash83

[Mitchell 1997] Mitchell T M Machine LearningMcGraw-Hill New York 1997

[MSDN Digital Library 2009] IOfficeAntiVirus Interfacehttpmsdnmicrosoftcomen-uslibraryms537369(VS85)aspx

[Nachenberg 1997] Nachenberg C ComputerVirus-Antivirus Coevolution Communications of the ACMVol 40 No 1 1997 pp 47ndash51

[Quinlan 2003] Quinlan J R C45 Programs for MachineLearning Fifth Edition Morgan Kaufmann San FranciscoCA 2003

[Schultz et al 2001] Schultz M G E Eskin E Zadok S JStolfo Data Mining Methods for Detection of New MaliciousExecutables in Proceedings of the IEEE Symposium onSecurity and Privacy pp 38ndash39 2001

[Szoumlr 2005] Szoumlr P The Art of Computer Virus Researchand Defense Addison-Wesley Professional 2005

[VX Heavens 2009] VX Heavens httpvxnetluxorg

[Walenstein et al 2006] Walenstein A R Mathur M RChouchane A Lakhotia Normalizing Metamorphic MalwareUsing Term Rewriting in Proceedings of the 6th IEEEWorkshop on Source Code Analysis and Manipulation(SCAM) 2006 pp 75ndash84

DATA MINING FOR INSIDERTHREAT DETECTION

221 IntroductionEffective detection of insider threats requires monitoringmechanisms that are far more fine-grained than for externalthreat detection These monitors must be efficiently andreliably deployable in the software environments whereactions endemic to malicious insider missions are caught in atimely manner Such environments typically includeuser-level applications such as word processors emailclients and web browsers for which reliable monitoring ofinternal events by conventional means is difficult

To monitor the activities of the insiders tools are needed tocapture the communications and relationships between theinsiders store the captured relationships query the storedrelationships and ultimately analyze the relationships so thatpatterns can be extracted that would give the analyst betterinsights into the potential threats Over time the number ofcommunications and relationships between the insiders couldbe in the billions Using the tools developed under ourproject the billions of relationships between the insiders canbe captured stored queried and analyzed to detect maliciousinsiders

In this chapter we discuss how data mining technologies maybe applied for insider threat detection First we discuss howsemantic web technologies may be used to represent thecommunication between insiders Next we discuss ourapproach to insider threat detection Finally we provide anoverview of our framework for insider threat detection thatalso incorporated some other techniques

The organization of this chapter is as follows In Section 222we discuss the challenges related work and our approach tothis problem Our approach is discussed in detail in Section223 Our framework is discussed in Section 224 Thechapter is concluded in Section 225 An overview ofsemantic web technologies is discussed in Appendix DFigure 221 illustrates the contents of this chapter

222 The Challenges RelatedWork and Our ApproachThe insiders and the relationships between the insiders will bepresented as nodes and links in a graph Therefore thechallenge is to represent the information in graphs developefficient storage strategies develop query processingtechniques for the graphs and subsequently develop datamining and analysis techniques to extract information fromthe graphs In particular there are three major challenges

1 Storing these large graphs in an expressive andunified manner in a secondary storage

2 Devising scalable solutions for querying the largegraphs to find relevant data

3 Identifying relevant features for the complex graphsand subsequently detecting insider threats in adynamic environment that changes over time

The motivation behind our approach is to address thepreviously mentioned three challenges We are developingsolutions based on cloud computing to (1) characterize graphscontaining up to billions of nodes and edges between nodesrepresenting activities (eg credit card transactions) emailor text messages Because the graphs will be massive we willdevelop technologies for efficient and persistent storage (2)To facilitate novel anomaly detection we require an efficientinterface to fetch relevant data in a timely manner from thispersistent storage Therefore we will develop efficient querytechniques on the stored graphs (3) The fetched relevant datacan then be used for further analysis to detect anomalies To

do this first we have to identify relevant features from thecomplex graphs and subsequently develop techniques formining large graphs to extract the nuggets

As stated earlier insider threat detection is a difficult problem[Maybury et al 2005] [Strayer et al 2009] The problembecomes increasingly complex with more data originatingfrom heterogeneous sources and sensors Recently there issome research that focuses on anomaly-based insider threatdetection from graphs [Eberle and Holder 2009] Thismethod is based on the minimum description length (MDL)principle The solution proposed by [Eberle and Holder2009] has some limitations First with their approachscalability is an issue In other words they have not discussedany issue related to large graphs Second the heterogeneityissue has not been addressed Finally it is unclear how theiralgorithm will deal with a dynamic environment

There are also several graph mining techniques that have beendeveloped especially for social network analysis [Carminati etal 2009] [Cook and Holder 2006] [Thuraisingham et al2009] [Tong 2009] The scalability of these techniques isstill an issue There is some work from the mathematicsresearch community to apply linear programming techniquesfor graph analysis [Berry et al 2007] Whether thesetechniques will work in a real-world setting is not clear

For a solution to be viable it must be highly scalable andsupport multiple heterogeneous data sources Currentstate-of-the-art solutions do not scale well and preserveaccuracy By leveraging Hadoop technology our solution willbe highly scalable Furthermore by utilizing the flexiblesemantic web RDF data model we are able to easily integrate

and align heterogeneous data Thus our approach will createa scalable solution in a dynamic environment No existingthreat detection tools offer this level of scalability andinteroperability We will combine these technologies withnovel data mining techniques to create a complete insiderthreat detection solution

We have exploited the cloud computing framework based onHadoopMapReduce technologies The insiders and theirrelationships are represented by nodes and links in the form ofgraphs In particular in our approach the billions of nodesand links will be presented as RDF (Resource DescriptionFramework) graphs By exploiting RDF representation wewill address heterogeneity We will develop mechanisms toefficiently store the RDF graphs query the graphs usingSPARQL technologies and mine the graphs to extractpatterns within the cloud computing framework We will alsodescribe our plans to commercialize the technologiesdeveloped under this project

223 Data Mining for InsiderThreat Detection2231 Our Solution Architecture

Figure 222 shows the architectural view of our solution Oursolution will pull data from multiple sources and then extractand select features After feature reduction the data will bestored in our Hardtop repository Data will be stored in theResource Description Framework (RDF) format so a format

conversion may be required if the data is in any other formatRDF is the data format for the semantic web and is very ableto represent graph data The Anomaly Prediction componentwill submit SPARQL Protocol and RDF Query Language(SPARQL) to the repository to select data It will then outputany detected insider threats SPARQL is the query languagefor RDF data It is similar to SQL in syntax The details ofeach of the components are given in the following sectionsFor choosing RDF representation for graphs over relationaldata models we will address heterogeneity issues effectively(semi-structured data model) For querying we will exploitstandard query language SPARQL instead of starting fromscratch Furthermore in our proposed framework inferencewill be provided

Figure 222 Solution architecture

We are assuming that the large graphs already exist Tofacilitate persistent storage and efficient retrieval of thesedata we use a distributed framework based on the cloudcomputing framework Hadoop [Hadoop] By leveraging theHadoop technology our framework is readily fault tolerantand scalable To support large amounts of data we can simplyadd more nodes to the Hadoop cluster All the nodes of acluster are commodity class machines there is no need to buyexpensive server machines To handle large complex graphswe will exploit Hadoop Distributed File System (HDFS) andMapReduce framework The former is the storage layer thatstores data in multiple nodes with replication The latter is theexecution layer where MapReduce jobs can be run We useHDFS to store RDF data and the MapReduce framework toanswer queries

2232 Feature Extraction and CompactRepresentation

In traditional graph analysis an edge represents a simplenumber that represents strength However we may faceadditional challenges in representing link values because ofthe unstructured nature of the content of text and emailmessages One possible approach is to keep the whole contentas a part of link values which we call explicit content (EC)EC will not scale well even for a moderate size graph This isbecause content representing a link between two nodes willrequire a lot of main memory space to process the graph inthe memory We propose a vector representation of thecontent (VRC) for each message In RDF triplerepresentation this will simply be represented as a unique

predicate We will keep track of the feature vector along withphysical location or URL of the original raw message in adictionary encoded table

VRC During the preprocessing step for each message wewill extract keywords and phrases (n-grams) as featuresThen if we want to generate vectors for these features thedimensionality of these vectors will be very high Here wewill observe the curse of dimensionality (ie sparseness andprocessing time will increase) Therefore we can applyfeature reduction (PCA SVD NMF) as well as featureselection Because feature reduction maps high-dimensionalfeature spaces to a space of fewer dimensions and newfeature dimension may be the linear combination of olddimensions that may be difficult to interpret we will exploitfeature selection

With regard to feature selection we need to use a class labelfor supervised data Here for the message we may not have aclass label however we know the sourcesender and thedestinationrecipient of a message Now we would like to usethis knowledge to construct an artificial label The sender anddestination pair will form a unique class label and allmessages sent from this sender to the recipient will serve asdata points Hence our goal is to find appropriate featuresthat will have discriminating power across all of these classlabels based on these messages There are several methods forfeature selection that are widely used in the area of machinelearning such as information gain [Masud et al 2010-a][Masud et al 2010-b] [Mitchell 1997] Gini indexchi-square statistics and subspace clustering [Ahmed andKhan 2009] Here we will present information gain which is

very popular and for the text domain we can use subspaceclustering for feature selection

Information gain (IG) can be defined as a measure of theeffectiveness of a feature in classifying the training data[Mitchell 1997] If we split the training data on theseattribute values then IG provides the measurement of theexpected reduction in entropy after the split The more anattribute can reduce entropy in the training data the better theattribute will be in classifying the data IG of an attribute A ona collection of examples S is given by (221)

where Values (A) is the set of all possible values for attributeA and Sv is the subset of S for which attribute A has value vEntropy of S is computed using the following equation (222)

where pi(S) is the prior probability of class i in the set S

Subspace clustering Subspace clustering can be used forfeature selection Subspace clustering is appropriate when theclusters corresponding to a dataset form a subset of theoriginal dimensions Based on how these subsets are formeda subspace clustering algorithm can be referred to as soft orhard subspace clustering In the case of soft subspaceclustering the features are assigned weights according to the

contribution each feature or dimension plays during theclustering process for each cluster In the case of hardsubspace clustering however a specific subset of features isselected for each cluster and the rest of the features arediscarded for that cluster Therefore subspace clustering canbe utilized for selecting which features are important (anddiscarding some features if their weights are very small for allclusters) One such soft subspace clustering approach is SISC[Ahmed and Khan 2009] The following objective function isused in that subspace clustering algorithm An E-Mformulation is used for the clustering In every iteration thefeature weights are updated for each cluster and by selectingthe features that have higher weights in each cluster we canselect a set of important features for the correspondingdataset

subject to

In this objective function W Z and Λ represent the clustermembership cluster centroid and dimension weight matricesrespectively Also the parameter f controls the fuzziness ofthe membership of each data point q further modifies theweight of each dimension of each cluster (λli) and finally γcontrols the strength of the incentive given to the chi-squarecomponent and dimension weights It is also assumed thatthere are n documents in the training dataset m features foreach of the data points and k subspace clusters generatedduring the clustering process Impl indicates the clusterimpurity whereas χ2 indicates the chi-square statistic Detailsabout these notations and how the clustering is done can befound in our prior work funded by NASA [Ahmed and Khan2009] It should be noted that feature selection using subspaceclustering can be considered as an unsupervised approachtoward feature selection as no label information is requiredduring an unsupervised clustering process

Once we select features a message between two nodes willbe represented as a vector using these features Each vectorrsquosindividual value can be binary or weighted Hence this willbe a compact representation of the original message and itcan be loaded into main memory along with graph structureIn addition the location or URL of the original message willbe kept in the main memory data structure If needed we willfetch the message Over time the feature vector may bechanged as a result of dynamic nature content [Masud et al2010-a] and hence the feature set may evolve Based on ourprior work for evolving streams with dynamic feature sets[Masud et al 2010-b] we will investigate alternative options

RDF is the data format for semantic web However it can beused to represent any linked data in the world RDF data areactually a collection of triples Triples consist of three partssubject predicate and object In RDF almost everything is aresource and hence the name of the format Subject andpredicate are always resources Objects may be eitherresources or literals Here RDF data can be viewed as adirected graph where predicates are edges that flow fromsubjects to objects Therefore in our proposed research tomodel any graph we will exploit RDF triple format Here anedge from the source node to destination node in a graphdataset will be represented as predicate subject and object ofan RDF triple respectively To reduce storage size of RDFtriples we will exploit dictionary encoding that is replaceeach unique string with a unique number and store the RDFdata in binary format Hence RDF triples will have subjectpredicate and object in an encoded form We will maintain aseparate tablefile for keeping track of dictionary encodinginformation To address the dynamic nature of the data wewill extend RDF triple to quad by adding a timestamp alongwith subject predicate and object representing information inthe network

Figure 223 shows our repository architecture which consistsof two components The upper part of the figure depicts thedata preprocessing component and the lower part shows thecomponent which answers a query We have threesubcomponents for data generation and preprocessing If thedata is not in N-Triples we will convert it to N-Triplesserialization format using the N-Triples Converter

component The PS component takes the N-Triples data andsplits it into predicate files The predicate-based files then willbe fed into the POS component which would split thepredicate files into smaller files based on the type of objects

Figure 223 RDF repository architecture (From M HusainL Khan M Kantarcioglu B Thuraisingham Data IntensiveQuery Processing for Large RDF Graphs Using CloudComputing Tools pp 1ndash10 2010 copy IEEE With permission)

Our MapReduce framework has three subcomponents in it Ittakes the SPARQL query from the user and passes it to theInput Selector and Plan Generator This component will selectthe input files and decide how many MapReduce jobs areneeded and pass the information to the Join Executercomponent which runs the jobs using MapReduceframework It will then relay the query answer from Hadoopto the user

2234 Data Storage

We will store the data in N-Triples format because in thisformat we have a complete RDF triple (Subject Predicateand Object) in one line of a file which is very convenient touse with MapReduce jobs We will dictionary encode the datafor increased efficiency Dictionary encoding means replacingtext strings with a unique binary number Not only does thisreduce disk space required for storage but also queryanswering will be fast because handling the primitive datatype is much faster than string matching The processing stepsto get the data in our intended format are described next

22341 File Organization We will not store the data in asingle file because in the Hadoop and MapReduceframework a file is the smallest unit of input to a MapReducejob and in absence of caching a file is always read from thedisk If we have all the data in one file the whole file will beinput to jobs for each query Instead we divide the data intomultiple smaller files The splitting will be done in two stepswhich we discuss in the following sections

22342 Predicate Split (PS) In the first step we will dividethe data according to the predicates In real-world RDFdatasets the number of distinct predicates is no more than100 This division will immediately enable us to cut down thesearch space for any SPARQL query that does not have avariable predicate For such a query we can just pick a filefor each predicate and run the query on those files only Forsimplicity we will name the files with predicates forexample all the triples containing a predicate p1pred go intoa file named p1-pred However in case we have a variablepredicate in a triple pattern and if we cannot determine thetype of the object we have to consider all files If we candetermine the type of the object then we will consider allfiles having that type of object

22343 Predicate Object Split (POS) In the next step wewill work with the explicit type information in the rdf_typefile The file will be first divided into as many files as thenumber of distinct objects the rdftype predicate has Theobject values will no longer be needed to be stored inside thefile as they can be easily retrieved from the file name Thiswill further reduce the amount of space needed to store thedata

Then we will divide the remaining predicate files accordingto the type of the objects Not all the objects are UniformResource Identifiers (URIs) some are literals The literalswill remain in the file named by the predicate no furtherprocessing is required for them The type information of aURI object is not mentioned in these files but they can beretrieved from the rdf-type_ files The URI objects willmove into their respective file named as predicate_type

2235 Answering Queries UsingHadoop MapReduce

For querying we can utilize HIVE a SQL-like querylanguage and SPARQL the query language for RDF dataWhen a query is submitted in HiveQL Hive which runs ontop of the Hadoop installation can answer that query basedon our schema presented earlier When a SPARQL query issubmitted to retrieve relevant data from the graph first wewill generate a query plan having the minimum number ofHadoop jobs possible

Next we will run the jobs and answer the query Finally wewill convert the numbers used to encode the strings back tothe strings when we present the query results to the user Wewill focus on minimizing the number of jobs because in ourobservation we have found that setting up Hadoop jobs isvery costly and the dominant factor (time-wise) is queryanswering The search space for finding the minimum numberof jobs is exponential so we will try to find a greedy-basedsolution or generally speaking an approximation solutionOur proposed approach will be capable of handling queriesinvolving inference We can infer on the fly and if neededwe can materialize the inferred data

To detect anomalyinsider threat machine learning anddomain knowledge-guided techniques are proposed Our goalis to create a comparison baseline to assess the effectivenessof chaotic attractors As a part of this task rather than

modeling normal behavior and detecting changes as anomalywe will apply a holistic approach based on a semi-supervisedmodel In particular first in our machine learning techniquewe will apply a sequence of activities or dimensions asfeatures Second domain knowledge (eg adversarialbehavior) will be a part of semi-supervised learning and willbe used for identifying correct features Finally ourtechniques will be able to identify an entirely brand newanomaly Over time activitiesdimensions may change ordeviate Hence our classification model needs to be adaptiveand identify new types or brand new anomalies We willdevelop adaptive and novel class detection techniques so thatour insider threat detection can cope with changes andidentify or isolate new anomalies from existing ones

We will apply a classification technique to detect insiderthreatanomaly Each distinct insider mission will be treatedas class and dimension andor activities will be treated asfeatures Because classification is a supervised task werequire a training set Given a training set feature extractionwill be a challenge We will apply n-gram analysis to extractfeatures or generate a number of sequences based on temporalproperty Once a new test case comes first we test it againstour classification model For classification model we canapply Support Vector Machine K-NN and Markov model

From a machine learning perspective it is customary toclassify behavior as either anomalous or benign Howeverbehavior of a malevolent insider (ie insider threat) may notbe immediately identified as malicious and it should alsohave subtle differences from benign behavior A traditionalmachine learning-based classification model is likely toclassify the behavior of a malevolent insider as benign It will

be interesting to see whether a machine learning-based novelclass detection technique [Masud et al 2010-a] can detect theinsider threat as a novel class and therefore trigger a warning

The novel class detection technique will be applied on thehuge amount of data that is being generated from useractivities Because these data have temporal properties andare produced continuously they are usually referred to as datastreams The novel class detection model will be updatedincrementally with the incoming data This will allow us tokeep the memory requirement within a constant limit as theraw data will be discarded but the characteristic or pattern ofthe behaviors will be summarized in the model Thisincremental learning will also reduce the training time as themodel need not be built from the scratch with the newincoming data Therefore this incremental learning techniquewill be useful in achieving scalability

We will examine the techniques that we have developed aswell as other relevant techniques to modeling and anomalydetection In particular we propose to develop the following

Tools that will analyze and model benign and anomalousmissions

Techniques to identify right dimensions and activities andapply pruning to discard irrelevant dimensions

Techniques to cope with changes and novel classanomalydetection

In a typical data stream classification task it is assumed thatthe total number of classes is fixed This assumption may not

be valid in insider threat detection cases where new classesmay evolve Traditional data stream classification techniquesare not capable of recognizing novel class instances until theappearance of the novel class is manually identified andlabeled instances of that class are presented to the learningalgorithm for training The problem becomes morechallenging in the presence of concept-drift when theunderlying data distribution changes over time We haveproposed a novel and efficient technique that canautomatically detect the emergence of a novel class (iebrand new anomaly) by quantifying cohesion amongunlabeled test instances and separating the test instances fromtraining instances Our goal is to use the available data andbuild this model

One interesting aspect of this model is that it should capturethe dynamic nature of dimensions of the mission as well asfilter out the noisy behaviors The dimensions (both benignand anomalous) have a dynamic nature because they tend tochange over time which we denote as concept-drift A majorchallenge of the novel class detection is to differentiate thenovel class from concept-drift and noisy data We areexploring this challenge in our current work

224 ComprehensiveFrameworkAs we have stated earlier insider threat detection is anextremely challenging problem In the previous section wediscussed our approach to handling this problem Insider

threat does not occur only at the application level rather ithappens at all levels including the operating system databasesystem and the application Furthermore due to the fact thatthe insider will be continually changing patterns it will beimpossible to detect all types of malicious behavior using apurely static algorithm a dynamic learning approach isrequired Essentially we need a comprehensive solution to theinsider threat problem However to provide a morecomprehensive solution we need a more comprehensiveframework Therefore we are proposing a framework forinsider threat detection Our framework will implement anumber of inter-related solutions to detect malicious insidersFigure 224 illustrates such a framework We propose fourapproaches to this problem At the heart of our framework isthe module that implements inline reference monitor-basedtechniques for feature collection This feature collectionprocess will be aided by two modules one uses game theoryapproach and the other uses the natural language-basedapproach to determine which features can be collected Thefourth module implements machine learning techniques toanalyze the collected features

Figure 224 Framework for insider threat detection

In summary the relationship between the four approaches canbe characterized as follows

Inline Reference Monitors (IRM) perform covertfine-grained feature collection

Game-theoretic techniques will identify which features shouldbe collected by the IRMs

Natural language processing techniques in general and honeytoken generation in particular will take an active approach tointroducing additional useful features (ie honey tokenaccesses) that can be collected

Machine learning techniques will use the collected features toinfer and classify the objectives of malicious insiders

Details of our framework are provided in [Hamlen et al2011] We assume that the inline reference monitor toolgame-theoretic tool and honey token generation tool willselect and refine the features we need Our data mining toolswill analyze the features and determine whether there is apotential for insider threat

225 SummaryIn this chapter we have discussed our approach to insiderthreat detection We represent the insiders and their

communication as RDF graphs and then query and mine thegraphs to extract the nuggets We also provided acomprehensive framework for insider threat detection

The insider threat problem is a challenging one Research isonly beginning The problem is that the insider may changehis or her patterns and behaviors Therefore we need toolsthat can be adaptive For example our stream mining toolsmay be used for detecting such threats We also needreal-time data mining solutions Some of the aspects ofreal-time data mining are discussed in Chapter 23

References[Ahmed and Khan 2009] Ahmed M S and L Khan SISCA Text Classification Approach Using Semi SupervisedSubspace Clustering DDDM rsquo09 The 3rd InternationalWorkshop on Domain Driven Data Mining in conjunctionwith ICDM 2009 December 6 2009 Miami Florida

[Berry et al 2007] Berry M W M Browne A LangvilleV P Pauca R J Plemmons Algorithms and Applicationsfor Approximate Nonnegative Matrix FactorizationComputational Statistics amp Data Analysis Vol 52 No 12007 pp 155ndash173

[Carminati et al 2009] Carminati B E Ferrari RHeatherly M Kantarcioglu B Thuraisingham A SemanticWeb-Based Framework for Social Network Access ControlProceedings of the 14th ACM Symposium on Access ControlModels and Technologies ACM NY pp 177minus186 2009

[Cook and Holder 2006] Cook D and L Holder MiningGraph Data Wiley Interscience New York 2006

[Eberle and Holder 2009] Eberle W and L HolderApplying Graph-Based Anomaly Detection Approaches to theDiscovery of Insider Threats Proceedings of IEEEInternational Conference on Intelligence and SecurityInformatics (ISI) June 2009 pp 206ndash208

[Guo et al 2005] Guo Y Z Pan J Heflin LUBM ABenchmark for OWL Knowledge Base Systems Journal ofWeb Semantics Vol 8 No 2ndash3 2005

[Hadoop] Apache Hadoop httphadoopapacheorg

[Hamlen et al 2011] Hamlen K L Khan M KantarciogluV Ng B Thuraisingham Insider Threat Detection UTDReport April 2011

[Masud et al 2010-a] Masud M J Gao L Khan J Han BThuraisingham Classification and Novel Class Detection inConcept-Drifting Data Streams under Time Constraints IEEETransactions on Knowledge amp Data Engineering (TKDE)April 2010 IEEE Computer Society Vol 23 No 6 pp859ndash874

[Masud et al 2010-b] Masud M Q Chen J Gao L KhanJ Han B Thuraisingham Classification and Novel ClassDetection of Data Streams in a Dynamic Feature Space inProceedings of European Conference on Machine Learningand Knowledge Discovery in Databases (ECML PKDD)Barcelona Spain September 20ndash24 2010 Springer 2010pp 337ndash352

[Maybury et al 2005] Maybury M P Chase B Cheikes DBrackney S Matzner T Hetherington et al Analysis andDetection of Malicious Insiders in 2005 InternationalConference on Intelligence Analysis McLean VA

[Thuraisingham et al 2009] Thuraisingham B MKantarcioglu L Khan Building a Geosocial Semantic Webfor Military Stabilization and Reconstruction OperationsIntelligence and Security Informatics Pacific Asia WorkshopPAISI 2009 Bangkok Thailand April 27 2009 ProceedingsLecture Notes in Computer Science 5477 Springer 2009 HChen C C Yang MChau S-H Li (Eds)

[Tong 2009] Tong H Fast Algorithms for Querying andMining Large Graphs CMU Report No ML-09-112September 2009

DEPENDABLE REAL-TIME DATAMINING

231 IntroductionMuch of the focus on data mining has been for analyticalapplications However there is a clear need to mine data forapplications that have to meet timing constraints Forexample a government agency may need to determinewhether a terrorist activity will happen within a certain timeor a financial institution may need to give out financial quotesand estimates within a certain time That is we need tools andtechniques for real-time data mining Consider for example amedical application where the surgeons and radiologists haveto work together during an operation Here the radiologisthas to analyze the images in real time and give inputs to thesurgeon In the case of military applications images andvideo may arrive from the war zone These images have to beanalyzed in real time so that advice is given to the soldiersThe challenge is to determine which data to analyze andwhich data to discard for future analysis in non-real time Inthe case of counter-terrorism applications the system has toanalyze the data about the passenger from the time thepassenger gets ticketed until the plane is boarded and giveproper advice to the security agent For all of theseapplications there is an urgent need for real-time data mining

Thuraisingham et al introduced the notion of real-time datamining in [Thuraisingham et al 2001] In that paper wefocused on mining multimedia data which is an aspect ofreal-time data mining Since then there have been manydevelopments in sensor data management as well as streamdata mining Furthermore the need for real-time data miningis more apparent especially due to the need forcounter-terrorism applications In a later paper we exploredsome issues on real-time data mining [Thuraisingham et al2005] In particular we discussed the need for real-time datamining and also discussed dependability issues includingincorporating security integrity timeliness and faulttolerance into data mining In a later paper [Thuraisingham etal 2009] we discussed real-time data mining for intelligenceapplications In this chapter we summarize the discussions inour prior papers

The organization of this chapter is as follows Some issues inreal-time data mining including real-time threats arediscussed in Section 232 Adapting data mining techniques tomeet real-time constraints is described in Section 233Parallel and distributed real-time data mining is discussed inSection 234 Techniques in dependable data mining thatintegrate security real-time processing and fault tolerance aregiven in Section 235 Stream data mining is discussed inSection 236 Summary and directions are provided in Section237 Figure 231 illustrates the concepts discussed in thischapter

232 Issues in Real-TimeData MiningAs stated in Section 231 data mining has typically beenapplied to non-real-time analytical applications Manyapplications especially for counter-terrorism and nationalsecurity need to handle real-time threats Timing constraintscharacterize real-time threats That is such threats may occurwithin a certain time and therefore we need to respond tothem immediately Examples of such threats include thespread of smallpox virus chemical attacks nuclear attacksnetwork intrusions and bombing of a building The questionis what types of data mining techniques do we need forreal-time threats

Data mining can be applied to data accumulated over a periodof time The goal is to analyze the data make deductions andpredict future trends Ideally it is used as a decision support

tool However the real-time situation is entirely different Weneed to rethink the way we do data mining so that the toolscan produce results in real time

For data mining to work effectively we need many examplesand patterns We observe known patterns and historical dataand then make predictions Often for real-time data mining aswell as terrorist attacks we have no prior knowledge So thequestion is how do we train the data mining tools based onsay neural networks without historical data Here we need touse hypothetical data as well as simulated data We need towork with counter-terrorism specialists and get as manyexamples as possible When we have gathered the examplesand start training the neural networks and other data miningtools the question becomes what sort of models do we buildOften the models for data mining are built beforehand Thesemodels are not dynamic To handle real-time threats we needthe models to change dynamically This is a big challenge

Data gathering is also a challenge for real-time data miningIn the case of non-real-time data mining we can collect dataclean data and format the data build warehouses and thencarry out mining All these tasks may not be possible forreal-time data mining because of time constraints Thereforethe questions are what tasks are critical and what tasks arenot Do we have time to analyze the data Which data do wediscard How do we build profiles of terrorists for real-timedata mining How can we increase processing speed andoverall efficiency We need real-time data managementcapabilities for real-time data mining

From the previous discussion it is clear that a lot has to bedone before we can perform real-time data mining Some

have argued that there is no such thing as real-time datamining and it will be impossible to build models in real timeSome others have argued that without accurate data wecannot do effective data mining These arguments may betrue However others have predicted the impossibility oftechnology (eg air travel Internet) that today we take forgranted Our challenge is to then perhaps redefine data miningand figure out ways to handle real-time threats

As we have stated there are several situations that have to bemanaged in real time Examples are the spread of smallpoxnetwork intrusions and analyzing data sensor data Forexample surveillance cameras are placed in various placessuch as shopping centers and in front of embassies and otherpublic places Often the data from these sensors must beanalyzed in real time to detect or prevent attacks We discusssome of the research directions in the remaining sectionsFigure 232 illustrates a concept of operation for real-timedata management and mining where some data are discardedother data are analyzed and a third dataset is stored for futureuse Figure 233 illustrates the cycle for real-time data mining

Figure 232 Concept of operation for real-time datamanagement and data mining (From B Thuraisingham LKhan C Clifton J Mauer M Ceruti Dependable Real-TimeData Mining pp 158ndash165 2005 copy IEEE With permission)

Figure 233 Real-time data mining cycle (From BThuraisingham L Khan C Clifton J Mauer M CerutiDependable Real-Time Data Mining pp 158ndash165 2005 copyIEEE With permission)

233 Real-Time Data MiningTechniquesIn this section we examine the various data mining outcomesand discuss how they could be applied for real-timeapplications The outcomes include making associations linkanalysis cluster formation classification and anomalydetection The techniques that result in these outcomes are

based on neural networks decisions trees market basketanalysis techniques inductive logic programming rough setslink analysis based on the graph theory and nearest neighbortechniques As we have stated in [Thuraisingham 2003] themethods used for data mining are top-down reasoning wherewe start with a hypothesis and then determine whether thehypothesis is true or bottom-up reasoning where we start withexamples and then form a hypothesis

Let us start with association mining techniques Examples ofthese techniques include market basket analysis techniques[Agrawal et al 1993] The goal is to find which items gotogether For example we may apply a data mining tool to adataset and find that John comes from country X and he hasassociated with James who has a criminal record The toolalso outputs the result that an unusually large percentage ofpeople from country X have performed some form of terroristattack Because of the associations between John and countryX as well as between John and James and James andcriminal records one may conclude that John has to be underobservation This is an example of an association Linkanalysis is closely associated with making associationsWhereas association rule-based techniques are essentiallyintelligent search techniques link analysis usesgraph-theoretic methods for detecting patterns With graphs(ie nodes and links) one can follow the chain and find linksFor example A is seen with B and B is friends with C and Cand D travel a lot together and D has a criminal record Thequestion is what conclusions can we draw about A Now forreal-time applications we need association rule mining andlink analysis techniques that output the associations and linksin real time

Relevant research is in progress Incremental association rulemining techniques were first proposed in [Cheung et al1996] More recently data stream techniques for miningassociation have been proposed [Chi et al 2004] these willbe discussed further in Section 236 Whereas they addresssome of the issues faced by real-time data mining the keyissue of time-critical need for results has not been addressedThe real-time database researchers have developed varioustechniques including real-time scheduling andapproximate-query processing We need to examine similartechniques for association rule mining and link analysis anddetermine the outcomes that can be determined in real timeAre we losing information by imposing real-time constraintsHow can we minimize errors when we impose real-timeconstraints Are approximate answers accurate enough tobase decisions on them

Next let us consider clustering techniques One could analyzethe data and form various clusters For example people withorigins from country X and who belong to a certain religionmay be grouped into Cluster I People with origins fromcountry Y and who are less than 50 years old may formanother cluster Cluster II These clusters could be formedbased on their travel patterns eating patterns buying patternsor behavior patterns Whereas clustering techniques do notrely on any prespecified condition to divide the populationclassification divides the population based on somepredefined condition The condition is found based onexamples For example we can form a profile of a terroristHe could have the following characteristics male less than 30years of a certain religion and of a certain ethnic origin Thismeans all males less than 30 years belonging to the samereligion and the same ethnic origin will be classified into this

group and possibly could be placed under observation Theseexamples of clustering and classification are for analyticalapplications For real-time applications the challenge is tofind the important clusters in real time Again data streamtechniques may provide a start Another approach is iterativetechniques Classical clustering methods such as k-means andEM could refine answers based on the time available ratherthan terminating on distance-based criteria The question ishow much accuracy and precision are we sacrificing byimposing timing constraints

Another data mining outcome is anomaly detection A goodexample here is learning to fly an airplane without wanting tolearn to take off or land The general pattern is that peoplewant to get a complete training course in flying Howeverthere are now some individuals who want to learn flying butdo not care about take-off or landing This is an anomalyAnother example is John always goes to the grocery store onSaturdays But on Saturday October 26 2002 he goes to afirearms store and buys a rifle This is an anomaly and mayneed some further analysis as to why he is going to a firearmsstore when he has never done so before Is it because he isnervous after hearing about the sniper shootings or is itbecause he has some ulterior motive If he is living say inthe Washington DC area then one could understand why hewants to buy a firearm possibly to protect himself But if heis living in say Socorro New Mexico then his actions mayhave to be followed up further Anomaly detection facesmany challenges even if time constraints are ignored Such anexample is the approaches for Intrusion Detection (see [Leeand Fan 2001] and [Axelsson 1999] for surveys of theproblem and [Wang and Stolfo 2004] for a recent discussionof anomaly detection approaches) Adding real-time

constraints will only exacerbate the difficulties In manycases the anomalies have to be detected in real time both forcyber security as well as for physical security The technicalchallenge is to come up with meaningful anomalies as well asmeet the timing constraints however a larger issue is todefine the problems and surrounding systems to takeadvantage of anomaly detection methods in spite of the falsepositives and false negatives Figure 234 illustrates examplesof real-time data mining outcomes

Figure 234 Real-time data mining outcomes (From BThuraisingham L Khan C Clifton J Mauer M CerutiDependable Real-Time Data Mining pp 158ndash165 2005 copyIEEE With permission)

234 Parallel DistributedReal-Time Data MiningFor real-time data mining applications perhaps a combinationof techniques may prove most efficient For exampleassociation rule techniques could be applied either in series orin parallel with clustering techniques which is illustrated inFigure 235 In series the association rule technique mayprovide enough information to issue a real-time alert to adecision maker before having to invoke the clusteringalgorithms

By using parallel processing software that executes on one ormore hardware platforms with multiple processors severalreal-time data mining techniques can be exploredsimultaneously rather than sequentially Among the manyways to implement this two basic categories emerge Firstone can execute real-time data mining programssimultaneously but on separate processors and input theresults to a control program that compares the results tocriteria or threshold values to issue alert reports to a decisionmaker

Figure 235 Data mining tasks executing in concert onseparate platforms with direct link to the control program(From B Thuraisingham L Khan C Clifton J Mauer MCeruti Dependable Real-Time Data Mining pp 158ndash1652005 copy IEEE With permission)

The second category is an architecture in which the programsexecute in parallel either on the same hardware platform orover a network as depicted in Figure 236 where a centralprogram would format and parse data inputs to the variousprocessors running the programs to determine the differentdata mining outcomes For example when clusters start toform in the output of the cluster detection processor theseclusters could be compared to the associations found in theassociation rule processor Similarly the patterns formed bythe link analysis processor could be input into the anomalydetector for examination to see if the pattern is the same ordifferent from those expected The various processors couldall process the same data in different ways or they couldprocess data from different sources The central control

program could compare the results to the criteria or thresholdsand issue alerts even before the slower algorithms havefinished processing The control program would continue tosend newly emerging results to the decision maker while theresponse to the threat is in progress

Figure 236 Distributed data mining tasks executing on anetwork (From B Thuraisingham L Khan C Clifton JMauer M Ceruti Dependable Real-Time Data Mining pp158ndash165 2005 copy IEEE With permission)

Figure 237 Data mining tasks executing on a parallelmachine (From B Thuraisingham L Khan C Clifton JMauer M Ceruti Dependable Real-Time Data Mining pp158ndash165 2005 copy IEEE With permission)

The method of parallel processing depicted in Figure 237 ispotentially the fastest and most efficient method of real-timedata mining because the software that implements every datamining outcome executes on the same hardware platformwithout any of the delays associated with communicationsnetworks routers and so forth It is also the most versatilechallenging and potentially best implement using artificialintelligence (AI) techniques for pattern recognition andalgorithm coordination These AI techniques could includerule-based reasoning case-based reasoning and Bayesiannetworks

235 Dependable DataMiningFor a system to be dependable it must be secure and faulttolerant meet timing deadlines and manage high-qualitydata However integrating these features into a system meansthat the system has to meet conflicting requirementsdetermined by the policy makers and the applicationsspecialists For example if the systems make all the accesscontrol checks then it may miss some of its deadlines Thechallenge in designing dependable systems is to designsystems that are flexible For example in some situations itmay be important to meet all the timing constraints whereas

in other situations it may be critical to satisfy all the securityconstraints

The major components of dependable systems includedependable networks dependable middleware (includinginfrastructures) dependable operating systems dependabledata managers and dependable applications Data miningwhich can be regarded as an aspect of information and datamanagement has to be dependable as well This means thatthe data mining algorithms have to have the ability to recoverfrom faults maintain security and meet real-time constraintsall in the same program

Sensor data may be available in the form of streams Specialdata management systems are needed to process stream dataFor example much of the data may be transient dataTherefore the system has to analyze the data discardunneeded data and store the necessary data all in real timeSpecial query processing strategies including queryoptimization techniques are needed for data streammanagement Many of the queries on stream data arecontinuous queries

Aggregating the sensor data and making sense out of it is amajor research challenge The data may be incomplete orsometimes inaccurate Many data prove to be irrelevant thusincreasing the noise of the detection-in-clutter task We needthe capability to deal with uncertainty and reason withincomplete data Information management includes extractingknowledge and models from data as well as mining andvisualizing the data Much work has been accomplished ininformation management in recent years For example sensordata must be visualized for a better understanding of the data

We need to develop intelligent real-time visualization toolsfor the sensor data We may also need to aggregate the sensordata and possibly build repositories and warehousesHowever much of the sensor data may be transientTherefore we need to determine which data to store andwhich data to discard Data may also have to be processed inreal time Some of the data may be stored and possiblywarehoused and analyzed for conducting analysis andpredicting trends That is the sensor data from surveillancecameras must be processed within a certain time The datamay also be warehoused for subsequent analysis

Sensor and stream data mining are becoming important areasWe need to examine the data mining techniques such asassociation rule mining clustering and link analysis forsensor data and data streams from sensors and other devicesOne important consideration is to select the level ofgranularity at which to mine the data For example should wemine raw sensor data or data processed at a higher level ofaggregation For example patterns found in images are easilydetected only when observed in the image context where therelationships between image features are preserved Thesefeatures are not recognized easily by analyzing a series ofpixels from the image

As we have stressed we need to manage sensor data in realtime Therefore we may need to mine the data in real timealso This means not only building models ahead of time sothat we can analyze the data in real time but also possiblybuilding models in real time That is the models have to beflexible and dynamic Model formation represents theaggregation of information at a very high level This is amajor challenge As we have stated in Section 232 we also

need many training examples to build models For examplewe need to mine sensor data to detect and possibly preventterrorist attacks This means that we need training examplesto train the neural networks classifiers and other tools so thatthey can recognize in real time when a potential anomalyoccurs Sensor data mining is a fairly new research area andwe need a research program for sensor data management anddata mining The mining of data streams is discussed inSection 236

Data mining may be a solution to some dependability issuesMost data mining techniques generate aggregates over largequantities of data averaging out random errors Systematicerrors pose a greater challenge but as shown in [Agrawal andSrikant 2000] and randomization approaches toprivacy-preserving data mining knowing something about thesource of errors allows high-quality data mining even if wecannot reconstruct correct data Many techniques arenon-deterministic the similarity or dissimilarity of results ofrepeated runs provides a measure of dependability (This wasused to minimize errors in anomaly detection in [Clifton2003]) Data mining has the potential to improve thedependability of decisions based on data even if each datumtaken separately is not dependable

Data mining has also come under fire perhaps unfairlybecause of perceived impacts on privacy Researchers aredeveloping techniques for privacy-preserving data mining aswell as for handling the inference problem that occursthrough data mining [Vaidya and Clifton 2004] Manyreal-time data mining problems involve sensitive data andprivacy will remain an issue Many privacy-preserving datamining approaches come at a significant computational cost

We need to integrate security techniques with real-time datamining so that we can develop algorithms for dependable datamining In particular methods that trade security for timeconstraints may be appropriate for particular problems Apassenger trying to catch a plane may be willing to acceptsome loss of privacy in return for a faster ldquoanomalydetectionrdquo check However information about people that isused to develop the data mining model (who have nothing togain from the faster check) must not be disclosed Suchasymmetric privacy and security requirements raise newchallenges

236 Mining Data StreamsIn recent years advances in hardware technology have madeit easy to store and record numerous transactions andactivities in everyday life in an automated way Suchprocesses result in data that often grow without limit referredto as data streams Stream data could come from sensorsvideo and other continuous media including transactionsSome research has been performed on mining stream dataSeveral important problems recently have been explored inthe data stream domain Clustering projected clusteringclassification and frequent pattern mining on data streams area few examples

Clustering is a form of data management that must beundertaken with considerable care and attention The ideabehind clustering is that a given set of data points can beorganized into groups of similar objects through the use of adistance function By defining similarity through a distance

function an entire data stream can be partitioned into groupsof similar objects Methods that do this view the problem ofpartitioning the data stream into object groups as anapplication of a one-pass clustering algorithm This has somemerit but a more careful definition of the problem with farbetter results will view the data stream as an infinite processwith data continually evolving over time Consequently aprocess is needed which can de novo and continuouslyestablish dominant clusters apart from distortions introducedby the previous history of the stream One way to accomplishthis is to resize the dataset periodically to include new datasampling and processing from time to time The operatorcould set parameters such as how to include old dataprocessed along with the new at what level of granularity andduring what time period It is important that past discoveriesdo not bias future searches that could miss newly formed (orrarely formed) clusters

An interesting proposal for a two-component process forclustering data streams is found in [Aggarwal et al 2003]where the components are an online micro-clustering processand an off-line macro-clustering process The first of thesethe online micro-clustering component is based on aprocedure for storing appropriate summary statistics in a fastdata stream This must be very efficient The summarystatistics consist of the sum and square of data values and is atemporal extension of the cluster feature vector shown byBIRCH [Zhang et al 1996] With respect to the off-linecomponent user input is combined with the summarystatistics to afford a rapid understanding of the clusterswhenever this is required and because this only utilizes thesummary statistics it is very efficient in practice

Flexibility to consider the way in which the clusters evolveover time is a feature of this two-phased approach as is anopportunity for users to develop insight into real applicationsThe question of how individual clusters are maintained onlineis discussed in [Aggarwal et al 2004-b] Using an iterativeapproach the algorithm for high-dimensional clustering isable to determine continuously new cluster structures whileat the same time redefining the set of dimensions included ineach cluster A normalization process with the aim ofequalizing the standard deviation along each dimension isused for the meaningful comparison of dimensions As thedata stream evolves over time the values might be expectedto change as well making it necessary to re-compute theclusters and the normalization factor on a periodic basis Aperiod for this re-computation can be taken as an interval of acertain number of points

For projected clustering on data streams [Aggarwal et al2004-b] have proposed a method of high-dimensionalprojected data stream clustering called ldquoHPStreamrdquoHPStream relies upon an exploration of a linear updatephilosophy in projected clustering achieving both highscalability and high clustering quality Through HPStreamconsistently high clustering quality can be achieved becauseof the programrsquos adaptability to the nature of the real datasetwhere data reveal their tight clustering behavior only indifferent subsets of dimension combinations

For classification of data streams [Aggarwal et al 2004-a]propose data stream mining in the context of classificationbased on one-pass mining Changes that have occurred in themodel since the beginning of the stream construction processare not generally recognized in one-pass mining However

the authors propose the exploitation of incremental updatingof the classification model which will not be greater than thebest sliding window model on a data stream thus creatingmicro-clusters for each class in the training stream Suchmicro-clusters represent summary statistics of a set of datapoints from the training data belonging to the same classsimilar to the clustering model in the off-line component[Aggarwal et al 2003] To classify the test stream in eachinstance a nearest neighbor classification process is appliedafter identifying various time horizons andor segmentsWhen different time horizons determine different class labelsmajority voting is applied

With regard to frequent pattern mining on data stream [Hanet al 2002] discuss algorithms at multiple time granularitiesThey first discuss the landmark model of Motwani and others[Datar et al 2002] and argue that the landmark modelconsiders a stream from start to finish As a result the modelis not appropriate for time-sensitive data where the patternssuch as video patterns as well as transactions may besensitive to time Therefore they focus on data streams overcertain intervals depending on the time sensitivity anddescribe algorithms for extracting frequent patterns in streamdata In particular they consider three types of patternsfrequent patterns sub-frequent patterns and infrequentpatterns They argue that due to limited storage space withsensor devices one cannot handle all kinds of patternsTherefore they focus on frequent patterns and sub-frequentpatterns as the sub-frequent patterns could become frequentpatterns over time They illustrate tree-building algorithmswhich essentially develop a structure that is a pattern tree witha time window Such a structure is what they call an

FP-stream This technique essentially relies on theFP-streams

Besides these [Demers et al 2004] use the notion of aninformation sphere that exists within an agency and focus onmining the multiple high-speed data streams within thatagency They also discuss the global information spheres thatspan across the agencies and focus on joining multiple datastreams

One major difference is noted between what we have calledreal-time data mining and the data stream mining defined byHan and others In the case of real-time data mining the goalis to mine the data and output results in real time That is thedata mining algorithm must meet timing constraints andobserve deadlines In the case of stream mining the goal is tofind patterns over specified time intervals That is thepatterns may be time sensitive but the result may notnecessarily lead to an urgent action on the part of a decisionmaker unless the pattern were to emerge in time to allowappropriate follow-up action We can also see the similaritiesbetween the two notions That is while stream mining has tofind patterns within a specified time interval it may alsoimply that after the interval has passed the patterns may notbe of much value That is stream mining also has to meettiming constraints in addition to finding patterns withtime-sensitive data Essentially what we need is a taxonomyfor real-time data mining that also includes stream miningMore detail on stream mining was discussed in Part VI

237 SummaryIn this chapter we discussed dependability issues for datamining Recently much emphasis has been placed on datamining algorithms meeting timing constraints as well asmining time-sensitive data For example how much do welose by imposing constraints on the data mining algorithmsIn some situations it is critical that analysis be completed andthe results reported within a few seconds rather than say afew hours

We first discussed issues of real-time data mining and thenwe examined various data mining techniques such asassociations and clustering and discussed how they may meettiming constraints We also discussed issues of using parallelprocessing techniques and mining data streams This isbecause streams come from sensor and video devices and thepatterns hidden in the streams may be time sensitive We alsodiscussed dependability issues for data mining

Since we introduced the notion of real-time data mining in[Thuraisingham et al 2001] much interest has emerged inthe field Many applications including counter-terrorism andfinancial analysis clearly need this type of data mining Thischapter has provided some initial directions Manyopportunities and challenges remain in real-time data mining

References[Aggarwal et al 2003] Aggarwal C J Han J Wang P SYu A Framework for Clustering Evolving Data Streams inProceedings of the 2003 International Conference on VeryLarge Data Bases (VLDBrsquo03) Berlin Germany September2003 pp 81ndash92

[Aggarwal et al 2004-a] Aggarwal C J Han J Wang P SYu On Demand Classification of Data Streams Proceedingsof the 2004 International Conference on KnowledgeDiscovery and Data Mining (KDDrsquo04) Seattle WA August2004 pp 503ndash508

[Aggarwal et al 2004-b] Aggarwal C J Han J Wang P SYu A Framework for Projected Clustering of HighDimensional Data Streams Proceedings of the 2004International Conference on Very Large Data Bases(VLDBrsquo04) Toronto Canada August 2004 pp 852ndash863

[Agrawal et al 1993] Agrawal R T Imielinski A NSwami Mining Association Rules between Sets of Items inLarge Databases Proceedings of the 1993 ACM SIGMODInternational Conference on Management of DataWashington DC May 1993 pp 207ndash216

[Agrawal and Srikant 2000] Agrawal R and R SrikantPrivacy-Preserving Data Mining Proceedings of the 2000ACM SIGMOD International Conference on Management ofData Dallas TX pp 439ndash450

[Axelsson 1999] Axelsson S Research in IntrusionDetection Systems A Survey Technical Report 98-17 (revisedin 1999) Chalmers University of Technology 1999

[Cheung et al 1996] D W Cheung J Han V Ng C YWong Maintenance of Discovered Association Rules inLarge Databases An Incremental Updating Technique inProceedings 1996 International Conference DataEngineering New Orleans LA February 1996 pp 106ndash114

[Chi et al 2004] Chi Y H Wang P Yu R MuntzMoment Maintaining Closed Frequent Itemsets over aStream SlidingWindow Proceedings of the 4th IEEEInternational Conference on Data Mining ICDMrsquo04 pp59ndash66

[Clifton 2003] Clifton C Change Detection in OverheadImagery using Neural Networks International Journal ofApplied Intelligence Vol 18 No 2 March 2003 pp215ndash234

[Datar et al 2002] Datar M A Gionis P Indyk RMotwani Maintaining Stream Statistics over SlidingWindows Proceedings of the 13th SIAM-ACM Symposium onDiscrete Algorithms 2002

[Demers 2004] Demers A J Gehrke and M RiedewaldResearch Issues in Mining and Monitoring Intelligent DataData Mining Next Generation Challenges and FutureDirections AAAI Press 2004 (H Kargupta et al Eds) pp2ndash46

[Han et al 2004] Han J J Pei Y Yin R Mao MiningFrequent Patterns without Candidate Generation AFrequent-Pattern Tree Approach Data Mining andKnowledge Discovery Vol 8 No 1 2004 pp 53ndash87

[Lee and Fan 2001] Lee W and W Fan Mining SystemAudit Data Opportunities and Challenges SIGMOD RecordVol 30 No 4 2001 pp 33ndash44

[Thuraisingham et al 2001] Thuraisingham B C CliftonM Ceruti J Maurer Real-Time Multimedia Data MiningProceedings of the ISORC Conference MagdebergGermany 2001

[Thuraisingham 2003] Thuraisingham B Data Mining forBusiness Intelligence and Counter-Terrorism CRC Press2003

[Thuraisingham et al 2005] Thuraisingham B L Khan CClifton J Maurer M Ceruti Dependable Real-Time DataMining ISORC 2005 pp 158ndash165

[Thuraisingham et al 2009] Thuraisingham B L Khan MKantarcioglu S Chib J Han S Son Real-Time KnowledgeDiscovery and Dissemination for Intelligence AnalysisHICSS 2009 pp 1ndash12

[Vaidya and Clifton 2004] Jaideep V and C CliftonPrivacy-Preserving Data Mining Why How and What ForIEEE Security amp Privacy New York NovemberDecember2004

[Zhang et al 1996] Zhang T R Ramakrishnan M LivnyBIRCH An Efficient Data Clustering Method for Very LargeDatabases Proceedings of the 1996 ACM SIGMODInternational Conference on Management of Data MontrealQuebec Canada June 4ndash6 1996 ACM Press 1996 H VJagadish I S Mumick (Eds) pp 103ndash114

FIREWALL POLICY ANALYSIS

241 IntroductionA firewall is a system that acts as an interface of a network toone or more external networks and regulates the networktraffic passing through it The firewall decides which packetsto allow through or to drop based on a set of ldquorulesrdquo definedby the administrator These rules have to be defined andmaintained with utmost care as any slight mistake in definingthe rules may allow unwanted traffic to enter or leave thenetwork or may deny passage to legitimate trafficUnfortunately the process of manual definition of the rulesand trying to detect mistakes in the rule set by inspection isprone to errors and is time consuming Thus research in thedirection of detecting anomalies in firewall rules has gainedmomentum recently Our work focuses on automating theprocess of detecting and resolving the anomalies in the ruleset

Firewall rules are usually in the form of criteria and an actionto take if any packet matches the criteria Actions are usuallyldquoacceptrdquo and ldquorejectrdquo A packet arriving at a firewall is testedwith each rule sequentially Whenever it matches with thecriteria of a rule the action specified in the rule is executedand the rest of the rules are skipped For this reason firewallrules are order sensitive When a packet matches with more

than one rule the first such rule is executed Thus if the set ofpackets matched by two rules are not disjoint they will createanomalies For instance the set of packets matching a rulemay be a superset of those matched by a subsequent rule Inthis case all the packets that the second rule could havematched will be matched and handled by the first one and thesecond rule will never be executed More complicatedanomalies may arise when the sets of packets matched by tworules are overlapped If no rule matches the packet then thedefault action of the firewall is taken Usually such packetsare dropped silently so that nothing unwanted can enter orexit the network In this chapter we assume that the defaultaction of the firewall system is to reject and we develop ouralgorithms accordingly

In this chapter we describe our algorithms for resolvinganomalies in firewall policy rules The organization of thechapter is as follows In Section 242 we discuss relatedwork In Section 243 we discuss the basic concepts offirewall systems representation of rules in firewalls possiblerelations between rules and possible anomalies between rulesin a firewall policy definition In Section 244 we firstpresent our algorithm for detecting and resolving anomaliesand illustrate the algorithm with an example Next we presentour algorithm to merge rules and provide an example of itsapplication The chapter is concluded in Section 245 Figure241 illustrates the concepts in this chapter

242 Related WorkOf late research work on detecting and resolving anomaliesin firewall policy rules have gained momentum [Mayer et al2000] presents tools for analyzing firewalls [Eronen andZitting 2001] propose the approach of representing the rulesas a knowledge base and present a tool based on ConstraintLogic Programming to allow the user to write higher leveloperations and queries Work focusing on automating theprocess of detecting anomalies in policy include [Hazelhurst1999] in which Hazelhurst describes an algorithm torepresent the rules as a Binary Decision Diagram andpresents a set of algorithms to analyze the rules [Eppsteinand Muthukrishnan 2001] give an efficient algorithm fordetermining whether a rule set contains conflicts Al-Shaer etal define the possible relations between firewall rules in[Al-Shaer and Hamed 2002 2003 2006] and then defineanomalies that can occur in a rule set in terms of these

definitions They also give an algorithm to detect theseanomalies and present policy advisor tools using thesedefinitions and algorithm They extend their work todistributed firewall systems in [Al-Shaer and Hamed 2004][Al-Shaer et al 2005] A work that focuses on detecting andresolving anomalies in firewall policy rules is [Hari et al2000] in which they propose a scheme for resolving conflictsby adding resolve filters However this algorithm requires thesupport of prioritized rules which is not always available infirewalls Also their treatment of the criterion values only asprefixes makes their work specific [Fu et al 2001] definehigh-level security requirements and develop mechanisms todetect and resolve conflicts among IPSec policies [Golnabi etal 2006] describe a data mining approach to the anomalyresolution

Most current research focuses on the analysis and detection ofanomalies in rules Those that do address the resolution ofanomalies require special features or provisions from thefirewall or focus on specific areas We base our work on theresearch of [Al-Shaer et al 2002 2003 2004] whoseanalysis is applicable to all rule-based firewalls in generalHowever their work is limited to the detection of anomaliesWe also show that one of their definitions is redundant andthe set of definitions do not cover all possibilities In ourwork we remove the redundant definition and modify onedefinition to cover all the possible relations between rulesWe also describe the anomalies in terms of the modifieddefinitions Then we present a set of algorithms tosimultaneously detect and resolve these anomalies to producean anomaly-free rule set We also present an algorithm tomerge rules whenever possible Reports are also produced by

the algorithms describing the anomalies that were found howthey were resolved and which rules were merged

243 Firewall ConceptsIn this section we first discuss the basic concepts of firewallsystems and their policy definition We present our modifieddefinitions of the relationships between the rules in a firewallpolicy and then present the anomalies as described in[Al-Shaer and Hamed 2002] Figure 242 illustrates theconcepts in this section

Figure 242 Firewall policy rules

A rule is defined as a set of criteria and an action to performwhen a packet matches the criteria The criteria of a ruleconsist of the elements direction protocol source IP sourceport destination IP and destination port Therefore acomplete rule may be defined by the ordered tuple _directionprotocol source IP source port destination IP destinationport action_ Each attribute can be defined as a range ofvalues which can be represented and analyzed as sets

The relation between two rules essentially means the relationbetween the set of packets they match Thus the action fielddoes not come into play when considering the relationbetween two rules Because the values of the other attributesof firewall rules can be represented as sets we can consider arule to be a set of sets and we can compare two rules usingthe set relations Two rules can be exactly equal if everycriterion in the rules matches exactly one rule can be thesubset of the other if each criterion of one rule is a subset ofor equal to the other rulersquos criteria or they can be overlappedif the rules are not disjoint and at least one of the criteria isoverlapped In the last case a rule would match a portion ofthe packets matched by the other but not every packet and theother rule would also match a portion of packets matched bythe first rule but not all Al-Shaer et al discuss these possiblerelations in [Al-Shaer and Hamed 2002] and they define therelations completely disjoint exactly matched inclusivelymatched partially disjoint and correlated We propose some

modifications to the relations defined in [Al-Shaer andHamed 2002] First we note that it is not needed todistinguish between completely disjoint and partially disjointrules as two rules will match an entirely different set ofpackets if they differ even only in one field Further weobserve that the formal definition of correlated rules does notinclude the possibility of an overlapped field in which thefields are neither disjoint nor a subset of one or the other Wepropose the following modified set of relations between therules

Disjoint Two rules r and s are disjoint denoted as r RD s ifthey have at least one criterion for which they havecompletely disjoint values Formally r RD s if exista isin attr[ra capsa = φ]

Exactly Matching Two rules r and s are exactly matcheddenoted by r REM s if each criterion of the rules matchesexactly Formally r REM s if exista isin attr[ra = sa]

Inclusively Matching A rule r is a subset or inclusivelymatched of another rule s denoted by r REM s if there existsat least one criterion for which rrsquos value is a subset of srsquosvalue and for the rest of the attributes rrsquos value is equal to srsquosvalue Formally r REM s if existasubattr [a ne φ and forallxisina [rx sub sx]and forallyisina c [ry = sy]]

Correlated Two rules r and s are correlated denoted by r RCs if r and s are not disjoint but neither is the subset of theother Formally r RC s if (r RD s) and (r RIM s) and (s RIM r)

2433 Possible Anomalies between TwoRules

[Al-Shaer and Hamed 2002] give formal definitions of thepossible anomalies between rules in terms of the relationsdefined in [Al-Shaer and Hamed 2002] Of these anomalieswe consider generalization not to be an anomaly as it is usedin practice to handle a specific group of addresses within alarger group and as such we omit it from our considerationHere we define the anomalies in terms of the relations givenearlier

Shadowing Anomaly A rule r is shadowed by another rule sif s precedes r in the policy and s can match all the packetsmatched by r The effect is that r is never activated Formallyrule r is shadowed by s if s precedes r r REM s and raction nesaction or s precedes r r RIM s and raction ne saction

Correlation Anomaly Two rules r and s are correlated ifthey have different filtering actions and the r matches somepackets that match s and the s matches some packets that rmatches Formally rules r and s have a correlation anomaly ifr RC s raction ne saction

Redundancy Anomaly A redundant rule r performs thesame action on the same packets as another rule s such that ifr is removed the security policy will not be affectedFormally rule r is redundant of rule s if s precedes r r REM sand raction = saction or s precedes r r RIM s and raction =saction whereas rule s is redundant to rule r if s precedes r sRIM r raction = saction and existt where s precedes t and tprecedes r sRIMRCt raction ne taction

244 Anomaly ResolutionAlgorithmsThis section describes the algorithms to detect and resolve theanomalies present in a set of firewall rules as defined in theprevious section The algorithm is in two parts The first partanalyzes the rules and generates a set of disjoint firewall rulesthat do not contain any anomaly The second part analyzes theset of rules and tries to merge the rules in order to reduce thenumber of rules thus generated without introducing any newanomaly Figure 243 illustrates the flow of firewall policyanalysis algorithms discussed in this section

2441 Algorithms for Finding andResolving Anomalies

In this section we present our algorithm to detect and resolveanomalies In this algorithm we resolve the anomalies asfollows In case of shadowing anomaly when rules areexactly matched we keep the one with the reject actionWhen the rules are inclusively matched we reorder the rulesto bring the subset rule before the superset rule In case ofcorrelation anomaly we break down the rules into disjointparts and insert them into the list Of the part that is commonto the correlated rules we keep the one with the reject actionIn case of redundancy anomaly we remove the redundantrule In our algorithm we maintain two global lists of firewallrules an old rules list and a new rules list The old rules listwill contain the rules as they are in the original firewall

configuration and the new rules list will contain the output ofthe algorithm a set of firewall rules without any anomalyThe approach taken here is incremental we take each rule inthe old rules list and insert it into the new rules list in such away that the new rules list remains free from anomaliesAlgorithm Resolve-Anomalies controls the whole processAfter initializing the global lists in lines 1 and 2 it takes eachrule from the old rules list and invokes algorithm Insert on itin lines 3 to 4 Then it scans the new rules list to resolve anyredundancy anomalies that might remain in the list in lines 5to 10 by looking for and removing any rule that is a subset ofa subsequent rule with same action

Figure 243 Flow of firewall policy analysis algorithms

Algorithm Insert inserts a rule into the new rules list in such away that the list remains anomaly free If the list is empty therule is unconditionally inserted in line 2 Otherwise Inserttests the rule with all the rules in new rules list using theResolve algorithm in the ldquoforrdquo loop in line 5 If the ruleconflicts with any rule in the list Resolve will handle it andreturn true breaking the loop So at line 10 if insert flag is

true it means that Resolve has already handled the ruleOtherwise the rule is disjoint or superset with all the rules innew rules list and it is inserted at the end of the list in line 11

Algorithm Resolve-Anomalies

Resolve anomalies in firewall rules file

1 old rules list larr read rules from config file2 new rules list larr empty list3 for all r isin old rules list do4 Insert(r new rules list)5 for all r isin new rules list do6 for all s isin new rules list after r do7 if r sub s then8 if raction = saction then9 Remove r from new rules list

10 break

Algorithm Insert(rnew rules list)

Insert the rule r into new rules list

1 if new rules list is empty then2 insert r into new rules list3 else4 inserted larr false5 for all s isin new rules list do6 if r and s are not disjoint then7 inserted larr Resolve(r s)8 if inserted = true then9 break

10 if inserted = false then11 Insert r into new rules list

The algorithm Resolve is used to detect and resolveanomalies between two non-disjoint rules This algorithm isused by the Insert algorithm The first rule passed to Resolver is the rule being inserted and the second parameter s is arule already in the new rules list In comparing them thefollowing are the possibilities

1 r and s are equal If they are equal and their actionsare same then any one can be discarded If theactions are different then the one with the rejectaction is retained This case is handled in lines 1 to 6

2 r is a subset of s In this case we simply insert rbefore s regardless of the action This case is handledin lines 7 to 9

3 r is a superset of s In this case r may match withrules further down the list so it is allowed to bechecked further No operation is performed in thiscase This case is handled in lines 10 to 11

4 r and s are correlated In this case we need to breakup the correlated rules into disjoint rules This case ishandled in lines 12 to 19 First the set of attributes inwhich the two rules differ is determined in line 13and then Split is invoked for each of the differingattributes in the ldquoforrdquo loop in line 14 After Splitreturns r and s contain the common part of the ruleswhich is then inserted

Algorithm Resolve(r s)

Resolve anomalies between two rules r and s

1 if r = s then2 if raction ne saction then3 set saction to REJECT and report anomaly4 else5 report removal of r6 return true7 if r sub s then8 insert r before s into new rules list and report

reordering9 return true

10 if s sub r then11 return false12 Remove s from new rules list13 Find set of attributes a = x|rx ne sx14 for all ai isin a do15 Split(r s ai)

16 if raction ne saction then17 saction larr REJECT18 Insert(s new rules list)19 return true

Algorithm Split is used to split two non-disjoint rules It ispassed through the two rules and the attribute on which therules differ It first extracts the parts of the rules that aredisjoint to the two rules and invokes the Insert algorithm onthem Then it computes the common part of the two rules Letr and s be two rules and let a be the attribute for which Splitis invoked The common part will always start withmax(rastart sastart) and end with min(raend saend) Thedisjoint part before the common part begins with min(rastartsastart) and ends with max(rastart sastart) minus 1 and thedisjoint part after the common part starts with min(raendsaend) + 1 and ends with max(raend saend) As these twoparts are disjoint with r and s but we do not know theirrelation with the other rules in the new rules list they areinserted into the new rules list by invoking the Insertprocedure The common part of the two rules is computed inlines 13 and 14 The disjoint part before the common part iscomputed and inserted in lines 5 to 8 The disjoint part afterthe common part is computed and inserted in lines 9 to 12

Algorithm Split(rsa)

Split overlapping rules r and s based on attribute a

1 left larr min(rastart sastart)2 right larr max(raend saend)3 common start larr max(rastart sastart)4 common end larr min(raend saend)5 if rastart gt sastart then6 Insert(((left common startminus1) rest of srsquos attributes)

new rules list)7 else if rastart lt sastart then8 Insert(((left common startminus1) rest of rrsquos attributes)

new rules list)9 if raend gt saend then

10 Insert(((common end+1 right) rest of rrsquos attributes)new rules list)

11 else if raend lt saend then12 Insert(((common end+1 right) rest of srsquos attributes)

new rules list)13 r larr ((common start common end) rest of rrsquos

attributes)14 s larr ((common start common end) rest of srsquos

attributes)

After completion of the Resolve-Anomalies algorithm thenew rules list will contain the list of firewall rules that are freefrom all the anomalies in consideration

24411 Illustrative Example Let us consider the followingset of firewall rules for analysis with the algorithm

1 (IN TCP 12911096117 ANY ANY 80REJECT)

2 (IN TCP 12911096 ANY ANY 80 ACCEPT)3 (IN TCP ANY ANY 1291109680 80 ACCEPT)4 (IN TCP 12911096 ANY 1291109680 80

REJECT)5 (OUT TCP 1291109680 22 ANY ANY

REJECT)6 (IN TCP 12911096117 ANY 1291109680 22

REJECT)7 (IN UDP 12911096117 ANY 12911096 22

REJECT)8 (IN UDP 12911096117 ANY 1291109680 22

REJECT)9 (IN UDP 12911096117 ANY 12911096117

22 ACCEPT)10 (IN UDP 12911096117 ANY 12911096117

22 REJECT)11 (OUT UDP ANY ANY ANY ANY REJECT)

Step 1 As the new rules list is empty rule 1 is inserted as itis

Step 2 When rule 2 is inserted the new rules list containsonly one rule the one that was inserted in the previous stepWe have r = (IN TCP 12911096 ANY ANY 80ACCEPT) and s = (IN TCP 12911096117 ANY ANY80 REJECT)

Here s sub r so r is inserted into new rules list after s

Step 3 In this step r = (IN TCP ANY ANY1291109680 80 ACCEPT) In the first iteration s =(INTCP12911096117ANYANY80REJECT)

Clearly these two rules are correlated with ssrcip sub rsrcipand rdestip sub sdestip Therefore these rules must be brokendown After splitting the rules into disjoint parts we have thefollowing rules in the new rules list

1 (IN TCP 129110961-116 ANY 129110968080 ACCEPT)

2 (IN TCP 12911096118-254 ANY 129110968080 ACCEPT)

3 (IN TCP 12911096117 ANY 129110961-7980 REJECT)

4 (IN TCP 12911096117 ANY 1291109681-25480 REJECT)

5 (IN TCP 12911096117 ANY 1291109680 80REJECT)

6 (IN TCP 12911096 ANY ANY 80 ACCEPT)

After completion of the first ldquoforrdquo loop in line 3 in thealgorithm Resolve-Anomalies the new rules list will hold thefollowing rules

1 (IN TCP 129110961-116 ANY 129110968080 ACCEPT)

2 (IN TCP 12911096118-254 ANY 129110968080 ACCEPT)

3 (IN TCP 12911096117 ANY 129110961-7980 REJECT)

4 (IN TCP 12911096117 ANY 1291109681-25480 REJECT)

5 (IN TCP 12911096117 ANY 1291109680 80REJECT)

6 (IN TCP 12911096 ANY 1291109680 80REJECT)

7 (IN TCP 12911096 ANY ANY 80 ACCEPT)8 (OUT TCP 1291109680 22 ANY ANY

REJECT)9 (IN TCP 12911096117 ANY 1291109680 22

REJECT)10 (IN UDP 12911096117 ANY 1291109680 22

REJECT)11 (IN UDP 12911096117 ANY 12911096117

22 REJECT)12 (IN UDP 12911096117 ANY 12911096 22

REJECT)13 (OUT UDP ANY ANY ANY ANY REJECT)

The next step is to scan this list to find and resolve theredundancy anomalies In this list rule 1 is a subset of rule 6but as the rules have different action rule 1 is retainedSimilarly rule 2 which is also a subset of rule 6 withdiffering action is also retained Rules 3 and 4 are subsets ofrule 7 but are retained as they have different action than rule7 Rule 5 is a subset of rule 6 and as they have the sameaction rule 5 is removed After removing these rules the listis free from all the anomalies

After the completion of the anomaly resolution algorithmthere are no correlated rules in the list In this list we canmerge rules having attributes with consecutive ranges with

the same action To accomplish this we construct a tree usingAlgorithm TreeInsert Each node of the tree represents anattribute The edges leading out of the nodes represent valuesof the attribute Each edge in the tree represents a particularrange of value for the attribute of the source node and itpoints to a node for the next attribute in the rule representedby the path For example the root node of the tree representsthe attribute Direction and there can be two edges out of theroot representing IN and OUT We consider a firewall rule tobe represented by the ordered tuple as mentioned in Section243 So the edge representing the value IN coming out of theroot node would point to a node for Protocol The leaf nodesalways represent the attribute Action A complete path fromthe root to a leaf corresponds to one firewall rule in thepolicy

Algorithm TreeInsert takes as input a rule and a node of thetree It checks if the value of the rule for the attributerepresented by the node matches any of the values of theedges out of the node If it matches any edge of the node thenit recursively invokes TreeInsert on the node pointed by theedge with the rule Otherwise it creates a new edge and addsit to the list of edges of the node

Algorithm TreeInsert(n r)

Inserts rule r into the node n of the rule tree

1 for all edge ei isin nedges do

2 if r(nattribute) = eirange then3 TreeInsert(eivertex r)4 return5 v larr new Vertex(next attribute after nattribute

NULL)6 Insert new edge _r(nattribute) r(nattribute) v _ in

nedges7 TreeInsert(v r)

We use Algorithm Merge on the tree to merge those edges ofthe tree that have consecutive values of attributes and haveexactly matching subtrees It first calls itself recursively oneach of its children in line 2 to ensure that their subtrees arealready merged Then it takes each edge and matches itsrange with all the other edges to see if they can be mergedWhether two edges can be merged depends on two criteriaFirst their ranges must be contiguous that is the range ofone starts immediately after the end of the other Second thesubtrees of the nodes pointed to by the edges must matchexactly This criterion ensures that all the attributes after thisattribute are the same for all the rules below this node Ifthese two criteria are met they are merged into one edge inplace of the original two edges After merging the possiblerules the number of rules defined in the firewall policy isreduced and it helps to increase the efficiency of firewallpolicy management

Algorithm Merge(n)

Merges edges of node n representing a continuous range

1 for all edge e isin nedges do2 Merge(enode)3 for all edge e isin nedges do4 for all edge eprime ne e isin nedges do5 if ranges of e and eprime are contiguous and

Subtree(e)=Subtree(eprime) then6 Merge erange and eprimerange into erange7 Remove eprime from nedges

24421 Illustrative Example of the Merge Algorithm Toillustrate the merging algorithm we start with the followingset of non-anomalous rules We deliberately chose a set ofrules with the same action since rules with a different actionwill never be merged

1 (IN TCP 2028016929-63 4831291109664-127 100-110 ACCEPT)

2 (IN TCP 2028016929-63 4831291109664-127 111-127 ACCEPT)

3 (IN TCP 2028016929-63 48312911096128-164 100-127 ACCEPT)

4 (IN TCP 2028016929-63 484 1291109664-99100-127 ACCEPT)

5 (IN TCP 2028016929-63 48412911096100-164 100-127 ACCEPT)

6 (IN TCP 2028016964-110 483-4841291109664-164 100-127ACCEPT)

From this rules list we generate the tree by the TreeInsertalgorithm On this tree the Merge procedure is run TheMerge algorithm traverses the tree in post order After theMerge algorithm is complete on the entire tree we are leftwith the single rule

(IN TCP 2028016929-110 483-484 1291109664-164100-127 ACCEPT)

Details of the example are given in [Abedin et al 2006]

245 SummaryResolution of anomalies from firewall policy rules is vital tothe networkrsquos security as anomalies can introduceunwarranted and hard-to-find security holes Our workpresents an automated process for detecting and resolvingsuch anomalies The anomaly resolution algorithm and themerging algorithm should produce a compact yetanomaly-free rule set that would be easier to understand andmaintain These algorithms can also be integrated into policyadvisor and editing tools This work also establishes thecomplete definition and analysis of the relations betweenrules

In the future this analysis can be extended to distributedfirewalls Also we propose to use data mining techniques toanalyze the log files of the firewall and discover other kindsof anomalies These techniques should be applied only afterthe rules have been made free from anomaly by applying thealgorithms in this chapter That way it would be ensured that

not only syntactic but also semantic mistakes in the rules willbe captured Research in this direction has already started

References[Abedin et al 2006] Abedin M S Nessa L Khan BThuraisingham Detection and Resolution of Anomalies inFirewall Policy Rules DBSec 2006 pp 15ndash29

[Al-Shaer and Hamed 2002] Al-Shaer E and H HamedDesign and Implementation of Firewall Policy Advisor ToolsTechnical Report CTI-techrep0801 School of ComputerScience Telecommunications and Information SystemsDePaul University August 2002

[Al-Shaer and Hamed 2003] Al-Shaer E and H HamedFirewall Policy Advisor for Anomaly Detection and RuleEditing in IEEEIFIP Integrated Management Conference(IMrsquo2003) March 2003 pp 17ndash30

[Al-Shaer and Hamed 2004] Al-Shaer E and H HamedDiscovery of Policy Anomalies in Distributed Firewalls inProceedings of the 23rd Conference IEEE CommunicationsSociety (INFOCOM 2004) Vol 23 No 1 March 2004 pp2605ndash2616

[Al-Shaer and Hamed 2006] Al-Shaer E and H HamedTaxonomy of Conflicts in Network Security Policies IEEECommunications Magazine Vol 44 No 3 March 2006 pp134ndash141

[Al-Shaer et al 2005] Al-Shaer E H Hamed R BoutabaM Hasan Conflict Classification and Analysis of DistributedFirewall Policies IEEE Journal on Selected Areas inCommunications (JSAC) Vol 23 No 10 October 2005 pp2069ndash2084

[Eppstein and Muthukrishnan 2001] Eppstein D and SMuthukrishnan Internet Packet Filter Management andRectangle Geometry in Proceedings of the 12th AnnualACMndashSIAM Symposium on Discrete Algorithms (SODA2001) January 2001 pp 827ndash835

[Eronen and Zitting 2001] Eronen P and J Zitting AnExpert System for Analyzing Firewall Rules in Proceedingsof the 6th Nordic Workshop on Secure IT Systems (NordSec2001) November 2001 pp 100ndash107

[Fu et al 2001] Fu Z S F Wu H Huang K Loh F GongI Baldine C Xu IPSecVPN Security Policy CorrectnessConflict Detection and Resolution Proceedings of the Policy2001 Workshop January 2001 pp 39ndash56

[Golnabi et al 2006] Golnabi K R K Min L Khan EAl-Shaer Analysis of Firewall Policy Rules Using DataMining Techniques in IEEEIFIP Network Operations andManagement Symposium (NOMS 2006) April 2006 pp305ndash315

[Hari et al 2000] Hari A S Suri G M Parulkar Detectingand Resolving Packet Filter Conflicts in INFOCOM Vol 3March 2000 pp 1203ndash1212

[Hazelhurst 1999] Hazelhurst S Algorithms for AnalysingFirewall and Router Access Lists Technical ReportTR-WitsCS-1999-5 Department of Computer ScienceUniversity of the Witwatersrand South Africa July 1999

[Mayer et al 2000] Mayer A A Wool E Ziskind Fang AFirewall Analysis Engine in Proceedings of the IEEESymposium on Security and Privacy IEEE Press May 2000pp 177ndash187

Conclusion to Part VII

We have presented data mining tools for various emergingsecurity applications These include data mining tools foractive defense and insider threat analysis In addition wediscussed aspects of real-time data mining as well as datamining for firewall policy rule management Some of thetools discussed here are in the design stages and need to bedeveloped further Nevertheless they provide some ideas onour approach to handling the emerging applications

This brings us to the end of the discussion of the data miningtools As we have stated malware is becoming more andmore sophisticated as new technologies emerge Malware willbe continuously changing patterns so that it is not caughtTherefore we need tools that can handle adaptable malwareas well as tools that can detect malware in real time

CHAPTER 25

SUMMARY AND DIRECTIONS

251 IntroductionThis chapter brings us to the close of Data Mining Tools forMalware Detection We discussed several aspects includingsupporting technologies such as data mining malware anddata mining applications and we provided a detaileddiscussion of the tools we have developed for malwaredetection The applications we discussed included emailworm detection remote exploited detection malicious codedetection and botnet detection This chapter provides asummary of the book and gives directions for data mining formalware detection

The organization of this chapter is as follows In Section 252we give a summary of this book We have taken thesummaries from each chapter and formed a summary of thisbook In Section 253 we discuss directions for data miningfor malware detection In Section 254 we give suggestions asto where to go from here

252 Summary of This BookChapter 1 provided an introduction to the book We firstprovided a brief overview of the data mining techniques and

applications and discussed various topics addressed in thisbook including the data mining tools we have developed Ourframework is a three-layer framework and each layer wasaddressed in one part of this book This framework wasillustrated in Figure 110 We replicate this framework inFigure 251

The book is divided into seven parts Part I consists of fourchapters 2 3 4 and 5 Chapter 2 provided an overview ofdata mining techniques used in this book Chapter 3 providedsome background information on malware In Chapter 4 weprovided an overview of data mining for securityapplications The tools we have described in our previousbook were discussed in Chapter 5

Figure 251 Components addressed in this book

Part II consists of three chapters 6 7 and 8 and describedour tool for email worm detection An overview of emailworm detection was discussed in Chapter 6 Our tool wasdiscussed in Chapter 7 Evaluation and results were discussedin Chapter 8 Part III consists of three chapters 9 10 and 11and described our tool malicious code detection An overviewof malicious code detection was discussed in Chapter 9 Ourtool was discussed in Chapter 10 Evaluation and results werediscussed in Chapter 11 Part IV consists of three chapters12 13 and 14 and described our tool for detecting remoteexploits An overview of detecting remote exploits wasdiscussed in Chapter 12 Our tool was discussed in Chapter13 Evaluation and results were discussed in Chapter 14 PartV consists of three chapters 15 16 and 17 and described ourtool for botnet detection An overview of botnet detection wasdiscussed in Chapter 15 Our tool was discussed in Chapter16 Evaluation and results were discussed in Chapter 17 PartVI consists of three chapters 18 19 and 20 and describedour tool for stream mining An overview of stream miningwas discussed in Chapter 18 Our tool was discussed inChapter 19 Evaluation and results were discussed in Chapter20 Part VII consists of four chapters 21 22 23 and 24 anddescribed our tools for emerging applications Our approachfor detecting adaptive malware was discussed in Chapter 21Our approach for insider threat detection was discussed inChapter 22 Real-time data mining was discussed in Chapter23 Firewall policy management tool was discussed inChapter 24

Chapter 25 which is this chapter provides a summary of thebook In addition we have four appendices that provide

supplementary information Appendix A provides anoverview of data management and describes the relationshipbetween our books Appendix B describes trustworthysystems Appendix C describes secure data information andknowledge management and Appendix D describes semanticweb technologies

253 Directions for DataMining Tools for MalwareDetectionThere are many directions for data mining for malwaredetection Figure 252 illustrates our directions In the list thatfollows we elaborate on the areas that need further work Inparticular we will reiterate the key points raised for each part

Part I Data Mining and Security One of the major challengesis to determine the appropriate techniques for different typesof malware We still need more benchmarks and performancestudies In addition the techniques should result in fewerfalse positives and negatives Furthermore as we have statedmalware is causing chaos in society and in the softwareindustry Malware technology is getting more and moresophisticated Malware is continuously changing patterns soas not to get caught Therefore developing solutions to detectandor prevent malware has become an urgent need

Figure 252 Directions for data mining tools for malwaredetection

Part II Data Mining for Email Worm Detection Future workhere will include detecting worms by combining thefeature-based approach with content-based approach to makeit more robust and efficient In addition we need to focus onthe statistical property of the contents of the messages forpossible contamination of worms

Part III Data Mining for Detecting Malicious Executables Inthis part we focused on techniques such as SVM and NB Weneed to examine other techniques as well as integrate multipletechniques We will also work to develop ways of extractingmore useful features

Part IV Data Mining for Detecting Remote Exploits As inthe case of malicious executables we need to examine othertechniques as well as integrate multiple techniques We willalso work to develop ways of extracting more useful features

Part V Data Mining for Detecting Botnets We need toexamine more sophisticated data mining techniques such asthe data stream classification techniques for botnet detectionData stream classification techniques will be particularlysuitable for botnet traffic detection because the botnet trafficitself is a kind of data stream We would need to extend ourhost-based detection technique to a distributed framework

Part VI Stream Mining for Security Application We need toextend stream mining to real-time data stream classificationFor example we have to optimize the training including thecreation of decision boundary The outlier detection and novelclass detection should also be made more efficient Webelieve the cloud computing framework can play an importantrole in increasing the efficiency of these processes

Part VII Emerging Applications For active defense we needtechniques that can handle adaptable malware For insiderthreat detection we need scalable graph mining techniquesFor real-time data mining we need techniques that can meettiming constraints as well as develop models dynamicallyFinally for firewall policy management we need scalableassociation rule mining techniques for mining a very largenumber of policies

254 Where Do We Go fromHereThis book has discussed a great deal about data mining toolsfor malware detection We have stated many challenges in

this field in Section 253 We need to continue with researchand development efforts if we are to make progress in thisvery important area

The question is where do we go from here First of all thosewho wish to work in this area must have a good knowledge ofthe supporting technologies including data managementstatistical reasoning and machine learning In additionknowledge of the application areas such as malwaresecurity and web technologies is needed Next because thefield is expanding rapidly and there are many developmentsin the field the reader has to keep up with the developmentsincluding reading about the commercial products Finally weencourage the reader to experiment with the products and alsodevelop security tools This is the best way to get familiarwith a particular fieldmdashthat is work on hands-on problemsand provide solutions to get a better understanding

As we have stated malware is getting more and moresophisticated with the emerging technologies Therefore weneed to be several steps ahead of the hacker We have toanticipate the types of malware that would be created anddevelop solutions Furthermore the malware will be changingvery rapidly and therefore we need solutions for adaptivemalware

To develop effective solutions for malware detection weneed research and development support from the governmentfunding agencies We also need commercial corporations toinvest research and development dollars so that progress canbe made in industrial research and the research can betransferred to commercial products We also need to

collaborate with the international research community tosolve problems and develop useful tools

Appendix A Data ManagementSystems

Developments and Trends

A1 IntroductionThe main purpose of this appendix is to set the context of theseries of books we have written in data management datamining and data security Our series started in 1997 with ourbook Data Management Systems Evolution andInteroperation [Thuraisingham 1997] Our subsequent bookshave evolved from this first book We have essentiallyrepeated Chapter 1 of our first book in Appendix A of oursubsequent books The purpose of this appendix is to providean overview of data management systems We will thendiscuss the relationships between the books we have written

As stated in our series of books the developments ininformation systems technologies have resulted incomputerizing many applications in various business areasData have become a critical resource in many organizationsand therefore efficient access to data sharing the dataextracting information from the data and making use of theinformation have become urgent needs As a result therehave been several efforts on integrating the various datasources scattered across several sites These data sources maybe databases managed by database management systems orthey could simply be files To provide the interoperability

between the multiple data sources and systems various toolsare being developed These tools enable users of one systemto access other systems in an efficient and transparentmanner

We define data management systems to be systems thatmanage the data extract meaningful information from thedata and make use of the information extracted Thereforedata management systems include database systems datawarehouses and data mining systems Data could bestructured data such as that found in relational databases or itcould be unstructured such as text voice imagery or videoThere have been numerous discussions in the past todistinguish between data information and knowledge We donot attempt to clarify these terms For our purposes datacould be just bits and bytes or they could convey somemeaningful information to the user We will howeverdistinguish between database systems and databasemanagement systems A database management system is thatcomponent which manages the database containing persistentdata A database system consists of both the database and thedatabase management system

A key component to the evolution and interoperation of datamanagement systems is the interoperability of heterogeneousdatabase systems Efforts on the interoperability betweendatabase systems have been reported since the late 1970sHowever it is only recently that we are seeing commercialdevelopments in heterogeneous database systems Majordatabase system vendors are now providing interoperabilitybetween their products and other systems Furthermore manyof the database system vendors are migrating toward anarchitecture called the client-server architecture which

facilitates distributed data management capabilities Inaddition to efforts on the interoperability between differentdatabase systems and client-server environments work is alsodirected toward handling autonomous and federatedenvironments

The organization of this appendix is as follows Becausedatabase systems are a key component of data managementsystems we first provide an overview of the developments indatabase systems These developments are discussed inSection A2 Then we provide a vision for data managementsystems in Section A3 Our framework for data managementsystems is discussed in Section A4 Note that data miningwarehousing and web data management are components ofthis framework Building information systems from ourframework with special instantiations is discussed in SectionA5 The relationship between the various texts that we havewritten (or are writing) for CRC Press is discussed in SectionA6 This appendix is summarized in Section A7

A2 Developments inDatabase SystemsFigure A1 provides an overview of the developments indatabase systems technology Whereas the early work in the1960s focused on developing products based on the networkand hierarchical data models much of the developments indatabase systems took place after the seminal paper by Codddescribing the relational model [Codd 1970] (see also [Date1990]) Research and development work on relational

database systems was carried out during the early 1970s andseveral prototypes were developed throughout the 1970sNotable efforts include IBMrsquos (International BusinessMachine Corporationrsquos) System R and the University ofCalifornia at Berkeleyrsquos INGRES During the 1980s manyrelational database system products were being marketed(notable among these products are those of OracleCorporation Sybase Inc Informix Corporation INGRESCorporation IBM Digital Equipment Corporation andHewlett-Packard Company) During the 1990s products fromother vendors emerged (eg Microsoft Corporation) In factto date numerous relational database system products havebeen marketed However Codd stated that many of thesystems that are being marketed as relational systems are notreally relational (see eg the discussion in [Date 1990]) Hethen discussed various criteria that a system must satisfy to bequalified as a relational database system Whereas the earlywork focused on issues such as data model normalizationtheory query processing and optimization strategies querylanguages and access strategies and indexes later the focusshifted toward supporting a multi-user environment Inparticular concurrency control and recovery techniques weredeveloped Support for transaction processing was alsoprovided

Figure A1 Developments in database systems technology

Research on relational database systems as well as ontransaction management was followed by research ondistributed database systems around the mid-1970s Severaldistributed database system prototype development effortsalso began around the late 1970s Notable among these effortsinclude IBMrsquos System R DDTS (Distributed DatabaseTestbed System) by Honeywell Inc SDD-I and Multibase byCCA (Computer Corporation of America) and Mermaid by

SDC (System Development Corporation) Furthermore manyof these systems (eg DDTS Multibase Mermaid) functionin a heterogeneous environment During the early 1990sseveral database system vendors (such as Oracle CorporationSybase Inc Informix Corporation) provided data distributioncapabilities for their systems Most of the distributedrelational database system products are based on client-serverarchitectures The idea is to have the client of vendor Acommunicate with the server database system of vendor B Inother words the client-server computing paradigm facilitatesa heterogeneous computing environment Interoperabilitybetween relational and non-relational commercial databasesystems is also possible The database systems community isalso involved in standardization efforts Notable among thestandardization efforts are the ANSISPARC 3-level schemaarchitecture the IRDS (Information Resource DictionarySystem) standard for Data Dictionary Systems the relationalquery language SQL (Structured Query Language) and theRDA (Remote Database Access) protocol for remote databaseaccess

Another significant development in database technology isthe advent of object-oriented database management systemsActive work on developing such systems began in themid-1980s and they are now commercially available (notableamong them include the products of Object Design IncOntos Inc Gemstone Systems Inc and Versant ObjectTechnology) It was felt that new generation applications suchas multimedia office information systems CADCAMprocess control and software engineering have differentrequirements Such applications utilize complex datastructures Tighter integration between the programminglanguage and the data model is also desired Object-oriented

database systems satisfy most of the requirements of thesenew generation applications [Cattell 1991]

According to the Lagunita report published as a result of aNational Science Foundation (NSF) workshop in 1990([Silberschatz et al 1990] also see [Kim 1990]) relationaldatabase systems transaction processing and distributed(relational) database systems are stated as maturetechnologies Furthermore vendors are marketingobject-oriented database systems and demonstrating theinteroperability between different database systems Thereport goes on to state that as applications are gettingincreasingly complex more sophisticated database systemsare needed Furthermore because many organizations nowuse database systems in many cases of different types thedatabase systems need to be integrated Although work hasbegun to address these issues and commercial products areavailable several issues still need to be resolved Thereforechallenges faced by the database systems researchers in theearly 1990s were in two areas One was next generationdatabase systems and the other was heterogeneous databasesystems

Next generation database systems include object-orienteddatabase systems functional database systems specialparallel architectures to enhance the performance of databasesystem functions high-performance database systemsreal-time database systems scientific database systemstemporal database systems database systems that handleincomplete and uncertain information and intelligentdatabase systems (also sometimes called logic or deductivedatabase systems) Ideally a database system should providethe support for high-performance transaction processing

model complex applications represent new kinds of data andmake intelligent deductions Although significant progresswas made during the late 1980s and early 1990s there ismuch to be done before such a database system can bedeveloped

Heterogeneous database systems have been receivingconsiderable attention during the past decade [March 1990]The major issues include handling different data modelsdifferent query processing strategies different transactionprocessing algorithms and different query languages Shoulda uniform view be provided to the entire system or should theusers of the individual systems maintain their own views ofthe entire system These are questions that have yet to beanswered satisfactorily It is also envisaged that a completesolution to heterogeneous database management systems is ageneration away While research should be directed towardfinding such a solution work should also be carried out tohandle limited forms of heterogeneity to satisfy the customerneeds Another type of database system that has receivedsome attention lately is a federated database system Note thatsome have used the terms heterogeneous database system andfederated database system interchangeably Whileheterogeneous database systems can be part of a federation afederation can also include homogeneous database systems

The explosion of users on the web as well as developments ininterface technologies has resulted in even more challengesfor data management researchers A second workshop wassponsored by NSF in 1995 and several emergingtechnologies have been identified to be important as we gointo the twenty-first century [Widom 1996] These includedigital libraries large database management data

administration issues multimedia databases datawarehousing data mining data management for collaborativecomputing environments and security and privacy Anothersignificant development in the 1990s is the development ofobject-relational systems Such systems combine theadvantages of both object-oriented database systems andrelational database systems Also many corporations are nowfocusing on integrating their data management products withweb technologies Finally for many organizations there is anincreasing need to migrate some of the legacy databases andapplications to newer architectures and systems such asclient-server architectures and relational database systemsWe believe there is no end to data management systems Asnew technologies are developed there are new opportunitiesfor data management research and development

A comprehensive view of all data management technologiesis illustrated in Figure A2 As shown traditional technologiesinclude database design transaction processing andbenchmarking Then there are database systems based on datamodels such as relational and object-oriented Databasesystems may depend on features they provide such assecurity and real time These database systems may berelational or object oriented There are also database systemsbased on multiple sites or processors such as distributed andheterogeneous database systems parallel systems andsystems being migrated Finally there are the emergingtechnologies such as data warehousing and miningcollaboration and the web Any comprehensive text on datamanagement systems should address all of these technologiesWe have selected some of the relevant technologies and putthem in a framework This framework is described in SectionA5

Figure A2 Comprehensive view of data managementsystems

A3 Status Vision andIssuesSignificant progress has been made on data managementsystems However many of the technologies are stillstand-alone technologies as illustrated in Figure A3 Forexample multimedia systems have yet to be successfullyintegrated with warehousing and mining technologies Theultimate goal is to integrate multiple technologies so thataccurate data as well as information are produced at the righttime and distributed to the user in a timely manner Our vision

for data and information management is illustrated in FigureA4

The work discussed in [Thuraisingham 1997] addressedmany of the challenges necessary to accomplish this vision Inparticular integration of heterogeneous databases as well asthe use of distributed object technology for interoperabilitywas discussed Although much progress has been made on thesystem aspects of interoperability semantic issues still remaina challenge Different databases have differentrepresentations Furthermore the same data entity may beinterpreted differently at different sites Addressing thesesemantic differences and extracting useful information fromthe heterogeneous and possibly multimedia data sources aremajor challenges This book has attempted to address some ofthe challenges through the use of data mining

Figure A3 Stand-alone systems

Figure A4 Vision

A4 Data ManagementSystems FrameworkFor the successful development of evolvable interoperabledata management systems heterogeneous database systemsintegration is a major component However there are othertechnologies that have to be successfully integrated todevelop techniques for efficient access and sharing of data aswell as for the extraction of information from the data Tofacilitate the development of data management systems tomeet the requirements of various applications in fields such asmedicine finance manufacturing and the military we haveproposed a framework which can be regarded as a reference

model for data management systems Various componentsfrom this framework have to be integrated to develop datamanagement systems to support the various applications

Figure A5 illustrates our framework which can be regardedas a model for data management systems This frameworkconsists of three layers One can think of the componenttechnologies which we will also refer to as componentsbelonging to a particular layer to be more or less built uponthe technologies provided by the lower layer Layer I is theDatabase Technology and Distribution layer This layerconsists of database systems and distributed database systemstechnologies Layer II is the Interoperability and Migrationlayer This layer consists of technologies such asheterogeneous database integration client-server databasesand multimedia database systems to handle heterogeneousdata types and migrating legacy databases Layer III is theInformation Extraction and Sharing layer This layeressentially consists of technologies for some of the newerservices supported by data management systems Theseinclude data warehousing data mining [Thuraisingham1998] web databases and database support for collaborativeapplications Data management systems may utilizelower-level technologies such as networking distributedprocessing and mass storage We have grouped thesetechnologies into a layer called the Supporting Technologieslayer This supporting layer does not belong to the datamanagement systems framework This supporting layer alsoconsists of some higher-level technologies such as distributedobject management and agents Also shown in Figure A5 isthe Application Technologies layer Systems such ascollaborative computing systems and knowledge-basedsystems which belong to the Application Technologies layer

may utilize data management systems Note that theApplication Technologies layer is also outside of the datamanagement systems framework

Figure A5 Data management systems framework

Figure A6 A three-dimensional view of data management

The technologies that constitute the data management systemsframework can be regarded as some of the core technologiesin data management However features like securityintegrity real-time processing fault tolerance and highperformance computing are needed for many applicationsutilizing data management technologies Applicationsutilizing data management technologies may be medicalfinancial or military among others We illustrate this inFigure A6 where a three-dimensional view relating datamanagement technologies with features and applications isgiven For example one could develop a secure distributeddatabase management system for medical applications or afault-tolerant multimedia database management system forfinancial applications

Integrating the components belonging to the various layers isimportant for developing efficient data management systemsIn addition data management technologies have to beintegrated with the application technologies to developsuccessful information systems However at present there islimited integration of these various components Our previousbook Data Management Systems Evolution andInteroperation focused mainly on the conceptsdevelopments and trends belonging to each of thecomponents shown in the framework Furthermore ourcurrent book on web data management focuses on the webdatabase component of Layer III of the framework of FigureA5 [Thuraisingham 2000]

Note that security cuts across all of the layers Security isneeded for the supporting layers such as agents anddistributed systems Security is needed for all of the layers inthe framework including database security distributeddatabase security warehousing security web databasesecurity and collaborative data management security This isthe topic of this book That is we have covered all aspects ofdata and applications security including database security andinformation management security

A5 Building InformationSystems from theFrameworkFigure A5 illustrates a framework for data managementsystems As shown in that figure the technologies for datamanagement include database systems distributed databasesystems heterogeneous database systems migrating legacydatabases multimedia database systems data warehousingdata mining web databases and database support forcollaboration Furthermore data management systems takeadvantage of supporting technologies such as distributedprocessing and agents Similarly application technologiessuch as collaborative computing visualization expertsystems and mobile computing take advantage of datamanagement systems

Many of us have heard of the term information systems onnumerous occasions This term is sometimes usedinterchangeably with the term data management systems Inour terminology information systems are much broader thandata management systems but they do include datamanagement systems In fact a framework for informationsystems will include not only the data management systemlayers but also the supporting technologies layer as well as theapplication technologies layer That is information systemsencompass all kinds of computing systems It can be regardedas the finished product that can be used for various

applications That is whereas hardware is at the lowest end ofthe spectrum applications are at the highest end

We can combine the technologies of Figure A5 to puttogether information systems For example at the applicationtechnology level one may need collaboration andvisualization technologies so that analysts can collaborativelycarry out some tasks At the data management level one mayneed both multimedia and distributed database technologiesAt the supporting level one may need mass storage as well assome distributed processing capability This specialframework is illustrated in Figure A7 Another example is aspecial framework for interoperability One may need somevisualization technology to display the integrated informationfrom the heterogeneous databases At the data managementlevel we have heterogeneous database systems technologyAt the supporting technology level one may use distributedobject management technology to encapsulate theheterogeneous databases This special framework isillustrated in Figure A8

Figure A7 Framework for multimedia data management forcollaboration

Figure A8 Framework for heterogeneous databaseinteroperability

Finally let us illustrate the concepts that we have describedby using a specific example Suppose a group of physicians orsurgeons want a system through which they can collaborateand make decisions about various patients This could be amedical video teleconferencing application That is at thehighest level the application is a medical application andmore specifically a medical video teleconferencingapplication At the application technology level one needs avariety of technologies including collaboration andteleconferencing These application technologies will makeuse of data management technologies such as distributeddatabase systems and multimedia database systems That isone may need to support multimedia data such as audio andvideo The data management technologies in turn draw upon

lower-level technologies such as distributed processing andnetworking We illustrate this in Figure A9

Figure A9 Specific example

In summary information systems include data managementsystems as well as application-layer systems such ascollaborative computing systems and supporting-layersystems such as distributed object management systems

While application technologies make use of data managementtechnologies and data management technologies make use ofsupporting technologies the ultimate user of the informationsystem is the application itself Today numerous applicationsmake use of information systems These applications are frommultiple domains such as medicine finance manufacturingtelecommunications and defense Specific applicationsinclude signal processing electronic commerce patientmonitoring and situation assessment Figure A10 illustratesthe relationship between the application and the informationsystem

Figure A10 Application-framework relationship

A6 Relationship betweenthe TextsWe have published eight books on data management andmining These books are Data Management SystemsEvolution and Interoperation [Thuraisingham 1997] DataMining Technologies Techniques Tools and Trends[Thuraisingham 1998] Web Data Management andElectronic Commerce [Thuraisingham 2000] Managing andMining Multimedia Databases for the Electronic Enterprise[Thuraisingham 2001] XML Databases and the SemanticWeb [Thuraisingham 2002] Web Data Mining andApplications in Business Intelligence and Counter-Terrorism[Thuraisingham 2003] and Database and ApplicationsSecurity Integrating Data Management and InformationSecurity [Thuraisingham 2005] Our book on trustworthysemantic webs [Thuraisingham 2007] has evolved fromChapter 25 of [Thuraisingham 2005] Our book on secureweb services [Thuraisingham 2010] has evolved from[Thuraisingham 2007] All of these books have evolved fromthe framework that we illustrated in this appendix and addressdifferent parts of the framework The connection betweenthese texts is illustrated in Figure A11

Figure A11 Relationship between textsmdashSeries I

This book is the second in a new series and is illustrated inFigure A12 This book has evolved from our previous bookon the design and implementation of data mining tools [Awadet al 2009]

Figure A12 Relationship between textsmdashSeries II

A7 SummaryIn this appendix we have provided an overview of datamanagement We first discussed the developments in datamanagement and then provided a vision for data management

Then we illustrated a framework for data management Thisframework consists of three layers database systems layerinteroperability layer and information extraction layer Webdata management belongs to Layer III Finally we showedhow information systems could be built from the technologiesof the framework

We believe that data management is essential to manyinformation technologies including data mining multimediainformation processing interoperability and collaborationand knowledge management This appendix focuses on datamanagement Security is critical for all data managementtechnologies We will provide background information ontrustworthy systems in Appendix B Background on datainformation and knowledge management which will providea better understanding of data mining will be discussed inAppendix C Semantic web technologies which are needed tounderstand some of the concepts in this book will bediscussed in Appendix D

[Cattell 1991] Cattell R Object Data Management SystemsAddison-Wesley 1991

[Codd 1970] Codd E F A Relational Model of Data forLarge Shared Data Banks Communications of the ACM Vol13 No 6 June 1970 pp 377ndash387

[Date 1990] Date C J An Introduction to DatabaseManagement Systems Addison-Wesley 1990 (6th editionpublished in 1995 by Addison-Wesley)

[Kim 1990] Kim W (Ed) Directions for Future DatabaseResearch amp Development ACM SIGMOD Record December1990

[March 1990] March S T Editor Special Issue onHeterogeneous Database Systems ACM Computing SurveysSeptember 1990

[Silberschatz et al 1990] Silberschatz A M Stonebraker JD Ullman Editors Database Systems Achievements andOpportunities The ldquoLagunitardquo Report of the NSF InvitationalWorkshop on the Future of Database Systems ResearchFebruary 22ndash23 Palo Alto CA (TR-90-22) Department ofComputer Sciences University of Texas at Austin AustinTX (Also in ACM SIGMOD Record December 1990)

[Thuraisingham 1997] Thuraisingham B Data ManagementSystems Evolution and Interoperation CRC Press 1997

[Thuraisingham 1998] Thuraisingham B Data MiningTechnologies Techniques Tools and Trends CRC Press1998

[Thuraisingham 2000] Thuraisingham B Web DataManagement and Electronic Commerce CRC Press 2000

[Thuraisingham 2001] Thuraisingham B Managing andMining Multimedia Databases for the Electronic EnterpriseCRC Press 2001

[Thuraisingham 2002] Thuraisingham B XML Databasesand the Semantic Web CRC Press 2002

[Thuraisingham 2003] Thuraisingham B Web Data MiningApplications in Business Intelligence and Counter-TerrorismCRC Press 2003

[Thuraisingham 2005] Thuraisingham B Database andApplications Security Integrating Data Management andInformation Security CRC Press 2005

[Thuraisingham 2007] Thuraisingham B BuildingTrustworthy Semantic Webs CRC Press 2007

[Thuraisingham 2010] Thuraisingham B Secure SemanticService-Oriented Systems CRC Press 2010

[Widom 1996] Widom J Editor Proceedings of theDatabase Systems Workshop Report published by theNational Science Foundation 1995 (also in ACM SIGMODRecord March 1996)

Appendix B Trustworthy Systems

B1 IntroductionTrustworthy systems are systems that are secure anddependable By dependable systems we mean systems thathave high integrity are fault tolerant and meet real-timeconstraints In other words for a system to be trustworthy itmust be secure and fault tolerant meet timing deadlines andmanage high-quality data

This appendix provides an overview of the variousdevelopments in trustworthy systems with special emphasison secure systems In Section B2 we discuss secure systemsin some detail In Section B4 we discuss web securityBuilding secure systems from entrusted components isdiscussed in Section B4 Section B5 provides an overview ofdependable systems that covers trust privacy integrity anddata quality Some other security concerns are discussed inSection B6 The appendix is summarized in Section B7

B2 Secure SystemsB21 Introduction

Secure systems include secure operating systems secure datamanagement systems secure networks and other types ofsystems such as web-based secure systems and secure digital

libraries This section provides an overview of the variousdevelopments in information security

In Section B22 we discuss basic concepts such as accesscontrol for information systems Section B23 provides anoverview of the various types of secure systems Secureoperating systems will be discussed in Section B24 Securedatabase systems will be discussed in Section B25 Networksecurity will be discussed in Section B26 Emerging trendsis the subject of section B27 Impact of the web is given inSection B28 An overview of the steps to building securesystems will be provided in Section B29

B22 Access Control and OtherSecurity Concepts

Access control models include those for discretionary securityand mandatory security In this section we discuss bothaspects of access control and also consider other issues Indiscretionary access control models users or groups of usersare granted access to data objects These data objects could befiles relations objects or even data items Access controlpolicies include rules such as User U has read access toRelation R1 and write access to Relation R2 Access controlcould also include negative access control where user U doesnot have read access to Relation R

In mandatory access control subjects that act on behalf ofusers are granted access to objects based on some policy Awell-known policy is the Bell and LaPadula policy [Bell andLaPadula 1973] where subjects are granted clearance levelsand objects have sensitivity levels The set of security levels

form a partially ordered lattice where Unclassified ltConfidential lt Secret lt TopSecret The policy has twoproperties which are the following A subject has read accessto an object if its clearance level dominates that of the objectA subject has write access to an object if its level isdominated by that of the object

Other types of access control include role-based accesscontrol Here access is granted to users depending on theirroles and the functions they perform For example personnelmanagers have access to salary data and project managershave access to project data The idea here is generally to giveaccess on a need-to-know basis

Whereas the early access control policies were formulated foroperating systems these policies have been extended toinclude other systems such as database systems networksand distributed systems For example a policy for networksincludes policies for not only reading and writing but also forsending and receiving messages

Figure B1 Security policies

Other security policies include administration policies Thesepolicies include those for ownership of data as well as forhow to manage and distribute the data Databaseadministrators as well as system security officers are involvedin formulating the administration policies

Security policies also include policies for identification andauthentication Each user or subject acting on behalf of a userhas to be identified and authenticated possibly using somepassword mechanisms Identification and authenticationbecomes more complex for distributed systems For examplehow can a user be authenticated at a global level

The steps to developing secure systems include developing asecurity policy developing a model of the system designingthe system and verifying and validating the system Themethods used for verification depend on the level ofassurance that is expected Testing and risk analysis are alsopart of the process These activities will determine thevulnerabilities and assess the risks involved Figure B1illustrates various types of security policies

In the previous section we discussed various policies forbuilding secure systems In this section we elaborate onvarious types of secure systems Much of the early research inthe 1960s and 1970s was on securing operating systems

Early security policies such as the Bell and LaPadula policywere formulated for operating systems Subsequently secureoperating systems such as Honeywellrsquos SCOMP andMULTICS were developed (see [IEEE 1983]) Other policiessuch as those based on noninterference also emerged in theearly 1980s

Although early research on secure database systems wasreported in the 1970s it was not until the early 1980s thatactive research began in this area Much of the focus was onmulti-level secure database systems The security policy foroperating systems was modified slightly For example thewrite policy for secure database systems was modified to statethat a subject has write access to an object if the subjectrsquoslevel is that of the object Because database systems enforcedrelationships between data and focused on semantics therewere additional security concerns For example data could beclassified based on content context and time The problem ofposing multiple queries and inferring sensitive informationfrom the legitimate responses became a concern Thisproblem is now known as the inference problem Alsoresearch was carried out not only on securing relationalsystems but also on object systems and distributed systemsamong others

Research on computer networks began in the late 1970s andthroughout the 1980s and beyond The networking protocolswere extended to incorporate security features The result wassecure network protocols The policies include those forreading writing sending and receiving messages Researchon encryption and cryptography has received muchprominence because of networks and the web Security forstand-alone systems was extended to include distributed

systems These systems included distributed databases anddistributed operating systems Much of the research ondistributed systems now focuses on securing the web (knownas web security) as well as securing systems such asdistributed object management systems

As new systems emerge such as data warehousescollaborative computing systems multimedia systems andagent systems security for such systems has to beinvestigated With the advent of the World Wide Websecurity is being given serious consideration by not onlygovernment organizations but also commercial organizationsWith e-commerce it is important to protect the companyrsquosintellectual property Figure B2 illustrates various types ofsecure systems

Figure B2 Secure systems

Work on security for operating systems was carried outextensively in the 1960s and 1970s The research stillcontinues as new kinds of operating systems such asWindows Linux and other products emerge The early ideasincluded access control lists and capability-based systemsAccess control lists are lists that specify the types of accessthat processes which are called subjects have on files whichare objects The access is usually read or write accessCapability lists are capabilities that a process must possess toaccess certain resources in the system For example a processwith a particular capability can write into certain parts of thememory

Work on mandatory security for operating systems startedwith the Bell and La Padula security model which has twoproperties

bull The simple security property states that a subject hasread access to an object if the subjectrsquos security leveldominated the level of the object

bull The -property (pronounced ldquostar propertyrdquo) statesthat a subject has write access to an object if thesubjectrsquos security level is dominated by that of theobject

Since then variations of this model as well as a popularmodel called the noninterference model (see [Goguen andMeseguer 1982]) have been proposed The non-interferencemodel is essentially about higher-level processes notinterfering with lower-level processes

Figure B3 Secure operating systems

As stated earlier security is becoming critical for operatingsystems Corporations such as Microsoft are putting in manyresources to ensure that their products are secure Often wehear of vulnerabilities in various operating systems and abouthackers trying to break into operating systems especiallythose with networking capabilities Therefore this is an areathat will continue to receive much attention for the nextseveral years Figure B3 illustrates some key aspects ofoperating systems security

Work on discretionary security for databases began in the1970s when security aspects were investigated for System Rat IBM Almaden Research Center Essentially the securityproperties specified the read and write access that a user mayhave to relations attributes and data elements [Denning

1982] In the 1980s and 1990s security issues wereinvestigated for object systems Here the security propertiesspecified the access that users had to objects instancevariables and classes In addition to read and write accessmethod execution access was also specified

Since the early 1980s much of the focus was on multi-levelsecure database management systems [AFSB 1983] Thesesystems essentially enforce the mandatory policy discussed inSection B22 Since the 1980s various designs prototypesand commercial products of multi-level database systemshave been developed [Ferrari and Thuraisingham 2000] givea detailed survey of some of the developments Exampleefforts include the SeaView effort by SRI International andthe LOCK Data Views effort by Honeywell These effortsextended relational models with security properties Onechallenge was to design a model in which a user sees differentvalues at different security levels For example at theUnclassified level an employeersquos salary may be 20K and atthe secret level it may be 50K In the standard relationalmodel such ambiguous values cannot be represented due tointegrity properties

Note that several other significant developments have beenmade on multi-level security for other types of databasesystems These include security for object database systems[Thuraisingham 1989] In this effort security propertiesspecify read write and method execution policies Muchwork was also carried out on secure concurrency control andrecovery The idea here is to enforce security properties andstill meet consistency without having covert channelsResearch was also carried out on multi-level security fordistributed heterogeneous and federated database systems

Another area that received a lot of attention was the inferenceproblem For details on the inference problem we refer thereader to [Thuraisingham et al 1993] For secureconcurrency control we refer to the numerous algorithms byAtluri Bertino Jajodia et al (see eg [Alturi et al 1997])For information on secure distributed and heterogeneousdatabases as well as secure federated databases we refer thereader to [Thuraisingham 1991] and [Thuraisingham 1994]

As database systems become more sophisticated securingthese systems will become more and more difficult Some ofthe current work focuses on securing data warehousesmultimedia databases and web databases (see eg theProceedings of the IFIP Database Security ConferenceSeries) Figure B4 illustrates various types of secure databasesystems

B26 Secure Networks

With the advent of the web and the interconnection ofdifferent systems and applications networks have proliferatedover the past decade There are public networks privatenetworks classified networks and unclassified networks Wecontinually hear about networks being infected with virusesand worms Furthermore networks are being intruded bymalicious code and unauthorized individuals Thereforenetwork security is emerging as one of the major areas ininformation security

Figure B4 Secure database systems

Various techniques have been proposed for network securityEncryption and cryptography are still dominating much of theresearch For a discussion of various encryption techniqueswe refer to [Hassler 2000] Data mining techniques are beingapplied for intrusion detection extensively (see [Ning et al2004]) There has also been a lot of work on network protocolsecurity in which security is incorporated into the variouslayers of for example the protocol stack such as the networklayer transport layer and session layer (see [Tannenbaum1990]) Verification and validation techniques are also beinginvestigated for securing networks Trusted NetworkInterpretation (also called the ldquored bookrdquo) was developedback in the 1980s to evaluate secure networks Various bookson the topic have also been published (see [Kaufmann et al2002]) Figure B5 illustrates network security techniques

B27 Emerging Trends

In the mid-1990s research in secure systems expanded toinclude emerging systems These included securingcollaborative computing systems multimedia computing anddata warehouses Data mining has resulted in new securityconcerns Because users now have access to various datamining tools and they can make sensitive associations it canexacerbate the inference problem On the other hand datamining can also help with security problems such as intrusiondetection and auditing

Figure B5 Secure networks

The advent of the web resulted in extensive investigations ofsecurity for digital libraries and electronic commerce In

addition to developing sophisticated encryption techniquessecurity research also focused on securing the web clients aswell as servers Programming languages such as Java weredesigned with security in mind Much research was alsocarried out on securing agents

Secure distributed system research focused on security fordistributed object management systems Organizations suchas OMG (Object Management Group) started working groupsto investigate security properties [OMG 2011] As a resultsecure distributed object management systems arecommercially available Figure B6 illustrates the variousemerging secure systems and concepts

The advent of the web has greatly impacted security Securityis now part of mainstream computing Governmentorganizations and commercial organizations are concernedabout security For example in a financial transactionmillions of dollars could be lost if security is not maintainedWith the web all sorts of information is available aboutindividuals and therefore privacy may be compromised

Figure B6 Emerging trends

Various security solutions are being proposed to secure theweb In addition to encryption the focus is on securing clientsas well as servers That is end-to-end security has to bemaintained Web security also has an impact on electroniccommerce That is when one carries out transactions on theweb it is critical that security be maintained Informationsuch as credit card numbers and social security numbers hasto be protected

All of the security issues discussed in the previous sectionshave to be considered for the web For example appropriatesecurity policies have to be formulated This is a challenge asno one person owns the web The various secure systemsincluding secure operating systems secure database systemssecure networks and secure distributed systems may beintegrated in a web environment Therefore this integrated

system has to be secure Problems such as the inference andprivacy problems may be exacerbated due to the various datamining tools The various agents on the web have to besecure In certain cases tradeoffs need to be made betweensecurity and other features That is quality of service is animportant consideration In addition to technologicalsolutions legal aspects also have to be examined That islawyers and engineers have to work together Although muchprogress has been made on web security there is still a lot tobe done as progress is made on web technologies Figure B7illustrates aspects of web security For a discussion of websecurity we refer readers to [Ghosh 1998]

Figure B7 Web security

In this section we outline the steps to building securesystems Note that our discussion is general and applicable toany secure system However we may need to adapt the stepsfor individual systems For example to build securedistributed database systems we need secure databasesystems as well as secure networks Therefore the multiplesystems have to be composed

The first step to building a secure system is developing asecurity policy The policy can be stated in an informallanguage and then formalized The policy essentially specifiesthe rules that the system must satisfy Then the securityarchitecture has to be developed The architecture will includethe security-critical components These are the componentsthat enforce the security policy and therefore should betrusted The next step is to design the system For example ifthe system is a database system the query processortransaction manager storage manager and metadata managermodules are designed The design of the system has to beanalyzed for vulnerabilities The next phase is thedevelopment phase Once the system has been implementedit has to undergo security testing This will include designingtest cases and making sure that the security policy is notviolated Furthermore depending on the level of assuranceexpected of the system formal verification techniques may beused to verify and validate the system Finally the systemwill be ready for evaluation Note that systems initially werebeing evaluated using the Trusted Computer SystemsEvaluation Criteria [TCSE 1985] There are interpretations ofthese criteria for networks [TNI 1987] and for databases

[TDI 1991] There are also several companion documents forvarious concepts such as auditing and inference control Notethat more recently some other criteria have been developedincluding the Common Criteria and the Federal Criteria

Figure B8 Steps to building secure systems

Note that before the system is installed in an operationalenvironment one needs to develop a concept of operation ofthe environment Risk assessment has to be carried out Once

the system has been installed it has to be monitored so thatsecurity violations including unauthorized intrusions aredetected Figure B8 illustrates the steps An overview ofbuilding secure systems can be found in [Gasser 1998]

B3 Web SecurityBecause the web is an essential part of our daily activities itis critical that the web be secure Some general cyber threatsinclude authentication violations nonrepudiation malwaresabotage fraud and denial of service infrastructure attacksaccess control violations privacy violations integrityviolations confidentiality violations inference problemidentity theft and insider threat [Ghosh 1998] Figured B9illustrates the various attacks on the web

Figure B9 Attacks on web security

The security solutions to the web include securing thecomponents and firewalls and encryption For examplevarious components have to be made secure to get a secureweb One desires end-to-end security and therefore thecomponents include secure clients secure servers securedatabases secure operating systems secure infrastructuressecure networks secure transactions and secure protocolsOne needs good encryption mechanisms to ensure that thesender and receiver communicate securely Ultimatelywhether it be exchanging messages or carrying outtransactions the communication between sender and receiveror the buyer and the seller has to be secure Secure clientsolutions include securing the browser securing the Javavirtual machine securing Java applets and incorporatingvarious security features into languages such as Java

One of the challenges faced by the web managers isimplementing security policies One may have policies forclients servers networks middleware and databases Thequestion is how do you integrate these policies That is howdo you make these policies work together Who isresponsible for implementing these policies Is there a globaladministrator or are there several administrators that have towork together Security policy integration is an area that isbeing examined by researchers

Finally one of the emerging technologies for ensuring that anorganizationrsquos assets are protected is firewalls Variousorganizations now have web infrastructures for internal andexternal use To access the external infrastructure one has togo through the firewall These firewalls examine the

information that comes into and out of an organization Thisway the internal assets are protected and inappropriateinformation may be prevented from coming into anorganization We can expect sophisticated firewalls to bedeveloped in the future

B4 Building TrustedSystems from UntrustedComponentsMuch of the discussion in the previous sections has assumedend-to-end security where the operating system networkdatabase system middleware and the applications all have tobe secure However in todayrsquos environment where thecomponents may come from different parts of the world onecannot assume end-to-end security Therefore the challengeis to develop secure systems with untrusted components Thatis although the operating system may be compromised thesystem must still carry out its missions securely This is achallenging problem

We have carried out some preliminary research in this area[Bertino et al 2010] Addressing the challenges of protectingapplications and data when the underlying platforms cannotbe fully trusted dictates a comprehensive defense strategySuch a strategy requires the ability to address new threats thatare smaller and more agile and may arise from thecomponents of the computing platforms Our strategy

including our tenets and principles are discussed in [Bertinoet al 2010]

B5 Dependable SystemsB51 Introduction

As we have discussed earlier by dependability we meanfeatures such as trust privacy integrity data quality andprovenance and rights management among others We haveseparated confidentiality and included it as part of securityTherefore essentially trustworthy systems include bothsecure systems and dependable systems (Note that this is nota standard definition)

Figure B10 Aspects of dependability

Whether we are discussing security integrity privacy trustor rights management there is always a cost involved Thatis at what cost do we enforce security privacy and trust Isit feasible to implement the sophisticated privacy policies andtrust management policies In addition to bringing lawyersand policy makers together with the technologists we alsoneed to bring economists into the picture We need to carryout economic tradeoffs for enforcing security privacy trustand rights management Essentially what we need areflexible policies for security privacy and trust and rightsmanagement

In this section we will discuss various aspects ofdependability Trust issues will be discussed in Section B52Digital rights management is discussed in Section B53Privacy is discussed in Section B54 Integrity issues dataquality and data provenance as well as fault tolerance andreal-time processing are discussed in Section B55 FigureB10 illustrates the dependability aspects

Trust management is all about managing the trust that oneindividual or group has of another That is even if a user hasaccess to the data do I trust the user so that I can release thedata The user may have the clearance or possess thecredentials but he may not be trustworthy Trust is formed bythe userrsquos behavior The user may have betrayed onersquosconfidence or carried out some act that is inappropriate innature Therefore I may not trust that user Now even if I donot trust John Jane may trust John and she may share her data

with John John may not be trustworthy to Jim but he may betrustworthy to Jane

The question is how do we implement trust Can we trustsomeone partially Can we trust say John 50 of the timeand Jane 70 of the time If we trust someone partially thencan we share some of the information How do we trust thedata that we have received from Bill That is if we do nottrust Bill then can we trust the data he gives us There havebeen many efforts on trusted management systems as well astrust negotiation systems Winslett et al have carried outextensive work and developed specification languages fortrust as well as designed trust negotiation systems (see [Yuand Winslett 2003]) The question is how do two partiesnegotiate trust A may share data D with B if B shares data Cwith A A may share data D with B only if B does not sharethese data with F There are many such rules that one canenforce and the challenge is to develop a system thatconsistently enforces the trust rules or policies

Closely related to trust management is digital rightsmanagement (DRM) This is especially critical forentertainment applications Who owns the copyright to avideo or an audio recording How can rights be propagatedWhat happens if the rights are violated Can I distributecopyrighted films and music on the web

We have heard a lot about the controversy surroundingNapster and similar organizations Is DRM a technical issueor is it a legal issue How can we bring technologists

lawyers and policy makers together so that rights can bemanaged properly There have been numerous articlesdiscussions and debates about DRM A useful source is[Iannella 2001]

B54 Privacy

Privacy is about protecting information about individualsFurthermore an individual can specify say to a web serviceprovider the information that can be released about him orher Privacy has been discussed a great deal in the pastespecially when it relates to protecting medical informationabout patients Social scientists and technologists have beenworking on privacy issues

Privacy has received enormous attention during recent yearsThis is mainly because of the advent of the web the semanticweb counter-terrorism and national security For example toextract information about various individuals and perhapsprevent or detect potential terrorist attacks data mining toolsare being examined We have heard much about nationalsecurity versus privacy in the media This is mainly due to thefact that people are now realizing that to handle terrorism thegovernment may need to collect data about individuals andmine the data to extract information Data may be inrelational databases or it may be text video and images Thisis causing a major concern with various civil liberties unions(see [Thuraisingham 2003]) Therefore technologists policymakers social scientists and lawyers are working together toprovide solutions to handle privacy violations

B55 Integrity Data Quality and HighAssurance

Integrity is about maintaining the accuracy of the data as wellas processes Accuracy of the data is discussed as part of dataquality Process integrity is about ensuring the processes arenot corrupted For example we need to ensure that theprocesses are not malicious processes Malicious processesmay corrupt the data as a result of unauthorizedmodifications To ensure integrity the software has to betested and verified to develop high assurance systems

The database community has ensured integrity by ensuringintegrity constraints (eg the salary value has to be positive)as well as by ensuring the correctness of the data whenmultiple processes access the data To achieve correctnesstechniques such as concurrency control are enforced The ideais to enforce appropriate locks so that multiple processes donot access the data at the same time and corrupt the data

Data quality is about ensuring the accuracy of the data Theaccuracy of the data may depend on who touched the dataFor example if the source of the data is not trustworthy thenthe quality value of the data may be low Essentially somequality value is assigned to each piece of data When data iscomposed quality values are assigned to the data in such away that the resulting value is a function of the quality valuesof the original data

Data provenance techniques also determine the quality of thedata Note that data provenance is about maintaining thehistory of the data This will include information such as who

accessed the data for readwrite purposes Based on thishistory one could then assign quality values of the data aswell as determine when the data are misused

Other closely related topics include real-time processing andfault tolerance Real-time processing is about the processesmeeting the timing constraints For example if we are to getstock quotes to purchase stocks we need to get theinformation in real time It does not help if the informationarrives after the trading desk is closed for business for theday Similarly real-time processing techniques also have toensure that the data are current Getting yesterdayrsquos stockquotes is not sufficient to make intelligent decisions Faulttolerance is about ensuring that the processes recover fromfaults Faults could be accidental or malicious In the case offaults the actions of the processes have to be redone theprocesses will then have to be aborted and if needed theprocesses are re-started

To build high assurance systems we need the systems tohandle faults be secure and handle real-time constraintsReal-time processing and security are conflicting goals as wehave discussed in [Thuraisingham 2005] For example amalicious process could ensure that critical timing constraintsare missed Furthermore to enforce all the access controlchecks some processes may miss the deadlines Thereforewhat we need are flexible policies that will determine whichaspects are critical for a particular situation

B6 Other SecurityConcerns

B61 Risk Analysis

As stated in the book by Shon Harris [Harris 2010] risk isthe likelihood that something bad will happen that causesharm to an informational asset (or the loss of the asset) Avulnerability is a weakness that can be used to endanger orcause harm to an informational asset A threat is anything(manmade or act of nature) that has the potential to causeharm

The likelihood that a threat will use a vulnerability to causeharm creates a risk When a threat uses a vulnerability toinflict harm it has an impact In the context of informationsecurity the impact is a loss of availability integrity andconfidentiality and possibly other losses (lost income loss oflife loss of real property) It is not possible to identify allrisks nor is it possible to eliminate all risk The remainingrisk is called residual risk

The challenges include identifying all the threats that areinherent to a particular situation For example consider abanking operation The bank has to employ security expertsand risk analysis experts to conduct a study of all possiblethreats Then they have to come up with ways of eliminatingthe threats If that is not possible they have to develop waysof containing the damage so that it is not spread further

Risk analysis is especially useful for handling malware Forexample once a virus starts spreading the challenge is howdo you stop it If you cannot stop it then how do you containit and also limit the damage that it caused Running variousvirus packages on onersquos system will perhaps limit the virus

from affecting the system or causing serious damage Theadversary will always find ways to develop new virusesTherefore we have to be one step or many steps ahead of theenemy We need to examine the current state of the practicein risk analysis and develop new solutions especially tohandle the new kinds of threats present in the cyber world

B62 Biometrics Forensics and OtherSolutions

Some of the recent developments in computer security aretools for biometrics and forensic analysis Biometrics toolsinclude understanding handwriting and signatures andrecognizing people from their features and eyes including thepupils Although this is a very challenging area muchprogress has been made Voice recognition tools toauthenticate users are also being developed In the future wecan expect many to use these tools

Forensic analysis essentially carries out postmortems just asthey do in medicine Once the attacks have occurred how doyou detect these attacks Who are the enemies andperpetrators Although progress has been made there are stillchallenges For example if one accesses the web pages anduses passwords that are stolen it will be difficult to determinefrom the web logs who the culprit is We still need a lot ofresearch in the area Digital Forensics also deals with usingcomputer evidence for crime analysis

Biometrics and Forensics are just some of the newdevelopments Other solutions being developed include

smartcards tools for detecting spoofing and jamming as wellas tools to carry out sniffing

B7 SummaryThis appendix has provided a brief overview of thedevelopments in trustworthy systems We first discussedsecure systems including basic concepts in access control aswell as discretionary and mandatory policies types of securesystems such as secure operating systems secure databasessecure networks and emerging technologies the impact ofthe web and the steps to building secure systems Next wediscussed web security and building secure systems fromuntested components This was followed by a discussion ofdependable systems Then we focused on risk analysis andtopics such as biometrics

Research in trustworthy systems is moving at a rapid paceSome of the challenges include malware detection andprevention insider threat analysis and building securesystems from untrusted components This book has addressedone such topic and that is malware detection with data miningtools

References[AFSB 1983] Air Force Studies Board Committee onMultilevel Data Management Security Multilevel DataManagement Security National Academy Press WashingtonDC 1983

[Atluri et al 1997] Atluri V S Jajodia E BertinoTransaction Processing in Multilevel Secure Databases withKernelized Architectures Challenges and Solutions IEEETransactions on Knowledge and Data Engineering Vol 9No 5 1997 pp 697ndash708

[Bell and LaPadula 1973] Bell D and L LaPadula SecureComputer Systems Mathematical Foundations and ModelM74-244 MITRE Corporation Bedford MA 1973

[Bertino et al 2010] Bertino E G Ghinita K Hamlen MKantarcioglu S H Lee N Li et al Securing the ExecutionEnvironment Applications and Data from Multi-TrustedComponents UT Dallas Technical Report UTDCS-03-10March 2010

[Denning 1982] Denning D Cryptography and DataSecurity Addison-Wesley 1982

[Ferrari and Thuraisingham 2000] Ferrari E and BThuraisingham Secure Database Systems in Advances inDatabase Management M Piatini and O Diaz EditorsArtech House 2000

[Gasser 1998] Gasser M Building a Secure ComputerSystem Van Nostrand Reinhold 1988

[Ghosh 1998] Ghosh A E-commerce Security Weak Linksand Strong Defenses John Wiley 1998

[Goguen and Meseguer 1982] Goguen J and J MeseguerSecurity Policies and Security Models Proceedings of the

IEEE Symposium on Security and Privacy Oakland CAApril 1982 pp 11ndash20

[Harris 2010] Harris S CISSP All-in-One Exam GuideMcGraw-Hill 2010

[Hassler 2000] Hassler V Security Fundamentals forE-Commerce Artech House 2000

[Iannella 2001] Iannella R Digital Rights Management(DRM) Architectures D-Lib Magazine Vol 7 No 6httpwwwdliborgdlibjune01iannella06iannellahtml

[IEEE 1983] IEEE Computer Magazine Special Issue onComputer Security Vol 16 No 7 1983

[Kaufmann et al 2002] Kaufmann C R Perlman MSpeciner Network Security Private Communication in aPublic World Pearson Publishers 2002

[Ning et al 2004] Ning P Y Cui D S Reeves D XuTechniques and Tools for Analyzing Intrusion Alerts ACMTransactions on Information and Systems Security Vol 7No 2 2004 pp 274ndash318

[OMG 2011] The Object Management Group wwwomgorg

[Tannenbaum 1990] Tannenbaum A Computer NetworksPrentice Hall 1990

[TCSE 1985] Trusted Computer Systems Evaluation CriteriaNational Computer Security Center MD 1985

[TDI 1991] Trusted Database Interpretation NationalComputer Security Center MD 1991

[Thuraisingham 1989] Thuraisingham B MandatorySecurity in Object-Oriented Database Systems Proceedingsof the ACM Object-Oriented Programming SystemsLanguage and Applications (OOPSLA) Conference NewOrleans LA October 1989 pp 203ndash210

[Thuraisingham 1991] Thuraisingham B MultilevelSecurity for Distributed Database Systems Computers andSecurity Vol 10 No 9 1991 pp 727ndash747

[Thuraisingham 1994] Thuraisingham B Security Issues forFederated Database Systems Computers and Security Vol13 No 6 1994 pp 509ndash525

[Thuraisingham et al 1993] Thuraisingham B W Ford MCollins Design and Implementation of a Database InferenceController Data and Knowledge Engineering Journal Vol11 No 3 1993 pp 271ndash297

[TNI 1987] Trusted Network Interpretation NationalComputer Security Center MD 1987

[Yu and Winslett 2003] Yu T and M Winslett A UnifiedScheme for Resource Protection in Automated TrustNegotiation IEEE Symposium on Security and PrivacyOakland CA May 2003 pp 110ndash122

Appendix C Secure DataInformation and KnowledgeManagement

C1 IntroductionIn this appendix we discuss secure data information andknowledge management technologies Note that datainformation and knowledge management technologies haveinfluenced the development of data mining Next we discussthe security impact on these technologies since data miningfor security applications falls under secure data informationand knowledge management

Data management technologies include databasemanagement database integration data warehousing anddata mining Information management technologies includeinformation retrieval multimedia information managementcollaborative information management e-commerce anddigital libraries Knowledge management is aboutorganizations utilizing the corporate knowledge to get abusiness advantage

The organization of this chapter is as follows Secure datamanagement will be discussed in Section C2 Secureinformation management will be discussed in Section C3Secure knowledge management will be discussed in SectionC4 The chapter is summarized in Section C5

C2 Secure DataManagementC21 Introduction

Database security has evolved from database managementand information security technologies In this appendix wewill discuss secure data management In particular we willprovide an overview of database management and thendiscuss the security impact

Database systems technology has advanced a great dealduring the past four decades from the legacy systems basedon network and hierarchical models to relational andobject-oriented database systems based on client-serverarchitectures We consider a database system to include boththe database management system (DBMS) and the database(see also the discussion in [Date 1990]) The DBMScomponent of the database system manages the database Thedatabase contains persistent data That is the data arepermanent even if the application programs go away

The organization of this section of the appendix is as followsIn Section C22 we will discuss database managementDatabase integration will be discussed in Section C23 Datawarehousing and data mining will be discussed in SectionC24 Web data management will be discussed in SectionC25 Security impact of data management technologies willbe discussed in Section C26

We discuss data modeling function and distribution for adatabase management system

C221 Data Model The purpose of a data model is to capturethe universe that it is representing as accurately completelyand naturally as possible [Tsichritzis and Lochovsky 1982]Data models include hierarchical models network modelsrelational models entity relationship models object modelsand logic-based models The relational data model is the mostpopular data model for database systems With the relationalmodel [Codd 1970] the database is viewed as a collection ofrelations Each relation has attributes and rows For exampleFigure C1 illustrates a database with two relations EMP andDEPT Various languages to manipulate the relations havebeen proposed Notable among these languages is the ANSIStandard SQL (Structured Query Language) This language isused to access and manipulate data in relational databases Adetailed discussion of the relational data model is given in[Date 1990] and [Ullman 1988]

Figure C1 Relational database

C222 Functions The functions of a DBMS carry out itsoperations A DBMS essentially manages a database and itprovides support to the user by enabling him to query andupdate the database Therefore the basic functions of aDBMS are query processing and update processing In someapplications such as banking queries and updates are issuedas part of transactions Therefore transaction management isalso another function of a DBMS To carry out thesefunctions information about the data in the database has to bemaintained This information is called the metadata Thefunction that is associated with managing the metadata ismetadata management Special techniques are needed tomanage the data stores that actually store the data Thefunction that is associated with managing these techniques isstorage management To ensure that these functions arecarried out properly and that the user gets accurate data thereare some additional functions These include securitymanagement integrity management and fault management(ie fault tolerance) The functional architecture of a DBMSis illustrated in Figure C2 (see also [Ullman 1988])

C223 Data Distribution As stated by [Ceri and Pelagatti1984] a distributed database system includes a distributeddatabase management system (DDBMS) a distributeddatabase and a network for interconnection (Figure C3) TheDDBMS manages the distributed database A distributeddatabase is data that is distributed across multiple databasesThe nodes are connected via a communication subsystem andlocal applications are handled by the local DBMS Inaddition each node is also involved in at least one globalapplication so there is no centralized control in this

architecture The DBMSs are connected through a componentcalled the Distributed Processor (DP) Distributed databasesystem functions include distributed query processingdistributed transaction management distributed metadatamanagement and security and integrity enforcement acrossthe multiple nodes It has been stated that the semantic webcan be considered to be a large distributed database

Figure C2 Database architecture

Figure C3 Distributed data management

Figure C4 Heterogeneous database integration

Figure C4 illustrates an example of interoperability betweenheterogeneous database systems The goal is to providetransparent access both for users and application programsfor querying and executing transactions (see eg[Wiederhold 1992]) Note that in a heterogeneousenvironment the local DBMSs may be heterogeneousFurthermore the modules of the DP have both local DBMSspecific processing as well as local DBMS independentprocessing We call such a DP a heterogeneous distributedprocessor (HDP) There are several technical issues that needto be resolved for the successful interoperation between thesediverse database systems Note that heterogeneity could existwith respect to different data models schemas queryprocessing techniques query languages transactionmanagement techniques semantics integrity and security

Some of the nodes in a heterogeneous database environmentmay form a federation Such an environment is classified as afederated data mainsheet environment As stated by [Shethand Larson 1990] a federated database system is a collectionof cooperating but autonomous database systems belonging toa federation That is the goal is for the database managementsystems which belong to a federation to cooperate with oneanother and yet maintain some degree of autonomy FiguresC5 illustrates a federated database system

Figure C5 Federated data management

C24 Data Warehousing and DataMining

Data warehousing is one of the key data managementtechnologies to support data mining and data analysis Asstated by [Inmon 1993] data warehouses are subjectoriented Their design depends to a great extent on theapplication utilizing them They integrate diverse andpossibly heterogeneous data sources They are persistentThat is the warehouses are very much like databases Theyvary with time This is because as the data sources fromwhich the warehouse is built get updated the changes have tobe reflected in the warehouse Essentially data warehousesprovide support for decision support functions of anenterprise or an organization For example while the data

sources may have the raw data the data warehouse may havecorrelated data summary reports and aggregate functionsapplied to the raw data

Figure C6 illustrates a data warehouse The data sources aremanaged by database systems A B and C The informationin these databases is merged and put into a warehouse With adata warehouse data may often be viewed differently bydifferent applications That is the data is multidimensionalFor example the payroll department may want data to be in acertain format whereas the project department may want datato be in a different format The warehouse must providesupport for such multidimensional data

Data mining is the process of posing various queries andextracting useful information patterns and trends oftenpreviously unknown from large quantities of data possiblystored in databases Essentially for many organizations thegoals of data mining include improving marketingcapabilities detecting abnormal patterns and predicting thefuture based on past experiences and current trends

Figure C6 Data warehouse

Some of the data mining techniques include those based onstatistical reasoning techniques inductive logic programmingmachine learning fuzzy sets and neural networks amongothers The data mining outcomes include classification(finding rules to partition data into groups) association(finding rules to make associations between data) andsequencing (finding rules to order data) Essentially onearrives at some hypothesis which is the informationextracted from examples and patterns observed Thesepatterns are observed from posing a series of queries eachquery may depend on the responses obtained to the previousqueries posed There have been several developments in datamining A discussion of the various tools is given in [KDN

2011] A good discussion of the outcomes and techniques isgiven in [Berry and Linoff 1997] Figure C7 illustrates thedata mining process

A major challenge for web data management researchers andpractitioners is coming up with an appropriate datarepresentation scheme The question is is there a need for astandard data model for web database systems Is it at allpossible to develop such a standard If so what are therelationships between the standard model and the individualmodels used by the databases on the web

Figure C7 Steps to data mining

Database management functions for the web include queryprocessing metadata management security and integrity In[Thuraisingham 2000] we have examined various databasemanagement system functions and discussed the impact ofweb database access on these functions Some of the issuesare discussed here Figure C8 illustrates the functionsQuerying and browsing are two of the key functions First ofall an appropriate query language is needed Because SQL isa popular language appropriate extensions to SQL may bedesired XML-QL which has evolved from XML (eXtensibleMarkup Language) and SQL is moving in this directionQuery processing involves developing a cost model Are therespecial cost models for Internet database management Withrespect to browsing operation the query processingtechniques have to be integrated with techniques forfollowing links That is hypermedia technology has to beintegrated with database management technology

Figure C8 Web data management

Updating web databases could mean different things Onecould create a new web site place servers at that site andupdate the data managed by the servers The question is can auser of the library send information to update the data at aweb site An issue here is with security privileges If the userhas write privileges then he could update the databases thathe is authorized to modify Agents and mediators could beused to locate the databases as well as to process the update

Transaction management is essential for many applicationsThere may be new kinds of transactions on the web Forexample various items may be sold through the Internet Inthis case the item should not be locked immediately when apotential buyer makes a bid It has to be left open until severalbids are received and the item is sold That is specialtransaction models are needed Appropriate concurrencycontrol and recovery techniques have to be developed for thetransaction models

Metadata management is a major concern for web datamanagement The question is what is metadata Metadatadescribes all of the information pertaining to the library Thiscould include the various web sites the types of users accesscontrol issues and policies enforced Where should themetadata be located Should each participating site maintainits own metadata Should the metadata be replicated orshould there be a centralized metadata repository Metadatain such an environment could be very dynamic especiallybecause the users and the web sites may be changingcontinuously

Storage management for web database access is a complexfunction Appropriate index strategies and access methods forhandling multimedia data are needed In addition because ofthe large volumes of data techniques for integrating databasemanagement technology with mass storage technology arealso needed Other data management functions includeintegrating heterogeneous databases managing multimediadata and mining We discussed them in [Thuraisingham2002-a]

C26 Security Impact

Now that we have discussed data management technologieswe will provide an overview of the security impact Withrespect to data management we need to enforce appropriateaccess control techniques Early work focused ondiscretionary access control later in the 1980s focus was onmandatory access control More recently the focus has beenon applying some of the novel access control techniques suchas role-based access control and usage control Extension toSQL to express security assertions as well as extensions tothe relational data model to support multilevel security hasreceived a lot of attention More details can be found in[Thuraisingham 2005]

With respect to data integration the goal is to ensure thesecurity of operation when heterogeneous databases areintegrated That is the policies enforced by the individualdata management systems have to be enforced at the coalitionlevel Data warehousing and data mining results in additionalsecurity concerns and this includes the inference problemWhen data is combined the combined data could be at a

higher security level Specifically inference is the process ofposing queries and deducing unauthorized information fromthe legitimate responses received The inference problemexists for all types of database systems and has been studiedextensively within the context of multilevel databases FigureC9 illustrates the security impact on data management

C3 Secure InformationManagementC31 Introduction

In this section we discuss various secure informationmanagement technologies In particular we will first discussinformation retrieval multimedia information managementcollaborative information management and e-business anddigital libraries and then discuss the security impact

Figure C9 Secure data management

Note that we have tried to separate data management andinformation management Data management focuses ondatabase systems technologies such as query processingtransaction management and storage managementInformation management is much broader than datamanagement and we have included many topics in thiscategory such as information retrieval and multimediainformation management

The organization of this section is as follows Informationretrieval is discussed in Section C32 Multimediainformation management is the subject of Section C33Collaboration and data management are discussed in SectionC34 Digital libraries are discussed in Section C35E-commerce technologies will be discussed in Section C36Security impact will be discussed in Section C37

Information retrieval systems essentially provide support formanaging documents The functions include documentretrieval document update and document storagemanagement among others These systems are essentiallydatabase management systems for managing documentsThere are various types of information retrieval systems andthey include text retrieval systems image retrieval systemsand audio and video retrieval systems Figure C10 illustratesa general purpose information retrieval system that may be

utilized for text retrieval image retrieval audio retrieval andvideo retrieval Such architecture can also be utilized for amultimedia data management system (see [Thuraisingham2001])

Figure C10 Information retrieval system

C33 Multimedia InformationManagement

A multimedia data manager (MM-DM) provides support forstoring manipulating and retrieving multimedia data from amultimedia database In a sense a multimedia databasesystem is a type of heterogeneous database system as itmanages heterogeneous data types Heterogeneity is due tothe multiple media of the data such as text video and audioBecause multimedia data also convey information such asspeeches music and video we have grouped this underinformation management One important aspect ofmultimedia data management is data representation Both

extended relational models and object models have beenproposed

An MM-DM must provide support for typical databasemanagement system functions These include queryprocessing update processing transaction managementstorage management metadata management security andintegrity In addition in many cases the various types of datasuch as voice and video have to be synchronized for displayand therefore real-time processing is also a major issue in anMM-DM

Various architectures are being examined to design anddevelop an MM-DM In one approach the data manager isused just to manage the metadata and a multimedia filemanager is used to manage the multimedia data There is amodule for integrating the data manager and the multimediafile manager In this case the MM-DM consists of the threemodules the data manager managing the metadata themultimedia file manager and the module for integrating thetwo The second architecture is the tight coupling approachIn this architecture the data manager manages both themultimedia data and the metadata The tight couplingarchitecture has an advantage because all of the datamanagement functions could be applied on the multimediadatabase This includes query processing transactionmanagement metadata management storage managementand security and integrity management Note that with theloose coupling approach unless the file manager performs theDBMS functions the DBMS only manages the metadata forthe multimedia data

Figure C11 Multimedia information management system

There are also other aspects to architectures as discussed in[Thuraisingham 1997] For example a multimedia databasesystem could use a commercial database system such as anobject-oriented database system to manage multimediaobjects However relationships between objects and therepresentation of temporal relationships may involveextensions to the database management system That is aDBMS together with an extension layer provide completesupport to manage multimedia data In the alternative caseboth the extensions and the database management functionsare integrated so that there is one database managementsystem to manage multimedia objects as well as therelationships between the objects Further details of thesearchitectures as well as managing multimedia databases arediscussed in [Thuraisingham 2001] Figure C11 illustrates amultimedia information management system

C34 Collaboration and DataManagement

Although the notion of computer supported cooperative work(CSCW) was first proposed in the early 1980s it is only inthe 1990s that much interest was shown on this topicCollaborative computing enables people groups ofindividuals and organizations to work together with oneanother to accomplish a task or a collection of tasks Thesetasks could vary from participating in conferences solving aspecific problem or working on the design of a system (see[ACM 1991])

One aspect of collaborative computing that is of particularinterest to the database community is workflow computingWorkflow is defined as the automation of a series of functionsthat comprise a business process such as data entry datareview and monitoring performed by one or more people Anexample of a process that is well suited for workflowautomation is the purchasing process Some early commercialworkflow system products targeted for office environmentswere based on a messaging architecture This architecturesupports the distributed nature of current workteamsHowever the messaging architecture is usually file based andlacks many of the features supported by databasemanagement systems such as data representation consistencymanagement tracking and monitoring The emergingworkflow systems utilize data management capabilities

Figure C12 Collaborative computing system

Figure C12 illustrates an example in which teams A and Bare working on a geographical problem such as analyzingand predicting the weather in North America The two teamsmust have a global picture of the map as well as any notesthat go with it Any changes made by one team should beinstantly visible to the other team and both teamscommunicate as if they are in the same room

To enable such transparent communication data managementsupport is needed One could utilize a database managementsystem to manage the data or some type of data manager thatprovides some of the essential features such as data integrityconcurrent access and retrieval capabilities In the previouslymentioned example the database may consist of informationdescribing the problem the teams are working on the data

that are involved history data and the metadata informationThe data manager must provide appropriate concurrencycontrol features so that when both teams simultaneouslyaccess the common picture and make changes these changesare coordinated

The web has increased the need for collaboration evenfurther Users now share documents on the web and work onpapers and designs on the web Corporate informationinfrastructures promote collaboration and sharing ofinformation and documents Therefore the collaborative toolshave to work effectively on the web More details are given in[IEEE 1999]

Digital libraries gained prominence with the initial efforts ofthe National Science Foundation (NSF) Defense AdvancedResearch Projects Agency (DARPA) and NationalAeronautics and Space Administration (NASA) NSFcontinued to fund special projects in this area and as a resultthe field has grown very rapidly The idea behind digitallibraries is to digitize all types of documents and provideefficient access to these digitized documents

Several technologies have to work together to make digitallibraries a reality These include web data managementmarkup languages search engines and question answeringsystems In addition multimedia information managementand information retrieval systems play an important role Thissection will review the various developments in some of the

digital libraries technologies Figure C13 illustrates anexample digital library system

Figure C13 Digital libraries

C36 E-Business

Various models architectures and technologies are beingdeveloped Business-to-business e-commerce is all about twobusinesses conducting transactions on the web We give someexamples Suppose corporation A is an automobilemanufacturer and needs microprocessors to be installed in itsautomobiles It will then purchase the microprocessors fromcorporation B who manufactures the microprocessorsAnother example is when an individual purchases somegoods such as toys from a toy manufacturer Thismanufacturer then contacts a packaging company via the web

to deliver the toys to the individual The transaction betweenthe manufacturer and the packaging company is abusiness-to-business transaction Business-to-businesse-commerce also involves one business purchasing a unit ofanother business or two businesses merging The main pointis that such transactions have to be carried out on the webBusiness-to-consumer e-commerce is when a consumermakes purchases on the web In the toy manufacturerexample the purchase between the individual and the toymanufacturer is a business-to-consumer transaction

The modules of the e-commerce server may include modulesfor managing the data and web pages mining customerinformation security enforcement and transactionmanagement E-commerce client functions may includepresentation management user interface as well as cachingdata and hosting browsers There could also be a middle tierwhich may implement the business objects to carry out thebusiness functions of e-commerce These business functionsmay include brokering mediation negotiations purchasingsales marketing and other e-commerce functions Thee-commerce server functions are impacted by the informationmanagement technologies for the web In addition to the datamanagement functions and the business functions thee-commerce functions also include those for managingdistribution heterogeneity and federations

Figure C14 E-business components

E-commerce also includes non-technological aspects such aspolicies laws social impacts and psychological impacts Weare now doing business in an entirely different way andtherefore we need a paradigm shift We cannot do successfule-commerce if we still want the traditional way of buying andselling products We have to be more efficient and rely on thetechnologies a lot more to gain a competitive edge Some keypoints for e-commerce are illustrated in Figure C14

C37 Security Impact

Security impact for information management technologiesinclude developing appropriate secure data models functionsand architectures For example to develop secure multimediainformation management systems we need appropriatesecurity policies for text audio and video data The next stepis to develop secure multimedia data models These could be

based on relations or objects or a combination of theserepresentations What is the level of granularity Shouldaccess be controlled to the entire video of video frames Howcan access be controlled based on semantics For digitallibraries there is research on developing flexible policiesNote that digital libraries may be managed by multipleadministrators under different environments Thereforepolicies cannot be rigid For collaborative informationsystems we need policies for different users to collaboratewith one another How can the participants trust each otherHow can truth be established What sort of access control isappropriate There is research on developing security modelsfor workflow and collaboration systems [Bertino et al 1999]

Secure e-business is receiving a lot of attention How can themodels processes and functions be secured What are thesesecurity models Closely related to e-business is supply chainmanagement The challenge here is ensuring security as wellas timely communication between the suppliers and thecustomers

C4 Secure KnowledgeManagementWe first discuss knowledge management and then describethe security impact

Knowledge management is the process of using knowledge asa resource to manage an organization It could mean sharingexpertise developing a learning organization teaching thestaff learning from experiences or collaborating Essentiallyknowledge management will include data management andinformation management However this is not a view sharedby everyone Various definitions of knowledge managementhave been proposed Knowledge management is a disciplineinvented mainly by business schools The concepts have beenaround for a long time But the term knowledge managementwas coined as a result of information technology and the web

In the collection of papers on knowledge management by[Morey et al 2001] knowledge management is divided intothree areas These are strategies such as building a knowledgecompany and making the staff knowledge workers processes(such as techniques) for knowledge management includingdeveloping a method to share documents and tools andmetrics that measure the effectiveness of knowledgemanagement In the Harvard Business Review in the area ofknowledge management there is an excellent collection ofarticles describing a knowledge-creating company building alearning organization and teaching people how to learn[Harvard 1996] Organizational behavior and team dynamicsplay major roles in knowledge management

Knowledge management technologies include severalinformation management technologies including knowledgerepresentation and knowledge-based management systemsOther knowledge management technologies include

collaboration tools tools for organizing information on theweb and tools for measuring the effectiveness of theknowledge gained such as collecting various metricsKnowledge management technologies essentially include datamanagement and information management technologies aswell as decision support technologies Figure C15 illustratessome of the knowledge management components andtechnologies It also lists the aspects of the knowledgemanagement cycle Web technologies play a major role inknowledge management Knowledge management and theweb are closely related Although knowledge managementpractices have existed for many years it is the web that haspromoted knowledge management

Figure C15 Knowledge management components andtechnologies

Many corporations now have Intranets and an Intranet is thesingle most powerful knowledge management tool

Thousands of employees are connected through the web in anorganization Large corporations have sites all over the worldand the employees are becoming well connected with oneanother Email can be regarded to be one of the earlyknowledge management tools Now there are many toolssuch as search engines and e-commerce tools

With the proliferation of web data management ande-commerce tools knowledge management will become anessential part of the web and e-commerce A collection ofpapers on knowledge management experiences coversstrategies processes and metrics [Morey et al 2001]Collaborative knowledge management is discussed in[Thuraisingham et al 2002-b]

C42 Security Impact

Secure knowledge management is receiving a lot of attention[SKM 2004] One of the major challenges here is todetermine the security impact on knowledge managementstrategies processes and metrics [Bertino et al 2006] Wewill examine each of the components

Note that an organizationrsquos knowledge management strategymust be aligned with its business strategy That is anorganization must utilize its knowledge to enhance itsbusiness which will ultimately include improved revenuesand profits Therefore the security strategy has to be alignedwith its business strategy For example an organization mayneed to protect its intellectual property Patents are one aspectof intellectual property other aspects include papers and tradesecrets Some of this intellectual property should not be

widely disseminated to maintain the competitive edgeTherefore policies are needed to ensure that sensitiveintellectual property is treated as classified material

With respect to knowledge management processes we needto incorporate security into them For example consider theworkflow management for purchase orders Only authorizedindividuals should be able to execute the various processesThis means that security for workflow systems is an aspect ofsecure knowledge management That is the data andinformation management technologies will contribute toknowledge management

With respect to metrics security will have an impact Forexample one metric could be the number of papers publishedby individuals These papers may be classified orunclassified Furthermore the existence of the classifieddocuments may also be classified This means that at theunclassified level there may be one value for the metricwhereas at the classified level there may be another valueTherefore when evaluating the employee for his or herperformance both values have to be taken into considerationHowever if the manager does not have an appropriateclearance then there will be an issue The organization has tothen develop appropriate mechanisms to ensure that theemployeersquos entire contributions are taken into considerationwhen he or she is evaluated

C5 SummaryIn this chapter we have provided an overview of secure datainformation and knowledge management In particular wehave discussed data information and knowledgemanagement technologies and then examined the securityimpact

As we have stated earlier data information and knowledgemanagement are supporting technologies for buildingtrustworthy semantic webs The agents that carry outactivities on the web have to utilize the data extractinformation from the data and reuse knowledge so thatmachine-untreatable web pages can be developed There areseveral other aspects of data information and knowledgemanagement that we have not covered in this chapter such aspeer-to-peer information management and informationmanagement for virtual organizations

References[ACM 1991] Special Issue on Computer SupportedCooperative Work Communications of the ACM December1991

[Berry and Linoff 1997] Berry M and G Linoff DataMining Techniques for Marketing Sales and CustomerSupport John Wiley 1997

[Bertino et al 1999] Bertino E E Ferrari V Atluri TheSpecification and Enforcement of Authorization Constraintsin Workflow Management Systems ACM Transactions onInformation and Systems Security Vol 2 No 1 1999 pp65ndash105

[Bertino et al 2006] Bertino E L Khan R S Sandhu BM Thuraisingham Secure Knowledge ManagementConfidentiality Trust and Privacy IEEE Transactions onSystems Man and Cybernetics Part A Vol 36 No 3 2006pp 429ndash438

[Ceri and Pelagatti 1984] Ceri S and G PelagattiDistributed Databases Principles and SystemsMcGraw-Hill 1984

[Codd 1970] Codd E F A Relational Model of Data forLarge Shared Data Banks Communications of the ACM Vol26 No 1 1970 pp 64ndash69

[Date 1990] Date C An Introduction to Database SystemsAddison-Wesley Reading MA 1990

[Harvard 1996] Harvard Business School Articles onKnowledge Management Harvard University MA 1996

[IEEE 1999] Special Issue in Collaborative ComputingIEEE Computer Vol 32 No 9 1999

[Inmon 1993] Inmon W Building the Data WarehouseJohn Wiley and Sons 1993

[KDN 2011] Kdnuggets wwwkdncom

[Morey et al 2001] Morey D M Maybury BThuraisingham Editors Knowledge Management MIT Press2001

[Sheth and Larson 1990] Sheth A and J Larson FederatedDatabase Systems for Managing Distributed Heterogeneousand Autonomous Databases ACM Computing Surveys Vol22 No 3 1990 pp 183ndash236 1990

[SKM 2004] Proceedings of the Secure KnowledgeManagement Workshop Buffalo NY 2004

[Thuraisingham 2002-a] Thuraisingham B XML Databasesand the Semantic Web CRC Press 2002

[Thuraisingham et al 2002-b] Thuraisingham B A GuptaE Bertino E Ferrari Collaborative Commerce andKnowledge Management Journal of Knowledge and ProcessManagement Vol 9 No 1 2002 pp 43ndash53

[Thuraisingham 2005] Thuraisingham B Database andApplications Security CRC Press 2005

[Tsichritzis and Lochovsky 1982] Tsichritzis D and FLochovsky Data Models Prentice Hall 1982

[Ullman 1988] Ullman J D Principles of Database andKnowledge Base Management Systems Vols I and IIComputer Science Press 1988

[Wiederhold 1992] Wiederhold G Mediators in theArchitecture of Future Information Systems IEEE ComputerVol 25 Issue 3 March 1992 pp 38ndash49

[Woelk et al 1986] Woelk D W Kim W Luther AnObject-Oriented Approach to Multimedia DatabasesProceedings of the ACM SIGMOD Conference WashingtonDC June 1986 pp 311ndash325

Appendix D Semantic Web

D1 IntroductionTim Berners Lee the father of the World Wide Web realizedthe inadequacies of current web technologies andsubsequently strived to make the web more intelligent Hisgoal was to have a web that would essentially alleviatehumans from the burden of having to integrate disparateinformation sources as well as to carry out extensive searchesHe then came to the conclusion that one needsmachine-understandable web pages and the use of ontologiesfor information integration This resulted in the notion of thesemantic web [Lee and Hendler 2001] The web services thattake advantage of semantic web technologies are calledsemantic web services

A semantic web can be thought of as a web that is highlyintelligent and sophisticated so that one needs little or nohuman intervention to carry out tasks such as schedulingappointments coordinating activities searching for complexdocuments as well as integrating disparate databases andinformation systems Although much progress has been madetoward developing such an intelligent web there is still a lotto be done For example technologies such as ontologymatching intelligent agents and markup languages arecontributing a lot toward developing the semantic webNevertheless humans are still needed to make decisions andtake actions

Recently there have been many developments on the semanticweb The World Wide Web consortium (W3Cwwww3corg) is specifying standards for the semantic webThese standards include specifications for XML RDF andInteroperability However it is also very important that thesemantic web be secure That is the components thatconstitute the semantic web have to be secure Thecomponents include XML RDF and Ontologies In additionwe need secure information integration We also need toexamine trust issues for the semantic web It is thereforeimportant that we have standards for securing the semanticweb including specifications for secure XML secure RDFand secure interoperability (see [Thuraisingham 2005]) Inthis appendix we will discuss the various components of thesemantic web and discuss semantic web services

Although agents are crucial to managing the data and theactivities on the semantic web usually agents are not treatedas part of semantic web technologies Because the subject ofagents is vast and there are numerous efforts to developagents as well as secure agents we do not discuss agents indepth in this appendix However we mention agents here as itis these agents that use XML and RDF and make sense of thedata and understand web pages Agents act on behalf of theusers Agents communicate with each other usingwell-defined protocols Various types of agents have beendeveloped depending on the tasks they carry out Theseinclude mobile agents intelligent agents search agents andknowledge management agents Agents invoke web servicesto carry out the operations For details of agents we referreaders to [Hendler 2001]

The organization of this appendix is as follows In SectionD2 we will provide an overview of the layered architecturefor the semantic web as specified by Tim Berners LeeComponents such as XML RDF ontologies and web rulesare discussed in Sections D3 through D6 Semantic webservices are discussed in Section D7 The appendix issummarized in Section D8 Much of the discussion of thesemantic web is summarized from the book by Antoniou andvan Harmelan [Antoniou and Harmelan 2003] For anup-to-date specification we refer the reader to the World WideWeb Consortium web site (wwww3corg)

Figure D1 Layered architecture for the semantic web

D2 Layered TechnologyStackFigure D1 illustrates the layered technology stack for thesemantic web This is the architecture that was developed byTim Berners Lee Essentially the semantic web consists oflayers where each layer takes advantage of the technologies ofthe previous layer The lowest layer is the protocol layer andthis is usually not included in the discussion of the semantictechnologies The next layer is the XML layer XML is adocument representation language and will be discussed inSection D3 Whereas XML is sufficient to specify syntax asemantic string such as ldquothe creator of document D is Johnrdquo ishard to specify in XML Therefore the W3C developed RDFwhich uses XML syntax The semantic web community thenwent further and came up with a specification of ontologies inlanguages such as OWL (Web Ontology Language) Note thatOWL addresses the inadequacies of RDF For example OWLsupports the notions of union and intersection of classes thatRDF does not support In order to reason about variouspolicies the semantic web community has come up with aweb rules language such as SWRL (semantic web ruleslanguage) and RuleML (rule markup language) (eg theconsistency of the policies or whether the policies lead tosecurity violations)

The functional architecture is illustrated in Figure D2 It isessentially a service-oriented architecture that hosts webservices

D3 XMLXML is needed due to the limitations of HTML andcomplexities of SGML It is an extensible markup languagespecified by the W3C (World Wide Web Consortium) anddesigned to make the interchange of structured documentsover the Internet easier An important aspect of XML used tobe the notion of Document Type Definitions (DTDs) whichdefines the role of each element of text in a formal modelXML schemas have now become critical to specify thestructure XML schemas are also XML documents Thissection will discuss various components of XML includingstatements elements attributes and schemas Thecomponents of XML are illustrated in Figure D3

Figure D2 Functional architecture for the semantic web

Figure D3 Components of XML

The following is an example of an XML statement thatdescribes the fact that ldquoJohn Smith is a Professor in TexasrdquoThe elements are name and state The XML statement is asfollows

ltProfessorgt

ltnamegt John Smith ltnamegt

ltstategt Texas ltstategt

ltProfessorgt

D32 XML Attributes

Suppose we want to specify that there is a professor calledJohn Smith who makes 60K We can use either elements orattributes to specify this The example below shows the use ofthe attributes Name and Salary

ltProfessorgt

Name = ldquoJohn Smithrdquo Access = All Read

Salary = ldquo60Krdquo

ltProfessorgt

D33 XML DTDs

DTDs (Document Type Definitions) essentially specify thestructure of XML documents

Consider the following DTD for Professor with elementsName and State This will be specified as

ltELEMENT Professor Officer (Name State)gt

ltELEMENT name (PCDATA)gt

ltELEMENT state (PCDATA)gt

ltELEMENT access (PCDATA)gt

D34 XML Schemas

While DTDs were the early attempts to specify structure forXML documents XML schemas are far more elegant tospecify structures Unlike DTDs XML schemas essentiallyuse the XML syntax for specification

Consider the following example

ltComplexType = name = ldquoProfessorTyperdquogt

ltSequencegt

ltelement name = ldquonamerdquo type = ldquostringrdquogt

ltelement name = ldquostaterdquo type = ldquostringrdquogt

ltSequencegt

ltComplexTypegt

D35 XML Namespaces

Namespaces are used for DISAMBIGUATION An exampleis given below

ltCountryX Academic-Institution

Xmlns CountryX = ldquohttpwwwCountryXeduInstitutionUTDrdquo

Xmlns USA = ldquohttpwwwUSAeduInstitution UTDrdquo

Xmlns UK = ldquohttpwwwUKeduInstitution UTDrdquo

ltUSA Title = College

USA Name = ldquoUniversity of Texas at Dallasrdquo

USA State = Texasrdquo

ltUK Title = University

UK Name = ldquoCambridge Universityrdquo

UK State = Cambs

ltCountryX Academic-Institutiongt

XML data may be distributed and the databases may formfederations This is illustrated in the segment below

Site 1 document

ltProfessor-namegt

ltIDgt 111 ltIDgt

ltNamegt John Smith ltnamegt

ltStategt Texas ltstategt

ltProfessor-namegt

Site 2 document

ltProfessor-salarygt

ltIDgt 111 ltIDgt

ltsalarygt 60K ltsalarygt

XML-QL and XQuery are query languages that have beenproposed for XML XPath is used to specify the queriesEssentially Xpath expressions may be used to reach aparticular element in the XML statement In our research wehave specified policy rules as Xpath expressions (see [Bertinoet al 2004]) XSLT is used to present XML documentsDetails are given on the World Wide Web Consortium website (wwww3corg) and in [Antoniou and Harmelan 2003]Another useful reference is [Laurent 2000]

D4 RDFWhereas XML is ideal to specify the syntax of variousstatements it is difficult to use XML to specify the semanticsof a statement For example with XML it is difficult tospecify statements such as the following

Engineer is a subclass of Employee

Engineer inherits all properties of Employee

Note that the statements specify the classsubclass andinheritance relationships RDF was developed by Tim BernersLee and his team so that the inadequacies of XML could behandled RDF uses XML syntax Additional constructs areneeded for RDF and we discuss some of them Details can befound in [Antoniou and Harmelan 2003]

Resource Description Framework (RDF) is the essence of thesemantic web It provides semantics with the use ofontologies to various statements and uses XML syntax RDFconcepts include the basic model which consists ofResources Properties and Statements and the containermodel which consists of Bag Sequence and Alternative Wediscuss some of the essential concepts The components ofRDF are illustrated in Figure D4

Figure D4 Components of RDF

Figure D5 RDF statement

D41 RDF Basics

The RDF basic model consists of resource property andstatement In RDF everything is a resource such as personvehicle and animal Properties describe relationships betweenresources such as ldquoboughtrdquo ldquoinventedrdquo ldquoaterdquo Statement is atriple of the form (Object Property Value) Examples ofstatements include the following

Berners Lee invented the Semantic Web

Tom ate the Apple

Mary brought a Dress

Figure D5 illustrates a statement in RDF In this statementBerners Lee is the Object Semantic Web is the Value andinvented is the property

The RDF container model consists of bag sequence andalternative As described in [Antoniou and Harmelan 2003]these constructs are specified in RDF as follows

Bag Unordered container may contain multiple occurrences

Rdf Bag

Seq Ordered container may contain multiple occurrences

Rdf Seq

Alt a set of alternatives

Rdf Alt

As stated in [Antoniou and Harmelan 2003] RDFspecifications have been given for attributes types nestingcontainers and others An example is the following

ldquoBerners Lee is the Author of the book Semantic Webrdquo

This statement is specified as follows (see also [Antoniou andHarmelan 2003])

ltrdf RDF

xmlns rdf = ldquohttpw3corg199902-22-rdf-syntax-nsrdquo

xmlns xsd = ldquohttp - - -

xmlns uni = ldquohttp - - - -

ltrdf Description rdf about = ldquo949352rdquo

ltuni name = Berners Leeltuninamegt

ltuni titlegt Professor lt unititlegt

ltrdf Descriptiongt

ltrdf Description rdf about ldquoZZZrdquo

lt uni booknamegt semantic web ltunibooknamegt

lt uni authoredby Berners Lee ltuniauthoredbygt

ltrdf Descriptiongt

ltrdf RDFgt

D44 RDF Schemas

Whereas XML schemas specify the structure of the XMLdocument and can be considered to be metadata RDF schemaspecifies relationships such as the classsubclass relationshipsFor example we need RDF schema to specify statementssuch as ldquoengineer is a subclass of employeerdquo The followingis the RDF specification for this statement

ltrdfs Class rdf ID = ldquoengineerrdquo

ltrdfs commentgt

The class of Engineers

All engineers are employees

ltrdfs commentgt

ltrdfs subClassof rdf resource = ldquoemployeerdquogt

ltrdfs Classgt

First-order logic is used to specify formulas and inferencingThe following constructs are needed

Built in functions (First) and predicates (Type)

Modus Ponens From A and If A then B deduce B

The following example is taken from [Antoniou andHarmelan 2003]

Example All Containers are Resources that is if X is acontainer then X is a resource

Type(C Container) rarr Type(c Resource)

If we have Type(A Container) then we can infer (Type AResource)

D46 RDF Inferencing

Unlike XML RDF has inferencing capabilities Althoughfirst-order logic provides a proof system it will becomputationally infeasible to develop such a system usingfirst-order logic As a result Horn clause logic was developedfor logic programming [Lloyd 1987] this is stillcomputationally expensive Semantic web is based on arestricted logic called Descriptive Logic details can be foundin [Antoniou and Harmelan 2003] RDF uses If then Rules asfollows

IF E contains the triples (u rdfs subClassof v)

and (v rdfs subClassof w)

E also contains the triple (u rdfs subClassof w)

That is if u is a subclass of v and v is a subclass of w then uis a subclass of w

D47 RDF Query

Similar to XML Query languages such as X-Query andXML-QL query languages are also being developed for RDFOne can query RDF using XML but this will be very difficultbecause RDF is much richer than XML Thus RQL anSQL-like language was developed for RDF It is of thefollowing form

Select from ldquoRDF documentrdquo where some ldquoconditionrdquo

D48 SPARQL

The RDF Data group at W3C has developed a query languagefor RDF called SPARQL which is becoming the standardnow for querying RDF documents We are developingSPARQL query processing algorithms for clouds We havealso developed a query optimizer for SPARQL queries

D5 OntologiesOntologies are common definitions for any entity person orthing Ontologies are needed to clarify various terms andtherefore they are crucial for machine-understandable web

pages Several ontologies have been defined and are availablefor use Defining a common ontology for an entity is achallenge as different groups may come up with differentdefinitions Therefore we need mappings for multipleontologies That is these mappings map one ontology toanother Specific languages have been developed forontologies Note that RDF was developed because XML isnot sufficient to specify semantics such as classsubclassrelationship RDF is also limited as one cannot expressseveral other properties such as Union and IntersectionTherefore we need a richer language Ontology languageswere developed by the semantic web community for thispurpose

OWL (Web Ontology Language) is a popular ontologyspecification language Itrsquos a language for ontologies andrelies on RDF DARPA (Defense Advanced ResearchProjects Agency) developed early language DAML (DARPAAgent Markup Language) Europeans developed OIL(Ontology Interface Language) DAML+OIL is acombination of the two and was the starting point for OWLOWL was developed by W3C OWL is based on a subset offirst-order logic and that is descriptive logic

OWL features include Subclass relationship Classmembership Equivalence of classes Classification andConsistency (eg x is an instance of A A is a subclass of Bx is not an instance of B)

There are three types of OWL OWL-Full OWL-DLOWL-Lite Automated tools for managing ontologies arecalled ontology engineering

Below is an example of OWL specification

Textbooks and Coursebooks are the same

EnglishBook is not a FrenchBook

EnglishBook is not a GermanBook

lt owl Class rdf about = ldquoEnglishBookrdquogt

ltowl disjointWith rdf resource ldquoFrenchBookrdquogt

ltowl disjointWith rdf resource = ldquoGermanBookrdquogt

ltowlClassgt

ltowl Class rdf ID = ldquoTextBookrdquogt

ltowl equivalentClass rdf resource = ldquoCourseBookrdquogt

ltowl Classgt

Below is an OWL specification for Property

Englishbooks are read by Students

lt owl ObjectProperty rdf about = ldquoreadByrdquogt

ltrdfs domain rdf resource = ldquoEnglishBookrdquogt

ltrdfs range rdf resource = ldquostudentrdquogt

ltrdfs subPropertyOf rdf resource = ldquoinvolvesrdquogt

ltowl ObjectPropertygt

Below is an OWL specification for property restriction

All Frenchbooks are read only by Frenchstudents

lt owl Class rdf about = ldquordquoFrenchBookrdquogt

ltrdfs subClassOfgt

ltowl Restrictiongt

ltowl onProperty rdf resource = ldquoreadByrdquogt

ltowl allValuesFrom rdf resource = ldquoFrenchStudentrdquogt

ltrdfs subClassOfgt

ltowl Classgt

D6 Web Rules and SWRLD61 Web Rules

RDF is built on XML and OWL is built on RDF We canexpress subclass relationships in RDF and additional

relationships in OWL However reasoning power is stilllimited in OWL Therefore we need to specify rules andsubsequently a markup language for rules so that machinescan understand and make inferences

Below are some examples as given in [ANTI03]

Studies(XY) Lives(XZ) Loc(YU) Loc(ZU) rarr

DomesticStudent(X)

ie if John Studies at UTDallas and John lives on Campbell

Road and the location of Campbell Road and UTDallas

are Richardson then John is a Domestic student

Note that Person (X) rarr Man(X) or Woman(X) is not a rule inpredicate logic

That is if X is a person then X is either a man or a womancannot be expressed in first order predicate logic Thereforein predicate logic we express the above as if X is a person andX is not a man then X is a woman and similarly if X is aperson and X is not a woman then X is a man That is inpredicate logic we can have a rule of the form

Person(X) and Not Man(X) rarr Woman(X)

However in OWL we can specify the rule if X is a personthen X is a man or X is a woman

Rules can be monotonic or nonmonotonic

Below is an example of a monotonic rule

rarr Mother(XY)

Mother(XY) rarr Parent(XY)

If Mary is the mother of John then Mary is the parent of John

Rule is of the form

B1 B2 ndashndashndashndash Bn rarr A

That is if B1 B2 ndashndash-Bn hold then A holds

In the case of nonmonotonic reasoning if we have X andNOT X we do not treat them as inconsistent as in the case ofmonotonic reasoning For example as discussed in [Antoniouand Harmelan 2003] consider the example of an apartmentthat is acceptable to John That is in general John is preparedto rent an apartment unless the apartment has fewer than twobedrooms and does not allow pets This can be expressed asfollows

rarr Acceptable(X)

Bedroom(XY) Ylt2 rarr NOT Acceptable(X)

NOT Pets(X) rarr NOT Acceptable(X)

The first rule states that an apartment is in generalacceptable to John The second rule states that if theapartment has fewer than two bedrooms it is not acceptableto John The third rule states that if pets are not allowed thenthe apartment is not acceptable to John Note that there couldbe a contradiction With nonmonotonic reasoning this isallowed whereas it is not allowed in monotonic reasoning

We need rule markup languages for the machine tounderstand the rules The various components of logic areexpressed in the Rule Markup Language called RuleMLdeveloped for the semantic web Both monotonic andnonmonotonic rules can be represented in RuleML

An example representation of the Fact Parent(A) whichmeans ldquoA is a parentrdquo is expressed as follows

ltfactgt

ltatomgt

ltpredicategtParentltpredicategt

lttermgt

ltconstgtAltconstgt

lttermgt

ltatomgt

ltfactgt

Figure D6 SWRL components

D62 SWRL

W3C has come up with a new rules language that integratesboth OWL and Web Rules and this is SWRL (semantic webrules language) The authors of SWRL state that SWRLextends the set of OWL axioms to include Horn-like rulesThis way Horn-like rules can be combined with an OWLknowledge base Such a language will have therepresentational power of OWL and the reasoning power oflogic programming We illustrate SWRL components inFigure D6

The authors of SWRL (Horrocks et al) also state that theproposed rules are in the form of an implication between anantecedent (body) and consequent (head) The intendedmeaning can be read as whenever the conditions specified in

the antecedent hold then the conditions specified in theconsequent must also hold An XML syntax is also given forthese rules based on RuleML and the OWL XMLpresentation syntax Furthermore an RDF concrete syntaxbased on the OWL RDFXML exchange syntax is presentedThe rule syntaxes are illustrated with several runningexamples Finally we give usage suggestions and cautions

The following is a SWRL example that we have taken fromthe W3C specification of SWTL [Horrocks et al 2004] Itstates that if x1 is the child of x2 and x3 is the brother of x2then x3 is the uncle of x1 For more details of SWRL werefer the reader to the W3C specification [Horrocks et al2004] The example uses XML syntax

ltrulemlimpgt

ltruleml_rlab rulemlhref=ldquoexample1rdquogt

ltruleml_bodygt

ltswrlxindividualPropertyAtom swrlxproperty=

ldquohasParentrdquogt

ltrulemlvargtx1ltrulemlvargt

ltswrlxindividualPropertyAtomgt

ldquohasBrotherrdquogt

ltruleml_bodygt

ltruleml_headgt

ldquohasUnclerdquogt

ltruleml_headgt

ltrulemlimpgt

D7 Semantic Web ServicesSemantic web services utilize semantic web technologiesWeb services utilize WSDL and SOAP messages which arebased on XML With semantic web technologies one could

utilize RDF to express semantics in the messages as well aswith web services description languages Ontologies could beutilized for handling heterogeneity For example if the wordsin the messages or service descriptions are ambiguous thenontologies could resolve these ambiguities Finally rulelanguages such as SWRL could be used for reasoning powerfor the messages as well as the service descriptions

As stated in [SWS] the mainstream XML standards forinteroperation of web services specify only syntacticinteroperability not the semantic meaning of messages Forexample WSDL can specify the operations available througha web service and the structure of data sent and received butit cannot specify semantic meaning of the data or semanticconstraints on the data This requires programmers to reachspecific agreements on the interaction of web services andmakes automatic web service composition difficult

Figure D7 Semantic web services

Semantic web services are built around semantic webstandards for the interchange of semantic data which makes iteasy for programmers to combine data from different sourcesand services without losing meaning Web services can beactivated ldquobehind the scenesrdquo when a web browser makes arequest to a web server which then uses various web servicesto construct a more sophisticated reply than it would havebeen able to do on its own Semantic web services can also beused by automatic programs that run without any connectionto a web browser Figure D7 illustrates the components ofsemantic web services

D8 SummaryThis appendix has provided an overview of semantic webtechnologies and the notion of semantic web services Inparticular we have discussed Tim Berners Leersquos technologystack as well as a functional architecture for the semanticweb Then we discussed XML RDF and ontologies as wellas web rules for the semantic web Finally we discussedsemantic web services and how they can make use ofsemantic web technologies

There is still a lot of work to be carried out on semantic webservices Much of the development of web services focusedon XML technologies We need to develop standards forusing RDF for web services For example we need to developRDF-like languages for web services descriptions Securityhas to be integrated into semantic web technologies Finallywe need to develop semantic web technologies forapplications such as multimedia geospatial technologies and

video processing Some of the directions are discussed in[Thuraisingham 2007] and [Thuraisingham 2010]

References[Antoniou and Harmelan 2003] Antoniou G and F vanHarmelan A Semantic Web Primer MIT Press 2003

[Bertino et al 2004] Bertino E B Carminati E Ferrari BThuraisingham A Gupta Selective and AuthenticThird-Party Distribution of XML Documents IEEETransactions on Knowledge and Data Engineering Vol 16No 10 2004 pp 1263ndash1278

[Hendler 2001] Hendler J Agents and the Semantic WebIEEE Intelligent Systems Journal Vol 16 No 2 2001 pp30ndash37

[Horrocks et al 2004] Horrocks I P F Patel-Schneider HBoley S Tabet B Grosof M Dean A Semantic Web RuleLanguage Combining OWL and RuleML National ResearchCouncil of Canada Network Inference and StanfordUniversity httpwwww3orgSubmissionSWRL1

[Laurent 2000] Laurent S S XML A Primer Power BooksPublishing 2000

[Lee and Hendler 2001] Lee T B and J Hendler TheSemantic Web Scientific American May 2001 pp 35ndash43

[Lloyd 1987] Lloyd J Logic Programming Springer 1987

[SWS] httpenwikipediaorgwikiSemantic_Web_Services

ABEP see Address before entry point

Active defense data mining for 245ndash261

antivirus product 258

architecture 249ndash250

boosted decision trees 249

data mining-based malware detection model 251ndash255

binary n-gram feature extraction 252ndash253

feature extraction 252ndash252

feature selection 253ndash254

feature vector computation 254

framework 251ndash252

testing 255

training 254ndash255

encrypted payload 247

experiments 257ndash258

model-reversing obfuscations 255ndash257

feature insertion 256

feature removal 256ndash257

path selection 256

related work 247ndash249

signature information leaks 259

summary and directions 258ndash260

Windows public antivirus interface 259

ActiveX controls 259

Adaboost 131

Address before entry point (ABEP) 128

Adelson Velsky Landis (AVL) tree 121

AFR algorithm see Assembly feature retrieval algorithm

AFS see Assembly feature set

AI see Artificial intelligence

ANN see Artificial neural network

Anomaly detection 53

Antivirus

interfaces signature information leaks 259

product tested 258

ARM see Association rule mining

ARPANET 38

Artificial intelligence (AI) 57 288

Artificial neural network (ANN) 14ndash19

back propagation algorithm 18

diagram 17

hybrid model and 65

learning process 15

perceptron input 15

predictive accuracy and 58

sigmoid transfer function 18

step size 19

training example 17

uses 14

Assembly feature retrieval (AFR) algorithm 128ndash130

Assembly feature set (AFS) 126

Association rule mining (ARM) 25ndash29

Apriori algorithm 28

firewall policy management 321

Frequent Itemset Graph 27 29

hybrid model and 65

incremental techniques 288

parameters 25

prediction using 57 63

problem in using 26

recommendation engine 27 28

relationships among itemsets 25

sensor data 289

web transactions 26

Attack

categories 52

critical infrastructures 50

distributed denial of service 5 41

host-based 52

information-related 48

network-based 52 53

terrorist 281

types 233

web security 353

zero-day 4 39 73 111

AVL tree see Adelson Velsky Landis tree

Banking online 41

BFS see Binary feature set

Binary feature set (BFS) 125 127

BIRCH 292

Boosted J48 172

Botnets 41ndash42

Botnets detecting 183ndash189

approach 187ndash188

botnet architecture 183ndash186

summary and directions 188

Botnets detecting design of data mining tool 191ndash199

bot command categorization 194ndash195

classification 198

data collection 194

flow-level features 196ndash196

packet-level features 195ndash196

log file correlation 197ndash198

packet filtering 198ndash199

system setup 193

Botnets detecting evaluation and results 201ndash206

baseline techniques 201ndash202

classifiers 202

Livadas 201

Temporal 201

comparison with other

techniques 203ndash204

false alarm rates 202

further analysis 205ndash206

performance on different datasets 202ndash203

Business-to-business e-commerce 378

CBIR see Content-based image retrieval

CDL see Code vs data length

CF see Collaborative filtering

Chi-square 65

Class label 89

CLEF medical image datasets 33

Cluster feature vector 292

Clustering idea behind 291

Code vs data length (CDL) 163ndash164

Collaborative filtering (CF) 63

Computer Fraud and Abuse Act 39

Computer supported cooperative work (CSCW) 375

Content-based image retrieval (CBIR) 31

Credit card fraud 49

Cryptography 348

CSCW see Computer supported cooperative work

Curse of dimensionality 32

Cyber-terrorism 48

DAF see Derived assembly feature set

DARPA see Defense Advanced Research Projects Agency

Database

functional 327

heterogeneous 328

management 364ndash266

data distribution 365ndash366

data model 364ndash365

functions 365

object-oriented 327

signature 39

system vendors 324

Data information and knowledge management 363ndash384

Distributed Processor 366

metadata 365

secure data management 364ndash372

database management 364ndash266

data warehousing and data mining 368ndash369

heterogeneous data integration 367

security impact 372

web data management 369ndash372

secure information management 372ndash380

collaboration and data management 375ndash377

digital libraries 377

e-business 378ndash379

information retrieval 373

multimedia information management 374ndash375

security impact 379ndash380

secure knowledge management 380ndash383

components and technologies 381

corporate Intranets 381

definition of knowledge management 380

knowledge management 380ndash382

Web data management and 382

transaction management 371

Data management systems (developments and trends)323ndash340

building information systems from framework 334ndash337

client-server environments 324

comprehensive view of data management systems 329

data management systems framework 331ndash334

developments in database systems 325ndash330

federated database system 328

functional database systems 327

heterogeneous database system 328

major database system vendors 324

object-oriented database systems 327

relational database system products 326

relationship between texts 337ndash338

standardization efforts 327

status vision and issues 330

Structured Query Language 327

three-dimensional view of data management 333

vision 331

Data mining introduction to 1ndash9

botnet detection 5

detecting remote exploits 5

email worm detection 3ndash4

emerging data mining tools for cyber security applications 6

image classification 2

malicious code detection 4

next steps 7ndash8

organization of book 6ndash7

security technologies 2ndash3

signature detection 4

stream data mining 5ndash6

trends 1ndash2

Data mining techniques 13ndash35

artificial neural network 14ndash19

diagram 17

learning process 15

perceptron input 15

step size 19

training example 17

uses 14

association rule mining 25ndash29

parameters 25

problem in using 26

web transactions 26

hyper-plane classifier 18

image mining 31ndash34

approaches 31

automatic image annotation 33

clustering algorithm 32

content-based image retrieval 31

curse of dimensionality 32

image classification 33ndash34

k nearest neighbor algorithm 34

principal component analysis 33

singular value decomposition 33

translational model 33

Markov model 22ndash25

example 23

first-order 24

second-order 24

sliding window 25

transition probability 23

web surfing predictive methods using 22

multi-class problem 29ndash30

one-vs-all 30

one-vs-one 29ndash30

overview 14

semantic gap 31

summary 34

support vector machines 19ndash22

basic concept in 19

binary classification and 19

description of 19

functional margin 19

linear separation 20

margin area adding objects in 22

optimization problem 21

support vectors 21

DDoS attacks see Distributed denial of service attacks

Defense see Active defense data mining for

Defense Advanced Research Projects Agency (DARPA) 377395ndash396

Derived assembly feature set (DAF) 133

Descriptive Logic 394

Design of data mining tool (detecting botnets) 191ndash199

classification 198

data collection 194

system setup 193

Design of data mining tool (detecting remote exploits)159ndash167

classification 166

combining features and compute combined feature vector164ndash165

DExtor architecture 159ndash151

disassembly 161ndash163

discard smaller sequences 162

generate instruction sequences 162

identify useful instructions 162

prune subsequences 162

remove illegal sequences 162

code vs data length 163ndash164

instruction usage frequencies 163

useful instruction count 163

instruction sequence distiller and analyzer 161

Design of data mining tool (email worm detection) 81ndash93

architecture 82

classification model 89

classification techniques 89ndash91

class label 89

hyperplane 90

two-layer approach 91

feature description 83ndash84

per-email features 83ndash84

per-window features 84

feature reduction techniques 84ndash88

decision tree algorithms 88

dimension reduction 84ndash85

minimal subset 87

phase I 85ndash87

phase II 87ndash88

potential feature set 87

two-phase feature selection 85ndash88

summary 91ndash92

support vectors 91

Design of data mining tool (malicious executables) 119ndash132

Adelson Velsky Landis tree 121

C45 Decision Tree algorithm 131

feature extraction using n-gram analysis 119ndash126

assembly n-gram feature 125ndash126

binary n-gram feature 120ndash121

DLL function call feature 126

feature collection 121ndash123

hybrid feature retrieval model 127ndash131

Adaboost 131

address before entry point 128

assembly feature retrieval algorithm 128ndash130

assembly instruction sequences 128

description of model 127ndash128

feature vector computation and classification 130ndash131

Naiumlve Bayes data mining techniques 131

maximum-margin hyperplane 131

most distinguishing instruction sequence 129

Design of data mining tool (stream mining) 221ndash230

classification techniques SVM 90

definitions 221ndash223

novel class detection 223ndash229

clustering 223ndash224

computing set of novel class instances 226ndash227

detecting novel class 225ndash229

filtering 224ndash225

impact of evolving class labels on ensemble classification228ndash229

outlier detection and filtering 224ndash225

saving inventory of used spaces during training 223ndash224

speeding up of computation 227ndash228

storing of cluster summary information 224

time complexity 228

security applications 229

Design and implementation of data mining tools (data miningand security) 57ndash68

collaborative filtering 63

dynamically growing self-organizing tree 61

contributions to 65

key aspect of 65

research 66

intrusion detection 59ndash62

predictive accuracy 58

supervised learning 57

support vectors 61

web page surfing prediction 62ndash65

DFS see DLL-call feature set

DGSOT see Dynamically growing self-organizing tree

Digital rights management (DRM) 356

Distributed denial of service (DDoS) attacks 5 41

Distributed Processor (DP) 366

DLL see Dynamic link library

DLL-call feature set (DFS) 126 127

Document Type Definitions (DTDs) 388

DP see Distributed Processor

DRM see Digital rights management

DTDs see Document Type

Definitions

Dynamically growing self-organizing tree (DGSOT) 61

Dynamic link library (DLL) 113 252

EarlyBird System 76

EC see Explicit content

E-commerce 378ndash379

Email Mining Toolkit (EMT) 75

Email worm detection 73ndash79

honeypot 76

known worms set 77

novel worms set 77

overview of approach 76ndash77

summary 77ndash78

training instances 75

zero-day attacks 73

Email worm detection design of data mining tool 81ndash93

architecture 82

class label 89

hyperplane 90

SVM 90

minimal subset 87

phase I 85ndash87

phase II 87ndash88

summary 91ndash92

support vectors 91

Email worm detection evaluation and results 95ndash110

dataset 96ndash98

experimental setup 98ndash99

baseline techniques 99

parameter settings 98

results 99ndash106

PCA-reduced data 99ndash102

two-phase selection 102ndash106

unreduced data 99

summary 106

Emerging applications 243

Emerging trends 348ndash349 350

EMT see Email Mining Toolkit

Encryption 348

Evaluation and results (detecting botnets) 201ndash206

classifiers 202

Livadas 201

Temporal 201

comparison with other techniques 203ndash204

Evaluation and results (detecting remote exploits) 169ndash177

analysis 174ndash175

czone values 175

dataset 170

experimental setup 171

results 171ndash174

effectiveness 172 173

metrics 171

running time 174

robustness and limitations 175ndash176

DExtor 176

junk-instruction insertion 175

limitations 176

robustness against obfuscations 175ndash176

Evaluation and results (email worm detection) 95ndash110

dataset 96ndash98

results 99ndash106

unreduced data 99

summary 106

Evaluation and results (malicious executables) 133ndash145

dataset 134ndash135

example run 143ndash144

receiver operating characteristic graphs 134

results 135ndash143

accuracy 135ndash138

Dataset1 136ndash137

Dataset2 137

DLL call feature 138

false positive and false negative 139

ROC curves 138ndash139

running time 140ndash142

statistical significance test 137ndash138

training and testing with boosted J48 142ndash143

Evaluation and results (stream mining) 231ndash245

datasets 232ndash234

real data (forest cover) 233ndash234

real data (KDD Cup 99

network intrusion detection) 233

synthetic data with concept-drift and novel class 233

synthetic data with only concept-drift (sync) 232

baseline method 234ndash235

OLINDDA model 234

Weighted Classified

Ensemble 234

performance study 235ndash240

evaluation approach 235

results 235ndash239

Explicit content (EC) 267

External attacks 47ndash48

Federated database system 328

Firewall policy analysis 297ndash313

anomaly resolution algorithms 302ndash311

algorithms for finding and resolving anomalies 302ndash309

algorithms for merging rules 309ndash311

correlation anomaly 303

illustrative example of the merge algorithm 310ndash311

inclusively matched rules 303

new rules list 303

old rules list 303

overlapping rules 306

redundancy anomaly 303

shadowing anomaly 303

firewall concepts 299ndash302

correlated rules 301

disjoint rules 301

exactly matching rules 301

inclusively matching rules 301

possible anomalies between two rules 301ndash302

relationship between two rules 300ndash301

representation of rules 300

Functional database systems 327

Graph(s)

analysis 265 267

dataset 270

mining techniques 6 321

network 75

receiver operating characteristic 134

theory 283

Hackers 48

Hadoop MapReduce 273ndash274

HDP see Heterogeneous distributed processor

Heterogeneous database system 328

Heterogeneous distributed processor (HDP) 367

HFR model see Hybrid feature retrieval model

Honeypot 76

Host-based attacks 52

HPStream 292

HTML forms fake 41

Hybrid feature retrieval (HFR) model 127ndash131

Adaboost 131

Hyper-plane classifier 18

Identity theft 49

IG see Information gain

Image classification 2 32 65ndash66

contributions to 65

key aspect of 65

research 66

Image mining 31ndash34

approaches 31

Information gain (IG) 123

Information management secure 372ndash380

Insider threat detection data mining for 263ndash278

challenges related work and approach 264ndash266

comprehensive framework 276ndash277

data mining for insider threat detection 266ndash276

answering queries using Hadoop MapReduce 273ndash274

data mining applications 274ndash276

data storage 272ndash275

explicit content 267

feature extraction and compact representation 267ndash270

file organization 272

predicate object split 273

predicate split 272ndash273

RDF repository architecture 270ndash272

Resource Description

Framework 266

solution architecture 266ndash267

vector representation of the content 267

Instruction usage frequencies (IUF) 163

Intranets corporate 381

IUF see Instruction usage frequencies

Java applets 259

Junk-instruction insertion 175

Keylogger 42

Keyword-based query interface 31

k nearest neighbor (kNN)

algorithm 34

kNN algorithm see k nearest neighbor algorithm 34

Knowledge management secure 380ndash383

web data management 382

Known worms set 77

Lagrangian multiplier 21

Layered technology stack 387

Linux 345

Log file correlation 197ndash198

Logic bombs 40ndash41

Malicious executables 111ndash118

hybrid feature retrieval model 116

masquerade detection 115

zero-day attacks 111

Malicious executables design of data mining tool 119ndash132

Adaboost 131

Malicious executables evaluation and results 133ndash145

dataset 134ndash135

results 135ndash143

Dataset2 137

Malware 37ndash44

botnet 41ndash42

keylogger 42

metamorphic 247

polymorphic 247

spyware 42

summary 42ndash43

time and logic bombs 40ndash41

Trojan horses 40

viruses 38ndash39

worms 39

Zeus botnet 41

Malware detection model 251ndash255

testing 255

Market basket analysis techniques 283

example 23

first-order 24

second-order 24

sliding window 25

WWW prediction 22 24

Masquerade detection 115

MDIS see Most distinguishing instruction sequence

MDL principle see Minimum description length principle

Metadata 365

Metamorphic malware 247

Microsoft IOfficeAntiVirus interface 246

Minimum description length (MDL) principle 265

MM-DM see Multimedia data manager

Most distinguishing instruction sequence (MDIS) 129

Multimedia data manager (MM-DM) 374

Naiumlve Bayes (NB) data mining techniques 4 131

NASA see National Aeronautics and

Space Administration

National Aeronautics and Space

Administration (NASA) 377

National Science Foundation (NSF) 377

National security threats to 47

NB data mining techniques see

Naiumlve Bayes data mining techniques

Network-based attacks 52 53

Network protocol security 348

Noninterference model 345

Novel worms set 77

NSF see National Science

Foundation

Object Management Group (OMG) 349

Object-oriented database systems 327

OLINDDA model 234

OMG see Object Management

Online banking login pages 41

Only concept-drift 232

Ontology engineering 396

OWL see Web Ontology Language

Packet filtering 198ndash199

Payload-based anomaly detection system 153

PCA see Principal component analysis

Perceptron input 15

PFS see Potential feature set

Polymorphic malware 247

POS see Predicate object split

Potential feature set (PFS) 87

Predicate object split (POS) 273

Predicate split (PS) 272

Principal component analysis (PCA) 33 73

PS see Predicate split

Query interface keyword-based 21

RDF see Resource Description

Framework

Real-time data mining dependable 279ndash296

BIRCH 292

cluster feature vector 292

clustering idea behind 291

dependable data mining 288ndash291

HPStream 292

incremental association rule mining techniques 284

issues in real-time data mining 281ndash282

mining data streams 291ndash294

parallel distributed real-time data mining 286ndash288

real-time data mining techniques 283ndash286

Receiver operating characteristic (ROC) graphs 134

Relational database system products 326

Remote exploits detecting 151ndash158

payload-based anomaly detection system 153

Remote exploits detecting design of data mining tool159ndash167

classification 166

Remote exploits detecting evaluation and results 169ndash177

czone values 175

dataset 170

results 171ndash174

metrics 171

running time 174

DExtor 176

limitations 176

Residual risk 359

Resource Description Framework (RDF) 268 391ndash395

axiomatic semantics 394

basics 392

container model 392ndash393

inferencing 394ndash395

query 395

schemas 393ndash394

SPARQL 395

specification 393

ROC graphs see Receiver operating characteristic graphs

Rule Markup Language 398

Security applications data mining for 45ndash55

anomaly detection 53

current research and development 52ndash54

data mining for cyber security 46ndash52

attacks on critical infrastructures 50

credit card fraud and identity theft 49

cyber-terrorism insider threats and external attacks 47ndash48

malicious intrusions 48ndash49

overview 46ndash47

hackers 48

host-based attacks 52

national security threats to 47

network-based attacks 52 53

ldquosocially engineeredrdquo penetration techniques 52

Trojan horses 45

viruses 45

Semantic web 385ndash402

Defense Advanced Research

Projects Agency 395ndash396

layered technology stack 387

ontologies 395ndash397

ontology engineering 396

Framework 391ndash395

basics 392

query 395

schemas 393ndash394

SPARQL 395

specification 393

rules language (SWRL) 387 397

semantic web rules language 397ndash400

semantic web services 400ndash401

XML 387ndash391

attributes 389

Document Type Definitions 388 389

federationsdistribution 390ndash391

namespaces 390

schemas 389ndash390

statement and elements 389

XML-QL XQuery Xpath XSLT 391

SigFree 177

Signature-based malware detection 245

detection 4 111

information leaks 259

unknown 251

Singular value decomposition (SVD) 33

Sliding window Markov model 25

ldquoSocially engineeredrdquo penetration techniques 52

SPARQL Protocol and RDF Query Language 266

Spyware 37 42

SQL see Structured Query Language

Stream mining 211ndash219

classifiers used 217ndash218

overview of novel class detection algorithm 216ndash217

summary 218ndash219

Stream mining design of data mining tool 221ndash230

time complexity 228

Stream mining evaluation and results 231ndash245

OLINDDA model 234

Weighted Classified

Ensemble 234

results 235ndash239

Structured Query Language (SQL) 327 372

Summary and directions 317ndash322

directions for data mining tools for malware detection319ndash321

summary of book 317ndash319

where to go from here 321ndash322

Supervised learning 57

Support vector machines (SVMs) 5 19ndash22

basic concept in 19

description of 19

separator 20

support vectors 21

Support vectors 21 61 91

SVD see Singular value decomposition

SVMs see Support vector machines

SWRL see Semantic web rules language

Threat see also Insider threat detection data mining forcyber 47 352

identifying 359

organizational 48

real-time 46 280

response 288

virus 37

Time bombs 40ndash41

TPS see Two-Phase Selection

Training instances (email) 75

Transaction management 371

Trojan horses 40 45 51 135

Trustworthy systems 341ndash362

biometrics forensics and other solutions 359ndash360

building trusted systems from untrusted components 354

cryptography 348

dependable systems 354ndash358

digital rights management 356

integrity data quality and high assurance 357ndash358

privacy 356ndash357

trust management 355ndash356

encryption 348

network protocol security 348

noninterference model 345

privacy 357

residual risk 359

risk analysis 358ndash359

secure systems 341ndash252

access control and other security concepts 342ndash343

emerging trends 348ndash349 350

impact of web 349ndash350

Object Management Group 349

secure database systems 346ndash347

secure networks 347ndash348

secure operating systems 345ndash346

steps to building secure systems 351ndash352

types of secure systems 343ndash344

Trusted Network Interpretation 348

web security 352ndash354

Two-Phase Selection (TPS) 71 77

UIC see Useful instruction count

Unknown label 57

Unreduced data 99

Useful instruction count (UIC) 163

Vector representation of the content (VRC) 267

Viruses 38ndash39 45

VRC see Vector representation of the content

Web see also Semantic web data management 369ndash372 382

surfing predictive methods for 22

transactions association rule mining and 26

Web Ontology Language (OWL) 387 395

Weighted Classified Ensemble 234

Windows 258 345

World Wide Web father of 385

Worm see also Email worm detection known worms set 77

novel worms set 77

WWW prediction 13

classification problem 57

hybrid approach 65

Markov model 22 24

number of classes 64

session recorded 26

typical training example 17

XML 387ndash391

attributes 389

namespaces 390

schemas 389ndash390

Zero-day attacks 4 39 73 111

Zeus botnet 41

Title Page

Copyright

Dedication

Contents

PREFACE

ACKNOWLEDGMENTS

THE AUTHORS


CHAPTER 3 MALWARE

CHAPTER 4 DATA MINING FOR SECURITY APPLICATIONS

CHAPTER 5 DESIGN AND IMPLEMENTATION OF DATA MINING TOOLS


PART II DATA MINING FOR EMAIL WORM DETECTION





PART III DATA MINING FOR DETECTING MALICIOUS EXECUTABLES





PART IV DATA MINING FOR DETECTING REMOTE EXPLOITS










PART VI STREAM MINING FOR SECURITY APPLICATIONS




CONCLUSION TO VI



CHAPTER 22 DATA MINING FOR INSIDER THREAT DETECTION

CHAPTER 23 DEPENDABLE REAL-TIME DATA MINING




APPENDIX A DATA MANAGEMENT SYSTEMS DEVELOPMENTS AND TRENDS


APPENDIX C SECURE DATA INFORMATION AND KNOWLEDGE MANAGEMENT


INDEX

James S TillerISBN 978-1-4398-8027-2

Cybersecurity Public Sector Threats and ResponsesEdited by Kim J AndreassonISBN 978-1-4398-4663-6

Cybersecurity for Industrial Control Systems SCADADCS PLC HMI and SISTyson Macaulay and Bryan SingerISBN 978-1-4398-0196-3

Data Warehouse Designs Achieving ROI with MarketBasket Analysis and Time VarianceFon SilversISBN 978-1-4398-7076-1

Emerging Wireless Networks Concepts Techniques andApplicationsEdited by Christian Makaya and Samuel PierreISBN 978-1-4398-2135-0

Information and Communication Technologies inHealthcareEdited by Stephan Jones and Frank M GroomISBN 978-1-4398-5413-6

Information Security Governance Simplified From theBoardroom to the KeyboardTodd FitzgeraldISBN 978-1-4398-1163-4

Dedication

Contents

PREFACE

Concluding Remarks

ACKNOWLEDGMENTS

THE AUTHORS

11 Trends

110 Next Steps

21 Introduction

25 Markov Model

271 One-vs-One

272 One-vs-All

28 Image Mining

29 Summary

References

CHAPTER 3 MALWARE

31 Introduction

32 Viruses

33 Worms

34 Trojan Horses

36 Botnet

37 Spyware

38 Summary

References

41 Introduction

421 Overview

44 Summary

References

51 Introduction

55 Summary

References

61 Introduction

62 Architecture

63 Related Work

65 Summary

References

71 Introduction

72 Architecture

7421 Phase I

7422 Phase II

76 Summary

References

81 Introduction

82 Dataset

84 Results

85 Summary

References

91 Introduction

92 Architecture

93 Related Work

95 Summary

References

101 Introduction

104 Summary

References

111 Introduction

112 Experiments

113 Dataset

115 Results

1151 Accuracy

11511 Dataset1

11512 Dataset2

1152 ROC Curves

1154 Running Time

116 Example Run

117 Summary

References

121 Introduction

122 Architecture

123 Related Work

125 Summary

References

131 Introduction

133 Disassembly

136 Classification

137 Summary

References

141 Introduction

142 Dataset

144 Results

1441 Running Time

145 Analysis

1462 Limitations

147 Summary

References

151 Introduction

153 Related Work

154 Our Approach

155 Summary

References

161 Introduction

162 Architecture

163 System Setup

164 Data Collection

168 Classification

1610 Summary

References

171 Introduction

1712 Classifiers

175 Summary

References

181 Introduction

182 Architecture

183 Related Work

184 Our Approach

188 Summary

References

191 Introduction

192 Definitions

19311 Clustering

19321 Filtering

195 Summary

Reference

201 Introduction

202 Datasets

2042 Results

2043 Running Time

205 Summary

References

CONCLUSION TO VI

211 Introduction

212 Related Work

213 Architecture

2141 Our Framework

2143 Training

2144 Testing

2151 Path Selection

216 Experiments

217 Summary

References

221 Introduction

2234 Data Storage

225 Summary

References

231 Introduction

237 Summary

References

241 Introduction

242 Related Work

245 Summary

References

251 Introduction

A1 Introduction

A7 Summary

References

B1 Introduction

B2 Secure Systems

B21 Introduction

B26 Secure Networks

B27 Emerging Trends

B3 Web Security

B51 Introduction

B54 Privacy

B61 Risk Analysis

B7 Summary

References

C1 Introduction

C21 Introduction

C221 Data Model

C222 Functions

C26 Security Impact

C31 Introduction

C36 E-Business

C37 Security Impact

C42 Security Impact

C5 Summary

References

D1 Introduction

D3 XML

D32 XML Attributes

D33 XML DTDs

D34 XML Schemas

D35 XML Namespaces

D4 RDF

D41 RDF Basics

D44 RDF Schemas

D46 RDF Inferencing

D47 RDF Query

D48 SPARQL

D5 Ontologies

D61 Web Rules

D62 SWRL

D8 Summary

References

Preface

Acknowledgments

The Authors

Figure 223

INTRODUCTION

Figure 12 Malware

PART I

271 One-vs-One

272 One-vs-All

MALWARE

PART II

62 Architecture

10 if Cavg ge pavg

11 pavg larr Cavg

12 else

14 MS larr MS cup X

15 end if

16 end while

17 return PFS

Mean words in body

PART III

Example-I

7 end while

11 end if

13 end if

14 end for

8 end if

10 end while

Example-II

10 q larr q cup r

11 end for

13 end if

15 end while

16 end for

20 V larr V cup q

21 end for

1151 Accuracy

1152 ROC Curves

1154 Running Time

SPACECOMPLEXITY

PART IV

instruction)

1441 Running Time

1462 Limitations

PART V

DETECTING BOTNETS

FEATURE DESCRIPTION

AvgPktLen

AvgBAtime

AvgBRtime

AvgBRsize

AvgBOtime

1712 Classifiers

PART VI

STREAM MINING

Majority voting

2042 Results

2043 Running Time

PART VII

2143 Training

2144 Testing

2151 Path Selection

subject to

2234 Data Storage

10 break

attributes)

REJECT)6 (IN TCP 12911096117 ANY 1291109680 22

REJECT)7 (IN UDP 12911096117 ANY 12911096 22

REJECT)8 (IN UDP 12911096117 ANY 1291109680 22

REJECT)9 (IN UDP 12911096117 ANY 12911096117

22 ACCEPT)10 (IN UDP 12911096117 ANY 12911096117

1 (IN TCP 129110961-116 ANY 129110968080 ACCEPT)

2 (IN TCP 12911096118-254 ANY 129110968080 ACCEPT)

3 (IN TCP 12911096117 ANY 129110961-7980 REJECT)

4 (IN TCP 12911096117 ANY 1291109681-25480 REJECT)

5 (IN TCP 12911096117 ANY 1291109680 80REJECT)

1 (IN TCP 129110961-116 ANY 129110968080 ACCEPT)

2 (IN TCP 12911096118-254 ANY 129110968080 ACCEPT)

3 (IN TCP 12911096117 ANY 129110961-7980 REJECT)

4 (IN TCP 12911096117 ANY 1291109681-25480 REJECT)

5 (IN TCP 12911096117 ANY 1291109680 80REJECT)

6 (IN TCP 12911096 ANY 1291109680 80REJECT)

REJECT)9 (IN TCP 12911096117 ANY 1291109680 22

REJECT)10 (IN UDP 12911096117 ANY 1291109680 22

REJECT)11 (IN UDP 12911096117 ANY 12911096117

22 REJECT)12 (IN UDP 12911096117 ANY 12911096 22

Algorithm Merge(n)

1 (IN TCP 2028016929-63 4831291109664-127 100-110 ACCEPT)

2 (IN TCP 2028016929-63 4831291109664-127 111-127 ACCEPT)

3 (IN TCP 2028016929-63 48312911096128-164 100-127 ACCEPT)

4 (IN TCP 2028016929-63 484 1291109664-99100-127 ACCEPT)

5 (IN TCP 2028016929-63 48412911096100-164 100-127 ACCEPT)

6 (IN TCP 2028016964-110 483-4841291109664-164 100-127ACCEPT)

(IN TCP 2028016929-110 483-484 1291109664-164100-127 ACCEPT)

CHAPTER 25

Figure A4 Vision

B26 Secure Networks

B27 Emerging Trends

B54 Privacy

B61 Risk Analysis

C26 Security Impact

C36 E-Business

C37 Security Impact

C42 Security Impact

ltProfessorgt

D32 XML Attributes

ltProfessorgt

D33 XML DTDs

D34 XML Schemas

ltSequencegt

ltComplexTypegt

D35 XML Namespaces

UK State = Cambs

Site 1 document

ltProfessor-namegt

ltIDgt 111 ltIDgt

ltProfessor-namegt

Site 2 document

ltIDgt 111 ltIDgt

D41 RDF Basics

Tom ate the Apple

Rdf Bag

Rdf Seq

Rdf Alt

ltrdf RDF

ltrdf Descriptiongt

ltrdf RDFgt

D44 RDF Schemas

ltrdfs commentgt

ltrdfs Classgt

D46 RDF Inferencing

D47 RDF Query

D48 SPARQL

ltowlClassgt

ltowl Classgt

ltrdfs subClassOfgt

ltowl Restrictiongt

ltrdfs subClassOfgt

ltowl Classgt

DomesticStudent(X)

rarr Mother(XY)

Rule is of the form

rarr Acceptable(X)

ltfactgt

ltatomgt

lttermgt

ltconstgtAltconstgt

lttermgt

ltatomgt

ltfactgt

D62 SWRL

ltrulemlimpgt

ltruleml_bodygt

ltruleml_headgt

ltrulemlimpgt

testing 255

path selection 256

Adaboost 131

Antivirus

product tested 258

ARPANET 38

diagram 17

hybrid model and 65

learning process 15

perceptron input 15

step size 19

training example 17

uses 14

hybrid model and 65

parameters 25

problem in using 26

sensor data 289

web transactions 26

Attack

categories 52

host-based 52

network-based 52 53

terrorist 281

types 233

web security 353

zero-day 4 39 73 111

Banking online 41

BIRCH 292

Boosted J48 172

Botnets 41ndash42

classification 198

data collection 194

system setup 193

classifiers 202

Livadas 201

Temporal 201

Chi-square 65

Class label 89

Cryptography 348

Cyber-terrorism 48

Database

functional 327

heterogeneous 328

functions 365

object-oriented 327

signature 39

system vendors 324

metadata 365

security impact 372

vision 331

botnet detection 5

next steps 7ndash8

trends 1ndash2

diagram 17

learning process 15

perceptron input 15

step size 19

training example 17

uses 14

parameters 25

problem in using 26

web transactions 26

approaches 31

example 23

first-order 24

second-order 24

sliding window 25

one-vs-all 30

overview 14

semantic gap 31

summary 34

basic concept in 19

description of 19

support vectors 21

classification 198

data collection 194

system setup 193

classification 166

architecture 82

class label 89

hyperplane 90

minimal subset 87

phase I 85ndash87

phase II 87ndash88

summary 91ndash92

support vectors 91

Adaboost 131

time complexity 228

contributions to 65

key aspect of 65

research 66

support vectors 61

Definitions

EarlyBird System 76

honeypot 76

known worms set 77

novel worms set 77

summary 77ndash78

zero-day attacks 73

architecture 82

class label 89

hyperplane 90

SVM 90

minimal subset 87

phase I 85ndash87

phase II 87ndash88

summary 91ndash92

support vectors 91

dataset 96ndash98

results 99ndash106

unreduced data 99

summary 106

Encryption 348

classifiers 202

Livadas 201

Temporal 201

czone values 175

dataset 170

results 171ndash174

metrics 171

running time 174

DExtor 176

limitations 176

dataset 96ndash98

results 99ndash106

unreduced data 99

summary 106

dataset 134ndash135

results 135ndash143

Dataset2 137

OLINDDA model 234

Weighted Classified

Ensemble 234

results 235ndash239

new rules list 303

old rules list 303

disjoint rules 301

Graph(s)

analysis 265 267

dataset 270

network 75

theory 283

Hackers 48

Honeypot 76

HPStream 292

HTML forms fake 41

Adaboost 131

Identity theft 49

contributions to 65

key aspect of 65

research 66

approaches 31

Framework 266

Java applets 259

Keylogger 42

algorithm 34

Known worms set 77

Linux 345

Adaboost 131

dataset 134ndash135

results 135ndash143

Dataset2 137

Malware 37ndash44

botnet 41ndash42

keylogger 42

metamorphic 247

polymorphic 247

spyware 42

summary 42ndash43

Trojan horses 40

viruses 38ndash39

worms 39

Zeus botnet 41

testing 255

example 23

first-order 24

second-order 24

sliding window 25

Metadata 365

Novel worms set 77

Foundation

OLINDDA model 234

Perceptron input 15

Framework

BIRCH 292

HPStream 292

classification 166

czone values 175

dataset 170

results 171ndash174

metrics 171

running time 174

DExtor 176

limitations 176

Residual risk 359

basics 392

query 395

schemas 393ndash394

SPARQL 395

specification 393

overview 46ndash47

hackers 48

Trojan horses 45

viruses 45

basics 392

query 395

schemas 393ndash394

SPARQL 395

specification 393

XML 387ndash391

attributes 389

namespaces 390

schemas 389ndash390

SigFree 177

detection 4 111

unknown 251

Spyware 37 42

summary 218ndash219

time complexity 228

OLINDDA model 234

Weighted Classified

Ensemble 234

results 235ndash239

basic concept in 19

description of 19

separator 20

support vectors 21

identifying 359

organizational 48

real-time 46 280

response 288

virus 37

cryptography 348

privacy 356ndash357

encryption 348

privacy 357

residual risk 359

Unknown label 57

Unreduced data 99

Windows 258 345

novel worms set 77

WWW prediction 13

hybrid approach 65

Markov model 22 24

session recorded 26

XML 387ndash391

attributes 389

namespaces 390

schemas 389ndash390

Zeus botnet 41

Title Page

Copyright

Dedication

Contents

PREFACE

ACKNOWLEDGMENTS

THE AUTHORS


CHAPTER 3 MALWARE




























CONCLUSION TO VI












INDEX

Dedication

Contents

PREFACE

Concluding Remarks

ACKNOWLEDGMENTS

THE AUTHORS

11 Trends

110 Next Steps

21 Introduction

25 Markov Model

271 One-vs-One

272 One-vs-All

28 Image Mining

29 Summary

References

CHAPTER 3 MALWARE

31 Introduction

32 Viruses

33 Worms

34 Trojan Horses

36 Botnet

37 Spyware

38 Summary

References

41 Introduction

421 Overview

44 Summary

References

51 Introduction

55 Summary

References

61 Introduction

62 Architecture

63 Related Work

65 Summary

References

71 Introduction

72 Architecture

7421 Phase I

7422 Phase II

76 Summary

References

81 Introduction

82 Dataset

84 Results

85 Summary

References

91 Introduction

92 Architecture

93 Related Work

95 Summary

References

101 Introduction

104 Summary

References

111 Introduction

112 Experiments

113 Dataset

115 Results

1151 Accuracy

11511 Dataset1

11512 Dataset2

1152 ROC Curves

1154 Running Time

116 Example Run

117 Summary

References

121 Introduction

122 Architecture

123 Related Work

125 Summary

References

131 Introduction

133 Disassembly

136 Classification

137 Summary

References

141 Introduction

142 Dataset

144 Results

1441 Running Time

145 Analysis

1462 Limitations

147 Summary

References

151 Introduction

153 Related Work

154 Our Approach

155 Summary

References

161 Introduction

162 Architecture

163 System Setup

164 Data Collection

168 Classification

1610 Summary

References

171 Introduction

1712 Classifiers

175 Summary

References

181 Introduction

182 Architecture

183 Related Work

184 Our Approach

188 Summary

References

191 Introduction

192 Definitions

19311 Clustering

19321 Filtering

195 Summary

Reference

201 Introduction

202 Datasets

2042 Results

2043 Running Time

205 Summary

References

CONCLUSION TO VI

211 Introduction

212 Related Work

213 Architecture

2141 Our Framework

2143 Training

2144 Testing

2151 Path Selection

216 Experiments

217 Summary

References

221 Introduction

2234 Data Storage

225 Summary

References

231 Introduction

237 Summary

References

241 Introduction

242 Related Work

245 Summary

References

251 Introduction

A1 Introduction

A7 Summary

References

B1 Introduction

B2 Secure Systems

B21 Introduction

B26 Secure Networks

B27 Emerging Trends

B3 Web Security

B51 Introduction

B54 Privacy

B61 Risk Analysis

B7 Summary

References

C1 Introduction

C21 Introduction

C221 Data Model

C222 Functions

C26 Security Impact

C31 Introduction

C36 E-Business

C37 Security Impact

C42 Security Impact

C5 Summary

References

D1 Introduction

D3 XML

D32 XML Attributes

D33 XML DTDs

D34 XML Schemas

D35 XML Namespaces

D4 RDF

D41 RDF Basics

D44 RDF Schemas

D46 RDF Inferencing

D47 RDF Query

D48 SPARQL

D5 Ontologies

D61 Web Rules

D62 SWRL

D8 Summary

References

Preface

Acknowledgments

The Authors

Figure 223

INTRODUCTION

Figure 12 Malware

PART I

271 One-vs-One

272 One-vs-All

MALWARE

PART II

62 Architecture

10 if Cavg ge pavg

11 pavg larr Cavg

12 else

14 MS larr MS cup X

15 end if

16 end while

17 return PFS

Mean words in body

PART III

Example-I

7 end while

11 end if

13 end if

14 end for

8 end if

10 end while

Example-II

10 q larr q cup r

11 end for

13 end if

15 end while

16 end for

20 V larr V cup q

21 end for

1151 Accuracy

1152 ROC Curves

1154 Running Time

SPACECOMPLEXITY

PART IV

instruction)

1441 Running Time

1462 Limitations

PART V

DETECTING BOTNETS

FEATURE DESCRIPTION

AvgPktLen

AvgBAtime

AvgBRtime

AvgBRsize

AvgBOtime

1712 Classifiers

PART VI

STREAM MINING

Majority voting

2042 Results

2043 Running Time

PART VII

2143 Training

2144 Testing

2151 Path Selection

subject to

2234 Data Storage

10 break

attributes)

REJECT)6 (IN TCP 12911096117 ANY 1291109680 22

REJECT)7 (IN UDP 12911096117 ANY 12911096 22

REJECT)8 (IN UDP 12911096117 ANY 1291109680 22

REJECT)9 (IN UDP 12911096117 ANY 12911096117

22 ACCEPT)10 (IN UDP 12911096117 ANY 12911096117

1 (IN TCP 129110961-116 ANY 129110968080 ACCEPT)

2 (IN TCP 12911096118-254 ANY 129110968080 ACCEPT)

3 (IN TCP 12911096117 ANY 129110961-7980 REJECT)

4 (IN TCP 12911096117 ANY 1291109681-25480 REJECT)

5 (IN TCP 12911096117 ANY 1291109680 80REJECT)

1 (IN TCP 129110961-116 ANY 129110968080 ACCEPT)

2 (IN TCP 12911096118-254 ANY 129110968080 ACCEPT)

3 (IN TCP 12911096117 ANY 129110961-7980 REJECT)

4 (IN TCP 12911096117 ANY 1291109681-25480 REJECT)

5 (IN TCP 12911096117 ANY 1291109680 80REJECT)

6 (IN TCP 12911096 ANY 1291109680 80REJECT)

REJECT)9 (IN TCP 12911096117 ANY 1291109680 22

REJECT)10 (IN UDP 12911096117 ANY 1291109680 22

REJECT)11 (IN UDP 12911096117 ANY 12911096117

22 REJECT)12 (IN UDP 12911096117 ANY 12911096 22

Algorithm Merge(n)

1 (IN TCP 2028016929-63 4831291109664-127 100-110 ACCEPT)

2 (IN TCP 2028016929-63 4831291109664-127 111-127 ACCEPT)

3 (IN TCP 2028016929-63 48312911096128-164 100-127 ACCEPT)

4 (IN TCP 2028016929-63 484 1291109664-99100-127 ACCEPT)

5 (IN TCP 2028016929-63 48412911096100-164 100-127 ACCEPT)

6 (IN TCP 2028016964-110 483-4841291109664-164 100-127ACCEPT)

(IN TCP 2028016929-110 483-484 1291109664-164100-127 ACCEPT)

CHAPTER 25

Figure A4 Vision

B26 Secure Networks

B27 Emerging Trends

B54 Privacy

B61 Risk Analysis

C26 Security Impact

C36 E-Business

C37 Security Impact

C42 Security Impact

ltProfessorgt

D32 XML Attributes

ltProfessorgt

D33 XML DTDs

D34 XML Schemas

ltSequencegt

ltComplexTypegt

D35 XML Namespaces

UK State = Cambs

Site 1 document

ltProfessor-namegt

ltIDgt 111 ltIDgt

ltProfessor-namegt

Site 2 document

ltIDgt 111 ltIDgt

D41 RDF Basics

Tom ate the Apple

Rdf Bag

Rdf Seq

Rdf Alt

ltrdf RDF

ltrdf Descriptiongt

ltrdf RDFgt

D44 RDF Schemas

ltrdfs commentgt

ltrdfs Classgt

D46 RDF Inferencing

D47 RDF Query

D48 SPARQL

ltowlClassgt

ltowl Classgt

ltrdfs subClassOfgt

ltowl Restrictiongt

ltrdfs subClassOfgt

ltowl Classgt

DomesticStudent(X)

rarr Mother(XY)

Rule is of the form

rarr Acceptable(X)

ltfactgt

ltatomgt

lttermgt

ltconstgtAltconstgt

lttermgt

ltatomgt

ltfactgt

D62 SWRL

ltrulemlimpgt

ltruleml_bodygt

ltruleml_headgt

ltrulemlimpgt

testing 255

path selection 256

Adaboost 131

Antivirus

product tested 258

ARPANET 38

diagram 17

hybrid model and 65

learning process 15

perceptron input 15

step size 19

training example 17

uses 14

hybrid model and 65

parameters 25

problem in using 26

sensor data 289

web transactions 26

Attack

categories 52

host-based 52

network-based 52 53

terrorist 281

types 233

web security 353

zero-day 4 39 73 111

Banking online 41

BIRCH 292

Boosted J48 172

Botnets 41ndash42

classification 198

data collection 194

system setup 193

classifiers 202

Livadas 201

Temporal 201

Chi-square 65

Class label 89

Cryptography 348

Cyber-terrorism 48

Database

functional 327

heterogeneous 328

functions 365

object-oriented 327

signature 39

system vendors 324

metadata 365

security impact 372

vision 331

botnet detection 5

next steps 7ndash8

trends 1ndash2

diagram 17

learning process 15

perceptron input 15

step size 19

training example 17

uses 14

parameters 25

problem in using 26

web transactions 26

approaches 31

example 23

first-order 24

second-order 24

sliding window 25

one-vs-all 30

overview 14

semantic gap 31

summary 34

basic concept in 19

description of 19

support vectors 21

classification 198

data collection 194

system setup 193

classification 166

architecture 82

class label 89

hyperplane 90

minimal subset 87

phase I 85ndash87

phase II 87ndash88

summary 91ndash92

support vectors 91

Adaboost 131

time complexity 228

contributions to 65

key aspect of 65

research 66

support vectors 61

Definitions

EarlyBird System 76

honeypot 76

known worms set 77

novel worms set 77

summary 77ndash78

zero-day attacks 73

architecture 82

class label 89

hyperplane 90

SVM 90

minimal subset 87

phase I 85ndash87

phase II 87ndash88

summary 91ndash92

support vectors 91

dataset 96ndash98

results 99ndash106

unreduced data 99

summary 106

Encryption 348

classifiers 202

Livadas 201

Temporal 201

czone values 175

dataset 170

results 171ndash174

metrics 171

running time 174

DExtor 176

limitations 176

dataset 96ndash98

results 99ndash106

unreduced data 99

summary 106

dataset 134ndash135

results 135ndash143

Dataset2 137

OLINDDA model 234

Weighted Classified

Ensemble 234

results 235ndash239

new rules list 303

old rules list 303

disjoint rules 301

Graph(s)

analysis 265 267

dataset 270

network 75

theory 283

Hackers 48

Honeypot 76

HPStream 292

HTML forms fake 41

Adaboost 131

Identity theft 49

contributions to 65

key aspect of 65

research 66

approaches 31

Framework 266

Java applets 259

Keylogger 42

algorithm 34

Known worms set 77

Linux 345

Adaboost 131

dataset 134ndash135

results 135ndash143

Dataset2 137

Malware 37ndash44

botnet 41ndash42

keylogger 42

metamorphic 247

polymorphic 247

spyware 42

summary 42ndash43

Trojan horses 40

viruses 38ndash39

worms 39

Zeus botnet 41

testing 255

example 23

first-order 24

second-order 24

sliding window 25

Metadata 365

Novel worms set 77

Foundation

OLINDDA model 234

Perceptron input 15

Framework

BIRCH 292

HPStream 292

classification 166

czone values 175

dataset 170

results 171ndash174

metrics 171

running time 174

DExtor 176

limitations 176

Residual risk 359

basics 392

query 395

schemas 393ndash394

SPARQL 395

specification 393

overview 46ndash47

hackers 48

Trojan horses 45

viruses 45

basics 392

query 395

schemas 393ndash394

SPARQL 395

specification 393

XML 387ndash391

attributes 389

namespaces 390

schemas 389ndash390

SigFree 177

detection 4 111

unknown 251

Spyware 37 42

summary 218ndash219

time complexity 228

OLINDDA model 234

Weighted Classified

Ensemble 234

results 235ndash239

basic concept in 19

description of 19

separator 20

support vectors 21

identifying 359

organizational 48

real-time 46 280

response 288

virus 37

cryptography 348

privacy 356ndash357

encryption 348

privacy 357

residual risk 359

Unknown label 57

Unreduced data 99

Windows 258 345

novel worms set 77

WWW prediction 13

hybrid approach 65

Markov model 22 24

session recorded 26

XML 387ndash391

attributes 389

namespaces 390

schemas 389ndash390

Zeus botnet 41

Title Page

Copyright

Dedication

Contents

PREFACE

ACKNOWLEDGMENTS

THE AUTHORS


CHAPTER 3 MALWARE




























CONCLUSION TO VI












INDEX