6
Detection of Spyware by Mining Executable Files Objectives The main objective of our project is to establish a method in spyware detection research using data mining techniques. These techniques are used for information retrieval and classification. In application of techniques, there was only one change that computer programs were used rather than text documents. In this project, binary features are extracted from executable files. A feature reduction method is then used to obtain a subset of data which is further used as a training set for automatically generating classifiers. In this method, the generated classifiers are used to classify new, previously unseen binaries as either legitimate software or spyware. We will use appropriate value of “n” in order to yield high performance, also suitable machine learning algorithm to produce high accuracy. Project idea The goal of the project is to detect spyware by using data mining and machine learning. We use the Waikato Environment for Knowledge Analysis (WEKA) to perform the experiments. WEKA is a suite of machine learning algorithms and analysis tools, which is used in practice for solving Detection of Spyware by Mining Executable Files”

Detection of Spyware by Mining Executable Files

Embed Size (px)

DESCRIPTION

In this project, binary features are extracted from executable files. A feature reduction method is then used to obtain a subset of data which is further used as a training set for automatically generating classifiers. In this method, the generated classifiers are used to classify new, previously unseen binaries as either legitimate software or spyware. We will use appropriate value of “n” in order to yield high performance, also suitable machine learning algorithm to produce high accuracy

Citation preview

Chapter 2

Detection of Spyware by Mining Executable FilesObjectivesThe main objective of our project is to establish a method in spyware detection research using data mining techniques. These techniques are used for information retrieval and classification. In application of techniques, there was only one change that computer programs were used rather than text documents.

In this project, binary features are extracted from executable files. A feature reduction method is then used to obtain a subset of data which is further used as a training set for automatically generating classifiers. In this method, the generated classifiers are used to classify new, previously unseen binaries as either legitimate software or spyware. We will use appropriate value of n in order to yield high performance, also suitable machine learning algorithm to produce high accuracy.Project idea

The goal of the project is to detect spyware by using data mining and machine learning. We use the Waikato Environment for Knowledge Analysis (WEKA) to perform the experiments. WEKA is a suite of machine learning algorithms and analysis tools, which is used in practice for solving data mining problems. First, we extract features from the binary files and we then apply a feature reduction method in order to reduce data set complexity. Finally, we convert the reduced feature set into the Attribute Relation File Format (ARFF). ARFF files are ASCII text files that include a set of data instances, each described by a set of features. Figure 2.1 shows the steps involved in our proposed method.

Figure 2.1: Proposed SystemWe organized our work into following stages:1. Data Collection

2. Byte Sequence Generation

3. N-gram Generation

4. Feature Extraction

5. Feature Reduction

6. ARFF Generation

7. Model TrainingStep 1: Data Collection

Our data set consists of two classes of binary files: (1) Benign files (2) Spyware files.Step 2: Byte Sequence Generation

This process makes file conversion from binary to byte sequence in each class. We use xxd, which is a UNIX based utility for conversion.

Step 3: N-gram Generation

This process pieces out the byte sequences into a desired size of n (namely 4, 5 and 6). An n-gram is a sequence of n elements. This process also makes sure that each line contains one n-gram and length of a single line is equal to the size of n.

Step 4: Feature Extraction We extract the features by using two different approaches: Common Feature Based Extraction (CFBE) and Frequency Based Feature Extraction (FBFE). Both methods are used to obtain Reduced Feature Sets (RFSs) which are then used to generate the Attribute Relation File Format (ARFF) files.1. Frequency Based Feature Extraction (FBFE): In FBFE, the frequency of each n-gram in each class is calculated. 2. Common Feature Based Extraction (CFBE): In CFBE, the common n-grams are extracted from each class.Step 5: Feature Reduction

In FBFE, all n-grams within a specified frequency range (50-500) are extracted and the rest (1-49) are discarded. In CFBE, only one representation of each feature is considered in one class. To obtain Reduced Feature Sets (RFSs) for CFBE and FBFE, merge unique n-grams for both classes.

Step 6: ARFF Generation (Data Set Generation)This process generates two ARFF databases: frequency based feature database and common feature based database. All attributes in database are treated as Boolean attributes. ARFF process searches for every n-gram in all byte sequences for a class and assign a value to the attribute which can be either 1 or 0 on the present/not present basis.Step 7: Model Training The ARFF file is used as input to WEKA for applying machine learning algorithms. The algorithms used in the experiment are: ZeroR, Naive Bayes, SVM (Support Vector Machines), J48, Random Forest and JRip.Hardware Requirements

Pentium Processor, 1.6 GHz or advanced RAM, 128 MB or more HDD, 40 GB or more.Software Requirements

Platform: Linux OS Language: JAVA

Editor: G-Edit Editor WEKA (Machine Learning Tool)

Detection of Spyware by Mining Executable Files