Machine Learning for Malware Analysis

Mike SlawinskiData Scientist

Introduction - What is Malware?

- Software intended to cause harm or inflict damage on computer systems

- Many different kinds:

- Viruses

- Trojans

- Worms

- Adware/Spyware

- Ransomware

- Rootkits

- Backdoors

- Botnets

Malware Detection - Hashing- Simplest method:

- Compute a fingerprint of the sample (MD5, SHA1, SHA256, …)

- Check for existance of hash in a database of known malicious hashes

- If the hash exists, the file is malicious

- Fast and simple

- Requires work to keep up the database

7578034f6f7cb994c69afdf09fc487d9

Query DB

Malicious Benign

Malware Detection - Signatures

Look for specific strings, byte sequences, … in the file.

If attributes match, the file is likely the piece of malware in question

Signature Example

Problems with Signatures- Can be thought of as an overfit classifier

- No generalization capability to novel threats

- Requires reverse engineers to write new signatures

- Signature may be trivially bypassed by the malware author

Malware Detection - Behavioral Methods- Instead of scanning for signatures, examine what the program does when executed

- Very slow - AV must run the program and extract information about what the sample does

- Malicious samples can “run out the clock” on behavior checks

Scaling Malware Detection- Previously mentioned approaches have difficulty generalizing to new

malware

- New kinds of malware require humans in the loop to reverse-engineer and create new signatures and heuristics for adequate detection

- Can we automate this process with machine learning?

Focus: Windows DLL/EXEs (Portable Executable)

Number of samples submitted to VirusTotal, Jan 29 2017

Portable Executable (PE) Format

Feature Engineering - Static Analysis- What kinds of features can we extract for PE files?- Objective: extract features from the EXE without executing anything- PE-Specific features

- Information about the structure of the PE file

- Strings- Print off all human-readable strings from the binary

- Entropy features- Extract information about the predictability of byte sequences

- Compressed/encrypted data is high entropy

- Disassembly features- Get an idea of what kind of code the sample will execute

PE-Specific Features

https://virustotal.com/en/file/e328b2406d8784e54e77ccc7dbe8e3731891a703e6c21cf7e2f924fa8a42ea5c/analysis/

PE-Specific Features (cont.)

https://virustotal.com/en/file/e328b2406d8784e54e77ccc7dbe8e3731891a703e6c21cf7e2f924fa8a42ea5c/analysis/

PE-Specific Features (cont.)

Feature Engineering - String Features- Extract contiguous runs of ASCII-

printable strings from the binary

- Can see strings used for dialog boxes, user queries, menu items, ...

- Samples trying to obfuscate themselves won’t have many strings

Entropy Features- Interpret the stream of bytes as a time-

series signal

- Compute a sliding-window entropy of the sample

- Information can determine if there are compressed, obfuscated, or encrypted parts of the sample

“Wavelet decomposition of software entropy reveals symptoms of malicious code”. Wojnowicz, et. al. https://arxiv.org/abs/1607.04950

Disassembly Features- Contains information about what will

actually execute

- Disassembly is difficult:

- Hard to get all of the compiled instructions from a sample

- x86 instruction set is variable-length

- Ambiguity about what is executed depending on where one starts interpreting the stream of x86 instructions

Difficulties for Static Analysis- Polymorphic code

- Code that can modify itself as it executes

- Packing- Samples that compress themselves prior to execution, and decompress themselves while executing

- Can hide malicious behavior in a compressed blob of bytes

- Can obscure benign code as well

- Requires expensive implementation of many unpackers (UPX, ASPack, Mew, Mpress, …)

- Disassembly- Malware authors can intentionally make the disassembly difficult to obtain

Modelling - Malicious versus Benign- Boils down to a binary classification task

- N: hundreds of millions of samples

- P: millions of highly sparse features (s=0.9999)

Malware

Benign

Modelling - Training on ~600 million samples

- Strong preference for minibatch methods and fast, compact models

- Logistic regression works very well

- Neural networks coupled with dimensionality reduction techniques are the workhorse

- Tend to combine lasso, dimensionality reduction, and neural networks

Files to Filesystems Question: How else can we leverage hardware optimized for matrix operations?

Answer: Graph Kernels applied to filesystems

Filesystems – interesting topological structure

𝐾 𝐺,𝐻 measures the similarity between G and H, taking into accountboth the topological structure of the trees and their labels.

Idea: construct a map which measures the similarity between graphs G and H, which takes into account both the topological differences of the trees and the label differences.

Upshot: We can measure the similarity between two file systems A and B by measuring the similarity between their labeled tree structure.

𝐾: Γ×Γ → ℝ

Graph Comparison and Vectorization

𝑎𝑏0000

𝑎𝑐0000

𝑐𝑑00

00𝑐𝑒00

𝑎𝑏000

𝑎𝑑000

𝑎𝑒000

Filesystems – interesting topological structureCan leverage GPU hardware in two ways:

• Kernel computations 𝐾: Γ×Γ → ℝ

• Neural Network training on features derived from these kernels

Upshot: The framing a given problem/procedure in terms of matrix algebra translates to massive computational advantages (GPU).

Selected Hardware

AWS G3 instance - four NVIDIA Tesla M60 GPUs

AWS P2 instances - up to16 NVIDIA K80 GPUs

Thank You!

Questions?

Machine Learning for Malware Analysis - GPU...

Documents

Malware Analysis Workshop June 5, 2012 - CCDCOE · Malware Analysis Workshop June 5, 2012 ... •Malware Analysis methods ... •Some malware doesnt run in Norman Sandbox

La primera será malwarebytes anti-malware. Un vez ... · Malwarebytes Anti-Malware Malware ANTI-MALWARE Registros Lista de ignorados Configuración Más herramientas Escáner Protección

An Introduction To Keyloggers, RATS And Malware€¦ · Malware Malware has been a problem for ages, Malware is short form of malicious software. A Malware is basically a program

Metamorphic Malware 1 Metamorphic Malware Research

Intrusion Detection and Malware Analysis - Malware collection

Using optimization algorithms for malware deobfuscationsigurnost.zemris.fer.hr/ns/malware/2010_spasojevic/Diplomski_Spasojevic.pdf · Using optimization algorithms for malware deobfuscation

Reversing & malware analysis training part 8 malware memory forensics

Improving accuracy of malware detection by …...FFRI, Inc. Background – malware and its detection 4 Increasing malware Targeted Attack (Unknown malware) Malware generators Obfuscators

Advanced malware analysis training session6 malware sandbox analysis

Malware Fails Best Bugs in Malware Felix Leder [Malware ... · 1 Malware Fails Best Bugs in Malware Felix Leder [Malware Detection Team] Felix.Leder@norman.com 5. desember 2011 malware

Malware Detection with Malware Images using …cosc.canterbury.ac.nz/research/reports/HonsReps/2018/...Malware Detection with Malware Images using Deep Learning Techniques by Ke HE

Windows Malware Firewall: Remove Windows Malware Firewall

Evaluating Open Source Malware Sandboxes with Linux malware

Malware & Anti-Malware

Ch 12: Covert Malware Launching - samsclass.info · Practical Malware Analysis Ch 12: Covert Malware Launching Last revised: 4-10-17. Hiding Malware • Malware used to be visible

THE EVOLUTION OF MALWARE & FUTURE MALWARE MITIGATION

Reversing & Malware Analysis Training Part 8 - Malware Memory Forensics

Intrusion Detection and Malware Analysis - Malware collection134.2.173.140/lehre/ws10/ids-malware/12-malware-collection.pdf · Intrusion Detection and Malware Analysis Malware collection

Part 4: Malware Functionality Chapter 11: Malware Behavior Chapter 12: Covert Malware Launching Chapter 13: Data Encoding Chapter 14: Malware-focused Network

Intrusion Detection and Malware Analysis - Malware analysis · Intrusion Detection and Malware Analysis Malware analysis Pavel Laskov Wilhelm Schickard Institute for Computer Science