37
Fourteenforty Research Institute, Inc. FFRI, Inc. http://www.ffri.jp @PacSec 2013 Fighting advanced malware using machine learning Junichi Murakami Director of Advanced Development

Fighting advanced malware using machine learning (English)

Embed Size (px)

DESCRIPTION

In this paper, behavioral-based detection powered by machine learning is introduced. As the result, detection ratio is dramatically improved by comparison with traditional detection. Needless to say that malware detection is getting harder today. Everybody knows signature-based detection reaches its limit, so that most anti-virus vendors use heuristic, behavioral and reputation-based detections altogether. About targeted attack, basically attackers use undetectable malware, so that reputation-based detection doesn't work well because it needs other victims beforehand. And it is a fact that detection ratio is not enough though we use heuristic and behavioral-based detections. In our research using the Metascan, average detection ratio of newest malware by most anti-virus scanner is about 30 %( the best is about 60 %). By the way, heuristic and behavioral-based detections are developed by knowledge and experience of malware analyst. For example, most analysts know that following features are indicator that those programs are malicious. - A file imports VirtualAlloc, VirtualProtect and LoadLibrary only and has a strange section name - An entry point that does not fall within declared text or code section - Creating remote threads into a legitimate process like explore.exe - After unpacking, calling OpenMutex and CreateMutex to avoid multiple infections - Register itself to auto start extension points like services and registry - Creating a .bat file and try to delete own itself through executing the file with cmd.exe - Setting global hook to capture keystroke using SetWindowsHookEx Heuristic and behavioral-based detections are developed based on those pre-determined features like above. Analysts are finding those features day by day. But, this kind of work is not appropriate for human. Therefore we classified programs as malware or benign by machine learning through dynamic analysis results. Thereby, detection ratio is dramatically improved and we could recognize that which features are strongly related to malware by numeric score. And then, we could find the features which we’ve never found by this method. Finally, the outlook and challenges of this method will be tackled.

Citation preview

Page 1: Fighting advanced malware using machine learning (English)

FFRI, Inc.

Fourteenforty Research Institute, Inc.

FFRI, Inc. http://www.ffri.jp

@PacSec 2013

Fighting advanced malware

using machine learning

Junichi Murakami Director of Advanced Development

Page 2: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• Security Researcher at FFRI, Inc

– Malware and vulnerabilities analysis with RCE

– Both Windows and Linux kernel development

• Speaker at

– BlackHat USA/JP, RSA, PacSec, AVAR, etc.

• *NOT* a MS/PhD degree in CS/Math

– Just a user of machine learning

Who am I ?

2

Page 5: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• Background

• Our approach

• Future works

– Computers vs. Man

– Applying to real time protection

• Conclusion

Agenda

5

Page 6: Fighting advanced malware using machine learning (English)

FFRI, Inc.

Background – malware and its detection

6

Increasing

malware

Targeted Attack

(Unknown malware)

Generators

Obfuscators

Limitation of

pattern matching

other methods

Reputation

Heuristic

Machine learning

Bigdata Bigdata Behavioral

Page 7: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• Evaluated 11 AV-product’s TRP using Metascan

• Used fresh malware (not wildlist malware)

• Prepared 2 test sets from different sources and period

– test-1: 1,000 samples

– test-2: 900 samples

Limitation of signature matching

7

Pattern

Reputation

Huer.

ML

Bhvr.

Page 8: Fighting advanced malware using machine learning (English)

FFRI, Inc.

Limitation of signature matching

8

0%

20%

40%

60%

80%

100%

test-

1te

st-

2

test-

1te

st-

2

test-

1te

st-

2

test-

1te

st-

2

test-

1te

st-

2

test-

1te

st-

2

test-

1te

st-

2

test-

1te

st-

2

test-

1te

st-

2

test-

1te

st-

2

test-

1te

st-

2

A B C D E F G H I J K

TPR FNR

Avg.

Page 9: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• Concept is the same with signature-matching(Blacklisting)

• Endpoints don’t have to keep HEAVY patterns anymore

• Easy to reflect a new pattern to the others

Advantage of (cloud) reputation

9

1. query 2. check

3. response Pattern

Reputation

Huer.

ML

Bhvr.

Page 10: Fighting advanced malware using machine learning (English)

FFRI, Inc.

Disdvantage of (cloud) reputation

10

1. query 2. check

3. response

• “Detectable” means that someone is already attacked

• What if you are the first victim?

• How much effective against “Targeted Attack” it is?

Pattern

Reputation

Huer.

ML

Bhvr.

Page 11: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• Based on pre-determined characteristics or behaviors

– OpenProcess -> WriteProcessMemory -> CreateRemoteThread

– Registering itself to auto start extensibility points

– Disabling Windows Firewall, etc.

• Providing generic logics to detect malware

• Signature-free

(eliminate regular scanning and update)

Advantage of heuristic/behavioral detection

11

Pattern

Reputation

Huer.

ML

Bhvr.

Page 12: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• Difficult to avoid false positives completely

• Or ask user to determine if an action would be allow or not

(User dependent)

• Have to analyze malware and update logics continuously

Disadvantage of heuristic/behavioral detection

12

(Not human task, more suitable for computers) Pattern

Reputation

Huer.

ML

Bhvr.

Page 13: Fighting advanced malware using machine learning (English)

FFRI, Inc.

Heuristic Behavioral Test - July 2012

13

http://chart.av-comparatives.org/chart1.php

• AV-comparatives publishes H/B Test results since 2012

• The behavioral hardly contributes to detect (avg: 4.8%)

Page 14: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• Behavioral-based detection powered by machine learning

• Not new but more practical for the industry

• Easy to try and automate using open source below

– Cuckoo Sandbox

– Jubatus

Our approach

14

Pattern

Reputation

Huer.

ML

Bhvr.

Page 15: Fighting advanced malware using machine learning (English)

FFRI, Inc.

Machine learning-based detection

15

Features

Static information

Dynamic information

Hybrid

Algorithms

SVM

Naive bayes

Perceptron, etc.

Evaluation

TPR/FRP, etc.

ROC-curve, etc.

Accuracy, Precision

• Most research is doing in academic

• Basically, it is a classification problem (task)

• Mainly focusing on a combination of the factors below

• Some good results are reported

Pattern

Reputation

Huer.

ML

Bhvr.

Page 16: Fighting advanced malware using machine learning (English)

FFRI, Inc.

Overview

16

Dynamic analysis

train

Cuckoo

Sandbox

Jubatus

training set

feature

selection

convert to

Feature

Vector

test

Jubatus

test set

TPR: X%

FPR: Y%

original data set malware

benign

Page 17: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• It is a “Confidence Interval” theory

• It depends on how margin of error we accept

• All of these below hit rates are 1%

– 1/100 (N=100)

– 10/1,000 (N=1,000)

– 100/10,000 (N=10,000)

• Each confidences for determining to “1%” are different

– Each of them has different error

How many samples should we use?

17

Page 18: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• We can calculate estimation of margin of error based on N

How many samples should we use?

18

(N)

Hit R

ate

conta

inin

g e

rrors

(95%

confidence)

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

100 1,000 10,000 100,000 1,000,000

Page 19: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• Randomly sampling from files collected by ourselves

– Malware: 15,000(5,000 = training, 10,000 = testing)

– Benign: 15,000 (5,000 = training, 10,000 = testing)

• *Random* is very important

– Different period (choose 15,000/N sample from every day)

– Different sources

– Never care about filetype or malware type

Malware and benign files

19

Page 20: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• Open source automated malware analysis system

– Sending malware into a virtual machine from a host

– Executing the malware inside the virtual machine

– Monitoring and saving its behaviors at runtime

• API calls, network activity, VT results, etc.

Cuckoo Sandbox - http://www.cuckoosandbox.org/

20

Page 21: Fighting advanced malware using machine learning (English)

FFRI, Inc.

API calls

21

Page 22: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• Most of the samples finished calling API within 1s

(or keep calling APIs) -> Used API only called within 5s

Trends of API calls

22

0

2000

4000

6000

8000

10000

1s 5s 10s 15s 30s 45s 60s 60s+

Elapsed time

malware goodware

The N

um

ber

of sa

mple

s w

hic

h f

inis

hed c

allin

g A

PI

Page 23: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• Distributed Online Machine Learning Framework

– Distributed: Scalable

– Online: Not batch, continuous learning

• Open source, LGPL v2.1. Latest release is 0.4.5(22/07/2013)

• Developed by Preferred Infrastructure, Inc. and NTT Software

Innovation Center

• Support various machine learning

– Classification, Regression, Recommendation, Anomaly Detection

• Easy to use (many language bindings, feature converter, etc)

Jubatus - http://jubat.us/en/

23

Page 24: Fighting advanced malware using machine learning (English)

FFRI, Inc.

Feature selection and convert to FV

24

NtOpenFile

NtWriteFile

NtWriteFile

NtOpenFile

NtWriteFile

NtWriteFile

Call s

equence

NtCreateProcess

NtOpenFileNtWriteFileNtWriteFile

NtWriteFileNtWriteFileNtOpenFile

NtWriteFileNtOpenFileNtWriteFile

.

.

.

.

.

Page 25: Fighting advanced malware using machine learning (English)

FFRI, Inc.

malware

benign

Feature selection and convert to FV

25

malware

NtCreateProcess NtClose NtClose benign NtCreateProcess NtClose NtClose benign

NtCreateProcessNtCloseNtClose

label:benign

NtOpenFile NtWriteFile Nt

malware

NtOpenFileNtWriteFileNtWriteFile

label:malware train

Jubatus

Page 26: Fighting advanced malware using machine learning (English)

FFRI, Inc.

Calc and update each FV’s weight based on its freq. and label

(for the detail is dependent on the algorithm called AROW, don’t ask me :-)

(Image) Training internal

26

malware benign

0.75 0.25

0.25

0.25

0.75

0.75

0.5 0.5

0.0

NtClose NtClose NtClose

NtWriteFile LdrLoadDll NtDelayExecution

NtCreateProcess NtClose NtClose

NtAllocateVirtualMemory ... ...

Page 27: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• N of N-gram

– 3~5-gram > 2-gram = 6-gram

• Best: 3-gram

– TRP: 72.33% [71.58 ~ 73.07 % (95% confidence)]

– FPR: 0.77% [0.60 ~ 0.99% (95% confidence)]

• The result above is an example

– A lot of combination of features are available

(we used only “API-name” and its sequence)

Testing results

27

Page 28: Fighting advanced malware using machine learning (English)

FFRI, Inc.

Demo

28

Page 29: Fighting advanced malware using machine learning (English)

FFRI, Inc.

Future Works

29

Page 30: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• http://blog.jubat.us/2013/06/classifier.html (Japanese only)

Dumping training model

30

Investigating weight parameters of classifier

jubalocal_storage_dump.cpp

https://gist.github.com/t-abe/5746333

Page 31: Fighting advanced malware using machine learning (English)

FFRI, Inc.

Indicators of malware likeness in API 3-gram

31

[foo@nolife classifier]$ ./dump --input model --label “malware”

0.181128 api_call$VirtualProtectEx_VirtualProtectEx_VirtualProtectEx@space#log_tf/bin

0.142254 api_call$RegOpenKeyExA_NtOpenKey_NtOpenKey@space#log_tf/bin

0.137144 api_call$NtReadFile_NtReadFile_NtFreeVirtualMemory@space#log_tf/bin

0.134443 api_call$LdrLoadDll_LdrGetProcedureAddress_VirtualProtectEx@space#log_tf/bin

0.130287 api_call$LdrLoadDll_RegOpenKeyExA_NtOpenKey@space#log_tf/bin

0.130287 api_call$DeviceIoControl_LdrLoadDll_RegOpenKeyExA@space#log_tf/bin

0.122363 api_call$VirtualProtectEx_LdrLoadDll_LdrGetProcedureAddress@space#log_tf/bin

0.102545 api_call$NtFreeVirtualMemory_LdrGetDllHandle_NtCreateFile@space#log_tf/bin

0.102485 api_call$RegCloseKey_RegCloseKey_RegCloseKey@space#log_tf/bin

0.0983165 api_call$NtReadFile_NtFreeVirtualMemory_LdrLoadDll@space#log_tf/bin

0.0966545 api_call$NtSetInformationFile_NtReadFile_NtFreeVirtualMemory@space#log_tf/bin

0.094639 api_call$NtMapViewOfSection_NtFreeVirtualMemory_NtOpenKey@space#log_tf/bin

0.0933827 api_call$NtFreeVirtualMemory_LdrLoadDll_LdrGetProcedureAddress@space#log_tf/bin

0.0905402 api_call$DeviceIoControl_DeviceIoControl_NtWriteFile@space#log_tf/bin

0.0903766 api_call$DeviceIoControl_NtWriteFile_NtWriteFile@space#log_tf/bin

0.0884724 api_call$RegOpenKeyExW_RegOpenKeyExW_LdrGetDllHandle@space#log_tf/bin

0.0853282 api_call$LdrLoadDll_LdrLoadDll_LdrLoadDll@space#log_tf/bin

...

Page 32: Fighting advanced malware using machine learning (English)

FFRI, Inc.

Indicators of goodware likeness in API 3-gram

32

[foo@nolife classifier]$ ./dump --input model --label “goodware”

0.268353 api_call$LdrGetDllHandle_LdrGetDllHandle_ExitProcess@space#log_tf/bin

0.268353 api_call$LdrGetDllHandle_ExitProcess_NtTerminateProcess@space#log_tf/bin

0.259838 api_call$NtWriteFile_LdrGetDllHandle_LdrGetDllHandle@space#log_tf/bin

0.25887 api_call$NtWriteFile_NtWriteFile_LdrGetDllHandle@space#log_tf/bin

0.135514 api_call$NtOpenFile_NtOpenFile_NtCreateFile@space#log_tf/bin

0.122445 api_call$DeviceIoControl_LdrLoadDll_LdrGetProcedureAddress@space#log_tf/bin

0.12242 api_call$DeviceIoControl_DeviceIoControl_LdrGetDllHandle@space#log_tf/bin

0.119231 api_call$GetSystemMetrics_LdrLoadDll_NtCreateMutant@space#log_tf/bin

0.115319 api_call$DeviceIoControl_LdrGetDllHandle_LdrGetProcedureAddress@space#log_tf/bin

0.109306 api_call$LdrGetProcedureAddress_NtOpenKey_LdrLoadDll@space#log_tf/bin

0.105579 api_call$NtReadFile_NtReadFile_NtReadFile@space#log_tf/bin

0.104565 api_call$NtCreateFile_NtCreateFile_NtWriteFile@space#log_tf/bin

0.103304 api_call$RegOpenKeyExA_LdrGetDllHandle_LdrGetProcedureAddress@space#log_tf/bin

0.10306 api_call$VirtualProtectEx_RegOpenKeyExA_LdrGetDllHandle@space#log_tf/bin

0.100701 api_call$NtFreeVirtualMemory_NtFreeVirtualMemory_GetSystemMetrics@space#log_tf/bin

...

Page 33: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• “VirtualProtectEx_VirtualProtectEx_VirtualProtectEx”

looks like to related to malware

• How about “RegOpenKeyExA_NtOpenKey_NtOpenKey”?

• Computers might recognize indicators which human can’t

(Extremely strong left-brain player)

• Why don’t we cooperate with machine?

Computer vs. Man

33

Page 34: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• Generating models using computers

• Checking them and guessing new logics by human

(Using our right-brain)

• ML-based detection is sometimes difficult to control

– Cannot specify strict conditions to detect

“ It is detected because ML said so ! ”

• Hybrid of ML-based and Logic-based would be powerful

Using computers

34

Page 35: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• Using static information as feature

– We can check a file before its execution

– The performance is dependent on features

• Using dynamic information as feature

– Malware is already executed

– Sometimes, detections would be too late

– The hybrid detection above might be also useful

in this perspective

Applying to real time protection

35

Page 36: Fighting advanced malware using machine learning (English)

FFRI, Inc.

• Traditional pattern-matching reaches its limit

• Current behavioral detections hardly contributes to detect

• By applying ML to behavioral detections

– TPR is improved

– Computers recognize new features which human can’t

– We should make use of them

Conclusion

36

Page 37: Fighting advanced malware using machine learning (English)

FFRI, Inc.

Thank you!

37

Contact: [email protected] Twitter: @FFRI_Research