Malware Variant Detection Using Similarity Search over Sets of Control Flow Graphs

Malware Variant Detection Using Similarity Search over Sets of Control Flow GraphsSilvio Cesare and Yang XiangDeakin University

Introduction

• Malware is a significant problem.

• 499,811 new malware samples in second half of 2007.

• Static detection dominant technique in AV.

• Polymorphism breaks traditional AV signatures.

• Detecting variants improves signature detection.

• Detecting entire families of malware reduces signature database sizes.

• Current graph based control flow signatures effective, but inefficient.

Contributions

• Control flow graph based malware detection in real or close to real-time.

• Propose two features for graphs – k-subgraphs and q-grams of decompiled graphs.

• We use features in feature vector and use distance metric to show similarity.

• Propose more accurate distance function based on minimum matching distance.

Related Work

• String signatures• Code normalization• Opcode distributions• N-grams• Basic Blocks• API Calls• Data flow and control flow• Call graphs• Control Flow Graphs

Problem Statement

• Collect malware samples in honeypot.

• Compare queries to known malware.

• High similarity identifies variants.

• Known as the software similarity problem.

Our Approach

1. Unpack malware.

2. Build signature, known as birthmark, of control flow graphs.

3. Use distance metrics to compare birthmarks.

4. Use efficient feature vector as birthmark to prefilter possible candidates.

5. Refine using minimum matching distance based distance metric.

Program p

Program q

Birthmark

Birthmark

Similar?

MATCH!

Different

The software similarity problem.

Unpacking and Static Analysis

• Unpack using application level emulation.

• Disassemble program.

• Reconstruct control flow.

• Decompile control flow graphs to strings via “structuring”

L_0

L_3

L_6

L_7L_1

L_2 L_4

L_5

true

true

true

true

true

W|IEH}Rproc(){L_0: while (v1 || v2) {L_1: if (v3) {L_2: } else {L_4: }L_5: }L_7: return;}

A control flow graph, its structured form, and its string representation.

The k-subgraph feature

• Fixed size k-subgraphs (k nodes) in control flow graphs (cfg) are a feature.

• Extract depth first spanning tree of cfgs.

• Extract k-subgraphs.

• Canonical graph labelling of k-subgraphs.

• Use string for adjacency matrix of canonical k-subgraph,

The control flow q-gram feature

• An alternative approach.

• Decompile control flow graph to string.

• Extract all q-gram strings of length q.

• Q-gram string is a feature.

Prefiltering

• Feature selection by choosing 500 most frequent features from training set.

• Make a feature vector using one of the two approaches.

• Manhattan distance between feature vectors shows similarity and dissimilarity:

n

iii qpqpqpd

111 ),(

Indexing and searching the feature vectors.• Group feature vectors from database into

buckets.

• Given query vector, find all similar feature vectors.

• Use metric Vantage point tree for similarity search of query q in database D. t

q

qrdDrR

),(1|}{

A distance (dissimilarity) function using the linear sum assignment problem• More accurate than using vectors, but slower.

• Program distance is between sets of decompiled control flow graphs.

• Make number of strings equal between sets.

• Make bijective mapping between string sets.

• Cost of mapping is string distance.

• Program distance is minimized sum cost.

• Solve using assignment problem and hungarian algorithm.

The distance metric

21

22

21

21

1222

2111

22

11

,),,(

,,

,,

{),(

'':

1:}{}{'

1:}{}{'

)(

)(

MbMaifbaed

MaMbifb

MbMaifa

baC

RMMC

MjMbMaM

MjMbMaM

PSM

PSM

ji

ji

'1))(,(

MaafaCd

Find a bijection f:M1’M2’ such that the distance, d is minimized.

Similarity search of malware

• Distance is variant of minimum matching distance.

• Known to be metric.

• Use metric DBM Tree similarity search for query q on prefiltered candidates E.

• Match on any similar sample.

qqpdtq

qpdEpp ),(,

),(1|,:

Implementation

• Built on top of Malwise – a malware and static analysis framework developed over several years.

• Malwise is about 100,000 LOC of C++.

• Modules to implement this paper are about 3,000 LOC of C++.

Unknown Sample

Static Analysis ClassifyYes Yes

Malware Database

Non Malicious

Malicious

NewSignature

No

Dynamic Analysis

Emulate

No

Samples From

Honeypots

PackedEnd of

Unpacking?

From Honeypot?

NewSignature

The Malwise malware classification system .

Evaluation

• Q-grams produces fewer similarities (false positives) than k-subgraphs between 280 windows system binaries.

• Database of 10,000 malware.

• Less than 1% false positives against 1601 benign binaries.

• Detected more variants than previous works using known malware families.

• Ran in close to real-time for expected case.

Effectiveness

Similarity K-Subgraphs Q-Grams

0.0 1302161 2334251

0.1 463170 413667

0.2 356345 40055

0.3 285202 7899

0.4 200326 3790

0.5 129790 327

0.6 46320 11

0.7 10784 0

0.8 5883 0

0.9 19 0

1.0 0 0

Classification Algorithm

Klez Netsky Roron Frethem

Maximum 36 49 81 289

Exact 20 29 17 139Heuristic Approximate 20 27 43 144

Q-Grams 20 31 79 226Optimal Distance 22 46 73 220Q-Grams + Optimal Distance 20 43 73 217

Classification Algorithm

False Positives

FP Percentage

Q-Grams 10 0.62Q-Grams + Optimal Distance 7 0.43

False Positives

Malware Detection Rates

False Positives with 10,000 Malware

Effectiveness on roron malware family

ao b d e g k m q a

ao0.4

40.2

80.2

70.2

80.5

50.4

40.4

40.4

7

b0.4

40.2

70.2

70.2

70.5

11.0

01.0

00.5

8

d0.2

80.2

70.4

80.5

60.2

70.2

70.2

70.2

7

e0.2

70.2

70.4

80.5

90.2

70.2

70.2

70.2

7

g0.2

80.2

70.5

60.5

90.2

70.2

70.2

70.2

7

k0.5

50.5

10.2

70.2

70.2

70.5

10.5

10.7

5

m0.4

41.0

00.2

70.2

70.2

70.5

11.0

00.5

8

q0.4

41.0

00.2

70.2

70.2

70.5

11.0

00.5

8

a0.4

70.5

80.2

70.2

70.2

70.7

50.5

80.5

8

ao b d e g k m q aao 0.70 0.28 0.28 0.27 0.75 0.70 0.70 0.75b 0.74 0.31 0.34 0.33 0.82 1.00 1.00 0.87d 0.28 0.29 0.50 0.74 0.29 0.29 0.29 0.29e 0.31 0.34 0.50 0.64 0.32 0.34 0.34 0.33g 0.27 0.33 0.74 0.64 0.29 0.33 0.33 0.30k 0.75 0.82 0.29 0.30 0.29 0.82 0.82 0.96m 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87q 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87a 0.75 0.87 0.30 0.31 0.30 0.96 0.87 0.87

ao b d e g k m q a

ao 0.8

60.5

30.6

40.5

90.8

60.8

60.8

60.8

6

b0.8

8 0.6

60.7

60.7

10.9

71.0

01.0

00.9

7

d0.6

50.7

2 0.8

80.9

30.7

30.7

20.7

20.7

3

e0.7

20.8

00.8

7 0.9

30.8

00.8

00.8

00.8

0

g0.6

90.7

70.9

30.9

3 0.7

70.7

70.7

70.7

7

k0.8

80.9

70.6

70.7

70.7

2 0.9

70.9

70.9

9

m0.8

81.0

00.6

60.7

60.7

10.9

7 1.0

00.9

7

q0.8

81.0

00.6

60.7

60.7

10.9

71.0

0 0.9

7

a0.8

70.9

70.6

70.7

70.7

20.9

90.9

70.9

7

ao b d e g k m q aao 0.86 0.49 0.54 0.50 0.87 0.86 0.86 0.86b 0.87 0.57 0.63 0.62 0.96 1.00 1.00 0.96d 0.61 0.64 0.85 0.91 0.64 0.64 0.64 0.64e 0.64 0.69 0.85 0.90 0.68 0.69 0.69 0.68g 0.62 0.68 0.91 0.91 0.68 0.68 0.68 0.68k 0.88 0.96 0.58 0.62 0.61 0.96 0.96 0.99m 0.87 1.00 0.57 0.63 0.62 0.96 1.00 0.96q 0.87 1.00 0.57 0.63 0.62 0.96 1.00 0.96a 0.87 0.96 0.58 0.62 0.61 0.99 0.96 0.96

Exact Matching Heuristic Approximate Matching

Q-Grams Optimal Distance Using Assignment Problem

Efficiency

Benign and Malicious Processing Time

• Median time for benign executables is 0.06s.

• Median time for malware is 0.84s

% SamplesBenign Time(s)

Malware Time(s)

10 0.02 0.16

20 0.02 0.28

30 0.03 0.30

40 0.03 0.36

50 0.06 0.84

60 0.09 0.94

70 0.13 0.97

80 0.25 1.03

90 0.56 1.31

100 8.06 585.16

Conclusion

• Malware can be characterised by control flow.

• We built feature vectors and distance metric using k-subgraphs and q-grams.

• We proposed a more effective distance metric using the minimum matching distance.

• We proposed using similarity searches to filter and refine matching samples.

• We implemented and evaluated a system.

• It was effective and efficient.

Technology

Malware Variant Detection Using Similarity Search over Sets of Control Flow Graphs