Upload
silvio-cesare
View
2.247
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
Malware Variant Detection Using Similarity Search over Sets of Control Flow GraphsSilvio Cesare and Yang XiangDeakin University
Introduction
• Malware is a significant problem.
• 499,811 new malware samples in second half of 2007.
• Static detection dominant technique in AV.
• Polymorphism breaks traditional AV signatures.
• Detecting variants improves signature detection.
• Detecting entire families of malware reduces signature database sizes.
• Current graph based control flow signatures effective, but inefficient.
Contributions
• Control flow graph based malware detection in real or close to real-time.
• Propose two features for graphs – k-subgraphs and q-grams of decompiled graphs.
• We use features in feature vector and use distance metric to show similarity.
• Propose more accurate distance function based on minimum matching distance.
Related Work
• String signatures• Code normalization• Opcode distributions• N-grams• Basic Blocks• API Calls• Data flow and control flow• Call graphs• Control Flow Graphs
Problem Statement
• Collect malware samples in honeypot.
• Compare queries to known malware.
• High similarity identifies variants.
• Known as the software similarity problem.
Our Approach
1. Unpack malware.
2. Build signature, known as birthmark, of control flow graphs.
3. Use distance metrics to compare birthmarks.
4. Use efficient feature vector as birthmark to prefilter possible candidates.
5. Refine using minimum matching distance based distance metric.
Program p
Program q
Birthmark
Birthmark
Similar?
MATCH!
Different
The software similarity problem.
Unpacking and Static Analysis
• Unpack using application level emulation.
• Disassemble program.
• Reconstruct control flow.
• Decompile control flow graphs to strings via “structuring”
L_0
L_3
L_6
L_7L_1
L_2 L_4
L_5
true
true
true
true
true
W|IEH}Rproc(){L_0: while (v1 || v2) {L_1: if (v3) {L_2: } else {L_4: }L_5: }L_7: return;}
A control flow graph, its structured form, and its string representation.
The k-subgraph feature
• Fixed size k-subgraphs (k nodes) in control flow graphs (cfg) are a feature.
• Extract depth first spanning tree of cfgs.
• Extract k-subgraphs.
• Canonical graph labelling of k-subgraphs.
• Use string for adjacency matrix of canonical k-subgraph,
The control flow q-gram feature
• An alternative approach.
• Decompile control flow graph to string.
• Extract all q-gram strings of length q.
• Q-gram string is a feature.
Prefiltering
• Feature selection by choosing 500 most frequent features from training set.
• Make a feature vector using one of the two approaches.
• Manhattan distance between feature vectors shows similarity and dissimilarity:
n
iii qpqpqpd
111 ),(
Indexing and searching the feature vectors.• Group feature vectors from database into
buckets.
• Given query vector, find all similar feature vectors.
• Use metric Vantage point tree for similarity search of query q in database D. t
q
qrdDrR
),(1|}{
A distance (dissimilarity) function using the linear sum assignment problem• More accurate than using vectors, but slower.
• Program distance is between sets of decompiled control flow graphs.
• Make number of strings equal between sets.
• Make bijective mapping between string sets.
• Cost of mapping is string distance.
• Program distance is minimized sum cost.
• Solve using assignment problem and hungarian algorithm.
The distance metric
21
22
21
21
1222
2111
22
11
,),,(
,,
,,
{),(
'':
1:}{}{'
1:}{}{'
)(
)(
MbMaifbaed
MaMbifb
MbMaifa
baC
RMMC
MjMbMaM
MjMbMaM
PSM
PSM
ji
ji
'1))(,(
MaafaCd
Find a bijection f:M1’M2’ such that the distance, d is minimized.
Similarity search of malware
• Distance is variant of minimum matching distance.
• Known to be metric.
• Use metric DBM Tree similarity search for query q on prefiltered candidates E.
• Match on any similar sample.
qqpdtq
qpdEpp ),(,
),(1|,:
Implementation
• Built on top of Malwise – a malware and static analysis framework developed over several years.
• Malwise is about 100,000 LOC of C++.
• Modules to implement this paper are about 3,000 LOC of C++.
Unknown Sample
Static Analysis ClassifyYes Yes
Malware Database
Non Malicious
Malicious
NewSignature
No
Dynamic Analysis
Emulate
No
Samples From
Honeypots
PackedEnd of
Unpacking?
From Honeypot?
NewSignature
The Malwise malware classification system .
Evaluation
• Q-grams produces fewer similarities (false positives) than k-subgraphs between 280 windows system binaries.
• Database of 10,000 malware.
• Less than 1% false positives against 1601 benign binaries.
• Detected more variants than previous works using known malware families.
• Ran in close to real-time for expected case.
Effectiveness
Similarity K-Subgraphs Q-Grams
0.0 1302161 2334251
0.1 463170 413667
0.2 356345 40055
0.3 285202 7899
0.4 200326 3790
0.5 129790 327
0.6 46320 11
0.7 10784 0
0.8 5883 0
0.9 19 0
1.0 0 0
Classification Algorithm
Klez Netsky Roron Frethem
Maximum 36 49 81 289
Exact 20 29 17 139Heuristic Approximate 20 27 43 144
Q-Grams 20 31 79 226Optimal Distance 22 46 73 220Q-Grams + Optimal Distance 20 43 73 217
Classification Algorithm
False Positives
FP Percentage
Q-Grams 10 0.62Q-Grams + Optimal Distance 7 0.43
False Positives
Malware Detection Rates
False Positives with 10,000 Malware
Effectiveness on roron malware family
ao b d e g k m q a
ao0.4
40.2
80.2
70.2
80.5
50.4
40.4
40.4
7
b0.4
40.2
70.2
70.2
70.5
11.0
01.0
00.5
8
d0.2
80.2
70.4
80.5
60.2
70.2
70.2
70.2
7
e0.2
70.2
70.4
80.5
90.2
70.2
70.2
70.2
7
g0.2
80.2
70.5
60.5
90.2
70.2
70.2
70.2
7
k0.5
50.5
10.2
70.2
70.2
70.5
10.5
10.7
5
m0.4
41.0
00.2
70.2
70.2
70.5
11.0
00.5
8
q0.4
41.0
00.2
70.2
70.2
70.5
11.0
00.5
8
a0.4
70.5
80.2
70.2
70.2
70.7
50.5
80.5
8
ao b d e g k m q aao 0.70 0.28 0.28 0.27 0.75 0.70 0.70 0.75b 0.74 0.31 0.34 0.33 0.82 1.00 1.00 0.87d 0.28 0.29 0.50 0.74 0.29 0.29 0.29 0.29e 0.31 0.34 0.50 0.64 0.32 0.34 0.34 0.33g 0.27 0.33 0.74 0.64 0.29 0.33 0.33 0.30k 0.75 0.82 0.29 0.30 0.29 0.82 0.82 0.96m 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87q 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87a 0.75 0.87 0.30 0.31 0.30 0.96 0.87 0.87
ao b d e g k m q a
ao 0.8
60.5
30.6
40.5
90.8
60.8
60.8
60.8
6
b0.8
8 0.6
60.7
60.7
10.9
71.0
01.0
00.9
7
d0.6
50.7
2 0.8
80.9
30.7
30.7
20.7
20.7
3
e0.7
20.8
00.8
7 0.9
30.8
00.8
00.8
00.8
0
g0.6
90.7
70.9
30.9
3 0.7
70.7
70.7
70.7
7
k0.8
80.9
70.6
70.7
70.7
2 0.9
70.9
70.9
9
m0.8
81.0
00.6
60.7
60.7
10.9
7 1.0
00.9
7
q0.8
81.0
00.6
60.7
60.7
10.9
71.0
0 0.9
7
a0.8
70.9
70.6
70.7
70.7
20.9
90.9
70.9
7
ao b d e g k m q aao 0.86 0.49 0.54 0.50 0.87 0.86 0.86 0.86b 0.87 0.57 0.63 0.62 0.96 1.00 1.00 0.96d 0.61 0.64 0.85 0.91 0.64 0.64 0.64 0.64e 0.64 0.69 0.85 0.90 0.68 0.69 0.69 0.68g 0.62 0.68 0.91 0.91 0.68 0.68 0.68 0.68k 0.88 0.96 0.58 0.62 0.61 0.96 0.96 0.99m 0.87 1.00 0.57 0.63 0.62 0.96 1.00 0.96q 0.87 1.00 0.57 0.63 0.62 0.96 1.00 0.96a 0.87 0.96 0.58 0.62 0.61 0.99 0.96 0.96
Exact Matching Heuristic Approximate Matching
Q-Grams Optimal Distance Using Assignment Problem
Efficiency
Benign and Malicious Processing Time
• Median time for benign executables is 0.06s.
• Median time for malware is 0.84s
% SamplesBenign Time(s)
Malware Time(s)
10 0.02 0.16
20 0.02 0.28
30 0.03 0.30
40 0.03 0.36
50 0.06 0.84
60 0.09 0.94
70 0.13 0.97
80 0.25 1.03
90 0.56 1.31
100 8.06 585.16
Conclusion
• Malware can be characterised by control flow.
• We built feature vectors and distance metric using k-subgraphs and q-grams.
• We proposed a more effective distance metric using the minimum matching distance.
• We proposed using similarity searches to filter and refine matching samples.
• We implemented and evaluated a system.
• It was effective and efficient.