23
Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu K. Hunt National Security Agency [email protected]

Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Embed Size (px)

Citation preview

Page 1: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Machine-Learning Assisted Binary Code Analysis

N. Rosenblum, X. Zhu, B. Miller

Computer Sciences Department

University of Wisconsin - Madison{nater,jerryzhu,bart}@cs.wisc.edu

K. Hunt

National Security [email protected]

Page 2: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 2 ML Assisted Binary Code Analysis

Supporting Static Binary Analysis

• Malware detection• Vulnerability analysis• Static and Dynamic

Instrumentation• Formal verification

Example Uses

Code is found through symbol information and parsing

Binary Analysis is a Foundational Technique for Many Areas

• Source code unavailable– e.g., malware

• Source code is inaccurate– Compiler transforms

structure

• Provides most accurate representation

Why Analyze Binaries?

MUCH HARDER without symbols

Page 3: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 3 ML Assisted Binary Code Analysis

Many Binaries are Stripped

Stripped binaries lack symbol & debug

information

• Malicious programs• Operating system distributions• Commercial software packages• Legacy codes

EXAMPLES:

Standard Approach: Parse from entry point

BINARY

Headers

Code Segment

(functions?)

Data Segment

Page 4: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 4 ML Assisted Binary Code Analysis

Stripped Binaries Exhibit Gaps

After static parsing, gap regions remain

• Indirect (pointer-based) control ambiguity

• Deliberate calls/branch obfuscation

• Gaps in code segment may not contain code

Code Segment

Page 5: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 5 ML Assisted Binary Code Analysis

Stripped Binaries Exhibit Gaps

.__gmon_start__.libc

.so.6.stpcpy.strcpy.__divdi3.printf.stdout.strerror.memmove.getopt_long.re_syntax_options.__ctype_b.getenv.__strtol_internal.getpagesize.re_search_2.memcpy.puts.feof.malloc.optarg.btowc._obstack_newchunk.re_match.__ctype_toupper.__xstat64.abort.strrchr._obstack_begin.calloc.re_set_registers.fprintf.

Gap contents may vary

String data

Code Segment

• Dialog Constants

• Import names• Other strings

Page 6: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 6 ML Assisted Binary Code Analysis

Stripped Binaries Exhibit Gaps

0x80223460x802434b0x80243ad0x80403d00x80503d00x80521400x80531420x806000b0x802321a0x80233320x804132a0x8050ca0

Gap contents may vary

Tables or lists of addresses

• Jump tables• Virtual function

tables• Data objects

Code Segment

Page 7: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 7 ML Assisted Binary Code Analysis

Stripped Binaries Exhibit Gaps

gap_funcA { . . .}

gap_funcB { . . .

gap_funcC { . . .}

Code unreachable through standard

static parsing

Gap contents may varyCode Segment

• Function pointers

• Virtual methods• Obfuscated

calls

Page 8: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 8 ML Assisted Binary Code Analysis

Stripped Binaries Exhibit Gaps

Gap contents may vary

But… all of these just look like bytes

7a 01 00 fd a2 b374 68 69 73 20 65 78 61 6d 70 6c 65 20 69 73 20 62 6f 67 75 73 2e 2e 2e7a 01 00 fd a2 b374 68 69 73 20 65 78 61 6d 70 6c 65 20 69 73 20 62 6f 67 75 73 2e 2e 2e7a 01 00 fd a2 b374 68 69 73 20 65 78 61 6d 70 6c 65 20 69 73 20 62 6f

Our approach: Use information in known code to model code in gaps

Every byte in gaps may be the start of a function

Code Segment

Previous work (Vigna et al., 2007) augments parsing with simple instruction frequency information

How can we find code in gaps?

Page 9: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 9 ML Assisted Binary Code Analysis

Modeling Binary Code

• Content: Idiom features of function entry points– Based on instruction sequences

• Structure: Control flow & conflict features– Capture relationship of candidate function entry points– Requires joint assignment over all function entry point

candidates

Problem reduces to finding function entry points

Task: Classifying every byte in a gap as entry point or non-entry point

Two types of features:

Page 10: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 10 ML Assisted Binary Code Analysis

Content-based FeaturesEntry idioms are common patterns at function entry

points

C1

push ebppush ebp|mov esp,ebppush ebp|*|sub esppush ebp|*|mov esp,ebp*|mov_esp,ebp*|sub 0x8,esp*|mov 0x8(ebp),eaxPRE nopPRE ret|nopPRE pop ebp|*|nop

Idioms are preceding and succeeding instruction sequences with wildcards

fu x i,y i,P 1 if x i contains u

0 otherwise

Candidate

Entry idioms

For each idiom u,

Page 11: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 11 ML Assisted Binary Code Analysis

Call Consistency & Overlap

C1

C3

C4

C2

Call & conflict features relate candidate FEPs over entire gap

fo(x i,x j ,y i,y j ,P)

1 y i = y j =1, x i,x j overlap

0 otheriwse

f c (x i,x j ,y i,y j ,P )

1 y i =1,y j = -1, y i calls y j

0 otherwise

Candidates

y1 = 1

y3 = -1

y2 = 1

y4 = 1

1),,,,( 4242 Pyyxxfc 1),,,,( 3131 Pyyxxfo

Page 12: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 12 ML Assisted Binary Code Analysis

Markov Random Field Formalization

• Joint assignment of yi = {1,-1} for each FEP xi in binary P• Unary idiom features fu

– Weights u trained through logistic regression

• Binary features fo (overlap), fc (call consistency)– Weights o, c large, negative

P y1:n | x1:n,P 1

Zexp

u fu x i,y i,P u{I }

i1

n

b fb (x i,x j ,y i,y j ,P )bo,c

i, j1

n

Page 13: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 13 ML Assisted Binary Code Analysis

Experimental Setup• Large set (100’s) of binaries from department Linux servers and

Windows workstations• Additional binaries compiled with Intel compiler• Binaries have full symbol information • Model implemented as extensions to Dyninst instrumentation

library

1. Strip binary copies and parse to obtain training set

2. Select top idiom features by forward feature selection

3. Perform logistic regression to build idiom model

4. Evaluate model on test data from gap regions in Step 1.

• Unstripped copies of binaries provide reference set

Page 14: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 14 ML Assisted Binary Code Analysis

Idiom Feature Selection & Training

Statically reachable functions

1. Obtain training data from traditional parse

Corpus is hundreds of stripped binaries

2. Use Condor HTC to drive forward feature selection on idioms

Features:

Feat1Feat2Feat3...Featk

3. Perform logistic regression on the selected idiom features to obtain model parameters t

Page 15: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 15 ML Assisted Binary Code Analysis

Evaluation Data Sets

CompilerPrograms examined

Total Training Examples (pos+neg)

Total Test Examples (pos+neg)

Actual number of

functions in gaps

GCC 625 8,412,711 22,806,449 85,870

MS VS 443 8,020,828 11,231,721 70,620

ICC 112 1,364,598 13,169,487 47,841

• GNU C Compiler

–Simple, regular function preamble

• Intel C Compiler

–Most variation in entry points; highly optimized

• MS Visual Studio

–High variation in function entry points

Page 16: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 16 ML Assisted Binary Code Analysis

Preliminary Results

Compiler

Orig. Dyninst IDA Pro Dyninst w/ Model

FP FN FP FN FP FN

GCC 2,833 2,012 14,576 38,074 403 1,860

MS VS 79,320 65,586 9,044 21,491 725 14,143

ICC 3,786 40,195 14,422 26,970 2,337 16,220

• Original Dyninst

–Scans for common entry preamble

• Dyninst w/ Model

–Model replaces entry preamble heuristic

• IDA Pro Disassembler

–Scans for common entry preamble

–List of Library Fingerprints (Windows)

Comparison of three binary analysis tools:

Page 17: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 17 ML Assisted Binary Code Analysis

Classifier Comparisons

GCC MSVS ICC

Model-based Dyninst extensions outperform vanilla Dyninst and IDA Pro

Page 18: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 18 ML Assisted Binary Code Analysis

Model Component Contributions

• Structural information improves classifier accuracy

• Conflict resolution contributes the most

ICC Test Set

Page 19: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 19 ML Assisted Binary Code Analysis

So Far We’ve…

• Framed stripped binary parsing as a machine learning problem

• Combined idiom and structural information to consider gap regions as a whole

• Extended Dyninst with classifier of Function Entry Points in gaps

• Obtained significant improvement in parsing stripped binaries over existing tools

• Shown how the HTC approach makes expensive ML techniques tractable for large scale systems

Page 20: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 20 ML Assisted Binary Code Analysis

Future Work: Extensions

• We’d like precision-recall AUC 1. How?– More detailed instruction sequence models

(e.g. Hidden Markov Model)– Additional information sources (e.g. pointer

tables)• Caveat: this is where IDA Pro often goes wrong

• Code provenance– First task: identify source compiler (needed

to choose appropriate model)

Page 21: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 21 ML Assisted Binary Code Analysis

Future Work: Targets• Malicious code

– Lots of hand-coded assembly– Usually packed (see Kevin Roundy’s talk)

• Obfuscated code– Obfuscation/deobfuscation arms race– Signal-based obfuscation is latest salvo– Can not trust control flow (e.g. non-

returning calls, branch functions, opaque branches)

– Maybe model block-level structural properties?

Page 22: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 22 ML Assisted Binary Code Analysis

Backup Slides

Page 23: Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu

Rosenblum, Zhu, Miller, Hunt 23 ML Assisted Binary Code Analysis

Tool Performance Comparison

• Classifier maintains high precision with good recall

• Model performance highly system-dependent

•MS Visual Studio & Intel C Compiler FEPs are highly variable