Arun Lakhotia, Professor Andrew Walenstein, Assistant Professor University of Louisiana at Lafayette...

Preview:

Citation preview

Arun Lakhotia, ProfessorAndrew Walenstein, Assistant Professor

University of Louisiana at Lafayettewww.cacs.louisiana.edu/labs/SRL

2008 AVAR (New Delhi) 1

Self-Learning Anti-Virus Scanner

Introduction

AVAR 2008 (New Delhi) 2

Director, Software Research Lab

Lab’s focus: Malware Analysis

Graduate level course on Malware Analysis

Six years of AV related research

Issues investigated:• Metamorphism• Obfuscation

Alumni in AV Industry

Prabhat Singh Nitin Jyoti Aditya Kapoor Rachit Kumar

McAfee AVERT Erik Uday Kumar,

Authentium Moinuddin Mohammed,

Microsoft Prashant Pathak,

Ex-Symantec

Funded by: Louisiana Governor’s IT Initiative

Outline

2008 AVAR (New Delhi) 3

Attack of VariantsAV vulnerability: Exact match

Information Retrieval TechniquesInexact match

Adapting IR to AVAccount for code permutation

Vilo: System using IR for AVIntegrating Vilo into AV InfrastructureSelf-Learning AV using Vilo

ATTACK OF VARIANTS

2008 AVAR (New Delhi) 4

AVAR 2008 (New Delhi)

Variants vs Family

0

50000

100000

150000

200000

250000

Half Year

Total Variants Total Family

Total Variants 1E+062609138 84752E+04E+05E+05E+05E+07E+02E+0

Total Family 141 184 164 171 170 104 101

02-I02-II

03-I03-II

04-I04-II

05-I05-II

06-I06-II

07-I

5

Source: Symantec Internet Threat Report, XI

Analysis of attacker strategy

2008 AVAR (New Delhi) 6

Purpose of attack of variantsDenial of Service on AV infrastructureIncrease odds of passing through

Weakness exploitedAV system use: Exact match over extract

Attack strategyGenerate just enough variation to beat exact

matchAttacker cost

Cost of generating and distributing variants

Analyzing attacker cost

2008 AVAR (New Delhi) 7

Payload creation is expensiveMust reuse payload

Need thousands of variantsMust be automated

“General” transformers are expensiveSpecialized, limited transformers

Hence packers/unpackers

Attacker vulnerability

2008 AVAR (New Delhi) 8

Automated transformersLimited capabilityMachine generated, must have regular

patternExploiting attacker vulnerability

Detect patterns of similaritiesApproach

Information Retrieval (this presentation)Markov Analysis (other work)

Information Retrieval

2008 AVAR (New Delhi) 9

IR Basics

2008 AVAR (New Delhi) 10

Basis of Google, BioinformaticsOrganizing very large corpus of dataKey idea

Inexact match over wholeContrast with AV

Exact match over extract

IR Problem

AVAR 2008 (New Delhi) 11

IR

Document Collection

Query: Keywords

orDocument

Related documents

IR Steps

AVAR 2008 (New Delhi) 12

Have you wondered

When is a rose a rose?Have you wonderedYou wondered whenWondered when roseWhen rose rose

Step 1: Convert documents to vectors1a. Define a method to identify “features”

Example: k-consecutive words

1b. Extract all features from all documents

1c. Count features, make feature vector

1

How about onions

Onion smell stinks

1

1

1

0

0

[1, 1, 1, 1, 0,0]

IR Steps

AVAR 2008 (New Delhi) 13

Step 2: Compute feature vectorsTake into account features in entire corpusClassical method

W=TF x IDF

You wondered when

Wondered when roseWhen rose rose

How about onions

Onion smell stinks

DF = # documents containing the feature

TF = Term Frequency

5

7

8

6

3

DF

1/5

1/7

1/8

1/6

1/3

IDF

IDF = Inverse of DF

1

2

5

3

0

TF(v1)

1/5

2/7

5/8

3/6

0/3

w1 = TFxIDF(v1)

IR Steps

2008 AVAR (New Delhi) 14

Step 3: Compare vectorsCosine similarity

||||),(

21

2121 ww

wwwwsim

w1 = [0.33, =0.25, 0.66, 0.50]

w1 = [0.33, =0.25, 0.66, 0.50]

222222222100.33.63.44.50.66.25.33.

00.50.33.66.63.25.44.33.),(

wwsim

IR Steps

AVAR 2008 (New Delhi) 15

Step 4: Document RankingUsing similarity measure

IR

Document Collection

0.90

0.82

0.76

0.30

Matching document

New Document

Adapting IR for AV

AVAR 2008 (New Delhi) 16

Adapting IR for AV

2008 AVAR (New Delhi) 17

l2D2: push ecxpush

4pop

ecxpush

ecxl2D7: rol

edx, 8mov

dl, aland

dl, 3Fhshr

eax, 6loop

l2D7pop

ecxcall

s319xchg

eax, edxstosdxchg

eax, edxinc

[ebp+v4]cmp

[ebp+v4], 12hjnz

short l305

l144: push ecxpush 4pop

ecxpush

ecxl149: movdl, al

anddl, 3Fh

roledx, 8

shrebx, 6

loopl149

popecx

calls52F

xchgebx, edx

stosdxchg

ebx, edxinc

[ebp+v4]cmp

[ebp+v4], 12hjnz

short l18

l2D2: push ecxpush

4pop

ecxpush

ecxl2D7: rol

edx, 8mov

dl, aland

dl, 3Fhshr

eax, 6loop

l2D7pop

ecxcall

s319xchg

eax, edxstosdxchg

eax, edxinc

[ebp+v4]cmp

[ebp+v4], 12hjnz

short l305

l144: push ecxpush

4pop

ecxpush

ecxl149: mov

dl, aland

dl, 3Fhrol

edx, 8shr

ebx, 6loop

l149pop

ecxcall

s52Fxchg

ebx, edxstosdxchg

ebx, edxinc

[ebp+v4]cmp

[ebp+v4], 12hjnz

short l18

pushpushpoppushrolmovandshrlooppopcallxchgstosdxchginccmpjnz

pushpushpoppushmovandrolshrlooppopcallxchgstosdxchginccmpjnz

Step 0: Mapping program to document

Extract Sequence of operations

Adapting IR for AV

2008 AVAR (New Delhi) 18

Step 1a: Defining features k-perm

PPOPRMASLOCXSXICJ

PPOPMARSLOCXSXICJ

P P O P R M A S L O C X S X I C J

P P O P S L O C X S X I C JRM A

Virus 1

Virus 2

Feature = Permutation of k operations

Adapting IR for AV

AVAR 2008 (New Delhi) 19

P P O P R M A S L O C X S X I C J

P P O P I C JO C X S XM A R S L

P P O P I C JO C X S XM A R S L

P P O P I C JO C X S XM A R S L P O P

Virus 1

Virus 2

Virus 3

Step 1 Example of 3-perm

Adapting IR for AV

AVAR 2008 (New Delhi) 20

POPR OPRM PRMA RMAS MASL POPM OPMA ARSL RSLP SLPO LPOP

1 1 1 1 1 1 0 0 0 0 0 0

2 0 0 0 1 1 1 0 0 0

3 0 0 0 0 0 0 1 1 1 1

1 1

1

PMAR MARS

0 0

0

0 0

0

P O P R M A S L

P O P M A R S L

1

2

3 M A R S L P O P

PMAR MARS

Step 2: Construct feature vectors (4-perms)

AVAR 2008 (New Delhi)

Adapting IR for AV

21

Step 3: Compare vectorsCosine similarity (as before)

Step 4: Match new sample

Vilo: System using IR for AV

AVAR 2008 (New Delhi) 22

Vilo Functional View

AVAR 2008 (New Delhi) 23

Vilo

Malware Collection

0.90

0.82

0.76

0.30

Malware Match

New Sample

Vilo in Action: Query Match

AVAR 2008 (New Delhi) 24

Vilo: Performance

AVAR 2008 (New Delhi) 25

Response time vs Database size

Search on generic desktop: In Seconds

Contrast withBehavior match: In Minutes

Graph match: In Minutes

Vilo Match Accuracy

AVAR 2008 (New Delhi) 26

ROC Curve: True Positive vs False Positive

False Positive

Tru

e P

osit

ive

Vilo in AV Product

AVAR 2008 (New Delhi) 27

Vilo in AV Product

AVAR 2008 (New Delhi) 28

AV ScannerClassifier Classifier ClassifierViloClassifier Classifier

AV Systems: Composed of classifiers

Introduce Vilo as a Classifier

Self-Learning AV Product

AVAR 2008 (New Delhi) 29

ViloClassifier Classifier

How to get malware collection?

Collect malware detected by the Product.

Solution 1

Self-Learning AV Product

AVAR 2008 (New Delhi) 30

ViloClassifier Classifier

Internet Cloud

Vilo

How to get malware collection?

Collect and learn in the cloud

Solution 2

Learning in the Cloud

AVAR 2008 (New Delhi) 31

Vilo ClassifierClassifier Classifier

Internet CloudVilo Learner

How to get malware collection?

Collect and learn in the cloud

Solution 2

Experience with Vilo-Learning

AVAR 2008 (New Delhi) 32

Vilo-in-the-cloud holds promiseCan utilize cluster of workstations

Like GoogleTake advantage of increasing bandwidth and

compute powerEngineering issues to address

Control growth of databaseForget samplesUse “signature” feature vector(s) for familyBe “selective” about features to use

Summary

AVAR 2008 (New Delhi) 33

Weakness of current AV systemExact match over extract

Exploited by creating large number of variants

Information Retrieval research strengthsInexact match over whole

VILO demonstrates IR techniques have promise

Architecture of Self-Learning AV SystemIntegrate VILO into existing AV systemsCreate feedback mechanism to drive learning

Recommended