35
Discoverer: Automatic Protocol Reverse Engineering from Network Traces Weidong Cui Jayanthkumar Kannan Helen J. Wang Microsoft Research USENIX Security (Security ‘07 Present by Mike Hsiao, 20080125

Discoverer : Automatic Protocol Reverse Engineering from Network Traces

Embed Size (px)

DESCRIPTION

USENIX Security (Security ‘07). Discoverer : Automatic Protocol Reverse Engineering from Network Traces. Weidong Cui Jayanthkumar Kannan Helen J. Wang Microsoft Research. Present by Mike Hsiao, 20080125. Outline. Introduction Problem Statement - PowerPoint PPT Presentation

Citation preview

Page 1: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

Discoverer: Automatic Protocol Reverse Engineering

from Network Traces

Weidong Cui

Jayanthkumar Kannan

Helen J. Wang

Microsoft Research

USENIX Security (Security ‘07)

Present by Mike Hsiao, 20080125

Page 2: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

2

Outline

1. Introduction

2. Problem Statement– Common protocol idioms and the scope of Discoverer

3. Design

4. Evaluation

5. Related Work*

6. Limitations and Future Work

7. Conclusion and Comment

Page 3: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

3

Application-level protocol specifications: usage

Application-level protocol specifications are useful for many security applications.– intrusion prevention and detection– deep packet inspection– protocol analyzer– penetration testing

generates network inputs to an application to uncover potential vulnerabilities

Current practice is mostly manual.

Section 1

Page 4: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

4

Discoverer

is a tool for automatically reverse engineering the protocol message formats of an application from its network trace

operates in a protocol-independent fashion– by inferring protocol idioms commonly seen in

message formats of many application-level protocols

is then evaluated over a text and two binary protocols

Section 1

Page 5: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

5

Application-level protocol specifications

From documentation or reverse engineered manually Time-consuming and error-prone

“It took the open-source SAMBA project 12 years to manually reverse engineer the Microsoft SMB protocol.”

“Yahoo messenger protocol has also been persistently reverse engineered, despite which, the open source clients regularly require patching to support proprietary changes in the Yahoo protocol.”

– the period between the availability of an official client and an open-source client has been a month

Section 1

Page 6: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

6

Automaticallyreverse engineer message formats

Challenges– Very few hints from the network trace (byte streams)– Protocols are significantly different from each other– Protocol message formats are often context-sensitive

where earlier fields dictate the parsing of the subsequent part of the message

The authors dissect the formless byte streams into text and binary segments or tokens

– as a starting point for clustering messages with similar patterns, where each cluster approximates a message format.

Section 1

Page 7: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

7

Evaluation Matrices

Correctness– does one inferred format correspond to exactly

one true format? Conciseness

– how many inferred formats is a single true format reflected in?

Coverage– how many messages are covered by the inferred

formats?

Section 1

Page 8: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

8

Problem Statement: Common Protocol Idioms

Application session– consists of a series of messages between two hosts that

accomplishes a specific task.

Message format specification– a sequence of fields and their semantics

length, offset (byte offset of another field) pointer (an offset specifies the index of a field) cookie (session specific opaque data. E.g., session ID) endpoint-address (IP, port) set (a group of fields that can be put in an arbitrary order)

Section 2

Page 9: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

9

Common Protocol Idioms: Format Distinguisher

Format Distinguisher (FD)– It serves to differentiate the format of the subsequent part of

the message– A message may have a sequence of FD fields, particularly

when multiple protocols are encapsulated. E.g., SMB consists of a NetBIOS header

– This implies that the applications need to scan a message from left-to-right, decoding a FD field before parsing the subsequent part of the message.

Section 2

Page 10: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

10

Scope of Discoverer

derive the message format specification– not protocol finite state machine

assume synchronous protocols A message is a consecutive chunk of application-

level data sent in one direction one or more TCP or UDP connection

– UDP connection is a pair of unidirectional UDP flows focus on applications that do not obfuscate payloads do not capture timing semantics

Section 2

Page 11: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

11

Design: Overview

Cluster messages with the same format together and infer the message format by comparing messages in a single cluster

1. Tokenization and Initial Clustering

2. Recursive Clustering

3. Merging

Section 3

Page 12: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

12

Page 13: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

13

Page 14: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

14

1-1 Tokenization (1/2)

Text– Identify text bytes by comparing them with the

ASCII values of printable characters– Consider a sequence of text bytes sandwiched

between two binary bytes as a text segment– Require the sequence to have a minimum length– Use a set of delimiters (e.g., space and tab) to

divide a text segment into tokens

Section 3

Page 15: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

15

1-1 Tokenization (2/2)

Binary– They simply declare a single binary byte to be a binary

token in its own right.– Error 1: consecutive binary bytes with ASCII values of

printable characters are wrongly marked as a text token– Error 2: a text string shorter than the minimum length is

wrongly marked as binary tokens– Error 3: a text field consisting of some white space

characters is wrongly divided into multiple text tokens

Section 3

Page 16: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

16

1-2 Initial Clustering by Token Patterns

The authors cluster messages based on their token patterns.– The token pattern assigned to a message is a

tuple: (dir, class of token 1, class of token 2, …) E.g., (client to server, text, binary, text)

Note that this initial clustering is coarse-grained since messages with different formats may have the same token pattern.

Section 3

Page 17: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

17

2 Recursive Clustering

The recursive clustering relies on identifying format distinguisher (FD) tokens

To find FD tokens, we need to invoke both format inference and format comparison

Section 3

Page 18: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

18

2-1 Format Inference

This phase takes as input a set of messages and infers a format that succinctly captures the content of the set of messages.

Property Inference– Token class is already identified during the tokenization phase.– Constant or variable tokens can also be easily identified.– Since the set of messages come from a single token-pattern

cluster, tokens in one message can be directly compared against their counterparts by simply using the token offset.

– Thus, constant tokens are those that take the same value across the entire set of messages, and variable tokens are those that take more than one value.

Section 3

Page 19: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

19

2-1 Format Inference

Semantic Inference– length

intuition: for a specific pair of messages, the difference in the values of potential length fields reflects the difference of the sizes of the messages

potential length: at most four consecutive binary tokens or a text token in the decimal or hex format

– offset compare the value difference with the difference of the offsets

of some subsequent tokens– cookie

operate at the end of the merging phase, RolePlayer [3]

Section 3

Page 20: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

20

2-2 Format Comparison

Decide if two inferred message formats are the same?– token-by-token– from left-to-right

Ideally, two tokens can be considered to match if their semantics match.

Section 3

Page 21: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

21

2-3 Recursive Clustering by Format Distinguishers

Three criteria to determine if a token is a FD1. number of unique values taken by this token across the

set of messages is less than a threshold

2. (if the 1st criteria is satisfied) Divided the cluster is into sub-clusters by using unique token value. the size of the largest sub-cluster exceeds a threshold guarantee a meaningful format inference in at least one sub-

cluster

3. (if potential FD passes 2nd phase) invoke format comparison across sub-clusters

Section 3

Page 22: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

22

2-3 Recursive Clustering by Format Distinguishers

This process is recursively performed on each of the sub-clusters because a message may have more than one FD token.

They find the next FD token by scanning further down the message towards the right (end) of the message.

The format inference is invoked again on the set of messages in each sub-cluster.

– The inferred token properties and semantics might change because the set of messages has become smaller.

Section 3

Page 23: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

23

3 Merging with Type-Based SequenceAlignment

In previous phases, we are conservative to ensure that the format inference procedure operates correctly on a set of messages of the same format.– this leads to a new problem of over-classification– E.g., a trace of SMB with 4M messages can come

out 7000 cluster/format, but the # of total true format is 130.

Section 3

Page 24: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

24

3 Merging with Type-Based SequenceAlignment

Type-based sequence alignment– It only allows two tokens of the same class (binary

or text) to align with each other. They claim two aligned tokens are matched if they either

have the same semantic or share at least one value.

– Extra gap constraints

Section 3

Page 25: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

25

An Example: true message from Ethereal

Section 3

Page 26: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

26

An Example: the final inferredformat by Discoverer (1/2)

Section 3

Page 27: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

27

An Example: the final inferredformat by Discoverer (1/2)

Section 3

inferred format is a sequence of tokens with token properties (binary vs. text, constant vs. variable) and semantics (e.g., length fields).

Page 28: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

28

Evaluation

5,700 lines of C++ code on Windows un-optimized implementation takes about 6-1

2 hours for a trace of several million messages

Data Sets– a honeyfarm site (which responds to unsolicited,

mostly malicious traffic); SMB only.– a busy enterprise (which has diverse and high-vol

ume traffic); HTTP, SMB, RPC.

Section 4

Page 29: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

29

Evaluation Methodology

Correctness– If a cluster contains messages from more than one true

format, then Discoverer will make incorrect inference.– For all three protocols, over 90% clusters contain messages

from a single true format. Conciseness

– A large number of redundant formats will affect the conciseness of the protocol specifications generated

– The ratio from the number of inferred formats to the number of true formats followed by their messages. (5:1)

– almost 80% true formats are scattered into at most five clusters.

Section 4

Page 30: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

30

Evaluation Methodology (cont’d)

Coverage– the fraction of messages covered by our inferred

formats– the fraction of true formats followed by covered

messages– For all the three protocols, the message coverage

is above 95% while the format coverage is around 30-40%.

Section 4

Page 31: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

31

Tunable Parameters

Section 4

Page 32: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

32

HTTP

Section 4

The HTTP protocol allows an arbitrary number of “parameter: value” pairs in an arbitrary order.

1. most messages (more than 99%) fall in thefirst top 1000 true formats. similar trendin the RPC and CIFS/SMB.

2. they inferred 3,926 formats, which covered 5,938,511 out of 5,950,453 messages (99.8%).

3. The covered messages belong to 865 out of 2,696 true formats (32%).

Page 33: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

33

Limitations and Future Work

Trace Dependency– message formats never occur in the trace– certain variable fields never take more than one value in the

trace Pre-Defined Semantics

– Only a set of pre-defined semantics can be inferred. Coalescing Fields

– Unlike text fields, no clue may be available in delimiting binary fields

– only few approaches (e.g., does this byte vary as much as the other one?)

Section 6

Page 34: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

34

Limitations and Future Work (cont’d)

Asynchronous Protocols– messages in one direction may be interrupted by those in

the other direction– messages in one direction may be delayed allowing two

back-to-back messages in the other direction.

Application Sessions– Currently, Discoverer analyzes each connection in isolation.

State Machine Inference– captures the sequences of messages in all sessions in the

trace

Section 6

Page 35: Discoverer :  Automatic Protocol Reverse Engineering from Network Traces

35

Conclusion and Comment

Discoverer is a tool that aims to automate this reverse engineering process

Protocol knowledge is very difficult to model automatically.– so far they only model the semantics (offset,

length…)– How about the communication interaction? (user

intention …)