Discoverer : Automatic Protocol Reverse Engineering from Network Traces

Discoverer: Automatic Protocol Reverse Engineering

from Network Traces

Weidong Cui

Jayanthkumar Kannan

Helen J. Wang

Microsoft Research

USENIX Security (Security ‘07)

Present by Mike Hsiao, 20080125

2

Outline

1. Introduction

2. Problem Statement– Common protocol idioms and the scope of Discoverer

3. Design

4. Evaluation

5. Related Work*

6. Limitations and Future Work

7. Conclusion and Comment

3

Application-level protocol specifications: usage

Application-level protocol specifications are useful for many security applications.– intrusion prevention and detection– deep packet inspection– protocol analyzer– penetration testing

generates network inputs to an application to uncover potential vulnerabilities

Current practice is mostly manual.

Section 1

4

Discoverer

is a tool for automatically reverse engineering the protocol message formats of an application from its network trace

operates in a protocol-independent fashion– by inferring protocol idioms commonly seen in

message formats of many application-level protocols

is then evaluated over a text and two binary protocols

Section 1

5

Application-level protocol specifications

From documentation or reverse engineered manually Time-consuming and error-prone

“It took the open-source SAMBA project 12 years to manually reverse engineer the Microsoft SMB protocol.”

“Yahoo messenger protocol has also been persistently reverse engineered, despite which, the open source clients regularly require patching to support proprietary changes in the Yahoo protocol.”

– the period between the availability of an official client and an open-source client has been a month

Section 1

6

Automaticallyreverse engineer message formats

Challenges– Very few hints from the network trace (byte streams)– Protocols are significantly different from each other– Protocol message formats are often context-sensitive

where earlier fields dictate the parsing of the subsequent part of the message

The authors dissect the formless byte streams into text and binary segments or tokens

– as a starting point for clustering messages with similar patterns, where each cluster approximates a message format.

Section 1

7

Evaluation Matrices

Correctness– does one inferred format correspond to exactly

one true format? Conciseness

– how many inferred formats is a single true format reflected in?

Coverage– how many messages are covered by the inferred

formats?

Section 1

8

Problem Statement: Common Protocol Idioms

Application session– consists of a series of messages between two hosts that

accomplishes a specific task.

Message format specification– a sequence of fields and their semantics

length, offset (byte offset of another field) pointer (an offset specifies the index of a field) cookie (session specific opaque data. E.g., session ID) endpoint-address (IP, port) set (a group of fields that can be put in an arbitrary order)

Section 2

9

Common Protocol Idioms: Format Distinguisher

Format Distinguisher (FD)– It serves to differentiate the format of the subsequent part of

the message– A message may have a sequence of FD fields, particularly

when multiple protocols are encapsulated. E.g., SMB consists of a NetBIOS header

– This implies that the applications need to scan a message from left-to-right, decoding a FD field before parsing the subsequent part of the message.

Section 2

10

Scope of Discoverer

derive the message format specification– not protocol finite state machine

assume synchronous protocols A message is a consecutive chunk of application-

level data sent in one direction one or more TCP or UDP connection

– UDP connection is a pair of unidirectional UDP flows focus on applications that do not obfuscate payloads do not capture timing semantics

Section 2

11

Design: Overview

Cluster messages with the same format together and infer the message format by comparing messages in a single cluster

1. Tokenization and Initial Clustering

2. Recursive Clustering

3. Merging

Section 3

12

13

14

1-1 Tokenization (1/2)

Text– Identify text bytes by comparing them with the

ASCII values of printable characters– Consider a sequence of text bytes sandwiched

between two binary bytes as a text segment– Require the sequence to have a minimum length– Use a set of delimiters (e.g., space and tab) to

divide a text segment into tokens

Section 3

15

1-1 Tokenization (2/2)

Binary– They simply declare a single binary byte to be a binary

token in its own right.– Error 1: consecutive binary bytes with ASCII values of

printable characters are wrongly marked as a text token– Error 2: a text string shorter than the minimum length is

wrongly marked as binary tokens– Error 3: a text field consisting of some white space

characters is wrongly divided into multiple text tokens

Section 3

16

1-2 Initial Clustering by Token Patterns

The authors cluster messages based on their token patterns.– The token pattern assigned to a message is a

tuple: (dir, class of token 1, class of token 2, …) E.g., (client to server, text, binary, text)

Note that this initial clustering is coarse-grained since messages with different formats may have the same token pattern.

Section 3

17

2 Recursive Clustering

The recursive clustering relies on identifying format distinguisher (FD) tokens

To find FD tokens, we need to invoke both format inference and format comparison

Section 3

18

2-1 Format Inference

This phase takes as input a set of messages and infers a format that succinctly captures the content of the set of messages.

Property Inference– Token class is already identified during the tokenization phase.– Constant or variable tokens can also be easily identified.– Since the set of messages come from a single token-pattern

cluster, tokens in one message can be directly compared against their counterparts by simply using the token offset.

– Thus, constant tokens are those that take the same value across the entire set of messages, and variable tokens are those that take more than one value.

Section 3

19

2-1 Format Inference

Semantic Inference– length

intuition: for a specific pair of messages, the difference in the values of potential length fields reflects the difference of the sizes of the messages

potential length: at most four consecutive binary tokens or a text token in the decimal or hex format

– offset compare the value difference with the difference of the offsets

of some subsequent tokens– cookie

operate at the end of the merging phase, RolePlayer [3]

Section 3

20

2-2 Format Comparison

Decide if two inferred message formats are the same?– token-by-token– from left-to-right

Ideally, two tokens can be considered to match if their semantics match.

Section 3

21

2-3 Recursive Clustering by Format Distinguishers

Three criteria to determine if a token is a FD1. number of unique values taken by this token across the

set of messages is less than a threshold

2. (if the 1st criteria is satisfied) Divided the cluster is into sub-clusters by using unique token value. the size of the largest sub-cluster exceeds a threshold guarantee a meaningful format inference in at least one sub-

cluster

3. (if potential FD passes 2nd phase) invoke format comparison across sub-clusters

Section 3

22

2-3 Recursive Clustering by Format Distinguishers

This process is recursively performed on each of the sub-clusters because a message may have more than one FD token.

They find the next FD token by scanning further down the message towards the right (end) of the message.

The format inference is invoked again on the set of messages in each sub-cluster.

– The inferred token properties and semantics might change because the set of messages has become smaller.

Section 3

23

3 Merging with Type-Based SequenceAlignment

In previous phases, we are conservative to ensure that the format inference procedure operates correctly on a set of messages of the same format.– this leads to a new problem of over-classification– E.g., a trace of SMB with 4M messages can come

out 7000 cluster/format, but the # of total true format is 130.

Section 3

24

3 Merging with Type-Based SequenceAlignment

Type-based sequence alignment– It only allows two tokens of the same class (binary

or text) to align with each other. They claim two aligned tokens are matched if they either

have the same semantic or share at least one value.

– Extra gap constraints

Section 3

25

An Example: true message from Ethereal

Section 3

26

An Example: the final inferredformat by Discoverer (1/2)

Section 3

27

An Example: the final inferredformat by Discoverer (1/2)

Section 3

inferred format is a sequence of tokens with token properties (binary vs. text, constant vs. variable) and semantics (e.g., length fields).

28

Evaluation

5,700 lines of C++ code on Windows un-optimized implementation takes about 6-1

2 hours for a trace of several million messages

Data Sets– a honeyfarm site (which responds to unsolicited,

mostly malicious traffic); SMB only.– a busy enterprise (which has diverse and high-vol

ume traffic); HTTP, SMB, RPC.

Section 4

29

Evaluation Methodology

Correctness– If a cluster contains messages from more than one true

format, then Discoverer will make incorrect inference.– For all three protocols, over 90% clusters contain messages

from a single true format. Conciseness

– A large number of redundant formats will affect the conciseness of the protocol specifications generated

– The ratio from the number of inferred formats to the number of true formats followed by their messages. (5:1)

– almost 80% true formats are scattered into at most five clusters.

Section 4

30

Evaluation Methodology (cont’d)

Coverage– the fraction of messages covered by our inferred

formats– the fraction of true formats followed by covered

messages– For all the three protocols, the message coverage

is above 95% while the format coverage is around 30-40%.

Section 4

31

Tunable Parameters

Section 4

32

HTTP

Section 4

The HTTP protocol allows an arbitrary number of “parameter: value” pairs in an arbitrary order.

1. most messages (more than 99%) fall in thefirst top 1000 true formats. similar trendin the RPC and CIFS/SMB.

2. they inferred 3,926 formats, which covered 5,938,511 out of 5,950,453 messages (99.8%).

3. The covered messages belong to 865 out of 2,696 true formats (32%).

33

Limitations and Future Work

Trace Dependency– message formats never occur in the trace– certain variable fields never take more than one value in the

trace Pre-Defined Semantics

– Only a set of pre-defined semantics can be inferred. Coalescing Fields

– Unlike text fields, no clue may be available in delimiting binary fields

– only few approaches (e.g., does this byte vary as much as the other one?)

Section 6

34

Limitations and Future Work (cont’d)

Asynchronous Protocols– messages in one direction may be interrupted by those in

the other direction– messages in one direction may be delayed allowing two

back-to-back messages in the other direction.

Application Sessions– Currently, Discoverer analyzes each connection in isolation.

State Machine Inference– captures the sequences of messages in all sessions in the

trace

Section 6

35

Conclusion and Comment

Discoverer is a tool that aims to automate this reverse engineering process

Protocol knowledge is very difficult to model automatically.– so far they only model the semantics (offset,

length…)– How about the communication interaction? (user

intention …)

Documents

Discoverer : Automatic Protocol Reverse Engineering from Network Traces