Efficient Memory Utilization on Network Processors for Deep Packet Inspection

Efficient Memory Utilization on Network Processors

for Deep Packet Inspection

Piti Piyachon

Yan Luo

Electrical and Computer Engineering Department

University of Massachusetts Lowell

ANCS 2006 U Mass Lowell

Our Contributions• Study parallelism of a pattern matching

algorithm• Propose Bit-Byte Aho-Corasick

Deterministic Finite Automata• Construct memory model to find optimal

settings to minimize the memory usage of DFA


DPI and Pattern Matching• Deep Packet Inspection

– Inspect: packet header & payload

– Detect: computer viruses, worms, spam, etc.

– Network intrusion detection application: Bro, Snort, etc.

• Pattern Matching requirements1. Matching predefined multiple patterns (keywords, or strings) at

the same time

2. Keywords can be any size.

3. Keywords can be anywhere in the payload of a packet.

4. Matching at line speed

5. Flexibility to accommodate new rule sets


Classical Aho-Corasick (AC) DFA: example 1

• A set of keywords– {he, her, him, his}

accept state

0

4

3

h

e

2

r

1

5

sm

i

6

he

herhim his

hhh

h

h

h

start state

accept state accept stateaccept state

Failure edges back to state 1 are shown as dash line.

Failure edges back to state 0 are not shown.


Memory Matrix Model of AC DFA

• Snort (Dec’05): 2733 keywords

• 256 next state pointers– width = 15 bits

• > 27,000 states

• keyword-ID width = 2733 bits

• 27538 x (2733 + 256 x 15) = 22 MB

22 MB is too big for on-chip RAM

256 Next State Pointers

255 254 3 2 1 0Keyword-ID

<15> <15> <15> <15> <15> <15><2733>

0

1

2

3

27538

state#

....

....

....

....


0

3

0

0

2

1

1

he

her himhis

k0: he : 0 , 0k1: her : 0 , 0 , 1k2: h im : 0 , 0, 0k3: h is : 0 , 0, 1

4

0

Failure edges are not shown.

Bit-1 DFA

0

3

1

0

2

0

1

he

her

him his

k0: he : 1 , 0k1: her : 1 , 0 , 0k2: h im: 1, 1, 1k3: h is : 1 , 1 , 0

5

1

Bit-3 DFA

4

1

6

0


Bit-AC DFA (Tan-Sherwood’s Bit-Split)k0 h e 0110 1000 0110 0101k1 h e r 0110 1000 0110 0101 0111 0010k2 h i m 0110 1000 0110 1001 0110 1101k3 h i s 0110 1000 0110 1001 0111 0011

Need 8 bit-DFA


Memory Matrix of Bit-AC DFA

• Snort (Dec’05): 2733 keywords

• 2 next state pointers– width = 9 bits

• 361 states

• keyword-ID width = 16 bits

• 1368 DFA• 1368 x 361 x (16 + 2 x 9) = 2 MB


1 0Keyword-ID

<9> <9><16>

0

1

2

3

361

state#

....

....

....

....


Bit-AC DFA Techniques• Shrinking the width of keyword-ID

– From 2733 to 16 bits– By dividing 2733 keywords into 171 subsets

• Each subset has 16 keywords

• Reducing next state pointers – From 256 to 2 pointers– By dividing each input byte into 1 bits– Need 8 bit-DFA

• Extra benefits – The number of states (per DFA) reduces from ~27,000 to ~300 states.– The width of next state pointer reduces from 15 to 9 bits.

• Memory– Reduced from 22 MB to 2 MB

• The number of DFA = ?– With 171 subsets, each subset has 8 DFA. – Total DFA = 171 x 8 = 1,368 DFA

What can we do better to reduce the memory usage?


1 0Keyword-ID

<9> <9><16>

0

1

2

3

512

state#

....

....

....

....


Classical AC DFA: example 2


0

e

1

l

2

e

3

m

4

5

e

6

n

7

t

s

8

p

9

a

10

r

11

a

12

13

l

14

l

15

e

l

16

m

17

a

18

n

19

a

20

21

g

e

22

e

23

m

24

o

25

r

26

y

27

k0 k1k2 k3

M atch Found atk0: e lements state 8k1: paralle l state 16k2: manage state 22k3: memory state 27

28 states

Byte-AC DFAbyte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3

k0 elements e l e m e n t sk1 parallel p a r a l l e lk2 manage m a n a g ek3 memory m e m o r y

M atch Found atk0: e lements state 2k1: para lle l state 4k2: m anage state 6k3: m emory state 8

0

3

2

1

4

5

6

8

e

l g

r

e

p m

k0

k1k2

k3

0

3

2

1

M atch Found atk0: e lemen ts state 2k1: pa ra llel state 4k2: manage state 6k3: memory state 8

4 6

7

8

n

l e

y

l

a

e

k0

k1k2

k3

0

3

2

1

M atch Found atk0: e lemen ts state 2k1: paralle l state 4k2: manage state 6k3: mem ory state 8

4

5

6

7

8

t

e (any)

(any)

e

r n

m

k0

k1k2

k3

0

3

2

1

M atch Found atk0: e lem ents state 2k1: para lle l state 4k2: manage state 4k3: memo ry state 8

4

7

8

s

l

(any)

m

a

o

k0

k1k2

k3

Byte 0 Byte 1

Byte 2 Byte 3

• Considering 4 bytes at a time

• 4 DFA

• < 9 states / DFA

• 256 next state pointers!

Similar to Dharmapurikar-Lockwood’s JACK DFA, ANCS’05


Bit-Byte-AC DFAbyte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3

k0 e l e m e n t s 0110 0101 0110 1100 0110 0101 0110 1101 0110 0101 0110 1110 0111 0100 0111 0011k1 p a r a l l e l 0111 0000 0110 0001 0111 0010 0110 0001 0110 1100 0110 1100 0110 0101 0110 1100k2 m a n a g e 0110 1101 0110 0001 0110 1110 0110 0001 0110 0111 0110 0101 xxxx xxxx xxxx xxxxk3 m e m o r y 0110 1101 0110 0101 0110 1101 0110 1111 0111 0010 0111 1001 xxxx xxxx xxxx xxxx

k0: e lements 0 ,0k1: para lle l 0 ,1k2: m anage 1 ,0k3: m emory 1 ,0

0

2

1

3

4

5

0 1 0

01

k0 k1 k2

bit 3, Byte 0

k3

0

1

3

1

k2

bit 6, Byte 2

k3

k0: e lemen ts 1 ,1k1: pa ralle l 1 ,1k2: manage 1 ,xk3: mem ory 1 ,x

0

2

1

k0

k1

k2

k3

• 4 bytes at a time

• Each byte divides into bits.

• 32 DFA (= 4 x 8)

• < 6 states/DFA

• 2 next state pointers


Memory Matrix of Bit-Byte-AC DFA

• Snort (Dec’05): 2733 keywords • 4 bytes at a time• < 36 states/DFA• 2 next state pointers

– width = 6 bits• keyword-ID width = 3 bits• 29152 DFA (= 911 x 32)• 29152 x 36 x (3 + 2 x 6) = 1.9 MB


1 0Keyword-ID

<6> <6><3>

0

1

2

3

36

state#

....

....

....

....

• 1.9 MB is a little better than 2 MB.

• This is because

• It is not any optimal setting.

• Each DFA has different number of states.

• Don’t need to provide same size of memory matrix for every DFA.


Bit-Byte-AC DFA Techniques• Still keeping the width of keyword-ID as low as Bit-DFA.

• Still keeping next state pointers as small as Bit-DFA.

• Reducing states per DFA by

– Skipping bytes

– Exploiting more shared states than Bit-DFA

• Results of reducing states per DFA

– from ~27,000 to 36 states

– The width of next state pointer reduces from 15 to 6 bits.


Construction of Bit-Byte AC DFAbyte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3

k0 e l e m e n t s 0110 0101 0110 1100 0110 0101 0110 1101 0110 0101 0110 1110 0111 0100 0111 0011k1k2k3

bit 3 of byte 0

4 bytes (considered) at a time


Construction of Bit-Byte AC DFA

k0: e lements 0k1:k2:k3:

0

1

0

bit 3, Byte 0

byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3k0 e l e m e n t s 0110 0101 0110 1100 0110 0101 0110 1101 0110 0101 0110 1110 0111 0100 0111 0011k1k2k3




k0: e lements 0 ,0k1:k2:k3:

0

2

1

0

0

k0

bit 3, Byte 0

byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3k0 e l e m e n t s 0110 0101 0110 1100 0110 0101 0110 1101 0110 0101 0110 1110 0111 0100 0111 0011k1k2k3




k0: e lements 0 ,0k1: para lle l 0k2:k3:

0

2

1

0

0

k0

bit 3, Byte 0

byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3k0 e l e m e n t s 0110 0101 0110 1100 0110 0101 0110 1101 0110 0101 0110 1110 0111 0100 0111 0011k1 p a r a l l e l 0111 0000 0110 0001 0111 0010 0110 0001 0110 1100 0110 1100 0110 0101 0110 1100k2k3




k0: e lements 0 ,0k1: para lle l 0 ,1k2:k3:

0

2

1

3

0 1

0

k0 k1

bit 3, Byte 0

byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3k0 e l e m e n t s 0110 0101 0110 1100 0110 0101 0110 1101 0110 0101 0110 1110 0111 0100 0111 0011k1 p a r a l l e l 0111 0000 0110 0001 0111 0010 0110 0001 0110 1100 0110 1100 0110 0101 0110 1100k2k3




k0: e lements 0 ,0k1: para lle l 0 ,1k2: m anage 1k3:

0

2

1

3

4

0 1

01

k0 k1

bit 3, Byte 0

byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3k0 e l e m e n t s 0110 0101 0110 1100 0110 0101 0110 1101 0110 0101 0110 1110 0111 0100 0111 0011k1 p a r a l l e l 0111 0000 0110 0001 0111 0010 0110 0001 0110 1100 0110 1100 0110 0101 0110 1100k2 m a n a g e 0110 1101 0110 0001 0110 1110 0110 0001 0110 0111 0110 0101 xxxx xxxx xxxx xxxxk3




k0: e lements 0 ,0k1: para lle l 0 ,1k2: m anage 1 ,0k3:

0

2

1

3

4

5

0 1 0

01

k0 k1 k2

bit 3, Byte 0

byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3k0 e l e m e n t s 0110 0101 0110 1100 0110 0101 0110 1101 0110 0101 0110 1110 0111 0100 0111 0011k1 p a r a l l e l 0111 0000 0110 0001 0111 0010 0110 0001 0110 1100 0110 1100 0110 0101 0110 1100k2 m a n a g e 0110 1101 0110 0001 0110 1110 0110 0001 0110 0111 0110 0101 xxxx xxxx xxxx xxxxk3




k0: e lements 0 ,0k1: para lle l 0 ,1k2: m anage 1 ,0k3: m emory 1

0

2

1

3

4

5

0 1 0

01

k0 k1 k2

bit 3, Byte 0

byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3k0 e l e m e n t s 0110 0101 0110 1100 0110 0101 0110 1101 0110 0101 0110 1110 0111 0100 0111 0011k1 p a r a l l e l 0111 0000 0110 0001 0111 0010 0110 0001 0110 1100 0110 1100 0110 0101 0110 1100k2 m a n a g e 0110 1101 0110 0001 0110 1110 0110 0001 0110 0111 0110 0101 xxxx xxxx xxxx xxxxk3 m e m o r y 0110 1101 0110 0101 0110 1101 0110 1111 0111 0010 0111 1001 xxxx xxxx xxxx xxxx





0

2

1

3

4

5

0 1 0

01

k0 k1 k2

bit 3, Byte 0

k3






0

2

1

3

4

5

0 1 0

01

k0 k1 k2

bit 3, Byte 0

k3




Construction of Bit-Byte AC DFAbyte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3

k0 e l e m e n t s 0110 0101 0110 1100 0110 0101 0110 1101 0110 0101 0110 1110 0111 0100 0111 0011k1 p a r a l l e l 0111 0000 0110 0001 0111 0010 0110 0001 0110 1100 0110 1100 0110 0101 0110 1100k2 m a n a g e 0110 1101 0110 0001 0110 1110 0110 0001 0110 0111 0110 0101 xxxx xxxx xxxx xxxxk3 m e m o r y 0110 1101 0110 0101 0110 1101 0110 1111 0111 0010 0111 1001 xxxx xxxx xxxx xxxx

0

1

3

1

k2

bit 6, Byte 2

k3

k0: e lemen ts 1 ,1k1: paralle l 1 ,1k2: manage 1 ,xk3: mem ory 1 ,x

0

2

1

k0

k1

k2

k3




0

2

1

3

4

5

0 1 0

01

k0 k1 k2

bit 3, Byte 0

k3

0

1

3

1

k2

bit 6, Byte 2

k3

k0: e lemen ts 1 ,1k1: pa ralle l 1 ,1k2: manage 1 ,xk3: mem ory 1 ,x

0

2

1

k0

k1

k2

k3


32 bit-byte DFA need to be constructed.


0

2

1

3

4

5

0 1 0

01

k0 k1 k2

bit 3, Byte 0

k3

0

1

3

1

k2

bit 6, Byte 2

k3

0

2

1

k0

k1

k2

k3

Bit-Byte-DFA: Searchingbyte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3

input


0

2

1

3

4

5

0 1 0

01

k0 k1 k2

bit 3, Byte 0

k3

0

1

3

1

k2

bit 6, Byte 2

k3

0

2

1

k0

k1

k2

k3

byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3a b 1 2

0110 0001 0110 0010 0011 0001 0011 0010input

A failure edge is shown as necessary.

0

Bit-Byte-DFA: Searching


0

2

1

3

4

5

0 1 0

01

k0 k1 k2

bit 3, Byte 0

k3

0

1

3

1

k2

bit 6, Byte 2

k3

0

2

1

k0

k1

k2

k3

byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3a b 1 2 m e m o

0110 0001 0110 0010 0011 0001 0011 0010 0110 1101 0110 0101 0110 1101 0110 1111input



0

2

1

3

4

5

0 1 0

01

k0 k1 k2

bit 3, Byte 0

k3

0

1

3

1

k2

bit 6, Byte 2

k3

0

2

1

k0

k1

k2

k3

A failure edge is shown as necessary.

byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3a b 1 2 m e m o r y e f

0110 0001 0110 0010 0011 0001 0011 0010 0110 1101 0110 0101 0110 1101 0110 1111 0111 0010 0111 1001 0110 0101 0110 0110input

0



0

2

1

3

4

5

0 1 0

01

k0 k1 k2

bit 3, Byte 0

k3

0

1

3

1

k2

bit 6, Byte 2

k3

0

2

1

k0

k1

k2

k3

Match=> (keyword) ‘memory’Only all 32 bit-DFA find the match in their

own!

byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3 byte0 byte1 byte2 byte3a b 1 2 m e m o r y e f

0110 0001 0110 0010 0011 0001 0011 0010 0110 1101 0110 0101 0110 1101 0110 1111 0111 0010 0111 1001 0110 0101 0110 0110input



Find the optimal settings to minimize memory• When k = keywords per subset

– The width of keyword-ID = k bits– k = 1, 2, 3, … , K – when K = the number of keywords in the whole set.

• Snort (Dec.2005) : K = 2733 keywords

• b = bit(s) extracted for each byte– b = 1, 2, 4, 8

– # of next state pointers = 2b

– The example 2: b = 1– Beyond b > 8

• > 256 next state pointers

• B = Bytes considered at a time – B = 1, 2, 3, … – The example 2: B = 4

• Total Memory (T) is a function of k, b, and B.– T = f(k, b, B)


1 0Keyword-ID

<6> <6><3>

0

1

2

3

36

state#

....

....

....

....


T’s Formula

Total memory of all bit-ACs in all subset

kK

j

bB

iijij ekBbkT

1

8

1

),,(

k

KNsubsetwhen ,

DefinitionK The number of all Keywords in the whole rule setk The number of keyword in each subset 1, 2, 3, …, Kb GroupedBit: The number of bits grouped to divide 8 bits (of a byte) 1, 2, 4, 8B The number of bytes considered at a time 1, 2, 3, …

The number of next state pointers 2, 4, 16, 256

The number of states in i th bit-level AC in j th subset

The number of bits used to encode states in i th bit-level AC in j th subset

ije

ij

iie 2log, b2andb

B

N

N

subset

bitDFA 8


1 0Keyword-ID

<6> <6><3>

0

1

2

3

36

....

....

....

....

k e e


250

270

290

310

330

350

370

390

410

1 11 21 31 41 51 61 71 81 91

T (KB)

Bit-Byte-AC DFA: B=16, b=2

k (keywords-per-subset)

Find the optimal k• Each pair of (b, B) has one optimal k for a minimal T.

T_min at k=12


Find the optimal b40

6.2

355.

7

314.

9

295.

2

295.

2 341.

6

395.

8

345.

4

307.

4

289.

2

271.

9

273.

4

773.5

672.9

598.0

562.8

502.6

424.0

200

300

400

500

600

700

800

B=1 B=2 B=4 B=8 B=16 B=32

b=1b=2b=4

T (KB)

k =3 3 3

k =3 3 3

k =4 4 4

k =6 5 4

k =9 12 18

k =18 33 34

• Each setting of k, b, and B has different optimal point.– Choosing only the optimal setting to

compare.

• b = 2 is the best.


Find the optimal B

100.00

87.26

82.38

77.65 77.51

74.5375.56

73.0672.17

68.69 69.06

65

70

75

80

85

90

95

100

1 2 3 4 5 6 7 8 9 16 32

T (normalized to the base case B =1)

B

k=3 k=3 k=3 k=4 k=4 k=4 k=4 k=5 k=5 k=12 k=32

395 KB

• b = 2

• T reduces while B increases.– Non-linearly

• B > 16, – T begins to increase.

• B = 16 is the best for Snort (Dec’05).


39

5.8

4

34

5.4

1

32

6.1

0

30

7.3

9

30

6.8

3

29

5.0

2

29

9.1

1

28

9.1

9

28

5.6

7

27

1.9

1

2001.57

6064.27

0

1000

2000

3000

4000

5000

6000

7000

1 2 3 4 5 6 7 8 9 16

T (KB)

B

Brodie's

Tan's

Comparing with Existing Works

• Tan-Sherwood’s, Brodie-Cytron-Taylor’s, and Ours

• Our Bit-Byte DFA when B=16– The optimal point at b=2 and k=12

– 272 KB

– 14 % of 2001 KB (Tan’s)

– 4 % of 6064 KB (Brodie’s)


Comparing with Existing Works

• Tan-Sherwood’s and Ours: At B = 13

95

.84

34

5.4

1

32

6.1

0

30

7.3

9

30

6.8

3

29

5.0

2

29

9.1

1

28

9.1

9

28

5.6

7

27

1.9

1

2001.57

6064.27

0

1000

2000

3000

4000

5000

6000

7000

1 2 3 4 5 6 7 8 9 16

T (KB)

B

Brodie's

Tan's

• (Tan’s on ASIC)

– 2001 KB

– k = 16 is not the optimal setting for B=1.

– Each bit-DFA uses same storage’s capacity, which fits the largest one (worst case).

• (Ours on NP)

– 396 KB < 2001 KB

– k = 3 is the optimal setting for B=1.

– Each bit-DFA uses exactly memory space to hold it.


Results with an NP Simulator

• NePSim2– An open source IXP24xx/28xx simulator

• NP Architecture based on IXP2855– 16 MicroEngines (MEs)– 512 KB– 1.4 GHz

• Bit-Byte AC DFA: b=2, B=16, k=12– T = 272 KB– 5 Gbps


Conclusion

• Bit-Byte DFA model can reduce memory usage up to 86%.

• Implementing on NP uses on-chip memory more efficiently without wasting space, comparing to ASIC.

• NP has flexibility to accommodate

• The optimal setting of k, b, and B.

• Different sizes of Bit-Byte DFA.

• New rule sets in the future.

• The optimal setting may change.

• The performance (using a NP simulator) satisfies line speed up to 5 Gbps throughput.


Thank you

Question?

[email protected]

[email protected]

Documents

Efficient Memory Utilization on Network Processors for Deep Packet Inspection