60
Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms Sailesh Kumar Advisors: Jon Turner, Patrick Crowley Committee: Roger Chamberlain, John Lockwood, Bob Morley

Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

  • Upload
    tamyra

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms. Sailesh Kumar Advisors: Jon Turner, Patrick Crowley Committee: Roger Chamberlain, John Lockwood, Bob Morley. Focus on 3 Network Features. In this proposal, we focus on 3 network features Packet payload inspection - PowerPoint PPT Presentation

Citation preview

Page 1: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

Doctoral Dissertation Proposal: Acceleration of

Network Processing Algorithms

Sailesh Kumar

Advisors: Jon Turner, Patrick CrowleyCommittee: Roger Chamberlain, John

Lockwood, Bob Morley

Page 2: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

2 - Sailesh Kumar - 04/22/23

Focus on 3 Network Features In this proposal, we focus on 3 network features

Packet payload inspection» Network security

Packet header processing» Packet forwarding, classification, etc

Packet buffering and queuing» QoS

Page 3: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

3 - Sailesh Kumar - 04/22/23

Overview of the Presentation Packet payload inspection

» Previous work– D2FA and CD2FA

» New ideas to implement regular expressions» Initial results

IP Lookup» Tries and pipelined tries» Previous work: CAMP» New direction: HEXA

Hashing used for packet header processing» Why do we need better hashing?» Previous work: Segmented Hash» New direction: Peacock Hashing

Packet buffering and queuing» Previous work: multichannel packet buffer, aggregated buffer» New direction: DRAM based buffer, NP based queuing assist

Page 4: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

4 - Sailesh Kumar - 04/22/23

Delayed Input DFA (D2FA), SIGCOMM’06 Many transitions in a DFA

» 256 transitions per state» 50+ distinct transitions per state (real world datasets)» Need 50+ words per state

Can we reduce the number of transitions in a DFA

Three rulesa+, b+c, c*d+

2

1 3b

4

5

a

d

a

c

a b

d

a

c

b

cb

b

a

c

d

d

d

c

4 transitionsper state

Look at state pairs: there are many common transitions.How to remove them?

Page 5: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

5 - Sailesh Kumar - 04/22/23

Delayed Input DFA (D2FA), SIGCOMM’06 Many transitions in a DFA

» 256 transitions per state» 50+ distinct transitions per state (real world datasets)» Need 50+ words per state

Can we reduce the number of transitions in a DFA

Three rulesa+, b+c, c*d+

1 3

a

a

a b

b

2

5

4

cb

b

c

d

d

d

c

4 transitionsper state

AlternativeRepresentation

d

c

a

b

d

c

a

1 3

a

a

a b

b

2

5

4

cb

b

c

d

d

d

c

d

c

a

b

d

c

a

Fewer transitions,less memory

Page 6: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

6 - Sailesh Kumar - 04/22/23

D2FA Operation

1 3

a

a

a b

b

2

5

4

cb

b

c

d

d

d

c

d

c

a

b

d

c

a

1 3

a

2

5

4

cc

b

d

Heavy edges are called default transitionsTake default transitions, whenever, a labeled transition is missing

DFA D2FA

Page 7: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

7 - Sailesh Kumar - 04/22/23

D2FA versus DFA D2FAs are compact but requires multiple memory

accesses» Up to 20x increased memory accesses» Not desirable in off-chip architecture

Can D2FAs match the performance of DFAs» YES!!!!» Content Addressed D2FAs (CD2FA)

CD2FAs require only one memory access per byte» Matches the performance of a DFA in cacheless system» Systems with data cache, CD2FA are 2-3x faster

CD2FAs are 10x compact than DFAs

Page 8: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

8 - Sailesh Kumar - 04/22/23

Introduction to CD2FA, ANCS’06 How to avoid multiple memory accesses of D2FAs?

» Avoid lookup to decide if default path needs to be taken» Avoid default path traversal

Solution: Assign labels to each state, labels contain:» Characters for which it has labeled transitions» Information about all of its default states» Characters for which its default states have labeled transitions

find node Rat location R

R

c

da

b

all

ab,cd,R

cd,R

R

V

U find node U athash(c,d,R)

find node V athash(a,b,hash(c,d,R))

ContentLabels

Page 9: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

9 - Sailesh Kumar - 04/22/23

Introduction to CD2FAR

c

d

all

ab,cd,R

cd,R

R

V

U

Input char =

hash(a,b,hash(c,d,R))

Z

l

mP

q

all

X

Ypq,lm,Z

lm,Z

hash(c,d,R)

Current state: V (label = ab,cd,R)

hash(p,q,hash(l,m,Z))

ab

d a

(R, a)(R, b)…(Z, a)(Z, b)…

lm,Zpq,lm,Z

(X, p)(X, q)

(V, a)(V, b)

→ X (label = pq,lm,Z)

Page 10: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

10 - Sailesh Kumar - 04/22/23

Construction of CD2FA We seek to keep the content labels small

Twin Objectives:» Ensure that states have few labeled transitions» Ensure that default paths are as small as possible

Proposed new heuristic called CRO to construct CD2FA» Details in ANCS’06 paper» Default path bound = 2 edges => CRO algorithm constructs

upto 10x space efficient CD2FAs

Page 11: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

11 - Sailesh Kumar - 04/22/23

Memory Mapping in CD2FAR

c

d

all

ab,cd,R

cd,R

R

V

U

Z

l

mP

q

all

X

Ypq,lm,R

lm,R

ab

(R, a)(R, b)…(Z, a)(Z, b)…

WE HAVE ASSUMEDTHAT HASHING ISCOLLISION FREE

hash(a,b,hash(c,d,R))hash(c,d,R))hash(p,q,hash(l,m,Z))

COLLISION

Page 12: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

12 - Sailesh Kumar - 04/22/23

Collision-free Memory Mapping

aab

c

pq

r

lm

n

de

f

b c , ….

p q r , ….

n , ….

d e f , ….

hash(abc, …)

hash(def, …)

hash(pqr, …)

hash(lmn, …)

hash(edf, …)

l m hash(mln, …)Add edges for allPossible choices

Four states

4 memorylocations

Page 13: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

13 - Sailesh Kumar - 04/22/23

Bipartite Graph Matching Bipartite Graph

» Left nodes are state content labels» Right nodes are memory locations» An edge for every choice of content label» Map state labels to unique memory locations» Perfect matching problem

With n left and right nodes» Need O(logn) random edges» n = 1M implies, we need ~20 edges per node

If we provide slight memory over-provisioning» We can uniquely map state labels with much fewer edges

In our experiments, we found perfect matching without memory over-provisioning

4

5

2

6

1

3

2

4

C o nte nt M e m o ry labe l addre s s

Page 14: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

14 - Sailesh Kumar - 04/22/23

Reg-ex – New Directions Three Key problems with traditional DFA based reg-ex

matching» 1. Employ complete signature to parse input data

– Even if normal data matches only a small prefix portion– Full signature => large DFA

» 2. Only one active state of execution and no memory about the previous matches– Combinations of partial matches requires new DFA states

» 3. Inability to count certain sub-expressions– E.g. a{1024} will require 1024 DFA states

We aim at addressing each of these problems in the proposed research

Page 15: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

15 - Sailesh Kumar - 04/22/23

Addressing the First Problem Divide the processing into fast and slow path

Split the signature into prefix and suffix» employ signature prefixes in fast path» Upon a match in fast path, trigger the slow path» Appropriate splitting can maintain low triggering rate

Benefits:» Fast path can employ a composite DFA for all prefixes

– Due to small prefixes composite DFA will remain small– Higher parsing rate

» Slow path will use separate DFA for each signature– No state explosion in slow path– Due to low triggering rate, slow path will not become a bottleneck

» Reduces per-flow state– Fast path uses composite DFA, one active state per flow

Page 16: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

16 - Sailesh Kumar - 04/22/23

Fast and Slow Path Processing Here we assume that ε fraction of the flows are

diverted to the slow path Fast path stores a per flow DFA state Slow path may store multiple active states

Fas t pathauto m ato n

Fas t paths tate

m e m o ry

B bits /s e c

S lo w pathauto m ata

S lo w path m e m o ry

C

sta te f

C

sta te s

B bits /s e c

Page 17: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

17 - Sailesh Kumar - 04/22/23

Splitting Reg-exes Splitting can be performed based upon data traces Assign probability to NFA states and make a cut so

that slow path cumulative probability is low

1 2 5d g

^g

0 g -h

*

3 e

6 7 10a g

^i

f 8 j9i

11 12 15g -h ia 13 c14a -e

^ l

^ j1 .0

0 .2 5 0 .2 0 .0 1 0 .0 0 1

0 .1 0 .0 1 0 .0 0 8 0 .0 0 6 0 .0 0 0 6

0 .1 0 .0 2 0 .0 1 6 0 .0 0 8 0 .0 0 0 8

C UT

0

1 .0

0

1 .0

*

*

s lo w p a th au to m ataf as t p a th au to m ato n

r1 = .*[gh]d[^g]*ger2 = .*fag[^i]*i[^j]*jr3 = .*a[gh]i[^l]*[ae]c

Cumulative probability of slow path = 0.05

Page 18: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

18 - Sailesh Kumar - 04/22/23

Splitting Reg-exes

1 2 5d g

^g

0 g -h

*

3 e

6 7 10a g

^ i

f 8 j9i

11 12 15g -h ia 13 c14a -e

^ l

^ j1 .0

0 .2 5 0 . 2 0 .0 1 0 .0 0 1

0 . 1 0 .0 1 0 .0 0 8 0 . 0 0 6 0 .0 0 0 6

0 .1 0 .0 2 0 .0 1 6 0 .0 0 8 0 .0 0 0 8

C UT

0

1 .0

0

1 .0

*

*

s lo w p a th au to m atafas t p a th au to m ato n

Slow path will comprise of three separate DFAs, one for each signature

Fast path will containa composite DFA (14 states)p1 = .*[gh]d[^g]*gp2 = .*fap3 = .*a[gh]i

0

0 , 1g ,h

^ g ,h

d 0 , 2 0 , 1 , 2

0 , 1 , 3

g g

0 , 5 e

h

^ d ,g ,h

^ d ,e ,g ,h*

^g

"sta rt sta te"

g ,h

g ,h dr1 = .*[gh]d[^g]*ger2 = .*fag[^i]*i[^j]*jr3 = .*a[gh]i[^l]*[ae]c

Notice the start state

Page 19: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

19 - Sailesh Kumar - 04/22/23

Protection against DoS Attacks An attacker can attack such system by sending data

that match the prefixes more often than provisioned» Slow path will become the bottleneck

Solution: Look at the history and determine if a flow is an attack flow or not» Compute anomaly index: weighted moving average of the

number of times a flow has triggered the slow path» If a flow has high anomaly index, send it to a low rate queue

Slo w pathauto m ata

pe r - f lo wano m alyc o unte r

C

B pkts /s e c::

k

Fas t pathauto m ato n

B pkts /s e c

Ho L b u f f e r

s lo w paths le e p s tatus

Page 20: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

20 - Sailesh Kumar - 04/22/23

Initial Simulation Results

0

5

10

15

20

25

1 26 51 76 101 126 151 176 201 226 251

Thro

ughp

ut, n

o D

oS p

rote

ctio

n

012345

1 26 51 76 101 126 151 176 201 226 251

Slow

pat

h lo

ad

0

5

10

15

20

25

1 26 51 76 101 126 151 176 201 226 251

Flow

thro

ughp

ut. D

oS p

rote

ctio

n

s lo w p a th 's th r es h o ld

N o o v er lo ad in g M o d er ate o v er lo ad in g E x tr em e o v er lo ad in g

tim e ( s ec o n d s )

Page 21: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

21 - Sailesh Kumar - 04/22/23

Addressing the Second Problem NFA: compact but O(n) active states DFA: 1 active state but state explosion

» How to avoid state explosion while also keeping the per-flow active state information small

Propose a novel machine called History based Finite Automaton or H-FA» Augment a DFA with a history buffer» Transitions are taken looking at the history buffer contents» During certain transitions, items are inserted/removed from

the history buffer

Claim: a small history buffer is sufficient to avoid state explosion and also keep a single active state

Page 22: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

22 - Sailesh Kumar - 04/22/23

Example of H-FA Construction

1 2 3b c

^ a

4 5 6e f

0

N FA : a b [^a ]* c; d e f

0

0 ,4

0 ,1 0 ,2b

a

ad

0 ,5e

0 ,2 ,4

a

e

0 , 3c

d 0 ,2 ,5 f 0 ,2 ,6

0 , 6f

a

dd

^[a d ]c c c

DFANFA state 2 is present in 4 DFA states.If remove the NFA state 2 from theseDFA states, then we will have just 6 states

Page 23: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

23 - Sailesh Kumar - 04/22/23

H-FA

0

0 ,4

0 ,1 0 ,2b

a

ad

0 ,5e

0 ,2 ,4

a

e

0 , 3c

d 0 ,2 ,5 f 0 ,2 ,6

0 , 6f

a

dd

^[a d ]c c c

DFANFA state 2 is present in 4 DFA states.If remove the NFA state 2 from theseDFA states, then we will have just 6 states

0

0 ,4

d

0 ,1

a

d

0 ,5e

0 , 3

d

0 , 6f

a

d

d

b , fla g <=1

a , fla g < =0c ,if fla g =1, fla g <= 0

a , fla g <=0

c , fla g =0

fl a g

This new machine uses a history flag inaddition to its transitions to make moves

Page 24: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

24 - Sailesh Kumar - 04/22/23

H-FA

0

0 ,4

d

0 ,1

a

d

0 ,5e

0 , 3

d

0 , 6f

a

d

d

b , fla g <=1

a , fla g < =0c ,if fla g =1, fla g <= 0

a , fla g <=0

c , fla g =0

fl a g

This new machine uses a history flag inaddition to its transitions to make moves

0

3,0set is flag because

c 4,0

dreset

0 is flag because

c

Input data = c d a b c

reset

flag

1,0

a

flagset

0

b

Page 25: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

25 - Sailesh Kumar - 04/22/23

H-FA In general, if we maintain a flag for each NFA state

that represents a Kleene closure, we can avoid any state explosion

k closures will require at most k-bits in history buffer

There are some challenges associated with the efficient implementation of conditional transitions» We plan to work on these in the proposed research

Page 26: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

26 - Sailesh Kumar - 04/22/23

Addressing the Third Problem

ab[^a]{1024}cdef 0

0 ,4

d

0 ,1

a

d

0 ,5e

0 , 3

d

0 , 6f

a

d

d

b , fla g < =1

a , fla g < =0c ,if fla g =1, fla g < =0

a , fla g < =0

c , fla g =0

fl a g

Replace flag by a counterReplace flag=1 condition with ctr=1024Replace flag=0 condition with ctr=0Increment ctr if ctr>0; reset when ctr reaches 1024

One of the primary goals of research to enable efficient implementation of counter conditions

Page 27: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

27 - Sailesh Kumar - 04/22/23

Early Results

DFA Composite H-FA / H-cFA Source # of closures, # of length restriction

# of automata

total # of states

# of flags in history

# of counters in history

Total # of states

Max # of transitions /

character

Total # of transitions

% space reduction

with H-FA

H-FA parsing rate speedup

Cisco64 14, 1 1 132784 6 0 3597 2 1215450 94.69 -

Cisco64 14, 1 1 132784 13 0 1861 8 682718 96.77 -

Cisco68 19, 1 1 328664 17 0 2956 8 1337293 97.03 -

Snort rule 1 6, 6 3 62589 5 6 583 8 238107 97.40 3x

Snort rule 2 1, 2 1 12703 1 2 71 2 27498 98.58 -

Snort rule 3 5, 1 2 4737 5 1 116 4 46124 93.48 2x

Linux70 11, 0 2 20662 9 0 1304 8 546378 81.63 2x

Page 28: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

28 - Sailesh Kumar - 04/22/23

Overview of the Presentation Packet payload inspection

» Previous work– D2FA and CD2FA

» New ideas to implement regular expressions» Initial results

IP Lookup» Tries and pipelined tries» Previous work: CAMP» New direction: HEXA

Hashing used for packet header processing» Why do we need better hashing?» Previous work: Segmented Hash» New direction: Peacock Hashing

Packet buffering and queuing» Previous work: multichannel packet buffer, aggregated buffer» New direction: DRAM based buffer, NP based queuing assist

Page 29: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

29 - Sailesh Kumar - 04/22/23

IP Address Lookup Routing tables at router input ports

contain (prefix, next hop) pairs Address in the packet is compared to

the stored prefixes, starting at left. Prefix that matches largest number of

address bits is desired match. Packet is forwarded to the specified

next hop.

1* 500* 301* 5

0* 7

001* 2011* 31011* 4

prefix nexthop

routing table

address: 0110 0100 1000

Page 30: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

30 - Sailesh Kumar - 04/22/23

Address Lookup Using Tries Prefixes stored in

“alphabetical order” in tree. Prefixes “spelled” out by

following path from top.»green dots mark prefix ends

To find best prefix, spell out address in tree.

Last green dot marks longest matching prefix.

address: 0110 0100 1000

10 0

1

1

1

1

0

3

1* 500* 301* 5

0* 7

001* 2011* 31011* 4

1

Page 31: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

31 - Sailesh Kumar - 04/22/23

Pipelined Trie-based IP-lookup

Each level in different stage → overlap multiple packets

Tree data-structure, prefixes in leaves (leaf pushing)Process IP address level-by-level to find the longest match

P4 = 10010*

10

10

0

1

0

P1 P2 P4P3 P5

1 P6 P7

Stages of different size:- Requires more memory- Largest stage becomes the bottleneck

Page 32: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

32 - Sailesh Kumar - 04/22/23

Circular Pipeline, ANCS’06 Use circular pipeline and allow requests to enter/exit

at any stage Mapping:

» Divide the trie into multiple sub-tries» Map each sub-trie with its root starting at different stage

Page 33: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

33 - Sailesh Kumar - 04/22/23

P3

P4

P5

P6

P7

P8

P1P2

00* Enter at pipelinestage 1

01* Enter at pipelinestage 2

10* No Match

11* Enter at pipelinestage 3

Pipeline stage 3 Pipeline stage 4

Pipeline stage 2 Pipeline stage 1

1

0

10

10

0

1 1

P1

0 1

0 1

0 1

0 1

11

1

0 0 1

P2 P3

P4 P5

P6 P7 P8

0 1

0 1

11

0 0 1P1

P2 P3

P4 P5

P6 P7 P8

00* Begin at Subtree 101* Begin at Subtree 210* No Match11* Begin at Subtree 3

Subtree 1 Subtree 2 Subtree 3

P1

00* P1000* P2010* P3

01001* P401011* P5011* P6110* P7111* P8

Direct index table handlingthe first 2-bit of the address

P1

Divide the trieinto three sub-tires

Mapping in Circular Pipeline

Page 34: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

34 - Sailesh Kumar - 04/22/23

Circular Pipeline Benefits:

» Uniform stage sizes» Less memory – no over-provisioning is needed in face of

arbitrary trie shape» Higher throughput

Page 35: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

35 - Sailesh Kumar - 04/22/23

New Direction: HEXA HEXA (History-based Encoding, eXecution and

Addressing)» Challenges the assumption that graph structures must store

log2n bits pointers to identify successor nodes

If labels of the path leading to every node is unique then these labels can be used to identify the node» In tries every node has a unique path starting at the root node» Thus, labels along the path will become the identifier of the

node» Note that these labels need not be explicitly stored

Page 36: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

36 - Sailesh Kumar - 04/22/23

Traditional ImplementationAddr data1 0, 2, 32 0, 4, 53 1, NULL, 64 1, NULL, NULL5 0, 7, 86 1, NULL, NULL7 0, 9, NULL8 1, NULL, NULL9 1, NULL, NULL

0 1

0 1

0

0

1* P100* P211* P3

011* P40100* P5

1

2 3

54

7

9

P 2

(a)(b )

P 5

1

6

P 31

8

P 4

P 1

There are nine nodes; we will need 4-bit node identifiersTotal memory = 9 x 9 bits

Page 37: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

37 - Sailesh Kumar - 04/22/23

HEXA based Implementation

0 1

0 1

0

0

1* P100* P211* P3

011* P40100* P5

1

2 3

54

7

9

P 2

(a)(b )

P 5

1

6

P 31

8

P 4

P 1

Define HEXA identifier of a node as the path which leads to it from the root

1. -2. 03. 1

4. 005. 016. 11

7. 0108. 0119. 0100

Notice that these identifiers are uniqueThus, they can potentially be mapped tounique memory address

Page 38: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

38 - Sailesh Kumar - 04/22/23

HEXA based Implementation

0 1

0 1

0

0

1* P100* P211* P3

011* P40100* P5

1

2 3

54

7

9

P 2

(a)(b )

P 5

1

6

P 31

8

P 4

P 1

Use hashing to map the HEXA identifier to memory address

1. -2. 03. 1

4. 005. 016. 11

7. 0108. 0119. 0100

If we have a minimal perfect hash function f -A function that maps elements to unique location

Then we can store the trie as shown below

f(010) = 5f(011) = 3f(0100) = 6

f(-) = 4f(0) = 7f(1) = 9

f(00) = 2f(01) = 8f(11) = 1

Addr Fast path Prefix1 1,0,0 P32 1,0,0 P23 1,0,0 P44 0,1,15 0,1,06 1,0,0 P57 0,1,18 0,1,19 1,0,1 P1

Here we use only3-bits per nodein fast path

Page 39: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

39 - Sailesh Kumar - 04/22/23

Devising One-to-one Mapping Finding a minimal perfect hash function is difficult

» One-to-one mapping is essential for HEXA to work

Use discriminator bits» Append c-bits to every HEXA identifier, that we can modify» Thus a node can have 2c choices of identifiers» Notice that we need to store these c-bits, thus more than just

3-bits per node are needed

With multiple choices of HEXA identifiers for a node, we can reduce the problem, to a bipartite graph matching problem» We need to find a perfect matching in the graph to map nodes

to unique memory locations

Page 40: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

40 - Sailesh Kumar - 04/22/23

Devising One-to-one Mapping

-

0

1

00

01

11

010

011

0100

00 0, 01 0,10 0, 11 0

00 1, 01 1,10 1, 11 1

00 -, 01 -,10 -, 11 -

00 00, 01 00,10 00, 11 00

00 01, 01 01,10 01, 11 01

00 11, 01 11,10 11, 11 11

00 010, 01 010,10 010, 11 010

00 011, 01 011,10 011, 11 011

00 0100, 01 0100,10 0100, 11 0100

0

1

2

3

4

5

6

7

8

h() = 0, h() = 4h() = 1, h() = 5

h() = 1, h() = 5h() = 2, h() = 6

h() = 0, h() = 4h() = 1, h() = 5

h() = 2, h() = 6h() = 3, h() = 7

h() = 1, h() = 5h() = 2, h() = 6

h() = 8, h() = 3h() = 0, h() = 4

h() = 1, h() = 5h() = 2, h() = 6

h() = 0, h() = 4h() = 1, h() = 5

h() = 0, h() = 3h() = 4, h() = 6

Input labels Four choices ofHEXA identifiers

Choices ofmemory locations

Bipartite graph anda perfect matching

1

2

3

4

5

6

7

8

9

Nodes

Page 41: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

41 - Sailesh Kumar - 04/22/23

Initial Results Our initial evaluation suggests that 2-bits

discriminators are enough to find a perfect matching» Thus 2-bits per node is enough instead of log2n bits

0

4

8

12

16

1.E+02 1.E+03 1.E+04 1.E+05 1.E+06

Number of nodes in the trie

Num

ber o

f HEX

A id

entif

ier c

hoic

es

no memory over-provisioning1% memory over-provisioning3% memory over-provisioning10% memory over-provisioning

Page 42: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

42 - Sailesh Kumar - 04/22/23

Initial Results Memory comparison to Eatherton’s trie

In future» Complete evaluation of HEXA based IP lookup: throughput, die

size and power estimate» Extend HEXA to string and finite automaton

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6Stride

Fast

pat

h tr

ie m

emor

y (M

B)

without HEXA

with HEXA

Page 43: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

43 - Sailesh Kumar - 04/22/23

Overview of the Presentation Packet payload inspection

» Previous work– D2FA and CD2FA

» New ideas to implement regular expressions» Initial results

IP Lookup» Tries and pipelined tries» Previous work: CAMP» New direction: HEXA

Hashing used for packet header processing» Why do we need better hashing?» Previous work: Segmented Hash» New direction: Peacock Hashing

Packet buffering and queuing» Previous work: multichannel packet buffer, aggregated buffer» New direction: DRAM based buffer, NP based queuing assist

Page 44: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

44 - Sailesh Kumar - 04/22/23

Hash Tables Suppose our hash function gave us the

following values:» hash("apple") = 5

hash("watermelon") = 3hash("grapes") = 8hash("cantaloupe") = 7hash("kiwi") = 0hash("strawberry") = 9hash("mango") = 6hash("banana") = 2

» hash("honeydew") = 6

This is called collision» Now what

kiwi

bananawatermelon

applemango

cantaloupegrapes

strawberry

0123456789

Page 45: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

45 - Sailesh Kumar - 04/22/23

Collision Resolution Policies Linear Probing

»Successively search for the first empty subsequent table entry

Linear Chaining»Link all collided entries at any bucket as a linked-list

Double Hashing»Uses a second hash function to successively index

the table

Page 46: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

46 - Sailesh Kumar - 04/22/23

Performance Analysis Average performance is O(1) However, worst-case performance is O(n) In fact the likelihood that a key is at a distance

> 1 is pretty high

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

10 20 30 40 50 60 70 80 90 100Load m/n (%)

Prob

abilit

y

ke y d is t a n c e > 1

ke y d is ta n c e > 2

These keys will take twice time to be

probed

These will take thrice the time to be

probed

Pretty high probability that throughput is half or three times lower than the peak

throughput

Page 47: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

47 - Sailesh Kumar - 04/22/23

Segmented Hashing, ANCS’05 Uses power of multiple choices

» has been proposed earlier by Azar et. al A N-way segmented hash

» Logically divides the hash table array into N equal segments» Maps the incoming keys onto a bucket from each segment» Picks the bucket which is either empty or has minimum keys

k i

h( ) k i is mappedto this bucket

k i+1

h( )k i+1 is mappedto this bucket

2 1 1 1 2 1 21 2

A 4-way segmented hash table

12

Page 48: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

48 - Sailesh Kumar - 04/22/23

Segmented Hash Performance More segments improves the probabilistic

performance» With 64 segments, probability that a key is inserted at

distance > 2 is nearly zero even at 100% load» Improvement in average case performance is still modest

1E-15

1E-12

1E-09

1E-06

1E-03

1E+00

10 20 30 40 50 60 70 80 90 100Load m/n (%)

Prob

. {ke

y di

stan

ce >

1}

1 s e g me n t4

16

64

32

8

1E-15

1E-12

1E-09

1E-06

1E-03

1E+00

10 20 30 40 50 60 70 80 90 100Load m/n (%)

Prob

. {ke

y di

stan

ce >

2} 1 s e g me n t

4

16

32

8

Page 49: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

49 - Sailesh Kumar - 04/22/23

Adding per Segment Filters

0

1

0

2 1 1 1 2 0 1 21 2

k ih( ) k i can go to any of the 3 buckets

1

0

0

0

0

1

1

0

1

h1(ki)h2(ki)

hk(ki)

:

mb bits

We can select any of the above three segments and insert the key into the

corresponding filter

Page 50: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

50 - Sailesh Kumar - 04/22/23

Selective Filter Insertion Algorithm

0

1

0

k ih( )

2 1 1 1 2 0 1 21 2

k i can go to any of the 3 buckets

1

0

0

0

0

1

1

0

1

h1(ki)h2(ki)

hk(ki)

:

mb bits

Insert the key into segment 4, since fewer bits are set. Fewer

bits are set => lower false positive

With more segments (or more choices), our

algorithm sets far fewer bits in the Bloom filter

Page 51: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

51 - Sailesh Kumar - 04/22/23

Problem with Segmented Hash Bloom filter size is proportional to the total number of

elements An O(1) lookup can be maintained even if we omit the

Bloom filter of one segment» With many segments and each of equal size, this omission will

not lead to much reduction in Bloom filter size

An alternative is to use segments of different sizes and omit the Bloom filter in the largest segment» If the largest segment is say 90% of the total memory, then

this will result in 90% reduction in the Bloom filter size» Peacock hashing utilizes this property

Page 52: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

52 - Sailesh Kumar - 04/22/23

Peacock Hashing

K(actual keys)

U(universe of keys)

k1

k3k 4

k6k5

k7

k1

k5

k6

k7

k4

h5( )h4( )h3`( )h2( )h1( )

k2

k2

k3

Size of 1st segment = 1Size of second segment = c Size of ith segment = c x size of i-1st segment

No element will be discardedUntil the first segment is filled

Page 53: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

53 - Sailesh Kumar - 04/22/23

Peacock Hash Use Bloom filter for all segments but the largest

segment» Thus, for c = say 10, the Bloom filter will be 10x smaller

Lookup is obvious» First consult all Bloom filters» If none of them shows a membership, then lookup in the

largest segment» Else lookup into the segments which shows a membership

In order to enable deletes we require counting Bloom filters, but counters can be kept in slow path

Deletes however lead to imbalance in the loading

Page 54: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

54 - Sailesh Kumar - 04/22/23

Peacock Hash A series of “delete and insert” may lead to overflow of

the smaller segments

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150Simulation time (sampling interval is 1000)

Discardrate (%)

Segment 54 3

Segment 6

Second phase begins

2

1

Page 55: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

55 - Sailesh Kumar - 04/22/23

Peacock Hash Following every delete we perform a re-balancing, i.e.

search the smaller segments and move elements to larger segment if possible

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150Simulation time (sampling interval is 1000)

Discardrate (%)

Segment 5 43

Segment 6

Second phase begins

2

Page 56: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

56 - Sailesh Kumar - 04/22/23

Issues and Future Directions It is not clear, how to perform rebalancing efficiently

» In the previous simulation, we use a brute force approach and search the entire segment, leading to O(n) rebalancing cost

Complicating factors:» Collision length higher than 1 in some segments» Double hashing collision policy» Use of 2-ary hashing may improve the efficiency, but will again

complicate the re-balancing

Future Research Objectives:» Develop efficient re-balancing algorithm» Develop Bloom filters which better utilizes the power of

multiple choices» Extend the scheme to memory segments with different

bandwidth and access latency

Page 57: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

57 - Sailesh Kumar - 04/22/23

Overview of the Presentation Packet payload inspection

» Previous work– D2FA and CD2FA

» New ideas to implement regular expressions» Initial results

IP Lookup» Tries and pipelined tries» Previous work: CAMP» New direction: HEXA

Hashing used for packet header processing» Why do we need better hashing?» Previous work: Segmented Hash» New direction: Peacock Hashing

Packet buffering and queuing» Previous work: multichannel packet buffer, aggregated buffer» New direction: DRAM based buffer, NP based queuing assist

Page 58: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

58 - Sailesh Kumar - 04/22/23

Packet Buffering and Queuing First objective is to extend the multichannel packet

buffer architecture to DRAM memories We also plan to consider memories with different size,

bandwidth and access latency» Extension of

– Sailesh Kumar, Patrick Crowley, and Jonathan Turner, "Design of Randomized Multichannel Packet Storage for High Performance Routers", Proceedings of IEEE Symposium on High Performance Interconnects (HotI-13), Stanford, August 17-19, 2005.

Work on a NP specific queuing hardware assist» Extension of

– Sailesh Kumar, John Maschmeyer, and Patrick Crowley, "Queuing Cache: Exploiting Locality to Ameliorate Packet Queue Contention and Serialization", Proceedings of ACM International Conference on Computing Frontiers (ICCF), Ischia, Italy, May 2-5, 2006.

Page 59: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

59 - Sailesh Kumar - 04/22/23

The proposed research is expected to take one year

Acknowledgments» Jon Turner» Patrick Crowley» Michela Becchi» Sarang Dharmapurikar» John Lockwood» Roger Chamberlain» Robert Morley» Balakrishnan Chandrasekaran» Michael Mitzenmacher, Harvard Univ.» George Varghese, UCSD» Will Eatherton, Cisco» John Williams, Cisco

Page 60: Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

60 - Sailesh Kumar - 04/22/23

Questions???