Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms

Doctoral Dissertation Proposal: Acceleration of

Network Processing Algorithms

Sailesh Kumar

Advisors: Jon Turner, Patrick CrowleyCommittee: Roger Chamberlain, John

Lockwood, Bob Morley

2 - Sailesh Kumar - 04/22/23

Focus on 3 Network Features In this proposal, we focus on 3 network features

Packet payload inspection» Network security

Packet header processing» Packet forwarding, classification, etc

Packet buffering and queuing» QoS


Overview of the Presentation Packet payload inspection

» Previous work– D2FA and CD2FA

» New ideas to implement regular expressions» Initial results

IP Lookup» Tries and pipelined tries» Previous work: CAMP» New direction: HEXA

Hashing used for packet header processing» Why do we need better hashing?» Previous work: Segmented Hash» New direction: Peacock Hashing

Packet buffering and queuing» Previous work: multichannel packet buffer, aggregated buffer» New direction: DRAM based buffer, NP based queuing assist


Delayed Input DFA (D2FA), SIGCOMM’06 Many transitions in a DFA

» 256 transitions per state» 50+ distinct transitions per state (real world datasets)» Need 50+ words per state

Can we reduce the number of transitions in a DFA

Three rulesa+, b+c, c*d+

2

1 3b

4

5

a

d

a

c

a b

d

a

c

b

cb

b

a

c

d

d

d

c

4 transitionsper state

Look at state pairs: there are many common transitions.How to remove them?


Delayed Input DFA (D2FA), SIGCOMM’06 Many transitions in a DFA

» 256 transitions per state» 50+ distinct transitions per state (real world datasets)» Need 50+ words per state

Can we reduce the number of transitions in a DFA

Three rulesa+, b+c, c*d+

1 3

a

a

a b

b

2

5

4

cb

b

c

d

d

d

c

4 transitionsper state

AlternativeRepresentation

d

c

a

b

d

c

a

1 3

a

a

a b

b

2

5

4

cb

b

c

d

d

d

c

d

c

a

b

d

c

a

Fewer transitions,less memory


D2FA Operation

1 3

a

a

a b

b

2

5

4

cb

b

c

d

d

d

c

d

c

a

b

d

c

a

1 3

a

2

5

4

cc

b

d

Heavy edges are called default transitionsTake default transitions, whenever, a labeled transition is missing

DFA D2FA


D2FA versus DFA D2FAs are compact but requires multiple memory

accesses» Up to 20x increased memory accesses» Not desirable in off-chip architecture

Can D2FAs match the performance of DFAs» YES!!!!» Content Addressed D2FAs (CD2FA)

CD2FAs require only one memory access per byte» Matches the performance of a DFA in cacheless system» Systems with data cache, CD2FA are 2-3x faster

CD2FAs are 10x compact than DFAs


Introduction to CD2FA, ANCS’06 How to avoid multiple memory accesses of D2FAs?

» Avoid lookup to decide if default path needs to be taken» Avoid default path traversal

Solution: Assign labels to each state, labels contain:» Characters for which it has labeled transitions» Information about all of its default states» Characters for which its default states have labeled transitions

find node Rat location R

R

c

da

b

all

ab,cd,R

cd,R

R

V

U find node U athash(c,d,R)

find node V athash(a,b,hash(c,d,R))

ContentLabels


Introduction to CD2FAR

c

d

all

ab,cd,R

cd,R

R

V

U

Input char =

hash(a,b,hash(c,d,R))

Z

l

mP

q

all

X

Ypq,lm,Z

lm,Z

hash(c,d,R)

Current state: V (label = ab,cd,R)

hash(p,q,hash(l,m,Z))

ab

d a

(R, a)(R, b)…(Z, a)(Z, b)…

lm,Zpq,lm,Z

(X, p)(X, q)

(V, a)(V, b)

→ X (label = pq,lm,Z)


Construction of CD2FA We seek to keep the content labels small

Twin Objectives:» Ensure that states have few labeled transitions» Ensure that default paths are as small as possible

Proposed new heuristic called CRO to construct CD2FA» Details in ANCS’06 paper» Default path bound = 2 edges => CRO algorithm constructs

upto 10x space efficient CD2FAs


Memory Mapping in CD2FAR

c

d

all

ab,cd,R

cd,R

R

V

U

Z

l

mP

q

all

X

Ypq,lm,R

lm,R

ab

(R, a)(R, b)…(Z, a)(Z, b)…

WE HAVE ASSUMEDTHAT HASHING ISCOLLISION FREE

hash(a,b,hash(c,d,R))hash(c,d,R))hash(p,q,hash(l,m,Z))

COLLISION


Collision-free Memory Mapping

aab

c

pq

r

lm

n

de

f

b c , ….

p q r , ….

n , ….

d e f , ….

hash(abc, …)

hash(def, …)

hash(pqr, …)

hash(lmn, …)

hash(edf, …)

l m hash(mln, …)Add edges for allPossible choices

Four states

4 memorylocations


Bipartite Graph Matching Bipartite Graph

» Left nodes are state content labels» Right nodes are memory locations» An edge for every choice of content label» Map state labels to unique memory locations» Perfect matching problem

With n left and right nodes» Need O(logn) random edges» n = 1M implies, we need ~20 edges per node

If we provide slight memory over-provisioning» We can uniquely map state labels with much fewer edges

In our experiments, we found perfect matching without memory over-provisioning

4

5

2

6

1

3

2

4

C o nte nt M e m o ry labe l addre s s


Reg-ex – New Directions Three Key problems with traditional DFA based reg-ex

matching» 1. Employ complete signature to parse input data

– Even if normal data matches only a small prefix portion– Full signature => large DFA

» 2. Only one active state of execution and no memory about the previous matches– Combinations of partial matches requires new DFA states

» 3. Inability to count certain sub-expressions– E.g. a{1024} will require 1024 DFA states

We aim at addressing each of these problems in the proposed research


Addressing the First Problem Divide the processing into fast and slow path

Split the signature into prefix and suffix» employ signature prefixes in fast path» Upon a match in fast path, trigger the slow path» Appropriate splitting can maintain low triggering rate

Benefits:» Fast path can employ a composite DFA for all prefixes

– Due to small prefixes composite DFA will remain small– Higher parsing rate

» Slow path will use separate DFA for each signature– No state explosion in slow path– Due to low triggering rate, slow path will not become a bottleneck

» Reduces per-flow state– Fast path uses composite DFA, one active state per flow


Fast and Slow Path Processing Here we assume that ε fraction of the flows are

diverted to the slow path Fast path stores a per flow DFA state Slow path may store multiple active states

Fas t pathauto m ato n

Fas t paths tate

m e m o ry

B bits /s e c

S lo w pathauto m ata

S lo w path m e m o ry

C

sta te f

C

sta te s

B bits /s e c


Splitting Reg-exes Splitting can be performed based upon data traces Assign probability to NFA states and make a cut so

that slow path cumulative probability is low

1 2 5d g

^g

0 g -h

*

3 e

6 7 10a g

î

f 8 j9i

11 12 15g -h ia 13 c14a -e

^ l

^ j1 .0

0 .2 5 0 .2 0 .0 1 0 .0 0 1

0 .1 0 .0 1 0 .0 0 8 0 .0 0 6 0 .0 0 0 6

0 .1 0 .0 2 0 .0 1 6 0 .0 0 8 0 .0 0 0 8

C UT

0

1 .0

0

1 .0

*

*

s lo w p a th au to m ataf as t p a th au to m ato n

r1 = .*[gh]d[^g]*ger2 = .*fag[î]*i[^j]*jr3 = .*a[gh]i[^l]*[ae]c

Cumulative probability of slow path = 0.05


Splitting Reg-exes

1 2 5d g

^g

0 g -h

*

3 e

6 7 10a g

^ i

f 8 j9i

11 12 15g -h ia 13 c14a -e

^ l

^ j1 .0

0 .2 5 0 . 2 0 .0 1 0 .0 0 1

0 . 1 0 .0 1 0 .0 0 8 0 . 0 0 6 0 .0 0 0 6

0 .1 0 .0 2 0 .0 1 6 0 .0 0 8 0 .0 0 0 8

C UT

0

1 .0

0

1 .0

*

*

s lo w p a th au to m atafas t p a th au to m ato n

Slow path will comprise of three separate DFAs, one for each signature

Fast path will containa composite DFA (14 states)p1 = .*[gh]d[^g]*gp2 = .*fap3 = .*a[gh]i

0

0 , 1g ,h

^ g ,h

d 0 , 2 0 , 1 , 2

0 , 1 , 3

g g

0 , 5 e

h

^ d ,g ,h

^ d ,e ,g ,h*

^g

"sta rt sta te"

g ,h

g ,h dr1 = .*[gh]d[^g]*ger2 = .*fag[î]*i[^j]*jr3 = .*a[gh]i[^l]*[ae]c

Notice the start state


Protection against DoS Attacks An attacker can attack such system by sending data

that match the prefixes more often than provisioned» Slow path will become the bottleneck

Solution: Look at the history and determine if a flow is an attack flow or not» Compute anomaly index: weighted moving average of the

number of times a flow has triggered the slow path» If a flow has high anomaly index, send it to a low rate queue

Slo w pathauto m ata

pe r - f lo wano m alyc o unte r

C

B pkts /s e c::

k

Fas t pathauto m ato n

B pkts /s e c

Ho L b u f f e r

s lo w paths le e p s tatus


Initial Simulation Results

0

5

10

15

20

25

1 26 51 76 101 126 151 176 201 226 251

Thro

ughp

ut, n

o D

oS p

rote

ctio

n

012345

1 26 51 76 101 126 151 176 201 226 251

Slow

pat

h lo

ad

0

5

10

15

20

25

1 26 51 76 101 126 151 176 201 226 251

Flow

thro

ughp

ut. D

oS p

rote

ctio

n

s lo w p a th 's th r es h o ld

N o o v er lo ad in g M o d er ate o v er lo ad in g E x tr em e o v er lo ad in g

tim e ( s ec o n d s )


Addressing the Second Problem NFA: compact but O(n) active states DFA: 1 active state but state explosion

» How to avoid state explosion while also keeping the per-flow active state information small

Propose a novel machine called History based Finite Automaton or H-FA» Augment a DFA with a history buffer» Transitions are taken looking at the history buffer contents» During certain transitions, items are inserted/removed from

the history buffer

Claim: a small history buffer is sufficient to avoid state explosion and also keep a single active state


Example of H-FA Construction

1 2 3b c

^ a

4 5 6e f

0

N FA : a b [â ]* c; d e f

0

0 ,4

0 ,1 0 ,2b

a

ad

0 ,5e

0 ,2 ,4

a

e

0 , 3c

d 0 ,2 ,5 f 0 ,2 ,6

0 , 6f

a

dd

^[a d ]c c c

DFANFA state 2 is present in 4 DFA states.If remove the NFA state 2 from theseDFA states, then we will have just 6 states


H-FA

0

0 ,4

0 ,1 0 ,2b

a

ad

0 ,5e

0 ,2 ,4

a

e

0 , 3c

d 0 ,2 ,5 f 0 ,2 ,6

0 , 6f

a

dd

^[a d ]c c c

DFANFA state 2 is present in 4 DFA states.If remove the NFA state 2 from theseDFA states, then we will have just 6 states

0

0 ,4

d

0 ,1

a

d

0 ,5e

0 , 3

d

0 , 6f

a

d

d

b , fla g <=1

a , fla g < =0c ,if fla g =1, fla g <= 0

a , fla g <=0

c , fla g =0

fl a g

This new machine uses a history flag inaddition to its transitions to make moves


H-FA

0

0 ,4

d

0 ,1

a

d

0 ,5e

0 , 3

d

0 , 6f

a

d

d

b , fla g <=1

a , fla g < =0c ,if fla g =1, fla g <= 0

a , fla g <=0

c , fla g =0

fl a g

This new machine uses a history flag inaddition to its transitions to make moves

0

3,0set is flag because

c 4,0

dreset

0 is flag because

c

Input data = c d a b c

reset

flag

1,0

a

flagset

0

b


H-FA In general, if we maintain a flag for each NFA state

that represents a Kleene closure, we can avoid any state explosion

k closures will require at most k-bits in history buffer

There are some challenges associated with the efficient implementation of conditional transitions» We plan to work on these in the proposed research


Addressing the Third Problem

ab[â]{1024}cdef 0

0 ,4

d

0 ,1

a

d

0 ,5e

0 , 3

d

0 , 6f

a

d

d

b , fla g < =1

a , fla g < =0c ,if fla g =1, fla g < =0

a , fla g < =0

c , fla g =0

fl a g

Replace flag by a counterReplace flag=1 condition with ctr=1024Replace flag=0 condition with ctr=0Increment ctr if ctr>0; reset when ctr reaches 1024

One of the primary goals of research to enable efficient implementation of counter conditions


Early Results

DFA Composite H-FA / H-cFA Source # of closures, # of length restriction

# of automata

total # of states

# of flags in history

# of counters in history

Total # of states

Max # of transitions /

character

Total # of transitions

% space reduction

with H-FA

H-FA parsing rate speedup

Cisco64 14, 1 1 132784 6 0 3597 2 1215450 94.69 -

Cisco64 14, 1 1 132784 13 0 1861 8 682718 96.77 -

Cisco68 19, 1 1 328664 17 0 2956 8 1337293 97.03 -

Snort rule 1 6, 6 3 62589 5 6 583 8 238107 97.40 3x

Snort rule 2 1, 2 1 12703 1 2 71 2 27498 98.58 -

Snort rule 3 5, 1 2 4737 5 1 116 4 46124 93.48 2x

Linux70 11, 0 2 20662 9 0 1304 8 546378 81.63 2x









IP Address Lookup Routing tables at router input ports

contain (prefix, next hop) pairs Address in the packet is compared to

the stored prefixes, starting at left. Prefix that matches largest number of

address bits is desired match. Packet is forwarded to the specified

next hop.

1* 500* 301* 5

0* 7

001* 2011* 31011* 4

prefix nexthop

routing table

address: 0110 0100 1000


Address Lookup Using Tries Prefixes stored in

“alphabetical order” in tree. Prefixes “spelled” out by

following path from top.»green dots mark prefix ends

To find best prefix, spell out address in tree.

Last green dot marks longest matching prefix.

address: 0110 0100 1000

10 0

1

1

1

1

0

3

1* 500* 301* 5

0* 7

001* 2011* 31011* 4

1


Pipelined Trie-based IP-lookup

Each level in different stage → overlap multiple packets

Tree data-structure, prefixes in leaves (leaf pushing)Process IP address level-by-level to find the longest match

P4 = 10010*

10

10

0

1

0

P1 P2 P4P3 P5

1 P6 P7

Stages of different size:- Requires more memory- Largest stage becomes the bottleneck


Circular Pipeline, ANCS’06 Use circular pipeline and allow requests to enter/exit

at any stage Mapping:

» Divide the trie into multiple sub-tries» Map each sub-trie with its root starting at different stage


P3

P4

P5

P6

P7

P8

P1P2

00* Enter at pipelinestage 1


10* No Match


Pipeline stage 3 Pipeline stage 4

Pipeline stage 2 Pipeline stage 1

1

0

10

10

0

1 1

P1

0 1

0 1

0 1

0 1

11

1

0 0 1

P2 P3

P4 P5

P6 P7 P8

0 1

0 1

11

0 0 1P1

P2 P3

P4 P5

P6 P7 P8

00* Begin at Subtree 101* Begin at Subtree 210* No Match11* Begin at Subtree 3

Subtree 1 Subtree 2 Subtree 3

P1

00* P1000* P2010* P3

01001* P401011* P5011* P6110* P7111* P8

Direct index table handlingthe first 2-bit of the address

P1

Divide the trieinto three sub-tires

Mapping in Circular Pipeline


Circular Pipeline Benefits:

» Uniform stage sizes» Less memory – no over-provisioning is needed in face of

arbitrary trie shape» Higher throughput


New Direction: HEXA HEXA (History-based Encoding, eXecution and

Addressing)» Challenges the assumption that graph structures must store

log2n bits pointers to identify successor nodes

If labels of the path leading to every node is unique then these labels can be used to identify the node» In tries every node has a unique path starting at the root node» Thus, labels along the path will become the identifier of the

node» Note that these labels need not be explicitly stored


Traditional ImplementationAddr data1 0, 2, 32 0, 4, 53 1, NULL, 64 1, NULL, NULL5 0, 7, 86 1, NULL, NULL7 0, 9, NULL8 1, NULL, NULL9 1, NULL, NULL

0 1

0 1

0

0

1* P100* P211* P3

011* P40100* P5

1

2 3

54

7

9

P 2

(a)(b )

P 5

1

6

P 31

8

P 4

P 1

There are nine nodes; we will need 4-bit node identifiersTotal memory = 9 x 9 bits


HEXA based Implementation

0 1

0 1

0

0

1* P100* P211* P3

011* P40100* P5

1

2 3

54

7

9

P 2

(a)(b )

P 5

1

6

P 31

8

P 4

P 1

Define HEXA identifier of a node as the path which leads to it from the root

1. -2. 03. 1

4. 005. 016. 11

7. 0108. 0119. 0100

Notice that these identifiers are uniqueThus, they can potentially be mapped tounique memory address


HEXA based Implementation

0 1

0 1

0

0

1* P100* P211* P3

011* P40100* P5

1

2 3

54

7

9

P 2

(a)(b )

P 5

1

6

P 31

8

P 4

P 1

Use hashing to map the HEXA identifier to memory address

1. -2. 03. 1

4. 005. 016. 11

7. 0108. 0119. 0100

If we have a minimal perfect hash function f -A function that maps elements to unique location

Then we can store the trie as shown below

f(010) = 5f(011) = 3f(0100) = 6

f(-) = 4f(0) = 7f(1) = 9

f(00) = 2f(01) = 8f(11) = 1

Addr Fast path Prefix1 1,0,0 P32 1,0,0 P23 1,0,0 P44 0,1,15 0,1,06 1,0,0 P57 0,1,18 0,1,19 1,0,1 P1

Here we use only3-bits per nodein fast path


Devising One-to-one Mapping Finding a minimal perfect hash function is difficult

» One-to-one mapping is essential for HEXA to work

Use discriminator bits» Append c-bits to every HEXA identifier, that we can modify» Thus a node can have 2c choices of identifiers» Notice that we need to store these c-bits, thus more than just

3-bits per node are needed

With multiple choices of HEXA identifiers for a node, we can reduce the problem, to a bipartite graph matching problem» We need to find a perfect matching in the graph to map nodes

to unique memory locations


Devising One-to-one Mapping

-

0

1

00

01

11

010

011

0100

00 0, 01 0,10 0, 11 0

00 1, 01 1,10 1, 11 1

00 -, 01 -,10 -, 11 -

00 00, 01 00,10 00, 11 00

00 01, 01 01,10 01, 11 01

00 11, 01 11,10 11, 11 11

00 010, 01 010,10 010, 11 010

00 011, 01 011,10 011, 11 011

00 0100, 01 0100,10 0100, 11 0100

0

1

2

3

4

5

6

7

8

h() = 0, h() = 4h() = 1, h() = 5

h() = 1, h() = 5h() = 2, h() = 6

h() = 0, h() = 4h() = 1, h() = 5

h() = 2, h() = 6h() = 3, h() = 7

h() = 1, h() = 5h() = 2, h() = 6

h() = 8, h() = 3h() = 0, h() = 4

h() = 1, h() = 5h() = 2, h() = 6

h() = 0, h() = 4h() = 1, h() = 5

h() = 0, h() = 3h() = 4, h() = 6

Input labels Four choices ofHEXA identifiers

Choices ofmemory locations

Bipartite graph anda perfect matching

1

2

3

4

5

6

7

8

9

Nodes


Initial Results Our initial evaluation suggests that 2-bits

discriminators are enough to find a perfect matching» Thus 2-bits per node is enough instead of log2n bits

0

4

8

12

16

1.E+02 1.E+03 1.E+04 1.E+05 1.E+06

Number of nodes in the trie

Num

ber o

f HEX

A id

entif

ier c

hoic

es

no memory over-provisioning1% memory over-provisioning3% memory over-provisioning10% memory over-provisioning


Initial Results Memory comparison to Eatherton’s trie

In future» Complete evaluation of HEXA based IP lookup: throughput, die

size and power estimate» Extend HEXA to string and finite automaton

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6Stride

Fast

pat

h tr

ie m

emor

y (M

B)

without HEXA

with HEXA









Hash Tables Suppose our hash function gave us the

following values:» hash("apple") = 5

hash("watermelon") = 3hash("grapes") = 8hash("cantaloupe") = 7hash("kiwi") = 0hash("strawberry") = 9hash("mango") = 6hash("banana") = 2

» hash("honeydew") = 6

This is called collision» Now what

kiwi

bananawatermelon

applemango

cantaloupegrapes

strawberry

0123456789


Collision Resolution Policies Linear Probing

»Successively search for the first empty subsequent table entry

Linear Chaining»Link all collided entries at any bucket as a linked-list

Double Hashing»Uses a second hash function to successively index

the table


Performance Analysis Average performance is O(1) However, worst-case performance is O(n) In fact the likelihood that a key is at a distance

> 1 is pretty high

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

10 20 30 40 50 60 70 80 90 100Load m/n (%)

Prob

abilit

y

ke y d is t a n c e > 1

ke y d is ta n c e > 2

These keys will take twice time to be

probed

These will take thrice the time to be

probed

Pretty high probability that throughput is half or three times lower than the peak

throughput


Segmented Hashing, ANCS’05 Uses power of multiple choices

» has been proposed earlier by Azar et. al A N-way segmented hash

» Logically divides the hash table array into N equal segments» Maps the incoming keys onto a bucket from each segment» Picks the bucket which is either empty or has minimum keys

k i

h( ) k i is mappedto this bucket

k i+1

h( )k i+1 is mappedto this bucket

2 1 1 1 2 1 21 2

A 4-way segmented hash table

12


Segmented Hash Performance More segments improves the probabilistic

performance» With 64 segments, probability that a key is inserted at

distance > 2 is nearly zero even at 100% load» Improvement in average case performance is still modest

1E-15

1E-12

1E-09

1E-06

1E-03

1E+00

10 20 30 40 50 60 70 80 90 100Load m/n (%)

Prob

. {ke

y di

stan

ce >

1}

1 s e g me n t4

16

64

32

8

1E-15

1E-12

1E-09

1E-06

1E-03

1E+00

10 20 30 40 50 60 70 80 90 100Load m/n (%)

Prob

. {ke

y di

stan

ce >

2} 1 s e g me n t

4

16

32

8


Adding per Segment Filters

0

1

0

2 1 1 1 2 0 1 21 2

k ih( ) k i can go to any of the 3 buckets

1

0

0

0

0

1

1

0

1

h1(ki)h2(ki)

hk(ki)

:

mb bits

We can select any of the above three segments and insert the key into the

corresponding filter


Selective Filter Insertion Algorithm

0

1

0

k ih( )

2 1 1 1 2 0 1 21 2

k i can go to any of the 3 buckets

1

0

0

0

0

1

1

0

1

h1(ki)h2(ki)

hk(ki)

:

mb bits

Insert the key into segment 4, since fewer bits are set. Fewer

bits are set => lower false positive

With more segments (or more choices), our

algorithm sets far fewer bits in the Bloom filter


Problem with Segmented Hash Bloom filter size is proportional to the total number of

elements An O(1) lookup can be maintained even if we omit the

Bloom filter of one segment» With many segments and each of equal size, this omission will

not lead to much reduction in Bloom filter size

An alternative is to use segments of different sizes and omit the Bloom filter in the largest segment» If the largest segment is say 90% of the total memory, then

this will result in 90% reduction in the Bloom filter size» Peacock hashing utilizes this property


Peacock Hashing

K(actual keys)

U(universe of keys)

k1

k3k 4

k6k5

k7

k1

k5

k6

k7

k4

h5( )h4( )h3`( )h2( )h1( )

k2

k2

k3

Size of 1st segment = 1Size of second segment = c Size of ith segment = c x size of i-1st segment

No element will be discardedUntil the first segment is filled


Peacock Hash Use Bloom filter for all segments but the largest

segment» Thus, for c = say 10, the Bloom filter will be 10x smaller

Lookup is obvious» First consult all Bloom filters» If none of them shows a membership, then lookup in the

largest segment» Else lookup into the segments which shows a membership

In order to enable deletes we require counting Bloom filters, but counters can be kept in slow path

Deletes however lead to imbalance in the loading


Peacock Hash A series of “delete and insert” may lead to overflow of

the smaller segments

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150Simulation time (sampling interval is 1000)

Discardrate (%)

Segment 54 3

Segment 6

Second phase begins

2

1


Peacock Hash Following every delete we perform a re-balancing, i.e.

search the smaller segments and move elements to larger segment if possible

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150Simulation time (sampling interval is 1000)

Discardrate (%)

Segment 5 43

Segment 6

Second phase begins

2


Issues and Future Directions It is not clear, how to perform rebalancing efficiently

» In the previous simulation, we use a brute force approach and search the entire segment, leading to O(n) rebalancing cost

Complicating factors:» Collision length higher than 1 in some segments» Double hashing collision policy» Use of 2-ary hashing may improve the efficiency, but will again

complicate the re-balancing

Future Research Objectives:» Develop efficient re-balancing algorithm» Develop Bloom filters which better utilizes the power of

multiple choices» Extend the scheme to memory segments with different

bandwidth and access latency









Packet Buffering and Queuing First objective is to extend the multichannel packet

buffer architecture to DRAM memories We also plan to consider memories with different size,

bandwidth and access latency» Extension of

– Sailesh Kumar, Patrick Crowley, and Jonathan Turner, "Design of Randomized Multichannel Packet Storage for High Performance Routers", Proceedings of IEEE Symposium on High Performance Interconnects (HotI-13), Stanford, August 17-19, 2005.

Work on a NP specific queuing hardware assist» Extension of

– Sailesh Kumar, John Maschmeyer, and Patrick Crowley, "Queuing Cache: Exploiting Locality to Ameliorate Packet Queue Contention and Serialization", Proceedings of ACM International Conference on Computing Frontiers (ICCF), Ischia, Italy, May 2-5, 2006.

http://www.arl.wustl.edu/~sailesh/papers/multichannel.pdf



http://www.arl.wustl.edu/~sailesh/papers/fp134-kumar.pdf

http://www.arl.wustl.edu/~sailesh/papers/fp134-kumar.pdf


The proposed research is expected to take one year

Acknowledgments» Jon Turner» Patrick Crowley» Michela Becchi» Sarang Dharmapurikar» John Lockwood» Roger Chamberlain» Robert Morley» Balakrishnan Chandrasekaran» Michael Mitzenmacher, Harvard Univ.» George Varghese, UCSD» Will Eatherton, Cisco» John Williams, Cisco


Questions???

Documents

Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms