Guarantee IP Lookup Performance with FIB Explosion · Motivation On-chip vs. Off-chip memory. 10 times faster, but limited in size. With FIBs increasing, for almost all packets 3

Guarantee IP Lookup Performance

with FIB Explosion

Tong Yang(ICT), Gaogang Xie(ICT), Yanbiao Li(HNU), Qiaobin Fu(ICT)

Alex X. Liu(MSU), Qi Li(ICT), Laurent Mathy(ULG)

Performance Issue in IP Lookup

FIB increasing: 15% per year; FIB size: 512,000

512k bug : In 2014.8, Cisco says that web browsing

speeds could slow over the next week as old hardware

is upgraded to handle the 512K FIB.

2

512k now

FIB

Motivation

On-chip vs. Off-chip memory. 10 times faster,

but limited in size.

With FIBs increasing, for almost all packets

3

Constant yet small

footprint for FIB:

On-chip Memory

Constant yet fast

lookup speed:

Low Time Complexity+

Ideal IP Lookup Algorithm

State-of-the-art

Achieving constant IP lookup time

– TCAM-based

– Trie pipeline using FPGA

– full-expansion

– DIR-24-8

Achieving small memory

– Based on Bloom Filter

– Level compression, path compression

– LC-trie

How to satisfy both constant lookup time

and small on-chip memory usage?4

SAIL Framework

5

Observation: almost all packets hit 0~24 prefixes

Two Splitting

– Splitting lookup process

– Splitting prefix length

On-chip

Finding prefix length

Prefix length 0~24

Prefix length 25~32Off-chip

Off-chip

Off-chip

Finding next hop

1

Bitmap arrays

1 0 1 1

0 0 1 1 … 0 1

1 0 1 1 … 1 1

1 0 1 1 … 1 1

6

8 0 3 1

0 0 9 2 … 0 3

1 0 7 1 … 2 1

3 0 4 5 … 1 1

Next hop arrays

Splitting

6

Level 0~24

Short prefixes

Original trie

Bit Maps 0-24

On-Chip

𝑖=0

24

2𝑖 = 4𝑀𝐵

How to avoid searching both short and long prefixes?

Level

25~32

Long

prefixes

Pivot Pushing & Lookup

7

prefixnext-

hop

*/0 6

1*/1 4

01*/2 3

001*/3 3

FIB

111*/3 7

0011*/4 1

1110*/4 8

11100*/5 2

3

3

1

9

4

7

8

2

6 1

0 1

0 1 0 0

0 1 0 0 0 0 0 1

0 0 1 1 0 0 … 01

level 0

level 1

level 2

3 8

A B C

001011*/6 9

D EF

G

H

Trie Bit maps

0 3 0 0 0 0 0 7

B0

B1

B2

B3

B4

N3

(a) (b) (c)

O

0 0 0 1 0 0 … 00N4

3

level 3

Lookup 001010

Pivot level: 4

B4 [001010 >> 2] = 1

N4 [2] = 0

long

prefix

Pivot push:

Update of SAIL_B

8

prefixnext-

hop

*/0 6

1*/1 4

01*/2 3

001*/3 3

FIB

111*/3 7

0011*/4 1

1110*/4 8

11100*/5 2

3

3

1

9

4

7

8

2

6 1

0 1

0 1 0 0

0 1 0 0 0 0 0 1

0 0 1 1 0 0 … 01

level 0

level 1

level 2

3 8

A B C

001011*/6 9

D EF

G

H

Trie Bit maps

0 3 0 0 0 0 0 7

B0

B1

B2

B3

B4

N3

(a) (b) (c)

O

0 0 0 1 0 0 … 00N4

3

level 3

Insert 10*

1

B2[10]=1

delete111*

B3[111]=0

0

changing 001*, or inserting 0010*

only need to update off-chip tables

Optimization

SAIL_B

– Lookup: 25 on-chip memory accesses in worst case

– Update: 1 on-chip memory access

Lookup Oriented Optimization (SAIL_L)


– Update: unbounded, low average update complexity

Update Oriented Optimization (SAIL_U)


– Update: 1 on-chip memory access

Extension: SAIL for Multiple FIBs (SAIL_M)

9

10

Level 24

Level 32

Level 16

SAIL_L

If B16==1

If B24==1

N16

N24

Y

Y

N32

N

N

SAIL_U

11

Level 6

Level 12

Level 18

Level 24

• Pushing to levels 6, 12, 18,

and 24.

• One update at most affects

2^6= 64 bits in the bitmap

array.

Still at most one on-chip memory access

is enough for each update.

SAIL_M

12

A: 00*

C: 10*

G: 110*

A:00*

C:10*

E:100*

Trie 1 Trie 2 Overlay Trie A: 00*

B: 01*

E: 100*

F: 101*

G: 110*

H: 111*

A AC D C

G

C

E F G

B

H

E

A

+D

(a) (b) (c)

SAILs in worst case

13

On-Chip

Memory

Lookup

(on-chip)

Update

(on-chip)

SAIL_B = 4MB 25 1

SAIL_L ≤ 2.13MB 2 Unbounded

SAIL_U ≤ 2.03MB 4 1

SAIL_M ≤ 2.13MB 2 Unbounded

Worst case: 2 off-chip memory accesses for lookup

Implementations FPGA: Xilinx ISE 13.2 IDE; Xilinx Virtex 7 device; On-

chip memory is 8.26MB

– SAIL_B, SAIL_U, and SAIL_L

Intel CPU: Core(TM) i7-3520M 2.9 GHz; 64KB L1,

512KB L2, 4MB L3; DRAM 8GB

– SAIL_L and SAIL_M

GPU: NVIDIA GPU (Tesla C2075, 1147 MHz, 5376 MB

device memory, 448 CUDA cores), Intel CPU (Xeon E5-

2630, 2.30 GHz, 6 Cores).

– SAIL_L

Many-core: TLR4-03680, 36 cores, each 256K L2 cache.

– SAIL_L14

Evaluation

FIBs

– Real FIB from a tier-1 router in China

– 18 real FIBs from www.ripe.net

Traces

– Real packet traces from the same tier-1 router

– Generating random packet traces

– Generating packer traces according to FIBs

Comparing with

– PBF [sigcomm 03]

– LC-trie [applied in Linux Kernel]

– Tree Bitmap

– Lulea [sigcomm 97 best paper]

15

FPGA Simulation

16

rrc00rrc01rrc03rrc04rrc05rrc06rrc07rrc10rrc11rrc12rrc13rrc14rrc150.0B

200.0kB

400.0kB

600.0kB

800.0kB

1.0MB

1.2MB

O

n-c

hip

me

mo

ry u

sa

ge

FIB

SAIL_L PBF

SAIL Algorithms Lookup Speed Throughput

SAIL_B 351Mpps 112Gbps

SAIL_U 405Mpps 130Gbps

SAIL_L 479Mpps 153Gbps

Intel CPU: real FIB and traces

17

1 2 3 4 5 6 7 8 9 10 11 120

100

200

300

400

500

600

700

800

Lo

oku

p s

pee

d (

Mpp

s)

FIB

LC-trie TreeBitmap Lulea SAIL_L

Intel CPU: 12 FIBs using prefix-based and random traces

18

2 3 4 5 6 7 8 9 10 11 120

100

200

300

400

500

Lo

oku

p s

pe

ed

(M

pp

s)

# of FIBs

Prefix-based traffic

Random Trace

Intel CPU: Update

19

919

29

39

49

59

69

79

89

99

109

119

129

139

149

159

169

179

189

199

209

219

229

239

249

259

269

279

289

299

309

319

0

2

4

6

8

10

12

14

# o

f m

em

ory

accesses p

er

update

# of updates (*500)

rrc00

average of rrc00

rrc01

average of rrc01

rrc03

average of rrc03

GPU: Lookup speed VS. batch size

20

rrc00 rrc01 rrc03 rrc04 rrc05 rrc06 rrc07 rrc10 rrc11 rrc12 rrc13 rrc14 rrc150

50

100

150

200

250

300

350

400

450

500

550

600

650

Lookup s

peed

(Mpps)

FIB

30

60

90

GPU: Lookup latency VS. batch size

21

rrc00 rrc01 rrc03 rrc04 rrc05 rrc06 rrc07 rrc10 rrc11 rrc12 rrc13 rrc14 rrc150

20

40

60

80

100

120

140

160

180

200

220

240

Late

ncy

(mic

rosecond)

FIB

30

60

90

Tilera GX-36: Lookup VS. # of cores

22

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 340

100M

200M

300M

400M

500M

600M

700M

Lo

oku

p s

pe

ed

(p

ps)

# of cores

Conclusion

Two-dimensional Splitting Framework: SAIL

Three optimization algorithms

– SAIL_U, SAIL_L, SAIL_M

– Up to 2.13MB on-chip memory usage

– 2 off-chip memory accesses

Suitable for different platforms

– FPGA, CPU, GPU, Many-core

– Up to 673.22~708.71 Mpps

Future work: SAIL to IPv6 lookup

23

24

Source codes of SAIL, LC-trie, Tree Bitmap, and Lulea

http://fi.ict.ac.cn/firg.php?n=PublicationsAmpTalks.OpenSource

Thankshttp://fi.ict.ac.cn

Documents

Guarantee IP Lookup Performance with FIB Explosion · Motivation On-chip vs. Off-chip memory. 10 times faster, but limited in size. With FIBs increasing, for almost all packets 3