TurboFlow: Information Rich Flow Record Generation on … · 2020-02-06 · TurboFlow Architecture:...

Preview:

Citation preview

TurboFlow: Information Rich Flow Record Generation on

Commodity SwitchesJohn Sonchack1, Adam J. Aviv2, Eric Keller3, Jonathan M. Smith1

1University of Pennsylvania, 2USNA, 3University of Colorado

Introduction: Network Monitoring with Flow Records

Flow record <srcIp, dstIp,

srcPort, dstPort, arrivalTs,

avgInterArrival, pktsDroppedCt, queueLen, …>

2

Flow record <srcIp, dstIp,

srcPort, dstPort, arrivalTs,

avgInterArrival, pktsDroppedCt, queueLen, …>

Traffic Engineering SecurityDebugging

3

Introduction: Network Monitoring with Flow Records

Flow record <srcIp, dstIp,

srcPort, dstPort, arrivalTs,

avgInterArrival, pktsDroppedCt, queueLen, …>

Traffic Engineering SecurityDebuggingLow throughput because packets dropped at switch

2!

4

Introduction: Network Monitoring with Flow Records

Flow record <srcIp, dstIp,

srcPort, dstPort, arrivalTs,

avgInterArrival, pktsDroppedCt, queueLen, …>

Traffic Engineering SecurityDebuggingCongestion at

switch 2!

5

Introduction: Network Monitoring with Flow Records

Flow record <srcIp, dstIp,

srcPort, dstPort, arrivalTs,

avgInterArrival, pktsDroppedCt, queueLen, …>

Traffic Engineering SecurityDebuggingCongestion at

switch 2!

6

Introduction: Network Monitoring with Flow Records

Flow record <srcIp, dstIp,

srcPort, dstPort, arrivalTs,

avgInterArrival, pktsDroppedCt, queueLen, …>

Traffic Engineering SecurityDebuggingHosts 2 and 4 are in a botnet!

From botnet 7

Introduction: Network Monitoring with Flow Records

Flow record <srcIp, dstIp,

srcPort, dstPort, arrivalTs,

avgInterArrival, pktsDroppedCt, queueLen, …>

Traffic Engineering SecurityDebugging

3.2$Tb/s >$100$M$packets/s

>$10$M$flows/s

8

Introduction: Network Monitoring with Flow Records

9

Flow Monitoring Switches: Prior Work

Sampled Packets InaccurateSampling Flow

Records

Custom Hardware Offloading

Server Offloading Expensive

Restrictive

Packets (or other records)

Flow Monitoring Switches: Prior Work

10

Sampled Packets InaccurateSampling

Flow Records

Flow Records

Packets (or other records)

Flow Records

Introduction: TurboFlowMain idea: Optimize instead of offload. Q : What can we get out of the programmable hardware in next-generation commodity switches?

Onboard MicroserversProgrammable Forwarding Engines

11

Introduction: TurboFlowMain idea: Optimize instead of offload. Q : What can we get out of the programmable hardware in next-generation commodity switches?

A : Flow record generation for multi-terabit rate traffic without sampling or offloading.

Onboard MicroserversProgrammable Forwarding Engines

12

13

Programmable Forwarding

Engine

MicroserverTurboFlow

Flow Record Generation

Flow Records

Packets

Pre-aggregation

Partial Flow

Records

Introduction: TurboFlow

• Introduction

• Architecture

• Evaluation

• Conclusion

14

Outline

Flow Records

Packets

Programmable Forwarding

Engine

MicroserverTurboFlow

Pre-aggregation

Partial Flow

Records

Flow Record Generation

15

TurboFlow Architecture

Flow Records

Packets

Programmable Forwarding

Engine

MicroserverTurboFlow

Partial Flow

Records

Pre-aggregation

Flow Record Generation

Forwarding Engine

Switch CPU

Background: Programmable Forwarding Engines

16

Stateful VariablesActionMatch

Forwarding Engine

Background: Programmable Forwarding Engines

17

Packet Count

Average Interarrival

Update 3 1 ms …Update 49 8 ms …Update 3 42 ms …

Match Stateful VariablesAction

Flow (IP 5-tuple)

A -> BE -> GF -> G

Switch CPU

Forwarding Engine

Background: Programmable Forwarding Engines

18

Switch CPU

Packet Count

Average Interarrival

Update 3 1 ms …Update 49 8 ms …Update 3 42 ms …

Match Stateful VariablesAction

Flow (IP 5-tuple)

A -> BE -> GF -> G

Table Manager

Rule installation rate: < 10 K / s

Flow arrival rate @ 1 Tb/s:

> 10,000 K / s

Forwarding Engine

Switch CPU

TurboFlow Architecture: Using the FE Efficiently

19

Table Manager

Forwarding Engine

Switch CPU

TurboFlow Architecture: Using the FE Efficiently

20

Current Flow

(IP 5-tuple)Packet Count

Average Interarrival

Match Stateful Variables

Forwarding Engine

Switch CPU

TurboFlow Architecture: Using the FE Efficiently

21

Match Stateful Variables

Flow Key Hash

1234

Current Flow

(IP 5-tuple)Packet Count

Average Interarrival

Table Manager

Forwarding Engine

Switch CPU

TurboFlow Architecture: Using the FE Efficiently

22

Flow Key (IP 5-tuple)

Packet Count

Average Interarrival

A -> B 4 3 ms …C -> D 49 8 ms …E -> F 3 42 ms …Z -> Q 9 10 ms …

Match Stateful Variables

Flow Key Hash

1234

HASH

Current Flow

(IP 5-tuple)Packet Count

Average Interarrival

A -> B 4 3 ms …C -> D 49 8 ms …E -> F 3 42 ms …Z -> Q 9 10 ms …

Tracked Flow: Update Counters

A->B

Forwarding Engine

Switch CPU

TurboFlow Architecture: Using the FE Efficiently

23

Match Stateful Variables

Flow Key Hash

1234

HASH

Tracked Flow: Update Counters

A->B

G->H

Untracked Flow: Replace colliding record, send it to CPU

Z -> Q: 9 10 ms …

Record Aggregator

Tracked Flow

(IP 5-tuple)Packet Count

Average Interarrival

A -> B 4 3 ms …C -> D 49 8 ms …E -> F 3 42 ms …G -> H 1 — …

24

TurboFlow Design

Flow Records

Packets

Programmable Forwarding

Engine

MicroserverTurboFlow

Partial Flow

Records

Pre-aggregation

Flow Record Generation

Count: 12Count: 2Count: 10

TurboFlow Architecture: Using the CPU Efficiently

25

Partial Flow

Records

Flow Records

Key Count

A->B 12

Flow Stats Dictionary

TurboFlow Architecture: Using the CPU Efficiently

Partial Flow

Records

Flow Records

Count: 12Key Count

A->B 12

Flow Stats Dictionary

Count: 2Count: 10

26

Optimization Performance Vs. Baseline

baseline (std::unordered_map)

-

Reduce Pointer Operations

1.64X

Vectorize Key Comparison

3.79X

Batch and Prefetch

4.9X

average of 146 cycles spent per partial flow record.

27

Outline

• Introduction

• Architecture

• Evaluation

• Conclusion

Flow Records

Packets

Programmable Forwarding

Engine

MicroserverTurboFlowOptimized

Flow Record Generation

Pre-aggregation

Partial Flow

Records

28

Implementation and Evaluation

Benchmark Workloads • 10 Gb/s Internet

Router Traces (CAIDA 2015)

• 144 Node Simulated Datacenter Cluster (YAPS simulator)

ImplementationsP4 Switch

(3.2 Tb/s Barefoot Tofino)

P4 SmartNIC (40 Gb/s Netronome NFP)

29

Implementation and Evaluation

Benchmark Workloads • 10 Gb/s Internet

Router Traces (CAIDA 2015)

• 144 Node Simulated Datacenter Cluster (YAPS simulator)

ImplementationsP4 Switch

(3.2 Tb/s Barefoot Tofino)

P4 SmartNIC (40 Gb/s Netronome NFP)

30

Required Average Throughput to Monitor 100 X 10 Gb/s Internet Links

0.0 0.2 0.4 0.6 0.8 1.00

10

20

30

40

Parti

alFl

owR

ecor

dpe

rSec

ond

(Mill

ions

)

No FE AggregationFE Aggregation with 5 MB FE Memory ( 26%)

31

0.0 0.2 0.4 0.6 0.8 1.00

10

20

30

40

Parti

alFl

owR

ecor

dpe

rSec

ond

(Mill

ions

)

No FE AggregationFE Aggregation with 5 MB FE Memory ( 26%)

Partial aggregation using 5 MB of FE memory reduces workload by ~4X.

Required Average Throughput to Monitor 100 X 10 Gb/s Internet Links

32

0.0 0.2 0.4 0.6 0.8 1.00

10

20

30

40

Parti

alFl

owR

ecor

dpe

rSec

ond

(Mill

ions

)

No FE AggregationFE Aggregation with 5 MB FE Memory ( 26%)

1 2 3 4Switch CPU Cores

0

10

20

30

40

Parti

alFl

owR

ecor

dpe

rSec

ond

(Mill

ions

)

std::unordered map Fully Optimized

No FE AggregationFE Aggregation with 5 MB FE Memory ( 26%)

Optimizations improve performance by ~5X.

Required Average Throughput to Monitor 100 X 10 Gb/s Internet Links

33

0.0 0.2 0.4 0.6 0.8 1.00

10

20

30

40

Parti

alFl

owR

ecor

dpe

rSec

ond

(Mill

ions

)

No FE AggregationFE Aggregation with 5 MB FE Memory ( 26%)

1 2 3 4Switch CPU Cores

0

10

20

30

40

Parti

alFl

owR

ecor

dpe

rSec

ond

(Mill

ions

)

std::unordered map Fully Optimized

No FE AggregationFE Aggregation with 5 MB FE Memory ( 26%)

FE pre-aggregation + optimizations = terabit rate workloads using 1 core and ~26% of FE memory.

Required Average Throughput to Monitor 100 X 10 Gb/s Internet Links

34

Outline

• Introduction

• TurboFlow Design

• Implementation and Evaluation

• Conclusion

In the Paper

Cost analysis

More interesting flow features Pipeline layouts

Psuedocode

Expected worst case analysis

More Evaluation

35

Conclusion (and Thank You for Listening!)

• Flow records are important for monitoring, but difficult to generate at the switch due to high traffic rates.

• TurboFlow is a flow record generator carefully optimized for next generation commodity switch hardware that scales to multi-terabit rate traffic without sampling.

36

Recommended