Feature Rich Flow Monitoring with P4 - Netronome...2. Evict to CPU on collision. Packets MicroFlowGenerator.P4 MicroFlowAggregator.c P4 Hardware CPU Flow Records Pkt Ct. 11 Hash Key

1 ©2017 Open-NFP

Feature Rich Flow Monitoring with P4

John Sonchack University of Pennsylvania

2 ©2017 Open-NFP

Outline

▪  Introduction: Flow Records ▪ Design and Implementation: P4 Accelerated Flow Record Generation ▪  Benchmarks and Optimizations

3 ©2017 Open-NFP

Outline

▪  Introduction: Flow Records ▪ Design and Implementation: P4 Accelerated Flow Record Generation ▪  Benchmarks and Optimizations

4 ©2017 Open-NFP

Introduction: Flow Records

▪  A flow record summarizes groups of packets in the same TCP / UDP stream

Flow key (IP 5-tuple)

Flow features (statistics and meta-data)

5 ©2017 Open-NFP

Introduction: Flow Record Applications

▪  Flow records are input to analysis applications

▪  Examples: •  Security: Detect botnets, intrusions,

DoS attacks, port scans •  Traffic management: Traffic

classification, QoS routing, flow scheduling, load balancing

•  Debugging: Performance monitoring, loop and black hole detection

•  More: –  “A survey of network flow

applications” B. Li, et al. –  “An overview of ip flow-based

intrusion detection” A.Sperotto, et al. –  NetFlow/IPFIX applications

Flow records

Information

Reconfiguration

6 ©2017 Open-NFP

Introduction: Flow Record Benefits

▪  Flow records reduce monitoring costs

▪  Important for high coverage monitoring •  visibility into many (or all) links •  costs add up

Packet Headers Flow Records Volume 2GB per second 1.38MB per second

Rate 328k per second 8.9k per second Tablesource:1hour10Gbit/scoreroutertrace(CAIDA 02/2015 Chicago dir. Ah5ps://www.caida.org/data/passive/trace_stats/)

Flow records

Information

Reconfiguration

7 ©2017 Open-NFP

Introduction: Flow Record Richness

▪  An important quality of flow records is their information richness:

Accuracy: Account for every packet and flow on monitored links

Feature richness: provide the features that applications use

8 ©2017 Open-NFP

Introduction: Flow Record Generation

▪  The problem: current approaches trade overhead for information richness

So#wareGenera+on SamplingSwitches DedicatedASICs

+Informa+onrich-Highoverhead

-Reducesaccuracy+Lowoverhead

-Limitedfeatureset+Lowoverhead

Flow recordsSampled flow records Fixed flow records

9 ©2017 Open-NFP

Introduction: our Goal

▪  Flow record generation that is efficient, accurate, and feature rich

+Informa+onrich+Lowoverhead

Thiswork:so#waregenera+on+P4hardwareaccelera+onSo#waregenerators

+Informa+onrich-Highoverhead

Flow recordsFlow records

+

10 ©2017 Open-NFP

Outline

▪  Introduction: Flow Records ▪ Design and Implementation: P4 Accelerated Flow Record

Generation ▪  Benchmarks and Optimizations

11 ©2017 Open-NFP

Packets

Flow Records

Flow Records

Microflow Records

Packets

P4 Accelerated Flow Record Generation ▪  Main Idea: Use P4 hardware to preprocess packets into

micro flow records that summarize per-flow packet bursts

Totalwork:map8

packetsto3flowrecords

Totalwork:map4micro

flowrecordsto3flowrecords

•  Informa+onrich•  featuresarefully

customizable•  Efficient

•  P4hardwarereducesCPUworkload

So#wareflowrecordgenera+on

P4acceleratedflowrecordgenera+on

12 ©2017 Open-NFP

First attempt: CPU Managed Tables

PacketsFlow Record Generator.P4

Insert

Update

Key Pkt Ct.

1

12

FlowGenerator_controller.c

RemoveInsert

CPU

Flow Records

P4 Hardware Forwarding Table

13 ©2017 Open-NFP

PacketsFlow Record Generator.P4

Insert

Update

Key Pkt Ct.

1

12

FlowGenerator_controller.c

RemoveInsert

CPU

Flow Records

P4 Hardware Forwarding Table

First attempt: CPU Managed Tables

▪  Bottleneck: table update operations ▪  P4 designed for forwarding tables that are not highly dynamic

NotLineRate!

Flowarrivalrate:~50kflowspersec.(10Gbit/sInternetlink)

Controllerruleupdaterate:~200–5kpersecond(hwdep.)

14 ©2017 Open-NFP

A Better Design: Minimal CPU Interaction 1.  Use a flat array, i.e., P4 register, indexed by hash to store records.

Packets

MicroFlowGenerator.P4

MicroFlowAggregator.c

P4 Hardware

CPUFlow Records

Pkt Ct.8

Hash

Key1.1.1.1—>…

52.2.2.2—>…

15 ©2017 Open-NFP

A Better Design: Minimal CPU Interaction 1.  Use a flat array, i.e., P4 register, indexed by hash to store records.

Packets



P4 Hardware

CPUFlow Records

Pkt Ct.8

Hash

Key1.1.1.1—>…

52.2.2.2—>…

16 ©2017 Open-NFP

1.  Use a flat array, i.e., P4 register, indexed by hash to store records.

Packets



P4 Hardware

CPUFlow Records

Pkt Ct.8

Hash

Key1.1.1.1—>…

52.2.2.2—>…

A Better Design: Minimal CPU Interaction

17 ©2017 Open-NFP


Packets



P4 Hardware

CPUFlow Records

Pkt Ct.9

Hash

Key1.1.1.1—>…

52.2.2.2—>…


18 ©2017 Open-NFP


Packets



P4 Hardware

CPUFlow Records

Pkt Ct.11

Hash

Key1.1.1.1—>…

72.2.2.2—>…


19 ©2017 Open-NFP


Packets



P4 Hardware

CPUFlow Records

Pkt Ct.11

Hash

Key1.1.1.1—>…

72.2.2.2—>…

Collision


20 ©2017 Open-NFP

1.  Use a flat array, i.e., P4 register, indexed by hash to store records. 2.  Evict to CPU on collision.

Packets



P4 Hardware

CPUFlow Records

Pkt Ct.11

Hash

Key1.1.1.1—>…

13.3.3.3—>…

Insert

Normal Hash Table

72.2.2.2—>…

Collision


21 ©2017 Open-NFP

1.  Use a flat array, i.e., P4 register, indexed by hash to store records. 2.  Evict to CPU on collision.

Packets



P4 Hardware

CPUFlow Records

Pkt Ct.11

Hash

Key1.1.1.1—>…

13.3.3.3—>…

Insert

Normal Hash Table

72.2.2.2—>…

Timeout


22 ©2017 Open-NFP

A Simple Implementation

23 ©2017 Open-NFP

Packets

Flow Records

Flow Records

Microflow Records

Packets

P4 Accelerated Flow Record Generation ▪  Main Idea: Use P4 hardware to preprocess packets into

micro flow records that summarize per-flow packet bursts

Totalwork:map8

packetsto3flowrecords

Totalwork:map4micro

flowrecordsto3flowrecords

•  Informa+onrich•  featuresarefully

customizable•  Efficient

•  P4HWreducescpuworkload

Commodityserverflowrecordgenera+on

P4acceleratedflowrecordgenera+on

24 ©2017 Open-NFP

Outline

▪  Introduction: Flow Records ▪ Design and Implementation: P4 Accelerated Flow Record Generation ▪ Benchmarks and Optimizations

25 ©2017 Open-NFP

Benchmarks and Optimizations 1.  How much does the P4 hardware reduce CPU workload? 2.  What is the maximum throughput of the P4 component? 3.  How can we optimize for the NFP-4000?

26 ©2017 Open-NFP

Benchmarks and Optimizations 1.  How much does the P4 hardware reduce CPU workload? 2.  What is the maximum throughput of the P4 component? 3.  How can we optimize for the NFP-4000?

27 ©2017 Open-NFP

Benchmarks and Optimizations 1.  How much does the P4 hardware reduce CPU workload?

Workload:1hour10Gbit/scoreroutertrace(CAIDA 02/2015 Chicago dir. Ah5ps://www.caida.org/data/passive/trace_stats/)

Eachmicroflowrepresents~10packets,onaverage

28 ©2017 Open-NFP


Workload:1hour10Gbit/scoreroutertrace(CAIDA 02/2015 Chicago dir. Ah5ps://www.caida.org/data/passive/trace_stats/)

Eachmicroflowrepresents~10packets,onaverage

Sta$s$c Packetsaggrega+on Microflowaggrega+on

Avg.ratefor10Gbit/slink

~500k ~50k

CPUaggregatorthroughput(percore)

~600k ~600k

totalCPUmonitoringcapacity(percore)

~1x10Gbit/slink ~10x10Gbit/slinks

Packets

Flow Records

Flow Records

Microflow Records

Packets

29 ©2017 Open-NFP


(~10x with 1 MB of P4 HW memory) 2.  What is the maximum throughput of the P4 component? 3.  How can we optimize it for the NFP-4000?

30 ©2017 Open-NFP


(~10x with 1 MB of P4 HW memory) 2.  What is the maximum throughput of the P4 component?

31 ©2017 Open-NFP



packetsize bitrate@10Mpkts/sec

64 ~5Gbit/s

128 ~10Gbit/s

256 ~20Gbit/s

512 ~40Gbit/s

1024 ~80Gbit/s

Averagepacketsizeson10Gbit/sinternetrouterlinks(caida)

32 ©2017 Open-NFP



(~40-80 Gbit/s with average size packets)

packetsize bitrate@10Mpkts/sec

64 ~5Gbit/s

128 ~10Gbit/s

256 ~20Gbit/s

512 ~40Gbit/s

1024 ~80Gbit/s

Averagepacketsizeson10Gbit/sinternetrouterlinks(caida)

33 ©2017 Open-NFP



(~40-80 Gbit/s with average size packets) 3.  How can we optimize for the NFP-4000?

34 ©2017 Open-NFP




Op[miza[on1:finergrainedsemaphores

35 ©2017 Open-NFP




Op[miza[on1:finergrainedsemaphores

36 ©2017 Open-NFP

Core 1

Core 2

Pkt Ct.Key

53.3.3.3—>…

packet.key = 4.4.4.4 —> … packet.hash = 3

1: read key @ 3


2: read key @ 3

3: replace 3

4: update 3

Optimization 1: Fine Grained P4 Semaphores

Q: Are there race conditions?

37 ©2017 Open-NFP

Core 1

Core 2

Pkt Ct.Key

53.3.3.3—>…


1: read key @ 3


2: read key @ 3

3: replace 3

4: update 3


Incorrect!

Q: Are there race conditions? A: Yes.

38 ©2017 Open-NFP


Core 1

Core 2

Pkt Ct.Key

53.3.3.3—>…


1: read key @ 3


3: read key @ 3

2: replace 3

4: replace 3

WAIT

Q: Are there race conditions? A: Yes.

39 ©2017 Open-NFP


Core 1

Core 2

Pkt Ct.Key

53.3.3.3—>…


1: read key @ 3


3: read key @ 3

2: replace 3

4: replace 3

WAIT

40 ©2017 Open-NFP


Core 1

Core 2

Pkt Ct.Key

53.3.3.3—>…

1: read key @ 3

2: replace 3WAIT

Core 3

WAIT

Core N

WAIT

…

41 ©2017 Open-NFP


lock/unlockbycallingaP4ac[on.

Lockasingleentry,nottheen[rearray.

42 ©2017 Open-NFP




1.  Fine grained P4 semaphores

Op[miza[on2:combiningtables

43 ©2017 Open-NFP

Optimization 2: Combining Tables

Version Cycles/packet

Improvement

Baseline 5084 -

44 ©2017 Open-NFP



Improvement

Baseline 5084 -

computehashandgetkeymerged

4191 18%

45 ©2017 Open-NFP



Improvement

Baseline 5084 -


4191 18%

Changingtablesisasexpensiveas~5ememops(register_reads/writes).

46 ©2017 Open-NFP


Canwemergeallthetables?Challenge:

branches.

47 ©2017 Open-NFP


InC,wecouldusecondi[onalassignments.


branches.

48 ©2017 Open-NFP


InP4,wecanuseacondiAonalmask.


branches.

hdp://homolog.us/blogs/blog/2014/12/04/the-en[re-world-of-bit-twiddling-hacks/

Encodeasmodify_field+register_writeinP4.

49 ©2017 Open-NFP


hdp://homolog.us/blogs/blog/2014/12/04/the-en[re-world-of-bit-twiddling-hacks/


Improvement

Baseline 5084 -


4191 18%

singletable 2915 42%

Endresult:~20-30%bederthroughput.

50 ©2017 Open-NFP




1.  Fine grained P4 semaphores 2.  Combine tables, use conditional masks

51 ©2017 Open-NFP

Summary

▪  Flow records are powerful for high coverage, low overhead network monitoring.

▪ Generating them efficiently, without sacrificing information richness, is a challenge.

▪ Our P4 accelerator can increase flow generator capacity by factor of 10 or more, without sacrificing information richness.

▪  Table merging, conditional masks, and P4 accessible semaphores are useful and portable optimization techniques.

▪ Code available (soon) at: https://github.com/jsonch/p4_code ▪  Thank you for attending!

52 ©2017 Open-NFP

THANK YOU!

Documents

Feature Rich Flow Monitoring with P4 - Netronome...2. Evict to CPU on collision. Packets MicroFlowGenerator.P4 MicroFlowAggregator.c P4 Hardware CPU Flow Records Pkt Ct. 11 Hash Key