Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
1 ©2017 Open-NFP
Feature Rich Flow Monitoring with P4
John Sonchack University of Pennsylvania
2 ©2017 Open-NFP
Outline
▪ Introduction: Flow Records ▪ Design and Implementation: P4 Accelerated Flow Record Generation ▪ Benchmarks and Optimizations
3 ©2017 Open-NFP
Outline
▪ Introduction: Flow Records ▪ Design and Implementation: P4 Accelerated Flow Record Generation ▪ Benchmarks and Optimizations
4 ©2017 Open-NFP
Introduction: Flow Records
▪ A flow record summarizes groups of packets in the same TCP / UDP stream
Flow key (IP 5-tuple)
Flow features (statistics and meta-data)
5 ©2017 Open-NFP
Introduction: Flow Record Applications
▪ Flow records are input to analysis applications
▪ Examples: • Security: Detect botnets, intrusions,
DoS attacks, port scans • Traffic management: Traffic
classification, QoS routing, flow scheduling, load balancing
• Debugging: Performance monitoring, loop and black hole detection
• More: – “A survey of network flow
applications” B. Li, et al. – “An overview of ip flow-based
intrusion detection” A.Sperotto, et al. – NetFlow/IPFIX applications
Flow records
Information
Reconfiguration
6 ©2017 Open-NFP
Introduction: Flow Record Benefits
▪ Flow records reduce monitoring costs
▪ Important for high coverage monitoring • visibility into many (or all) links • costs add up
Packet Headers Flow Records Volume 2GB per second 1.38MB per second
Rate 328k per second 8.9k per second Tablesource:1hour10Gbit/scoreroutertrace(CAIDA 02/2015 Chicago dir. Ah5ps://www.caida.org/data/passive/trace_stats/)
Flow records
Information
Reconfiguration
7 ©2017 Open-NFP
Introduction: Flow Record Richness
▪ An important quality of flow records is their information richness:
Accuracy: Account for every packet and flow on monitored links
Feature richness: provide the features that applications use
8 ©2017 Open-NFP
Introduction: Flow Record Generation
▪ The problem: current approaches trade overhead for information richness
So#wareGenera+on SamplingSwitches DedicatedASICs
+Informa+onrich-Highoverhead
-Reducesaccuracy+Lowoverhead
-Limitedfeatureset+Lowoverhead
Flow recordsSampled flow records Fixed flow records
9 ©2017 Open-NFP
Introduction: our Goal
▪ Flow record generation that is efficient, accurate, and feature rich
+Informa+onrich+Lowoverhead
Thiswork:so#waregenera+on+P4hardwareaccelera+onSo#waregenerators
+Informa+onrich-Highoverhead
Flow recordsFlow records
+
10 ©2017 Open-NFP
Outline
▪ Introduction: Flow Records ▪ Design and Implementation: P4 Accelerated Flow Record
Generation ▪ Benchmarks and Optimizations
11 ©2017 Open-NFP
Packets
Flow Records
Flow Records
Microflow Records
Packets
P4 Accelerated Flow Record Generation ▪ Main Idea: Use P4 hardware to preprocess packets into
micro flow records that summarize per-flow packet bursts
Totalwork:map8
packetsto3flowrecords
Totalwork:map4micro
flowrecordsto3flowrecords
• Informa+onrich• featuresarefully
customizable• Efficient
• P4hardwarereducesCPUworkload
So#wareflowrecordgenera+on
P4acceleratedflowrecordgenera+on
12 ©2017 Open-NFP
First attempt: CPU Managed Tables
PacketsFlow Record Generator.P4
Insert
Update
Key Pkt Ct.
1
12
FlowGenerator_controller.c
RemoveInsert
CPU
Flow Records
P4 Hardware Forwarding Table
13 ©2017 Open-NFP
PacketsFlow Record Generator.P4
Insert
Update
Key Pkt Ct.
1
12
FlowGenerator_controller.c
RemoveInsert
CPU
Flow Records
P4 Hardware Forwarding Table
First attempt: CPU Managed Tables
▪ Bottleneck: table update operations ▪ P4 designed for forwarding tables that are not highly dynamic
NotLineRate!
Flowarrivalrate:~50kflowspersec.(10Gbit/sInternetlink)
Controllerruleupdaterate:~200–5kpersecond(hwdep.)
14 ©2017 Open-NFP
A Better Design: Minimal CPU Interaction 1. Use a flat array, i.e., P4 register, indexed by hash to store records.
Packets
MicroFlowGenerator.P4
MicroFlowAggregator.c
P4 Hardware
CPUFlow Records
Pkt Ct.8
Hash
Key1.1.1.1—>…
52.2.2.2—>…
15 ©2017 Open-NFP
A Better Design: Minimal CPU Interaction 1. Use a flat array, i.e., P4 register, indexed by hash to store records.
Packets
MicroFlowGenerator.P4
MicroFlowAggregator.c
P4 Hardware
CPUFlow Records
Pkt Ct.8
Hash
Key1.1.1.1—>…
52.2.2.2—>…
16 ©2017 Open-NFP
1. Use a flat array, i.e., P4 register, indexed by hash to store records.
Packets
MicroFlowGenerator.P4
MicroFlowAggregator.c
P4 Hardware
CPUFlow Records
Pkt Ct.8
Hash
Key1.1.1.1—>…
52.2.2.2—>…
A Better Design: Minimal CPU Interaction
17 ©2017 Open-NFP
1. Use a flat array, i.e., P4 register, indexed by hash to store records.
Packets
MicroFlowGenerator.P4
MicroFlowAggregator.c
P4 Hardware
CPUFlow Records
Pkt Ct.9
Hash
Key1.1.1.1—>…
52.2.2.2—>…
A Better Design: Minimal CPU Interaction
18 ©2017 Open-NFP
1. Use a flat array, i.e., P4 register, indexed by hash to store records.
Packets
MicroFlowGenerator.P4
MicroFlowAggregator.c
P4 Hardware
CPUFlow Records
Pkt Ct.11
Hash
Key1.1.1.1—>…
72.2.2.2—>…
A Better Design: Minimal CPU Interaction
19 ©2017 Open-NFP
1. Use a flat array, i.e., P4 register, indexed by hash to store records.
Packets
MicroFlowGenerator.P4
MicroFlowAggregator.c
P4 Hardware
CPUFlow Records
Pkt Ct.11
Hash
Key1.1.1.1—>…
72.2.2.2—>…
Collision
A Better Design: Minimal CPU Interaction
20 ©2017 Open-NFP
1. Use a flat array, i.e., P4 register, indexed by hash to store records. 2. Evict to CPU on collision.
Packets
MicroFlowGenerator.P4
MicroFlowAggregator.c
P4 Hardware
CPUFlow Records
Pkt Ct.11
Hash
Key1.1.1.1—>…
13.3.3.3—>…
Insert
Normal Hash Table
72.2.2.2—>…
Collision
A Better Design: Minimal CPU Interaction
21 ©2017 Open-NFP
1. Use a flat array, i.e., P4 register, indexed by hash to store records. 2. Evict to CPU on collision.
Packets
MicroFlowGenerator.P4
MicroFlowAggregator.c
P4 Hardware
CPUFlow Records
Pkt Ct.11
Hash
Key1.1.1.1—>…
13.3.3.3—>…
Insert
Normal Hash Table
72.2.2.2—>…
Timeout
A Better Design: Minimal CPU Interaction
22 ©2017 Open-NFP
A Simple Implementation
23 ©2017 Open-NFP
Packets
Flow Records
Flow Records
Microflow Records
Packets
P4 Accelerated Flow Record Generation ▪ Main Idea: Use P4 hardware to preprocess packets into
micro flow records that summarize per-flow packet bursts
Totalwork:map8
packetsto3flowrecords
Totalwork:map4micro
flowrecordsto3flowrecords
• Informa+onrich• featuresarefully
customizable• Efficient
• P4HWreducescpuworkload
Commodityserverflowrecordgenera+on
P4acceleratedflowrecordgenera+on
24 ©2017 Open-NFP
Outline
▪ Introduction: Flow Records ▪ Design and Implementation: P4 Accelerated Flow Record Generation ▪ Benchmarks and Optimizations
25 ©2017 Open-NFP
Benchmarks and Optimizations 1. How much does the P4 hardware reduce CPU workload? 2. What is the maximum throughput of the P4 component? 3. How can we optimize for the NFP-4000?
26 ©2017 Open-NFP
Benchmarks and Optimizations 1. How much does the P4 hardware reduce CPU workload? 2. What is the maximum throughput of the P4 component? 3. How can we optimize for the NFP-4000?
27 ©2017 Open-NFP
Benchmarks and Optimizations 1. How much does the P4 hardware reduce CPU workload?
Workload:1hour10Gbit/scoreroutertrace(CAIDA 02/2015 Chicago dir. Ah5ps://www.caida.org/data/passive/trace_stats/)
Eachmicroflowrepresents~10packets,onaverage
28 ©2017 Open-NFP
Benchmarks and Optimizations 1. How much does the P4 hardware reduce CPU workload?
Workload:1hour10Gbit/scoreroutertrace(CAIDA 02/2015 Chicago dir. Ah5ps://www.caida.org/data/passive/trace_stats/)
Eachmicroflowrepresents~10packets,onaverage
Sta$s$c Packetsaggrega+on Microflowaggrega+on
Avg.ratefor10Gbit/slink
~500k ~50k
CPUaggregatorthroughput(percore)
~600k ~600k
totalCPUmonitoringcapacity(percore)
~1x10Gbit/slink ~10x10Gbit/slinks
Packets
Flow Records
Flow Records
Microflow Records
Packets
29 ©2017 Open-NFP
Benchmarks and Optimizations 1. How much does the P4 hardware reduce CPU workload?
(~10x with 1 MB of P4 HW memory) 2. What is the maximum throughput of the P4 component? 3. How can we optimize it for the NFP-4000?
30 ©2017 Open-NFP
Benchmarks and Optimizations 1. How much does the P4 hardware reduce CPU workload?
(~10x with 1 MB of P4 HW memory) 2. What is the maximum throughput of the P4 component?
31 ©2017 Open-NFP
Benchmarks and Optimizations 1. How much does the P4 hardware reduce CPU workload?
(~10x with 1 MB of P4 HW memory) 2. What is the maximum throughput of the P4 component?
packetsize bitrate@10Mpkts/sec
64 ~5Gbit/s
128 ~10Gbit/s
256 ~20Gbit/s
512 ~40Gbit/s
1024 ~80Gbit/s
Averagepacketsizeson10Gbit/sinternetrouterlinks(caida)
32 ©2017 Open-NFP
Benchmarks and Optimizations 1. How much does the P4 hardware reduce CPU workload?
(~10x with 1 MB of P4 HW memory) 2. What is the maximum throughput of the P4 component?
(~40-80 Gbit/s with average size packets)
packetsize bitrate@10Mpkts/sec
64 ~5Gbit/s
128 ~10Gbit/s
256 ~20Gbit/s
512 ~40Gbit/s
1024 ~80Gbit/s
Averagepacketsizeson10Gbit/sinternetrouterlinks(caida)
33 ©2017 Open-NFP
Benchmarks and Optimizations 1. How much does the P4 hardware reduce CPU workload?
(~10x with 1 MB of P4 HW memory) 2. What is the maximum throughput of the P4 component?
(~40-80 Gbit/s with average size packets) 3. How can we optimize for the NFP-4000?
34 ©2017 Open-NFP
Benchmarks and Optimizations 1. How much does the P4 hardware reduce CPU workload?
(~10x with 1 MB of P4 HW memory) 2. What is the maximum throughput of the P4 component?
(~40-80 Gbit/s with average size packets) 3. How can we optimize for the NFP-4000?
Op[miza[on1:finergrainedsemaphores
35 ©2017 Open-NFP
Benchmarks and Optimizations 1. How much does the P4 hardware reduce CPU workload?
(~10x with 1 MB of P4 HW memory) 2. What is the maximum throughput of the P4 component?
(~40-80 Gbit/s with average size packets) 3. How can we optimize for the NFP-4000?
Op[miza[on1:finergrainedsemaphores
36 ©2017 Open-NFP
Core 1
Core 2
Pkt Ct.Key
53.3.3.3—>…
packet.key = 4.4.4.4 —> … packet.hash = 3
1: read key @ 3
packet.key = 3.3.3.3 —> … packet.hash = 3
2: read key @ 3
3: replace 3
4: update 3
Optimization 1: Fine Grained P4 Semaphores
Q: Are there race conditions?
37 ©2017 Open-NFP
Core 1
Core 2
Pkt Ct.Key
53.3.3.3—>…
packet.key = 4.4.4.4 —> … packet.hash = 3
1: read key @ 3
packet.key = 3.3.3.3 —> … packet.hash = 3
2: read key @ 3
3: replace 3
4: update 3
Optimization 1: Fine Grained P4 Semaphores
Incorrect!
Q: Are there race conditions? A: Yes.
38 ©2017 Open-NFP
Optimization 1: Fine Grained P4 Semaphores
Core 1
Core 2
Pkt Ct.Key
53.3.3.3—>…
packet.key = 4.4.4.4 —> … packet.hash = 3
1: read key @ 3
packet.key = 3.3.3.3 —> … packet.hash = 3
3: read key @ 3
2: replace 3
4: replace 3
WAIT
Q: Are there race conditions? A: Yes.
39 ©2017 Open-NFP
Optimization 1: Fine Grained P4 Semaphores
Core 1
Core 2
Pkt Ct.Key
53.3.3.3—>…
packet.key = 4.4.4.4 —> … packet.hash = 3
1: read key @ 3
packet.key = 3.3.3.3 —> … packet.hash = 3
3: read key @ 3
2: replace 3
4: replace 3
WAIT
40 ©2017 Open-NFP
Optimization 1: Fine Grained P4 Semaphores
Core 1
Core 2
Pkt Ct.Key
53.3.3.3—>…
1: read key @ 3
2: replace 3WAIT
Core 3
WAIT
Core N
WAIT
…
41 ©2017 Open-NFP
Optimization 1: Fine Grained P4 Semaphores
lock/unlockbycallingaP4ac[on.
Lockasingleentry,nottheen[rearray.
42 ©2017 Open-NFP
Benchmarks and Optimizations 1. How much does the P4 hardware reduce CPU workload?
(~10x with 1 MB of P4 HW memory) 2. What is the maximum throughput of the P4 component?
(~40-80 Gbit/s with average size packets) 3. How can we optimize for the NFP-4000?
1. Fine grained P4 semaphores
Op[miza[on2:combiningtables
43 ©2017 Open-NFP
Optimization 2: Combining Tables
Version Cycles/packet
Improvement
Baseline 5084 -
44 ©2017 Open-NFP
Optimization 2: Combining Tables
Version Cycles/packet
Improvement
Baseline 5084 -
computehashandgetkeymerged
4191 18%
45 ©2017 Open-NFP
Optimization 2: Combining Tables
Version Cycles/packet
Improvement
Baseline 5084 -
computehashandgetkeymerged
4191 18%
Changingtablesisasexpensiveas~5ememops(register_reads/writes).
46 ©2017 Open-NFP
Optimization 2: Combining Tables
Canwemergeallthetables?Challenge:
branches.
47 ©2017 Open-NFP
Optimization 2: Combining Tables
InC,wecouldusecondi[onalassignments.
Canwemergeallthetables?Challenge:
branches.
48 ©2017 Open-NFP
Optimization 2: Combining Tables
InP4,wecanuseacondiAonalmask.
Canwemergeallthetables?Challenge:
branches.
hdp://homolog.us/blogs/blog/2014/12/04/the-en[re-world-of-bit-twiddling-hacks/
Encodeasmodify_field+register_writeinP4.
49 ©2017 Open-NFP
Optimization 2: Combining Tables
hdp://homolog.us/blogs/blog/2014/12/04/the-en[re-world-of-bit-twiddling-hacks/
Version Cycles/packet
Improvement
Baseline 5084 -
computehashandgetkeymerged
4191 18%
singletable 2915 42%
Endresult:~20-30%bederthroughput.
50 ©2017 Open-NFP
Benchmarks and Optimizations 1. How much does the P4 hardware reduce CPU workload?
(~10x with 1 MB of P4 HW memory) 2. What is the maximum throughput of the P4 component?
(~40-80 Gbit/s with average size packets) 3. How can we optimize for the NFP-4000?
1. Fine grained P4 semaphores 2. Combine tables, use conditional masks
51 ©2017 Open-NFP
Summary
▪ Flow records are powerful for high coverage, low overhead network monitoring.
▪ Generating them efficiently, without sacrificing information richness, is a challenge.
▪ Our P4 accelerator can increase flow generator capacity by factor of 10 or more, without sacrificing information richness.
▪ Table merging, conditional masks, and P4 accessible semaphores are useful and portable optimization techniques.
▪ Code available (soon) at: https://github.com/jsonch/p4_code ▪ Thank you for attending!
52 ©2017 Open-NFP
THANK YOU!