Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Quantized Congestion Control (QCN) Experiences with Ethernet and RoCE Liran Liss Mellanox Technologies
Agenda
• Introduction • Why QCN? • Data plane • Management • QCN in ConnectX-3 and SwitchX • Initial results • Conclusions
© 2013 Mellanox Technologies 2
QCN Introduction
• IEEE standard (802.1Qau) – Provides congestion control for long-lived flows in limited
BW-delay product networks – Part of the Data Center Bridging (DCB) protocol suite
• ETS, PFC, QCN, DCBX
• Conducted at L2 – Similar to IB FECN/BECN congestion management – Independent from L3 congestion control
• E.g., ECN in TCP/DCTCP
• Suitable for HW implementations – Simple – Minimal state
© 2013 Mellanox Technologies 3
Why QCN?
• Prevent congestion spreading – Mitigate victim flows – Important when PFC is used
• Control switch buffer utilization – Less impact on other flows
• Handle bursts better • Mitigate incast
– Improve latency
© 2013 Mellanox Technologies 4
F A B C D E
Y
X
Zzz
Tailored for DCB Environments
• Diverse traffic patterns – Partition-aggregate incast – Long running BW-hungry flows – IPC
• Diverse traffic classes – Best effort lossy traffic – Critical lossless traffic
© 2013 Mellanox Technologies 5
Data Plane
© 2013 Mellanox Technologies 6
NIC NIC Switch
Reaction Point (RP)
Congestion Point (CP)
Congestion Controlled Flow (CCF)
Congestion Notification Messages
(CNMs)
Ethernet subnet
Congestion Point Behavior
• Congestion feedback computed from Switch buffer state – Takes into account both the
degree and trajectory of congestion
• Feedback sent directly to RP
• Statistical sampling – Sampling probability
increases with congestion
© 2013 Mellanox Technologies 7
10%
1% |Fb|
Reflection probability
Qeq Q
Qoff
Fb = -(Qoff + wQdelta)
Qdelta = Q - Qold
Reaction Point Behavior
© 2013 Mellanox Technologies 8
Rate
Time
Target Rate (TR)
Current Rate (CR)
RAI
RHAI
Fast Recovery
Active Increase
Hyper Active Increase
CR1/2(CR+TR) TRTR+RAI CR1/2(CR+TR)
TRTR+i*RAI CR1/2(CR+TR)
CNM received
Rate increases due to timer or byte-count thresholds
Move to AI after 5 rate increase cycles
Move to HAI after both timer and byte-counter completed 5 cycles
TRCR CRCR(1-Gd|Fb|)
Flow Segregation
• Up to 7 Congestion Notification Priority Values (CNPVs) – At least one User Priority dedicated for best effort traffic
• An endpoint may support multiple RPs per CNPV – Grouped into Reaction Point Groups (RPGs) for configuration
• Congestion Controlled Flows (CCFs) determined solely by the RP – By Application – By DMAC – By 5-tuple hash – By QP
• Optional Flow ID may be injected by RP using Congestion Notification tags (CNtag) – Reflected back in CNMs
© 2013 Mellanox Technologies 9
CNM Frame Format
© 2013 Mellanox Technologies 10
Size 6B 6B 2B 2B 2B 2B 2B 2B 8B 2B 2B 2B 6B 2B 0-64B
4b 6b 6b
Field
DA SA
ET = 0x8100
UP + CFI + VID
ET = 0x22E9
Flow ID
ET = 0x22E7
Ver
Rsvd
Fb CPID
QO
ffset
QDelta
Encap UP
Encap DA
MSDU
len
MSDU
VLAN tag (optional)
CN tag (optional)
Congestion Notification Message (CNM)
QCN Management
• Congestion Notification Domains (CNDs) – A connected subset of switches and endpoints
configured to support a CNPV • CNDs must be defended
– Accomplished by remapping CNPVs to non-CNPV priorities
• End to end configuration required – May be configured manually or by DCBX
© 2013 Mellanox Technologies 11
QCN in ConnectX-3 and SwitchX
• ConnectX-3 – HW RP implementation – Reaction to CNMs within usecs – Operation transparent to SW
• A must for offloaded protocols such as RoCE – Thousands of RPs
• SwitchX – CP Active queue management
• Ethernet driver – Multiple flow queues per priority – Selected based on L2/3/4 hash
• RoCE driver – Each QP constitutes its own flow queue
© 2013 Mellanox Technologies 12
Initial Testing
• Experimental setup – 2 ConnectX-3 HCAs – SX1036 switch (SwitchX)
• Manual CND configuration • Traffic pattern
– Both ports of requestor HCA target the same responder port
– Tested the dynamic response to congestion
© 2013 Mellanox Technologies 13
ConnectX-3 initiator
SwitchX (SX1036)
ConnectX-3 responder
QCN Convergence
© 2013 Mellanox Technologies 14
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 5 10 15 20 25 30 35 40 45
Flow 1
Flow 2
Aggregate BW
Time [ms]
BW
[Gbp
s]
Flow 2 added
Flow 2 removed
< 2ms
Observations
• Rapid convergence – Under 2ms
• Stable – Small oscillations – Negligible underflow
• >99% utilization of congested link
– No overflows in steady state • No pause frames while enabling PFC/global-pause • No packet loss while disabling pause
© 2013 Mellanox Technologies 15
Conclusions
• QCN offers important benefits to DCB environments – Lossless and lossy flows can happily live together
• Efficient HW implementation in ConnectX-3 and SwitchX – Promising initial results
• Next steps – Multi-flow behavior – Complex setups – Measure real application benefits
© 2013 Mellanox Technologies 16
Thank You
© 2013 Mellanox Technologies
Quantized Congestion Control (QCN) Experiences with Ethernet and RoCEAgendaQCN IntroductionWhy QCN?Tailored for DCB EnvironmentsData PlaneCongestion Point BehaviorReaction Point BehaviorFlow SegregationCNM Frame FormatQCN ManagementQCN in ConnectX-3 and SwitchXInitial TestingQCN ConvergenceObservationsConclusionsThank You