WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Substrate Control: Overview
Fred [email protected]
Applied Research Laboratory
Washington University in St. Louis
2WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Defining Terms and Models
3WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
The SPP Node• Slice instantiation:
– Allocate virtual machine (VM)instance on a GPE
– may request code option instance, NPE resources and bandwidth
• Share a common set of (global) IPaddresses
– UDP/TCP port space shared across GPE/NPEs
• Line card TCAM Filters direct traffic– unregistered traffic originating outside the node
is sent to the CP.
– unregistered traffic originating within node usesNAT (on line card)
– application may register server ports. Causes filter to be inserted in the line card directing traffic to specific GPE
– application must register ports (or tunnels) associated with fast path instances
• It is assumed that fast path instances will use tunnels (overlays) to send traffic between routing nodes.
– Currently we only support UDP tunnels but will extend to include GRE and possibly others.
GPE
RMP
NMP
planetlab OS
vmx
app
NPE
SRAM
TCAM
SCD
mi-mux
code option
FPx GPENPE
LC
Internet
…
Ingress
… map flowto internaldestination
…Egress
…IP route table and
ARP
SCD (ARP, nat)
local delivery/exceptions,uses an Internal UDP Tunnel
4WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Meta-Interfaces and Tunnels• Slice Fast path (Code option instance, allocated resources) are assumed to sit at one end of a tunnel
– currently only UDP tunnels are supported.– UDP Tunnel is defined by the 4-tuple
UDP tunnel: {peer ipaddr, peer port, local ipaddr, local port}– Meta-interface or MI: Represents a tunnel endpoint as viewed by a slice’s the fast path router. A meta-interface
is defined by the local endpoint’s addressMeta-Interface: {local ipaddr, local UDP port}
• The encapsulated packet is processed by the fast path.– packet is always encapsulated within a tunnel by the substrate– code option instance processes the encapsulated frame
• In the SPP context, slice registers MI and substrate manages encapsulation headers:– Guard against forging source address– A filter is installed in the corresponding line card’s TCAM to send matching packets to the correct NPE– NPE’s decap module verifies the encapsulation header and provides isolation between slices (based on local IP
and port number values in the tunnel header)– Fabric VLANs are used to provide link level isolation between slice instances. The VLAN label is also used by
the substrate to associate packets with slice fast paths.
meta-interfaces
MI: local tunnel endpoint (UDP), {external ipaddr, udp_port}
fast path (FPx)0 1 2 3 4 5 6
MI IP Address UDP Port
0 192.168.1.2 6060
1 192.168.1.3 6060
2 192.168.1.2 6061
3 192.168.1.2 6062
4 192.168.1.3 6061
5 192.168.1.3 6062
6 192.168.1.3 6063
5WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Lookup Table, TCAM, Use
6WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Lookup filters: Key, Action and Result• A lookup key is then created from the packet’s header fields and the receiving meta-
interface – code option extracts fields from the encapsulated packet– substrate adds the receiving meta-interface identifier
• If no entry is found then the packet’s no_route exception attribute is set, otherwise a result is returned containing an action field and forwarding information (output meta-interface and next hop address)
– a code option may define additional exception attributes• The complete filter specification: {lookup_key, result_vector}• lookup_key : {RxMI, *copt_key}
– RxMI : Meta interface ID on which the packet was received.– copt_key : Lookup key defined by the code option. The IPv4 key:
{daddr(32),saddr(32),sport(16),dport(16),tcp_flgs(8),proto(8)}• result_vector : {sindx, action[, qid, TxMI, nexthop]}
– sindx : stats index– action: Packet disposition, one of {drop, fwd, ld}
• drop : drop packet; • fwd : forward packet using next hop value (fwdkey)• ld : local delivery, code option instance has local address information??
– qid : packet Queue– TxMI : Meta-interface used for sending packet, corresponds to a previously registered local tunnel
endpoint. Used to fill in the local address of the outgoing packet tunnel header.– nexthop : Tunnel endpoint for the next hop. For UDP tunnels, this is the IP address and UDP
port number of the next hop device.
7WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Slice view of the Lookup Key
• When a packet is received the substrate creates a lookup key using the target slices xsid and the receiving meta-interface. The remaining bits are defined by the code option. – xsid’ : represents the internal slice ID and may differ from the value of xsid.
For implementation efficiency, this is the VLAN identifier assigned to the slice. – xmi : Internal representation of the meta-interface (MI), encoding of the
received tunnel endpoint.• For UDP tunnels this field includes a 4-bit interface id and the 16 bit local UDP port
number. The 4-bit id is used as an index into a table of local IP addresses.
• The IPv4 code option defined fields are shown below where pr is the IP protocol field and tcp is the TCP header flags.
slice defined fieldsxmixsid’128-NN12
user specified lookup key (4 - 32-bit words)
8WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
IPv4 TCAM Filter Formats (on NPE)
6 82flags
2 12
0100
2
TCP RSV proto00!TCP
daddr saddr sport dport tcp/proto
Defined by the IPv4 Code Option, 112bits
32 32 16 16 16
vlan
11
if
T = 0: Normal LookupT = 1; substrate only lookup
T
1
RX port
Substrate defined
164
TX IP daddr TX dport TX sport rsv
32 16 12 1516
QM
3
D: Drop packetL: Local delivery
rsv
113 1
L
1
Drsv sindx Sch2
qid
16
20-bit internal qid(SCD maps slice’s miidto QM and Sch. SCD Also
maps slice’s qid toglobal qid value)
TX IP address and sport representsthe output meta-interface. The
dport is provided by the slice. (RMP maps miid to tx tunnel params,
use dport provided by slice)
Result, 64 bits
Represents input meta-interface
global statsindex
(SCD mapsslice’s sindx
to global value)
Key: Input miid, IPv4 fltr {daddr, saddr, sport, dport, tcp/proto}
Result: Flags {Drop, GPE}, sindx, Output miid, QID
Slice parameters:
9WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Lookup• Parse block make copt_key.• Substrate add the xsid and xmi fields.• Substrate uses the TxMI and nexthop fields to construct
encapsulation header
......
xsid:RxMI:copt_keyLookup A
slice defined fieldsxmixsid’
sindx;action:qid:TxMI:nexthop
packet
annotations:{xsid, RxMI}
parse block
decap
TxMI:nexthop
10WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Version 2 and Multicast
......
lookup_key action:sindx:rindxLookupA
slice defined fieldsxmixsid’
result_index
packet
annotations:{xsid, RxMI}
parse block
decapoverloaded with fanout address
fanout Table
...
qid:TxMI:nexthop
• In version 2 there will be 2 stages to the lookupadd fanout (count) to lookup B.
• if fanout > 1 then address of fanout else result vector; Chain fanout blocks
• TxMI includes an interface vector: 4-bit field that is used to lookup interface IP address and MAC address.
...
rindx
sindx:qid:TxMI:nexthop
LookupB
sindex passed from side A
VLAN table in header formatand VLAN table in Decap/Parse
11WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
• Then the control software could use the following:write_fltr(fid, rxmi, {prefix,width}, action, {qid,TxMI,nexthop})write_fltr(0, *, {10.10.2.1, 0xFFFFFFFF}, LD})write_fltr(1, *, {10.5.2.0, 0xFFFFFF00}, fwd, {1, 1, NHA})write_fltr(2, *, {10.5.1.0, 0xFFFFFF00}, fwd, {2, 2, NHB})write_fltr(3, *, {10.5.0.0, 0xFFFF0000}, fwd, {3, 3, NHC})
Lookup Example• When a code option is requested the slice is
allocated the requested number of TCAM entries; fid ε {0,..., Nf-1}
– all TCAM operations accept a TCAM entry ID (fid)
– Entries are listed in priority order with fid=0 the highest priority and entry Nf-1 the lowest.
• It is up to the slice control path to order the lookup entries.
– For example if we have the simple routing database:
10.10.2.1/32 Local delivery (GPE)
10.5.2.0/24 NH A10.5.1.0/24 NH B10.5.0.0/16 NH C
prefix TxMI nexthop10.10.2.1/32 0* Local10.5.2.0/24 1 NH A10.5.1.0/24 2 NH B10.5.0.0/16 3 NH C
MI IP Address UDP Port0 192.168.1.2 60601 10.50.10.2 60612 10.50.10.2 60623 10.1.1.1 6060
QID Interface BW max Bytes0 0* - Local*1 1 40% 10242 1 60% 10243 2 100% 1024
Interface BW ipAddr0* BE 192.168.1.21 100Mbps 10.50.10.22 10Mbps 10.1.1.1
Desired Route Table (LPM)
Slice BW AllocationsSlice Meta-Interfaces
Slice Queue Bindings
12WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Example IPv4 LPM• In general for longest prefix match a good strategy is to
divide allocated filters into 32 sets
• For example assume 1024 TCAM entries have been allocated and we are using LPM.– Divide the filters into 32 sets of 32 filters each and associate a prefix
length with each:
– Then for a particular prefix width add it to the appropriate set.
– Entries within a set are non-overlapping so their order doesn’t matter.
– This is the scheme used by software written by IDT, the manufacturer of the TCAM we currently use.
Prefix Width Filter ID Range32 0 - 3131 32-63w (32-w)*32 +(0...31)1 992 - 1023
13WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Keeping track of TCAM entries• Substrate will have to manage the mapping of VM
TCAM filter IDs to the actual filter ID.• VM control software will use a normalized filter index
list (starts at 0 and has the requested number of filters entries).
• The SCD (xscale daemon) must map the per-VM index into the actual TCAM Index.
• Source for managing TCAM entries.• NPU A and B share a common TCAM and index range
so this must be managed across the two xscales. – See C++ implementation of the RangeMap class in
$WUSRC/range – Class will also be used for managing the QID name space.
14WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Control Software:Resource Management
15WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
node components not in hub(switch, GPEs, Development Hosts)
FPkFPkFPx
NPE
SRAM
TCAM
SCDLC
SCD
TCAMMUX
SRM
Resource DB
System Resource Manager
Exception and Local delivery traffic. Includes shim header with RxMI.
SNM
CP
GPE
RMP
NMP
planetlab OSroot context
vmx
control
Support fast path configuration via
the PLC
vnet
SP
16WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Partitioning of (substrate) Responsibilities• Virtual Machine (Slice control SW): Application logic, code option specific control and data
operations.– traditional PlanetLab slice operations– manage code option specific lookup tables, stats, memory and configuration blocks– implements interface with fast path for exception and local delivery traffic
• vnet– flow isolation: filtering traffic through the linux kernel– add support for VLAN- based filtering and port reservation
• Resource Manager Proxy (aka Local Resource Manager)– all VM commands are issued to the RMP
• the RMP is able to validate command sender (authenticate)• enforce access restrictions (authorize)• decouples VMs from substrate control entities. That is, maps exported abstractions and interfaces to specific hardware and
software interfaces.– verifies (or inserts) substrate message header slice IDs to prevent deliberate or accidental masquerading - part
of ensuring isolation and security.– in tandem with SRM implements device independent logic
• System Resource Manager– device independent logic– responsible for implementing and enforcing
• system resource abstractions• resource isolation and allocation policies• facilitating SNM: implementing PlanetLab compatible behavior and abstractions
• Substrate Control Daemon– intermediary between VM and code option instances (vouches for VM)– enforces policies on resource allocations and isolation in the control plane– implements device dependent logic
17WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Responsibilities
endpoint (port) mapsresvMap availMap usedMaps xsidMap
Systemtables
Interfacesifn:{type,ipaddr,linkBW,availBW}...
Per SliceTables
xsid
vlan
meta-ifacesmi:endpoint...
endpointsid:{type,ipaddr,port,proto,board,bw}...
gpe
board id BW
plab sliceID
NPE (allocated)sram {start,size} #flts
#Qsboard ID BW #Stats SRM(the “Decider”)
Per Slice data
xsid: {qidMap,FidMap,statsMap}Interface BW
Slice Mapsxsid: {sram_start,sram_size}
Slice SRAM Assignments
SCD (NPE)SRAMbase
xsid:size
xsid:offsetLookup Table
xsid:range
Queue Params
xsid:range
Stats Table
xsid:range
Tables in data Path
VLAN Tablevlan
copt:sram_addr
ranges are not required to be contiguous
“real”indx
“real”indx
sid
fid
“real”indx
qidHF Control Block?
code optioncontrol blocks?
GPE
servMap resvMap
endpoint (port) maps
controlIP BWmaps??
RMPrequest allocation
make allocation
RMP Responsibilities• Translate slice MI to local endpoint. Either
call SRM or cache mappings.• Add xsid to subMsg header• Pass through identifiers mapped by SCD:
qid, fid and stats.• Pass through relative queue weights, SCD
maps to global weight.
SCD Responsibilities• Translate slice specific indices to global
indices: qid, fid and stats.• Knows the location of all tables• Interprets commands to add, remove and
modify entries to data path tables.• Knows per slice interface BW allocation and
maps relative queue weight to global weight.• Each interface schedule is assigned (by SRM)
max rate.
xsid:offset
Per interface scheduler and rate limits
NPE Tableid:{addr,BW/Port,copts,fltrs,sram,Qs}...
VLAN mapsrange:{start,end}
vlanid:xsid...
18WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Queuing and allocating Interface Bandwidth
19WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
FP slice1
Simple Queuing Example
q1n’
...
Slice Interface and Queue Allocations:{Port, BW, QList}, Qlist = {{qid, weight, threshold},...}
q10
q11
wrr
q2m’
...
FP slice2
q20
q21
NPE
GPE
FP1
GPE
FP2
linkBW
wrr
BW11
BW21
BW11 + BW21 = BW1
BW1
Physical Port (Interface)Attributes:
{ifn, type, ipaddr, linkBW, availBW}
ifn : Interface numbertype: {Internet, Peering}Operations:
get_interfaces()get_ifattrs(ifn)get_ifpeer(ifn)alloc_ifbw(ifn,xsid,bw)
LC
qid in 0...n-1
qid in 0...m-1
ipAddr
20WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Substrate Message Format
21WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Substrate Message
mlen: Total message length, including the header.
mid: Message ID, used to support synchronous message processing.
cid: context identifier. Specifies context within which the message is processed. A value of 0 indicates substrate context.
cmd: Command to execute or a return code.
The 4 header fields are each 16 bites.body: 0 or more bytes of command
data.
mlen midcmdcid
body: 0-N(B)
• Assume a simple command response (two-way) messaging framework. But will support one-way schemes..
• Supports asynchronous communications using a message ID.
• The command field is overloaded for the return code.
• Every server is expected to implement a simple Version command (cmd == 0) which return the server’s ID and Version number as two 32-bit fields.– primary use is for monitoring health of
servers and debugging.– All other command values are uniique
only to a particular server.
• Uses UDP as the transport protocol.• All commands are expected to be
idempotent
msgheader
015 015
22WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Overview• In the interface specifications I provide a c-like description of
the operations and results.• The descriptions are only intended to describe the actual
message format, data fields and returned results. It is not meant to specify an application level library.
• The arguments are to be encoded into the message body in the order that are given, using network byte order (Big Endian) and without padding.
• All commands result in:1. No return response: one-way call semantics2. an error occurs processing the message or command encounters and
unexpected condition or error. In this case the return message will have the error return code in the cmd field.
3. The command completes and does not indicate and error to the message framework then the message result code indicates success. The message body contains any result data.
23WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
Example Message• Slice with xsid of 0x10 requests the allocation of a global
UDP port (decimal 17) for the local IP address 128.252.130.34 (hex 0x80FC8222).– Assume the alloc_port command ID is 4.
port = alloc_port(0x80FC8222, 0, 17)– Allocate a global UDP (decimal 17) port for the local IP address
128.252.130.34 (hex 0x80FC8222), and let the system assign the next available port number.
• The resource manager allocates port 5050 (0x13BA), the return code of 0 indicates success.
F 1410
80 FC 82 2200 00 11
Command MessageF 1
01080 FC 82 2213 BA 11
Reply Message
24WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
NAT
25WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
• Problem:– UDP, TCP: 2 or more GPEs attempt to use same global IP,
Port and Proto– ICMP: ???
26WashingtonWASHINGTON UNIVERSITY IN ST LOUIS
Fred Kuhns - 04/22/23
min,
,,,
min,,
, ,
j
ji
j
jijji
jj
jjj
jiji
BW
BWMTU
BW
BWWw
BWWMTU
BWBWW
wBW