High Performance Switches and Routers:
Theory and PracticeSigcomm 99
August 30, 1999
Harvard UniversityHigh PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.
Nick McKeown Balaji Prabhakar
Departments of Electrical Engineering and Computer Science
Copyright 1999. All Rights Reserved 2
Tutorial Outline
• Introduction: What is a Packet Switch?
• Packet Lookup and Classification: Where does a packet go next?
• Switching Fabrics:How does the packet get there?
• Output Scheduling:When should the packet leave?
Copyright 1999. All Rights Reserved 3
IntroductionWhat is a Packet Switch?
• Basic Architectural Components• Some Example Packet Switches• The Evolution of IP Routers
Copyright 1999. All Rights Reserved 4
Basic Architectural Components
PolicingOutput
SchedulingSwitching
Routing
CongestionControl
ReservationAdmissionControl
Control
Datapath:per-packet processing
Copyright 1999. All Rights Reserved 5
Basic Architectural ComponentsDatapath: per-packet processing
ForwardingDecision
ForwardingDecision
ForwardingDecision
ForwardingTable
ForwardingTable
ForwardingTable
Interconnect
OutputScheduling
1.
2.
3.
Copyright 1999. All Rights Reserved 6
Where high performance packet switches are used
Enterprise WAN access& Enterprise Campus Switch
- Carrier Class Core Router- ATM Switch- Frame Relay Switch
The Internet Core
Edge Router
Copyright 1999. All Rights Reserved 7
IntroductionWhat is a Packet Switch?
• Basic Architectural Components• Some Example Packet Switches• The Evolution of IP Routers
Copyright 1999. All Rights Reserved 8
ATM Switch
• Lookup cell VCI/VPI in VC table.
• Replace old VCI/VPI with new.
• Forward cell to outgoing interface.
• Transmit cell onto link.
Copyright 1999. All Rights Reserved 9
Ethernet Switch
• Lookup frame DA in forwarding table.– If known, forward to correct port.– If unknown, broadcast to all ports.
• Learn SA of incoming frame.
• Forward frame to outgoing interface.
• Transmit frame onto link.
Copyright 1999. All Rights Reserved 10
IP Router
• Lookup packet DA in forwarding table.– If known, forward to correct port.– If unknown, drop packet.
• Decrement TTL, update header Cksum.
• Forward packet to outgoing interface.
• Transmit packet onto link.
Copyright 1999. All Rights Reserved 11
IntroductionWhat is a Packet Switch?
• Basic Architectural Components• Some Example Packet Switches• The Evolution of IP Routers
Copyright 1999. All Rights Reserved 12
First-Generation IP Routers
Shared Backplane
Line Interface
CPU
Memory
CPU BufferMemory
LineInterface
DMA
MAC
LineInterface
DMA
MAC
LineInterface
DMA
MAC
Copyright 1999. All Rights Reserved 13
Second-Generation IP Routers
CPU BufferMemory
LineCard
DMA
MAC
LocalBuffer
Memory
LineCard
DMA
MAC
LocalBuffer
Memory
LineCard
DMA
MAC
LocalBuffer
Memory
Copyright 1999. All Rights Reserved 14
Third-Generation Switches/Routers
LineCard
MAC
LocalBuffer
Memory
CPUCard
LineCard
MAC
LocalBuffer
Memory
Switched Backplane
Line Interface
Line Interface
Line Interface
Line Interface
Line Interface
Line Interface
Line Interface
Line Interface
CPUM
emory
Copyright 1999. All Rights Reserved 15
1 2 3 4 5 6 7 8 9 10 1112131415 16
17 1819 20 2122 23242526 2728 2930 3132
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30
31 32 21
1 2 3 4 5 6
7 8 9 10 11 12
Fourth-Generation Switches/RoutersClustering and Multistage
Copyright 1999. All Rights Reserved 16
Packet SwitchesReferences
• J. Giacopelli, M. Littlewood, W.D. Sincoskie “Sunshine: A high performance self-routing broadband packet switch architecture”, ISS ‘90.
• J. S. Turner “Design of a Broadcast packet switching network”, IEEE Trans Comm, June 1988, pp. 734-743.
• C. Partridge et al. “A Fifty Gigabit per second IP Router”, IEEE Trans Networking, 1998.
• N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick, M. Horowitz, “The Tiny Tera: A Packet Switch Core”, IEEE Micro Magazine, Jan-Feb 1997.
Copyright 1999. All Rights Reserved 17
Tutorial Outline
• Introduction: What is a Packet Switch?
• Packet Lookup and Classification: Where does a packet go next?
• Switching Fabrics:How does the packet get there?
• Output Scheduling:When should the packet leave?
Copyright 1999. All Rights Reserved 18
Basic Architectural ComponentsDatapath: per-packet processing
ForwardingDecision
ForwardingDecision
ForwardingDecision
ForwardingTable
ForwardingTable
ForwardingTable
Interconnect
OutputScheduling
1.
2.
3.
Copyright 1999. All Rights Reserved 19
Forwarding Decisions• ATM and MPLS switches
– Direct Lookup • Bridges and Ethernet switches
– Associative Lookup– Hashing– Trees and tries
• IP Routers– Caching– CIDR– Patricia trees/tries– Other methods
• Packet Classification
Copyright 1999. All Rights Reserved 20
ATM and MPLS SwitchesDirect Lookup
VCI
Address
Memory
Data
(Port, VCI)
Copyright 1999. All Rights Reserved 21
Forwarding Decisions• ATM and MPLS switches
– Direct Lookup • Bridges and Ethernet switches
– Associative Lookup– Hashing– Trees and tries
• IP Routers– Caching– CIDR– Patricia trees/tries– Other methods
• Packet Classification
Copyright 1999. All Rights Reserved 22
Bridges and Ethernet SwitchesAssociative Lookups
NetworkAddress
AssociatedData
AssociativeMemory or CAM
Search Data
48
log2N
AssociatedData
Hit?
Address{
Advantages:• Simple
Disadvantages• Slow• High Power• Small• Expensive
Copyright 1999. All Rights Reserved 23
Bridges and Ethernet SwitchesHashing
HashingFunction
Memory
Add
ress
Dat
a
Search Data
48
log2N
AssociatedData
Hit?
Address{16
Copyright 1999. All Rights Reserved 24
Lookups Using HashingAn example
Hashing Function
CRC-16
16
#1 #2 #3 #4
#1 #2
#1 #2 #3Linked lists
Memory
Search Data
48
log2N
AssociatedData
Hit?
Address{
Copyright 1999. All Rights Reserved 25
Lookups Using HashingPerformance of simple example
Where:
ER Expected number of memory references=
M Number of memory addresses in table=
N Number of linked lists= M N=
ER 12--- 1
1 1 1N----–
M–
--------------------------------+
=
Copyright 1999. All Rights Reserved 26
Lookups Using Hashing
Advantages:
• Simple
• Expected lookup time can be small
Disadvantages
• Non-deterministic lookup time
• Inefficient use of memory
Copyright 1999. All Rights Reserved 27
Trees and Tries
Binary Search Tree
< >
< > < >
log2 N
N entries
Binary Search Trie
0 1
0 1 0 1
111010
Copyright 1999. All Rights Reserved 28
Trees and TriesMultiway tries
16-ary Search Trie
0000, ptr 1111, ptr
0000, 0 1111, ptr
000011110000
0000, 0 1111, ptr
111111111111
Copyright 1999. All Rights Reserved 29
Trees and TriesMultiway tries
Degree ofTree
# MemReferences
# Nodes(x106)
Total Memory(Mbytes)
FractionWasted (%)
2 48 1.09 4.3 494 24 0.53 4.3 738 16 0.35 5.6 8616 12 0.25 8.3 9364 8 0.17 21 98256 6 0.12 64 99.5
Ew DL 1– 1 1 N
DL-------–
D–
Di 1 Di 1–– N 1 D1 i–– N–
i 1=
L 1–
+=
En 1 DL 1 N
DL-------–
DDi Di 1– 1 Di 1–– N–
i 1=
L 1–
+ +=
Where:
D Degree of tree=
L Number of layers/references=
N Number of entries in table =
En Expected number of nodes=
Ew Expected amount of wasted memory=
Table produced from 215 randomly generated 48-bit addresses
Copyright 1999. All Rights Reserved 30
Forwarding Decisions• ATM and MPLS switches
– Direct Lookup • Bridges and Ethernet switches
– Associative Lookup– Hashing– Trees and tries
• IP Routers– Caching– CIDR– Patricia trees/tries– Other methods
• Packet Classification
Copyright 1999. All Rights Reserved 31
Caching Addresses
CPU BufferMemory
LineCard
DMA
MAC
LocalBuffer
Memory
LineCard
DMA
MAC
LocalBuffer
Memory
LineCard
DMA
MAC
LocalBuffer
Memory
Fast Path
Slow Path
Copyright 1999. All Rights Reserved 32
Caching Addresses
LAN:Average flow < 40 packets
WAN: Huge Number of flows
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Cache = 10% of Full Table
Cache Hit Rate
Copyright 1999. All Rights Reserved 33
IP RoutersClass-based addresses
Class A Class B Class C D
212.17.9.4
Class A
Class B
Class C212.17.9.0 Port 4
Exact Match
Routing Table:
IP Address Space
Copyright 1999. All Rights Reserved 34
IP RoutersCIDR
A B C D0 232-1
0 232-1
128.9/16
128.9.0.0
216
142.12/19
65/8
Classless:
Class-based:
128.9.16.14
Copyright 1999. All Rights Reserved 35
IP RoutersCIDR
0 232-1
128.9/16
128.9.16.14
128.9.16/20 128.9.176/20
128.9.19/24
128.9.25/24
Most specific route = “longest matching prefix”
Copyright 1999. All Rights Reserved 36
IP RoutersMetrics for Lookups
128.9.16.14 128.9/16128.9.16/20
128.9.176/20
128.9.19/24128.9.25/24
142.12/19
65/8
Prefix Port35271013
• Lookup time• Storage space• Update time• Preprocessing time
Copyright 1999. All Rights Reserved 37
IP Router Lookup
IPv4 unicast destination address based lookup
Dstn Addr Next Hop
--------
---- ----
--------
Destination Next HopForwarding Table
Next Hop Computation
Forwarding Engine
Incoming Packet
HEADER
Copyright 1999. All Rights Reserved 38
Need more than IPv4 unicast lookups
• Multicast • PIM SM
– Longest Prefix Matching on the source and group address – Try (S,G) followed by (*,G) followed by (*,*,RP) – Check Incoming Interface
• DVMRP: – Incoming Interface Check followed by (S,G) lookup
• IPv6 • 128 bit destination address field
• Exact address architecture not yet known
Copyright 1999. All Rights Reserved 39
Lookup Performance Required
Gigabit Ethernet (84B packets): 1.49 Mpps
Line Line Rate Pkt size=40B Pkt size=240B
T1 1.5Mbps 4.68 Kpps 0.78 Kpps
OC3 155Mbps 480 Kpps 80 Kpps
OC12 622Mbps 1.94 Mpps 323 Kpps
OC48 2.5Gbps 7.81 Mpps 1.3 Mpps
OC192 10 Gbps 31.25 Mpps 5.21 Mpps
Copyright 1999. All Rights Reserved 40
Size of the Routing Table
Source: http://www.telstra.net/ops/bgptable.html
Copyright 1999. All Rights Reserved 41
Ternary CAMs
10.0.0.0 R1
10.1.0.0 R2
10.1.1.0 R310.1.3.0 R4
255.0.0.0255.255.0.0
255.255.255.0
255.255.255.0
255.255.255.25510.1.3.1 R4
Value Mask
Priority Encoder
Next Hop
Associative Memory
Copyright 1999. All Rights Reserved 42
Binary Tries
Example Prefixes
a) 00001b) 00010c) 00011d) 001e) 0101f) 011g) 100h) 1010i) 1100j) 11110000
a b c
d
e
f g
h i
j
0 1
Copyright 1999. All Rights Reserved 43
Patricia Tree
Skip=5
j
a b c
d
e
f g
0 1
h i
Example Prefixesa) 00001b) 00010c) 00011d) 001e) 0101f) 011g) 100h) 1010i) 1100j) 11110000
Copyright 1999. All Rights Reserved 44
Patricia Tree
Disadvantages• Many memory accesses• May need backtracking• Pointers take up a lot of
space
Advantages• General Solution• Extensible to wider
fields
Avoid backtracking by storing the intermediate-best matched prefix. (Dynamic Prefix Tries)
40K entries: 2MB data structure with 0.3-0.5 Mpps [O(W)]
Copyright 1999. All Rights Reserved 45
Binary search on trie levels
P
Level 0
Level 29
Level 8
Copyright 1999. All Rights Reserved 46
Binary search on trie levels
10.0.0.0/810.1.0.0/1610.1.1.0/24
Example Prefixes
10.1.2.0/24Length Hash
8
12
16
24
Store a hash table for each prefix lengthto aid search at a particular trie level.
10.2.3.0/24
Example Addrs
10.1.1.410.4.4.310.2.3.910.2.4.8
10.0.0.0/810.1.0.0/1610.1.1.0/24
Example Prefixes
10.1.2.0/2410.2.3.0/2410
10.1, 10.2
10.1.1, 10.1.2, 10.2.3
Copyright 1999. All Rights Reserved 47
Binary search on trie levels
Disadvantages• Multiple hashed memory
accesses.• Updates are complex.
Advantages• Scaleable to IPv6.
33K entries: 1.4MB data structure with 1.2-2.2 Mpps [O(log W)]
Copyright 1999. All Rights Reserved 48
Compacting Forwarding Tables
1 0 0 0 1 0 1 1 1 0 0 0 1 1
1 1
Copyright 1999. All Rights Reserved 49
Compacting Forwarding Tables
10001010 11100010 10000010 10110100 11000000
R1, 0 R5, 0R2, 3 R3, 7 R4, 9
0 13
Codeword array
Base index array
0 1
0 321 4
Copyright 1999. All Rights Reserved 50
Compacting Forwarding Tables
Disadvantages• Scalability to larger
tables?• Updates are complex.
Advantages• Extremely small data
structure - can fit in cache.
33K entries: 160KB data structure with average 2Mpps [O(W/k)]
Copyright 1999. All Rights Reserved 51
16-ary Search Trie
0000, ptr 1111, ptr
0000, 0 1111, ptr
000011110000
0000, 0 1111, ptr
111111111111
Multi-bit Tries
Copyright 1999. All Rights Reserved 52
Compressed Tries
L16
L24
L8
Only 3 memory accesses
Copyright 1999. All Rights Reserved 53
Routing Lookups in Hardware
Prefix length
Num
ber
Most prefixes are 24-bits or shorter
Copyright 1999. All Rights Reserved 54
Routing Lookups in Hardware14
2.19
.6.1
4
Prefixes up to 24-bits
142.
19.6
14
1 Next Hop
24
Next Hop
142.19.6
224 = 16M entries
Copyright 1999. All Rights Reserved 55
Routing Lookups in Hardware12
8.3.
72.4
4
Prefixes up to 24-bits
128.
3.72
44
1 Next Hop
128.3.72
24 0 Pointer
8
Prefixes above 24-bits
Next Hop
Next Hop
Next Hop
offs
etba
se
Copyright 1999. All Rights Reserved 56
Routing Lookups in HardwarePrefixes up to n-bits
2n entries:
0
N + M
N
i j Prefixeslonger than
N+M bits
Next Hop
2m
i entries
Copyright 1999. All Rights Reserved 57
Routing Lookups in Hardware
Disadvantages• Large memory required
(9-33MB)• Depends on prefix-length
distribution.
Advantages• 20Mpps with 50ns
DRAM• Easy to implement in
hardware
Various compression schemes can be employed to decrease thestorage requirements: e.g. employ carefully chosen variable length strides, bitmap compression etc.
Copyright 1999. All Rights Reserved 58
IP Router LookupsReferences
• A. Brodnik, S. Carlsson, M. Degermark, S. Pink. “Small Forwarding Tables for Fast Routing Lookups”, Sigcomm 1997, pp 3-14.
• B. Lampson, V. Srinivasan, G. Varghese. “ IP lookups using multiway and multicolumn search”, Infocom 1998, pp 1248-56, vol. 3.
• M. Waldvogel, G. Varghese, J. Turner, B. Plattner. “Scalable high speed IP routing lookups”, Sigcomm 1997, pp 25-36.
• P. Gupta, S. Lin, N.McKeown. “Routing lookups in hardware at memory access speeds”, Infocom 1998, pp 1241-1248, vol. 3.
• S. Nilsson, G. Karlsson. “Fast address lookup for Internet routers”, IFIP Intl Conf on Broadband Communications, Stuttgart, Germany, April 1-3, 1998.
• V. Srinivasan, G.Varghese. “Fast IP lookups using controlled prefix expansion”, Sigmetrics, June 1998.
Copyright 1999. All Rights Reserved 59
Forwarding Decisions• ATM and MPLS switches
– Direct Lookup • Bridges and Ethernet switches
– Associative Lookup– Hashing– Trees and tries
• IP Routers– Caching– CIDR– Patricia trees/tries– Other methods
• Packet Classification
Copyright 1999. All Rights Reserved 60
Providing Value Added ServicesSome examples
• Differentiated services – Regard traffic from Autonomous System #33 as `platinum grade’
• Access Control Lists– Deny udp host 194.72.72.33 194.72.6.64 0.0.0.15 eq snmp
• Committed Access Rate– Rate limit WWW traffic from sub interface#739 to 10Mbps
• Policy based Routing– Route all voice traffic through the ATM network
Copyright 1999. All Rights Reserved 61
Packet Classification
Action
--------
---- ----
--------
Predicate ActionClassifier (Policy Database)
Packet Classification
Forwarding Engine
Incoming Packet
HEADER
Copyright 1999. All Rights Reserved 62
Multi-field Packet Classification
Given a classifier with N rules, find the action associated with the highest priority rule matching an incoming packet.
Field 1 Field 2 … Field k Action
Rule 1 152.163.190.69/ 21 152.163.80.11/ 32 … UDP A1
Rule 2 152.168.3.0/ 24 152.163.0.0/ 16 … TCP A2
… … … … … …
Rule N 152.168.0.0/ 16 152.0.0.0/ 8 … ANY An
Copyright 1999. All Rights Reserved 63
R5
Geometric Interpretation in 2D
R4
R3
R2R1
R7
P2
Field #1
Fie
ld #
2
R6
Field #1 Field #2 Data
P1
e.g. (128.16.46.23, *)e.g. (144.24/16, 64/24)
Copyright 1999. All Rights Reserved 64
Proposed Schemes
Pros ConsSequentialEvaluation
Small storage, scales well withnumber of fields
Slow classification rates
Ternary CAMs Single cycle classification Cost, density, powerconsumption
Grid of Tries(Srinivasan etal[Sigcomm
98])
Small storage requirements andfast lookup rates for two fields.Suitable for big classifiers
Not easily extendible tomore than two fields.
Copyright 1999. All Rights Reserved 65
Proposed Schemes (Contd.)
Pros ConsCrossproducting
(Srinivasan etal[Sigcomm 98])
Fast accesses.Suitable formultiple fields.
Large memoryrequirements. Suitablewithout caching forclassifiers with fewer than50 rules.
Bil-level Parallelism(Lakshman and
Stiliadis[Sigcomm 98])
Suitable formultiple fields.
Large memory bandwidthrequired. Comparativelyslow lookup rate.Hardware only.
Copyright 1999. All Rights Reserved 66
Proposed Schemes (Contd.)
Pros ConsHierarchical
Intelligent Cuttings(Gupta and
McKeown[HotI 99])
Suitable for multiplefields. Small memoryrequirements. Goodupdate time.
Large preprocessingtime.
Tuple Space Search(Srinivasan et
al[Sigcomm 99])
Suitable for multiplefields. The basic schemehas good update timesand memoryrequirements.
Classification rate can below. Requires perfecthashing for determinism.
Recursive FlowClassification (GuptaandMcKeown[Sigcomm99])
Fast accesses. Suitable formultiple fields.Reasonable memoryrequirements for real-lifeclassifiers.
Large preprocessing timeand memoryrequirements for largeclassifiers.
Copyright 1999. All Rights Reserved 67
Grid of Tries
R7
R4
R6R5R3R2
R1
Dimension 1
Dimension 2
Copyright 1999. All Rights Reserved 68
Grid of Tries
Disadvantages• Static solution• Not easy to extend to
higher dimensions
Advantages• Good solution for two
dimensions
20K entries: 2MB data structure with 9 memory accesses [at most 2W]
Copyright 1999. All Rights Reserved 69
Classification using Bit Parallelism
R4 R3 R2R11
1
0
0
1
0
1
1
Copyright 1999. All Rights Reserved 70
Classification using Bit Parallelism
Disadvantages• Large memory
bandwidth• Hardware optimized
Advantages• Good solution for
multiple dimensions for small classifiers
512 rules: 1Mpps with single FPGA and 5 128KB SRAM chips.
Copyright 1999. All Rights Reserved 71
Classification Using Multiple FieldsRecursive Flow Classification
Packet Header
F1
F2
F3
F4
Fn
MemoryMemory
Action
Memory
2S = 2128 2T = 212
2S = 21282T = 212264
224
Copyright 1999. All Rights Reserved 72
Packet ClassificationReferences
• T.V. Lakshman. D. Stiliadis. “High speed policy based packet forwarding using efficient multi-dimensional range matching”, Sigcomm 1998, pp 191-202.
• V. Srinivasan, S. Suri, G. Varghese and M. Waldvogel. “Fast and scalable layer 4 switching”, Sigcomm 1998, pp 203-214.
• V. Srinivasan, G. Varghese, S. Suri. “Fast packet classification using tuple space search”, to be presented at Sigcomm 1999.
• P. Gupta, N. McKeown, “Packet classification using hierarchical intelligent cuttings”, Hot Interconnects VII, 1999.
• P. Gupta, N. McKeown, “Packet classification on multiple fields”, Sigcomm 1999.
Copyright 1999. All Rights Reserved 73
Tutorial Outline
• Introduction: What is a Packet Switch?
• Packet Lookup and Classification: Where does a packet go next?
• Switching Fabrics:How does the packet get there?
• Output Scheduling:When should the packet leave?
Copyright 1999. All Rights Reserved 74
Switching Fabrics
• Output and Input Queueing
• Output Queueing
• Input Queueing– Scheduling algorithms– Combining input and output queues– Other non-blocking fabrics– Multicast traffic
Copyright 1999. All Rights Reserved 75
Basic Architectural ComponentsDatapath: per-packet processing
ForwardingDecision
ForwardingDecision
ForwardingDecision
ForwardingTable
ForwardingTable
ForwardingTable
Interconnect
OutputScheduling
1.
2.
3.
Copyright 1999. All Rights Reserved 76
InterconnectsTwo basic techniques
Input Queueing Output Queueing
Usually a non-blockingswitch fabric (e.g. crossbar)
Usually a fast bus
Copyright 1999. All Rights Reserved 77
InterconnectsOutput Queueing
Individual Output Queues Centralized Shared Memory
Memory b/w = (N+1).R
1
2
N
Memory b/w = 2N.R
1
2
N
Copyright 1999. All Rights Reserved 78
Output QueueingThe “ideal”
1
1
1
1
1
1
1
1
1
11
1
2
2
2
2
2
2
Copyright 1999. All Rights Reserved 79
Output QueueingHow fast can we make centralized shared memory?
SharedMemory
200 byte bus
5ns SRAM
1
2
N
• 5ns per memory operation• Two memory operations per packet• Therefore, up to 160Gb/s• In practice, closer to 80Gb/s
Copyright 1999. All Rights Reserved 80
Switching Fabrics
• Output and Input Queueing
• Output Queueing
• Input Queueing– Scheduling algorithms– Other non-blocking fabrics– Combining input and output queues– Multicast traffic
Copyright 1999. All Rights Reserved 81
InterconnectsInput Queueing with Crossbar
configuration
Dat
a In
Data Out
Scheduler
Memory b/w = 2R
Copyright 1999. All Rights Reserved 82
Input QueueingHead of Line Blocking
Del
ay
Load58.6% 100%
Copyright 1999. All Rights Reserved 83
Head of Line Blocking
Copyright 1999. All Rights Reserved 84
Copyright 1999. All Rights Reserved 85
Copyright 1999. All Rights Reserved 86
Input QueueingVirtual output queues
Copyright 1999. All Rights Reserved 87
Input QueuesVirtual Output Queues
Del
ay
Load100%
Copyright 1999. All Rights Reserved 88
Input Queueing
Scheduler
Memory b/w = 2R
Can be quitecomplex!
Copyright 1999. All Rights Reserved 89
Input QueueingScheduling
Input 1
Q(1,1)
Q(1,n)
A1(t)
Input m
Q(m,1)
Q(m,n)
Am(t)
D1(t)
Dn(t)
Output 1
Output n
Matching, MA1,1(t)
?
Copyright 1999. All Rights Reserved 90
Input QueueingScheduling
RequestGraph
123
4
12342
5
242
7
BipartiteMatching
1234
1234
(Weight = 18)
Question: Maximum weight or maximum size?
Copyright 1999. All Rights Reserved 91
Input QueueingScheduling
• Maximum Size– Maximizes instantaneous throughput– Does it maximize long-term throughput?
• Maximum Weight– Can clear most backlogged queues– But does it sacrifice long-term throughput?
Copyright 1999. All Rights Reserved 92
Input QueueingScheduling
1
2
1
2
1
2
1
2
Copyright 1999. All Rights Reserved 93
Input QueueingLongest Queue First or
Oldest Cell First
1234
1234
1234
1234
10 1
1
1
1 10
Maximum weight
Weight Waiting Time
100%Queue Length { } =
Copyright 1999. All Rights Reserved 94
Input QueueingWhy is serving long/old queues better than
serving maximum number of queues?• When traffic is uniformly distributed, servicing themaximum number of queues leads to 100% throughput.• When traffic is non-uniform, some queues become longer than others.• A good algorithm keeps the queue lengths matched, and services a large number of queues.
VOQ #
Avg
Occ
upan
cy Uniform traffic
VOQ #
Avg
Occ
upan
cyNon-uniform traffic
Copyright 1999. All Rights Reserved 95
Input QueueingPractical Algorithms
• Maximal Size Algorithms– Wave Front Arbiter (WFA)– Parallel Iterative Matching (PIM)– iSLIP
• Maximal Weight Algorithms– Fair Access Round Robin (FARR)– Longest Port First (LPF)
Copyright 1999. All Rights Reserved 96
Wave Front Arbiter
Requests Match
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Copyright 1999. All Rights Reserved 97
Wave Front Arbiter
Requests Match
Copyright 1999. All Rights Reserved 98
Wave Front ArbiterImplementation
1,1 1,2 1,3 1,4
2,1 2,2 2,3 2,4
3,1 3,2 3,3 3,4
4,1 4,2 4,3 4,4
Combinational Logic Blocks
Copyright 1999. All Rights Reserved 99
Wave Front ArbiterWrapped WFA (WWFA)
Requests Match
N steps instead of2N-1
Copyright 1999. All Rights Reserved 100
Input QueueingPractical Algorithms
• Maximal Size Algorithms– Wave Front Arbiter (WFA)– Parallel Iterative Matching (PIM)– iSLIP
• Maximal Weight Algorithms– Fair Access Round Robin (FARR)– Longest Port First (LPF)
Copyright 1999. All Rights Reserved 101
Parallel Iterative Matching
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Requests
1
2
3
4
1
2
3
4Grant
1
2
3
4
1
2
3
4Accept/Match
1
2
3
4
1
2
3
4
#1
#2
Random Selection
1
2
3
4
1
2
3
4
Random Selection
Copyright 1999. All Rights Reserved 102
Parallel Iterative MatchingMaximal is not Maximum
1
2
3
4
1
2
3
4Requests Accept/Match
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Copyright 1999. All Rights Reserved 103
Parallel Iterative MatchingAnalytical Results
E C Nlog
E Ui N2
4i------- C # of iterations required to resolve connections=
N # of ports =
Ui # of unresolved connections after iteration i=
Number of iterations to converge:
Copyright 1999. All Rights Reserved 104
Parallel Iterative Matching
Copyright 1999. All Rights Reserved 105
Parallel Iterative Matching
Copyright 1999. All Rights Reserved 106
Parallel Iterative Matching
Copyright 1999. All Rights Reserved 107
Input QueueingPractical Algorithms
• Maximal Size Algorithms– Wave Front Arbiter (WFA)– Parallel Iterative Matching (PIM)– iSLIP
• Maximal Weight Algorithms– Fair Access Round Robin (FARR)– Longest Port First (LPF)
Copyright 1999. All Rights Reserved 108
iSLIP
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Requests
1
2
3
4
1
2
3
4Grant
1
2
3
4
1
2
3
4Accept/Match
1
2
3
4
1
2
3
4
#1
#2
Round-Robin Selection
1
2
3
4
1
2
3
4
Round-Robin Selection
Copyright 1999. All Rights Reserved 109
iSLIPProperties
• Random under low load• TDM under high load• Lowest priority to MRU• 1 iteration: fair to outputs• Converges in at most N iterations. On average <=
log2N
• Implementation: N priority encoders• Up to 100% throughput for uniform traffic
Copyright 1999. All Rights Reserved 110
iSLIP
Copyright 1999. All Rights Reserved 111
iSLIP
Copyright 1999. All Rights Reserved 112
iSLIPImplementation
Grant
Grant
Grant
Accept
Accept
Accept
1
2
N
1
2
N
State
N
N
N
Decision
log2N
log2N
log2N
ProgrammablePriority Encoder
Copyright 1999. All Rights Reserved 113
Input Queueing ReferencesReferences
• M. Karol et al. “Input vs Output Queueing on a Space-Division Packet Switch”, IEEE Trans Comm., Dec 1987, pp. 1347-1356.
• Y. Tamir, “Symmetric Crossbar arbiters for VLSI communication switches”, IEEE Trans Parallel and Dist Sys., Jan 1993, pp.13-27.
• T. Anderson et al. “High-Speed Switch Scheduling for Local Area Networks”, ACM Trans Comp Sys., Nov 1993, pp. 319-352.
• N. McKeown, “The iSLIP scheduling algorithm for Input-Queued Switches”, IEEE Trans Networking, April 1999, pp. 188-201.
• C. Lund et al. “Fair prioritized scheduling in an input-buffered switch”, Proc. of IFIP-IEEE Conf., April 1996, pp. 358-69.
• A. Mekkitikul et al. “A Practical Scheduling Algorithm to Achieve 100% Throughput in Input-Queued Switches”, IEEE Infocom 98, April 1998.
Copyright 1999. All Rights Reserved 114
Switching Fabrics
• Output and Input Queueing
• Output Queueing
• Input Queueing– Scheduling algorithms– Other non-blocking fabrics– Combining input and output queues– Multicast traffic
Copyright 1999. All Rights Reserved 115
Other Non-Blocking FabricsClos Network
Copyright 1999. All Rights Reserved 116
Other Non-Blocking FabricsClos Network
Expansion factor required = 2-1/N (but still blocking for multicast)
Copyright 1999. All Rights Reserved 117
Other Non-Blocking FabricsSelf-Routing Networks
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
Copyright 1999. All Rights Reserved 118
Other Non-Blocking FabricsSelf-Routing Networks
3
7
5
2
6
0
1
4
7
2
3
5
6
1
0
4
7
5
2
3
1
0
6
4
7
0
5
1
3
4
2
6
7
4
5
6
0
3
1
2
7
6
4
5
3
2
0
2
7
6
5
4
3
2
1
0
000001
010011
100101
110111
Batcher Sorter Self-Routing Network
The Non-blocking Batcher Banyan Network
• Fabric can be used as scheduler. •Batcher-Banyan network is blocking for multicast.
Copyright 1999. All Rights Reserved 119
Switching Fabrics
• Output and Input Queueing
• Output Queueing
• Input Queueing– Scheduling algorithms– Other non-blocking fabrics– Combining input and output queues– Multicast traffic
Copyright 1999. All Rights Reserved 120
Speedup
• Context– input-queued switches
– output-queued switches
– the speedup problem
• Early approaches
• Algorithms
• Implementation considerations
Copyright 1999. All Rights Reserved 121
Speedup: Context
Memory
Memory
The placement of memory gives
- Output-queued switches- Input-queued switches- Combined input- and output-queued switches
A generic switch
Copyright 1999. All Rights Reserved 122
Output-queued switches
Best delay and throughput performance- Possible to erect “bandwidth firewalls” between sessions
Main problem- Requires high fabric speedup (S = N)
Unsuitable for high-speed switching
Copyright 1999. All Rights Reserved 123
Input-queued switches
Big advantage - Speedup of one is sufficient
Main problem- Can’t guarantee delay due to input contention
Overcoming input contention: use higher speedup
Copyright 1999. All Rights Reserved 124
A Comparison
Line Rate MemoryBW
Access TimePer cell
MemoryBW
Access Time
Memory speeds for 32x32 switch
Output-queued Input-queued
100 Mb/s 3.3 Gb/s 128 ns 200 Mb/s 2.12 s
1 Gb/s 33 Gb/s 12.8 ns 2 Gb/s 212 ns
2.5 Gb/s 82.5 Gb/s 5.12 ns 5 Gb/s 84.8 ns
10 Gb/s 330 Gb/s 1.28ns 20 Gb/s 21.2 ns
Copyright 1999. All Rights Reserved 125
The Speedup Problem
Find a compromise: 1 < Speedup << N
- to get the performance of an OQ switch- close to the cost of an IQ switch
Essential for high speed QoS switching
Copyright 1999. All Rights Reserved 126
Some Early Approaches
Probabilistic Analyses
- assume traffic models (Bernoulli, Markov-modulated,
Numerical Methods
- use actual and simulated traffic traces- run different algorithms - set the “speedup dial” at various values
non-uniform loading, “friendly correlated”)- obtain mean throughput and delays, bounds on tails- analyze different fabrics (crossbar, multistage, etc)
Copyright 1999. All Rights Reserved 127
The findings
Very tantalizing ...
- under different settings (traffic, loading, algorithm, etc)
- and even for varying switch sizes
A speedup of between 2 and 5 was sufficient!
Copyright 1999. All Rights Reserved 128
Using Speedup
1
1
1
2
2
Copyright 1999. All Rights Reserved 129
Intuition
Speedup = 1
Speedup = 2
Fabric throughput = .58
Bernoulli IID inputs
Fabric throughput = 1.16
Bernoulli IID inputs
I/p efficiency, = 1/1.16
Ave I/p queue = 6.25
Copyright 1999. All Rights Reserved 130
Intuition (continued)
Speedup = 3Fabric throughput = 1.74
Bernoulli IID inputs
Input efficiency = 1/1.74
Speedup = 4 Fabric throughput = 2.32
Bernoulli IID inputs
Input efficiency = 1/2.32
Ave I/p queue = 0.75
Ave I/p queue = 1.35
Copyright 1999. All Rights Reserved 131
Issues
Need hard guarantees- exact, not average
Robustness- realistic, even adversarial, traffic not friendly Bernoulli IID
Copyright 1999. All Rights Reserved 132
The Ideal Solution
Speedup = N
?Speedup << N
Inputs Outputs
Question: Can we find- a simple and good algorithms - that exactly mimics output-queueing- regardless of switch sizes and traffic patterns?
Copyright 1999. All Rights Reserved 133
What is exact mimicking?
Apply same inputs to an OQ and a CIOQ switch- packet by packet
Obtain same outputs- packet by packet
Copyright 1999. All Rights Reserved 134
Algorithm - MUCF
Key concept: urgency value- urgency = departure time - present time
Copyright 1999. All Rights Reserved 135
MUCF
The algorithm
- Outputs try to get their most urgent packets- Inputs grant to output whose packet is most urgent, ties broken by port number- Loser outputs for next most urgent packet- Algorithm terminates when no more matchings are possible
Copyright 1999. All Rights Reserved 136
Stable Marriage Problem
MariaHillary Monica
PedroJohnBill
Men = Outputs
Women = Inputs
Copyright 1999. All Rights Reserved 137
An example
Observation: Only two reasons a packet doesn’t get to its output
- Input contention, Output contention
- This is why speedup of 2 works!!
Copyright 1999. All Rights Reserved 138
What does this get us?
Speedup of 4 is sufficient for exact emulation of FIFO
OQ switches, with MUCF
What about non-FIFO OQ switches?E.g. WFQ, Strict priority
Copyright 1999. All Rights Reserved 139
Other results
To exactly emulate an NxN OQ switch
- Speedup of 2 - 1/N is necessary and sufficient
- Input traffic patterns can be absolutely arbitrary
(Hence a speedup of 2 is sufficient for all N)
- Emulated OQ switch may use a “monotone”
- E.g.: FIFO, LIFO, strict priority, WFQ, etc
scheduling policies
Copyright 1999. All Rights Reserved 140
What gives?
Complexity of the algorithms
- Extra hardware for processing
- Extra run time (time complexity)
What is the benefit?
- Reduced memory bandwidth requirements
Tradeoff: Memory for processing
- Moore’s Law supports this tradeoff
Copyright 1999. All Rights Reserved 141
Implementation - a closer look
Main sources of difficulty
- Estimating urgency, etc - info is distributed
- Matching process - too many iterations?
Estimating urgency depends on what is being emulated
- Like taking a ticket to hold a place in a queue
- FIFO, Strict priorities - no problem
- WFQ, etc - problems
(and communicating this info among I/ps and O/ps)
Copyright 1999. All Rights Reserved 142
Implementation (contd)
Matching process
- A variant of the stable marriage problem
- Worst-case number of iterations in switching = N- High probability and average approxly log(N)
- Worst-case number of iterations for SMP = N2
Copyright 1999. All Rights Reserved 143
Other Work
Relax stringent requirement of exact emulation
- Least Occupied O/p First Algorithm (LOOFA)
- Disallow arbitrary inputs
Keeps outputs always busy if there are packets
By time-stamping packets, it also exactly mimics
E.g. leaky bucket constrained
Obtain worst-case delay bounds
Copyright 1999. All Rights Reserved 144
References for speedup- Y. Oie et al, “Effect of speedup in nonblocking packet switch’’, ICC 89.
- A.L Gupta, N.D. Georgana, “Analysis of a packet switch with input and
and output buffers and speed constraints”, Infocom 91.
- S-T. Chuang et al, “Matching output queueing with a combined input and
and output queued switch”, IEEE JSAC, vol 17, no 6, 1999.
- B. Prabhakar, N. McKeown, “On the speedup required for combined input
and output queued switching”, Automatica, vol 35, 1999.
- P. Krishna et al, “On the speedup required for work-conserving crossbar
switches”, IEEE JSAC, vol 17, no 6, 1999.- A. Charny, “Providing QoS guarantees in input buffered crossbar switches
with speedup”, PhD Thesis, MIT, 1998.
Copyright 1999. All Rights Reserved 145
Switching Fabrics
• Output and Input Queueing
• Output Queueing
• Input Queueing– Scheduling algorithms– Other non-blocking fabrics– Combining input and output queues– Multicast traffic
Copyright 1999. All Rights Reserved 146
Multicast Switching
• The problem
• Switching with crossbar fabrics
• Switching with other fabrics
Copyright 1999. All Rights Reserved 147
Multicasting
1
2
64
3 5
Copyright 1999. All Rights Reserved 148
Crossbar fabrics: Method 1
Copy networks
Copy network + unicast switching
Increased hardware, increased input contention
Copyright 1999. All Rights Reserved 149
Method 2Use copying properties of crossbar fabric
No fanout-splitting: Easy, but lowthroughput
Fanout-splitting: higher throughput, but not as simple.Leaves “residue”.
Copyright 1999. All Rights Reserved 150
The effect of fanout-splitting
Performance of an 8x8 switch with and without fanout-splittingunder uniform IID traffic
Copyright 1999. All Rights Reserved 151
Placement of residue
Key question: How should outputs grant requests?
(and hence decide placement of residue)
Copyright 1999. All Rights Reserved 152
Residue and throughput
Result: Concentrating residue brings more new workforward. Hence leads to higher throughput.
But, there are fairness problems to deal with.
This and other problems can be looked at in a unifiedway by mapping the multicasting problem onto a variation of Tetris.
Copyright 1999. All Rights Reserved 153
Multicasting and Tetris
Output ports1 2 3 54
1 2 3 54Input ports
Residue
Copyright 1999. All Rights Reserved 154
Multicasting and Tetris
Output ports1 2 3 54
1 2 3 54Input ports
ResidueConcentrated
Copyright 1999. All Rights Reserved 155
Replication by recyclingMain idea: Make two copies at a time using a binary tree with input at root and all possible destination outputs at the leaves.
a
b
e
x dy
c x
y
a
b
c
x
y
d
e
Copyright 1999. All Rights Reserved 156
Replication by recycling (cont’d)
Receive
Recycle
Network
Reseq TransmitOutputTable
Scaleable to large fanouts. Needs resequencing at outputs andintroduces variable delays.
Copyright 1999. All Rights Reserved 157
References for Multicasting
• J. Hayes et al. “Performance analysis of a multicast switch”, IEEE/ACM Trans. on Networking, vol 39, April 1991.
• B. Prabhakar et al. “Tetris models for multicast switches”, Proc. of the 30th Annual Conference on Information Sciences and Systems, 1996
• B. Prabhakar et al. “Multicast scheduling for input-queued switches”, IEEE JSAC, 1997
• J. Turner, “An optimal nonblocking multicast virtual circuit switch”, INFOCOM, 1994
Copyright 1999. All Rights Reserved 158
Tutorial Outline
• Introduction: What is a Packet Switch?
• Packet Lookup and Classification: Where does a packet go next?
• Switching Fabrics:How does the packet get there?
• Output Scheduling:When should the packet leave?
Copyright 1999. All Rights Reserved 159
Output Scheduling
• What is output scheduling?
• How is it done?
• Practical Considerations
Copyright 1999. All Rights Reserved 160
Output Scheduling
scheduler
Allocating output bandwidthControlling packet delay
Copyright 1999. All Rights Reserved 161
Output Scheduling
FIFO
Fair Queueing
Copyright 1999. All Rights Reserved 162
Motivation
• FIFO is natural but gives poor QoS
– bursty flows increase delays for others
– hence cannot guarantee delays
Need round robin scheduling of packets
– Fair Queueing
– Weighted Fair Queueing, Generalized Processor Sharing
Copyright 1999. All Rights Reserved 163
Fair queueing: Main issues
• Level of granularity
– packet-by-packet? (favors long packets)
– bit-by-bit? (ideal, but very complicated)
• Packet Generalized Processor Sharing (PGPS)
– serves packet-by-packet
– and imitates bit-by-bit schedule within a tolerance
Copyright 1999. All Rights Reserved 164
How does WFQ work?
WR = 1WG = 5WP = 2
Copyright 1999. All Rights Reserved 165
Delay guarantees
• Theorem
If flows are leaky bucket constrained and all nodes employ GPS (WFQ), then the network can guarantee worst-case delay bounds to sessions.
Copyright 1999. All Rights Reserved 166
Practical considerations
• For every packet, the scheduler needs to
– classify it into the right flow queue and maintain a linked-list
for each flow
– schedule it for departure
• Complexities of both are o(log [# of flows])
– first is hard to overcome
– second can be overcome by DRR
Copyright 1999. All Rights Reserved 167
Deficit Round Robin
50 700 250
400 600
200 600 100
500
500 Quantum size
250
500
500400
750
1000
Good approximation of FQ
Much simpler to implement
Copyright 1999. All Rights Reserved 168
But...
• WFQ is still very hard to implement
– classification is a problem
– needs to maintain too much state information
– doesn’t scale well
Copyright 1999. All Rights Reserved 169
Strict Priorities and Diff Serv
• Classify flows into priority classes
– maintain only per-class queues
– perform FIFO within each class
– avoid “curse of dimensionality”
Copyright 1999. All Rights Reserved 170
Diff Serv• A framework for providing differentiated QoS
– set Type of Service (ToS) bits in packet headers
– this classifies packets into classes
– routers maintain per-class queues
– condition traffic at network edges to conform to
class requirements
May still need queue management inside the network
Copyright 1999. All Rights Reserved 171
References for O/p Scheduling
- A. Demers et al, “Analysis and simulation of a fair queueing algorithm”,
ACM SIGCOMM 1989.
- A. Parekh, R. Gallager, “A generalized processor sharing approach to
flow control in integrated services networks: the single node
- M. Shreedhar, G. Varghese, “Efficient Fair Queueing using Deficit Round
Robin”, ACM SIGCOMM, 1995.- K. Nichols, S. Blake (eds), “Differentiated Services: Operational Model
and Definitions”, Internet Draft, 1998.
case”, IEEE Trans. on Networking, June 1993. - A. Parekh, R. Gallager, “A generalized processor sharing approach to flow control in integrated services networks: the multiple nodecase”, IEEE Trans. on Networking, August 1993.
Copyright 1999. All Rights Reserved 172
• Problems with traditional queue management
– tail drop
• Active Queue Management
– goals
– an example
– effectiveness
Active Queue Management
Copyright 1999. All Rights Reserved 173
Max Queue Length
Tail Drop Queue ManagementLock-Out
Copyright 1999. All Rights Reserved 174
• Drop packets only when queue is full
– long steady-state delay
– global synchronization
– bias against bursty traffic
Tail Drop Queue Management
Copyright 1999. All Rights Reserved 175
Max Queue Length
Global Synchronization
Copyright 1999. All Rights Reserved 176
Max Queue Length
Bias Against Bursty Traffic
Copyright 1999. All Rights Reserved 177
• Drop from front on full queue
• Drop at random on full queue
both solve the lock-out problem both have the full-queues problem
Alternative Queue Management Schemes
Copyright 1999. All Rights Reserved 178
• Solve lock-out and full-queue problems– no lock-out behavior– no global synchronization– no bias against bursty flow
• Provide better QoS at a router– low steady-state delay– lower packet dropping
Active Queue ManagementGoals
Copyright 1999. All Rights Reserved 179
• Problems with traditional queue management
– tail drop
• Active Queue Management
– goals
an example
– effectiveness
Active Queue Management
Copyright 1999. All Rights Reserved 180
Random Early Detection (RED)
if qavg < minth: admit every packet
else if qavg <= maxth: drop an incoming packet with
p = (qavg - minth)/(maxth - minth)
else if qavg > maxth: drop every incoming packet
minthmaxth
P1PkP2
qavg
Copyright 1999. All Rights Reserved 181
Effectiveness of RED: Lock-Out
• Packets are randomly dropped
• Each flow has the same probability of being
discarded
Copyright 1999. All Rights Reserved 182
• Drop packets probabilistically in anticipation of congestion (not when queue is full)
• Use qavg to decide packet dropping probability: allow instantaneous bursts
• Randomness avoids global synchronization
Effectiveness of RED: Full-Queue
Copyright 1999. All Rights Reserved 183
What QoS does RED Provide?
• Lower buffer delay: good interactive service – qavg is controlled to be small
• Given responsive flows: packet dropping is reduced– early congestion indication allows traffic to throttle back before congestion
• Given responsive flows: fair bandwidth allocation
Copyright 1999. All Rights Reserved 184
Unresponsive or aggressive flows
• Don’t properly back off during congestion
• Take away bandwidth from TCP compatible flows
• Monopolize buffer space
Copyright 1999. All Rights Reserved 185
Control Unresponsive Flows
• Some active queue management schemes
– RED with penalty box– Flow RED (FRED)– Stabilized RED (SRED)
identify and penalize unresponsive flows with a bit of extra work
Copyright 1999. All Rights Reserved 186
Active Queue ManagementReferences
• B. Braden et al. “Recommendations on queue management and congestion avoidance in the internet”, RFC2309, 1998.
• S. Floyd, V. Jacobson, “Random early detection gateways for congestion avoidance”, IEEE/ACM Trans. on Networking, 1(4), Aug. 1993.
• D. Lin, R. Morris, “Dynamics on random early detection”, ACM SIGCOMM, 1997
• T. Ott et al. “SRED: Stabilized RED”, INFOCOM 1999
• S. Floyd, K. Fall, “Router mechanisms to support end-to-end congestion control”, LBL technical report, 1997
Copyright 1999. All Rights Reserved 187
Tutorial Outline
• Introduction: What is a Packet Switch?
• Packet Lookup and Classification: Where does a packet go next?
• Switching Fabrics:How does the packet get there?
• Output Scheduling:When should the packet leave?
Copyright 1999. All Rights Reserved 188
Basic Architectural Components
PolicingOutput
SchedulingSwitching
Routing
CongestionControl
ReservationAdmissionControl
Control
Datapath:per-packet processing
Copyright 1999. All Rights Reserved 189
Basic Architectural ComponentsDatapath: per-packet processing
ForwardingDecision
ForwardingDecision
ForwardingDecision
ForwardingTable
ForwardingTable
ForwardingTable
Interconnect
OutputScheduling
1.
2.
3.