View
234
Download
2
Category
Preview:
Citation preview
Shivkumar KalyanaramanRensselaer Polytechnic Institute
1
Packet Switches
Shivkumar KalyanaramanRensselaer Polytechnic Institute
2
Packet switches In a circuit switch, path of a sample is determined at time
of connection establishment No need for a sample header--position in frame used In a packet switch, packets carry a destination field or
label Need to look up destination port on-the-fly
Datagram switches lookup based on entire destination address (longest-
prefix match) Cell or Label-switches
lookup based on VCI or Labels L2 Switches, L3 Switches, L4-L7 switches
Key difference is in lookup function (I.e. filtering), not in switching (I.e not in forwarding)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
3
Shared Memory Switches Dual-ported RAM Incoming cells converted
from serial to parallel Elegant, but memory
speeds & port counts don’t scale Output buffering 100% throughput
under heavy load Minimize buffers
Eg: CNET Prelude, Hitachi shared buffer s/w, AT&T GCNS-2000
Shivkumar KalyanaramanRensselaer Polytechnic Institute
4
Shared memory fabrics: more…
Memory interface hardware expensive => many “ports” share fewer memory interfaces Eg: dual-ported memory
Separate low-speed bus lines for controller
Shivkumar KalyanaramanRensselaer Polytechnic Institute
5
Shared Medium Switches Share medium (I.e.
bus/ring etc) instead of memory
Medium has to be N times as fast Address filters &
output buffers at the medium speed also!
TDM + round robin Egs: IBM PARIS &
plaNET s/w, Fore Forerunner ASX-100, NEC ATOM
Shivkumar KalyanaramanRensselaer Polytechnic Institute
6
Fully Interconnected Switches Full interconnections Broadcast + address-filters
Multicasting is natural Output queuing All hardware same speed =>
scalable Quadratic growth of buffers/filters Knockout switch (AT&T) reduced
# of buffers: fixed L (=8) buffers per output + a tournament method to eliminate packets Small residual packet loss rate
(1/million) Egs: Fujitsu bus matrix, GTE
SPANet
Shivkumar KalyanaramanRensselaer Polytechnic Institute
7
Crossbar: “Switched” interconnections
2N media (I.e. buses), BUT… Use “switches” between each input and output bus instead of broadcasting
Total number of “paths” required = N+M Number of switching points = NxM Arbitration/scheduling needed to deal with port contention
Shivkumar KalyanaramanRensselaer Polytechnic Institute
8
Multi-Stage Fabrics Compromise between pure time-division and pure space
division Attempt to combine advantages of each
Lower cost from time-division Higher performance from space-division
Technique: Limited Sharing Eg: Banyan switch Features
Scalable Self-routing, I.e. no central controller Packet queues allowed, but not required
Note: multi-stage switches share the “crosspoints” which have now become “expensive” resources…
Shivkumar KalyanaramanRensselaer Polytechnic Institute
9
Multi-stage switches: fewer crosspoints
Issue: output & internal blocking…
Shivkumar KalyanaramanRensselaer Polytechnic Institute
10
Banyan Switch Fabric (Contd)
Basic building block = 2x2 switch, labelled by 0/1 Can be synchronous or asynchronous
Asynchronous => packets can arrive at arbitrary times Synchronous banyan offers TWICE the effective throughput!
Worst case when all inputs receive packets with same label
Shivkumar KalyanaramanRensselaer Polytechnic Institute
11
Switch fabric element
Goal: “self-routing” fabrics Build complicated fabrics from a simple elements
Routing rule: if 0, send packet to upper output, else to lower output If both packets to same output, buffer or drop
Shivkumar KalyanaramanRensselaer Polytechnic Institute
12
Multi-stage Interconnects (MINs): Banyan Key: reduce the number of
crosspoints in a crossbar 8x8 banyan: Recursive design
Use the first bit to route the cell through the first stage, either to the upper or lower 4x4 network,
Last 2 bits to route the cell through the 4x4 network to the appropriate output port.
Self-routing: output address completely specifies the route through the network (aka digit-controlled routing)
Simple elements, scalable, parallel routing, elements at same speed
Eg: Bellcore Sunshine, Alcatel DN 1100
Shivkumar KalyanaramanRensselaer Polytechnic Institute
13
Banyan Fabric: another view…
Shivkumar KalyanaramanRensselaer Polytechnic Institute
14
Banyan Simplest self-routing recursive fabric
Two packets want to go to the same output => output blocking Banyan: packets may block even if they want to go to different outputs
=> internal blocking! Unlike crossbar: because it has fewer crosspoints However, feasible non-blocking schedules exist => pre-sort &
shuffle packets to get to such non-blocking schedules
Shivkumar KalyanaramanRensselaer Polytechnic Institute
15
Non-Blocking Batcher-Banyan
3
7
5
2
6
0
1
4
7
2
3
5
6
1
0
4
7
5
2
3
1
0
6
4
7
0
5
1
3
4
2
6
7
4
5
6
0
3
1
2
7
6
4
5
3
2
0
2
7
6
5
4
3
2
1
0
000001
010011
100101
110111
Batcher Sorter Self-Routing Network
• Fabric can be used as scheduler. •Batcher-Banyan network is blocking for multicast.
Shivkumar KalyanaramanRensselaer Polytechnic Institute
16
Blocking in Banyan S/ws: Sorting Can avoid blocking by choosing order in which packets
appear at input ports If we can
present packets at inputs sorted by output “trap” duplicates (I.e. going to same o/p port) remove gaps precede banyan with a perfect shuffle stage then no internal blocking
For example: [X, 010, 010, X, 011, X, X, X]: Sort => [010, 011, 011, X, X, X, X, X] Trap duplicates => [010, 011, X, X, X, X, X, X] Shuffle => [010, X, 011, X, X, X, X, X] Need sort, shuffle, and trap networks
Shivkumar KalyanaramanRensselaer Polytechnic Institute
17
Sorting using Merging Build sorters from merge networks Assume we can merge two sorted lists Sort pairwise, merge, recurse
Shivkumar KalyanaramanRensselaer Polytechnic Institute
18
Putting together: Batcher-Banyan
Shivkumar KalyanaramanRensselaer Polytechnic Institute
19
Scaling Banyan Networks: Challenges
1. Batcher-banyan networks of significant size are physically limited by the possible circuit density and number of input/output pins of the integrated circuit. To interconnect several boards, interconnection complexity and power dissipation place a constraint on the number of boards that can be interconnected
2. The entire set of N cells must be synchronized at every stage
3. Large sizes increases the difficulty of reliability and repairability
4. All modifications to maximize the throughput of space-division networks increase the implementation complexity
Shivkumar KalyanaramanRensselaer Polytechnic Institute
20
Other Non-Blocking FabricsClos Network
Shivkumar KalyanaramanRensselaer Polytechnic Institute
21
Other Non-Blocking FabricsClos Network
Expansion factor required = 2-1/N (but still blocking for multicast)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
22
Blocking and Buffering
Shivkumar KalyanaramanRensselaer Polytechnic Institute
23
Blocking in packet switches
Can have both internal and output blocking Internal
no path to output Output
trunk unavailable Unlike a circuit switch, cannot predict if packets
will block (why?) If packet is blocked => must either buffer or drop
Shivkumar KalyanaramanRensselaer Polytechnic Institute
24
Dealing with blocking in packet switches
Over-provisioning internal links much faster than inputs
Buffersat input or output
Backpressure if switch fabric doesn’t have buffers, prevent
packet from entering until path is available Parallel switch fabrics
increases effective switching capacity
Shivkumar KalyanaramanRensselaer Polytechnic Institute
25
Blocking in Banyan Fabric
Shivkumar KalyanaramanRensselaer Polytechnic Institute
26
Buffering: where?
Input Output Internal Re-circulating
Shivkumar KalyanaramanRensselaer Polytechnic Institute
27
Queuing: input, output buffers
Shivkumar KalyanaramanRensselaer Polytechnic Institute
28
Switch Fabrics: Buffered crossbar
What happens if packets at two inputs both want to go to same output?
Can defer one at an input buffer Or, buffer cross-points: complex arbiter
Shivkumar KalyanaramanRensselaer Polytechnic Institute
29
Queuing:Two basic practical techniques
Input Queueing Output Queueing
Usually a non-blockingswitch fabric (e.g. crossbar)
Usually a fast bus
Shivkumar KalyanaramanRensselaer Polytechnic Institute
30
Queuing:Output Queueing
Individual Output Queues Centralized Shared Memory
Memory b/w = (N+1).R
1
2
N
Memory b/w = 2N.R
1
2
N
Shivkumar KalyanaramanRensselaer Polytechnic Institute
31
Output Queuing
Shivkumar KalyanaramanRensselaer Polytechnic Institute
32
Input Queuing
Shivkumar KalyanaramanRensselaer Polytechnic Institute
33
Input QueueingHead of Line Blocking
Del
ay
Load58.6% 100%
Shivkumar KalyanaramanRensselaer Polytechnic Institute
34
Solution: Input Queueing w/Virtual output queues (VOQ)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
35
Head-of-Line (HOL) in Input Queuing
Shivkumar KalyanaramanRensselaer Polytechnic Institute
36
Input QueuesVirtual Output Queues
Del
ay
Load100%
Shivkumar KalyanaramanRensselaer Polytechnic Institute
37
Input Queueing
Scheduler
Memory b/w = 2R
Can be quitecomplex!
Shivkumar KalyanaramanRensselaer Polytechnic Institute
38
Input QueueingScheduling
Input 1
Q(1,1)
Q(1,n)
A1(t)
Input m
Q(m,1)
Q(m,n)
Am(t)
D1(t)
Dn(t)
Output 1
Output n
Matching, MA1,1(t)
?
Shivkumar KalyanaramanRensselaer Polytechnic Institute
39
Input QueueingScheduling: Example
RequestGraph
123
4
12342
5
242
7
BipartiteMatching
1234
1234
(Weight = 18)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
40
Input QueueingLongest Queue First or
Oldest Cell First
1234
1234
1234
1234
10 1
1
1
1 10
Maximum weight
Weight Waiting Time
100%Queue Length { } =
Shivkumar KalyanaramanRensselaer Polytechnic Institute
41
Input QueueingScheduling
Maximum SizeMaximizes instantaneous throughputDoes it maximize long-term throughput?
Maximum WeightCan clear most backlogged queuesBut does it sacrifice long-term throughput?
Shivkumar KalyanaramanRensselaer Polytechnic Institute
42
Input QueuingWhy is serving long/old queues better than serving
maximum number of queues?
• When traffic is uniformly distributed, servicing themaximum number of queues leads to 100% throughput.• When traffic is non-uniform, some queues become longer than others.• A good algorithm keeps the queue lengths matched, and services a large number of queues.
VOQ #
Avg
Occ
upan
cy Uniform traffic
VOQ #
Avg
Occ
upan
cyNon-uniform traffic
Shivkumar KalyanaramanRensselaer Polytechnic Institute
43
Input QueueingPractical Algorithms
Maximal Size AlgorithmsWave Front Arbiter (WFA)Parallel Iterative Matching (PIM) iSLIP
Maximal Weight AlgorithmsFair Access Round Robin (FARR)Longest Port First (LPF)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
44
iSLIP
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Requests
1
2
3
4
1
2
3
4Grant
1
2
3
4
1
2
3
4Accept/Match
1
2
3
4
1
2
3
4
#1
#2
Round-Robin Selection
1
2
3
4
1
2
3
4
Round-Robin Selection
Shivkumar KalyanaramanRensselaer Polytechnic Institute
45
iSLIPProperties
Random under low load TDM under high load Lowest priority to MRU 1 iteration: fair to outputs Converges in at most N iterations. On
average <= log2N Implementation: N priority encoders Up to 100% throughput for uniform traffic
Shivkumar KalyanaramanRensselaer Polytechnic Institute
46
iSLIP
Shivkumar KalyanaramanRensselaer Polytechnic Institute
47
iSLIP
Shivkumar KalyanaramanRensselaer Polytechnic Institute
48
iSLIPImplementation
Grant
Grant
Grant
Accept
Accept
Accept
1
2
N
1
2
N
State
N
N
N
Decision
log2N
log2N
log2N
ProgrammablePriority Encoder
Shivkumar KalyanaramanRensselaer Polytechnic Institute
49
Throughput results
Theory:
Practice:
InputQueueing
(IQ)
InputQueueing
(IQ)
InputQueueing
(IQ)
InputQueueing
(IQ)
58% [Karol, 1987]
IQ + VOQ,Maximum weight matching
IQ + VOQ,Maximum weight matching
IQ + VOQ,Sub-maximal size matching
e.g. PIM, iSLIP.
IQ + VOQ,Sub-maximal size matching
e.g. PIM, iSLIP.
100% [M et al., 1995]
Different weight functions,incomplete information, pipelining.
Different weight functions,incomplete information, pipelining.
Randomized algorithmsRandomized algorithms
100% [Tassiulas, 1998]
100% [Various]
Various heuristics, distributed algorithms,
and amounts of speedup
Various heuristics, distributed algorithms,
and amounts of speedup
IQ + VOQ,Maximal size matching,
Speedup of two.
IQ + VOQ,Maximal size matching,
Speedup of two.
100% [Dai & Prabhakar, 2000]
Shivkumar KalyanaramanRensselaer Polytechnic Institute
50
Speedup: Context
Memory
Memory
The placement of memory gives
- Output-queued switches- Input-queued switches- Combined input- and output-queued switches
A generic switch
Shivkumar KalyanaramanRensselaer Polytechnic Institute
51
Output-queued switches
Best delay and throughput performance- Possible to erect “bandwidth firewalls” between sessions
Main problem- Requires high fabric speedup (S = N)
Unsuitable for high-speed switching
Shivkumar KalyanaramanRensselaer Polytechnic Institute
52
Input-queued switches
Big advantage - Speedup of one is sufficient
Main problem- Can’t guarantee delay due to input contention
Overcoming input contention: use higher speedup
Shivkumar KalyanaramanRensselaer Polytechnic Institute
53
The Speedup Problem
Find a compromise: 1 < Speedup << N
- to get the performance of an OQ switch- close to the cost of an IQ switch
Essential for high speed QoS switching
Shivkumar KalyanaramanRensselaer Polytechnic Institute
54
Intuition
Speedup = 1
Speedup = 2
Fabric throughput = .58
Bernoulli IID inputs
Fabric throughput = 1.16
Bernoulli IID inputs
I/p efficiency, = 1/1.16
Ave I/p queue = 6.25
Shivkumar KalyanaramanRensselaer Polytechnic Institute
55
Intuition (continued)
Speedup = 3Fabric throughput = 1.74
Bernoulli IID inputs
Input efficiency = 1/1.74
Speedup = 4 Fabric throughput = 2.32
Bernoulli IID inputs
Input efficiency = 1/2.32
Ave I/p queue = 0.75
Ave I/p queue = 1.35
Recommended