Download ppt - High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

High Performance Switches and Routers:

Theory and PracticeSigcomm 99

August 30, 1999

Harvard UniversityHigh PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.

Nick McKeown Balaji Prabhakar

Departments of Electrical Engineering and Computer Science

[email protected] [email protected]

Copyright 1999. All Rights Reserved 2

Tutorial Outline

• Introduction: What is a Packet Switch?

• Packet Lookup and Classification: Where does a packet go next?

• Switching Fabrics:How does the packet get there?

• Output Scheduling:When should the packet leave?


IntroductionWhat is a Packet Switch?

• Basic Architectural Components• Some Example Packet Switches• The Evolution of IP Routers


Basic Architectural Components

PolicingOutput

SchedulingSwitching

Routing

CongestionControl

ReservationAdmissionControl

Control

Datapath:per-packet processing


Basic Architectural ComponentsDatapath: per-packet processing

ForwardingDecision

ForwardingDecision

ForwardingDecision

ForwardingTable

ForwardingTable

ForwardingTable

Interconnect

OutputScheduling

1.

2.

3.


Where high performance packet switches are used

Enterprise WAN access& Enterprise Campus Switch

- Carrier Class Core Router- ATM Switch- Frame Relay Switch

The Internet Core

Edge Router





ATM Switch

• Lookup cell VCI/VPI in VC table.

• Replace old VCI/VPI with new.

• Forward cell to outgoing interface.

• Transmit cell onto link.


Ethernet Switch

• Lookup frame DA in forwarding table.– If known, forward to correct port.– If unknown, broadcast to all ports.

• Learn SA of incoming frame.

• Forward frame to outgoing interface.

• Transmit frame onto link.


IP Router

• Lookup packet DA in forwarding table.– If known, forward to correct port.– If unknown, drop packet.

• Decrement TTL, update header Cksum.

• Forward packet to outgoing interface.

• Transmit packet onto link.





First-Generation IP Routers

Shared Backplane

Line Interface

CPU

Memory

CPU BufferMemory

LineInterface

DMA

MAC

LineInterface

DMA

MAC

LineInterface

DMA

MAC


Second-Generation IP Routers

CPU BufferMemory

LineCard

DMA

MAC

LocalBuffer

Memory

LineCard

DMA

MAC

LocalBuffer

Memory

LineCard

DMA

MAC

LocalBuffer

Memory


Third-Generation Switches/Routers

LineCard

MAC

LocalBuffer

Memory

CPUCard

LineCard

MAC

LocalBuffer

Memory

Switched Backplane

Line Interface

Line Interface

Line Interface

Line Interface

Line Interface

Line Interface

Line Interface

Line Interface

CPUM

emory


1 2 3 4 5 6 7 8 9 10 1112131415 16

17 1819 20 2122 23242526 2728 2930 3132

13 14 15 16 17 18

19 20 21 22 23 24

25 26 27 28 29 30

31 32 21

1 2 3 4 5 6

7 8 9 10 11 12

Fourth-Generation Switches/RoutersClustering and Multistage


Packet SwitchesReferences

• J. Giacopelli, M. Littlewood, W.D. Sincoskie “Sunshine: A high performance self-routing broadband packet switch architecture”, ISS ‘90.

• J. S. Turner “Design of a Broadcast packet switching network”, IEEE Trans Comm, June 1988, pp. 734-743.

• C. Partridge et al. “A Fifty Gigabit per second IP Router”, IEEE Trans Networking, 1998.

• N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick, M. Horowitz, “The Tiny Tera: A Packet Switch Core”, IEEE Micro Magazine, Jan-Feb 1997.


Tutorial Outline







ForwardingDecision

ForwardingDecision

ForwardingDecision

ForwardingTable

ForwardingTable

ForwardingTable

Interconnect

OutputScheduling

1.

2.

3.


Forwarding Decisions• ATM and MPLS switches

– Direct Lookup • Bridges and Ethernet switches

– Associative Lookup– Hashing– Trees and tries

• IP Routers– Caching– CIDR– Patricia trees/tries– Other methods

• Packet Classification


ATM and MPLS SwitchesDirect Lookup

VCI

Address

Memory

Data

(Port, VCI)








Bridges and Ethernet SwitchesAssociative Lookups

NetworkAddress

AssociatedData

AssociativeMemory or CAM

Search Data

48

log2N

AssociatedData

Hit?

Address{

Advantages:• Simple

Disadvantages• Slow• High Power• Small• Expensive


Bridges and Ethernet SwitchesHashing

HashingFunction

Memory

Add

ress

Dat

a

Search Data

48

log2N

AssociatedData

Hit?

Address{16


Lookups Using HashingAn example

Hashing Function

CRC-16

16

#1 #2 #3 #4

#1 #2

#1 #2 #3Linked lists

Memory

Search Data

48

log2N

AssociatedData

Hit?

Address{


Lookups Using HashingPerformance of simple example

Where:

ER Expected number of memory references=

M Number of memory addresses in table=

N Number of linked lists= M N=

ER 12--- 1

1 1 1N----–

M–

--------------------------------+

=


Lookups Using Hashing

Advantages:

• Simple

• Expected lookup time can be small

Disadvantages

• Non-deterministic lookup time

• Inefficient use of memory


Trees and Tries

Binary Search Tree

< >

< > < >

log2 N

N entries

Binary Search Trie

0 1

0 1 0 1

111010


Trees and TriesMultiway tries

16-ary Search Trie

0000, ptr 1111, ptr

0000, 0 1111, ptr

000011110000

0000, 0 1111, ptr

111111111111


Trees and TriesMultiway tries

Degree ofTree

# MemReferences

# Nodes(x106)

Total Memory(Mbytes)

FractionWasted (%)

2 48 1.09 4.3 494 24 0.53 4.3 738 16 0.35 5.6 8616 12 0.25 8.3 9364 8 0.17 21 98256 6 0.12 64 99.5

Ew DL 1– 1 1 N

DL-------–

D–

Di 1 Di 1–– N 1 D1 i–– N–

i 1=

L 1–

+=

En 1 DL 1 N

DL-------–

DDi Di 1– 1 Di 1–– N–

i 1=

L 1–

+ +=

Where:

D Degree of tree=

L Number of layers/references=

N Number of entries in table =

En Expected number of nodes=

Ew Expected amount of wasted memory=

Table produced from 215 randomly generated 48-bit addresses








Caching Addresses

CPU BufferMemory

LineCard

DMA

MAC

LocalBuffer

Memory

LineCard

DMA

MAC

LocalBuffer

Memory

LineCard

DMA

MAC

LocalBuffer

Memory

Fast Path

Slow Path


Caching Addresses

LAN:Average flow < 40 packets

WAN: Huge Number of flows

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Cache = 10% of Full Table

Cache Hit Rate


IP RoutersClass-based addresses

Class A Class B Class C D

212.17.9.4

Class A

Class B

Class C212.17.9.0 Port 4

Exact Match

Routing Table:

IP Address Space


IP RoutersCIDR

A B C D0 232-1

0 232-1

128.9/16

128.9.0.0

216

142.12/19

65/8

Classless:

Class-based:

128.9.16.14


IP RoutersCIDR

0 232-1

128.9/16

128.9.16.14

128.9.16/20 128.9.176/20

128.9.19/24

128.9.25/24

Most specific route = “longest matching prefix”


IP RoutersMetrics for Lookups

128.9.16.14 128.9/16128.9.16/20

128.9.176/20

128.9.19/24128.9.25/24

142.12/19

65/8

Prefix Port35271013

• Lookup time• Storage space• Update time• Preprocessing time


IP Router Lookup

IPv4 unicast destination address based lookup

Dstn Addr Next Hop

--------

---- ----

--------

Destination Next HopForwarding Table

Next Hop Computation

Forwarding Engine

Incoming Packet

HEADER


Need more than IPv4 unicast lookups

• Multicast • PIM SM

– Longest Prefix Matching on the source and group address – Try (S,G) followed by (*,G) followed by (*,*,RP) – Check Incoming Interface

• DVMRP: – Incoming Interface Check followed by (S,G) lookup

• IPv6 • 128 bit destination address field

• Exact address architecture not yet known


Lookup Performance Required

Gigabit Ethernet (84B packets): 1.49 Mpps

Line Line Rate Pkt size=40B Pkt size=240B

T1 1.5Mbps 4.68 Kpps 0.78 Kpps

OC3 155Mbps 480 Kpps 80 Kpps

OC12 622Mbps 1.94 Mpps 323 Kpps

OC48 2.5Gbps 7.81 Mpps 1.3 Mpps

OC192 10 Gbps 31.25 Mpps 5.21 Mpps


Size of the Routing Table

Source: http://www.telstra.net/ops/bgptable.html


Ternary CAMs

10.0.0.0 R1

10.1.0.0 R2

10.1.1.0 R310.1.3.0 R4

255.0.0.0255.255.0.0

255.255.255.0

255.255.255.0

255.255.255.25510.1.3.1 R4

Value Mask

Priority Encoder

Next Hop

Associative Memory


Binary Tries

Example Prefixes

a) 00001b) 00010c) 00011d) 001e) 0101f) 011g) 100h) 1010i) 1100j) 11110000

a b c

d

e

f g

h i

j

0 1


Patricia Tree

Skip=5

j

a b c

d

e

f g

0 1

h i

Example Prefixesa) 00001b) 00010c) 00011d) 001e) 0101f) 011g) 100h) 1010i) 1100j) 11110000


Patricia Tree

Disadvantages• Many memory accesses• May need backtracking• Pointers take up a lot of

space

Advantages• General Solution• Extensible to wider

fields

Avoid backtracking by storing the intermediate-best matched prefix. (Dynamic Prefix Tries)

40K entries: 2MB data structure with 0.3-0.5 Mpps [O(W)]


Binary search on trie levels

P

Level 0

Level 29

Level 8



10.0.0.0/810.1.0.0/1610.1.1.0/24

Example Prefixes

10.1.2.0/24Length Hash

8

12

16

24

Store a hash table for each prefix lengthto aid search at a particular trie level.

10.2.3.0/24

Example Addrs

10.1.1.410.4.4.310.2.3.910.2.4.8

10.0.0.0/810.1.0.0/1610.1.1.0/24

Example Prefixes

10.1.2.0/2410.2.3.0/2410

10.1, 10.2

10.1.1, 10.1.2, 10.2.3



Disadvantages• Multiple hashed memory

accesses.• Updates are complex.

Advantages• Scaleable to IPv6.

33K entries: 1.4MB data structure with 1.2-2.2 Mpps [O(log W)]


Compacting Forwarding Tables

1 0 0 0 1 0 1 1 1 0 0 0 1 1

1 1



10001010 11100010 10000010 10110100 11000000

R1, 0 R5, 0R2, 3 R3, 7 R4, 9

0 13

Codeword array

Base index array

0 1

0 321 4



Disadvantages• Scalability to larger

tables?• Updates are complex.

Advantages• Extremely small data

structure - can fit in cache.

33K entries: 160KB data structure with average 2Mpps [O(W/k)]


16-ary Search Trie

0000, ptr 1111, ptr

0000, 0 1111, ptr

000011110000

0000, 0 1111, ptr

111111111111

Multi-bit Tries


Compressed Tries

L16

L24

L8

Only 3 memory accesses


Routing Lookups in Hardware

Prefix length

Num

ber

Most prefixes are 24-bits or shorter


Routing Lookups in Hardware14

2.19

.6.1

4

Prefixes up to 24-bits

142.

19.6

14

1 Next Hop

24

Next Hop

142.19.6

224 = 16M entries


Routing Lookups in Hardware12

8.3.

72.4

4

Prefixes up to 24-bits

128.

3.72

44

1 Next Hop

128.3.72

24 0 Pointer

8

Prefixes above 24-bits

Next Hop

Next Hop

Next Hop

offs

etba

se


Routing Lookups in HardwarePrefixes up to n-bits

2n entries:

0

N + M

N

i j Prefixeslonger than

N+M bits

Next Hop

2m

i entries


Routing Lookups in Hardware

Disadvantages• Large memory required

(9-33MB)• Depends on prefix-length

distribution.

Advantages• 20Mpps with 50ns

DRAM• Easy to implement in

hardware

Various compression schemes can be employed to decrease thestorage requirements: e.g. employ carefully chosen variable length strides, bitmap compression etc.


IP Router LookupsReferences

• A. Brodnik, S. Carlsson, M. Degermark, S. Pink. “Small Forwarding Tables for Fast Routing Lookups”, Sigcomm 1997, pp 3-14.

• B. Lampson, V. Srinivasan, G. Varghese. “ IP lookups using multiway and multicolumn search”, Infocom 1998, pp 1248-56, vol. 3.

• M. Waldvogel, G. Varghese, J. Turner, B. Plattner. “Scalable high speed IP routing lookups”, Sigcomm 1997, pp 25-36.

• P. Gupta, S. Lin, N.McKeown. “Routing lookups in hardware at memory access speeds”, Infocom 1998, pp 1241-1248, vol. 3.

• S. Nilsson, G. Karlsson. “Fast address lookup for Internet routers”, IFIP Intl Conf on Broadband Communications, Stuttgart, Germany, April 1-3, 1998.

• V. Srinivasan, G.Varghese. “Fast IP lookups using controlled prefix expansion”, Sigmetrics, June 1998.








Providing Value Added ServicesSome examples

• Differentiated services – Regard traffic from Autonomous System #33 as `platinum grade’

• Access Control Lists– Deny udp host 194.72.72.33 194.72.6.64 0.0.0.15 eq snmp

• Committed Access Rate– Rate limit WWW traffic from sub interface#739 to 10Mbps

• Policy based Routing– Route all voice traffic through the ATM network


Packet Classification

Action

--------

---- ----

--------

Predicate ActionClassifier (Policy Database)

Packet Classification

Forwarding Engine

Incoming Packet

HEADER


Multi-field Packet Classification

Given a classifier with N rules, find the action associated with the highest priority rule matching an incoming packet.

Field 1 Field 2 … Field k Action

Rule 1 152.163.190.69/ 21 152.163.80.11/ 32 … UDP A1

Rule 2 152.168.3.0/ 24 152.163.0.0/ 16 … TCP A2

… … … … … …

Rule N 152.168.0.0/ 16 152.0.0.0/ 8 … ANY An


R5

Geometric Interpretation in 2D

R4

R3

R2R1

R7

P2

Field #1

Fie

ld #

2

R6

Field #1 Field #2 Data

P1

e.g. (128.16.46.23, *)e.g. (144.24/16, 64/24)


Proposed Schemes

Pros ConsSequentialEvaluation

Small storage, scales well withnumber of fields

Slow classification rates

Ternary CAMs Single cycle classification Cost, density, powerconsumption

Grid of Tries(Srinivasan etal[Sigcomm

98])

Small storage requirements andfast lookup rates for two fields.Suitable for big classifiers

Not easily extendible tomore than two fields.


Proposed Schemes (Contd.)

Pros ConsCrossproducting

(Srinivasan etal[Sigcomm 98])

Fast accesses.Suitable formultiple fields.

Large memoryrequirements. Suitablewithout caching forclassifiers with fewer than50 rules.

Bil-level Parallelism(Lakshman and

Stiliadis[Sigcomm 98])

Suitable formultiple fields.

Large memory bandwidthrequired. Comparativelyslow lookup rate.Hardware only.


Proposed Schemes (Contd.)

Pros ConsHierarchical

Intelligent Cuttings(Gupta and

McKeown[HotI 99])

Suitable for multiplefields. Small memoryrequirements. Goodupdate time.

Large preprocessingtime.

Tuple Space Search(Srinivasan et

al[Sigcomm 99])

Suitable for multiplefields. The basic schemehas good update timesand memoryrequirements.

Classification rate can below. Requires perfecthashing for determinism.

Recursive FlowClassification (GuptaandMcKeown[Sigcomm99])

Fast accesses. Suitable formultiple fields.Reasonable memoryrequirements for real-lifeclassifiers.

Large preprocessing timeand memoryrequirements for largeclassifiers.


Grid of Tries

R7

R4

R6R5R3R2

R1

Dimension 1

Dimension 2


Grid of Tries

Disadvantages• Static solution• Not easy to extend to

higher dimensions

Advantages• Good solution for two

dimensions

20K entries: 2MB data structure with 9 memory accesses [at most 2W]


Classification using Bit Parallelism

R4 R3 R2R11

1

0

0

1

0

1

1


Classification using Bit Parallelism

Disadvantages• Large memory

bandwidth• Hardware optimized

Advantages• Good solution for

multiple dimensions for small classifiers

512 rules: 1Mpps with single FPGA and 5 128KB SRAM chips.


Classification Using Multiple FieldsRecursive Flow Classification

Packet Header

F1

F2

F3

F4

Fn

MemoryMemory

Action

Memory

2S = 2128 2T = 212

2S = 21282T = 212264

224


Packet ClassificationReferences

• T.V. Lakshman. D. Stiliadis. “High speed policy based packet forwarding using efficient multi-dimensional range matching”, Sigcomm 1998, pp 191-202.

• V. Srinivasan, S. Suri, G. Varghese and M. Waldvogel. “Fast and scalable layer 4 switching”, Sigcomm 1998, pp 203-214.

• V. Srinivasan, G. Varghese, S. Suri. “Fast packet classification using tuple space search”, to be presented at Sigcomm 1999.

• P. Gupta, N. McKeown, “Packet classification using hierarchical intelligent cuttings”, Hot Interconnects VII, 1999.

• P. Gupta, N. McKeown, “Packet classification on multiple fields”, Sigcomm 1999.


Tutorial Outline






Switching Fabrics

• Output and Input Queueing

• Output Queueing

• Input Queueing– Scheduling algorithms– Combining input and output queues– Other non-blocking fabrics– Multicast traffic



ForwardingDecision

ForwardingDecision

ForwardingDecision

ForwardingTable

ForwardingTable

ForwardingTable

Interconnect

OutputScheduling

1.

2.

3.


InterconnectsTwo basic techniques

Input Queueing Output Queueing

Usually a non-blockingswitch fabric (e.g. crossbar)

Usually a fast bus


InterconnectsOutput Queueing

Individual Output Queues Centralized Shared Memory

Memory b/w = (N+1).R

1

2

N

Memory b/w = 2N.R

1

2

N


Output QueueingThe “ideal”

1

1

1

1

1

1

1

1

1

11

1

2

2

2

2

2

2


Output QueueingHow fast can we make centralized shared memory?

SharedMemory

200 byte bus

5ns SRAM

1

2

N

• 5ns per memory operation• Two memory operations per packet• Therefore, up to 160Gb/s• In practice, closer to 80Gb/s


Switching Fabrics


• Output Queueing

• Input Queueing– Scheduling algorithms– Other non-blocking fabrics– Combining input and output queues– Multicast traffic


InterconnectsInput Queueing with Crossbar

configuration

Dat

a In

Data Out

Scheduler

Memory b/w = 2R


Input QueueingHead of Line Blocking

Del

ay

Load58.6% 100%


Head of Line Blocking




Input QueueingVirtual output queues


Input QueuesVirtual Output Queues

Del

ay

Load100%


Input Queueing

Scheduler

Memory b/w = 2R

Can be quitecomplex!


Input QueueingScheduling

Input 1

Q(1,1)

Q(1,n)

A1(t)

Input m

Q(m,1)

Q(m,n)

Am(t)

D1(t)

Dn(t)

Output 1

Output n

Matching, MA1,1(t)

?



RequestGraph

123

4

12342

5

242

7

BipartiteMatching

1234

1234

(Weight = 18)

Question: Maximum weight or maximum size?



• Maximum Size– Maximizes instantaneous throughput– Does it maximize long-term throughput?

• Maximum Weight– Can clear most backlogged queues– But does it sacrifice long-term throughput?



1

2

1

2

1

2

1

2


Input QueueingLongest Queue First or

Oldest Cell First

1234

1234

1234

1234

10 1

1

1

1 10

Maximum weight

Weight Waiting Time

100%Queue Length { } =


Input QueueingWhy is serving long/old queues better than

serving maximum number of queues?• When traffic is uniformly distributed, servicing themaximum number of queues leads to 100% throughput.• When traffic is non-uniform, some queues become longer than others.• A good algorithm keeps the queue lengths matched, and services a large number of queues.

VOQ #

Avg

Occ

upan

cy Uniform traffic

VOQ #

Avg

Occ

upan

cyNon-uniform traffic


Input QueueingPractical Algorithms

• Maximal Size Algorithms– Wave Front Arbiter (WFA)– Parallel Iterative Matching (PIM)– iSLIP

• Maximal Weight Algorithms– Fair Access Round Robin (FARR)– Longest Port First (LPF)


Wave Front Arbiter

Requests Match

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4


Wave Front Arbiter

Requests Match


Wave Front ArbiterImplementation

1,1 1,2 1,3 1,4

2,1 2,2 2,3 2,4

3,1 3,2 3,3 3,4

4,1 4,2 4,3 4,4

Combinational Logic Blocks


Wave Front ArbiterWrapped WFA (WWFA)

Requests Match

N steps instead of2N-1






Parallel Iterative Matching

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

Requests

1

2

3

4

1

2

3

4Grant

1

2

3

4

1

2

3

4Accept/Match

1

2

3

4

1

2

3

4

#1

#2

Random Selection

1

2

3

4

1

2

3

4

Random Selection


Parallel Iterative MatchingMaximal is not Maximum

1

2

3

4

1

2

3

4Requests Accept/Match

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4


Parallel Iterative MatchingAnalytical Results

E C Nlog

E Ui N2

4i------- C # of iterations required to resolve connections=

N # of ports =

Ui # of unresolved connections after iteration i=

Number of iterations to converge:












iSLIP

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

Requests

1

2

3

4

1

2

3

4Grant

1

2

3

4

1

2

3

4Accept/Match

1

2

3

4

1

2

3

4

#1

#2

Round-Robin Selection

1

2

3

4

1

2

3

4

Round-Robin Selection


iSLIPProperties

• Random under low load• TDM under high load• Lowest priority to MRU• 1 iteration: fair to outputs• Converges in at most N iterations. On average <=

log2N

• Implementation: N priority encoders• Up to 100% throughput for uniform traffic


iSLIP


iSLIP


iSLIPImplementation

Grant

Grant

Grant

Accept

Accept

Accept

1

2

N

1

2

N

State

N

N

N

Decision

log2N

log2N

log2N

ProgrammablePriority Encoder


Input Queueing ReferencesReferences

• M. Karol et al. “Input vs Output Queueing on a Space-Division Packet Switch”, IEEE Trans Comm., Dec 1987, pp. 1347-1356.

• Y. Tamir, “Symmetric Crossbar arbiters for VLSI communication switches”, IEEE Trans Parallel and Dist Sys., Jan 1993, pp.13-27.

• T. Anderson et al. “High-Speed Switch Scheduling for Local Area Networks”, ACM Trans Comp Sys., Nov 1993, pp. 319-352.

• N. McKeown, “The iSLIP scheduling algorithm for Input-Queued Switches”, IEEE Trans Networking, April 1999, pp. 188-201.

• C. Lund et al. “Fair prioritized scheduling in an input-buffered switch”, Proc. of IFIP-IEEE Conf., April 1996, pp. 358-69.

• A. Mekkitikul et al. “A Practical Scheduling Algorithm to Achieve 100% Throughput in Input-Queued Switches”, IEEE Infocom 98, April 1998.


Switching Fabrics


• Output Queueing



Other Non-Blocking FabricsClos Network


Other Non-Blocking FabricsClos Network

Expansion factor required = 2-1/N (but still blocking for multicast)


Other Non-Blocking FabricsSelf-Routing Networks

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111


Other Non-Blocking FabricsSelf-Routing Networks

3

7

5

2

6

0

1

4

7

2

3

5

6

1

0

4

7

5

2

3

1

0

6

4

7

0

5

1

3

4

2

6

7

4

5

6

0

3

1

2

7

6

4

5

3

2

0

2

7

6

5

4

3

2

1

0

000001

010011

100101

110111

Batcher Sorter Self-Routing Network

The Non-blocking Batcher Banyan Network

• Fabric can be used as scheduler. •Batcher-Banyan network is blocking for multicast.


Switching Fabrics


• Output Queueing



Speedup

• Context– input-queued switches

– output-queued switches

– the speedup problem

• Early approaches

• Algorithms

• Implementation considerations


Speedup: Context

Memory

Memory

The placement of memory gives

- Output-queued switches- Input-queued switches- Combined input- and output-queued switches

A generic switch


Output-queued switches

Best delay and throughput performance- Possible to erect “bandwidth firewalls” between sessions

Main problem- Requires high fabric speedup (S = N)

Unsuitable for high-speed switching


Input-queued switches

Big advantage - Speedup of one is sufficient

Main problem- Can’t guarantee delay due to input contention

Overcoming input contention: use higher speedup


A Comparison

Line Rate MemoryBW

Access TimePer cell

MemoryBW

Access Time

Memory speeds for 32x32 switch

Output-queued Input-queued

100 Mb/s 3.3 Gb/s 128 ns 200 Mb/s 2.12 s

1 Gb/s 33 Gb/s 12.8 ns 2 Gb/s 212 ns

2.5 Gb/s 82.5 Gb/s 5.12 ns 5 Gb/s 84.8 ns

10 Gb/s 330 Gb/s 1.28ns 20 Gb/s 21.2 ns


The Speedup Problem

Find a compromise: 1 < Speedup << N

- to get the performance of an OQ switch- close to the cost of an IQ switch

Essential for high speed QoS switching


Some Early Approaches

Probabilistic Analyses

- assume traffic models (Bernoulli, Markov-modulated,

Numerical Methods

- use actual and simulated traffic traces- run different algorithms - set the “speedup dial” at various values

non-uniform loading, “friendly correlated”)- obtain mean throughput and delays, bounds on tails- analyze different fabrics (crossbar, multistage, etc)


The findings

Very tantalizing ...

- under different settings (traffic, loading, algorithm, etc)

- and even for varying switch sizes

A speedup of between 2 and 5 was sufficient!


Using Speedup

1

1

1

2

2


Intuition

Speedup = 1

Speedup = 2

Fabric throughput = .58

Bernoulli IID inputs

Fabric throughput = 1.16


I/p efficiency, = 1/1.16

Ave I/p queue = 6.25


Intuition (continued)

Speedup = 3Fabric throughput = 1.74


Input efficiency = 1/1.74

Speedup = 4 Fabric throughput = 2.32


Input efficiency = 1/2.32




Issues

Need hard guarantees- exact, not average

Robustness- realistic, even adversarial, traffic not friendly Bernoulli IID


The Ideal Solution

Speedup = N

?Speedup << N

Inputs Outputs

Question: Can we find- a simple and good algorithms - that exactly mimics output-queueing- regardless of switch sizes and traffic patterns?


What is exact mimicking?

Apply same inputs to an OQ and a CIOQ switch- packet by packet

Obtain same outputs- packet by packet


Algorithm - MUCF

Key concept: urgency value- urgency = departure time - present time


MUCF

The algorithm

- Outputs try to get their most urgent packets- Inputs grant to output whose packet is most urgent, ties broken by port number- Loser outputs for next most urgent packet- Algorithm terminates when no more matchings are possible


Stable Marriage Problem

MariaHillary Monica

PedroJohnBill

Men = Outputs

Women = Inputs


An example

Observation: Only two reasons a packet doesn’t get to its output

- Input contention, Output contention

- This is why speedup of 2 works!!


What does this get us?

Speedup of 4 is sufficient for exact emulation of FIFO

OQ switches, with MUCF

What about non-FIFO OQ switches?E.g. WFQ, Strict priority


Other results

To exactly emulate an NxN OQ switch

- Speedup of 2 - 1/N is necessary and sufficient

- Input traffic patterns can be absolutely arbitrary

(Hence a speedup of 2 is sufficient for all N)

- Emulated OQ switch may use a “monotone”

- E.g.: FIFO, LIFO, strict priority, WFQ, etc

scheduling policies


What gives?

Complexity of the algorithms

- Extra hardware for processing

- Extra run time (time complexity)

What is the benefit?

- Reduced memory bandwidth requirements

Tradeoff: Memory for processing

- Moore’s Law supports this tradeoff


Implementation - a closer look

Main sources of difficulty

- Estimating urgency, etc - info is distributed

- Matching process - too many iterations?

Estimating urgency depends on what is being emulated

- Like taking a ticket to hold a place in a queue

- FIFO, Strict priorities - no problem

- WFQ, etc - problems

(and communicating this info among I/ps and O/ps)


Implementation (contd)

Matching process

- A variant of the stable marriage problem

- Worst-case number of iterations in switching = N- High probability and average approxly log(N)

- Worst-case number of iterations for SMP = N2


Other Work

Relax stringent requirement of exact emulation

- Least Occupied O/p First Algorithm (LOOFA)

- Disallow arbitrary inputs

Keeps outputs always busy if there are packets

By time-stamping packets, it also exactly mimics

E.g. leaky bucket constrained

Obtain worst-case delay bounds


References for speedup- Y. Oie et al, “Effect of speedup in nonblocking packet switch’’, ICC 89.

- A.L Gupta, N.D. Georgana, “Analysis of a packet switch with input and

and output buffers and speed constraints”, Infocom 91.

- S-T. Chuang et al, “Matching output queueing with a combined input and

and output queued switch”, IEEE JSAC, vol 17, no 6, 1999.

- B. Prabhakar, N. McKeown, “On the speedup required for combined input

and output queued switching”, Automatica, vol 35, 1999.

- P. Krishna et al, “On the speedup required for work-conserving crossbar

switches”, IEEE JSAC, vol 17, no 6, 1999.- A. Charny, “Providing QoS guarantees in input buffered crossbar switches

with speedup”, PhD Thesis, MIT, 1998.


Switching Fabrics


• Output Queueing



Multicast Switching

• The problem

• Switching with crossbar fabrics

• Switching with other fabrics


Multicasting

1

2

64

3 5


Crossbar fabrics: Method 1

Copy networks

Copy network + unicast switching

Increased hardware, increased input contention


Method 2Use copying properties of crossbar fabric

No fanout-splitting: Easy, but lowthroughput

Fanout-splitting: higher throughput, but not as simple.Leaves “residue”.


The effect of fanout-splitting

Performance of an 8x8 switch with and without fanout-splittingunder uniform IID traffic


Placement of residue

Key question: How should outputs grant requests?

(and hence decide placement of residue)


Residue and throughput

Result: Concentrating residue brings more new workforward. Hence leads to higher throughput.

But, there are fairness problems to deal with.

This and other problems can be looked at in a unifiedway by mapping the multicasting problem onto a variation of Tetris.


Multicasting and Tetris

Output ports1 2 3 54

1 2 3 54Input ports

Residue


Multicasting and Tetris

Output ports1 2 3 54

1 2 3 54Input ports

ResidueConcentrated


Replication by recyclingMain idea: Make two copies at a time using a binary tree with input at root and all possible destination outputs at the leaves.

a

b

e

x dy

c x

y

a

b

c

x

y

d

e


Replication by recycling (cont’d)

Receive

Recycle

Network

Reseq TransmitOutputTable

Scaleable to large fanouts. Needs resequencing at outputs andintroduces variable delays.


References for Multicasting

• J. Hayes et al. “Performance analysis of a multicast switch”, IEEE/ACM Trans. on Networking, vol 39, April 1991.

• B. Prabhakar et al. “Tetris models for multicast switches”, Proc. of the 30th Annual Conference on Information Sciences and Systems, 1996

• B. Prabhakar et al. “Multicast scheduling for input-queued switches”, IEEE JSAC, 1997

• J. Turner, “An optimal nonblocking multicast virtual circuit switch”, INFOCOM, 1994


Tutorial Outline






Output Scheduling

• What is output scheduling?

• How is it done?

• Practical Considerations


Output Scheduling

scheduler

Allocating output bandwidthControlling packet delay


Output Scheduling

FIFO

Fair Queueing


Motivation

• FIFO is natural but gives poor QoS

– bursty flows increase delays for others

– hence cannot guarantee delays

Need round robin scheduling of packets

– Fair Queueing

– Weighted Fair Queueing, Generalized Processor Sharing


Fair queueing: Main issues

• Level of granularity

– packet-by-packet? (favors long packets)

– bit-by-bit? (ideal, but very complicated)

• Packet Generalized Processor Sharing (PGPS)

– serves packet-by-packet

– and imitates bit-by-bit schedule within a tolerance


How does WFQ work?

WR = 1WG = 5WP = 2


Delay guarantees

• Theorem

If flows are leaky bucket constrained and all nodes employ GPS (WFQ), then the network can guarantee worst-case delay bounds to sessions.


Practical considerations

• For every packet, the scheduler needs to

– classify it into the right flow queue and maintain a linked-list

for each flow

– schedule it for departure

• Complexities of both are o(log [# of flows])

– first is hard to overcome

– second can be overcome by DRR


Deficit Round Robin

50 700 250

400 600

200 600 100

500

500 Quantum size

250

500

500400

750

1000

Good approximation of FQ

Much simpler to implement


But...

• WFQ is still very hard to implement

– classification is a problem

– needs to maintain too much state information

– doesn’t scale well


Strict Priorities and Diff Serv

• Classify flows into priority classes

– maintain only per-class queues

– perform FIFO within each class

– avoid “curse of dimensionality”


Diff Serv• A framework for providing differentiated QoS

– set Type of Service (ToS) bits in packet headers

– this classifies packets into classes

– routers maintain per-class queues

– condition traffic at network edges to conform to

class requirements

May still need queue management inside the network


References for O/p Scheduling

- A. Demers et al, “Analysis and simulation of a fair queueing algorithm”,

ACM SIGCOMM 1989.

- A. Parekh, R. Gallager, “A generalized processor sharing approach to

flow control in integrated services networks: the single node

- M. Shreedhar, G. Varghese, “Efficient Fair Queueing using Deficit Round

Robin”, ACM SIGCOMM, 1995.- K. Nichols, S. Blake (eds), “Differentiated Services: Operational Model

and Definitions”, Internet Draft, 1998.

case”, IEEE Trans. on Networking, June 1993. - A. Parekh, R. Gallager, “A generalized processor sharing approach to flow control in integrated services networks: the multiple nodecase”, IEEE Trans. on Networking, August 1993.


• Problems with traditional queue management

– tail drop

• Active Queue Management

– goals

– an example

– effectiveness

Active Queue Management


Max Queue Length

Tail Drop Queue ManagementLock-Out


• Drop packets only when queue is full

– long steady-state delay

– global synchronization

– bias against bursty traffic

Tail Drop Queue Management


Max Queue Length

Global Synchronization


Max Queue Length

Bias Against Bursty Traffic


• Drop from front on full queue

• Drop at random on full queue

both solve the lock-out problem both have the full-queues problem

Alternative Queue Management Schemes


• Solve lock-out and full-queue problems– no lock-out behavior– no global synchronization– no bias against bursty flow

• Provide better QoS at a router– low steady-state delay– lower packet dropping

Active Queue ManagementGoals


• Problems with traditional queue management

– tail drop

• Active Queue Management

– goals

an example

– effectiveness

Active Queue Management


Random Early Detection (RED)

if qavg < minth: admit every packet

else if qavg <= maxth: drop an incoming packet with

p = (qavg - minth)/(maxth - minth)

else if qavg > maxth: drop every incoming packet

minthmaxth

P1PkP2

qavg


Effectiveness of RED: Lock-Out

• Packets are randomly dropped

• Each flow has the same probability of being

discarded


• Drop packets probabilistically in anticipation of congestion (not when queue is full)

• Use qavg to decide packet dropping probability: allow instantaneous bursts

• Randomness avoids global synchronization

Effectiveness of RED: Full-Queue


What QoS does RED Provide?

• Lower buffer delay: good interactive service – qavg is controlled to be small

• Given responsive flows: packet dropping is reduced– early congestion indication allows traffic to throttle back before congestion

• Given responsive flows: fair bandwidth allocation


Unresponsive or aggressive flows

• Don’t properly back off during congestion

• Take away bandwidth from TCP compatible flows

• Monopolize buffer space


Control Unresponsive Flows

• Some active queue management schemes

– RED with penalty box– Flow RED (FRED)– Stabilized RED (SRED)

identify and penalize unresponsive flows with a bit of extra work


Active Queue ManagementReferences

• B. Braden et al. “Recommendations on queue management and congestion avoidance in the internet”, RFC2309, 1998.

• S. Floyd, V. Jacobson, “Random early detection gateways for congestion avoidance”, IEEE/ACM Trans. on Networking, 1(4), Aug. 1993.

• D. Lin, R. Morris, “Dynamics on random early detection”, ACM SIGCOMM, 1997

• T. Ott et al. “SRED: Stabilized RED”, INFOCOM 1999

• S. Floyd, K. Fall, “Router mechanisms to support end-to-end congestion control”, LBL technical report, 1997


Tutorial Outline






Basic Architectural Components

PolicingOutput

SchedulingSwitching

Routing

CongestionControl

ReservationAdmissionControl

Control

Datapath:per-packet processing



ForwardingDecision

ForwardingDecision

ForwardingDecision

ForwardingTable

ForwardingTable

ForwardingTable

Interconnect

OutputScheduling

1.

2.

3.