49
Towards Simple, High- performance Input-Queued Switch Schedulers Devavrat Shah Stanford University Berkeley, Dec 5 Joint work with Paolo Giaccone and Balaji Prabhakar

Towards Simple, High-performance Input-Queued Switch Schedulers

  • Upload
    tuyen

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

Towards Simple, High-performance Input-Queued Switch Schedulers. Devavrat Shah Stanford University. Joint work with Paolo Giaccone and Balaji Prabhakar. Berkeley, Dec 5. Outline. Description of input-queued switches Scheduling the problem some history - PowerPoint PPT Presentation

Citation preview

Page 1: Towards Simple, High-performance Input-Queued Switch Schedulers

Towards Simple, High-performance Input-Queued

Switch Schedulers

Devavrat ShahStanford University

Berkeley, Dec 5

Joint work withPaolo Giaccone and Balaji Prabhakar

Page 2: Towards Simple, High-performance Input-Queued Switch Schedulers

2

Outline

• Description of input-queued switches• Scheduling

– the problem – some history

• Simple, high-performance schedulers– Laura– Serena– Apsara

• Conclusions

Page 3: Towards Simple, High-performance Input-Queued Switch Schedulers

3

The Input-Queued (IQ) Switch Architecture

• N inputs, N outputs (in fig, N = 3)• Time is slotted

– at most one packet can arrive per time-slot at each input

• Equal sized cells/packets• Buffers only at inputs• Use a crossbar for switching packets

Page 4: Towards Simple, High-performance Input-Queued Switch Schedulers

4

Scheduling

• Crossbar is defined by these constraints: in each time-slot– only one packet can be transferred to each output– only one packet can be transferred from each input

• The scheduling problem: Subject to the above constraint, find a matching of inputs and outputs– i.e. determine which output will receive a packet from

which input in each time slot

Page 5: Towards Simple, High-performance Input-Queued Switch Schedulers

5

Background to switch scheduling

1. [Karol et al. 1987] Throughput is limited due to head-of-line blocking (limited to 58% for Bernoulli IID uniform traffic)

2. [Tamir 1989] Observed that with “Virtual Output Queues” (VOQs) head-of-line blocking is eliminated.

Page 6: Towards Simple, High-performance Input-Queued Switch Schedulers

6

Basic Switch Model

S(t)

N NLNN(t)

A1N(t)

A11(t)L11(t)

1 1

ANN(t)

AN1(t)

D1(t)

DN(t)

Page 7: Towards Simple, High-performance Input-Queued Switch Schedulers

7

Some definitions

matrix. npermutatio a is and :where

:matrix Service 2.

".admissible" is traffic the say we If

where

:matrix Traffic 1.

SssS

nAE

ijij

jij

iij

ijijij

1,0],[

1,1

)]([:,

3. Queue occupancies:

Occupancy

L11(t) LNN(t)

)]([ tAE ij

Page 8: Towards Simple, High-performance Input-Queued Switch Schedulers

8

More background on theory

[Anderson et al. 1993] A schedule is equivalent to finding a matching in a bipartite graph induced by input and output nodes

Page 9: Towards Simple, High-performance Input-Queued Switch Schedulers

9

Background

[McKeown et al. 1995] (a) Maximum size match does not give 100% throughput.(b) But maximum weight match can, where weight can be queue-length, age of a cell

20

32

30

25

20

30

25

MWM

Page 10: Towards Simple, High-performance Input-Queued Switch Schedulers

10

Maximum Weight Matching

• Maximum weight matching (MWM)– 100% throughput– provable delay bounds for i.i.d. Bernoulli admissible

traffic– but, finding MWM is like solving a network-flow problem

whose complexity is -- complex for high-speed networks

• We seek to approximate maximum weight matching

• Our goal:– obtain a simply implementable approximation to MWM

that performs competitively with MWM

)( 3NO

Page 11: Towards Simple, High-performance Input-Queued Switch Schedulers

11

Approximating MWM

• Two performance measures– throughput– delay

• We first consider simple approximations to MWM that deliver 100% throughput (i.e. stability), and then deal with delay

Page 12: Towards Simple, High-performance Input-Queued Switch Schedulers

12

Methods of Approximation

• Randomization– well-known method for simplifying

implementation

• Using information in packet arrivals– since queue-sizes grow due to arrivals, and

arrival times are a source of randomness

• Hardware parallelism– yields an efficient search procedure

Page 13: Towards Simple, High-performance Input-Queued Switch Schedulers

13

Randomization

• The main idea of randomized algorithms is

– to simplify the decision-making process by basing

decisions upon a small, randomly chosen sample from the state rather than upon the complete state

Page 14: Towards Simple, High-performance Input-Queued Switch Schedulers

14

An Illustrative Example

• Find the oldest person from a population of 1 billion

• Deterministic algorithm: linear search – has a complexity of 1 billion

• A randomized version: find the oldest of 30 randomly chosen people– has a complexity of 30 (ignoring complexity of random

sampling)

• Performance– linear search will find the absolute oldest person (rank = 1)– if R is the person found by randomized algorithm, we can

make statements like P(R has rank < 100 million) > 0.95 thus, we can say that the performance of the randomized

algorithm is very good with a high probability

109

130

Page 15: Towards Simple, High-performance Input-Queued Switch Schedulers

15

Randomizing Iterative Schemes

• Often, we want to perform some operation iteratively

• Example: find the oldest person each year

• Say in 2001 you choose 30 people at random– and store the identity of the oldest person in memory– in 2002 you choose 29 new people at random– let R be the oldest person from these 29 + 1 = 30 people

P(R has rank < 100 million)

or, P(R has rank < 50 million)

109

159

109

130

Page 16: Towards Simple, High-performance Input-Queued Switch Schedulers

16

Back to Switch Scheduling: Randomizing MWM

• Choose d matchings at random and use the heaviest one as the schedule

• Ideally we would like to have small d. However:

• Theorem: Even with d = N this algorithm doesn’t yield 100% throughput!

Page 17: Towards Simple, High-performance Input-Queued Switch Schedulers

17

Proof

Page 18: Towards Simple, High-performance Input-Queued Switch Schedulers

18

• Switch Size : 32 X 32

• Input Traffic (shown for a 4 X 4 switch) – Bernoulli i.i.d. inputs– diagonal load matrix:

• normalized load=x+y<1• x=2y

Simulation Scenario

xy

yx

yx

yx

00

00

00

00

Page 19: Towards Simple, High-performance Input-Queued Switch Schedulers

19

0.001

0.01

0.1

1

10

100

1000

10000

0.0 0.2 0.4 0.6 0.8 1.0

Mea

n IQ

Len

Normalized Load

Diagonal Traffic

MWM R32R1

Page 20: Towards Simple, High-performance Input-Queued Switch Schedulers

20

Crucial Observation

• The state of the switch changes due to arrivals & departures

• Between consecutive time slots, a queue’s length can change at most by 1– hence a heavy matching tends to stay heavy

• Therefore– ‘’remembering’’ a heavy matching should help

in improving the performance

Page 21: Towards Simple, High-performance Input-Queued Switch Schedulers

21

Tassiulas’ Algorithm

• [Tassiulas 1998] proposed the following algorithm based on this observation:– let S(t-1) be the matching used at time t-1– let R(t) be a matching chosen uniformly at

random– and let S(t) be the heavier of R(t) and S(t-1)

• This gives 100% throughput !note the boost in throughput is due to the use

of memory

• But, delays are very large

Page 22: Towards Simple, High-performance Input-Queued Switch Schedulers

22

0.01

0.1

1

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Mea

n IQ

Len

Normalized Load

Diagonal Traffic

MWMTassiulas

Page 23: Towards Simple, High-performance Input-Queued Switch Schedulers

23

Derandomization

• Let G be a fully-connected graph where each node is one of the N! possible schedules

• Construct a Hamiltonian walk, H(t), on G– H(t) cycles through the nodes of G

• At any time t – let R(t) = H(t mod N!) – and let S(t) be the heavier of R(t) and S(t-1) this also has 100% throughput, but delays are

large (derandomization will be useful later)

Page 24: Towards Simple, High-performance Input-Queued Switch Schedulers

24

Stability

• Lemma: Consider IQ switch with Bernoulli i.i.d. inputs. Let B be a matching algorithm which ensures WB(t) >= W*(t) – c for every t. Then B is stable.

• Theorem: WDER(t) >= W*(t) – 2N.N! Therefore, it is stable.

Page 25: Towards Simple, High-performance Input-Queued Switch Schedulers

25

Delay

• These simple approximations of MWM yield 100% throughput, but delays are large

• To obtain good delays we’ll present three different algorithms which use the following features:– selective remembrance -- Laura– information in the arrivals -- Serena– hardware parallelism -- Apsara

Page 26: Towards Simple, High-performance Input-Queued Switch Schedulers

26

Laura

Tassiulas

• COMP = Maximum• R(t) – uniform sample

Next time COMP

S(t-1)

S(t)

R(t)

Laura

• COMP = Merge, picks the best edges of two matchings

• R(t) – non-uniform sample

Page 27: Towards Simple, High-performance Input-Queued Switch Schedulers

27

10

10

10

70

60

50

40

30

10

20Merging

S(t-1) R

10 – 40+10 -30+10-50= - 90

70-10+60-20=100

W(S(t-1))=160

W(R)=150

S(t)W(S(t)) = 250

Merging Procedure

Page 28: Towards Simple, High-performance Input-Queued Switch Schedulers

28

Throughput

• Theorem:– LAURA is stable under any admissible Bernoulli

i.i.d. input traffic.

Page 29: Towards Simple, High-performance Input-Queued Switch Schedulers

29

Average Backlog via Simulation

• Switch size: N = 32

• Length of VOQ: QMAX = 10000

• Comparison with– iSLIP, iLQF, MUCS, RPA and MWM

Page 30: Towards Simple, High-performance Input-Queued Switch Schedulers

30

Simulation

• Traffic Matrices– uniform diagonalsparse– logdiagonal

TU 1

N 2

1 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 1 11 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 1 1

TD 1

2N

1 1 0 0 0 0

0 1 1 0 0 0

0 0 1 1 0 00 0 0 1 1 0

0 0 0 0 1 1

1 0 0 0 0 1

TS 1

3N 2

2 1 0 0 0 0

0 1 2 0 0 0

0 1 1 1 0 00 0 0 2 1 0

0 0 0 0 1 2

1 0 0 0 1 1

Page 31: Towards Simple, High-performance Input-Queued Switch Schedulers

31

Laura: Diagonal traffic

Page 32: Towards Simple, High-performance Input-Queued Switch Schedulers

32

Laura: Sparse traffic

Page 33: Towards Simple, High-performance Input-Queued Switch Schedulers

33

• Since an increase in queue sizes is due to arrivals

• And arrivals are a source of randomness

Use arrivals to generate random matching

SERENASerena

Page 34: Towards Simple, High-performance Input-Queued Switch Schedulers

34

Serena

Next time Merge

S(t-1)

S(t)

R(t) = matching generated using arrivals

Page 35: Towards Simple, High-performance Input-Queued Switch Schedulers

35

23 7

893

2

5

Arr-R

47

1131

97

S(t-1)

Merging Procedure

893

5

23

W(S(t-1))=209

1

W(R)=121RMerging

S(t)

W(S(t))=243

89

3

23

31

97

Page 36: Towards Simple, High-performance Input-Queued Switch Schedulers

36

Throughput

Theorem:– SERENA achieves 100% throughput under any

admissible i.i.d. Bernoulli traffic pattern

Page 37: Towards Simple, High-performance Input-Queued Switch Schedulers

37

Serena: Diagonal traffic

Page 38: Towards Simple, High-performance Input-Queued Switch Schedulers

38

Apsara

• One way to obtain MWM is to search the space of all N! matchings

• A natural approximation: If S(t-1) is the current matching, then S(t) is the heaviest matching in a “neighborhood” of S(t-1)

• It turns out that there is a convenient way of defining neighbors (both for theory and for practice)

Page 39: Towards Simple, High-performance Input-Queued Switch Schedulers

39

Neighbors

Neighbors differ from S(t) in ONLY TWO edges (for all values of N)

Neighbors

Example: 3 x 3 switchS(t)

Page 40: Towards Simple, High-performance Input-Queued Switch Schedulers

40

Apsara

Next time MAX

S(t-1)

S(t)

Neighbors generated in parallel

N1 N2 Nk H(t)

Hamiltonian Walk

Page 41: Towards Simple, High-performance Input-Queued Switch Schedulers

41

Apsara: Throughput

• Theorem: Apsara is stable under any admissible i.i.d. Bernoulli traffic.

(stability due to Hamiltonian matching)

• Also, note that W(S(t)) >= W(S(t-1),t)

• Theorem: If W(S(t)) = W(S(t-1),t) then W(S(t)) >= 0.5 W *(t)

(this is not enough to ensure stability)

Page 42: Towards Simple, High-performance Input-Queued Switch Schedulers

42

Apsara: Diagonal traffic

Page 43: Towards Simple, High-performance Input-Queued Switch Schedulers

43

Limited Parallelism

• The Apsara algorithm searches over neighbors in parallel

• If space is limited to modules, then search over randomly chosen subset of size K from all neighbors

• And there are other (good) deterministic ways of searching a smaller neighborhood of matchings

2

N

2

NK

2

N

Page 44: Towards Simple, High-performance Input-Queued Switch Schedulers

44

Apsara: Limited parallelism

Page 45: Towards Simple, High-performance Input-Queued Switch Schedulers

45

Diagonal traffic

Page 46: Towards Simple, High-performance Input-Queued Switch Schedulers

46

Conclusions

• We have presented novel scheduling algorithms for input-queued switches– Laura– Serena– Apsara

• They are simple to implement and perform competitively with respect to the Maximum Weight Matching algorithm

Page 47: Towards Simple, High-performance Input-Queued Switch Schedulers

47

References

1. L. Tassiulas, “Linear complexity algorithms for maximum throughput in radio networks and input-queued switches,” Proc. INFOCOM 1998.

2. D. Shah, P. Giaccone and B. Prabhakar, “An efficient randomized algorithm for input-queued switch scheduling,” Proc. of Hot Interconnects, 2001.

3. P. Giaccone, D. Shah and B. Prabhakar,” An Implementable Parallel Scheduler for Input-Queued Switches”, Proc. of Hot Interconnects, 2001.

4. P. Giaccone, B. Prabhakar and D. Shah, “Towards simple and efficient scheduler for high-aggregate IQ switches”, Submitted INFOCOM’02.

5. R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge University Press, 1995.

Page 48: Towards Simple, High-performance Input-Queued Switch Schedulers

48

Uniform traffic

Page 49: Towards Simple, High-performance Input-Queued Switch Schedulers

49

LogDiagonal traffic

Maximum Throughput Algorithm Load 0.99

MWM 0.99MaxLAURA 0.99LAURA 0.99iSLIP 0.84iLQF 0.97MUCS 0.99RPA 0.98