93
1 CIST560 by M. Hamdi Packet Scheduling/Arbitrati on in Virtual Output Queues: Maximal Matching Algorithms (Part II)

CIST560 by M. Hamdi 1 Packet Scheduling/Arbitration in Virtual Output Queues: Maximal Matching Algorithms (Part II)

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

1CIST560 by M. Hamdi

Packet Scheduling/Arbitration in Virtual Output Queues:

Maximal Matching Algorithms

(Part II)

2CIST560 by M. Hamdi

Pointer Desynchronization

• Performance: RRM < iSlip < FIRM

• Difference only in updating pointers

• Observation: iSlip and FIRM can effectively desynchronize their output pointers

• The best effect of pointer desynchronization is achieved if forced

3CIST560 by M. Hamdi

Static Round Robin Matching (SRR):To Achieve FULL Desynchronization

• Initialization. The input pointers are set to 0's. The output pointers are set to some initial pattern such that there is no duplication among the pointers.

• The 3 steps of one iteration are:– Request. Each input sends a request to every output for which it

has a queued cell.– Grant. If an output receives any requests, it chooses the one that

appears next in a fixed, round-robin schedule starting from the highest priority element. The output notifies each input whether or not its request was granted. The pointer to the highest priority element of the round-robin schedule is always incremented by one (modulo N) whether there is a grant or not.

4CIST560 by M. Hamdi

SRR (Cont’d)

– Accept. If an input receives a grant, it accepts the one that appears next in a fixed round-robin schedule starting from the highest priority element. The pointer to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the accepted one.

• In DSRR (Improved version of SRR), input pointers are also desynchronized.

• Rotating DSRR (RDSRR):– Unfairness among inputs under special traffic model.

– Outputs searching in clockwise and anti-clockwise directions alternatively to decide grants.

xx

xx

xx

xx

00

00

00

00

5CIST560 by M. Hamdi

Simulation Results

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

10

20

30

40

50

60

70

Normalized load

Rel

ativ

e av

erag

e de

lay

32x32 switch under uniform traffic

iSlipFIRM SRR DSRR RDSRR

6CIST560 by M. Hamdi

Simulation Results

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

40

45

Normalized load

Rel

ativ

e av

erag

e de

lay

32x32 switch under uniform bursty traffic

iSlipFIRM SRR DSRR RDSRR

7CIST560 by M. Hamdi

Simulation Results

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55100

101

102

103

104

Normalized load

Rel

ativ

e av

erag

e de

lay

32x32 switch under hotspot traffic

iSlipFIRM SRR DSRR RDSRR

8CIST560 by M. Hamdi

Simulation Results

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1100

101

102

103

104

Normalized load

Ave

rage

del

ay32x32 switch under unbalanced traffic

iSlipFIRM SRR DSRR RDSRR

9CIST560 by M. Hamdi

Stability Property

• A VOQ switch is considered stable if it approaches a steady state where the expected length of each VOQ is bounded. If it is stable, 100% throughput can be achieved under any admissible traffic pattern.

• RDSRR is more stable than iSlip and FIRM under various traffic patterns.

10CIST560 by M. Hamdi

Stability Property (Cont’d)

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 10.94

0.95

0.96

0.97

0.98

0.99

1

1.01

Normalized load

Thr

ough

put

32x32 switch under unbalanced traffic

iSlip FIRM RDSRR Output

11CIST560 by M. Hamdi

3-Phase & 2-Phase Algorithms

• iSlip & FIRM are 3-phase algorithms: Request-Grant-Accept

• DRRM is 2-phase algorithm: Grant-Accept– Each input sends one grant

– Each output sends one accept

• 2-FIRM is the 2-phase version of FIRM

12CIST560 by M. Hamdi

DRRM (Dual Round Robin Matching)

13CIST560 by M. Hamdi

3-Phase & 2-Phase Algorithms

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

10

20

30

40

50

60

70

Normalized load

Re

lati

ve

av

era

ge

de

lay

32x32 switch under uniform traffic

iSlip DRRM FIRM 2-FIRM

14CIST560 by M. Hamdi

3-Phase & 2-Phase Algorithms

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.5510

0

101

102

103

104

Normalized load

Re

lati

ve

av

era

ge

de

lay

32x32 switch under hotspot traffic

iSlip DRRM FIRM 2-FIRM

15CIST560 by M. Hamdi

3-Phase & 2-Phase Algorithms

• In general case, the traffic model changes from time to time

• When the temporary non-uniformity is on the input side, 3-phase scheme performs better

• When the temporary non-uniformity is on the output side, 2-phase scheme performs better

16CIST560 by M. Hamdi

2-stage Maximum Size Matching Algorithm: Description

• The 2-stage algorithm works in the following way: 1. The pointers at both input and output sides are kept fully desynchronized.

2. In each iteration, there are 3 steps:

Step 1: Each input sends a request to every output for which it has a queued cell.

  Step 2: Each input selects one VOQ to send grant that appears next starting from its highest priority output. Each output selects one request received in step 1 to send grant that appears next starting from its highest priority input. OutputCount = number of outputs receiving grants from inputs. InputCount = number of inputs receiving grants from outputs.

17CIST560 by M. Hamdi

2-stage Maximum Size Matching Algorithm: Description

• Step 3: If OutputCount ? InputCount, each output selects one among the grants received in step 2 which appears next starting from its highest priority input and sends accept.

Else, each input selects one among the grants received in step 2 which appears next starting from its highest priority output and sends accept.

• In simple words, this algorithm will decide in each time slot whether to use 2-phase or 3-phase scheme based on which one can make more matches.

18CIST560 by M. Hamdi

2-stage Maximum Size Matching Algorithm: Hardware

ImplementationSt

ate

of I

npu

t Q

ueu

es

(N2

bit

s)

1

2

N

1

2

N

Dec

isio

n R

egis

ter

Grant Arbiters

Accept Arbiters

Output Counter

Input Counter

Comparator

1st group of inputs 2nd group of inputs 2 physical lines from comparator

19CIST560 by M. Hamdi

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

10

20

30

40

50

60

70

Normalized load

Rela

tive

ave

rage d

ela

y

32x32 switch under uniform traffic (1 iteration)

iSlip FIRM 2-StageSRR Output

Performance Evaluation: Simulation StudyU

nif

orm

Tra

ffic

20CIST560 by M. Hamdi

Performance Evaluation: Simulation Study

Load 0.5 0.6 0.7 0.8 0.9 0.95 0.99

Improvement

Percentage 67% 196% 81% 58% 60% 84% 43%

Normalized Improvement Percentage

40% 66% 45% 37% 37% 46% 30%

Improvement Factor

1.67 2.96 1.81 1.58 1.60 1.84 1.43

Improvement Percentage

7% 75% 92% 54% 59% 83% 43%

Normalized Improvement Percentage

7% 43% 48% 35% 37% 45% 30%

Improvement Factor

1.07 1.75 1.92 1.54 1.59 1.83 1.43

2-s

tag

e

over

iSlip

SR

R

over

iSlip

21CIST560 by M. Hamdi

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

40

45

Normalized load

Rela

tive

ave

rage d

ela

y

32x32 switch under uniform bursty traffic (1 iteration)

iSlip FIRM 2-StageSRR Output

Performance Evaluation: Simulation StudyB

urs

ty

Tra

ffic

22CIST560 by M. Hamdi

Load 0.63 0.7 0.75 0.8 0.85 0.9

Improvement

Percentage 213% 96% 70% 46% 28% 16%

Normalized Improvement Percentage

68% 49% 41% 31% 22% 14%

Improvement Factor

3.13 1.96 1.70 1.46 1.28 1.16

Improvement Percentage

89% 56% 46% 33% 22% 14%

Normalized Improvement Percentage

47% 36% 32% 25% 18% 12%

Improvement Factor

1.89 1.56 1.46 1.33 1.22 1.14

Performance Evaluation: Simulation Study

2-s

tag

e

over

iSlip

SR

R

over

iSlip

23CIST560 by M. Hamdi

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.5510

0

101

102

103

104

Normalized load

Rela

tive

ave

rage d

ela

y

32x32 switch under hotspot traffic (1 iteration)

iSlip FIRM 2-StageSRR Output

Performance Evaluation: Simulation StudyH

ots

pot

Tra

ffic

24CIST560 by M. Hamdi

Load 0.31 0.38 0.43 0.46 0.50

Improvement

Percentage 26% 56% 101626% 160469% 81633%

Normalized Improvement Percentage

21% 36% 100% 100% 100%

Improvement Factor

1.26 1.56 1017.26 1605.69 817.33

Improvement Percentage

5% 9% 56177% 74631% 19618%

Normalized Improvement Percentage

5% 8% 99% 100% 99%

Improvement Factor

1.05 1.09 562.77 747.31 197.18

Performance Evaluation: Simulation Study

2-s

tag

e

over

iSlip

SR

R

over

iSlip

25CIST560 by M. Hamdi

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110

0

101

102

103

Normalized load

Rela

tive

ave

rage d

ela

y

32x32 switch under cross-shaped traffic (1 iteration)

iSlip FIRM 2-StageSRR Output

Performance Evaluation: Simulation Study

Un

bala

nced

Tra

ffic

26CIST560 by M. Hamdi

Performance Evaluation: Simulation Study

Load 0.5 0.6 0.7 0.8 0.9 0.95 0.99

Improvement

Percentage 12% 39% 53% 142% 552% 8040% 3351%

Normalized Improvement Percentage

11% 28% 35% 59% 85% 99% 97%

Improvement Factor

1.12 1.39 1.53 2.42 6.52 81.40 34.51

Improvement Percentage

4% 35% 74% 225% 843% 11494% 3499%

Normalized Improvement Percentage

4% 26% 43% 69% 89% 99% 97%

Improvement Factor

1.04 1.35 1.74 3.25 9.43 115.94 35.99

2-s

tag

e

over

iSlip

SR

R

over

iSlip

27CIST560 by M. Hamdi

A new algorithm – RDESRR

• Real Desynchronized Round Robin Model (RDESRR)• Based on 2 phases RRM model (Request and Grant)• Add a small share memory that each outputs can

read/write (called Share Bits)• The size of the memory is 1 bit per input• If the bit is set, the corresponding input has already

granted by an output• If the bit is not set, the output may grant to

corresponding input port

28CIST560 by M. Hamdi

RDESRR Conceptual model

0

1

2

3

0

1

2

3

3 02 1

3 02 1

3 02 1

3 02 1

3

0

1

2

Share Bits

29CIST560 by M. Hamdi

RDESRR model• 2 phases only

• Request. Each input sends a request to every output for which it has a queued cell.

• Grant. If an output receives any requests, it chooses the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The output check the corresponding bit is set or not, if not set, the output will set the bit and notifies the input its request was granted. Otherwise, the output will look for next request until all requests has gone through. The pointer gi to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the granted input. If no request is received, the pointer stays unchanged.

30CIST560 by M. Hamdi

RDESRR Demo - Request

Step 1: Request

0

1

2

3

0

1

2

3

31CIST560 by M. Hamdi

RDESRR Demo – Add a share memory in Output

Step 2: Grant

0

1

2

3

0

1

2

3

3 02 1

3 02 1

3 02 1

3 02 1 3

0

1

2

Share Bits

•Add a small share memory that each outputs can read/write (called Share Bits)

32CIST560 by M. Hamdi

3 02 1

3 02 1

3 02 1

3 02 1

RDESRR Demo – Output check the share bits

0

1

2

3

0

1

2

3

Step 2: Grant

3

0

1

2

Share Bits

•The output check the corresponding bit is set or not

33CIST560 by M. Hamdi

RDESRR Demo – When share bit is occupied

0

1

2

3

0

1

2

3

Step 2: Grant

3

0

1

2

3 02 1

3 02 1

3 02 1

3 02 1

Share Bits

•if not set, the output will set the bit and notifies the input its request was granted•The share bit is First Come First Serve

34CIST560 by M. Hamdi

RDESRR Demo – Output looks for next request

0

1

2

3

0

1

2

3

Step 2: Grant

3 02 1

3 02 1

3 02 1

3 02 1 3

0

1

2

Share Bits

•If set, the output will look for next request until all requests have gone through

35CIST560 by M. Hamdi

RDESRR Demo – All share bits are allocated

0

1

2

3

0

1

2

3

Step 2: Grant

3 02 1

3 02 1

3 02 1

3 02 1 3

0

1

2

Share Bits

•Fully allocate the share bit will result for fully grant all input request

36CIST560 by M. Hamdi

3 02 1

3 02 1

3 02 1

3 02 1

RDESRR Demo – Pointer update/Share bit reset

0

1

2

3

0

1

2

3

3

0

1

2

Share Bits

•The pointer gi to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the granted input•If no request is received, the pointer stays unchanged•Share bits are also reset

37CIST560 by M. Hamdi

SIM Results• Run the test for 32x32 port in SIM using –l 1000000

Total Latency Avg Match Size0.1 0.0588 3.1958 0.2 0.1447 6.3938 0.3 0.2686 9.5947 0.4 0.4501 12.7940 0.5 0.7198 15.9960 0.6 1.1398 19.1980 0.7 1.8636 22.3961 0.8 3.2619 25.5986 0.9 7.5087 28.8003 1.0 715.5900 31.9850

RDESRR

38CIST560 by M. Hamdi

Input QueueingLongest Queue First or

Oldest Cell First

1234

1234

1234

1234

10 1

1

1

1 10

Maximum weight

Weight Waiting Time

100%Queue Length { } =

39CIST560 by M. Hamdi

Input QueueingWhy is serving long/old queues better than serving

maximum number of queues?

• When traffic is uniformly distributed, servicing themaximum number of queues leads to 100% throughput.

• When traffic is non-uniform, some queues become longer than others.

• A good algorithm keeps the queue lengths matched, and

services a large number of queues.

VOQ #

Avg

Occ

upan

cy Uniform traffic

VOQ #

Avg

Occ

upan

cy

Non-uniform traffic

40CIST560 by M. Hamdi

Maximum/Maximal Weight Matching

• 100% throughput for admissible traffic (uniform or non-uniform)

• Maximum Weight Matching– OCF (Oldest Cell First): w=cell waiting time

– LQF (Longest Queue First):w=input queue occupancy

– LPF (Longest Port First):w=QL of the source port + Sum of QL form the source port to the destination port

• Maximal Weight Matching (practical algorithms)– iOCF

– iLQF

– iLPF (comparators in the critical path of iLQF are removed )

41CIST560 by M. Hamdi

Maximal Weight Matching Algorithms: iLQF

• Request. Each unmatched input sends a request word of width bits to each output for which it has a queued cell, indicating the number of cells that it has queued to that output.

• Grant. If an unmatched output receives any requests, it chooses the largest valued request. Ties are broken randomly.

• Accept. If an unmatched input receives one or more grants, it accepts the one to which it made the largest valued request. Ties are broken randomly.

42CIST560 by M. Hamdi

Maximal Weight Matching Algotithms: iLQF

• The i-LQF algorithm has the following properties:

• Property 1. Independent of the number of iterations, the longest input queue is always served.

• Property 2. As with i-SLIP, the algorithm converges in at most logN iterations.

• Property 3. For an inadmissible offered load, an input queue may be starved.

43CIST560 by M. Hamdi

Maximal Weight Matching Algotithms: iOCF

• The i-OCF algorithm works in similar fashion to iLQF, and has the following properties:

• Property 1. Independent of the number of iterations, the cell that has been waiting the longest time in the input queues (it must at the head of the queue)

• Property 2. As with i-LQF, the algorithm converges in at most logN iterations.

• Property 3. No input queue can be starved indefinitely.

• Property 4. It is difficult to keep time stamps on the cells.

44CIST560 by M. Hamdi

iLQF - Implementation

45CIST560 by M. Hamdi

iLPF - ImplementationComplicated hardware

46CIST560 by M. Hamdi

Other research efforts

• Packet-based arbitration• Exhaustive-based arbitration• Numerous other efforts

47CIST560 by M. Hamdi

Packet Scheduling/Arbitration in Virtual Output Queues:Randomized Algorithms

and Others

48CIST560 by M. Hamdi

Input-Queued Packet Switch

Crossbar

Scheduler

inputs

outputs

1

N

1 N

.

.

.

.

. . . .

i,j

N,N

1,

1

Xi,j

(i i i,j < 1 ; j j i,j < 1)

49CIST560 by M. Hamdi

Bipartite Graph and Matrix

011

111

001inputs

outputs

1

2

3

321

50CIST560 by M. Hamdi

Stability of Scheduling

Definition:

Let Xi,j(t) be the number of packets queued at input i for output j at time-slot t.

Then an algorithm is stable iff:

)(

, , tXEji ji

51CIST560 by M. Hamdi

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

Maximum size matching

Maximum weight matching

1

2

3

4

1

2

3

4

8

6

4

2

1

3

1

1

2

3

4

1

2

3

4

8

6

4

Maximum Matching in VOQ Architecture

52CIST560 by M. Hamdi

Complexity of Maximum Matchings

• Maximum Size/Cardinality Matchings:– It is not a stable algorithm

– Algorithm by Dinic O(N5/2)

• Maximum Weight Matchings– Algorithm by Kuhn O(N3logN)

– It is a stable algorithm

• In general:– Hard to implement in hardware (does not lend itself to

simple hardware implementation not because of its serial time complexity)

– Slooooow.

53CIST560 by M. Hamdi

Maximal Matching Algorithms

• Maximal matching algorithms are heuristic algorithms that try to approximate MSM or MWM.

• In general, maximal matching is much simpler to implement (Not because of its time complexity), and has a much faster running time.

• A maximal size matching is at least half the size of a maximum size matching.

• A maximal weight matching is at least half the size of a maximum weight matching.

54CIST560 by M. Hamdi

Maximal Size Matching Algorithm: Performance and Properties

• Can have 100% throughtput under uniform traffic

• They converge in logN iterations to a maximal size matching

• Their performance can be quite good (close to an ideal Output Queued Switch) with multiple iterations

• The best iterative maximal size matching algorithm takes O(N2logN) serial or O(log N) parallel time steps.

• If the number of iterations is constant, then it can be implemented in constant time (that is why it is practical).

55CIST560 by M. Hamdi

Sta

te o

f In

pu

t Q

ueu

es (

N 2 b

its)

1

2

N

1

2

N

Dec

isio

n R

egis

ter

Grant Arbiters Request Arbiters

Implementation of the parallel maximal matching algorithms

56CIST560 by M. Hamdi

Small Differences (in implementation) between RRM, iSlip & FIRM

But large difference in performance

RRM iSlip FIRM

Input No grant unchanged

Granted one location beyond the accepted one

Output

No request unchanged

Grant accepted

one location beyond the granted one

Grant not accepted

one location beyond the previously granted one

unchanged the granted one

57CIST560 by M. Hamdi

Maximum/Maximal Weight Matching

• 100% throughput for admissible traffic (uniform or non-uniform)

• Maximum Weight Matching– OCF (Oldest Cell First): w=cell waiting time

– LQF (Longest Queue First):w=input queue occupancy

– LPF (Longest Port First):w=QL of the source port + Sum of QL form the source port to the destination port

• Maximal Weight Matching (practical iterative algorithms)

• Make these maximal weight matching algorithms operate like iSLIP– iOCF

– iLQF

– iLPF

58CIST560 by M. Hamdi

Maximal Weight Matching Algorithms: iLQF

• Request. Each unmatched input sends a request word of width bits to each output for which it has a queued cell, indicating the number of cells that it has queued to that output.

• Grant. If an unmatched output receives any requests, it chooses the largest valued request (has the longest queue). Ties are broken randomly.

• Accept. If an unmatched input receives one or more grants, it accepts the one to which it made the largest valued request (has the longest queue). Ties are broken randomly.

59CIST560 by M. Hamdi

Maximal Weight Matching Algotithms: iLQF

• The i-LQF algorithm has the following properties:

• Property 1. Independent of the number of iterations, the longest input queue is always served.

• Property 2. As with i-SLIP, the algorithm converges in at most logN iterations.

• Property 3. For an inadmissible offered load, an input queue may be starved.

• Property 4. It is a stable algorithm.

60CIST560 by M. Hamdi

Maximal Weight Matching Algotithms: iOCF

• The i-OCF algorithm works in similar fashion to iLQF, and has the following properties:

• Property 1. Independent of the number of iterations, the cell that has been waiting the longest time in the input queues (it must at the head of the queue)

• Property 2. As with i-LQF, the algorithm converges in at most logN iterations.

• Property 3. No input queue can be starved indefinitely.

• Property 4. It is difficult to keep time stamps on the cells.

61CIST560 by M. Hamdi

Can we do better with than maximal matchings

usingRandomized Algorithms

62CIST560 by M. Hamdi

MotivationMotivation• Networking problems suffer from the “curse of

dimensionality”– algorithmic solutions do not scale well

• Typical causes– size: large number of users or large number of I/O

– time: very high speeds of operation

• A good deterministic algorithm exists (Max Flow), but …– it requires too large a data structure

– it needs state information, and “state” is too big

– it “starts from scratch” in each iteration

63CIST560 by M. Hamdi

Randomization• Randomized algorithms have frequently been used in many

situations where the state space (e.g., different number of connections between input and output N!) is very large

• Randomized algorithms– are a powerful way of approximating

– it is often possible to randomize deterministic algorithms

– this simplifies the implementation while retaining a (surprisingly) high level of performance

• The main idea is – to simplify the decision-making process

– by basing decisions upon a small, randomly chosen sample of the state

– rather than upon the complete state

64CIST560 by M. Hamdi

An Illustrative ExampleFind the largest element of a set S of size 1 billion

• Deterministic algorithm: linear search – has a complexity of 1 billion

• The randomized version: find the largest of 10 randomly chosen samples– has a complexity of 10

– (note: this ignores complexity of choosing 10 random samples)

• Performance– linear search will find the absolute largest element

– if R is the element found by randomized algorithm, we can make statements like

P(R is at least the 100 millionth largest element) = thus, we can say that the performance of the randomized algorithm is very

good with a high probability

101

110

65CIST560 by M. Hamdi

Randomizing Iterative Schemes (e.g., iSLIP)

• Often, we want to perform some operation iteratively• Example: find the heaviest matching in a switch in every time

slot• Since, in each time slot

– at most one packet can arrive at each input– and, at most one packet can depart from each output the size of the queues, or the “state” of the switch, doesn’t change by

much between successive time slots so, a matching that was heavy at time t will quite likely continue to be

heavy at time t+1

• This suggests that– knowing a heavy matching at time t should help in determining a heavy

matching at time t+1 there is no need to start from scratch in each time slot

66CIST560 by M. Hamdi

Summarizing Randomized Algorithms

• Randomized algorithms can help simplify the implementation– by reducing the amount of work in each iteration

• If the state of the system doesn’t change by much between iterations, then– we can reduce the work even further by carrying information

between iterations

• The big pay-off is that, even though it is an approximation, the performance of a

randomized scheme can be surprisingly good

67CIST560 by M. Hamdi

Randomized Scheduling Algorithms: Example

• Consider a 3 x 3 input-queued switch – input traffic: is Bernoulli IID and λij = α/3 for all i, j, and

α < 1

– This is admissible

– note: there are a total of 6 (= 3!) possible service matrices

111

111

111

3/

3/3/3/

3/3/3/

3/3/3/

100

010

001

010

100

001

100

001

010

001

100

010

010

001

100

001

010

100

68CIST560 by M. Hamdi

Random Scheduling Algorithms

• In time slot n, let S(n) be equal to one of the 6 possible matchings independently and uniformly at random

• Stability of Random – Consider L11(n), the number of packets in VOQ11

• arrivals to VOQ11 occur according to A11(n), which is Bernoulli IID • input rate = λ11 = α/3 • this queue gets served whenever the service matrix connects input 1 to

output 1 • There are 2 service matrices that connect input 1 to output 1 • since Random chooses service matrices u.a.r., input 1 is connected to

output 1 1. for a fraction of time = 2/6 = 1/3 --- the service rate between input1 and output1

• E(L11(n)) < iff λ11 < 1/3 α < 1

• This random algorithm is stable.

69CIST560 by M. Hamdi

Random Scheduling Algorithms

• Instability of Random • Now suppose λii = α for all i and λij =0 for

– clearly, this is admissible traffic for all α < 1

– but, under Random, the service rate at VOQ11 is 1/3 at best

– hence VOQ11 and the switch will be unstable as soon as

• Stability (or 100% throughput) means it is stable under all admissible traffic!

ji

3/1

70CIST560 by M. Hamdi

• Switch Size : 32 x 32

• Input Traffic (shown for a 4 X 4 switch) – diagonal load matrix:

• normalized load=x+y<1

• x=2y

• It is a good test-case

Simulation Scenario

xy

yx

yx

yx

00

00

00

00

71CIST560 by M. Hamdi

Obvious Randomized Schemes

• Choose a matching at random and use it as the schedule doesn’t give 100% throughput (already shown)

• Choose 2 matchings at random and use the heavier one as the schedule

• Choose N matchings at random and use the heaviest one as the schedule

None of these can give 100% throughput !!

72CIST560 by M. Hamdi

0.001

0.01

0.1

1

10

100

1000

10000

0.0 0.2 0.4 0.6 0.8 1.0

Mea

n IQ

Len

Normalized Load

Diagonal Traffic

MWM R32R1

73CIST560 by M. Hamdi

Bounds on Maximum Throughput

74CIST560 by M. Hamdi

Iterative Randomized Scheme(Tassiulas)

• Say M is the matching used at time t

• Let R be a new matching chosen uniformly at random (u.a.r.) among the N! different matchings

• At time t+1, use the heavier of M and R• Complexity is very low O(1) iterations • This gives 100% throughput !

note the boost in throughput is due to memory (saving previous matchings)

• But, delays are very large

75CIST560 by M. Hamdi

0.01

0.1

1

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Mea

n IQ

Len

Normalized Load

Diagonal Traffic

MWMTassiulas

76CIST560 by M. Hamdi

Observations for Improvement

• Most of the weight of a matching is carried in a small number of edges

• Hence, remember edges not matchings• We can have 100% throughput under all

admissible traffic.

77CIST560 by M. Hamdi

0.01

0.1

1

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Mea

n IQ

Len

Normalized Load

Diagonal Traffic

MWMR32M32 R1M1 Tassiulas

78CIST560 by M. Hamdi

Finer Observations

• Let M be schedule used at time t

• Choose a “good’’ random matching R

• M’ = Merge(M,R)

• M’ includes best edges from M and R

• Use M’ as schedule at time t+1

• Above procedure yields algorithm called LAURA• There are many other small variations to this algorithm.

79CIST560 by M. Hamdi

3

2

3

2

2

1

2

3

4

1Merging

3

2

3

3

1

X R3-1+2-2=2

2-1+2-4=-1

W(X)=12 W(R)=10

M

W(M)=13

Merging Procedure

80CIST560 by M. Hamdi

0.01

0.1

1

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Mea

n IQ

Len

Normalized Load

Diagonal Traffic

MWMM-LAURA LAURAiLQFTassiulas

81CIST560 by M. Hamdi

Can we avoid having schedulers altogether !!!

82CIST560 by M. Hamdi

Recap:Recap: Two Successive Scaling Problems Two Successive Scaling Problems

OQ routers: + work-conserving (QoS)- memory bandwidth =

(N+1)RR

R

RR

IQ routers: + memory bandwidth = 2R- arbitration complexity

Bipartite Matching

R R

83CIST560 by M. Hamdi

Today: 64 ports at 10Gbps, 64-byte cells.

• Arbitration Time = = 51.2ns

• Request/Grant Communication BW = 17.5Gbps

10Gbps 64bytes

IQ Arbitration Complexity

Two main alternatives for scaling:1. Increase cell size2. Eliminate arbitration

Scaling to 160Gbps:• Arbitration Time = 3.2ns• Request/Grant Communication BW = 280Gbps

84CIST560 by M. Hamdi

Desirable Characteristics for Router Architecture

Ideal: OQ• 100% throughput• Minimum delay• Maintains packet order

Necessary: able to regularly connect any input to any output

What if the world was perfect? Assume Bernoulli iid uniform arrival traffic...

85CIST560 by M. Hamdi

Round-Robin Scheduling

• Uniform & non-bursty traffic => 100% throughput• Problem: traffic is non-uniform & bursty

86CIST560 by M. Hamdi

Two-Stage Switch (I)

1

N

1

N

1

N

External Outputs

Internal Inputs

External Inputs

First Round-Robin Second Round-Robin

87CIST560 by M. Hamdi

Two-Stage Switch (I)

1

N

1

N

1

N

External Outputs

Internal Inputs

External Inputs

First Round-Robin Second Round-Robin

Load Balancing

88CIST560 by M. Hamdi

100% throughputProblem: unbounded mis-sequencing

External Outputs

Internal Inputs

1

N

ExternalInputs

Cyclic Shift Cyclic Shift

1

N

1

N

11

2

2

Two-Stage Switch Characteristics

89CIST560 by M. Hamdi

Two-Stage Switch (II)

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

F ik

F ik

.

.

.

.

.

.

.

FlowSplitter

LoadBalancer VOQs First-Stage Round-Robin Second-Stage Round-RobinVOQs

External inputs Internal outputs Internal inputs External outputs

1 1 1

N N N

1

N

1

N

i

.

.

.

.

.

.

.

.

.

.

.

.

j

.

.

.

.

.

.

.

.

.

.

.

.

j

.

.

.

.

.

.

.

.

.

.

.

.

k

.

.

.

.

.

.

.

.

.

.

.

.

New

N3 instead of N2

90CIST560 by M. Hamdi

Expanding VOQ Structure

Solution: expand VOQ structure by distinguishing among switch inputs

2

1

3

a

b

91CIST560 by M. Hamdi

What is being done in practice(Cisco for example)

• They want schedulers that achieve 100% throughput and very low delay (Like MWM)

• They want it to be as simple as iSLIP in terms of hardware implementation

• Is there any solution to this !!!!!

92CIST560 by M. Hamdi

Typical Performance of ISLIP-like Algorithms

PIM with 4 iterations

93CIST560 by M. Hamdi

What is being done in practice(Cisco for example)

Company Switching Capacity

Switch Architecture

Fabric Overspeed

Agere 40 Gbit/s-2.5 Tbit/s Arbitrated crossbar 2x

AMCC 20-160 Gbit/s Shared memory 1.0x

AMCC 40 Gbit/s-1.2 Tbit/s Arbitrated crossbar 1-2x

Broadcom 40-640 Gbit/s Buffered crossbar 1-4x

Cisco 40-320 Gbit/s Arbitrated crossbar 2x