29
DUKE UNIVERSITY Self-Tuned Congestion Self-Tuned Congestion Control Control for Multiprocessor for Multiprocessor Networks Networks Shubhendu S. Mukherjee Shubhendu S. Mukherjee [email protected] [email protected] VSSAD, Alpha Development Group VSSAD, Alpha Development Group Compaq Computer Corporation Compaq Computer Corporation Shrewsbury, Massachusetts Shrewsbury, Massachusetts Mithuna Thottethodi Mithuna Thottethodi Alvin R. Lebeck Alvin R. Lebeck {mithuna,alvy}@cs.duke.edu {mithuna,alvy}@cs.duke.edu Department of Computer Department of Computer Sciences Sciences Duke University, Durham, Duke University, Durham, North Carolina North Carolina d in the 7 th International Symposium on High-Performance C Architecture (HPCA), Monterrey, Mexico, January, 2001

DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee [email protected] VSSAD, Alpha Development Group

Embed Size (px)

Citation preview

Page 1: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

DUKE UNIVERSITY

Self-Tuned Congestion Control Self-Tuned Congestion Control for Multiprocessor Networksfor Multiprocessor Networks

Shubhendu S. MukherjeeShubhendu S. [email protected]@compaq.com

VSSAD, Alpha Development GroupVSSAD, Alpha Development GroupCompaq Computer CorporationCompaq Computer Corporation

Shrewsbury, MassachusettsShrewsbury, Massachusetts

Mithuna ThottethodiMithuna ThottethodiAlvin R. LebeckAlvin R. Lebeck{mithuna,alvy}@cs.duke.edu{mithuna,alvy}@cs.duke.eduDepartment of Computer SciencesDepartment of Computer SciencesDuke University, Durham, North Duke University, Durham, North CarolinaCarolina

Appeared in the 7th International Symposium on High-Performance Computer Architecture (HPCA), Monterrey, Mexico, January, 2001

Page 2: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 2DUKE UNIVERSITY

Network SaturationNetwork Saturation

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0 10 20 30 40 50 60 70 80 90 100

Packet regeneration Interval (cycles)

No

rma

lize

d A

cce

pte

d

Th

rou

gh

pu

t (F

lits

/No

de

/Cyc

le)

Butterfly

Random

Page 3: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 3DUKE UNIVERSITY

Why Network Saturation?Why Network Saturation?

router

• Tree saturation• Deadlock cycles • New packets block older packets• Backpressure take 1000s of cycles to propagate back

Page 4: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 4DUKE UNIVERSITY

Why Do We Care?Why Do We Care?

Computation power per router is increasingComputation power per router is increasing More aggressive speculationMore aggressive speculation Simultaneous Multithreading Simultaneous Multithreading Chip MultiprocessorsChip Multiprocessors

““Unstable” behavior makes designers very nervousUnstable” behavior makes designers very nervous

Router

CPUs

Page 5: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 5DUKE UNIVERSITY

So, what’s the solution? So, what’s the solution?

ThrottleThrottle stop injecting packets when you hit a “threshold”stop injecting packets when you hit a “threshold” ““threshold” = % full network buffers threshold” = % full network buffers

ButBut Local estimate of threshold insufficientLocal estimate of threshold insufficient Saturation point differs for communication patternsSaturation point differs for communication patterns

QuestionsQuestions How do we collect global estimate of % full network buffers?How do we collect global estimate of % full network buffers? How do we “tune” the threshold to different patterns?How do we “tune” the threshold to different patterns?

Page 6: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 6DUKE UNIVERSITY

OutlineOutline

Overview Overview Multiprocessor Network BasicsMultiprocessor Network Basics

Deadlocks & virtual channelsDeadlocks & virtual channels Adaptive routing & Duato’s theoryAdaptive routing & Duato’s theory

How to collect global estimate of congestion?How to collect global estimate of congestion? How to “tune” the throttle threshold?How to “tune” the throttle threshold? Methodology & ResultsMethodology & Results Summary, Future Work, & Other ProjectsSummary, Future Work, & Other Projects

Page 7: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 7DUKE UNIVERSITY

A Multiprocessor NetworkA Multiprocessor Network

router

Page 8: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 8DUKE UNIVERSITY

Deadlock AvoidanceDeadlock Avoidance

1 2

34

1 3

2 4

3 1

4 2

Deadlocked

1 2

34

1 3

2 4

3 1

4 2

Virtual Channels(red & yellow)

Page 9: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 9DUKE UNIVERSITY

Virtual Channels (VC)Virtual Channels (VC)

1

34

1 3

2 4

3 1

4 2

One Buffer Per VC

Logically, red and yellow networks (deadlock-free)

Page 10: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 10DUKE UNIVERSITY

Duato’s TheoryDuato’s Theory

Adaptive network for high performanceAdaptive network for high performance deadlock-pronedeadlock-prone

Deadlock-free network when adaptive network deadlocksDeadlock-free network when adaptive network deadlocks drop down to deadlock-free when router is congesteddrop down to deadlock-free when router is congested

Implemented with different virtual channelsImplemented with different virtual channels adaptive virtual channelsadaptive virtual channels deadlock-free virtual channels (escape channels)deadlock-free virtual channels (escape channels)

Page 11: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 11DUKE UNIVERSITY

OutlineOutline

Overview Overview Multiprocessor Network BasicsMultiprocessor Network Basics How to collect global estimate of congestion? How to collect global estimate of congestion? How to “tune” the throttle threshold? How to “tune” the throttle threshold? Methodology & ResultsMethodology & Results Summary, Future Work, & Other ProjectsSummary, Future Work, & Other Projects

Page 12: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 12DUKE UNIVERSITY

Global Estimate of CongestionGlobal Estimate of Congestion

% of full buffers in entire network% of full buffers in entire network more & more buffers occupied when network saturatesmore & more buffers occupied when network saturates throttle network when % full buffers cross thresholdthrottle network when % full buffers cross threshold

AdvantagesAdvantages simple aggregationsimple aggregation empirical observation: works wellempirical observation: works well

DisadvantagesDisadvantages doesn’t detect localized congestiondoesn’t detect localized congestion threshold differs for communication patterns (we solve this)threshold differs for communication patterns (we solve this)

Page 13: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 13DUKE UNIVERSITY

Gather Global InformationGather Global Information

Global InformationGlobal Information % full network buffers in an “interval”% full network buffers in an “interval” % packets or flits delivered during an “interval”% packets or flits delivered during an “interval”

ConstraintConstraint gather time << backpressure buildup time (1000s of cycles)gather time << backpressure buildup time (1000s of cycles)

MechanismsMechanisms piggybackingpiggybacking meta-packetsmeta-packets side-band signalside-band signal

Page 14: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 14DUKE UNIVERSITY

Sideband: Dimension-wise AggregationSideband: Dimension-wise Aggregation

Each hop takes Each hop takes h h cycles on the sidebandcycles on the sideband

After After 22 hops, aggregation in one dimenstion done hops, aggregation in one dimenstion done

22 such phases such phases

Total gather time = Total gather time = 2 * 2 * h2 * 2 * h = = 4h4h cycles cycles

For k-ary, n-cubes, gather-time (g) = For k-ary, n-cubes, gather-time (g) = n * k * h / 2n * k * h / 2

For a 16x16 network, g = 2 * 16 * 2 / 2 = 32 cyclesFor a 16x16 network, g = 2 * 16 * 2 / 2 = 32 cycles

Phase I Phase 2

Page 15: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 15DUKE UNIVERSITY

OutlineOutline

Overview Overview Multiprocessor Network BasicsMultiprocessor Network Basics How to collect global estimate of congestion? How to collect global estimate of congestion? How to “tune” the throttle threshold?How to “tune” the throttle threshold? Methodology & ResultsMethodology & Results Summary, Future Work, & Other ProjectsSummary, Future Work, & Other Projects

Page 16: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 16DUKE UNIVERSITY

Dynamic Detection of ThresholdDynamic Detection of Threshold(Hill Climbing)(Hill Climbing)

B

A

C

% full buffers (%)0

Thr

ough

put

Yes No

No Increment No Change

Yes Decrement Decrement

Currently throttling?Drop in Bandwidth > 25%

Threshold

… we may still creep into saturation (later)

Page 17: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 17DUKE UNIVERSITY

Summary of ApproachSummary of Approach

Global Knowledge of a NetworkGlobal Knowledge of a Network Collect % full network buffers and overall throughputCollect % full network buffers and overall throughput Dimension-wise aggregation, g-cycle snapshotsDimension-wise aggregation, g-cycle snapshots Aggregation via sideband signalsAggregation via sideband signals

Dynamically detect throttling thresholdDynamically detect throttling threshold Threshold = % of full network buffersThreshold = % of full network buffers Self-tuned using hill climbingSelf-tuned using hill climbing Reset if hill climbing failsReset if hill climbing fails

Page 18: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 18DUKE UNIVERSITY

OutlineOutline

Overview Overview Multiprocessor Network BasicsMultiprocessor Network Basics How to collect global estimate of congestion? How to collect global estimate of congestion? How to “tune” the throttle threshold?How to “tune” the throttle threshold? Methodology & ResultsMethodology & Results Summary, Future Work, & Other ProjectsSummary, Future Work, & Other Projects

Page 19: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 19DUKE UNIVERSITY

MethodologyMethodology Flitsim 2.0 Simulator (Pinkston’s group at USC)Flitsim 2.0 Simulator (Pinkston’s group at USC)

warmup for 10k cycles, simulate for 50k cycleswarmup for 10k cycles, simulate for 50k cycles

Network architectureNetwork architecture 16x16 two-dimensional torus (16-ary, 2-cube)16x16 two-dimensional torus (16-ary, 2-cube) Full-duplex linksFull-duplex links Packet size = 16 flitsPacket size = 16 flits Wormhole routingWormhole routing Deadlock avoidance (paper has deadlock recovery results)Deadlock avoidance (paper has deadlock recovery results)

Router architectureRouter architecture 3 virtual channels per physical channel3 virtual channels per physical channel Each virtual channel buffer holds 8 flitsEach virtual channel buffer holds 8 flits 1 cycle central arbitration, 1 cycle switching1 cycle central arbitration, 1 cycle switching

Page 20: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 20DUKE UNIVERSITY

Input TrafficInput Traffic

Packet Generation FrequencyPacket Generation Frequency ““attempt” to send one packet per packet regeneration attempt” to send one packet per packet regeneration

interval interval Traffic PatternsTraffic Patterns

Random destinationRandom destination Perfect Shuffle: aPerfect Shuffle: an-1n-1aan-2n-2... a... a11aa00 a an-2n-2aan-3n-3 ... a ... a00aan-1n-1

Butterfly: aButterfly: an-1n-1aan-2n-2... a... a11aa00 a a00aan-2n-2 … a … a11aan-1n-1

Bit Reversal: aBit Reversal: an-1n-1aan-2n-2... a... a11aa00 a a00aa11... a... an-2n-2aan-1n-1

Page 21: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 21DUKE UNIVERSITY

Throttling AlgorithmsThrottling Algorithms

BaseBase no throttlingno throttling

ALO (At Least One)ALO (At Least One) Lopez, Martinez, and Duato, ICPP, August, 1998Lopez, Martinez, and Duato, ICPP, August, 1998 Throttling based on local estimation of congestionThrottling based on local estimation of congestion Inject new packet only ifInject new packet only if

– ““useful” physical channel has all virtual channels free, oruseful” physical channel has all virtual channels free, or– at least one virtual channel on every “useful” channel is freeat least one virtual channel on every “useful” channel is free

Tune (this work)Tune (this work)

Page 22: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 22DUKE UNIVERSITY

Tuning ParametersTuning Parameters

Total number of network buffers = 256 * 3 * 4 = 3072Total number of network buffers = 256 * 3 * 4 = 3072 Gather time (g) = n * k * h / 2 = 32 cyclesGather time (g) = n * k * h / 2 = 32 cycles Sideband communication latency (h) = 2 cyclesSideband communication latency (h) = 2 cycles Sideband communication bandwidth = 25 bits (!)Sideband communication bandwidth = 25 bits (!)

# network buffers = 3072 = 12 bits# network buffers = 3072 = 12 bits max throughput = g * 256 * 1 = 8192 = 13 bitsmax throughput = g * 256 * 1 = 8192 = 13 bits

Tuning frequency = once every 96 cyclesTuning frequency = once every 96 cycles Initial threshold value = 1% ~= 30 buffersInitial threshold value = 1% ~= 30 buffers Threshold increment = 1%, decrement = 4%Threshold increment = 1%, decrement = 4%

Page 23: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 23DUKE UNIVERSITY

Random PatternRandom Pattern

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50 60 70 80 90 100

Packet Regeneration Interval (cycles)

Nor

mal

ized

Acc

epte

d Th

roug

hput

(F

lits/

Nod

e/C

ycle

)

10

100

1000

0 10 20 30 40 50 60 70 80 90 100

Packet Regeneration Interval (cycles)

Av

era

ge

Lat

en

cy (

cycl

es)

Tune

Base

ALO

Beyond saturation point, Tune outperforms ALO and Base

Page 24: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 24DUKE UNIVERSITY

Delayed Collection of Global Knowledge Delayed Collection of Global Knowledge (h = 2, 3, 6 cycles)(h = 2, 3, 6 cycles)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50 60 70 80 90 100

Packet Regeneration Interval (cycles)

No

rma

lize

d A

cc

ep

ted

Th

rou

gh

pu

t (F

lits

/No

de

/Cyc

le)

10

100

1000

0 10 20 30 40 50 60 70 80 90 100

Packet regeneration Interval (cycles)

Ave

rag

e L

ate

ncy

(cy

cle

s)

g=32 (h=2)

g=48 (h=3)

g=96 (h=6)

Tune fairly insensitive to delayed collection of information

Page 25: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 25DUKE UNIVERSITY

Static Threshold ChoiceStatic Threshold Choice

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 10 20 30 40 50 60 70 80 90 100

Packet Regeneration Interval (cycles)

Ac

ce

pte

d T

hro

ug

hp

ut

(Flit

s/N

od

e/C

yc

le) Static Threshold = 250

Static Threshold = 50

Tune

Base

Uniform Random

Butterfly

Optimal Thesholds different for random and butterflyTune performs close to the best static threshold

Page 26: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 26DUKE UNIVERSITY

With Bursty Load Tune outperforms ALOWith Bursty Load Tune outperforms ALO

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1000

0

1500

0

2000

0

2500

0

3000

0

3500

0

4000

0

4500

0

5000

0

5500

0

6000

0

Time (Cycles)

No

rma

lize

d T

hro

ug

hp

ut

(fli

ts/n

od

e/c

ycle

)

Tune

Base

ALO

random

bit reversal shufflebutterfly

Page 27: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 27DUKE UNIVERSITY

Avoiding Local MaximaAvoiding Local Maxima

What if steady decrease in bandwidth < 25%?What if steady decrease in bandwidth < 25%? potential to “creep” into saturationpotential to “creep” into saturation

Solution: remember global maximaSolution: remember global maximamaxmax = maximum throughput seen in any tuning period = maximum throughput seen in any tuning period NNmaxmax = number of full buffers at = number of full buffers at maxmax

TTmaxmax = threshold at = threshold at maxmax

Reset threshold min(TReset threshold min(Tmaxmax, N, Nmaxmax) if throughput < 50% max) if throughput < 50% max

If “r” consecutive resets don’t fix the problem, then restartIf “r” consecutive resets don’t fix the problem, then restart hypothesis: communication pattern has changedhypothesis: communication pattern has changed

Page 28: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 28DUKE UNIVERSITY

Threshold Reset NecessaryThreshold Reset Necessary

Packet Rengeration Interval = 10 cycles

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1000

0

1500

0

2000

0

2500

0

3000

0

3500

0

4000

0

4500

0

5000

0

5500

0

6000

0

Time (cycles)

Acc

epte

d Th

roug

hput

(F

lits/

node

/cyc

le)

0

200

400

600

800

1000

1200

1400

1600

10

00

0

15

00

0

20

00

0

25

00

0

30

00

0

35

00

0

40

00

0

45

00

0

50

00

0

55

00

0

60

00

0

Time (cycles)

Th

res

ho

ld

Hill Climbing

Hill Climbing + Local Maxima

Hill Climbing

Hill Climbing + Local Maxima

Page 29: DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group

Slide 29DUKE UNIVERSITY

SummarySummary

Network Saturation is a severe problemNetwork Saturation is a severe problem advent of powerful processors, SMT, and CMPsadvent of powerful processors, SMT, and CMPs ““unstable” behavior makes designers nervousunstable” behavior makes designers nervous

We propose throttling based on global knowledgeWe propose throttling based on global knowledge aggregate global knowledge (% full buffers,throughput)aggregate global knowledge (% full buffers,throughput) throttle when % full buffers exceed thresholdthrottle when % full buffers exceed threshold tune threshold for communication patters & offered loadtune threshold for communication patters & offered load