24
Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, Alenka Zajic, Milos Prvulovic

Jungju Oh, Alenka Zajic , Milos Prvulovic

  • Upload
    lori

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect. Jungju Oh, Alenka Zajic , Milos Prvulovic. Contents. Introduction Hybrid Network Low-Latency Transmission Line Ring Traffic Steering Evaluation Result Conclusion. - PowerPoint PPT Presentation

Citation preview

Page 1: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect

Jungju Oh, Alenka Zajic, Milos Prvulovic

Page 2: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

2/23

Contents• Introduction• Hybrid Network

– Low-Latency Transmission Line Ring– Traffic Steering

• Evaluation• Result• Conclusion

Page 3: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

3/23

Introduction• On-chip communication latency is increasing

• Broadcast interconnect– Insufficient bandwidth and delay for many-core – Growing core counts → contention– Growing core counts → longer wire

→ larger wire capacitance → longer delay– Unfavorable wire delay with technology scaling

• Packet-switched on-chip network (OCN)+ Short links → fast communication between adjacent nodes+ Scalable aggregated bandwidth– Packets travel many links and pipelined routers– Growing core counts → increasing hop counts/latency for far-apart cores

51015202530350

102030405060708090

Technology (nm)

Dela

y fo

r 1 m

m (n

s)

ITRS 2012

Page 4: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

4/23

Motivation• Switched on-chip network

– Good latency for local traffic, but not for long-distance traffic– Much more local than long-distance traffic

• Broadcast interconnect– Avoids routing latency even for long-distance traffic– Cannot handle much traffic

2 4 6 8 10 12 1405

101520253035404550

0%

2%

4%

6%

8%

10%

12%

14%

16%

Distance (hops)

Late

ncy

Traffi

c

Page 5: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

5/23

Hybrid Network• Exploit the strengths

– Broadcast on Transmission Line: low latency– Switched on-chip network: throughput

• … alleviate weakness– Limited TL throughput – use only for critical and/or long-distance traffic– High switching overhead for long-distance traffic – use TL

• Two critical components to this work– Transmission Line Broadcast Interconnect – the Why and the How– Traffic Steering – which messages use which interconnect

Page 6: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

6/23

Transmission Line• Why Transmission Line?

– Extremely fast propagation• Use electromagnetic wave for signal propagation

– 0.0075 ns/mm (unrepeated wire: 0.54 ns/mm)– Not affected by technology scaling

– But expensive in terms of metal area (20 µm-wide vs. 0.135 µm global wire)• Limited throughput

Transmission Line

Traditioanl Wire

Ground

4.193 µm

4.571 µm

8.457 µm4.1 µm

16 µm

vs. …0.135 µm

TL Traditional Global Wire

Page 7: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

7/23

Transmission Line Ring• Transmission Line

– Extremely fast propagation– But expensive in terms of metal area

• Why Ring?– Minimizes overall TL cost– Allows fast arbitration (token passing)

Page 8: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

8/23

Unidirectional Transmission Line Ring• Two major problems with TL caused by many connections in many-core

– Attenuation of signal (power split at connections)– Signal reflections/reverberations (discontinuity at connections)– Signal needs to stay stronger than sum of noise and reverberations!

• Unidirectional Transmission Line (UTL) ring makes it easy to design– Chained directional couplers in a ring shape– Control of attenuation– Almost no reflected signal

• Directional Coupler– Two TL lines running in parallel

Transmission Line

Page 9: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

9/23

Unidirectional Transmission Line Ring

• Directional Coupler– Two TL lines running in parallel– Signal into one end ①

• Most comes out on other end ②• But some is transferred (EM-coupled) to same direction on other line ③

– Directivity: (almost) no signal on ④– Chain couplers using one line, use the other to connect transmitters/receivers

① ②

③④

Transmission Line

Core 2Rx2 Tx2

Core 1Rx1 Tx1

×

Page 10: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

10/23

Using the UTL Ring• Simple receiver/transmitter

– Simple modulation: on-off keying– 1 bit = one or more consecutive pulses

• How fast can we transfer?– Depends on available spectrum of the transmission medium– UTL coupler: 20–60 GHz– 40 GHz clock, 2 pulses/bit → 20 Gbps

• Transmitter– PLL (pulses)– Pass-gate (on/off pulses)– Amplifier (impedance matching)

• Receiver– Pulse detector,– Shift register (collect high rate bits)

PLL

Amp

Data

Transmitter

Detector

Data

Receiver

Shift register

Page 11: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

11/23

2 4 6 8 10 12 14 1605

101520253035404550

OCNTL

Distance

Late

ncy

Traffic Steering• Which packet should use which network?• Static steering

– E.g. >8 hops go to TL, rest goes on mesh– Lacks adaptivity

• When traffic low, 8-hop, 7-hop, etc. could benefit from ring• When traffic high, ring can become saturated

Page 12: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

12/23

Adaptive Steering• Ring-Affinity Score

– More hops more benefit from using the ring– Non-critical packet no benefit– Ring Affinity Score = latency difference plus criticality adjustment

• Threshold– Score above threshold use ring– Adjust threshold to prevent ring bandwidth saturation

• Too much traffic on the ring queuing delays all benefit dissapears

Page 13: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

13/23

Ring-Affinity Score• Score • : criticality adjustment

– Constant penalty to non-critical coherence messages for simplification• (latency benefit)

– : latency estimate for mesh– : latency estimate for UTL ring

• How to get ?– Depends on packet’s hop count, mesh network congestion

• Tried using just hop count times router latency, not good enough!– Small cache in each node, stores recent latencies for given hop count

• E.g. 8x8 mesh 15 hop counts 15 sets in the latency cache• Each set keeps most recently observed latencies• Predictor chooses between using just the most recent latency, the average

of latest latencies, or the average of all () latencies

Page 14: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

14/23

Ring Affinity Scoring• Estimating

– How long to transmit? Easy.– How long to get the token?

• We see everything on the ring!– Can remember who sent

the last few packets, and when

– We know how far away the token is (last sender)– We can estimate how “fast” it “moves”

• Example: 7 nodes in 10 cycles (0.7 nodes/cycle)– If token 30 nodes away, estimated is 21 cycles (30*0.7)

• Detailed equations and explanations are in the paper

3 10

Core 3 sent packet on ring at cycle 10

Core 10 sent packet on ring at cycle 20

𝒅𝒌=𝟕

𝒕𝒌=𝟏𝟎

Page 15: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

15/23

Threshold and Re-steering• Threshold adjusted to manage UTL ring utilization

– Low enough to avoid excessive queuing– But high enough not to waste the ring throughput– Target utilizations around 75% tend to work well

• Threshold Management– Packet steered to ring when its score exceeds the threshold– Increase threshold when ring utilization higher than desired– Decrease the threshold if ring utilization is too low

• Re-Steeringing– Sudden burst of high-scoring packets…

• Threshold adaptation takes a while• Meanwhile, ring packets have very long latencies

– If ring-steered packet sits in queue too long, re-steer to the mesh• How long is too long?

Page 16: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

16/23

Evaluation• Simulated using SESC

– 64-tile CMP, 2-issue OoO, 1GHz, 32KB L1 D/I cache, 1MB slice of L2– 8×8 mesh (switched NoC) with 128 bit link width, 8 VC (24 buffers)

• Applications from PARSEC 3, SPLASH-2 benchmark suites– Half of the applications show <20% improvement with ideal interconnect– Focus analysis on on-chip latency sensitive applications

raytr

ace

ocea

n-nc

lu-nc

ocea

nstr

eam

cls.

x264

radio

sity

barn

eslu-

cnbla

cksc

h.wat

er-sp

chole

sky fft

cann

eal

volre

ndbo

dytra

ckwat

er-ns

qfer

ret

fmm

radix

0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

Exe

cutio

n Ti

me

Page 17: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

17/23

Speedup

barneslu-cn lu-nc

ocean-nc

ocean

radiosity

raytrace

blacksch

.x2

64

streamcls

.

gmean0.9

1.0

1.1

1.2

1.3

1.4

Spee

dup

1.14x

Page 18: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

18/23

Speedup

barneslu-cn lu-nc

ocean-nc

ocean

radiosity

raytrace

blacksch

.x2

64

streamcls

.

gmean0.9

1.0

1.1

1.2

1.3

1.4Mesh+TLCmesh+TLMeshCmeshSeries6

Spee

dup

• 4-concentrated mesh + UTL Ring– 8.7% improvement: 1.13× → 1.23×

Page 19: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

19/23

Speedup

barneslu-cn lu-nc

ocean-nc

ocean

radiosity

raytrace

blacksch

.x2

64

streamcls

.

gmean0.9

1.0

1.1

1.2

1.3

1.4Mesh+TLCmesh+TLFlat+TLMeshCmeshFlat

Spee

dup

• 4-concentrated mesh + UTL Ring– 8.7% improvement: 1.13× → 1.23×

• Flattened Butterfly + UTL Ring– 5.7% improvement: 1.10× → 1.16×

Page 20: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

20/23

Summary• Increasing core counts worsens on-chip latency

• Unidirectional Transmission Line Ring – Low-latency– But limited throughput

• Use UTL Ring with switched interconnect synergistically– UTL Ring for low latency– Switched interconnect for throughput

• Adaptive traffic steering enables judicious use of the ring– Proposed traffic steering provides 14% performance improvement

Page 21: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

21/23

Thank you!

Page 22: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

22/23

Result: Latency Reduction of UTL Ring• UTL Ring latency is 55% lower than the mesh

– Lower latency than advanced interconnects– >44% latency reduction over concentrated mesh and flattened butterfly– But we can only do this for 13% to 44% of messages (2.0% to 9.9% of the bits)

barneslu-cn lu-nc

ocean-nc

ocean

radiosity

raytrace

blacksch

.x2

64

streamcls

.Avg

.0.00.10.20.30.40.50.60.70.80.91.0 Cmesh Flat TL

Nor

mal

ized

Pac

ket L

aten

cy

44.3%43.9%

Page 23: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

23/23

Result: Speedup vs. Mesh Alone• Performs slightly better than advanced on-chip network

– 1.14 (Mesh + UTL ring)– vs. 1.13 (concentrated mesh) and 1.10 (flattened butterfly)

barneslu-cn lu-nc

ocean-nc

ocean

radiosity

raytrace

blacksch

.x2

64

streamcls

.

gmean1

1.05

1.1

1.15

1.2

1.25

1.3 CmeshFlatTL

Spee

dup

1.14×

1.10×1.13×

Page 24: Jungju  Oh,  Alenka Zajic , Milos  Prvulovic

24/23

Adaptive vs Non-Adaptive Steering• Non-adaptive random steering

– 0.63× slowdown on application (ocean-nc) with high on-chip traffic– 1.02× speedup if 30% of packets use UTL Ring randomly (RND30)– 0.96× slowdown if 50% (RND50)

• Adaptive traffic steering – 1.14×speedup (up to 1.20× with 64 Gbps configuration)

barneslu-cn lu-nc

ocean-nc

ocean

radiosity

raytrace

blacksch

.x2

64

streamcls

.

gmean0.800.850.900.951.001.051.101.151.201.251.30

RND50-16G RND30-16G TS-16G TS-32GTS-64G

Spee

dup

slowdown