45
GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip Mukund Ramakrishna, Paul V. Gratz & Alex Sprintson Department of Electrical and Computer Engineering Texas A&M University

GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

  • Upload
    jirair

  • View
    109

  • Download
    0

Embed Size (px)

DESCRIPTION

GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip. Mukund Ramakrishna, Paul V. Gratz & Alex Sprintson Department of Electrical and Computer Engineering Texas A&M University. Networks-on-Chip. Moore’s Law is putting more and more transistors on the chip. - PowerPoint PPT Presentation

Citation preview

Page 1: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Mukund Ramakrishna, Paul V. Gratz & Alex Sprintson

Department of Electrical and Computer Engineering

Texas A&M University

Page 2: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Networks-on-Chip

• Moore’s Law is putting more and more transistors on the chip.• NoCs scale better than traditional interconnects.• High interconnect latencies translate into idle processor core cycles

and wasted power

Tilera Tile64

Intel Single-Chip Cloud Computer

(ISSCC 2010)

Page 3: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Routing in NoCs

• Real workloads are unbalanced in nature

• Oblivious routing (DOR) – Tends to exacerbate

congestion• Adaptive routing:

– Try to avoid congested spots– Classified on the basis of

awareness:• Local adaptive• Regionally aware• Globally aware

fft benchmark from SPLASH-2 under DORDarker arrows = Higher congestion

Page 4: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Outline

• Introduction• Motivation for Global Awareness• Related Work• Global Congestion Awareness (GCA):

– Route Computation– Information Propagation

• Implementation• Evaluation• Conclusion and Future Work

Page 5: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Local Congestion• Local adaptive

– Measure local congestion metric (free VC, free buffers)– Greedy local decisions due to poor visibility

S

Low congestion

Moderate congestion

D

High congestionOptimal

Local adaptive

Page 6: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Regional Awareness• Regionally aware(Gratz et al. HPCA’ 08, Ma et al. ISCA ’11 )

– Aggregated congestion of all nodes in a dimension– Noisy information degrades performance.

S

Low congestion

Moderate congestion

D

High congestionOptimal

RCA-1D

Page 7: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Ideally …• On a per-destination basis:

– Evaluate end-to-end delay along all minimal paths to destination

– Pick path with least delay

S

Low congestion

Moderate congestion

D

High congestionOptimal

Page 8: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Global Awareness

• Earlier schemes utilize a separate congestion monitoring network (Manevich et al. DSD`11,

Ramanujam et al. ANCS`10)

• Increased network complexity • Slow route calculation mechanism in DAR • Challenges:

– Low overhead dissemination technique– Limited resource for storage and computation

Page 9: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Outline

• Introduction• Motivation for Global Awareness• Global Congestion Awareness (GCA):

– Route Computation– Information Propagation

• Implementation• Evaluation• Conclusion and Future Work

Page 10: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

GCA: Bird’s eye view

1. Congestion information is conveyed via piggybacking onto the header flits

2. Every node builds a “map” representing the congestion of the network

3. Optimal path is calculated using a shortest path graph algorithm in each router

Page 11: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Packet-level

• “Piggybacking” of congestion information in header flits (Zhang et al.

PrimeAsia`10, Chen et al. NOCS`12)

• Back-annotation appends congestion information for link in opposite direction

– Direction of flit traversal: Black– Congestion Information appended: Red

Page 12: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Router Micro-Architecture

VC-1

VC-nN

E

W

S

In

VC-1

VC-n

VC Allocator

XB Allocator

Congestion Map

Route Compute Hardware

Optimal Output Port

Table

Routing Unit

Header Modification

Local Congestion

Values

X

Traffic Vector

Destination Node

NEWSEj

Page 13: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Router Micro-Architecture

VC-1

VC-nN

E

W

S

In

VC-1

VC-n

VC Allocator

XB Allocator

Congestion Map

Route Compute Hardware

Optimal Output Port

Table

Routing Unit

Header Modification

Local Congestion

Values

X

Traffic Vector

Destination Node

NEWSEj

Page 14: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Router Micro-Architecture

+

<

P2

P1 Pout

Dout+

d1

d2

l1

l2

4

Page 15: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Route Computation

• Node marked 0 is source

• Number on link denotes congestion

• Number in node denotes shortest path cost to that node

• Letter denotes optimal output port.

Page 16: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Route Computation (contd.)

• At most two feeder nodes for every node

• Pick the feeder node with the least cost path

• For nodes a hop away:– Cost = Congestion of

connecting link

Page 17: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Route Computation (contd.)

• Simple “add and compare” step

• Example: Top-left node– From East port:

• 3+1=4– From South port:

• 8+0=8

• Cost assigned = 4

Page 18: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Route Computation (contd.)

• Every iteration flows outward

• Every quadrant computed in parallel

• Re-evaluate only the downstream sub-graph every update

Page 19: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Caveats

• Infrequent updates of distant links– Scale the weights so that distant information is less important– For a link i with congestion ci, which is n hops away from the

local node, we calculate its weight as

• Staleness of congestion information– Entries untouched for n cycles are faded– Fade towards a nominal value in steps of x (empirically

determined)– For large networks, fade one entry every n/L cycles where L is

the number of links

Page 20: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Limited GCA (LGCA)

• Constrain the visibility to a smaller window

• Store information only for nodes k hops or less away

• Reduces storage overhead at the cost of slight performance penalty vis-à-vis GCA

Page 21: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Implementation

• Storage overhead:– Congestion Map:

• 3 bit congestion metric

– Optimal output port table:• 1 bit per node because of minimal routing• No storage for nodes in same dimension

– Flag array for fading mechanism:• One bit per entry in the congestion map

• Synthesis results:– 16 node computation and storage circuit – 2GHz at 45nm and less than 1% area overhead– No impact to router critical path

Page 22: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

• Simulation carried out in a cycle accurate C++ simulator1

1. S. Prabhu, B. Grot, P. Gratz, and J. Hu, “Ocin_tsim - DVFS Aware Simulator for NoCs,” in Proc. SAW-1, Jan 2010.

Simulation parameters

Characteristics of simulated design

Realistic Workloads Synthetic traffic

Topology 7x7 2D Mesh 8x8 2D Mesh

Router uArch Two Stage Speculative

Per hop latency 3 cycles: 2 cycles in router, 1 cycle to cross channel

Virtual Channels/Port 8

Flit buffers/VC 5

Traffic Workload SPLASH-2 traces Random, Transpose, Bit-complement

Duration of simulation 10 million cycles or end of trace 10000 warm-up cycles followed by 100000 packets

Scaling Factor (w) 0.25

Fading (x,n) x=1 unit; n = 100 cycles

Page 23: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

SPLASH-2

water spatial

fft lu water nsquared

radix raytrace ocean barnes geomean (overall)

0

0.2

0.4

0.6

0.8

1

1.2

localRCADARGCALGCA

Aver

age

Pack

et L

aten

cy (R

elati

ve to

DO

R)

Page 24: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

SPLASH-2

Improvement due to GCA (average):

water spatial

fft lu water nsquared

radix raytrace ocean barnes geomean (overall)

0

0.2

0.4

0.6

0.8

1

1.2

localRCADARGCALGCA

Aver

age

Pack

et L

aten

cy (R

elati

ve to

DO

R)

- DOR: 45% - Local: 26%

- RCA-1D: 15% - DAR: 8%

Page 25: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

SPLASH-2

Outliers:• DAR better than GCA on inherently static workloads (fft,radix)

• Statistical traffic distribution enables better performance• GCA better than DAR on all other workloads

water spatial

fft lu water nsquared

radix raytrace ocean barnes geomean (overall)

0

0.2

0.4

0.6

0.8

1

1.2

localRCADARGCALGCA

Aver

age

Pack

et L

aten

cy (R

elati

ve to

DO

R)

Page 26: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

SPLASH-2

LGCA performance:• Close to GCA on most workloads

• lu is an exception• Overall average slightly worse than GCA but still better than other competing

algorithms

water spatial

fft lu water nsquared

radix raytrace ocean barnes geomean (overall)

0

0.2

0.4

0.6

0.8

1

1.2

localRCADARGCALGCA

Aver

age

Pack

et L

aten

cy (R

elati

ve to

DO

R)

Page 27: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Conclusion

• Proposed a novel adaptive routing mechanism which uses global congestion information to perform per-hop routing in on-chip networks.

• Uses back-annotated piggybacking to propagate congestion information which alleviates the issue of overheads

• Light-weight implementation of the shortest path computation

• GCA improves average packet latency– By 26% against local adaptive– By 15% against RCA -1D– By 8% against DAROn average for the SPLASH-2 suite of benchmarks.

Page 28: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Thank You

Page 29: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

BACKUP

Page 30: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Minimal Adaptive Routing• Model

– Adaptive routing along minimal paths.

D

S

Page 31: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Fading

• Entries untouched for n cycles are faded• Fade towards the gray value in steps of x, an empirically

determined parameter– Black = extremely congested – White = uncongested– Gray = middle value

• For large networks, stagger the fading mechanism– Fade one entry every n/L cycles where L is the number of links

Page 32: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Synthetic Traffic

10 15 20 25 30 35 400

10

20

30

40

50

60

Uniform Random

xydorlocalRCALGCAGCA

Injection Rate (flits/node/cycle)

Aver

age

Pack

et La

tenc

y

5 10 15 20 25 30 35 400

10

20

30

40

50

60

Transpose

xydorlocalRCALGCAGCA

Injection Rate (flits/node/cycle)

Aver

age

Pack

et La

tenc

y

3 8 13 18 23 280

102030405060708090

Bit-complement

xydorlocalRCALGCAGCA

Injection Rate (flits/node/cycle)

Aver

age

Pack

et La

tenc

y

Page 33: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Network Sensitivity Experiments

• Variation of two parameters:– Vary VC Count

• No variation in relative performance– Vary the mesh size

• Performs better for larger meshes• For both experiments, we simulate

Transpose traffic

Page 34: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Network Size

5 10 15 20 25 30 35 400

10

20

30

40

50

60

8 x 8 Mesh

localLGCAGCA

Injection Rate (% of flits/node/cycle)

Aver

age

Pack

et La

tenc

y

16 %

4 6 8 10 12 14 16 18 200

20

40

60

80

100

16 x 16 Mesh

localLGCAGCA

Injection Rate (% of flits/node/cycle)

Aver

age

Pack

et La

tenc

y

21 %

Page 35: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Congestion Information Scaling

• Scale the weights so that distant information is less important in making routing decision

• Degree of scaling determined by an empirical constant w (0< w < 1).

• For a link i with congestion ci, which is n hops away from the local node, we calculate its weight as

Page 36: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Number of steps

• For a 2-D mesh, the number of steps required lies in the range:– The upper bound is always – If N is even, the lower bound is – If N is odd, the lower bound is

Page 37: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Empirical parameters

• Scaling Factor w:– w = 0.25

• GCA: links beyond 4 hops are assigned a constant scaling factor of 0.25

• LGCA: links beyond 4 hops are not stored as k=4• Fading mechanism:

– n = 100 cycles– x = 1 unit

Page 38: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Challenges in global awareness

• Dissemination of congestion information– Low overhead– Account for staleness

• Limited storage in on-chip routers– Exponential number of paths to each

destination

• Limited hardware resources for computations

Page 39: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Future Work

• Congestion prediction: proactive adaptive routing instead of reactive adaptive routing

• Stability analysis: Does the algorithm thrash between different paths for some traffic patterns?

• Effect of imperfect congestion state representation

Page 40: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Back-annotationFor each outgoing flit, the node appends the congestion metric for the link in the same direction

For each outgoing flit, the node appends the congestion metric for the link in the opposite direction.

S

D

Packet Traversal directionCongestion information direction

S

D

Page 41: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Multi-region

• Network partitioned into four quadrants

• Each quadrant runs a benchmark as shown

• Isolated traffic regions emulate virtual machine-like scenario

Page 42: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

Multi-region

• Local adaptive is unaffected due to its lack of visibility• RCA’s performance suffers due to noise through aggregation• GCA maintains fine-grained information

• Helps avoid noise and perform better than RCA

lu waternsq ocean raytrace Geomean0

0.2

0.4

0.6

0.8

1

1.2

1.4

local_uniregion local_multiregion RCA_uniregionRCA_multiregion GCA_uniregion GCA_multiregion

Aver

age

Pack

et La

tenc

y (r

elati

ve

to D

OR)

Page 43: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

SPLASH-2

Improvement over local adaptive:• GCA: 26% average, 86% best case• LGCA: 23% average, 82% best case

water spatial

fft lu water nsquared

radix raytrace ocean barnes geomean (overall)

0

0.2

0.4

0.6

0.8

1

1.2

localRCADARGCALGCA

Aver

age

Pack

et L

aten

cy (R

elati

ve to

DO

R)

Page 44: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

SPLASH-2

Improvement over RCA-1D:• GCA: 15% average, 51% best case• LGCA: 11% average, 38% best case

water spatial

fft lu water nsquared

radix raytrace ocean barnes geomean (overall)

0

0.2

0.4

0.6

0.8

1

1.2

localRCADARGCALGCA

Aver

age

Pack

et L

aten

cy (R

elati

ve to

DO

R)

Page 45: GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip

SPLASH-2

Improvement over DAR:• GCA: 8% average, 53% best case• LGCA: 4% average, 41% best case

water spatial

fft lu water nsquared

radix raytrace ocean barnes geomean (overall)

0

0.2

0.4

0.6

0.8

1

1.2

localRCADARGCALGCA

Aver

age

Pack

et L

aten

cy (R

elati

ve to

DO

R)