27
May 2, 2011 1 A Cost Effective Centralized Adaptive Routing for Networks on Chip Ran Manevich*, Israel Cidon*, Avinoam Kolodny*, Isask’har (Zigi) Walter* and Shmuel Wimer # *Technion – Israel Institute of Technology Group Research QNoC # Bar-Ilan University M odule M odule M odule M odule M odule M odule M odule M odule M odule M odule M odule M odule

Cost Effective centralized adpative routing for networks on chip

Embed Size (px)

DESCRIPTION

RanManevich, Technion

Citation preview

Page 1: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 1

A Cost Effective Centralized Adaptive Routing for Networks

on ChipRan Manevich*, Israel Cidon*, Avinoam Kolodny*,

Isask’har (Zigi) Walter* and Shmuel Wimer#

*Technion – Israel Institute of Technology

M odule

M odule M odule

M odule M odule

M odule M odule

M odule

M odule

M odule

M odule

M oduleGroup

ResearchQNoC

#Bar-Ilan University

Page 2: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 2

Networks-on-Chip (NoCs)

Page 3: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 3

Global traffic information is essential to make the right decision!

Page 4: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 4

Adaptive Routing in NoCs – Local vs. Global Information

2D Mesh NoCLow Congestion

Medium Congestion

High Congestion

A Packet routed from upper left to bottom right corner utilizing local congestion information.

The same packet routed using global information.

I CAN MAKE IT!!!Source

Destination

Page 5: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 5

Route Selection - ATDOR ATDOR - Adaptive Toggle Dimension Ordered Routing

Keep it simple! Centralized selection:

Routing tables in sources. One bit per destination.

The option with less congested bottleneck link is preferred.

XY or YX

Page 6: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 6

ATDOR Illustration 1 Five identical flows, 100

MB/s each.

Links modeled as M/M/1 queues. Delay of a single link:

LINKTraffic

DCapacity Traffic

Links capacity is 210 MB/s.

Initial routing - XY

Page 7: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 7

Centralized Routing – How?• Option 1 – Continuous calculation of optimal routing

for the active sessions:

Achievable load balancing

Speed and computation complexity

System complexity

Page 8: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 8

Centralized Routing – How?• Option 2 – Iterative serial selection based on traffic

load measurements between XY and YX for all source-destination pairs:

Achievable load balancing

Speed and computation complexity

System complexity

Page 9: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 9

ATDOR illustration 1

Average Delay

Re-Routed Flow

Step #

1->15 1

Re-Routed Flow

Step #

2->8 2

Average Delay

37 ns

Re-Routed Flow

Step #

2->15 3

Average Delay

22 ns

Page 10: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 10

What did we just see? For each flow we:

1. Calculated the better route.

2. Updated routing table of the source.

3. Waited for the update to take effect and measured global traffic load.

Steps 2 and 3 are unified for all destinations of a single source:

Achievable load balancing

Speed and computation complexity

Scalability

Performing steps 1-3 for each flow is slow and not scalable.

Page 11: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 11

Back illustration 1

Average Delay

Re-Routed Flow

Step #

1->15 1

Average Delay

22 ns

Re-Routed Flow

Step #

2->82

2->15

Re-Routed Flow

Step #

4->15 3

Average Delay

22 ns

Re-Routed Flow

Step #

1->15 4

Average Delay

22 ns

Re-Routed Flow

Step #

2->85

2->15

Average Delay

Page 12: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 12

Problem #1 Changing routing may enhance

congestion and cause fluctuations.

Solution: Change routing only if the alternative is better by the margin α, 0< α <1:

YX XY

YX XY

XY YX

XY YX

if (Current Route = XY)

YX if MAX[Load ] a MAX[Load ]NextRoute =

XY if MAX[Load ] > a MAX[Load ]

elseif (Current Route = YX)

XY if MAX[Load ] a MAX[Load ]NextRoute =

YX if MAX[Load ] > a MAX[Load ]

Page 13: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 13

ATDOR illustration 2

Average Delay

Re-Routed Flow

Step #

1->14

11->15

1->16

Average Delay

Re-Routed Flow

Step #

1->14

21->15

1->16

Re-Routed Flow

Step #

1->14

31->15

1->16

Page 14: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 14

Problem #2 Coupling among flows sharing the same

source.

Solution: Re-Routing counters CI,J count routing changes of flows from source I to destination J (FI,J). When CI,J reaches a limit LI,J, routing of FI,J is locked. A Possible definition of Limits LI,J :

, ( ) mod 3I JL I J

Page 15: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 15

Back to illustration 2R. Changes

LeftFlows

2 1->16

1 1->15

0 1->14

Average Delay

R. Changes Left

Flows

1 1->16

0 1->15

0 1->14

Average Delay

73 ns

R. Changes Left

Flows

0 1->16

0 1->15

0 2->14

Average Delay

22 ns

, ( ) mod 3I JL I J

Page 16: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 16

Bring it all togetherR. Changes

LeftFlows

1 1>-15

1 2>-8

2 2>-15

1 4>-15

Average Delay

R. Changes Left

Flows

0 1>-15

1 2>-8

2 2>-15

1 4>-15

R. Changes Left

Flows

0 1>-15

0 2>-8

1 2>-15

1 4>-15

Average Delay

22 ns

R. Changes Left

Flows

0 1>-15

0 2>-8

1 2>-15

0 4>-15

Average Delay

22 nsAverage Delay

14 ns

R. Changes Left

Flows

0 1>-15

0 2>-8

0 2>-15

0 4>-15

, ( ) mod 3I JL I J

Page 17: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 17

Centralized Adaptive Routing for NoCs - Architecture

Traffic load measurements aggregation into Traffic Load Maps.

Routing control.

Local traffic load measurements inside the routers.

Page 18: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 18

Load Measurements Aggregation An illustration of

aggregation of load values in a 4X4 2D mesh.

A congestion value is written to each traffic load map every clock cycle.

Page 19: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 19

ATDOR – Route Selection Circuit

• Combinatorial pipelined implementation.

Result every ATDOR clock cycle.

Maximally loaded links of the two alternatives are compared. Next route:

YX XY

YX XY

XY YX

XY YX

if(Current Route = XY)

YX if MAX[Load ] a MAX[Load ]NextRoute =

XY if MAX[Load ] > a MAX[Load ]

elseif(Current Route = YX)

XY if MAX[Load ] a MAX[Load ]NextRoute =

YX if MAX[Load ] > a MAX[Load ]

0 < a <1

Page 20: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 20

Hardware Requirements The whole mechanism

was implemented on xc5vlx50t VIRTEX 5 FPGA.

Estimated area for 45nm technology node.

Per-Router hardware overheads in % for a NoC with typical size (50 KGates) virtual channel routers.

Page 21: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 21

Average Packet Delay – Uniform Traffic

• Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. Uniform traffic pattern.

Page 22: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 22

Average Packet Delay – Transpose Traffic

• Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. Transpose traffic pattern.

Page 23: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 23

Average Packet Delay – Hotspot Traffic

• Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. 4 Hotspots traffic pattern.

Page 24: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 24

Control Iteration Duration• Number of re-routed flows vs. time. • 8X8 2D Mesh, ATDOR clock of 100 MHz.

α = 15/16 α = 3/4

Page 25: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 25

CMP DNUCA - Architecture• 8X8 CMP DNUCA (Dynamic Non Uniform Cache Array)

with 8 CPUs and 56 cache banks:

Page 26: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 26

CMP DNUCA – Saturation Throughput

• Saturation throughput - Splash 2 and Parsec benchmarks on 8X8 CMP DNUCA with 8 CPUs and 56 cache banks:

Page 27: Cost Effective centralized adpative routing for networks on chip

May 2, 2011 27

Conclusions• Centralized adaptive routing is feasible for NoCs.

ATDOR: Centralized selection between XY and YX for each source-destination pair.

Hardware overhead: <4% of an 8X8 typical NoC.

Average saturation throughput improvement:Vs. RCA Vs. O1TURN

12.1% 19.3% Synthetic Patterns

12.8% 22.8% Spash 2 and Parsec Benchmarks