Throughput-Effective On-Chip Networks for Manycore Accelerators

Throughput-Effective On-Chip Networks for Manycore Accelerators

Ali Bakhoda, John Kim¹ and Tor M. Aamodt¹KAIST, Korea

2

Manycore Accelerators and NoC

Manycore accelerators Prevalent example: high-end GPUs 10s of thousands of threads running at the same time Bulk Synchronous Parallel programming style 3 / 5 top supercomputers

Based on the Nov. 2010 Top500 list

Primary goal: Higher application level throughput

NoC in accelerators Needs a different perspective from CPUs Not very well studied in this context

3

The Need for Throughput-Effective NoCs

190 210 230 250 270 290 3100.0012

0.0014

0.0016

0.0018

0.0020

Average Throughput [IPC]

(Chi

p Ar

ea)-1

[1/

mm

2]

Ideal NoC

LESS AREA

HIGHER THROUGHPUT

0.35 IPC/mm 2

0.40 IPC/mm 2

0.45 IPC/mm 2

0.50 IPC/mm 2

0.55 IPC/mm 2

0.30 IPC/mm 2

Throughput-Effective design: Improves application level performance per unit chip area

4

Contributions

Study impact of NoC on application level performance

Traditional improvements (router latency reduction): minimal impact on application level performance

Increasing channel width: High performance gain + high area cost Consider application level throughput per unit area of NoC

Throughput correlated with injection rate of few nodes Many-to-few-to-many traffic pattern

Propose Throughput-Effective NoC design Checkerboard network Multi-port router structure

5

Outline

Introduction Baseline architecture NoC properties in accelerators Throughput-Effective NoC design Experimental results Conclusion

6

Accelerator Overview

Compute

Network-On-Chip

MC+L2

GDDR

MC+L2

GDDR

MC+L2

GDDR

Compute Compute Compute ComputeCompute

Network-On-Chip

MC+L2

GDDR

MC+L2

GDDR

MC+L2

GDDR

Compute Compute Compute Compute

DispatchQueue

MemMiss

WaitingQueue

7

Baseline Network Mesh with MCs at periphery of the chip

Similar to Tilera’s TILE64 or Intel’s 80-core Teraflops chip Simple and Scalable

Dimension Order Routing Virtual Channel Flow Control 4-cycle routers

Compute

Network-On-Chip

MC+L2

GDDR

MC+L2

GDDR

MC+L2

GDDR

Compute Compute Compute Compute

8

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.60.50

0.75

1.00

Application Level ThroughputApplication Level Throughput/Area

Bandwidth Limit of Ideal Interconnect[fraction of off-chip DRAM bandwidth]

Finding a Balanced Design

Bisection bandwidth of baseline mesh

9

Gap between Balanced Mesh and Ideal NoC

190 210 230 250 270 290 3100.0012

0.0014

0.0016

0.0018

0.0020


(Chi

p Ar

ea)-1

[1/

mm

2]

Ideal NoCLESS AREA

HIGHER THROUGHPUT

0.35 IPC/mm 2

0.40 IPC/mm 2

0.45 IPC/mm 2

0.50 IPC/mm 2

0.55 IPC/mm 2

0.30 IPC/mm 2

Balanced Mesh

10

Outline


11

NoC properties in ManyCore Accelerators

Router latency has minimal impact on application level throughput

Aggressive 1-cycle routers instead of 4-cycle router Only 2.3% application level speedup

Channel Bandwidth is very important 27% speedup by doubling BW But quadratic area increase 1-Cy-

cle Router

s

2x BW0%

20%

HM Speedup

12

2x Channel Bandwidth

190 210 230 250 270 290 3100.0012

0.0014

0.0016

0.0018

0.0020


(Chi

p Ar

ea)-1

[1/

mm

2]

Ideal NoCLESS AREA

HIGHER THROUGHPUT

2x BW

0.35 IPC/mm 2

0.40 IPC/mm 2

0.45 IPC/mm 2

0.50 IPC/mm 2

0.55 IPC/mm 2

0.30 IPC/mm 2

Balanced Mesh

13

Many-to-Few-to-Many Traffic Pattern

C0

requ

est n

etw

ork

C1

Cn

C0

C1

Cnre

ply

netw

ork

MC0

MC1

MCm

C2

MC Injectionbandwidth

C2

14

Outline


15

Throughput-Effective Network design

Throughput-Effective

Reduce Area

Checkerboard Routing

Channel Slicing

Increase Performance

Checkerboard Placement

Multi-Port routers at

MCs

16

Checkerboard Routing: Half-Routers

Half-Router Connectivity

Half-Routers No turns allowed at half-routers Limited connectivity Saves ~50% of router crossbar area

Full-Routers: Normal routers w/ complete connectivity

Use Half-Routers every other node

Ejection

Injection

North

South

EastWest

Half Router

Full Router

17

Solution: Routing Restriction (1)

• Routing from a full-router to a half-router that is:– An odd number of columns

away– Not in the same row

• Solution: Use YX routing instead of XY routing in this case

Half Router

Full Router

18

Solution: Routing Restriction (2)

Routing from a half-router to a half-router that is: An even number of columns

away Not in the same row

Solution: needs two turns(1) To intermediate full-router using YX(2) To the destination using XY

Requires an extra VC to avoid deadlock

Half Router

Full Router

19

Routing Restriction (3) Full-routers that are odd number of columns

away We avoid this case by using a different MC

placement (next 2 slides)

Half Router

Full Router

20



Reduce Area


Channel Slicing




MCs

21

Placement of MCs

Exploit Many-to-Few Place the MCs at Half-Router nodes

Half-Routers can communicate will all nodes with no penalty Common case for BSP: compute cores communicate with MCs

not each other

[CMP-MSI’08] “Extending the Scalability of Single Chip Stream Processors with On-chip Caches”, Bakhoda et al. [ISCA’09] “Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs" Abts et al.

Half Router

Compute Core Router

Memory Controller Router

22



Reduce Area


Channel Slicing




MCs

23

Multi-port routers at MCs

• Reduce the bottleneck at the few nodes• Increase terminal BW of the few nodes

– Increase the injection ports of MC routers– Minimal area overhead (~1% in total NoC area)– Speedups of up to 25%

24



Reduce Area


Channel Slicing




MCs

25

Outline


26

Methodology

Compute simulation: GPGPU-Sim (2.2.1b) NoC simulation: Booksim-2

Integrated into GPGPU-Sim as network simulator

Area estimations: Orion 2.0 Benchmarks: 24 CUDA applications including

the Rodinia benchmarks

27

Results Combination of

Checkerboard routing and placement Channel Slicing Multi-port routers at MCs

Overall HM speedup 17% across 24 benchmarks over balanced baseline

Total NoC area reduction of 43%

AES BIN HSP NE NDL

HW LE HIS LU SLA BP CON

NNC

BLK

MM LPS RAY

DG SS TRA

SR WP MUM

LIB FWT

SCP STC KM CFD

BFS RD HM-20%

0%20%40%60%80%

Spee

dup

Low SpeedupLow Traffic

Low SpeedupHigh Traffic

High SpeedupHigh Traffic

28

Throughput-Effective NoC

190 210 230 250 270 290 3100.0012

0.0014

0.0016

0.0018

0.0020


(Chi

p Ar

ea)-1

[1/

mm

2]

Thr. Eff.

Ideal NoCLESS AREA

HIGHER THROUGHPUT

2x BW

0.35 IPC/mm 2

0.40 IPC/mm 2

0.45 IPC/mm 2

0.50 IPC/mm 2

0.55 IPC/mm 2

0.30 IPC/mm 2

Balanced Mesh

29

Summary

Throughput-Effective design: Consider system level performance impact + area impact of NoC

Observations NoC BW is more important than latency in accelerators Many-to-Few-to-Many traffic pattern

Throughput-Effective NoC for accelerators Checkerboard Multi-port MC routers Channel-slicing

Thank you

31

Backups…

32

Channel Slicing – Double networks

Divide the single network into two physical networks Each new network: half the bisection BW of the original network Overall bisection BW: constant

Saves area Quadratic dependency of crossbar area on channel BW

Increases serialization latency But compute accelerators are not sensitive to latency

33

Results

Memory Controller placement HM of speedup 13% over balanced baseline design

Compute Core Router


-20%0%

20%40%60%80%

AES BIN HSP NE NDL HW LE HIS LU SLA BP CONNNCBLK MM LPS RAY DG SS TRA SR WP MUM LIB FWT SCP STC KM CFD BFS RD HM

Spee

dup

34

Results• Checkerboard routing

– Less than 1% performance loss compared to DOR with same resources

– Reduces total router area by 14.2%

Half Router

Compute Core Router


70%80%90%

100%110%120%


Rel

ativ

e Pe

rfor

man

ce

35

Results Channel slicing

Average change in performance < 1% NoC area reduction of 37%

Half Router

Compute Core Router


-7%0%7%

14%


Spee

dup

36

Top 5 systems

TOP 5 Systems - 11/2010 1 Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C, Nvidia GPU

, FT-1000 8C 2 Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz 3 Nebulae - Dawning TC3600 Blade, Intel X5650, Nvidia Tesla

C2050 GPU 4TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon 6C X5670,

Nvidia GPU, Linux/Windows 5 Hopper - Cray XE6 12-core 2.1 GHz

http://www.top500.org/system/10587












37

Alternative MC placement example

38

Many-to-Few-to-Many Traffic Pattern

C0

requ

est n

etw

ork

C1

Core outputbandwidth

Cn

C0

C1

Cnre

ply

netw

ork

MC0

MC1

MCm

C2

MC inputbandwidth

MC outputbandwidth

Core inputbandwidth

C2

Documents

Throughput-Effective On-Chip Networks for Manycore Accelerators