38
Throughput-Effective On-Chip Networks for Manycore Accelerators Ali Bakhoda, John Kim¹ and Tor M. Aamodt ¹KAIST, Korea

Throughput-Effective On-Chip Networks for Manycore Accelerators

  • Upload
    randi

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Throughput-Effective On-Chip Networks for Manycore Accelerators. Ali Bakhoda , John Kim ¹ and Tor M. Aamodt ¹ KAIST, Korea . Manycore Accelerators and NoC. Manycore accelerators P revalent example: high-end GPUs 10s of thousands of threads running at the same time - PowerPoint PPT Presentation

Citation preview

Page 1: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

Throughput-Effective On-Chip Networks for Manycore Accelerators

Ali Bakhoda, John Kim¹ and Tor M. Aamodt¹KAIST, Korea

Page 2: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

2

Manycore Accelerators and NoC

Manycore accelerators Prevalent example: high-end GPUs 10s of thousands of threads running at the same time Bulk Synchronous Parallel programming style 3 / 5 top supercomputers

Based on the Nov. 2010 Top500 list

Primary goal: Higher application level throughput

NoC in accelerators Needs a different perspective from CPUs Not very well studied in this context

Page 3: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

3

The Need for Throughput-Effective NoCs

190 210 230 250 270 290 3100.0012

0.0014

0.0016

0.0018

0.0020

Average Throughput [IPC]

(Chi

p Ar

ea)-1

[1/

mm

2]

Ideal NoC

LESS AREA

HIGHER THROUGHPUT

0.35 IPC/mm 2

0.40 IPC/mm 2

0.45 IPC/mm 2

0.50 IPC/mm 2

0.55 IPC/mm 2

0.30 IPC/mm 2

Throughput-Effective design: Improves application level performance per unit chip area

Page 4: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

4

Contributions

Study impact of NoC on application level performance

Traditional improvements (router latency reduction): minimal impact on application level performance

Increasing channel width: High performance gain + high area cost Consider application level throughput per unit area of NoC

Throughput correlated with injection rate of few nodes Many-to-few-to-many traffic pattern

Propose Throughput-Effective NoC design Checkerboard network Multi-port router structure

Page 5: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

5

Outline

Introduction Baseline architecture NoC properties in accelerators Throughput-Effective NoC design Experimental results Conclusion

Page 6: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

6

Accelerator Overview

Compute

Network-On-Chip

MC+L2

GDDR

MC+L2

GDDR

MC+L2

GDDR

Compute Compute Compute ComputeCompute

Network-On-Chip

MC+L2

GDDR

MC+L2

GDDR

MC+L2

GDDR

Compute Compute Compute Compute

DispatchQueue

MemMiss

WaitingQueue

Page 7: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

7

Baseline Network Mesh with MCs at periphery of the chip

Similar to Tilera’s TILE64 or Intel’s 80-core Teraflops chip Simple and Scalable

Dimension Order Routing Virtual Channel Flow Control 4-cycle routers

Compute

Network-On-Chip

MC+L2

GDDR

MC+L2

GDDR

MC+L2

GDDR

Compute Compute Compute Compute

Page 8: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

8

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.60.50

0.75

1.00

Application Level ThroughputApplication Level Throughput/Area

Bandwidth Limit of Ideal Interconnect[fraction of off-chip DRAM bandwidth]

Finding a Balanced Design

Bisection bandwidth of baseline mesh

Page 9: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

9

Gap between Balanced Mesh and Ideal NoC

190 210 230 250 270 290 3100.0012

0.0014

0.0016

0.0018

0.0020

Average Throughput [IPC]

(Chi

p Ar

ea)-1

[1/

mm

2]

Ideal NoCLESS AREA

HIGHER THROUGHPUT

0.35 IPC/mm 2

0.40 IPC/mm 2

0.45 IPC/mm 2

0.50 IPC/mm 2

0.55 IPC/mm 2

0.30 IPC/mm 2

Balanced Mesh

Page 10: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

10

Outline

Introduction Baseline architecture NoC properties in accelerators Throughput-Effective NoC design Experimental results Conclusion

Page 11: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

11

NoC properties in ManyCore Accelerators

Router latency has minimal impact on application level throughput

Aggressive 1-cycle routers instead of 4-cycle router Only 2.3% application level speedup

Channel Bandwidth is very important 27% speedup by doubling BW But quadratic area increase 1-Cy-

cle Router

s

2x BW0%

20%

HM Speedup

Page 12: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

12

2x Channel Bandwidth

190 210 230 250 270 290 3100.0012

0.0014

0.0016

0.0018

0.0020

Average Throughput [IPC]

(Chi

p Ar

ea)-1

[1/

mm

2]

Ideal NoCLESS AREA

HIGHER THROUGHPUT

2x BW

0.35 IPC/mm 2

0.40 IPC/mm 2

0.45 IPC/mm 2

0.50 IPC/mm 2

0.55 IPC/mm 2

0.30 IPC/mm 2

Balanced Mesh

Page 13: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

13

Many-to-Few-to-Many Traffic Pattern

C0

requ

est n

etw

ork

C1

Cn

C0

C1

Cnre

ply

netw

ork

MC0

MC1

MCm

C2

MC Injectionbandwidth

C2

Page 14: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

14

Outline

Introduction Baseline architecture NoC properties in accelerators Throughput-Effective NoC design Experimental results Conclusion

Page 15: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

15

Throughput-Effective Network design

Throughput-Effective

Reduce Area

Checkerboard Routing

Channel Slicing

Increase Performance

Checkerboard Placement

Multi-Port routers at

MCs

Page 16: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

16

Checkerboard Routing: Half-Routers

Half-Router Connectivity

Half-Routers No turns allowed at half-routers Limited connectivity Saves ~50% of router crossbar area

Full-Routers: Normal routers w/ complete connectivity

Use Half-Routers every other node

Ejection

Injection

North

South

EastWest

Half Router

Full Router

Page 17: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

17

Solution: Routing Restriction (1)

• Routing from a full-router to a half-router that is:– An odd number of columns

away– Not in the same row

• Solution: Use YX routing instead of XY routing in this case

Half Router

Full Router

Page 18: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

18

Solution: Routing Restriction (2)

Routing from a half-router to a half-router that is: An even number of columns

away Not in the same row

Solution: needs two turns(1) To intermediate full-router using YX(2) To the destination using XY

Requires an extra VC to avoid deadlock

Half Router

Full Router

Page 19: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

19

Routing Restriction (3) Full-routers that are odd number of columns

away We avoid this case by using a different MC

placement (next 2 slides)

Half Router

Full Router

Page 20: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

20

Throughput-Effective Network design

Throughput-Effective

Reduce Area

Checkerboard Routing

Channel Slicing

Increase Performance

Checkerboard Placement

Multi-Port routers at

MCs

Page 21: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

21

Placement of MCs

Exploit Many-to-Few Place the MCs at Half-Router nodes

Half-Routers can communicate will all nodes with no penalty Common case for BSP: compute cores communicate with MCs

not each other

[CMP-MSI’08] “Extending the Scalability of Single Chip Stream Processors with On-chip Caches”, Bakhoda et al. [ISCA’09] “Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs" Abts et al.

Half Router

Compute Core Router

Memory Controller Router

Page 22: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

22

Throughput-Effective Network design

Throughput-Effective

Reduce Area

Checkerboard Routing

Channel Slicing

Increase Performance

Checkerboard Placement

Multi-Port routers at

MCs

Page 23: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

23

Multi-port routers at MCs

• Reduce the bottleneck at the few nodes• Increase terminal BW of the few nodes

– Increase the injection ports of MC routers– Minimal area overhead (~1% in total NoC area)– Speedups of up to 25%

Page 24: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

24

Throughput-Effective Network design

Throughput-Effective

Reduce Area

Checkerboard Routing

Channel Slicing

Increase Performance

Checkerboard Placement

Multi-Port routers at

MCs

Page 25: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

25

Outline

Introduction Baseline architecture NoC properties in accelerators Throughput-Effective NoC design Experimental results Conclusion

Page 26: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

26

Methodology

Compute simulation: GPGPU-Sim (2.2.1b) NoC simulation: Booksim-2

Integrated into GPGPU-Sim as network simulator

Area estimations: Orion 2.0 Benchmarks: 24 CUDA applications including

the Rodinia benchmarks

Page 27: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

27

Results Combination of

Checkerboard routing and placement Channel Slicing Multi-port routers at MCs

Overall HM speedup 17% across 24 benchmarks over balanced baseline

Total NoC area reduction of 43%

AES BIN HSP NE NDL

HW LE HIS LU SLA BP CON

NNC

BLK

MM LPS RAY

DG SS TRA

SR WP MUM

LIB FWT

SCP STC KM CFD

BFS RD HM-20%

0%20%40%60%80%

Spee

dup

Low SpeedupLow Traffic

Low SpeedupHigh Traffic

High SpeedupHigh Traffic

Page 28: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

28

Throughput-Effective NoC

190 210 230 250 270 290 3100.0012

0.0014

0.0016

0.0018

0.0020

Average Throughput [IPC]

(Chi

p Ar

ea)-1

[1/

mm

2]

Thr. Eff.

Ideal NoCLESS AREA

HIGHER THROUGHPUT

2x BW

0.35 IPC/mm 2

0.40 IPC/mm 2

0.45 IPC/mm 2

0.50 IPC/mm 2

0.55 IPC/mm 2

0.30 IPC/mm 2

Balanced Mesh

Page 29: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

29

Summary

Throughput-Effective design: Consider system level performance impact + area impact of NoC

Observations NoC BW is more important than latency in accelerators Many-to-Few-to-Many traffic pattern

Throughput-Effective NoC for accelerators Checkerboard Multi-port MC routers Channel-slicing

Page 30: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

Thank you

Page 31: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

31

Backups…

Page 32: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

32

Channel Slicing – Double networks

Divide the single network into two physical networks Each new network: half the bisection BW of the original network Overall bisection BW: constant

Saves area Quadratic dependency of crossbar area on channel BW

Increases serialization latency But compute accelerators are not sensitive to latency

Page 33: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

33

Results

Memory Controller placement HM of speedup 13% over balanced baseline design

Compute Core Router

Memory Controller Router

-20%0%

20%40%60%80%

AES BIN HSP NE NDL HW LE HIS LU SLA BP CONNNCBLK MM LPS RAY DG SS TRA SR WP MUM LIB FWT SCP STC KM CFD BFS RD HM

Spee

dup

Page 34: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

34

Results• Checkerboard routing

– Less than 1% performance loss compared to DOR with same resources

– Reduces total router area by 14.2%

Half Router

Compute Core Router

Memory Controller Router

70%80%90%

100%110%120%

AES BIN HSP NE NDL HW LE HIS LU SLA BP CONNNCBLK MM LPS RAY DG SS TRA SR WP MUM LIB FWT SCP STC KM CFD BFS RD HM

Rel

ativ

e Pe

rfor

man

ce

Page 35: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

35

Results Channel slicing

Average change in performance < 1% NoC area reduction of 37%

Half Router

Compute Core Router

Memory Controller Router

-7%0%7%

14%

AES BIN HSP NE NDL HW LE HIS LU SLA BP CONNNCBLK MM LPS RAY DG SS TRA SR WP MUM LIB FWT SCP STC KM CFD BFS RD HM

Spee

dup

Page 36: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

36

Top 5 systems

TOP 5 Systems - 11/2010 1 Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C, Nvidia GPU

, FT-1000 8C 2 Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz 3 Nebulae - Dawning TC3600 Blade, Intel X5650, Nvidia Tesla

C2050 GPU 4TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon 6C X5670,

Nvidia GPU, Linux/Windows 5 Hopper - Cray XE6 12-core 2.1 GHz

Page 37: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

37

Alternative MC placement example

Page 38: Throughput-Effective On-Chip Networks for  Manycore  Accelerators

38

Many-to-Few-to-Many Traffic Pattern

C0

requ

est n

etw

ork

C1

Core outputbandwidth

Cn

C0

C1

Cnre

ply

netw

ork

MC0

MC1

MCm

C2

MC inputbandwidth

MC outputbandwidth

Core inputbandwidth

C2