1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

1November 11, 2015

A Massively Parallel, Hybrid Dataflow/von Neumann Architecture

Yoav Etsion

November 11, 2015

2November 11, 2015

Massively Parallel Computing

• CUDA/OpenCL are gaining track in high-performance computing (HPC)– Same code; different data

• GPUs deliver better FLOPS per Watt– Available in mobile systems and supercomputers

• But… GPGPUs still suffer from von-Neumann inefficiencies

2

3November 11, 2015

von-Neumann inefficiencies

• Fetch/Decode/Issue each instruction– Even though most instructions come from loops

• Explicit storage needed for communicating values between instructions– Register file; stack– Data travels

between executionunits and storage

3

[Understanding Sources of Inefficiency in General-Purpose Chips, Hameed et al., ISCA10]

Component

Inst. fetch

Pipeline registers

Data cache

Register file

Control ALU

Power ]%[

33% 22% 19% 10% 10% 6%

4November 11, 2015

Quantifying inefficiencies:instruction pipeline• Every instruction fetched, decoded and issued• Very wasteful• Most of the execution time is spent in (tight) loops

• Avg. pipeline power consumption:– NVIDIA Tesla

• >10% of processor power [Hong and Kim. ISCA’10]– NVIDIA Fermi

• ~15% of processor power [Leng et al. ISCA’13]

4

5November 11, 2015

Quantifying Inefficiencies:Register File• Communication via bulletin board

– 40% of values only read once [Gebhart et al. ISCA’11]

• Avg. register file power consumption:– NVIDIA Tesla

• 5-10% of processor power [Hong and Kim. ISCA’10]– NVIDIA Fermi

• >15% of processor power [Leng et al. ISCA’13]

5

6November 11, 2015

Alternatives to von-Neumann:Dataflow/spatial computing• Processor is a grid of functional units• Computation graph is mapped to the grid

– Statically, at compile time

• No energy wasted on pipeline– Instructions are statically mapped to nodes

• No energy wasted on RF and data transfers– No centralized register file needed– Save static power and area (128KB on Fermi)

6

7November 11, 2015

Spatial/Dataflow Computing

7

int temp1 = a[threadId] * b[threadId];int temp2 = 5 * temp1;if (temp2 > 255 ) { temp2 = temp2 >> 3; result[threadId] = temp2 ;}else result[threadId] = temp2;

a threadIdx entry b

IMM_5 S_LOAS1 S_LOAD2

ALU1_mul ALU2_mul JOIN1

IMM_3 ALU4_ashl ALU3_icmp IMM_256

if_else if_then

S_SOTRE3 result S_SOTRE4

8November 11, 2015

SGMF: A Massively Multithreaded Dataflow Architecture Every thread is a flow through the dataflow graph Many threads execute (flow) in parallel

8

9November 11, 2015

Execution Overview:Dynamic Dataflow• Each flow/thread is associated with a token• Execute the operation when tokens match• Parallelism is determined by the number of tokens

in the system

9

OoO LD/ST units

token matching

10November 11, 2015

DESIGN ISSUESA Massively Multithreaded Dataflow Processor

10

11November 11, 2015

Multithreading Design Issues:Preventing Deadlocks• Imbalanced out-of-order memory responses may

trigger deadlocks

11

Deadlock due to limited buffer space

OoO LD/ST units

Solution: load-store units limit bypassing to the size of the token buffer

12November 11, 2015

Design issues:Variable path lengths Short paths must wait for long paths

12

ab

c x

x

+ +

xBubble

f a x x c b

Solution: equalize paths’ lengths

14November 11, 2015

ARCHITECTUREA Massively Multithreaded Dataflow Processor

14

15November 11, 2015

Architecture overview

Heterogeneous grid of tiles1. Compute tiles: very similar to CUDA cores2. LD/ST tiles: buffer and throttle data3. Control tiles: pipeline buffering and join ops.4. Special tiles: deal with non-pipelined operations

Reference point:– A single grid is the equivalent of a single NVIDIA Streaming

Multiprocessor (SM)– Total buffering capacity in SGMF is less than 30% of that of an

NVIDIA Fermi register file

15

16November 11, 2015

Architecture overview

16

18November 11, 2015

EVALUATIONA Massively Multithreaded Dataflow Processor

18

19November 11, 2015

Methodology

The main HW blocks were Implemented in Verilog Synthesized to a 65nm process

– Validate timing and connectivity– Estimate area and power consumption– The size of one SGMF core synthesized with 65nm process is 54.3mm2

– When scaled down to 40nm, each SGMF core would occupy 21.18mm2

– Nvidia Fermi GTX480 card (40nm) occupies 529mm2

Cycle accurate simulations based on GPGPUSim– We Integrated synthesis results into the GPGPUSim/Wattch power model

Benchmarks from Rodinia suite– CUDA kernels, compiled for SGMF

19

20November 11, 2015

Single core system SGMF vs. Fermi – Performance

BFS BP CFD-1 CFD-2 GE-1 GE-2 PF NN Average012345678 1 token

2 tokens4 tokens8 tokens16 tokens32 tokens64 tokens

Benchmark

Spee

dup

[x]

21November 11, 2015

Single core systemEnergy savings

21

BFS BP CFD-1 CFD-2 GE-1 GE-2 PF NN Average0

1

2

3

4

5

6

7 1 token2 tokens4 tokens8 tokens16 tokens32 tokens64 tokens

Benchmark

Inst

s. p

er Jo

ule

[x]

24November 11, 2015

Conclusions

• von-Neumann engines have inherent inefficiencies – Throughput computing can benefit from dataflow/spatial computing

• SGMF can potentially achieve much better performance/power than current GPGPUs– Almost 2x speedup (average) and 50% energy saving– Need to tune the memory system

• Greatly motivates further research– Compilation, place&route, connectivity, …

24

25November 11, 2015

Thank you!

Questions?

Documents

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015