21
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

Embed Size (px)

DESCRIPTION

3 November 11, 2015 von-Neumann inefficiencies Fetch/Decode/Issue each instruction – Even though most instructions come from loops Explicit storage needed for communicating values between instructions – Register file; stack – Data travels between execution units and storage 3 [Understanding Sources of Inefficiency in General-Purpose Chips, Hameed et al., ISCA10] Compo nent Inst. fetch Pipeline registers Data cache Register file ControlALU Power [%] 33%22%19%10% 6%

Citation preview

Page 1: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

1November 11, 2015

A Massively Parallel, Hybrid Dataflow/von Neumann Architecture

Yoav Etsion

November 11, 2015

Page 2: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

2November 11, 2015

Massively Parallel Computing

• CUDA/OpenCL are gaining track in high-performance computing (HPC)– Same code; different data

• GPUs deliver better FLOPS per Watt– Available in mobile systems and supercomputers

• But… GPGPUs still suffer from von-Neumann inefficiencies

2

Page 3: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

3November 11, 2015

von-Neumann inefficiencies

• Fetch/Decode/Issue each instruction– Even though most instructions come from loops

• Explicit storage needed for communicating values between instructions– Register file; stack– Data travels

between executionunits and storage

3

[Understanding Sources of Inefficiency in General-Purpose Chips, Hameed et al., ISCA10]

Component

Inst. fetch

Pipeline registers

Data cache

Register file

Control ALU

Power ]%[

33% 22% 19% 10% 10% 6%

Page 4: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

4November 11, 2015

Quantifying inefficiencies:instruction pipeline• Every instruction fetched, decoded and issued• Very wasteful• Most of the execution time is spent in (tight) loops

• Avg. pipeline power consumption:– NVIDIA Tesla

• >10% of processor power [Hong and Kim. ISCA’10]– NVIDIA Fermi

• ~15% of processor power [Leng et al. ISCA’13]

4

Page 5: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

5November 11, 2015

Quantifying Inefficiencies:Register File• Communication via bulletin board

– 40% of values only read once [Gebhart et al. ISCA’11]

• Avg. register file power consumption:– NVIDIA Tesla

• 5-10% of processor power [Hong and Kim. ISCA’10]– NVIDIA Fermi

• >15% of processor power [Leng et al. ISCA’13]

5

Page 6: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

6November 11, 2015

Alternatives to von-Neumann:Dataflow/spatial computing• Processor is a grid of functional units• Computation graph is mapped to the grid

– Statically, at compile time

• No energy wasted on pipeline– Instructions are statically mapped to nodes

• No energy wasted on RF and data transfers– No centralized register file needed– Save static power and area (128KB on Fermi)

6

Page 7: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

7November 11, 2015

Spatial/Dataflow Computing

7

int temp1 = a[threadId] * b[threadId];int temp2 = 5 * temp1;if (temp2 > 255 ) { temp2 = temp2 >> 3; result[threadId] = temp2 ;}else result[threadId] = temp2;

a threadIdx entry b

IMM_5 S_LOAS1 S_LOAD2

ALU1_mul ALU2_mul JOIN1

IMM_3 ALU4_ashl ALU3_icmp IMM_256

if_else if_then

S_SOTRE3 result S_SOTRE4

Page 8: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

8November 11, 2015

SGMF: A Massively Multithreaded Dataflow Architecture Every thread is a flow through the dataflow graph Many threads execute (flow) in parallel

8

Page 9: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

9November 11, 2015

Execution Overview:Dynamic Dataflow• Each flow/thread is associated with a token• Execute the operation when tokens match• Parallelism is determined by the number of tokens

in the system

9

OoO LD/ST units

token matching

Page 10: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

10November 11, 2015

DESIGN ISSUESA Massively Multithreaded Dataflow Processor

10

Page 11: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

11November 11, 2015

Multithreading Design Issues:Preventing Deadlocks• Imbalanced out-of-order memory responses may

trigger deadlocks

11

Deadlock due to limited buffer space

OoO LD/ST units

Solution: load-store units limit bypassing to the size of the token buffer

Page 12: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

12November 11, 2015

Design issues:Variable path lengths Short paths must wait for long paths

12

ab

c x

x

+ +

xBubble

f a x x c b

Solution: equalize paths’ lengths

Page 13: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

14November 11, 2015

ARCHITECTUREA Massively Multithreaded Dataflow Processor

14

Page 14: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

15November 11, 2015

Architecture overview

Heterogeneous grid of tiles1. Compute tiles: very similar to CUDA cores2. LD/ST tiles: buffer and throttle data3. Control tiles: pipeline buffering and join ops.4. Special tiles: deal with non-pipelined operations

Reference point:– A single grid is the equivalent of a single NVIDIA Streaming

Multiprocessor (SM)– Total buffering capacity in SGMF is less than 30% of that of an

NVIDIA Fermi register file

15

Page 15: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

16November 11, 2015

Architecture overview

16

Page 16: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

18November 11, 2015

EVALUATIONA Massively Multithreaded Dataflow Processor

18

Page 17: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

19November 11, 2015

Methodology

The main HW blocks were Implemented in Verilog Synthesized to a 65nm process

– Validate timing and connectivity– Estimate area and power consumption– The size of one SGMF core synthesized with 65nm process is 54.3mm2

– When scaled down to 40nm, each SGMF core would occupy 21.18mm2

– Nvidia Fermi GTX480 card (40nm) occupies 529mm2

Cycle accurate simulations based on GPGPUSim– We Integrated synthesis results into the GPGPUSim/Wattch power model

Benchmarks from Rodinia suite– CUDA kernels, compiled for SGMF

19

Page 18: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

20November 11, 2015

Single core system SGMF vs. Fermi – Performance

BFS BP CFD-1 CFD-2 GE-1 GE-2 PF NN Average012345678 1 token

2 tokens4 tokens8 tokens16 tokens32 tokens64 tokens

Benchmark

Spee

dup

[x]

Page 19: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

21November 11, 2015

Single core systemEnergy savings

21

BFS BP CFD-1 CFD-2 GE-1 GE-2 PF NN Average0

1

2

3

4

5

6

7 1 token2 tokens4 tokens8 tokens16 tokens32 tokens64 tokens

Benchmark

Inst

s. p

er Jo

ule

[x]

Page 20: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

24November 11, 2015

Conclusions

• von-Neumann engines have inherent inefficiencies – Throughput computing can benefit from dataflow/spatial computing

• SGMF can potentially achieve much better performance/power than current GPGPUs– Almost 2x speedup (average) and 50% energy saving– Need to tune the memory system

• Greatly motivates further research– Compilation, place&route, connectivity, …

24

Page 21: 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

25November 11, 2015

Thank you!

Questions?