Upload
elfreda-bennett
View
221
Download
0
Embed Size (px)
DESCRIPTION
3 November 11, 2015 von-Neumann inefficiencies Fetch/Decode/Issue each instruction – Even though most instructions come from loops Explicit storage needed for communicating values between instructions – Register file; stack – Data travels between execution units and storage 3 [Understanding Sources of Inefficiency in General-Purpose Chips, Hameed et al., ISCA10] Compo nent Inst. fetch Pipeline registers Data cache Register file ControlALU Power [%] 33%22%19%10% 6%
Citation preview
1November 11, 2015
A Massively Parallel, Hybrid Dataflow/von Neumann Architecture
Yoav Etsion
November 11, 2015
2November 11, 2015
Massively Parallel Computing
• CUDA/OpenCL are gaining track in high-performance computing (HPC)– Same code; different data
• GPUs deliver better FLOPS per Watt– Available in mobile systems and supercomputers
• But… GPGPUs still suffer from von-Neumann inefficiencies
2
3November 11, 2015
von-Neumann inefficiencies
• Fetch/Decode/Issue each instruction– Even though most instructions come from loops
• Explicit storage needed for communicating values between instructions– Register file; stack– Data travels
between executionunits and storage
3
[Understanding Sources of Inefficiency in General-Purpose Chips, Hameed et al., ISCA10]
Component
Inst. fetch
Pipeline registers
Data cache
Register file
Control ALU
Power ]%[
33% 22% 19% 10% 10% 6%
4November 11, 2015
Quantifying inefficiencies:instruction pipeline• Every instruction fetched, decoded and issued• Very wasteful• Most of the execution time is spent in (tight) loops
• Avg. pipeline power consumption:– NVIDIA Tesla
• >10% of processor power [Hong and Kim. ISCA’10]– NVIDIA Fermi
• ~15% of processor power [Leng et al. ISCA’13]
4
5November 11, 2015
Quantifying Inefficiencies:Register File• Communication via bulletin board
– 40% of values only read once [Gebhart et al. ISCA’11]
• Avg. register file power consumption:– NVIDIA Tesla
• 5-10% of processor power [Hong and Kim. ISCA’10]– NVIDIA Fermi
• >15% of processor power [Leng et al. ISCA’13]
5
6November 11, 2015
Alternatives to von-Neumann:Dataflow/spatial computing• Processor is a grid of functional units• Computation graph is mapped to the grid
– Statically, at compile time
• No energy wasted on pipeline– Instructions are statically mapped to nodes
• No energy wasted on RF and data transfers– No centralized register file needed– Save static power and area (128KB on Fermi)
6
7November 11, 2015
Spatial/Dataflow Computing
7
int temp1 = a[threadId] * b[threadId];int temp2 = 5 * temp1;if (temp2 > 255 ) { temp2 = temp2 >> 3; result[threadId] = temp2 ;}else result[threadId] = temp2;
a threadIdx entry b
IMM_5 S_LOAS1 S_LOAD2
ALU1_mul ALU2_mul JOIN1
IMM_3 ALU4_ashl ALU3_icmp IMM_256
if_else if_then
S_SOTRE3 result S_SOTRE4
8November 11, 2015
SGMF: A Massively Multithreaded Dataflow Architecture Every thread is a flow through the dataflow graph Many threads execute (flow) in parallel
8
9November 11, 2015
Execution Overview:Dynamic Dataflow• Each flow/thread is associated with a token• Execute the operation when tokens match• Parallelism is determined by the number of tokens
in the system
9
OoO LD/ST units
token matching
10November 11, 2015
DESIGN ISSUESA Massively Multithreaded Dataflow Processor
10
11November 11, 2015
Multithreading Design Issues:Preventing Deadlocks• Imbalanced out-of-order memory responses may
trigger deadlocks
11
Deadlock due to limited buffer space
OoO LD/ST units
Solution: load-store units limit bypassing to the size of the token buffer
12November 11, 2015
Design issues:Variable path lengths Short paths must wait for long paths
12
ab
c x
x
+ +
xBubble
f a x x c b
Solution: equalize paths’ lengths
14November 11, 2015
ARCHITECTUREA Massively Multithreaded Dataflow Processor
14
15November 11, 2015
Architecture overview
Heterogeneous grid of tiles1. Compute tiles: very similar to CUDA cores2. LD/ST tiles: buffer and throttle data3. Control tiles: pipeline buffering and join ops.4. Special tiles: deal with non-pipelined operations
Reference point:– A single grid is the equivalent of a single NVIDIA Streaming
Multiprocessor (SM)– Total buffering capacity in SGMF is less than 30% of that of an
NVIDIA Fermi register file
15
16November 11, 2015
Architecture overview
16
18November 11, 2015
EVALUATIONA Massively Multithreaded Dataflow Processor
18
19November 11, 2015
Methodology
The main HW blocks were Implemented in Verilog Synthesized to a 65nm process
– Validate timing and connectivity– Estimate area and power consumption– The size of one SGMF core synthesized with 65nm process is 54.3mm2
– When scaled down to 40nm, each SGMF core would occupy 21.18mm2
– Nvidia Fermi GTX480 card (40nm) occupies 529mm2
Cycle accurate simulations based on GPGPUSim– We Integrated synthesis results into the GPGPUSim/Wattch power model
Benchmarks from Rodinia suite– CUDA kernels, compiled for SGMF
19
20November 11, 2015
Single core system SGMF vs. Fermi – Performance
BFS BP CFD-1 CFD-2 GE-1 GE-2 PF NN Average012345678 1 token
2 tokens4 tokens8 tokens16 tokens32 tokens64 tokens
Benchmark
Spee
dup
[x]
21November 11, 2015
Single core systemEnergy savings
21
BFS BP CFD-1 CFD-2 GE-1 GE-2 PF NN Average0
1
2
3
4
5
6
7 1 token2 tokens4 tokens8 tokens16 tokens32 tokens64 tokens
Benchmark
Inst
s. p
er Jo
ule
[x]
24November 11, 2015
Conclusions
• von-Neumann engines have inherent inefficiencies – Throughput computing can benefit from dataflow/spatial computing
• SGMF can potentially achieve much better performance/power than current GPGPUs– Almost 2x speedup (average) and 50% energy saving– Need to tune the memory system
• Greatly motivates further research– Compilation, place&route, connectivity, …
24
25November 11, 2015
Thank you!
Questions?