33
A New Dataflow Compiler IR for Accelerating Control-Intensive Code in Spatial Hardware Ali Mustafa Zaidi, David Greaves {ali-mustafa.zaidi, david.greaves}@cl.cam.ac.uk University of Cambridge Computer Laboratory

A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

A New Dataflow Compiler IR for Accelerating Control­Intensive Code in Spatial Hardware

Ali Mustafa Zaidi, David Greaves

{ali­mustafa.zaidi, david.greaves}@cl.cam.ac.uk

University of Cambridge Computer Laboratory

Page 2: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

2

The Dark Silicon Problem

2.1GHz @ 90nm (80W)

18%

5.2GHz @ 45nm (80W)

7%

7.3GHz @ 32nm (80W)

3%

Amdahl's Law

Utilization Wall

+

=

Dark Silicon45nm → 8nm (32x resources)

● CPU: 3.5x, GPU 2.4x (Cnsrv.)

● CPU: 7.9x, GPU 2.7x (ITRS)

Esmaeilzadeh et al, "Dark Silicon and the End of Multicore Scaling". IEEE Micro 2012.

Page 3: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

3

2.1GHz @ 90nm (80W)

18%

5.2GHz @ 45nm (80W)

7%

7.3GHz @ 32nm (80W)

3%

Amdahl's Law

Utilization Wall

+

=

Dark Silicon45nm → 8nm (32x resources)

● CPU: 3.5x, GPU 2.4x (Cnsrv.)

● CPU: 7.9x, GPU 2.7x (ITRS)

Esmaeilzadeh et al, "Dark Silicon and the End of Multicore Scaling". IEEE Micro 2012.

Need for both high 'Sequential' Performance,

AND

Very High Energy Efficiency

The Dark Silicon Problem

Page 4: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

4

0 2 4 6 8 10 120

10

20

30

40

50

60

Relative Performance

Po

we

r D

iss

ipa

tio

n

Amdahl's Law

Utilization Wall

+

=

Dark SiliconCan we achieve Superscalar Performance, w/o

Superscalar Overheads?

Esmaeilzadeh et al, "Dark Silicon and the End of Multicore Scaling". IEEE Micro 2012.

Conventional

Spatial?

Superscalar Processors

● Only Option for Seq. Performance!

● Power scales exponentially with Performance

Custom Hardware:

● 10 – 1000 x Efficiency!

● Not for Sequential!

ARM A5 SOC

Custom Video

Decoder

The Dark Silicon Problem

Page 5: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

5

Solution: Spatial Architectures?

● Custom Hardware, FPGAs, CGRAs, MPPAs, etc.

● Advantages

– Scalable, Decentralized architectures, with short, p2p wiring.

– High Computational Density

– 10­1000x Energy efficiency & Performance.

● Issues

– Poor Programmability: often requiring low­level hardware knowledge

– Limited Amenability: poor performance on sequential, irregular, or complex control­flow code.

● Examples

– Conservation Cores: Performance ≈ in­order MIPS24KE core

– Phoenix CASH Hardware: Performance 30% less than 4­way OOO Core.

Page 6: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

6

● Key Reasons for High Performance of Complex, OOO Superscalars:

– Aggressive Control­flow Speculation

– Dynamic, out­of­order execution scheduling

● Custom hardware has very limited speculation

– Single flow of control

– If­conversion & hyperblock formation for forward branches.

– No acceleration of backwards branches!

= A[i]

> 0

A i

foo()T

F

Start

i = 0

i++

< 100

T

EndF

bar()

Control­Data Flow Graph

McFarlin et al., “Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?”, ASPLOS ’13

Solution: Spatial Architectures?

Page 7: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

7

Our Solution

Instead of

CDFG IR + Compile-time Execution Scheduling

We Employ

VSFG IR + Dataflow Execution Model

Control­Data Flow Graph

= A[i]

> 0

A i

foo()T

F

Start

i = 0

i++

< 100

T

EndF

bar()

Solution: Spatial Architectures!

Page 8: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

8

Value State Flow Graph

● Hierarchical Dataflow Graph – Instead of [Basic Blocks + Control

Flow], we have [Nested Subgraphs + Dataflow]

– Functions → nested subgraphs

– Loops → tail­recursive functions.

● Dataflow execution of operations– Multiple Subgraphs may execute

concurrently in Dataflow order (unlike basic blocks).

– Exposes Multiple Flows of Control!

= A[i]

foo()> 0

F TP

i = 0 A STATE_IN

STATE_OUT

i++

< 100Next

iteration of 'for' loop

F TP

bar()

inPred

VSFG: Value­State Flow Graph

Page 9: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

9

Value State Flow Graph

● Infinite DAG – Loops represented as Tail Recursion

– Branches represented via if­conversion

– Enables Aggressive Speculation!

● No single 'Flow of Control'– Instead, control implemented via

'Boolean Predicate Expressions'.

– Logic minimization can simplify expressions, facilitating Control Dependence Analysis!

= A[i]

foo()> 0

F TP

i = 0 A STATE_IN

STATE_OUT

i++

< 100Next

iteration of 'for' loop

F TP

bar()

inPred

VSFG: Value­State Flow Graph

Page 10: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

10

Value State Flow Graph

● Hierarchical Dataflow Graph – Subgraphs may be 'predicated',

or executed speculatively (via 'if­conversion').

– 'Flattening' loop tail­call subgraphs → loop unrolling/pipelining.

– Multiple loops in a loop­nest may be unrolled independently to expose ILP

= A[i]

foo()> 0

F TP

i = 0 A STATE_IN

STATE_OUT

i++

< 100Next

iteration of 'for' loop

F TP

bar()

inPred

VSFG: Value­State Flow Graph

Page 11: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

11

VSFG: Value­State Flow Graph

Page 12: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

12

VSFG: Value­State Flow Graph

Page 13: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

13

Any High Level

LanguageLLVM VSFG

Bluespec SystemVerilog

ASIC / FPGALow­Level

IR

%1 = mul i32 %x, %y;%2 = srem i32 %1, %z;%3 = icmp slt i32 %2, %1;

FIFOF(int) x ← mkFIFOF1;FIFOF(int) y ← mkFIFOF1;FIFOF(int) z ← mkFIFOF1;FIFOF(int) srem_1 ← mkFIFOF1;FIFOF(int) icmp_1 ← mkFIFOF1;FIFOF(int) icmp_2 ← mkFIFOF1;FIFOF(int) out_3 ← mkFIFOF1;

rule mul_inst;let val1 = x.first; x.deq;let val2 = y.first; y.deq;let rslt = val1 * val2;srem_1.enq (rslt);icmp_1.enq (rslt);

endrule

rule srem_inst;let val1 = srem_1.first; srem_1.deq;let val2 = z.first; z.deq;let rslt = val1 % val2;icmp_2.enq (rslt);

endrule.

High Level Synthesis Case Study

Page 14: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

14

%1 = mul i32 %x, %y ; <i32>%2 = srem i32 %1, %z ; <i32>%3 = icmp slt i32 %2, %1 ; <i1>

icmp

%2

srem

%1

z

mul

x y

Value-State Flow Graph

%3

→ Registers

→ Instructions

→ Petri Net Places

→ Petri Net TransitionsLLVM IR

Petri Net basedLow Level Dataflow IR

Hardware Oriented Dataflow IR

Page 15: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

15

%1 = mul i32 %x, %y ; <i32>%2 = srem i32 %1, %z ; <i32>%3 = icmp slt i32 %2, %1 ; <i1>

→ Petri Net Places

→ Petri Net Transitions

LLVM IR FIFOF(int) x ← mkFIFOF1;FIFOF(int) y ← mkFIFOF1;FIFOF(int) z ← mkFIFOF1;FIFOF(int) srem_1 ← mkFIFOF1;FIFOF(int) icmp_1 ← mkFIFOF1;FIFOF(int) icmp_2 ← mkFIFOF1;FIFOF(int) out_3 ← mkFIFOF1;

rule mul_inst;let val1 = x.first; x.deq;let val2 = y.first; y.deq;let rslt = val1 * val2;srem_1.enq (rslt);icmp_1.enq (rslt);

endrule

rule srem_inst;let val1 = srem_1.first; srem_1.deq;let val2 = z.first; z.deq;let rslt = val1 % val2;icmp_2.enq (rslt);

endrule...

Equivalent Bluespec Code

Petri Net basedLow Level Dataflow IR

Hardware Oriented Dataflow IR

Page 16: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

16

● LegUp– LLVM 2.9

– O2– No LTO, no LTI

– No Op Chaining– Statically Scheduled CFG

● Our Toolchain– LLVM 2.6

– O2

– No LTO, no LTI

– No Op Chaining

– Dynamically Scheduled VSFG

● Performance and Energy Evaluation by comparing with – LegUp HLS Tool, & Altera Nios IIf Processor, implemented on Altera

Stratix IV GX FPGA.

– Nehalem Core i7 (Sniper interval simulator from Intel).– In all cases, memory access latency assumed == 1 Cycle.

High Level Synthesis Case Study

Page 17: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

17

Normalised to LegUp

Compared to Nios II/f & Intel Nehalem Core i7 (SniperSim)

Matrix Transpose(x1k cycles)

adpcm(x1k cycles)

dfsin(x1k cycles)

Neural Net Simulator (x1M cycles)

Performance (Cycle Counts)

epic* adpcm dfadd dfdiv dfmul dfsin mips bimpa GEOMEAN0

0.2

0.4

0.6

0.8

1

1.2

.99

.81

.68

1.07.97

.74

1.08

.80.88

.25

.72.66

.87

.66 .69

.97

.68 .65

LegUp (CDFG) VSFG_0 VSFG_1 VSFG_3

Page 18: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

18

Nios IIf @ 250MHz

epic* adpcm dfadd dfdiv dfmul dfsin mips bimpa GEOMEAN0

50

100

150

200

250

300

350

400

450

392

117

204

124

184

125

185

124

167

345

109

218

167

225

169184

127

182

Frequency

Frequency & Delay

epic* adpcm dfadd dfdiv dfmul dfsin mips bimpa GEOMEAN0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.12

.87

.64

.80 .80

.54

1.09

.78 .81

.42

.00

.70 .68 .65

.00

1.14

.00

.68

Delay

LegUp (CFG) VSFG_0 VSFG_1 VSFG_3

Page 19: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

19

epic adpcm dfadd dfdiv dfmul dfsin mips0

1

2

3

4

5

6

misspeculated activity (bits)

useful activity (bits)

epic adpcm dfadd dfdiv dfmul dfsin mips0

1

2

3

4

5

6

LegUp

VSFG_0_Eff

Power estimation assuming 250MHz

operating frequency

Power & Speculation Overheads

Page 20: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

20

epic adpcm dfadd dfdiv dfmul dfsin mips0

2

4

6

8

10

12

LegUp

VSFG_0_Eff

VSFG_1_Eff

VSFG_3_Eff

Power estimation assuming 250MHz

operating frequency

epic adpcm dfadd dfdiv dfmul dfsin mips0

1

2

3

4

5

6

misspeculated activity (bits)

useful activity (bits)

Power & Speculation Overheads

Page 21: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

21

epic adpcm dfadd dfdiv dfmul dfsin mips GEOMEAN0.1

1

10

100

1 1

3

1

32 2

213

5

2

43 3 33

63

75 4

62

1

17 1831

14

6

12

LegUp VSFG_0 VSFG_1 VSFG_3 Nios

Normalized Energy

Page 22: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

22

● Energy Cost Comparison:– vs Nios II/f: 0.25 x (GEOMEAN)

– vs LegUp: 3­4 x (GEOMEAN)

● Overheads of Speculation– Balance between speculation & predication must be found for efficiency &

performance

● Part of power dissipation proportional to Area– Clock Gating for predicated regions to reduce dynamic power

● (consider asynchronous Ckts)

– Power gating for predicated regions to reduce static power?

– Selective loop unrolling.

Sources of Energy Inefficiency

Page 23: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

23

● 35% better performance than statically scheduled CFG, without any optimizations:– Improvements due to dynamic scheduling, MFC & CDA– Unrolling helps, but speed­up saturates quickly.

● Further Improvements possible:– Balance between predication & speculation, to improve speed­up without

unrolling (thus reducing area and energy costs)

– State­edge is on critical path – limits both unrolling & MFC.● Last remnant of 'sequential' nature of program.

● Frequency Scaling limited by Memory Interconnect– Partition memory & pipeline memory access tree

Current Performance Limitations

Page 24: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

24

Increasing Programmer / Compiler

Effort

Alias Analysis

Specul.Loads

OpenMP Assertion: Implicit (determinstic)parallel programming models are essentially means of partitioning the state­edge.

OpenCL

Sieve C++

Implicit Parallelism & State­edge Partitioning

SpMT /TLS

DynamicOOO LSQ

Increasing Runtime Effort

Page 25: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

25

Thank you for listening!

Questions &/or

Comments?

Page 26: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

26

Control­Data Flow Graph

Value State Flow Graph

Overcoming Control­Flow with the VSFG

= A[i]

> 0

A i

foo()T

F

Start

i = 0

i++

< 100

T

EndF

bar()

= A[i]

foo()> 0

F TP

i = 0 A STATE_IN

STATE_OUT

i++

< 100Next

iteration of 'for' loop

F TP

bar()

inPred

Page 27: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

27

Performance (Cycle Counts)

● Cycle counts normalized to LegUp results● VSFG implemented with all loops unrolled 0, 1, and 3 times● Full Speculation: all subgraphs (except loops) triggered

without predicates

epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Cycle Counts with Full Speculation

LegUp (CFG)

VSFG_0

VSFG_1

VSFG_3

Page 28: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

28

Performance (Cycle Counts)

Predication: only one block

will execute

Speculation: both blocks execute, but

only one result is chosen

epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Cycle Counts with Full Speculation

LegUp (CFG)

VSFG_0

VSFG_1

VSFG_3

Page 29: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

29

Performance (Cycle Counts)

epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Cycle Counts with Full Speculation

LegUp (CFG)

VSFG_0

VSFG_1

VSFG_3

Page 30: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

30

epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Cycle Counts with Full Speculation

LegUp (CFG)

VSFG_0

VSFG_1

VSFG_3

Performance (Cycle Counts)

epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Cycle Counts with Predicated Subgraphs

LegUp (CFG)

VSFG_0

VSFG_1

VSFG_3

Page 31: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

31

Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30

20000000

40000000

60000000

80000000

100000000

120000000

140000000

160000000

180000000

200000000

39664956

373347552

142386696

11436149498179648 97430648

small_bimpa

Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30

20000

40000

60000

80000

100000

120000

140000

104953

1420558

105773

72007 71896 71896

dfsin

Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30

200000

400000

600000

800000

1000000

1200000

1400000

200174

3399634

1078444 1062436

528218

265170

epic

Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30

10000

20000

30000

40000

50000

60000

70000

80000

42662

119794

71349

5786051580 51186

adpcm

Performance (Cycle Counts)

Page 32: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

32

Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30

2000

4000

6000

8000

10000

12000

14000

16000

1800015994 16441

2391 1999 1590 1574

dfadd

Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30

5000

10000

15000

20000

25000

30000

35000

40000

15120

36487

3029 3235 2825 2639

dfdiv

Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30

2000

4000

6000

8000

10000

12000

14000

1600014072

7074

941 916 671 625

dfmul

Core i7 Nios 2f LegUp VSFG_0 VSFG_1 VSFG_30

5000

10000

15000

20000

25000

30000

35000

29998 31082

13414 14489 13438 12953

mips**

Performance (Cycle Counts)

Page 33: A New Dataflow Compiler IR for Accelerating ControlIntensive … · A New Dataflow Compiler IR for Accelerating ControlIntensive Code in Spatial Hardware Ali Mustafa Zaidi, David

33

Understanding OOO Performance

● Control flow is the primary constraint on ILP

– Wall (1991): Conventional processors limited to ILP of 4-8!

● Single Flow of control● Branch prediction (+95% accuracy)

– Lam & Wilson (1993), Mak & Mycroft (2009): 10x ILP possible, with:

● Control Dependence Analysis (CDA)● Multiple Flows of Control (MFC)

● Custom hardware has very limited speculation

– Single flow of control

– If­conversion & hyperblock formation for forward branches.

– No acceleration of backwards branches!

= A[i]

> 0

A i

foo()T

F

Start

i = 0

i++

< 100

T

EndF

bar()

Control­Data Flow Graph