Spatial Computation
Thesis committee:Seth Goldstein
Peter Lee
Todd Mowry
Babak Falsafi
Nevin Heintze
Ph.D. Thesis defense, December 8, 2003
SCS
Mihai BudiuCMU CS
2
Spatial Computation
Thesis committee:Seth Goldstein
Peter Lee
Todd Mowry
Babak Falsafi
Nevin Heintze
Ph.D. Thesis defense, December 8, 2003
SCSA model of general-purpose computationbased on Application-Specific Hardware.
3
Thesis StatementApplication-Specific Hardware (ASH):
• can be synthesized by adapting software compilation for predicated architectures,
• provides high-performance for programs withhigh ILP, with very low power consumption,
• is a more scalable and efficient computation substrate than monolithic processors.
4
Outline• Introduction
• Compiling for ASH
• Media processing on ASH
• ASH vs. superscalar processors
• Conclusions
5
CPU Problems
• Complexity
• Power
• Global Signals
• Limited ILP
6
Design Complexity
from Michael Flynn’s FCRC 2003 talk
58%/Year
21%/Year
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2003
2001
2005
2007
2009
xxx
x xx
x
Logic transistors/chip
Transistors/staff*month
Source: S. Malik, orig Sematech
Prod
uctiv
ity
10
1,000,000
10,000,000
100,000,000
1000
100
10,000
100,000
10
1000
100
10,000
100,000
1,000,000
10,000,000
Chi
p si
ze (K
tran
sist
ors)
Design Time:CAD productivity favors FPL
2.5
.10
.35
7
Communication vs. Computation
5ps 20ps
gate wire
Power consumption on wires is also dominant
8
Our Approach: ASH
Application-Specific Hardware
9
1.
2.
1.
2.Programs
Programs
Resource Binding Time
CPU ASH
10
Hardware Interface
CPU ASH
ISA
software
hardware
software
hardwaregates
virtual ISA
11
Application-Specific HardwareC program
Compiler
Dataflow IR
Reconfigurable/custom hw
12
Contributions
Compilation
Computerarchitecture
Reconfigurablecomputing
Embeddedsystems
Asynchronouscircuits
High-levelsynthesis
Dataflowmachines
Nanotechnology
theory
syste
ms
13
Outline• Introduction
• CASH: Compiling for ASH
• Media processing on ASH
• ASH vs. superscalar processors
• Conclusions
14
Computation = Dataflow
• Operations ) functional units• Variables ) wires• No interpretation
x = a & 7;...
y = x >> 2;
Programs
&
a 7
>>
2
x
Circuits
15
Basic Operation
+data
valid
ack
latch
16
+
Asynchronous Computation
data
valid
ack
1
+
2
+
3
+
4
+
8
+
7
+
6
+
5
latch
17
Distributed Control Logic
+ -
ackrdy
FSM
asynchronous control
short, local wires
18
Forward Branches
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
Conditionals ) Speculation critical path
19
Control Flow ) Data Flow
datapredicate
Merge (label)
Gateway
data
data
Split (branch)p
!
20
i
+1< 100
0
*
+
sum
0
Loops
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return sum;return sum; !
ret
21
no speculation
sequencingof side-effects
Predication and Side-Effects
Load
addr
data
pred
token
token
tomemory
22
Thesis StatementApplication-Specific Hardware:
• can be synthesized by adapting software compilation for predicated architectures,
• provides high-performance for programs withhigh ILP, with very low power consumption,
• is a more scalable and efficient computation substrate than monolithic processors.
23
Outline• Introduction• CASH: Compiling for ASH
– An optimization on the SIDE
• Media processing on ASH• ASH vs. superscalar processors• Conclusions
skip to
24
Availability Dataflow Analysis
y
y = a*b;
...
if (x) {
...
... = a*b;
}
25
Dataflow Analysis Is Conservative
if (x) {
...
y = a*b;
}
...
... = a*b;y?
26
Static Instantiation, Dynamic Evaluation
flag = false;
if (x) {
...
y = a*b;
flag = true;
}
...
... = flag ? y : a*b;
27
SIDE Register Promotion Impact
0
5
10
15
20
25
30
ad
pcm
_e
ad
pcm
_d
gsm
_e
gsm
_d
ep
ic_
e
ep
ic_
d
mp
eg
2_
e
mp
eg
2_
d
jpe
g_
e
jpe
g_
d
pe
gw
it_e
pe
gw
it_d
g7
21
_e
g7
21
_d
pg
p_
e
pg
p_
d
rast
a
me
sa
09
9.g
o
12
4.m
88
ksim
12
9.c
om
pre
ss
13
0.li
13
2.ij
pe
g
13
4.p
erl
14
7.v
ort
ex
18
3.e
qu
ake
18
8.a
mm
p
16
4.g
zip
17
5.v
pr
17
6.g
cc
18
1.m
cf
19
7.p
ars
er
25
4.g
ap
30
0.tw
olf
%st promo
%st PRE
53
0
5
10
15
20
25
30
35
40
45
adp
cm_e
adp
cm_d
gsm
_e
gsm
_d
epic
_e
epic
_d
mpe
g2_e
mpe
g2_d
jpeg
_e
jpeg
_d
peg
wit_
e
peg
wit_
d
g72
1_e
g72
1_d
pgp
_e
pgp
_d
rast
a
mes
a
099
.go
124
.m88
ksim
129
.co
mp
ress
130
.li
132
.ijpe
g
134
.pe
rl
147
.vo
rtex
183
.eq
uake
188
.am
mp
164
.gzi
p
175
.vp
r
176
.gcc
181
.mcf
197
.pa
rser
254
.ga
p
300
.twol
f
% ld promo
% ld PRE
Loads
Stores
% r
educ
tion
28
Outline• Introduction• CASH: Compiling for ASH• Media processing on ASH
• ASH vs. superscalar processors• Conclusions
29
Performance Evaluation
ASH
LSQ
limited BW
L18K
L21/4M
Mem
CPU: 4-way OOO
Assumption: all operations have the same latency.
30
Media Kernels, vs 4-way OOO
0
0.5
1
1.5
2
2.5
3ad
pcm
_d
adpc
m_e
epic
_d
epic
_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mes
a
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
rast
a
Tim
es f
aste
r
125.85.8
31
Media Kernels, IPC
0
5
10
15
20
25
adpc
m_d
adpc
m_e
epic
_d
epic
_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mes
a
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
rast
a
Base IPC
ASH IPC
4
32
Speed-up IPC Correlation
0
1
2
3
4
5
6
7
8
9
10ad
pcm
_d
adpc
m_e
epic
_d
epic
_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mes
a
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
rast
a
Tim
es b
igg
er
Speed-up
IPC Ratio
12
33
Low-Level EvaluationC
CASHcore
Verilog back-end
Synopsys,Cadence P/R
Results shown so far.All results in thesis.
Results in the next two slides.
ASIC
180nm std. cell library, 2V
~1999technology
34
Area
0
2
4
6
8
10
12
adpc
m_d
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg_
d
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
Sq
uar
e m
m
Reference: P4 in 180nm has 217mm2
35
Power
vs 4-way OOO superscalar, 600 Mhz, with clock gating (Wattch), ~ 6W
0
50
100
150
200
250
300
350
Tim
es s
mal
ler
than
OO
O
power ratio 70 41 41 129 147 94 121 136 303 303
adpcm_d g721_d g721_e gsm_d gsm_e jpeg_d mpeg2_d mpeg2_e pegwit_d pegwit_e
36
Thesis StatementApplication-Specific Hardware:
• can be synthesized by adapting software compilation for predicated architectures,
• provides high-performance for programs withhigh ILP, with very low power consumption,
• is a more scalable and efficient computation substrate than monolithic processors.
37
Outline• Introduction• CASH: Compiling for ASH• Media processing on ASH
– dataflow pipelining
• ASH vs. superscalar processors• Conclusions
skip to
38
Pipeliningi
+
<=
100
1
*
+
sum
pipelinedmultiplier(8 stages)
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return sum;
cycle=1
39
Pipeliningi
+
<=
100
1
*
+
sum
cycle=2
40
Pipeliningi
+
<=
100
1
*
+
sum
cycle=3
41
Pipeliningi
+
<=
100
1
*
+
sum
cycle=4
42
Pipeliningi
+
<=
100
1
i=1
i=0
+
sum
cycle=5
pipeline balancing
43
Outline• Introduction
• CASH: Compiling for ASH
• Media processing on ASH
• ASH vs. superscalar processors
• Conclusions
44
This Is Obvious!
ASH runs at full dataflow speed, so CPU cannot do any better(if compilers equally good).
45
SpecInt95, ASH vs 4-way OOO
-50
-40
-30
-20
-10
0
10
20
300
99
.go
12
4.m
88
ksim
12
9.c
om
pre
ss
13
0.li
13
2.ij
pe
g
13
4.p
erl
14
7.v
ort
ex
Pe
rce
nt
slo
we
r /
fas
ter
46
Predicted not takenEffectively a noop for CPU!
Predicted taken.
Branch Prediction
for (i=0; i < N; i++) {
...
if (exception) break;
}
i
+
<
1
&
!
exception
result available before inputs
ASH crit path
CPU crit path
47
SpecInt95, perfect prediction
-60
-40
-20
0
20
40
60
099.
go
124.
m88
ksim
129.
com
pres
s
130.
li
132.
ijpeg
134.
perl
147.
vort
ex
Per
ce
nt
slo
we
r/fa
ster
baseline
prediction
no data
48
ASH Problems
• Both branch and join not free• Static dataflow (no re-issue of same instr)• Memory is “far”• Fully static
– No branch prediction– No dynamic unrolling– No register renaming
• Calls/returns not lenient• ...
49
Thesis StatementApplication-Specific Hardware:
• can be synthesized by adapting software compilation for predicated architectures,
• provides high-performance for programs withhigh ILP, with very low power consumption,
• is a more scalable and efficient computation substrate than monolithic processors.
50
Outline
Introduction
+ CASH: Compiling for ASH
+ Media processing on ASH
+ ASH vs. superscalar processors
= Conclusions
51
• low power
• simple verification?
• specialized to app.
• unlimited ILP
• simple hardware
• no fixed window
• economies of scale
• highly optimized
• branch prediction
• control speculation
• full-dataflow
• global signals/decision
Strengths
52
Conclusions
• Compiling “around the ISA” is a fruitful research approach.
• Distributed computation structures require more synchronization overhead.
• Spatial Computation efficiently implements high-ILP computation with very low power.
53
Backup Slides
• Control logic • Pipeline balancing• Lenient execution• Dynamic Critical Path• Memory PRE• Critical path analysis• CPU + ASH
54
Control Logic
C
C
Reg
rdyin
ackin
rdyoutackout
datain dataout
back back to talk
55
Last-Arrival Events
+
data
valid
ack
• Event enabling the generation of a result• May be an ack• Critical path=collection of last-arrival edges
56
Dynamic Critical Path
3. Some edges may repeat 2. Trace back along
last-arrival edges
1. Start from last node
back back to analysis
57
Critical Paths
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
58
Lenient Operations
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
Solve the problem of unbalanced pathsback back to talk
59
Pipeliningi
+
<=
100
1
*i=1
i=0
+
sum
cycle=6
60
Pipeliningi
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
Longlatency pipe
predicate
cycle=7
61
Predicate ackedge is on thecritical path.
Pipeliningi
+
<=
100
1
*
+
sum
critical pathi’s loop
sum’s loop
62
Pipelinine balancing i
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
decouplingFIFO
cycle=7
63
Pipelinine balancing i
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
critical path
decouplingFIFO
back back to presentation
64
Register Promotion
…=*p(p2)
*p=…(p1)
…=*p
*p=…(p1)
(p2 Æ : p1)
Load is executed only if store is not
65
Register Promotion (2)
…=*p(p2)
*p=…(p1)
…=*p(false)
*p=…(p1)
• When p2 ) p1 the load becomes dead...• ...i.e., when store dominates load in CFG
back
66
¼ PRE
...=*p(p1) ...=*p(p2) ...=*p(p1 Ç p2)
This corresponds in the CFG to lifting the load to a basic block dominating the original loads
67
Store-store (1)
*p=...(p2)
*p=…(p1)
*p=...(p2)
*p=…(p1 Æ : p2)
• When p1 ) p2 the first store becomes dead...• ...i.e., when second store post-dominates first in CFG
68
Store-store (2)
*p=...(p2)
*p=…(p1)
*p=...(p2)
*p=…(p1 Æ : p2)
• Token edge eliminated, but...• ...transitive closure of tokens preserved
back
69
A Code Fragment
for(i = 0; i < 64; i++) {
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
Y[i] = X[j].q;
}
SpecINT95:124.m88ksim:init_processor, stylized
70
Dynamic Critical Path
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
load predicate
loop predicate
sizeof(X[j])
definition
71
MIPS gcc CodeLOOP:
L1: beq $v0,$a1,EXIT ; X[j].r == i
L2: addiu $v1,$v1,20 ; &X[j+1].r
L3: lw $v0,0($v1) ; X[j+1].r
L4: addiu $a0,$a0,1 ; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:
L1! L2 ! L3 ! L5 ! L14-instructions loop-carried dependence
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
72
If Branch Prediction Correct
L1! L2 ! L3 ! L5 ! L1Superscalar is issue-limited!2 cycles/iteration sustained
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
LOOP:
L1: beq $v0,$a1,EXIT ; X[j].r == i
L2: addiu $v1,$v1,20 ; &X[j+1].r
L3: lw $v0,0($v1) ; X[j+1].r
L4: addiu $a0,$a0,1 ; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:
73
Critical Path with Prediction
Loads are notspeculative
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
74
Prediction + Load Speculation
~4 cycles!Load not pipelined(self-anti-dependence)
ack edge
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
75
OOO Pipe Snapshot
IF DA EX WB CT
L5L1L2
L1L2L3L4
L1L3
L5L3L2
L1L3L3
registerrenaming
LOOP:
L1: beq $v0,$a1,EXIT ; X[j].r == i
L2: addiu $v1,$v1,20 ; &X[j+1].r
L3: lw $v0,0($v1) ; X[j+1].r
L4: addiu $a0,$a0,1 ; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:
76
Unrolling?
for(i = 0; i < 64; i++) {
for (j = 0; X[j].r != 0xF; j+=2) {
if (X[j].r == i)
break;
if (X[j+1].r == 0xF)
break;
if (X[j+1].r == i)
break;
}
Y[i] = X[j].q;
}
when 1 iteration
back back to talk
77
Ideal Architecture
High-ILPcomputation
Low ILP computation+ OS+ VM
CPU ASH
Memory
back