Upload
tavon
View
64
Download
0
Embed Size (px)
DESCRIPTION
Idempotent Code Generation: Implementation, Analysis, and Evaluation. Marc de Kruijf ( ) Karthikeyan Sankaralingam. CGO 2013, Shenzhen. Example. source code. int sum( int *array, int len ) { int x = 0; for ( int i = 0; i < len ; ++ i ) x += array[ i ]; - PowerPoint PPT Presentation
Citation preview
Idempotent Code Generation: Implementation, Analysis, and Evaluation
Marc de Kruijf ( )Karthikeyan Sankaralingam
CGO 2013, Shenzhen
Example
2
int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x;}
source code
Example
3
R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3
assembly code
F F F F0
faults
exceptions
x
load ?
mis-speculations
Example
4
R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3
assembly code
BAD STUFF HAPPENS!
R0 and R1 are unmodified
R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3
Example
5
assembly code
just re-execute!
convention:use checkpoints/buffers
It’s Idempotent!
6
idempoh… what…?
int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x;}
=
6
Idempotent Region Constructionpreviously… in PLDI ’12
idempotent regionsALL THE TIME
before:after:
7
Idempotent Code Generationnow… in CGO ’13
int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x;}
how do we get from here...
8
Idempotent Code Generationnow… in CGO ’13
to here...
R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3
9
Idempotent Code Generationnow… in CGO ’13
not here (this is not idempotent) ...
R2 = load [R1] R1 = 0LOOP: R4 = load [R0 + R2] R1 = add R1, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3
10
Idempotent Code Generationnow… in CGO ’13
and not here (this is slow) ...
R3 = R1 R2 = load [R3] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3
11
Idempotent Code Generationnow… in CGO ’13
here...
R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3
13
F F F F0
faults
exceptions
x
load ?
mis-speculations
Hampton & Asanović, ICS ’06De Kruijf & Sankaralingam, MICRO ’11 Menon et al., ISCA ’12
Kim et al., TOPLAS ’06Zhang et al., ASPLOS ‘13
De Kruijf et al., ISCA ’10Feng et al., MICRO ’11De Kruijf et al., PLDI ’12
Idempotent Code Generationapplications to prior work
13
Idempotent Code Generationexecutive summary
(1) how do we generate efficient idempotent code?
(2) how do external factors affect overhead?(a) idempotent region size(b) instruction set (ISA) characteristics(c) control flow side-effects
each can affect overheads by 10% or more
algorithms made available in source code form: http://research.cs.wisc.edu/vertical/iCompiler
not covered in this talk
14
Presentation Overview
❶ Introduction
❷ Analysis
❸ Evaluation
(a) idempotent region size(b) ISA characteristics(c) control flow side-effects
Analysis
16
(a) idempotent region size
region size
over
head
- number of inputs increasing- likelihood of spills growing
- maximum spill cost reached- amortized over more instructions
Analysis
17
(b) ISA characteristics
(1) two-address (e.g. x86) vs. three-address (e.g. ARM)ADD R1, R2 -> R1
Idempotent? NO!
(2) register-memory (e.g. x86) vs. register-register (e.g. ARM)
(3) number of available registers
ADD R1, R2 = R3 idempotent? YES!
for register-memory, register spills may be less costly (microarchitecture dependent)
impact is obvious, but… more registers is not always enough (see back-up slide)
Analysis(c) control flow side-effects
x = ...
... = f(x)y = ...
18
region boundaries
x’s “shadow interval” given no side-effects
x’s live interval
Analysis(c) control flow side-effects
x = ...
... = f(x)y = ...
19
region boundaries
x’s “shadow interval” given side-effects
x’s live interval
19
Presentation Overview
❶ Introduction
❷ Analysis
❸ Evaluation(a) idempotent region size(b) ISA characteristics(c) control flow side-effects
Evaluation
21
methodology
measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 – region size: instructions between boundaries (path length)
benchmarks – SPEC 2006, PARSEC, and Parboil suites
0%
10%
20%
30%
40%
50%
Evaluation
region size
over
head
YOU ARE HERE(baseline: typically 10-30 instructions)
?
(a) idempotent region size
22
10+ instructions
13.1% (geometric mean)
0%
10%
20%
30%
40%
50%
region size
over
head
23
detectionlatency
??
Evaluation(a) idempotent region size
13.1%
0%
10%
20%
30%
40%
50%
region size
over
head
24
detectionlatency
re-executiontime
?
Evaluation(a) idempotent region size
0.06%13.1%
11.1%
24
Evaluation
SPEC INT SPEC FP PARSEC Parboil OVERALL0
5
10
15
20
perc
enta
ge o
verh
ead
x86-64 ARMv7
Three-address support matters more for FP benchmarksRegister-memory matters more for integer benchmarks
(b) ISA characteristics
25
Evaluation
SPEC INT SPEC FP PARSEC Parboil namd libquantum OVERALL05
10152025303540
perc
enta
ge o
verh
ead
no side-effects side-effects
(c) control flow side-effects
substantial only in two cases; insubstantial otherwiseintuition: typically compiler already spills for control flow divergence
26
Presentation Overview
❶ Introduction
❷ Analysis
❸ Evaluation
27
Conclusions
(a) region size – matters a lot; large regions are ideal if recovery is infrequent
overheads approach zero as regions grow
overheads drop below 10% only with careful co-design
(b) instruction set – matters when region sizes must be small
supporting control flow side-effects is not expensive
(c) control flow side-effects – generally does not matter
28
Conclusions
code generation and static analysis algorithmshttp://research.cs.wisc.edu/vertical/iCompiler
applicability not limited to architecture designsee Zhang et al., ASPLOS ‘13: “ConAir: Featherweight Concurrency Bug Recovery [...]”
thank you!
29
Back-up Slides
ISA Characteristics
31
more registers isn’t always enough
x = 0;if (y > 0) x = 1;z = x + y;
C code R0 = 0
if (R1 > 0)
R0 = 1
R2 = R0 + R1
ISA Characteristics
32
more registers isn’t always enough
R0 = 0
if (R1 > 0)
R3 = R0
x = 0;if (y > 0) x = 1;z = x + y;
C code
R3 = 1
R2 = R3 + R1need an extra instruction
no matter what
32
14-GPR 12-GPR 10-GPR baseline02468
101214
data from SPEC INT only (SPEC INT uses General Purpose Registers (GPRs) only)
perc
enta
ge o
verh
ead
ISA Characteristicsidempotence vs. fewer registers
no idempotence, #GPR reduced from 16
Very Large Regions
34
how do we get there?
Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; hurts loops
Problem #2: loop optimizations – boundaries in loops are bad for everyone (next slides) – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help
Problem #3: large array structures – awareness of array access patterns can help (next slides)
Problem #4: intra-procedural scope – limited scope aggravates all effects listed above
Very Large Regions
35
Re: Problem #2 (cut in loops are bad)
i0 = φ(0, i1)
i1 = i0 + 1if (i1 < X)
for (i = 0; i < X; i++) { ...}
C code CFG + SSA
Very Large Regions
36
Re: Problem #2 (cut in loops are bad)
R0 = 0
R0 = R0 + 1if (R0 < X)
for (i = 0; i < X; i++) { ...}
C code machine code
NO BOUNDARIES = NO PROBLEM
Very Large Regions
37
Re: Problem #2 (cut in loops are bad)
R0 = 0
R0 = R0 + 1if (R0 < X)
for (i = 0; i < X; i++) { ...}
C code machine code
Very Large Regions
38
Re: Problem #2 (cut in loops are bad)
R1 = 0
R0 = R1
R1 = R0 + 1if (R1 < X)
for (i = 0; i < X; i++) { ...}
C code machine code
– “redundant” copy– extra boundary (pressure)
Very Large Regions
39
Re: Problem #3 (array access patterns)
[x] = a;b = [x];[x] = c;
[x] = a;b = a;[x] = c;
non-clobber antidependences… GONE!
PLDI ‘12 algorithm makes this simplifying assumption:
cheap for scalars, expensive for arrays
Very Large Regions
40
Re: Problem #3 (array access patterns)
not really practical for large arraysbut if we don’t do it, non-clobber antidependences remain
solution: handle potential non-clobbers in a post-pass(same way we deal with loop clobbers in static analysis)
// initialize:int[100] array;memset(&array, 100*4, 0);// accumulate:for (...) array[i] += foo(i);