40
Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen

Idempotent Code Generation: Implementation, Analysis, and Evaluation

  • Upload
    tavon

  • View
    64

  • Download
    0

Embed Size (px)

DESCRIPTION

Idempotent Code Generation: Implementation, Analysis, and Evaluation. Marc de Kruijf ( ) Karthikeyan Sankaralingam. CGO 2013, Shenzhen. Example. source code. int sum( int *array, int len ) { int x = 0; for ( int i = 0; i < len ; ++ i ) x += array[ i ]; - PowerPoint PPT Presentation

Citation preview

Page 1: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Idempotent Code Generation: Implementation, Analysis, and Evaluation

Marc de Kruijf ( )Karthikeyan Sankaralingam

CGO 2013, Shenzhen

Page 2: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Example

2

int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x;}

source code

Page 3: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Example

3

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

assembly code

F F F F0

faults

exceptions

x

load ?

mis-speculations

Page 4: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Example

4

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

assembly code

BAD STUFF HAPPENS!

Page 5: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

R0 and R1 are unmodified

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

Example

5

assembly code

just re-execute!

convention:use checkpoints/buffers

Page 6: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

It’s Idempotent!

6

idempoh… what…?

int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x;}

=

Page 7: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

6

Idempotent Region Constructionpreviously… in PLDI ’12

idempotent regionsALL THE TIME

before:after:

Page 8: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

7

Idempotent Code Generationnow… in CGO ’13

int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x;}

how do we get from here...

Page 9: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

8

Idempotent Code Generationnow… in CGO ’13

to here...

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

Page 10: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

9

Idempotent Code Generationnow… in CGO ’13

not here (this is not idempotent) ...

R2 = load [R1] R1 = 0LOOP: R4 = load [R0 + R2] R1 = add R1, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

Page 11: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

10

Idempotent Code Generationnow… in CGO ’13

and not here (this is slow) ...

R3 = R1 R2 = load [R3] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

Page 12: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

11

Idempotent Code Generationnow… in CGO ’13

here...

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

Page 13: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

13

F F F F0

faults

exceptions

x

load ?

mis-speculations

Hampton & Asanović, ICS ’06De Kruijf & Sankaralingam, MICRO ’11 Menon et al., ISCA ’12

Kim et al., TOPLAS ’06Zhang et al., ASPLOS ‘13

De Kruijf et al., ISCA ’10Feng et al., MICRO ’11De Kruijf et al., PLDI ’12

Idempotent Code Generationapplications to prior work

Page 14: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

13

Idempotent Code Generationexecutive summary

(1) how do we generate efficient idempotent code?

(2) how do external factors affect overhead?(a) idempotent region size(b) instruction set (ISA) characteristics(c) control flow side-effects

each can affect overheads by 10% or more

algorithms made available in source code form: http://research.cs.wisc.edu/vertical/iCompiler

not covered in this talk

Page 15: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

14

Presentation Overview

❶ Introduction

❷ Analysis

❸ Evaluation

(a) idempotent region size(b) ISA characteristics(c) control flow side-effects

Page 16: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Analysis

16

(a) idempotent region size

region size

over

head

- number of inputs increasing- likelihood of spills growing

- maximum spill cost reached- amortized over more instructions

Page 17: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Analysis

17

(b) ISA characteristics

(1) two-address (e.g. x86) vs. three-address (e.g. ARM)ADD R1, R2 -> R1

Idempotent? NO!

(2) register-memory (e.g. x86) vs. register-register (e.g. ARM)

(3) number of available registers

ADD R1, R2 = R3 idempotent? YES!

for register-memory, register spills may be less costly (microarchitecture dependent)

impact is obvious, but… more registers is not always enough (see back-up slide)

Page 18: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Analysis(c) control flow side-effects

x = ...

... = f(x)y = ...

18

region boundaries

x’s “shadow interval” given no side-effects

x’s live interval

Page 19: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Analysis(c) control flow side-effects

x = ...

... = f(x)y = ...

19

region boundaries

x’s “shadow interval” given side-effects

x’s live interval

Page 20: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

19

Presentation Overview

❶ Introduction

❷ Analysis

❸ Evaluation(a) idempotent region size(b) ISA characteristics(c) control flow side-effects

Page 21: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Evaluation

21

methodology

measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 – region size: instructions between boundaries (path length)

benchmarks – SPEC 2006, PARSEC, and Parboil suites

Page 22: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

0%

10%

20%

30%

40%

50%

Evaluation

region size

over

head

YOU ARE HERE(baseline: typically 10-30 instructions)

?

(a) idempotent region size

22

10+ instructions

13.1% (geometric mean)

Page 23: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

0%

10%

20%

30%

40%

50%

region size

over

head

23

detectionlatency

??

Evaluation(a) idempotent region size

13.1%

Page 24: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

0%

10%

20%

30%

40%

50%

region size

over

head

24

detectionlatency

re-executiontime

?

Evaluation(a) idempotent region size

0.06%13.1%

11.1%

Page 25: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

24

Evaluation

SPEC INT SPEC FP PARSEC Parboil OVERALL0

5

10

15

20

perc

enta

ge o

verh

ead

x86-64 ARMv7

Three-address support matters more for FP benchmarksRegister-memory matters more for integer benchmarks

(b) ISA characteristics

Page 26: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

25

Evaluation

SPEC INT SPEC FP PARSEC Parboil namd libquantum OVERALL05

10152025303540

perc

enta

ge o

verh

ead

no side-effects side-effects

(c) control flow side-effects

substantial only in two cases; insubstantial otherwiseintuition: typically compiler already spills for control flow divergence

Page 27: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

26

Presentation Overview

❶ Introduction

❷ Analysis

❸ Evaluation

Page 28: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

27

Conclusions

(a) region size – matters a lot; large regions are ideal if recovery is infrequent

overheads approach zero as regions grow

overheads drop below 10% only with careful co-design

(b) instruction set – matters when region sizes must be small

supporting control flow side-effects is not expensive

(c) control flow side-effects – generally does not matter

Page 29: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

28

Conclusions

code generation and static analysis algorithmshttp://research.cs.wisc.edu/vertical/iCompiler

applicability not limited to architecture designsee Zhang et al., ASPLOS ‘13: “ConAir: Featherweight Concurrency Bug Recovery [...]”

thank you!

Page 30: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

29

Back-up Slides

Page 31: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

ISA Characteristics

31

more registers isn’t always enough

x = 0;if (y > 0) x = 1;z = x + y;

C code R0 = 0

if (R1 > 0)

R0 = 1

R2 = R0 + R1

Page 32: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

ISA Characteristics

32

more registers isn’t always enough

R0 = 0

if (R1 > 0)

R3 = R0

x = 0;if (y > 0) x = 1;z = x + y;

C code

R3 = 1

R2 = R3 + R1need an extra instruction

no matter what

Page 33: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

32

14-GPR 12-GPR 10-GPR baseline02468

101214

data from SPEC INT only (SPEC INT uses General Purpose Registers (GPRs) only)

perc

enta

ge o

verh

ead

ISA Characteristicsidempotence vs. fewer registers

no idempotence, #GPR reduced from 16

Page 34: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Very Large Regions

34

how do we get there?

Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; hurts loops

Problem #2: loop optimizations – boundaries in loops are bad for everyone (next slides) – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help

Problem #3: large array structures – awareness of array access patterns can help (next slides)

Problem #4: intra-procedural scope – limited scope aggravates all effects listed above

Page 35: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Very Large Regions

35

Re: Problem #2 (cut in loops are bad)

i0 = φ(0, i1)

i1 = i0 + 1if (i1 < X)

for (i = 0; i < X; i++) { ...}

C code CFG + SSA

Page 36: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Very Large Regions

36

Re: Problem #2 (cut in loops are bad)

R0 = 0

R0 = R0 + 1if (R0 < X)

for (i = 0; i < X; i++) { ...}

C code machine code

NO BOUNDARIES = NO PROBLEM

Page 37: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Very Large Regions

37

Re: Problem #2 (cut in loops are bad)

R0 = 0

R0 = R0 + 1if (R0 < X)

for (i = 0; i < X; i++) { ...}

C code machine code

Page 38: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Very Large Regions

38

Re: Problem #2 (cut in loops are bad)

R1 = 0

R0 = R1

R1 = R0 + 1if (R1 < X)

for (i = 0; i < X; i++) { ...}

C code machine code

– “redundant” copy– extra boundary (pressure)

Page 39: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Very Large Regions

39

Re: Problem #3 (array access patterns)

[x] = a;b = [x];[x] = c;

[x] = a;b = a;[x] = c;

non-clobber antidependences… GONE!

PLDI ‘12 algorithm makes this simplifying assumption:

cheap for scalars, expensive for arrays

Page 40: Idempotent Code Generation:  Implementation, Analysis, and Evaluation

Very Large Regions

40

Re: Problem #3 (array access patterns)

not really practical for large arraysbut if we don’t do it, non-clobber antidependences remain

solution: handle potential non-clobbers in a post-pass(same way we deal with loop clobbers in static analysis)

// initialize:int[100] array;memset(&array, 100*4, 0);// accumulate:for (...) array[i] += foo(i);