52
1 PRAGMATIC OPTIMIZATION IN MODERN PROGRAMMING DEMYSTIFYING A COMPILER Created by for / 2015-2016 Marina (geek) K olpakova UNN

Pragmatic Optimization in Modern Programming - Demystifying the Compiler

Embed Size (px)

Citation preview

Page 1: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

1

PRAGMATICOPTIMIZATION

IN MODERN PROGRAMMINGDEMYSTIFYING A COMPILER

Created by for / 2015-2016Marina (geek) K olpakova UNN

Page 2: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

2

COURSE TOPICSOrdering optimization approachesDemystifying a compilerMastering compiler optimizationsModern computer architectures concepts

Page 3: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

3

OUTLINECompilation trajectoryIntermediate languageDealing with local variableslink-time and whole program optimizationOptimization levelsCompiler optimization taxonomies

ClassicScopeCode pattern

How to get the feedback from optimization?Compiler optimization challengesSummary

Page 4: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

4

EXECUTABLE GENERATION PHASESPre-processing. Pre-process, but don't compile.

gcc -E test.cc

cl /E test.cc

Compilation. Compile, but don't assemble.gcc -S test.cc

cl /FA test.cc

Assembling. Assemble, but don't link.gcc -c test.cc

cl /c test.cc

Linking. Link object �les to generate the executable.gcc test.cc

cl test.cc

Page 5: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

5 . 1

COMPILATION TRAJECTORY

Lexical Analysisscans the source code as a stream of charactersconverting it into lexemes (tokens).

Syntax Analysistakes the tokens, produced by lexical analysis, as inputand generates a syntax tree. Source code grammar(syntactical correctness) is checked here.

Page 6: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

5 . 2

COMPILATION TRAJECTORY

Semantic Analysischecks whether the constructed syntax tree follows thelanguage rules (including the type checking).

Intermediate Code Generationbuilds a program representation for some abstractmachine. It is in between the high-level language and thetarget machine language.

Page 7: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

5 . 3

COMPILATION TRAJECTORY

Code Optimizationdoes optimization of the intermediate code (eg,redundancy elimination).

Code Generationtakes an optimized representation of the intermediatecode and maps it to the target machine language.

Page 8: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

6

FRONTEND AND BACKEND

Only a backend is required for new machine supportOnly a frontend is required for new language supportMost of optimizations resemble each other for all targetsand could be applied in between frontend and backend

Page 9: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

7 . 1

INTERMEDIATE LANGUAGE

Optimization techniques become much easier to conduct onthe level of intermediate code. Modern compilers usually use

2 levels of intermediate representation (IR).

Page 10: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

7 . 2

INTERMEDIATE LANGUAGEHigh Level IR

is close to the source and can be easily generated fromthe source code. Some code optimizations are possible.It is not very suitable for target machine optimization.

Low Level IRis close to the target machine and used for machine-dependent optimizations: register allocation, instructionselection, peephole optimization.

Page 11: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

7 . 3

INTERMEDIATE LANGUAGELanguage-speci�c to be used for JIT compilation later:

Java byte code; .NET CLI, NVIDIA PTX.Language independent, like three-(four-)address code(similar to a classic RISC ISA).

a = b + c * d + c * d;

Three-Address Code (TAC)

r1 = c * d; r2 = b + r1; r3 = r2 + r1; a = r3

Here rth is an abstract register.

Page 12: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

7 . 4

THREE-ADDRESS CODEQuadruples has four �elds

Op arg1 arg2 result

* c d r1

+ b r1 r2

+ r2 r1 r3

= r3 a

Triples or Indirect triples have three �eldsOp arg1 arg2

* c d

+ b (0)

+ (1) (0)

= (2)

Page 13: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

7 . 5

INTERMEDIATE LANGUAGEProvides frontend independent code representation.

and GNU Compiler Collection -fdump-tree-all -fdump-tree-optimized -fdump-tree-ssa -fdump-rtl-all

clang and other LLWM based compilers -emit-llvm

CIL (C Intermediate Language)Visual Studio cl.exe

GENERIC GIMPLE

LLVM IL

Page 14: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

7 . 6

INTERMEDIATE LANGUAGE

uint32_t gray2rgba_v1(uint8_t c) { return c + (c<<8) + (c<<16) + (c<<24); }

$ clang -Os -S -emit-llvm test.c -o test.ll $ cat test.ll

define i32 @gray2rgba_v1(i8 zeroext %c) #0 { %1 = zext i8 %c to i32 %2 = mul i32 %1, 16843009 ret i32 %2 }

gray2rgba_v1: movzbl %dil, %eax imull $16843009, %eax, %eax ret

Page 15: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

8 . 1

DEALING LOCAL VARIABLESCompiler don't care how many variables are used in code,

register allocation is done after IR rotations.

for( ; j <= roi.width - 4; j += 4 ) { uchar t0 = tab[src[j]]; uchar t1 = tab[src[j+1]]; dst[j] = t0; dst[j+1] = t1; t0 = tab[src[j+2]]; t1 = tab[src[j+3]]; dst[j+2] = t0; dst[j+3] = t1; }

Page 16: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

8 . 2

DEALING LOCAL VARIABLES.lr.ph4: ; preds = %0, %.lr.ph4 %indvars.iv5 = phi i64 [ %indvars.iv.next6, %.lr.ph4 ], [ 0, %0 ] %6 = getelementptr inbounds i8* %src, i64 %indvars.iv5 %7 = load i8* %6, align 1, !tbaa !1 %8 = zext i8 %7 to i64 %9 = getelementptr inbounds i8* %tab, i64 %8 %10 = load i8* %9, align 1, !tbaa !1 %11 = or i64 %indvars.iv5, 1 %12 = getelementptr inbounds i8* %src, i64 %11 %13 = load i8* %12, align 1, !tbaa !1 %14 = zext i8 %13 to i64 %15 = getelementptr inbounds i8* %tab, i64 %14 %16 = load i8* %15, align 1, !tbaa !1 %17 = getelementptr inbounds i8* %dst, i64 %indvars.iv5 store i8 %10, i8* %17, align 1, !tbaa !1 %18 = getelementptr inbounds i8* %dst, i64 %11 store i8 %16, i8* %18, align 1, !tbaa !1 %19 = or i64 %indvars.iv5, 2 // ... %28 = zext i8 %27 to i64 %29 = getelementptr inbounds i8* %tab, i64 %28 %30 = load i8* %29, align 1, !tbaa !1 %31 = getelementptr inbounds i8* %dst, i64 %19 store i8 %24, i8* %31, align 1, !tbaa !1 %32 = getelementptr inbounds i8* %dst, i64 %25 store i8 %30, i8* %32, align 1, !tbaa !1 %indvars.iv.next6 = add nuw nsw i64 %indvars.iv5, 4 %33 = trunc i64 %indvars.iv.next6 to i32 %34 = icmp sgt i32 %33, %1 br i1 %34, label %..preheader_crit_edge, label %.lr.ph4

Page 17: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

9 . 1

Perform inter-procedural optimizations during linking.

Most compilers support this feature:

LINK-TIME OPTIMIZATION (LTO)

(-�to) (-�to) stating with 4.9

(/GL, /LTCG)...

clanggcccl.exe

Page 18: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

9 . 2

WHOPR: WHOLE PROGRAM OPTIMIZATION1. Compile each source �le separately, add extra information

to the object �le2. Analyze information collected from all object �les3. Perform second optimization phase to generate object �le4. Link the �nal binary

Eliminate even more redundant codeCompilations is better optimized for multi-core systems

Page 19: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

10 . 1

OPTIMIZATION LEVELS-O0 (the default) No optimization

generates unoptimized code but has the fastestcompilation time.

-O1 Moderate optimizationoptimizes reasonably well but does not degradecompilation time signi�cantly.

-O2 Full optimizationgenerates highly optimized code and has the slowestcompilation time.

Page 20: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

10 . 2

OPTIMIZATION LEVELS-O3 Aggressive optimization

employees more aggressive automatic inlining ofsubprograms within a unit and attempts to vectorize.

-Os Optimizes with focus on program sizeenables all -O2 optimizations that do not typicallyincrease code size. It also performs furtheroptimizations designed to reduce code size.

Page 21: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

10 . 3

ENABLED OPTIMIZATIONS: GCC -O0GNU C version 4.9.2 (x86_64-linux-gnu)

$ touch 1.c; gcc -O0 -S -fverbose-asm 1.c -o 1.s

options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fcommon -fdelete-null-pointer-checks -fdwarf2-c�-asm -fearly-inlining -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fgnu-runtime -fgnu-unique -�dent -�nline-atomics -�ra-hoist-pressure -�ra-share-save-slots -�ra-share-spill-slots -�vopts -fkeep-static-consts -�eading-underscore -fmath-errno -fmerge-debug-

strings -fpeephole -fprefetch-loop-arrays -freg-struct-return -fsched-critical-path-heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-stalled-insns-

dep -fshow-column -fsigned-zeros -fsplit-ivs-in-unroller -fstack-protector -fstrict-volatile-bit�elds -fsync-libcalls -ftrapping-math -ftree-coalesce-vars -ftree-cselim -ftree-forwprop -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-

loops= -ftree-phiprop -ftree-reassoc -ftree-scev-cprop -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-stor e -mfancy-math-387 -mfp-ret-in-

387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs

Page 22: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

10 . 4

ENABLED OPTIMIZATIONS: GCC -O1GNU C version 4.9.2 (x86_64-linux-gnu)

$ touch 1.c; gcc -O1 -S -fverbose-asm 1.c -o 1.s

options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcombine-stack-adjustments -fcommon -fcompare-elim -fcprop-registers -fdefer-pop -fdelete-null-pointer-checks -fdwarf2-c�-asm -fearly-inlining -

feliminate-unused-debug-types -fforward-propagate -ffunction-cse -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability -�dent -�f-conversion -�f-conversion2 -�nline -�nline-atomics -�nline-functions-called-once -�pa-pro�le -�pa-pure-const -�pa-reference -�ra-hoist-

pressure -�ra-share-save-slots -�ra-share-spill-slots -�vopts -fkeep-static-consts -�eading-underscore -fmath-errno -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -fpeephole -fprefetch-loop-arrays -freg-struct-return -fsched-critical-path-

heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector -fstrict-volatile-bit�elds -fsync-libcalls -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-ccp -ftree-ch -ftree-

coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-ter -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-stor e -mfancy-math-387 -mfp-ret-in-

387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs

Page 23: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

10 . 5

ENABLED OPTIMIZATIONS: GCC -O2GNU C version 4.9.2 (x86_64-linux-gnu)

$ touch 1.c; gcc -O2 -S -fverbose-asm 1.c -o 1.s

options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcaller-saves -fcombine-stack-adjustments -fcommon -fcompare-elim -fcprop-registers -fcse-follow-jumps -fdefer-pop -fdelete-null-pointer-checks -fdevirtualize -

fdevirtualize-speculatively -fdwarf2-c�-asm -fearly-inlining -feliminate-unused-debug-types -fexpensive-optimizations -fforward-propagate -ffunction-cse -fgcse -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability -fhoist-adjacent-loads -�dent -�f-conversion -�f-

conversion2 -�ndirect-inlining -�nline -�nline-atomics -�nline-functions-called-once -�nline-small-functions -�pa-cp -�pa-pro�le -�pa-pure-const -�pa-reference -�pa-sra -�ra-hoist-pressure -�ra-share-save-slots -�ra-share-spill-slots -�solate-erroneous-paths-dereference -�vopts -fkeep-static-consts -�eading-underscore -fmath-errno -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer

-foptimize-sibling-calls -foptimize-strlen -fpartial-inlining -fpeephole -fpeephole2 -fprefetch-loop-arrays -free -freg-struct-return -freorder-blocks -freorder-blocks-and-partition -freorder-functions -frerun-cse-after-loop -fsched-critical-path-heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-

stalled-insns-dep -fschedule-insns2 -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector -fstrict-aliasing -fstrict-over�ow -fstrict-volatile-bit�elds -fsync-libcalls -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-

builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -

ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-switch-conversion -ftree-tail-merge -ftree-ter -ftree-vrp -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-stor e -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mieee-fp -

mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs -mvzeroupper

Page 24: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

10 . 6

ENABLED OPTIMIZATIONS: GCC -O3GNU C version 4.9.2 (x86_64-linux-gnu)

$ touch 1.c; gcc -O3 -S -fverbose-asm 1.c -o 1.s

options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcaller-saves -fcombine-stack-adjustments -fcommon -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps -fdefer-pop -fdelete-null-pointer-checks -

fdevirtualize -fdevirtualize-speculatively -fdwarf2-c�-asm -fearly-inlining -feliminate-unused-debug-types -fexpensive-optimizations -fforward-propagate -ffunction-cse -fgcse -fgcse-after-reload -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability -fhoist-

adjacent-loads -�dent -�f-conversion -�f-conversion2 -�ndirect-inlining -�nline -�nline-atomics -�nline-functions -�nline-functions-called-once -�nline-small-functions -�pa-cp -�pa-cp-clone -�pa-pro�le -�pa-pure-const -�pa-reference -�pa-sra -�ra-hoist-pressure -�ra-share-save-

slots -�ra-share-spill-slots -�solate-erroneous-paths-dereference -�vopts -fkeep-static-consts -�eading-underscore -fmath-errno -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -foptimize-sibling-calls -foptimize-strlen -fpartial-inlining -

fpeephole -fpeephole2 -fpredictive-commoning -fprefetch-loop-arrays -free -freg-struct-return -freorder-blocks -freorder-blocks-and-partition-freorder-functions -frerun-cse-after-loop -fsched-critical-path-heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-

interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fschedule-insns2 -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector -fstrict-aliasing -fstrict-over�ow

-fstrict-volatile-bit�elds -fsync-libcalls -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre

-ftree-loop-distribute-patterns -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-loop-vectorize -ftree-parallelize-loops= -ftree-partial-pre -ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slp-vectorize -ftree-

slsr -ftree-sra -ftree-switch-conversion -ftree-tail-merge -ftree-ter -ftree-vrp -funit-at-a-time -funswitch-loops -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-

store -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs -mvzeroupper

Page 25: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

11

CLASSIC COMPILER OPTIMIZATION TAXONOMYMachine independentApplicable across a broadrange of machines1. Eliminate redundant

computations, dead code2. Reduce running time and

space3. Decrease ratio of

overhead to real work4. Specialize code on a

context5. Enable other optimizations

Machine dependentCapitalize on speci�cmachine properties1. Manage or hide latency2. Manage resources

(registers, stack)3. Improve mapping from IR

to concrete machine4. Use some exotic

instructions (eg VLDM )

Page 26: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

12 . 1

SCOPE COMPILER OPTIMIZATION TAXONOMYInterprocedural optimizations

consider the whole translation unit, involve analysis ofdata�ow and dependency graphs.

Intraprocedural optimizationsconsider the whole procedure, involve analysis ofdata�ow and dependency graphs.

Page 27: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

12 . 2

SCOPE COMPILER OPTIMIZATION TAXONOMYGlobal optimizations

consider the inter-most code block with the context.Loop optimizations belong to this.

Local optimizationsconsider a single block, the analysis is limited to it.

Peephole optimizationsmap one or more consecutive operators from the IR to amachine code.

Page 28: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

12 . 3

INTERPROCEDURAL OPTIMIZATIONS (IPO)Look at all routines in a translation unit in order to make

optimizations across routine boundaries, including but notlimited to inlining and cloning.

Also called as Interprocedural Analysis (IPA).

Compiler can move, optimize, restructure and delete codebetween procedures

and even different source �les, if LTO is enabledInlining — replacing a subroutine call with the replicatedcode of itCloning — optimizing logic in the copied subroutine for aparticular call

Page 29: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

13

PATTERN COMPILER OPTIMIZATION TAXONOMYDependency chains (linear code)BranchesLoop bodies

Single loopLoop and branchMulti-loop

Functional calls to subroutines

Page 30: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

14 . 1

HOW TO GET OPTIMIZATION FEEDBACK?Check wall-time of you application

If a compiler has done its job well, you'll see performanceimprovements

Dump an assembly of your code (or/and IL)Ensure instruction and register schedulingCheck for extra operations and register spills

See compiler optimization reportAll the compilers have some support for itSome of them are able to generate very detailed reportsabout loop unrolling, auto-vectorization, VLIW slotsscheduling, etc

Page 31: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

14 . 2

COMMONLY CONSIDERED METRICSWall(-clock) time

is a human perception of the span of time from thestart to the completion of a task.

Power consumptionis the electrical energy which is consumed to complete atask.

Processor time (or runtime)is the total execution time during which a processor wasdedicated to a task (i.e. executes instructions of thattask).

Page 32: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

14 . 3

DUMPING ASSEMBLYAssembler is a must-have to check the compiler

but it is rarely used to write low-level code.

$ gcc code.c -S -o asm.s

Assembly writing is the least portable optimizationInline assembly limits compiler optimizationsAssembly does not give overwhelming speedup nowadaysSometimes it is needed to overcome compiler bugs andoptimization limitations

Page 33: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

14 . 4

EXAMPLE: GCC FEEDBACK OPTIONSEnables optimization information printing-fopt-info-fopt-info-<optimized/missed/note/all>-fopt-info-all-<ipa/loop/inline/vec/optall>-fopt-info=filename

Controls the amount of debugging output the schedulerprints on targets that use instruction scheduling-fopt-info -fsched-verbose=n

Controls the amount of output from auto-vectorizer-ftree-vectorizer-verbose=n

Page 34: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

14 . 5

EXAMPLES: GCC FEEDBACK OPTIONSOutputs all optimization info to stderr.gcc -O3 -fopt-info

Outputs missed optimization report from all the passesto missed.txtgcc -O3 -fopt-info-missed=missed.txt

Outputs information about missed optimizations aswell as optimized locations from all the inlining passesto inline.txt.gcc -O3 -fopt-info-inline-optimized-missed=inline.txt

Page 35: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

14 . 6

GCC FEEDBACK EXAMPLE./src/box.cc:193:9: note: loop vectorized ./src/box.cc:193:9: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:193:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:96:9: note: loop vectorized ./src/box.cc:96:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:51:9: note: loop vectorized ./src/box.cc:51:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:193:9: note: loop with 7 iterations completely unrolled ./src/box.cc:32:13: note: loop with 7 iterations completely unrolled ./src/box.cc:96:9: note: loop with 15 iterations completely unrolled ./src/box.cc:51:9: note: loop with 15 iterations completely unrolled ./src/box.cc:584:9: note: loop vectorized ./src/box.cc:584:9: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:584:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:482:9: note: loop vectorized ./src/box.cc:482:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:463:5: note: loop vectorized ./src/box.cc:463:5: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:463:5: note: loop peeled for vectorization to enhance alignment

Page 36: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

15 . 1

POINTER ALIASINGvoid twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; }

void twiddle2(int *xp, int *yp) { *xp += 2* *yp; }

ARE THEY ALWAYS EQUAL?

Page 37: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

15 . 2

POINTER ALIASINGWhat if..

int main(int argc, char** argv) { int i = 5, j = 5; twiddle1(&i, &i); twiddle2(&j, &j);

printf("twiddle1 result is %d\n", i); printf("twiddle2 result is %d\n", j); }

twiddle1 result is 20 while twiddle2 result is 15

Page 38: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

15 . 3

POINTER ALIASINGAliasing refers to the situation where the same memory

location can be accessed by using different names.

void twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; }

void twiddle2(int *xp, int *yp) { *xp += 2* *yp; }

Page 39: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

15 . 4

STRICT ALIASING ASSUMPTIONStrict aliasing is an assumption, made by a C (or C++)

compiler, that dereferencing pointers to objects of differenttypes will never refer to the same memory location.

This assumption enables more aggressive optimization (gccassumes it up from -02), but a programmer should have tofollow strict aliasing rules to get code working correctly.

Page 40: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

15 . 5

results in

results in

STRICT ALIASING ASSUMPTIONvoid check(int32_t *h, int64_t *k) { *h = 5; *k = 6; printf("%d\n", *h);}void main(){ int64_t k; check((int32_t *)&k, &k); }

gcc -O1 test.c -o test ; ./test 6

gcc -O2 test.c -o test ; ./test 5

Page 41: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

15 . 6

POINTER ALIASING: MISSED OPPORTUNITIESCompiler freely schedules arithmetic, but often preserves the order of memory dereferencingCompiler is limited in redundancy eliminationCompiler is limited in loop unrollingCompiler is limited in auto-vectorization

Page 42: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

16 . 1

FUNCTION CALLSint callee();int caller(){ return callee() + callee(); }

int callee();int caller(){ return 2*callee(); }

ARE THEY EQUAL?

Page 43: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

16 . 2

FUNCTION CALLSint callee(int i); int caller() { int s=0, i=0; for ( ; i < N ; i++) s += callee(i); return s; }

int callee(int i); int caller() { int s0=0, s1=0, i=0; for ( ; i < N/2 ; i+=2) { s0+=callee(i); s1+=callee(i+1); } return s0 + s1; }

ARE THEY EQUAL?

Page 44: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

16 . 3

PURE FUNCTIONSPure function is a function for which both of the following

statements are true:1. The function always evaluates the same result having been

given the same argument value(s). The function result mustnot depend on any hidden information or state that maychange while program execution proceeds or betweendifferent executions of the program, as well as on anyexternal input from I/O devices.

2. Evaluation of the result does not cause any semanticallyobservable side effect or output, such as mutation ofmutable objects or output to I/O devices.

Page 45: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

16 . 4

PURE FUNCTIONSPure functions are much easier to optimize. Expressingideas in code as pure functions simpli�es compiler's life.Most functions from math.h are not pure (sets/cleans�oating point �ags and conditions, throws �oating pointexceptions)Use constexpr keyword for c++11 to hint a compiler thatfunction could be evaluated in compile timeUse static keyword to help the compiler to see all theusages of the function (and perform aggressive inlining, oreven deduce whether the function is pure or not)Neither constexpr nor static doesn't guarantee thatfunction is pure but they give compiler some hints.

Page 46: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

16 . 5

FUNCTIONAL CALLS: MISSED OPPORTUNITIESIf the compiler fails to inline a function body:

it is limited in redundancy eliminationthere are some overhead on function callsinlining is crucial for functional calls from loopsmany other optimizations aren't performed for thisfragment of code

loop unrollingauto-vectorizationetc

potential bloating of code and stack

Page 47: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

17

FLOATING POINTFloating point arithmetics is not associative, so A+(B+C) != (A+B)+CA compiler is very conservative about �oating point!

void associativityCheck (void) { double x = 3.1415926535897931; double a = 1.0e15; double b = -(1.0e15 - 1.0); printf("%f %f\n", x*(a + b), x*a + x*b ); }

$ gcc check.c -o check ; ./check 3.141593 3.000000

Such situation is known as catastrophic cancellation

Page 48: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

18

MORE OPTIMIZATION CHALLENGESBranches inside a loopExceptionsAccesses to storage type global variablesInline assemblyvolatile keyword

Page 49: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

19 . 1

SUMMARYSource code does through lexical, syntax, semanticanalysis, as well as IR generation, optimization beforeproducing target machine codeBackend and frontend simplify compiler developmentIntermediate language makes compiler optimizationsreusable across broad range of languages and targetsIL can be Language-speci�c or Language independentTriples and Quadruples are widely used as language-independent IRAll the compiler optimizations are done on IRRegister allocation goes after IR optimization, local-variable reuse is pointless nowadays!

Page 50: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

19 . 2

SUMMARYLTO allows do optimizations during linkingWHOPR allows globally optimize whole binaryCompiler usually supports multiple optimization levelsCompiler optimizations are split into machine-dependentand machine-independent groupsBy scope compiler optimizations are split intointerprocedural, intraprocedural, global, local andpeepholeThe most common targets are dependency chains,branches, loopsCompiler optimization is a multi-phase iterative processPerforming one optimization allows many otherMost optimizations need certain order of application

Page 51: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

19 . 3

SUMMARYChecking wall-time, assembly or optimizer's report are themost common ways to get optimization feedbackWall time is the most important metric to optimizeAssembler is a must-have to check the compiler but it is rarely used to write low-level codeInspect optimizer's report to demystify its "habits"Stick to the strict aliasing ruleClean code is not enough.. Write pure code!Compilers are usually very conservative optimizing �oating point math

Page 52: Pragmatic Optimization in Modern Programming - Demystifying the Compiler

20

THE END

/ 2015-2016MARINA KOLPAKOVA