Pragmatic Optimization in Modern Programming - Demystifying the Compiler

1

PRAGMATICOPTIMIZATION

IN MODERN PROGRAMMINGDEMYSTIFYING A COMPILER

Created by for / 2015-2016Marina (geek) K olpakova UNN

https://github.com/cuda-geek

2

COURSE TOPICSOrdering optimization approachesDemystifying a compilerMastering compiler optimizationsModern computer architectures concepts

3

OUTLINECompilation trajectoryIntermediate languageDealing with local variableslink-time and whole program optimizationOptimization levelsCompiler optimization taxonomies

ClassicScopeCode pattern

How to get the feedback from optimization?Compiler optimization challengesSummary

4

EXECUTABLE GENERATION PHASESPre-processing. Pre-process, but don't compile.

gcc -E test.cc

cl /E test.cc

Compilation. Compile, but don't assemble.gcc -S test.cc

cl /FA test.cc

Assembling. Assemble, but don't link.gcc -c test.cc

cl /c test.cc

Linking. Link object �les to generate the executable.gcc test.cc

cl test.cc

5 . 1

COMPILATION TRAJECTORY

Lexical Analysisscans the source code as a stream of charactersconverting it into lexemes (tokens).

Syntax Analysistakes the tokens, produced by lexical analysis, as inputand generates a syntax tree. Source code grammar(syntactical correctness) is checked here.

5 . 2


Semantic Analysischecks whether the constructed syntax tree follows thelanguage rules (including the type checking).

Intermediate Code Generationbuilds a program representation for some abstractmachine. It is in between the high-level language and thetarget machine language.

5 . 3


Code Optimizationdoes optimization of the intermediate code (eg,redundancy elimination).

Code Generationtakes an optimized representation of the intermediatecode and maps it to the target machine language.

6

FRONTEND AND BACKEND

Only a backend is required for new machine supportOnly a frontend is required for new language supportMost of optimizations resemble each other for all targetsand could be applied in between frontend and backend

7 . 1

INTERMEDIATE LANGUAGE

Optimization techniques become much easier to conduct onthe level of intermediate code. Modern compilers usually use

2 levels of intermediate representation (IR).

7 . 2

INTERMEDIATE LANGUAGEHigh Level IR

is close to the source and can be easily generated fromthe source code. Some code optimizations are possible.It is not very suitable for target machine optimization.

Low Level IRis close to the target machine and used for machine-dependent optimizations: register allocation, instructionselection, peephole optimization.

7 . 3

INTERMEDIATE LANGUAGELanguage-speci�c to be used for JIT compilation later:

Java byte code; .NET CLI, NVIDIA PTX.Language independent, like three-(four-)address code(similar to a classic RISC ISA).

a = b + c * d + c * d;

Three-Address Code (TAC)

r1 = c * d; r2 = b + r1; r3 = r2 + r1; a = r3

Here rth is an abstract register.

7 . 4

THREE-ADDRESS CODEQuadruples has four �elds

Op arg1 arg2 result

* c d r1

+ b r1 r2

+ r2 r1 r3

= r3 a

Triples or Indirect triples have three �eldsOp arg1 arg2

* c d

+ b (0)

+ (1) (0)

= (2)

7 . 5

INTERMEDIATE LANGUAGEProvides frontend independent code representation.

and GNU Compiler Collection -fdump-tree-all -fdump-tree-optimized -fdump-tree-ssa -fdump-rtl-all

clang and other LLWM based compilers -emit-llvm

CIL (C Intermediate Language)Visual Studio cl.exe

GENERIC GIMPLE

LLVM IL

https://gcc.gnu.org/onlinedocs/gccint/GENERIC.html#GENERIC

https://gcc.gnu.org/onlinedocs/gccint/GIMPLE.html

http://llvm.org/docs/LangRef.html

7 . 6

INTERMEDIATE LANGUAGE

uint32_t gray2rgba_v1(uint8_t c) { return c + (c<<8) + (c<<16) + (c<<24); }

$ clang -Os -S -emit-llvm test.c -o test.ll $ cat test.ll

define i32 @gray2rgba_v1(i8 zeroext %c) #0 { %1 = zext i8 %c to i32 %2 = mul i32 %1, 16843009 ret i32 %2 }

gray2rgba_v1: movzbl %dil, %eax imull $16843009, %eax, %eax ret

8 . 1

DEALING LOCAL VARIABLESCompiler don't care how many variables are used in code,

register allocation is done after IR rotations.

for( ; j <= roi.width - 4; j += 4 ) { uchar t0 = tab[src[j]]; uchar t1 = tab[src[j+1]]; dst[j] = t0; dst[j+1] = t1; t0 = tab[src[j+2]]; t1 = tab[src[j+3]]; dst[j+2] = t0; dst[j+3] = t1; }

8 . 2

DEALING LOCAL VARIABLES.lr.ph4: ; preds = %0, %.lr.ph4 %indvars.iv5 = phi i64 [ %indvars.iv.next6, %.lr.ph4 ], [ 0, %0 ] %6 = getelementptr inbounds i8* %src, i64 %indvars.iv5 %7 = load i8* %6, align 1, !tbaa !1 %8 = zext i8 %7 to i64 %9 = getelementptr inbounds i8* %tab, i64 %8 %10 = load i8* %9, align 1, !tbaa !1 %11 = or i64 %indvars.iv5, 1 %12 = getelementptr inbounds i8* %src, i64 %11 %13 = load i8* %12, align 1, !tbaa !1 %14 = zext i8 %13 to i64 %15 = getelementptr inbounds i8* %tab, i64 %14 %16 = load i8* %15, align 1, !tbaa !1 %17 = getelementptr inbounds i8* %dst, i64 %indvars.iv5 store i8 %10, i8* %17, align 1, !tbaa !1 %18 = getelementptr inbounds i8* %dst, i64 %11 store i8 %16, i8* %18, align 1, !tbaa !1 %19 = or i64 %indvars.iv5, 2 // ... %28 = zext i8 %27 to i64 %29 = getelementptr inbounds i8* %tab, i64 %28 %30 = load i8* %29, align 1, !tbaa !1 %31 = getelementptr inbounds i8* %dst, i64 %19 store i8 %24, i8* %31, align 1, !tbaa !1 %32 = getelementptr inbounds i8* %dst, i64 %25 store i8 %30, i8* %32, align 1, !tbaa !1 %indvars.iv.next6 = add nuw nsw i64 %indvars.iv5, 4 %33 = trunc i64 %indvars.iv.next6 to i32 %34 = icmp sgt i32 %33, %1 br i1 %34, label %..preheader_crit_edge, label %.lr.ph4

9 . 1

Perform inter-procedural optimizations during linking.

Most compilers support this feature:

LINK-TIME OPTIMIZATION (LTO)

(-�to) (-�to) stating with 4.9

(/GL, /LTCG)...

clanggcccl.exe

http://llvm.org/docs/LinkTimeOptimization.html

https://gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html

https://msdn.microsoft.com/en-us/library/0zza0de8.aspx

9 . 2

WHOPR: WHOLE PROGRAM OPTIMIZATION1. Compile each source �le separately, add extra information

to the object �le2. Analyze information collected from all object �les3. Perform second optimization phase to generate object �le4. Link the �nal binary

Eliminate even more redundant codeCompilations is better optimized for multi-core systems

10 . 1

OPTIMIZATION LEVELS-O0 (the default) No optimization

generates unoptimized code but has the fastestcompilation time.

-O1 Moderate optimizationoptimizes reasonably well but does not degradecompilation time signi�cantly.

-O2 Full optimizationgenerates highly optimized code and has the slowestcompilation time.

10 . 2

OPTIMIZATION LEVELS-O3 Aggressive optimization

employees more aggressive automatic inlining ofsubprograms within a unit and attempts to vectorize.

-Os Optimizes with focus on program sizeenables all -O2 optimizations that do not typicallyincrease code size. It also performs furtheroptimizations designed to reduce code size.

10 . 3

ENABLED OPTIMIZATIONS: GCC -O0GNU C version 4.9.2 (x86_64-linux-gnu)

$ touch 1.c; gcc -O0 -S -fverbose-asm 1.c -o 1.s

options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fcommon -fdelete-null-pointer-checks -fdwarf2-c�-asm -fearly-inlining -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fgnu-runtime -fgnu-unique -�dent -�nline-atomics -�ra-hoist-pressure -�ra-share-save-slots -�ra-share-spill-slots -�vopts -fkeep-static-consts -�eading-underscore -fmath-errno -fmerge-debug-

strings -fpeephole -fprefetch-loop-arrays -freg-struct-return -fsched-critical-path-heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-stalled-insns-

dep -fshow-column -fsigned-zeros -fsplit-ivs-in-unroller -fstack-protector -fstrict-volatile-bit�elds -fsync-libcalls -ftrapping-math -ftree-coalesce-vars -ftree-cselim -ftree-forwprop -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-

loops= -ftree-phiprop -ftree-reassoc -ftree-scev-cprop -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-stor e -mfancy-math-387 -mfp-ret-in-

387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs

10 . 4



options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcombine-stack-adjustments -fcommon -fcompare-elim -fcprop-registers -fdefer-pop -fdelete-null-pointer-checks -fdwarf2-c�-asm -fearly-inlining -

feliminate-unused-debug-types -fforward-propagate -ffunction-cse -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability -�dent -�f-conversion -�f-conversion2 -�nline -�nline-atomics -�nline-functions-called-once -�pa-pro�le -�pa-pure-const -�pa-reference -�ra-hoist-

pressure -�ra-share-save-slots -�ra-share-spill-slots -�vopts -fkeep-static-consts -�eading-underscore -fmath-errno -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -fpeephole -fprefetch-loop-arrays -freg-struct-return -fsched-critical-path-

heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector -fstrict-volatile-bit�elds -fsync-libcalls -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-ccp -ftree-ch -ftree-

coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-ter -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-stor e -mfancy-math-387 -mfp-ret-in-

387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs

10 . 5



options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcaller-saves -fcombine-stack-adjustments -fcommon -fcompare-elim -fcprop-registers -fcse-follow-jumps -fdefer-pop -fdelete-null-pointer-checks -fdevirtualize -

fdevirtualize-speculatively -fdwarf2-c�-asm -fearly-inlining -feliminate-unused-debug-types -fexpensive-optimizations -fforward-propagate -ffunction-cse -fgcse -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability -fhoist-adjacent-loads -�dent -�f-conversion -�f-

conversion2 -�ndirect-inlining -�nline -�nline-atomics -�nline-functions-called-once -�nline-small-functions -�pa-cp -�pa-pro�le -�pa-pure-const -�pa-reference -�pa-sra -�ra-hoist-pressure -�ra-share-save-slots -�ra-share-spill-slots -�solate-erroneous-paths-dereference -�vopts -fkeep-static-consts -�eading-underscore -fmath-errno -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer

-foptimize-sibling-calls -foptimize-strlen -fpartial-inlining -fpeephole -fpeephole2 -fprefetch-loop-arrays -free -freg-struct-return -freorder-blocks -freorder-blocks-and-partition -freorder-functions -frerun-cse-after-loop -fsched-critical-path-heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-

stalled-insns-dep -fschedule-insns2 -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector -fstrict-aliasing -fstrict-over�ow -fstrict-volatile-bit�elds -fsync-libcalls -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-

builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -

ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-switch-conversion -ftree-tail-merge -ftree-ter -ftree-vrp -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-stor e -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mieee-fp -

mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs -mvzeroupper

10 . 6



options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcaller-saves -fcombine-stack-adjustments -fcommon -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps -fdefer-pop -fdelete-null-pointer-checks -

fdevirtualize -fdevirtualize-speculatively -fdwarf2-c�-asm -fearly-inlining -feliminate-unused-debug-types -fexpensive-optimizations -fforward-propagate -ffunction-cse -fgcse -fgcse-after-reload -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability -fhoist-

adjacent-loads -�dent -�f-conversion -�f-conversion2 -�ndirect-inlining -�nline -�nline-atomics -�nline-functions -�nline-functions-called-once -�nline-small-functions -�pa-cp -�pa-cp-clone -�pa-pro�le -�pa-pure-const -�pa-reference -�pa-sra -�ra-hoist-pressure -�ra-share-save-

slots -�ra-share-spill-slots -�solate-erroneous-paths-dereference -�vopts -fkeep-static-consts -�eading-underscore -fmath-errno -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -foptimize-sibling-calls -foptimize-strlen -fpartial-inlining -

fpeephole -fpeephole2 -fpredictive-commoning -fprefetch-loop-arrays -free -freg-struct-return -freorder-blocks -freorder-blocks-and-partition-freorder-functions -frerun-cse-after-loop -fsched-critical-path-heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-

interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fschedule-insns2 -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector -fstrict-aliasing -fstrict-over�ow

-fstrict-volatile-bit�elds -fsync-libcalls -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre

-ftree-loop-distribute-patterns -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-loop-vectorize -ftree-parallelize-loops= -ftree-partial-pre -ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slp-vectorize -ftree-

slsr -ftree-sra -ftree-switch-conversion -ftree-tail-merge -ftree-ter -ftree-vrp -funit-at-a-time -funswitch-loops -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-

store -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs -mvzeroupper

11

CLASSIC COMPILER OPTIMIZATION TAXONOMYMachine independentApplicable across a broadrange of machines1. Eliminate redundant

computations, dead code2. Reduce running time and

space3. Decrease ratio of

overhead to real work4. Specialize code on a

context5. Enable other optimizations

Machine dependentCapitalize on speci�cmachine properties1. Manage or hide latency2. Manage resources

(registers, stack)3. Improve mapping from IR

to concrete machine4. Use some exotic

instructions (eg VLDM )

12 . 1

SCOPE COMPILER OPTIMIZATION TAXONOMYInterprocedural optimizations

consider the whole translation unit, involve analysis ofdata�ow and dependency graphs.

Intraprocedural optimizationsconsider the whole procedure, involve analysis ofdata�ow and dependency graphs.

12 . 2

SCOPE COMPILER OPTIMIZATION TAXONOMYGlobal optimizations

consider the inter-most code block with the context.Loop optimizations belong to this.

Local optimizationsconsider a single block, the analysis is limited to it.

Peephole optimizationsmap one or more consecutive operators from the IR to amachine code.

12 . 3

INTERPROCEDURAL OPTIMIZATIONS (IPO)Look at all routines in a translation unit in order to make

optimizations across routine boundaries, including but notlimited to inlining and cloning.

Also called as Interprocedural Analysis (IPA).

Compiler can move, optimize, restructure and delete codebetween procedures

and even different source �les, if LTO is enabledInlining — replacing a subroutine call with the replicatedcode of itCloning — optimizing logic in the copied subroutine for aparticular call

13

PATTERN COMPILER OPTIMIZATION TAXONOMYDependency chains (linear code)BranchesLoop bodies

Single loopLoop and branchMulti-loop

Functional calls to subroutines

14 . 1

HOW TO GET OPTIMIZATION FEEDBACK?Check wall-time of you application

If a compiler has done its job well, you'll see performanceimprovements

Dump an assembly of your code (or/and IL)Ensure instruction and register schedulingCheck for extra operations and register spills

See compiler optimization reportAll the compilers have some support for itSome of them are able to generate very detailed reportsabout loop unrolling, auto-vectorization, VLIW slotsscheduling, etc

14 . 2

COMMONLY CONSIDERED METRICSWall(-clock) time

is a human perception of the span of time from thestart to the completion of a task.

Power consumptionis the electrical energy which is consumed to complete atask.

Processor time (or runtime)is the total execution time during which a processor wasdedicated to a task (i.e. executes instructions of thattask).

14 . 3

DUMPING ASSEMBLYAssembler is a must-have to check the compiler

but it is rarely used to write low-level code.

$ gcc code.c -S -o asm.s

Assembly writing is the least portable optimizationInline assembly limits compiler optimizationsAssembly does not give overwhelming speedup nowadaysSometimes it is needed to overcome compiler bugs andoptimization limitations

14 . 4

EXAMPLE: GCC FEEDBACK OPTIONSEnables optimization information printing-fopt-info-fopt-info-<optimized/missed/note/all>-fopt-info-all-<ipa/loop/inline/vec/optall>-fopt-info=filename

Controls the amount of debugging output the schedulerprints on targets that use instruction scheduling-fopt-info -fsched-verbose=n

Controls the amount of output from auto-vectorizer-ftree-vectorizer-verbose=n

14 . 5

EXAMPLES: GCC FEEDBACK OPTIONSOutputs all optimization info to stderr.gcc -O3 -fopt-info

Outputs missed optimization report from all the passesto missed.txtgcc -O3 -fopt-info-missed=missed.txt

Outputs information about missed optimizations aswell as optimized locations from all the inlining passesto inline.txt.gcc -O3 -fopt-info-inline-optimized-missed=inline.txt

14 . 6

GCC FEEDBACK EXAMPLE./src/box.cc:193:9: note: loop vectorized ./src/box.cc:193:9: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:193:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:96:9: note: loop vectorized ./src/box.cc:96:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:51:9: note: loop vectorized ./src/box.cc:51:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:193:9: note: loop with 7 iterations completely unrolled ./src/box.cc:32:13: note: loop with 7 iterations completely unrolled ./src/box.cc:96:9: note: loop with 15 iterations completely unrolled ./src/box.cc:51:9: note: loop with 15 iterations completely unrolled ./src/box.cc:584:9: note: loop vectorized ./src/box.cc:584:9: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:584:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:482:9: note: loop vectorized ./src/box.cc:482:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:463:5: note: loop vectorized ./src/box.cc:463:5: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:463:5: note: loop peeled for vectorization to enhance alignment

15 . 1

POINTER ALIASINGvoid twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; }

void twiddle2(int *xp, int *yp) { *xp += 2* *yp; }

ARE THEY ALWAYS EQUAL?

15 . 2

POINTER ALIASINGWhat if..

int main(int argc, char** argv) { int i = 5, j = 5; twiddle1(&i, &i); twiddle2(&j, &j);

printf("twiddle1 result is %d\n", i); printf("twiddle2 result is %d\n", j); }

twiddle1 result is 20 while twiddle2 result is 15

15 . 3

POINTER ALIASINGAliasing refers to the situation where the same memory

location can be accessed by using different names.

void twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; }

void twiddle2(int *xp, int *yp) { *xp += 2* *yp; }

15 . 4

STRICT ALIASING ASSUMPTIONStrict aliasing is an assumption, made by a C (or C++)

compiler, that dereferencing pointers to objects of differenttypes will never refer to the same memory location.

This assumption enables more aggressive optimization (gccassumes it up from -02), but a programmer should have tofollow strict aliasing rules to get code working correctly.

15 . 5

results in

results in

STRICT ALIASING ASSUMPTIONvoid check(int32_t *h, int64_t *k) { *h = 5; *k = 6; printf("%d\n", *h);}void main(){ int64_t k; check((int32_t *)&k, &k); }

gcc -O1 test.c -o test ; ./test 6

gcc -O2 test.c -o test ; ./test 5

15 . 6

POINTER ALIASING: MISSED OPPORTUNITIESCompiler freely schedules arithmetic, but often preserves the order of memory dereferencingCompiler is limited in redundancy eliminationCompiler is limited in loop unrollingCompiler is limited in auto-vectorization

16 . 1

FUNCTION CALLSint callee();int caller(){ return callee() + callee(); }

int callee();int caller(){ return 2*callee(); }

ARE THEY EQUAL?

16 . 2

FUNCTION CALLSint callee(int i); int caller() { int s=0, i=0; for ( ; i < N ; i++) s += callee(i); return s; }

int callee(int i); int caller() { int s0=0, s1=0, i=0; for ( ; i < N/2 ; i+=2) { s0+=callee(i); s1+=callee(i+1); } return s0 + s1; }

ARE THEY EQUAL?

16 . 3

PURE FUNCTIONSPure function is a function for which both of the following

statements are true:1. The function always evaluates the same result having been

given the same argument value(s). The function result mustnot depend on any hidden information or state that maychange while program execution proceeds or betweendifferent executions of the program, as well as on anyexternal input from I/O devices.

2. Evaluation of the result does not cause any semanticallyobservable side effect or output, such as mutation ofmutable objects or output to I/O devices.

16 . 4

PURE FUNCTIONSPure functions are much easier to optimize. Expressingideas in code as pure functions simpli�es compiler's life.Most functions from math.h are not pure (sets/cleans�oating point �ags and conditions, throws �oating pointexceptions)Use constexpr keyword for c++11 to hint a compiler thatfunction could be evaluated in compile timeUse static keyword to help the compiler to see all theusages of the function (and perform aggressive inlining, oreven deduce whether the function is pure or not)Neither constexpr nor static doesn't guarantee thatfunction is pure but they give compiler some hints.

16 . 5

FUNCTIONAL CALLS: MISSED OPPORTUNITIESIf the compiler fails to inline a function body:

it is limited in redundancy eliminationthere are some overhead on function callsinlining is crucial for functional calls from loopsmany other optimizations aren't performed for thisfragment of code

loop unrollingauto-vectorizationetc

potential bloating of code and stack

17

FLOATING POINTFloating point arithmetics is not associative, so A+(B+C) != (A+B)+CA compiler is very conservative about �oating point!

void associativityCheck (void) { double x = 3.1415926535897931; double a = 1.0e15; double b = -(1.0e15 - 1.0); printf("%f %f\n", x*(a + b), x*a + x*b ); }

$ gcc check.c -o check ; ./check 3.141593 3.000000

Such situation is known as catastrophic cancellation

18

MORE OPTIMIZATION CHALLENGESBranches inside a loopExceptionsAccesses to storage type global variablesInline assemblyvolatile keyword

19 . 1

SUMMARYSource code does through lexical, syntax, semanticanalysis, as well as IR generation, optimization beforeproducing target machine codeBackend and frontend simplify compiler developmentIntermediate language makes compiler optimizationsreusable across broad range of languages and targetsIL can be Language-speci�c or Language independentTriples and Quadruples are widely used as language-independent IRAll the compiler optimizations are done on IRRegister allocation goes after IR optimization, local-variable reuse is pointless nowadays!

19 . 2

SUMMARYLTO allows do optimizations during linkingWHOPR allows globally optimize whole binaryCompiler usually supports multiple optimization levelsCompiler optimizations are split into machine-dependentand machine-independent groupsBy scope compiler optimizations are split intointerprocedural, intraprocedural, global, local andpeepholeThe most common targets are dependency chains,branches, loopsCompiler optimization is a multi-phase iterative processPerforming one optimization allows many otherMost optimizations need certain order of application

19 . 3

SUMMARYChecking wall-time, assembly or optimizer's report are themost common ways to get optimization feedbackWall time is the most important metric to optimizeAssembler is a must-have to check the compiler but it is rarely used to write low-level codeInspect optimizer's report to demystify its "habits"Stick to the strict aliasing ruleClean code is not enough.. Write pure code!Compilers are usually very conservative optimizing �oating point math

20

THE END

/ 2015-2016MARINA KOLPAKOVA

https://github.com/cuda-geek

Education

Pragmatic Optimization in Modern Programming - Demystifying the Compiler