Download ppt - Profile Guided Optimizations in Visual C++ 2005 Andrew Pardoe Phoenix Team (C++ Optimizer)

Profile Guided Optimizations in Visual C++ 2005

Andrew PardoePhoenix Team (C++ Optimizer)

What do optimizers do?int setArray(int a, int *array){ for(int x = 0; x < a; ++x) array[x] = 0; return x;} The compiler knows nothing about the value of ‘a’ The compiler knows nothing about the array’s

alignment The compiler doesn’t look at all the source files

together The compiler doesn’t know how the program will

execute

What is PGO (pronounced PoGO)? A “profile” details a program’s behavior in a

specific scenario Profile-guided optimizations use the profile to

guide the optimizer for that given scenario PGO tells the optimizer which areas of the

application were most frequently executed This information lets the optimizer be more

selective in optimizing the program PGO has its own set of optimizations as well

as improving traditional optimizations

Example of a PGO win Compiler optimizations make assumptions based

on static analysis and standard heuristics For example, we assume that a loop executes multiple

timesfor (p=list; *p; p=p->next) {p->f = sqrt(F);

} The optimizer would hoist the call to the loop invariant

sqrt(F)tmp = sqrt(F);for (p=list; *p; p=p->next) {p->f = tmp;

} If the profile shows that p is zero, we will not hoist the

call

How is PGO used?

Scenarios

Source code

Instrumented binary

Source code

Optimized binary

Instrumented binary

PGO Probes

Profile

Profile

How is PGO used? PGO is built on top of Link-Time Code Generation Must link object files twice: once for instrumented

build, once for optimized build Can be used on almost all native code

exe, dll, lib COM/MFC Windows services

Cannot be used on system or managed code Drivers or kernel mode code No code compiled with /CLR

Incorrect scenarios could cause worse optimizations!

PGO profile gathering Two major themes of PGO profile gathering

Identify “hot paths” in program execution path and optimize to make these paths perform well

Likewise, identify “cold paths” to separate cold code—or dead code—from hot code

Identify “typical” values such as switch values, loop induction variables and targets of indirect calls and optimize code for these values

PGO main optimizations: inlining Improved inlining heuristics

Inline based on frequency of call, not function size or depth of call stack

“Hot” call sites: inline agressively “Cold” call sites: only inline if there are

other optimization opportunities (such as folding)

“Dead” call sites: only inline the trivial cases

PGO main optimizations: inlining Speculative inlining: used for virtual call

specification Indirect calls are profiled to find typical

targets An indirect call heavily biased toward

certain target(s) can be multi-versioned The new sequence contains direct call(s)

to typical target(s), which can be inlined Partial inlining: only inline the portions of

the callee we execute. If the cold code is called, call the non-inlined function.

PGO main optimizations: code size Choice of favoring size versus speed made on a

per-function basis Program execution should be dominated by functions

optimized for speed and less-frequently used functions should be small

PGO computes a dynamic instruction count for each profiled function. Inlining effects are taken into account.

Sorts functions in descending order by count. Functions in the upper 99% of total dynamic

instruction count are optimized for speed. Others are compressed.

In large applications (Vista, SQL) most functions are optimized for size.

PGO main optimizations: locality Reorder the code to “fall through” wherever

possible Intra-function layout reorders basic blocks so that

the major trace falls through whenever possible. Inter-function layout tries to place frequent

caller-callee pairs near one another in the image. Extract “dead” code from the .text section and

put it in a remote section of the image Dead code can be entire functions that are not

called or basic blocks inside a function Penalty for being wrong is very large so the

profile must be accurate!

What code benefits most? C++ programs: many virtual calls can be inlined

once the target is determined through profiling Large applications where size and speed are

important Code with frequent branches that are difficult to

predict at compile time Code which can be separated by profiling into

“hot” and “cold” blocks to help instruction cache locality

Code for which you know the typical usage patterns and can produce accurate profiling scenarios

Scenario 1 Customer compiles with /O2 and gets pretty

good performance but wants to take advantage of advanced optimizations like LTCG and PGO

Code is tested by the dev team throughout development cycle using unit and bug regression tests

Customer has done performance measurements of the code. Customer has no automated tests to measure performance but believes it can improve.

Is this customer ready to try PGO? Probably not.

Scenario 2 Customer has well-defined performance goals

and tests set up to measure performance Customer knows typical usage patterns for

the application Application is being built with LTCG Most of the execution time is spent in tightly-

nested loops doing heavy floating-point calculations

Is this customer ready to use PGO? Maybe…

Scenario 3 Customer has well-defined performance goals

and tests set up to measure performance Customer knows typical usage patterns for

the application Application is being built with LTCG Application spends most of its time in

branches and calls Application is fairly large and makes use of

inheritance Is this customer ready to use PGO? Definitely.

Scenario 4 Customer has a build lab and wants to enable

PGO in nightly builds But profiling every night seems too expensive Solution: PGO Incremental Update

Avoid running profile scenarios at every build PGU uses “stale” profile data Can check in profile data and refresh weekly

PGU restricts optimizations Functions which have changed will not be

optimized Effects of localized changes are usually negligible

PGO sweeper Some scenarios are difficult to collect profile data for

Profile scenario may not begin and end with application launch and shutdown

Some components cannot write a file Some components cannot link to the PGO runtime DLL

PGO sweeper collects profile data from running instrumented processes

This allows you to close a currently open .pgc file and create a new one without exiting the instrumented binary

You get one .pgc file per run or sweep. You can delete any .pgc files you do not want reflected in your scenario.

PGO Manager PGO manager adds profile data from one or

more .pgc files into the .pgd file The .pgd file is the main profile database Allows you to profile multiple scenarios (.pgc) for a

single codebase into one profile database (.pgd) PGO manager also lets you generate reports from

the .pgd file to see that your scenarios “feel right” in the code

Information in the reports include Module count, function count, arc and value count Static (all) instruction count, dynamic (hot) instruction

count Basic block count, average basic block size Function entry count

How much performance does PGO get? Performance gain is architecture and application

specific IA64 sees biggest gains x64 benefits more than x32 Large applications benefit more than small: SQL

server saw over 30% gains through PGO Many parts of Windows use PGO to balance size vs.

speed If you understand your real-world scenarios and

have adequate, repeatable tests PGO is almost always a win

Once your testing is in-place integrating PGO into your build process should be easy

Performance gains over LTCG

Call-graph profiling Given this call graph, determine which code

paths are hot and which are cold

foo

bat

bar baz

a

Call-graph profiling continued Measure the frequency of calls

100

foo

bat

20 50bar baz

15bar

baza75

bar

baz15

10

20

100

75

50

15

Call-graph profiling after inlining Inline functions based on call profile

Highest-frequency calls are (bar, baz) and (bat, bar)

foo

bat

20 125bar baz

10015bar baz

a10

15

Reordering basic blocks Change code layout to improve instruction

cache locality

A

CB

D

100

100

10

10

A

B

C

D

Default layout

A

C

B

D

Optimized layout

Default layout Optimized layoutExecution profile

10

10 100

100

Speculative inlining of virtual calls Profiling shows the dynamic type of object A

in function Func was almost always Foo (and almost never Bar)

class Foo:Base{ … void call();}

class Bar:Base { … void call();}

class Base{ … virtual void call();}

void Bar(Base *A){ … while(true) { … A->call(); … }}

void Func(Base *A){ … while(true) { … if(type(A) == Foo:Base) { // inline of A->call(); } else // virtual dispatch A->call(); … }}

Partial inliningProfiling shows that condition Cond favorsthe left branch overthe right branch

Basic Block 1

Cond

Cold CodeHot Code

More Code

Partial inlining concludedWe can inline the hot path, and not the cold path. We can make different decisions at each call site!

Basic Block 1

Cond

Cold CodeHot Code

More Code

Using PGO (in more detail)

Scenarios

Source code

Optimized binary

Compile with /GL and opts

Object files

Link with /LTCG:PGI

Object files

Instrumented binary

.PGD file

Instrumented binary

Object files

.PGC file(s)

.PGC files

.PGD file

Link with /LTCG:PGO

PGO tips The scenarios used to generate the profile data

should be real-world scenarios. The scenarios are NOT and attempt to do code coverage.

Using scenarios to train with that are not representative of real-world use can result in code that performs worse than if PGO was not used.

Name the optimized code something different from the instrumented code, for example, app.opt.exe and app.inst.exe. This way you can rerun the instrumented application to supplement your set of scenario profiles without rerunning everything again.

To tweak results, use the /clear option of pgomgr to clear out a .PGD files.

PGO tips If you have two scenarios that run for different

amounts of time, but would like them to be weighted equally, you can use the weight switch (/merge:weight in pgomgr) on .PGC files to adjust them.

You can use the speed switch to change the speed/size thresholds.

You can control the inlining threshold with a switch but use it with care. The values from 0-100 aren't linear.

Integrate PGO into your build process and update scenarios frequently for the most consistent results and best performance increases.

In summary Using PGO is very easy, with four simple steps

CL to parse the source files cl /c /O2 /GL *.cpp

LINK / PGI to generate instrumented image link /ltcg:pgi /pgd:appname.pgd *.obj *.lib Also generates a PGD file (PGO database)

Run your program on representative scenarios Generates PGC files (PGO profile data)

LINK / PGO to generate optimized image Implicitly uses the generated PGC files link /ltcg:pgo /pgd:appname.pgd *.obj *.lib

More information Matt Pietrek’s Under the Hood column from May

2002 has a fantastic explanation of LTCG internals Multiple articles on PGO located on MSDN

The links are long: just search for PGO on MSDN Look through articles by Kang Su Gatlin on his

blog at http://blogs.msdn.com/kangsu or on MSDN Improvements are coming in the new VC++

backend Based on the Phoenix optimization framework Profiling is a major scenario for the Phoenix-based

optimizer There will be a talk on Phoenix later today