Profile Guided Optimizations in Visual C++ 2005
Andrew PardoePhoenix Team (C++ Optimizer)
What do optimizers do?int setArray(int a, int *array){ for(int x = 0; x < a; ++x) array[x] = 0; return x;} The compiler knows nothing about the value of ‘a’ The compiler knows nothing about the array’s
alignment The compiler doesn’t look at all the source files
together The compiler doesn’t know how the program will
execute
What is PGO (pronounced PoGO)? A “profile” details a program’s behavior in a
specific scenario Profile-guided optimizations use the profile to
guide the optimizer for that given scenario PGO tells the optimizer which areas of the
application were most frequently executed This information lets the optimizer be more
selective in optimizing the program PGO has its own set of optimizations as well
as improving traditional optimizations
Example of a PGO win Compiler optimizations make assumptions based
on static analysis and standard heuristics For example, we assume that a loop executes multiple
timesfor (p=list; *p; p=p->next) {p->f = sqrt(F);
} The optimizer would hoist the call to the loop invariant
sqrt(F)tmp = sqrt(F);for (p=list; *p; p=p->next) {p->f = tmp;
} If the profile shows that p is zero, we will not hoist the
call
How is PGO used?
Scenarios
Source code
Instrumented binary
Source code
Optimized binary
Instrumented binary
PGO Probes
Profile
Profile
How is PGO used? PGO is built on top of Link-Time Code Generation Must link object files twice: once for instrumented
build, once for optimized build Can be used on almost all native code
exe, dll, lib COM/MFC Windows services
Cannot be used on system or managed code Drivers or kernel mode code No code compiled with /CLR
Incorrect scenarios could cause worse optimizations!
PGO profile gathering Two major themes of PGO profile gathering
Identify “hot paths” in program execution path and optimize to make these paths perform well
Likewise, identify “cold paths” to separate cold code—or dead code—from hot code
Identify “typical” values such as switch values, loop induction variables and targets of indirect calls and optimize code for these values
PGO main optimizations: inlining Improved inlining heuristics
Inline based on frequency of call, not function size or depth of call stack
“Hot” call sites: inline agressively “Cold” call sites: only inline if there are
other optimization opportunities (such as folding)
“Dead” call sites: only inline the trivial cases
PGO main optimizations: inlining Speculative inlining: used for virtual call
specification Indirect calls are profiled to find typical
targets An indirect call heavily biased toward
certain target(s) can be multi-versioned The new sequence contains direct call(s)
to typical target(s), which can be inlined Partial inlining: only inline the portions of
the callee we execute. If the cold code is called, call the non-inlined function.
PGO main optimizations: code size Choice of favoring size versus speed made on a
per-function basis Program execution should be dominated by functions
optimized for speed and less-frequently used functions should be small
PGO computes a dynamic instruction count for each profiled function. Inlining effects are taken into account.
Sorts functions in descending order by count. Functions in the upper 99% of total dynamic
instruction count are optimized for speed. Others are compressed.
In large applications (Vista, SQL) most functions are optimized for size.
PGO main optimizations: locality Reorder the code to “fall through” wherever
possible Intra-function layout reorders basic blocks so that
the major trace falls through whenever possible. Inter-function layout tries to place frequent
caller-callee pairs near one another in the image. Extract “dead” code from the .text section and
put it in a remote section of the image Dead code can be entire functions that are not
called or basic blocks inside a function Penalty for being wrong is very large so the
profile must be accurate!
What code benefits most? C++ programs: many virtual calls can be inlined
once the target is determined through profiling Large applications where size and speed are
important Code with frequent branches that are difficult to
predict at compile time Code which can be separated by profiling into
“hot” and “cold” blocks to help instruction cache locality
Code for which you know the typical usage patterns and can produce accurate profiling scenarios
Scenario 1 Customer compiles with /O2 and gets pretty
good performance but wants to take advantage of advanced optimizations like LTCG and PGO
Code is tested by the dev team throughout development cycle using unit and bug regression tests
Customer has done performance measurements of the code. Customer has no automated tests to measure performance but believes it can improve.
Is this customer ready to try PGO? Probably not.
Scenario 2 Customer has well-defined performance goals
and tests set up to measure performance Customer knows typical usage patterns for
the application Application is being built with LTCG Most of the execution time is spent in tightly-
nested loops doing heavy floating-point calculations
Is this customer ready to use PGO? Maybe…
Scenario 3 Customer has well-defined performance goals
and tests set up to measure performance Customer knows typical usage patterns for
the application Application is being built with LTCG Application spends most of its time in
branches and calls Application is fairly large and makes use of
inheritance Is this customer ready to use PGO? Definitely.
Scenario 4 Customer has a build lab and wants to enable
PGO in nightly builds But profiling every night seems too expensive Solution: PGO Incremental Update
Avoid running profile scenarios at every build PGU uses “stale” profile data Can check in profile data and refresh weekly
PGU restricts optimizations Functions which have changed will not be
optimized Effects of localized changes are usually negligible
PGO sweeper Some scenarios are difficult to collect profile data for
Profile scenario may not begin and end with application launch and shutdown
Some components cannot write a file Some components cannot link to the PGO runtime DLL
PGO sweeper collects profile data from running instrumented processes
This allows you to close a currently open .pgc file and create a new one without exiting the instrumented binary
You get one .pgc file per run or sweep. You can delete any .pgc files you do not want reflected in your scenario.
PGO Manager PGO manager adds profile data from one or
more .pgc files into the .pgd file The .pgd file is the main profile database Allows you to profile multiple scenarios (.pgc) for a
single codebase into one profile database (.pgd) PGO manager also lets you generate reports from
the .pgd file to see that your scenarios “feel right” in the code
Information in the reports include Module count, function count, arc and value count Static (all) instruction count, dynamic (hot) instruction
count Basic block count, average basic block size Function entry count
How much performance does PGO get? Performance gain is architecture and application
specific IA64 sees biggest gains x64 benefits more than x32 Large applications benefit more than small: SQL
server saw over 30% gains through PGO Many parts of Windows use PGO to balance size vs.
speed If you understand your real-world scenarios and
have adequate, repeatable tests PGO is almost always a win
Once your testing is in-place integrating PGO into your build process should be easy
Performance gains over LTCG
Call-graph profiling Given this call graph, determine which code
paths are hot and which are cold
foo
bat
bar baz
a
Call-graph profiling continued Measure the frequency of calls
100
foo
bat
20 50bar baz
15bar
baza75
bar
baz15
10
20
100
75
50
15
Call-graph profiling after inlining Inline functions based on call profile
Highest-frequency calls are (bar, baz) and (bat, bar)
foo
bat
20 125bar baz
10015bar baz
a10
15
Reordering basic blocks Change code layout to improve instruction
cache locality
A
CB
D
100
100
10
10
A
B
C
D
Default layout
A
C
B
D
Optimized layout
Default layout Optimized layoutExecution profile
10
10 100
100
Speculative inlining of virtual calls Profiling shows the dynamic type of object A
in function Func was almost always Foo (and almost never Bar)
class Foo:Base{ … void call();}
class Bar:Base { … void call();}
class Base{ … virtual void call();}
void Bar(Base *A){ … while(true) { … A->call(); … }}
void Func(Base *A){ … while(true) { … if(type(A) == Foo:Base) { // inline of A->call(); } else // virtual dispatch A->call(); … }}
Partial inliningProfiling shows that condition Cond favorsthe left branch overthe right branch
Basic Block 1
Cond
Cold CodeHot Code
More Code
Partial inlining concludedWe can inline the hot path, and not the cold path. We can make different decisions at each call site!
Basic Block 1
Cond
Cold CodeHot Code
More Code
Using PGO (in more detail)
Scenarios
Source code
Optimized binary
Compile with /GL and opts
Object files
Link with /LTCG:PGI
Object files
Instrumented binary
.PGD file
Instrumented binary
Object files
.PGC file(s)
.PGC files
.PGD file
Link with /LTCG:PGO
PGO tips The scenarios used to generate the profile data
should be real-world scenarios. The scenarios are NOT and attempt to do code coverage.
Using scenarios to train with that are not representative of real-world use can result in code that performs worse than if PGO was not used.
Name the optimized code something different from the instrumented code, for example, app.opt.exe and app.inst.exe. This way you can rerun the instrumented application to supplement your set of scenario profiles without rerunning everything again.
To tweak results, use the /clear option of pgomgr to clear out a .PGD files.
PGO tips If you have two scenarios that run for different
amounts of time, but would like them to be weighted equally, you can use the weight switch (/merge:weight in pgomgr) on .PGC files to adjust them.
You can use the speed switch to change the speed/size thresholds.
You can control the inlining threshold with a switch but use it with care. The values from 0-100 aren't linear.
Integrate PGO into your build process and update scenarios frequently for the most consistent results and best performance increases.
In summary Using PGO is very easy, with four simple steps
CL to parse the source files cl /c /O2 /GL *.cpp
LINK / PGI to generate instrumented image link /ltcg:pgi /pgd:appname.pgd *.obj *.lib Also generates a PGD file (PGO database)
Run your program on representative scenarios Generates PGC files (PGO profile data)
LINK / PGO to generate optimized image Implicitly uses the generated PGC files link /ltcg:pgo /pgd:appname.pgd *.obj *.lib
More information Matt Pietrek’s Under the Hood column from May
2002 has a fantastic explanation of LTCG internals Multiple articles on PGO located on MSDN
The links are long: just search for PGO on MSDN Look through articles by Kang Su Gatlin on his
blog at http://blogs.msdn.com/kangsu or on MSDN Improvements are coming in the new VC++
backend Based on the Phoenix optimization framework Profiling is a major scenario for the Phoenix-based
optimizer There will be a talk on Phoenix later today