Upload
tyronica-ramirez
View
20
Download
3
Embed Size (px)
DESCRIPTION
Cross-Module Optimization. Thomas Lindgren [email protected]. Overview. OM - optimization manager Erlang-to-Erlang optimizer (mostly) ~20k lines of Erlang intended to accelerate large applications The rest of this talk What does OM do? How well does it work?. Source code. Profiling code. - PowerPoint PPT Presentation
Citation preview
Cross-Module Optimization
Thomas [email protected]
Overview
• OM - optimization manager– Erlang-to-Erlang optimizer (mostly)– ~20k lines of Erlang– intended to accelerate large
applications
• The rest of this talk– What does OM do?– How well does it work?
Profiling code
Source code
Annotation trees
Training exec
aggregation
Higher-order eliminationApply open-codingOutliningModule splitting
Inlining
Simplification
Production exec
Om overview
(Other modules)
Profiling and annotation
• Instrument code with profiling counters– standard counters (per function clause, per
call site, …)– which modules call each other, how often– which function is used at apply
• Annotations saved as syntax trees + counters
• Post-training: read counters, decorate annotation trees, optimize the result
Per-module optimizations
• Higher-order elimination: replace lists:map, lists:foldl, and others with specialized functions where suitable
• Apply open-coding: replace apply with explicit (open-ended) switch
• Outlining: cold (seldom-executed) clauses are moved out-of-line
• Module splitting: cold code moved into new module
Higher-order elimination
Call:
lists:map(
fun(X) -> X+Y end,
Xs)
Call:
lists_map_0(Xs,Y)
lists_map_0([X|A],Y) ->
[X+Y|lists_map_0(A,Y)];
lists_map_0([],Y) -> [].
(The equivalent is done for most functions in lists)
Per-module optimizations
• Higher-order elimination: replace lists:map, lists:foldl, and others with specialized functions where suitable
• Apply open-coding: replace apply with explicit (open-ended) switch
• Outlining: cold (seldom-executed) clauses are moved out-of-line
• Module splitting: cold code moved into new module
Apply open-coding
• apply(M,F,[A1,…,An])• Profiling reveals that
certain {Mod,Func,Arity} tuples are most common
• Switch on likely functions
• Enables inlining of explicit call (e.g., m1:f1(A1,A2))
case {M,F,length(As)} of
{m1,f1,2} ->
[A1,A2] = As,
m1:f1(A1,A2);
…
_ -> apply(M,F,As)
end
(most general case; optimization possible
when arity known, when call is local, …)
Per-module optimizations
• Higher-order elimination: replace lists:map, lists:foldl, and others with specialized functions where suitable
• Apply open-coding: replace apply with explicit (open-ended) switch
• Outlining: cold (seldom-executed) clauses are moved out-of-line
• Module splitting: cold code moved into new module
Outlining
• Move cold function clauses, switch clauses, ... out-of-line
• Reduces function size => more inlining possible– outlining + inlining = (structured) partial inlining
• Sometimes improves pattern matching codecase read_file(FD,Len) of
{error,closed} -> …;
{error,prot} -> …;
{ok,{incomplete,Data}} -> …;
{ok,{complete,Data}} -> …;
X -> ...
end
case read_file(FD,Len) of
{ok,{complete,Data}} -> …;
Else -> ‘OUTLINED’(Else)
end
Per-module optimizations
• Higher-order elimination: replace lists:map, lists:foldl, and others with specialized functions where suitable
• Apply open-coding: replace apply with explicit (open-ended) switch
• Outlining: cold (seldom-executed) clauses are moved out-of-line
• Module splitting: cold code moved into new module
Module splitting
• Hot code retained in original module• Cold functions moved into “cold module”
– currently: duplicate entire original module
• Calls to cold functions re-routed to cold module– outlined function clauses often end up in cold
module
• Benefit: reduces hot module size => more aggregation– drawback: total code size increases (unimportant?)
Aggregation
• Optimization across module boundaries– but in Erlang, any module can be replaced
at any time (“hot code loading”)
• Merge optimized hot modules into aggregates– optimize each aggregate aggressively– but in Erlang you can replace any module
at runtime– how to do it?
Hot code loading
• Remote calls m:f(X) logically do the following:– lookup module named m– lookup function named f/1 in the found module– call the found function
• A new version of m can be loaded at any time– but occurs seldom in practice (every month? week?)– (an aside: OTP further structures code replacement)
• we do not take advantage of this
Hot code loading (2)
• Inlining of remote calls is not possible– what if the inlined module subsequently
changes?– worse, remote calls are very common
• Merging two modules into one is problematic– making remote calls into local calls changes
behaviour– safe approach: speculate that code has
not changed.
Hot code loading (3)
• Remote call is rewritten into test + common-case local call + backup remote call
• latest(m) can be implemented in linker– initially, always true– when new m loaded,
becomes always false
m:f(X1,X2)
(case latest(m) of
true -> local_f(X1,X2);
false -> m:f(X1,X2)
end)
Aggregation
• Merge modules that call each other often– use module-module call profile– remote calls are rewritten to use latest(m)– aggregation limited by size
• Widely-shared modules (e.g., lists) are engulfed– copy engulfed module into the calling module– necessary to enable high-quality aggregation
without huge aggregates
Post-aggregation optimization
• Profile-guided inlining– consider call sites in order of importance (# calls)– total amount of inlining limited by code size
increase– avoids pitfalls of static inlining: working on wrong
code, too conservative for important sites
• Simplification of resulting code– dead function removal (occurs due to engulfing,
inlining)– case-of-case, beta reduction, ...
Results
• Benchmarks: important subsystems of OTP, daily use– (decode1: protocol processing “inner loop”)– beam: beam compiler on lists.erl– gen_tcp: small messages over local socket– ldapv2: encoding and decoding LDAPv2 ASN.1
PDUs– mnesia: realtime database running simple
pseudo-HLR
• Benchmark suite freely available from author
Results (2)
• Each benchmark compiled with OM– same input used for training and
production– latest(m) simulated with cheap test
• Each benchmark run 30-40 times for baseline and optimized– removed outliers for gen_tcp and
mnesia to get more focussed speedup values
Results (3)
speedup notes
decode1 1.12 due to outlining
beam 3.96 h-o elim 2.94x
gen_tcp 2.54 (2.15 w/ outliers)
ldapv2 1.01
mnesia 1.17 (1.28 w/ outliers)
Conclusions
• Optimization across modules beneficial• Profile-driven optimization practical and
beneficial• Future work:
– try real applications (100s-1000s of modules)
– more optimizations– tune optimizations– automate reprofiling/recompilation