Upload
august-wright
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Adaptive Compilation(results from current work)
Fall 2003
Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.
Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use.
Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.
Comp 512Spring 2011
1COMP 512, Rice University
COMP 512, Rice University
2
Roadmap
In the lecture on optimizing for code size (17), we looked at some early results on using genetic algorithms (GAs) to pick compilation orders that produce compact programs.
In the lecture on adaptive blocksize choice for dgemm with the MIPSPro compiler, we saw more adaptive behavior.
In this lecture:
• More experimental results
• Discussion of search algorithms other than GAs
• Insight into construction & use of adaptive compilers
COMP 512, Rice University
3
We are exploring new organizing principles
• Explicit objective function (chosen by the user )
• Steering algorithm controls optimizer and back end Picks & orders methods, chooses parameters Use multiple trials to explore the solution space Finds a configuration to minimize objective function
Adaptive Compilers
Vary parameters
Objectivefunction
executable code
Front end remains intact
SteeringAlgorithm
COMP 512, Rice University
4
These adaptive compilers
• Adapt to the program, the target, & the objective function
• Use power of modern computers to improve code quality
• Produce better results than their fixed-sequence siblings
Adaptive Compilers
Vary parameters
Objectivefunction
executable code
Front end remains intact
SteeringAlgorithm
• Are much more expensive (today) than fixed-sequence compilers
*
A practical way to find many optimization bugs
COMP 512, Rice University
5
What’s the Point?
These compiler configurations matter ...
• Choice of transformations Significant overlap in coverage Each input program has different opportunities Best choice depends on input, target, & objective function
• Order of transformation They create & destroy opportunities for each other We see significant variations (up & down) on simple codes Best choice depends on input, target, & objective function
• We do not know enough, today, to predict a good sequence
• Our prototype systems search the space of sequences
COMP 512, Rice University
6
Adaptive Compilation
Research prototype
• Based on MSCP compiler
• 13 transformations Run in any order (not easy) Many options & variants
• Search-based steering algorithms Hill climber Genetic algorithms Constructive greedy method Exhaustive enumeration
• Objective functions Run-time speed Code space Dynamic bit-transitions
Vary parameters
Objectivefunction
executable code
Front end
SteeringAlgorithm
• Experimental tool Exploring applications Learning about search space Designing better searches
COMP 512, Rice University
7
Experimental Results with Compilation Order
Early Experiments (1999, Lecture 17)
• Space then speed (10 transformation subset) 13% smaller code than fixed sequence (0 to 41%) Code was generally faster (26% to -25%; 5 of 14 slower)
• Speed then space 20% faster code than fixed sequence (best was 26%) Code was generally smaller (0 to 5%)
• Genetic Algorithm Evaluate each sequence Replace worst + 3 at each generation Generate new strings with crossover Apply mutation to all, save the best
• Found “best” sequence in 200 to 300 generations of size 20
Register-relative procedure abstraction gets 5% space, -2% speed
*
GA took many fewer probes to find “best” sequence than did random sampling.
NB: 10-of-10 fmin space, optimizing for size
COMP 512, Rice University
8
Experience from Early Experiments
Improving the Genetic Algorithm
• Experiments aimed at understanding & improving convergence
*
• Larger population helps Tried pools from 50 to 1000, work mostly with 50 to 100 100 generations is about right
• Use weighted selection in reproductive choice Fitness scaling to exaggerate late, small improvements
• Crossover 2 point, random crossover Apply “mutate until unique” to each new string
• Variable length chromosomes With fixed string, GA discovers NOPs Varying string rarely has useless pass
Current experiments use fixed length (9) strings… 9-of-13 space
COMP 512, Rice University
9
Improving the Search
Two kinds of experiments
• Enumerations Choose a code & a subspace Evaluate every point in that subspace
• Explorations Choose a code & a search algorithm Run a series of searches
Enumerations are critical for understanding the space
Explorations are critical for evaluating the search algorithms
COMP 512, Rice University
10
Enumerating FMIN
• One application, FMIN 150 lines of Fortran, 44 basic blocks Exhibited complex behavior in other experiments
• Ran hill-climber from many random starting points Picked the 5 most used passes from winners p, l, o, s, & n
• Enumerated the 10-of-5 subspace fmin+plosn 9,765,625 distinct combinations 34,000 to 37,000 experiments/machine/day Produced sequence & number of cycles to execute 14 CPU-months to complete
• Results shed light on fmin & the search space FMIN
Bad way to pick them ?
COMP 512, Rice University
11
Results from Enumerating FMIN
In the set of 9,765,625 sequences
• Best sequence is 986 cycles
• Seven at 987
• Worst sequence is 1716
• Unoptimized code takes 1716
• Range is 0% to 43% improvement
*
With full set of transformations:
GA: 822 in 184 generations of 100 839 in 100 generations of 50
Hill: 843 in 1,355 probes
Local minima
• 258 strict local minima (x < all Hamming-1 points)
• 27,315 non-strict local minima (x ≤ all Hamming-1 points)
• All local minima are improvements
• They come in many shapes
fmin+plosn
COMP 512, Rice University
12
Results from Enumerating FMIN
Distribution of solutions
+ 5%+ 10% + 20%
Fair number of “bad” solutions!
fmin+plosn
COMP 512, Rice University
13
Results from Enumerating FMIN
Strict local minima have many shapes
132 of these spikes, from loop peeling
minimum
Hamming-1 points
Magnitude is wall height (≥ 1)
Landscape looks like an asteroid, in 10 dimensions
fmin+plosn
COMP 512, Rice University
14
Results from Enumerating FMIN
1026 goes to 14,879
+ 5% + 10% + 20% + 30%
fmin+plosn
These are local minima, not all solutions
COMP 512, Rice University
15
Results from Enumerating FMIN
What does the space look like?
• 4-of-5 space
• fmin+plosn
• Prefix vs suffix plot
Looks like noise to me
• Not convex
• Not differentiable
• Outright hostile to the searches
Does order of p,l,o,s, & n matter?
COMP 512, Rice University
16
Results from Enumerating FMIN
What does the space look like?
This looks better
• Same space
• Different order
• Troughs, valleys, … fmin+plosn
Unfortunately, this does not exist
• Inconsistent axes…
COMP 512, Rice University
17
Results from Enumerating FMIN
What does the space look like?
• 4-of-5 space
• fmin+plosn
• Prefix vs suffix plot
Looks like noise to me
• Not convex
• Not differentiable
• Outright hostile to the searches
We still need to search this space!
COMP 512, Rice University
18
Searching FMIN
Steering algorithm needs to search the space for good sequences
• Hill climbers Start somewhere (random point) and walk downhill until
stuck To pick next point, try steepest descent or random
downhill
• Random probes Generate sequences at random If enough good sequences exist, you’re likely to hit one
• Greedy constructive algorithm Repeat until done: take the best step available In practice: consider prepending & appending one pass
• Genetic algorithm (of course) Combines big steps (crossover) & small steps (mutation) Survival of fittest, mutate until unique make it a search
*
COMP 512, Rice University
19
Hill climbing in FMIN
*
Local downhill walk
• Step moves to lowest Hamming-1 point
• Reaches minima quickly
12 steps in 10-of-5 for fmin+plosn
• Suggests a two-prong attack
pick a random point & walk to minimum
start from many random points
• Rarely finds best point
fmin+plosn
fmin+plosn
COMP 512, Rice University
20
Hill Climbing in FMIN
Narrowing the range of “good” to 5%
• Steepest descent (of Hamming-1 points) Gets stuck in local minima
• Random downhill walk (take any lower Hamming-1 point) Gets within 5% of minimum 88% of the time 7 step average, 18 step worst case
• Best downhill walk Gets within 5% of minimum 99% of the time
• Worst downhill walk Gets within 5% of minimum 38% of the time
Good News: Can reliably get within 5%Bad News: Cannot get within 1 to 2%
*
Suggests that we need to take bigger steps, as the GA does!
fmin+plosn
COMP 512, Rice University
21
Searching in FMIN
Greedy Constructive Algorithm
• Start with null sequence
• Add best option to the sequence (evaluate prefixes & suffixes)
• Repeat until desired length is reached
Ties complicate the matter
• Often, multiple additions have the same fitness values
• Each may lead to a different part of the search space
• How should the algorithm handle this situation? GC-EXH : explore all ties GC-50 : pick one at random & go forward (repeat 50
times) GC-fast : use breadth-first, not depth-first, search
COMP 512, Rice University
22
Search Results in FMIN
Instructions Executed
Sequences evaluated • GA-50 did best
fmin 9-of-13 space
COMP 512, Rice University
23
Search Results in FMIN
Instructions Executed
Sequences evaluated • GA-50 did best
• Tie breaking did not matter in this caseGC was cheap
fmin 9-of-13 space
COMP 512, Rice University
24
Search Results in FMIN
Instructions Executed
Sequences evaluated• GA-50 did best
• Tie breaking did not matter in this case
GC was cheap• HC-50 was cost
effective
fmin 9-of-13 space
COMP 512, Rice University
25
Search Results in FMIN
Instructions Executed
Sequences evaluated• GA-50 did best
• Tie breaking did not matter in this case
GC was cheap• HC-50 was cost
effective• RC-3000 is not bad• GC-fast beat RC-300
fmin 9-of-13 space
COMP 512, Rice University
26
Search Results in FMIN
Data on more programs
• Lots of trends here …
• GA wins (for more effort)
• GC & HC do well for less effort
COMP 512, Rice University
27
Search Results
Does tie-breaking ever matter in GC?
• 91,663 sequences versus 325 sequences in adpcm
• Small variation in fitness value (see next slide)
Sequences tested
COMP 512, Rice University
28
Search Results in FMIN
Factor of 282 difference in sequences evaluated
2.2% difference in instructions executed
COMP 512, Rice University
29
Road Map for our Project
Our goals
• Short term (now) Characterize the problems, the potential, & the search space Learn to find outstanding sequences quickly (search)
• Medium term (3 to 5 years) Develop practical adaptive compilers for scalar optimization Develop a framework for self-tuning optimizers Develop proxies and estimators for performance (speed )
• Long term (5 to 10 years) Apply these techniques to harder problems
Data distribution, parallelization schemes on real codes Compiling for complex environments, like the Grid
Build systems that use knowledge from search to make adaptive compilation both practical & routine
Processor
Speed
*
COMP 512, Rice University
30
Relating Code Properties to Sequences
Looking at the input program might let us predict good sequences
• Lots of work needed to generalize our understanding
• We want to look at four aspects of the problem Search strategies Objective functions Program characteristics Accumulated experience
With enough experience, we may be able to predict good sequences
Each may contribute to finding the best sequence efficiently
COMP 512, Rice University
31
Making It Practical
Full-blown search may not be the answer
Many strategies are possible for a production system
• Best of k trials
• Pick k sequences from a database of many (say 5,000) Might use program properties to pick sequence
• Incrementalize the search over many compilations
• Find a good program specific sequence and keep using it
• Find the best sequence in a fixed amount of compile time
This should be fertile ground for research & experimentation
COMP 512, Rice University
32
Another Piece of the Puzzle
Engineering Reconfigurable Compilers
• We built ours (almost) by accident Pass structure optimizer (no surprise there) Uses a definitive, low-level, linear IR Each pass does its own analysis Nothing passed along, unless it is written in the IR
• This eliminates most of the impediments to reordering Name space for PRE & LCM OSR algorithm requires constant propagation & code motion
• Need ways to specify interactions & manage them Check potential configurations for conflict Package predetermines subsequences together
• Need ways to use results of earlier searches to improve process
Lots of effort on building fast
components
COMP 512, Rice University
34
Results from FMIN
Analysis (K-L divergence testing) suggests that we can obtain a good model of the distribution with 1,000 probes of the
subspace
• GA used 6,000 in a 14-transformation subspace
• Need to test other subspaces
• Suggests that search engine can build a probabilistic model We are devising an algorithm for this one …
Near-term Experiments
• Running another subspace for FMIN
• Running subspaces on other, dissimilar codes
• Speed improvements to experimental apparatus Goal is 500,000 compiles / day in the lab Will port framework to Rice TeraScale Cluster (264 IA-64s)
• Next step is to characterize program properties
COMP 512, Rice University
35
Results from FMIN
Can we take advantage of clustering in the good sol’ns?
• Cluster is connected set of points, all below threshold value
• Clustering explains the need for multiple trials
*
COMP 512, Rice University
37
Results from FMIN
Looking at cluster size
• Cluster is connected set of points, all below threshold value
• Clustering explains the need for multiple trials
• Cluster numbers and shapes depend on threshold
Phase transition between 2% and 2.5% (for FMIN)
• At 10%, there are 13 clusters
12 are singular points involving a “peel” minimum
The other one contains the rest of the 10%-good points
*