Adaptive Compilation (results from current work) Fall 2003 Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp

Adaptive Compilation(results from current work)

Fall 2003

Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.

Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use.

Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.

Comp 512Spring 2011

1COMP 512, Rice University

COMP 512, Rice University

2

Roadmap

In the lecture on optimizing for code size (17), we looked at some early results on using genetic algorithms (GAs) to pick compilation orders that produce compact programs.

In the lecture on adaptive blocksize choice for dgemm with the MIPSPro compiler, we saw more adaptive behavior.

In this lecture:

• More experimental results

• Discussion of search algorithms other than GAs

• Insight into construction & use of adaptive compilers


3

We are exploring new organizing principles

• Explicit objective function (chosen by the user )

• Steering algorithm controls optimizer and back end Picks & orders methods, chooses parameters Use multiple trials to explore the solution space Finds a configuration to minimize objective function

Adaptive Compilers

Vary parameters

Objectivefunction

executable code

Front end remains intact

SteeringAlgorithm


4

These adaptive compilers

• Adapt to the program, the target, & the objective function

• Use power of modern computers to improve code quality

• Produce better results than their fixed-sequence siblings

Adaptive Compilers

Vary parameters

Objectivefunction

executable code

Front end remains intact

SteeringAlgorithm

• Are much more expensive (today) than fixed-sequence compilers

*

A practical way to find many optimization bugs


5

What’s the Point?

These compiler configurations matter ...

• Choice of transformations Significant overlap in coverage Each input program has different opportunities Best choice depends on input, target, & objective function

• Order of transformation They create & destroy opportunities for each other We see significant variations (up & down) on simple codes Best choice depends on input, target, & objective function

• We do not know enough, today, to predict a good sequence

• Our prototype systems search the space of sequences


6

Adaptive Compilation

Research prototype

• Based on MSCP compiler

• 13 transformations Run in any order (not easy) Many options & variants

• Search-based steering algorithms Hill climber Genetic algorithms Constructive greedy method Exhaustive enumeration

• Objective functions Run-time speed Code space Dynamic bit-transitions

Vary parameters

Objectivefunction

executable code

Front end

SteeringAlgorithm

• Experimental tool Exploring applications Learning about search space Designing better searches


7

Experimental Results with Compilation Order

Early Experiments (1999, Lecture 17)

• Space then speed (10 transformation subset) 13% smaller code than fixed sequence (0 to 41%) Code was generally faster (26% to -25%; 5 of 14 slower)

• Speed then space 20% faster code than fixed sequence (best was 26%) Code was generally smaller (0 to 5%)

• Genetic Algorithm Evaluate each sequence Replace worst + 3 at each generation Generate new strings with crossover Apply mutation to all, save the best

• Found “best” sequence in 200 to 300 generations of size 20

Register-relative procedure abstraction gets 5% space, -2% speed

*

GA took many fewer probes to find “best” sequence than did random sampling.

NB: 10-of-10 fmin space, optimizing for size


8

Experience from Early Experiments

Improving the Genetic Algorithm

• Experiments aimed at understanding & improving convergence

*

• Larger population helps Tried pools from 50 to 1000, work mostly with 50 to 100 100 generations is about right

• Use weighted selection in reproductive choice Fitness scaling to exaggerate late, small improvements

• Crossover 2 point, random crossover Apply “mutate until unique” to each new string

• Variable length chromosomes With fixed string, GA discovers NOPs Varying string rarely has useless pass

Current experiments use fixed length (9) strings… 9-of-13 space


9

Improving the Search

Two kinds of experiments

• Enumerations Choose a code & a subspace Evaluate every point in that subspace

• Explorations Choose a code & a search algorithm Run a series of searches

Enumerations are critical for understanding the space

Explorations are critical for evaluating the search algorithms


10

Enumerating FMIN

• One application, FMIN 150 lines of Fortran, 44 basic blocks Exhibited complex behavior in other experiments

• Ran hill-climber from many random starting points Picked the 5 most used passes from winners p, l, o, s, & n

• Enumerated the 10-of-5 subspace fmin+plosn 9,765,625 distinct combinations 34,000 to 37,000 experiments/machine/day Produced sequence & number of cycles to execute 14 CPU-months to complete

• Results shed light on fmin & the search space FMIN

Bad way to pick them ?


11

Results from Enumerating FMIN

In the set of 9,765,625 sequences

• Best sequence is 986 cycles

• Seven at 987

• Worst sequence is 1716

• Unoptimized code takes 1716

• Range is 0% to 43% improvement

*

With full set of transformations:

GA: 822 in 184 generations of 100 839 in 100 generations of 50

Hill: 843 in 1,355 probes

Local minima

• 258 strict local minima (x < all Hamming-1 points)

• 27,315 non-strict local minima (x ≤ all Hamming-1 points)

• All local minima are improvements

• They come in many shapes

fmin+plosn


12


Distribution of solutions

+ 5%+ 10% + 20%

Fair number of “bad” solutions!

fmin+plosn


13


Strict local minima have many shapes

132 of these spikes, from loop peeling

minimum

Hamming-1 points

Magnitude is wall height (≥ 1)

Landscape looks like an asteroid, in 10 dimensions

fmin+plosn


14


1026 goes to 14,879

+ 5% + 10% + 20% + 30%

fmin+plosn

These are local minima, not all solutions


15


What does the space look like?

• 4-of-5 space

• fmin+plosn

• Prefix vs suffix plot

Looks like noise to me

• Not convex

• Not differentiable

• Outright hostile to the searches

Does order of p,l,o,s, & n matter?


16



This looks better

• Same space

• Different order

• Troughs, valleys, … fmin+plosn

Unfortunately, this does not exist

• Inconsistent axes…


17



• 4-of-5 space

• fmin+plosn

• Prefix vs suffix plot

Looks like noise to me

• Not convex

• Not differentiable

• Outright hostile to the searches

We still need to search this space!


18

Searching FMIN

Steering algorithm needs to search the space for good sequences

• Hill climbers Start somewhere (random point) and walk downhill until

stuck To pick next point, try steepest descent or random

downhill

• Random probes Generate sequences at random If enough good sequences exist, you’re likely to hit one

• Greedy constructive algorithm Repeat until done: take the best step available In practice: consider prepending & appending one pass

• Genetic algorithm (of course) Combines big steps (crossover) & small steps (mutation) Survival of fittest, mutate until unique make it a search

*


19

Hill climbing in FMIN

*

Local downhill walk

• Step moves to lowest Hamming-1 point

• Reaches minima quickly

12 steps in 10-of-5 for fmin+plosn

• Suggests a two-prong attack

pick a random point & walk to minimum

start from many random points

• Rarely finds best point

fmin+plosn

fmin+plosn


20

Hill Climbing in FMIN

Narrowing the range of “good” to 5%

• Steepest descent (of Hamming-1 points) Gets stuck in local minima

• Random downhill walk (take any lower Hamming-1 point) Gets within 5% of minimum 88% of the time 7 step average, 18 step worst case

• Best downhill walk Gets within 5% of minimum 99% of the time

• Worst downhill walk Gets within 5% of minimum 38% of the time

Good News: Can reliably get within 5%Bad News: Cannot get within 1 to 2%

*

Suggests that we need to take bigger steps, as the GA does!

fmin+plosn


21

Searching in FMIN

Greedy Constructive Algorithm

• Start with null sequence

• Add best option to the sequence (evaluate prefixes & suffixes)

• Repeat until desired length is reached

Ties complicate the matter

• Often, multiple additions have the same fitness values

• Each may lead to a different part of the search space

• How should the algorithm handle this situation? GC-EXH : explore all ties GC-50 : pick one at random & go forward (repeat 50

times) GC-fast : use breadth-first, not depth-first, search


22

Search Results in FMIN

Instructions Executed

Sequences evaluated • GA-50 did best

fmin 9-of-13 space


23



Sequences evaluated • GA-50 did best

• Tie breaking did not matter in this caseGC was cheap

fmin 9-of-13 space


24



Sequences evaluated• GA-50 did best

• Tie breaking did not matter in this case

GC was cheap• HC-50 was cost

effective

fmin 9-of-13 space


25



Sequences evaluated• GA-50 did best

• Tie breaking did not matter in this case

GC was cheap• HC-50 was cost

effective• RC-3000 is not bad• GC-fast beat RC-300

fmin 9-of-13 space


26


Data on more programs

• Lots of trends here …

• GA wins (for more effort)

• GC & HC do well for less effort


27

Search Results

Does tie-breaking ever matter in GC?

• 91,663 sequences versus 325 sequences in adpcm

• Small variation in fitness value (see next slide)

Sequences tested


28


Factor of 282 difference in sequences evaluated

2.2% difference in instructions executed


29

Road Map for our Project

Our goals

• Short term (now) Characterize the problems, the potential, & the search space Learn to find outstanding sequences quickly (search)

• Medium term (3 to 5 years) Develop practical adaptive compilers for scalar optimization Develop a framework for self-tuning optimizers Develop proxies and estimators for performance (speed )

• Long term (5 to 10 years) Apply these techniques to harder problems

Data distribution, parallelization schemes on real codes Compiling for complex environments, like the Grid

Build systems that use knowledge from search to make adaptive compilation both practical & routine

Processor

Speed

*


30

Relating Code Properties to Sequences

Looking at the input program might let us predict good sequences

• Lots of work needed to generalize our understanding

• We want to look at four aspects of the problem Search strategies Objective functions Program characteristics Accumulated experience

With enough experience, we may be able to predict good sequences

Each may contribute to finding the best sequence efficiently


31

Making It Practical

Full-blown search may not be the answer

Many strategies are possible for a production system

• Best of k trials

• Pick k sequences from a database of many (say 5,000) Might use program properties to pick sequence

• Incrementalize the search over many compilations

• Find a good program specific sequence and keep using it

• Find the best sequence in a fixed amount of compile time

This should be fertile ground for research & experimentation


32

Another Piece of the Puzzle

Engineering Reconfigurable Compilers

• We built ours (almost) by accident Pass structure optimizer (no surprise there) Uses a definitive, low-level, linear IR Each pass does its own analysis Nothing passed along, unless it is written in the IR

• This eliminates most of the impediments to reordering Name space for PRE & LCM OSR algorithm requires constant propagation & code motion

• Need ways to specify interactions & manage them Check potential configurations for conflict Package predetermines subsequences together

• Need ways to use results of earlier searches to improve process

Lots of effort on building fast

components


33

EXTRA SLIDES START HERE


34

Results from FMIN

Analysis (K-L divergence testing) suggests that we can obtain a good model of the distribution with 1,000 probes of the

subspace

• GA used 6,000 in a 14-transformation subspace

• Need to test other subspaces

• Suggests that search engine can build a probabilistic model We are devising an algorithm for this one …

Near-term Experiments

• Running another subspace for FMIN

• Running subspaces on other, dissimilar codes

• Speed improvements to experimental apparatus Goal is 500,000 compiles / day in the lab Will port framework to Rice TeraScale Cluster (264 IA-64s)

• Next step is to characterize program properties


35

Results from FMIN

Can we take advantage of clustering in the good sol’ns?

• Cluster is connected set of points, all below threshold value

• Clustering explains the need for multiple trials

*


36

Results from FMIN

Cluster Sizes

Distance from best (as %)

#

clusters


37

Results from FMIN

Looking at cluster size

• Cluster is connected set of points, all below threshold value

• Clustering explains the need for multiple trials

• Cluster numbers and shapes depend on threshold

Phase transition between 2% and 2.5% (for FMIN)

• At 10%, there are 13 clusters

12 are singular points involving a “peel” minimum

The other one contains the rest of the 10%-good points

*

Documents

Adaptive Compilation (results from current work) Fall 2003 Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp