Minimizing the Cost of Iterative Compilation with Active Learning · 2020-02-27 · Minimizing the Cost of Iterative Compilation with Active Learning William F. Ogilvie University

Minimizing the Cost of IterativeCompilation with Active Learning

William F OgilvieUniversity of Edinburgh UK

s0198982smsedacuk

Pavlos PetoumenosUniversity of Edinburgh UK

ppetoumeinfedacuk

Zheng WangLancaster University UKzwanglancasteracuk

Hugh LeatherUniversity of Edinburgh UK

hleatherinfedacuk

AbstractSince performance is not portable between platforms en-gineers must fine-tune heuristics for each processor in turnThis is such a laborious task that high-profile compilers sup-porting many architectures cannot keep up with hardwareinnovation and are actually out-of-date Iterative compilationdriven by machine learning has been shown to be efficientat generating portable optimization models automaticallyHowever good quality models require costly repetitive andextensive training which greatly hinders the wide adoption ofthis powerful technique

In this work we show that much of this cost is spentcollecting training data runtime measurements for differ-ent optimization decisions which contribute little to the fi-nal heuristic Current implementations evaluate randomlychosen often redundant training examples a pre-configuredalmost always excessive number of times ndash a large sourceof wasted effort Our approach optimizes not only the selec-tion of training examples but also the number of samplesper example independently To evaluate we construct 11high-quality models which use a combination of optimizationsettings to predict the runtime of benchmarks from the SPAPTsuite Our novel broadly applicable methodology is able toreduce the training overhead by up to 26x compared to anapproach with a fixed number of sample runs transformingwhat is potentially months of work into days

Categories and Subject Descriptors D34 [ProgrammingLanguages] Processorsmdashcompilers optimization

Keywords Active Learning Compilers Iterative Compila-tion Machine Learning Sequential Analysis

1 IntroductionWith Dennard scaling long dead and technology scaling stop-ping within the next five years [1] performance improve-ments rely increasingly on software improvements especially

compiler and runtime optimizations Accurate heuristics fordeciding the best way to optimize a program are hard to con-struct The space of possible decisions is vast while theireffect on performance is complex and depends on both theapplication and the targeted hardware Manually develop-ing such heuristics takes months or years even for a singletarget meaning that tuning the compiler and runtime heur-istics for each new released processor is unrealistic Evenhigh-quality production compilers are out-of-date [31] andcannot extract the full performance potential of the hardwareThe industry-wide trend towards heterogeneity only servesto make the optimization decision space even more complexmaking effective heuristics near impossible to construct [44]

To overcome this problem iterative compilation [30] wasproposed as a means by which heuristics can be automaticallyproduced without the need for expert involvement Differentoptimization strategies are applied and the effect on speedsize or energy is measured Augmented with machine learn-ing [2] these data are used to train a predictor which selectsthe best optimizations for a particular code and platform Thisapproach not only produces heuristics much faster but it alsooutperforms heuristics crafted by human experts [31 57]The same methodology can be applied to radically differ-ent problem domains compilation [33 55 56] parallelismmapping [22] runtime tuning [15] and hardware-softwareco-design [60]

The time required to create these heuristics while auto-mated is still substantial Researchers have improved uponthis work by removing its reliance on random search andused active learning instead [4 5 39 60] Random search isproblematic because it selects optimization decisions and pro-files the application multiple times under those optimizationsbefore it even knows whether this will actually improve ourknowledge of the decision space In contrast active learningis a methodology which predicts the parts of the decision

space where as much information as possible can be gainedand directs the search towards them

These works represent a substantial leap forward towardsmaking iterative compilation quick and easy However size-able inefficiencies still exist Previous work on iterativecompilation used a fixed sampling plan each unique train-ing instance is repeatedly profiled a set number of timeschosen a priori Repeated measurements are necessary be-cause runtime measurements are inherently noisy

There are many sources of noise encountered for runtimemeasurements The most egregious of which is caused byother user or system processes Such processes compete forresources with our application especially cores caches [41]and memory [59] and they do so in non-deterministic waysIn recent systems the power and thermal walls lead tomore complex interference patterns Intelrsquos Turbo Boostfor example might lower the frequency and the powerconsumption of a process running on a core when othercores wake up [9]

Even ignoring interference from other applications thereare still more sources of noise Memory management mech-anisms such as dynamic memory allocators [26] and garbagecollectors [48] can introduce additional unpredictable over-heads On top of this address space layout randomization andthe physical page allocation mechanism change the logicaland physical memory layout of the application every time it isexecuted potentially affecting the number of conflict missesin the CPU caches and branch mispredictions [17 38] Multi-threaded applications can even force non-deterministic beha-vior on themselves if the scheduler is not set to be perfectlyrepeatable or if small timing changes alter the communica-tion patterns [45] Any IO can have non-repeatable timingsand even changes to the environment variables between runscan shift memory and alter runtimes [38]

Past research has investigated ways to reduce experimentalnoise Typical approaches include overriding the defaultscheduling policy [43 45] using more deterministic memorymanagement mechanisms [26 43 45] avoiding IO orjust minimizing the number of active processes includingservices and daemons Doing so is not always enough ordesirable because of the following reasons First while theydo reduce noise they do not eliminate it Multiple profilingruns are still needed to determine whether noise affects themeasurement significantly Even then the amount of variationmight be too high for optimization heuristics dependenton accurate measurements [32] Secondly modificationsto reduce noise may do so at the expense of altering theruntime behaviour that is meant to be measured or of riskingthe wrong heuristic being learned Heuristics targeting veryspecific low-noise runtime environments may not match wellwhen used in practice For example [16] showed that theruntime variation caused by memory layout changes suchas address space randomization can dwarf the differencesbetween optimizations If address space randomization is

disabled during training or only a single run is taken thenan optimization could be selected which is not optimal onaverage in deployment Instead multiple runs must be usedto smooth out the effects of random layout changes Finallyeven when a low-noise environment would not actuallyalter the heuristic we have found it difficult to convincecompanies that tuning heuristics to an environment differentthan their production one is acceptable In short what isneeded are techniques which perform well in realistic noisyenvironments

Our work aims exactly at handling noise without havingto reduce it and without wasting time on large numbers of re-peated performance measurements Our insight was that eachadditional observation that is each additional performancemeasurement for the same optimization strategy providesdiminishing amounts of information Indeed that extra in-formation quickly reaches zero if there is little experimentalnoise or if the observation fits well with what we alreadyknow about the decision space In other words extra profilingruns for a decision are useful only if they are likely to con-tradict what we predict about that decision Our experimentsconfirm that iterative compilation can be slowed down byusing a fixed sampling plan spending most of its time gettingobservations which provide no additional information

In this paper we introduce a novel active learning tech-nique for iterative compilation which combines sequentialanalysis an approach where the number of samples are notfixed By profiling the application under the same optimiz-ation decision only as long as this improves our knowledgeof the decision space we produce models quickly withoutsacrificing the heuristicrsquos quality Specifically our techniquebegins by taking a single sample runtime for optimizationsthat are deemed to be most profitable to learn from as definedby an active learner As knowledge is built up the algorithmis able to revisit these examples instead of getting new onesThis happens if it determines that they are of continued in-terest that is if it appears that measurement noise has affectedthe data we previously collected on that configuration

To evaluate our approach we create predictors for 11programs from the SPAPT suite [3] These models can predictwith low error the runtime of a particular code given anumber of optimization options that we may want to applyand in this way can find an optimal combination Our resultsshow that we can create a high-quality heuristic on average4x and up to 26x faster than a baseline approach which uses35 samples per run

2 MotivationAs previously stated the research in this paper is based onthe realization that current procedures for creating machinelearning based heuristics do not consider sample size a para-meter for optimization but rather assume it to be a constantvalue fixed a priori Moreover little or no justification isever provided for one chosen sample size over another With

0 10 20 30

Loop i1 Unroll Factor

Loop i2 U

4MAE (ms)

mm compiled with minusO21 sample per point

(a) mean absolute error for sample size of one

0 10 20 30

Loop i1 Unroll FactorLoop i2 U

4MAE (ms)

mm compiled with minusO2optimum samples per point

(b) mean absoulute error for optimal sample size

0 10 20 30

Loop i2 U

Samples

(c) optimal number of samples across the space

Figure 1 Error and sample size for each point in the decision space when using a sample size of one or the optimal sample sizeThe decision is the unroll factor for two loops of the mm kernel of SPAPT For most but not all points a single sample is enough

active iterative learning this need not be the case and we canleverage the knowledge built up by the algorithm over time toadaptively select a more appropriate sample size per examplesignificantly speeding-up training overall

To motivate our work we examined the way iterativecompilation works on the problem of tuning the unrollfactor for two loops i1 and i2 of the matrix multiplicationkernel in the SPAPT suite on the system we describe inSection 4 Using the -O2 optimization level as a baseline wecompiled the kernel multiple times each one with a differentcombination of unroll factors for the two loops Each binarywas then executed 35 times and its runtime measurementsrecorded

In Figure 1a we present the Mean Absolute Error (MAE)we would have incurred had we only taken a single observa-tion This gives us an estimated baseline for the worst errorwe could pay in this space as high as 4ms (5 of the mean)for some binaries but practically zero for many For the lattergetting even a second sample is a waste of effort To estim-ate the potential speed-up we could obtain if we knew howmany samples we should actually take for each optimizationsetting we iterate through the space again but at each pointwe remove samples randomly from the group of 35 sampleswe collected initially We continue reducing the number ofsamples as long as our calculated MAE is below 01ms

Figures 1b and 1c show the error of this adaptive ap-proach across the space and the number of samples neededper configuration to maintain such a small error respectivelyThese figures demonstrate that there is quite considerablestochastic noise in measurements across the space and there-fore that the number of samples needed for a low MAEvaries If we take the naiumlve fixed sampling plan of 35 weneed 35times 30times 30 = 31 500 individual executions whereaswith lsquoperfect knowledgersquo we can incur an error of only 01msat the cost of 15 131 program runs ndash nearly half

0 5 10 15 20 25 30

adi compiled with minusO2

Figure 2 Runtime versus unroll factor for a loop of adiwhen using a sample size of one A relationship betweenunroll factor and runtime is relatively clear despite the noise ndashie stable around 21s until 10 where it climbs steadily andplateaus at 31s for high levels of loop unrolling

This example starts with lsquoperfectrsquo information about eachpoint in the decision space and it removes samples until theaverage runtime starts to deviate from that initially calculatedA real sequential analysis approach on the other hand needsto work the opposite way around start from zero informationabout each point and add samples until the distance betweentheir average runtime and the true mean is minimal Wecannot know this distance without actually taking a large

number of samples but we can approximate it by looking atwhat the rest of the space tells us

Consider Figure 2 where we unroll the i1 loop in theadi benchmark a random number of times and take a singlesample each time Despite the noise we see that there is apattern easily identifiable to the human eye a plateau startingat 21s then climbs and levels off at 31s around a loopunroll factor of 10 We postulate and it can be shown thatpoints in areas where the pattern is clear and which fit well inthat pattern are likely to be already close to their respectivepopulation means The points where we need more samplesare the rest

Our active learning approach for iterative compilationalready uses Machine Learning to discover patterns in thedecision space during the training process We can use thatsame knowledge to determine whether a set of runtimemeasurements for a point in the space fits the local pattern ornot that is whether it is likely to be affected by noise or notgiven what is known about its neighbours

3 Active Learning with Sequential AnalysisPrevious research [18 31] has shown that machine learningcan be used to generate compiler heuristics more accuratelythan human experts However current implementations userandomly collected examples for training and this is problem-atic since randomness often leads to redundancy In contextswhere obtaining data is cheap this is perfectly acceptable butfor us since each example requires compilation and multipleruns to record average performance a lot of effort can bewasted

Active learning [39] is specifically aimed at reducingthe occurrence of these unprofitable evaluations Instead ofblindly selecting training examples in our case binaries com-piled with certain optimizations it generates an intermedi-ate model based on the examples already evaluated Thealgorithm considers a number of potential candidates (optim-ization settings) that it could learn from next and assigns eacha score This score represents the predicted extra informationthat the example will provide It is usually a function of theuncertainty the model has with regards to the predicted valuendash runtime The example with the highest score is lsquolabelledrsquondash compiled and profiled ndash and the information used to up-date the intermediate model This loop continues until somecompletion criterion has been reached

Our work in this paper introduces a novel approach toactive learning which is broadly applicable The only as-sumptions we make are that neighbouring examples in theoptimization design space do not significantly differ in res-ult from one another the majority of the time and that thesignal-to-noise ratio of measurements are such that an ap-proximately accurate model can be fit to the data where thisis in essence no different than techniques which attempt toavoid over-fitting While traditional active learning is usedto reduce only the number of training examples we wish

to reduce the number of samples needed per example Tothe best of our knowledge we are the first to propose com-bining active learning and sequential analysis in this wayPrevious research in this area has ignored sequential analysisThe reason for this is that many implementations of activelearning are greedy [6 47] so learning noisy data on pur-pose will lead to incorrect conclusions being drawn from theintermediate models In particular this will steer the searchtowards the wrong areas of the decision space significantlyreducing further the quality of the final model In Section 33we explain how we overcome this problem

31 Sequential AnalysisTraditionally in active learning the training set (the set ofexamples already seen) and the candidate set a randomsubset of all examples that could be learnt from next are keptdisjoint This makes sense because the information containedin the training set is assumed to be of good quality eachexample will have been evaluated some fixed number oftimes to ameliorate the effect of noise hence there is littleto be gained from revisiting those examples However aswe have demonstrated a fixed sample count is often overlyconservative and wasteful slowing down the training processsubstantially

In order to modify active learning to incorporate sequentialanalysis we change the algorithm such that the initial samplesize is set to one In case of noisy data we need to be able torevisit previously compiled programs so we keep them in thecandidate set ndash see Figure 3 That is to say at each iteration ofthe learning loop the algorithm will consider not only gettinga new example but also whether it is more profitable to try anold one again similar to the multi-armed bandit problem [29]We are able to do this because the particular model we useprovides a scoring function which quantifies the uncertaintythe model has about a particular point in the space givenwhat it knows As knowledge is gained given the shapeof the intermediate model noisy examples or examples incomplex areas of the decision space will begin to lsquostick outrsquoand will be more likely to be visited In both cases with eachiteration of the training loop we select the example wherethe highest amount of information can be extracted

We outline our algorithm in more detail in Alg 1 Thealgorithm begins by constructing a model M with ninit

training examples which have been randomly chosen fromall potential examples F as a seed To generate this initialmodel we obtain some fixed number of observations nobs

for each training example to give the active learner a quickand accurate look at the search space The learning loopthen proceeds whilst the completion criterion has not beensatisfied In Alg 1 this criterion is set to a fixed number oftraining instances but could have been based on for examplewall-clock time or some estimate of error in the final modelestablished through cross-validation [25] At each iterationof the loop the candidate set C combines nc random pointswhich have never been observed before and those examples

LearningAlgorithms

NewObservation

InitialTrainingPoints

Active Learner

IntermediateModel

FinalModel

EvaluateCandidates

RandomUnseen

PreviouslySeen

Figure 3 An overview of our active learning approach To seed an initial model we give the learner some good quality dataWe then choose a single training example from a set of candidate examples We collect data for the chosen example and feedthat back into the algorithm The process repeats until we reach some completion criterion Contrary to existing active learningapproaches we collect our (potentially noisy) training data one observation at a time Visited training examples remain in thecandidate set and can be revisited if getting more observations for them is more profitable than trying a new training example

which have been seen previously but less than nobs times Wechoose the next training example x based upon its predictedusefulness (see Section 33) and we measure its runtime y onemore time We then update the model as well as the requireddata structures It should be noted that this algorithm is easilyparallelized by selecting multiple training examples per loopiteration instead of just one [4]

Tt Tt+1

Stay Prune Grow

Figure 4 This diagram shows the three potential updatesthat are stochastically applied to the Dynamic Tree uponreceiving a new training example xt+1 The tree eitherremains unchanged a leaf node is pruned back so that theparent of the leaf becomes a leaf itself or grown such thattwo new children divide the relevant subspace

32 Dynamic TreesIn regression problems where we wish to estimate the uncer-tainty of a prediction the collective wisdom would be to use aGaussian Process (GP) [46] However GP inference is slowwith O(n3) efficiency for n examples This is problematicparticularly in active learning since each time somethingnew is learned a model needs to be constructed and evaluatedA more efficient model which we leverage in this work isthe relatively new dynamic tree which is based on the clas-sical decision tree model [8] with modifications to includeBayesian inference [10 11] The advantages of the dynamictree for our purposes are

bull its ability to evolve over time as new data come in withoutreconstructing the model from the ground up with eachiterationbull its estimation of uncertainty at any given point in the

space like a GP but without the overhead

bull its avoidance of over-fitting to the training data which isvital since we are learning potentially noisy information

Full details on the model can be obtained from the articleby Taddy et al [51] The brief overview of how it works isas follows The static model used within the dynamic treeframework is a traditional decision tree for regression applic-ations A set of rules recursively partitions the search spaceinto a set of hyper-rectangles such that training exampleswith the same or similar output value are contained withinthe same leaf node The dynamic tree changes over timewhen new information is introduced by a stochastic processthereby avoiding the need to prune at the end At time t a treeTt is derived from the training data (x y)t When new data(xt+1 yt+1) arrives an updated tree Tt+1 is created identicalto Tt except that some mechanism has been randomly chosenfrom three possibilities ndash see Figure 4 The leaf node η(xt+1)containing xt+1 either (1) remains completely unchanged (2)is pruned so that the parent of η(xt+1) becomes a leaf node(3) is grown such that η(xt+1) becomes an internal node totwo new children The choice of transformation is influencedby yt+1 in a posterior distribution This posterior distributiondepends upon the probability of yt+1 given xt+1 Tt and[x y]t hence the dynamic tree is more resilient to noisy datathan other techniques

33 Quantifying UsefulnessThe most crucial part of the active learning loop is estimatingwhich training example from within the pool of potentialcandidates C would be most profitable to learn from next ThedynaTree package for R [21] that we use offers two heuristicsout-of-the-box both well cited in the literature for regressionproblems The first was presented by Mackay [34] and selectsthe candidate where the estimated variance of the outputis maximized relative to the other candidates The secondheuristic by Cohn [13] selects the candidate it calculates willmost reduce the predicted average variance across the spaceTo put this in a more accessible way it selects the exampleit believes will enable the model to best fit what it is alreadyseeing in an attempt to reveal key information that it may bemissing Both are competitive with each other and both solve

Algorithm 1 An active learning algorithm modified to re-duce the number of samples where F contains all trainingexamples that could be chosen ninit and nmax specify theinitial and total number of training examples to record nc thenumber of candidates per iteration and nobs the number ofsamples thought to be needed to reduce the affects of noisein the outputperformance values

1 procedure ACTIVELEARN(F ninit nmax nc nobs)2 X larr sample(F ninit)3 Y larr getObservations(Xnobs)4 M larr dynaTree(XY )5 D larr empty6 for i = ninit nmax do7 C larr sample(F minusXnc)8 for all k isin keys(D) do9 if D[k] lt nobs then C larr C cup k

10 end if11 end for12 xlarr empty13 vmin larrMAX_DOUBLE14 for all c isin C do15 v larr predictAvgModelVariance(M c)16 if v lt vmin then17 vmin larr v18 xlarr c19 end if20 end for21 y larr getObservations(x 1)22 M larr updateModel(Mx y)23 X larr X cup x24 if k isin keys(D) then25 D[k]larr D[k] + 126 else27 D[k]larr 128 end if29 end for30 return M31 end procedure

the greedy search problem discussed previously althoughthe latter is more computationally intensive than the former ndashO(|C|2) versus O(|C|) Despite this we use the latter as ourscoring function since it handles heteroskedasticity non-uniform variance across the space which we assume forincreased robustness more effectively

4 Experimental Setup41 Optimization ProblemWe evaluate our approach by examining how efficientlywe can construct models to solve a classical but complexcompilation problem In particular the problem we considerin this work involves finding the optimal set of compilationparameters for a program The set of parameters includes loopunrolling cache tiling and register tiling factors where eachparameter has a range of possible values unique to each loop

The combination of these parameters results in a massivesearch space where we will need to find a configuration thatleads to a short program runtime Our goal is to build aprogram-specific model that can predict the runtime from thegiven set of optimizations This allows us to quickly searchover a large number of configurations to find out the bestperforming one without compiling and profiling the programwith every single option

42 Platform and BenchmarksPlatform We evaluated our method on a server runningOpenSuse v123 with an Intel Core i7-4770K 4-core CPUat 34GHz The machine contains 16GB of RAM and thecompiler used was gcc v472

Environment We measured time using the C library func-tion clock_gettime() As in previous iterative compilationliterature our machine was restricted to a single user and didnot have any processes running other than those enabled bydefault under a standard OS installation We took no furthersteps to reduce experimental noise such as pinning threads orusing a non-standard memory allocator we decided againstthis to avoid creating an artificial environment which couldalter our findings as discussed previously in Section 1

Benchmarks We used 11 applications taken from theSPAPT suite [3] a collection of search problems (programs)for automatic performance tuning These benchmarks arebased on high-performance computing problems such asstencil codes and dense linear algebra These particular 11were chosen based on an initial prototyping of our algorithmusing data kindly provided by the authors of [4] where onlythese 11 were contained within that initial dataset Theseprograms are sequential implementations where dynamicmemory is allocated it is through the standard malloc()library function again we decided against using a low noiseallocator for reasons previously discussed

Each problem of the SPAPT suite is defined by threeprimary variables ndash kernel input size and tunable config-uration The tunable parameters are further broken down intoa number of integer and binary values with the values giv-ing optimizing code transformations to apply as specifiedabove In our experiments binary flags and input size werenot considered so that a fair comparison could be made withthe related work [4] The precise size of each search space isgiven in Table 1

43 Evaluation MethodologyBaseline Approach Most machine learning in compilersworks use simple constant sampling plans [36 37 49] wherethe number of observations in each sample is fixed ahead oftime Different sizes are chosen in the literature for example[19 23] use 10 [24] uses 20 [42] uses 80 and the workagainst which we compare [4] uses 35 There is no statisticalcriterion that can determine how many observations will besufficient for this purpose However post hoc validation can

be performed for example by calculating the ratio of theConfidence Interval (CI) to the mean and rejecting if thatbreaches some threshold Typically this validation is notpresented in papers if it is done at all When it is donestandard values are to use the 95 confidence and a 1CImean threshold In this paper we compare against aconstant sampling plan of 35 observations as that is whatis used in our comparison work [4] We note that even 35observations is not always enough Across our benchmarkswe found that even though on the majority of examples therewas often very little noise many did not fall into this patternFully 5 of examples broke the threshold Even at a moregenerous 5 threshold we found that with 35 observations05 failed With fewer observations the problem is worseAt 5 observations 33 fail that more generous thresholdand at 2 observations (the minimum to have any statisticalcertainty) 5 fail This finding is corroborated by [32] whichsamples until the threshold is met and discovers that fortiming small code sequences it is sometimes necessary to takehundreds of observations Since active learning is susceptibleto bad data these erroneous training examples can have adetrimental affect on the quality of the learned heuristic andits convergence

Our technique by avoiding a constant sampling plan isable to achieve far better results Indeed it is even able toperform adequately with only one observation per examplein low noise parts of the space and will spend effort withmultiple observations only on those parts of the space thatrequire it

Based on classical methodologies we consider two tech-niques to be in competition with our own For both we get afixed number of observations for each training example andthe candidate set is kept disjoint from past training examplesThe first technique uses the average of the runtimes recordedover 35 observations per single training example as in [4]The second technique records a single execution per exampleIn this way we can compare how our approach fairs in rela-tion to both very low and relatively high accuracy per trainingpoint in terms of estimating mean runtime

In order to provide the best evaluation possible we com-pare our methodology to the active learning approach byBalaprakash et al [4] in particular we use the same bench-mark suite model parameters and accuracy metric they do

Evaluation Metrics Our evaluation examines the efficiencyof model construction and more specifically the evolutionof the model error over training time for each one of the 11benchmarks and the three different active learning approachesWe quantify the accuracy of the models produced by eachapproach using the Root Mean Squared Error of the predictedruntimes (1) For each data point in a test set of n instancesthe runtime predicted by the current model yt is compared tothe observed mean runtime yt as follows

RMSE =

radicsumni=1 (yi minus yi)

We measure the training time in each experiment as thecumulative compilation and runtimes of any executables usedin training The overhead of updating the Dynamic Treesis not measured as it is a small part of the overall trainingoverhead and is near constant for all evaluated approaches

44 Algorithm and Model ParametersFor each kernel the goal is to produce a model capable ofestimating mean serial code runtime for any set of optimiza-tion settings To this end we used the following parametersin constructing our model and overall learning algorithm

With respect to Alg 1 we start by seeding the algorithmwith just 5 random examples ninit For each of these werecord 35 observations nobs to calculate a mean runtimeFor the Dynamic Tree model we employ the R dynaTreepackage We use an entirely default configuration except thenumber of particles N is set to 5000 In each iteration ofthe loop we consider 500 random and new candidate traininginstances nc

The completion criterion for the experiments was set suchthat the maximum size of the training set nmax does notexceed 2500 All experiments were repeated ten times withnew random seeds The results reported in Section 5 are allaveraged over ten experimental runs

45 Description of the DatasetsTo collect the data for our experiments we profile each pro-gram with 10000 distinct randomly selected configurationsFor each one we record its mean runtime determined byaveraging 35 separate executions per example and its com-pilation time Per experiment we randomly mark 7500 ofthem as available for use in training while we test the modelon the remaining 2500 examples

The feature values of each data point which is to saythe values which make each example distinct from oneanother were all normalized through scaling and centringto transform them into something similar to the StandardNormal Distribution a common practice in machine learningwork where features are not all on comparative scales

5 Experimental ResultsIn this Section we first show that our approach can greatlyspeed-up the learning process by reducing the cost of profilingby up to 26x as compared to a baseline approach that uses 35observations for each data point We then provide a detailedanalysis of our results for each program in turn

51 Overall AnalysisTo evaluate the overall efficiency of our proposed method-ology versus the baseline active learning approach [4] we

Table 1 Lowest common RMS error achieved by both approaches profiling time needed to reach this error level and speed-upfor all 11 benchmarks

benchmark search space lowest common RMSE cost of the baseline cost of our approach speed-up(sec) (sec)

adi 378times 1014 0087 262times 104 908times 104 029atax 257times 1012 0097 333times 103 239times 102 1393

bicgkernel 583times 108 0065 135times 104 376times 103 359correlation 378times 1014 0589 5746 813 707

dgemv3 133times 1027 0067 175times 102 744 2352gemver 114times 1016 0342 299times 103 115times 102 2600hessian 195times 107 0006 576times 103 156times 103 369jacobi 195times 107 0076 304times 103 857times 102 355

lu 583times 108 0013 257times 103 709times 102 362mm 318times 109 0042 987times 104 889times 104 111mvt 195times 107 0002 259times 103 220times 103 118

geometric mean 397

Table 2 This table gives an indication of the spread of the variance and 95 confidence interval relative to the mean for allbenchmarks tested the latter is given for two sample sizes 5 and 35 observations The values shown illustrate that althoughnoise can be low for many benchmarks it is high for others

benchmark variance 35-sample 95 CI mean 5-sample 95 CI meanmin mean max min mean max min mean max

adi 844times 10minus10 234times 10minus3 014 410times 10minus6 225times 10minus3 005 277times 10minus6 001 016atax 754times 10minus10 972times 10minus5 003 222times 10minus5 231times 10minus3 006 179times 10minus5 001 025

bicgkernel 206times 10minus10 106times 10minus4 005 117times 10minus5 152times 10minus3 007 102times 10minus5 464times 10minus3 029correlation 227times 10minus10 042 802 213times 10minus5 003 034 442times 10minus6 013 241

dgemv3 115times 10minus9 560times 10minus5 003 331times 10minus5 225times 10minus3 008 224times 10minus5 001 028gemver 119times 10minus9 591times 10minus3 047 118times 10minus5 481times 10minus3 010 934times 10minus6 002 042hessian 235times 10minus11 103times 10minus6 199times 10minus4 389times 10minus5 133times 10minus3 006 163times 10minus5 415times 10minus3 024jacobi 254times 10minus10 120times 10minus4 009 132times 10minus5 129times 10minus3 009 412times 10minus6 383times 10minus3 039

lu 184times 10minus11 845times 10minus7 109times 10minus4 203times 10minus5 689times 10minus4 002 576times 10minus6 210times 10minus3 011mm 276times 10minus10 487times 10minus6 131times 10minus3 226times 10minus5 744times 10minus4 002 136times 10minus5 237times 10minus3 009mvt 997times 10minus12 107times 10minus8 787times 10minus6 629times 10minus5 828times 10minus4 003 398times 10minus5 244times 10minus3 011

n0123456789

101112131415

rofilin

Figure 5 Reduction of profiling overhead compared to abaseline approach

measured the time needed for both techniques to first reach acommon lowest average error

In particular Table 1 shows for each benchmark what thislowest common average error is and how many seconds ittook to collect the profiling data needed to reach this levelfor the competing methods Figure 5 presents graphicallythe acceleration achieved by our approach In all but onebenchmark our algorithm is faster at reaching the lowestaverage error Specifically our methodology is able to reduce

the overhead for 10 benchmarks by up to 26x The onlybenchmark in which our approach fails to reduce the overheadis adi However the difference in errors between the twotechniques is comparable within a few thousandths of asecond on average Summarizing our approach outperformsthe baseline by achieving on average a 4x reduction ofthe profiling overhead which translates to saving weeks ofcompute time in practice for many compilation problems

52 Detailed AnalysisIn this Section we present our findings in more detail Fig-ures 6andash6f show the Root Mean Squared Error (1) againstevaluation time (cumulative profiling and compilation costin seconds) averaged over 10 runs for several representativeresults To make a fair comparison each graph shows therange of time over which all three sampling plans are simul-taneously active in processing up to 2500 training instancesWhat follows is a qualitative summary of those resultsadi Figure 6a gives error against time for the three differentsampling techniques we evaluated for benchmark adi Itseems self-evident that there is some considerable noise inthe underlying data since a single observation per trainingexample plateaus in error fairly quickly and cannot achievethe same results as the other two methods Although our

variable observation approach is also unable to keep up witha high fixed number of observations per example it doesachieve comparably low error throughoutatax bicgkernel The data of benchmark atax in Figure6b is quite different to that in Figure 6a and appears torepresent a case where the underlying noise in performancemeasurements is comparatively low This is exemplified bythe fact that one sample per unique instance is enough todo well and indeed our technique appears to detect thiscompare these plot-lines to the 35 observations approach andwe see a good example of how much time can potentially besaved by our technique over this baseline The bicgkernelexperiments follow the same sort of patterncorrelation Figure 6c showing the results of the benchmarkcorrelation is interesting since the error remains highregardless of sampling technique Data from Table 2 pointsat a potential reason for this that we outline below As inFigure 6a we see that there must be noise present since oneobservation does even worse Our approach is not quite asgood as using a large number of observations per data pointbut is competitive and within a few hundredths of a secondin terms of average error by the end of the displayed timedgemv3 gemver hessian In Figure 6d our variable ap-proach is much faster than the classical method and thesimple but potentially noisy variant similarly for the resultsof dgemv3 and hessianjacobi lu mm mvt The data for the jacobi benchmark(Figure 6e) which is also generally representative of lu areinteresting since they show our algorithm to be slightly toocautious but still much more efficient than a fixed samplingplan The mm benchmark gives a graph akin to that of mvtshowing our approach as giving slight speed-ups over theclassical methodology

Table 2 details the distributions of the runtimes measuredduring our experiments and in particular the spread of thevariance and confidence intervals relative to the means Thelevel of noise across this set of benchmarks varies acrossapplications Moreover the variance is not constant across allparts of the space for even a single benchmark in isolationsome parts of the space suffer from extreme noise Anadaptive algorithm such as ours is necessary to make thebest of these conditions

Correlation shows very high noise and achieves a 7xlearning speed-up whereas gemver has lower noise but gaveus the highest learning speed-up ndash 26x We think this isbecause our experiments capped the number of executionsat 35 but many points in correlationrsquos space need moredata This limits the maximum speed-up that can be attainedGemver in contrast to correlation has fewer points forwhich 35 observations are inadequate For adi where thespeed-up runs counter to our expectations we ran longerexperiments but the outcome did not change We believe thatthis is due to the shape of the noisy regions in the space Wewill investigate these in future work

6 Related WorkOur work lies at the intersection of optimization modelingand active learning No existing work has used sequentialanalysis and active learning to reduce the overhead of iterativecompilation

Analytic Modeling Analytic models have been widely usedto tackle complex optimization problems such as auto-parallelization [7 40] runtime estimation [12 27 58] andtask mappings [28] A particular problem with them howeveris that the model has to be re-tuned for each new targetedhardware [50]

Predictive Modeling Predictive modeling has been shownto be useful in the optimization of both sequential and par-allel programs [14 23 52 53] Its great advantage is thatit can adapt to changing platforms as it has no a priori as-sumptions about their behavior but it is expensive to trainThere are many studies showing it outperforms human basedapproaches [19 24 31 54 61] Prior work for machine learn-ing in iterative compilation [20] often uses random samplingor exhaustive search to collect training examples The processof collecting these examples can be expensive taking severalmonths in practice With active learning our approach cansignificantly reduce the overhead of collecting these trainingdata accelerating the process of tuning optimization heurist-ics using machine learning

Active Learning for Systems Optimization Active learn-ing has recently emerged as a viable means for constructingheuristics for systems optimization Zuluaga et al [60] pro-posed an active learning algorithm to select parameters in amulti-objective problem Balaprakash et al [4 5] used act-ive learning to find optimizations for both CPU and GPUscientific codes and Ogilvie et al [39] proposed its use toconstruct models to map programs in a CPUndashGPU mixedplatform In all these works however each training examplewas profiled a fixed number of times in order to compute anaverage performance Our work advances this prior work bydynamically adjusting the number of profiling runs as neededwhich significantly reduces the training overhead

Program Runtime Variation The work by Mazouz et al[35] shows that parallel program execution time could varyto various extents on different platforms Thus the number ofprofiling runs needed for statistical soundness varies from oneplatform to the other Leather et al [32] proposed a statisticalmethod to determine the number of times to profile a programbut for the much simpler problem of determining the bestperforming binary version of a program during a randomsearch of the optimization space In their work binarieswhose confidence interval of the runtime does not overlapwith that of the best performing binary are not revisited

0 20000 40000 60000 80000

Results for adi benchmark

Evaluation Time (s)

Root M

eanminus

Square

all observations

one observation

variable observations

(a) adi

0 1000 2000 3000

Results for atax benchmark

Evaluation Time (s)

Root M

eanminus

Square

all observations

one observation

(b) atax

0 200 400 600 800 1000

Results for correlation benchmark

Evaluation Time (s)

Root M

eanminus

Square

all observations

one observation

(c) correlation

0 500 1000 1500 2000 2500 3000

Results for gemver benchmark

Evaluation Time (s)

Root M

eanminus

Square

all observations

one observation

(d) gemver

0 500 1000 1500 2000 2500 3000

Results for jacobi benchmark

Evaluation Time (s)

Root M

eanminus

Square

all observations

one observation

(e) jacobi

0 500 1000 1500 2000 2500

Results for mvt benchmark

Evaluation Time (s)

Root M

eanminus

Square

all observations

one observation

(f) mvt

Figure 6 RMS error over time for six of our benchmarks for three different approaches one observation 35 observations andvariable observations per training point

7 Conclusions and Future WorkWhile we need good software optimization heuristics to fullyexploit the performance potential of our hardware it is be-coming increasingly unrealistic to hand-tune them with expertknowledge for every hardware architecture they have to targetThe result is that current compilers use out-of-date optimiza-tion strategies ultimately leading to sub-optimal binaries Toalleviate this problem previous research has proposed ma-chine learning to automate the heuristic generation processExisting implementations are unnecessarily slow They selecttheir training data randomly much of which carry little usefulinformation despite being time consuming to acquire Activelearning approaches have tackled this inefficiency but theirinflexible sampling plan still causes them to collect trainingdata with little useful information

In this paper we present a unique approach broadly applic-able to heuristic generation It combines sequential analysiswhich reduces the observations per training example togetherwith active learning which reduces training examples overallto greatly accelerate learning We demonstrate our approach

by comparing it with a baseline 35-sample technique whichcreates software optimization models derived through activelearning alone Our approach achieves an average speed-upof 4x and up to 26x without significant penalties to the finalheuristic quality

We intend to test the bounds of our technique by artificiallyintroducing noise into the system to see how robustly itperforms in extreme cases Success would allow our strategiesto be used in heavily loaded multi-user environments Thiswould have been interesting in this paper and is somethingof an omission that was pointed out to us in a review As suchwe leave it to future work

AcknowledgmentsThis work was partly supported by the UK Engineeringand Physical Sciences Research Council (EPSRC) undergrants EPL0000551 (ALEA) EPM01567X1 (SANDeRs)EPM0158231 and EPM0157931 (DIVIDEND) We wouldlike to thank Dr Balaprakash of Argonne National Laborat-ory for his kind help in providing us the initial data for ourresearch

References[1] 2015 international technology roadmap for semicon-

ductors httpwwwsemiconductorsorgmain2015_international_technology_roadmap_for_semiconductors_itrs Retrieved 080916

[2] F Agakov E Bonilla J Cavazos B Franke G Fursin M F POrsquoBoyle J Thomson M Toussaint and C K I WilliamsUsing Machine Learning to Focus Iterative Optimization InCGO 2006

[3] P Balaprakash S M Wild and B Norris SPAPT SearchProblems in Automatic Performance Tuning In ICCS 2012

[4] P Balaprakash R B Gramacy and S M Wild Active-Learning-Based Surrogate Models for Empirical PerformanceTuning In CLUSTER 2013

[5] P Balaprakash K Rupp A Mametjanov R B Gramacy P DHovland and S M Wild Empirical Performance Modeling ofGPU Kernels Using Active Learning In ParCo 2013

[6] M-F Balcan A Beygelzimer and J Langford AgnosticActive Learning In ICML 2006

[7] U Bondhugula A Hartono J Ramanujam and P SadayappanA Practical Automatic Polyhedral Parallelizer and LocalityOptimizer In PLDI 2008

[8] L Breiman J H Friedman R A Olshen and C J StoneClassification and regression trees Wadsworth and Brooks1984

[9] J Charles P Jassi N S Ananth A Sadat and A FedorovaEvaluation of the Intel Core i7 Turbo Boost feature In IISWC2009

[10] H A Chipman E I George and R E McCulloch BayesianCART Model Search Journal of the American StatisticalAssociation 93 1998

[11] H A Chipman E I George and R E McCulloch BayesianTreed Models Machine Learning 48 2002

[12] M Clement and M Quinn Analytical Performance Predictionon Multicomputers In SC 1993

[13] D A Cohn Neural Network Exploration Using OptimalExperiment Design Neural Networks 9(6) 1996

[14] K D Cooper P J Schielke and D Subramanian Optimizingfor Reduced Code Space using Genetic Algorithms In LCTES1999

[15] C Cummins P Petoumenos M Steuwer and H LeatherAutotuning OpenCL Workgroup Size for Stencil PatternsarXiv preprint arXiv151102490 2015

[16] C Curtsinger and E D Berger STABILIZER StatisticallySound Performance Evaluation In ASPLOS 2013

[17] A B de Oliveira J-C Petkovich and S Fischmeister Howmuch does memory layout impact performance A wide studyIn REPRODUCE 2014 2014

[18] C Dubach T Jones E Bonilla G Fursin and M F POrsquoBoyle Portable Compiler Optimisation Across EmbeddedPrograms and Microarchitectures using Machine Learning InMICRO 2009

[19] M K Emani et al Smart Adaptive Mapping of Parallelism inthe Presence of External Workload In CGO 2013

[20] G Fursin et al Milepost GCC Machine Learning EnabledSelf-tuning Compiler International Journal of Parallel Pro-gramming 39(3) 2011

[21] R B Gramacy and M A Taddy dynaTree Dynamic Treesfor Learning and Design httpfacultychicagoboothedurobertgramacydynaTreehtml 2011 R packageRetrieved 022916

[22] D Grewe Z Wang and M F OrsquoBoyle Portable Mapping ofData Parallel Programs to OpenCL for Heterogeneous SystemsIn CGO 2013

[23] D Grewe Z Wang and M F P OrsquoBoyle OpenCL TaskPartitioning in the Presence of GPU Contention In LCPC2013

[24] D Grewe et al A Workload-Aware Mapping Approach ForData-Parallel Programs In HiPEAC 2011

[25] T Hastie R Tibshirani and J Friendman The Elements ofStatistical Learning Data Mining Inference and Predictionchapter 7 Springer 2 edition 2009

[26] J Herter P Backes F Haupenthal and J Reineke CAMAA predictable cache-aware memory allocator In Euromicro2011

[27] S Hong and H Kim An Analytical Model for a GPUArchitecture with Memory-level and Threadndashlevel ParallelismAwareness In ISCA 2009

[28] A H Hormati Y Choi M Kudlur R Rabbah T Mudge andS Mahlke Flextream Adaptive Compilation of StreamingApplications for Heterogeneous Architectures In PACT 2009

[29] M N Katehakis and J Arthur F Veinott The multi-armedbandit problem Decomposition and computation Mathematicsof Operations Research 12(2) 1987

[30] P M W Knijnenburg T Kisuki and M F P OrsquoBoyleIterative Compilation 2002

[31] S Kulkarni and J Cavazos Mitigating the Compiler Optim-ization Phase-Ordering Problem using Machine Learning InOOPSLA 2012

[32] H Leather M F P OrsquoBoyle and B Worton Raced profilesefficient selection of competing compiler optimizations InLCTES 2009

[33] H Leather E Bonilla and M OrsquoBoyle Automatic FeatureGeneration for Machine Learningndashbased Optimising Compila-tion ACM TACO 2014

[34] D J C MacKay Information-Based Objective Functions forActive Data Selection Neural Computation 4 1992

[35] A Mazouz S A A Touati and D Barthou Study ofVariations of Native Program Execution Times on Multi-CoreArchitectures In CISIS 2010

[36] A Monsifrot F Bodin and R Quiniou A Machine LearningApproach to Automatic Production of Compiler Heuristics InAIMSA 2002

[37] E Moss P Utgoff J Cavazos D Precup D StefanovicC Brodley and D Scheeff Learning to schedule straight-linecode Advances in Neural Information Processing Systems1997

[38] T Mytkowicz A Diwan M Hauswirth and P F SweeneyProducing Wrong Data Without Doing Anything ObviouslyWrong In ASPLOS XIV 2009

[39] W F Ogilvie et al Fast Automatic Heuristic ConstructionUsing Active Learning In LCPC 2014

[40] E Park J Cavazos L-N Pouchet C Bastoul A Cohen andP Sadayappan Predictive Modeling in a Polyhedral Optimiza-tion Space International Journal of Parallel Programming 41(5) 2013

[41] P Petoumenos G Keramidas H Zeffer S Kaxiras andE Hagersten STATSHARE A Statistical Model for ManagingCache Sharing via Decay In MoBS 2006

[42] P Petoumenos L Mukhanov Z Wang H Leather and D SNikolopoulos Power Capping What Works What Does NotIn ICPADS 2015

[43] L-N Pouchet Polybench The polyhedral benchmark suiteURL httppolybench sourceforge net 2012

[44] J Power A Basu J Gu S Puthoor B M Beckmann M DHill S K Reinhardt and D A Wood Heterogeneous SystemCoherence for Integrated CPUndashGPU Systems In MICRO2013

[45] K K Pusukuri R Gupta and L N Bhuyan Thread Tran-quilizer Dynamically reducing performance variation ACMTACO 2012

[46] C E Rasmussen and C K I Williams Gaussian Processesfor Machine Learning The MIT Press 2006

[47] B Settles Active Learning Morgan and Claypool 2012

[48] F Siebert Constant-Time Root Scanning for DeterministicGarbage Collection In CC 2001

[49] M Stephenson S Amarasinghe M Martin and U-MOrsquoReilly Meta Optimization Improving Compiler Heuristicswith Machine Learning In PLDI 2003

[51] M A Taddy R B Gramacy and N G Polson Dynamic Treesfor Learning and Design Journal of the American StatisticsAssociation 106(493) 2009

[52] Z Wang and M F OrsquoBoyle Mapping Parallelism to Multi-cores A Machine Learning Based Approach In PPoPP 2009

[53] Z Wang and M F OrsquoBoyle Partitioning Streaming Parallelismfor Multi-cores A Machine Learning Based Approach InPACT 2010

[54] Z Wang and M F P OrsquoBoyle Using Machine Learning toPartition Streaming Programs ACM TACO 2013

[55] Z Wang D Powel B Franke and M F OrsquoBoyle Exploitationof GPUs for the Parallelisation of Probably Parallel LegacyCode In CC 2014

[56] Z Wang G Tournavitis B Franke and M F P OrsquoboyleIntegrating Profile-driven Parallelism Detection and Machine-learning-based Mapping ACM TACO 2014

[57] Y Wen Z Wang and M OrsquoBoyle Smart multi-task schedul-ing for OpenCL programs on CPUGPU heterogeneous plat-forms In IEEE HiPC 2014

[58] R Wilhelm et al The Worst-Case Execution-Time Problem ndashOverview of Methods and Survey of Tools ACM TECS 2008

[59] S Zhuravlev S Blagodurov and A Fedorova Address-ing Shared Resource Contention in Multicore Processors viaScheduling In ASPLOS XV 2010

[60] M Zuluaga A Krause G Sergent and M Puumlschel ActiveLearning for MultindashObjective Optimization In ICML 2013

[61] M Zuluaga et al ldquoSmartrdquo Design Space Sampling to PredictParetondashOptimal Solutions In LCTES 2012

Introduction

Motivation

Active Learning with Sequential Analysis

Sequential Analysis

Dynamic Trees

Quantifying Usefulness

Experimental Setup

Optimization Problem

Platform and Benchmarks

Evaluation Methodology

Algorithm and Model Parameters

Description of the Datasets

Experimental Results

Overall Analysis

Detailed Analysis

Related Work

Conclusions and Future Work