OpenTuner: An Extensible Framework for Program ? OpenTuner: An Extensible Framework for Program ...

  • Published on
    28-Jun-2018

  • View
    212

  • Download
    0

Transcript

OpenTuner: An Extensible Framework for ProgramAutotuningJason Ansel Shoaib Kamil Kalyan VeeramachaneniJonathan Ragan-Kelley Jeffrey BosboomUna-May OReilly Saman AmarasingheMassachusetts Institute of TechnologyCambridge, MA{jansel, skamil, kalyan, jrk, jbosboom, unamay, saman}@csail.mit.eduABSTRACTProgram autotuning has been shown to achieve betteror more portable performance in a number of domains.However, autotuners themselves are rarely portable betweenprojects, for a number of reasons: using a domain-informedsearch space representation is critical to achieving goodresults; search spaces can be intractably large and requireadvanced machine learning techniques; and the landscape ofsearch spaces can vary greatly between different problems,sometimes requiring domain specific search techniques toexplore efficiently.This paper introduces OpenTuner, a new open sourceframework for building domain-specific multi-objective pro-gram autotuners. OpenTuner supports fully-customizableconfiguration representations, an extensible technique rep-resentation to allow for domain-specific techniques, and aneasy to use interface for communicating with the program tobe autotuned. A key capability inside OpenTuner is the useof ensembles of disparate search techniques simultaneously;techniques that perform well will dynamically be allocateda larger proportion of tests. We demonstrate the efficacyand generality of OpenTuner by building autotuners for 7distinct projects and 16 total benchmarks, showing speedupsover prior techniques of these projects of up to 2.8 withlittle programmer effort.1. INTRODUCTIONProgram autotuning is increasingly being used in domainssuch as high performance computing and graphics to opti-mize programs. Program autotuning augments traditionalhuman-guided optimization by offloading some or all ofthe search for an optimal program implementation to anautomated search technique. Rather than optimizing aprogram directly, the programmer expresses a search spaceof possible implementations and optimizations. AutotuningPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from Permissions@acm.org.PACT14, August 2427, 2014, Edmonton, AB, Canada.Copyright 2014 ACM 978-1-4503-2809-8/14/08 ...$15.00.http://dx.doi.org/10.1145/2628071.2628092.can often make the optimization process more efficientas autotuners are able to search larger spaces than ispossible by hand. Autotuning also provides performanceportability, as the autotuning process can easily be re-run onnew machines which require different sets of optimizations.Finally, multi-objective autotuning can be used to trade offbetween performance and accuracy, or other criteria suchas energy consumption and memory usage, and provideprograms which meet given performance or quality of servicetargets.While the practice of autotuning has increased inpopularity, autotuners themselves often remain relativelysimple and project specific. There are three main challengeswhich make the development of autotuning frameworksdifficult.The first challenge is using the right configurationrepresentation for the problem. Configurations can containparameters that vary from a single integer for a blocksize to a much more complex type such as an expressiontree representing a set of instructions. The creator ofthe autotuner must find ways to represent their complexdomain-specific data structures and constraints. Whenthese data structures are naively mapped to simplerrepresentations, such as a point in high dimensional space,locality information is lost which makes the search problemmuch more difficult. Picking the right representation for thesearch space is critical to having an effective autotuner. Todate, all autotuners that have used a representation otherthan the simplest ones have had custom project-specificrepresentations.The second challenge is the size of the valid configurationspace. While some prior autotuners have worked hard toprune the configuration space, we have found that for manyproblems excessive search space pruning will miss out onnon-intuitive good configurations. We believe providing allthe valid configurations of these search spaces is better thanartificially constraining search spaces and possibly missingoptimal solutions. Search spaces can be very large, up to103600 possible configurations for one of our benchmarks.Full exhaustive search of such a space will not completein human lifetimes! Thus, intelligent machine learningtechniques are required to seek out a good result with asmall number of experiments.The third challenge is the landscape of the configurationspace. If the configuration space is a monotonic function, asearch technique biased towards this type of search space(such as a hill climber) will be able to find the optimalconfiguration. If the search space is discontinuous andhaphazard an evolution algorithm may perform better.However, in practice search spaces are much more complex,with discontinuities, high dimensionality, plateaus, hillswith some of the configuration parameters strongly coupledand some others independent from each other. A searchtechnique that is optimal in one type of configuration spacemay fail to locate an adequate configuration in another. Itis difficult to provide a robust system that performs wellin a variety of situations. Additionally, many applicationdomains will have domain-specific search techniques (suchas scheduling or blocking heuristics) which may be criticalto finding an optimal solution efficiently. This has causedmost prior autotuners to use customized search techniquestailored to their specific problem. This requires machinelearning expertise in addition to the individual domainexpertise to build an autotuner for a system. We believethat this is one of the main reasons that, while autotunersare recognized as critical for performance optimization, theyhave not seen commodity adoption.In this paper we present OpenTuner, a new framework forbuilding domain-specific program autotuners. OpenTunerfeatures an extensible configuration and technique repre-sentation able to support complex and user-defined datatypes and custom search heuristics. It contains a libraryof predefined data types and search techniques to makeit easy to setup a new project. Thus, OpenTuner solvesthe custom configuration problem by providing not only alibrary of data types that will be sufficient for most projects,but also extensible data types that can be used to supportmore complex domain specific representations when needed.A core concept in OpenTuner is the use of ensembles ofsearch techniques. Many search techniques (both built inand user-defined) are run at the same time, each testingcandidate configurations. Techniques which perform wellby finding better configurations are allocated larger budgetsof tests to run, while techniques which perform poorlyare allocated fewer tests or disabled entirely. Techniquesare able to share results using a common results databaseto constructively help each other in finding an optimalsolution. algorithms add results from other techniques asnew members of their population. To allocate tests betweentechniques we use an optimal solution to the multi-armedbandit problem using area under the curve credit assignment.Ensembles of techniques solve the large and complex searchspace problem by providing both a robust solutions tomany types of large search spaces and a way to seamlesslyincorporate domain specific search techniques.1.1 ContributionsThis paper makes the following contributions: To the best of our knowledge, OpenTuner is the firstto introduce a general framework to describe complexsearch spaces for program autotuning. OpenTuner introduces the concept of ensembles ofsearch techniques to program autotuning, which allowmany search techniques to work together to find anoptimal solution. OpenTuner provides more sophisticated search tech-niques than typical program autotuners. This enablesexpanded uses of program autotuning to solve morecomplex search problems and pushes the state of theart forward in program autotuning in a way that caneasily be adopted by other projects. We demonstrate the versatility of our frameworkby building autotuners for 7 distinct projects anddemonstrate the effectiveness of the system with 16total benchmarks, showing speedups over existingtechniques of up to 2.8. We show that OpenTuner is able to succeed bothin massively large search spaces, exceeding 103600possible configurations in size, and in smaller searchspaces using less than 2% of the tests required forexhaustive search.2. RELATED WORKPackage Domain Search MethodActive Harmony [31] Runtime System Nelder-MeadATLAS [34] Dense Linear Algebra ExhaustiveFFTW [14] Fast Fourier Transform Exhaustive/Dynamic Prog.Insieme [19] Compiler Differential EvolutionOSKI [33] Sparse Linear Algebra Exhaustive+HeuristicPATUS [9] Stencil Computations Nelder-Mead or EvolutionaryPetaBricks [4] Programming Language Bottom-up EvolutionarySepya [21] Stencil Computations Random-Restart Gradient AscentSPIRAL [28] DSP Algorithms Pareto Active LearningFigure 1: Summary of selected related projects usingautotuningA number of offline empirical autotuning frameworks havebeen developed for building efficient, portable libraries inspecific domains; selected projects and techniques used aresummarized in Figure 1. ATLAS [34] utilizes empiricalautotuning to produce an optimized matrix multiplyroutine. FFTW [14] uses empirical autotuning to combinesolvers for FFTs. Other autotuning systems includeSPIRAL [28] for digital signal processing PATUS [9] andSepya [21] for stencil computations, and OSKI [33] for sparsematrix kernels.The area of iterative compilation contains many projectsthat use different machine learning techniques to optimizelower level compiler optimizations [2, 15, 26, 1]. Theseprojects change both the order that compiler passes areapplied and the types of passes that are applied.In the dynamic autotuning space, there have been anumber of systems developed [18, 17, 22, 6, 8, 5] that focuson creating applications that can monitor and automaticallytune themselves to optimize a particular objective. Many ofthese systems employ a control systems based autotuner thatoperates on a linear model of the application being tuned.For example, PowerDial [18] converts static configurationparameters that already exist in a program into dynamicknobs that can be tuned at runtime, with the goal oftrading QoS guarantees for meeting performance and powerusage goals. The system uses an offline learning stage toconstruct a linear model of the choice configuration spacewhich can be subsequently tuned using a linear controlsystem. The system employs the heartbeat framework [16]to provide feedback to the control system. A similartechnique is employed in [17], where a simpler heuristic-based controller dynamically adjusts the degree of loopperforation performed on a target application to trade QoSfor performance.Results DatabaseSearch TechniquesSearch DriverSearchReads: ResultsWrites: Desired ResultsMeasurementUser Defined Measurement FunctionMeasurement DriverConfiguration ManipulatorReads: Desired ResultsWrites: ResultsFigure 2: Overview of the major components in theOpenTuner framework.Related to our Mario benchmark, Murphy [23] presentsalgorithms for playing NES games based on example humaninputs. In contrast, this papers Mario example has noknowledge of the game except for pixels moved to the right.3. THE OPENTUNER FRAMEWORKOur terminology reflects that the autotuning problem iscast as a search problem. The search space is made upof configurations, which are concrete assignments of a setof parameters. Parameters can be primitive such as aninteger or complex such as a permutation of a list. Whenthe performance, output accuracy, or other metrics of aconfiguration are measured (typically by running it in adomain-specific way), we call this measurement a result.Search techniques are methods for exploring the searchspace and make requests for measurement called desiredresults. Search techniques can change configurations using auser-defined configuration manipulator, which also includesparameters corresponding directly the parameters in theconfiguration. Some parameters include manipulators,which are opaque functions that make stochastic changesto a specific parameter in a configuration.Figure 2 provides an overview of the major componentsin OpenTuner. The search process includes techniques,which use the user defined configuration manipulator inorder to read and write configurations. The measurementprocesses evaluate candidate configurations using a userdefined measurement function. These two componentscommunicate exclusively through a results database used torecord all results collected during the tuning process, as wellas providing ability to perform multiple measurements inparallel.3.1 OpenTuner UsageTo implement an autotuner with OpenTuner, first, theuser must define the search space by creating a configurationmanipulator. This configuration manipulator includes a setof parameter objects which OpenTuner will search over.Second, the user must define a run function which evaluatesthe fitness of a given configuration in the search space toproduce a result. These must be implemented in a smallPython program in order to interface with the OpenTunerAPI.Figure 3 shows an example of using OpenTuner to searchover the space of GCC compiler flags in order to minimize1import opentuner2from opentuner import ConfigurationManipulator3from opentuner import EnumParameter4from opentuner import IntegerParameter5from opentuner import MeasurementInterface6from opentuner import Result78GCC_FLAGS = [9align-functions, align-jumps, align-labels,10branch-count-reg, branch-probabilities,11# ... (176 total)12]1314# (name, min, max)15GCC_PARAMS = [16(early-inlining-insns, 0, 1000),17(gcse-cost-distance-ratio, 0, 100),18# ... (145 total)19]2021class GccFlagsTuner(MeasurementInterface):2223def manipulator(self):24"""25Define the search space by creating a26ConfigurationManipulator27"""28manipulator = ConfigurationManipulator()29manipulator.add_parameter(30IntegerParameter(opt_level, 0, 3))31for flag in GCC_FLAGS:32manipulator.add_parameter(33EnumParameter(flag,34[on, off, default]))35for param, min, max in GCC_PARAMS:36manipulator.add_parameter(37IntegerParameter(param, min, max))38return manipulator3940def run(self, desired_result, input, limit):41"""42Compile and run a given configuration then43return performance44"""45cfg = desired_result.configuration.data46gcc_cmd = g++ raytracer.cpp -o ./tmp.bin47gcc_cmd += -O{0}.format(cfg[opt_level])48for flag in GCC_FLAGS:49if cfg[flag] == on:50gcc_cmd += -f{0}.format(flag)51elif cfg[flag] == off:52gcc_cmd += -fno-{0}.format(flag)53for param, min, max in GCC_PARAMS:54gcc_cmd += --param {0}={1}.format(55param, cfg[param])5657compile_result = self.call_program(gcc_cmd)58assert compile_result[returncode] == 059run_result = self.call_program(./tmp.bin)60assert run_result[returncode] == 061return Result(time=run_result[time])6263if __name__ == __main__:64argparser = opentuner.default_argparser()65GccFlagsTuner.main(argparser.parse_args())Figure 3: GCC/G++ flags autotuner using OpenTuner.execution time of the resulting program. In Section 4, wepresent results on an expanded version of this example whichobtains up to 2.8x speedup over -O3.This example tunes three types of flags to GCC. Firstit chooses between the four optimization levels -O0, -O1, -O2, -O3. Second, for 176 flags listed on line 8 it decidesbetween turning the flag on (with -fFLAG), off (with -fno-FLAG), or omitting the flag to allow the default value totake precedence. Including the default value as a choice isnot necessary for completeness, but speeds up convergenceand results in shorter command lines. Finally, it assigns abounded integer value to the 145 parameters on line 15 withthe --param NAME=VALUE command line option.The method manipulator (line 23), is called once atstartup and creates a ConfigurationManipulator objectwhich defines the search space of GCC flags. All accessesto configurations by search techniques are done throughthe configuration manipulator. For optimization level, anIntegerParameter between 0 and 3 is created. For eachflag, a EnumParameter is created which can take the valueson, off, and default. Finally, for the remaining boundedGCC parameters, an IntegerParameter is created with theappropriate range.The method run (line 40) implements the measurementfunction for configurations. First, the configuration isrealized as a specific command line to g++. Next, this g++command line is run to produce an executable, tmp.bin,which is then run using call_program. call_program is aconvenience function which runs and measures the executiontime of the given program. Finally, a Result is constructedand returned, which is a database record type containingmany other optional fields such as time, accuracy, andenergy. By default OpenTuner minimizes the time field,but this can be customized.3.2 Search TechniquesTo provide robust search, OpenTuner includes techniquesthat can handle many types of search spaces and runsa collection of search techniques at the same time.Techniques which perform well are allocated more tests,while techniques which perform poorly are allocated fewertests. Techniques share results through the results database,so that improvements made by one technique can benefitother techniques. This sharing occurs in technique-specificways; for example, evolutionary techniques add resultsfound by other techniques as members of their population.OpenTuner techniques are meant to be extended. Userscan define custom techniques which implement domain-specific heuristics and add them to ensembles of pre-definedtechniques.Ensembles of techniques are created by instantiating ameta technique, which is a technique made up of a collectionof other techniques. The OpenTuner search driver interactswith a single root technique, which is typically a metatechnique. When the meta technique gets allocated tests,it incrementally decides how to divide theses tests amongits sub-techniques. OpenTuner contains an extensible classhierarchy of techniques and meta techniques, which can becombined together and used in autotuners.3.2.1 AUC Bandit Meta TechniqueIn addition to a number of simple meta techniques, suchas round robin, OpenTuners core meta technique usedin results is the multi-armed bandit with sliding window,area under the curve credit assignment (AUC Bandit)meta technique. A similar technique was used in [25] inthe different context of online operator selection. It isbased on an optimal solution to the multi-armed banditproblem [12]. The multi-armed bandit problem is theproblem of picking levers to pull on a slot machine withmany arms each with an unknown payout probability. Thesliding window makes the technique only consider a subset ofhistory (the length of the window) when making decisions.This meta-technique encapsulates a fundamental trade-offbetween exploitation (using the best known technique) andexploration (estimating the performance of all techniques).The AUC Bandit meta technique assigns each test tothe technique, t, defined by the formula arg maxt(AUCt +C2 lg |H|Ht) where |H| is the length of the sliding historywindow, Ht is the number of times the technique has beenused in that history window, C is a constant controllingthe exploration/exploitation trade-off, and AUCt is thecredit assignment term quantifying the performance of thetechnique in the sliding window. The second term in theequation is the exploration term which gets smaller the moreoften a technique is used.The area under the curve credit assignment mechanism,based on [13], draws a curve by looking at the history for aspecific technique and looking only at if a technique yielded anew global best or not. If the technique yielded a new globalbest, a upward line is drawn, otherwise a flat line is drawn.The area under this curve (scaled to a maximum value of 1)is the total credit attributed to the technique. This creditassignment mechanism can be described more precisely bythe formula: AUCt =2|Vt|(|Vt|+1)|Vt|i=1 iVt,i where Vt is thelist of uses of t in the sliding window history. Vt,i is 1 ifusing technique t the ith time in the history resulted in aspeedup, otherwise 0.3.2.2 Other TechniquesOpenTuner includes implementations of the techniques:differential evolution; many variants of Nelder-Mead searchand Torczon hillclimbers; a number of evolutionary mutationtechniques; pattern search; particle swarm optimization; andrandom search. OpenTuner also includes a bandit mutationtechnique which uses the same AUC Bandit method todecide which manipulator function across all parameters tocall on the best known configuration. These techniquesspan a range of strategies and are each biased to performbest in different types of search spaces. They also eachcontain many settings which can be configured to changetheir behavior. Each technique has been modified so thatwith some probability it will use information found byother techniques if other techniques have discovered a betterconfiguration.The default meta technique, used in results in this paperand meant to be robust, uses an AUC Bandit meta techniqueto combine greedy mutation, differential evolution, and twohill climber instances.3.3 Configuration ManipulatorThe configuration manipulator provides a layer of ab-straction between the search techniques and the rawconfiguration structure. It is primarily responsible formanaging a list of parameter objects, each of which can beParameterPrimitive ComplexInteger ScaledNumericFloatLogInteger LogFloat PowerOfTwoSwitch Enum PermutationScheduleSelectorBooleanFigure 4: Hierarchy of built in parameter types. Userdefined types can be added at any point below Primitiveor Complex in the tree.used by search techniques to read and write parts of theunderlying configuration.The default implementation of the configuration ma-nipulator uses a fixed list of parameters and storesthe configuration as a dictionary from parameter nameto parameter-dependent data type. The configurationmanipulator can be extended by the user either to changethe underlying data structure used for configurations or tosupport a dynamic list of parameters that is dependent onthe configuration instance.3.3.1 Parameter TypesOpenTuner has a hierarchy of built-in parameter types.Each parameter type is responsible for interfacing betweenthe raw representation of a parameter in the configurationand the standardized view of that parameter presented tothe search techniques. Parameter types can be extendedboth to change the underlying representation, and to changethe abstraction provided to search techniques to cause aparameter to be search in different ways.From the viewpoint of search techniques there are twomain types of parameters, each of which provides a differentabstraction to the search techniques:Primitive parameters present a view to search techniquesof a numeric value with an upper and lower bound.These upper and lower bounds can be dependent on theconfiguration instance.The built in parameter types Float and LogFloat (andsimilarly Integer and LogInteger) both have identicalrepresentations in the configuration, but present a differentview of the underlying value to the search techniques.Float is presented directly to to search techniques, whileLogFloat presents a log scaled view of the underlyingparameter to search techniques. To a search technique,halving and doubling a log scaled parameter are changesof equal magnitude. Log scaled variants of parameters areoften better for parameters such as block sizes where fixedchanges in values have diminishing effects the larger theparameter becomes. PowerOfTwo is a commonly used specialcase, similar to LogInteger, where the legal values of theparameter are restricted to powers of two.Complex parameters present a more opaque view to searchtechniques. Complex parameters have a variable set of ma-nipulation operators (manipulators) which make stochasticchanges to the underlying parameter. These manipulatorsare arbitrary functions defined on the parameter whichcan make high level type dependent changes. Complexparameters are meant to be easily extended to add domainspecific structures to the search space. New complexparameters define custom manipulation operators for use bythe configuration manipulators.The built in parameter types Boolean, Switch, andEnum could theoretically also be represented as primitiveparameters, since they each can be translated directly toa small integer representation. However, in the contextof search techniques they make more sense as complexparameters. This is because, for primitive parameters,search techniques will attempt to follow gradients. Theseparameter types are unordered collections of values for whichno gradients exist. Thus, the complex parameter abstractionis a more efficient representation to search over.The Permutation parameter type assigns an order to agiven list of values and has manipulators which make varioustypes of random changes to the permutation. A Scheduleparameter is a Permutation with a set of dependenciesthat limit the legal order. Schedules are implemented as apermutation that gets topologically sorted after each change.Finally, a Selector parameter is a special type of tree whichis used to define a mapping from an integer input to anenumerated value type.In addition to these primary primitive and complexabstractions for parameter types, there are a number ofderived ways that search techniques will interact withparameters in order to more precisely convey intent. Theseare additional methods on parameter which contain defaultimplementations for both primitive and complex parametertypes. These methods can optionally be overridden forspecific parameters types to improve search techniques.Parameter types will work without these methods beingoverridden, but implementing them can improve results.As an example, a common operation in many searchtechniques is to add the difference between configuration Aand B to configuration C. This is used both in differentialevolution and many hill climbers. Complex parameters havea default implementation of this indent which compares thevalue of the parameter in the 3 configurations: if A = B,then there is no difference and the result is C; similarly,if B = C, then A is returned; otherwise a change shouldbe made so random manipulators are called. This works ingeneral, but for individual parameter types there are oftenbetter interpretations. For example, with permutations onecould calculate the positional movement of each item inthe list and calculate a new permutation by applying thesemovements again.3.4 ObjectivesOpenTuner supports multiple user-defined objectives.Result records have fields for time, accuracy, energy, size,confidence, and user defined data. The default objective isto minimize time. Many other objectives are supported,such as: maximize accuracy; threshold accuracy whileminimizing time; and maximize accuracy then minimize size.The user can easily define their own objective by definingcomparison operators and display methods on a subclass ofObjective.3.5 Search Driver and MeasurementOpenTuner is divided into two submodules, search andmeasurement. The search driver and measurement driver ineach of these modules orchestrate most of the framework ofthe search process. These two modules communicate onlythrough the results database. The measurement moduleis minimal by design and is primarily a wrapper aroundthe user defined measurement function which creates resultsfrom configurations.This division between search and measurement is moti-vated by a number of different factors: To allow parallelism between multiple search mea-surement processes, possibly across different machines.Parallelism is most important in the measurementprocesses since in most autotuning applications mea-surement costs dominate. To allow for parallelism thesearch driver will make multiple requests for desiredresults without waiting for each request to be fulfilled.If a specific technique is blocked waiting for results,other techniques in the ensemble will used to fill outrequests to prevent idle time. The separation of the measurement modules isdesirable to support online learning and sidelinelearning. In these setups, autotuning is not done beforedeployment of an application, but is done online as anapplication is running or during idle time. Since themeasurement module is minimal by design, it can bereplaced by an domain specific online learning modulewhich periodically examines the database to decidewhich configuration to use and records performanceback to the database. Finally, in many embedded or mobile settings whichrequire constrained measurement environments it isdesirable to have a minimal measurement modulewhich can easily be re-implemented in other languageswithout needing to modify the majority of theOpenTuner framework.3.6 Results DatabaseThe results database is a fully featured SQL database. Allmajor database types are supported, and SQLite is used ifthe user has not configured a database type so that no setupis required. It allows different techniques to query and shareresults in a variety of ways and is useful for introspectionabout the performance of techniques across large numbersof runs of the autotuner.4. EXPERIMENTAL RESULTSProject Benchmark Possible ConfigurationsGCC/G++ Flags all 10806Halide Blur 1025Halide Wavelet 1032Halide Bilateral 10176HPL n/a 109.9PetaBricks Poisson 103657PetaBricks Sort 1090PetaBricks Strassen 10188PetaBricks TriSolve 101559Stencil all 106.5Unitary n/a 1021Mario n/a 106328Figure 5: Search space sizes in number of possibleconfigurations, as represented in OpenTuner.We validated OpenTuner by using it to implementedautotuners for seven distinct projects. This section describesthese seven projects, the autotuners we implemented, andpresents results comparing to prior practices in each project.Figure 5 lists, for each benchmark, the number of distinctconfigurations that can be generated by OpenTuner. Thismeasure is not perfect because some configurations may besemantically equivalent and the search space depends onthe representation chosen in OpenTuner. It does, however,provide a sense of the relative size of each search space,which is useful as a first approximation of tuning difficulty.For many of these benchmarks, the size of the search spacemakes exhaustive search intractable. Thus, we compare tothe best performance obtained by the benchmark authorswhen optimality is impossible to quantify.For each benchmark, we keep the environment as constantas possible, but do not use the same environment forall benchmarks, due to particular requirements for eachbenchmark, and because one (the stencil benchmark) isbased on data from an environment outside our control.4.1 GCC/G++ FlagsThe GCC/G++ flags autotuner is described in detailin Section 3.1. There are a number features that wereomitted from the earlier example code for simplicity, whichare included in the full version of the autotuner.First, we added error checking to gracefully handle thecompiler or the output program hanging, crashing, runningout of memory, or otherwise going wrong. Our testsuncovered a number of bugs in GCC which triggeredinternal compiler errors and we implemented code to detect,diagnose, and avoid error-causing sets of flags. We aresubmitting bug reports for these crashes to the GCCdevelopers.Second, instead of using a fixed list of flags andparameters (which the example does for simplicity), our fullautotuner automatically extracts the supported flags fromg++ --help=optimizers. Parameters and legal ranges areextracted automatically from params.def in the GCC sourcecode.Additionally, there were a number of smaller featuressuch as: time limits to abort slow tests which will not beoptimal; use of LogInteger parameter types for some values;a save_final_config method to output the final flags; andmany command line options to control autotuner behavior.We ran experiments using gcc 4.7.3-1ubuntu1, on an 8total core, 2-socket Xeon E5320. We allowed flags such as-ffast-math which can change rounding or NaN behavior offloating-point numbers and have small impacts on programresults. We still observe speedups with these flags removed.For target programs to optimize we used: a fast Fouriertransform in C, fft.c, taken from the SPLASH2 [35]benchmark suite; a C++ template matrix multiply, matrix-multiply.cpp, written by Xiang Fan [11] (version 3); a C++ray tracer, raytracer.cpp, taken from the scratchpixelwebsite [27]; and a genetic algorithm to solve the travelingsalesman program in C++, tsp_ga.cpp, by KristofferNordkvist [24], which we modified to run deterministically.These programs were chosen to span a range fromhighly optimized codes, like fft.c which contains cacheaware tiling and threading, to less optimized codes, likematrixmultiply.cpp which contains only a transpose of oneof the inputs.Figure 6 shows the performance for autotuning GCC flagson these four different sample programs. Final speedupsranged from 1.15 for FFT to 2.82 for matrix multiply.Figure 7 shows a separate experiment to try to understandwhich flags contributed the most to the speedups obtained.Full GCC command lines found contained over 250 options 0.8 0.85 0.9 0.95 1 0 300 600 900 1200 1500 1800Execution Time (seconds)Autotuning Time (seconds)gcc -O1gcc -O2gcc -O3OpenTuner(a) fft.c 0.1 0.15 0.2 0.25 0.3 0 300 600 900 1200 1500 1800Execution Time (seconds)Autotuning Time (seconds)g++ -O1g++ -O2g++ -O3OpenTuner(b) matrixmultiply.cpp 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0 300 600 900 1200 1500 1800Execution Time (seconds)Autotuning Time (seconds)g++ -O1g++ -O2g++ -O3OpenTuner(c) raytracer.cpp 0.4 0.45 0.5 0.55 0.6 0 300 600 900 1200 1500 1800Execution Time (seconds)Autotuning Time (seconds)g++ -O1g++ -O2g++ -O3OpenTuner(d) tsp ga.cppFigure 6: GCC/G++ Flags: Execution time (lower is better) as a function of autotuning time. Aggregated performanceof 30 runs of OpenTuner, error bars indicate median, 20th, and 80th percentiles. Note that in (b) the O1/O2/O3 and in (c)the O2/O3 lines are on top of each other and may be difficult to see.and are difficult understand by hand. To approximate theimportance of each flag, we measured the slowdown resultingfrom removing it from the autotuned GCC command. Wenormalize all slowdowns to sum to 100%. This experimentwas conducted on a single run of the autotuner. Thismeasurement of importance is imperfect because flags caninteract in complex ways and are not independent. Wenote that the total slowdown from removing each flagindependently is larger than the slowdown of removing allflags.The benchmarks show interesting behavior and varyingsets of optimal flags. The matrix multiply benchmarkreaches optimal performance through a few large steps, andits speedup is dominated by 3 flags: -fno-exceptions, -fwrapv, and -funsafe-math-optimizations. On the otherhand, TSP takes many small, incremental steps and itsspeedup is spread over a large number of flags with smallereffects. The FFT benchmark is heavily hand optimized andperforms more poorly at -O3 than at -O2; the flags foundfor it are primarily disabling, rather than enabling, variouscompiler optimizations which interact poorly with its hand-written optimizations. While there are some patterns, eachbenchmark requires a different set of flags to get the bestperformance.4.2 HalideHalide [29, 30] is a domain-specific language and compilerfor image processing and computational photography,specifically targeted towards image processing pipelines(graphs) that contain many stages. Halide separatesthe expression of the kernels and pipeline itself from thepipelines schedule, which defines the order of executionand placement of data by which it is mapped to machineexecution. The schedule dictates how the Halide compilersynthesizes code for a given pipeline. This allows expertprogrammers to easily define and explore the space ofcomplex schedules which result in high performance.Until today, production uses of Halide have relied onhand-tuned schedules written by experts. The autotuningproblem in Halide is to automatically synthesize schedulesand search for high-performance configurations of a givenpipeline on a given architecture. The Halide projectpreviously integrated a custom autotuner to automaticallysearch over the space of schedules, but it was removedbecause it became too complex to maintain and was difficultto use in practice.11Unfortunately, the original Halide autotuner cannot beused as a baseline to compare against, because the Halidelanguage and compiler have changed too much since itsremoval to make direct comparison meaningful.Flag (Importance)-fno-tree-vectorize (25.0%)-funroll-loops (6.7%)-fno-jump-tables (5.4%)-fno-inline (5.0%)-fno-ipa-pure-const (4.8%)-fno-tree-cselim (4.7%)-fno-rerun-cse-after-loop (4.5%)-fno-tree-forwprop (4.0%)-fno-tree-tail-merge (4.0%)-fno-cprop-registers (3.2%)264 other flags (32.7%)(a) fft.cFlag (Importance)-fno-exceptions (28.5%)-fwrapv (27.8%)-funsafe-math-optimizations (27.0%)param=large-stack-frame=65 (1.1%)-fschedule-insns2 (1.0%)-funroll-loops (0.9%)-fno-ivopts (0.8%)param=sccvn-max-scc-size=2995 (0.7%)param=max-sched-extend-regions-iters=2(0.6%)param=slp-max-insns-in-bb=1786(0.6%)272 other flags (10.9 %)(b) matrixmultiply.cppFlag (Importance)-funsafe-math-optimizations (21.1%)-ffinite-math-only (9.4%)-frename-registers (3.9%)-fwhole-program (2.8%)param=selsched-insns-to-rename=2 (2.4%)-fno-tree-dominator-opts (2.3%)param=min-crossjump-insns=17 (2.2%)param=max-crossjump-edges=31 (2.1%)param=sched-state-edge-prob-cutoff=17(2.0%)param=sms-loop-average-count-threshold=4(1.8%)265 other flags (49.9 %)(c) raytracer.cppFlag (Importance)-freorder-blocks-and-partition (2.8%)-funroll-all-loops (1.9%)param=omega-max-geqs=64 (1.8%)param=predictable-branch-outcome=2 (1.5%)param=min-insn-to-prefetch-ratio=36 (1.5%)-fno-rename-registers (1.4%)param=max-unswitch-insns=200 (1.4%)param=omega-max-keys=2000 (1.4%)param=max-delay-slot-live-search=83 (1.3%)param=prefetch-latency=50 (1.3%)279 other flags (83.7%)(d) tsp ga.cppFigure 7: Top 10 most important GCC flags for each benchmark. Importance is defined as the slowdown when the flag isremoved from the final configuration. Importance is normalized to sum to 100%. 0 0.005 0.01 0.015 0.02 0.025 0.03 0 100 200 300 400 500Execution Time (seconds)Autotuning Time (seconds)Hand-optimizedOpenTuner(a) Blur 0 0.005 0.01 0.015 0.02 0 500 1000Execution Time (seconds)Autotuning Time (seconds)Hand-optimizedOpenTuner(b) Wavelet 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 5000 10000 15000Execution Time (seconds)Autotuning Time (seconds)Hand-optimizedOpenTuner(c) BilateralFigure 8: Halide: Execution time (lower is better) as a function of autotuning time. Aggregated performance of 30 runs ofOpenTuner, error bars indicate median, 20th, and 80th percentiles.Halides model of schedules is based on defining threethings for each function in a pipeline:1. In what order pixels within the stage should beevaluated.2. At what granularity pixels should be interleavedbetween producer and consumer stages.3. At what granularity pixels should be stored for reuse.In the Halide language, these choices are specified byapplying schedule operators to each function. Thecomposition of potentially many schedule operators for eachfunction together define the organization of computationsand data for a whole pipeline.For example, the best hand-tuned schedule (against whichwe compare our autotuner) for a two-stage separable blurpipeline is written as follows:blur_y.split(y, y, yi, 8).parallel(y).vectorize(x, 8);blur_x.store_at(blur_y, y).compute_at(blur_y, yi).vectorize(x, 8);blur_y(x, y) and blur_y(x, y) are the Halide functionswhich make up the pipeline. The scheduling operatorsavailable to the autotuner are the following: split introduces a new variable and loop nest byadding a layer of blocking. Recursive splits make thesearch space theoretically infinite; we limit the numberof splits to at most 4 per dimension per function, whichis sufficient in practice. We represent each of thesesplits as a PowerOfTwoParameter, where setting thesize of the split to 1 corresponds to not using the splitoperator. parallel, vectorize, and unroll cause the loop nestassociated with a given variable in the function to beexecuted in parallel (over multiple threads), vectorized(e.g., with SSE), or unrolled. OpenTuner representseach of these operators as a parameter per variableper function (including variables introduced by splits).The parallel operator is a BooleanParameter, whilevectorize and unroll are PowerOfTwoParameters likesplits. reorder / reorder_storage take a list of variablesand reorganizes the loop nest order or storageorder for those variables. We encode a singlereorder and a single reorder_storage for eachfunction. We represent reorder_storage as aPermutationParameter over the original (pre-split)variables in the function. The reorder operator foreach function is a ScheduleParameter over all thepost-split variables, with the permutation constraintthat the split children of a single variable are orderedin a fixed order relative to each other, to removeredundancy in the choice space.Together, these operators define the order of execution(as a perfectly nested loop) and storage layout withinthe domain of each function, independently. Finally, thegranularity of interleaving and storage between stages arespecified by two more operators: compute_at specifies the granularity at which a givenfunction should be computed within its consumers. store_at specifies the granularity at which a givenfunction should be stored for reuse by its consumers. 8.6 8.65 8.7 8.75 8.8 8.85 8.9 8.95 9 9.05 0 200 400 600 800 1000 1200 1400 1600 1800Execution Time (seconds)Autotuning Time (seconds)Vendor-optimizedOpenTunerFigure 9: High Performance Linpack: Executiontime (lower is better) as a function of autotuning time.Aggregated performance of 30 runs of OpenTuner, error barsindicate median, 20th, and 80th percentiles.Both granularity choices are specified as a level in the loopnest of a downstream function whose scope encloses alluses of the producer. These producer-consumer granularitychoices are the most difficult parameters to encode andsearch, because the valid choices for these parameters areinter-dependent with both granularity and order choicesfor all other functions in the pipeline. Navely encodingthese parameters without respecting this interdependencecreates numerous invalid scheduleswhich are meaninglessin Halides schedule language and rejected by the compilerfor each valid schedule.The result of all these operators is a schedule space thatis an exceedingly complex tree; transformations can add ordelete subtrees, for example. Searching over such a spaceis far more difficult when it must be encoded in simpleinteger parameters, so we take advantage of OpenTunersability to encode more complex parameter types. Werepresented the set of all compute and storage interleavingchoices for a whole pipeline as a single instance of a custom,domain-specific parameter type. The parameter extends aScheduleParameter, encoding open and close pairs for eachscope in the scheduled pipelines loop nest, with a correctionstep which projects any invalid permutations back into thespace of meaningful schedules.Figure 8 presents results for three benchmarks drawnfrom Halides standard demo applications: a two-stageblur, an inverse Daubechies wavelet transform, and a fastbilateral filter using the bilateral grid. For all three ofthese examples OpenTuner is able to create schedules thatmatch or beat the best hand optimized schedules shippingwith the Halide source code. Results were collected onthree IvyBridge-generation Intel processors2 using the latestrelease of Halide.4.3 High Performance LinpackThe High Performance Linpack benchmark [10] is usedto evaluate floating point performance of machines rangingfrom small multiprocessors to large-scale clusters, and isthe evaluation criterion for the Top 500 [32] supercomputerbenchmark. The benchmark measures the speed of2Blur was tuned on a Core i5-3550, wavelet on a Core i7-3770, and bilateral grid on a 12 cores of a dual-socket XeonE5-2695 v2 server.solving a large random dense linear system of equationsusing distributed memory. Achieving optimal performancerequires tuning about fifteen parameters, including matrixblock sizes and algorithmic parameters. To assist in tuning,HPL includes a built in autotuner that uses exhaustivesearch over user-provided discrete values of the parameters.We run HPL on a 2.93 GHz Intel Sandy Bridge quad-coremachine running Linux kernel 3.2.0, compiled with GCC4.5 and using the Intel Math Kernel Library (MKL) 11.0for optimized math operations. For comparison purposes,we evaluate performance relative to Intels optimized HPLimplementation3. We encode the input tuning parametersfor HPL as navely as possible, without using any machine-specific knowledge. For most parameters, we utilizeEnumParameter or SwitchParameter, as they generallyrepresent discrete choices in the algorithm used. The majorparameter that controls performance is the blocksize of thematrix; this we represent as an IntegerParameter to giveas much freedom as possible for the autotuner for searching.Another major parameter controls the distribution of thematrix onto the processors; we represent this by enumeratingall 2D decompositions possible for the number of processorson the machine.Figure 9 shows the results of 30 tuning runs using Open-Tuner, compared with the vendor-provided performance.The median performance across runs, after 1200 secondsof autotuning, exceeds the performance of Intels optimizedparameters. Overall, OpenTuner obtains a best performanceof 86.5% of theoretical peak performance on this machine,while exploring a miniscule amount of the overall searchspace. Furthermore, the blocksize chosen is not a powerof two, and is generally a value unlikely to be guessed foruse in hand-tuning.4.4 PetaBricksPetaBricks [3] is an implicitly parallel language andcompiler which incorporates the concept of algorithmicchoice into the language. The PetaBricks language providesa framework for the programmer to describe multipleways of solving a problem while allowing the autotuner todetermine which combination of ways is best for the userssituation. The search space of PetaBricks programs is bothover low level optimizations and over different algorithms.The autotuner is used to synthesize poly-algorithms whichweave many individual algorithms together by switchingdynamically between them at recursive call sites.The primary components in the search space for PetaBricksprograms are algorithmic selectors which are used tosynthesize instances of poly-algorithms from algorithmicchoices in the PetaBricks language. A selector is used tomap input sizes to algorithmic choices, and is representedby a list of cutoffs and choices. As an example,the selector [InsertionSort, 500,QuickSort, 1000,MergeSort]would correspond to synthesizing the function:void Sort(List& list) {if(list.length < 500)InsertionSort(list);else if( list.length < 1000)QuickSort(list);elseMergeSort(list);}3Available at http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download.http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-downloadhttp://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0 600 1200 1800 2400 3000 3600Execution Time (seconds)Autotuning Time (seconds)PetaBricks AutotunerOpenTuner(a) Poisson 0 0.02 0.04 0.06 0.08 0.1 0 600 1200Execution Time (seconds)Autotuning Time (seconds)PetaBricks AutotunerOpenTuner(b) Sort 0 0.05 0.1 0.15 0.2 0 600 1200Execution Time (seconds)Autotuning Time (seconds)PetaBricks AutotunerOpenTuner(c) Strassen 0.0095 0.01 0.0105 0.011 0.0115 0 600 1200Execution Time (seconds)Autotuning Time (seconds)PetaBricks AutotunerOpenTuner(d) Tridiagonal SolverFigure 10: PetaBricks: Execution time (lower is better) as a function of autotuning time. Aggregated performance of 30runs of OpenTuner, error bars indicate median, 20th, and 80th percentiles.where QuickSort and MergeSort recursively call Sort sothe program dynamically switches between sorting methodsas recursive calls are made on smaller and smaller lists.We used the general SelectorParameter to represent thischoice type, which internally keeps track of the order of thealgorithmic choices and the cutoffs. PetaBricks programscontain many algorithmic selectors and a large number ofother parameter types, such as block sizes, thread counts,iteration counts, and program specific parameters.Results using OpenTuner compared to the built-inPetaBricks autotuner are shown in Figure 10. ThePetaBricks autotuner uses a different strategy that startswith tests on very small problem inputs and incrementallyworks up to full sized inputs [4]. In all cases, the autotunersarrive at similar solutions, and for Strassen, the exact samesolution. For Sort and Tridiagonal Solver, OpenTunerbeats the native PetaBricks autotuner, while for Poissonthe PetaBricks autotuner arrives at a better solution, buthas much higher variance.The Poisson equation solver (Figure 10a) presents themost difficult search space. The search space for Poissonin PetaBricks is described in detail in [7]. It is a variableaccuracy benchmark where the goal of the autotuner isto find a solution that provides 8-digits of accuracy whileminimizing time. All points in Figure 10a satisfy theaccuracy target, so we do not display accuracy. Open-Tuner uses the ThresholdAccuracyMinimizeTime objectivedescribed in Section 3.4. The Poisson search space selectsbetween direct solvers, iterative solvers, and multigridsolvers where the shape of the multigrid V-cycle/W-cycleis defined by the autotuner. The optimal solution is apoly-algorithm composed of multigrid W-cycles. However,it is difficult to obtain 8 digits of accuracy with randomlygenerated multigrid cycle shapes, though it is easy witha direct solver (which solves the problem exactly). Thiscreates a large plateau which is difficult for the autotunersto improve upon, and is shown near 0.16. The nativePetaBricks autotuner is less affected by this plateau becauseit constructs algorithms incrementally bottom up; howeverthe use of these smaller input sizes causes larger variance asmistakes early on get amplified.4.5 StencilIn [20], the authors describe a generalized system forautotuning memory-bound stencil computations on modernmulticore machines and GPUs. By composing domain-specific transformations, the authors explore a large spaceof implementations for each kernel; the original autotuningmethodology involves exhaustive search over thousands ofimplementations for each kernel.We obtained the raw execution data, courtesy of theauthors, and use OpenTuner instead of exhaustive searchon the data from a Nehalem-class 2.66 GHz Intel Xeonmachine, running Linux 2.6. We compare against theoptimal performance obtained by the original autotuning 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 1.4e+08 1.6e+08 1.8e+08 2e+08 0 10 20 30 40 50 60Execution Time (cycles)TestsOptimalOpenTuner(a) Laplacian 1e+08 2e+08 3e+08 4e+08 5e+08 6e+08 0 10 20 30 40 50 60Execution Time (cycles)TestsOptimalOpenTuner(b) DivergenceFigure 11: Stencil: Execution time (lower is better) asa function of tests. Aggregated performance of 30 runsof OpenTuner, error bars indicate median, 20th, and 80thpercentiles.system through exhaustive search. The search space forthis problem involves searching for parameters for theparallel decomposition, cache and thread blocking, and loopunrolling for each kernel; to limit the impact of controlflow and cache misalignment, these parameters depend onone another (for example, the loop unrolling will be asmall integer divisor of the thread blocking). We encodethese parameters as PowerOfTwoParameters but ensure thatinvalid combinations are discarded.Figure 11 shows the results of using OpenTuner for theLaplacian and divergence kernel benchmarks, showing themedian performance obtained over 30 trials as a functionof the number of tests (results for the gradient kernel aresimilar, so we omit them for brevity). OpenTuner is ableto obtain peak performance on Laplacian after less than 35tests of candidate implementations and 25 implementationsfor divergence; thus, using OpenTuner, less than 2% ofthe search space needs to be explored to reach optimalperformance. These results show that even for problemswhere exhaustive search is tractable (though it may takedays), OpenTuner can drastically improve convergence tothe optimal performance with little programmer effort.4.6 Unitary MatricesAs a somewhat different example, we use OpenTuner in anexample from physics, namely the quantum control problemof synthesizing unitary matrices in SU(2) in optimal time, 0.99 0.992 0.994 0.996 0.998 1 0 10 20 30 40 50 60AccuracyAutotuning Time (seconds)OpenTuner (easy problem instances)OpenTuner (hard problem instances)Figure 12: Unitary: Accuracy (higher is better) as afunction of autotuning time. Aggregated performance of 30runs of OpenTuner, error bars indicate median, 20th, and80th percentiles.using a finite control set composed of rotations around twonon-parallel axes. (Such rotations generate the completespace SU(2).)Unlike other examples, which use OpenTuner as atraditional autotuner to optimize a program, the Unitaryexample uses OpenTuner to perform a search over theproblem space as a subroutine at runtime in a program.The problem has a fixed set of operators (or controls),represented as matrices, and the goal is to find a sequence ofoperators that, when multiplied together, produce a giventarget matrix. The objective function is an accuracy valuedefined as a function of the distance of the product of thesequence to the goal (also called the trace fidelity).Figure 12 shows the performance of the Unitary exampleon both easy and hard instances of the problem. For bothtypes of problem instance OpenTuner is able to meet theaccuracy target within the first few seconds. This is exampleshows that OpenTuner can be used for more types of searchproblems than just program autotuning.4.7 MarioTo demonstrate OpenTuners versatility, we used thesystem to learn to play Super Mario Bros., a NintendoEntertainment System game in which the player, as Mario,runs to the right, jumping over pits and onto enemiesheads to reach the flagpole at the end of each level. Weuse OpenTuner to search for a sequence of button pressesthat completes the games first level. The button pressesare written to a file and played back in the NES emulatorFCEUX. The emulator is configured to evaluate candidateinput sequences faster than real time. OpenTuner evaluates96 input sequences in parallel in different instances of theemulator. A Lua script running in the emulator detectswhen Mario dies or reaches the flagpole by monitoring alocation in the emulated memory. At that time it reports thenumber of pixels Mario has moved to the right, which is theobjective function OpenTuner tries to maximize. The gameis deterministic, so only one trial is necessary to evaluate thisfunction. OpenTuner does not interpret the games graphicsor read the emulated memory; the objective functions valueis the only information OpenTuner receives. Horizontalmovement is controlled by many EnumParameters choosing 1000 1500 2000 2500 3000 3500 0 60 120 180 240 300Pixels Moved Right (Progress)Autotuning Time (seconds)Win LevelOpenTunerFigure 13: Mario: Pixels Mario moves to the right (higheris better) as a function of autotuning time. Reachingpixel 3161 (the flagpole) completes the level. Aggregatedperformance of 30 runs of OpenTuner, error bars indicatemedian, 20th, and 80th percentiles.to run or walk left or right or to not move, each paired withan IntParameter specifying the duration of the movementin frames. As the goal is to move right, the search space isgiven a 3 to 1 right-moving bias. Jumping is controlled byIntParameters specifying on which frames to jump and howlong to hold the jump button.Figure 13 shows the performance of the Mario example.OpenTuner rapidly reaches 1500 pixels because many initial(mostly random) configurations will make it that far in thelevel. The bottlenecks evident in the figure correspondto pits which Mario must jump over, which requirescoordinating all four kinds of parameter: Mario must jumpon the correct frame for a long enough duration, havingbuilt up speed from walking or running right for a longenough duration, and then must not return left into thepit. Performance is aided by aratchetingeffect: due to theNESs memory limits, portions of the level that have scrolledoff the screen are quickly overwritten with new portions, sothe game only allows Mario to move left one screen (upto 256 pixels) from his furthest-right position. The finalsequences complete the level after between 3500 and 5000input actions.5. CONCLUSIONSWe have shown OpenTuner, a new framework for buildingdomain-specific multi-objective program autotuners. Open-Tuner supports fully customizable configuration represen-tations and an extensible technique representation to allowfor domain-specific techniques. OpenTuner introduces theconcept of ensembles of search techniques in autotuning,which allow many search techniques to work together to findan optimal solution and provides a more robust search thana single technique alone.While implementing these seven autotuners in Open-Tuner, the biggest lesson we learned reinforced a coremessage of this paper of the need for domain-specificrepresentations and domain-specific search techniques inautotuning. As an example, the initial version of thePetaBricks autotuner we implemented just used a point inhigh dimensional space as the configuration representation.This generic mapping of the search space did not work atall. It produced final configurations an order of magnitudeslower than the results presented from our autotuner thatuses selector parameter types. Similarly, Halides searchspace strongly motivates domain specific techniques thatmake large coordinated jumps, for example, swappingscheduling operators on x and y across all functionsin the program. We were able to add domain-specificrepresentations and techniques to OpenTuner at a fractionof the time and effort of building a fully custom system forthat project. OpenTuner was able to seamlessly integratethe techniques with its ensemble approach.OpenTuner is free and open source4 and as the communityadds more techniques and representations to this flexibleframework, there will be less of a need to create a newrepresentation or techniques for each project and we hopethat the system will work out-of-the-box for most creatorsof autotuners. OpenTuner pushes the state of the artforward in program autotuning in a way that can easilybe adopted by other projects. We hope that OpenTunerwill be an enabling technology that will allow the expandeduse of program autotuning both to more domains andby expanding the role of program autotuning in existingdomains.6. ACKNOWLEDGEMENTSWe would like to thank Clarice Aiello for contributing theUnitary benchmark program. We gratefully acknowledgeConnelly Barnes for helpful discussions and bug fixes relatedto autotuning the Halide project.This work is partially supported by DOE award DE-SC0005288 and DOD DARPA award HR0011-10-9-0009.This research used resources of the National EnergyResearch Scientific Computing Center, which is supportedby the Office of Science of the U.S. Department of Energyunder Contract No. DE-AC02-05CH11231.7. REFERENCES[1] F. Agakov, E. Bonilla, J. Cavazos, B. Franke,G. Fursin, M. F. P. Oboyle, J. Thomson,M. Toussaint, and C. K. I. Williams, Using machinelearning to focus iterative optimization, in CGO06,2006, pp. 295305.[2] L. Almagor, K. D. Cooper, A. Grosul, T. J. Harvey,S. W. Reeves, D. Subramanian, L. Torczon, andT. Waterman, Finding effective compilationsequences. in LCTES04, 2004, pp. 231239.[3] J. Ansel, C. Chan, Y. L. Wong, M. Olszewski,Q. Zhao, A. Edelman, and S. Amarasinghe,PetaBricks: A language and compiler for algorithmicchoice, in PLDI, Dublin, Ireland, Jun 2009.[4] J. Ansel, M. Pacula, S. Amarasinghe, and U.-M.OReilly, An efficient evolutionary algorithm forsolving bottom up problems, in Annual Conferenceon Genetic and Evolutionary Computation, Dublin,Ireland, July 2011.[5] W. Baek and T. Chilimbi, Green: A framework forsupporting energy-conscious programming usingcontrolled approximation, in PLDI, June 2010.[6] V. Bhat, M. Parashar, . Hua Liu, M. Khandekar,N. Kandasamy, and S. Abdelwahed, Enablingself-managing applications using model-based online4Available from http://opentuner.org/http://opentuner.org/control strategies, in International Conference onAutonomic Computing, Washington, DC, 2006.[7] C. Chan, J. Ansel, Y. L. Wong, S. Amarasinghe, andA. Edelman, Autotuning multigrid with PetaBricks,in Supercomputing, Portland, OR, Nov 2009.[8] F. Chang and V. Karamcheti, A framework forautomatic adaptation of tunable distributedapplications, Cluster Computing, vol. 4, March 2001.[9] M. Christen, O. Schenk, and H. Burkhart, Patus: Acode generation and autotuning framework for paralleliterative stencil computations on modernmicroarchitectures. in IPDPS. IEEE, 2011.[10] J. J. Dongarra, P. Luszczek, and A. Petitet, TheLINPACK Benchmark: past, present and future,Concurrency and Computation: Practice andExperience, vol. 15, no. 9, pp. 803820, 2003.[11] X. Fan, Optimize your code: Matrix multiplication,https://tinyurl.com/kuvzbp9, 2009.[12] A. Fialho, L. Da Costa, M. Schoenauer, and M. Sebag,Analyzing bandit-based adaptive operator selectionmechanisms, Annals of Mathematics and ArtificialIntelligence Special Issue on Learning and IntelligentOptimization, 2010.[13] A. Fialho, R. Ros, M. Schoenauer, and M. Sebag,Comparison-based adaptive strategy selection withbandits in differential evolution, in PPSN10, ser.LNCS, R. S. et al., Ed., vol. 6238. Springer,September 2010.[14] M. Frigo and S. G. Johnson, The design andimplementation of FFTW3, IEEE, vol. 93, no. 2,February 2005.[15] G. Fursin, C. Miranda, O. Temam, M. Namolaru,E. Yom-Tov, A. Zaks, B. Mendelson, E. Bonilla,J. Thomson, H. Leather, C. Williams, M. OBoyle,P. Barnard, E. Ashton, E. Courtois, and F. Bodin,MILEPOST GCC: machine learning based researchcompiler, in Proceedings of the GCC DevelopersSummit, Jul 2008.[16] H. Hoffmann, J. Eastep, M. D. Santambrogio, J. E.Miller, and A. Agarwal, Application heartbeats: ageneric interface for specifying program performanceand goals in autonomous computing environments, inICAC, New York, NY, 2010.[17] H. Hoffmann, S. Misailovic, S. Sidiroglou, A. Agarwal,and M. Rinard, Using code perforation to improveperformance, reduce energy consumption, and respondto failures, Massachusetts Institute of Technology,Tech. Rep. MIT-CSAIL-TR-2209-042, Sep 2009.[18] H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic,A. Agarwal, and M. Rinard, Power-aware computingwith dynamic knobs, in ASPLOS, 2011.[19] H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini,P. Gschwandtner, T. Fahringer, and H. Moritsch, Amulti-objective auto-tuning framework for parallelcodes, in Proceedings of the International Conferenceon High Performance Computing, Networking, Storageand Analysis, ser. SC 12, 2012.[20] S. Kamil, C. Chan, L. Oliker, J. Shalf, andS. Williams, An auto-tuning framework for parallelmulticore stencil computations, in IPDPS10, 2010,pp. 112.[21] S. A. Kamil, Productive high performance parallelprogramming with auto-tuned domain-specificembedded languages, Ph.D. dissertation, EECSDepartment, University of California, Berkeley, Jan2013.[22] G. Karsai, A. Ledeczi, J. Sztipanovits, G. Peceli,G. Simon, and T. Kovacshazy, An approach toself-adaptive software based on supervisory control,in International Workshop in Self-adaptive software,2001.[23] T. Murphy VII, The first level of Super Mario Bros.is easy with lexicographic orderings and time travel,April 2013.[24] K. Nordkvist, Solving TSP with a genetic algorithmin C++, https://tinyurl.com/lq3uqlh, 2012.[25] M. Pacula, J. Ansel, S. Amarasinghe, and U.-M.OReilly, Hyperparameter tuning in bandit-basedadaptive operator selection, in European Conferenceon the Applications of Evolutionary Computation,Malaga, Spain, Apr 2012.[26] E. Park, L.-N. Pouche, J. Cavazos, A. Cohen, andP. Sadayappan, Predictive modeling in a polyhedraloptimization space, in CGO11, April 2011, pp. 119129.[27] S. Pixel, 3D Basic Lessons: Writing a simpleraytracer, https://tinyurl.com/lp8ncnw, 2012.[28] M. Puschel, J. M. F. Moura, B. Singer, J. Xiong, J. R.Johnson, D. A. Padua, M. M. Veloso, and R. W.Johnson, Spiral: A generator for platform-adaptedlibraries of signal processing alogorithms, IJHPCA,vol. 18, no. 1, 2004.[29] J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy,S. Amarasinghe, and F. Durand, Decouplingalgorithms from schedules for easy optimization ofimage processing pipelines, ACM Trans. Graph.,vol. 31, no. 4, pp. 32:132:12, Jul. 2012.[30] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris,F. Durand, and S. Amarasinghe, Halide: a languageand compiler for optimizing parallelism, locality, andrecomputation in image processing pipelines, inProceedings of the 34th ACM SIGPLAN conference onProgramming language design and implementation,ser. PLDI 13. New York, NY, USA: ACM, 2013, pp.519530.[31] C. Tapus, I.-H. Chung, and J. K. Hollingsworth,Active harmony: Towards automated performancetuning, in In Proceedings from the Conference onHigh Performance Networking and Computing, 2003.[32] Top500, Top 500 supercomputer sites,http://www.top500.org/, 2010.[33] R. Vuduc, J. W. Demmel, and K. A. Yelick, OSKI: Alibrary of automatically tuned sparse matrix kernels,in Scientific Discovery through Advanced ComputingConference, San Francisco, CA, June 2005.[34] R. C. Whaley and J. J. Dongarra, Automaticallytuned linear algebra software, in Supercomputing,Washington, DC, 1998.[35] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta,The SPLASH-2 programs: characterization andmethodological considerations, in Symposium onComputer Architecture News, June 1995.https://tinyurl.com/kuvzbp9https://tinyurl.com/lq3uqlhhttps://tinyurl.com/lp8ncnwIntroductionContributionsRelated WorkThe OpenTuner FrameworkOpenTuner UsageSearch TechniquesAUC Bandit Meta TechniqueOther TechniquesConfiguration ManipulatorParameter TypesObjectivesSearch Driver and MeasurementResults DatabaseExperimental ResultsGCC/G++ FlagsHalideHigh Performance LinpackPetaBricksStencilUnitary MatricesMarioConclusionsAcknowledgementsReferences

Recommended

View more >