Neural Network Assisted Tile Size Selection › ~pouchet › doc › iwapt-slides.10.pdfAutomatic parametric tiling [ICS’09,CGO’10]: I Produce code where the tile dimensions are

Neural Network Assisted Tile Size Selection

Mohammed Rahman, Louis-Noël Pouchet and P. Sadayappan

Dept. of Computer Science and EngineeringOhio State University

June 22, 2010

iWAPT 2010 WorkshopBerkeley, USA

Introduction: iWAPT’10

Overview

Situation:I New advances in parametric tiling → more user code to be tunedI The problem of tile size selection is complex and unsolved!

Our approach:I Use machine learning to create a performance predictor of tile size

performance, for a specific programI Rely on the distribution shape to extract promising subspaces for

empirical searchI Outcome: < 2% of the space traversed → 90+% of maximal speedup

achieved

Ohio State 2

Problem Statement: iWAPT’10

Tiling

I Tiling partition the computation into blocksI Note we consider only rectangular tiling hereI For tiling to be legal, such a partitioning must be legal

Ohio State 3


Parametric Tiling

Automatic parametric tiling [ICS’09,CGO’10]:I Produce code where the tile dimensions are parametersI Seamlessly find/apply all required transformation to make the code

tilableI Actual tile sizes are given at run-timeI very useful for tile size selection (no need to recompile)I recent progresses have generalized the approach:

I Operates on arbitrary affine-control loops (imperfectly nested)I Produce good quality codeI Even expose pipeline-parallelism if neededI Software (from OSU): Pluto, PrimeTile/DynTile/PTile

Ohio State 4


Tile Size Selection

Problem: how to select the tile size to have the best performance?

I data reuse within the execution of a tile;I data reuse between tiles;I the layout in memory of the data used in a tile;I the relative penalty of misses at each level of the hierarchy, which is

machine-dependent.I the cache replacement policy;I the interaction with other units, such at prefetching;I the interaction with vectorization, to enable a profitable steady-state for

the vectorized loop(s);I ...

Ohio State 5


Performance DistributionPerformance distribution of fdtd-2d and syr2k

0

0.1

0.2

0.3

0.4

0.5

0.6

1:1

:1

4:2

:40

8:4

:500

12:8

:30

16:1

0:3

00

20:1

6:1

2

25:3

0:2

00

30:4

0:8

35:4

8:1

28

42:1

00

:4

48:1

28

:64

Ex

ec

uti

on

Tim

e i

n S

ec

on

ds

Tile Sizes ( Ti:Tj:Tk)

fdtd-2d: Performance distribution with Tile Size configurations

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1:1

:1

4:2

:40

8:4

:500

12:8

:30

30:1

0:3

00

40

:16

:12

64:3

0:2

00

12

8:4

0:8

200

:48:1

28

300

:100:4

500

:128:6

4

Ex

ec

uti

on

tim

e i

n S

ec

on

ds

Tile sizes- Ti:Tj:Tk

dsyr2k: Performance Distribution with Tile Size Configurations

I Search space: 10648 possible tile sizesI {1,2,4,6,8,10,12,16, 30,32,40,48,64,100,128,

150,200,256,300,400,500,600}I Machine: Core i7 (1 thread)I 2 "standard" distribution shapes

Ohio State 6

Performance Prediction: iWAPT’10

Ojectives

Correlate execution time with tile sizes

I (Static) performance models do exist...I ... but fail to capture the interplay between all hardware componentsI Usually better suited for well-known problems (eg, uniform reuse +

square tiles)I Another view: pruning the space of poor-performing tile sizes

Our approach:I Build a neural network to model the performance distributionI Focus directly on the execution timeI ANN dedicated to a specific program + dataset size

Ohio State 7


Neural Network

Layout:I Fully connected, multi-layer perceptron (MLP)I Input layer: the tile sizes (Ti, Tj, Tk)I Output layer: predicted execution timeI One hidden layer consisting of 30 hidden neuronsI Use Stuttgart Neural Network Simulator library

Training:I Select 5% (530 tuples) from the search space of 10648I Run the program on the machine using the tile size specified by the

tuplesI Train with resilient back-propagation (rprop), using the actual execution

time for a tupleI Standard 10% cross-validation procedure

Ohio State 8


Performance Prediction [1/2]

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

10

:12

:8

16

:2:8

12

:1:4

8

45

:12

8:6

20

:2:1

6

12

:40

0:8

32

:4:4

30

:64

:15

0

10

:1:2

56

16

:40

0:4

00

40

:60

0:1

2

Execu

tio

n T

ime in

Seco

nd

s

Tile Sizes (Ti:Tj:Tk)

fdtd-2d: Predicted versus Actual Performance

ExTime (Actual )

ExTime (Predicted)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

8:4

:64

60

0:1

28

:32

64

:4:1

6

10

:40

0:5

00

12

8:2

:30

0

25

6:2

00

:25

6

100:4

0:3

00

30

:30

0:3

00

40

:10

:4

10

0:3

00

:12

6:1

2:1

Exe

cu

tio

n T

ime

in

se

co

nd

s

Tile sizes - Ti:Tj:Tk

dsyr2k : Predicted versus Actual Performance

ExTime(Actual)

ExTime(Predicted)

Ohio State 9


Performance Prediction [2/2]

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

12

:12

:16

32

:2:1

28

64

:40

:16

2:1

0:1

1:3

2:2

56

25

6:6

4:4

10

:25

6:1

2

4:5

00

:10

30

:64

:40

0

6:2

00

:50

0

25

6:4

00

:16

Ex

ec

uti

on

Tim

e i

n S

ec

on

ds

Tile sizes ( Ti:Tj:Tk)

lu: Predicted versus Actual Performance

ExTime (Actual)

ExTime (Predicted)

0

0.5

1

1.5

2

2.5

3

3.5

1:1

:1

4:2

:40

8:4

:50

0

12:8

:30

30

:10

:30

0

40

:16

:12

64

:30

:20

0

12

8:4

0:8

20

0:4

8:1

28

30

0:1

00

:4

50

0:1

28

:64

Exec

uti

on

Tim

e in

Se

co

nd

s

Tile Sizes (Ti:Tj:Tk)

dgemm: Predicted versus Actual Performance

ExTime (Actual)

ExTime (Predicted)

Ohio State 10


Discussions

I for trmm, lu, 2d-jacobi, syr2k and doitgen, predict more than 90% of oursearch space with less than 10% deviation for the actual execution time

I In total, can predict 80% and more with less than 10% deviationI Usually smaller deviation for the best tile sizes

→ These ANN are able to model the performance distribution

Openings:I Program classifier w.r.t. performance distributionI Training: do not "fit" that much the training points?

Ohio State 11

Tile Size Selection: iWAPT’10

Selecting the Best Tile Size

The performance distribution can drive the empirical search to focus onpromising subspaces

Tile size selection:I Random approach has a huge variability on some distribution shapesI Exhaustive search is likely not neededI Need for an intermediate solution

I Low number of empirical runsI Good convergence, good variabilityI General enough to work on arbitrary user codes

Ohio State 12


Overview of the Algorithm

1 Generate a parametrically tiled code

2 Randomly select x% of the tile size space, and run them on the machine

3 Train an ANN using this data

4 Use the ANN to predict performance of the entire space

5 Collect y tile sizes that are predicted best and not already ran

6 Run the y tile sizes on the machine, output the best found

Ohio State 13


Experimental Setup

I Studied various kernels (perfectly/imperfectly nested, BLAS & stencils)I Only focused on single-threaded execution, on an Intel Core i7

I Comparison: simple random search (R), ANN search (ANN)I Repeat each experiment 100 times, for various sampling rate

Ohio State 14


Experimental Results (y = 50)doitgen gemm syr2k lu 2d-jacobi fdtd-2d

1%

R-best 100% 99.86% 98.15% 99.89% 99.91% 97.75%R-average 98.71% 96.29% 94.80% 92.19% 94.10% 84.15%R-worst 95.35% 69.64% 89.81% 40.63% 17.69% 31.02%ANN-best 100% 99.86% 100% 100% 99.91% 100%ANN-average 98.89% 96.35% 96.01% 92.62% 98.51% 84.50%ANN-worst 97.26% 82.93% 89.79% 79.68% 94.23% 66.53%

2%

R-best 99.97% 99.86% 98.71% 99.89% 100% 100%R-average 98.71% 96.42% 94.80% 92.87% 97.60% 84.10%R-worst 86.49% 67.89% 88.20% 45.29% 55.98% 27.30%ANN-best 100% 99.86% 100% 100% 100% 100%ANN-average 98.89% 96.76% 96.69% 95.34% 98.55% 88.61%ANN-worst 97.26% 89.83% 89.65% 85.80% 94.17% 60.65%

3%

R-best 99.97% 99.86% 98.71% 99.89% 100% 100%R-average 98.77% 96.47% 94.80% 94.27% 98.39% 85.47%R-worst 94.89% 63.58% 87.99% 61.24% 84.54% 47.99%ANN-best 99.97% 99.86% 100% 100% 100% 100%ANN-average 98.93% 97.14% 97.17% 95.34% 98.74% 91.45%ANN-worst 97.64% 91.01% 92.27% 85.80% 94.50% 63.34%

4%

R-best 99.97% 99.86% 98.71% 99.89% 100% 100%R-average 98.80% 96.65% 94.93% 92.19% 98.41% 85.55%R-worst 96.86% 69.73% 88.57% 52.03% 82.47% 43.74%ANN-best 100% 99.86% 100% 100% 100% 100%ANN-average 98.99% 97.67% 97.20% 95.79% 98.90% 93.55%ANN-worst 98.28% 93.65% 92.66% 85.80% 94.50% 79.26%

Ohio State 15


Some Related Work

Epshteyn et al. [LCPC’05]:I Search-oriented contributionI Uses regression curves to approximate the performance distributionI Uses active learning to select good candidates for empirical evaluationI Good results for BLAS kernels

Yuki et al. [CGO’10]:I Aims at selecting/combining between different static modelsI Uses program features to characterize accesses, train ANNI Results demonstrated for matrix-like kernels

Ohio State 16


Conclusions and Future Work

ANN is a candidate approach to connect tile sizes with performanceI Good prediction qualityI Deviation usually smaller for the good pointsI Combined search heuristic proposed:

I Strong variability improvement over naive random approachI 90+% efficiency using < 2% of the space, likely can be improved further

Future work:I Generalization!

I Categorize benchmarks reg. the performance distribution shapeI Dataset size

I Do not try to fit the random samples during trainingI Reduce the training timeI problem: ANN configuration

Ohio State 17

Acknowledgements: iWAPT’10

Acknowledgements

This work was funded in part by the U.S. National Science Foundationthrough award 0926688 and the Defense Advanced Research ProjectsAgency through AFRL Contract FA8650-09-C-7915. The opinions andfindings in this document do not necessarily reflect the views of eitherthe United States Government or the Ohio State University.

Ohio State 18

IntroductionProblem StatementPerformance PredictionTile Size SelectionAcknowledgements

Documents

Neural Network Assisted Tile Size Selection › ~pouchet › doc › iwapt-slides.10.pdfAutomatic parametric tiling [ICS’09,CGO’10]: I Produce code where the tile dimensions are