Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Neural Network Assisted Tile Size Selection
Mohammed Rahman, Louis-Noël Pouchet and P. Sadayappan
Dept. of Computer Science and EngineeringOhio State University
June 22, 2010
iWAPT 2010 WorkshopBerkeley, USA
Introduction: iWAPT’10
Overview
Situation:I New advances in parametric tiling → more user code to be tunedI The problem of tile size selection is complex and unsolved!
Our approach:I Use machine learning to create a performance predictor of tile size
performance, for a specific programI Rely on the distribution shape to extract promising subspaces for
empirical searchI Outcome: < 2% of the space traversed → 90+% of maximal speedup
achieved
Ohio State 2
Problem Statement: iWAPT’10
Tiling
I Tiling partition the computation into blocksI Note we consider only rectangular tiling hereI For tiling to be legal, such a partitioning must be legal
Ohio State 3
Problem Statement: iWAPT’10
Parametric Tiling
Automatic parametric tiling [ICS’09,CGO’10]:I Produce code where the tile dimensions are parametersI Seamlessly find/apply all required transformation to make the code
tilableI Actual tile sizes are given at run-timeI very useful for tile size selection (no need to recompile)I recent progresses have generalized the approach:
I Operates on arbitrary affine-control loops (imperfectly nested)I Produce good quality codeI Even expose pipeline-parallelism if neededI Software (from OSU): Pluto, PrimeTile/DynTile/PTile
Ohio State 4
Problem Statement: iWAPT’10
Tile Size Selection
Problem: how to select the tile size to have the best performance?
I data reuse within the execution of a tile;I data reuse between tiles;I the layout in memory of the data used in a tile;I the relative penalty of misses at each level of the hierarchy, which is
machine-dependent.I the cache replacement policy;I the interaction with other units, such at prefetching;I the interaction with vectorization, to enable a profitable steady-state for
the vectorized loop(s);I ...
Ohio State 5
Problem Statement: iWAPT’10
Performance DistributionPerformance distribution of fdtd-2d and syr2k
0
0.1
0.2
0.3
0.4
0.5
0.6
1:1
:1
4:2
:40
8:4
:500
12:8
:30
16:1
0:3
00
20:1
6:1
2
25:3
0:2
00
30:4
0:8
35:4
8:1
28
42:1
00
:4
48:1
28
:64
Ex
ec
uti
on
Tim
e i
n S
ec
on
ds
Tile Sizes ( Ti:Tj:Tk)
fdtd-2d: Performance distribution with Tile Size configurations
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1:1
:1
4:2
:40
8:4
:500
12:8
:30
30:1
0:3
00
40
:16
:12
64:3
0:2
00
12
8:4
0:8
200
:48:1
28
300
:100:4
500
:128:6
4
Ex
ec
uti
on
tim
e i
n S
ec
on
ds
Tile sizes- Ti:Tj:Tk
dsyr2k: Performance Distribution with Tile Size Configurations
I Search space: 10648 possible tile sizesI {1,2,4,6,8,10,12,16, 30,32,40,48,64,100,128,
150,200,256,300,400,500,600}I Machine: Core i7 (1 thread)I 2 "standard" distribution shapes
Ohio State 6
Performance Prediction: iWAPT’10
Ojectives
Correlate execution time with tile sizes
I (Static) performance models do exist...I ... but fail to capture the interplay between all hardware componentsI Usually better suited for well-known problems (eg, uniform reuse +
square tiles)I Another view: pruning the space of poor-performing tile sizes
Our approach:I Build a neural network to model the performance distributionI Focus directly on the execution timeI ANN dedicated to a specific program + dataset size
Ohio State 7
Performance Prediction: iWAPT’10
Neural Network
Layout:I Fully connected, multi-layer perceptron (MLP)I Input layer: the tile sizes (Ti, Tj, Tk)I Output layer: predicted execution timeI One hidden layer consisting of 30 hidden neuronsI Use Stuttgart Neural Network Simulator library
Training:I Select 5% (530 tuples) from the search space of 10648I Run the program on the machine using the tile size specified by the
tuplesI Train with resilient back-propagation (rprop), using the actual execution
time for a tupleI Standard 10% cross-validation procedure
Ohio State 8
Performance Prediction: iWAPT’10
Performance Prediction [1/2]
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
10
:12
:8
16
:2:8
12
:1:4
8
45
:12
8:6
20
:2:1
6
12
:40
0:8
32
:4:4
30
:64
:15
0
10
:1:2
56
16
:40
0:4
00
40
:60
0:1
2
Execu
tio
n T
ime in
Seco
nd
s
Tile Sizes (Ti:Tj:Tk)
fdtd-2d: Predicted versus Actual Performance
ExTime (Actual )
ExTime (Predicted)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
8:4
:64
60
0:1
28
:32
64
:4:1
6
10
:40
0:5
00
12
8:2
:30
0
25
6:2
00
:25
6
100:4
0:3
00
30
:30
0:3
00
40
:10
:4
10
0:3
00
:12
6:1
2:1
Exe
cu
tio
n T
ime
in
se
co
nd
s
Tile sizes - Ti:Tj:Tk
dsyr2k : Predicted versus Actual Performance
ExTime(Actual)
ExTime(Predicted)
Ohio State 9
Performance Prediction: iWAPT’10
Performance Prediction [2/2]
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
12
:12
:16
32
:2:1
28
64
:40
:16
2:1
0:1
1:3
2:2
56
25
6:6
4:4
10
:25
6:1
2
4:5
00
:10
30
:64
:40
0
6:2
00
:50
0
25
6:4
00
:16
Ex
ec
uti
on
Tim
e i
n S
ec
on
ds
Tile sizes ( Ti:Tj:Tk)
lu: Predicted versus Actual Performance
ExTime (Actual)
ExTime (Predicted)
0
0.5
1
1.5
2
2.5
3
3.5
1:1
:1
4:2
:40
8:4
:50
0
12:8
:30
30
:10
:30
0
40
:16
:12
64
:30
:20
0
12
8:4
0:8
20
0:4
8:1
28
30
0:1
00
:4
50
0:1
28
:64
Exec
uti
on
Tim
e in
Se
co
nd
s
Tile Sizes (Ti:Tj:Tk)
dgemm: Predicted versus Actual Performance
ExTime (Actual)
ExTime (Predicted)
Ohio State 10
Performance Prediction: iWAPT’10
Discussions
I for trmm, lu, 2d-jacobi, syr2k and doitgen, predict more than 90% of oursearch space with less than 10% deviation for the actual execution time
I In total, can predict 80% and more with less than 10% deviationI Usually smaller deviation for the best tile sizes
→ These ANN are able to model the performance distribution
Openings:I Program classifier w.r.t. performance distributionI Training: do not "fit" that much the training points?
Ohio State 11
Tile Size Selection: iWAPT’10
Selecting the Best Tile Size
The performance distribution can drive the empirical search to focus onpromising subspaces
Tile size selection:I Random approach has a huge variability on some distribution shapesI Exhaustive search is likely not neededI Need for an intermediate solution
I Low number of empirical runsI Good convergence, good variabilityI General enough to work on arbitrary user codes
Ohio State 12
Tile Size Selection: iWAPT’10
Overview of the Algorithm
1 Generate a parametrically tiled code
2 Randomly select x% of the tile size space, and run them on the machine
3 Train an ANN using this data
4 Use the ANN to predict performance of the entire space
5 Collect y tile sizes that are predicted best and not already ran
6 Run the y tile sizes on the machine, output the best found
Ohio State 13
Tile Size Selection: iWAPT’10
Experimental Setup
I Studied various kernels (perfectly/imperfectly nested, BLAS & stencils)I Only focused on single-threaded execution, on an Intel Core i7
I Comparison: simple random search (R), ANN search (ANN)I Repeat each experiment 100 times, for various sampling rate
Ohio State 14
Tile Size Selection: iWAPT’10
Experimental Results (y = 50)doitgen gemm syr2k lu 2d-jacobi fdtd-2d
1%
R-best 100% 99.86% 98.15% 99.89% 99.91% 97.75%R-average 98.71% 96.29% 94.80% 92.19% 94.10% 84.15%R-worst 95.35% 69.64% 89.81% 40.63% 17.69% 31.02%ANN-best 100% 99.86% 100% 100% 99.91% 100%ANN-average 98.89% 96.35% 96.01% 92.62% 98.51% 84.50%ANN-worst 97.26% 82.93% 89.79% 79.68% 94.23% 66.53%
2%
R-best 99.97% 99.86% 98.71% 99.89% 100% 100%R-average 98.71% 96.42% 94.80% 92.87% 97.60% 84.10%R-worst 86.49% 67.89% 88.20% 45.29% 55.98% 27.30%ANN-best 100% 99.86% 100% 100% 100% 100%ANN-average 98.89% 96.76% 96.69% 95.34% 98.55% 88.61%ANN-worst 97.26% 89.83% 89.65% 85.80% 94.17% 60.65%
3%
R-best 99.97% 99.86% 98.71% 99.89% 100% 100%R-average 98.77% 96.47% 94.80% 94.27% 98.39% 85.47%R-worst 94.89% 63.58% 87.99% 61.24% 84.54% 47.99%ANN-best 99.97% 99.86% 100% 100% 100% 100%ANN-average 98.93% 97.14% 97.17% 95.34% 98.74% 91.45%ANN-worst 97.64% 91.01% 92.27% 85.80% 94.50% 63.34%
4%
R-best 99.97% 99.86% 98.71% 99.89% 100% 100%R-average 98.80% 96.65% 94.93% 92.19% 98.41% 85.55%R-worst 96.86% 69.73% 88.57% 52.03% 82.47% 43.74%ANN-best 100% 99.86% 100% 100% 100% 100%ANN-average 98.99% 97.67% 97.20% 95.79% 98.90% 93.55%ANN-worst 98.28% 93.65% 92.66% 85.80% 94.50% 79.26%
Ohio State 15
Tile Size Selection: iWAPT’10
Some Related Work
Epshteyn et al. [LCPC’05]:I Search-oriented contributionI Uses regression curves to approximate the performance distributionI Uses active learning to select good candidates for empirical evaluationI Good results for BLAS kernels
Yuki et al. [CGO’10]:I Aims at selecting/combining between different static modelsI Uses program features to characterize accesses, train ANNI Results demonstrated for matrix-like kernels
Ohio State 16
Tile Size Selection: iWAPT’10
Conclusions and Future Work
ANN is a candidate approach to connect tile sizes with performanceI Good prediction qualityI Deviation usually smaller for the good pointsI Combined search heuristic proposed:
I Strong variability improvement over naive random approachI 90+% efficiency using < 2% of the space, likely can be improved further
Future work:I Generalization!
I Categorize benchmarks reg. the performance distribution shapeI Dataset size
I Do not try to fit the random samples during trainingI Reduce the training timeI problem: ANN configuration
Ohio State 17
Acknowledgements: iWAPT’10
Acknowledgements
This work was funded in part by the U.S. National Science Foundationthrough award 0926688 and the Defense Advanced Research ProjectsAgency through AFRL Contract FA8650-09-C-7915. The opinions andfindings in this document do not necessarily reflect the views of eitherthe United States Government or the Ohio State University.
Ohio State 18
IntroductionProblem StatementPerformance PredictionTile Size SelectionAcknowledgements