Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Language and Compiler Support for Auto-TuningVariable-Accuracy Algorithms
Jason Ansel Yee Lok Wong Cy Chan Marek OlszewskiAlan Edelman Saman Amarasinghe
MIT - CSAIL
April 4, 2011
Jason Ansel (MIT) PetaBricks April 4, 2011 1 / 30
Outline
1 Motivating Example
2 PetaBricks Language Overview
3 Variable Accuracy
4 Autotuner
5 Results
6 Conclusions
Jason Ansel (MIT) PetaBricks April 4, 2011 2 / 30
A motivating example
How would you write a fast sorting algorithm?
Insertion sortQuick sortMerge sortRadix sortBinary tree sort, Bitonic sort, Bubble sort, Bucket sort, Burstsort,Cocktail sort, Comb sort, Counting Sort, Distribution sort, Flashsort,Heapsort, Introsort, Library sort, Odd-even sort, Postman sort,Samplesort, Selection sort, Shell sort, Stooge sort, Strand sort,Timsort?
Poly-algorithms
Jason Ansel (MIT) PetaBricks April 4, 2011 3 / 30
A motivating example
How would you write a fast sorting algorithm?
Insertion sortQuick sortMerge sortRadix sort
Binary tree sort, Bitonic sort, Bubble sort, Bucket sort, Burstsort,Cocktail sort, Comb sort, Counting Sort, Distribution sort, Flashsort,Heapsort, Introsort, Library sort, Odd-even sort, Postman sort,Samplesort, Selection sort, Shell sort, Stooge sort, Strand sort,Timsort?
Poly-algorithms
Jason Ansel (MIT) PetaBricks April 4, 2011 3 / 30
A motivating example
How would you write a fast sorting algorithm?
Insertion sortQuick sortMerge sortRadix sortBinary tree sort, Bitonic sort, Bubble sort, Bucket sort, Burstsort,Cocktail sort, Comb sort, Counting Sort, Distribution sort, Flashsort,Heapsort, Introsort, Library sort, Odd-even sort, Postman sort,Samplesort, Selection sort, Shell sort, Stooge sort, Strand sort,Timsort?
Poly-algorithms
Jason Ansel (MIT) PetaBricks April 4, 2011 3 / 30
A motivating example
How would you write a fast sorting algorithm?
Insertion sortQuick sortMerge sortRadix sortBinary tree sort, Bitonic sort, Bubble sort, Bucket sort, Burstsort,Cocktail sort, Comb sort, Counting Sort, Distribution sort, Flashsort,Heapsort, Introsort, Library sort, Odd-even sort, Postman sort,Samplesort, Selection sort, Shell sort, Stooge sort, Strand sort,Timsort?
Poly-algorithms
Jason Ansel (MIT) PetaBricks April 4, 2011 3 / 30
std::stable sort
/usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367
Jason Ansel (MIT) PetaBricks April 4, 2011 4 / 30
std::stable sort
/usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367
Jason Ansel (MIT) PetaBricks April 4, 2011 4 / 30
std::sort
/usr/include/c++/4.5.2/bits/stl algo.h lines 2163-2167
Why 16? Why 15?
Dates back to at least 2000 (Jun 2000 SGI release)
Still in current C++ STL shipped with GCC
10+ years of of S threshold = 16
Jason Ansel (MIT) PetaBricks April 4, 2011 5 / 30
std::sort
/usr/include/c++/4.5.2/bits/stl algo.h lines 2163-2167
Why 16? Why 15?
Dates back to at least 2000 (Jun 2000 SGI release)
Still in current C++ STL shipped with GCC
10+ years of of S threshold = 16
Jason Ansel (MIT) PetaBricks April 4, 2011 5 / 30
std::sort
/usr/include/c++/4.5.2/bits/stl algo.h lines 2163-2167
Why 16? Why 15?
Dates back to at least 2000 (Jun 2000 SGI release)
Still in current C++ STL shipped with GCC
10+ years of of S threshold = 16
Jason Ansel (MIT) PetaBricks April 4, 2011 5 / 30
Is 15 the right number?
The best cutoff (CO) changes
Depends on competing costs:
Cost of computation (< operator, call overhead, etc)Cost of communication (swaps)Cache behavior (misses, prefetcher, locality)
Sorting 100000 doubles with std::stable sort:
CO ≈ 200 optimal on a Phenom 905e (15% speedup over CO = 15)CO ≈ 400 optimal on a Opteron 6168 (15% speedup over CO = 15)CO ≈ 500 optimal on a Xeon E5320 (34% speedup over CO = 15)CO ≈ 700 optimal on a Xeon X5460 (25% speedup over CO = 15)
Compiler’s hands are tied, it is stuck with 15
Jason Ansel (MIT) PetaBricks April 4, 2011 6 / 30
Is 15 the right number?
The best cutoff (CO) changes
Depends on competing costs:
Cost of computation (< operator, call overhead, etc)Cost of communication (swaps)Cache behavior (misses, prefetcher, locality)
Sorting 100000 doubles with std::stable sort:
CO ≈ 200 optimal on a Phenom 905e (15% speedup over CO = 15)CO ≈ 400 optimal on a Opteron 6168 (15% speedup over CO = 15)CO ≈ 500 optimal on a Xeon E5320 (34% speedup over CO = 15)CO ≈ 700 optimal on a Xeon X5460 (25% speedup over CO = 15)
Compiler’s hands are tied, it is stuck with 15
Jason Ansel (MIT) PetaBricks April 4, 2011 6 / 30
Back to our motivating example
How would you write a fast sorting algorithm?
Insertion sortQuick sortMerge sortRadix sortBinary tree sort, Bitonic sort, Bubble sort, Bucket sort, Burstsort,Cocktail sort, Comb sort, Counting Sort, Distribution sort, Flashsort,Heapsort, Introsort, Library sort, Odd-even sort, Postman sort,Samplesort, Selection sort, Shell sort, Stooge sort, Strand sort,Timsort?
Poly-algorithms
Answer
It depends!
Jason Ansel (MIT) PetaBricks April 4, 2011 7 / 30
Back to our motivating example
How would you write a fast sorting algorithm?
Insertion sortQuick sortMerge sortRadix sortBinary tree sort, Bitonic sort, Bubble sort, Bucket sort, Burstsort,Cocktail sort, Comb sort, Counting Sort, Distribution sort, Flashsort,Heapsort, Introsort, Library sort, Odd-even sort, Postman sort,Samplesort, Selection sort, Shell sort, Stooge sort, Strand sort,Timsort?
Poly-algorithms
Answer
It depends!
Jason Ansel (MIT) PetaBricks April 4, 2011 7 / 30
Autotuned parallel sorting algorithms
On a Xeon E7340 (2× 4 cores)1 Insertion sort below 6002 Quick sort below 14203 2-way parallel merge sort
On a Sun Fire T200 Niagara (8 cores)1 16-way merge sort below 752 8-way merge sort below 14613 4-way merge sort below 24004 2-way parallel merge sort
235% slowdown running Niagara algorithm on the Xeon
8% slowdown running Xeon algorithm on the Niagara
Need a way to express these algorithmic choices to enable autotuning
Jason Ansel (MIT) PetaBricks April 4, 2011 8 / 30
Autotuned parallel sorting algorithms
On a Xeon E7340 (2× 4 cores)1 Insertion sort below 6002 Quick sort below 14203 2-way parallel merge sort
On a Sun Fire T200 Niagara (8 cores)1 16-way merge sort below 752 8-way merge sort below 14613 4-way merge sort below 24004 2-way parallel merge sort
235% slowdown running Niagara algorithm on the Xeon
8% slowdown running Xeon algorithm on the Niagara
Need a way to express these algorithmic choices to enable autotuning
Jason Ansel (MIT) PetaBricks April 4, 2011 8 / 30
Autotuned parallel sorting algorithms
On a Xeon E7340 (2× 4 cores)1 Insertion sort below 6002 Quick sort below 14203 2-way parallel merge sort
On a Sun Fire T200 Niagara (8 cores)1 16-way merge sort below 752 8-way merge sort below 14613 4-way merge sort below 24004 2-way parallel merge sort
235% slowdown running Niagara algorithm on the Xeon
8% slowdown running Xeon algorithm on the Niagara
Need a way to express these algorithmic choices to enable autotuning
Jason Ansel (MIT) PetaBricks April 4, 2011 8 / 30
Autotuned parallel sorting algorithms
On a Xeon E7340 (2× 4 cores)1 Insertion sort below 6002 Quick sort below 14203 2-way parallel merge sort
On a Sun Fire T200 Niagara (8 cores)1 16-way merge sort below 752 8-way merge sort below 14613 4-way merge sort below 24004 2-way parallel merge sort
235% slowdown running Niagara algorithm on the Xeon
8% slowdown running Xeon algorithm on the Niagara
Need a way to express these algorithmic choices to enable autotuning
Jason Ansel (MIT) PetaBricks April 4, 2011 8 / 30
Outline
1 Motivating Example
2 PetaBricks Language Overview
3 Variable Accuracy
4 Autotuner
5 Results
6 Conclusions
Jason Ansel (MIT) PetaBricks April 4, 2011 9 / 30
Algorithmic choices
Language
e i the r {I n s e r t i o n S o r t ( out , i n ) ;
} or {Q u i c k S o r t ( out , i n ) ;
} or {MergeSort ( out , i n ) ;
} or {R a d i x S o r t ( out , i n ) ;
}
⇒
Representation
Decision tree synthesized byour evolutionary algorithm(EA)
Jason Ansel (MIT) PetaBricks April 4, 2011 10 / 30
Algorithmic choices
Language
e i the r {I n s e r t i o n S o r t ( out , i n ) ;
} or {Q u i c k S o r t ( out , i n ) ;
} or {MergeSort ( out , i n ) ;
} or {R a d i x S o r t ( out , i n ) ;
}
⇒
Representation
Decision tree synthesized byour evolutionary algorithm(EA)
Jason Ansel (MIT) PetaBricks April 4, 2011 10 / 30
The PetaBricks language
Choices expressed in the language
High level algorithmic choicesDependency-based synthesized outer control flowParallelization strategy
Programs automatically adapt to their environment
Tuned using our bottom-up evaluation algorithmOffline autotuner or always-on online autotuner
Jason Ansel (MIT) PetaBricks April 4, 2011 11 / 30
Outline
1 Motivating Example
2 PetaBricks Language Overview
3 Variable Accuracy
4 Autotuner
5 Results
6 Conclusions
Jason Ansel (MIT) PetaBricks April 4, 2011 12 / 30
Variable accuracy algorithms
Many problems don’t have a single correct answer
Soft computing
Approximation algorithms for NP-hard problems
DSP algorithms
Different grid resolutionsData precisions
Iterative algorithms
Choosing convergence criteria
Jason Ansel (MIT) PetaBricks April 4, 2011 13 / 30
Variable accuracy algorithms
Many problems don’t have a single correct answer
Soft computing
Approximation algorithms for NP-hard problems
DSP algorithms
Different grid resolutionsData precisions
Iterative algorithms
Choosing convergence criteria
Jason Ansel (MIT) PetaBricks April 4, 2011 13 / 30
Variable accuracy example
Example
. . .f o r ( i n t i = 0 ; i < 1 0 0 ; ++i ) {
S O R I t e r a t i o n ( tmp ) ;}. . .
Competing objectives of performance and accuracy
Must maximize performance while meeting accuracy targets
Jason Ansel (MIT) PetaBricks April 4, 2011 14 / 30
Variable accuracy example
Example
. . .f o r ( i n t i = 0 ; i < 1 0 0 ; ++i ) {
S O R I t e r a t i o n ( tmp ) ;}. . .
Competing objectives of performance and accuracy
Must maximize performance while meeting accuracy targets
Jason Ansel (MIT) PetaBricks April 4, 2011 14 / 30
Accuracy metrics and for enough loops
Languageaccu racy met r i c MyRMSError
. . .fo r enough {
SORI t e r a t i on ( tmp ) ;}
⇒
Representation
Function from problem sizeto number of iterationssynthesized by our EA
Jason Ansel (MIT) PetaBricks April 4, 2011 15 / 30
Accuracy metrics and for enough loops
Languageaccu racy met r i c MyRMSError
. . .fo r enough {
SORI t e r a t i on ( tmp ) ;}
⇒
Representation
Function from problem sizeto number of iterationssynthesized by our EA
Jason Ansel (MIT) PetaBricks April 4, 2011 15 / 30
Accuracy metrics and for enough loops
Languageaccu racy met r i c MyRMSError
. . .fo r enough {
SORI t e r a t i on ( tmp ) ;}
⇒
Representation
Function from problem sizeto number of iterationssynthesized by our EA
Jason Ansel (MIT) PetaBricks April 4, 2011 15 / 30
Accuracy variables
Languageaccu racy met r i c MyRMSErrora c c u r a c y v a r i a b l e k
. . .f o r ( i n t i =0; i<k ; ++i ) {
SORI t e r a t i on ( tmp ) ;}
⇒Representation
Function from problem sizeto k synthesized by our EA
Jason Ansel (MIT) PetaBricks April 4, 2011 16 / 30
Accuracy variables
Languageaccu racy met r i c MyRMSErrora c c u r a c y v a r i a b l e k
. . .f o r ( i n t i =0; i<k ; ++i ) {
SORI t e r a t i on ( tmp ) ;}
⇒Representation
Function from problem sizeto k synthesized by our EA
Jason Ansel (MIT) PetaBricks April 4, 2011 16 / 30
Variable accuracy and algorithmic choices
Languageaccu racy met r i c MyRMSError. . .e i t h e r {
fo r enough {SORI t e r a t i on ( tmp ) ;
}} or {
Mu l t i g r i d ( tmp ) ;} or {
D i r e c t S o l v e ( tmp ) ;}
Jason Ansel (MIT) PetaBricks April 4, 2011 17 / 30
Outline
1 Motivating Example
2 PetaBricks Language Overview
3 Variable Accuracy
4 Autotuner
5 Results
6 Conclusions
Jason Ansel (MIT) PetaBricks April 4, 2011 18 / 30
Traditional evolution algorithm
Initial population ? ? ? ? Cost = 0
Generation 2 Cost =
Generation 3 Cost =
Generation 4 Cost =
Cost of autotuning front-loaded in initial (unfit) population
We could speed up tuning if we start with a faster initial population
Key insight
Smaller input sizes can be used to form better initial population
Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30
Traditional evolution algorithm
Initial population 72.7s ? ? ? Cost = 72.7
Generation 2 Cost =
Generation 3 Cost =
Generation 4 Cost =
Cost of autotuning front-loaded in initial (unfit) population
We could speed up tuning if we start with a faster initial population
Key insight
Smaller input sizes can be used to form better initial population
Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30
Traditional evolution algorithm
Initial population 72.7s 10.5s ? ? Cost = 83.2
Generation 2 Cost =
Generation 3 Cost =
Generation 4 Cost =
Cost of autotuning front-loaded in initial (unfit) population
We could speed up tuning if we start with a faster initial population
Key insight
Smaller input sizes can be used to form better initial population
Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30
Traditional evolution algorithm
Initial population 72.7s 10.5s 4.1s ? Cost = 87.3
Generation 2 Cost =
Generation 3 Cost =
Generation 4 Cost =
Cost of autotuning front-loaded in initial (unfit) population
We could speed up tuning if we start with a faster initial population
Key insight
Smaller input sizes can be used to form better initial population
Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30
Traditional evolution algorithm
Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5
Generation 2 Cost =
Generation 3 Cost =
Generation 4 Cost =
Cost of autotuning front-loaded in initial (unfit) population
We could speed up tuning if we start with a faster initial population
Key insight
Smaller input sizes can be used to form better initial population
Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30
Traditional evolution algorithm
Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5
Generation 2 ? ? ? ? Cost = 0
Generation 3 Cost =
Generation 4 Cost =
Cost of autotuning front-loaded in initial (unfit) population
We could speed up tuning if we start with a faster initial population
Key insight
Smaller input sizes can be used to form better initial population
Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30
Traditional evolution algorithm
Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5
Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1
Generation 3 Cost =
Generation 4 Cost =
Cost of autotuning front-loaded in initial (unfit) population
We could speed up tuning if we start with a faster initial population
Key insight
Smaller input sizes can be used to form better initial population
Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30
Traditional evolution algorithm
Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5
Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1
Generation 3 ? ? ? ? Cost = 0
Generation 4 Cost =
Cost of autotuning front-loaded in initial (unfit) population
We could speed up tuning if we start with a faster initial population
Key insight
Smaller input sizes can be used to form better initial population
Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30
Traditional evolution algorithm
Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5
Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1
Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0
Generation 4 Cost =
Cost of autotuning front-loaded in initial (unfit) population
We could speed up tuning if we start with a faster initial population
Key insight
Smaller input sizes can be used to form better initial population
Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30
Traditional evolution algorithm
Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5
Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1
Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0
Generation 4 ? ? ? ? Cost = 0
Cost of autotuning front-loaded in initial (unfit) population
We could speed up tuning if we start with a faster initial population
Key insight
Smaller input sizes can be used to form better initial population
Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30
Traditional evolution algorithm
Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5
Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1
Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0
Generation 4 0.3s 0.1s 0.4s 2.4s Cost = 3.2
Cost of autotuning front-loaded in initial (unfit) population
We could speed up tuning if we start with a faster initial population
Key insight
Smaller input sizes can be used to form better initial population
Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30
Traditional evolution algorithm
Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5
Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1
Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0
Generation 4 0.3s 0.1s 0.4s 2.4s Cost = 3.2
Cost of autotuning front-loaded in initial (unfit) population
We could speed up tuning if we start with a faster initial population
Key insight
Smaller input sizes can be used to form better initial population
Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30
Traditional evolution algorithm
Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5
Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1
Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0
Generation 4 0.3s 0.1s 0.4s 2.4s Cost = 3.2
Cost of autotuning front-loaded in initial (unfit) population
We could speed up tuning if we start with a faster initial population
Key insight
Smaller input sizes can be used to form better initial population
Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30
Bottom-up evolutionary algorithm
Train on input size 1, to form initial population for:
Train on input size 2, to form initial population for:
Train on input size 8, to form initial population for:
Train on input size 16, to form initial population for:
Train on input size 32, to form initial population for:
Train on input size 64
Naturally exploits optimal substructure of problems
Jason Ansel (MIT) PetaBricks April 4, 2011 20 / 30
Bottom-up evolutionary algorithm
Train on input size 1, to form initial population for:
Train on input size 2, to form initial population for:
Train on input size 8, to form initial population for:
Train on input size 16, to form initial population for:
Train on input size 32, to form initial population for:
Train on input size 64
Naturally exploits optimal substructure of problems
Jason Ansel (MIT) PetaBricks April 4, 2011 20 / 30
Bottom-up evolutionary algorithm
Train on input size 1, to form initial population for:
Train on input size 2, to form initial population for:
Train on input size 8, to form initial population for:
Train on input size 16, to form initial population for:
Train on input size 32, to form initial population for:
Train on input size 64
Naturally exploits optimal substructure of problems
Jason Ansel (MIT) PetaBricks April 4, 2011 20 / 30
Bottom-up evolutionary algorithm
Train on input size 1, to form initial population for:
Train on input size 2, to form initial population for:
Train on input size 8, to form initial population for:
Train on input size 16, to form initial population for:
Train on input size 32, to form initial population for:
Train on input size 64
Naturally exploits optimal substructure of problems
Jason Ansel (MIT) PetaBricks April 4, 2011 20 / 30
Bottom-up evolutionary algorithm
Train on input size 1, to form initial population for:
Train on input size 2, to form initial population for:
Train on input size 8, to form initial population for:
Train on input size 16, to form initial population for:
Train on input size 32, to form initial population for:
Train on input size 64
Naturally exploits optimal substructure of problems
Jason Ansel (MIT) PetaBricks April 4, 2011 20 / 30
Bottom-up evolutionary algorithm
Train on input size 1, to form initial population for:
Train on input size 2, to form initial population for:
Train on input size 8, to form initial population for:
Train on input size 16, to form initial population for:
Train on input size 32, to form initial population for:
Train on input size 64
Naturally exploits optimal substructure of problems
Jason Ansel (MIT) PetaBricks April 4, 2011 20 / 30
Bottom-up evolutionary algorithm
Train on input size 1, to form initial population for:
Train on input size 2, to form initial population for:
Train on input size 8, to form initial population for:
Train on input size 16, to form initial population for:
Train on input size 32, to form initial population for:
Train on input size 64
Naturally exploits optimal substructure of problems
Jason Ansel (MIT) PetaBricks April 4, 2011 20 / 30
Variable accuracy autotuner
Generation i
⇒ Generation i + 1
Partition accuracy space into discrete levels
Prune population to have a fixed number of representatives from eachlevel
Jason Ansel (MIT) PetaBricks April 4, 2011 21 / 30
Variable accuracy autotuner
Generation i
⇒ Generation i + 1
Partition accuracy space into discrete levels
Prune population to have a fixed number of representatives from eachlevel
Jason Ansel (MIT) PetaBricks April 4, 2011 21 / 30
Variable accuracy autotuner
Generation i
⇒ Generation i + 1
Partition accuracy space into discrete levels
Prune population to have a fixed number of representatives from eachlevel
Jason Ansel (MIT) PetaBricks April 4, 2011 21 / 30
Variable accuracy autotuner
Generation i
⇒ Generation i + 1
Partition accuracy space into discrete levels
Prune population to have a fixed number of representatives from eachlevel
Jason Ansel (MIT) PetaBricks April 4, 2011 21 / 30
Variable accuracy autotuner
Generation i
⇒ Generation i + 1
Partition accuracy space into discrete levels
Prune population to have a fixed number of representatives from eachlevel
Jason Ansel (MIT) PetaBricks April 4, 2011 21 / 30
Variable accuracy autotuner
Generation i
⇒
Generation i + 1
Partition accuracy space into discrete levels
Prune population to have a fixed number of representatives from eachlevel
Jason Ansel (MIT) PetaBricks April 4, 2011 21 / 30
Variable accuracy autotuner
Generation i
⇒Generation i + 1
Partition accuracy space into discrete levels
Prune population to have a fixed number of representatives from eachlevel
Jason Ansel (MIT) PetaBricks April 4, 2011 21 / 30
Outline
1 Motivating Example
2 PetaBricks Language Overview
3 Variable Accuracy
4 Autotuner
5 Results
6 Conclusions
Jason Ansel (MIT) PetaBricks April 4, 2011 22 / 30
Changing accuracy requirements
1
2
4
8
16
32
10 100 1000 10000
Spe
edup
(x)
Input Size
Accuracy Level 2.0Accuracy Level 1.5Accuracy Level 1.0Accuracy Level 0.8Accuracy Level 0.6Accuracy Level 0.3
Image Compression
Jason Ansel (MIT) PetaBricks April 4, 2011 23 / 30
Changing accuracy requirements
1
2
4
8
10 100 1000
Spe
edup
(x)
Input Size
Accuracy Level 0.95Accuracy Level 0.75Accuracy Level 0.50Accuracy Level 0.20Accuracy Level 0.10Accuracy Level 0.05
Clustering
1
10
100
1000
10000
10 100 1000 10000 100000 1e+06
Spe
edup
(x)
Input Size
Accuracy Level 1.01Accuracy Level 1.1 Accuracy Level 1.2 Accuracy Level 1.3 Accuracy Level 1.4
Bin Packing
1
2
4
8
16
32
10 100 1000 10000
Spe
edup
(x)
Input Size
Accuracy Level 2.0Accuracy Level 1.5Accuracy Level 1.0Accuracy Level 0.8Accuracy Level 0.6Accuracy Level 0.3
Image Compression
1
2
4
8
16
32
10 100 1000 10000 100000
Spe
edup
(x)
Input Size
Accuracy Level 109
Accuracy Level 107
Accuracy Level 105
Accuracy Level 103
Accuracy Level 101
3D Helmholtz
1
2
4
8
16
32
64
100 1000 10000 100000 1e+06
Spe
edup
(x)
Input Size
Accuracy Level 109
Accuracy Level 107
Accuracy Level 105
Accuracy Level 103
Accuracy Level 101
2D Poisson
1
2
4
8
10 100 1000 10000
Spe
edup
(x)
Input Size
Accuracy Level 3.0Accuracy Level 2.0Accuracy Level 1.5Accuracy Level 1.0Accuracy Level 0.5Accuracy Level 0.0
Preconditioner
Jason Ansel (MIT) PetaBricks April 4, 2011 24 / 30
Multigrid choice space
Pseudo codeaccu racy met r i c MyRMSError. . .e i t h e r {
fo r enough {SORI t e r a t i on ( tmp ) ;
}} or {
Mu l t i g r i d ( tmp ) ;} or {
D i r e c t S o l v e ( tmp ) ;}
Jason Ansel (MIT) PetaBricks April 4, 2011 25 / 30
Multigrid choice space
Pseudo codeaccu racy met r i c MyRMSError. . .e i t h e r {
fo r enough {SORI t e r a t i on ( tmp ) ;
}} or {
Mu l t i g r i d ( tmp ) ;} or {
D i r e c t S o l v e ( tmp ) ;}
SOR Ite ra tion
Time
Jason Ansel (MIT) PetaBricks April 4, 2011 25 / 30
Multigrid choice space
Pseudo codeaccu racy met r i c MyRMSError. . .e i t h e r {
fo r enough {SORI t e r a t i on ( tmp ) ;
}} or {
Mu l t i g r i d ( tmp ) ;} or {
D i r e c t S o l v e ( tmp ) ;}
Grid
Siz
e128
SOR Iteration
Time
64
32
16
Jason Ansel (MIT) PetaBricks April 4, 2011 25 / 30
Multigrid choice space
Pseudo codeaccu racy met r i c MyRMSError. . .e i t h e r {
fo r enough {SORI t e r a t i on ( tmp ) ;
}} or {
Mu l t i g r i d ( tmp ) ;} or {
D i r e c t S o l v e ( tmp ) ;}
Grid
Siz
e128
SOR Iteration
Time
64
32
16
Jason Ansel (MIT) PetaBricks April 4, 2011 25 / 30
Multigrid choice space
Pseudo codeaccu racy met r i c MyRMSError. . .e i t h e r {
fo r enough {SORI t e r a t i on ( tmp ) ;
}} or {
Mu l t i g r i d ( tmp ) ;} or {
D i r e c t S o l v e ( tmp ) ;}
Grid
Siz
e128
SOR Iteration
Time
64
32
16
Direct Solve
Jason Ansel (MIT) PetaBricks April 4, 2011 25 / 30
Autotuned V-cycle shapes
101
Gri
d S
ize
2048
1024
512
256
128
64
32
16
Jason Ansel (MIT) PetaBricks April 4, 2011 26 / 30
Autotuned V-cycle shapes
101
Gri
d S
ize
2048
1024
512
256
128
64
32
16
103
Jason Ansel (MIT) PetaBricks April 4, 2011 26 / 30
Autotuned V-cycle shapes
101
Gri
d S
ize
2048
1024
512
256
128
64
32
16
103
105
Jason Ansel (MIT) PetaBricks April 4, 2011 26 / 30
Autotuned V-cycle shapes
101
Gri
d S
ize
2048
1024
512
256
128
64
32
16
103
105
107
Gri
d S
ize
2048
1024
512
256
128
64
32
16
Jason Ansel (MIT) PetaBricks April 4, 2011 26 / 30
Autotuned bin packing algorithms
Jason Ansel (MIT) PetaBricks April 4, 2011 27 / 30
Outline
1 Motivating Example
2 PetaBricks Language Overview
3 Variable Accuracy
4 Autotuner
5 Results
6 Conclusions
Jason Ansel (MIT) PetaBricks April 4, 2011 28 / 30
Conclusions
Motivating goal of PetaBricks
Make programs future-proof by allowing them to adapt to theirenvironment.
We can do better than hard coded constants!
Jason Ansel (MIT) PetaBricks April 4, 2011 29 / 30
Thanks!
Questions?
http://projects.csail.mit.edu/petabricks/
Jason Ansel (MIT) PetaBricks April 4, 2011 30 / 30