Language and Compiler Support for Auto-Tuning Variable...

Preview:

Citation preview

Language and Compiler Support for Auto-TuningVariable-Accuracy Algorithms

Jason Ansel Yee Lok Wong Cy Chan Marek OlszewskiAlan Edelman Saman Amarasinghe

MIT - CSAIL

April 4, 2011

Jason Ansel (MIT) PetaBricks April 4, 2011 1 / 30

Outline

1 Motivating Example

2 PetaBricks Language Overview

3 Variable Accuracy

4 Autotuner

5 Results

6 Conclusions

Jason Ansel (MIT) PetaBricks April 4, 2011 2 / 30

A motivating example

How would you write a fast sorting algorithm?

Insertion sortQuick sortMerge sortRadix sortBinary tree sort, Bitonic sort, Bubble sort, Bucket sort, Burstsort,Cocktail sort, Comb sort, Counting Sort, Distribution sort, Flashsort,Heapsort, Introsort, Library sort, Odd-even sort, Postman sort,Samplesort, Selection sort, Shell sort, Stooge sort, Strand sort,Timsort?

Poly-algorithms

Jason Ansel (MIT) PetaBricks April 4, 2011 3 / 30

A motivating example

How would you write a fast sorting algorithm?

Insertion sortQuick sortMerge sortRadix sort

Binary tree sort, Bitonic sort, Bubble sort, Bucket sort, Burstsort,Cocktail sort, Comb sort, Counting Sort, Distribution sort, Flashsort,Heapsort, Introsort, Library sort, Odd-even sort, Postman sort,Samplesort, Selection sort, Shell sort, Stooge sort, Strand sort,Timsort?

Poly-algorithms

Jason Ansel (MIT) PetaBricks April 4, 2011 3 / 30

A motivating example

How would you write a fast sorting algorithm?

Insertion sortQuick sortMerge sortRadix sortBinary tree sort, Bitonic sort, Bubble sort, Bucket sort, Burstsort,Cocktail sort, Comb sort, Counting Sort, Distribution sort, Flashsort,Heapsort, Introsort, Library sort, Odd-even sort, Postman sort,Samplesort, Selection sort, Shell sort, Stooge sort, Strand sort,Timsort?

Poly-algorithms

Jason Ansel (MIT) PetaBricks April 4, 2011 3 / 30

A motivating example

How would you write a fast sorting algorithm?

Insertion sortQuick sortMerge sortRadix sortBinary tree sort, Bitonic sort, Bubble sort, Bucket sort, Burstsort,Cocktail sort, Comb sort, Counting Sort, Distribution sort, Flashsort,Heapsort, Introsort, Library sort, Odd-even sort, Postman sort,Samplesort, Selection sort, Shell sort, Stooge sort, Strand sort,Timsort?

Poly-algorithms

Jason Ansel (MIT) PetaBricks April 4, 2011 3 / 30

std::stable sort

/usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367

Jason Ansel (MIT) PetaBricks April 4, 2011 4 / 30

std::stable sort

/usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367

Jason Ansel (MIT) PetaBricks April 4, 2011 4 / 30

std::sort

/usr/include/c++/4.5.2/bits/stl algo.h lines 2163-2167

Why 16? Why 15?

Dates back to at least 2000 (Jun 2000 SGI release)

Still in current C++ STL shipped with GCC

10+ years of of S threshold = 16

Jason Ansel (MIT) PetaBricks April 4, 2011 5 / 30

std::sort

/usr/include/c++/4.5.2/bits/stl algo.h lines 2163-2167

Why 16? Why 15?

Dates back to at least 2000 (Jun 2000 SGI release)

Still in current C++ STL shipped with GCC

10+ years of of S threshold = 16

Jason Ansel (MIT) PetaBricks April 4, 2011 5 / 30

std::sort

/usr/include/c++/4.5.2/bits/stl algo.h lines 2163-2167

Why 16? Why 15?

Dates back to at least 2000 (Jun 2000 SGI release)

Still in current C++ STL shipped with GCC

10+ years of of S threshold = 16

Jason Ansel (MIT) PetaBricks April 4, 2011 5 / 30

Is 15 the right number?

The best cutoff (CO) changes

Depends on competing costs:

Cost of computation (< operator, call overhead, etc)Cost of communication (swaps)Cache behavior (misses, prefetcher, locality)

Sorting 100000 doubles with std::stable sort:

CO ≈ 200 optimal on a Phenom 905e (15% speedup over CO = 15)CO ≈ 400 optimal on a Opteron 6168 (15% speedup over CO = 15)CO ≈ 500 optimal on a Xeon E5320 (34% speedup over CO = 15)CO ≈ 700 optimal on a Xeon X5460 (25% speedup over CO = 15)

Compiler’s hands are tied, it is stuck with 15

Jason Ansel (MIT) PetaBricks April 4, 2011 6 / 30

Is 15 the right number?

The best cutoff (CO) changes

Depends on competing costs:

Cost of computation (< operator, call overhead, etc)Cost of communication (swaps)Cache behavior (misses, prefetcher, locality)

Sorting 100000 doubles with std::stable sort:

CO ≈ 200 optimal on a Phenom 905e (15% speedup over CO = 15)CO ≈ 400 optimal on a Opteron 6168 (15% speedup over CO = 15)CO ≈ 500 optimal on a Xeon E5320 (34% speedup over CO = 15)CO ≈ 700 optimal on a Xeon X5460 (25% speedup over CO = 15)

Compiler’s hands are tied, it is stuck with 15

Jason Ansel (MIT) PetaBricks April 4, 2011 6 / 30

Back to our motivating example

How would you write a fast sorting algorithm?

Insertion sortQuick sortMerge sortRadix sortBinary tree sort, Bitonic sort, Bubble sort, Bucket sort, Burstsort,Cocktail sort, Comb sort, Counting Sort, Distribution sort, Flashsort,Heapsort, Introsort, Library sort, Odd-even sort, Postman sort,Samplesort, Selection sort, Shell sort, Stooge sort, Strand sort,Timsort?

Poly-algorithms

Answer

It depends!

Jason Ansel (MIT) PetaBricks April 4, 2011 7 / 30

Back to our motivating example

How would you write a fast sorting algorithm?

Insertion sortQuick sortMerge sortRadix sortBinary tree sort, Bitonic sort, Bubble sort, Bucket sort, Burstsort,Cocktail sort, Comb sort, Counting Sort, Distribution sort, Flashsort,Heapsort, Introsort, Library sort, Odd-even sort, Postman sort,Samplesort, Selection sort, Shell sort, Stooge sort, Strand sort,Timsort?

Poly-algorithms

Answer

It depends!

Jason Ansel (MIT) PetaBricks April 4, 2011 7 / 30

Autotuned parallel sorting algorithms

On a Xeon E7340 (2× 4 cores)1 Insertion sort below 6002 Quick sort below 14203 2-way parallel merge sort

On a Sun Fire T200 Niagara (8 cores)1 16-way merge sort below 752 8-way merge sort below 14613 4-way merge sort below 24004 2-way parallel merge sort

235% slowdown running Niagara algorithm on the Xeon

8% slowdown running Xeon algorithm on the Niagara

Need a way to express these algorithmic choices to enable autotuning

Jason Ansel (MIT) PetaBricks April 4, 2011 8 / 30

Autotuned parallel sorting algorithms

On a Xeon E7340 (2× 4 cores)1 Insertion sort below 6002 Quick sort below 14203 2-way parallel merge sort

On a Sun Fire T200 Niagara (8 cores)1 16-way merge sort below 752 8-way merge sort below 14613 4-way merge sort below 24004 2-way parallel merge sort

235% slowdown running Niagara algorithm on the Xeon

8% slowdown running Xeon algorithm on the Niagara

Need a way to express these algorithmic choices to enable autotuning

Jason Ansel (MIT) PetaBricks April 4, 2011 8 / 30

Autotuned parallel sorting algorithms

On a Xeon E7340 (2× 4 cores)1 Insertion sort below 6002 Quick sort below 14203 2-way parallel merge sort

On a Sun Fire T200 Niagara (8 cores)1 16-way merge sort below 752 8-way merge sort below 14613 4-way merge sort below 24004 2-way parallel merge sort

235% slowdown running Niagara algorithm on the Xeon

8% slowdown running Xeon algorithm on the Niagara

Need a way to express these algorithmic choices to enable autotuning

Jason Ansel (MIT) PetaBricks April 4, 2011 8 / 30

Autotuned parallel sorting algorithms

On a Xeon E7340 (2× 4 cores)1 Insertion sort below 6002 Quick sort below 14203 2-way parallel merge sort

On a Sun Fire T200 Niagara (8 cores)1 16-way merge sort below 752 8-way merge sort below 14613 4-way merge sort below 24004 2-way parallel merge sort

235% slowdown running Niagara algorithm on the Xeon

8% slowdown running Xeon algorithm on the Niagara

Need a way to express these algorithmic choices to enable autotuning

Jason Ansel (MIT) PetaBricks April 4, 2011 8 / 30

Outline

1 Motivating Example

2 PetaBricks Language Overview

3 Variable Accuracy

4 Autotuner

5 Results

6 Conclusions

Jason Ansel (MIT) PetaBricks April 4, 2011 9 / 30

Algorithmic choices

Language

e i the r {I n s e r t i o n S o r t ( out , i n ) ;

} or {Q u i c k S o r t ( out , i n ) ;

} or {MergeSort ( out , i n ) ;

} or {R a d i x S o r t ( out , i n ) ;

}

Representation

Decision tree synthesized byour evolutionary algorithm(EA)

Jason Ansel (MIT) PetaBricks April 4, 2011 10 / 30

Algorithmic choices

Language

e i the r {I n s e r t i o n S o r t ( out , i n ) ;

} or {Q u i c k S o r t ( out , i n ) ;

} or {MergeSort ( out , i n ) ;

} or {R a d i x S o r t ( out , i n ) ;

}

Representation

Decision tree synthesized byour evolutionary algorithm(EA)

Jason Ansel (MIT) PetaBricks April 4, 2011 10 / 30

The PetaBricks language

Choices expressed in the language

High level algorithmic choicesDependency-based synthesized outer control flowParallelization strategy

Programs automatically adapt to their environment

Tuned using our bottom-up evaluation algorithmOffline autotuner or always-on online autotuner

Jason Ansel (MIT) PetaBricks April 4, 2011 11 / 30

Outline

1 Motivating Example

2 PetaBricks Language Overview

3 Variable Accuracy

4 Autotuner

5 Results

6 Conclusions

Jason Ansel (MIT) PetaBricks April 4, 2011 12 / 30

Variable accuracy algorithms

Many problems don’t have a single correct answer

Soft computing

Approximation algorithms for NP-hard problems

DSP algorithms

Different grid resolutionsData precisions

Iterative algorithms

Choosing convergence criteria

Jason Ansel (MIT) PetaBricks April 4, 2011 13 / 30

Variable accuracy algorithms

Many problems don’t have a single correct answer

Soft computing

Approximation algorithms for NP-hard problems

DSP algorithms

Different grid resolutionsData precisions

Iterative algorithms

Choosing convergence criteria

Jason Ansel (MIT) PetaBricks April 4, 2011 13 / 30

Variable accuracy example

Example

. . .f o r ( i n t i = 0 ; i < 1 0 0 ; ++i ) {

S O R I t e r a t i o n ( tmp ) ;}. . .

Competing objectives of performance and accuracy

Must maximize performance while meeting accuracy targets

Jason Ansel (MIT) PetaBricks April 4, 2011 14 / 30

Variable accuracy example

Example

. . .f o r ( i n t i = 0 ; i < 1 0 0 ; ++i ) {

S O R I t e r a t i o n ( tmp ) ;}. . .

Competing objectives of performance and accuracy

Must maximize performance while meeting accuracy targets

Jason Ansel (MIT) PetaBricks April 4, 2011 14 / 30

Accuracy metrics and for enough loops

Languageaccu racy met r i c MyRMSError

. . .fo r enough {

SORI t e r a t i on ( tmp ) ;}

Representation

Function from problem sizeto number of iterationssynthesized by our EA

Jason Ansel (MIT) PetaBricks April 4, 2011 15 / 30

Accuracy metrics and for enough loops

Languageaccu racy met r i c MyRMSError

. . .fo r enough {

SORI t e r a t i on ( tmp ) ;}

Representation

Function from problem sizeto number of iterationssynthesized by our EA

Jason Ansel (MIT) PetaBricks April 4, 2011 15 / 30

Accuracy metrics and for enough loops

Languageaccu racy met r i c MyRMSError

. . .fo r enough {

SORI t e r a t i on ( tmp ) ;}

Representation

Function from problem sizeto number of iterationssynthesized by our EA

Jason Ansel (MIT) PetaBricks April 4, 2011 15 / 30

Accuracy variables

Languageaccu racy met r i c MyRMSErrora c c u r a c y v a r i a b l e k

. . .f o r ( i n t i =0; i<k ; ++i ) {

SORI t e r a t i on ( tmp ) ;}

⇒Representation

Function from problem sizeto k synthesized by our EA

Jason Ansel (MIT) PetaBricks April 4, 2011 16 / 30

Accuracy variables

Languageaccu racy met r i c MyRMSErrora c c u r a c y v a r i a b l e k

. . .f o r ( i n t i =0; i<k ; ++i ) {

SORI t e r a t i on ( tmp ) ;}

⇒Representation

Function from problem sizeto k synthesized by our EA

Jason Ansel (MIT) PetaBricks April 4, 2011 16 / 30

Variable accuracy and algorithmic choices

Languageaccu racy met r i c MyRMSError. . .e i t h e r {

fo r enough {SORI t e r a t i on ( tmp ) ;

}} or {

Mu l t i g r i d ( tmp ) ;} or {

D i r e c t S o l v e ( tmp ) ;}

Jason Ansel (MIT) PetaBricks April 4, 2011 17 / 30

Outline

1 Motivating Example

2 PetaBricks Language Overview

3 Variable Accuracy

4 Autotuner

5 Results

6 Conclusions

Jason Ansel (MIT) PetaBricks April 4, 2011 18 / 30

Traditional evolution algorithm

Initial population ? ? ? ? Cost = 0

Generation 2 Cost =

Generation 3 Cost =

Generation 4 Cost =

Cost of autotuning front-loaded in initial (unfit) population

We could speed up tuning if we start with a faster initial population

Key insight

Smaller input sizes can be used to form better initial population

Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30

Traditional evolution algorithm

Initial population 72.7s ? ? ? Cost = 72.7

Generation 2 Cost =

Generation 3 Cost =

Generation 4 Cost =

Cost of autotuning front-loaded in initial (unfit) population

We could speed up tuning if we start with a faster initial population

Key insight

Smaller input sizes can be used to form better initial population

Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30

Traditional evolution algorithm

Initial population 72.7s 10.5s ? ? Cost = 83.2

Generation 2 Cost =

Generation 3 Cost =

Generation 4 Cost =

Cost of autotuning front-loaded in initial (unfit) population

We could speed up tuning if we start with a faster initial population

Key insight

Smaller input sizes can be used to form better initial population

Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30

Traditional evolution algorithm

Initial population 72.7s 10.5s 4.1s ? Cost = 87.3

Generation 2 Cost =

Generation 3 Cost =

Generation 4 Cost =

Cost of autotuning front-loaded in initial (unfit) population

We could speed up tuning if we start with a faster initial population

Key insight

Smaller input sizes can be used to form better initial population

Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30

Traditional evolution algorithm

Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5

Generation 2 Cost =

Generation 3 Cost =

Generation 4 Cost =

Cost of autotuning front-loaded in initial (unfit) population

We could speed up tuning if we start with a faster initial population

Key insight

Smaller input sizes can be used to form better initial population

Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30

Traditional evolution algorithm

Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5

Generation 2 ? ? ? ? Cost = 0

Generation 3 Cost =

Generation 4 Cost =

Cost of autotuning front-loaded in initial (unfit) population

We could speed up tuning if we start with a faster initial population

Key insight

Smaller input sizes can be used to form better initial population

Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30

Traditional evolution algorithm

Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5

Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1

Generation 3 Cost =

Generation 4 Cost =

Cost of autotuning front-loaded in initial (unfit) population

We could speed up tuning if we start with a faster initial population

Key insight

Smaller input sizes can be used to form better initial population

Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30

Traditional evolution algorithm

Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5

Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1

Generation 3 ? ? ? ? Cost = 0

Generation 4 Cost =

Cost of autotuning front-loaded in initial (unfit) population

We could speed up tuning if we start with a faster initial population

Key insight

Smaller input sizes can be used to form better initial population

Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30

Traditional evolution algorithm

Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5

Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1

Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0

Generation 4 Cost =

Cost of autotuning front-loaded in initial (unfit) population

We could speed up tuning if we start with a faster initial population

Key insight

Smaller input sizes can be used to form better initial population

Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30

Traditional evolution algorithm

Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5

Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1

Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0

Generation 4 ? ? ? ? Cost = 0

Cost of autotuning front-loaded in initial (unfit) population

We could speed up tuning if we start with a faster initial population

Key insight

Smaller input sizes can be used to form better initial population

Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30

Traditional evolution algorithm

Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5

Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1

Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0

Generation 4 0.3s 0.1s 0.4s 2.4s Cost = 3.2

Cost of autotuning front-loaded in initial (unfit) population

We could speed up tuning if we start with a faster initial population

Key insight

Smaller input sizes can be used to form better initial population

Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30

Traditional evolution algorithm

Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5

Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1

Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0

Generation 4 0.3s 0.1s 0.4s 2.4s Cost = 3.2

Cost of autotuning front-loaded in initial (unfit) population

We could speed up tuning if we start with a faster initial population

Key insight

Smaller input sizes can be used to form better initial population

Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30

Traditional evolution algorithm

Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5

Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1

Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0

Generation 4 0.3s 0.1s 0.4s 2.4s Cost = 3.2

Cost of autotuning front-loaded in initial (unfit) population

We could speed up tuning if we start with a faster initial population

Key insight

Smaller input sizes can be used to form better initial population

Jason Ansel (MIT) PetaBricks April 4, 2011 19 / 30

Bottom-up evolutionary algorithm

Train on input size 1, to form initial population for:

Train on input size 2, to form initial population for:

Train on input size 8, to form initial population for:

Train on input size 16, to form initial population for:

Train on input size 32, to form initial population for:

Train on input size 64

Naturally exploits optimal substructure of problems

Jason Ansel (MIT) PetaBricks April 4, 2011 20 / 30

Bottom-up evolutionary algorithm

Train on input size 1, to form initial population for:

Train on input size 2, to form initial population for:

Train on input size 8, to form initial population for:

Train on input size 16, to form initial population for:

Train on input size 32, to form initial population for:

Train on input size 64

Naturally exploits optimal substructure of problems

Jason Ansel (MIT) PetaBricks April 4, 2011 20 / 30

Bottom-up evolutionary algorithm

Train on input size 1, to form initial population for:

Train on input size 2, to form initial population for:

Train on input size 8, to form initial population for:

Train on input size 16, to form initial population for:

Train on input size 32, to form initial population for:

Train on input size 64

Naturally exploits optimal substructure of problems

Jason Ansel (MIT) PetaBricks April 4, 2011 20 / 30

Bottom-up evolutionary algorithm

Train on input size 1, to form initial population for:

Train on input size 2, to form initial population for:

Train on input size 8, to form initial population for:

Train on input size 16, to form initial population for:

Train on input size 32, to form initial population for:

Train on input size 64

Naturally exploits optimal substructure of problems

Jason Ansel (MIT) PetaBricks April 4, 2011 20 / 30

Bottom-up evolutionary algorithm

Train on input size 1, to form initial population for:

Train on input size 2, to form initial population for:

Train on input size 8, to form initial population for:

Train on input size 16, to form initial population for:

Train on input size 32, to form initial population for:

Train on input size 64

Naturally exploits optimal substructure of problems

Jason Ansel (MIT) PetaBricks April 4, 2011 20 / 30

Bottom-up evolutionary algorithm

Train on input size 1, to form initial population for:

Train on input size 2, to form initial population for:

Train on input size 8, to form initial population for:

Train on input size 16, to form initial population for:

Train on input size 32, to form initial population for:

Train on input size 64

Naturally exploits optimal substructure of problems

Jason Ansel (MIT) PetaBricks April 4, 2011 20 / 30

Bottom-up evolutionary algorithm

Train on input size 1, to form initial population for:

Train on input size 2, to form initial population for:

Train on input size 8, to form initial population for:

Train on input size 16, to form initial population for:

Train on input size 32, to form initial population for:

Train on input size 64

Naturally exploits optimal substructure of problems

Jason Ansel (MIT) PetaBricks April 4, 2011 20 / 30

Variable accuracy autotuner

Generation i

⇒ Generation i + 1

Partition accuracy space into discrete levels

Prune population to have a fixed number of representatives from eachlevel

Jason Ansel (MIT) PetaBricks April 4, 2011 21 / 30

Variable accuracy autotuner

Generation i

⇒ Generation i + 1

Partition accuracy space into discrete levels

Prune population to have a fixed number of representatives from eachlevel

Jason Ansel (MIT) PetaBricks April 4, 2011 21 / 30

Variable accuracy autotuner

Generation i

⇒ Generation i + 1

Partition accuracy space into discrete levels

Prune population to have a fixed number of representatives from eachlevel

Jason Ansel (MIT) PetaBricks April 4, 2011 21 / 30

Variable accuracy autotuner

Generation i

⇒ Generation i + 1

Partition accuracy space into discrete levels

Prune population to have a fixed number of representatives from eachlevel

Jason Ansel (MIT) PetaBricks April 4, 2011 21 / 30

Variable accuracy autotuner

Generation i

⇒ Generation i + 1

Partition accuracy space into discrete levels

Prune population to have a fixed number of representatives from eachlevel

Jason Ansel (MIT) PetaBricks April 4, 2011 21 / 30

Variable accuracy autotuner

Generation i

Generation i + 1

Partition accuracy space into discrete levels

Prune population to have a fixed number of representatives from eachlevel

Jason Ansel (MIT) PetaBricks April 4, 2011 21 / 30

Variable accuracy autotuner

Generation i

⇒Generation i + 1

Partition accuracy space into discrete levels

Prune population to have a fixed number of representatives from eachlevel

Jason Ansel (MIT) PetaBricks April 4, 2011 21 / 30

Outline

1 Motivating Example

2 PetaBricks Language Overview

3 Variable Accuracy

4 Autotuner

5 Results

6 Conclusions

Jason Ansel (MIT) PetaBricks April 4, 2011 22 / 30

Changing accuracy requirements

1

2

4

8

16

32

10 100 1000 10000

Spe

edup

(x)

Input Size

Accuracy Level 2.0Accuracy Level 1.5Accuracy Level 1.0Accuracy Level 0.8Accuracy Level 0.6Accuracy Level 0.3

Image Compression

Jason Ansel (MIT) PetaBricks April 4, 2011 23 / 30

Changing accuracy requirements

1

2

4

8

10 100 1000

Spe

edup

(x)

Input Size

Accuracy Level 0.95Accuracy Level 0.75Accuracy Level 0.50Accuracy Level 0.20Accuracy Level 0.10Accuracy Level 0.05

Clustering

1

10

100

1000

10000

10 100 1000 10000 100000 1e+06

Spe

edup

(x)

Input Size

Accuracy Level 1.01Accuracy Level 1.1 Accuracy Level 1.2 Accuracy Level 1.3 Accuracy Level 1.4

Bin Packing

1

2

4

8

16

32

10 100 1000 10000

Spe

edup

(x)

Input Size

Accuracy Level 2.0Accuracy Level 1.5Accuracy Level 1.0Accuracy Level 0.8Accuracy Level 0.6Accuracy Level 0.3

Image Compression

1

2

4

8

16

32

10 100 1000 10000 100000

Spe

edup

(x)

Input Size

Accuracy Level 109

Accuracy Level 107

Accuracy Level 105

Accuracy Level 103

Accuracy Level 101

3D Helmholtz

1

2

4

8

16

32

64

100 1000 10000 100000 1e+06

Spe

edup

(x)

Input Size

Accuracy Level 109

Accuracy Level 107

Accuracy Level 105

Accuracy Level 103

Accuracy Level 101

2D Poisson

1

2

4

8

10 100 1000 10000

Spe

edup

(x)

Input Size

Accuracy Level 3.0Accuracy Level 2.0Accuracy Level 1.5Accuracy Level 1.0Accuracy Level 0.5Accuracy Level 0.0

Preconditioner

Jason Ansel (MIT) PetaBricks April 4, 2011 24 / 30

Multigrid choice space

Pseudo codeaccu racy met r i c MyRMSError. . .e i t h e r {

fo r enough {SORI t e r a t i on ( tmp ) ;

}} or {

Mu l t i g r i d ( tmp ) ;} or {

D i r e c t S o l v e ( tmp ) ;}

Jason Ansel (MIT) PetaBricks April 4, 2011 25 / 30

Multigrid choice space

Pseudo codeaccu racy met r i c MyRMSError. . .e i t h e r {

fo r enough {SORI t e r a t i on ( tmp ) ;

}} or {

Mu l t i g r i d ( tmp ) ;} or {

D i r e c t S o l v e ( tmp ) ;}

SOR Ite ra tion

Time

Jason Ansel (MIT) PetaBricks April 4, 2011 25 / 30

Multigrid choice space

Pseudo codeaccu racy met r i c MyRMSError. . .e i t h e r {

fo r enough {SORI t e r a t i on ( tmp ) ;

}} or {

Mu l t i g r i d ( tmp ) ;} or {

D i r e c t S o l v e ( tmp ) ;}

Grid

Siz

e128

SOR Iteration

Time

64

32

16

Jason Ansel (MIT) PetaBricks April 4, 2011 25 / 30

Multigrid choice space

Pseudo codeaccu racy met r i c MyRMSError. . .e i t h e r {

fo r enough {SORI t e r a t i on ( tmp ) ;

}} or {

Mu l t i g r i d ( tmp ) ;} or {

D i r e c t S o l v e ( tmp ) ;}

Grid

Siz

e128

SOR Iteration

Time

64

32

16

Jason Ansel (MIT) PetaBricks April 4, 2011 25 / 30

Multigrid choice space

Pseudo codeaccu racy met r i c MyRMSError. . .e i t h e r {

fo r enough {SORI t e r a t i on ( tmp ) ;

}} or {

Mu l t i g r i d ( tmp ) ;} or {

D i r e c t S o l v e ( tmp ) ;}

Grid

Siz

e128

SOR Iteration

Time

64

32

16

Direct Solve

Jason Ansel (MIT) PetaBricks April 4, 2011 25 / 30

Autotuned V-cycle shapes

101

Gri

d S

ize

2048

1024

512

256

128

64

32

16

Jason Ansel (MIT) PetaBricks April 4, 2011 26 / 30

Autotuned V-cycle shapes

101

Gri

d S

ize

2048

1024

512

256

128

64

32

16

103

Jason Ansel (MIT) PetaBricks April 4, 2011 26 / 30

Autotuned V-cycle shapes

101

Gri

d S

ize

2048

1024

512

256

128

64

32

16

103

105

Jason Ansel (MIT) PetaBricks April 4, 2011 26 / 30

Autotuned V-cycle shapes

101

Gri

d S

ize

2048

1024

512

256

128

64

32

16

103

105

107

Gri

d S

ize

2048

1024

512

256

128

64

32

16

Jason Ansel (MIT) PetaBricks April 4, 2011 26 / 30

Autotuned bin packing algorithms

Jason Ansel (MIT) PetaBricks April 4, 2011 27 / 30

Outline

1 Motivating Example

2 PetaBricks Language Overview

3 Variable Accuracy

4 Autotuner

5 Results

6 Conclusions

Jason Ansel (MIT) PetaBricks April 4, 2011 28 / 30

Conclusions

Motivating goal of PetaBricks

Make programs future-proof by allowing them to adapt to theirenvironment.

We can do better than hard coded constants!

Jason Ansel (MIT) PetaBricks April 4, 2011 29 / 30

Thanks!

Questions?

http://projects.csail.mit.edu/petabricks/

Jason Ansel (MIT) PetaBricks April 4, 2011 30 / 30