AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

Chirag Dave and Rudolf EigenmannPurdue University

GOALS

• Automatic parallelization without loss of performance– Use automatic detection of parallelism– Parallelization is overzealous– Remove overhead-inducing parallelism– Ensure no performance loss over original program

• Generic tuning framework– Empirical approach– Use program execution to measure benefits– Offline tuning

AUTO Vs. MANUALPARALLELIZATION

Source Program Hand

parallelized

Parallelizing Compiler

Parallel Program

Significant development time

State-of-the-art auto-parallelization in the

order of minutes

User tunes the program for performance

AUTO-PARALLELISM OVERHEADint foo(){ #pragma omp private(i,j,t) for (i=0; i<10; i++) { a[i] = c; #pragma omp private(j,t) #pragma omp parallel for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } }}

fork

joinFork/Join overheads

Load balancingWork in parallel section

Loop level parallelism

NEED FOR AUTOMATIC TUNING

• Identify, at compile time, the optimization strategy for maximum performance

• Beneficial parallelism– Which loops to parallelize– Parallel loop coverage

OUR APPROACH

Best combination of loops to parallelizeOffline tuning

Decisions based on actual execution time

CETUS: VERSION GENERATION

SEARCH SPACE NAVIGATION• Search Space -> The set of parallelizable loops

• Generic Tuning Algorithm– Capture Interaction– Use program execution time as decision metric

• COMBINED ELIMINATION– Each loop is an on/off optimization– Selective parallelization

• Pan, Z., Eigenmann, R.: Fast and effective orchestration of compiler optimizations for automatic performance tuning. In: The 4th Annual International Symposium on Code Generation and Optimization (CGO). (March 2006) 319–330

TUNING ALGORITHMBATCH ELIMINATION ITERATIVE ELIMINATION

COMBINED ELIMINATION

- Considers separately, the effects of each optimization

- Instant elimination

-Considers interactions-More tuning time

New Base Case

- Considers interactions amongst a subset- Iterates over the smaller subset and performs batch elimination

CETUNE INTERFACE

int foo(){ #pragma cetus parallel… for (i=0; i<50; i++) { t = a[i]; a[i+50] = t + (a[i+50] + b[i])/2.0; }

for (i=0; i<10; i++) { a[i] = c; #pragma cetus parallel… for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } }}

cetus –ompGen –tune-ompGen=“1,1”Parallelize both loops

cetus –ompGen –tune-ompGen=“1,0”cetus –ompGen –tune-ompGen=“0,1”Parallelize one and serialize the other

cetus –ompGen –tune-ompGen=“0,0”Serialize both loops

EMPIRICAL MEASUREMENT

Input source code (train data set)

Version generation using tuner input

Back end code generation

Runtime performance measurementTrain data set

Decision based on RIP

Next point in the search space

Automatic parallelization using Cetus

Start configuration

Final configuration

ICC

Intel Xeon Dual Quad-core

RESULTS

RESULTS

RESULTS

CONTRIBUTIONS• Described a compiler + empirical system that detects parallel loops in

serial and parallel programs and selects the combination of parallel loops that gives highest performance

• Finding profitable parallelism can be done using a generic tuning method

• The method can be applied on a section-by-section basis, thus allowing fine-grained tuning of program sections

• Using a set of NAS and OMP 2001 benchmarks, we show that the auto-parallelized and tuned version near-equals or improves performance over the original serial or parallel program

THANK YOU!

Documents

AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS