Upload
tavi
View
34
Download
0
Tags:
Embed Size (px)
DESCRIPTION
AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS. Chirag Dave and Rudolf Eigenmann Purdue University. GOALS. Automatic parallelization without loss of performance Use automatic detection of parallelism Parallelization is overzealous Remove overhead -inducing parallelism - PowerPoint PPT Presentation
Citation preview
AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS
Chirag Dave and Rudolf EigenmannPurdue University
GOALS
• Automatic parallelization without loss of performance– Use automatic detection of parallelism– Parallelization is overzealous– Remove overhead-inducing parallelism– Ensure no performance loss over original program
• Generic tuning framework– Empirical approach– Use program execution to measure benefits– Offline tuning
AUTO Vs. MANUALPARALLELIZATION
Source Program Hand
parallelized
Parallelizing Compiler
Parallel Program
Significant development time
State-of-the-art auto-parallelization in the
order of minutes
User tunes the program for performance
AUTO-PARALLELISM OVERHEADint foo(){ #pragma omp private(i,j,t) for (i=0; i<10; i++) { a[i] = c; #pragma omp private(j,t) #pragma omp parallel for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } }}
fork
joinFork/Join overheads
Load balancingWork in parallel section
Loop level parallelism
NEED FOR AUTOMATIC TUNING
• Identify, at compile time, the optimization strategy for maximum performance
• Beneficial parallelism– Which loops to parallelize– Parallel loop coverage
OUR APPROACH
Best combination of loops to parallelizeOffline tuning
Decisions based on actual execution time
CETUS: VERSION GENERATION
SEARCH SPACE NAVIGATION• Search Space -> The set of parallelizable loops
• Generic Tuning Algorithm– Capture Interaction– Use program execution time as decision metric
• COMBINED ELIMINATION– Each loop is an on/off optimization– Selective parallelization
• Pan, Z., Eigenmann, R.: Fast and effective orchestration of compiler optimizations for automatic performance tuning. In: The 4th Annual International Symposium on Code Generation and Optimization (CGO). (March 2006) 319–330
TUNING ALGORITHMBATCH ELIMINATION ITERATIVE ELIMINATION
COMBINED ELIMINATION
- Considers separately, the effects of each optimization
- Instant elimination
-Considers interactions-More tuning time
New Base Case
- Considers interactions amongst a subset- Iterates over the smaller subset and performs batch elimination
CETUNE INTERFACE
int foo(){ #pragma cetus parallel… for (i=0; i<50; i++) { t = a[i]; a[i+50] = t + (a[i+50] + b[i])/2.0; }
for (i=0; i<10; i++) { a[i] = c; #pragma cetus parallel… for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } }}
cetus –ompGen –tune-ompGen=“1,1”Parallelize both loops
cetus –ompGen –tune-ompGen=“1,0”cetus –ompGen –tune-ompGen=“0,1”Parallelize one and serialize the other
cetus –ompGen –tune-ompGen=“0,0”Serialize both loops
EMPIRICAL MEASUREMENT
Input source code (train data set)
Version generation using tuner input
Back end code generation
Runtime performance measurementTrain data set
Decision based on RIP
Next point in the search space
Automatic parallelization using Cetus
Start configuration
Final configuration
ICC
Intel Xeon Dual Quad-core
RESULTS
RESULTS
RESULTS
CONTRIBUTIONS• Described a compiler + empirical system that detects parallel loops in
serial and parallel programs and selects the combination of parallel loops that gives highest performance
• Finding profitable parallelism can be done using a generic tuning method
• The method can be applied on a section-by-section basis, thus allowing fine-grained tuning of program sections
• Using a set of NAS and OMP 2001 benchmarks, we show that the auto-parallelized and tuned version near-equals or improves performance over the original serial or parallel program
THANK YOU!