OpenMP to CUDA

Preview:

Citation preview

Mapping OpenMP to the Stream Programming Model

Hu Ming Zhang Fangzhou Yue Kun

Objective 1. Study the mapping relationship of parallel mechanism in OpenMP to stream programming model (CUDA). 2. Point out the which part is suitable for translation. 3. Analyzing typical scientific applications

Outline OpenMP vs CUDA: Execution model OpenMP vs CUDA: Semantics OpenMP vs CUDA: Performace Analysis of Benchmarks

OpenMP vs CUDA Execution Model

OpenMP vs CUDA Execution Model

OpenMP vs CUDA Semantic

Parallel Construct parallel

Worksharing Construct loop, sections, single

Master and Synchronization Construct critical, barrier, taskwait, atomic, flush, ordered

Data Environment shared, private, firstprivate, lastprivate, reduction, copyin, copyprivate

OpenMP vs CUDA Semantic

#include <omp.h>

main()

{

int x;

x = 0;

#pragma omp parallel shared(x)

{

#pragma omp critical

x = x + 1;

}

/* end of parallel section */

}

OpenMP vs CUDA Semantic

#pragma omp for ordered [clauses...] (loop region) #pragma omp ordered structured_block (endo of loop region)

OpenMP vs CUDA Semantic

Most of the directives and clauses can be mapped into the stream programs

OpenMP vs CUDA Performance

CUDA: lightweight hardware thread data-centric processing model simple control logic inefficient to handle branch

OpenMP: OS level thread thread-centric parallel processing model thread can be complicated

Map those constructs that have large parallelism and uniform processing among threads

OpenMP vs CUDA Performance

Not suitable: single, section. –-- they have small parallelism and different processing among threads master ---- parallelism is 1 barrier, taskwait ---- demand all threads grouped into one block lastprivate ---- processing is not uniform among threadc

OpenMP vs CUDA

To understand whether it is reasonable to translate OpenMP program to CUDA program, we should analyze the application’s pattern.

Conclusion 1. A majority of scientific applications

are suitable to be mapped to stream programming model.

2. The heterogeneous architecture using CPU and GPU will be more common.

Comments: 1.This paper’s work is mainly on

analysis.

2.We think more real applications should be considered, not just benchmark.

3.Automatically translate OpenMP program to CUDA program may be possible.

Recommended