85
Parallel Programming Concepts OpenHPI Course Week 4 : Accelerators Unit 4.1: Accelerate Now! Frank Feinbube + Teaching Team

OpenHPI - Parallel Programming Concepts - Week 4

Embed Size (px)

DESCRIPTION

Week 4 in the OpenHPI course on parallel programming concepts is about GPU-based parallelism. Find the whole course at http://bit.ly/1l3uD4h.

Citation preview

Page 1: OpenHPI - Parallel Programming Concepts - Week 4

Parallel Programming Concepts

OpenHPI Course

Week 4 : Accelerators

Unit 4.1: Accelerate Now!

Frank Feinbube + Teaching Team

Page 2: OpenHPI - Parallel Programming Concepts - Week 4

Summary: Week 3

■ Short overview of shared memory parallel programming ideas

■ Different levels of abstractions

□ Process model, thread model, task model

■ Threads for concurrency and parallelization

□ Standardized POSIX interface

□ Java / .NET concurrency functionality

■ Tasks for concurrency and parallelization

□ OpenMP for C / C++, Java, .NET, Cilk, …

■ Functional language constructs for implicit parallelism

■ PGAS languages for NUMA optimization

2

Specialized languages help the programmer to achieve speedup.What about accordingly specialized parallel hardware?

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 3: OpenHPI - Parallel Programming Concepts - Week 4

What is specialized hardware?

OpenHPI | Parallel Programming Concepts | Frank Feinbube

3

Page 4: OpenHPI - Parallel Programming Concepts - Week 4

OpenHPI | Parallel Programming Concepts | Frank Feinbube

4

What is specialized hardware?

■ Graphics cards: speed up rendering tasks

■ Sound cards: play sounds

■ Physics cards ?

What else could be sped up by hardware?

■ Encryption

■ Compression

■ Codec parsing

■ Protocols and formats: XML

■ Text, regular expression matching

■ …

Page 5: OpenHPI - Parallel Programming Concepts - Week 4

Wide Variety of Accelerators

OpenHPI | Parallel Programming Concepts | Frank Feinbube

5

Best Super Computer:Tianhe-2 (MilkyWay-2)

2nd

Titan

Page 6: OpenHPI - Parallel Programming Concepts - Week 4

Wide Variety of Applications

OpenHPI | Parallel Programming Concepts | Frank Feinbube

6

Fluids NBody

RadixSort

Page 7: OpenHPI - Parallel Programming Concepts - Week 4

Wide Variety of Application Domains

OpenHPI | Parallel Programming Concepts | Frank Feinbube

7

BioInformaticsComputational

Chemistry Computational Finance

ComputationalFluid Dynamics

ComputationalStructural Mechanics Data Science

DefenseElectronic

Design AutomationImaging &

Computer Vision

Medical Imaging Numerical Analytics Weather and Climate

http

://ww

w.n

vid

ia.c

om

/obje

ct/g

pu-a

pplic

atio

ns-d

om

ain

.htm

l

Page 8: OpenHPI - Parallel Programming Concepts - Week 4

What’s in it for me?

OpenHPI | Parallel Programming Concepts | Frank Feinbube

8

Page 9: OpenHPI - Parallel Programming Concepts - Week 4

Short Term View: Cheap Performance

Performance

Energy / Price

■ Cheap to buy and to maintain

■ GFLOPS per watt: Fermi 1,5 / Kepler 5 / Maxwell 15 (2014)

OpenHPI | Parallel Programming Concepts | Frank Feinbube

9

0

200

400

600

800

1000

1200

1400

0 10000 20000 30000 40000 50000

Execu

tion

Tim

e in

M

illi

seco

nd

s

Problem Size (Number of Sudoku Places)

Intel E8500 CPU

AMD R800 GPU

NVIDIA GT200 GPU

lower means faster

GPU: Graphics Processing Unit(CPU of a graphics card)

Page 10: OpenHPI - Parallel Programming Concepts - Week 4

A singleMaxwell GPU

will have moreperformance

than the fastest super computer

of 2001

Yourcomputer /notebook

(thx to GPUs)

What is 15 GFlops?15 000 000 000 floating point operations / s

OpenHPI | Parallel Programming Concepts | Frank Feinbube

10

Page 11: OpenHPI - Parallel Programming Concepts - Week 4

Middle Term View: Even More Performance

OpenHPI | Parallel Programming Concepts | Frank Feinbube

11

Page 12: OpenHPI - Parallel Programming Concepts - Week 4

Middle Term View: Even More Performance

OpenHPI | Parallel Programming Concepts | Frank Feinbube

12

Page 13: OpenHPI - Parallel Programming Concepts - Week 4

Long Term View: Acceleration Everywhere

Dealing with massively multi-core:

■ Accelerators (APUs) that accompany common

general purpose CPUs (Hybrid Systems)

Hybrid Systems (Accelerators + CPUs)

■ GPU Compute Devices:

High Performance Computing

(top 2 supercomputers are accelerator-based!),

Business Servers, Home/Desktop

Computers, Mobile and Embedded Systems

■ Special-Purpose Accelerators:

(de)compression, XML parsing,

(en|de)cryption, regular expression

matching

OpenHPI | Parallel Programming Concepts | Frank Feinbube

13

Page 14: OpenHPI - Parallel Programming Concepts - Week 4

How do they get so fast?

OpenHPI | Parallel Programming Concepts | Frank Feinbube

14

Page 15: OpenHPI - Parallel Programming Concepts - Week 4

Three Ways Of Doing Anything Faster [Pfister]

■ Work harder

(clock speed)

Power wall problem

Memory wall problem

■ Work smarter

(optimization, caching)

ILP wall problem

Memory wall problem

■ Get help

(parallelization)

□ More cores per single CPU

□ Software needs to exploit

them in the right way

Memory wall problem

Problem

CPU

Core

Core

Core

Core

Core

15

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 16: OpenHPI - Parallel Programming Concepts - Week 4

Accelerators bypass the walls

■ CPUs are general purpose

□ Need to support all types of software / programming models

□ Need to support a large variety of legacy software

That makes it hard to do something against the walls

■ Accelerators are special purpose

□ Need to support only a limited subset of software /

programming models

□ Legacy software is usually even more limited and strict

That lessens the impact of the walls (memory & power wall)

Stronger parallelization and better speedup possible

OpenHPI | Parallel Programming Concepts | Frank Feinbube

16

Page 17: OpenHPI - Parallel Programming Concepts - Week 4

CPU and Accelerator trends

CPU

Evolving towards throughput computing

Motivated by energy-efficient performance

Accelerator

Evolving towards general-purpose computing

Motivated by higher quality graphics and

data-parallel programming

OpenHPI | Parallel Programming Concepts | Frank Feinbube

17

CPU

ACCThroughput Performance

Pro

gra

mm

ability

Multi-threading Multi-core Many-core

FullyProgrammable

PartiallyProgrammable

Fixed Function

NVIDIAKeppler

IntelMIC

Page 18: OpenHPI - Parallel Programming Concepts - Week 4

Task Parallelism and Data Parallelism

OpenHPI | Parallel Programming Concepts | Frank Feinbube

18

Input Data

Parallel

Processing

Result Data

„CPU-style“ „GPU-style“

Page 19: OpenHPI - Parallel Programming Concepts - Week 4

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Flynn‘s Taxonomy (1966)

■ Classify parallel hardware architectures according to their

capabilities in the instruction and data processing dimension

Single Instruction,Single Data (SISD)

Single Instruction,Multiple Data (SIMD)

19

Processing StepInstruction

Data Item

OutputProcessing Step

Instruction

Data Items

Output

Multiple Instruction,Single Data (MISD)

Processing Step

InstructionsData Item

Output

Multiple Instruction,Multiple Data (MIMD)

Processing Step

InstructionsData Items

Output

Page 20: OpenHPI - Parallel Programming Concepts - Week 4

Parallel Programming Concepts

OpenHPI Course

Week 4 : Accelerators

Unit 4.2: Accelerator Technology

Frank Feinbube + Teaching Team

Page 21: OpenHPI - Parallel Programming Concepts - Week 4

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Flynn‘s Taxonomy (1966)

■ Classify parallel hardware architectures according to their

capabilities in the instruction and data processing dimension

Single Instruction,Single Data (SISD)

Single Instruction,Multiple Data (SIMD)

21

Processing StepInstruction

Data Item

OutputProcessing Step

Instruction

Data Items

Output

Multiple Instruction,Single Data (MISD)

Processing Step

InstructionsData Item

Output

Multiple Instruction,Multiple Data (MIMD)

Processing Step

InstructionsData Items

Output

Page 22: OpenHPI - Parallel Programming Concepts - Week 4

Why SIMD?

OpenHPI | Parallel Programming Concepts | Frank Feinbube

22

Page 23: OpenHPI - Parallel Programming Concepts - Week 4

History of GPU Computing

• 1980s-1990s; configurable, not programmable; first APIs (DirectX, OpenGL); Vertex Processing

Fixed Function Graphic Pipelines

• Since 2001: APIs for Vertex Shading, Pixel Shading and access to texture; DirectX9

Programmable Real-Time Graphics

• 2006: NVIDIAs G80; unified processors arrays; three programmable shading stages; DirectX10

Unified Graphics and Computing Processors

OpenHPI | Parallel Programming Concepts | Frank Feinbube

23

Page 24: OpenHPI - Parallel Programming Concepts - Week 4

From Fixed Function PipelineTo Programmable Shading Stages

GPUs pre 2006:

DirectX 10 Geometry Shader Idea:

NVIDIA G80 Solution:

OpenHPI | Parallel Programming Concepts | Frank Feinbube

24

Vertex Shader Pixel Shader

Vertex Shader Geometry Shader Pixel Shader

Programmable Shader(Vertex/Geometry/Pixel)

We would need a bigger card :/

This design allows:Smaller card / more power

Page 25: OpenHPI - Parallel Programming Concepts - Week 4

History of GPU Computing

• 1980s-1990s; configurable, not programmable; first APIs (DirectX, OpenGL); Vertex Processing

Fixed Function Graphic Pipelines

• Since 2001: APIs for Vertex Shading, Pixel Shading and access to texture; DirectX9

Programmable Real-Time Graphics

• 2006: NVIDIAs G80; unified processors arrays; three programmable shading stages; DirectX10

Unified Graphics and Computing Processors

• compute problem as native graphic operations; algorithms as shaders; data in textures

General Purpose GPU (GPGPU)

OpenHPI | Parallel Programming Concepts | Frank Feinbube

25

Page 26: OpenHPI - Parallel Programming Concepts - Week 4

General Purpose GPU (GPGPU)

OpenHPI | Parallel Programming Concepts | Frank Feinbube

26

𝐵 >1

𝑛

𝑖=1

𝑛

𝑥𝑖2.7 1.8 2.8 1.8 2.8 4.5 9.0 4.5

Programmable Shader(Vertex/Geometry/Pixel)

Graphics Processing Unit (GPU):

Data

Texture

Formula

Shader

Result Texture

0 0 0 0 0 1 1 1

Result

Page 27: OpenHPI - Parallel Programming Concepts - Week 4

History of GPU Computing

• 1980s-1990s; configurable, not programmable; first APIs (DirectX, OpenGL); Vertex Processing

Fixed Function Graphic Pipelines

• Since 2001: APIs for Vertex Shading, Pixel Shading and access to texture; DirectX9

Programmable Real-Time Graphics

• 2006: NVIDIAs G80; unified processors arrays; three programmable shading stages; DirectX10

Unified Graphics and Computing Processors

• compute problem as native graphic operations; algorithms as shaders; data in textures

General Purpose GPU (GPGPU)

• Programming CUDA; shaders programmable; load and store instructions; barriers; atomics

GPU Computing

OpenHPI | Parallel Programming Concepts | Frank Feinbube

27

Page 28: OpenHPI - Parallel Programming Concepts - Week 4

What’s the difference to CPUs?

OpenHPI | Parallel Programming Concepts | Frank Feinbube

28

Page 29: OpenHPI - Parallel Programming Concepts - Week 4

Task Parallelism and Data Parallelism

OpenHPI | Parallel Programming Concepts | Frank Feinbube

29

Input Data

Parallel

Processing

Result Data

„CPU-style“ „GPU-style“

Page 30: OpenHPI - Parallel Programming Concepts - Week 4

CPU vs. GPU Architecture

■ Some huge threads

■ Branch prediction

■ 1000+ light-weight threads

■ Memory latency hiding

OpenHPI | Parallel Programming Concepts | Frank Feinbube

30

ControlPE

PE

PE

PE

Cache

DRAM DRAM

CPU GPU „many-core“„multi-core“

Page 31: OpenHPI - Parallel Programming Concepts - Week 4

GPU Threads are different

GPU threads

■ Execute exactly the same instruction (line of code)

□ They share the program counter

No branching!

Memory Latency Hiding

■ GPU holds thousands of threads

■ If some access the slow memory, others will run while they wait

■ No context switch

□ Enough registers to store all the threads all the time

OpenHPI | Parallel Programming Concepts | Frank Feinbube

31

Page 32: OpenHPI - Parallel Programming Concepts - Week 4

What does it look like?

OpenHPI | Parallel Programming Concepts | Frank Feinbube

32

Page 33: OpenHPI - Parallel Programming Concepts - Week 4

GF10

OpenHPI | Parallel Programming Concepts | Frank Feinbube

33

GPU Hardware in Detail

Page 34: OpenHPI - Parallel Programming Concepts - Week 4

GF100

OpenHPI | Parallel Programming Concepts | Frank Feinbube

34

GF100

L2 Cache

Page 35: OpenHPI - Parallel Programming Concepts - Week 4

GF100

OpenHPI | Parallel Programming Concepts | Frank Feinbube

35

GF100

Page 36: OpenHPI - Parallel Programming Concepts - Week 4

GF100

OpenHPI | Parallel Programming Concepts | Frank Feinbube

36

GF100

Page 37: OpenHPI - Parallel Programming Concepts - Week 4

GF100

OpenHPI | Parallel Programming Concepts | Frank Feinbube

37

GF100

Page 38: OpenHPI - Parallel Programming Concepts - Week 4

GF100

OpenHPI | Parallel Programming Concepts | Frank Feinbube

38

GF100

Page 39: OpenHPI - Parallel Programming Concepts - Week 4

It’s a jungle out there!

OpenHPI | Parallel Programming Concepts | Frank Feinbube

39

Page 40: OpenHPI - Parallel Programming Concepts - Week 4

GPU Computing Platforms (Excerpt)

AMD

R700, R800, R900,

HD 7000, HD 8000, Rx 200

NVIDIA

G80, G92, GT200, GF100, GK110

Geforce, Quadro,

Tesla, ION

OpenHPI | Parallel Programming Concepts | Frank Feinbube

40

Page 41: OpenHPI - Parallel Programming Concepts - Week 4

Compute Capability by version

Plus: varying amounts of cores, global memory sizes, bandwidth,

clock speeds (core, memory), bus width, memory access penalties …

OpenHPI | Parallel Programming Concepts | Frank Feinbube

41

1.0 1.1 1.2 1.3 2.x 3.0 3.5

double precision floatingpoint operations

No Yes

caches No Yes

max # concurrent kernels1 8

Dynamic Parallelism

max # threads per block 512 1024

max # Warps per MP 24 32 48 64

max # Threads per MP 768 1024 1536 2048

register count (32 bit) 8192 16384 32768 65536

max shared mem per MP 16KB 16/48KB 16/32/48KB

# shared memory banks 16 32

Page 42: OpenHPI - Parallel Programming Concepts - Week 4

Intel Xeon Phi: Hardware

60 Cores based on P54C architecture (Pentium)

■ > 1.0 Ghz clock speed; 64bit based x86 instructions + SIMD

■ 1x 25 MB L2 Cache (=512KB per core) + 64 KB L1 (Cache coherency)

■ 8 (to 32) GB of DDR5

■ 4 Hardware Threads per Core (240 logical cores)

□ No Multicore / Hyper-Threading

□ Think graphics-card hardware threads

□ Only one runs = memory latency hiding

□ Switched after each instruction!!

-> use 120 or 240 threads for the 60 cores

■ 512 bit wide VPU with new ISA Kci

■ No support for MMX, SSE or AVX

■ Could handle 8 doule precision floats/16 single precision floats

■ Always structured in vectors with 16 elements

OpenHPI | Parallel Programming Concepts | Frank Feinbube

42

Page 43: OpenHPI - Parallel Programming Concepts - Week 4

Intel Xeon Phi: Operating System

■ minimal, embedded Linux

■ Linux Standard Base (LSB) Core libraries.

■ Implements Busybox minimal shell environment

OpenHPI | Parallel Programming Concepts | Frank Feinbube

43

Page 44: OpenHPI - Parallel Programming Concepts - Week 4

Parallel Programming Concepts

OpenHPI Course

Week 4 : Accelerators

Unit 4.3: Open Compute Language (OpenCL)

Frank Feinbube + Teaching Team

Page 45: OpenHPI - Parallel Programming Concepts - Week 4

Simple Example: Vector Addition

OpenHPI | Parallel Programming Concepts | Frank Feinbube

45

𝑐 =

𝑎1𝑎2𝑎3𝑎4

+

𝑏1𝑏2𝑏3𝑏4

?

Page 46: OpenHPI - Parallel Programming Concepts - Week 4

Open Compute Language (OpenCL)

OpenHPI | Parallel Programming Concepts | Frank Feinbube

46

AMD

ATI

NVIDIA

Intel

Apple

Merged, needed commonality across products

GPU vendor – wants to steal market share from CPU

Was tired of recoding for many-core andGPUs. Pushed vendors to standardize.

CPU vendor – wants to steal market share from GPU

Wrote a

draft straw man API

Khronos Compute

Group formed

Ericsson

Nokia

IBM

Sony

Blizzard

TexasInstruments

Page 47: OpenHPI - Parallel Programming Concepts - Week 4

OpenCL Platform Model

■ OpenCL exposes CPUs, GPUs, and other Accelerators as “devices”

■ Each “device” contains one or more “compute units”, i.e. cores, SMs,...

■ Each “compute unit” contains one or more SIMD “processing elements”

OpenHPI | Parallel Programming Concepts | Frank Feinbube

47

Page 48: OpenHPI - Parallel Programming Concepts - Week 4

Terminology

OpenHPI | Parallel Programming Concepts | Frank Feinbube

48

CPU OpenCL

PlatformLevel

Memory Level

ExecutionLevel

PlatformLevel

Memory Level

ExecutionLevel

SMP system

Main Memory

ProcessComputeDevice

Global andConstant Memory

Index Range

(NDRange)

Processor - -Compute

UnitLocal

MemoryWork Group

Core

Registers, ThreadLocal

Storage

ThreadProcessingElement

Registers, Private Memory

Work Items

(+Kernels)

Page 49: OpenHPI - Parallel Programming Concepts - Week 4

The BIG idea behind OpenCL

OpenCL execution model … execute a kernel at each point in a

problem domain.

E.g., process a 1024 x 1024 image with one kernel invocation per

pixel or 1024 x 1024 = 1,048,576 kernel executions

OpenHPI | Parallel Programming Concepts | Frank Feinbube

49

Traditional Loops Data Parallel OpenCL

Page 50: OpenHPI - Parallel Programming Concepts - Week 4

Building and Executing OpenCL Code

OpenHPI | Parallel Programming Concepts | Frank Feinbube

50

Codeof one

ormore

Kernels

Compile for GPU

Compile for CPU

GPU BinaryRepresen-

tation

CPU BinaryRepresen-

tation

Kernel Program Device

OpenCL codes must be prepared to deal with much greater hardware diversity (features are optional and my not be supported on all devices) → compile code

that is tailored according to the device configuration

Page 51: OpenHPI - Parallel Programming Concepts - Week 4

OpenCL Execution Model

An OpenCL kernel is executed by an array of work items.

■ All work items run the same code (SPMD)

■ Each work item has an index that it uses to compute memory

addresses and make control decisions

OpenHPI | Parallel Programming Concepts | Frank Feinbube

51

0 1 2 3 4 5 6 7Work item ids:

Threads

Page 52: OpenHPI - Parallel Programming Concepts - Week 4

Work Groups: Scalable Cooperation

Divide monolithic work item array into work groups

■ Work items within a work group cooperate via shared

memory, atomic operations and barrier synchronization

■ Work items in different work groups cannot cooperate

OpenHPI | Parallel Programming Concepts | Frank Feinbube

52

0 1 2 3 4 5 6 7Work

item ids:Threads

8 9 10 11 12 13 14 15

Work group 0 Work group 1

Page 53: OpenHPI - Parallel Programming Concepts - Week 4

OpenCL Execution Model

■ Parallel work is submitted to devices by launching kernels

■ Kernels run over global dimension index ranges (NDRange), broken up

into “work groups”, and “work items”

■ Work items executing within the same work group can synchronize with

each other with barriers or memory fences

■ Work items in different work groups can’t sync with each other, except

by launching a new kernel

OpenHPI | Parallel Programming Concepts | Frank Feinbube

53

Page 54: OpenHPI - Parallel Programming Concepts - Week 4

OpenCL Execution Model

An example of an NDRange index space showing work-items, their global IDs

and their mapping onto the pair of work-group and local IDs.

OpenHPI | Parallel Programming Concepts | Frank Feinbube

54

Page 55: OpenHPI - Parallel Programming Concepts - Week 4

Terminology

OpenHPI | Parallel Programming Concepts | Frank Feinbube

55

CPU OpenCL

PlatformLevel

Memory Level

ExecutionLevel

PlatformLevel

Memory Level

ExecutionLevel

SMP system

Main Memory

ProcessComputeDevice

Global andConstant Memory

Index Range

(NDRange)

Processor - -Compute

UnitLocal

MemoryWork Group

Core

Registers, ThreadLocal

Storage

ThreadProcessingElement

Registers, Private Memory

Work Items

(+Kernels)

Page 56: OpenHPI - Parallel Programming Concepts - Week 4

OpenCL Memory Architecture

Private

Per work-item

Local

Shared within

a workgroup

Global/

Constant

Visible to

all workgroups

Host Memory

On the CPU

OpenHPI | Parallel Programming Concepts | Frank Feinbube

56

Page 57: OpenHPI - Parallel Programming Concepts - Week 4

Terminology

OpenHPI | Parallel Programming Concepts | Frank Feinbube

57

CPU OpenCL

PlatformLevel

Memory Level

ExecutionLevel

PlatformLevel

Memory Level

ExecutionLevel

SMP system

Main Memory

ProcessComputeDevice

Global andConstant Memory

Index Range

(NDRange)

Processor - -Compute

UnitLocal

MemoryWork Group

Core

Registers, ThreadLocal

Storage

ThreadProcessingElement

Registers, Private Memory

Work Items

(+Kernels)

Page 58: OpenHPI - Parallel Programming Concepts - Week 4

OpenCL Memory Architecture

■ Memory management is explicit:

you must move data from host → global → local… and back

OpenHPI | Parallel Programming Concepts | Frank Feinbube

58

Memory Type

Keyword Description/Characteristics

Global Memory

__global Shared by all work items; read/write; may becached (modern GPU), else slow; huge

Private Memory

__private For local variables; per work item; may bemapped onto global memory (Arrays on GPU)

LocalMemory

__local Shared between workitems of a work group; may be mapped onto global memory (not GPU), else fast; small

Constant Memory

__constant Read-only, cached; add. special kind for GPUs: texture memory

Page 59: OpenHPI - Parallel Programming Concepts - Week 4

OpenCL Work Item Code

A subset of ISO C99 - without some C99 features

■ headers, function pointers, recursion, variable length arrays,

and bit fields

A superset of ISO C99 with additions for

■ Work-items and workgroups

■ Vector types (2,4,8,16): endian safe, aligned at vector length

■ Image types mapped to texture memory

■ Synchronization

■ Address space qualifiers

Also includes a large set of built-in functions for image manipulation,

work-item manipulation, specialized math routines, vectors, etc.

OpenHPI | Parallel Programming Concepts | Frank Feinbube

59

Page 60: OpenHPI - Parallel Programming Concepts - Week 4

Vector Addition: Kernel

■ Kernel body is instantiated once for each work item; each

getting an unique index

» Code that actually executes on target devices

OpenHPI | Parallel Programming Concepts | Frank Feinbube

60

Page 61: OpenHPI - Parallel Programming Concepts - Week 4

Vector Addition: Host Program

OpenHPI | Parallel Programming Concepts | Frank Feinbube

61

[5]

Page 62: OpenHPI - Parallel Programming Concepts - Week 4

Vector Addition: Host Program

OpenHPI | Parallel Programming Concepts | Frank Feinbube

62

[5]

„standard“ overheadfor an OpenCL program

Page 63: OpenHPI - Parallel Programming Concepts - Week 4

Development Support

Software development kits: NVIDIA and AMD; Windows and Linux

Special libraries: AMD Core Math Library, BLAS and FFT libraries by NVIDIA,

OpenNL for numerics and CULA for linear algebra; NVIDIA Performance

Primitives library: a sink for common GPU accelerated algorithms

Profiling and debugging tools:

■ NVIDIAs Parallel Nsight for Microsoft Visual Studio

■ AMDs ATI Stream Profiler

■ AMDs Stream KernelAnalyzer:

displays GPU assembler code, detects execution bottlenecks

■ gDEBugger (platform-independent)

Big knowledge bases with tutorials, examples, articles, show cases, and

developer forums

OpenHPI | Parallel Programming Concepts | Frank Feinbube

63

Page 64: OpenHPI - Parallel Programming Concepts - Week 4

Nsight

OpenHPI | Parallel Programming Concepts | Frank Feinbube

64

Page 65: OpenHPI - Parallel Programming Concepts - Week 4

Parallel Programming Concepts

OpenHPI Course

Week 4 : Accelerators

Unit 4.4: Optimizations

Frank Feinbube + Teaching Team

Page 66: OpenHPI - Parallel Programming Concepts - Week 4

The Power of GPU Computing

0

200

400

600

800

1000

1200

1400

0 10000 20000 30000 40000 50000

Execu

tion

Tim

e in

Mil

liseco

nd

s

Problem Size (Number of Sudoku Places)

IntelE8500CPU

AMDR800GPU

NVIDIAGT200GPU

OpenHPI | Parallel Programming Concepts | Frank Feinbube

66

* less is better

big performance gains for small problem sizes

Page 67: OpenHPI - Parallel Programming Concepts - Week 4

The Power of GPU Computing

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

0 200000 400000 600000

Execu

tion

Tim

e in

Mil

liseco

nd

s

Problem Size (Number of Sudoku Places)

IntelE8500CPU

AMDR800GPU

NVIDIAGT200GPU

OpenHPI | Parallel Programming Concepts | Frank Feinbube

67

* less is better

small/moderate performance gains for large problem sizes

→ further optimizations needed

Page 68: OpenHPI - Parallel Programming Concepts - Week 4

Best Practices for Performance Tuning

•Asynchronous, Recompute, SimpleAlgorithm Design

•Chaining, Overlap Transfer & ComputeMemory Transfer

•Divergent Branching, PredicationControl Flow

• Local Memory as Cache, rare resourceMemory Types

•Coalescing, Bank ConflictsMemory Access

•Execution Size, EvaluationSizing

•Shifting, Fused Multiply, Vector TypesInstructions

•Native Math Functions, Build OptionsPrecision

OpenHPI | Parallel Programming Concepts | Frank Feinbube

68

Page 69: OpenHPI - Parallel Programming Concepts - Week 4

Divergent Branching and Predication

Divergent Branching

■ Flow control instruction (if, switch, do, for, while) can result in

different execution paths

Data parallel execution → varying execution paths will be serialized

Threads converge back to same execution path after completion

Branch Predication

■ Instructions are associated with a per-thread condition code (predicate)

□ All instructions are scheduled for execution

□ Predicate true: executed normally

□ Predicate false: do not write results, do not evaluate addresses, do

not read operands

■ Compiler may use branch predication for if or switch statements

■ Unroll loops yourself (or use #pragma unroll for NVIDIA)

OpenHPI | Parallel Programming Concepts | Frank Feinbube

69

Page 70: OpenHPI - Parallel Programming Concepts - Week 4

Use Caching: Local, Texture, Constant

Local Memory

■ Memory latency roughly 100x lower than global memory latency

■ Small, no coalescing problems, prone to memory bank conflicts

Texture Memory

■ 2-dimensionally cached, read-only

■ Can be used to avoid uncoalesced loads

form global memory

■ Used with the image data type

Constant Memory

■ Linear cache, read-only, 64 KB

■ as fast as reading from a register for the same address

■ Can be used for big lists of input arguments

OpenHPI | Parallel Programming Concepts | Frank Feinbube

70

0 1 2 3

64 65 66 67

128 129 130 131

192 193 194 195

Page 71: OpenHPI - Parallel Programming Concepts - Week 4

Sizing:What is the right execution layout?

■ Local work item count should be a multiple of native execution

size (NVIDIA 32, AMD 64, MIC 16), but not too big

■ Number of work groups should be multiple of the number of

multiprocessors (hundreds or thousands of work groups)

■ Can be configured in 1-, 2- or 3-dimensional layout: consider

access patterns and caching

■ Balance between latency hiding

and resource utilization

■ Experimenting is

required!

OpenHPI | Parallel Programming Concepts | Frank Feinbube

71

[4]

Page 72: OpenHPI - Parallel Programming Concepts - Week 4

Instructions and Precision

■ Single precision floats provide best performance

■ Use shift operations to avoid expensive division and modulo calculations

■ Special compiler flags

■ AMD has native vector type implementation; NVIDIA is scalar

■ Use the native math library whenever speed trumps precision

OpenHPI | Parallel Programming Concepts | Frank Feinbube

72

Functions Throughput

single-precision floating-point add, multiply, and multiply-add

8 operations per clock cycle

single-precision reciprocal, reciprocal square root, and native_logf(x)

2 operations per clock cycle

native_sin, native_cos, native_exp 1 operation per clock cycle

Page 73: OpenHPI - Parallel Programming Concepts - Week 4

Coalesced Memory Accesses

Simple Access Pattern

■ Can be fetched in a single 64-byte

transaction (red rectangle)

■ Could also be permuted *

Sequential but Misaligned Access

■ Fall into single 128-byte segment:

single 128-byte transaction,

else: 64-byte transaction + 32-

byte transaction *

Strided Accesses

■ Depending on stride from 1 (here)

up to 16 transactions *

* 16 transactions with compute capability 1.1

OpenHPI | Parallel Programming Concepts | Frank Feinbube

73

[6]

Page 74: OpenHPI - Parallel Programming Concepts - Week 4

Intel Xeon Phi: OpenCL

■ Whats different to GPUs?

□ Thread granularity bigger + No need for local shared memory

■ Work Groups are mapped to Threads

□ 240 OpenCL hardware threads handle workgroups

□ More than 1000 Work Groups recommended

□ Each thread executes one work group

■ Implicit Vectorization by the compiler of the inner most loop

□ 16 elements per Vector → dimension zero must be divisible by 16

otherwise scalar execution (Good work group size = 16)

OpenHPI | Parallel Programming Concepts | Frank Feinbube

74

__Kernel ABC()

for (int i = 0; i < get_local_size(2); i++)

for (int j = 0; j < get_local_size(1); j++)

for (int k = 0; k < get_local_size(0); k++)

Kernel_Body;

dimension zero of the NDRange

Page 75: OpenHPI - Parallel Programming Concepts - Week 4

Intel Xeon Phi: OpenCL

■ Non uniform branching in a work group has significant overhead

■ Non vector size aligned memory access has significant overhead

■ Non linear access patterns have significant overhead

■ Manual prefetching can be advantageous

■ No hardware support for barriers

■ Further reading: Intel® SDK for OpenCL Applications XE 2013

Optimization Guide for Linux OS

OpenHPI | Parallel Programming Concepts | Frank Feinbube

75

Especially for dimension zero

Page 76: OpenHPI - Parallel Programming Concepts - Week 4

Parallel Programming Concepts

OpenHPI Course

Week 4 : Accelerators

Unit 4.5: Future Trends

Frank Feinbube + Teaching Team

Page 77: OpenHPI - Parallel Programming Concepts - Week 4

Towards new Platforms

WebCL [Draft] http://www.khronos.org/webcl/

■ JavaScript binding to OpenCL

■ Heterogeneous Parallel Computing (CPUs + GPU)

within Web Browsers

■ Enables compute intense programs

like physics engines, video editing…

■ Currently only available with add-ons

(Node.js, Firefox, WebKit)

Android installable client driver extension (ICD)

■ Enables OpenCL implementations to be discovered and loaded

as a shared object on Android systems.

OpenHPI | Parallel Programming Concepts | Frank Feinbube

77

Page 78: OpenHPI - Parallel Programming Concepts - Week 4

Towards new Applications:Dealing with Unstructured Grids

Too coarse Too fine

OpenHPI | Parallel Programming Concepts | Frank Feinbube

78

vs.

Page 79: OpenHPI - Parallel Programming Concepts - Week 4

Towards new Applications:Dealing with Unstructured Grids

Fixed Grid Dynamic Grid

OpenHPI | Parallel Programming Concepts | Frank Feinbube

79

Page 80: OpenHPI - Parallel Programming Concepts - Week 4

Towards new Applications: Dynamic Parallelism

OpenHPI | Parallel Programming Concepts | Frank Feinbube

80

CPU manages execution GPU manages execution

Page 81: OpenHPI - Parallel Programming Concepts - Week 4

OpenHPI | Parallel Programming Concepts | Frank Feinbube

81

Page 82: OpenHPI - Parallel Programming Concepts - Week 4

Towards new Programming Models:OpenACC GPU Computing

OpenHPI | Parallel Programming Concepts | Frank Feinbube

82

Copy arrays to GPU

Parallelize loop with GPU kernels

Data automatically copied back at end of region

Page 83: OpenHPI - Parallel Programming Concepts - Week 4

Towards new Programming Models:OpenACC GPU Computing

OpenHPI | Parallel Programming Concepts | Frank Feinbube

83

Page 84: OpenHPI - Parallel Programming Concepts - Week 4

Hybrid System

OpenHPI | Parallel Programming Concepts | Frank Feinbube

84

MIC

GPURAM

CPU

Core Core

CPU

Core Core

GPU Core

QPI

RAM

RAM

GPU

GPURAM

Page 85: OpenHPI - Parallel Programming Concepts - Week 4

Summary: Week 4

■ Accelerators promise big speedups for data parallel applications

□ SIMD execution model (no branching)

□ Memory latency hiding with 1000s of light-weight threads

■ Enormous diversity -> OpenCL as a standard programming model for all

□ Uniform terms: Compute Device, Compute Unit, Processing Element

□ Idea: Loop parallelism with index ranges

□ Kernels are written in C; compiled at runtime; executed in parallel

□ Complex memory hierarchy, overhead to copy data from CPU

■ Getting fast is easy, getting faster is hard

□ Best practices for accelerators

□ Knowledge about hardware characteristics necessary

■ Future: faster, more features, more platforms, better programmability

85

Multi-core, Many-Core, .. What if my computational problem still demands more power?

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger