52
OpenCL Ryan Renna

OpenCL Ryan Renna. Overview Introduction History Anatomy of OpenCL Execution Model Memory Model Implementation Applications The Future

Embed Size (px)

Citation preview

Page 1: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

OpenCLRyan Renna

Page 2: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Overview

Introduction History Anatomy of OpenCL Execution Model Memory Model Implementation Applications The Future

2

Page 3: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Goals

Knowledge that is transferable to all APIs

Overview of concepts rather than API specific terminology

Avoid coding examples as much as possible

3

Page 4: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Introduction

Page 5: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

What is OpenCL

A Language: Open Computer Language, it’s C like! Execute code across mixed platforms

consisting of CPUs, GPUs and other processors.

An API: Runs on the “Host”, manipulate and

control OpenCL objects and code. Deals with devices as abstract processing

units5

Page 6: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Why Use GPUs?

Modern GPUs are made up of highly parallelizable processing units. Have been named “Stream Processors”

Modern pc’s all have dedicated GPUs which sit idle for most of the day to day processing

This strategy is known as “General-Purpose Computation on Graphical Processing Units” or GPGPU6

Page 7: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Any device capable of Stream Processing, related to SIMD

Given a set of data (the stream) a series of functions (called Kernel functions) are applied to each element

On-chip memory is used, to minimize external memory bandwidth

The Stream Processor

Did you know:The Cell processor,

invented by Toshiba, Sony & IBM is a Stream

Processor?7

Page 8: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Streams

Most commonly 2D grids (Textures)

Maps well to Matrix Algebra, Image Processing, Physics simulations, etc

Did you know:The latest ATI card has 1600 individual Stream

Processors?8

Page 9: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Kernel Functions

for(int i = 0; i < 100 * 4; i++){

result[i] = source0[i] + source1[i];}

Traditional sequential method:

for(int el = 0; el < 100; el++){

vector_sum(result[el],source0[el],source1[el]);}

The same process, using the kernel “vector_sum”

9

Page 10: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

An “Open” Computing Language

Multiple CPU machines with multiple GPUs, all from different vendors, can work together.

10

Page 11: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

History

Page 12: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

GPGPU

General-Purpose Computation on Graphical Processing Units

Coined in 2002, with the rise of using GPUs for non-graphics applications

Hardware specific GPGPU APIs have been created :

CUDA NVidia 2007 Close To Metal ATI 2006

12

Page 13: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

GPGPU

General-Purpose Computation on Graphical Processing Units

Coined in 2002, with the rise of using GPUs for non-graphics applications

Hardware specific GPGPU APIs have been created :

CUDA NVidia 2007 Close To Metal ATI 2006

13

?

Page 14: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

The next step

OpenCL:

Developed by Apple computers

Collaborated with AMD, Intel, IBM and NVidia to refine the proposal

Submitted to the Khronos Group The specification for OpenCL 1.0 was

finished 5 months later

14

Page 15: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

You may remember me from such open standards as…

OpenGL – 2D and 3D graphics API

OpenAL – 3D audio API

OpenGL ES – OpenGL for embedded system. Used in all smartphones.

Collada – XML-based schema

for storing 3D assets. 15

Page 16: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Anatomy of OpenCL

Page 17: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

API – Platform Layer

Compute Device A processor that executes data-parallel programs. Contains Compute Units

Compute Unit A Processing element. Example: a CORE of a CPU

Queues Submits work to a compute device. Can be in-order or out-of-order.

Context Collection of compute devices. Enables memory sharing across devices.

Host Container of Contexts. Represents the computer itself.17

Page 18: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Host Example

A host computer with one device group

A Dual-core CPU

A GPU with 8 Stream Processors

Host

Context

Queue 1

Dual Core CPU

Compute Unit

Compute Unit

Queue 2

GPU with 8 Stream Processors

Compute Unit

Compute Unit

Compute Unit

Compute Unit

Compute Unit

Compute Unit

Compute Unit

Compute Unit

18

Page 19: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

API – Runtime Layer

Memory Objects Buffers

Blocks of memory, accessed as arrays, pointers or structs

Images 2D or 3D images

Executable Objects Kernel

A data-parallel function that is executed by a compute device

Program A group of kernels and

functions

Synchronization: Events

Caveat:Each image can be read or written in a kernel, but not both.

19

Page 20: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Example Flow

Compile Code

Create Data &

Arguments

Send to Executio

n

Program• Program

with a collection of Kernels

CPU & GPU

Binaries

Memory Objects

Buffers

ImagesCompute Device

In-Order Queue

Out-of-Order Queue

20

Page 21: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Execution Model of OpenCL

Page 22: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

The N-Dimensional computation domain is called the N-D Space, defines the total number of elements of execution Defines the Global Dimensions

Each element of execution, representing an instance of a kernel, is called a work-item

Work-items are grouped in local workgroups Size is defined by Local Dimensions

N-D Space

22

Page 23: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Global work-items don’t belong to a workgroup and run in parallel independently (no synchronization)

Local work-items can be synchronized within a workgroup, and share workgroup memory

Each work-item runs as it’s own thread Thousands of lightweight threads can be running at a time, and are

managed by the device

Each work-item is assigned a unique id, a local id within it’s workgroup and naturally each workgroup is assigned a workgroup id

Work-Items

23

Page 24: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Example – Image Filter

Executed on a 128 x 128 image, our Global Dimensions are 128, 128. We will have 16,384 work-items in total.

We can then define a Local Dimensions of 30, 30.

Since workgroups are executed together, and work-items can only be synchronized within workgroups, picking your Global and Local Dimensions is problem specific.

If we asked for the local id of work-item 31, we’d receive 1. As it’s the 1st work-item of the 2nd workgroup.

24

Page 25: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Memory Model of OpenCL

Page 26: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Memory Model

Private Per work-item

Local Shared within a

workgroup

Global/Constant Not synchronized,

per device

Host Memory

Compute Device

Host

Host Memory

Global / Constant Memory

Local Memory

Local Memory

..Compute Unit 1

Work Item

Private Private

Work Item ..

Compute Unit 2

Work Item

Private Private

Work Item

26

Page 27: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Intermission

27

Page 28: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Implementation

Page 29: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Key thoughts: Work-items should be independent of each other

Workgroups share data, but are executed in sync, so they cannot depend on each others results

Find tasks that are independent and highly repeated, pay attention to loops

Transferring data over a PCI bus has overhead, parallelization is only justified for large data sets, or ones with lots of mathematical computations

Identifying Parallelizable Routines

29

Page 30: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

30

An Example – Class Average

Let’s imagine we were writing an application that computed the class average

There are two tasks we’d need to perform: Compute the final grade for each student

Obtain a class average by averaging the final grades

Page 31: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Let’s imagine we were writing an application that computed the class average

There are two tasks we’d need to perform: Compute the final grade for each student

Obtain a class average by averaging the final grades

31

An Example – Class Average

Parallelizabl

e

Non-

Parallelizable

Page 32: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Pseudo Code

32

Foreach(student in class){

grades = student.getGrades();sum = 0;count = 0;foreach(grade in grades){

sum += grade;count++;

}student.averageGrade = sum/count;

}

Compute the final grade for each student

Page 33: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Foreach(student in class){ grades = student.getGrades(); sum = 0; count = 0; foreach(grade in grades) { sum += grade; count++; } student.averageGrade = sum/count;

}

Pseudo Code

33

This code can be isolated._kernel void calcGrade(__global const float* input,__global float* output){

int i = get_global_id(0);

//Do work on class[i]

}

Page 34: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

First decide how to represent your problem, this will tell you the dimensionality of your Global and Local dimensions.

Global dimensions are problem specific

Local dimensions are algorithm specific

Local dimensions must have the same number of dimensions as Global.

Local dimensions must divide the global space evenly

Passing NULL as a workgroup size argument will let OpenCL pick the most efficient setup, but no synchronization will be possible between work-items

34

Determining the Data Dimensions

Page 35: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

An OpenCL calculation needs to perform 6 key steps:

Initialization Allocate Resources Creating Programs/Kernels Execution Read the Result(s) Clean Up

Execution Steps

Warning! Code Ahead

35

Page 36: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Store Kernel in string/char array

Initialization

const char* Kernel_Source = "\n "__calcGrade(__global const float* input,__global float* output){

int i = get_global_id(0);//Do work on class[i]

}”;

36

Page 37: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Selecting a device and creating a context in which to run the calculation

Initialization

cl_int err;Cl_context context;cl_device_id devices;cl_command_queue cmd_queue;

err = clGetDeviceIDs(CL_DEVICE_TYPE_GPU,1,&devices,NULL);context = clCreateContext(0,1,&devices,NULL,NULL,&err);cmd_queue = clCreateCommandQueue(context,devices,0,NULL);

37

Page 38: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Allocation of memory/storage that will be used on the device and push it to the device

Allocation

cl_mem ax_mem = clCreateBuffer(context,CL_MEM_READ_ONLY,atom_buffer_size,NULL,NULL);

err = clEnqueueWriteBuffer(cmd_queue,ax_mem,CL_TRUE,0,atom_buffer_size,(void*)values,0,NULL,NULL);

38

Page 39: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Programs and Kernels are read in from source and loaded as binary

Program/Kernel Creation

cl_program program[1];cl_kernel kernel[1];

Program[0] = clCreateProgramWithSource(context,1,(const char**)&kernel_source,NULL,&err);

err = clBuildProgram(program[0],NULL,NULL,NULL,NULL);Kernel[0]= clCreateKernel(program[0],”calcGrade”,&err);

39

Page 40: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Arguments to the kernel are set and the kernel is executed on all data

Execution

size_t global_work_size[1],local_work_size[1];global_work_size[0] = x; local_work_size[0] = x/2;

err = clSetKernelArg(kernel[0],0,sizeof(cl_mem),&values);

err = clEnqueueNDRangeKernel(cmd_queue,kernel[0],1,NULL,&global_work_size,&local_work_size,NULL,NULL);

40

Page 41: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

We read back the results to the Host

Read the Result(s)

err = clEnqueueReadBuffer(cmd_queue,val_mem,CL_TRUE,0,grid_buffer_size,val,0,NULL,NULL);

41

Note:If we were working on images, the function

clEnqueueReadImage() would be called

instead.

Page 42: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Clean up memory, release all OpenCL objects. Can check OpenCL reference count and ensure it equals zero

Clean Up

clReleaseKernel(kernel);clReleaseProgram(program);clReleaseCommandQueue(cmd_queue);clReleaseContext(context);

42

Page 43: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Instead of finding the first GPU, we could create a context out of all OpenCL devices, or decide to use specific dimensions / devices which would perform best on the devices dynamically.

Debugging can be done more efficiently on the CPU then on a GPU, prinf functions will work inside a kernel

Advanced Techniques

43

Page 44: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Applications

Page 45: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Applications

Raytracing Weather forecasting, Climate research Physics Simulations Computational finance Computer Vision Signal processing, Speech processing Cryptography / Cryptanalysis Neural Networks Database operations …Many more!

45

Page 46: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

The Future

Page 47: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

OpenGL Interoperability

OpenCL + OpenGL Efficient, inter-API communication OpenCL efficiently shares resources with OpenGL (doesn’t

copy) OpenCL objects can be created from OpenGL objects OpenGL 4.0 has been designed to align both standards to

closely work together

Example Implementation:Vertex and Image data generated

with OpenCL

Rendered with OpenGL

Post Processed with OpenCL

Kernels

47

Page 48: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Competitor

DirectCompute by Microsoft Bundled with DirectX 11 Requires a DX10 or 11 graphic card Requires Windows Vista or 7 Close to OpenCL feature wise

Internet Explorer 9 and Firefox 3.7 both use DirectX to speed up dom tree rendering (Windows Only)

48

Page 49: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Overview

With OpenCL Leverage CPUs, GPUs and other processors to accelerate

parallel computation

Get dramatic speedups for computationally intensive applications

Write accelerated portable code across different devices and architectures

49

Page 50: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

Getting Started…

ATI Stream SDK Support for OpenCL/OpenGL interoperability Support for OpenCL/DirectX interoperability http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx

Cuda Toolkit http://developer.nvidia.com/object/cuda_3_0_downloads.html

OpenCL.NET OpenCL Wrapper for .NET languages http://www.hoopoe-cloud.com/Solutions/OpenCL.NET/Default.a

spx

50

Page 51: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

The End? No… The Beginning

51

Page 52: OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future

References

52

http://www.macresearch.org/opencl_episode1 http://developer.amd.com/GPU/ATISTREAMSDK/pages/TutorialOpe

nCL.aspx http://en.wikipedia.org/wiki/Stream_Processing http://techreport.com/articles.x/11211 http://www.geeks3d.com/20100115/gpu-computing-geforce-and-r

adeon-opencl-test-part-1/ http://gpgpu.org/about http://developer.apple.com/Mac/library/documentation/Performan

ce/Conceptual/OpenCL_MacProgGuide/WhatisOpenCL/WhatisOpenCL.html

http://www.khronos.org/developers/library/overview/opencl_overview.pdf

http://gpgpu.org/wp/wp-content/uploads/2009/09/C1-OpenCL-API.pdf