21
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009- 02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009 V0.3 Improved draft – Still need some work Kari Pulli Nokia Research Center Jyrki Leskelä Nokia Devices R&D / Technology Renewal

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

Embed Size (px)

Citation preview

Page 1: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä

1

OpenCL Embedded Profile

Presentation for Multicore Expo 16 March 2009V0.3 Improved draft – Still need some work

Kari PulliNokia Research CenterJyrki LeskeläNokia Devices R&D / Technology Renewal

Page 2: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä

2

OpenCL Embedded Profile - Basics

Page 3: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä3

OpenCL Relation to Khronos Embedded Ecosystem

Page 4: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä4

OpenCL 1.0 Embedded Profile One-Slider

Page 5: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä5

Embedded Profile Main Differencies

The embedded profile is defined to be a subset for each version of OpenCL:

• Online compiler is optional

• No 64-bit integers, or integer vectors

• Float 2D/3D images can only be used with nearest neighbor sampling

• Macro __EMBEDDED_PROFILE__ is added in the language and CL_PLATFORM_PROFILE capability will return the string EMBEDDED_PROFILE if the OpenCL implementation supports the embedded profile only.

• Minimum requirements for constant buffer size, object allocation size, constant argument count and local memory size are scaled down.

• Image support and floating point support is aligned with OpenGL ES 2.0 texture requirements

The extensions of full profile can be applied to embedded profile

Page 6: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä6

Floating Point Numbers in Embedded Profile• INF and NAN values for floats are not mandated

• Accuracy requirements of some single precision floating-point operations are relaxed from full profile:

• x / y <= 3 ulp

• exp <= 4 ulp

• log <= 4 ulp

• Float add, sub, mul, mad can be rounded to zero resulting an error <= 1 ulp due to strict HW area.

• Denormalized numbers for the half float data type can be flushed to zero.

• The precision of conversions from normalized integers is <= 2 ulp for the embedded profile (instead of <= 1.5 ulp)

Page 7: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä7

Image Support in Embedded Profile

• Image support is an optional feature within an OpenCL device

• If Images are supported, the minimum requirements for the supported image capabilities are lowered to the level of OpenGL ES 2.0 textures

• Kernel must be able to read >= 8 simultaneous image objects

• Kernel must be able to write >= 1 simultaneous image objects

• Width and height of 2D image >= 2048

• Number of samplers >= 8

• Image formats are similar to corresponding OpenGL ES 2.0 texture formats

• Support for 3D images is optional for embedded implementations

Page 8: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä8

Potential Mobile Device Use-Cases

• Image post-processing and enhancement

• Image editing software

• Compatibility for devices lacking high-end imaging HW

• Machine vision, Local media search, Augmented reality

• Support emerging new coding schemes quickly• For example web-originated media codecs

• Streaming math/algorithm libraries

• Physics modeling

• Gaming engines and WOW effects

Page 9: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä9

Potential Benefits for Mobile Devices

• Easier programming in a heterogeneous processor environment• Instead of learning different programming methods for CPU, GPU, DSP

• OpenCL framework handles also event queuing

• Code developed once will run with future hardware• If the application conforms to the specification, it will run

• OpenCL computing model will be relatively easy to virtualize

• Area and energy constrained embedded devices• Computing power of each computing device close to ”sweet spot”

• Allocation of the workload to multiple computing devices is valuable

Page 10: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä

10

Example Case 1: Split computation

Page 11: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä11

Split computation: Image Post Processing

CPU

GPU

Host Application

CL API Calls

Camera Image

OpenCL Post-

Processing

OpenCL Post-

Processing

CL Buffer CL Buffer … Render

Page 12: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä12

Image Post-Processing Kernel Program__kernel void convolution( _global const uchar4 *srcdata, _global uchar4 *destdata,

_global float *kernel, float kernel_multiplier, float kernel_bias, int kernel_dim )

{ int x = get_global_id(0), y = get_global_id(1); int sizex = get_global_size( 0 ), sizey = get_global_size( 1 ); int half_kernel = kernel_dim / 2; uint4 sum; for( int j = y-half_kernel, kj = 0; j <= y+half_kernel; j++, kj++ ) { if( ( j >= 0 ) && ( j <= sizey ) ) { for( int i = x-half_kernel, ki = 0; i <= x+half_kernel; i++, ki++ ) { if( ( i >= 0 ) && ( i <= sizex ) ) { sum += srcdata[ j * sizex + i ] * kernel[ kj * kernel_dim + ki ]; } } } } sum = sum * kernel_multiplier + kernel_bias; destdata[ y * sizex + x ] = convert_uchar4_sat(sum);}

Page 13: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä13

Split computation: Speedup

• tcpu is the time to process the task with only CPU, tgpu is the time to process the task with only GPU and tgpuif is the time to transfer the data between CPU and GPU (the transfer is modeled to be CPU bound).

• In this case, the speed-optimal workload split between CPU and GPU would yield the following execution time:

Example: tgpu = k tcpu , k є 0.5 … 1.5

tgpuif = 0.1 tcpu

Comparison of total execution times:

cpugpuif

gpuifgpugpuifgpucpu

gpuifgpusplit

tt

ttttt

ttt

,21

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

k

t(tcpu)

tcpu

tgpu

tsplit split

gpu

cpu

t

t

t

)( cputt

Page 14: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä14

Split computation: Energy efficiency

• tcpu, tgpu and tgpuif from the previous slide.

• pcpu, pgpu and pgpuif are the average battery power drain by CPU execution, GPU execution and data transfer between CPU and GPU respectively.

• psplit is the average power drain when the computation is time-optimally split to between CPU and GPU. csplit is the corresponding battery capacity as a product of power and time.

Example: tgpu = k tcpu , k є 0.5…1.5

tgpuif = 0.1 tcpu

pgpu = 0.5 pcpu

pgpuif = 0.1 pcpu

Total consumption of battery capacity:

splitsplitsplit

gpuifgpu

gpuifgpuif

gpuifgpu

gpugpucpusplit

tpC

tt

tp

tt

tppp

,

0

0.2

0.4

0.6

0.8

1

1.2

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 k

c(ccpu)

ccpu

cgpu

csplit split

gpu

cpu

C

C

C

)( cpuCC

Page 15: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä

15

More Example Cases

Page 16: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä16

DSP

CPU

GPU

Pipelining: Mixing computation and graphics

OpenCL Fractal Anim.

Texture

OpenGL ES 2.0

Rendering

Host Application

CL API Calls GL API Calls

GL Renderbuffer

CL Buffer

GL Texture

CL Buffer

Page 17: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä17

Multimedia Frameworks: OpenMAX environment

More portabilityby using OpenCLin some hotspots

Diagram Copyright © 2009 Khronos Group

Page 18: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä

18

Summary

Page 19: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä19

Summary

• OpenCL 1.0 Embedded Profile is a subset of the full profile• Not an ”ES” specification of its own

• Easier programming of heterogeneous multi-processor• Fast multiprocessor code without portability hassle

• Speedups and energy efficiency via parallelism• Parallelize a uniform task to different processors

• Split pipeline stages to different processors

Page 20: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä

20

Demo

Page 21: © 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä21

Demo: Magnification Lense• Internal development environment for evaluating the OpenCL Embedded Profile

• Early pilot version only• No conformance test coverage at the moment

• Runs on• N810 (OMAP2420 CPU)• Zoom MDK (OMAP3430 CPU+SIMD+DSP)

• The lens effect is a mapping of the original image f(x,y) into modified image g(x,y) as piecewise continuous function

where Ro and Ri are the outer and inner boundaries of the lens frame, (xc, yc) is the center point of the lens, and M is the magnification factor in the center area of the lens.

ic

cc

c

oiio

o

ccio

o

cc

o

cc

RrM

yyx

M

xxxf

RrRRRM

rRyyy

RRM

rRxxxf

Rryxf

yxg

yyxxr

),)(

,)(

(

),

11

1)(,

11

1)((

),,(

),(

)()( 22