Upload
daniel-shelton
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä
1
OpenCL Embedded Profile
Presentation for Multicore Expo 16 March 2009V0.3 Improved draft – Still need some work
Kari PulliNokia Research CenterJyrki LeskeläNokia Devices R&D / Technology Renewal
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä
2
OpenCL Embedded Profile - Basics
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä3
OpenCL Relation to Khronos Embedded Ecosystem
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä4
OpenCL 1.0 Embedded Profile One-Slider
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä5
Embedded Profile Main Differencies
The embedded profile is defined to be a subset for each version of OpenCL:
• Online compiler is optional
• No 64-bit integers, or integer vectors
• Float 2D/3D images can only be used with nearest neighbor sampling
• Macro __EMBEDDED_PROFILE__ is added in the language and CL_PLATFORM_PROFILE capability will return the string EMBEDDED_PROFILE if the OpenCL implementation supports the embedded profile only.
• Minimum requirements for constant buffer size, object allocation size, constant argument count and local memory size are scaled down.
• Image support and floating point support is aligned with OpenGL ES 2.0 texture requirements
The extensions of full profile can be applied to embedded profile
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä6
Floating Point Numbers in Embedded Profile• INF and NAN values for floats are not mandated
• Accuracy requirements of some single precision floating-point operations are relaxed from full profile:
• x / y <= 3 ulp
• exp <= 4 ulp
• log <= 4 ulp
• Float add, sub, mul, mad can be rounded to zero resulting an error <= 1 ulp due to strict HW area.
• Denormalized numbers for the half float data type can be flushed to zero.
• The precision of conversions from normalized integers is <= 2 ulp for the embedded profile (instead of <= 1.5 ulp)
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä7
Image Support in Embedded Profile
• Image support is an optional feature within an OpenCL device
• If Images are supported, the minimum requirements for the supported image capabilities are lowered to the level of OpenGL ES 2.0 textures
• Kernel must be able to read >= 8 simultaneous image objects
• Kernel must be able to write >= 1 simultaneous image objects
• Width and height of 2D image >= 2048
• Number of samplers >= 8
• Image formats are similar to corresponding OpenGL ES 2.0 texture formats
• Support for 3D images is optional for embedded implementations
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä8
Potential Mobile Device Use-Cases
• Image post-processing and enhancement
• Image editing software
• Compatibility for devices lacking high-end imaging HW
• Machine vision, Local media search, Augmented reality
• Support emerging new coding schemes quickly• For example web-originated media codecs
• Streaming math/algorithm libraries
• Physics modeling
• Gaming engines and WOW effects
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä9
Potential Benefits for Mobile Devices
• Easier programming in a heterogeneous processor environment• Instead of learning different programming methods for CPU, GPU, DSP
• OpenCL framework handles also event queuing
• Code developed once will run with future hardware• If the application conforms to the specification, it will run
• OpenCL computing model will be relatively easy to virtualize
• Area and energy constrained embedded devices• Computing power of each computing device close to ”sweet spot”
• Allocation of the workload to multiple computing devices is valuable
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä
10
Example Case 1: Split computation
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä11
Split computation: Image Post Processing
CPU
GPU
Host Application
CL API Calls
Camera Image
OpenCL Post-
Processing
OpenCL Post-
Processing
CL Buffer CL Buffer … Render
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä12
Image Post-Processing Kernel Program__kernel void convolution( _global const uchar4 *srcdata, _global uchar4 *destdata,
_global float *kernel, float kernel_multiplier, float kernel_bias, int kernel_dim )
{ int x = get_global_id(0), y = get_global_id(1); int sizex = get_global_size( 0 ), sizey = get_global_size( 1 ); int half_kernel = kernel_dim / 2; uint4 sum; for( int j = y-half_kernel, kj = 0; j <= y+half_kernel; j++, kj++ ) { if( ( j >= 0 ) && ( j <= sizey ) ) { for( int i = x-half_kernel, ki = 0; i <= x+half_kernel; i++, ki++ ) { if( ( i >= 0 ) && ( i <= sizex ) ) { sum += srcdata[ j * sizex + i ] * kernel[ kj * kernel_dim + ki ]; } } } } sum = sum * kernel_multiplier + kernel_bias; destdata[ y * sizex + x ] = convert_uchar4_sat(sum);}
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä13
Split computation: Speedup
• tcpu is the time to process the task with only CPU, tgpu is the time to process the task with only GPU and tgpuif is the time to transfer the data between CPU and GPU (the transfer is modeled to be CPU bound).
• In this case, the speed-optimal workload split between CPU and GPU would yield the following execution time:
Example: tgpu = k tcpu , k є 0.5 … 1.5
tgpuif = 0.1 tcpu
Comparison of total execution times:
cpugpuif
gpuifgpugpuifgpucpu
gpuifgpusplit
tt
ttttt
ttt
,21
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
k
t(tcpu)
tcpu
tgpu
tsplit split
gpu
cpu
t
t
t
)( cputt
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä14
Split computation: Energy efficiency
• tcpu, tgpu and tgpuif from the previous slide.
• pcpu, pgpu and pgpuif are the average battery power drain by CPU execution, GPU execution and data transfer between CPU and GPU respectively.
• psplit is the average power drain when the computation is time-optimally split to between CPU and GPU. csplit is the corresponding battery capacity as a product of power and time.
Example: tgpu = k tcpu , k є 0.5…1.5
tgpuif = 0.1 tcpu
pgpu = 0.5 pcpu
pgpuif = 0.1 pcpu
Total consumption of battery capacity:
splitsplitsplit
gpuifgpu
gpuifgpuif
gpuifgpu
gpugpucpusplit
tpC
tt
tp
tt
tppp
,
0
0.2
0.4
0.6
0.8
1
1.2
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 k
c(ccpu)
ccpu
cgpu
csplit split
gpu
cpu
C
C
C
)( cpuCC
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä
15
More Example Cases
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä16
DSP
CPU
GPU
Pipelining: Mixing computation and graphics
OpenCL Fractal Anim.
Texture
OpenGL ES 2.0
Rendering
Host Application
CL API Calls GL API Calls
GL Renderbuffer
CL Buffer
GL Texture
CL Buffer
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä17
Multimedia Frameworks: OpenMAX environment
More portabilityby using OpenCLin some hotspots
Diagram Copyright © 2009 Khronos Group
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä
18
Summary
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä19
Summary
• OpenCL 1.0 Embedded Profile is a subset of the full profile• Not an ”ES” specification of its own
• Easier programming of heterogeneous multi-processor• Fast multiprocessor code without portability hassle
• Speedups and energy efficiency via parallelism• Parallelize a uniform task to different processors
• Split pipeline stages to different processors
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä
20
Demo
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä21
Demo: Magnification Lense• Internal development environment for evaluating the OpenCL Embedded Profile
• Early pilot version only• No conformance test coverage at the moment
• Runs on• N810 (OMAP2420 CPU)• Zoom MDK (OMAP3430 CPU+SIMD+DSP)
• The lens effect is a mapping of the original image f(x,y) into modified image g(x,y) as piecewise continuous function
where Ro and Ri are the outer and inner boundaries of the lens frame, (xc, yc) is the center point of the lens, and M is the magnification factor in the center area of the lens.
ic
cc
c
oiio
o
ccio
o
cc
o
cc
RrM
yyx
M
xxxf
RrRRRM
rRyyy
RRM
rRxxxf
Rryxf
yxg
yyxxr
),)(
,)(
(
),
11
1)(,
11
1)((
),,(
),(
)()( 22