40

APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

Embed Size (px)

Citation preview

Page 1: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team
Page 2: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

APARAPIJava™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU

Gary FrostAMD PMTS Java Runtime Team

Page 3: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

3| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

AGENDA

The age of heterogeneous computing is here

The supercomputer in your desktop/laptop

Why Java ™?

Current GPU programming options for Java developers

Are developers likely to adopt emerging Java OpenCL™/CUDA ™ bindings?

Aparapi

– What is it

– How it works

Performance

Examples/Demos

Proposed Enhancements

Future work

Page 4: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

4| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

THE AGE OF HETEROGENEOUS COMPUTE IS HERE

GPUs originally developed to accelerate graphics operations

Early adopters repurposed their GPUs for ‘general compute’ by performing ‘unnatural acts’ with shader APIs

OpenGL allowed shaders/textures to be compiled and executed via extensions

OpenCLTM/GLSL/CUDATM standardized/formalized how to express GPU compute and simplified host programming

New programming models are emerging and lowering barriers to adoption

Page 5: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

5| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

THE SUPERCOMPUTER IN YOUR DESKTOP

Some interesting tidbits from http://www.top500.org/

– November 2000

“ASCI White is new #1 with 4.9 TFlops on the Linpack"

http://www.top500.org/lists/2000/11

– November 2002

“3.2 TFlops are needed to enter the top 10”

http://www.top500.org/lists/2002/11

May 2011

– AMD RadeonTM 6990 5.1TFlops single precision performance http://www.amd.com/us/products/desktop/graphics/amd-radeon-hd-6000/hd-6990/Pages/amd-radeon-hd-6990-overview.aspx#3

Page 6: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

6| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

One of the most widely used programming languages

– http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html

Established in domains likely to benefit from heterogeneous compute

– BigData , Search, Hadoop+Pig, Finance, GIS, Oil & Gas

Even if applications are not implemented in Java, they may still run on the Java Virtual Machine (JVM)

– JRuby, JPython, Scala, Clojure, Quercus(PHP)

Acts as a good proxy/indicator for enablement of other runtimes/interpreters

– JavaScript, Flash, .NET, PHP, Python, Ruby, Dalvik?

WHY JAVA?

18.16

16.179.14

7.546.51

5.01

4.58

32.89

Java

C

C++

C#

PHP

Objective C

Python

Other

Page 7: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

7| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

GPU PROGRAMMING OPTIONS FOR JAVA PROGRAMMERS

Emerging Java GPU APIs require coding a ‘Kernel’ in a domain-specific language // JOCL/OpenCL kernel code__kernel void squares(__global const float *in, __global float *out){ int gid = get_global_id(0); out[gid] = in[gid] * in[gid];}

As well as writing the Java ‘host’ CPU-based code to:

– Initialize the data– Select/Initialize execution device– Allocate or define memory buffers for args/parameters– Compile 'Kernel' for a selected device– Enqueue/Send arg buffers to device– Execute the kernel– Read results buffers back from the device– Cleanup (remove buffers/queues/device handles)– Use the results

import static org.jocl.CL.*;import org.jocl.*;

public class Sample { public static void main(String args[]) { // Create input- and output data int size = 10; float inArr[] = new float[size]; float outArray[] = new float[size]; for (int i=0; i<size; i++) { inArr[i] = i; }

Pointer in = Pointer.to(inArr); Pointer out = Pointer.to(outArray);

// Obtain the platform IDs and initialize the context properties cl_platform_id platforms[] = new cl_platform_id[1]; clGetPlatformIDs(1, platforms, null); cl_context_properties contextProperties = new cl_context_properties(); contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);

// Create an OpenCL context on a GPU device cl_context context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);

// Obtain the cl_device_id for the first device cl_device_id devices[] = new cl_device_id[1]; clGetContextInfo(context, CL_CONTEXT_DEVICES, Sizeof.cl_device_id, Pointer.to(devices), null);

// Create a command-queue cl_command_queue commandQueue = clCreateCommandQueue(context, devices[0], 0, null);

// Allocate the memory objects for the input- and output data cl_mem inMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * size, in, null); cl_mem outMem = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * size, null, null);

// Create the program from the source code cl_program program = clCreateProgramWithSource(context, 1, new String[]{ "__kernel void sampleKernel("+ " __global const float *in,"+ " __global float *out){"+ " int gid = get_global_id(0);"+ " out[gid] = in[gid] * in[gid];"+ "}" }, null, null);

// Build the program clBuildProgram(program, 0, null, null, null, null);

// Create and extract a reference to the kernel cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);

// Set the arguments for the kernel clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(inMem)); clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(outMem));

// Execute the kernel clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, new long[]{inArray.length}, null, 0, null, null);

// Read the output data clEnqueueReadBuffer(commandQueue, outMem, CL_TRUE, 0, outArray.length * Sizeof.cl_float, out, 0, null, null);

// Release kernel, program, and memory objects clReleaseMemObject(inMem); clReleaseMemObject(outMem); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(commandQueue); clReleaseContext(context);

for (float f:outArray){ System.out.printf("%5.2f, ", f); }

}}

Page 8: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

8| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

ARE DEVELOPERS LIKELY TO ADOPT EMERGING JAVA OPENCL/CUDA BINDINGS?

Some will

– Early adopters

– Prepared to learn new languages

– Motivated to squeeze all the performance they can from available compute devices

– Prepared to implement algorithms both in Java and in CUDA/OpenCL

Many won’t

– OpenCL/CUDA C99 heritage likely to disenfranchise Java developers

Many walked away from C/C++ or possibly never encountered it at all (due to CS education shifts)

Difficulties exposing low level concepts (such as GPU memory model) to developers who have ‘moved on’ and just expect the JVM to ‘do the right thing’

Who pays for retraining of Java developers?

– Notion of writing code twice (once for Java execution another for GPU/APU) alien

Where’s my ‘Write Once, Run Anywhere’?

Page 9: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

9| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

WHAT IS APARAPI?

An API for expressing data parallel workloads in Java

– Developer extends a Kernel base class

– Compiles to Java bytecode using existing tool chain

– Uses existing/familiar Java tool chain to debug the logic of their Kernel implementations

A runtime component capable of either :

– Executing Kernel via a Java Thread Pool

– Converting Kernel bytecode to OpenCL and executing on GPU

Platform Supports OpenCL? Yes

Bytecode can be converted to OpenCL?

No No

Execute Kernel using Java

Thread Pool

Convert bytecode to

OpenCLYes

Execute OpenCL Kernel

on GPU

MyKernel.class

Page 10: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

10| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

AN EMBARRASSINGLY PARALLEL USE CASE

First lets revisit our earlier code example

– Calculate square[0..size] for a given input in[0..size]

final int[] square= new int[size];final int[] in = new int[size]; // populating in[0..size] omitted

for (int i=0; i<size; i++){ square[i] = in[i] * in[i];}

Note that the order we traverse the loop is unimportant

Ideally Java would provide a way to indicate that the body of the loop need not be executed sequentially

Something like a parallel-for ?

However we don’t want to modify the language, compiler or tool chain.

parallel-for (int i=0; i<size; i++){

Page 11: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

11| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

REFACTORING OUR EXAMPLE TO USE APARAPI

final int[] square= new int[size];final int[] in = new int[size]; // populating in[0..size] omitted

for (int i=0; i<size; i++){ square[i] = in[i] * in[i];}

new Kernel(){ @Override public void run(){ int i = getGlobalID();

square[i] = in[i]*in[i]; }}.execute(size);

Page 12: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

12| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

EXPRESSING DATA PARALLEL IN APARAPI

What happens when we call execute(n)?

Kernel kernel = new Kernel(){ @Override public void run(){ int i=getGlobalID(); square[i]=int[i]*int[i]; }};

kernel.execute(size);

Platform Supports OpenCL?

YesBytecode can be converted to OpenCL?

No

NoExecute Kernel

using Java Thread Pool

Convert bytecode to OpenCL

Execute OpenCL Kernel on GPU

Is this the first

execution?

Do we have OpenCL? Yes

Yes

No

No

Yes

Page 13: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

13| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

FIRST CALL OF KERNEL.EXECUTE(SIZE) WHEN OPENCL/GPU IS AVAILABLE

Reload classfile via classloader and locate all methods and fields

For ‘run()’ method and all methods reachable from ‘run()’

– Convert method bytecode to an IR

Expression trees

Conditional sequences analyzed and converted to if{}, if{}else{} and for{} constructs

– Create a list of fields accessed by the bytecode

Note the access type (read/write/read+write)

Accessed fields will be turned into args and passed to generated OpenCL

Create an OpenCL buffer for each accessed primitive array (read, write or readwrite)

– Create and Compile OpenCL

Bail and revert to Java Thread Pool if we encounter any issues in previous steps

Page 14: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

14| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

ALL CALLS OF KERNEL.EXECUTE(SIZE) WHEN OPENCL/GPU IS AVAILABLE

Lock any accessed primitive arrays (so the garbage collector doesn’t move or collect them)

For each field readable by the kernel:

– If field is an array → enqueue a buffer write

– If field is scalar → set kernel arg value

Enqueue Kernel execution

For each array writeable by the kernel:

– Enqueue a buffer read

Wait for all enqueued requests to complete

Unlock accessed primitive arrays

Page 15: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

15| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

KERNEL.EXECUTE(SIZE) WHEN OPENCL/GPU IS NOT AN OPTION

Create a thread pool

One thread per core

Clone Kernel once for each thread

Each Kernel accessed exclusively from a single thread

Each Kernel loops globalSize/threadCount times

Update globalId, localId, groupSize, globalSize on Kernel instance

Executes run() method on Kernel instance

Wait for all threads to complete

Page 16: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

16| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

ADOPTION CHALLENGES (APARAPI VS EMERGING JAVA GPU BINDINGS)

Emerging GPU bindings

Aparapi

Learn OpenCL/CUDA DIFFICULT N/A

Locate potential data parallel opportunities MEDIUM MEDIUM

Refactor existing code/data structures MEDIUM MEDIUM

Create Kernel Code DIFFICULT EASY

Create code to coordinate execution and buffer transfers MEDIUM N/A

Identify GPU performance bottlenecks DIFFICULT DIFFICULT

Iterate code/debug algorithm logic DIFFICULT MEDIUM

Solve build/deployment issues DIFFICULT MEDIUM

Page 17: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

17| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

MANDELBROT EXAMPLE

new Kernel(){ @Override public void run() { int gid = getGlobalId();

float x = (((gid % w)-(w/2))/(float)w); // x {-1.0 .. +1.0}

float y = (((gid / w)-(h/2))/(float)h); // y {-1.0 .. +1.0}

float zx = x, zy = y, new_zx = 0f; int count = 0; while (count < maxIterations && zx * zx + zy * zy < 8) { new_zx = zx * zx - zy * zy + x; zy = 2 * zx * zy + y; zx = new_zx; count++; } rgb[gid] = pallette[count]; }).execute(width*height);

Page 18: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

18| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

EXPRESSING DATA PARALLEL IN JAVA WITH APARAPI BY EXTENDING KERNEL

class SquareKernel extends Kernel{ final int[] in, square; public SquareKernel(final int[] in){ this.in = in; this.square = new int[in.length); } @Override public void run(){ int i=getGlobalID(); square[i]=int[i]*int[i]; } public int[] square(){ execute(in.length); return(square); }}

int []in = new int[size]; SquareKernel squareKernel = new SquareKernel(in);// populating in[0..size] omittedint[] result = squareKernel.square();

square() method ‘wraps’ the execution mechanics

Provides a more natural Java API

Page 19: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

19| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

EXPRESSING DATA PARALLELISM IN APARAPI USING PROPOSED JAVA 8 LAMBDAS

JSR 335 ‘Project Lambda’ proposes addition of ‘lambda’ expressions to Java 8. http://cr.openjdk.java.net/~briangoetz/lambda/lambda-state-3.html

How we expect Aparapi will make use of the proposed Java 8 extensions

final int [] square = new int[size];final int [] in = new int[size]; // populating in[0..size] omitted

Kernel.execute(size, #{ i -> out[i]=int[i]*int[i]; });

Page 20: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

20| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

At runtime Aparapi converts Java bytecode to OpenCL

OpenCL compiler converts OpenCL to device specific ISA (for GPU/APU)

GPU comprised of multiple SIMD (Single Instruction Multiple Dispatch) Cores

SIMD performance stems from executing the same instruction on different data at the same time

– Think single program counter shared across multiple threads

– All SIMDs executing at the same time (in lock-step)

new Kernel(){ @Override public void run(){ int i = getGlobalID(); int temp= in[i]*2; temp = temp+1; out[i] = temp; }}.execute(4)

HOW APAPAPI EXECUTES ON THE GPU

i=0 i=1 i=2 i=3

int temp =in[0]*2 int temp =in[1]*2 int temp =in[2]*2 int temp =in[3]*2

temp=temp+1 temp=temp+1 temp=temp+1 temp=temp+1

out[0]=temp out[1]=temp out[2]=temp out[3]=temp

Page 21: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

21| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

DEVELOPER IS RESPONSIBLE FOR ENSURING PROBLEM IS DATA PARALLEL

Data dependencies may violate the ‘in any order’ contract for (int i=1; i< 100; i++){ out[i] = out[i-1]+in[i];}

out[i-1] refers to a value resulting from a previous iteration which may not have been evaluated yet

Loops mutating shared data will need to be refactored or will necessitate atomic operationsfor (int i=0; i< 100; i++){ sum += in[i];}

sum += x causes a race condition

Almost certainly will not be atomic when translated to OpenCL

Not safe in multi-threaded Java either

new Kernel(){ @Override public void run(){ int i = getGlobalID(); out[i] = out[i-1]+in[i];}}.execute(100);

new Kernel(){ @Override public void run(){ int i = getGlobalID(); sum+= in[i];}}.execute(100);

Page 22: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

22| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

SOMETIMES WE CAN REFACTOR TO RECOVER SOME PARALLELISM

for (int i=0; i< 100; i++){ sum += in[i];}

new Kernel(){ @Override public void run(){ int i = getGlobalID();

sum+= in[i]; }}.execute(100);new Kernel(){ @Override public void run(){ int n = getGlobalID() for (int i=0; i<10; i++) partial[n] += data[n*10+i]; }}.execute(10);

for (int i=0; i< 10; i++){ sum+=partial[i];}

for (int n=0; n<10; n++){ for (int i=0; i<10; i++){ partial[n] += data[n*10+i]; }}for (int i=0; i< 10; i++){ sum+=partial[i];}

Page 23: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

23| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

SIMD performance impacted when code contains branches

– To stay in lockstep SIMDs must process both the 'then' and 'else' blocks

– Use result of 'condition' to predicate instructions (conditionally mask to a no-op)

new Kernel(){ @Override public void run(){ int i = getGlobalID(); int temp= in[i]*2; if (i%2==0) temp = temp+1; else temp = temp -1; out[i] = temp; }}.execute(4)

TRY TO AVOID BRANCHING WHEREVER POSSIBLE

i=0 i=1 i=2 i=3

int temp =in[0]*2 int temp =in[1]*2 int temp =in[2]*2 int temp =in[3]*2

<c> = (0%2==0) <c> = (1%2==0) <c> = (2%2==0) <c> = (3%2==0)

if< c> temp=temp+1 if< c> temp=temp+1 if< c> temp=temp+1 if< c> temp=temp+1

if <!c> temp=temp-1

if <!c> temp=temp-1

if <!c> temp=temp-1

if <!c> temp=temp-1

out[0]=temp out[1]=temp out[2]=temp out[3]=temp

Page 24: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

24| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

CHARACTERISTICS OF IDEAL DATA PARALLEL WORKLOADS

Code which iterates over large arrays of primitives

– 32/64 bit data types preferred

– Where the order of iteration is not critical

Avoid data dependencies between iterations

– Each iteration contains sequential code (few branches)

Good balance between data size (low) and compute (high)

– Transfer of data to/from the GPU can be costly

Although APUs likely to mitigate this over time

– Trivial compute often not worth the transfer cost

– May still benefit by freeing up CPU for other work

Com

pute

Data Size

GPUMemory

Ideal

Page 25: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

25| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

APARAPI NBODY EXAMPLE

NBody is a common OpenCL/CUDA benchmark/demo

– For each particle/body

Calculate new position based on the gravitational force impinged on each body, by every other body

Essentially a N^2 space problem

– If we double the number of bodies, we perform four times the positional calculations

Following charts compare

– Naïve Java version (single loop)

– Aparapi version using Java Thread Pool

– Aparapi version running on the GPU (ATI Radeon ™ 5870)

Page 26: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

26| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

APARAPI NBODY PERFORMANCE (FRAMES RATE VS NUMBER OF BODIES)

1k 2k 4k 8k 16k 32k 64k 128k0

100

200

300

400

500

600

700

800

80.42

19.965.19 1.29 0.32 0.08 0.02 0.01

260.8

72.67

19.37 5.47 1.45 0.38 0.1 0.02

670.2

389.12

186.05

79.87

34.2412.18 3.57 0.94

Java Single ThreadAparapi Thread PoolAparapi GPU

Number of bodies/particles

Fra

mes

per

sec

ond

Page 27: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

27| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

NBODY PERFORMANCE: CALCULATIONS PER ΜSEC VS. NUMBER OF BODIES

1k 2k 4k 8k 16k 32k 64k 128k0

2000

4000

6000

8000

10000

12000

14000

16000

18000

84 83 83 86 86 86 86 86273 304 313 367 388 407 412 412702

1632

3146

5360

9190

13078

1566316101Java Single Thread

Aparapi Thread PoolAparapi GPU

Pos

ition

cal

cula

tions

per

µS

Number of bodies/particles

Page 28: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

28| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

APARAPI EXPLICIT BUFFER MANAGEMENT

This code demonstrates a fairly common pattern. Namely a Kernel executed inside a loop

int [] buffer = new int[HUGE];int [] unusedBuffer = new int[HUGE];

Kernel k = new Kernel(){ @Override public void run(){ // mutates buffer contents // no reference to unusedBuffer } };

for (int i=0; i< 1000; i++){

k.execute(HUGE);

}

Although Aparapi analyzes kernel methods to optimize host buffer transfer requests,

it has no knowledge of buffer accesses from the enclosing loop.

Aparapi must assume that the buffer is modified between invocations.

This forces (possibly unnecessary) buffer copies to and from the device for each invocation of Kernel.excute(int)

//Transfer buffer to GPU

//Transfer buffer from GPU

Page 29: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

29| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

APARAPI EXPLICIT BUFFER MANAGEMENT

Using the new explicit buffer management APIs

int [] buffer = new int[HUGE];

Kernel k = new Kernel(){ @Override public void run(){ // mutates buffer contents } }; k.setExplicit();k.put(buffer);for (int i=0; i< 1000; i++){ k.execute(HUGE);}k.get(buffer);

Developer takes control (of all buffer transfers) by marking the kernel as explicit

Then coordinates when/if transfers take place

Here we save 999 buffer writes and 999 buffer reads

Page 30: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

30| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

APARAPI EXPLICIT BUFFER MANAGEMENT

A possible alternative might be to expose the ‘host’ code to Aparapi

int [] buffer = new int[HUGE];

Kernel k = new Kernel(){ @Override public void run(){ // mutates buffer contents } @Override public void host(){ for (int i=0; i< 1000; i++){ execute(HUGE); } }}; k.host();

Developer exposes the host code to Aparapi by overriding the host() method.

By analyzing the bytecode of host(), Aparapi can determine when/if buffers are mutated and can ‘inject’ appropriate put()/get() requests behind the scenes.

Page 31: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

31| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

APARAPI BITONIC SORT WITH EXPLICIT BUFFER MANAGEMENT

Bitonic mergesort is a parallel friendly ‘in place’ sorting algorithm

– http://en.wikipedia.org/wiki/Bitonic_sorter

On 10/18/2010 the following post appeared on Aparapi forums

“Aparapi 140x slower than single thread Java?! what am I doing wrong?”

– Source code (for Bitonic Sort) was included in the post

An Aparapi Kernel (for each sort pass) executed inside a Java loop.

Aparapi was forcing unnecessary buffer copies.

Following chart compares :

– Single threaded Java version

– Aparapi/GPU version without explicit buffer management (default AUTO mode)

– Aparapi/GPU version with recent explicit buffer management feature enabled.

Both Aparapi versions running on ATI Radeon ™ 5870.

Page 32: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

32| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

EXPLICIT BUFFER MANAGEMENT EFFECT ON BITONIC SORT IMPLEMENTATION

16k 32k 64k 128k 256k 512k 1024k 2048k 4096k0

500

1000

1500

2000

2500

3000

3500

13 21 36 69142

296

632

1525

3235

117 137 164 215332

495

850

1462

2855

17 19 23 25 34 54 97165

337

Java Single ThreadGPU (AUTO)GPU (EXPLICIT)

Number of integers

Tim

e (m

s)

Page 33: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

33| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

PROPOSED APARAPI ENHANCEMENT: ALLOW ACCESS TO ARRAYS OF OBJECTS

A Java developer implementing an 'nbody' solution would probably define a class for each particlepublic class Particle{ private int x, y, z; private String name; private Color color; // ...}

… would make all fields private and limit access via setters/getters public void setX(int x){ this.x = x}; public int getX(){return this.x); // same for y,z, name etc

… and expect to create a Kernel to update positions for an array of such particles

Particle[] particles = new Particle[1024];ParticleKernel kernel = new ParticleKernel(particles);while(displaying){

kernel.execute(particles.length);updateDisplayPositions(particles);

}

Page 34: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

34| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

PROPOSED APARAPI ENHANCEMENT: ALLOW ACCESS TO ARRAYS OF OBJECTS

Unfortunately the current ‘alpha’ version of Aparapi would fail to convert this kernel to OpenCL

Would fall back to using a Thread Pool.

Aparapi currently requires that the previous code to be refactored so that data is held in primitive arraysint[] x = new int[1024];int[] y = new int[1024];int[] z = new int[1024];Color[] color = new Color[1024];String[] name = new String[1024];Positioner.position(x, y, z);

This is clearly a potential ‘barrier to adoption’

Page 35: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

35| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

PROPOSED APARAPI ENHANCEMENT: ALLOW ACCESS TO ARRAYS OF OBJECTS

Proposed enhancement will allow Aparapi Kernels to access arrays (or array based collections) of objects

At runtime Aparapi:

– Tracks all fields accessed via objects reachable from Kernel.run()

– Extracts the data from these fields into a parallel-array form

– Executes a Kernel using the parallel-array form

– Returns the data back into the original object array

These extra steps do impact performance (compared with refactored data parallel form)

– However, we can still demonstrate performance gains over non-Aparapi versions

Page 36: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

36| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

FUTURE WORK

Sync with ‘project lambda’ (Java 8) and allow kernels to be represented as lambda expressions

Continue to investigate automatic extraction of buffer transfers from object collections

Hand more explicit control to ‘power users’

– Explicit buffer (or even sub buffer) transfers

– Expose local memory and barriers

Open Source

– Aiming for Q3 Open Source release of Aparapi

– License TBD, probably BSD variant

– Still reviewing hosting options

– Encourage community contributions

Page 37: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

37| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

SIMILAR INTERESTING/RELATED WORK

Tidepowerd

– Offers a similar solution for .NET

– NVIDIA cards only at present

http://www.tidepowerd.com/

java-gpu

– An open source project for extracting kernels from nested loops

– Extracts code structure from bytecode

– Creates CUDA behind the scenes

http://code.google.com/p/java-gpu/

GRAPHITE-OpenCL

– Auto detect data parallel loops in gcc compiler and generate OpenCL + host code for those loops

http://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=2010-GCC-Summit-Proceedings.pdf

Page 38: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

38| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

SUMMARY

APUs/GPUs offer unprecedented performance for the appropriate workload

Don’t assume everything can/should execute on the APU/GPU

Profile your Java code to uncover potential parallel opportunities

Aparapi provides an ideal framework for executing data-parallel code on the GPU

Find out more about Aparapi at http://developer.amd.com/Aparapi

Participate in the upcoming Aparapi Open Source community

Page 39: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

QUESTIONS

Page 40: APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

40| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, AMD Radeon, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.

OpenCL is a trademark of Apple Inc used under license to the Khronos Group, Inc.

NVIDIA, the NVIDIA logo, and CUDA are trademarks or registered trademarks of NVIDIA Corporation.

Java , JVM, JDK and “Write Once, Run Anywhere" are trademark s of Oracle and/or its affiliates.

© 2011 Advanced Micro Devices, Inc. All rights reserved.