Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Automatic Transformation of Applications onto GPUs and GPU

Clusters

PhD Candidacy presentation: Wenjing Ma

Advisor: Dr Gagan Agrawal

The Ohio State University

Sep 11, 2009

Outline of Contents• Introduction

• Accelerators, GPGPU and GPU cluster• Difficulty of GPU programming

• Current Work• Translation system for enabling data mining applications on

GPUs• Automatic translation of data mining applications from

MATLAB to GPUs• Automatic code generation for data mining on clusters with

GPU support• Future work

• Optimize shared memory usage• Support more categories of MATLAB programs• Extend FREERIDE framework

Sep 11, 2009

Introduction• Accelerators, GPGPU and GPU cluster

– Multi-core architectures are more and more popular in high performance computing

– GPU, Cell Processor, FPGA– GPU has good performance/price ratio

• Difficulty of Programming – How to program a cluster with accelerators on

each node ?

Sep 11, 2009

Our Approach

• Provide high-level support for programming emerging high-end configurations • High-level language as input• Target code for particular devices as output

• Focus on specific application classes• Simplifies code generation

Sep 11, 2009

Specific Implementations

C/C++

MATLAB

……

GPU cluster

GPU

……

Sep 11, 2009

Outline of Contents• Introduction• Current Work

• Translation system for enabling data mining applications on GPUs

• Automatic translation of data mining applications from MATLAB to GPUs

• Automatic code generation for data mining on clusters with GPU support

• Future work• Optimize shared memory usage• Support more categories of MATLAB programs• Extend FREERIDE framework

Sep 11, 2009

Translation System for Enabling Data Mining Applications on

GPUs

• Focus on data mining applications• Common structure

• Translation system• Simple user input

• Optimization for GPU execution

Sep 11, 2009

Complications of GPU Programming

• User has to have thoroughly understand the architecture of GPU and the programming model of the language

• Has to deal with the memory allocation • Need to develop and optimize strategies for

data movement between memory hierarchies (example is the shared memory in GPU)

Sep 11, 2009

Parallel Data Mining

• Common structure of data mining applications (adapted from FREERIDE)

• while() {

/* Reduction loop */

Foreach (element e) {

(i, val) = process(e);

Reduc(i) = Reduc(i) op val;

}

Finalize();

}

Sep 11, 2009

Basic Idea of Data Mining on GPU

• Parallel shared memory programming• 3 steps involved in the main loop

• Reading data

• Computing update

• Writing back update

Sep 11, 2009

• User input• Code analyzer

• Analysis of variables (variable type and size)

• Analysis of reduction functions (sequential code from the user)

• Code Generator ( generating CUDA code and C++ code invoking the kernel function)• Optimization

GREENRIDE: Translation System for Enabling Data Mining Applications on GPUs

Sep 11, 2009

Architecture of the System

Variable information

Reduction functions

Optional functions Code

Analyzer( In LLVM)

Variable Analyzer

Code Generato

r

Variable Access

Pattern and Combination Operations

Host Program

Data copy and thread

grid configuration

Kernel functions

Executable

User Input

Sep 11, 2009

User Input

A sequential reduction function

Optional functions (initialization function, combination function…)

Value of variable / size of array

Variables to be used in the reduction function

Sep 11, 2009

Analysis of Sequential Code

• Identify the parallel loop• One major loop is parallelized

• Get the access features of each variable• Figure out the data to be replicated• Get the operator for global combination• Calculate the size of shared memory to use

and which data to be kept in shared memory

Sep 11, 2009

Get the Access Features for Each Variable

• Use LLVM to generate the intermediate representation

• Analyze the intermediate representation and get the information of each pointer• Trace the pointers used in “store” operations, which

are output pointers• The other pointers in argument list are input

variables• The pointers that don’t appear in the argument list

are temporary storage

Sep 11, 2009

Data to be Replicated

• Variables that need to be replicated for each thread• Variables to be written

• Different threads may access the same object

• Need combination at the end

Sep 11, 2009

Generate CUDA and C++/C Code

• Revisit Common structure of data mining applications (adapted from FREERIDE)

• while() {

/* Reduction loop */

Foreach (element e){

(i, val) = process(e);

Reduc(i) = Reduc(i) op val;

}

Finalize();

}

Sep 11, 2009

Generate CUDA and C++/C Code

• Memory allocation and copy• Reduction functions and global combination• Optimization

• Using shared memory on GPU

Sep 11, 2009

Memory Allocation and CopyDevice Memory allocation, copying between memory Device Memory allocation, copying between memory hierarchies, replication for threadshierarchies, replication for threads

Need replication for each threadNeed replication for each thread

T0 T1 T2 T3 T4 T61 T62 T63 T0 T1

…… ……

T0 T1 T2 T3 T4 T61 T62 T63 T0 T1

……

AA..

BB..

Sep 11, 2009

Reduction Functions• Global function

• Invoke the kernel reduction, global combination function,

• Invoke user specified initialization and combination functions, if any

• Kernel reduction function• Generated out of the original sequential code

• Divide the main loop by block number and thread number

• Rewrite the access indices

• ……

• Global combination

Sep 11, 2009

Program Structure of the Generated CUDA Code

• Host program Copy_data(HOST_TO_DEVICE);

Device_reduct();

Copy_data(DEVICE_TO_HOST);

Combination();

• Device function:

Device_reduct()

{

Reduction();

__syncthreads();

Combination();

__syncthreads();

}

Sep 11, 2009

Optimizations

• Use shared memory on GPU ( almost “must”! )• If provided, use user-specified initialization

functions and combination functions• Specify variables that are allocated once and

reused by multiple invocations of the kernel• …

Sep 11, 2009

Deal with Shared Memory

• Use simple heuristics• Size = length * sizeof(type) * thread_info

– length: size of the array– type: char, int, and float– thread_info: whether it’s copied to each thread

• Sort the arrays according to Size• Mark each array as shared until the size exceeds

the limit of shared memory

Sep 11, 2009

Other Optimizations

• Reduce memory allocation and copy overhead• Arrays shared by multiple iterations can be allocated

and copied only once

• User defined combination function• Possibly reduce the amount of data to combine

Sep 11, 2009

Experiments

• Applications: popular data mining applications• K-means clustering• EM clustering• PCA

• Compare performance of manual CUDA code and automatically generated code

• The effect of optimization, especially the use of shared memory

Sep 11, 2009

Experiment Results (K-means)

Results on GeForce 8800GTX Results on GeForce 9800GX2

0

10

20

30

40

50

60

64*1

256*

1

256*

2

256*

8

256*

32

256*

256

thread_number * bl ock_number

Spe

edup

ove

r C

PU

seq

uent

ial

vers

ion

manual-computingmanual-computing with copyautomatic-computingautomatic-computing with copy

05

10152025303540

64*1

256*

1

256*

2

256*

8

256*

32

256*

256

thread_number * bl ock_numberS

peed

up o

ver

CP

U s

eque

ntia

lve

rsio

n

manual-computingmanual-computing with copyautomatic-computingautomatic-computing with copy

Sep 11, 2009

Experiment Results (EM)

0

5

10

15

20

64*1

128*

1

256*

1

256*

2

256*

4

256*

8

256*

16

256*

32

256*

64

256*

256

t hread_number * bl ock_number

Spe

edup

ove

r C

PU

seq

uent

ial v

ersi

on

manual-computing

manual-computing with copy

automatic-computing

automatic-computing with copy

Sep 11, 2009

Experiment Results (PCA)

0

5

10

15

20

25

64*1

128*

1s

256*

1

128*

2

128*

4

128*

8

128*

16

128*

32

128*

64

thread_number * bl ock_number

Spe

edup

ove

r C

PU

sequ

entia

l ver

sion

computingcomputing with copyoptimized computingoptimized computing with copy

Sep 11, 2009






Sep 11, 2009

GMAT-DM: Automatic Transformation from MATLAB for

GPUs

MATLAB code

GMAT-DM

OCTAVE parser

C code

CUDA code

GREENRIDE

Sep 11, 2009

Main Issues in GMAT-DM

• Matrix Manipulation• Inner representation as 1-d arrays• Type and shape information based on user input

• Basic problem: matrix operation translation• Matrix Multiplication Chain

• Function combination• Other optimizations

Sep 11, 2009

A * B * C A:6*100; B: 100*2; C: 2*1

1. (A*B)*CInstruction count: 1212Intermediate storage: 12

2.A*(B*C)Instruction count: 800Intermediate storage: 100

Use weighted metrics to determine the multiplication chain

Matrix Multiplication Chain

Sep 11, 2009

Function Combination

for (i = 0; i < maxRow; i++) for (j = 0; j < k; j++) { /* code to compute distances */ d[i][j] = sqrt(distances); }for(int i1 = 0; i1<maxRow; i1++){ z[i1] = 0; g[i1] = MAX_VALUE; for(int i2 = 0; i2 < k; i2++) {

if(z[i1] > d[i1][i2]){

z[i1] = d[i1][i2];g[i1] = i2;

} } }

float temp 0, temp z;int temp g;for (i = 0; i < maxRow; i++){ temp_z = MAX_VALUE; temp_g = 0; for (j = 0; j < k; j++) {

/* code to compute distances */temp_0 = sqrt(distances);if(temp_z > temp_0){

temp_z = temp_0;temp_g = j;

} } z[i] = temp_z; g[i] = temp_g;}

Sep 11, 2009

Other Optimizations

• Avoid memory allocation overhead – use index transformation• Row extraction• Matrix transposition

• Rename when a single element is accessed through many iterations

Sep 11, 2009


0

5

10

15

20

25

30

35

64*1

128*

1

256*

1

512*

1

256*

2

256*

4

256*

8

256*

16

256*

32

256*

64

256*

256

thread_number * block_number

Spee

dup

C unoptimized optimized

C with copy unoptimized with copy optimized with copy

Sep 11, 2009

Experiment Results (EM)

E phase M phase

0

2

4

6

810

12

14

16

18

20

64*1

128*

1

256*

1

256*

2

256*

4

256*

8

256*

16

256*

32

256*

64

256*

256


spee

dup

C unoptimizedoptimized optimized-with-copy

02468

10

1214161820

64*1

128*

1

256*

1

256*

2

256*

4

256*

8

256*

16

256*

32

256*

64

256*

256

thread_number * block_numbersp

eedu

p

C unoptimizedoptimized-1 optimized-2optimized-2-with-copy

Sep 11, 2009


0

5

10

15

20

25

128*

1

128*

2

128*

4

128*

8

128*

16

128*

32

128*

64


Sp

eed

up

MATLAB CMATLAB-with copy C-with copy

Sep 11, 2009






Sep 11, 2009

AUTO-GC: Automatic Code Generation for FREERIDE with GPU

Support

Revisit the common structure of data mining applications (adapted from FREERIDE)

While () {{ * Reduction Loop * }Foreach (element e) {

(i,val) = process(e);Reduc(i) = Reduc(i) op val; GPU

}

Finalize(); }

Sep 11, 2009

Code Generation System

Sep 11, 2009


1.5GB data set 3GB data set

Sep 11, 2009


64M rows, 4 components 2M rows, 64 components

Sep 11, 2009






Sep 11, 2009

Optimize Shared Memory Usage

• Use constant and texture memory• cached and read-only

• Integer programming model• Loop transformation

Sep 11, 2009

Integer Programming Model

Variables: x[i][j], 0 ≤ i ≤ m, 0 ≤ j ≤ n, whether variable i is kept in shared memory at loop jObjective function: maximize z = Σi {1..m}, j {1..n}∈ ∈ (Acc[i][j] − 1) x[i][j] Wp[i][j];∗ ∗Constraints:For i in 1..m: Σj 1..n∈ x[i][j] W[i][j] Sizep[i][j] ≤ S;∗ ∗For j in 1..n: For i in 1..nLive[j]:

x[live[j][i].start][j] = x[live[j][i].start + 1][j] = ... = x[live[j][i].end][j];

Sep 11, 2009

Loop Transformation

• Loop Fusion– Reduce intermediate memory– For loops with the same size, and variable[k] is

temporary storage

Loop 1 (size[1] = n): Active[1][k]=1 ......Loop 2 (size[2] = n): Active[2][k]=1 ......

Sep 11, 2009

• Loop Fission– Fit more data in shared memory– Should not violate data dependency

Loop Transformation

1. for (int r = 0; r < n; r++) //Loop 12. {3. · · ·4. for(int c = 0; c < m; c + +) //Loop 25. i[c]· · · ;6. for(int c = 0; c < m; c + +) //Loop 37. j[c]· · · ;8. · · ·9. }

Sep 11, 2009

• Loop Switching– Sort all the loops in decreasing order of

iteration numbers– Enable better usage of shared memory– Should not violate dependency

Loop Transformation

Sep 11, 2009

Support More Categories of MATLAB Programs

• Include library functions and determining thread grid configuration

• Deal with sparse matrix computation• Investigate basic functions in SPM library and

parallelization of ICA

Sep 11, 2009

Include Library Functions and Determining Thread Grid

Configuration

• 2-D blocking or 1-D blocking• Use library functions

Depending on the input data

Might need separating kernels to use different configurations

Need a model to evaluate the overhead of multiple kernels

Sep 11, 2009

Deal with Sparse Matrix Computation

• Represent sparse matrix in GPU• Coordinate representation • Diagonal representation (easy to be combined with

computing context)• Might use Sparse Accelerator (evaluate efficiency of

shared memory usage)

• Representation conversion• Better to be done on CPU

Sep 11, 2009

Investigate Basic Functions in SPM Library and Parallelization of ICA

• Enrich GMAT-DM• Provide GPU support for more functions and operations in MATLAB, especially sparse matrix computation

• Study possible optimization to communication and synchronization for SPM and ICA• For clusters or GPU, depending on my progress and time

Sep 11, 2009

Extend FREERIDE Framework

• Convert MATLAB programs into C++ and CUDA code in the FREERIDE framework – Based on GMAT-DM and AUTO-GC

• Cost model to determine device configuration

cost[i] = (mem[i] LATENCY + comp[i] + shared[i]) ∗ ∗Iter[i]+shared[i] LATENCY∗total_cost =∑m

i=1 cost[i]/number_iterationscost_combine =∑i output∈ size[i]*total_threadsIf total_cost + cost_combine > T Then use GPUElse use CPU

Sep 11, 2009

Conclusion

• Contributions– GREENRIDE: Translating Data Mining Applications onto

GPUs– GMAT-DM: Automatic Translation of Data Mining

Applications from MATLAB to GPUs– AUTO-GC: Automatic Code Generation for FREERIDE with

GPU Support• Proposed work

– Optimize shared memory usage– Support more categories of MATLAB programs– Extend FREERIDE framework

Sep 11, 2009

Thank you !

• Questions?

Documents

Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The