55
Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The Ohio State University

Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Embed Size (px)

Citation preview

Page 1: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Automatic Transformation of Applications onto GPUs and GPU

Clusters

PhD Candidacy presentation: Wenjing Ma

Advisor: Dr Gagan Agrawal

The Ohio State University

Page 2: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Outline of Contents• Introduction

• Accelerators, GPGPU and GPU cluster• Difficulty of GPU programming

• Current Work• Translation system for enabling data mining applications on

GPUs• Automatic translation of data mining applications from

MATLAB to GPUs• Automatic code generation for data mining on clusters with

GPU support• Future work

• Optimize shared memory usage• Support more categories of MATLAB programs• Extend FREERIDE framework

Page 3: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Introduction• Accelerators, GPGPU and GPU cluster

– Multi-core architectures are more and more popular in high performance computing

– GPU, Cell Processor, FPGA– GPU has good performance/price ratio

• Difficulty of Programming – How to program a cluster with accelerators on

each node ?

Page 4: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Our Approach

• Provide high-level support for programming emerging high-end configurations • High-level language as input• Target code for particular devices as output

• Focus on specific application classes• Simplifies code generation

Page 5: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Specific Implementations

C/C++

MATLAB

……

GPU cluster

GPU

……

Page 6: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Outline of Contents• Introduction• Current Work

• Translation system for enabling data mining applications on GPUs

• Automatic translation of data mining applications from MATLAB to GPUs

• Automatic code generation for data mining on clusters with GPU support

• Future work• Optimize shared memory usage• Support more categories of MATLAB programs• Extend FREERIDE framework

Page 7: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Translation System for Enabling Data Mining Applications on

GPUs

• Focus on data mining applications• Common structure

• Translation system• Simple user input

• Optimization for GPU execution

Page 8: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Complications of GPU Programming

• User has to have thoroughly understand the architecture of GPU and the programming model of the language

• Has to deal with the memory allocation • Need to develop and optimize strategies for

data movement between memory hierarchies (example is the shared memory in GPU)

Page 9: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Parallel Data Mining

• Common structure of data mining applications (adapted from FREERIDE)

• while() {

  /* Reduction loop */

  Foreach (element e) {

  (i, val) = process(e);

  Reduc(i) = Reduc(i) op val;

  }

  Finalize();

  }

Page 10: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Basic Idea of Data Mining on GPU

• Parallel shared memory programming• 3 steps involved in the main loop

• Reading data

• Computing update

• Writing back update

Page 11: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

• User input• Code analyzer

• Analysis of variables (variable type and size)

• Analysis of reduction functions (sequential code from the user)

• Code Generator ( generating CUDA code and C++ code invoking the kernel function)• Optimization

GREENRIDE: Translation System for Enabling Data Mining Applications on GPUs

Page 12: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Architecture of the System

Variable information

Reduction functions

Optional functions Code

Analyzer( In LLVM)

Variable Analyzer

Code Generato

r

Variable Access

Pattern and Combination Operations

Host Program

Data copy and thread

grid configuration

Kernel functions

Executable

User Input

Page 13: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

User Input

A sequential reduction function

Optional functions (initialization function, combination function…)

Value of variable / size of array

Variables to be used in the reduction function

Page 14: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Analysis of Sequential Code

• Identify the parallel loop• One major loop is parallelized

• Get the access features of each variable• Figure out the data to be replicated• Get the operator for global combination• Calculate the size of shared memory to use

and which data to be kept in shared memory

Page 15: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Get the Access Features for Each Variable

• Use LLVM to generate the intermediate representation

• Analyze the intermediate representation and get the information of each pointer• Trace the pointers used in “store” operations, which

are output pointers• The other pointers in argument list are input

variables• The pointers that don’t appear in the argument list

are temporary storage

Page 16: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Data to be Replicated

• Variables that need to be replicated for each thread• Variables to be written

• Different threads may access the same object

• Need combination at the end

Page 17: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Generate CUDA and C++/C Code

• Revisit Common structure of data mining applications (adapted from FREERIDE)

• while() {

  /* Reduction loop */

  Foreach (element e){

  (i, val) = process(e);

  Reduc(i) = Reduc(i) op val;

  }

  Finalize();

  }

Page 18: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Generate CUDA and C++/C Code

• Memory allocation and copy• Reduction functions and global combination• Optimization

• Using shared memory on GPU

Page 19: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Memory Allocation and CopyDevice Memory allocation, copying between memory Device Memory allocation, copying between memory hierarchies, replication for threadshierarchies, replication for threads

Need replication for each threadNeed replication for each thread

T0 T1 T2 T3 T4 T61 T62 T63 T0 T1

…… ……

T0 T1 T2 T3 T4 T61 T62 T63 T0 T1

……

AA..

BB..

Page 20: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Reduction Functions• Global function

• Invoke the kernel reduction, global combination function,

• Invoke user specified initialization and combination functions, if any

• Kernel reduction function• Generated out of the original sequential code

• Divide the main loop by block number and thread number

• Rewrite the access indices

• ……

• Global combination

Page 21: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Program Structure of the Generated CUDA Code

• Host program Copy_data(HOST_TO_DEVICE);

 Device_reduct();

 Copy_data(DEVICE_TO_HOST);

 Combination();

• Device function:

Device_reduct()

{

 Reduction();

 __syncthreads();

 Combination();

 __syncthreads();

}

Page 22: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Optimizations

• Use shared memory on GPU ( almost “must”! )• If provided, use user-specified initialization

functions and combination functions• Specify variables that are allocated once and

reused by multiple invocations of the kernel• …

Page 23: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Deal with Shared Memory

• Use simple heuristics• Size = length * sizeof(type) * thread_info

– length: size of the array– type: char, int, and float– thread_info: whether it’s copied to each thread

• Sort the arrays according to Size• Mark each array as shared until the size exceeds

the limit of shared memory

Page 24: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Other Optimizations

• Reduce memory allocation and copy overhead• Arrays shared by multiple iterations can be allocated

and copied only once

• User defined combination function• Possibly reduce the amount of data to combine

Page 25: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Experiments

• Applications: popular data mining applications• K-means clustering• EM clustering• PCA

• Compare performance of manual CUDA code and automatically generated code

• The effect of optimization, especially the use of shared memory

Page 26: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Experiment Results (K-means)

Results on GeForce 8800GTX Results on GeForce 9800GX2

0

10

20

30

40

50

60

64*1

256*

1

256*

2

256*

8

256*

32

256*

256

thread_number * bl ock_number

Spe

edup

ove

r C

PU

seq

uent

ial

vers

ion

manual-computingmanual-computing with copyautomatic-computingautomatic-computing with copy

05

10152025303540

64*1

256*

1

256*

2

256*

8

256*

32

256*

256

thread_number * bl ock_numberS

peed

up o

ver

CP

U s

eque

ntia

lve

rsio

n

manual-computingmanual-computing with copyautomatic-computingautomatic-computing with copy

Page 27: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Experiment Results (EM)

0

5

10

15

20

64*1

128*

1

256*

1

256*

2

256*

4

256*

8

256*

16

256*

32

256*

64

256*

256

t hread_number * bl ock_number

Spe

edup

ove

r C

PU

seq

uent

ial v

ersi

on

manual-computing

manual-computing with copy

automatic-computing

automatic-computing with copy

Page 28: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Experiment Results (PCA)

0

5

10

15

20

25

64*1

128*

1s

256*

1

128*

2

128*

4

128*

8

128*

16

128*

32

128*

64

thread_number * bl ock_number

Spe

edup

ove

r C

PU

sequ

entia

l ver

sion

computingcomputing with copyoptimized computingoptimized computing with copy

Page 29: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Outline of Contents• Introduction• Current Work

• Translation system for enabling data mining applications on GPUs

• Automatic translation of data mining applications from MATLAB to GPUs

• Automatic code generation for data mining on clusters with GPU support

• Future work• Optimize shared memory usage• Support more categories of MATLAB programs• Extend FREERIDE framework

Page 30: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

GMAT-DM: Automatic Transformation from MATLAB for

GPUs

MATLAB code

GMAT-DM

OCTAVE parser

C code

CUDA code

GREENRIDE

Page 31: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Main Issues in GMAT-DM

• Matrix Manipulation• Inner representation as 1-d arrays• Type and shape information based on user input

• Basic problem: matrix operation translation• Matrix Multiplication Chain

• Function combination• Other optimizations

Page 32: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

A * B * C A:6*100; B: 100*2; C: 2*1

1. (A*B)*CInstruction count: 1212Intermediate storage: 12

2.A*(B*C)Instruction count: 800Intermediate storage: 100

Use weighted metrics to determine the multiplication chain

Matrix Multiplication Chain

Page 33: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Function Combination

for (i = 0; i < maxRow; i++) for (j = 0; j < k; j++) { /* code to compute distances */ d[i][j] = sqrt(distances); }for(int i1 = 0; i1<maxRow; i1++){ z[i1] = 0; g[i1] = MAX_VALUE; for(int i2 = 0; i2 < k; i2++) {

if(z[i1] > d[i1][i2]){

z[i1] = d[i1][i2];g[i1] = i2;

} } }

float temp 0, temp z;int temp g;for (i = 0; i < maxRow; i++){ temp_z = MAX_VALUE; temp_g = 0; for (j = 0; j < k; j++) {

/* code to compute distances */temp_0 = sqrt(distances);if(temp_z > temp_0){

temp_z = temp_0;temp_g = j;

} } z[i] = temp_z; g[i] = temp_g;}

Page 34: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Other Optimizations

• Avoid memory allocation overhead – use index transformation• Row extraction• Matrix transposition

• Rename when a single element is accessed through many iterations

Page 35: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Experiment Results (K-means)

0

5

10

15

20

25

30

35

64*1

128*

1

256*

1

512*

1

256*

2

256*

4

256*

8

256*

16

256*

32

256*

64

256*

256

thread_number * block_number

Spee

dup

C unoptimized optimized

C with copy unoptimized with copy optimized with copy

Page 36: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Experiment Results (EM)

E phase M phase

0

2

4

6

810

12

14

16

18

20

64*1

128*

1

256*

1

256*

2

256*

4

256*

8

256*

16

256*

32

256*

64

256*

256

thread_number * block_number

spee

dup

C unoptimizedoptimized optimized-with-copy

02468

10

1214161820

64*1

128*

1

256*

1

256*

2

256*

4

256*

8

256*

16

256*

32

256*

64

256*

256

thread_number * block_numbersp

eedu

p

C unoptimizedoptimized-1 optimized-2optimized-2-with-copy

Page 37: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Experiment Results (PCA)

0

5

10

15

20

25

128*

1

128*

2

128*

4

128*

8

128*

16

128*

32

128*

64

thread_number * block_number

Sp

eed

up

MATLAB CMATLAB-with copy C-with copy

Page 38: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Outline of Contents• Introduction• Current Work

• Translation system for enabling data mining applications on GPUs

• Automatic translation of data mining applications from MATLAB to GPUs

• Automatic code generation for data mining on clusters with GPU support

• Future work• Optimize shared memory usage• Support more categories of MATLAB programs• Extend FREERIDE framework

Page 39: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

AUTO-GC: Automatic Code Generation for FREERIDE with GPU

Support

Revisit the common structure of data mining applications (adapted from FREERIDE)

While () {{ * Reduction Loop * }Foreach (element e) {

(i,val) = process(e);Reduc(i) = Reduc(i) op val; GPU

}

Finalize(); }

Page 40: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Code Generation System

Page 41: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Experiment Results (K-means)

1.5GB data set 3GB data set

Page 42: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Experiment Results (PCA)

64M rows, 4 components 2M rows, 64 components

Page 43: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Outline of Contents• Introduction• Current Work

• Translation system for enabling data mining applications on GPUs

• Automatic translation of data mining applications from MATLAB to GPUs

• Automatic code generation for data mining on clusters with GPU support

• Future work• Optimize shared memory usage• Support more categories of MATLAB programs• Extend FREERIDE framework

Page 44: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Optimize Shared Memory Usage

• Use constant and texture memory• cached and read-only

• Integer programming model• Loop transformation

Page 45: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Integer Programming Model

Variables: x[i][j], 0 ≤ i ≤ m, 0 ≤ j ≤ n, whether variable i is kept in shared memory at loop jObjective function: maximize z = Σi {1..m}, j {1..n}∈ ∈ (Acc[i][j] − 1) x[i][j] Wp[i][j];∗ ∗Constraints:For i in 1..m: Σj 1..n∈ x[i][j] W[i][j] Sizep[i][j] ≤ S;∗ ∗For j in 1..n: For i in 1..nLive[j]:

x[live[j][i].start][j] = x[live[j][i].start + 1][j] = ... = x[live[j][i].end][j];

Page 46: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Loop Transformation

• Loop Fusion– Reduce intermediate memory– For loops with the same size, and variable[k] is

temporary storage

Loop 1 (size[1] = n): Active[1][k]=1 ......Loop 2 (size[2] = n): Active[2][k]=1 ......

Page 47: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

• Loop Fission– Fit more data in shared memory– Should not violate data dependency

Loop Transformation

1. for (int r = 0; r < n; r++) //Loop 12. {3. · · ·4. for(int c = 0; c < m; c + +) //Loop 25. i[c]· · · ;6. for(int c = 0; c < m; c + +) //Loop 37. j[c]· · · ;8. · · ·9. }

Page 48: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

• Loop Switching– Sort all the loops in decreasing order of

iteration numbers– Enable better usage of shared memory– Should not violate dependency

Loop Transformation

Page 49: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Support More Categories of MATLAB Programs

• Include library functions and determining thread grid configuration

• Deal with sparse matrix computation• Investigate basic functions in SPM library and

parallelization of ICA

Page 50: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Include Library Functions and Determining Thread Grid

Configuration

• 2-D blocking or 1-D blocking• Use library functions

Depending on the input data

Might need separating kernels to use different configurations

Need a model to evaluate the overhead of multiple kernels

Page 51: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Deal with Sparse Matrix Computation

• Represent sparse matrix in GPU• Coordinate representation • Diagonal representation (easy to be combined with

computing context)• Might use Sparse Accelerator (evaluate efficiency of

shared memory usage)

• Representation conversion• Better to be done on CPU

Page 52: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Investigate Basic Functions in SPM Library and Parallelization of ICA

• Enrich GMAT-DM• Provide GPU support for more functions and operations in MATLAB, especially sparse matrix computation

• Study possible optimization to communication and synchronization for SPM and ICA• For clusters or GPU, depending on my progress and time

Page 53: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Extend FREERIDE Framework

• Convert MATLAB programs into C++ and CUDA code in the FREERIDE framework – Based on GMAT-DM and AUTO-GC

• Cost model to determine device configuration

cost[i] = (mem[i] LATENCY + comp[i] + shared[i]) ∗ ∗Iter[i]+shared[i] LATENCY∗total_cost =∑m

i=1 cost[i]/number_iterationscost_combine =∑i output∈ size[i]*total_threadsIf total_cost + cost_combine > T Then use GPUElse use CPU

Page 54: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Conclusion

• Contributions– GREENRIDE: Translating Data Mining Applications onto

GPUs– GMAT-DM: Automatic Translation of Data Mining

Applications from MATLAB to GPUs– AUTO-GC: Automatic Code Generation for FREERIDE with

GPU Support• Proposed work

– Optimize shared memory usage– Support more categories of MATLAB programs– Extend FREERIDE framework

Page 55: Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The

Sep 11, 2009

Thank you !

• Questions?